Step 4: Malware Visualisation Context

Binary-to-image for malware families

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Malware visualisation: binaries as images

In 2011, Nataraj et al. demonstrated that treating raw malware bytes as pixel values produces visually distinctive images per malware family. This technique converts a detection problem into an image classification problem -- exactly what CNNs excel at.

Step	Operation	Example
1	Read binary as raw bytes	`4D 5A 90 00 03 00 ...`
2	Interpret each byte as a greyscale pixel (0-255)	`[77, 90, 144, 0, 3, 0, ...]`
3	Reshape to a 2D grid	32x32, 64x64, or 128x128 depending on file size
4	Feed the image to a CNN	Same Conv-Pool-Dense architecture as MNIST

The conversion code

import numpy as np

def binary_to_image(file_path, img_size=32):
    """Convert a binary file to a greyscale image."""
    with open(file_path, 'rb') as f:
        raw_bytes = f.read()

    # Convert bytes to numpy array of uint8 values (0-255)
    byte_array = np.frombuffer(raw_bytes, dtype=np.uint8)

    # Truncate or pad to fill a square image
    total_pixels = img_size * img_size
    if len(byte_array) >= total_pixels:
        pixels = byte_array[:total_pixels]
    else:
        pixels = np.pad(byte_array, (0, total_pixels - len(byte_array)))

    # Reshape to 2D image
    image = pixels.reshape(img_size, img_size)
    return image

Why malware families look different

Different binary sections produce characteristic visual patterns:

Binary section	Visual pattern	Why
.text (code)	Fine-grained texture with regular patterns	Compiled instructions have structured byte patterns
.data (initialised data)	Smooth gradients or blocks of similar values	Strings, constants, and configuration data
.rsrc (resources)	Distinct blocks, sometimes recognisable icons	Embedded images, dialogs, version info
Packed/encrypted	High-entropy noise (random-looking)	Packing destroys structure, producing uniform randomness

Malware families share code and structure, so variants of the same family produce visually similar images even after minor modifications like changing C2 addresses or encryption keys.

CNN vs traditional detection

Method	Strengths	Weaknesses
Hash-based (MD5/SHA256)	Fast, exact, zero false positives	One byte change = completely different hash; trivially evaded
Signature-based (YARA)	Pattern matching, expert-crafted, explainable	Manual effort to write rules; misses unknown variants
CNN on binary images	Learns family-level structure; detects unknown variants	Requires training data; less explainable; slower inference

In practice, these methods are complementary. Hash lookup handles known samples in microseconds. YARA catches known patterns. The CNN catches novel variants that evade both by detecting structural similarity at the family level.

Think Deeper

Try this:

A malware analyst asks: 'Can we just use file hashes instead of this image stuff?' What is the fundamental limitation of hash-based detection that CNN-based visualisation addresses?

Hashes are exact-match only — change one byte and the hash is completely different. Malware authors trivially evade hash detection by recompiling or packing. CNN-based visualisation detects structural similarity: variants of the same malware family produce visually similar images because they share code sections, data layouts, and execution patterns. The CNN learns these family-level patterns that survive minor modifications.

Cybersecurity tie-in: Malware visualisation with CNNs is actively used in production security systems. Microsoft's research on malware-as-image classification demonstrated that CNNs trained on binary visualisations can classify malware families with over 98% accuracy, even on previously unseen samples from the same family. This approach complements traditional signature-based detection by catching variants that change their hash but preserve their structural DNA.

← Previous ← → to navigate Next →