Step 4: Malware Visualisation Context

Binary-to-image for malware families

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Malware visualisation: binaries as images

In 2011, Nataraj et al. demonstrated that treating raw malware bytes as pixel values produces visually distinctive images per malware family. This technique converts a detection problem into an image classification problem -- exactly what CNNs excel at.

StepOperationExample
1Read binary as raw bytes4D 5A 90 00 03 00 ...
2Interpret each byte as a greyscale pixel (0-255)[77, 90, 144, 0, 3, 0, ...]
3Reshape to a 2D grid32x32, 64x64, or 128x128 depending on file size
4Feed the image to a CNNSame Conv-Pool-Dense architecture as MNIST

The conversion code

import numpy as np

def binary_to_image(file_path, img_size=32):
    """Convert a binary file to a greyscale image."""
    with open(file_path, 'rb') as f:
        raw_bytes = f.read()

    # Convert bytes to numpy array of uint8 values (0-255)
    byte_array = np.frombuffer(raw_bytes, dtype=np.uint8)

    # Truncate or pad to fill a square image
    total_pixels = img_size * img_size
    if len(byte_array) >= total_pixels:
        pixels = byte_array[:total_pixels]
    else:
        pixels = np.pad(byte_array, (0, total_pixels - len(byte_array)))

    # Reshape to 2D image
    image = pixels.reshape(img_size, img_size)
    return image

Why malware families look different

Different binary sections produce characteristic visual patterns:

Binary sectionVisual patternWhy
.text (code)Fine-grained texture with regular patternsCompiled instructions have structured byte patterns
.data (initialised data)Smooth gradients or blocks of similar valuesStrings, constants, and configuration data
.rsrc (resources)Distinct blocks, sometimes recognisable iconsEmbedded images, dialogs, version info
Packed/encryptedHigh-entropy noise (random-looking)Packing destroys structure, producing uniform randomness

Malware families share code and structure, so variants of the same family produce visually similar images even after minor modifications like changing C2 addresses or encryption keys.

CNN vs traditional detection

MethodStrengthsWeaknesses
Hash-based (MD5/SHA256)Fast, exact, zero false positivesOne byte change = completely different hash; trivially evaded
Signature-based (YARA)Pattern matching, expert-crafted, explainableManual effort to write rules; misses unknown variants
CNN on binary imagesLearns family-level structure; detects unknown variantsRequires training data; less explainable; slower inference

In practice, these methods are complementary. Hash lookup handles known samples in microseconds. YARA catches known patterns. The CNN catches novel variants that evade both by detecting structural similarity at the family level.

Loading...
Loading...
Loading...

Think Deeper

A malware analyst asks: 'Can we just use file hashes instead of this image stuff?' What is the fundamental limitation of hash-based detection that CNN-based visualisation addresses?

Hashes are exact-match only — change one byte and the hash is completely different. Malware authors trivially evade hash detection by recompiling or packing. CNN-based visualisation detects structural similarity: variants of the same malware family produce visually similar images because they share code sections, data layouts, and execution patterns. The CNN learns these family-level patterns that survive minor modifications.
Cybersecurity tie-in: Malware visualisation with CNNs is actively used in production security systems. Microsoft's research on malware-as-image classification demonstrated that CNNs trained on binary visualisations can classify malware families with over 98% accuracy, even on previously unseen samples from the same family. This approach complements traditional signature-based detection by catching variants that change their hash but preserve their structural DNA.

Loading...