Malware visualisation: binaries as images
In 2011, Nataraj et al. demonstrated that treating raw malware bytes as pixel values produces visually distinctive images per malware family. This technique converts a detection problem into an image classification problem -- exactly what CNNs excel at.
| Step | Operation | Example |
|---|---|---|
| 1 | Read binary as raw bytes | 4D 5A 90 00 03 00 ... |
| 2 | Interpret each byte as a greyscale pixel (0-255) | [77, 90, 144, 0, 3, 0, ...] |
| 3 | Reshape to a 2D grid | 32x32, 64x64, or 128x128 depending on file size |
| 4 | Feed the image to a CNN | Same Conv-Pool-Dense architecture as MNIST |
The conversion code
import numpy as np
def binary_to_image(file_path, img_size=32):
"""Convert a binary file to a greyscale image."""
with open(file_path, 'rb') as f:
raw_bytes = f.read()
# Convert bytes to numpy array of uint8 values (0-255)
byte_array = np.frombuffer(raw_bytes, dtype=np.uint8)
# Truncate or pad to fill a square image
total_pixels = img_size * img_size
if len(byte_array) >= total_pixels:
pixels = byte_array[:total_pixels]
else:
pixels = np.pad(byte_array, (0, total_pixels - len(byte_array)))
# Reshape to 2D image
image = pixels.reshape(img_size, img_size)
return image
Why malware families look different
Different binary sections produce characteristic visual patterns:
| Binary section | Visual pattern | Why |
|---|---|---|
| .text (code) | Fine-grained texture with regular patterns | Compiled instructions have structured byte patterns |
| .data (initialised data) | Smooth gradients or blocks of similar values | Strings, constants, and configuration data |
| .rsrc (resources) | Distinct blocks, sometimes recognisable icons | Embedded images, dialogs, version info |
| Packed/encrypted | High-entropy noise (random-looking) | Packing destroys structure, producing uniform randomness |
Malware families share code and structure, so variants of the same family produce visually similar images even after minor modifications like changing C2 addresses or encryption keys.
CNN vs traditional detection
| Method | Strengths | Weaknesses |
|---|---|---|
| Hash-based (MD5/SHA256) | Fast, exact, zero false positives | One byte change = completely different hash; trivially evaded |
| Signature-based (YARA) | Pattern matching, expert-crafted, explainable | Manual effort to write rules; misses unknown variants |
| CNN on binary images | Learns family-level structure; detects unknown variants | Requires training data; less explainable; slower inference |
In practice, these methods are complementary. Hash lookup handles known samples in microseconds. YARA catches known patterns. The CNN catches novel variants that evade both by detecting structural similarity at the family level.
Think Deeper
A malware analyst asks: 'Can we just use file hashes instead of this image stuff?' What is the fundamental limitation of hash-based detection that CNN-based visualisation addresses?