From URLs to Numbers
A model can't read a URL. You need to extract features — turn structural properties into numbers:
| Feature | Why phishing URLs score higher |
|---|---|
url_length | Phishing URLs pad with random characters to obscure the real domain |
num_dots | Deep subdomains: login.bank.com.phish.evil.com |
has_at_symbol | user@evil.com/redirect — browser ignores text before @ |
uses_https | Counter-intuitive: many phishing sites now use HTTPS |
num_subdomains | Deep nesting: a.b.c.d.evil.com |
has_ip_address | http://192.168.1.1/login instead of a domain name |
num_hyphens | secure-login-paypal.com mimics legitimate domains |
path_length | Long paths with encoded redirect parameters |
Key Code Pattern
import numpy as np
import pandas as pd
# Synthetic phishing dataset with realistic distributions
np.random.seed(42)
n = 1000
# Legitimate URLs (label=0): shorter, fewer dots, rarely have @
legit = pd.DataFrame({
'url_length': np.random.normal(45, 15, n//2).clip(10),
'num_dots': np.random.poisson(2, n//2),
'has_at_symbol': np.random.binomial(1, 0.01, n//2),
'has_ip_address': np.random.binomial(1, 0.02, n//2),
'num_hyphens': np.random.poisson(0.5, n//2),
'label': 0
})
# Phishing URLs (label=1): longer, more dots, more suspicious features
phish = pd.DataFrame({
'url_length': np.random.normal(85, 25, n//2).clip(10),
'num_dots': np.random.poisson(5, n//2),
'has_at_symbol': np.random.binomial(1, 0.15, n//2),
'has_ip_address': np.random.binomial(1, 0.20, n//2),
'num_hyphens': np.random.poisson(3, n//2),
'label': 1
})
df = pd.concat([legit, phish]).reset_index(drop=True)
print(df.groupby('label').mean())
Exploring Feature Distributions
Before modelling, compare each feature's distribution across classes. Features with clear separation between phishing and legitimate will be the most useful to the model.
Check for class imbalance too: if 95% of your data is legitimate, the model can get 95% accuracy by always predicting "legitimate" — while missing every single phishing attempt.
Loading...
Loading...
Loading...
Think Deeper
Try this:
A phishing URL uses HTTPS and has a valid certificate. Does that make it safe? Which features would still catch it?
uses_https = 1 alone does not indicate safety — most phishing sites now use HTTPS. But features like high url_length, has_at_symbol, has_ip_address, and num_hyphens would still flag it. This is why ML uses multiple features, not just one.
Cybersecurity tie-in: Feature engineering is where domain expertise matters most.
A generic data scientist might not know that browsers ignore text before
@ in URLs,
or that IP-based URLs are a red flag. Your security knowledge is the competitive advantage
— it tells you which features to extract from raw data.