Step 2: Feature Engineering URLs

Turn raw URLs into numbers a model can learn from

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

From URLs to Numbers

A model can't read a URL. You need to extract features — turn structural properties into numbers:

FeatureWhy phishing URLs score higher
url_lengthPhishing URLs pad with random characters to obscure the real domain
num_dotsDeep subdomains: login.bank.com.phish.evil.com
has_at_symboluser@evil.com/redirect — browser ignores text before @
uses_httpsCounter-intuitive: many phishing sites now use HTTPS
num_subdomainsDeep nesting: a.b.c.d.evil.com
has_ip_addresshttp://192.168.1.1/login instead of a domain name
num_hyphenssecure-login-paypal.com mimics legitimate domains
path_lengthLong paths with encoded redirect parameters

Key Code Pattern

import numpy as np
import pandas as pd

# Synthetic phishing dataset with realistic distributions
np.random.seed(42)
n = 1000

# Legitimate URLs (label=0): shorter, fewer dots, rarely have @
legit = pd.DataFrame({
    'url_length':     np.random.normal(45, 15, n//2).clip(10),
    'num_dots':       np.random.poisson(2, n//2),
    'has_at_symbol':  np.random.binomial(1, 0.01, n//2),
    'has_ip_address': np.random.binomial(1, 0.02, n//2),
    'num_hyphens':    np.random.poisson(0.5, n//2),
    'label':          0
})

# Phishing URLs (label=1): longer, more dots, more suspicious features
phish = pd.DataFrame({
    'url_length':     np.random.normal(85, 25, n//2).clip(10),
    'num_dots':       np.random.poisson(5, n//2),
    'has_at_symbol':  np.random.binomial(1, 0.15, n//2),
    'has_ip_address': np.random.binomial(1, 0.20, n//2),
    'num_hyphens':    np.random.poisson(3, n//2),
    'label':          1
})

df = pd.concat([legit, phish]).reset_index(drop=True)
print(df.groupby('label').mean())

Exploring Feature Distributions

Before modelling, compare each feature's distribution across classes. Features with clear separation between phishing and legitimate will be the most useful to the model.

Check for class imbalance too: if 95% of your data is legitimate, the model can get 95% accuracy by always predicting "legitimate" — while missing every single phishing attempt.

Loading...
Loading...
Loading...

Think Deeper

A phishing URL uses HTTPS and has a valid certificate. Does that make it safe? Which features would still catch it?

uses_https = 1 alone does not indicate safety — most phishing sites now use HTTPS. But features like high url_length, has_at_symbol, has_ip_address, and num_hyphens would still flag it. This is why ML uses multiple features, not just one.
Cybersecurity tie-in: Feature engineering is where domain expertise matters most. A generic data scientist might not know that browsers ignore text before @ in URLs, or that IP-based URLs are a red flag. Your security knowledge is the competitive advantage — it tells you which features to extract from raw data.

Loading...