Step 2: Sentence Embeddings

Encode sentences as vectors for semantic comparison

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

From Words to Vectors: Sentence Embeddings

In Lesson 4.1 you saw word embeddings -- one vector per token. Sentence embeddings go further: they encode an entire sentence into a single dense vector that captures its overall meaning.

Sentence	Vector (384 dims)	Note
"The system was compromised via phishing"	[0.23, -0.45, 0.87, ...]
"A spear-phishing email led to the breach"	[0.21, -0.42, 0.89, ...]	Similar meaning, similar vector
"Pizza delivery takes 30 minutes"	[-0.55, 0.31, -0.12, ...]	Different meaning, distant vector

Sentences with similar meanings produce similar vectors, regardless of the specific words used. This is why sentence embeddings are the foundation of every modern RAG system.

Why Sentence Embeddings Beat Keywords

Keyword search matches exact strings. Sentence embeddings match meaning:

Query	Keyword match?	Embedding match?
"How to detect credential theft"	Misses documents about "password dumping"	Finds documents about password dumping, Mimikatz, LSASS
"Lateral movement techniques"	Misses documents about "pivoting between hosts"	Finds documents about pivoting, pass-the-hash, RDP hopping
"Data exfiltration via DNS"	Requires exact phrase	Finds documents about DNS tunnelling, covert channels

The SentenceTransformer Model

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load model (80MB, excellent quality/speed trade-off)
model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode sentences into vectors
sentences = [
    "Brute force attack detected on SSH port",
    "Multiple failed login attempts on SSH service",
    "Quarterly budget report submitted"
]
embeddings = model.encode(sentences)
# shape: (3, 384)

# Compare them
sims = cosine_similarity(embeddings)
print(f"SSH sentences: {sims[0][1]:.3f}")   # ~0.85 (similar)
print(f"SSH vs budget: {sims[0][2]:.3f}")   # ~0.05 (unrelated)

Cosine Similarity Refresher

Cosine similarity measures the angle between two vectors, ignoring magnitude:

Score	Meaning	Example
1.0	Identical meaning	Same sentence, rephrased
0.7 - 0.9	Closely related	"SSH brute force" vs "failed SSH logins"
0.3 - 0.6	Loosely related	"SSH brute force" vs "network attack"
0.0 - 0.2	Unrelated	"SSH brute force" vs "pizza delivery"

In the next step, you will use cosine similarity to build a full semantic search engine.

Think Deeper

Try this:

Encode these two sentences: 'User logged in from VPN' and 'Employee accessed system remotely via VPN'. What is their cosine similarity? Now encode 'Pizza delivery at noon'. How does its similarity compare to the first two?

The VPN sentences should have high cosine similarity (>0.8) because they express the same semantic meaning with different words. The pizza sentence should have very low similarity (<0.2). This is the foundation of semantic search: meaning-based matching beats keyword matching. A keyword search for 'logged in' would miss the second sentence entirely, but embeddings capture the shared meaning.

Cybersecurity tie-in: Sentence embeddings enable semantic search over threat intelligence. When an analyst searches for "credential theft", a keyword-based system misses documents about "password dumping" or "LSASS memory access". Embedding-based search captures meaning, not just words -- surfacing relevant advisories that would otherwise be invisible.

← Previous ← → to navigate Next →