From Words to Vectors: Sentence Embeddings
In Lesson 4.1 you saw word embeddings -- one vector per token. Sentence embeddings go further: they encode an entire sentence into a single dense vector that captures its overall meaning.
| Sentence | Vector (384 dims) | Note |
|---|---|---|
| "The system was compromised via phishing" | [0.23, -0.45, 0.87, ...] | |
| "A spear-phishing email led to the breach" | [0.21, -0.42, 0.89, ...] | Similar meaning, similar vector |
| "Pizza delivery takes 30 minutes" | [-0.55, 0.31, -0.12, ...] | Different meaning, distant vector |
Sentences with similar meanings produce similar vectors, regardless of the specific words used. This is why sentence embeddings are the foundation of every modern RAG system.
Why Sentence Embeddings Beat Keywords
Keyword search matches exact strings. Sentence embeddings match meaning:
| Query | Keyword match? | Embedding match? |
|---|---|---|
| "How to detect credential theft" | Misses documents about "password dumping" | Finds documents about password dumping, Mimikatz, LSASS |
| "Lateral movement techniques" | Misses documents about "pivoting between hosts" | Finds documents about pivoting, pass-the-hash, RDP hopping |
| "Data exfiltration via DNS" | Requires exact phrase | Finds documents about DNS tunnelling, covert channels |
The SentenceTransformer Model
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load model (80MB, excellent quality/speed trade-off)
model = SentenceTransformer("all-MiniLM-L6-v2")
# Encode sentences into vectors
sentences = [
"Brute force attack detected on SSH port",
"Multiple failed login attempts on SSH service",
"Quarterly budget report submitted"
]
embeddings = model.encode(sentences)
# shape: (3, 384)
# Compare them
sims = cosine_similarity(embeddings)
print(f"SSH sentences: {sims[0][1]:.3f}") # ~0.85 (similar)
print(f"SSH vs budget: {sims[0][2]:.3f}") # ~0.05 (unrelated)
Cosine Similarity Refresher
Cosine similarity measures the angle between two vectors, ignoring magnitude:
| Score | Meaning | Example |
|---|---|---|
| 1.0 | Identical meaning | Same sentence, rephrased |
| 0.7 - 0.9 | Closely related | "SSH brute force" vs "failed SSH logins" |
| 0.3 - 0.6 | Loosely related | "SSH brute force" vs "network attack" |
| 0.0 - 0.2 | Unrelated | "SSH brute force" vs "pizza delivery" |
In the next step, you will use cosine similarity to build a full semantic search engine.
Think Deeper
Encode these two sentences: 'User logged in from VPN' and 'Employee accessed system remotely via VPN'. What is their cosine similarity? Now encode 'Pizza delivery at noon'. How does its similarity compare to the first two?