Semantic Search: The Two-Phase Architecture
A semantic search engine has two phases. The first runs once (offline indexing); the second runs every time a user searches (real-time query):
| Phase | When | What happens | Cost |
|---|---|---|---|
| Phase 1: Indexing | Once, offline | Encode all documents into an embedding matrix (N x 384) | Slow (seconds to minutes) |
| Phase 2: Query | Every search | Encode query, compute cosine similarity, return top-k | Fast (milliseconds) |
Phase 2 is fast because all document embeddings are pre-computed. Only the single query needs encoding at search time.
Building the Index
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Your security knowledge base
documents = [
"Mimikatz extracts credentials from LSASS process memory",
"SSH brute force: repeated failed login attempts on port 22",
"DNS tunnelling encodes data in DNS query subdomains",
"Ransomware encrypts files and demands cryptocurrency payment",
"Phishing emails impersonate trusted senders to steal credentials",
]
# Phase 1: encode all documents (done once)
doc_embeddings = model.encode(documents)
# shape: (5, 384)
Querying the Index
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def search(query, doc_embeddings, documents, k=3):
"""Return top-k most relevant documents for a query."""
query_vec = model.encode([query]) # (1, 384)
scores = cosine_similarity(query_vec, doc_embeddings)[0]
top_k = np.argsort(scores)[::-1][:k]
for rank, idx in enumerate(top_k, 1):
print(f"{rank}. [{scores[idx]:.3f}] {documents[idx]}")
search("how to detect credential dumping")
# 1. [0.71] Mimikatz extracts credentials from LSASS process memory
# 2. [0.48] Phishing emails impersonate trusted senders to steal credentials
# 3. [0.31] SSH brute force: repeated failed login attempts on port 22
The search function found the Mimikatz document even though the query never mentioned "Mimikatz" or "LSASS". This is the power of semantic matching.
Why This Matters: The Retrieval Half of RAG
What you just built is the retrieval component of a RAG (Retrieval-Augmented Generation) pipeline. In the next lesson on RAG, you will combine this retrieval mechanism with an LLM to generate grounded answers:
| Component | What it does | You built it in |
|---|---|---|
| Sentence embeddings | Encode documents and queries as vectors | Step 2 (this lesson) |
| Semantic search | Find the most relevant documents for a query | Step 3 (this step) |
| Chunking | Split long documents into embeddable pieces | Lesson 4.4, Step 1 |
| LLM generation | Answer the question using retrieved context | Lesson 4.4, Step 3 |
Think Deeper
Your knowledge base has 50 security advisories. A SOC analyst queries 'how to detect credential dumping'. The top result is about Mimikatz and LSASS. But the second result is about password spraying. Is this a retrieval failure? Why or why not?