Step 3: Semantic Search

Build a search engine over a security knowledge base

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Semantic Search: The Two-Phase Architecture

A semantic search engine has two phases. The first runs once (offline indexing); the second runs every time a user searches (real-time query):

PhaseWhenWhat happensCost
Phase 1: IndexingOnce, offlineEncode all documents into an embedding matrix (N x 384)Slow (seconds to minutes)
Phase 2: QueryEvery searchEncode query, compute cosine similarity, return top-kFast (milliseconds)

Phase 2 is fast because all document embeddings are pre-computed. Only the single query needs encoding at search time.

Building the Index

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Your security knowledge base
documents = [
    "Mimikatz extracts credentials from LSASS process memory",
    "SSH brute force: repeated failed login attempts on port 22",
    "DNS tunnelling encodes data in DNS query subdomains",
    "Ransomware encrypts files and demands cryptocurrency payment",
    "Phishing emails impersonate trusted senders to steal credentials",
]

# Phase 1: encode all documents (done once)
doc_embeddings = model.encode(documents)
# shape: (5, 384)

Querying the Index

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(query, doc_embeddings, documents, k=3):
    """Return top-k most relevant documents for a query."""
    query_vec = model.encode([query])             # (1, 384)
    scores = cosine_similarity(query_vec, doc_embeddings)[0]
    top_k = np.argsort(scores)[::-1][:k]

    for rank, idx in enumerate(top_k, 1):
        print(f"{rank}. [{scores[idx]:.3f}] {documents[idx]}")

search("how to detect credential dumping")
# 1. [0.71] Mimikatz extracts credentials from LSASS process memory
# 2. [0.48] Phishing emails impersonate trusted senders to steal credentials
# 3. [0.31] SSH brute force: repeated failed login attempts on port 22

The search function found the Mimikatz document even though the query never mentioned "Mimikatz" or "LSASS". This is the power of semantic matching.

Why This Matters: The Retrieval Half of RAG

What you just built is the retrieval component of a RAG (Retrieval-Augmented Generation) pipeline. In the next lesson on RAG, you will combine this retrieval mechanism with an LLM to generate grounded answers:

ComponentWhat it doesYou built it in
Sentence embeddingsEncode documents and queries as vectorsStep 2 (this lesson)
Semantic searchFind the most relevant documents for a queryStep 3 (this step)
ChunkingSplit long documents into embeddable piecesLesson 4.4, Step 1
LLM generationAnswer the question using retrieved contextLesson 4.4, Step 3
Loading...
Loading...
Loading...

Think Deeper

Your knowledge base has 50 security advisories. A SOC analyst queries 'how to detect credential dumping'. The top result is about Mimikatz and LSASS. But the second result is about password spraying. Is this a retrieval failure? Why or why not?

Not a failure -- it is a semantic neighbourhood effect. Password spraying and credential dumping are both credential-theft techniques, so their embeddings are close in vector space. In a security context, this is actually useful: related techniques surface together. However, for precise retrieval you might need to re-rank results or use hybrid search (semantic + keyword) to separate closely related but distinct techniques.
Cybersecurity tie-in: A semantic search engine over your organisation's threat intelligence lets analysts query "how do we respond to credential dumping?" and get relevant runbooks, past incident reports, and CVE advisories -- even if those documents never use the exact phrase "credential dumping". This replaces brittle keyword search with meaning-based retrieval.

Loading...