Step 3: Semantic Search

Build a search engine over a security knowledge base

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Semantic Search: The Two-Phase Architecture

A semantic search engine has two phases. The first runs once (offline indexing); the second runs every time a user searches (real-time query):

Phase	When	What happens	Cost
Phase 1: Indexing	Once, offline	Encode all documents into an embedding matrix (N x 384)	Slow (seconds to minutes)
Phase 2: Query	Every search	Encode query, compute cosine similarity, return top-k	Fast (milliseconds)

Phase 2 is fast because all document embeddings are pre-computed. Only the single query needs encoding at search time.

Building the Index

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Your security knowledge base
documents = [
    "Mimikatz extracts credentials from LSASS process memory",
    "SSH brute force: repeated failed login attempts on port 22",
    "DNS tunnelling encodes data in DNS query subdomains",
    "Ransomware encrypts files and demands cryptocurrency payment",
    "Phishing emails impersonate trusted senders to steal credentials",
]

# Phase 1: encode all documents (done once)
doc_embeddings = model.encode(documents)
# shape: (5, 384)

Querying the Index

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(query, doc_embeddings, documents, k=3):
    """Return top-k most relevant documents for a query."""
    query_vec = model.encode([query])             # (1, 384)
    scores = cosine_similarity(query_vec, doc_embeddings)[0]
    top_k = np.argsort(scores)[::-1][:k]

    for rank, idx in enumerate(top_k, 1):
        print(f"{rank}. [{scores[idx]:.3f}] {documents[idx]}")

search("how to detect credential dumping")
# 1. [0.71] Mimikatz extracts credentials from LSASS process memory
# 2. [0.48] Phishing emails impersonate trusted senders to steal credentials
# 3. [0.31] SSH brute force: repeated failed login attempts on port 22

The search function found the Mimikatz document even though the query never mentioned "Mimikatz" or "LSASS". This is the power of semantic matching.

Why This Matters: The Retrieval Half of RAG

What you just built is the retrieval component of a RAG (Retrieval-Augmented Generation) pipeline. In the next lesson on RAG, you will combine this retrieval mechanism with an LLM to generate grounded answers:

Component	What it does	You built it in
Sentence embeddings	Encode documents and queries as vectors	Step 2 (this lesson)
Semantic search	Find the most relevant documents for a query	Step 3 (this step)
Chunking	Split long documents into embeddable pieces	Lesson 4.4, Step 1
LLM generation	Answer the question using retrieved context	Lesson 4.4, Step 3

Think Deeper

Try this:

Your knowledge base has 50 security advisories. A SOC analyst queries 'how to detect credential dumping'. The top result is about Mimikatz and LSASS. But the second result is about password spraying. Is this a retrieval failure? Why or why not?

Not a failure -- it is a semantic neighbourhood effect. Password spraying and credential dumping are both credential-theft techniques, so their embeddings are close in vector space. In a security context, this is actually useful: related techniques surface together. However, for precise retrieval you might need to re-rank results or use hybrid search (semantic + keyword) to separate closely related but distinct techniques.

Cybersecurity tie-in: A semantic search engine over your organisation's threat intelligence lets analysts query "how do we respond to credential dumping?" and get relevant runbooks, past incident reports, and CVE advisories -- even if those documents never use the exact phrase "credential dumping". This replaces brittle keyword search with meaning-based retrieval.

← Previous ← → to navigate Next →