From Chunks to a Vector Index
After chunking, you encode each chunk into a vector using a sentence embedding model. All chunk vectors together form a vector index -- an embedding matrix you search against at query time:
| Step | Input | Output | When |
|---|---|---|---|
| 1. Chunk | Raw documents | List of N text chunks | Once (offline) |
| 2. Encode | N text chunks | Embedding matrix (N x 384) | Once (offline) |
| 3. Query | User question | Query vector (1 x 384) | Every search |
| 4. Rank | Query vector vs embedding matrix | Cosine similarity scores (N,) | Every search |
| 5. Return | Top-k scores | k most relevant chunks | Every search |
Steps 1-2 are done once. Steps 3-5 run in milliseconds because the expensive encoding is already done.
Building the Vector Index
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
# Assume chunks is a list of strings from step 1
chunks = [
"Mimikatz can extract plaintext passwords from LSASS process memory...",
"To detect LSASS dumping, monitor for process access events (Sysmon ID 10)...",
"SSH brute force attacks generate many failed authentication events...",
"DNS tunnelling encodes exfiltrated data in DNS query subdomains...",
"Ransomware typically encrypts files using AES-256 and stores the key...",
]
# Encode all chunks (done once, offline)
chunk_embeddings = model.encode(chunks)
print(f"Index shape: {chunk_embeddings.shape}") # (5, 384)
Retrieving Relevant Chunks
def retrieve(query, chunk_embeddings, chunks, k=3):
"""Retrieve the top-k most relevant chunks for a query."""
query_vec = model.encode([query])
scores = cosine_similarity(query_vec, chunk_embeddings)[0]
top_k_indices = np.argsort(scores)[::-1][:k]
results = []
for idx in top_k_indices:
results.append({
"chunk": chunks[idx],
"score": float(scores[idx]),
"index": int(idx),
})
return results
# Test it
results = retrieve("how to detect credential dumping", chunk_embeddings, chunks)
for r in results:
print(f"[{r['score']:.3f}] {r['chunk'][:80]}...")
Evaluating Retrieval Quality
Before connecting retrieval to an LLM, verify that the right chunks come back for your expected queries:
| Test query | Expected top chunk | Pass? |
|---|---|---|
| "how to detect credential dumping" | LSASS dumping detection (Sysmon ID 10) | Check top-1 |
| "what is DNS tunnelling" | DNS tunnelling encodes data in subdomains | Check top-1 |
| "ransomware encryption method" | Ransomware typically encrypts with AES-256 | Check top-1 |
If the correct chunk does not appear in the top-3 results, the problem is in your chunking strategy or embedding model -- not in the LLM. Fix retrieval before adding generation.
Think Deeper
Your security knowledge base has a chunk about 'SSH brute force detection' and another about 'SSH key rotation best practices'. A user queries 'how to secure SSH'. Which chunk ranks higher? Is this the right behaviour for an incident responder vs a compliance auditor?