The Full RAG Pipeline
RAG combines three components you have already built individually. Now they connect end-to-end:
| Stage | Component | What happens |
|---|---|---|
| R - Retrieval | Semantic search (Lesson 4.2 + Step 3) | Encode query, find top-k chunks from vector index |
| A - Augmentation | Prompt construction | Inject retrieved chunks into the LLM prompt as context |
| G - Generation | LLM API call (Lesson 4.3) | Model reads the context and generates a grounded answer |
The key insight: the LLM is not memorising your security documents -- it reads them at inference time, injected as context. This means the knowledge base can be updated without retraining.
The Augmented Prompt
def build_rag_prompt(question, retrieved_chunks):
"""Build a prompt that grounds the LLM in retrieved context."""
context = "\n\n---\n\n".join(retrieved_chunks)
system = """You are a security analyst assistant.
Answer the question based ONLY on the provided context.
If the context does not contain enough information, say:
'The provided context does not contain information about this.'
Do not use your general knowledge. Cite specific details from the context."""
user_message = f"""Context:
{context}
Question: {question}"""
return system, user_message
The instruction "based ONLY on the provided context" is the most important line. Without it, the model blends its pre-training knowledge with your documents, making it impossible to verify where the answer came from.
Putting It All Together
from llm_client import get_client
provider, client = get_client()
def rag_query(question, chunk_embeddings, chunks, k=3):
"""Full RAG pipeline: retrieve, augment, generate."""
# 1. RETRIEVE
results = retrieve(question, chunk_embeddings, chunks, k=k)
retrieved_texts = [r["chunk"] for r in results]
# 2. AUGMENT
system, user_msg = build_rag_prompt(question, retrieved_texts)
# 3. GENERATE
answer = client.chat(
system=system,
messages=[{"role": "user", "content": user_msg}],
max_tokens=400,
)
return {
"answer": answer,
"sources": results, # for attribution
}
# Ask a question
result = rag_query("How do I detect Mimikatz?", chunk_embeddings, chunks)
print(result["answer"])
print(f"\nBased on {len(result['sources'])} retrieved chunks")
RAG vs Pure LLM: Why Grounding Matters
| Property | Pure LLM | RAG |
|---|---|---|
| Knowledge source | Pre-training data (stale) | Your documents (up-to-date) |
| Knowledge cutoff | Fixed at training date | Updated when you add documents |
| Internal policies | Unknown to the model | Indexed and searchable |
| Hallucination risk | High -- model invents plausible answers | Low -- answer grounded in context |
| Attribution | None -- cannot cite sources | Yes -- you know which chunks were used |
| Cost per query | Lower (no retrieval step) | Slightly higher (embedding + retrieval + generation) |
For security work, the attribution and accuracy benefits of RAG far outweigh the marginal cost increase. When an analyst asks "what is our response procedure for ransomware?", the answer must come from your runbook -- not the model's general knowledge.
Think Deeper
Ask your RAG pipeline a question that IS in the knowledge base, then ask one that is NOT. Compare the two responses. Does the model admit when it does not know?