Step 4: The Full RAG Pipeline

Retrieve, augment, generate -- end to end

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

The Full RAG Pipeline

RAG combines three components you have already built individually. Now they connect end-to-end:

Stage	Component	What happens
R - Retrieval	Semantic search (Lesson 4.2 + Step 3)	Encode query, find top-k chunks from vector index
A - Augmentation	Prompt construction	Inject retrieved chunks into the LLM prompt as context
G - Generation	LLM API call (Lesson 4.3)	Model reads the context and generates a grounded answer

The key insight: the LLM is not memorising your security documents -- it reads them at inference time, injected as context. This means the knowledge base can be updated without retraining.

The Augmented Prompt

def build_rag_prompt(question, retrieved_chunks):
    """Build a prompt that grounds the LLM in retrieved context."""
    context = "\n\n---\n\n".join(retrieved_chunks)

    system = """You are a security analyst assistant.
Answer the question based ONLY on the provided context.
If the context does not contain enough information, say:
'The provided context does not contain information about this.'
Do not use your general knowledge. Cite specific details from the context."""

    user_message = f"""Context:
{context}

Question: {question}"""

    return system, user_message

The instruction "based ONLY on the provided context" is the most important line. Without it, the model blends its pre-training knowledge with your documents, making it impossible to verify where the answer came from.

Putting It All Together

from llm_client import get_client

provider, client = get_client()

def rag_query(question, chunk_embeddings, chunks, k=3):
    """Full RAG pipeline: retrieve, augment, generate."""
    # 1. RETRIEVE
    results = retrieve(question, chunk_embeddings, chunks, k=k)
    retrieved_texts = [r["chunk"] for r in results]

    # 2. AUGMENT
    system, user_msg = build_rag_prompt(question, retrieved_texts)

    # 3. GENERATE
    answer = client.chat(
        system=system,
        messages=[{"role": "user", "content": user_msg}],
        max_tokens=400,
    )

    return {
        "answer": answer,
        "sources": results,     # for attribution
    }

# Ask a question
result = rag_query("How do I detect Mimikatz?", chunk_embeddings, chunks)
print(result["answer"])
print(f"\nBased on {len(result['sources'])} retrieved chunks")

RAG vs Pure LLM: Why Grounding Matters

Property	Pure LLM	RAG
Knowledge source	Pre-training data (stale)	Your documents (up-to-date)
Knowledge cutoff	Fixed at training date	Updated when you add documents
Internal policies	Unknown to the model	Indexed and searchable
Hallucination risk	High -- model invents plausible answers	Low -- answer grounded in context
Attribution	None -- cannot cite sources	Yes -- you know which chunks were used
Cost per query	Lower (no retrieval step)	Slightly higher (embedding + retrieval + generation)

For security work, the attribution and accuracy benefits of RAG far outweigh the marginal cost increase. When an analyst asks "what is our response procedure for ransomware?", the answer must come from your runbook -- not the model's general knowledge.

Think Deeper

Try this:

Ask your RAG pipeline a question that IS in the knowledge base, then ask one that is NOT. Compare the two responses. Does the model admit when it does not know?

For in-context questions, the model produces a grounded, specific answer citing the retrieved chunks. For out-of-context questions, a well-prompted model says 'the provided context does not contain information about this'. Without the grounding instruction ('answer based on context only'), the model hallucinates -- it invents plausible-sounding security guidance from its pre-training. In a SOC setting, hallucinated remediation steps are dangerous. The 'based on context only' instruction is a critical safety control.

Cybersecurity tie-in: RAG is the architecture behind every production security assistant -- from CVE lookup tools to incident response copilots. The "based on context only" instruction is a safety control against hallucination. Without it, the model might generate remediation steps that sound correct but are fabricated. In a security context, hallucinated remediation advice during an active incident could delay response or cause further damage. Always ground security answers in verified documents.

← Previous ← → to navigate Next →