Step 2: Document Chunking

Split documents into embeddable pieces

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Why Documents Must Be Chunked

Sentence embedding models have a maximum input length (typically 128-512 tokens). A CVE advisory or incident report can be thousands of words. You must split documents into chunks before embedding.

But chunking is not just a technical constraint -- it determines retrieval quality:

Problem	Chunks too large	Chunks too small
Embedding quality	Averages too much content -- embeddings become vague	Loses cross-sentence context -- embeddings are narrow
Retrieval precision	Retrieved chunks contain irrelevant information	Answers split across multiple chunks
Attribution	Hard to pinpoint which part of a chunk is relevant	Many more chunks to search through

Rule of thumb: 100-300 words per chunk, with 20-50 word overlap.

Three Chunking Strategies

Strategy	How it works	Best for
Fixed-size	Split every N words, no overlap	Simple, fast; risk of cutting mid-sentence
Fixed-size with overlap	Split every N words, overlap by M words	General purpose; boundary content preserved
Sentence-based	Split on sentence boundaries; group sentences into chunks	High quality; respects natural structure

Fixed-Size Chunking with Overlap

def chunk_text(text, chunk_size=200, overlap=50):
    """Split text into overlapping word-level chunks."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap    # step forward by (chunk_size - overlap)
    return chunks

# Example: 1000-word document
doc = open("cve_advisory.txt").read()
chunks = chunk_text(doc, chunk_size=200, overlap=50)
print(f"Document: {len(doc.split())} words")
print(f"Chunks:   {len(chunks)}")
print(f"Chunk 1:  {len(chunks[0].split())} words")

Overlap ensures that a sentence spanning a chunk boundary appears in both chunks. Without overlap, a question about that sentence might fail to retrieve either chunk.

Chunk Size vs Retrieval Quality

Choosing the right chunk size is an empirical decision. Here is a practical starting point for security documents:

Document type	Recommended chunk size	Overlap	Reasoning
CVE advisories	150-200 words	30 words	Vulnerability + mitigation often in adjacent paragraphs
Incident reports	200-300 words	50 words	Timeline entries are longer and interconnected
Policy documents	100-150 words	20 words	Short, self-contained sections
Threat intel feeds	100-200 words	30 words	IOCs and descriptions are compact

Think Deeper

Try this:

Chunk a 1000-word CVE advisory with chunk_size=100 and overlap=0, then again with overlap=20. Count the chunks produced. Which approach would you trust more for a question that falls right on a chunk boundary?

Without overlap you get 10 chunks; with overlap=20 you get ~12 chunks. The overlapping version duplicates boundary sentences, so a question about content near a split point has a better chance of retrieving a complete answer. In security advisories, mitigation steps often follow vulnerability descriptions -- if the split falls between them, a zero-overlap chunking loses the connection. Overlap is a safety net against boundary information loss.

Cybersecurity tie-in: Chunking quality directly affects retrieval accuracy. If a CVE advisory's vulnerability description ends up in one chunk and its mitigation steps in another, a query like "how to fix CVE-2024-1234" might retrieve the vulnerability chunk but miss the mitigation. Overlap is your first defence against this failure mode. For critical security documents, test your chunking strategy by verifying that key question-answer pairs retrieve the correct chunks.

← Previous ← → to navigate Next →