Step 2: Document Chunking

Split documents into embeddable pieces

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Why Documents Must Be Chunked

Sentence embedding models have a maximum input length (typically 128-512 tokens). A CVE advisory or incident report can be thousands of words. You must split documents into chunks before embedding.

But chunking is not just a technical constraint -- it determines retrieval quality:

ProblemChunks too largeChunks too small
Embedding qualityAverages too much content -- embeddings become vagueLoses cross-sentence context -- embeddings are narrow
Retrieval precisionRetrieved chunks contain irrelevant informationAnswers split across multiple chunks
AttributionHard to pinpoint which part of a chunk is relevantMany more chunks to search through

Rule of thumb: 100-300 words per chunk, with 20-50 word overlap.

Three Chunking Strategies

StrategyHow it worksBest for
Fixed-sizeSplit every N words, no overlapSimple, fast; risk of cutting mid-sentence
Fixed-size with overlapSplit every N words, overlap by M wordsGeneral purpose; boundary content preserved
Sentence-basedSplit on sentence boundaries; group sentences into chunksHigh quality; respects natural structure

Fixed-Size Chunking with Overlap

def chunk_text(text, chunk_size=200, overlap=50):
    """Split text into overlapping word-level chunks."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap    # step forward by (chunk_size - overlap)
    return chunks

# Example: 1000-word document
doc = open("cve_advisory.txt").read()
chunks = chunk_text(doc, chunk_size=200, overlap=50)
print(f"Document: {len(doc.split())} words")
print(f"Chunks:   {len(chunks)}")
print(f"Chunk 1:  {len(chunks[0].split())} words")

Overlap ensures that a sentence spanning a chunk boundary appears in both chunks. Without overlap, a question about that sentence might fail to retrieve either chunk.

Chunk Size vs Retrieval Quality

Choosing the right chunk size is an empirical decision. Here is a practical starting point for security documents:

Document typeRecommended chunk sizeOverlapReasoning
CVE advisories150-200 words30 wordsVulnerability + mitigation often in adjacent paragraphs
Incident reports200-300 words50 wordsTimeline entries are longer and interconnected
Policy documents100-150 words20 wordsShort, self-contained sections
Threat intel feeds100-200 words30 wordsIOCs and descriptions are compact
Loading...
Loading...
Loading...

Think Deeper

Chunk a 1000-word CVE advisory with chunk_size=100 and overlap=0, then again with overlap=20. Count the chunks produced. Which approach would you trust more for a question that falls right on a chunk boundary?

Without overlap you get 10 chunks; with overlap=20 you get ~12 chunks. The overlapping version duplicates boundary sentences, so a question about content near a split point has a better chance of retrieving a complete answer. In security advisories, mitigation steps often follow vulnerability descriptions -- if the split falls between them, a zero-overlap chunking loses the connection. Overlap is a safety net against boundary information loss.
Cybersecurity tie-in: Chunking quality directly affects retrieval accuracy. If a CVE advisory's vulnerability description ends up in one chunk and its mitigation steps in another, a query like "how to fix CVE-2024-1234" might retrieve the vulnerability chunk but miss the mitigation. Overlap is your first defence against this failure mode. For critical security documents, test your chunking strategy by verifying that key question-answer pairs retrieve the correct chunks.

Loading...