Why Documents Must Be Chunked
Sentence embedding models have a maximum input length (typically 128-512 tokens). A CVE advisory or incident report can be thousands of words. You must split documents into chunks before embedding.
But chunking is not just a technical constraint -- it determines retrieval quality:
| Problem | Chunks too large | Chunks too small |
|---|---|---|
| Embedding quality | Averages too much content -- embeddings become vague | Loses cross-sentence context -- embeddings are narrow |
| Retrieval precision | Retrieved chunks contain irrelevant information | Answers split across multiple chunks |
| Attribution | Hard to pinpoint which part of a chunk is relevant | Many more chunks to search through |
Rule of thumb: 100-300 words per chunk, with 20-50 word overlap.
Three Chunking Strategies
| Strategy | How it works | Best for |
|---|---|---|
| Fixed-size | Split every N words, no overlap | Simple, fast; risk of cutting mid-sentence |
| Fixed-size with overlap | Split every N words, overlap by M words | General purpose; boundary content preserved |
| Sentence-based | Split on sentence boundaries; group sentences into chunks | High quality; respects natural structure |
Fixed-Size Chunking with Overlap
def chunk_text(text, chunk_size=200, overlap=50):
"""Split text into overlapping word-level chunks."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
start += chunk_size - overlap # step forward by (chunk_size - overlap)
return chunks
# Example: 1000-word document
doc = open("cve_advisory.txt").read()
chunks = chunk_text(doc, chunk_size=200, overlap=50)
print(f"Document: {len(doc.split())} words")
print(f"Chunks: {len(chunks)}")
print(f"Chunk 1: {len(chunks[0].split())} words")
Overlap ensures that a sentence spanning a chunk boundary appears in both chunks. Without overlap, a question about that sentence might fail to retrieve either chunk.
Chunk Size vs Retrieval Quality
Choosing the right chunk size is an empirical decision. Here is a practical starting point for security documents:
| Document type | Recommended chunk size | Overlap | Reasoning |
|---|---|---|---|
| CVE advisories | 150-200 words | 30 words | Vulnerability + mitigation often in adjacent paragraphs |
| Incident reports | 200-300 words | 50 words | Timeline entries are longer and interconnected |
| Policy documents | 100-150 words | 20 words | Short, self-contained sections |
| Threat intel feeds | 100-200 words | 30 words | IOCs and descriptions are compact |
Think Deeper
Chunk a 1000-word CVE advisory with chunk_size=100 and overlap=0, then again with overlap=20. Count the chunks produced. Which approach would you trust more for a question that falls right on a chunk boundary?