From Embeddings to Search
In Lesson 4.1 you learned that embeddings place text into a high-dimensional space where similar meaning = nearby coordinates. In Lesson 4.2 you encoded sentences and measured cosine similarity between them.
That works for comparing a handful of sentences. But what happens when you have millions of them and need to ask: "which ones are nearest to this query?"
That is what a vector database is for — Pinecone, Weaviate, Qdrant, Chroma, FAISS, pgvector, and others all do the same thing: store vectors, find nearest neighbours, fast.
What Is a "Document"?
In vector-DB land, a document is just one piece of text you want to be searchable. It is not necessarily a Word file or a PDF:
| Real thing | What gets stored as a "document" |
|---|---|
| A 200-page playbook | Each paragraph or section becomes one document — chopping it up is called chunking (next step) |
| A SOC ticket | The ticket summary + description as a single document |
| A Slack message | One message = one document |
| A firewall log line | One log entry = one document |
| A knowledge-base article | Title + body as one document, or each section as its own |
The rule of thumb: a document is whatever sized chunk of text you want the search to return as a single hit.
Step 1 — INDEX (done once)
Every document is converted to a vector once by an embedding model and stored in the database. Documents that talk about the same thing land near each other automatically:
# Index time -- done once per document
vec = embedding_model.encode("Ransomware encrypted the file server")
vector_db.upsert(id="ticket-4291", vector=vec, metadata={"severity": "high"})
Nobody told the database that ransomware and SQL injection are both threats. The clustering is a free side effect of the embedding model you already know from Lesson 4.2.
Step 2 — SEARCH (done every query)
At query time the user's question gets encoded with the same embedding model and dropped onto the same map. The database returns the K closest vectors by distance — those are the most semantically similar documents:
# Query time -- done on every user question
query_vec = embedding_model.encode("How do I detect ransomware on a host?")
hits = vector_db.search(vector=query_vec, top_k=3)
# [{"id": "ticket-4291", "score": 0.91, "metadata": {...}}, ...]
That is the entire algorithm. Two functions: upsert (write a vector) and search (find K nearest). Everything else — RAG, semantic cache, dedup — is built on these two calls.
Keyword Search vs Semantic Search
A SQL WHERE text LIKE '%ransomware%' only finds documents containing the literal word. A vector database finds documents with similar meaning, even when the surface words are different:
| Document text | Contains "ransomware"? | Found by vector search? |
|---|---|---|
| "Files were encrypted and a ransom note appeared" | No | Yes |
| "Encryption malware demanding bitcoin payment" | No | Yes |
| "Cryptolocker variant detected on host" | No | Yes |
This is the difference between keyword search and semantic search — and it is why RAG uses a vector database, not a SQL LIKE query.
When to Use a Vector Database
| You have | Use a vector DB? |
|---|---|
| 50 internal policy PDFs and want a chatbot to answer questions | Yes — this is the canonical use case (RAG) |
| 10 million SOC tickets you want to find "things like this one" in | Yes — semantic similarity over a huge corpus |
A structured table with user_id, email, last_login | No — normal SQL query, no embeddings needed |
| Real-time stream of 100k events/sec for exact rule matching | No — use a SIEM / Sigma rules; vector search is too slow |
Think Deeper
A SQL query <code>WHERE text LIKE '%credential theft%'</code> returns 3 results from your ticket database. You suspect there are more incidents described with different wording. How would a vector database help, and what would you need to build one?
all-MiniLM-L6-v2 from Lesson 4.2), (2) encode every ticket once with model.encode(), and (3) store the vectors in a vector DB. At query time, encode the question with the same model and call search(top_k=10). The keyword gap that SQL cannot bridge is exactly what semantic search solves.