Step 1: Vector Databases

Why they exist and how they work

1 ExplorePlay below

›

2 ReadUnderstand

›

💡 ReflectThink deeper

From Embeddings to Search

In Lesson 4.1 you learned that embeddings place text into a high-dimensional space where similar meaning = nearby coordinates. In Lesson 4.2 you encoded sentences and measured cosine similarity between them.

That works for comparing a handful of sentences. But what happens when you have millions of them and need to ask: "which ones are nearest to this query?"

That is what a vector database is for — Pinecone, Weaviate, Qdrant, Chroma, FAISS, pgvector, and others all do the same thing: store vectors, find nearest neighbours, fast.

What Is a "Document"?

In vector-DB land, a document is just one piece of text you want to be searchable. It is not necessarily a Word file or a PDF:

Real thing	What gets stored as a "document"
A 200-page playbook	Each paragraph or section becomes one document — chopping it up is called chunking (next step)
A SOC ticket	The ticket summary + description as a single document
A Slack message	One message = one document
A firewall log line	One log entry = one document
A knowledge-base article	Title + body as one document, or each section as its own

The rule of thumb: a document is whatever sized chunk of text you want the search to return as a single hit.

Step 1 — INDEX (done once)

Every document is converted to a vector once by an embedding model and stored in the database. Documents that talk about the same thing land near each other automatically:

# Index time -- done once per document
vec = embedding_model.encode("Ransomware encrypted the file server")
vector_db.upsert(id="ticket-4291", vector=vec, metadata={"severity": "high"})

Nobody told the database that ransomware and SQL injection are both threats. The clustering is a free side effect of the embedding model you already know from Lesson 4.2.

Step 2 — SEARCH (done every query)

At query time the user's question gets encoded with the same embedding model and dropped onto the same map. The database returns the K closest vectors by distance — those are the most semantically similar documents:

# Query time -- done on every user question
query_vec = embedding_model.encode("How do I detect ransomware on a host?")
hits = vector_db.search(vector=query_vec, top_k=3)
# [{"id": "ticket-4291", "score": 0.91, "metadata": {...}}, ...]

That is the entire algorithm. Two functions: upsert (write a vector) and search (find K nearest). Everything else — RAG, semantic cache, dedup — is built on these two calls.

Keyword Search vs Semantic Search

A SQL WHERE text LIKE '%ransomware%' only finds documents containing the literal word. A vector database finds documents with similar meaning, even when the surface words are different:

Document text	Contains "ransomware"?	Found by vector search?
"Files were encrypted and a ransom note appeared"	No	Yes
"Encryption malware demanding bitcoin payment"	No	Yes
"Cryptolocker variant detected on host"	No	Yes

This is the difference between keyword search and semantic search — and it is why RAG uses a vector database, not a SQL LIKE query.

When to Use a Vector Database

You have	Use a vector DB?
50 internal policy PDFs and want a chatbot to answer questions	Yes — this is the canonical use case (RAG)
10 million SOC tickets you want to find "things like this one" in	Yes — semantic similarity over a huge corpus
A structured table with `user_id`, `email`, `last_login`	No — normal SQL query, no embeddings needed
Real-time stream of 100k events/sec for exact rule matching	No — use a SIEM / Sigma rules; vector search is too slow

Think Deeper

Try this:

A SQL query <code>WHERE text LIKE '%credential theft%'</code> returns 3 results from your ticket database. You suspect there are more incidents described with different wording. How would a vector database help, and what would you need to build one?

A vector database would find tickets described as 'password harvesting', 'LSASS memory dumping', or 'stolen login tokens' — all semantically similar to 'credential theft' even though the exact words never appear. To build one you need: (1) an embedding model (e.g. all-MiniLM-L6-v2 from Lesson 4.2), (2) encode every ticket once with model.encode(), and (3) store the vectors in a vector DB. At query time, encode the question with the same model and call search(top_k=10). The keyword gap that SQL cannot bridge is exactly what semantic search solves.

Cybersecurity tie-in: Vector databases are the backbone of every modern security knowledge assistant. When an analyst asks "how do we respond to ransomware?", the system encodes that question, finds the nearest chunks in your indexed runbooks, and passes them to the LLM. The quality of that answer depends entirely on what was indexed and how it was chunked — which is what the next three steps teach you to build.

← Previous ← → to navigate Next →