Step 2: How Guardrails Work

Inbound + outbound scanning, detection methods

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

💡 ReflectThink deeper

Two Checkpoints

AI Guardrails operates as an inline scanning layer with two checkpoints:

Checkpoint	What It Scans	What It Catches
Inbound (user → LLM)	User's prompt before it reaches the LLM	Prompt injection, jailbreak attempts, prohibited topics
Outbound (LLM → user)	LLM's response before the user sees it	Hallucinated content, data leakage, toxic output

Detection Methods

Method	How It Works	Latency
Pattern matching	Known attack signatures and templates	1–5 ms
NLP classification	ML models trained on attack datasets	5–20 ms
Semantic analysis	Embedding-based comparison to known attacks	10–30 ms
Contextual analysis	Full conversation context, catches multi-turn attacks	15–40 ms

Total end-to-end: ~20–50 ms — invisible compared to LLM generation time (500ms–3s).

Connection to What You Know

Stage 1 (Classification) — attack detection is a classification problem: safe, injection, jailbreak, toxic
Stage 4 (Embeddings) — semantic analysis uses the same embedding + cosine similarity approach from RAG
Stage 1 (Evaluation) — the same precision/recall tradeoff applies: too aggressive = blocks legitimate prompts, too permissive = misses attacks

Think Deeper

Try this:

Your guardrails add 30ms latency. A customer says that's too slow. How do you respond?

30ms is invisible to users — LLM generation itself takes 500ms–3s. The guardrails latency is less than 5% of total response time. Compare: a WAF adds 1–5ms but can't detect prompt injection at all. The tradeoff is 30ms of latency for catching attacks that no other security layer can.

Key insight: Guardrails use the same ML techniques you've learned throughout this program. Classification, embeddings, semantic similarity — you can explain how they work, not just that they work. That technical depth is your competitive advantage.

← Previous ← → to navigate Next →