Step 1: The Threat Landscape

Prompt injection, jailbreaks, and why WAFs don't work

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

💡 ReflectThink deeper

Attack Categories

Category	What the Attacker Does	Risk
Prompt injection	Embeds hidden instructions to override system prompt	Bypasses application controls
Jailbreak	Social engineering to bypass safety training	LLM generates harmful content
Data extraction	Tricks LLM into revealing system prompts or RAG context	Leaks proprietary data
Indirect injection	Hides instructions in documents/emails the LLM processes	Manipulates output without direct interaction
Toxic content	Gets LLM to produce harmful content	Reputational and legal liability
PII extraction	Extracts personal data from context	Privacy violations, regulatory penalties

Why Traditional Security Doesn't Work

Traditional Control	Why It Fails
WAF rules	Prompt injections are natural language — no SQL syntax to match
Input validation	Attacks are semantically valid text, not malformed input
Output filtering	Regex can't catch creative encoding or rephrased content
Rate limiting	A single well-crafted prompt is enough
Authentication	The attacker is often a legitimate, authenticated user

LLM applications need a purpose-built security layer that understands natural language semantics.

Think Deeper

Try this:

An attacker uses indirect prompt injection — hiding instructions in a document the LLM summarises. How does this bypass user-facing guardrails?

User-facing guardrails scan the user's prompt, which is innocuous ('Summarise this document'). The malicious instructions are in the document content, which may bypass input scanning. Defense requires scanning both the prompt and the context (RAG documents, tool outputs) before they enter the LLM.

Key insight: "Ignore previous instructions" is valid English, not a SQL injection or script tag. No existing security tool was designed to detect semantic attacks on AI systems. This is why AI Guardrails is a new product category, not an extension of WAF.

← Previous ← → to navigate Next →