Step 1: The Threat Landscape

Prompt injection, jailbreaks, and why WAFs don't work

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
💡 ReflectThink deeper

Attack Categories

CategoryWhat the Attacker DoesRisk
Prompt injectionEmbeds hidden instructions to override system promptBypasses application controls
JailbreakSocial engineering to bypass safety trainingLLM generates harmful content
Data extractionTricks LLM into revealing system prompts or RAG contextLeaks proprietary data
Indirect injectionHides instructions in documents/emails the LLM processesManipulates output without direct interaction
Toxic contentGets LLM to produce harmful contentReputational and legal liability
PII extractionExtracts personal data from contextPrivacy violations, regulatory penalties

Why Traditional Security Doesn't Work

Traditional ControlWhy It Fails
WAF rulesPrompt injections are natural language — no SQL syntax to match
Input validationAttacks are semantically valid text, not malformed input
Output filteringRegex can't catch creative encoding or rephrased content
Rate limitingA single well-crafted prompt is enough
AuthenticationThe attacker is often a legitimate, authenticated user

LLM applications need a purpose-built security layer that understands natural language semantics.

Loading...
Loading...

Think Deeper

An attacker uses indirect prompt injection — hiding instructions in a document the LLM summarises. How does this bypass user-facing guardrails?

User-facing guardrails scan the user's prompt, which is innocuous ('Summarise this document'). The malicious instructions are in the document content, which may bypass input scanning. Defense requires scanning both the prompt and the context (RAG documents, tool outputs) before they enter the LLM.
Key insight: "Ignore previous instructions" is valid English, not a SQL injection or script tag. No existing security tool was designed to detect semantic attacks on AI systems. This is why AI Guardrails is a new product category, not an extension of WAF.

Loading...