Attack Categories
| Category | What the Attacker Does | Risk |
|---|---|---|
| Prompt injection | Embeds hidden instructions to override system prompt | Bypasses application controls |
| Jailbreak | Social engineering to bypass safety training | LLM generates harmful content |
| Data extraction | Tricks LLM into revealing system prompts or RAG context | Leaks proprietary data |
| Indirect injection | Hides instructions in documents/emails the LLM processes | Manipulates output without direct interaction |
| Toxic content | Gets LLM to produce harmful content | Reputational and legal liability |
| PII extraction | Extracts personal data from context | Privacy violations, regulatory penalties |
Why Traditional Security Doesn't Work
| Traditional Control | Why It Fails |
|---|---|
| WAF rules | Prompt injections are natural language — no SQL syntax to match |
| Input validation | Attacks are semantically valid text, not malformed input |
| Output filtering | Regex can't catch creative encoding or rephrased content |
| Rate limiting | A single well-crafted prompt is enough |
| Authentication | The attacker is often a legitimate, authenticated user |
LLM applications need a purpose-built security layer that understands natural language semantics.
Loading...
Loading...
Think Deeper
Try this:
An attacker uses indirect prompt injection — hiding instructions in a document the LLM summarises. How does this bypass user-facing guardrails?
User-facing guardrails scan the user's prompt, which is innocuous ('Summarise this document'). The malicious instructions are in the document content, which may bypass input scanning. Defense requires scanning both the prompt and the context (RAG documents, tool outputs) before they enter the LLM.
Key insight: "Ignore previous instructions" is valid English, not a SQL injection
or script tag. No existing security tool was designed to detect semantic attacks on AI systems.
This is why AI Guardrails is a new product category, not an extension of WAF.