Step 3: Hands-On Lab

Attack an LLM, observe what gets caught

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
💡 ReflectThink deeper

Lab: Lakera-Demo

A web-based testing platform with 24+ documented attack vectors across 9 categories. You'll attack a real LLM through Lakera's input/output scanners and watch which payloads slip through.

Open repo on GitHub

Lab Guide (PDF)

The Playground guide is your map for this entire step. Keep it open in a second tab while you work through the exercises.

🛡
Lakera Playground Guide
Walkthrough of the split-screen Playground, the attack library, and the multi-vendor benchmark (Lakera vs Azure vs LLM Guard).

One-shot setup

Run these from your terminal:

git clone https://github.com/alshawwaf/Lakera-Demo.git
cd Lakera-Demo
python -m venv venv
# Windows: venv\Scripts\activate
# macOS/Linux: source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env -- add your LAKERA_API_KEY and LLM provider key
python app.py

Open http://127.0.0.1:9000 in your browser. The Playground guide PDF (above) walks you through the UI panel by panel.

Exercise 1 — Test the attack library (20 min)

Open the Playground and run at least five built-in attacks from different categories:

  • Jailbreak (role-play bypass)
  • Prompt injection (instruction override)
  • Data extraction (system-prompt leak)
  • PII extraction
  • Toxic content generation

For each attack, record three things in your notes: caught? (y/n), category Lakera assigned, and confidence score. You'll need this table for Exercise 3.

Exercise 2 — Craft a novel attack (15 min)

Write your own attack prompt that is not in the built-in library. Pick at least one technique:

  • Encode the payload in a non-English language
  • Split the malicious instruction across multiple turns
  • Embed the attack inside a "summarise this document" request (indirect injection)
  • Rewrite a known jailbreak using synonyms / paraphrase

Did the guardrails catch it? Why or why not? If it slipped through, the guide PDF's Detection Methods section explains which scanner type would have caught it.

Exercise 3 — Benchmark and stack (10 min)

Open the Benchmark view in the Playground (Lakera vs Azure Content Safety vs LLM Guard). Answer:

  1. Which attack category has the highest detection rate across all three? The lowest?
  2. If you could only deploy two of the three vendors, which combination maximises coverage? Why?
  3. What would you scan on the output side that input scanners can't catch?
Loading...
Loading...

Think Deeper

You successfully jailbroke the LLM using a role-play scenario. The guardrails didn't catch it. What does this tell you about defense strategy?

No single detection method catches everything. Role-play jailbreaks exploit the model's instruction-following training. Defense should be layered: pattern matching catches known attacks, NLP classifiers catch intent, semantic analysis catches variations, and output scanning catches whatever slipped through input scanning.
Key insight: You just attacked an LLM application. Some attacks got through. This is normal — no single defense layer is perfect. The goal is to make attacks significantly harder, not impossible. Layer multiple detection methods, scan both input AND output, and log everything for audit.

Loading...