The Chat API Structure
Every modern LLM API -- Claude, OpenAI, Gemini, Ollama -- uses the same three-parameter structure. Once you learn it, you can use any provider:
| Parameter | Type | Purpose | Example |
|---|---|---|---|
system | String | Defines the model's role and behaviour (invisible to end user) | "You are a SOC analyst. Be concise." |
messages | List of dicts | The conversation history: alternating user/assistant turns | [{"role": "user", "content": "Analyse this log..."}] |
max_tokens | Integer | Hard cap on response length (100 tokens ~ 75 words) | 200 |
The system prompt shapes every response. The messages list carries context. The max_tokens cap controls cost and latency.
Your First API Call
from llm_client import get_client
provider, client = get_client() # auto-detects your API key
print(f"Using provider: {provider}")
response = client.chat(
system="You are a cybersecurity analyst. Be concise and technical.",
messages=[
{"role": "user", "content": "What is a reverse shell?"}
],
max_tokens=200,
)
print(response)
The llm_client.py helper wraps Claude, OpenAI, Gemini, and Ollama behind a common interface. Set whichever API key you have and the code works identically.
Understanding Tokens and Cost
| Provider | Input cost (per 1M tokens) | Output cost (per 1M tokens) | Free tier? |
|---|---|---|---|
| Claude (Sonnet) | $3 | $15 | No |
| OpenAI (GPT-4o) | $2.50 | $10 | No |
| Gemini (Flash) | Free tier | Free tier | Yes |
| Ollama (local) | Free | Free | N/A -- runs locally |
A 1,000-word security report analysis costs approximately $0.001. For development and learning, the cost is negligible. Ollama runs entirely on your machine with no internet required after the initial model download.
Request/Response Flow
Understanding the data flow helps you debug issues and optimise performance:
| Step | What happens | Where |
|---|---|---|
| 1. Build request | Assemble system + messages + max_tokens | Your Python code |
| 2. Send HTTPS | Request travels to provider API (or localhost for Ollama) | Network |
| 3. Tokenise | Provider converts text to token IDs | Provider server |
| 4. Generate | Model predicts tokens one at a time | Provider GPU |
| 5. Return | Response string sent back | Network |
Typical latency: 1-5 seconds for moderate-length responses. Most of the time is spent in step 4 (generation).
Think Deeper
Make an API call with max_tokens=10, then the same prompt with max_tokens=500. How does the response change? What happens to the cost?