Architecture Overview
The Simple Chat API is the most basic GenAI architecture pattern. It consists of a single, stateless request-response cycle: a user sends a message, the system prepends a system prompt, calls an LLM, and returns the generated text. No memory, no retrieval, no tool use — just a direct conversation turn.
When to Use
- Single-turn Q&A applications (FAQ bots, helpdesk)
- Text transformation tasks (summarization, translation, reformatting)
- Code generation or completion from a single prompt
- Prototyping and validating prompt designs before adding complexity
Complexity Level
Low. This is the starting point for every GenAI project. If this pattern solves your problem, do not add additional complexity. Many production applications are just well-crafted system prompts with good error handling.
Start with the simplest architecture that works. You can always layer on memory, RAG, or tool use later — but premature complexity is the biggest mistake in GenAI engineering.
Architecture Diagram
Architecture diagram — Simple Chat API: stateless request-response with system prompt injection
Components Deep Dive
System Prompt
Defines the LLM's role, personality, constraints, and output format. This is your primary lever for controlling behavior. Keep it clear, specific, and tested.
Temperature
Controls randomness in token selection. 0.0 = deterministic (factual tasks), 0.7 = creative balance, 1.0+ = highly creative. Always tune for your use case.
Max Tokens
Upper bound on output length. Set this to avoid runaway generation costs. Consider: input tokens + max_tokens must fit within the model's context window.
Streaming
Delivers tokens incrementally via SSE (Server-Sent Events). Reduces perceived latency from seconds to milliseconds for first visible token. Essential for chat UIs.
API Gateway
Handles authentication, rate limiting, request validation, and API key management. Sits between the client and the LLM provider to add security and control.
Error Handling
Handle rate limits (429), timeouts, malformed responses, and provider outages. Implement retries with exponential backoff and circuit breaker patterns.
Implementation
Basic Chat Completion
import anthropic
client = anthropic.Anthropic()
def chat(user_message: str, system: str = "You are a helpful assistant.") -> str:
"""Single-turn chat completion."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": user_message}],
temperature=0.3,
)
return response.content[0].text
Streaming Response
def chat_stream(user_message: str, system: str = "You are a helpful assistant."):
"""Stream tokens as they are generated."""
with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": user_message}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Error Handling with Retry
import time
from anthropic import RateLimitError, APITimeoutError
def chat_with_retry(user_message, max_retries=3):
for attempt in range(max_retries):
try:
return chat(user_message)
except RateLimitError:
wait = 2 ** attempt # exponential backoff
print(f"Rate limited. Retrying in {wait}s...")
time.sleep(wait)
except APITimeoutError:
print("Timeout. Retrying...")
raise Exception("Max retries exceeded")
Data Flow
Here is the step-by-step flow of a single request through the Simple Chat API architecture:
- 1. Client sends request — User message + optional parameters (temperature, max_tokens) via HTTP POST
- 2. API gateway validates — Check auth token, rate limits, input length, and content policy
- 3. System prompt prepended — Server-side system prompt is added to the messages array (never exposed to client)
- 4. LLM API called — Request forwarded to the model provider (Anthropic, OpenAI, etc.)
- 5. Tokens generated — Model generates output tokens autoregressively
- 6. Response returned — Complete text (or streamed chunks) sent back to client
- 7. Logging & metrics — Log latency, token counts, and errors for observability
Trade-offs & Considerations
| Advantage | Limitation |
|---|---|
| Simplest possible architecture | No conversation memory (stateless) |
| Low latency (single API call) | Cannot access external data or tools |
| Easy to debug and test | Limited to model's training knowledge |
| Minimal infrastructure needed | System prompt engineering can be finicky |
| Low cost per request | No built-in content grounding |
Move to Architecture 02 (Conversational Chatbot) when users need multi-turn context. Move to Architecture 03 (RAG) when the model needs access to your proprietary data.
Production Checklist
- API key rotation and secrets management (e.g., AWS Secrets Manager, GCP Secret Manager)
- Rate limiting per user/API key to prevent abuse
- Input validation: max length, content filtering, injection detection
- Output validation: format checks, PII scanning, toxicity filtering
- Structured logging: request ID, latency, token usage, model version
- Retry logic with exponential backoff and circuit breaker
- Cost monitoring and alerting on token spend
- Prompt versioning and A/B testing framework
- Health check endpoint for load balancer
- Graceful degradation when LLM provider is down