⚡ Architecture 01 · Foundation

Simple Chat API

The most fundamental GenAI pattern: a single stateless LLM call with a system prompt. Every generative AI application starts here. Understand request/response flow, prompt design, temperature tuning, token limits, streaming, and error handling.

System Prompts Temperature & Sampling Streaming Responses Error Handling
Open Notebook in Colab →
1

Architecture Overview

The Simple Chat API is the most basic GenAI architecture pattern. It consists of a single, stateless request-response cycle: a user sends a message, the system prepends a system prompt, calls an LLM, and returns the generated text. No memory, no retrieval, no tool use — just a direct conversation turn.

When to Use

  • Single-turn Q&A applications (FAQ bots, helpdesk)
  • Text transformation tasks (summarization, translation, reformatting)
  • Code generation or completion from a single prompt
  • Prototyping and validating prompt designs before adding complexity

Complexity Level

Low. This is the starting point for every GenAI project. If this pattern solves your problem, do not add additional complexity. Many production applications are just well-crafted system prompts with good error handling.

Tip

Start with the simplest architecture that works. You can always layer on memory, RAG, or tool use later — but premature complexity is the biggest mistake in GenAI engineering.

2

Architecture Diagram

User Sends message prompt API Gateway + system prompt temperature, max_tokens validation, rate limit API call LLM Generate response response (text or stream)

Architecture diagram — Simple Chat API: stateless request-response with system prompt injection

3

Components Deep Dive

💬

System Prompt

Defines the LLM's role, personality, constraints, and output format. This is your primary lever for controlling behavior. Keep it clear, specific, and tested.

🌡

Temperature

Controls randomness in token selection. 0.0 = deterministic (factual tasks), 0.7 = creative balance, 1.0+ = highly creative. Always tune for your use case.

📏

Max Tokens

Upper bound on output length. Set this to avoid runaway generation costs. Consider: input tokens + max_tokens must fit within the model's context window.

Streaming

Delivers tokens incrementally via SSE (Server-Sent Events). Reduces perceived latency from seconds to milliseconds for first visible token. Essential for chat UIs.

🛡

API Gateway

Handles authentication, rate limiting, request validation, and API key management. Sits between the client and the LLM provider to add security and control.

Error Handling

Handle rate limits (429), timeouts, malformed responses, and provider outages. Implement retries with exponential backoff and circuit breaker patterns.

4

Implementation

Basic Chat Completion

import anthropic

client = anthropic.Anthropic()

def chat(user_message: str, system: str = "You are a helpful assistant.") -> str:
    """Single-turn chat completion."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": user_message}],
        temperature=0.3,
    )
    return response.content[0].text

Streaming Response

def chat_stream(user_message: str, system: str = "You are a helpful assistant."):
    """Stream tokens as they are generated."""
    with client.messages.stream(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": user_message}],
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)

Error Handling with Retry

import time
from anthropic import RateLimitError, APITimeoutError

def chat_with_retry(user_message, max_retries=3):
    for attempt in range(max_retries):
        try:
            return chat(user_message)
        except RateLimitError:
            wait = 2 ** attempt  # exponential backoff
            print(f"Rate limited. Retrying in {wait}s...")
            time.sleep(wait)
        except APITimeoutError:
            print("Timeout. Retrying...")
    raise Exception("Max retries exceeded")
5

Data Flow

Here is the step-by-step flow of a single request through the Simple Chat API architecture:

  • 1. Client sends request — User message + optional parameters (temperature, max_tokens) via HTTP POST
  • 2. API gateway validates — Check auth token, rate limits, input length, and content policy
  • 3. System prompt prepended — Server-side system prompt is added to the messages array (never exposed to client)
  • 4. LLM API called — Request forwarded to the model provider (Anthropic, OpenAI, etc.)
  • 5. Tokens generated — Model generates output tokens autoregressively
  • 6. Response returned — Complete text (or streamed chunks) sent back to client
  • 7. Logging & metrics — Log latency, token counts, and errors for observability
6

Trade-offs & Considerations

AdvantageLimitation
Simplest possible architectureNo conversation memory (stateless)
Low latency (single API call)Cannot access external data or tools
Easy to debug and testLimited to model's training knowledge
Minimal infrastructure neededSystem prompt engineering can be finicky
Low cost per requestNo built-in content grounding
When to upgrade

Move to Architecture 02 (Conversational Chatbot) when users need multi-turn context. Move to Architecture 03 (RAG) when the model needs access to your proprietary data.

7

Production Checklist

  • API key rotation and secrets management (e.g., AWS Secrets Manager, GCP Secret Manager)
  • Rate limiting per user/API key to prevent abuse
  • Input validation: max length, content filtering, injection detection
  • Output validation: format checks, PII scanning, toxicity filtering
  • Structured logging: request ID, latency, token usage, model version
  • Retry logic with exponential backoff and circuit breaker
  • Cost monitoring and alerting on token spend
  • Prompt versioning and A/B testing framework
  • Health check endpoint for load balancer
  • Graceful degradation when LLM provider is down