Chat Completions Interface
The Chat Completions interface is the universal contract between your application code and a large language model. Regardless of whether you are working with OpenAI's GPT-4o, Anthropic's Claude Sonnet, Google's Gemini, or an open-source model hosted on your own infrastructure, the fundamental interaction pattern is identical: you construct an array of messages that represent a conversation, you send that array along with configuration parameters to an HTTP endpoint, and you receive back a response containing the model's generated text plus metadata about the generation process. Understanding this interface at a deep level is not optional for production work. It is the foundation on which every other concept in this module is built, and misunderstanding any part of it will create subtle bugs that are extremely difficult to diagnose later.
The reason this interface exists as a "chat" format rather than a simple prompt-in, text-out format is both historical and deeply practical. Early LLM APIs from the GPT-3 era used a simple completions endpoint where you sent a single string and received a continuation. This created an enormous number of problems in practice: developers had to manually format multi-turn conversations by concatenating strings with hand-chosen delimiters, there was no clean way to distinguish between instructions and user input, and prompt injection attacks were trivially easy because everything was one undifferentiated blob of text. The chat format solves all of these problems by giving each message a role that the model can treat differently during inference, establishing clear semantic boundaries between different types of input.
When you make a chat completions request, you are providing a structured context window that the model uses to understand its role, the conversation history, and what is expected of it. The model processes the entire messages array as a single input sequence, applies its attention mechanism across all the tokens in all the messages, and generates a response token by token. The distinction between roles is encoded as special tokens in the model's vocabulary, which means the model genuinely treats system, user, and assistant messages differently during the attention computation. A system message carries more persistent instructional weight than a user message for controlling output behavior.
The quality of your results depends enormously on how you construct your messages array. A sloppy messages array with vague system prompts, missing conversation history, or poorly structured user messages will produce dramatically worse results than a carefully constructed one, even with the same model and the same underlying question. This is why prompt engineering is an entire module unto itself. But before you can engineer prompts effectively, you need to understand the mechanical interface through which those prompts are delivered, which is exactly what this section covers in detail.
One crucial aspect that beginners overlook is that LLM APIs are inherently stateless. Each request is completely independent and the provider maintains zero conversation state between requests. Your application code is entirely responsible for maintaining the conversation history and sending the relevant portions with each request. For short conversations this is trivial, but for long conversations or multi-session applications, you need strategies for persistence in a database, truncation when approaching context limits, and summarization to compress old messages while preserving essential context.
The Messages Array and Role System
The messages array is an ordered list of message objects, each containing at minimum a role field and a content field. The role is one of three values in the OpenAI convention: system, user, or assistant. Anthropic uses a slightly different convention where the system message is a separate top-level parameter rather than a member of the messages array, but the conceptual model is identical. The system role sets the persona, behavioral constraints, and task instructions for the model. The user role represents the human or the application acting on behalf of the human. The assistant role represents previous model responses, which you include in multi-turn conversations so the model has context about what it already said and can maintain consistency across turns.
The ordering of messages matters critically. Messages are processed sequentially as if they were a transcript of a conversation, and the model generates its response as the next turn after the final message in the array. If you want to implement a multi-turn chatbot, you maintain the full conversation history by appending each new user message and each model response to the array, then sending the entire growing array with each request. This means the token cost of a long conversation grows with the number of turns, which is why context window management and summarization strategies become important for production applications that need to support long conversations without exceeding context limits or running up excessive costs.
The content field can be a simple string for text-only messages, but modern APIs support multimodal content where the content is an array of content blocks. For example, in OpenAI's API you can send an image by providing a content array with a text block and an image_url block. In Anthropic's API, you can send images as base64-encoded content blocks or as URL references. This multimodal capability is what powers vision-language applications like document analysis, image captioning, and visual question answering. The content block format also enables structured content like tool use results, which we cover in the structured outputs section later in this module.
A subtle but important detail is that the system message is not just a suggestion. It establishes the model's behavioral frame for the entire conversation. Research and practical experience consistently show that detailed, specific system messages produce dramatically better results than vague ones. A system message like "You are a helpful assistant" is almost useless. A system message like "You are a senior Python developer reviewing code for security vulnerabilities. Report each finding with severity, the file and line number, the vulnerability type by OWASP category, and a specific remediation with code example. If no vulnerabilities are found, say so explicitly." will produce focused, structured, actionable output. The specificity of your system message is the single highest-leverage knob you have for controlling output quality.
One advanced pattern worth understanding is few-shot examples within the messages array. Instead of describing what you want in the system message alone, you demonstrate it by including example user/assistant message pairs before the actual user query. You include a user message with an example input and an assistant message with the ideal output format, followed by the real user message. The model sees these examples and mimics the pattern. This technique is extraordinarily powerful for getting consistent output formats, and it works because the model treats the example assistant messages as if it had actually generated them, creating a strong pattern for continuation that the model will follow faithfully.
Another important consideration for production applications is the handling of message truncation when conversations grow long. Each model has a finite context window, and if your messages array exceeds that window, the API will return an error. Strategies for managing this include: sliding window truncation where you keep only the most recent N messages, summary-based compression where you periodically summarize older messages into a condensed form, and semantic truncation where you selectively remove messages that are least relevant to the current query. The right strategy depends on your application, but every production chatbot needs one of these approaches.
The practical implication of the stateless design is that every single API call must be entirely self-contained: the messages array must contain everything the model needs to understand the conversation context and generate an appropriate response. You cannot rely on the model remembering previous calls because there is no memory between calls whatsoever. This design choice was deliberate because it makes the API simple, horizontally scalable, and deterministic for a given input, but it places the entire burden of state management on the application developer. Later modules will cover how frameworks like LangGraph manage this state automatically through checkpointed conversation graphs.
Request and Response Schema
The request to a chat completions endpoint contains several important fields beyond the messages array. The model field specifies which model to use, such as gpt-4o or claude-sonnet-4-20250514. The temperature parameter controls randomness in the output: a value of 0.0 produces nearly deterministic output by always choosing the highest-probability token, while 1.0 or higher produces more creative and varied output. For factual tasks, code generation, and structured outputs, use temperature 0.0. For creative writing and brainstorming, use 0.7 to 1.0. The max_tokens field caps the length of the generated response. This is a hard limit, not a target, so always set it generously to avoid truncation of responses that the model would otherwise complete naturally.
The top_p parameter, also known as nucleus sampling, is an alternative to temperature for controlling randomness. It limits the model to considering only the smallest set of tokens whose cumulative probability exceeds the threshold. For example, a top_p of 0.1 means the model only considers tokens in the top 10% of the probability mass. In practice, most developers use either temperature or top_p but not both simultaneously, as their effects overlap and combining them can produce unpredictable behavior. OpenAI's documentation explicitly recommends adjusting one and leaving the other at its default value. The frequency_penalty and presence_penalty parameters, which are OpenAI-specific, penalize token repetition and are useful for reducing the verbatim repetition that models sometimes exhibit in long outputs.
The response object from a chat completions call contains the generated message along with critical metadata. The choices array contains one or more response options, usually just one unless you set n greater than 1 to request multiple completions. Each choice contains a message object with the assistant's response and a finish_reason field. The finish_reason is essential for understanding why the model stopped generating: stop means the model naturally completed its response, length means it hit the max_tokens limit and your response is truncated, tool_calls means the model wants to invoke a tool, and content_filter means the response was blocked by safety filters. Always check finish_reason in production code because a length finish_reason means you are losing information.
The usage field in the response reports token consumption: prompt_tokens for input token count, completion_tokens for generated token count, and total_tokens for the sum. This is the basis for cost calculation since providers charge per token. Understanding token counts is critical for budgeting. A detailed system prompt might use 500 tokens, a conversation history might use 3000 tokens, and a long response might use 2000 tokens. At GPT-4o pricing of $2.50 and $10.00 per million input and output tokens respectively, that single request costs roughly three cents. These numbers add up very quickly at scale.
An often-overlooked aspect of the response schema is the id field, which provides a unique identifier for the completion request. This identifier is invaluable for debugging and audit logging. If a user reports a problematic response, you can use the completion ID to trace back to the exact request. Similarly, the created timestamp tells you when the response was generated, and the model field in the response confirms which model actually served the request, which may differ from what you requested if a model alias resolves to a specific snapshot version.
For Anthropic's Messages API, the response schema has notable differences. Instead of a choices array, Anthropic returns a top-level content array containing content blocks. Each block has a type, usually text but possibly tool_use, and the corresponding data. The stop_reason field serves the same purpose as OpenAI's finish_reason but uses different values: end_turn for natural completion, max_tokens for truncation, and tool_use when the model wants to call a tool. Anthropic also returns an id starting with msg_ and a usage object with input_tokens and output_tokens fields. Understanding these schema differences is essential when writing provider-agnostic code, which is exactly what LiteLLM in Section 5 solves.
One nuanced difference between providers relates to reasoning capabilities. Anthropic's Claude models support an explicit thinking feature where you can enable extended thinking with a budget of thinking tokens. The model produces a visible thinking content block before its final response. OpenAI's equivalent is the o1 and o3 reasoning models which perform chain-of-thought internally but do not expose the reasoning tokens to the caller by default. This architectural difference affects how you design prompts for complex reasoning tasks and how you budget for token costs, because reasoning tokens count toward your usage even when the content is hidden.
Temperature and sampling parameter ranges also differ subtly between providers. OpenAI's temperature parameter ranges from 0 to 2, while Anthropic's ranges from 0 to 1. Both default to approximately 1.0, but the effective randomness at the same numeric value differs between models because the underlying probability distributions are calibrated differently. In practice, temperature 0 for deterministic tasks and around 0.3 to 0.7 for creative tasks works reasonably well across both providers. However, if you are doing rigorous A/B testing between providers, be aware that temperature values are not directly comparable across different model families.
Authentication is handled via bearer tokens for both providers but with different header conventions. OpenAI uses the standard Authorization: Bearer sk-... header while Anthropic uses a custom x-api-key: sk-ant-... header. Anthropic also requires an anthropic-version header specifying the API version. Both the OpenAI and Anthropic SDKs abstract these differences completely, reading the API key from environment variables and setting the correct headers automatically. This is one of the primary reasons to use the SDKs rather than making raw HTTP calls.
OpenAI vs Anthropic: Key Differences
The most visible structural difference between the two APIs is system message handling. OpenAI places the system message as the first element in the messages array with role set to system. Anthropic extracts it to a separate top-level system parameter. Anthropic's rationale is that the system prompt is fundamentally a configuration parameter for the model's behavior rather than a conversational turn, and separating it makes the API semantics cleaner. In practice this means provider-switching code must handle the system message extraction, not just endpoint and authentication changes.
Error handling patterns differ as well. OpenAI returns standard HTTP error codes with a JSON error body containing error.type, error.message, and error.code. Anthropic returns errors with a type field in the body such as invalid_request_error, authentication_error, rate_limit_error, or overloaded_error. Rate limiting headers differ too: OpenAI uses x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens, while Anthropic uses anthropic-ratelimit-requests-remaining and anthropic-ratelimit-tokens-remaining. Understanding these differences is critical for implementing proper retry logic and rate limit management in production systems that call both providers.
Multimodal content handling also differs between providers. OpenAI expects image content as a content array with objects containing type image_url and a URL or base64 data URL. Anthropic expects images as content blocks with type image containing a source object that specifies media type and data. Token counting for images differs too: OpenAI uses a tile-based system where cost depends on resolution, while Anthropic charges based on image dimensions with a specific formula. These differences mean that provider-switching code must handle content format transformation, not just endpoint routing.
One important area where the providers diverge is in their approach to structured outputs. OpenAI offers a native response_format parameter with json_schema mode that constrains generation at the token level. Anthropic achieves structured outputs through the tool use mechanism, where you define a tool with an input schema and the model generates structured JSON to call that tool. Both approaches are effective, but the implementation code is quite different. The instructor library, covered in Section 6, abstracts these differences and provides a unified Pydantic-based interface for structured outputs across both providers.
Pricing structures also differ in ways that affect architecture decisions. Anthropic's prompt caching provides a 90% discount on cached input tokens with explicit cache control annotations, while OpenAI's automatic caching provides a 50% discount for sequences over 1024 tokens with no code changes required. Anthropic's models generally have larger context windows (200K tokens for Claude versus 128K for GPT-4o) but higher per-token prices for output. These tradeoffs mean that the optimal provider choice depends heavily on your specific workload: long-context tasks with heavy prompt reuse favor Anthropic, while high-volume simple tasks favor OpenAI's cheaper models.
The streaming implementations also differ in meaningful ways. OpenAI streams individual token deltas through the choices array's delta field, with the stream terminated by a data: [DONE] sentinel. Anthropic uses a richer event-based streaming protocol with distinct event types: message_start, content_block_start, content_block_delta, content_block_stop, and message_stop. The Anthropic approach provides more granular lifecycle information but requires more complex parsing logic. Both SDKs abstract these differences behind iterator interfaces, but if you need to implement custom stream processing, understanding the underlying protocol for each provider is essential.
# OpenAI Chat Completions request
import openai
client = openai.OpenAI() # reads OPENAI_API_KEY from env
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a senior Python developer."},
{"role": "user", "content": "Explain decorators with an example."},
],
temperature=0.0,
max_tokens=1024,
)
text = response.choices[0].message.content
reason = response.choices[0].finish_reason # "stop" | "length" | "tool_calls"
usage = response.usage # prompt_tokens, completion_tokens
# --- Anthropic Messages request ---
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
message = client.messages.create(
model="claude-sonnet-4-20250514",
system="You are a senior Python developer.", # separate param
messages=[
{"role": "user", "content": "Explain decorators with an example."},
],
temperature=0.0,
max_tokens=1024,
)
text = message.content[0].text
reason = message.stop_reason # "end_turn" | "max_tokens" | "tool_use"
usage = message.usage # input_tokens, output_tokens
The chat completions interface is conceptually identical across providers: messages array plus configuration in, generated text plus metadata out. The differences are in system message placement, response schema field names, and authentication headers. LiteLLM (Section 5) normalizes all of these differences behind a unified interface.
Provider SDKs
While you could interact with LLM APIs using raw HTTP requests via httpx or requests, nobody does this in production. Provider SDKs handle authentication, request construction, response parsing, automatic retries, connection pooling, streaming, and type safety. The three most important SDKs for this course are the OpenAI Python SDK, the Anthropic Python SDK, and LiteLLM which provides a unified interface across all providers. Understanding all three gives you the flexibility to use the right tool for each situation: the native SDKs when you need provider-specific features, and LiteLLM when you want portability across providers without rewriting your application code.
The OpenAI Python SDK is the most widely used LLM SDK in the ecosystem. It is a thin, well-typed wrapper around the OpenAI REST API that uses Pydantic models for all request and response types. The SDK handles authentication by reading the OPENAI_API_KEY environment variable or accepting it as a constructor parameter. It constructs the proper HTTP request with the correct headers and endpoint, parses the JSON response into typed Python objects, and provides both synchronous and asynchronous clients via the OpenAI and AsyncOpenAI classes. The typed response objects mean you get IDE autocompletion and type checking for every field, which dramatically reduces bugs compared to working with raw dictionaries.
The Anthropic Python SDK follows a very similar design philosophy. It provides Anthropic and AsyncAnthropic clients, reads ANTHROPIC_API_KEY from the environment, and returns typed response objects with full IDE support. The SDK handles the Anthropic-specific authentication header and version header automatically. One notable feature of the Anthropic SDK is its built-in streaming support through the with client.messages.stream() context manager, which provides a convenient iterator interface with separate streams for text content and tool use content, making it easier to build applications that process streamed responses.
LiteLLM takes a fundamentally different approach: instead of being a single provider's SDK, it acts as a translation layer that converts a single unified API call into the appropriate provider-specific call. You call litellm.completion() with an OpenAI-compatible messages format, and LiteLLM translates the request to whatever provider you specify via the model name. For example, model gpt-4o routes to OpenAI, model anthropic/claude-sonnet-4-20250514 routes to Anthropic, and model groq/llama-3.1-70b-versatile routes to Groq. This unified interface is extraordinarily valuable for production systems that need to switch between providers without rewriting application code.
Each SDK has its own error hierarchy that you must understand for robust error handling. The OpenAI SDK raises openai.APIError as the base exception, with specific subclasses like RateLimitError for 429 responses, AuthenticationError for 401 responses, BadRequestError for 400 responses, and APIConnectionError for network failures. The Anthropic SDK similarly raises anthropic.APIError with subclasses for each error type. LiteLLM wraps provider-specific errors into its own exception types but also surfaces the underlying provider error for debugging. Catching and handling specific exception types allows you to implement targeted retry logic for different failure modes.
Production LLM applications need robust retry logic because LLM APIs are inherently unreliable at scale. Rate limiting is the most common transient failure, as providers limit requests per minute and tokens per minute, and bursts of traffic will inevitably hit these limits. Server errors occur during provider outages or high-load periods. Network errors occur due to DNS issues, connection timeouts, or TLS handshake failures. The tenacity library is the standard Python solution for implementing retry logic with exponential backoff, jitter, and configurable retry conditions. Both the OpenAI and Anthropic SDKs include some built-in retry logic configurable via the max_retries parameter, but for production systems you often want more control over different retry strategies for different error types.
An important consideration when choosing between native SDKs and LiteLLM is feature completeness versus portability. Native SDKs always support the latest provider features immediately, while LiteLLM may lag behind on newly released features. If you need Anthropic's extended thinking, OpenAI's structured output mode, or any other cutting-edge provider feature, the native SDK is the safest choice. However, LiteLLM covers the vast majority of common use cases well, and the productivity gain from having a single interface that works across all providers is enormous for teams that need multi-provider support.
Connection management is another reason to use SDKs rather than raw HTTP. Both the OpenAI and Anthropic SDKs maintain internal connection pools via httpx, reusing TCP connections across multiple API calls. This eliminates the overhead of establishing a new TLS handshake for every request, which can save 100-200ms per call. The SDKs also handle HTTP/2 multiplexing where supported, allowing multiple concurrent requests over a single connection. For high-throughput applications making hundreds of calls per second, this connection pooling can meaningfully reduce both latency and resource consumption on your application server.
Type safety provided by the SDKs deserves emphasis. Both SDKs return fully typed Pydantic objects, which means your IDE can provide autocompletion for response fields, your type checker can catch field access errors at development time rather than runtime, and your code is self-documenting because the types explicitly declare what fields are available. When you write response.choices[0].message.content, your IDE knows that choices is a list, that each choice has a message attribute, and that message has a content attribute that is a string. This is vastly superior to working with raw JSON dictionaries where typos in field names become runtime KeyError exceptions.
For asynchronous applications, which include virtually all web servers and high-throughput processing pipelines, the async variants of the SDKs are essential. The AsyncOpenAI and AsyncAnthropic clients use async httpx under the hood, which means they yield control to the event loop while waiting for network responses. This allows a single Python process to handle hundreds or thousands of concurrent LLM requests without threading. Using the synchronous client in an async context, such as inside a FastAPI endpoint, blocks the entire event loop and serializes all requests, destroying your server's throughput. Always use the async client in async code.
The SDKs also provide important observability features. Both include request and response logging capabilities that you can hook into for monitoring and debugging. The OpenAI SDK supports custom HTTP client injection, allowing you to wrap the underlying httpx client with middleware for metrics collection, request tracing, or audit logging. The Anthropic SDK provides similar extensibility through its client configuration. LiteLLM goes further with its built-in callback system, which can automatically send metrics to observability platforms like Langfuse, Helicone, or Datadog without any custom instrumentation code.
The Same Call in Three SDKs
# ---- 1. OpenAI SDK ----
from openai import OpenAI
oai = OpenAI() # OPENAI_API_KEY env var
rsp = oai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Return only valid JSON."},
{"role": "user", "content": "List 3 Python web frameworks."},
],
temperature=0,
max_tokens=256,
)
print(rsp.choices[0].message.content)
# ---- 2. Anthropic SDK ----
from anthropic import Anthropic
ant = Anthropic() # ANTHROPIC_API_KEY env var
msg = ant.messages.create(
model="claude-sonnet-4-20250514",
system="Return only valid JSON.",
messages=[
{"role": "user", "content": "List 3 Python web frameworks."},
],
temperature=0,
max_tokens=256,
)
print(msg.content[0].text)
# ---- 3. LiteLLM (unified) ----
import litellm
# Same interface for ANY provider
for model in ["gpt-4o", "anthropic/claude-sonnet-4-20250514", "groq/llama-3.1-70b-versatile"]:
rsp = litellm.completion(
model=model,
messages=[
{"role": "system", "content": "Return only valid JSON."},
{"role": "user", "content": "List 3 Python web frameworks."},
],
temperature=0,
max_tokens=256,
)
print(f"{model}: {rsp.choices[0].message.content}")
Error Handling and Retry with Tenacity
import openai
from tenacity import (
retry, stop_after_attempt, wait_exponential,
retry_if_exception_type
)
@retry(
retry=retry_if_exception_type((
openai.RateLimitError, # 429
openai.APIConnectionError, # network
openai.InternalServerError, # 500
)),
wait=wait_exponential(multiplier=1, min=2, max=60),
stop=stop_after_attempt(5),
)
def robust_completion(messages: list, model: str = "gpt-4o") -> str:
"""Call LLM with automatic retry on transient failures."""
client = openai.OpenAI()
try:
rsp = client.chat.completions.create(
model=model, messages=messages,
temperature=0, max_tokens=1024
)
if rsp.choices[0].finish_reason == "length":
raise ValueError("Response truncated -- increase max_tokens")
return rsp.choices[0].message.content
except openai.BadRequestError as e:
# Don't retry 400s -- the request itself is wrong
raise ValueError(f"Bad request: {e}") from e
Never retry 400 Bad Request or 401 Unauthorized errors. These indicate bugs in your code or invalid credentials, not transient failures. Only retry 429, 500, 502, 503, and connection errors.
Streaming Responses
Without streaming, the user experience of an LLM application is terrible: the user submits a prompt, stares at a blank screen for 5 to 30 seconds while the model generates its entire response, and then sees the complete text appear all at once. Streaming solves this by delivering the response token by token as the model generates each one, so the user sees text appearing almost immediately and can start reading while generation continues. Every production LLM application uses streaming for user-facing interactions. The time to first token is typically 200 to 500 milliseconds, compared to 5 to 30 seconds for the complete non-streamed response, and this improvement in perceived responsiveness is transformative for user experience and engagement.
The streaming protocol used by LLM APIs is Server-Sent Events, an HTTP-based protocol defined in the HTML5 specification. SSE works over a standard HTTP connection: the client sends a normal HTTP request with stream: true in the request body, and the server responds with Content-Type text/event-stream and keeps the connection open. The server then sends a series of events, each formatted as one or more lines prefixed with data:, separated by blank lines. Each event contains a JSON object with a delta representing the incremental change since the last event. The stream ends with a data: [DONE] event in the OpenAI convention or by sending a final event with stop_reason set in the Anthropic convention.
Understanding the delta accumulation pattern is essential because it is the core loop of any streaming LLM application. The first delta typically contains role information but no content. Subsequent deltas contain content fragments of varying length, sometimes a single character and sometimes a whole word or phrase. The final delta contains an empty content field and the finish_reason. Your accumulation code must handle all of these cases: skip deltas with no content, append content deltas to a buffer, and check for the finish signal to know when generation is complete. Getting this wrong leads to missing text, duplicated text, or applications that hang waiting for a termination signal that has already passed.
In a web application context, you need to forward the stream from the LLM API to the client browser. FastAPI provides the StreamingResponse class for exactly this purpose. You create an async generator function that consumes the LLM stream and yields formatted chunks, then wrap it in a StreamingResponse with the text/event-stream media type. The client-side JavaScript uses the Fetch API with a ReadableStream reader or the EventSource API to consume the stream. This end-to-end streaming pipeline from user browser to your FastAPI server to LLM API and back is the standard architecture for any chat-style application, and you will implement it multiple times throughout this course.
There are important nuances to production streaming that are easy to overlook. First, error handling is more complex with streaming because the HTTP connection has already returned a 200 status code before the model starts generating. If an error occurs mid-stream, the error arrives as a stream event rather than an HTTP error code. Your code must handle these mid-stream errors gracefully. Second, connection management is critical: if the client disconnects mid-stream because the user navigates away, your server should detect this and cancel the upstream LLM API call to avoid paying for tokens that nobody will see.
SSE Protocol Details
The Server-Sent Events protocol is simple but rigid. Each event is a sequence of field lines, where each field has the format field: value\n. The most common field is data: which carries the payload. Events are separated by a blank line consisting of two newline characters. The event field can specify an event type for routing, and the id field provides a last-event-ID for reconnection. In LLM APIs, only the data field is typically used, containing a JSON-serialized chunk object. The SSE specification also supports automatic reconnection via the retry field, but LLM APIs do not use this because each stream is a unique generation that cannot be resumed from a checkpoint.
The raw SSE format on the wire looks like this: the server sends data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"Hello"}}]} followed by a blank line, then data: {"id":"chatcmpl-...","choices":[{"delta":{"content":" world"}}]} and so on. Each delta object contains just the new content since the last event. Your application code accumulates these deltas by concatenating the content strings to build the complete response. The SDK libraries handle all the SSE parsing for you, exposing the stream as a Python async iterator that yields parsed chunk objects, but understanding the underlying protocol is valuable for debugging when things go wrong.
Another production concern is token counting for streamed responses. When streaming, most providers do not send the token usage information until the final event. This means you cannot calculate cost in real-time during streaming. If you need real-time token counting to implement a hard token budget that stops generation mid-stream, you need to count tokens client-side using a tokenizer library like tiktoken. However, client-side counting is approximate because the tokenization of the accumulated text may differ slightly from the model's internal tokenization due to partial token boundaries in the stream.
The performance characteristics of streaming are worth understanding in detail. The time to first token is determined by the model's prefill phase, which is the time it takes to process all input tokens through the transformer layers. Longer prompts have longer time to first token because there are more input tokens to process. After the first token, the inter-token latency is typically very consistent, determined by the model's decode speed and the provider's infrastructure. For GPT-4o, inter-token latency is roughly 10 to 20 milliseconds, meaning a 500-token response takes about 5 to 10 seconds to stream completely but the user sees the first token within 200 to 400 milliseconds.
Backpressure is a consideration for high-throughput streaming applications. If your server processes stream events faster than the client can consume them, you can end up buffering large amounts of data in memory. In ASGI servers like Uvicorn, the framework handles TCP-level backpressure automatically, but your application-level buffers should be bounded. Similarly, if you are processing the stream by applying a content filter to each chunk before forwarding it, the processing time adds to the effective latency. Keep stream processing as lightweight as possible to maintain the responsiveness benefit that streaming provides.
One powerful pattern enabled by streaming is the ability to display thinking or reasoning tokens to the user in real-time. Anthropic's extended thinking feature streams the thinking content before the final response, allowing you to show a "thinking" phase with visible reasoning followed by the actual answer. This transparency can significantly improve user trust and engagement. Implementing this requires parsing the stream events for content block types and routing thinking blocks to a separate UI element from the response blocks.
Streaming also enables a technique called progressive rendering in user interfaces, where you process and render partial content as it arrives rather than waiting for the complete response. For example, if the model is generating a Markdown table, you can start rendering the table headers as soon as they arrive and add rows progressively. If the model is generating code, you can start syntax highlighting the first lines while the rest is still being generated. This creates a much more fluid and responsive user experience compared to waiting for the complete response and rendering it all at once.
For applications that need to process or transform streamed content before displaying it, consider implementing a stream transformation pipeline. This is an async generator that consumes the raw LLM stream, applies transformations such as content filtering, citation injection, or format conversion, and yields the transformed chunks. FastAPI's StreamingResponse works perfectly with these chained async generators, allowing you to compose multiple stream processing stages without buffering the entire response in memory.
Finally, testing streaming code requires special consideration. Unit testing a streaming endpoint means verifying that chunks arrive in the correct order, that the accumulated content matches the expected output, that errors mid-stream are handled gracefully, and that client disconnection is detected and propagated. The httpx library provides an async streaming response mock that you can use in tests. You should also have integration tests that verify end-to-end streaming from a real LLM API, since the SSE event format and timing characteristics of real providers can differ from mocks in subtle ways.
Async Streaming Implementation
import asyncio
from openai import AsyncOpenAI
from anthropic import AsyncAnthropic
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
# --- OpenAI async streaming ---
async def stream_openai(prompt: str):
client = AsyncOpenAI()
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True,
max_tokens=2048,
)
full_text = ""
async for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
full_text += delta.content
print(delta.content, end="", flush=True)
print()
return full_text
# --- Anthropic async streaming ---
async def stream_anthropic(prompt: str):
client = AsyncAnthropic()
full_text = ""
async with client.messages.stream(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": prompt}],
max_tokens=2048,
) as stream:
async for text in stream.text_stream:
full_text += text
print(text, end="", flush=True)
print()
return full_text
# --- FastAPI streaming endpoint ---
@app.get("/stream")
async def stream_endpoint(prompt: str):
async def event_generator():
client = AsyncOpenAI()
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True, max_tokens=2048,
)
async for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
yield f"data: {delta.content}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
event_generator(),
media_type="text/event-stream",
)
Always use the async client (AsyncOpenAI, AsyncAnthropic) in FastAPI endpoints. The sync client blocks the event loop, meaning your server handles only one request at a time. Async clients yield control while waiting for network I/O, enabling thousands of concurrent streams.
Cost Management
LLM API costs can escalate from a few dollars during prototyping to thousands of dollars per day in production, and many teams discover this the hard way. Understanding the economics of LLM API usage is not an afterthought but a core engineering requirement that should influence your architecture from day one. The cost model is straightforward: you pay per token, with separate rates for input tokens and output tokens. Output tokens are always more expensive than input tokens, typically by a factor of 3 to 5x, because output tokens require sequential autoregressive generation while input tokens can be processed in parallel during the prefill phase.
The token itself is not a word. It is a subword unit produced by the model's tokenizer. Common English words are often a single token, but less common words may be split into multiple tokens. As a rough heuristic, one token is approximately three-quarters of a word in English, or about four characters. The exact mapping depends on the specific tokenizer: OpenAI models use the cl100k_base or o200k_base tokenizer depending on the model, Anthropic uses their own tokenizer, and open-source models use various SentencePiece or BPE tokenizers. The tiktoken library lets you count tokens for OpenAI models precisely, which is essential for predicting costs before sending a request.
The pricing landscape across providers varies dramatically, and choosing the right model for each task is the single most impactful cost optimization. A task that works well with GPT-4o-mini at $0.15 and $0.60 per million tokens does not need GPT-4o at $2.50 and $10.00 per million tokens. Many production systems use a tiered approach: fast and cheap models for simple tasks like classification, extraction, and routing; mid-tier models for standard tasks like summarization and question answering; and premium models only for complex tasks like multi-step reasoning and creative writing. This tiering alone can reduce costs by 80% or more compared to using a premium model for everything.
Beyond model selection, there are several structural cost optimizations. Prompt caching, offered by both Anthropic and OpenAI, dramatically reduces input token costs for requests that share a common prefix. If you send the same system prompt with every request, the provider can cache the key-value computation for that prefix and charge a reduced rate for the cached portion. Anthropic's prompt caching charges 90% less for cache hits, and OpenAI's caching provides a 50% discount on cached input tokens. The batch API offered by OpenAI provides a 50% discount on all token costs in exchange for accepting higher latency, with results delivered within 24 hours instead of real-time. This is perfect for bulk processing tasks like dataset annotation, evaluation runs, or offline analysis.
Application-level caching with Redis or a similar key-value store provides even more dramatic savings by completely avoiding redundant API calls. If two users ask the same question, you can serve the cached response instantly without making an API call at all. The implementation involves hashing the messages array after normalization and checking Redis before calling the LLM. Cache hit rates vary enormously by application: a customer support bot might see 30 to 50% cache hits because many users ask the same questions, while a creative writing assistant might see nearly 0% because every request is unique. Even modest cache hit rates translate to significant cost savings and latency improvements.
Token counting before sending a request is an essential production practice. By counting input tokens with tiktoken, you can predict the cost of a request before incurring it, enforce per-request or per-user budget limits, detect accidentally large prompts which is a common bug where an entire document is included instead of a summary, and optimize context window usage by trimming conversation history to fit within a target token budget. The counting is fast, taking only microseconds for typical prompts, and should be part of your standard request pipeline.
Token Pricing Comparison (early 2025)
| Model | Provider | Input / 1M tokens | Output / 1M tokens | Context |
|---|---|---|---|---|
gpt-4o | OpenAI | $2.50 | $10.00 | 128K |
gpt-4o-mini | OpenAI | $0.15 | $0.60 | 128K |
claude-sonnet-4-20250514 | Anthropic | $3.00 | $15.00 | 200K |
gemini-2.0-flash | $0.10 | $0.40 | 1M |
Monitoring and alerting on API costs should be implemented from day one of any production deployment. This means logging every API call with its token usage and model, aggregating costs by user, feature, and model, setting budget alerts at daily and monthly thresholds, and building dashboards that show cost trends over time. LiteLLM includes built-in cost tracking that can write to a database, making this significantly easier than implementing cost monitoring from scratch. Without monitoring, it is frighteningly easy for a bug like an infinite retry loop or an accidentally large context to run up thousands of dollars in API costs before anyone notices.
One often-overlooked cost factor is the interaction between context window usage and quality. Sending a longer prompt with more context typically produces better results, but it also costs more. There is an optimal point where adding more context no longer improves quality enough to justify the cost increase. Finding this point requires empirical testing for your specific use case, which is why the evaluation module is so closely linked to cost management. A well-designed evaluation pipeline lets you measure quality at different context sizes and find the cost-quality sweet spot for each task type in your application.
For organizations with high-volume usage, negotiated enterprise pricing and commitment-based discounts can reduce costs further. Both OpenAI and Anthropic offer enterprise agreements with volume discounts, higher rate limits, dedicated capacity, and data processing agreements. If your projected monthly spend exceeds $10,000, it is almost always worth initiating an enterprise conversation. The discount structures are not publicly documented but typically range from 15 to 40% depending on volume commitment and contract length.
Finally, consider the hidden costs beyond token pricing. Rate limiting can be a cost in terms of throughput, as requests queue up and latency increases when you hit limits. You may need to provision higher rate limit tiers, which some providers charge for, or distribute requests across multiple API keys. Network egress costs from cloud providers can add up for high-volume streaming responses. And the engineering cost of optimizing prompts, managing caches, and monitoring costs is itself a significant investment that should be factored into your total cost of ownership calculation for any LLM-powered feature.
Cost Calculator and Caching Setup
import tiktoken
import hashlib, json, redis
# ---- Token counting with tiktoken ----
PRICING = {
"gpt-4o": (2.50, 10.00), # (input, output) per 1M tokens
"gpt-4o-mini": (0.15, 0.60),
"claude-sonnet": (3.00, 15.00),
"gemini-flash": (0.10, 0.40),
}
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens for OpenAI models using tiktoken."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def estimate_cost(
messages: list,
model: str = "gpt-4o",
est_output_tokens: int = 500,
) -> dict:
"""Estimate cost before sending a request."""
enc = tiktoken.encoding_for_model(model)
input_tokens = sum(
len(enc.encode(m["content"])) + 4 # +4 for role overhead
for m in messages
)
rate_in, rate_out = PRICING.get(model, (2.50, 10.00))
cost_in = (input_tokens / 1_000_000) * rate_in
cost_out = (est_output_tokens / 1_000_000) * rate_out
return {
"input_tokens": input_tokens,
"est_output_tokens": est_output_tokens,
"est_cost_usd": round(cost_in + cost_out, 6),
}
# ---- Redis response cache ----
rdb = redis.Redis(host="localhost", port=6379, db=0)
CACHE_TTL = 3600 # 1 hour
def cache_key(messages: list, model: str) -> str:
"""Deterministic hash of request for cache lookup."""
blob = json.dumps({"model": model, "messages": messages}, sort_keys=True)
return "llm:" + hashlib.sha256(blob.encode()).hexdigest()
def cached_completion(messages: list, model: str = "gpt-4o") -> str:
"""LLM call with Redis caching."""
key = cache_key(messages, model)
cached = rdb.get(key)
if cached:
return cached.decode()
# Cache miss -- call the LLM
import openai
rsp = openai.OpenAI().chat.completions.create(
model=model, messages=messages,
temperature=0, max_tokens=1024
)
text = rsp.choices[0].message.content
rdb.setex(key, CACHE_TTL, text)
return text
Anthropic's prompt caching gives a 90% discount on cached input tokens. Add cache_control: {"type": "ephemeral"} to your system message content block. OpenAI automatically caches input prefixes for sequences over 1024 tokens at a 50% discount with no code changes needed. The Batch API provides 50% off all tokens but returns results within 24 hours.
LiteLLM Provider Switching
LiteLLM is a Python library that provides a unified interface for calling over 100 LLM providers through a single API. The core insight is simple but powerful: if you write your application code against the OpenAI-compatible interface and use LiteLLM as the client layer, you can switch between providers by changing a single string without modifying any other code. This provider portability is invaluable for production systems because it enables cost optimization by routing to cheaper models for simpler tasks, reliability by falling back to alternative providers during outages, and future-proofing by adopting new models as they become available without code changes.
LiteLLM uses a model naming convention to identify providers: the model name is either a bare name which defaults to OpenAI such as gpt-4o, or a prefixed name with the provider such as anthropic/claude-sonnet-4-20250514, groq/llama-3.1-70b-versatile, bedrock/anthropic.claude-3-sonnet, or vertex_ai/gemini-2.0-flash. LiteLLM reads the prefix to determine which provider to route to, then translates the OpenAI-compatible request into the provider-specific format, sends the request, and translates the provider-specific response back into the OpenAI-compatible format. The translation layer handles all the differences: system message placement, response schema, authentication headers, and error types.
The litellm.Router class is where LiteLLM becomes truly powerful for production systems. The Router supports multiple model deployments, automatic fallbacks, load balancing, and cost tracking. You configure it with a list of model groups, where each group contains one or more model deployments. When you make a request to a model group, the Router selects a deployment based on the configured strategy such as round-robin, least-busy, latency-based, or cost-based, and if that deployment fails, it automatically falls back to the next deployment in the group. This is exactly the pattern you need for high-availability LLM applications.
Fallback chains are the most common Router pattern. You define a primary model, usually the best quality, and one or more fallback models, usually cheaper or hosted on different infrastructure. If the primary model fails due to rate limiting, outage, or timeout, the Router automatically retries with the next model in the chain. For example, you might configure GPT-4o as primary, Claude Sonnet as the first fallback, and Llama 3.1 on Groq as the second fallback. This ensures your application stays responsive even during provider outages, and the fallback to cheaper models during rate limiting provides a natural cost pressure valve that prevents budget overruns.
Load balancing across multiple deployments of the same model is another key Router feature. If you have multiple API keys for the same provider, common in enterprise setups to get higher aggregate rate limits, you can configure each key as a separate deployment and the Router will distribute requests across them. The Router supports several load balancing strategies: simple round-robin, weighted round-robin to send more traffic to faster or cheaper deployments, and least-busy routing to the deployment with the fewest pending requests. For applications that need maximum throughput, distributing requests across multiple keys can multiply your effective rate limit by the number of keys.
Cost tracking is built into the Router and provides real-time visibility into API spending. The Router logs every request with its token usage, model, and calculated cost. You can query aggregated costs by model, by user, by time period, or by any custom metadata you attach to requests. This data is essential for budgeting, for identifying cost anomalies, and for making informed decisions about model selection. The Router can also enforce spending limits: you can set a maximum daily or monthly budget, and the Router will reject requests that would exceed the limit, protecting you from runaway costs.
The LiteLLM Proxy Server extends the library's capabilities into a standalone service that acts as an OpenAI-compatible API gateway. Instead of using LiteLLM as a Python library embedded in your application, you run the proxy as a separate service and point your application code at it as if it were the OpenAI API. The proxy provides all the Router features plus additional capabilities like virtual API keys where you can issue keys to team members with individual budgets and model access controls, request and response logging to a database, per-key rate limiting, and a web UI for monitoring. For teams with multiple applications or developers accessing LLM APIs, the proxy centralizes management and provides organization-wide visibility.
An important consideration when using LiteLLM is that while the translation layer handles the common case well, provider-specific features may not be fully available through the unified interface. For example, Anthropic's extended thinking feature, OpenAI's structured output mode, or provider-specific parameters may require passing extra parameters that LiteLLM forwards as-is. LiteLLM supports this through provider-specific keyword arguments, but you should always test that provider-specific features work correctly through LiteLLM rather than assuming compatibility. The library is actively maintained and coverage improves with each release, but edge cases do exist.
Integration with observability tools is another strength of LiteLLM. It supports callbacks that fire on every request, allowing you to send metrics to Prometheus, Datadog, or any monitoring system. You can also integrate with LLM observability platforms like Langfuse, Helicone, or LangSmith by configuring the appropriate callback. This means that by routing all your LLM calls through LiteLLM, you automatically get centralized logging, cost tracking, and observability without instrumenting each individual call site.
For practical purposes, LiteLLM serves as the default LLM client layer in most projects. The standard pattern is to import litellm, configure the Router with your model deployments and fallbacks, and use router.completion() or router.acompletion() everywhere in your application code. This gives you provider portability, fault tolerance, and cost tracking out of the box, with minimal code overhead compared to using a provider SDK directly. When you need provider-specific features, you can always drop down to the native SDK for that particular call while keeping LiteLLM for everything else.
A practical tip for development environments: LiteLLM can route to local models served by Ollama or vLLM using the openai/ prefix with a custom api_base parameter. This allows you to develop and test your application locally using free open-source models, then switch to cloud providers in production by changing only the model configuration. This workflow dramatically reduces development costs and eliminates the need for API keys during the development and testing phases of your project.
Router with Fallback Chain
from litellm import Router
# Define model deployments with fallback chain:
# Primary: GPT-4o -> Fallback 1: Claude -> Fallback 2: Llama/Groq
model_list = [
{
"model_name": "main-llm",
"litellm_params": {
"model": "gpt-4o",
"api_key": "sk-...",
},
},
{
"model_name": "main-llm",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-20250514",
"api_key": "sk-ant-...",
},
},
{
"model_name": "main-llm",
"litellm_params": {
"model": "groq/llama-3.1-70b-versatile",
"api_key": "gsk-...",
},
},
]
router = Router(
model_list=model_list,
fallbacks=[{"main-llm": ["main-llm"]}],
num_retries=2,
timeout=30,
routing_strategy="least-busy",
)
# Usage -- identical to litellm.completion()
async def call_with_fallback(prompt: str) -> str:
response = await router.acompletion(
model="main-llm",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=1024,
)
return response.choices[0].message.content
# Cost tracking via callbacks
import litellm
litellm.success_callback = ["langfuse"] # or "lunary", "helicone"
Run litellm --model gpt-4o --port 4000 to start a local OpenAI-compatible proxy. Point any OpenAI SDK client at http://localhost:4000 and it routes through LiteLLM with all fallback and tracking features. Add --config config.yaml for full Router configuration.
Structured Outputs
One of the most important capabilities for production LLM applications is the ability to extract structured data from model outputs. When you ask a model to extract key entities from a document, you do not want a free-form paragraph. You want a JSON object with specific fields that your downstream code can parse reliably. Without structured output enforcement, you are at the mercy of the model's formatting choices, which vary between calls, between models, and even between runs with the same model at non-zero temperature. Structured outputs solve this by constraining the model's generation to produce valid data conforming to a predetermined schema.
There are several approaches to structured outputs, ranging from simple prompting to provider-native enforcement to library-based validation. The simplest approach, asking the model to respond in JSON via the prompt, works surprisingly well with modern models but provides no guarantee. The model might produce invalid JSON, include extra text before or after the JSON block, or use field names that differ from what you expected. For prototyping this is acceptable, but for production code that needs to parse the output programmatically, you need much stronger guarantees than a polite request in the prompt.
OpenAI provides a native structured output feature called response_format with type json_schema. When you provide a JSON schema, OpenAI's API constrains the model's token generation to only produce tokens that are valid according to the schema. This is not post-hoc validation. It is built into the generation process itself, which means the model physically cannot produce output that violates the schema. The schema supports standard JSON Schema features including required fields, type constraints, enums, arrays, nested objects, and descriptions. This is the most reliable approach when using OpenAI models because it guarantees schema compliance with zero chance of parsing failure.
Anthropic does not have an equivalent native JSON schema mode, but provides a powerful alternative through the tool use feature, also known as function calling. You define a tool with a name, description, and input schema using JSON Schema, and when the model decides to use the tool, it generates a structured JSON input that conforms to the schema. By designing your tool definition to represent the desired output structure, you effectively get structured outputs. The key insight is that you are not actually using the tool for its function-calling purpose. You are repurposing the tool schema as a structured output constraint. This pattern is so common that it has become an established idiom in the Anthropic developer community.
The instructor library by Jason Liu is the most popular Python library for working with structured LLM outputs. Instructor wraps the OpenAI and Anthropic SDKs with a Pydantic-based interface: you define your desired output structure as a Pydantic model, and instructor handles the schema conversion, API call, response parsing, and validation automatically. If the model's response fails Pydantic validation, instructor can automatically retry with the validation error message appended to the conversation, giving the model a chance to correct its output. This retry-on-validation pattern typically produces valid structured output within one to two retries even for complex schemas.
The power of Pydantic models as output schemas goes beyond basic type checking. You can use Pydantic validators to enforce business logic constraints such as minimum and maximum values, string patterns, cross-field consistency rules, and custom validation functions. Instructor surfaces these validation failures to the model during retries, effectively teaching the model your business rules through error feedback. This creates a remarkably robust extraction pipeline where the model handles the unstructured-to-structured conversion and Pydantic handles the quality assurance, with automatic retry bridging any gaps.
For production systems, structured outputs enable a critical architectural pattern: the LLM as a data extraction or classification component in a larger pipeline. Instead of using the LLM's output as user-facing text, you use it as structured data that feeds into downstream computation, database storage, or API calls. For example, you might use an LLM to extract invoice line items as a list of Pydantic objects, then store those objects in a database and compute totals programmatically. The structured output guarantee means your downstream code can trust the schema and focus on business logic rather than defensive parsing of unpredictable text.
Performance considerations for structured outputs are important to understand. JSON schema mode in OpenAI adds minimal latency because the schema constraint is applied during generation. However, complex schemas with many nested objects and enums can slightly reduce generation speed because the constraint checking at each token position is more computationally expensive. Instructor's retry mechanism adds latency proportional to the number of retries. If validation fails twice before succeeding, you have paid for three API calls instead of one. To minimize retries, keep schemas as simple as possible, provide clear field descriptions, and include few-shot examples in the prompt showing correctly structured output.
Nested and recursive schemas deserve special attention. OpenAI's JSON schema mode supports nested objects and arrays but has limitations on deeply recursive or self-referencing schemas. If you need to extract tree structures or recursive data, you may need to flatten the schema and post-process the output into the desired structure. Instructor handles this more gracefully because Pydantic natively supports recursive models, and the validation-retry loop can guide the model toward producing valid recursive structures. For deeply nested schemas, it often helps to decompose the extraction into multiple simpler calls rather than trying to extract everything in one monolithic call.
Testing structured outputs requires a different approach than testing free-text outputs. Since the output conforms to a known schema, you can write deterministic assertions: check that required fields are present, values are within expected ranges, enums contain valid values, and lists have the expected length. This makes structured output pipelines much more testable than free-text pipelines, which is another strong reason to prefer structured outputs in production systems. The evaluation module covers this in detail, but the key insight is that structured outputs transform the inherently non-deterministic LLM output into something you can test with standard software testing techniques.
One advanced pattern is using structured outputs for chain-of-thought reasoning with structured results. You define a Pydantic model with a reasoning field of type string and a result field with your structured data type, and the model is forced to explain its reasoning before producing the structured answer. This combines the quality benefits of chain-of-thought prompting with the reliability benefits of structured outputs. The reasoning field serves as both a quality improvement mechanism and an audit trail for debugging when the structured result is incorrect, because you can read the reasoning to understand why the model made a particular extraction decision.
Instructor with Pydantic Validation
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator
from typing import List, Literal
# Define the output schema with Pydantic
class ExtractedEntity(BaseModel):
name: str = Field(..., description="Entity name")
entity_type: Literal["person", "org", "location", "date"]
confidence: float = Field(..., ge=0.0, le=1.0)
@field_validator("name")
@classmethod
def name_not_empty(cls, v):
if not v.strip():
raise ValueError("Entity name cannot be empty")
return v.strip()
class ExtractionResult(BaseModel):
reasoning: str = Field(..., description="Step-by-step reasoning")
entities: List[ExtractedEntity] = Field(default_factory=list)
document_summary: str = Field(..., max_length=200)
# Patch the OpenAI client with instructor
client = instructor.from_openai(OpenAI())
# Structured call -- returns a Pydantic model, not raw text
result = client.chat.completions.create(
model="gpt-4o",
response_model=ExtractionResult,
max_retries=3,
messages=[
{"role": "system", "content": "Extract entities from the document."},
{"role": "user", "content": """
Anthropic, based in San Francisco, announced on March 4 2025
that CEO Dario Amodei will present new safety research at
the United Nations headquarters in New York.
"""},
],
temperature=0,
)
# result is a validated Pydantic object
print(result.reasoning)
for entity in result.entities:
print(f" {entity.name} ({entity.entity_type}) conf={entity.confidence}")
# Works with Anthropic too:
from anthropic import Anthropic
ant_client = instructor.from_anthropic(Anthropic())
result2 = ant_client.messages.create(
model="claude-sonnet-4-20250514",
response_model=ExtractionResult,
max_retries=3,
max_tokens=1024,
messages=[{"role": "user", "content": "Extract entities from: ..."}],
)
Always include a reasoning or thinking field in your Pydantic model. Forcing the model to explain its reasoning before producing structured fields consistently improves extraction accuracy. It is chain-of-thought prompting built directly into the output schema.
Architecture Diagrams
Figure 1 -- The complete lifecycle of an LLM API call, from application code through the SDK, network, provider infrastructure, model, and back.
Figure 2 -- LiteLLM Router fallback chain: requests go to GPT-4o first; on failure they fall back to Claude Sonnet, then Llama 3.1 on Groq.
Interview Ready
LLM APIs expose large language models as stateless HTTP services. You send a messages array containing system, user, and assistant roles along with configuration like model name, temperature, and max_tokens to a chat completions endpoint, and you receive generated text plus token usage metadata. The API is inherently stateless so your application must manage conversation history. Provider SDKs from OpenAI and Anthropic handle authentication, serialization, retries, and type safety. Streaming via Server-Sent Events delivers tokens incrementally for responsive UIs with 200-500ms time to first token. Structured outputs using JSON schema mode or the instructor library guarantee parseable responses conforming to a Pydantic schema. Cost management revolves around token economics: choosing the right model tier, caching prompts, counting tokens with tiktoken before sending, and monitoring spend. LiteLLM provides a unified interface across 100+ providers with automatic fallbacks, load balancing, and cost tracking, making your application provider-agnostic. Production resilience comes from exponential backoff retries with tenacity, distinguishing retryable errors like 429 and 503 from non-retryable ones like 400 and 401.
Common Interview Questions
| Question | What They're Really Asking |
|---|---|
| How does the Chat Completions API work and what are the message roles? | Do you understand the stateless request/response contract and the semantic difference between system, user, and assistant messages? |
| How do you handle streaming responses from an LLM API in a production web application? | Can you implement SSE-based streaming with async clients, forward streams through FastAPI, and handle mid-stream errors and client disconnections? |
| What strategies do you use to manage LLM API costs at scale? | Do you think about model tiering, token counting with tiktoken, prompt caching, response caching with Redis, batch APIs, and spend monitoring? |
| How do you guarantee structured output from an LLM? | Can you compare prompt-based JSON, OpenAI's native json_schema mode, Anthropic's tool-use-as-schema pattern, and instructor with Pydantic validation and retries? |
| How do you build a resilient LLM client that handles rate limits, outages, and provider switching? | Do you know exponential backoff with tenacity, which HTTP errors are retryable, and how LiteLLM Router provides fallback chains and load balancing? |
Model Answers
Q1: How does the Chat Completions API work and what are the message roles?
The Chat Completions API accepts a JSON payload containing a model identifier, a messages array, and generation parameters like temperature and max_tokens. Each message has a role and content. The system role sets behavioral instructions and persona for the model. The user role carries the human input. The assistant role contains prior model responses for multi-turn context. The API is completely stateless, so every request must include the full conversation history. The response returns generated text in a choices array, a finish_reason indicating why generation stopped (stop, length, or tool_calls), and a usage object reporting prompt and completion token counts for cost calculation.
# Minimal chat completions call
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a code reviewer."},
{"role": "user", "content": "Review this function for bugs."},
],
temperature=0,
max_tokens=1024,
)
# Always check finish_reason -- "length" means truncated output
if response.choices[0].finish_reason == "length":
raise ValueError("Response was truncated")
Q2: How do you handle streaming responses from an LLM API in a production web application?
You set stream=True in the API call, which returns tokens incrementally via Server-Sent Events rather than waiting for the full response. Each SSE event contains a delta with a content fragment that you accumulate into the complete response. In a FastAPI application, you use AsyncOpenAI or AsyncAnthropic to avoid blocking the event loop, wrap the token stream in an async generator, and return it via StreamingResponse with media type text/event-stream. Critical production considerations include detecting client disconnections to cancel upstream API calls and avoid wasting tokens, handling mid-stream errors that arrive as stream events rather than HTTP error codes, and using the async client exclusively in async contexts since the sync client blocks the entire event loop.
async def stream_to_client(prompt: str):
client = AsyncOpenAI()
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True, max_tokens=2048,
)
async for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
yield f"data: {delta.content}\n\n"
yield "data: [DONE]\n\n"
Q3: What strategies do you use to manage LLM API costs at scale?
The highest-impact optimization is model tiering: routing simple tasks like classification to cheap models like GPT-4o-mini at $0.15/$0.60 per million tokens while reserving expensive models like GPT-4o at $2.50/$10.00 for complex reasoning. Second, I use token counting with tiktoken before sending requests to predict costs, enforce per-request budgets, and catch accidentally large prompts. Third, prompt caching at the provider level gives 50-90% discounts on repeated prefixes. Fourth, application-level caching with Redis eliminates redundant API calls entirely by hashing the messages array and serving cached responses. Fifth, I monitor every call's token usage and cost, aggregate by user and feature, and set daily budget alerts. The OpenAI Batch API provides an additional 50% discount for non-real-time workloads.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
input_tokens = len(enc.encode(prompt_text))
est_cost = (input_tokens / 1_000_000) * 2.50 # input rate for gpt-4o
if est_cost > MAX_BUDGET:
raise ValueError(f"Request would cost ${est_cost:.4f}, exceeds budget")
Q4: How do you guarantee structured output from an LLM?
There are three levels of reliability. The weakest is prompt-based: asking the model to return JSON, which provides no schema guarantee. The strongest for OpenAI is native json_schema mode via the response_format parameter, which constrains token generation at inference time so the model physically cannot produce invalid JSON. For Anthropic, you repurpose the tool use mechanism by defining a tool whose input schema matches your desired output structure. The most practical approach across both providers is the instructor library, which wraps the SDK with Pydantic models as output schemas. You define your structure as a Pydantic class with validators, and instructor handles schema conversion, API calls, parsing, and automatic retries on validation failure, feeding the error back to the model so it can self-correct.
import instructor
from pydantic import BaseModel, Field
class SentimentResult(BaseModel):
reasoning: str = Field(..., description="Explain your analysis")
sentiment: str = Field(..., pattern="^(positive|negative|neutral)$")
confidence: float = Field(..., ge=0.0, le=1.0)
client = instructor.from_openai(OpenAI())
result = client.chat.completions.create(
model="gpt-4o", response_model=SentimentResult, max_retries=3,
messages=[{"role": "user", "content": "Analyze: 'The product is great!'"}],
) # result is a validated Pydantic object, never raw text
Q5: How do you build a resilient LLM client that handles rate limits, outages, and provider switching?
Resilience starts with retry logic using tenacity with exponential backoff and jitter. I only retry transient errors: 429 (rate limit), 500/502/503 (server errors), and connection errors. I never retry 400 (bad request) or 401 (auth error) since those indicate code bugs, not transient failures. For provider-level resilience, I use LiteLLM Router with a fallback chain: the primary model is tried first, and on failure the Router automatically tries the next provider in the chain. Load balancing across multiple API keys multiplies effective rate limits. The Router also tracks per-request cost and can enforce spending limits. For maximum reliability, I configure a three-tier fallback like GPT-4o primary, Claude Sonnet fallback, and Llama on Groq as the final fallback, ensuring the application stays responsive even during a full provider outage.
from litellm import Router
router = Router(
model_list=[
{"model_name": "llm", "litellm_params": {"model": "gpt-4o"}},
{"model_name": "llm", "litellm_params": {"model": "anthropic/claude-sonnet-4-20250514"}},
{"model_name": "llm", "litellm_params": {"model": "groq/llama-3.1-70b-versatile"}},
],
fallbacks=[{"llm": ["llm"]}],
num_retries=2, timeout=30,
)
# Automatically tries next provider on 429/503/timeout
resp = await router.acompletion(model="llm", messages=msgs)
Scenario: Design the LLM API layer for a customer support chatbot serving 10,000 concurrent users with sub-second response times and a monthly budget of $5,000.
Approach: Use a tiered model strategy: route FAQ-style questions through a classifier to GPT-4o-mini ($0.15/$0.60 per 1M tokens) handling 80% of traffic, and escalate complex queries to GPT-4o. Implement streaming via SSE for all user-facing responses to achieve 200-400ms time to first token. Deploy a Redis response cache keyed on normalized message hashes to eliminate redundant calls, targeting 30-40% cache hit rate. Use LiteLLM Router with Claude Sonnet as a fallback for GPT-4o outages. Count tokens with tiktoken before each request and enforce a per-user daily budget of $0.50. Use Anthropic prompt caching for the system prompt (90% discount on cached tokens). Set up cost monitoring with daily alerts at 80% of the $167 daily budget threshold. Use the async SDK exclusively with connection pooling to handle concurrency without thread overhead. Manage conversation history with a sliding window of the last 10 messages plus a summarized context block, keeping each request under 4,000 input tokens.
Common Mistakes to Avoid
Using openai.OpenAI() inside a FastAPI endpoint blocks the entire event loop, serializing all requests and destroying server throughput. Always use openai.AsyncOpenAI() in async contexts. This single mistake can reduce your server's capacity from thousands of concurrent requests to effectively one at a time.
Retrying 400 Bad Request or 401 Unauthorized errors wastes time and money because these indicate bugs in your code or invalid credentials, not transient failures. Only retry 429 (rate limit), 500/502/503 (server errors), and connection timeouts. Use tenacity's retry_if_exception_type to target specific exception classes rather than catching all errors.
When finish_reason is "length" instead of "stop", the model's response was truncated at the max_tokens limit and you are missing content. Production code must check this field and either increase max_tokens, request a continuation, or log a warning. Silently accepting truncated responses leads to incomplete data extraction, broken JSON, and subtle downstream bugs that are extremely difficult to diagnose.