LLM Hosting & API Exposure — CareerAlign GenAI

Why Self-Host LLMs

Plain Language

Think of using an LLM API like renting an apartment: it is convenient, someone else handles the plumbing, and you can move in quickly. Self-hosting is like buying a house — you have full control over the property, you can renovate however you like, but you are also responsible for every repair. The question is not which option is universally better, but which makes sense for your situation. Some organizations have strict data residency requirements that make it illegal to send customer data to a third-party API provider. Others need to run models at extremely high volumes where per-token pricing becomes prohibitively expensive compared to renting GPU hardware outright. Still others have fine-tuned custom models that only exist as local weight files and cannot be deployed through any commercial API provider.

The economic crossover point is surprisingly concrete. Most commercial APIs charge between one and thirty dollars per million tokens, depending on the model size. A single A100 GPU rented from a cloud provider costs roughly two to three dollars per hour, and a well-optimized inference engine can push through hundreds of thousands of tokens per minute on that hardware. If your application generates more than a few million tokens per day consistently, the math starts favoring self-hosting. But this calculation only covers compute — you also need to factor in the engineering time to set up, monitor, and maintain the infrastructure, the cost of handling failovers and scaling events, and the operational burden of keeping models updated.

Privacy and compliance provide another compelling reason. In industries like healthcare, finance, and government, regulations like HIPAA, SOC 2, and GDPR may require that data never leaves a controlled environment. Self-hosting lets you run inference inside your own VPC, on your own hardware, with full audit trails. No request ever touches an external server, which dramatically simplifies compliance reviews. Even when regulations do not strictly require it, many enterprises prefer the risk posture of keeping sensitive data internal.

Latency is a third consideration. When you call a commercial API, your request travels over the internet to the provider's data center, queues behind other customers' requests, and then returns. This round trip can add 200 to 500 milliseconds of overhead before the first token even starts generating. A self-hosted model running in the same data center as your application eliminates this network latency entirely, which matters enormously for real-time applications like coding assistants, interactive agents, and voice-driven interfaces where every millisecond of delay degrades the user experience.

Finally, self-hosting gives you complete control over model versions, quantization levels, and serving configurations. You can pin a specific model checkpoint for reproducibility, swap between quantization levels to balance speed and quality, and implement custom preprocessing pipelines that would be impossible through a standard API. This flexibility is essential for teams that need deterministic behavior across deployments or that are running experimental models not available through any commercial provider.

Deep Dive

The decision framework for self-hosting versus API consumption should be evaluated across five dimensions: cost at scale, data governance, latency requirements, model customization needs, and operational maturity. Each dimension has a clear threshold beyond which self-hosting becomes the better choice, and understanding these thresholds prevents both premature optimization and missed opportunities.

Cost modeling requires careful token-volume analysis. Consider a RAG application processing 10,000 queries per day, each consuming roughly 2,000 input tokens and generating 500 output tokens. At GPT-4o pricing of $2.50 per million input tokens and $10 per million output tokens, the daily cost is approximately $50 for input and $50 for output — $100 per day or $3,000 per month. A single A100-80GB instance on AWS (p4d.24xlarge with 8 GPUs) costs roughly $32.77 per hour, or about $23,600 per month. That seems more expensive until you realize that a well-optimized vLLM deployment on that instance can serve a 70B parameter model handling 50+ concurrent requests. If your query volume grows to 100,000 queries per day, the API cost jumps to $30,000 per month while the infrastructure cost remains fixed. The crossover point depends heavily on the model you choose — smaller open-weight models like Llama 3 8B can run on much cheaper hardware, making self-hosting economical at lower volumes.

Data governance goes beyond simple compliance checkboxes. When you send data to an API provider, you are trusting their security practices, their employee access controls, their data retention policies, and their subprocessor agreements. Even providers that promise not to train on your data still store your requests temporarily for abuse monitoring and billing. Self-hosting eliminates this entire trust chain. Your requests flow from your application server to your inference server over a private network, logs are stored in your own systems, and you control every aspect of data lifecycle management. For organizations handling PII, PHI, financial data, or classified information, this level of control is not optional — it is a hard requirement.

Latency optimization in self-hosted deployments can achieve remarkable results. A vLLM server running locally can deliver time-to-first-token (TTFT) under 50 milliseconds for a 7B model on a modern GPU, compared to 300-800ms for commercial APIs that include network overhead and queuing. For streaming applications, this translates to a dramatically snappier user experience. The inter-token latency — time between each generated token — is also more predictable in self-hosted deployments because you control the batch size and scheduling policy, whereas API providers may throttle your request during peak hours or batch your request with others in ways that increase variance.

Quantization is a key self-hosting capability that directly impacts the cost-performance tradeoff. The same Llama 3 70B model that requires two A100-80GB GPUs at full FP16 precision (140GB of weight storage) can fit on a single GPU when quantized to 4-bit precision using GPTQ or AWQ, reducing memory to roughly 35GB. The quality degradation from 4-bit quantization is often minimal for production use cases — benchmarks typically show a 1-3% drop on standard evaluation metrics. This means you can often serve a much larger model on the same hardware, getting better quality at the same cost.

The operational maturity dimension is often underestimated. Self-hosting an LLM is not a "deploy and forget" operation. You need monitoring for GPU utilization, memory pressure, request latency percentiles, and error rates. You need autoscaling to handle traffic spikes. You need a model management pipeline to test new model versions, perform canary deployments, and roll back if quality degrades. You need health checks, graceful shutdown procedures, and disaster recovery plans. Teams that are not ready for this operational burden should start with APIs and migrate to self-hosting only when the economic or compliance case becomes compelling enough to justify the investment.

Decision Framework

Start with commercial APIs unless you have: (1) more than $5K/month in API spend, (2) strict data residency requirements, (3) latency-critical use cases under 100ms TTFT, or (4) custom fine-tuned models. When two or more conditions apply, self-hosting pays for itself within 3-6 months.

Figure 1 — Self-hosting decision flowchart: data sensitivity, volume, and latency thresholds

Inference Engines: vLLM & TGI

Plain Language

An inference engine is the runtime software that loads a language model's weights into GPU memory and executes the mathematical operations needed to generate text. Think of it like a web server for AI — just as Nginx or Apache serve web pages, vLLM and TGI serve language model predictions. The engine handles all the low-level complexity: allocating GPU memory efficiently, batching multiple requests together to maximize throughput, and managing the key-value cache that prevents redundant computation during text generation.

The two dominant open-source inference engines are vLLM (from UC Berkeley) and Text Generation Inference (TGI) (from Hugging Face). Both solve the same fundamental problem — making LLM inference fast and efficient — but they approach it differently. vLLM introduced a breakthrough technique called PagedAttention that manages GPU memory the way an operating system manages virtual memory, allowing it to serve many more concurrent requests than naive implementations. TGI, on the other hand, is tightly integrated with the Hugging Face ecosystem and offers a polished production experience with built-in features like token streaming, watermarking, and automatic batching.

If you have ever wondered why running a model locally feels slow compared to ChatGPT, the answer is usually the inference engine. A naive PyTorch implementation generates tokens one at a time, waiting for each token to finish before starting the next request. Production inference engines like vLLM exploit a technique called continuous batching, where new requests are dynamically inserted into an ongoing batch without waiting for existing requests to finish. This keeps the GPU busy at all times instead of idling between requests, which can improve throughput by 10x to 30x compared to static batching approaches.

Another critical optimization is KV-cache management. During text generation, the model computes attention over all previous tokens. Without caching, this computation grows quadratically with sequence length. Both vLLM and TGI maintain a key-value cache that stores previously computed attention states, so each new token only requires attention computation against the new position. The clever part is how this cache is managed — vLLM's PagedAttention allocates cache memory in small pages (similar to OS virtual memory pages), avoiding the massive contiguous memory allocations that cause out-of-memory errors when serving many concurrent requests with varying sequence lengths.

Deep Dive

vLLM has become the de facto standard for high-throughput LLM serving. Its core innovation, PagedAttention, partitions the KV cache into fixed-size blocks (pages) that can be stored non-contiguously in GPU memory. Traditional serving systems allocate a contiguous chunk of GPU memory for each request's KV cache, sized for the maximum possible sequence length. This leads to massive memory waste — a request that generates only 100 tokens still reserves memory for the full 4,096 or 8,192 token context window. PagedAttention eliminates this waste by allocating pages on demand, allowing memory to be shared across requests and reclaimed immediately when a request completes. In practice, this enables vLLM to serve 2-4x more concurrent requests than traditional approaches on the same hardware.

Starting a vLLM server is straightforward. The vllm serve command launches an OpenAI-compatible API server that accepts the exact same request format as the OpenAI Chat Completions API. This means you can point any application that uses the OpenAI SDK at your vLLM server simply by changing the base URL — no code changes required. The server supports streaming, function calling, multiple model formats (HuggingFace, GPTQ, AWQ, GGUF), and tensor parallelism across multiple GPUs.

# Install vLLM
pip install vllm

# Start serving Llama 3 8B with OpenAI-compatible API
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --max-num-seqs 256

# Key flags explained:
# --dtype auto            → Use FP16 or BF16 based on GPU capability
# --max-model-len 8192    → Maximum context window per request
# --gpu-memory-utilization → Fraction of GPU memory for KV cache (0.90 = 90%)
# --enable-chunked-prefill → Process long prompts in chunks, reducing TTFT
# --max-num-seqs 256      → Maximum concurrent requests in a batch

Tensor parallelism allows you to split a model across multiple GPUs when a single GPU doesn't have enough memory. For a 70B parameter model that requires roughly 140GB at FP16, you would need two A100-80GB GPUs. vLLM handles this transparently with the --tensor-parallel-size flag. The model's weight matrices are sharded across GPUs, and vLLM manages the cross-GPU communication during inference. Performance scales nearly linearly for up to 4 GPUs, with some overhead from inter-GPU communication beyond that.

# Serve a 70B model across 2 GPUs
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.92

# Serve a quantized 4-bit model (fits on single GPU)
vllm serve TheBloke/Llama-3-70B-Instruct-GPTQ \
  --quantization gptq \
  --max-model-len 4096 \
  --dtype float16

Text Generation Inference (TGI) by Hugging Face takes a different architectural approach. Written in Rust for the core server with a Python model layer, TGI achieves excellent performance through custom CUDA kernels, Flash Attention 2 integration, and an efficient Rust-based HTTP server. TGI's standout features include built-in support for speculative decoding (using a smaller draft model to speed up generation from a larger model), Paged Attention (adopted from vLLM's research), and tight integration with Hugging Face Hub for model downloading and caching.

# Run TGI via Docker (recommended approach)
docker run --gpus all --shm-size 1g \
  -p 8080:80 \
  -v /data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3-8B-Instruct \
  --max-input-tokens 4096 \
  --max-total-tokens 8192 \
  --max-batch-prefill-tokens 4096

# Call TGI endpoint
curl http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "What is the capital of France?",
    "parameters": {"max_new_tokens": 100, "temperature": 0.7}
  }'

Feature	vLLM	TGI
Language	Python + CUDA	Rust + Python + CUDA
API Format	OpenAI-compatible	Custom + OpenAI-compatible
PagedAttention	Native (invented here)	Adopted
Tensor Parallelism	Yes (simple flag)	Yes
Speculative Decoding	Yes	Yes (built-in)
Quantization	GPTQ, AWQ, FP8, GGUF	GPTQ, AWQ, EETQ, bitsandbytes
Deployment	pip install / Docker	Docker (primary)
Best For	Max throughput, OpenAI drop-in	HuggingFace ecosystem integration

Recommendation

For most production deployments, start with vLLM — its OpenAI-compatible API makes it a drop-in replacement that requires zero code changes. Use TGI when you need tight Hugging Face Hub integration or specific features like built-in watermarking.

Ollama for Local Development

Plain Language

Ollama is to local LLM development what Docker was to containerization — it packages the entire complexity of model downloading, quantization, and serving into a single command-line tool that just works. Instead of manually downloading model weights, configuring CUDA drivers, managing Python environments, and setting up serving infrastructure, you simply run ollama run llama3 and start chatting. This radical simplicity makes Ollama the ideal tool for local development, prototyping, and testing workflows before deploying to production infrastructure with vLLM or TGI.

Under the hood, Ollama uses the llama.cpp inference engine, which is written in C++ and optimized for running on consumer hardware. It supports both GPU and CPU inference, meaning you can run smaller models even on a laptop without a dedicated GPU. On a MacBook with Apple Silicon (M1/M2/M3), Ollama leverages the Metal framework for GPU acceleration, achieving surprisingly fast inference — a 7B parameter model can generate 30-50 tokens per second on an M2 Pro, which is more than fast enough for interactive development.

Ollama's model library is curated and versioned, similar to Docker Hub for container images. Models are specified with names and tags like llama3:8b, mistral:7b-instruct, or phi3:mini. Each model is automatically quantized to a sensible default (usually Q4_K_M for a good balance of quality and speed), and Ollama handles all the weight file management, caching, and GPU memory allocation. You can also create custom "Modelfiles" — similar to Dockerfiles — that specify a base model, system prompt, temperature settings, and other parameters, creating a reproducible model configuration that can be shared across your team.

For development workflows, Ollama exposes an API on localhost:11434 that follows a similar pattern to the OpenAI API. Many LLM frameworks — including LangChain, LlamaIndex, and Haystack — have native Ollama integrations, meaning you can develop your entire RAG pipeline or agent system locally against Ollama and then switch to a production vLLM or API endpoint simply by changing a configuration variable. This local-first development pattern is invaluable because it eliminates API costs during the iterative development phase, provides faster iteration cycles with zero network latency, and allows you to work offline.

Deep Dive

Installation and basic usage demonstrates Ollama's design philosophy of minimal friction. On macOS, you install it via a simple download and install. On Linux, a one-line curl command handles everything. Once installed, the ollama serve command starts a background daemon that manages model loading, GPU allocation, and API serving. Models are pulled on demand — the first time you request a model, Ollama downloads and caches it; subsequent uses load directly from cache.

# Install Ollama (Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model interactively
ollama run llama3:8b

# Pull a model without running
ollama pull mistral:7b-instruct
ollama pull phi3:mini
ollama pull codellama:13b

# List downloaded models
ollama list

# Show model details (quantization, size, parameters)
ollama show llama3:8b

Modelfiles give you Dockerfile-like control over model configuration. You define a base model, system prompt, and parameter overrides in a simple text file, then build a custom model from it. This is powerful for creating purpose-specific models — for example, a "code reviewer" model that uses CodeLlama with a specialized system prompt and lower temperature for more deterministic output.

# Create a Modelfile for a custom coding assistant
# Save as "Modelfile.code-review"

FROM codellama:13b

SYSTEM """You are an expert code reviewer. Analyze the provided code for:
1. Bugs and potential runtime errors
2. Security vulnerabilities
3. Performance issues
4. Code style and readability
Provide specific, actionable feedback with corrected code examples."""

PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER stop "</review>"

# Build and run the custom model
ollama create code-reviewer -f Modelfile.code-review
ollama run code-reviewer

Ollama's REST API on localhost:11434 provides both chat and completion endpoints. The chat endpoint accepts a list of messages (just like the OpenAI API), while the generate endpoint accepts a raw prompt. Both support streaming via newline-delimited JSON. Here is how you interact with Ollama programmatically from Python:

import requests
import json

# Chat completion (non-streaming)
response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3:8b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain KV caching in transformers."}
        ],
        "stream": False
    }
)
result = response.json()
print(result["message"]["content"])

# Streaming chat completion
response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3:8b",
        "messages": [{"role": "user", "content": "Write a haiku about GPUs."}],
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    chunk = json.loads(line)
    print(chunk["message"]["content"], end="", flush=True)

The OpenAI-compatible endpoint is available at /v1/chat/completions, meaning you can use the official OpenAI Python SDK with Ollama by simply overriding the base URL. This is the recommended approach for development because it means your code is portable — switch from Ollama to OpenAI or vLLM by changing a single environment variable:

from openai import OpenAI

# Point OpenAI SDK at local Ollama server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"   # Ollama doesn't need a real key
)

response = client.chat.completions.create(
    model="llama3:8b",
    messages=[
        {"role": "system", "content": "You are a concise technical writer."},
        {"role": "user", "content": "What is continuous batching?"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Development Workflow

Use OPENAI_BASE_URL=http://localhost:11434/v1 as an environment variable in your .env file during development. Your code uses the standard OpenAI SDK everywhere, and switching to production (OpenAI, vLLM, or any OpenAI-compatible endpoint) requires only changing this one variable.

Building FastAPI LLM Services

Plain Language

While vLLM and TGI provide raw inference endpoints, production applications usually need additional functionality that sits between the user and the model: input validation, authentication, rate limiting, prompt templates, output parsing, logging, and business-specific logic. FastAPI is the Python web framework of choice for building these wrapper services because it provides automatic request validation through Pydantic models, native async support for handling many concurrent requests efficiently, and automatic OpenAPI documentation generation.

Think of the architecture as a three-layer sandwich. At the bottom is your inference engine (vLLM or Ollama) running the actual model. In the middle is your FastAPI service that handles all the application logic — checking user permissions, selecting the right prompt template, routing to different models based on the request type, logging interactions for evaluation, and formatting the response. At the top is your frontend or client application that talks to the FastAPI service through a clean, well-documented REST API. This separation of concerns is crucial because it lets you swap out the inference engine without touching application logic, update prompt templates without redeploying the model, and scale the API layer independently from the GPU-intensive inference layer.

Streaming is particularly important for LLM applications because language models generate tokens one at a time, and users expect to see text appearing progressively rather than waiting for the entire response. FastAPI supports Server-Sent Events (SSE) through its StreamingResponse class, which pairs perfectly with the streaming APIs provided by vLLM and Ollama. Your FastAPI service receives tokens from the inference engine as they are generated and immediately forwards them to the client, creating a smooth, ChatGPT-like streaming experience.

Error handling in LLM services requires special attention because inference can fail in ways that typical web services do not. The model might run out of GPU memory during a long generation, the inference engine might crash and need restarting, or a request might time out because the model is overloaded. Your FastAPI service should handle all of these gracefully — retrying against a backup model, returning partial responses when possible, and providing clear error messages that help clients recover. Circuit breaker patterns, where you temporarily stop sending requests to a failing backend, are particularly useful for maintaining service availability during inference engine issues.

Deep Dive

A production-grade FastAPI LLM service starts with well-defined Pydantic models for request and response validation. This ensures that malformed requests are rejected before they reach the inference engine, and that responses have a consistent structure that clients can rely on. The request model should validate the messages array, enforce limits on max_tokens to prevent runaway generation costs, and sanitize the temperature and other sampling parameters to valid ranges.

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from openai import AsyncOpenAI
from typing import AsyncGenerator
import json, time, logging

logger = logging.getLogger(__name__)
app = FastAPI(title="LLM Service", version="1.0.0")

# --- Pydantic Models ---

class Message(BaseModel):
    role: str = Field(..., pattern="^(system|user|assistant)$")
    content: str = Field(..., min_length=1, max_length=100_000)

class ChatRequest(BaseModel):
    messages: list[Message] = Field(..., min_length=1)
    model: str = "llama3:8b"
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    max_tokens: int = Field(default=1024, ge=1, le=4096)
    stream: bool = False

class ChatResponse(BaseModel):
    content: str
    model: str
    usage: dict
    latency_ms: float

# --- Client Setup ---

# Points to vLLM or Ollama backend
client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",  # vLLM
    api_key="not-needed"
)

The non-streaming endpoint wraps the OpenAI client call with error handling, timing, and structured logging. This is the simpler path and works well for backend-to-backend communication where the client does not need progressive updates:

@app.post("/v1/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
    start = time.perf_counter()
    try:
        response = await client.chat.completions.create(
            model=req.model,
            messages=[m.model_dump() for m in req.messages],
            temperature=req.temperature,
            max_tokens=req.max_tokens,
        )
    except Exception as e:
        logger.error(f"Inference failed: {e}")
        raise HTTPException(status_code=502, detail="Inference backend unavailable")

    elapsed = (time.perf_counter() - start) * 1000
    choice = response.choices[0]

    logger.info(f"model={req.model} tokens={response.usage.total_tokens} latency={elapsed:.0f}ms")

    return ChatResponse(
        content=choice.message.content,
        model=response.model,
        usage={
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens,
        },
        latency_ms=round(elapsed, 2)
    )

The streaming endpoint is where FastAPI's async capabilities truly shine. The pattern uses an async generator that yields Server-Sent Events (SSE) as tokens arrive from the inference engine. Each event is a JSON object containing the token delta, formatted according to the SSE protocol with data: prefix and double newline separator. The client receives these events in real time, displaying each token as it arrives:

async def stream_tokens(req: ChatRequest) -> AsyncGenerator[str, None]:
    """Async generator that yields SSE-formatted token chunks."""
    try:
        stream = await client.chat.completions.create(
            model=req.model,
            messages=[m.model_dump() for m in req.messages],
            temperature=req.temperature,
            max_tokens=req.max_tokens,
            stream=True,
        )
        async for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                event = json.dumps({"content": delta})
                yield f"data: {event}\n\n"
        yield "data: [DONE]\n\n"
    except Exception as e:
        error = json.dumps({"error": str(e)})
        yield f"data: {error}\n\n"

@app.post("/v1/chat/stream")
async def chat_stream(req: ChatRequest):
    return StreamingResponse(
        stream_tokens(req),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # Disable Nginx buffering
        }
    )

Health checks and readiness probes are essential for container orchestration. The health endpoint should verify that the inference backend is actually responding, not just that the FastAPI process is alive. A liveness probe checks if the process is healthy, while a readiness probe checks if it can serve traffic (the model is loaded and warm):

@app.get("/health")
async def health():
    """Readiness probe: verifies inference backend is responding."""
    try:
        models = await client.models.list()
        return {"status": "healthy", "models": [m.id for m in models.data]}
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Backend unhealthy: {e}")

@app.get("/live")
async def liveness():
    """Liveness probe: confirms process is alive."""
    return {"status": "alive"}

Production Tip

Run your FastAPI service with uvicorn app:app --workers 4 --loop uvloop for maximum async performance. The uvloop event loop is significantly faster than the default asyncio loop for I/O-bound workloads like proxying to inference engines.

AWS SageMaker Endpoints

Plain Language

AWS SageMaker is Amazon's managed machine learning platform, and its real-time inference endpoints provide a fully managed way to deploy LLMs to production without managing any GPU infrastructure yourself. Think of SageMaker as a "valet parking" service for your model — you hand over your model weights and configuration, and AWS handles provisioning GPU instances, loading the model, setting up load balancing, configuring auto-scaling, managing health checks, and rotating instances during updates. You interact with your model through a simple API call, and AWS bills you per hour of instance uptime rather than per token.

The primary deployment path for LLMs on SageMaker uses the Hugging Face Deep Learning Container (DLC), which comes pre-installed with TGI or vLLM and all necessary CUDA drivers. You specify the model ID from Hugging Face Hub, the instance type (which determines the GPU), and SageMaker handles downloading the model, loading it into memory, and exposing it as an HTTPS endpoint. The entire setup takes about 10-15 minutes, and the endpoint automatically gets an AWS-managed TLS certificate, IAM authentication, and CloudWatch monitoring.

SageMaker's real power shows in production operations. Auto-scaling policies automatically add or remove GPU instances based on metrics like invocations per instance, GPU utilization, or custom CloudWatch metrics. Shadow deployments (called "production variants") let you route a percentage of traffic to a new model version, compare quality and performance, and gradually shift traffic — essentially A/B testing your model deployments. Blue/green deployments let you swap model versions with zero downtime. These operational features are extremely difficult to build yourself but come standard with SageMaker.

The main tradeoff is cost and flexibility. SageMaker instances are more expensive than equivalent bare EC2 instances — you pay a premium for the managed infrastructure. A p4d.24xlarge (8x A100-80GB) costs about $32.77 per hour on EC2 but roughly 30% more through SageMaker. You also have less control over the runtime environment and cannot customize the inference engine as deeply as a raw deployment. For many teams, this premium is well worth the operational simplicity, but cost-sensitive high-volume deployments may prefer direct EC2 or ECS deployment.

Deep Dive

Deploying an LLM to SageMaker involves three core concepts: a Model (the artifacts and container), an EndpointConfig (the instance type and scaling settings), and an Endpoint (the live serving infrastructure). The Hugging Face DLC simplifies this by letting you specify just a model ID and container image — SageMaker handles downloading weights from Hugging Face Hub during deployment. Here is a complete deployment using the SageMaker Python SDK:

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

# Get the latest TGI container image URI
image_uri = get_huggingface_llm_image_uri(
    backend="huggingface",   # "huggingface" = TGI, or use "lmi" for vLLM
    region=sess.boto_region_name,
    version="2.3.1"
)

# Define the model
model = HuggingFaceModel(
    role=role,
    image_uri=image_uri,
    env={
        "HF_MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct",
        "HF_TOKEN": "hf_YOUR_TOKEN_HERE",  # For gated models
        "SM_NUM_GPUS": "1",
        "MAX_INPUT_TOKENS": "4096",
        "MAX_TOTAL_TOKENS": "8192",
        "MAX_BATCH_PREFILL_TOKENS": "4096",
    }
)

# Deploy to a real-time endpoint
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",  # 1x A10G 24GB — perfect for 8B models
    endpoint_name="llama3-8b-endpoint",
    container_startup_health_check_timeout=600,  # 10 min for model download
)

print(f"Endpoint ready: {predictor.endpoint_name}")

Once deployed, invoking the endpoint uses the SageMaker runtime client. The request format follows the TGI API specification. Here is both a direct invocation and a streaming invocation:

import boto3, json

runtime = boto3.client("sagemaker-runtime")

# --- Non-streaming invocation ---
payload = {
    "inputs": "What are the benefits of self-hosting LLMs?",
    "parameters": {
        "max_new_tokens": 512,
        "temperature": 0.7,
        "top_p": 0.9,
        "do_sample": True
    }
}

response = runtime.invoke_endpoint(
    EndpointName="llama3-8b-endpoint",
    ContentType="application/json",
    Body=json.dumps(payload)
)

result = json.loads(response["Body"].read().decode())
print(result[0]["generated_text"])

# --- Streaming invocation ---
response = runtime.invoke_endpoint_with_response_stream(
    EndpointName="llama3-8b-endpoint",
    ContentType="application/json",
    Body=json.dumps({**payload, "stream": True})
)

for event in response["Body"]:
    chunk = event["PayloadPart"]["Bytes"].decode()
    print(chunk, end="", flush=True)

Auto-scaling is configured through Application Auto Scaling policies attached to the endpoint. The most common approach is target-tracking scaling based on invocations per instance, which automatically adds instances when traffic increases and removes them when it decreases. For cost optimization, you can scale to zero during off-hours using a scheduled scaling policy:

import boto3

aas = boto3.client("application-autoscaling")

# Register the endpoint as a scalable target
aas.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/llama3-8b-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=1,
    MaxCapacity=4,
)

# Create target-tracking scaling policy
aas.put_scaling_policy(
    PolicyName="llama3-scaling",
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/llama3-8b-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 50.0,  # Target: 50 invocations per instance per minute
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        },
        "ScaleInCooldown": 300,   # Wait 5 min before scaling in
        "ScaleOutCooldown": 60,   # Scale out quickly (1 min)
    }
)

SageMaker Instance	GPU	VRAM	Best For	~Cost/hr
ml.g5.xlarge	1x A10G	24 GB	7B models (quantized)	$1.41
ml.g5.2xlarge	1x A10G	24 GB	7-8B models (FP16)	$1.52
ml.g5.12xlarge	4x A10G	96 GB	70B models (quantized)	$7.09
ml.p4d.24xlarge	8x A100	640 GB	70B+ models (FP16)	$42.60
ml.g6.xlarge	1x L4	24 GB	7B models (cost-optimized)	$0.98

Cost Warning

SageMaker endpoints bill by the hour even when idle. Always implement auto-scaling with a minimum of 0 or 1 instances, and use scheduled scaling to shut down dev/staging endpoints outside business hours. A single ml.g5.2xlarge running 24/7 costs ~$1,100/month.

Docker & ECS Deployment

Plain Language

While SageMaker provides a fully managed deployment experience, many teams prefer the flexibility and cost control of deploying LLMs using Docker containers on AWS ECS (Elastic Container Service) or EKS (Elastic Kubernetes Service). Think of Docker as a shipping container for your software — it packages your inference engine, model configuration, FastAPI wrapper, and all dependencies into a single portable unit that runs identically on your laptop, in CI/CD, and in production. ECS is the AWS service that orchestrates these containers, handling scheduling, scaling, networking, and health management.

The typical architecture for a containerized LLM deployment consists of two containers running side-by-side in the same ECS task: the inference engine container (vLLM or TGI with the model loaded) and the API wrapper container (your FastAPI service). These containers communicate over localhost within the task, and only the API container is exposed to external traffic through an Application Load Balancer. This sidecar pattern gives you clean separation between the inference engine and your application logic, and lets you update either container independently.

The biggest challenge with containerized LLM deployment is the model loading time. Model weights for a 7B parameter model are roughly 14GB at FP16, and for a 70B model they can exceed 140GB. Downloading these weights from S3 or Hugging Face Hub every time a container starts would add 5-15 minutes to your startup time, which is unacceptable for auto-scaling scenarios where you need new capacity quickly. The solution is to bake the model weights into the Docker image itself, or more commonly, use an EFS (Elastic File System) volume mount that pre-caches the weights. When a new container starts, it mounts the EFS volume and loads weights directly from the shared filesystem, reducing startup time to 1-3 minutes.

GPU access in Docker requires the NVIDIA Container Toolkit, which allows containers to access the host's GPU hardware. On ECS, you specify GPU requirements in your task definition, and ECS automatically schedules the task on an instance that has available GPUs. The instance types for GPU workloads (p4d, g5, g6 families) come pre-configured with NVIDIA drivers, making the setup relatively straightforward. The key configuration step is requesting GPUs in your container definition using the resourceRequirements field.

Deep Dive

A production Dockerfile for a vLLM-based LLM service starts from the official vLLM image (which includes CUDA, PyTorch, and all inference dependencies) and adds your FastAPI wrapper on top. The multi-stage build pattern keeps the final image lean while including all necessary components:

# Dockerfile for LLM inference service
FROM vllm/vllm-openai:latest AS base

# Install additional dependencies for the API wrapper
RUN pip install fastapi uvicorn pydantic

# Copy application code
WORKDIR /app
COPY app/ /app/

# Expose ports: 8000 for vLLM, 8080 for FastAPI wrapper
EXPOSE 8000 8080

# Startup script that launches both vLLM and FastAPI
COPY start.sh /app/start.sh
RUN chmod +x /app/start.sh

ENTRYPOINT ["/app/start.sh"]

#!/bin/bash — start.sh
# Launch vLLM server in the background
vllm serve ${MODEL_ID:-meta-llama/Meta-Llama-3-8B-Instruct} \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --max-model-len ${MAX_MODEL_LEN:-8192} \
  --gpu-memory-utilization ${GPU_MEM_UTIL:-0.90} &

# Wait for vLLM to be ready
echo "Waiting for vLLM to start..."
until curl -s http://localhost:8000/health > /dev/null 2&1; do
  sleep 2
done
echo "vLLM is ready!"

# Launch FastAPI wrapper
uvicorn app.main:app --host 0.0.0.0 --port 8080 --workers 2

The ECS Task Definition specifies the container configuration, GPU requirements, and resource limits. Here is a CloudFormation-style task definition for a GPU-accelerated LLM service:

# ecs-task-definition.json
{
  "family": "llm-inference",
  "requiresCompatibilities": ["EC2"],
  "networkMode": "awsvpc",
  "cpu": "8192",
  "memory": "30720",
  "containerDefinitions": [
    {
      "name": "llm-server",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/llm-service:latest",
      "essential": true,
      "portMappings": [
        {"containerPort": 8080, "protocol": "tcp"}
      ],
      "resourceRequirements": [
        {"type": "GPU", "value": "1"}
      ],
      "environment": [
        {"name": "MODEL_ID", "value": "meta-llama/Meta-Llama-3-8B-Instruct"},
        {"name": "MAX_MODEL_LEN", "value": "8192"},
        {"name": "GPU_MEM_UTIL", "value": "0.90"}
      ],
      "mountPoints": [
        {"sourceVolume": "model-cache", "containerPath": "/root/.cache/huggingface"}
      ],
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 10,
        "retries": 3,
        "startPeriod": 600
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/llm-inference",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "llm"
        }
      }
    }
  ],
  "volumes": [
    {
      "name": "model-cache",
      "efsVolumeConfiguration": {
        "fileSystemId": "fs-0123456789abcdef0",
        "rootDirectory": "/models"
      }
    }
  ]
}

An API Gateway sits in front of the Application Load Balancer to provide rate limiting, API key authentication, and request throttling. This is particularly important for LLM services because a single malicious request with a high max_tokens value could occupy a GPU for minutes. The API Gateway enforces per-client rate limits and request size limits before traffic even reaches your ECS service:

Figure 2 — Production deployment architecture: API Gateway → ALB → ECS with vLLM + FastAPI sidecar pattern

For teams that prefer Infrastructure as Code, here is a minimal AWS CDK deployment that creates the ECS service with GPU support:

# cdk_stack.py — minimal ECS GPU deployment
from aws_cdk import (
    Stack, Duration,
    aws_ec2 as ec2,
    aws_ecs as ecs,
    aws_ecs_patterns as ecs_patterns,
)
from constructs import Construct

class LLMStack(Stack):
    def __init__(self, scope: Construct, id: str, **kwargs):
        super().__init__(scope, id, **kwargs)

        vpc = ec2.Vpc(self, "LLMVpc", max_azs=2)

        cluster = ecs.Cluster(self, "LLMCluster", vpc=vpc)

        # Add GPU capacity (g5.2xlarge = 1x A10G 24GB)
        cluster.add_capacity(
            "GPUCapacity",
            instance_type=ec2.InstanceType("g5.2xlarge"),
            machine_image=ecs.EcsOptimizedImage.amazon_linux2(
                hardware_type=ecs.AmiHardwareType.GPU
            ),
            desired_capacity=1,
            min_capacity=1,
            max_capacity=3,
        )

        task = ecs.Ec2TaskDefinition(self, "LLMTask",
            network_mode=ecs.NetworkMode.AWS_VPC
        )

        container = task.add_container("LLMContainer",
            image=ecs.ContainerImage.from_registry(
                "your-ecr-repo/llm-service:latest"
            ),
            memory_limit_mib=28672,
            gpu_count=1,
            environment={
                "MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct",
            },
            logging=ecs.LogDrivers.aws_logs(stream_prefix="llm"),
            health_check=ecs.HealthCheck(
                command=["CMD-SHELL", "curl -f http://localhost:8080/health"],
                interval=Duration.seconds(30),
                start_period=Duration.seconds(600),
            )
        )
        container.add_port_mappings(ecs.PortMapping(container_port=8080))

        # Create ALB-fronted ECS service
        ecs_patterns.ApplicationLoadBalancedEc2Service(
            self, "LLMService",
            cluster=cluster,
            task_definition=task,
            desired_count=1,
            health_check_grace_period=Duration.seconds(600),
        )

EFS Pre-warming

Before your first deployment, pre-download model weights to EFS using a one-time EC2 instance or ECS task: huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir /mnt/efs/models/llama3-8b. This ensures subsequent container starts load from cache instead of downloading 14+ GB over the network.

🎯

Interview Ready

How to Explain This in 2 Minutes

Elevator Pitch

LLM hosting is the practice of running large language models on your own infrastructure instead of relying on third-party APIs. You choose an inference engine like vLLM or TGI, select appropriate GPU hardware, apply quantization (such as GPTQ, AWQ, or GGUF) to fit larger models into less memory, and expose the model through a REST API. The key trade-off is control versus operational burden: self-hosting gives you lower latency, full data privacy, and predictable costs at scale, but you take on GPU management, scaling, and monitoring. The decision hinges on token volume, data sensitivity, and latency requirements — once API spend exceeds roughly $5K per month or you need sub-100ms time-to-first-token, self-hosting typically wins.

Likely Interview Questions

Question	What They're Really Asking
When would you self-host an LLM instead of using a commercial API?	Can you reason about cost, compliance, latency, and operational trade-offs rather than defaulting to one approach?
What is vLLM's PagedAttention and why does it matter?	Do you understand the GPU memory bottleneck in LLM serving and how modern engines solve it?
How does GGUF quantization differ from GPTQ and AWQ?	Can you explain quantization trade-offs and pick the right format for a given deployment target (CPU vs GPU)?
How would you choose a GPU for serving a 70B parameter model?	Do you know how to estimate VRAM requirements, understand tensor parallelism, and balance cost against throughput?
What strategies would you use to optimize latency and throughput for a self-hosted LLM?	Can you discuss continuous batching, KV-cache tuning, speculative decoding, and autoscaling as a coherent system?

Model Answers

When would you self-host an LLM instead of using a commercial API?
Self-hosting makes sense when at least two of these conditions are met: API costs exceed $5K per month, data governance rules prohibit sending data to third parties, latency requirements demand sub-100ms time-to-first-token, or you need to serve a custom fine-tuned model. At high token volumes, a single A100 GPU running vLLM can serve hundreds of thousands of tokens per minute at a fixed hourly cost, whereas API pricing scales linearly with usage. However, self-hosting demands operational maturity — monitoring, autoscaling, failover, and model update pipelines — so teams should start with APIs and migrate when the economics or compliance case is clear.

What is vLLM's PagedAttention and why does it matter?
PagedAttention manages the KV cache in non-contiguous memory pages, similar to how an operating system manages virtual memory. Traditional serving allocates a contiguous block for each request sized to the maximum sequence length, wasting enormous amounts of GPU memory on short sequences. PagedAttention allocates pages on demand and reclaims them immediately when a request finishes, enabling 2-4x more concurrent requests on the same hardware. This directly translates to higher throughput and lower cost per token in production.

How does GGUF quantization differ from GPTQ and AWQ?
GGUF is the format used by llama.cpp and Ollama, optimized for CPU and Apple Silicon inference with support for mixed-precision schemes like Q4_K_M that quantize different layers at different bit widths. GPTQ and AWQ are GPU-focused formats that use calibration data to minimize quantization error — GPTQ applies one-shot weight quantization while AWQ protects salient weight channels that disproportionately affect accuracy. For GPU-based production serving with vLLM or TGI, GPTQ or AWQ is preferred. For local development on laptops or CPU-based edge deployment, GGUF is the standard choice.

How would you choose a GPU for serving a 70B parameter model?
A 70B model at FP16 requires roughly 140GB of VRAM just for weights. A single A100-80GB is insufficient, so you either use tensor parallelism across two A100-80GB GPUs or quantize to 4-bit (roughly 35GB) to fit on one card. For production throughput, two A100s with tensor parallelism gives better performance because the KV cache also needs significant memory. When budgeting, compare the cost of an A100 instance ($2-3/hour) against H100 ($5-8/hour, but 2-3x faster inference) — the H100 often wins on cost-per-token despite the higher hourly rate due to its superior throughput.

What strategies would you use to optimize latency and throughput for a self-hosted LLM?
Start with continuous batching to keep GPUs saturated — vLLM does this by default, dynamically inserting new requests into running batches. Tune gpu-memory-utilization to 0.90-0.95 to maximize KV-cache space. Enable chunked prefill to reduce time-to-first-token for long prompts. Consider speculative decoding with a smaller draft model to speed up generation from a larger target model. For scaling, deploy behind a load balancer with autoscaling based on GPU utilization and request queue depth, and use model replicas rather than larger GPU counts when throughput rather than single-request latency is the bottleneck.

System Design Scenario

Design Challenge

Your company processes 50,000 customer support tickets per day and wants to deploy a Llama 3 70B model to draft responses. Each ticket averages 800 input tokens and 400 output tokens. Data must stay within your AWS VPC due to PII concerns. Design the hosting architecture: choose your inference engine, GPU instance type, number of replicas, quantization strategy, and autoscaling policy. Estimate the monthly cost and compare it against using the Claude API at $3 per million input tokens and $15 per million output tokens. Explain your latency target and how you would monitor quality drift over time.

Common Mistakes

Ignoring operational costs in the self-host vs API comparison. Teams often compare only compute costs against API pricing, forgetting the engineering hours spent on infrastructure setup, monitoring dashboards, on-call rotations, model updates, and incident response. A fair comparison includes at least 0.5-1 FTE of ongoing operational overhead for a self-hosted deployment.
Over-provisioning GPU memory by skipping quantization. Running a 70B model at full FP16 across four GPUs when a 4-bit AWQ quantized version achieves nearly identical quality on a single GPU wastes three-quarters of your hardware budget. Always benchmark quantized variants against your specific evaluation set before defaulting to full precision.
Treating GPU selection as a pure VRAM calculation. VRAM determines whether a model fits, but memory bandwidth and compute throughput determine how fast it runs. An H100 has 3.35 TB/s memory bandwidth versus the A100's 2 TB/s, which translates directly to faster token generation because LLM inference is memory-bandwidth-bound during the decode phase. Choosing the cheapest GPU that fits the model often leads to unacceptable latency.

← Previous

04 · Fine-Tuning

06 · Prompt Engineering