Module 13 — Cloud Phase

AWS Cloud Services for GenAI

AWS provides the complete infrastructure stack for building, deploying, and operating GenAI applications at scale. This module covers Amazon Bedrock (managed foundation model access), SageMaker (custom model training and hosting), Lambda and API Gateway (serverless inference), S3 and DynamoDB (data storage for RAG pipelines), and the networking, security, and cost management patterns essential for production GenAI workloads on AWS.

Amazon Bedrock
SageMaker
Lambda + API Gateway
S3 & DynamoDB
IAM & Security
Cost Optimization
Open in Colab Open Notebook in Colab
01

Amazon Bedrock

Plain Language

Amazon Bedrock is AWS's managed service for accessing foundation models from multiple providers through a single API. Instead of managing separate accounts and integrations with OpenAI, Anthropic, Meta, Cohere, and Stability AI, Bedrock gives you a unified interface to Claude, Llama, Titan, Command, and other models. You pay per token with no upfront commitments, and your data stays within your AWS account — it is never used to train the underlying models. Think of Bedrock as an AI model marketplace with built-in enterprise features: VPC endpoints for private connectivity, CloudTrail for audit logging, and IAM for fine-grained access control.

Bedrock goes beyond basic model access with three powerful features. Knowledge Bases provide a fully managed RAG pipeline: you point Bedrock at an S3 bucket of documents, and it handles chunking, embedding, vector storage (using OpenSearch Serverless), and retrieval. Agents let you build tool-using agents that can query databases, call APIs, and execute multi-step workflows — all managed by AWS. Guardrails provide configurable content filters, PII detection, and topic boundaries as a managed service. Together, these features let you build a complete GenAI application without managing any ML infrastructure.

The key advantage of Bedrock over calling model providers directly is enterprise governance. All API calls flow through your AWS account, which means they are subject to your IAM policies, VPC network rules, CloudTrail audit logs, and AWS Config compliance rules. You can restrict which models specific teams can access, enforce data residency by choosing specific AWS regions, and monitor usage and costs through standard AWS billing tools. For organizations that already invest heavily in AWS, Bedrock is the path of least resistance to GenAI adoption.

The tradeoff is that Bedrock's model versions may lag behind the providers' direct APIs by days or weeks. When Anthropic releases a new Claude model, it appears on their API immediately but may take time to arrive on Bedrock. Bedrock also does not support every parameter and feature that each provider's native API offers. For applications where having the absolute latest model version is critical, calling the provider directly may be better; for everything else, Bedrock's governance and integration benefits outweigh the slight delay.

Deep Dive

import boto3, json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

# --- Invoke Claude via Bedrock ---
response = bedrock.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    contentType="application/json",
    accept="application/json",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1024,
        "messages": [{
            "role": "user",
            "content": "Explain Amazon Bedrock in 3 sentences."
        }]
    })
)

result = json.loads(response["body"].read())
print(result["content"][0]["text"])

# --- Streaming response ---
response = bedrock.invoke_model_with_response_stream(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    contentType="application/json",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1024,
        "messages": [{"role": "user", "content": "Write a haiku about cloud computing."}]
    })
)

for event in response["body"]:
    chunk = json.loads(event["chunk"]["bytes"])
    if chunk["type"] == "content_block_delta":
        print(chunk["delta"]["text"], end="", flush=True)

Bedrock Knowledge Bases provide a fully managed RAG pipeline. You create a knowledge base, point it at an S3 data source, and Bedrock handles the entire indexing pipeline:

# Query a Bedrock Knowledge Base (managed RAG)
bedrock_agent = boto3.client("bedrock-agent-runtime")

response = bedrock_agent.retrieve_and_generate(
    input={"text": "What is our company's return policy?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": "KB_ID_HERE",
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
        }
    }
)

print(response["output"]["text"])
for citation in response.get("citations", []):
    for ref in citation["retrievedReferences"]:
        print(f"  Source: {ref['location']['s3Location']['uri']}")
Bedrock ModelProviderInput $/1MOutput $/1MBest For
Claude 3.5 SonnetAnthropic$3.00$15.00Complex reasoning, coding
Claude 3 HaikuAnthropic$0.25$1.25Fast, cost-effective tasks
Llama 3 70BMeta$2.65$3.50Open-weight, customizable
Titan Text ExpressAmazon$0.20$0.60Simple tasks, lowest cost
Command R+Cohere$3.00$15.00RAG-optimized
Bedrock + LiteLLM

Use LiteLLM as a unified SDK to call Bedrock models with the same OpenAI-compatible code you use for other providers: litellm.completion(model="bedrock/anthropic.claude-3-5-sonnet...", messages=[...]). This makes your code portable between Bedrock, direct API, and self-hosted models.

02

SageMaker for GenAI

Plain Language

While Bedrock provides access to pre-built foundation models, SageMaker is where you go when you need custom model training, fine-tuning, or self-hosted inference with full control. SageMaker provides managed Jupyter notebooks for experimentation, training jobs that scale to multi-GPU clusters, and real-time inference endpoints that auto-scale based on traffic. For GenAI specifically, SageMaker JumpStart provides one-click deployment of popular open models (Llama, Mistral, Falcon) on optimized GPU instances with TGI or vLLM as the inference engine.

The typical SageMaker workflow for GenAI is: (1) Use SageMaker Studio for experimentation and prompt engineering, (2) Fine-tune a model using SageMaker Training Jobs with your custom dataset stored in S3, (3) Deploy the fine-tuned model to a SageMaker Endpoint with auto-scaling, and (4) Monitor performance using SageMaker Model Monitor. Module 05 covered SageMaker endpoints in detail — this section focuses on the broader ecosystem and how SageMaker integrates with other AWS services for end-to-end GenAI applications.

SageMaker's Inference Components feature is particularly powerful for cost optimization. Instead of dedicating an entire GPU instance to a single model, you can host multiple models on the same instance and dynamically allocate GPU memory between them. This means you can run a large model for complex queries and a small model for simple queries on the same hardware, switching between them based on the request. This can reduce inference costs by 50-70% compared to dedicated endpoints for each model.

Deep Dive

import sagemaker
from sagemaker.jumpstart.model import JumpStartModel

# One-click deployment of Llama 3 via JumpStart
model = JumpStartModel(
    model_id="meta-textgeneration-llama-3-8b-instruct",
    role=sagemaker.get_execution_role(),
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    endpoint_name="llama3-jumpstart",
)

# Invoke
response = predictor.predict({
    "inputs": "What are the benefits of RAG?",
    "parameters": {"max_new_tokens": 256, "temperature": 0.7}
})
print(response)
Clean Up Endpoints

SageMaker endpoints bill by the hour even when idle. Always delete endpoints after experimentation: predictor.delete_endpoint(). Use lifecycle policies to auto-delete idle endpoints in non-production environments.

03

Serverless GenAI

Plain Language

Not every GenAI application needs GPU instances running 24/7. For many use cases — document processing pipelines, chatbot backends with moderate traffic, scheduled report generation — a serverless architecture using Lambda + API Gateway + Bedrock is the most cost-effective approach. You pay only when requests are being processed, with zero cost during idle periods. Lambda handles the application logic (prompt construction, RAG orchestration, response formatting), API Gateway provides the HTTP endpoint with authentication, and Bedrock provides the model inference.

The main limitation of serverless GenAI is the Lambda timeout (15 minutes maximum) and cold start latency. LLM calls through Bedrock typically complete in 2-15 seconds depending on the model and output length, well within Lambda's limits. However, streaming responses require Lambda Response Streaming (a feature that allows Lambda to send partial responses as they are generated) rather than the traditional request-response pattern. For applications that need streaming, Lambda with Function URL Response Streaming or API Gateway WebSocket APIs are the solution.

Deep Dive

# Lambda function for Bedrock-powered chatbot
import json, boto3

bedrock = boto3.client("bedrock-runtime")

def handler(event, context):
    body = json.loads(event["body"])
    user_message = body["message"]

    response = bedrock.invoke_model(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",
        contentType="application/json",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 512,
            "messages": [{"role": "user", "content": user_message}]
        })
    )

    result = json.loads(response["body"].read())

    return {
        "statusCode": 200,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps({
            "response": result["content"][0]["text"],
            "model": "claude-3-haiku",
            "usage": result["usage"]
        })
    }
04

Data Services for GenAI

Plain Language

GenAI applications need several types of data storage: S3 for document files (PDFs, images) that feed RAG pipelines, OpenSearch Serverless or Aurora PostgreSQL with pgvector for vector embeddings, DynamoDB for conversation history and session state, and ElastiCache (Redis) for caching embeddings and LLM responses. Choosing the right storage for each data type is critical for both performance and cost.

For RAG vector storage on AWS, you have three main options. OpenSearch Serverless is fully managed and integrates natively with Bedrock Knowledge Bases — zero vector database management. Aurora PostgreSQL with pgvector keeps vectors in your existing relational database alongside structured data — ideal if you already use Aurora. Pinecone or Qdrant on ECS provides a dedicated vector database with advanced features but requires managing the infrastructure yourself. For most teams starting out, OpenSearch Serverless through Bedrock Knowledge Bases is the simplest path.

Deep Dive

import boto3

# DynamoDB for conversation history
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("conversations")

def save_message(session_id: str, role: str, content: str):
    import time
    table.put_item(Item={
        "session_id": session_id,
        "timestamp": int(time.time() * 1000),
        "role": role,
        "content": content,
        "ttl": int(time.time()) + 86400 * 7,  # 7-day TTL
    })

def get_history(session_id: str, limit: int = 20) -> list:
    response = table.query(
        KeyConditionExpression="session_id = :sid",
        ExpressionAttributeValues={":sid": session_id},
        ScanIndexForward=True,
        Limit=limit,
    )
    return [{"role": i["role"], "content": i["content"]} for i in response["Items"]]
05

Security & IAM

Plain Language

AWS security for GenAI follows the shared responsibility model: AWS secures the infrastructure, you secure your application and data. The critical security controls for GenAI applications are: IAM policies that restrict which models and actions each service or user can access, VPC endpoints that keep Bedrock traffic off the public internet, KMS encryption for data at rest and in transit, CloudTrail for audit logging of every model invocation, and AWS Config rules that enforce compliance standards automatically.

The principle of least privilege is especially important for GenAI. A Lambda function that calls Bedrock should have an IAM policy that allows only the specific model IDs it needs, not blanket access to all Bedrock models. A SageMaker endpoint role should have access to only the S3 bucket containing its model artifacts, not your entire S3 estate. These fine-grained policies prevent blast radius if any single component is compromised.

Deep Dive

// IAM policy: least-privilege Bedrock access
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": [
        "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-haiku*",
        "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet*"
      ]
    },
    {
      "Effect": "Deny",
      "Action": "bedrock:InvokeModel",
      "Resource": "arn:aws:bedrock:*::foundation-model/anthropic.claude-3-opus*",
      "Condition": {
        "StringNotEquals": {"aws:PrincipalTag/team": "ml-research"}
      }
    }
  ]
}
06

Cost Management

Plain Language

GenAI costs on AWS can escalate quickly without proper controls. The main cost drivers are: model inference (per-token charges for Bedrock, per-hour charges for SageMaker endpoints), GPU instances (g5.2xlarge at ~$1.50/hr adds up to ~$1,100/month if left running), vector database storage (OpenSearch Serverless has minimum charges), and data transfer. The key cost optimization strategies are: use the cheapest model that meets quality requirements (Haiku before Sonnet, Titan before Claude), implement prompt caching to avoid recomputing identical prompts, auto-scale SageMaker endpoints (including scale-to-zero for dev), and set up AWS Budgets alerts before costs surprise you.

Provisioned Throughput on Bedrock lets you reserve model capacity at a fixed monthly cost, which is significantly cheaper than on-demand pricing at high volumes. If you know you will make more than roughly $10,000/month in Bedrock on-demand calls, provisioned throughput typically saves 30-50%. The commitment is for one month minimum, and you get guaranteed throughput without throttling — important for production applications that need consistent performance.

Deep Dive

import boto3

# Set up a CloudWatch alarm for Bedrock costs
cloudwatch = boto3.client("cloudwatch")

cloudwatch.put_metric_alarm(
    AlarmName="BedrockDailyCostAlarm",
    MetricName="EstimatedCharges",
    Namespace="AWS/Billing",
    Statistic="Maximum",
    Period=86400,
    EvaluationPeriods=1,
    Threshold=100.0,  # Alert at $100/day
    ComparisonOperator="GreaterThanThreshold",
    Dimensions=[{"Name": "ServiceName", "Value": "Amazon Bedrock"}],
    AlarmActions=["arn:aws:sns:us-east-1:123456789:cost-alerts"],
)
Cost StrategySavingsEffort
Use smallest effective model (Haiku > Sonnet)80-90%Low — just test quality
Cache identical prompts (Redis/ElastiCache)30-70%Medium
Provisioned Throughput for high volume30-50%Low — commitment required
Auto-scale SageMaker to zero off-hours50-70%Medium
Prompt engineering (shorter prompts)20-40%Medium
Day-One Budget Alerts

Set up AWS Budget alerts before deploying any GenAI workload. A misconfigured agent loop can generate thousands of API calls in minutes. Set alerts at 50%, 80%, and 100% of your expected monthly budget, and add a hard stop (Lambda + EventBridge) that disables resources if spending exceeds 150%.

🎯

Interview Ready

How to Explain This in 2 Minutes

Elevator Pitch

AWS provides the full stack for deploying GenAI applications in production. Amazon Bedrock is the managed service for accessing foundation models like Claude, Llama, and Titan through a single API — you get enterprise governance (IAM, VPC, CloudTrail) without managing any ML infrastructure. For custom models, SageMaker handles training, fine-tuning, and hosting on GPU instances with auto-scaling. For cost-effective architectures, Lambda plus API Gateway plus Bedrock gives you a fully serverless GenAI backend that costs zero when idle. The data layer uses S3 for document storage, OpenSearch Serverless for vector embeddings in RAG pipelines, and DynamoDB for conversation history. Security follows least-privilege IAM policies scoped to specific model IDs, VPC endpoints to keep traffic private, and KMS encryption throughout. The biggest operational concern is cost — GPU instances and per-token charges add up fast, so you need budget alerts, auto-scaling policies, and model selection strategies (use Haiku before Sonnet) from day one.

Likely Interview Questions

QuestionWhat They're Really Asking
When would you use Bedrock vs. SageMaker for GenAI?Do you understand that Bedrock is for managed model access with zero infrastructure, while SageMaker is for custom training, fine-tuning, and self-hosted inference with full control?
How would you build a serverless GenAI application on AWS?Can you architect a Lambda + API Gateway + Bedrock pipeline, handle streaming responses, and explain the cost and latency tradeoffs compared to always-on instances?
How do you secure LLM access on AWS?Can you design IAM policies scoped to specific model ARNs, set up VPC endpoints for private connectivity, and implement CloudTrail audit logging for compliance?
How would you implement RAG on AWS?Do you know the managed path (Bedrock Knowledge Bases with OpenSearch Serverless) vs. the custom path (Aurora pgvector, custom chunking, SageMaker embeddings) and when to use each?
How do you control GenAI costs on AWS?Can you identify the main cost drivers (per-token inference, GPU instance hours, vector DB minimums) and propose specific strategies like model tiering, prompt caching, provisioned throughput, and auto-scaling to zero?

Model Answers

Bedrock vs. SageMaker: I use Bedrock when I need access to frontier models like Claude or Llama without managing infrastructure — it provides a pay-per-token API with built-in governance through IAM, VPC endpoints, and CloudTrail. I switch to SageMaker when I need to fine-tune a model on proprietary data, host an open-weight model with custom inference logic, or need features like inference components for multi-model hosting on shared GPU instances. In practice, many production systems use both: Bedrock for the primary LLM calls and SageMaker for custom embedding models or specialized fine-tuned models.

Serverless GenAI Architecture: For moderate-traffic applications, I architect with Lambda handling request processing and prompt construction, API Gateway providing the HTTP endpoint with API key authentication and rate limiting, and Bedrock for model inference. DynamoDB stores conversation history with TTL for automatic cleanup. The key advantage is zero cost during idle periods. For streaming responses, I use Lambda Function URLs with response streaming instead of the traditional API Gateway request-response pattern. The main constraint is Lambda's 15-minute timeout, but Bedrock calls typically complete in 2-15 seconds.

Cost Control Strategy: I implement cost controls at multiple levels. First, model tiering — route simple queries to Haiku or Titan and only escalate to Sonnet for complex tasks, saving 80-90% on those requests. Second, prompt caching with ElastiCache Redis to avoid recomputing identical prompts. Third, SageMaker auto-scaling with scale-to-zero for development endpoints. Fourth, AWS Budget alerts at 50%, 80%, and 100% thresholds with an automated kill switch via Lambda and EventBridge at 150%. For high-volume production, I evaluate Bedrock Provisioned Throughput which saves 30-50% over on-demand pricing above roughly $10,000 per month.

System Design Scenario

Design Prompt

Design a production RAG-powered customer support system on AWS that handles 50,000 queries per day, ingests 10,000 knowledge base documents from S3, and must respond within 3 seconds. The system needs multi-tenant isolation (each customer sees only their own documents), audit logging for compliance, and a total monthly budget of $5,000. Describe your architecture choices for model selection (Bedrock vs. SageMaker), vector storage, document processing pipeline, caching strategy, security boundaries between tenants, and how you would monitor quality and costs in production.

Common Mistakes

  • Leaving SageMaker endpoints running after experimentation: GPU instances like g5.2xlarge cost ~$1.50/hr ($1,100/month). Always delete endpoints when not in use, set up lifecycle policies for non-production environments, and use auto-scaling with scale-to-zero for development workloads.
  • Using blanket IAM permissions for Bedrock: Granting bedrock:InvokeModel on * allows access to all models, including expensive ones like Claude Opus. Always scope IAM policies to specific model ARNs and use condition keys to restrict access by team or environment.
  • Skipping cost monitoring until the first bill arrives: GenAI workloads can generate surprising costs from misconfigured agent loops, unexpected traffic spikes, or developers experimenting with large models. Set up AWS Budget alerts and CloudWatch alarms for Bedrock spend on day one, before deploying any workload.
← Previous
12 · MCP
Next →
14 · n8n No-Code