Cloud Cost Management
Financial governance is essential in the cloud because the pay-as-you-go model can lead to unexpected costs if resources are not managed properly. Google Cloud provides multiple tools and strategies for controlling and optimizing cloud spending.
Cost Optimization Strategies
Right-Sizing
Match VM machine types to actual workload needs. Google recommends right-sizing based on Cloud Monitoring metrics. Avoid paying for idle CPU/memory.
Sustained Use Discounts
Automatic discounts (up to 30%) for VMs running more than 25% of a month. No commitment required. Applied to Compute Engine and GKE nodes.
Committed Use Discounts
1 or 3 year commitments for up to 57% discount on compute. Best for predictable, steady-state workloads. Commit to vCPUs and memory, not specific machine types.
Spot / Preemptible VMs
60-91% discount for interruptible workloads. Ideal for batch processing, CI/CD, and fault-tolerant jobs. Google can reclaim with 30s notice.
Autoscaling
Scale resources dynamically based on demand. Scale down during off-peak hours. Serverless services (Cloud Run, BigQuery) scale to zero when idle.
Lifecycle Policies
Automatically transition Cloud Storage objects to cheaper classes (Nearline, Coldline, Archive) or delete old data. Reduces long-term storage costs.
The exam often asks "how to reduce cloud costs." Key answers: right-size VMs, use committed/sustained use discounts, use serverless (scale to zero), use Spot VMs for batch, set storage lifecycle rules, set budget alerts.
Billing and Pricing
Google Cloud billing is organized around billing accounts that are linked to projects. Understanding the billing structure is important for both cost management and for the CDL exam.
Billing Structure
| Component | Description |
|---|---|
| Billing Account | Payment instrument linked to one or more projects. Contains payment method, billing contacts, and invoices. |
| Project | Resource container and billing boundary. Each project is linked to exactly one billing account. All resource costs are charged to the project's billing account. |
| Budgets & Alerts | Set spending thresholds (actual or forecasted) that trigger email/Pub/Sub notifications. Alerts do NOT stop spending by default. |
| Billing Export | Export detailed billing data to BigQuery for analysis, or to Cloud Storage for archival. |
| Labels | Key-value tags applied to resources. Used to organize and filter billing reports (e.g., team:frontend, env:prod). |
Budget alerts are notifications only — they do NOT automatically stop or cap spending. To enforce spending limits, you must set up programmatic responses using Pub/Sub notifications + Cloud Functions to disable billing or shut down resources.
Google Cloud Pricing Principles
- Pay-as-you-go — No upfront costs. Pay only for what you use.
- Per-second billing — Compute Engine bills per second (minimum 1 minute).
- No termination fees — Stop using a service anytime without penalties (except CUDs).
- Custom machine types — Choose exact vCPU and memory ratios to avoid paying for unused resources.
- Free tier — Many services include a free usage tier (e.g., BigQuery 1 TB/month queries, Cloud Functions 2M invocations/month).
# List billing accounts
gcloud billing accounts list
# Link a project to a billing account
gcloud billing projects link my-project \
--billing-account=0X0X0X-0X0X0X-0X0X0X
# Create a budget alert
gcloud billing budgets create \
--billing-account=0X0X0X-0X0X0X-0X0X0X \
--display-name="Monthly Budget" \
--budget-amount=1000USD \
--threshold-rule=percent=0.5 \
--threshold-rule=percent=0.9 \
--threshold-rule=percent=1.0
# Export billing to BigQuery
gcloud billing export bigquery enable \
--billing-account=0X0X0X-0X0X0X-0X0X0X \
--project=my-project \
--dataset=billing_export
Resource Hierarchy
Google Cloud organizes resources in a hierarchy that controls access (IAM), policies, and billing. Understanding this hierarchy is critical for both governance and the exam.
[ Organization ] (example.com)
|
|-- [ Folder: Engineering ]
| |-- [ Project: web-prod ]
| |-- [ Project: web-staging ]
| |-- [ Folder: ML Team ]
| |-- [ Project: ml-training ]
| |-- [ Project: ml-serving ]
|
|-- [ Folder: Finance ]
|-- [ Project: finance-prod ]
|-- [ Project: finance-analytics ]
| Level | Purpose | Key Properties |
|---|---|---|
| Organization | Root node, maps to your domain | Created with Cloud Identity or Google Workspace. Org-level policies and IAM apply to everything. |
| Folder | Group projects by team, department, or environment | Can be nested up to 10 levels. IAM and policies inherit to child folders and projects. |
| Project | Container for resources and billing | Has a unique project ID (globally unique), project name, and project number. Linked to one billing account. |
| Resource | Individual GCP services | VMs, buckets, datasets, etc. Belong to exactly one project. |
The project is the fundamental unit of Google Cloud. It is both the billing boundary and the resource container. Every API call is made in the context of a project. Best practice: separate projects for different environments (dev, staging, prod) and different teams.
Site Reliability Engineering (SRE)
SRE is Google's approach to operations and reliability, defined as "what happens when you ask a software engineer to design an operations function." SRE concepts are heavily tested on the CDL exam because Google Cloud applies these principles throughout its services.
Core SRE Concepts
| Concept | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A measurable metric that defines service health | Request latency (p99 < 200ms), availability (% of successful requests) |
| SLO (Service Level Objective) | Target value for an SLI over a time period | "99.9% of requests succeed within 200ms over 30 days" |
| SLA (Service Level Agreement) | Business contract with consequences for missing SLO | "If availability drops below 99.9%, customer receives service credits" |
| Error Budget | Allowed amount of unreliability (100% - SLO) | 99.9% SLO = 0.1% error budget = ~43 min/month of downtime allowed |
| Toil | Repetitive, manual, automatable operational work | Manual deployments, ticket-driven restarts, manual scaling |
SLI = what you measure (latency, availability). SLO = internal target (99.9%). SLA = external contract with penalties. SLOs are always tighter than SLAs. Error budget = the difference between 100% and SLO — spend it on feature velocity.
Error Budget Philosophy
The error budget is a powerful concept: it quantifies exactly how much unreliability a service can tolerate. If the error budget is healthy (few errors), teams can push new features faster. If the error budget is depleted, teams must focus on reliability before shipping new features.
Blameless Postmortems
After incidents, focus on systemic causes and process improvements, not individual blame. Document what happened, why, and how to prevent recurrence.
Toil Reduction
Automate repetitive operational tasks. SRE teams should spend <50% of time on toil and >50% on engineering work that improves reliability and automation.
Monitoring & Alerting
Monitor SLIs, alert on SLO violations (not individual errors). Use symptom-based alerting (user-facing problems) rather than cause-based alerting (internal metrics).
DevOps Practices
DevOps is a set of practices that combines software development (Dev) and IT operations (Ops) to shorten the development lifecycle and deliver high-quality software continuously. Google Cloud provides tools that support key DevOps practices.
DORA Metrics (DevOps Research and Assessment)
Google's DORA team identified four key metrics that measure DevOps performance:
| Metric | What It Measures | Elite Performance |
|---|---|---|
| Deployment Frequency | How often code is deployed to production | Multiple times per day |
| Lead Time for Changes | Time from code commit to running in production | Less than 1 hour |
| Change Failure Rate | % of deployments that cause a failure | 0-15% |
| Time to Restore | Time to recover from a failure | Less than 1 hour |
Google Cloud DevOps Tools
Cloud Source Repositories
Private Git repositories hosted on Google Cloud. Integrates with Cloud Build for CI/CD triggers on push or PR.
Cloud Build
Serverless CI/CD platform. Build, test, and deploy containers. Supports Docker, Buildpacks, and custom build steps. Pay per build-minute.
Artifact Registry
Store and manage container images, language packages (npm, Maven, Python), and OS packages. Vulnerability scanning built-in.
Cloud Deploy
Managed continuous delivery to GKE, Cloud Run, and Anthos. Promotion-based pipeline with approval gates and rollback support.
# Trigger a Cloud Build from source code
gcloud builds submit --tag gcr.io/my-project/my-app .
# Set up a build trigger on a Git repo
gcloud builds triggers create github \
--repo-name=my-repo \
--repo-owner=my-org \
--branch-pattern="^main$" \
--build-config=cloudbuild.yaml
# Deploy to Cloud Run
gcloud run deploy my-service \
--image=gcr.io/my-project/my-app \
--region=us-central1 \
--allow-unauthenticated
Cloud Monitoring and Logging
Google Cloud's operations suite (formerly Stackdriver) provides integrated monitoring, logging, error reporting, and tracing for applications running on Google Cloud and beyond.
| Service | Purpose | Key Features |
|---|---|---|
| Cloud Monitoring | Metrics, dashboards, alerting | 1500+ built-in metrics, custom metrics, uptime checks, alerting policies, SLO monitoring |
| Cloud Logging | Log management | Centralized logs from all GCP services, log-based metrics, log routing to BigQuery/Storage/Pub/Sub |
| Cloud Trace | Distributed tracing | Track request latency across microservices, identify bottlenecks, latency distributions |
| Error Reporting | Error aggregation | Automatically group and count errors, link to source code, alert on new errors |
| Cloud Profiler | Production profiling | Continuously profile CPU and memory in production with minimal overhead (<5%) |
Alerting Best Practices
- Alert on symptoms, not causes — Alert when users are affected (high latency, errors), not when a single metric changes.
- Use multi-condition policies — Combine conditions to reduce false alarms (e.g., high error rate AND high traffic).
- Set appropriate thresholds — Too sensitive = alert fatigue. Too loose = missed incidents.
- Route to the right team — Use notification channels (email, Slack, PagerDuty, Pub/Sub) per severity.
Know the difference between Cloud Monitoring (metrics and dashboards) and Cloud Logging (log data). Monitoring answers "what is happening now?" Logging answers "what happened and why?" Trace answers "where is the bottleneck in this request?"
Google Cloud Sustainability
Google has been carbon neutral since 2007 and aims to run on 24/7 carbon-free energy by 2030. The CDL exam includes questions about Google's environmental commitments because sustainability is a key decision factor for many organizations choosing a cloud provider.
Google's Sustainability Commitments
- Carbon neutral since 2007 — Google matches 100% of electricity with renewable energy purchases.
- 24/7 CFE by 2030 — Goal to run every data center on carbon-free energy every hour of every day.
- 1.1 PUE — Power Usage Effectiveness. Industry average is ~1.6. Google's data centers are among the most energy-efficient in the world.
- Circular economy — Reuse and recycle server components. Custom-designed hardware reduces waste.
- Water stewardship — Water-positive by 2030 (replenish more freshwater than consumed).
Customer Sustainability Tools
Carbon Footprint Dashboard
View your workloads' carbon emissions in the Cloud Console. Track gross and net emissions by project, service, and region over time.
Region Carbon Data
Google publishes carbon-free energy percentage per region. Choose low-carbon regions (e.g., Oregon, Finland, Iowa) for environmentally-conscious deployments.
Active Assist
AI-powered recommendations for idle resources, right-sizing, and committed use. Reducing waste saves both money and carbon.
Moving from on-premises data centers to Google Cloud can reduce carbon emissions by an estimated 5-10x due to Google's efficient infrastructure, renewable energy, and custom hardware. The exam may frame this as a benefit of cloud migration.
Exam Tips
Section 6 (~17%) covers cost optimization, SRE concepts (SLI/SLO/SLA/error budget), and DevOps practices. Know the billing hierarchy (Organization → Folder → Project), how budget alerts work, DORA metrics, and Google's sustainability claims.
Quick Reference
- "How to control cloud costs?" → Budgets, alerts, right-sizing, CUDs, Spot VMs, autoscaling
- "What measures service reliability?" → SLIs (metrics), SLOs (targets), SLAs (contracts)
- "What is error budget?" → 100% - SLO = allowed unreliability. Spend it on innovation.
- "How to measure DevOps performance?" → DORA metrics (deploy frequency, lead time, failure rate, restore time)
- "Where to see billing data for analysis?" → Export billing to BigQuery
- "How to organize GCP resources?" → Organization → Folders → Projects
- "What is Google's carbon goal?" → 24/7 carbon-free energy by 2030
- "Budget alerts stop spending?" → No! Alerts are notifications only. Use Cloud Functions + Pub/Sub to automate shutdown.