Scaling with Google Cloud Operations

1

Cloud Cost Management

Financial governance is essential in the cloud because the pay-as-you-go model can lead to unexpected costs if resources are not managed properly. Google Cloud provides multiple tools and strategies for controlling and optimizing cloud spending.

Cost Optimization Strategies

R

Right-Sizing

Match VM machine types to actual workload needs. Google recommends right-sizing based on Cloud Monitoring metrics. Avoid paying for idle CPU/memory.

S

Sustained Use Discounts

Automatic discounts (up to 30%) for VMs running more than 25% of a month. No commitment required. Applied to Compute Engine and GKE nodes.

C

Committed Use Discounts

1 or 3 year commitments for up to 57% discount on compute. Best for predictable, steady-state workloads. Commit to vCPUs and memory, not specific machine types.

P

Spot / Preemptible VMs

60-91% discount for interruptible workloads. Ideal for batch processing, CI/CD, and fault-tolerant jobs. Google can reclaim with 30s notice.

A

Autoscaling

Scale resources dynamically based on demand. Scale down during off-peak hours. Serverless services (Cloud Run, BigQuery) scale to zero when idle.

L

Lifecycle Policies

Automatically transition Cloud Storage objects to cheaper classes (Nearline, Coldline, Archive) or delete old data. Reduces long-term storage costs.

Exam Tip

The exam often asks "how to reduce cloud costs." Key answers: right-size VMs, use committed/sustained use discounts, use serverless (scale to zero), use Spot VMs for batch, set storage lifecycle rules, set budget alerts.

2

Billing and Pricing

Google Cloud billing is organized around billing accounts that are linked to projects. Understanding the billing structure is important for both cost management and for the CDL exam.

Billing Structure

Component	Description
Billing Account	Payment instrument linked to one or more projects. Contains payment method, billing contacts, and invoices.
Project	Resource container and billing boundary. Each project is linked to exactly one billing account. All resource costs are charged to the project's billing account.
Budgets & Alerts	Set spending thresholds (actual or forecasted) that trigger email/Pub/Sub notifications. Alerts do NOT stop spending by default.
Billing Export	Export detailed billing data to BigQuery for analysis, or to Cloud Storage for archival.
Labels	Key-value tags applied to resources. Used to organize and filter billing reports (e.g., team:frontend, env:prod).

Important

Budget alerts are notifications only — they do NOT automatically stop or cap spending. To enforce spending limits, you must set up programmatic responses using Pub/Sub notifications + Cloud Functions to disable billing or shut down resources.

Google Cloud Pricing Principles

Pay-as-you-go — No upfront costs. Pay only for what you use.
Per-second billing — Compute Engine bills per second (minimum 1 minute).
No termination fees — Stop using a service anytime without penalties (except CUDs).
Custom machine types — Choose exact vCPU and memory ratios to avoid paying for unused resources.
Free tier — Many services include a free usage tier (e.g., BigQuery 1 TB/month queries, Cloud Functions 2M invocations/month).

# List billing accounts
gcloud billing accounts list

# Link a project to a billing account
gcloud billing projects link my-project \
  --billing-account=0X0X0X-0X0X0X-0X0X0X

# Create a budget alert
gcloud billing budgets create \
  --billing-account=0X0X0X-0X0X0X-0X0X0X \
  --display-name="Monthly Budget" \
  --budget-amount=1000USD \
  --threshold-rule=percent=0.5 \
  --threshold-rule=percent=0.9 \
  --threshold-rule=percent=1.0

# Export billing to BigQuery
gcloud billing export bigquery enable \
  --billing-account=0X0X0X-0X0X0X-0X0X0X \
  --project=my-project \
  --dataset=billing_export

3

Resource Hierarchy

Google Cloud organizes resources in a hierarchy that controls access (IAM), policies, and billing. Understanding this hierarchy is critical for both governance and the exam.


  [ Organization ]  (example.com)
       |
       |-- [ Folder: Engineering ]
       |       |-- [ Project: web-prod ]
       |       |-- [ Project: web-staging ]
       |       |-- [ Folder: ML Team ]
       |              |-- [ Project: ml-training ]
       |              |-- [ Project: ml-serving ]
       |
       |-- [ Folder: Finance ]
               |-- [ Project: finance-prod ]
               |-- [ Project: finance-analytics ]

Resource hierarchy: Organization → Folders → Projects → Resources

Level	Purpose	Key Properties
Organization	Root node, maps to your domain	Created with Cloud Identity or Google Workspace. Org-level policies and IAM apply to everything.
Folder	Group projects by team, department, or environment	Can be nested up to 10 levels. IAM and policies inherit to child folders and projects.
Project	Container for resources and billing	Has a unique project ID (globally unique), project name, and project number. Linked to one billing account.
Resource	Individual GCP services	VMs, buckets, datasets, etc. Belong to exactly one project.

Key Concept

The project is the fundamental unit of Google Cloud. It is both the billing boundary and the resource container. Every API call is made in the context of a project. Best practice: separate projects for different environments (dev, staging, prod) and different teams.

4

Site Reliability Engineering (SRE)

SRE is Google's approach to operations and reliability, defined as "what happens when you ask a software engineer to design an operations function." SRE concepts are heavily tested on the CDL exam because Google Cloud applies these principles throughout its services.

Core SRE Concepts

Concept	Definition	Example
SLI (Service Level Indicator)	A measurable metric that defines service health	Request latency (p99 < 200ms), availability (% of successful requests)
SLO (Service Level Objective)	Target value for an SLI over a time period	"99.9% of requests succeed within 200ms over 30 days"
SLA (Service Level Agreement)	Business contract with consequences for missing SLO	"If availability drops below 99.9%, customer receives service credits"
Error Budget	Allowed amount of unreliability (100% - SLO)	99.9% SLO = 0.1% error budget = ~43 min/month of downtime allowed
Toil	Repetitive, manual, automatable operational work	Manual deployments, ticket-driven restarts, manual scaling

SLI vs SLO vs SLA

SLI = what you measure (latency, availability). SLO = internal target (99.9%). SLA = external contract with penalties. SLOs are always tighter than SLAs. Error budget = the difference between 100% and SLO — spend it on feature velocity.

Error Budget Philosophy

The error budget is a powerful concept: it quantifies exactly how much unreliability a service can tolerate. If the error budget is healthy (few errors), teams can push new features faster. If the error budget is depleted, teams must focus on reliability before shipping new features.

B

Blameless Postmortems

After incidents, focus on systemic causes and process improvements, not individual blame. Document what happened, why, and how to prevent recurrence.

T

Toil Reduction

Automate repetitive operational tasks. SRE teams should spend <50% of time on toil and >50% on engineering work that improves reliability and automation.

M

Monitoring & Alerting

Monitor SLIs, alert on SLO violations (not individual errors). Use symptom-based alerting (user-facing problems) rather than cause-based alerting (internal metrics).

5

DevOps Practices

DevOps is a set of practices that combines software development (Dev) and IT operations (Ops) to shorten the development lifecycle and deliver high-quality software continuously. Google Cloud provides tools that support key DevOps practices.

DORA Metrics (DevOps Research and Assessment)

Google's DORA team identified four key metrics that measure DevOps performance:

Metric	What It Measures	Elite Performance
Deployment Frequency	How often code is deployed to production	Multiple times per day
Lead Time for Changes	Time from code commit to running in production	Less than 1 hour
Change Failure Rate	% of deployments that cause a failure	0-15%
Time to Restore	Time to recover from a failure	Less than 1 hour

Google Cloud DevOps Tools

R

Cloud Source Repositories

Private Git repositories hosted on Google Cloud. Integrates with Cloud Build for CI/CD triggers on push or PR.

B

Cloud Build

Serverless CI/CD platform. Build, test, and deploy containers. Supports Docker, Buildpacks, and custom build steps. Pay per build-minute.

A

Artifact Registry

Store and manage container images, language packages (npm, Maven, Python), and OS packages. Vulnerability scanning built-in.

D

Cloud Deploy

Managed continuous delivery to GKE, Cloud Run, and Anthos. Promotion-based pipeline with approval gates and rollback support.

# Trigger a Cloud Build from source code
gcloud builds submit --tag gcr.io/my-project/my-app .

# Set up a build trigger on a Git repo
gcloud builds triggers create github \
  --repo-name=my-repo \
  --repo-owner=my-org \
  --branch-pattern="^main$" \
  --build-config=cloudbuild.yaml

# Deploy to Cloud Run
gcloud run deploy my-service \
  --image=gcr.io/my-project/my-app \
  --region=us-central1 \
  --allow-unauthenticated

6

Cloud Monitoring and Logging

Google Cloud's operations suite (formerly Stackdriver) provides integrated monitoring, logging, error reporting, and tracing for applications running on Google Cloud and beyond.

Service	Purpose	Key Features
Cloud Monitoring	Metrics, dashboards, alerting	1500+ built-in metrics, custom metrics, uptime checks, alerting policies, SLO monitoring
Cloud Logging	Log management	Centralized logs from all GCP services, log-based metrics, log routing to BigQuery/Storage/Pub/Sub
Cloud Trace	Distributed tracing	Track request latency across microservices, identify bottlenecks, latency distributions
Error Reporting	Error aggregation	Automatically group and count errors, link to source code, alert on new errors
Cloud Profiler	Production profiling	Continuously profile CPU and memory in production with minimal overhead (<5%)

Alerting Best Practices

Alert on symptoms, not causes — Alert when users are affected (high latency, errors), not when a single metric changes.
Use multi-condition policies — Combine conditions to reduce false alarms (e.g., high error rate AND high traffic).
Set appropriate thresholds — Too sensitive = alert fatigue. Too loose = missed incidents.
Route to the right team — Use notification channels (email, Slack, PagerDuty, Pub/Sub) per severity.

Exam Tip

Know the difference between Cloud Monitoring (metrics and dashboards) and Cloud Logging (log data). Monitoring answers "what is happening now?" Logging answers "what happened and why?" Trace answers "where is the bottleneck in this request?"

7

Google Cloud Sustainability

Google has been carbon neutral since 2007 and aims to run on 24/7 carbon-free energy by 2030. The CDL exam includes questions about Google's environmental commitments because sustainability is a key decision factor for many organizations choosing a cloud provider.

Google's Sustainability Commitments

Carbon neutral since 2007 — Google matches 100% of electricity with renewable energy purchases.
24/7 CFE by 2030 — Goal to run every data center on carbon-free energy every hour of every day.
1.1 PUE — Power Usage Effectiveness. Industry average is ~1.6. Google's data centers are among the most energy-efficient in the world.
Circular economy — Reuse and recycle server components. Custom-designed hardware reduces waste.
Water stewardship — Water-positive by 2030 (replenish more freshwater than consumed).

Customer Sustainability Tools

C

Carbon Footprint Dashboard

View your workloads' carbon emissions in the Cloud Console. Track gross and net emissions by project, service, and region over time.

R

Region Carbon Data

Google publishes carbon-free energy percentage per region. Choose low-carbon regions (e.g., Oregon, Finland, Iowa) for environmentally-conscious deployments.

A

Active Assist

AI-powered recommendations for idle resources, right-sizing, and committed use. Reducing waste saves both money and carbon.

Key Concept

Moving from on-premises data centers to Google Cloud can reduce carbon emissions by an estimated 5-10x due to Google's efficient infrastructure, renewable energy, and custom hardware. The exam may frame this as a benefit of cloud migration.

8

Exam Tips

Exam Strategy

Section 6 (~17%) covers cost optimization, SRE concepts (SLI/SLO/SLA/error budget), and DevOps practices. Know the billing hierarchy (Organization → Folder → Project), how budget alerts work, DORA metrics, and Google's sustainability claims.

Quick Reference

"How to control cloud costs?" → Budgets, alerts, right-sizing, CUDs, Spot VMs, autoscaling
"What measures service reliability?" → SLIs (metrics), SLOs (targets), SLAs (contracts)
"What is error budget?" → 100% - SLO = allowed unreliability. Spend it on innovation.
"How to measure DevOps performance?" → DORA metrics (deploy frequency, lead time, failure rate, restore time)
"Where to see billing data for analysis?" → Export billing to BigQuery
"How to organize GCP resources?" → Organization → Folders → Projects
"What is Google's carbon goal?" → 24/7 carbon-free energy by 2030
"Budget alerts stop spending?" → No! Alerts are notifications only. Use Cloud Functions + Pub/Sub to automate shutdown.