Cloud Monitoring
Metrics and Dashboards
Cloud Monitoring collects metrics from all GCP services automatically. You can also send custom metrics via the Monitoring API or OpenTelemetry.
| Metric Type | Source | Examples | Retention |
|---|---|---|---|
| GCP System Metrics | Automatic from GCP services | CPU utilization, disk IOPS, LB latency | 24 months |
| Custom Metrics | Application code via API/OTEL | Business KPIs, queue depth, cache hit rate | 24 months |
| Agent Metrics | Ops Agent on VMs | Memory, processes, system logs | 24 months |
| External Metrics | Prometheus, Datadog, etc. | Third-party application metrics | Varies |
# Install Ops Agent on a Compute Engine VM
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
# Write a custom metric using gcloud
gcloud monitoring metrics-descriptors create \
custom.googleapis.com/myapp/queue_depth \
--type=GAUGE \
--value-type=INT64 \
--description="Number of items in the processing queue"
# Create a dashboard from a JSON definition
gcloud monitoring dashboards create \
--config-from-file=dashboard.json
Alerting Policies
Alerting policies define conditions, notification channels, and documentation for automated incident detection.
# Create an alerting policy for high CPU
gcloud monitoring policies create \
--display-name="High CPU Alert" \
--condition-display-name="CPU > 80% for 5 min" \
--condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization"
AND resource.type="gce_instance"' \
--condition-threshold-value=0.8 \
--condition-threshold-comparison=COMPARISON_GT \
--condition-threshold-duration=300s \
--notification-channels=projects/my-project/notificationChannels/12345 \
--documentation="Investigate high CPU. Check for runaway processes or traffic spike."
Uptime Checks
Uptime checks probe endpoints from multiple global locations. They detect outages before users report them. Uptime checks can verify HTTP response codes, response body content, and SSL certificate expiration.
Combine uptime checks with SLO monitoring. An uptime check tells you if the service is reachable. An SLO tells you if the service is meeting its quality targets. Both are needed for comprehensive reliability monitoring.
Cloud Logging
Log Types
| Log Type | What It Captures | Default Retention | Can Disable? |
|---|---|---|---|
| Admin Activity | API calls that modify resources (create, delete, update) | 400 days | No (always on) |
| Data Access | API calls that read data or metadata | 30 days | Yes (off by default for most) |
| System Event | Google-initiated actions (live migration, maintenance) | 400 days | No |
| Policy Denied | Denied actions due to security policy violations | 30 days | No |
| Platform Logs | GKE, Cloud Run, App Engine application logs | 30 days | Yes |
Data Access logs are off by default for most services. If a compliance question asks about tracking who read BigQuery data or Cloud Storage objects, you must enable Data Access audit logs. Admin Activity logs are always enabled and cannot be disabled.
Log Sinks and Routing
Log sinks export logs to external destinations for long-term storage, analysis, or SIEM integration.
| Destination | Use Case | Retention |
|---|---|---|
| Cloud Storage | Long-term archival, compliance | Configurable (lifecycle policies) |
| BigQuery | Log analytics, SQL queries on logs | Configurable (table expiration) |
| Pub/Sub | Real-time streaming to SIEM or custom pipeline | N/A (real-time) |
| Another Project | Centralized logging across org | Depends on destination |
# Create a log sink to BigQuery for audit logs
gcloud logging sinks create audit-to-bq \
bigquery.googleapis.com/projects/my-project/datasets/audit_logs \
--log-filter='logName:"cloudaudit.googleapis.com"' \
--use-partitioned-tables
# Create a log sink to Cloud Storage for long-term archival
gcloud logging sinks create all-logs-archive \
storage.googleapis.com/my-project-log-archive \
--log-filter='severity >= WARNING'
# Create a log-based metric for error counting
gcloud logging metrics create api-errors \
--description="Count of API 5xx errors" \
--log-filter='resource.type="cloud_run_revision"
AND httpRequest.status >= 500'
SRE Principles
SLIs, SLOs, and SLAs
| Concept | Definition | Example | Who Defines It |
|---|---|---|---|
| SLI (Service Level Indicator) | A measurable metric of service quality | Request latency p99, error rate, availability % | Engineering team |
| SLO (Service Level Objective) | A target value or range for an SLI | "99.9% of requests complete in <200ms" | Engineering + product |
| SLA (Service Level Agreement) | A contractual commitment with consequences | "99.9% uptime or credit refund" | Business + legal |
SLOs should be stricter than SLAs. If your SLA promises 99.9% uptime, your internal SLO should target 99.95%. This gives you an error budget buffer before violating the contractual SLA. Cloud Monitoring supports creating SLO monitors with burn-rate alerting.
Error Budgets
An error budget is the amount of unreliability your SLO allows. If your SLO is 99.9% availability, your error budget is 0.1% (about 43 minutes per month of allowed downtime).
- Budget remaining — Continue deploying new features, take calculated risks.
- Budget exhausted — Freeze deployments, focus on reliability improvements, reduce change velocity.
- Budget policy — Define what happens at 50%, 75%, 100% budget consumption. Automate alerts.
# Create an SLO in Cloud Monitoring
gcloud monitoring slo create \
--service=my-cloud-run-service \
--display-name="Availability SLO" \
--goal=0.999 \
--rolling-period=30d \
--request-based-sli-good-total-ratio-threshold-performance \
--good-service-filter='metric.type="run.googleapis.com/request_count"
AND metric.labels.response_code_class="2xx"' \
--total-service-filter='metric.type="run.googleapis.com/request_count"'
Reliability Engineering
Chaos Engineering
Chaos engineering is the practice of intentionally injecting failures into a system to test its resilience. The PCA exam tests your understanding of when and how to apply chaos engineering.
Failure Injection
Terminate VMs, kill pods, introduce network latency, simulate disk failures. Validate that autoscaling, health checks, and failover work as designed.
Game Days
Scheduled chaos experiments with the team present. Practice incident response procedures. Document findings and remediation actions.
Steady-State Hypothesis
Define what "normal" looks like before injecting chaos. Measure whether the system returns to steady state after the experiment.
- Start small — Begin with non-production environments, graduate to production.
- Have a rollback plan — Every chaos experiment must be reversible.
- Monitor continuously — Watch SLIs during experiments to detect cascading failures.
- Automate over time — Move from manual game days to automated chaos frameworks (Chaos Monkey, Litmus).
Load Testing
Load testing validates that your architecture handles expected and peak traffic. On GCP, common approaches include:
| Tool | Type | Best For | GCP Integration |
|---|---|---|---|
| Locust | Open-source, Python | HTTP/gRPC, scriptable | Run on GKE or Compute Engine |
| k6 (Grafana) | Open-source, JS | HTTP, developer-friendly | Run on GKE, export to Cloud Monitoring |
| JMeter | Open-source, Java | Complex protocols, GUI | Run on Compute Engine |
| Cloud Tasks + Pub/Sub | GCP-native | Async load generation | Native integration |
Always load test against production-like environments. Testing against an under-provisioned staging environment does not validate production resilience. Use the same machine types, autoscaling configurations, and database tiers as production.
Release Management
Release Strategies on GCP
| Strategy | GCP Service | Configuration | Monitoring |
|---|---|---|---|
| Canary (Cloud Run) | Cloud Run traffic splitting | --to-revisions=new=10,old=90 |
Error rate, latency by revision |
| Canary (GKE) | Cloud Deploy canary strategy | Percentage-based promotion in pipeline | Custom metrics, SLO burn rate |
| Blue/Green (GKE) | Service routing (Istio/ASM) | VirtualService weight shifting | Service mesh telemetry |
| Rolling (MIG) | Instance Group Updater | --max-surge=3 --max-unavailable=0 |
Health check pass rate |
| Feature Flags | Application-level (LaunchDarkly, custom) | Conditional code paths | Per-feature error rates |
Incident Response
A structured incident response process is essential for operations excellence:
- Detect — Alerting policies, uptime checks, SLO burn-rate alerts trigger notifications.
- Triage — Determine severity (P1-P4), assign incident commander, open communication channel.
- Mitigate — Rollback deployment, scale resources, enable failover. Focus on restoring service, not root cause.
- Resolve — Fix the underlying issue. Deploy fix through normal CI/CD pipeline.
- Post-mortem — Blameless analysis of what happened, what went well, what to improve. Document action items.
Blameless post-mortems are a core SRE practice. Focus on systemic improvements (better monitoring, automated rollback, improved testing) rather than individual blame. Google publishes post-mortem templates that the exam may reference.
Distributed Tracing
Cloud Trace collects latency data from applications to help identify performance bottlenecks. It integrates with Cloud Run, GKE, App Engine, and custom applications via OpenTelemetry.
- Automatic Tracing — Cloud Run, App Engine, and Cloud Functions automatically report traces.
- Custom Instrumentation — Use OpenTelemetry SDK to add spans for custom code paths.
- Trace Analysis — View request waterfall diagrams, identify slow spans, correlate with logs.
- Latency Distribution — Analyze p50, p95, p99 latency across services and time periods.
# Python — OpenTelemetry tracing with Cloud Trace exporter
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Set up Cloud Trace exporter
trace.set_tracer_provider(TracerProvider())
tracer_provider = trace.get_tracer_provider()
cloud_trace_exporter = CloudTraceSpanExporter(project_id="my-project")
tracer_provider.add_span_processor(BatchSpanProcessor(cloud_trace_exporter))
# Instrument your code
tracer = trace.get_tracer("my-service")
with tracer.start_as_current_span("process-request") as span:
span.set_attribute("user.id", user_id)
result = process_data(data) # This call will be traced
span.set_attribute("result.count", len(result))
The three pillars of observability are metrics (Cloud Monitoring), logs (Cloud Logging), and traces (Cloud Trace). A complete observability strategy uses all three. Metrics tell you something is wrong, logs tell you what happened, traces tell you where in the request path it happened.
Exam Tips
"A company notices intermittent 502 errors from their Cloud Run service during peak hours..."
Answer: Check Cloud Run metrics (concurrency, instance count, request latency). Likely cause: insufficient max-instances or low concurrency setting. Increase --max-instances and --concurrency. Set up Cloud Monitoring alerting on 5xx error rate and p99 latency. Use Cloud Trace to identify slow request paths.
"An SRE team needs to define SLOs for a critical API with 99.95% availability target..."
Answer: Define SLIs (request success rate, latency p99). Create SLO in Cloud Monitoring with 99.95% target on a 30-day rolling window. Set up burn-rate alerting at 2x, 5x, and 10x consumption rates. Error budget = 0.05% = ~21.6 minutes/month. When budget is <25%, reduce deployment frequency.
"A team wants to ensure their GKE application can survive a zone failure..."
Answer: Use a regional GKE cluster (nodes across 3 zones). Pod Disruption Budgets to maintain availability during maintenance. Pod anti-affinity rules to spread replicas across zones. Run chaos experiments: kill a node pool in one zone and verify service remains healthy.
"A company needs to retain all audit logs for 7 years for compliance..."
Answer: Create a log sink to Cloud Storage with a retention-locked bucket (7-year retention policy). Admin Activity logs are retained 400 days by default — the sink ensures long-term archival. Use Cloud Storage Coldline or Archive class for cost efficiency. Optionally sink to BigQuery for queryable compliance reporting.
Operations excellence is about proactive reliability, not reactive firefighting. The exam rewards answers that include monitoring, alerting, automation, and continuous improvement. Always mention SLOs, error budgets, and blameless post-mortems when discussing operational practices.