Section 02 — ~16% of Exam

Exploring Data Transformation with Google Cloud

Learn how Google Cloud enables organizations to unlock the value of their data through managed data warehousing, real-time streaming, object storage, and business intelligence tools.

Data value chain
BigQuery data warehouse
Cloud Storage classes
Pub/Sub & Dataflow streaming
Looker & BI
1

The Data Value Chain

Data by itself has little value. It becomes valuable when it is collected, stored, processed, analyzed, and activated. Google Cloud provides managed services for every stage of this value chain, allowing organizations to build complete data platforms without managing infrastructure.

Stages of the Data Value Chain

StageActivityGoogle Cloud Services
IngestCollect data from sourcesPub/Sub, Datastream, Transfer Service, IoT Core
StorePersist data durablyCloud Storage, BigQuery, Cloud SQL, Bigtable, Firestore
ProcessTransform and clean dataDataflow, Dataproc, Dataprep, Cloud Data Fusion
AnalyzeQuery and gain insightsBigQuery, Looker, Data Studio, Connected Sheets
ActivateUse insights to drive actionVertex AI, BigQuery ML, Looker dashboards, APIs
Key Concept

Google Cloud's data services are serverless by design. BigQuery, Dataflow, Pub/Sub, and Cloud Storage all auto-scale without you managing clusters or servers. This is a key differentiator vs. on-premises Hadoop/Spark clusters.

Data Types

S

Structured Data

Organized in rows and columns with a fixed schema. Stored in relational databases (Cloud SQL, Spanner) or BigQuery. Examples: transactions, CRM records, inventory.

U

Unstructured Data

No predefined format. Stored in Cloud Storage as objects. Examples: images, videos, PDFs, audio files, emails. Over 80% of enterprise data is unstructured.

M

Semi-Structured Data

Has some organizational properties but no rigid schema. Examples: JSON, XML, log files, IoT sensor readings. Stored in Firestore, Bigtable, or BigQuery (nested/repeated fields).

2

BigQuery — Cloud Data Warehouse

BigQuery is Google Cloud's fully managed, serverless, petabyte-scale data warehouse. It enables super-fast SQL queries using the processing power of Google's infrastructure. BigQuery is one of the most heavily tested services on the CDL exam.

Architecture: Separation of Storage and Compute

BigQuery's key architectural innovation is the separation of storage and compute. Data is stored in Google's distributed storage system (Colossus), while compute resources for queries are provisioned from a shared pool (Dremel engine). This means you can scale storage and compute independently.


  [Data Sources]
       |
  [ Ingest: Streaming / Batch / Transfer ]
       |
  +---------+      +-----------+
  | Colossus |      |  Dremel   |
  | (Storage)|      | (Compute) |
  +---------+      +-----------+
       |                 |
       +--------+--------+
                |
      [ BigQuery SQL Engine ]
                |
    +-----------+-----------+
    |           |           |
  [Looker]  [Data Studio] [BQML]
        
BigQuery architecture: storage and compute are independent, enabling autoscaling queries

Key BigQuery Features

  • Serverless — No infrastructure to manage. No clusters to provision or tune.
  • Standard SQL — ANSI-compliant SQL (no proprietary query language).
  • Built-in ML — BigQuery ML lets you create and execute ML models using SQL.
  • Streaming ingestion — Insert rows in real-time for near-instant analysis.
  • Geospatial analytics — Built-in GIS functions for location-based queries.
  • Column-level security — Fine-grained access control on individual columns.
  • BI Engine — In-memory analysis layer for sub-second dashboard queries.

BigQuery Pricing Models

ModelHow It WorksBest For
On-demand$6.25 per TB of data scanned by queriesVariable/unpredictable workloads, exploration
Capacity (Editions)Pay for reserved compute slots (per-second billing)Predictable, heavy query workloads
Storage$0.02/GB/month (active), $0.01/GB/month (long-term)All workloads; long-term = 90+ days unmodified
Free tier1 TB queries + 10 GB storage per monthLearning, prototyping, small analytics
Exam Tip

BigQuery on-demand pricing charges by data scanned, not compute time. To reduce costs: use partitioned tables, clustered tables, and SELECT only the columns you need (avoid SELECT *).

# Query a public dataset (free tier)
bq query --use_legacy_sql=false \
  'SELECT name, SUM(number) as total
   FROM `bigquery-public-data.usa_names.usa_1910_current`
   WHERE year > 2000
   GROUP BY name
   ORDER BY total DESC
   LIMIT 10'

# Create a dataset and table
bq mk --dataset my_project:my_dataset
bq load --autodetect \
  my_dataset.sales_data \
  gs://my-bucket/sales.csv
3

Cloud Storage

Cloud Storage is Google Cloud's object storage service for storing any amount of unstructured data. It offers 11 nines (99.999999999%) durability, global accessibility, and four storage classes to optimize cost based on access frequency.

Storage Classes

ClassMinimum Storage DurationUse CaseStorage Cost (per GB/month)
StandardNoneFrequently accessed data, websites, streaming~$0.020
Nearline30 daysBackups, data accessed < once/month~$0.010
Coldline90 daysDisaster recovery, data accessed < once/quarter~$0.004
Archive365 daysLong-term retention, regulatory compliance~$0.0012
Important

Storage class affects cost and access charges, not performance or durability. All classes have the same 11 nines durability and millisecond access latency. Nearline/Coldline/Archive have higher per-operation retrieval costs and minimum storage durations.

Object Lifecycle Management

Lifecycle rules automatically transition or delete objects based on age, creation date, storage class, or other conditions. This is a key cost optimization feature for the exam.

# Set lifecycle rule: move to Coldline after 90 days, delete after 365 days
# lifecycle.json
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
        "condition": {"age": 90}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"age": 365}
      }
    ]
  }
}

# Apply lifecycle policy
gsutil lifecycle set lifecycle.json gs://my-bucket

Key Cloud Storage Concepts

B

Buckets

Top-level containers for objects. Globally unique names. Choose location type: region, dual-region, or multi-region. IAM controls access at bucket or object level.

V

Versioning

Keep historical versions of objects. Prevents accidental deletion or overwrites. Combined with lifecycle rules to limit version retention and control costs.

R

Retention Policies

Lock objects for a minimum period (regulatory compliance). Bucket Lock makes the retention policy permanent and immutable — even admins cannot delete objects early.

T

Transfer Service

Move large datasets from other clouds (AWS S3, Azure), on-premises, or between buckets. Schedule recurring transfers with bandwidth controls.

4

Database Services

Google Cloud offers multiple managed database services, each optimized for different workloads. The CDL exam tests your ability to match the right database to a given scenario.

ServiceTypeBest ForKey Feature
Cloud SQLRelational (MySQL, PostgreSQL, SQL Server)Web apps, CMS, e-commerceManaged backups, replicas, patches
Cloud SpannerRelational, globally distributedGlobal transactions, financial systemsUnlimited scale + strong consistency
FirestoreNoSQL documentMobile/web apps, user profiles, catalogsReal-time sync, offline support
Cloud BigtableNoSQL wide-columnIoT, time-series, analytics (1M+ rows/sec)Sub-10ms latency at petabyte scale
MemorystoreIn-memory (Redis, Memcached)Caching, session management, leaderboardsSub-millisecond latency
AlloyDBPostgreSQL-compatibleEnterprise workloads, OLTP + analytics4x faster than standard PostgreSQL
Decision Guide

Need SQL + global consistency? Cloud Spanner. Need SQL + standard relational? Cloud SQL. Need NoSQL for mobile/web? Firestore. Need NoSQL for massive throughput? Bigtable. Need caching? Memorystore.

5

Streaming and Messaging: Pub/Sub

Pub/Sub (Publish/Subscribe) is Google Cloud's fully managed messaging service for building event-driven architectures. It decouples event producers from consumers, enabling asynchronous communication between microservices, data pipelines, and applications.

How Pub/Sub Works


  [Publisher 1]  [Publisher 2]  [Publisher 3]
       |              |              |
       +------+-------+------+------+
              |              |
         [ Topic A ]    [ Topic B ]
              |              |
       +------+------+      |
       |             |       |
  [Subscription 1] [Sub 2] [Sub 3]
       |             |       |
  [Subscriber A] [Sub B] [Sub C]
  (push/pull)   (push)   (pull)
        
Pub/Sub: publishers send messages to topics; subscribers receive via subscriptions

Key Pub/Sub Features

  • At-least-once delivery — Every message is delivered at least once (may see duplicates).
  • Push and pull — Pull: subscribers poll for messages. Push: Pub/Sub sends HTTP POST to an endpoint.
  • Message retention — Up to 31 days (configurable). Replay old messages for recovery.
  • Ordering — Optional message ordering within an ordering key for FIFO guarantees.
  • Dead letter topics — Route undeliverable messages to a separate topic for debugging.
  • Serverless & auto-scaling — Handles millions of messages per second with no capacity planning.
# Create a topic and subscription
gcloud pubsub topics create my-events
gcloud pubsub subscriptions create my-sub \
  --topic=my-events \
  --ack-deadline=60

# Publish a message
gcloud pubsub topics publish my-events \
  --message="Hello from CDL study guide"

# Pull messages
gcloud pubsub subscriptions pull my-sub --auto-ack --limit=5
Key Concept

Pub/Sub is the ingestion layer in most real-time data pipelines. A common architecture is: Pub/Sub (ingest) → Dataflow (process) → BigQuery (analyze). This pattern appears frequently on the CDL exam.

6

Dataflow and Dataproc

Google Cloud offers two main data processing services. Dataflow is the serverless option based on Apache Beam. Dataproc is the managed Hadoop/Spark option for teams with existing Spark/Hadoop code.

FeatureDataflowDataproc
Based onApache Beam (Google-created)Apache Hadoop / Spark
ManagementFully serverless, auto-scalingManaged clusters (you choose size)
Batch & StreamUnified model for bothSpark Streaming or batch
Use caseNew pipelines, real-time ETLExisting Hadoop/Spark migrations
PricingPer worker per secondPer cluster per second (+ VMs)
ProvisioningNo cluster setupCluster in ~90 seconds
Exam Decision Rule

New data pipeline with no existing code? Choose Dataflow (serverless, Beam). Migrating existing Hadoop/Spark jobs? Choose Dataproc (managed clusters, same APIs).

Cloud Data Fusion

Cloud Data Fusion is a fully managed, visual ETL/ELT service. It provides a drag-and-drop interface for building data pipelines without writing code. Under the hood, it uses Dataproc for execution. Best for teams that prefer visual pipeline design or lack coding expertise.

7

Looker and Business Intelligence

Looker is Google Cloud's enterprise business intelligence (BI) platform. It provides a governed semantic layer (LookML) that ensures consistent metric definitions across an organization, plus dashboards, reports, and embedded analytics capabilities.

Google Cloud BI Tools

ToolAudienceKey Feature
LookerEnterprise analysts, data teamsLookML semantic model, governed metrics, embedded analytics
Looker Studio (formerly Data Studio)Business users, marketersFree, drag-and-drop dashboards, 800+ connectors
Connected SheetsSpreadsheet usersQuery BigQuery directly from Google Sheets — no SQL needed
BigQuery BI EngineDashboard developersIn-memory acceleration for sub-second dashboard queries
Key Concept

LookML is Looker's modeling language that defines relationships, metrics, and business logic in a reusable, version-controlled layer. It ensures that "revenue" means the same thing whether viewed in a dashboard, a report, or an embedded application.

8

Exam Tips and Service Selection Guide

Exam Strategy

Section 2 (~16% of exam) focuses on data service selection. For every scenario, ask: Is the data structured or unstructured? Is it batch or streaming? What size and latency requirements? Then map to the right service.

Service Selection Decision Tree

  • "Store and analyze petabytes of structured data with SQL" → BigQuery
  • "Store images, videos, backups, or any unstructured data" → Cloud Storage
  • "Build a real-time event-driven pipeline" → Pub/Sub + Dataflow + BigQuery
  • "Migrate existing Hadoop/Spark jobs" → Dataproc
  • "Need a relational database for a web app" → Cloud SQL
  • "Need global transactions with relational consistency" → Cloud Spanner
  • "Build dashboards for business users" → Looker / Looker Studio
  • "Real-time NoSQL for mobile/web apps" → Firestore
  • "Massive throughput NoSQL for IoT/time-series" → Bigtable

Common Exam Scenarios

Q

Cost Optimization

"How to reduce BigQuery costs?" → Use partitioned/clustered tables, avoid SELECT *, use on-demand for sporadic queries, editions for heavy use.

Q

Data Lifecycle

"How to manage storage costs for aging data?" → Cloud Storage lifecycle rules to transition Standard → Nearline → Coldline → Archive → Delete.

Q

Real-Time Analytics

"Analyze clickstream data in real-time?" → Pub/Sub (ingest) → Dataflow (transform) → BigQuery (query) → Looker (visualize).