Data Transformation with Google Cloud

1

The Data Value Chain

Data by itself has little value. It becomes valuable when it is collected, stored, processed, analyzed, and activated. Google Cloud provides managed services for every stage of this value chain, allowing organizations to build complete data platforms without managing infrastructure.

Stages of the Data Value Chain

Stage	Activity	Google Cloud Services
Ingest	Collect data from sources	Pub/Sub, Datastream, Transfer Service, IoT Core
Store	Persist data durably	Cloud Storage, BigQuery, Cloud SQL, Bigtable, Firestore
Process	Transform and clean data	Dataflow, Dataproc, Dataprep, Cloud Data Fusion
Analyze	Query and gain insights	BigQuery, Looker, Data Studio, Connected Sheets
Activate	Use insights to drive action	Vertex AI, BigQuery ML, Looker dashboards, APIs

Key Concept

Google Cloud's data services are serverless by design. BigQuery, Dataflow, Pub/Sub, and Cloud Storage all auto-scale without you managing clusters or servers. This is a key differentiator vs. on-premises Hadoop/Spark clusters.

Data Types

S

Structured Data

Organized in rows and columns with a fixed schema. Stored in relational databases (Cloud SQL, Spanner) or BigQuery. Examples: transactions, CRM records, inventory.

U

Unstructured Data

No predefined format. Stored in Cloud Storage as objects. Examples: images, videos, PDFs, audio files, emails. Over 80% of enterprise data is unstructured.

M

Semi-Structured Data

Has some organizational properties but no rigid schema. Examples: JSON, XML, log files, IoT sensor readings. Stored in Firestore, Bigtable, or BigQuery (nested/repeated fields).

2

BigQuery — Cloud Data Warehouse

BigQuery is Google Cloud's fully managed, serverless, petabyte-scale data warehouse. It enables super-fast SQL queries using the processing power of Google's infrastructure. BigQuery is one of the most heavily tested services on the CDL exam.

Architecture: Separation of Storage and Compute

BigQuery's key architectural innovation is the separation of storage and compute. Data is stored in Google's distributed storage system (Colossus), while compute resources for queries are provisioned from a shared pool (Dremel engine). This means you can scale storage and compute independently.


  [Data Sources]
       |
  [ Ingest: Streaming / Batch / Transfer ]
       |
  +---------+      +-----------+
  | Colossus |      |  Dremel   |
  | (Storage)|      | (Compute) |
  +---------+      +-----------+
       |                 |
       +--------+--------+
                |
      [ BigQuery SQL Engine ]
                |
    +-----------+-----------+
    |           |           |
  [Looker]  [Data Studio] [BQML]

BigQuery architecture: storage and compute are independent, enabling autoscaling queries

Key BigQuery Features

Serverless — No infrastructure to manage. No clusters to provision or tune.
Standard SQL — ANSI-compliant SQL (no proprietary query language).
Built-in ML — BigQuery ML lets you create and execute ML models using SQL.
Streaming ingestion — Insert rows in real-time for near-instant analysis.
Geospatial analytics — Built-in GIS functions for location-based queries.
Column-level security — Fine-grained access control on individual columns.
BI Engine — In-memory analysis layer for sub-second dashboard queries.

BigQuery Pricing Models

Model	How It Works	Best For
On-demand	$6.25 per TB of data scanned by queries	Variable/unpredictable workloads, exploration
Capacity (Editions)	Pay for reserved compute slots (per-second billing)	Predictable, heavy query workloads
Storage	$0.02/GB/month (active), $0.01/GB/month (long-term)	All workloads; long-term = 90+ days unmodified
Free tier	1 TB queries + 10 GB storage per month	Learning, prototyping, small analytics

Exam Tip

BigQuery on-demand pricing charges by data scanned, not compute time. To reduce costs: use partitioned tables, clustered tables, and SELECT only the columns you need (avoid SELECT *).

# Query a public dataset (free tier)
bq query --use_legacy_sql=false \
  'SELECT name, SUM(number) as total
   FROM `bigquery-public-data.usa_names.usa_1910_current`
   WHERE year > 2000
   GROUP BY name
   ORDER BY total DESC
   LIMIT 10'

# Create a dataset and table
bq mk --dataset my_project:my_dataset
bq load --autodetect \
  my_dataset.sales_data \
  gs://my-bucket/sales.csv

3

Cloud Storage

Cloud Storage is Google Cloud's object storage service for storing any amount of unstructured data. It offers 11 nines (99.999999999%) durability, global accessibility, and four storage classes to optimize cost based on access frequency.

Storage Classes

Class	Minimum Storage Duration	Use Case	Storage Cost (per GB/month)
Standard	None	Frequently accessed data, websites, streaming	~$0.020
Nearline	30 days	Backups, data accessed < once/month	~$0.010
Coldline	90 days	Disaster recovery, data accessed < once/quarter	~$0.004
Archive	365 days	Long-term retention, regulatory compliance	~$0.0012

Important

Storage class affects cost and access charges, not performance or durability. All classes have the same 11 nines durability and millisecond access latency. Nearline/Coldline/Archive have higher per-operation retrieval costs and minimum storage durations.

Object Lifecycle Management

Lifecycle rules automatically transition or delete objects based on age, creation date, storage class, or other conditions. This is a key cost optimization feature for the exam.

# Set lifecycle rule: move to Coldline after 90 days, delete after 365 days
# lifecycle.json
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
        "condition": {"age": 90}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"age": 365}
      }
    ]
  }
}

# Apply lifecycle policy
gsutil lifecycle set lifecycle.json gs://my-bucket

Key Cloud Storage Concepts

B

Buckets

Top-level containers for objects. Globally unique names. Choose location type: region, dual-region, or multi-region. IAM controls access at bucket or object level.

V

Versioning

Keep historical versions of objects. Prevents accidental deletion or overwrites. Combined with lifecycle rules to limit version retention and control costs.

R

Retention Policies

Lock objects for a minimum period (regulatory compliance). Bucket Lock makes the retention policy permanent and immutable — even admins cannot delete objects early.

T

Transfer Service

Move large datasets from other clouds (AWS S3, Azure), on-premises, or between buckets. Schedule recurring transfers with bandwidth controls.

4

Database Services

Google Cloud offers multiple managed database services, each optimized for different workloads. The CDL exam tests your ability to match the right database to a given scenario.

Service	Type	Best For	Key Feature
Cloud SQL	Relational (MySQL, PostgreSQL, SQL Server)	Web apps, CMS, e-commerce	Managed backups, replicas, patches
Cloud Spanner	Relational, globally distributed	Global transactions, financial systems	Unlimited scale + strong consistency
Firestore	NoSQL document	Mobile/web apps, user profiles, catalogs	Real-time sync, offline support
Cloud Bigtable	NoSQL wide-column	IoT, time-series, analytics (1M+ rows/sec)	Sub-10ms latency at petabyte scale
Memorystore	In-memory (Redis, Memcached)	Caching, session management, leaderboards	Sub-millisecond latency
AlloyDB	PostgreSQL-compatible	Enterprise workloads, OLTP + analytics	4x faster than standard PostgreSQL

Decision Guide

Need SQL + global consistency? Cloud Spanner. Need SQL + standard relational? Cloud SQL. Need NoSQL for mobile/web? Firestore. Need NoSQL for massive throughput? Bigtable. Need caching? Memorystore.

5

Streaming and Messaging: Pub/Sub

Pub/Sub (Publish/Subscribe) is Google Cloud's fully managed messaging service for building event-driven architectures. It decouples event producers from consumers, enabling asynchronous communication between microservices, data pipelines, and applications.

How Pub/Sub Works


  [Publisher 1]  [Publisher 2]  [Publisher 3]
       |              |              |
       +------+-------+------+------+
              |              |
         [ Topic A ]    [ Topic B ]
              |              |
       +------+------+      |
       |             |       |
  [Subscription 1] [Sub 2] [Sub 3]
       |             |       |
  [Subscriber A] [Sub B] [Sub C]
  (push/pull)   (push)   (pull)

Pub/Sub: publishers send messages to topics; subscribers receive via subscriptions

Key Pub/Sub Features

At-least-once delivery — Every message is delivered at least once (may see duplicates).
Push and pull — Pull: subscribers poll for messages. Push: Pub/Sub sends HTTP POST to an endpoint.
Message retention — Up to 31 days (configurable). Replay old messages for recovery.
Ordering — Optional message ordering within an ordering key for FIFO guarantees.
Dead letter topics — Route undeliverable messages to a separate topic for debugging.
Serverless & auto-scaling — Handles millions of messages per second with no capacity planning.

# Create a topic and subscription
gcloud pubsub topics create my-events
gcloud pubsub subscriptions create my-sub \
  --topic=my-events \
  --ack-deadline=60

# Publish a message
gcloud pubsub topics publish my-events \
  --message="Hello from CDL study guide"

# Pull messages
gcloud pubsub subscriptions pull my-sub --auto-ack --limit=5

Key Concept

Pub/Sub is the ingestion layer in most real-time data pipelines. A common architecture is: Pub/Sub (ingest) → Dataflow (process) → BigQuery (analyze). This pattern appears frequently on the CDL exam.

6

Dataflow and Dataproc

Google Cloud offers two main data processing services. Dataflow is the serverless option based on Apache Beam. Dataproc is the managed Hadoop/Spark option for teams with existing Spark/Hadoop code.

Feature	Dataflow	Dataproc
Based on	Apache Beam (Google-created)	Apache Hadoop / Spark
Management	Fully serverless, auto-scaling	Managed clusters (you choose size)
Batch & Stream	Unified model for both	Spark Streaming or batch
Use case	New pipelines, real-time ETL	Existing Hadoop/Spark migrations
Pricing	Per worker per second	Per cluster per second (+ VMs)
Provisioning	No cluster setup	Cluster in ~90 seconds

Exam Decision Rule

New data pipeline with no existing code? Choose Dataflow (serverless, Beam). Migrating existing Hadoop/Spark jobs? Choose Dataproc (managed clusters, same APIs).

Cloud Data Fusion

Cloud Data Fusion is a fully managed, visual ETL/ELT service. It provides a drag-and-drop interface for building data pipelines without writing code. Under the hood, it uses Dataproc for execution. Best for teams that prefer visual pipeline design or lack coding expertise.

7

Looker and Business Intelligence

Looker is Google Cloud's enterprise business intelligence (BI) platform. It provides a governed semantic layer (LookML) that ensures consistent metric definitions across an organization, plus dashboards, reports, and embedded analytics capabilities.

Google Cloud BI Tools

Tool	Audience	Key Feature
Looker	Enterprise analysts, data teams	LookML semantic model, governed metrics, embedded analytics
Looker Studio (formerly Data Studio)	Business users, marketers	Free, drag-and-drop dashboards, 800+ connectors
Connected Sheets	Spreadsheet users	Query BigQuery directly from Google Sheets — no SQL needed
BigQuery BI Engine	Dashboard developers	In-memory acceleration for sub-second dashboard queries

Key Concept

LookML is Looker's modeling language that defines relationships, metrics, and business logic in a reusable, version-controlled layer. It ensures that "revenue" means the same thing whether viewed in a dashboard, a report, or an embedded application.

8

Exam Tips and Service Selection Guide

Exam Strategy

Section 2 (~16% of exam) focuses on data service selection. For every scenario, ask: Is the data structured or unstructured? Is it batch or streaming? What size and latency requirements? Then map to the right service.

Service Selection Decision Tree

"Store and analyze petabytes of structured data with SQL" → BigQuery
"Store images, videos, backups, or any unstructured data" → Cloud Storage
"Build a real-time event-driven pipeline" → Pub/Sub + Dataflow + BigQuery
"Migrate existing Hadoop/Spark jobs" → Dataproc
"Need a relational database for a web app" → Cloud SQL
"Need global transactions with relational consistency" → Cloud Spanner
"Build dashboards for business users" → Looker / Looker Studio
"Real-time NoSQL for mobile/web apps" → Firestore
"Massive throughput NoSQL for IoT/time-series" → Bigtable

Common Exam Scenarios

Q

Cost Optimization

"How to reduce BigQuery costs?" → Use partitioned/clustered tables, avoid SELECT *, use on-demand for sporadic queries, editions for heavy use.

Q

Data Lifecycle

"How to manage storage costs for aging data?" → Cloud Storage lifecycle rules to transition Standard → Nearline → Coldline → Archive → Delete.

Q

Real-Time Analytics

"Analyze clickstream data in real-time?" → Pub/Sub (ingest) → Dataflow (transform) → BigQuery (query) → Looker (visualize).

Exploring Data Transformation with Google Cloud

The Data Value Chain

Stages of the Data Value Chain

Data Types

Structured Data

Unstructured Data

Semi-Structured Data

BigQuery — Cloud Data Warehouse

Architecture: Separation of Storage and Compute

Key BigQuery Features

BigQuery Pricing Models

Cloud Storage

Storage Classes

Object Lifecycle Management

Key Cloud Storage Concepts

Buckets

Versioning

Retention Policies

Transfer Service

Database Services

Streaming and Messaging: Pub/Sub

How Pub/Sub Works

Key Pub/Sub Features

Dataflow and Dataproc

Cloud Data Fusion

Looker and Business Intelligence

Google Cloud BI Tools

Exam Tips and Service Selection Guide

Service Selection Decision Tree

Common Exam Scenarios

Cost Optimization

Data Lifecycle

Real-Time Analytics