The Data Value Chain
Data by itself has little value. It becomes valuable when it is collected, stored, processed, analyzed, and activated. Google Cloud provides managed services for every stage of this value chain, allowing organizations to build complete data platforms without managing infrastructure.
Stages of the Data Value Chain
| Stage | Activity | Google Cloud Services |
|---|---|---|
| Ingest | Collect data from sources | Pub/Sub, Datastream, Transfer Service, IoT Core |
| Store | Persist data durably | Cloud Storage, BigQuery, Cloud SQL, Bigtable, Firestore |
| Process | Transform and clean data | Dataflow, Dataproc, Dataprep, Cloud Data Fusion |
| Analyze | Query and gain insights | BigQuery, Looker, Data Studio, Connected Sheets |
| Activate | Use insights to drive action | Vertex AI, BigQuery ML, Looker dashboards, APIs |
Google Cloud's data services are serverless by design. BigQuery, Dataflow, Pub/Sub, and Cloud Storage all auto-scale without you managing clusters or servers. This is a key differentiator vs. on-premises Hadoop/Spark clusters.
Data Types
Structured Data
Organized in rows and columns with a fixed schema. Stored in relational databases (Cloud SQL, Spanner) or BigQuery. Examples: transactions, CRM records, inventory.
Unstructured Data
No predefined format. Stored in Cloud Storage as objects. Examples: images, videos, PDFs, audio files, emails. Over 80% of enterprise data is unstructured.
Semi-Structured Data
Has some organizational properties but no rigid schema. Examples: JSON, XML, log files, IoT sensor readings. Stored in Firestore, Bigtable, or BigQuery (nested/repeated fields).
BigQuery — Cloud Data Warehouse
BigQuery is Google Cloud's fully managed, serverless, petabyte-scale data warehouse. It enables super-fast SQL queries using the processing power of Google's infrastructure. BigQuery is one of the most heavily tested services on the CDL exam.
Architecture: Separation of Storage and Compute
BigQuery's key architectural innovation is the separation of storage and compute. Data is stored in Google's distributed storage system (Colossus), while compute resources for queries are provisioned from a shared pool (Dremel engine). This means you can scale storage and compute independently.
[Data Sources]
|
[ Ingest: Streaming / Batch / Transfer ]
|
+---------+ +-----------+
| Colossus | | Dremel |
| (Storage)| | (Compute) |
+---------+ +-----------+
| |
+--------+--------+
|
[ BigQuery SQL Engine ]
|
+-----------+-----------+
| | |
[Looker] [Data Studio] [BQML]
Key BigQuery Features
- Serverless — No infrastructure to manage. No clusters to provision or tune.
- Standard SQL — ANSI-compliant SQL (no proprietary query language).
- Built-in ML — BigQuery ML lets you create and execute ML models using SQL.
- Streaming ingestion — Insert rows in real-time for near-instant analysis.
- Geospatial analytics — Built-in GIS functions for location-based queries.
- Column-level security — Fine-grained access control on individual columns.
- BI Engine — In-memory analysis layer for sub-second dashboard queries.
BigQuery Pricing Models
| Model | How It Works | Best For |
|---|---|---|
| On-demand | $6.25 per TB of data scanned by queries | Variable/unpredictable workloads, exploration |
| Capacity (Editions) | Pay for reserved compute slots (per-second billing) | Predictable, heavy query workloads |
| Storage | $0.02/GB/month (active), $0.01/GB/month (long-term) | All workloads; long-term = 90+ days unmodified |
| Free tier | 1 TB queries + 10 GB storage per month | Learning, prototyping, small analytics |
BigQuery on-demand pricing charges by data scanned, not compute time. To reduce costs: use partitioned tables, clustered tables, and SELECT only the columns you need (avoid SELECT *).
# Query a public dataset (free tier)
bq query --use_legacy_sql=false \
'SELECT name, SUM(number) as total
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year > 2000
GROUP BY name
ORDER BY total DESC
LIMIT 10'
# Create a dataset and table
bq mk --dataset my_project:my_dataset
bq load --autodetect \
my_dataset.sales_data \
gs://my-bucket/sales.csv
Cloud Storage
Cloud Storage is Google Cloud's object storage service for storing any amount of unstructured data. It offers 11 nines (99.999999999%) durability, global accessibility, and four storage classes to optimize cost based on access frequency.
Storage Classes
| Class | Minimum Storage Duration | Use Case | Storage Cost (per GB/month) |
|---|---|---|---|
| Standard | None | Frequently accessed data, websites, streaming | ~$0.020 |
| Nearline | 30 days | Backups, data accessed < once/month | ~$0.010 |
| Coldline | 90 days | Disaster recovery, data accessed < once/quarter | ~$0.004 |
| Archive | 365 days | Long-term retention, regulatory compliance | ~$0.0012 |
Storage class affects cost and access charges, not performance or durability. All classes have the same 11 nines durability and millisecond access latency. Nearline/Coldline/Archive have higher per-operation retrieval costs and minimum storage durations.
Object Lifecycle Management
Lifecycle rules automatically transition or delete objects based on age, creation date, storage class, or other conditions. This is a key cost optimization feature for the exam.
# Set lifecycle rule: move to Coldline after 90 days, delete after 365 days
# lifecycle.json
{
"lifecycle": {
"rule": [
{
"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
"condition": {"age": 90}
},
{
"action": {"type": "Delete"},
"condition": {"age": 365}
}
]
}
}
# Apply lifecycle policy
gsutil lifecycle set lifecycle.json gs://my-bucket
Key Cloud Storage Concepts
Buckets
Top-level containers for objects. Globally unique names. Choose location type: region, dual-region, or multi-region. IAM controls access at bucket or object level.
Versioning
Keep historical versions of objects. Prevents accidental deletion or overwrites. Combined with lifecycle rules to limit version retention and control costs.
Retention Policies
Lock objects for a minimum period (regulatory compliance). Bucket Lock makes the retention policy permanent and immutable — even admins cannot delete objects early.
Transfer Service
Move large datasets from other clouds (AWS S3, Azure), on-premises, or between buckets. Schedule recurring transfers with bandwidth controls.
Database Services
Google Cloud offers multiple managed database services, each optimized for different workloads. The CDL exam tests your ability to match the right database to a given scenario.
| Service | Type | Best For | Key Feature |
|---|---|---|---|
| Cloud SQL | Relational (MySQL, PostgreSQL, SQL Server) | Web apps, CMS, e-commerce | Managed backups, replicas, patches |
| Cloud Spanner | Relational, globally distributed | Global transactions, financial systems | Unlimited scale + strong consistency |
| Firestore | NoSQL document | Mobile/web apps, user profiles, catalogs | Real-time sync, offline support |
| Cloud Bigtable | NoSQL wide-column | IoT, time-series, analytics (1M+ rows/sec) | Sub-10ms latency at petabyte scale |
| Memorystore | In-memory (Redis, Memcached) | Caching, session management, leaderboards | Sub-millisecond latency |
| AlloyDB | PostgreSQL-compatible | Enterprise workloads, OLTP + analytics | 4x faster than standard PostgreSQL |
Need SQL + global consistency? Cloud Spanner. Need SQL + standard relational? Cloud SQL. Need NoSQL for mobile/web? Firestore. Need NoSQL for massive throughput? Bigtable. Need caching? Memorystore.
Streaming and Messaging: Pub/Sub
Pub/Sub (Publish/Subscribe) is Google Cloud's fully managed messaging service for building event-driven architectures. It decouples event producers from consumers, enabling asynchronous communication between microservices, data pipelines, and applications.
How Pub/Sub Works
[Publisher 1] [Publisher 2] [Publisher 3]
| | |
+------+-------+------+------+
| |
[ Topic A ] [ Topic B ]
| |
+------+------+ |
| | |
[Subscription 1] [Sub 2] [Sub 3]
| | |
[Subscriber A] [Sub B] [Sub C]
(push/pull) (push) (pull)
Key Pub/Sub Features
- At-least-once delivery — Every message is delivered at least once (may see duplicates).
- Push and pull — Pull: subscribers poll for messages. Push: Pub/Sub sends HTTP POST to an endpoint.
- Message retention — Up to 31 days (configurable). Replay old messages for recovery.
- Ordering — Optional message ordering within an ordering key for FIFO guarantees.
- Dead letter topics — Route undeliverable messages to a separate topic for debugging.
- Serverless & auto-scaling — Handles millions of messages per second with no capacity planning.
# Create a topic and subscription
gcloud pubsub topics create my-events
gcloud pubsub subscriptions create my-sub \
--topic=my-events \
--ack-deadline=60
# Publish a message
gcloud pubsub topics publish my-events \
--message="Hello from CDL study guide"
# Pull messages
gcloud pubsub subscriptions pull my-sub --auto-ack --limit=5
Pub/Sub is the ingestion layer in most real-time data pipelines. A common architecture is: Pub/Sub (ingest) → Dataflow (process) → BigQuery (analyze). This pattern appears frequently on the CDL exam.
Dataflow and Dataproc
Google Cloud offers two main data processing services. Dataflow is the serverless option based on Apache Beam. Dataproc is the managed Hadoop/Spark option for teams with existing Spark/Hadoop code.
| Feature | Dataflow | Dataproc |
|---|---|---|
| Based on | Apache Beam (Google-created) | Apache Hadoop / Spark |
| Management | Fully serverless, auto-scaling | Managed clusters (you choose size) |
| Batch & Stream | Unified model for both | Spark Streaming or batch |
| Use case | New pipelines, real-time ETL | Existing Hadoop/Spark migrations |
| Pricing | Per worker per second | Per cluster per second (+ VMs) |
| Provisioning | No cluster setup | Cluster in ~90 seconds |
New data pipeline with no existing code? Choose Dataflow (serverless, Beam). Migrating existing Hadoop/Spark jobs? Choose Dataproc (managed clusters, same APIs).
Cloud Data Fusion
Cloud Data Fusion is a fully managed, visual ETL/ELT service. It provides a drag-and-drop interface for building data pipelines without writing code. Under the hood, it uses Dataproc for execution. Best for teams that prefer visual pipeline design or lack coding expertise.
Looker and Business Intelligence
Looker is Google Cloud's enterprise business intelligence (BI) platform. It provides a governed semantic layer (LookML) that ensures consistent metric definitions across an organization, plus dashboards, reports, and embedded analytics capabilities.
Google Cloud BI Tools
| Tool | Audience | Key Feature |
|---|---|---|
| Looker | Enterprise analysts, data teams | LookML semantic model, governed metrics, embedded analytics |
| Looker Studio (formerly Data Studio) | Business users, marketers | Free, drag-and-drop dashboards, 800+ connectors |
| Connected Sheets | Spreadsheet users | Query BigQuery directly from Google Sheets — no SQL needed |
| BigQuery BI Engine | Dashboard developers | In-memory acceleration for sub-second dashboard queries |
LookML is Looker's modeling language that defines relationships, metrics, and business logic in a reusable, version-controlled layer. It ensures that "revenue" means the same thing whether viewed in a dashboard, a report, or an embedded application.
Exam Tips and Service Selection Guide
Section 2 (~16% of exam) focuses on data service selection. For every scenario, ask: Is the data structured or unstructured? Is it batch or streaming? What size and latency requirements? Then map to the right service.
Service Selection Decision Tree
- "Store and analyze petabytes of structured data with SQL" → BigQuery
- "Store images, videos, backups, or any unstructured data" → Cloud Storage
- "Build a real-time event-driven pipeline" → Pub/Sub + Dataflow + BigQuery
- "Migrate existing Hadoop/Spark jobs" → Dataproc
- "Need a relational database for a web app" → Cloud SQL
- "Need global transactions with relational consistency" → Cloud Spanner
- "Build dashboards for business users" → Looker / Looker Studio
- "Real-time NoSQL for mobile/web apps" → Firestore
- "Massive throughput NoSQL for IoT/time-series" → Bigtable
Common Exam Scenarios
Cost Optimization
"How to reduce BigQuery costs?" → Use partitioned/clustered tables, avoid SELECT *, use on-demand for sporadic queries, editions for heavy use.
Data Lifecycle
"How to manage storage costs for aging data?" → Cloud Storage lifecycle rules to transition Standard → Nearline → Coldline → Archive → Delete.
Real-Time Analytics
"Analyze clickstream data in real-time?" → Pub/Sub (ingest) → Dataflow (transform) → BigQuery (query) → Looker (visualize).