Google Cloud Certification

Professional Data Engineer Exam Guide

Master every domain of the GCP Professional Data Engineer certification. From designing data processing systems to building pipelines, choosing storage, preparing data for analytics and ML, and automating production workloads. Five comprehensive study guides with hands-on notebooks.

5
Study Guides
5
Exam Sections
5
Colab Notebooks
30+
GCP Services
100%
Exam Coverage

Section 1 · ~22% of Exam
SECTION 01
Designing Data Processing Systems
Security and compliance, reliability and fidelity, flexibility and portability, data migrations. IAM, encryption, Dataflow, Dataform, Cloud Data Fusion, BigQuery Data Transfer Service.
IAM Dataflow Data Fusion Migrations
Section 2 · ~25% of Exam
SECTION 02
Ingesting and Processing Data
Pipeline planning and building, batch vs streaming, deploying and operationalizing pipelines. Dataflow, Apache Beam, Dataproc, Pub/Sub, Cloud Composer, Spark, Kafka.
Beam Dataproc Pub/Sub Streaming
Section 3 · ~20% of Exam
SECTION 03
Storing the Data
Storage system selection, data warehouse design, data lake architecture, data platform design. BigQuery, BigLake, Bigtable, Spanner, AlloyDB, Cloud SQL, Firestore, Memorystore, Dataplex.
BigQuery Spanner Bigtable Dataplex
Section 4 · ~15% of Exam
SECTION 04
Preparing and Using Data for Analysis
Data visualization, AI/ML data preparation, data sharing and governance, security and privacy. BI Engine, BigQuery ML, Analytics Hub, DLP API, data masking.
BQML BI Engine DLP Analytics Hub
Section 5 · ~18% of Exam
SECTION 05
Maintaining and Automating Data Workloads
Resource optimization, automation and repeatability, workload organization, monitoring and troubleshooting, fault tolerance and data integrity.
Composer Monitoring Reservations DAGs

Data Engineering Glossary
Key GCP services and concepts you need to know for the Professional Data Engineer exam.
BigQuery
Serverless, petabyte-scale enterprise data warehouse. Supports SQL analytics, streaming ingestion, ML (BQML), and BI Engine for sub-second dashboards.
Dataflow
Fully managed, serverless stream and batch processing service based on Apache Beam. Supports auto-scaling, exactly-once semantics, and windowing.
Apache Beam
Open-source unified programming model for batch and streaming. Portable across runners (Dataflow, Spark, Flink). Uses PCollections, ParDo, and windowing.
Pub/Sub
Global, real-time messaging service for event-driven architectures. At-least-once delivery, push/pull subscriptions, dead-letter topics, message ordering.
Dataproc
Managed Apache Spark and Hadoop service. Supports autoscaling clusters, preemptible VMs, and Dataproc Serverless for on-demand Spark jobs without cluster management.
Cloud Composer
Managed Apache Airflow service for workflow orchestration. DAG-based scheduling, cross-service coordination, built-in GCP operators and sensors.
Bigtable
Managed wide-column NoSQL database for low-latency, high-throughput workloads. Ideal for time-series, IoT, and analytics at petabyte scale. HBase API compatible.
Cloud Spanner
Globally distributed, strongly consistent relational database. Unlimited scale with SQL, ACID transactions, and 99.999% availability SLA.
AlloyDB
Fully managed PostgreSQL-compatible database. Up to 4x faster than standard PostgreSQL for transactional workloads, 100x faster for analytical queries.
BigLake
Unified storage engine for data lakes and warehouses. Query data in Cloud Storage and BigQuery with consistent governance, access control, and fine-grained security.
Dataplex
Intelligent data fabric for unified data management. Auto-discovers, classifies, and governs data across data lakes, warehouses, and marts. Supports data quality tasks.
Cloud Data Fusion
Fully managed, code-free ETL/ELT service built on CDAP. Visual drag-and-drop pipeline builder with 200+ connectors for enterprise data integration.
Dataform
SQL-based data transformation service in BigQuery. Git-integrated, SQLX-based, with dependency management, testing, and scheduling for ELT workflows.
Datastream
Serverless change data capture (CDC) and replication service. Streams data from MySQL, PostgreSQL, Oracle, and AlloyDB into BigQuery and Cloud Storage in real time.
Cloud DLP (Sensitive Data Protection)
Data loss prevention API for discovering, classifying, and de-identifying sensitive data (PII, PHI, PCI). Supports masking, tokenization, and k-anonymity.
Analytics Hub
Data exchange platform for securely sharing BigQuery datasets across organizations. Zero-copy linked datasets, published listings, fine-grained access control.
Cloud SQL
Managed relational database for MySQL, PostgreSQL, and SQL Server. Automated backups, replication, failover, and encryption. Best for OLTP web/app workloads.
Firestore
Serverless NoSQL document database with real-time sync, offline support, and ACID transactions. Two modes: Native (mobile/web) and Datastore (server-side).
Memorystore
Managed in-memory data store for Redis and Memcached. Sub-millisecond latency for caching, session management, and real-time analytics leaderboards.
BigQuery ML (BQML)
Train and deploy ML models using SQL in BigQuery. Supports linear/logistic regression, k-means, time series (ARIMA_PLUS), boosted trees, DNN, and imported TensorFlow models.
Cloud Storage
Object storage with four classes: Standard, Nearline (30-day), Coldline (90-day), Archive (365-day). Lifecycle policies, versioning, and uniform bucket-level access.
Database Migration Service
Serverless service for migrating MySQL, PostgreSQL, SQL Server, Oracle, and AlloyDB to Cloud SQL, AlloyDB, or Cloud Spanner with minimal downtime.
Transfer Appliance
Physical appliance for offline data transfer to Google Cloud. For datasets too large for network transfer (typically 20+ TB). Rack-mountable, encrypted.
BI Engine
In-memory analysis service for sub-second interactive dashboards in BigQuery. Integrates with Looker Studio and connected sheets. Automatic smart tuning.