GCP Professional Data Engineer

Data Engineering Glossary

Key GCP services and concepts you need to know for the Professional Data Engineer exam.

BigQuery

Serverless, petabyte-scale enterprise data warehouse. Supports SQL analytics, streaming ingestion, ML (BQML), and BI Engine for sub-second dashboards.

Dataflow

Fully managed, serverless stream and batch processing service based on Apache Beam. Supports auto-scaling, exactly-once semantics, and windowing.

Apache Beam

Open-source unified programming model for batch and streaming. Portable across runners (Dataflow, Spark, Flink). Uses PCollections, ParDo, and windowing.

Pub/Sub

Global, real-time messaging service for event-driven architectures. At-least-once delivery, push/pull subscriptions, dead-letter topics, message ordering.

Dataproc

Managed Apache Spark and Hadoop service. Supports autoscaling clusters, preemptible VMs, and Dataproc Serverless for on-demand Spark jobs without cluster management.

Cloud Composer

Managed Apache Airflow service for workflow orchestration. DAG-based scheduling, cross-service coordination, built-in GCP operators and sensors.

Bigtable

Managed wide-column NoSQL database for low-latency, high-throughput workloads. Ideal for time-series, IoT, and analytics at petabyte scale. HBase API compatible.

Cloud Spanner

Globally distributed, strongly consistent relational database. Unlimited scale with SQL, ACID transactions, and 99.999% availability SLA.

AlloyDB

Fully managed PostgreSQL-compatible database. Up to 4x faster than standard PostgreSQL for transactional workloads, 100x faster for analytical queries.

BigLake

Unified storage engine for data lakes and warehouses. Query data in Cloud Storage and BigQuery with consistent governance, access control, and fine-grained security.

Dataplex

Intelligent data fabric for unified data management. Auto-discovers, classifies, and governs data across data lakes, warehouses, and marts. Supports data quality tasks.

Cloud Data Fusion

Fully managed, code-free ETL/ELT service built on CDAP. Visual drag-and-drop pipeline builder with 200+ connectors for enterprise data integration.

Dataform

SQL-based data transformation service in BigQuery. Git-integrated, SQLX-based, with dependency management, testing, and scheduling for ELT workflows.

Datastream

Serverless change data capture (CDC) and replication service. Streams data from MySQL, PostgreSQL, Oracle, and AlloyDB into BigQuery and Cloud Storage in real time.

Cloud DLP (Sensitive Data Protection)

Data loss prevention API for discovering, classifying, and de-identifying sensitive data (PII, PHI, PCI). Supports masking, tokenization, and k-anonymity.

Analytics Hub

Data exchange platform for securely sharing BigQuery datasets across organizations. Zero-copy linked datasets, published listings, fine-grained access control.

Cloud SQL

Managed relational database for MySQL, PostgreSQL, and SQL Server. Automated backups, replication, failover, and encryption. Best for OLTP web/app workloads.

Firestore

Serverless NoSQL document database with real-time sync, offline support, and ACID transactions. Two modes: Native (mobile/web) and Datastore (server-side).

Memorystore

Managed in-memory data store for Redis and Memcached. Sub-millisecond latency for caching, session management, and real-time analytics leaderboards.

BigQuery ML (BQML)

Train and deploy ML models using SQL in BigQuery. Supports linear/logistic regression, k-means, time series (ARIMA_PLUS), boosted trees, DNN, and imported TensorFlow models.

Cloud Storage

Object storage with four classes: Standard, Nearline (30-day), Coldline (90-day), Archive (365-day). Lifecycle policies, versioning, and uniform bucket-level access.

Database Migration Service

Serverless service for migrating MySQL, PostgreSQL, SQL Server, Oracle, and AlloyDB to Cloud SQL, AlloyDB, or Cloud Spanner with minimal downtime.

Transfer Appliance

Physical appliance for offline data transfer to Google Cloud. For datasets too large for network transfer (typically 20+ TB). Rack-mountable, encrypted.

BI Engine

In-memory analysis service for sub-second interactive dashboards in BigQuery. Integrates with Looker Studio and connected sheets. Automatic smart tuning.