Data Preparation Pipeline on GCP
Why Data Preparation Matters
Data scientists spend 60-80% of their time on data preparation. The quality of your ML model is bounded by the quality of your data. Garbage in, garbage out.
On GCP, data preparation is not a single tool but a pipeline of services working together. Choosing the right tool for each stage is critical for cost, performance, and maintainability.
Pipeline Stages
The exam tests your ability to choose the right data tool for the scenario. The three main contenders are Dataflow, Dataprep, and Dataproc. Each has a distinct sweet spot — understanding the boundaries is more important than memorizing syntax.
Dataflow (Apache Beam)
Cloud Dataflow is Google Cloud's fully managed, serverless service for running Apache Beam pipelines. It handles both batch and streaming data processing with the same programming model.
Core Pipeline Concepts
| Concept | Description | Example |
|---|---|---|
| Pipeline | The entire data processing workflow, defined as a DAG of transforms | pipeline = beam.Pipeline() |
| PCollection | An immutable, distributed dataset. The input and output of every transform. | Lines of text, rows of data, images |
| Transform (PTransform) | An operation on a PCollection. Applied with the | (pipe) operator. |
Map, Filter, GroupByKey, ParDo |
| Source & Sink | Where data enters and leaves the pipeline | Read from GCS/BigQuery, write to BigQuery/GCS/Pub/Sub |
| Window | Groups streaming data into finite chunks by time | Fixed windows (every 5 min), sliding windows, session windows |
| Watermark | Tracks event-time progress to handle late-arriving data | Allows processing even when some events arrive out of order |
Apache Beam Pipeline Example
# Apache Beam pipeline — runs on Dataflow (or locally for testing)
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
# Define pipeline options
options = PipelineOptions([
'--runner=DataflowRunner', # Use 'DirectRunner' for local testing
'--project=my-gcp-project',
'--region=us-central1',
'--temp_location=gs://my-bucket/temp',
'--staging_location=gs://my-bucket/staging',
])
# Build the pipeline
with beam.Pipeline(options=options) as p:
(
p
| 'ReadFromBQ' >> beam.io.ReadFromBigQuery(
query='SELECT * FROM `project.dataset.table`',
use_standard_sql=True,
)
| 'FilterValid' >> beam.Filter(lambda row: row['amount'] > 0)
| 'TransformData' >> beam.Map(
lambda row: {
'customer_id': row['customer_id'],
'amount_normalized': row['amount'] / 1000.0,
'category': row['category'].lower(),
}
)
| 'WriteToBQ' >> beam.io.WriteToBigQuery(
'project:dataset.output_table',
schema='customer_id:STRING,amount_normalized:FLOAT,category:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
)
)
# Streaming pipeline — process events from Pub/Sub in real time
with beam.Pipeline(options=options) as p:
(
p
| 'ReadPubSub' >> beam.io.ReadFromPubSub(
topic='projects/my-project/topics/events',
)
| 'ParseJSON' >> beam.Map(json.loads)
| 'Window5Min' >> beam.WindowInto(
beam.window.FixedWindows(300) # 5-minute windows
)
| 'GroupByUser' >> beam.GroupByKey()
| 'Aggregate' >> beam.Map(compute_user_stats)
| 'WriteBQ' >> beam.io.WriteToBigQuery(...)
)
Dataflow's killer feature is unified batch and streaming. The same Apache Beam code runs as batch (bounded PCollection from GCS/BQ) or streaming (unbounded PCollection from Pub/Sub). No need to maintain two separate pipelines.
When the question mentions "serverless" data processing or "both batch and streaming", the answer is almost always Dataflow. Dataflow is also the default answer for ETL/ELT pipelines unless there is a specific reason to use Spark (existing Spark code, Spark MLlib).
Dataprep by Trifacta
Cloud Dataprep is a visual, no-code data wrangling tool built by Trifacta (now part of Alteryx). It provides an interactive web UI for exploring, cleaning, and transforming data without writing code.
When to Use Dataprep
- Visual data exploration — Dataprep shows data distributions, anomalies, and patterns automatically
- Non-technical users — Business analysts and data stewards who do not write Python or SQL
- One-off data cleaning — Ad-hoc data preparation tasks, not recurring production pipelines
- Data profiling — Quickly understand data quality (nulls, outliers, distributions)
- Recipe-based transforms — Build a sequence of transformation "recipes" that can be saved and reused
Under the hood, Dataprep generates and executes Dataflow jobs. So it is essentially a visual front-end for Dataflow with AI-assisted suggestions for transforms.
Dataprep is not suitable for production streaming pipelines. It is designed for interactive, ad-hoc data wrangling. For production ETL, use Dataflow directly (code-based). Dataprep also has row count limitations on the visual preview (though the actual execution via Dataflow can handle large datasets).
Dataproc (Managed Spark/Hadoop)
Cloud Dataproc is Google Cloud's managed service for running Apache Spark, Hadoop, Flink, and Presto clusters. It provisions clusters in 90 seconds and integrates with GCS, BigQuery, and other GCP services.
When to Use Dataproc
- Existing Spark/Hadoop code — Lift-and-shift from on-premises Hadoop clusters
- Spark MLlib — Distributed ML training using Spark's ML library
- Complex graph processing — GraphX, Spark GraphFrames
- PySpark/Scala Spark jobs — Teams that already know Spark
- Ephemeral clusters — Spin up a cluster for a job, shut it down when done (cost-effective)
# Create a Dataproc cluster via gcloud CLI
gcloud dataproc clusters create my-spark-cluster \
--region=us-central1 \
--master-machine-type=n1-standard-4 \
--worker-machine-type=n1-standard-4 \
--num-workers=2 \
--image-version="2.1-debian11" \
--enable-component-gateway \
--optional-components=JUPYTER
# Submit a PySpark job
gcloud dataproc jobs submit pyspark \
gs://my-bucket/scripts/etl_job.py \
--cluster=my-spark-cluster \
--region=us-central1 \
-- --input=gs://my-bucket/raw/ --output=gs://my-bucket/processed/
# Delete cluster when done (ephemeral pattern)
gcloud dataproc clusters delete my-spark-cluster \
--region=us-central1 --quiet
The ephemeral cluster pattern is a best practice for Dataproc: create a cluster, run the job, delete the cluster. Use Dataproc Workflow Templates or Dataproc Serverless (Spark Batch) to automate this. You only pay while the cluster is running.
If the question mentions "existing Spark code", "Hadoop migration", or "Spark MLlib", the answer is Dataproc. If it mentions "serverless" with no existing Spark/Hadoop, the answer is Dataflow.
Decision Guide: Dataflow vs Dataprep vs Dataproc
Comparison Table
| Criteria | Dataflow | Dataprep | Dataproc |
|---|---|---|---|
| Engine | Apache Beam | Dataflow (under the hood) | Apache Spark / Hadoop |
| Management | Fully serverless | Fully serverless (visual UI) | Managed clusters (or serverless Spark) |
| Batch | Yes | Yes | Yes |
| Streaming | Yes (unified model) | No | Yes (Spark Structured Streaming) |
| Coding Required | Yes (Python/Java) | No (visual UI) | Yes (Python/Scala/Java) |
| Best For | New ETL pipelines, streaming, GCP-native | Ad-hoc data wrangling, exploration | Existing Spark/Hadoop, Spark MLlib |
| Scaling | Auto-scales workers | Auto-scales (via Dataflow) | Manual or autoscaling clusters |
| Cost Model | Per vCPU-hour + GB-hour | Dataflow cost + Trifacta fee | Per cluster VM-hour |
Default: Dataflow. Override to Dataproc only if there is existing Spark/Hadoop code or Spark MLlib is needed. Override to Dataprep only if the user is non-technical and the task is ad-hoc exploration. For in-warehouse transforms on structured data already in BigQuery, use BigQuery SQL.
Pre-trained ML APIs Deep Dive
Google Cloud offers pre-trained ML APIs that require zero training data and zero ML expertise. Send data via REST or client library, get structured predictions back. These are the fastest path to adding ML to an application.
Cloud Vision API
Label Detection
Identifies objects, locations, activities, animal species, and products in an image. Returns labels with confidence scores.
OCR (Text Detection)
TEXT_DETECTION for sparse text (signs, menus). DOCUMENT_TEXT_DETECTION for dense documents (pages, articles). Returns text with bounding boxes.
Face Detection
Detects faces with landmarks (eyes, nose, mouth). Returns joy, sorrow, anger, surprise likelihoods. Does NOT identify individuals (no face recognition).
Safe Search
Detects explicit content: adult, violence, racy, spoof, medical. Returns likelihood levels (VERY_UNLIKELY to VERY_LIKELY). Essential for user-generated content moderation.
Landmark Detection
Identifies famous landmarks (Eiffel Tower, Taj Mahal). Returns name, confidence, geographic coordinates (latitude/longitude).
Logo Detection
Identifies logos of well-known companies and brands. Returns logo name, confidence score, and bounding polygon.
# Cloud Vision API — Label Detection (Python)
from google.cloud import vision
client = vision.ImageAnnotatorClient()
# From a GCS URI
image = vision.Image(source=vision.ImageSource(
image_uri="gs://my-bucket/photo.jpg"
))
# Or from a URL
image = vision.Image(source=vision.ImageSource(
image_uri="https://example.com/photo.jpg"
))
response = client.label_detection(image=image, max_results=10)
for label in response.label_annotations:
print(f"{label.description}: {label.score:.2%}")
Cloud Natural Language API
| Feature | What It Returns | Use Case |
|---|---|---|
| Sentiment Analysis | Score (-1 to +1) and magnitude (0 to ∞) for document and each sentence | Product review analysis, social media monitoring, customer feedback |
| Entity Analysis | Named entities (PERSON, LOCATION, ORGANIZATION, EVENT, etc.) with salience scores | Information extraction, knowledge graph building, content tagging |
| Syntax Analysis | Part-of-speech tags, dependency parse tree, lemmatization | Grammar analysis, text preprocessing, linguistic research |
| Content Classification | Categories from a taxonomy of 700+ categories (e.g., "/Arts & Entertainment/Music") | Content organization, ad targeting, content filtering |
| Entity Sentiment | Sentiment for each entity mentioned in the text | Brand monitoring ("iPhone is great but MacBook is overpriced") |
# Cloud NLP API — Sentiment + Entity Analysis
from google.cloud import language_v1
client = language_v1.LanguageServiceClient()
document = language_v1.Document(
content="Google Cloud's Vertex AI is an excellent ML platform.",
type_=language_v1.Document.Type.PLAIN_TEXT,
)
# Sentiment
sentiment = client.analyze_sentiment(document=document).document_sentiment
print(f"Sentiment: score={sentiment.score:.2f}, magnitude={sentiment.magnitude:.2f}")
# Entities
entities = client.analyze_entities(document=document).entities
for entity in entities:
print(f" {entity.name}: {entity.type_.name} (salience={entity.salience:.2f})")
Cloud Speech-to-Text
Converts audio to text using Google's speech recognition models. Supports 120+ languages, streaming recognition (real-time), and batch transcription (long audio files).
- Synchronous — Audio up to 1 minute, immediate response
- Asynchronous — Audio up to 480 minutes, results via polling or callback
- Streaming — Real-time transcription as audio flows in
| Feature | Description |
|---|---|
| Speaker Diarization | Identifies different speakers in the audio ("Speaker 1", "Speaker 2") |
| Automatic Punctuation | Adds periods, commas, question marks to transcription |
| Word Timestamps | Returns start and end time for each word |
| Speech Adaptation | Boost recognition of specific words/phrases (industry jargon, product names) |
| Multi-channel | Transcribe separate audio channels independently (e.g., caller vs agent) |
# Cloud Speech-to-Text — Transcribe audio from GCS
from google.cloud import speech
client = speech.SpeechClient()
audio = speech.RecognitionAudio(
uri="gs://my-bucket/audio-sample.wav"
)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
enable_automatic_punctuation=True,
enable_word_time_offsets=True,
diarization_config=speech.SpeakerDiarizationConfig(
enable_speaker_diarization=True,
min_speaker_count=2,
max_speaker_count=4,
),
)
# For audio > 1 min, use long_running_recognize
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=300)
for result in response.results:
print(f"Transcript: {result.alternatives[0].transcript}")
print(f"Confidence: {result.alternatives[0].confidence:.2%}")
Cloud Translation API
Two tiers of translation service:
| Feature | Basic (v2) | Advanced (v3) |
|---|---|---|
| Model | Google NMT | Google NMT + AutoML Translation |
| Custom Glossary | No | Yes — domain-specific term mapping |
| Custom Model | No | Yes — train on your parallel corpus |
| Batch Translation | No | Yes — translate entire documents/files |
| Languages | 100+ | 100+ |
| Use Case | Simple translations, language detection | Enterprise, glossaries, domain-specific |
# Cloud Translation API v3 (Advanced)
from google.cloud import translate_v2 as translate
client = translate.Client()
# Detect language
detection = client.detect_language("こんにちは世界")
print(f"Detected: {detection['language']} ({detection['confidence']:.0%})")
# Translate
result = client.translate(
"Machine learning is transforming industries.",
target_language="es", # Spanish
)
print(f"Translation: {result['translatedText']}")
Video Intelligence API
Analyzes video content stored in Cloud Storage. Processes entire videos and returns time-stamped annotations.
Label Detection
Identifies objects, actions, and concepts throughout the video with timestamps. Segment-level and shot-level labels.
Shot Change Detection
Detects scene transitions (cuts, fades, dissolves). Returns start/end timestamps for each shot. Essential for video editing workflows.
Explicit Content Detection
Frame-by-frame analysis for adult/violent content. Returns likelihood per frame. Required for platforms with user-uploaded video.
Object Tracking
Tracks objects across frames with bounding boxes. Identifies and follows specific objects through the video timeline.
Text Detection in Video
OCR on video frames. Detects and extracts text visible in the video (signs, captions, overlays) with timestamps.
Speech Transcription
Transcribes speech in video to text. Same engine as Speech-to-Text but integrated into the video analysis pipeline.
# Video Intelligence API — Label Detection
from google.cloud import videointelligence
client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.Feature.LABEL_DETECTION]
operation = client.annotate_video(
request={
"input_uri": "gs://my-bucket/video.mp4",
"features": features,
}
)
result = operation.result(timeout=300)
for label in result.annotation_results[0].segment_label_annotations:
print(f"Label: {label.entity.description}")
for segment in label.segments:
start = segment.segment.start_time_offset.seconds
end = segment.segment.end_time_offset.seconds
conf = segment.confidence
print(f" {start}s - {end}s (confidence: {conf:.2%})")
"Analyze sentiment of customer reviews" → Cloud NLP API (sentiment analysis).
"Extract text from a scanned PDF" → Cloud Vision API (DOCUMENT_TEXT_DETECTION) or Document AI.
"Transcribe a meeting recording" → Cloud Speech-to-Text (async, with diarization).
"Detect inappropriate content in user-uploaded videos" → Video Intelligence API (explicit content detection).
"Translate product descriptions to multiple languages" → Cloud Translation API (Advanced if glossary needed).
API Authentication and Quotas
Authentication Methods
| Method | When to Use | How |
|---|---|---|
| API Key | Simple API calls, client-side apps | Pass key as query parameter or header. Least secure — no identity, only project-level access. |
| Service Account | Server-to-server, production workloads | JSON key file or Workload Identity. IAM roles control access. Recommended for production. |
| OAuth 2.0 | User-facing apps, accessing user data | User consent flow. Access token + refresh token. Scoped permissions. |
| Application Default Credentials (ADC) | Development, Compute Engine, GKE | gcloud auth application-default login or automatic on GCE/GKE. Client libraries use ADC automatically. |
# Set ADC for local development
gcloud auth application-default login
# Set service account for production
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
# Enable an API
gcloud services enable vision.googleapis.com
gcloud services enable language.googleapis.com
gcloud services enable speech.googleapis.com
gcloud services enable translate.googleapis.com
gcloud services enable videointelligence.googleapis.com
Every API has rate limits (requests per minute) and usage quotas (requests per day/month). Default quotas are generous for development but may need to be increased for production. Check quotas at: Console > IAM & Admin > Quotas. You can request quota increases through the console.
The exam may test authentication best practices. Key points: never embed API keys in code, use service accounts with least-privilege IAM roles for production, and use ADC for development. When running on GCE/GKE, prefer Workload Identity over key files.
Exam Tips — Scenario-Based Guidance
Common Exam Scenarios
"A company needs to build a real-time data pipeline that reads from Pub/Sub, transforms events, and writes to BigQuery..."
Answer: Dataflow. Serverless, supports streaming from Pub/Sub, unified batch/streaming model. This is the textbook Dataflow use case.
"A company has existing PySpark ETL jobs running on an on-premises Hadoop cluster..."
Answer: Dataproc. Lift-and-shift existing Spark jobs with minimal code changes. Use ephemeral clusters for cost efficiency.
"A business analyst needs to clean and prepare a CSV dataset for a one-time analysis..."
Answer: Dataprep. Visual UI, no coding required, AI-assisted suggestions. Perfect for non-technical, ad-hoc tasks.
"A media company needs to automatically tag uploaded images with content categories..."
Answer: Cloud Vision API (label detection). No training data needed, immediate results. If they need custom labels specific to their domain, upgrade to AutoML Image.
"A call center wants to transcribe calls, analyze sentiment, and extract key entities..."
Answer: Pipeline of Speech-to-Text (transcription with diarization) → Cloud NLP API (sentiment analysis + entity extraction). These APIs compose naturally in a data pipeline.
"A social media platform needs to detect and flag inappropriate video content before publishing..."
Answer: Video Intelligence API with explicit content detection. Process uploaded videos asynchronously, flag those exceeding the threshold, and queue for human review.
Two key decision frameworks:
(1) Data tool selection: Default to Dataflow unless there is a reason for Dataproc (existing Spark) or Dataprep (non-technical, ad-hoc).
(2) API selection: Match the data type (image, text, audio, video) to the corresponding API. If the pre-trained API's general model is insufficient, escalate to AutoML for a custom model.
Interview Ready
How to Explain This in 2 Minutes
Google Cloud’s pre-trained ML APIs—Vision, Natural Language, Translation, Speech-to-Text, and Video Intelligence—let you add production-grade AI to applications with a single REST call, no training data or ML expertise required. For data preparation at scale, GCP offers three complementary tools: Dataflow (serverless Apache Beam for streaming and batch ETL), Dataprep (visual, no-code wrangling by Trifacta), and Dataproc (managed Spark/Hadoop for lift-and-shift workloads). The interview-ready insight is knowing when to use each: Dataflow for green-field pipelines needing autoscaling, Dataprep for quick exploratory cleaning, Dataproc when you have existing Spark code, and pre-trained APIs when the task is a solved problem (OCR, sentiment, translation) with no domain-specific labels.
Likely Interview Questions
| Question | What They're Really Asking |
|---|---|
| When would you use pre-trained APIs versus training your own model? | Can you identify when a generic model is sufficient vs. when domain-specific data justifies custom training? |
| Compare Dataflow, Dataprep, and Dataproc for data preparation. | Do you understand the trade-offs between serverless beam, visual wrangling, and managed Spark? |
| How would you handle data cleaning for an ML pipeline at scale on GCP? | Can you design a pipeline that handles missing values, normalization, and feature encoding in a distributed way? |
| What is the Vision API capable of, and what are its limitations? | Do you know the boundary between API capabilities (label detection, OCR, face detection) and custom Vision needs? |
| How do you authenticate and manage quotas for GCP ML APIs in production? | Do you understand service accounts, API keys, OAuth scopes, and rate-limit handling for production workloads? |
Model Answers
Pre-trained vs Custom: I default to pre-trained APIs when the task matches a general category—sentiment analysis on product reviews, OCR on scanned documents, or translating support tickets. The APIs are continuously improved by Google, require zero training data, and scale instantly. I switch to custom training (AutoML Vision, AutoML Natural Language, or Vertex AI custom) when accuracy on domain-specific classes is critical—for example, classifying medical images where generic labels are insufficient, or detecting industry-specific entities in legal text that the NL API doesn’t recognize.
Dataflow vs Dataproc vs Dataprep: Dataflow is my first choice for new pipelines because it’s fully serverless, autoscales to zero, handles both batch and streaming with the same Apache Beam code, and integrates natively with BigQuery and Pub/Sub. Dataproc is the right choice when the team has existing PySpark or Hadoop jobs they want to run on GCP with minimal refactoring—it provisions a cluster in 90 seconds and supports preemptible VMs. Dataprep is ideal for data analysts doing exploratory data quality checks before building a formal pipeline.
Production API Architecture: In production, I use service accounts (not API keys) with least-privilege IAM roles, enable the specific APIs in the project, and configure exponential backoff for 429 (quota exceeded) responses. I set up Cloud Monitoring alerts on API usage metrics, use VPC Service Controls to prevent data exfiltration, and batch requests where possible (e.g., Vision API batch annotation) to stay within rate limits and reduce cost.
System Design Scenario
Scenario: A global e-commerce company receives customer reviews in 15 languages. They want to extract sentiment, detect offensive content, translate all reviews to English, and store structured results for analytics. Design the data pipeline on GCP.
Approach: Reviews land in Pub/Sub from the application backend. A Dataflow streaming pipeline consumes messages, calls the Translation API to normalize text to English, then calls the Natural Language API for sentiment and content classification in parallel. Results are written to BigQuery with columns for original_language, translated_text, sentiment_score, and content_categories. A separate Dataflow branch calls the Vision API on any attached review images for label detection and SafeSearch. Cloud DLP scans for PII before storage. Monitoring dashboards in Looker visualize sentiment trends by product and region. The pipeline autoscales with Dataflow and requires no cluster management.
Common Mistakes
- Using Dataproc for everything — Many candidates default to Spark because it’s familiar, but Dataflow is preferred on GCP for new pipelines because it’s serverless and handles streaming natively. Only recommend Dataproc when there’s existing Hadoop/Spark code to migrate.
- Forgetting API limitations — Pre-trained APIs have file size limits, rate quotas, and language support boundaries. Failing to mention these constraints in a design answer signals you haven’t used them in production.
- Skipping data quality in the pipeline — Jumping straight from raw data to model training without addressing missing values, duplicates, encoding issues, and label quality is a red flag. Always describe the data validation and cleaning steps explicitly.