BigQuery ML Overview
BigQuery ML (BQML) allows you to create and execute machine learning models directly in BigQuery using standard SQL. The data stays in BigQuery — there is no need to export data to a separate training environment, dramatically reducing time-to-insight.
Why BigQuery ML?
No Data Movement
Train models where the data lives. No ETL to a separate training service. Reduces latency and data governance risk.
SQL Interface
Data analysts and SQL users can build ML models without Python or framework knowledge. Democratizes ML across the organization.
Automatic Preprocessing
BQML handles feature normalization, one-hot encoding of categoricals, and missing value imputation by default.
Scales with BigQuery
Leverages BigQuery's distributed infrastructure. Models train on petabyte-scale datasets without managing infrastructure.
When you run CREATE MODEL, BigQuery allocates training resources, reads data from
the specified query, trains the model, and stores it as a BigQuery model object. The model
metadata, weights, and training statistics are all accessible via SQL functions like
ML.EVALUATE, ML.PREDICT, and ML.TRAINING_INFO.
Supported Model Types
BQML supports a wide range of model types, from simple linear models to AutoML and imported TensorFlow models. The exam frequently tests your ability to choose the right model type for a given scenario.
Complete Model Type Reference
| Model Type | SQL Keyword | Task | When to Use |
|---|---|---|---|
| Linear Regression | LINEAR_REG |
Regression | Predict continuous values (price, revenue, duration). Fast training, interpretable. |
| Logistic Regression | LOGISTIC_REG |
Classification | Binary or multiclass classification. Set auto_class_weights=TRUE for imbalanced data. |
| K-Means Clustering | KMEANS |
Clustering | Customer segmentation, anomaly detection. Unsupervised — no label column needed. |
| Matrix Factorization | MATRIX_FACTORIZATION |
Recommendation | Build recommendation systems from user-item interaction data (explicit or implicit feedback). |
| ARIMA+ | ARIMA_PLUS |
Time Series | Forecasting with automatic seasonality detection, holiday effects, and trend decomposition. |
| Boosted Tree Classifier | BOOSTED_TREE_CLASSIFIER |
Classification | Higher accuracy than logistic regression for complex, non-linear classification tasks. |
| Boosted Tree Regressor | BOOSTED_TREE_REGRESSOR |
Regression | Non-linear regression with feature interactions. Good default for tabular data. |
| DNN Classifier | DNN_CLASSIFIER |
Classification | Deep neural network for complex classification. Requires more data and training time. |
| DNN Regressor | DNN_REGRESSOR |
Regression | Deep neural network for complex regression. Good when linear/tree models underperform. |
| AutoML Classifier | AUTOML_CLASSIFIER |
Classification | Automated model architecture search. Best accuracy with minimal configuration. Higher cost. |
| AutoML Regressor | AUTOML_REGRESSOR |
Regression | Automated regression model. Use when you want maximum accuracy and budget allows. |
| TensorFlow (imported) | TENSORFLOW |
Any | Import a pre-trained SavedModel for inference in BigQuery. No training in BQML. |
| XGBoost | BOOSTED_TREE_* |
Any tabular | BQML's boosted tree implementation uses XGBoost under the hood. Same SQL syntax. |
The exam often presents a scenario and asks which model type to use. Key signals: "forecast sales" = ARIMA_PLUS. "segment customers" = KMEANS. "recommend products" = MATRIX_FACTORIZATION. "predict churn (yes/no)" = LOGISTIC_REG or BOOSTED_TREE_CLASSIFIER. "predict price" = LINEAR_REG or BOOSTED_TREE_REGRESSOR.
CREATE MODEL Syntax Deep Dive
The CREATE MODEL statement is the core of BigQuery ML. It combines model
specification, hyperparameters, and training data in a single SQL statement.
Basic Syntax
-- General CREATE MODEL syntax
CREATE OR REPLACE MODEL `project.dataset.model_name`
OPTIONS (
model_type = 'LOGISTIC_REG', -- Required: model type
input_label_cols = ['target_column'], -- Label column (supervised)
auto_class_weights = TRUE, -- Handle imbalanced classes
data_split_method = 'AUTO_SPLIT', -- AUTO_SPLIT, RANDOM, CUSTOM, SEQ, NO_SPLIT
max_iterations = 20, -- Training iterations
learn_rate = 0.1, -- Learning rate
l1_reg = 0.001, -- L1 regularization
l2_reg = 0.001, -- L2 regularization
early_stop = TRUE, -- Stop when no improvement
min_rel_progress = 0.01 -- Minimum relative progress
) AS
SELECT
feature1,
feature2,
feature3,
target_column
FROM
`project.dataset.training_table`
WHERE
feature1 IS NOT NULL
Classification Example (Logistic Regression)
-- Predict penguin species using logistic regression
CREATE OR REPLACE MODEL `my_dataset.penguin_classifier`
OPTIONS (
model_type = 'LOGISTIC_REG',
input_label_cols = ['species'],
auto_class_weights = TRUE,
max_iterations = 20
) AS
SELECT
species,
island,
culmen_length_mm,
culmen_depth_mm,
flipper_length_mm,
body_mass_g,
sex
FROM
`bigquery-public-data.ml_datasets.penguins`
WHERE
body_mass_g IS NOT NULL
AND sex IS NOT NULL
Regression Example (Boosted Tree)
-- Predict taxi fare using boosted tree regression
CREATE OR REPLACE MODEL `my_dataset.taxi_fare_model`
OPTIONS (
model_type = 'BOOSTED_TREE_REGRESSOR',
input_label_cols = ['fare'],
num_parallel_tree = 5,
max_tree_depth = 6,
subsample = 0.8,
learn_rate = 0.1,
early_stop = TRUE,
data_split_method = 'AUTO_SPLIT'
) AS
SELECT
fare,
pickup_latitude, pickup_longitude,
dropoff_latitude, dropoff_longitude,
passenger_count,
EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour,
EXTRACT(DAYOFWEEK FROM pickup_datetime) AS pickup_day
FROM
`my_dataset.taxi_trips`
WHERE
fare > 0 AND fare < 200
Time Series Example (ARIMA_PLUS)
-- Forecast daily sales using ARIMA_PLUS
CREATE OR REPLACE MODEL `my_dataset.sales_forecast`
OPTIONS (
model_type = 'ARIMA_PLUS',
time_series_timestamp_col = 'date',
time_series_data_col = 'total_sales',
time_series_id_col = 'store_id', -- Multiple time series
auto_arima = TRUE, -- Auto-detect (p,d,q)
holiday_region = 'US', -- Holiday effects
data_frequency = 'DAILY' -- AUTO_FREQUENCY also works
) AS
SELECT
date,
store_id,
total_sales
FROM
`my_dataset.daily_sales`
The input_label_cols option is required for supervised models but must be
omitted for unsupervised models (KMEANS, MATRIX_FACTORIZATION). Including it for KMEANS
will cause an error. For ARIMA_PLUS, use time_series_data_col instead.
Feature Engineering in BQML
BQML provides built-in feature engineering functions that can be applied within the
TRANSFORM clause. This ensures transformations are consistently applied
during both training and prediction.
The TRANSFORM Clause
The TRANSFORM clause defines feature transformations that become part of the model.
When you call ML.PREDICT, these transformations are automatically applied —
you pass raw data and the model handles preprocessing.
-- Feature engineering with TRANSFORM clause
CREATE OR REPLACE MODEL `my_dataset.engineered_model`
TRANSFORM (
-- Pass through label and features
target_col,
-- Bucketize continuous features
ML.BUCKETIZE(age, [18, 25, 35, 50, 65]) AS age_bucket,
-- Quantile bucketize (data-driven boundaries)
ML.QUANTILE_BUCKETIZE(income, 10) AS income_decile,
-- Feature cross (interaction between categoricals)
ML.FEATURE_CROSS(STRUCT(city, product_category)) AS city_product_cross,
-- Polynomial expansion
ML.POLYNOMIAL_EXPAND(STRUCT(feature1, feature2), 2) AS poly_features,
-- Standard SQL transformations
LOG(amount + 1) AS log_amount,
EXTRACT(HOUR FROM timestamp_col) AS hour_of_day,
EXTRACT(DAYOFWEEK FROM timestamp_col) AS day_of_week,
IF(category IS NULL, 'unknown', category) AS category_clean
)
OPTIONS (
model_type = 'BOOSTED_TREE_CLASSIFIER',
input_label_cols = ['target_col']
) AS
SELECT * FROM `my_dataset.training_data`
Built-in Feature Engineering Functions
| Function | Purpose | Example |
|---|---|---|
ML.BUCKETIZE |
Bin continuous values into fixed buckets | ML.BUCKETIZE(age, [18, 30, 50]) |
ML.QUANTILE_BUCKETIZE |
Bin by quantiles (data-driven boundaries) | ML.QUANTILE_BUCKETIZE(income, 5) |
ML.FEATURE_CROSS |
Combine categoricals into interaction features | ML.FEATURE_CROSS(STRUCT(city, day)) |
ML.POLYNOMIAL_EXPAND |
Create polynomial feature combinations | ML.POLYNOMIAL_EXPAND(STRUCT(x, y), 2) |
ML.NGRAMS |
Generate n-grams from text | ML.NGRAMS(tokens, [1, 2]) |
ML.HASH_BUCKETIZE |
Hash high-cardinality features into fixed buckets | ML.HASH_BUCKETIZE(user_id, 1000) |
Always use the TRANSFORM clause instead of applying transformations in the
training query. The TRANSFORM clause ensures the same transformations are
applied during ML.PREDICT, preventing training-serving skew.
Model Evaluation
After training, use ML.EVALUATE to assess model performance. The metrics
returned depend on the model type.
-- Evaluate model performance
SELECT *
FROM ML.EVALUATE(
MODEL `my_dataset.penguin_classifier`
)
-- Evaluate on a specific holdout set
SELECT *
FROM ML.EVALUATE(
MODEL `my_dataset.penguin_classifier`,
(
SELECT * FROM `my_dataset.penguins_test`
WHERE body_mass_g IS NOT NULL
)
)
Key Metrics by Model Type
| Metric | Model Type | Interpretation |
|---|---|---|
| accuracy | Classification | Proportion of correct predictions. Misleading for imbalanced classes. |
| precision | Classification | Of predicted positives, how many are correct? High precision = few false positives. |
| recall | Classification | Of actual positives, how many did we find? High recall = few false negatives. |
| f1_score | Classification | Harmonic mean of precision and recall. Best single metric for imbalanced data. |
| roc_auc | Binary Classification | Area under ROC curve. 0.5 = random, 1.0 = perfect. Threshold-independent. |
| log_loss | Classification | Measures confidence of predictions. Lower is better. Penalizes confident wrong answers. |
| mean_squared_error | Regression | Average of squared errors. Sensitive to outliers. Lower is better. |
| mean_absolute_error | Regression | Average of absolute errors. More robust to outliers than MSE. |
| r2_score | Regression | Proportion of variance explained. 1.0 = perfect. Can be negative for poor models. |
| davies_bouldin_index | Clustering (KMEANS) | Cluster separation quality. Lower is better (tighter, more separated clusters). |
-- View training info (loss curve)
SELECT *
FROM ML.TRAINING_INFO(
MODEL `my_dataset.penguin_classifier`
)
ORDER BY iteration
-- View confusion matrix for classification
SELECT *
FROM ML.CONFUSION_MATRIX(
MODEL `my_dataset.penguin_classifier`
)
Know when to use each metric. Fraud detection (high cost of false negatives) → optimize for recall. Spam filtering (high cost of false positives) → optimize for precision. Balanced importance → use F1 score. Regression with outliers → prefer MAE over RMSE.
Prediction and Explainability
ML.PREDICT generates batch predictions, while ML.EXPLAIN_PREDICT
adds feature attributions to explain why the model made each prediction.
Batch Prediction with ML.PREDICT
-- Generate predictions on new data
SELECT
predicted_species,
predicted_species_probs,
island,
body_mass_g
FROM ML.PREDICT(
MODEL `my_dataset.penguin_classifier`,
(
SELECT
island, culmen_length_mm, culmen_depth_mm,
flipper_length_mm, body_mass_g, sex
FROM `my_dataset.new_penguins`
)
)
-- For regression models, output column is: predicted_<label>
SELECT predicted_fare, actual_fare
FROM ML.PREDICT(
MODEL `my_dataset.taxi_fare_model`,
(SELECT * FROM `my_dataset.test_trips`)
)
Explainability with ML.EXPLAIN_PREDICT
-- Get predictions with feature attributions
SELECT *
FROM ML.EXPLAIN_PREDICT(
MODEL `my_dataset.penguin_classifier`,
(
SELECT *
FROM `bigquery-public-data.ml_datasets.penguins`
WHERE body_mass_g IS NOT NULL
LIMIT 5
),
STRUCT(3 AS top_k_features)
)
-- Returns: predicted label + Shapley-based feature attributions
-- Each row includes the top_k most important features for that prediction
ARIMA_PLUS Forecasting
-- Generate 30-day forecast
SELECT *
FROM ML.FORECAST(
MODEL `my_dataset.sales_forecast`,
STRUCT(30 AS horizon, 0.95 AS confidence_level)
)
-- Returns: forecast_timestamp, forecast_value,
-- standard_error, prediction_interval_lower_bound,
-- prediction_interval_upper_bound
ML.PREDICT is used for all model types except ARIMA_PLUS. For time series
models, use ML.FORECAST instead. ML.EXPLAIN_PREDICT works
with classification and regression models and uses Shapley values for feature attribution.
Exporting Models to Vertex AI
BQML models trained in BigQuery can be exported and deployed to Vertex AI Endpoints for online (real-time) prediction. This is the bridge from BQML prototyping to production serving.
-- Step 1: Export model to Cloud Storage
EXPORT MODEL `my_dataset.penguin_classifier`
OPTIONS (
URI = 'gs://my-bucket/exported_models/penguin_classifier/'
)
-- Step 2: Register in Vertex AI Model Registry (via gcloud)
gcloud ai models upload \
--region=us-central1 \
--display-name=penguin-classifier \
--artifact-uri=gs://my-bucket/exported_models/penguin_classifier/ \
--container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-12:latest
-- Step 3: Deploy to an endpoint
gcloud ai endpoints deploy-model ENDPOINT_ID \
--region=us-central1 \
--model=MODEL_ID \
--display-name=penguin-prod \
--machine-type=n1-standard-4 \
--min-replica-count=1 \
--max-replica-count=3
Export BQML models to Vertex AI when you need online (real-time) prediction
with low latency (<100ms). Keep models in BigQuery if you only need
batch prediction via ML.PREDICT.
Not all BQML model types can be exported. ARIMA_PLUS and MATRIX_FACTORIZATION models cannot be exported to Vertex AI. AutoML models are exported differently (they are already Vertex AI models under the hood).
Hyperparameter Tuning in BQML
BQML supports automatic hyperparameter tuning using the num_trials option.
BigQuery uses a Bayesian optimization strategy to search the hyperparameter space.
-- Hyperparameter tuning with BQML
CREATE OR REPLACE MODEL `my_dataset.tuned_model`
OPTIONS (
model_type = 'BOOSTED_TREE_CLASSIFIER',
input_label_cols = ['species'],
-- Enable HP tuning
num_trials = 20, -- Number of trials to run
max_parallel_trials = 5, -- Parallel trials
hparam_tuning_objectives = ['roc_auc'], -- Metric to optimize
-- Hyperparameter search ranges (auto-selected if omitted)
max_tree_depth = hparam_range(3, 10),
learn_rate = hparam_range(0.01, 0.3),
subsample = hparam_range(0.5, 1.0),
l2_reg = hparam_candidates([0, 0.1, 1.0, 10.0])
) AS
SELECT * FROM `my_dataset.training_data`
-- View tuning results
SELECT *
FROM ML.TRIAL_INFO(
MODEL `my_dataset.tuned_model`
)
ORDER BY roc_auc DESC
hparam_range(min, max) defines a continuous range for search.
hparam_candidates([v1, v2, v3]) defines discrete values to try.
BQML automatically selects the best trial as the final model.
BQML vs AutoML vs Custom Training
Choosing the right ML approach on GCP depends on your team's skills, data complexity, performance requirements, and time constraints.
| Criteria | BigQuery ML | AutoML (Vertex AI) | Custom (TF/PyTorch) |
|---|---|---|---|
| Interface | SQL | GUI + API | Python (code) |
| Skill Required | SQL knowledge | Minimal ML knowledge | Deep ML expertise |
| Data Location | Must be in BigQuery | GCS or BigQuery | Anywhere |
| Model Types | Pre-defined (see table above) | Tabular, image, text, video | Any architecture |
| Training Time | Minutes to hours | Hours (architecture search) | Hours to days |
| Customization | Hyperparameters only | Limited (objectives, budget) | Full control |
| Best For | Quick prototyping, SQL-based teams, tabular data | High accuracy with minimal effort | Complex architectures, research, unique requirements |
| Cost Model | BigQuery compute (on-demand or flat-rate) | Node-hours | Compute Engine / Vertex AI Training |
Data already in BigQuery + SQL team + tabular data? → Start with BQML. Need highest accuracy with minimal tuning? → AutoML. Need custom architecture or unstructured data? → Custom training. BQML prototype worked but need online serving? → Export BQML model to Vertex AI.
Exam Focus
Model Type Selection Cheat Sheet
Predict a number (price, count, duration) → LINEAR_REG or BOOSTED_TREE_REGRESSOR
Predict a category (yes/no, A/B/C) → LOGISTIC_REG or BOOSTED_TREE_CLASSIFIER
Group similar items (customer segments) → KMEANS
Recommend items (products, movies) → MATRIX_FACTORIZATION
Forecast future values (sales, demand) → ARIMA_PLUS
Maximum accuracy, tabular data → AUTOML_CLASSIFIER / AUTOML_REGRESSOR
Use existing TF model in BQ → Import TENSORFLOW model
Common Exam Scenarios
"A retail company has sales data in BigQuery. They want to forecast next quarter's sales
by region. The data shows clear weekly and yearly seasonality." →
ARIMA_PLUS with time_series_id_col='region' and
auto_arima=TRUE
"A data analyst (SQL expert, no Python) needs to predict customer churn (binary outcome) using data already in BigQuery." → LOGISTIC_REG in BQML. If accuracy needs improvement, upgrade to BOOSTED_TREE_CLASSIFIER.
"The team trained a BQML model for fraud detection. It performs well in batch evaluation. Now they need real-time predictions for incoming transactions." → Export model to Cloud Storage, upload to Vertex AI Model Registry, deploy to a Vertex AI Endpoint for online prediction.
"An ML engineer wants to ensure consistent feature transformations between training and
serving in BQML." → Use the TRANSFORM clause in
CREATE MODEL. Transformations defined here are automatically applied during
ML.PREDICT.
Key Exam Takeaways
- BQML trains models using SQL with no data movement from BigQuery
- Know all model types and when to use each (especially ARIMA_PLUS, KMEANS, LOGISTIC_REG)
- TRANSFORM clause prevents training-serving skew by embedding preprocessing in the model
- ML.EVALUATE returns different metrics based on model type (classification vs regression)
- ML.EXPLAIN_PREDICT uses Shapley values for feature attribution
- Export BQML models to Vertex AI for online serving (not all types are exportable)
- Use ML.FORECAST (not ML.PREDICT) for ARIMA_PLUS time series models
- Hyperparameter tuning uses num_trials with Bayesian optimization
- Choose precision for "minimize false positives," recall for "minimize false negatives"
Interview Ready
How to Explain This in 2 Minutes
BigQuery ML lets you create, evaluate, and serve machine learning models using standard SQL directly inside BigQuery—no data export, no Python, no separate infrastructure. You write CREATE MODEL with a model type (linear regression, logistic regression, k-means, ARIMA_PLUS, boosted trees, DNN, or even imported TensorFlow models), point it at a training query, and BigQuery handles distributed training on its serverless compute. Predictions use ML.PREDICT, evaluation uses ML.EVALUATE, and you can explain predictions with ML.EXPLAIN_PREDICT. The key insight is that BQML is ideal when your data already lives in BigQuery, the team is SQL-fluent, and the model complexity fits supported types. For more complex architectures, you export the BQML model to Vertex AI or switch to custom training entirely.
Likely Interview Questions
| Question | What They're Really Asking |
|---|---|
| When would you use BigQuery ML instead of Vertex AI custom training? | Can you identify the sweet spot for SQL-based ML vs. full framework control? |
| Walk me through creating and deploying a forecasting model on GCP. | Do you know ARIMA_PLUS syntax, evaluation metrics, and the path from BQML to a serving endpoint? |
| How does feature engineering work in BigQuery ML? | Do you understand TRANSFORM clause, automatic preprocessing, and manual SQL-based feature construction? |
| What are the limitations of BigQuery ML? | Can you articulate where BQML stops being sufficient—custom loss functions, novel architectures, GPU training, real-time serving? |
| How would you compare BQML, AutoML, and Vertex AI custom training for a tabular classification problem? | Do you understand the trade-off triangle: speed-to-deploy vs. accuracy ceiling vs. engineering effort? |
Model Answers
BQML vs Custom Training: I choose BQML when the data is already in BigQuery (avoiding export costs and latency), the model type is supported (linear/logistic regression, boosted trees, ARIMA_PLUS, k-means, matrix factorization), and the team consists of analysts or data engineers who are SQL-proficient but not ML engineers. BQML delivers a trained, evaluated model in minutes with zero infrastructure. I switch to Vertex AI custom training when I need a transformer architecture, custom loss functions, distributed GPU training, or real-time sub-10ms inference that BQML batch predictions cannot satisfy.
ARIMA_PLUS Forecasting: I create a time-series model with CREATE MODEL ... OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', time_series_data_col='revenue'). ARIMA_PLUS automatically handles seasonality detection, holiday effects, and trend decomposition. I evaluate with ML.EVALUATE checking AIC, variance, and seasonal periods. For predictions, ML.FORECAST returns point estimates with confidence intervals. If the forecast needs to serve a dashboard, I materialize predictions into a BigQuery table and connect Looker. If it needs a REST endpoint, I export the model to Vertex AI and deploy to an endpoint.
Feature Engineering in BQML: BQML provides the TRANSFORM clause that embeds preprocessing into the model artifact itself, so the same transformations apply at prediction time automatically. Supported transforms include ML.BUCKETIZE, ML.QUANTILE_BUCKETIZE, ML.FEATURE_CROSS, ML.POLYNOMIAL_EXPAND, and standard SQL functions. For features BQML cannot express, I pre-compute them in a SQL view or scheduled query and feed that into CREATE MODEL. The TRANSFORM clause is critical because it prevents training-serving skew—the model carries its preprocessing logic.
System Design Scenario
Scenario: An online marketplace wants to predict customer churn (binary classification) using transaction history stored in BigQuery (500M rows, 40 features). They also need weekly revenue forecasts per product category. The analytics team knows SQL but not Python. Design the ML architecture.
Approach: For churn prediction, use BQML with model_type='BOOSTED_TREE_CLASSIFIER' since the data is already in BigQuery, the team is SQL-fluent, and boosted trees handle tabular classification well. Use the TRANSFORM clause for feature scaling and bucketing. Evaluate with ML.EVALUATE (precision, recall, AUC) and explain top features with ML.GLOBAL_EXPLAIN. For revenue forecasting, use model_type='ARIMA_PLUS' partitioned by product_category. Schedule weekly retraining with BigQuery Scheduled Queries. Materialize churn scores into a churn_predictions table that the CRM reads via API, and forecast results into a revenue_forecast table connected to a Looker dashboard. If churn model accuracy plateaus below target, export to Vertex AI and try AutoML Tabular for a higher accuracy ceiling with minimal extra effort.
Common Mistakes
- Using BQML for everything — BQML is powerful for supported model types on tabular data, but recommending it for image classification, NLP, or real-time serving shows you do not understand its boundaries. Always state what BQML cannot do.
- Forgetting the TRANSFORM clause — Manually preprocessing features in the training query but not replicating that logic at prediction time causes training-serving skew. The TRANSFORM clause solves this—mention it explicitly.
- Ignoring the export-to-Vertex path — Many candidates treat BQML and Vertex AI as completely separate. Knowing that you can export a BQML model to Vertex AI Model Registry for online serving or further tuning demonstrates full-stack understanding of the GCP ML ecosystem.