BigQuery for ML Engineers in Google Cloud

BigQuery is by far the most important storage and processing service in Google Cloud from ML perspective.

It has many integrated functionalities for ML. It deserves a separate introduction. Here is a great introduction for BigQuery machine learning capabilities by Google.

What is Google BigQuery?

At its core, BigQuery is a serverless columnar database for vast amounts of tabular data. It is capable of quickly training basic ML models by SQL.

Dataset is a container to group tables and ML models.

Google recommends to store ML related structured and semi-structured data to BigQuery. Unstructured data should be kept in Cloud Storage buckets.

BigQuery vs Snowflake

BigQuery is definitely a competitor with Snowflake. In BigQuery you can query datasets from other teams and projects considering you have required permissions.

Public datasets are also awesome. Just think how much time it would take to scrape Google Trends data manually!

BigQuery vs Spark and Databricks

BigQuery can replace Spark in massive parallel processing of structured data. Compared to Spark, BigQuery is more automated and serverless at some extent. Often it is enough to write SQL. And no need to wait cluster to start unlike in Databricks!

The advantage of Spark is that you can call regular Python functions for special cases. In practice I have only used the DataFrame API in PySpark to make processing robust and performant.

BigQuery pricing

BigQuery is priced by analysis and storage in the on-demand pricing. Analysis pricing is the more important component as each query incurs costs. At the moment cost is 6 $ per queried TB while the first 1 TB is free.

On-demand pricing minimum query cost is for 10 MB with rounding to nearest 1 MB.

Flat rate pricing is far more expensive intended for heavy users. Subscriptions exist for per-second (flex), monthly and annually.

For example matrix factorization ML model for product recommendations require flat rate pricing in BigQuery.

100 slots is minimum for flat rate pricing. This would be 2400 $ for monthly reservation. The same capacity with flex slots would be 5 $ per hour.

Data pipelines and data lineage with BigQuery

BigQuery is awesome for data processing pipelines.

Create your tables by statement like this:

CREATE TABLE `project.pipeline_step_2` AS (
	SELECT *
	FROM `project.pipeline_step_1`
	WHERE category='x'
)

And BigQuery will create visual data lineage automatically! The lineage is visible in BigQuery console by simply clicking the table from the panel on the left hand side.

Connecting to BigQuery

Tool connecting to BigQuery	How to connect
GCP Dataflow	BigQuery I/O connector
Tensorflow or Keras	`tf.data.Dataset`
Python	BigQuery Python client library

BigQuery Omni makes it possible to analyze data across cloud platforms (Azure, AWS) within BigQuery console.

Data ingestion and processing in BigQuery

BigQuery can ingest data from other Google Cloud storage products such as Cloud Storage, Cloud Spanner, Bigtable and Cloud SQL. Also other cloud providers and public datasets are not a problem.

Data can be inserted as batch, stream or SQL inserts.

The key steps in BigQuery ML process are:

Prepare training data in BigQuery
Train the model in BigQuery
Use predictions

Apparently BigQuery can be used as a transactional application database, even though it is primarily for analytical usage.

BigQuery ML examples

Here is an example SQL to create an ML model in BigQuery:

--BigQuery specific model creation
CREATE OR REPLACE MODEL `dataset_name.model_name`
OPTIONS (
	model_type='logistic_reg',
	labels = ['binary_label']
)
AS
--Select features and label by standard SQL
SELECT feature_1, feature_2, binary_label
FROM `dataset_name.table_name`

After the model has been created it appears under the BigQuery dataset.

Documentation recommends to split data for training and testing sets by applying a FARM_FINGERPRINT() hash function for example to a date column.

The models can also be exported in Tensorflow format for online prediction.

Hyper parameters can be optimized by SQL but it utilizes the Vertex AI Vizier. The NUM_TRIALS option is the minimum requirement to run the hyperparameter tuning while training the model by SQL.

Feature preparation

On the feature preparation side BigQuery provides functions such as:

Feature preparation SQL function	Description
`ML.FEATURE_CROSS`	Create a feature cross (dot product of two arrays).
`ML.BUCKETIZE`	Make numeric column to category.
`ML.MIN_MAX_SCALER`	Scale between 0 and 1.
`ML.NGRAMS`	Create an array of adjacent items from the source array.
`ML.POLYNOMIAL_EXPAND`	Create combinations from array and multiple the items.

Retrieve information about trained model

Examples of functions / pseudo tables containing information about the trained model:

Model evaluation SQL function	Description
`ML.EVALUATE`	Precision and recall for classification, MSE for regression.
`ML.FEATURE_INFO`	Aggregated feature importances.
`ML.EXPLAIN_PREDICT`	Row level feature importance.
`ML.TRIAL_INFO`	Hyperparameter info.

Use ML.PREDICT for model inference.

BigQuery SQL for data manipulation

Here are some BigQuery functions that might be useful for data preparation

BigQuery SQL function	Description
UNNEST	Use in `FROM` to explode record to columns.
ROLLUP	Use in `GROUP BY` to calculate eg cumulative sums.
EXCEPT	Choose all but one column in `SELECT`.

Supported ML models in BigQuery

ML model type	ML model name
Regression	Linear regression
Regression	Boosted tree
Regression	Random forest
Regression	Deep Neural Network (DNN)
Regression	Wide DNN
Regression	AutoML
Classification	Logistic regression
Classification	Boosted tree
Classification	Random forest
Classification	Deep Neural Network (DNN)
Classification	Wide DNN
Classification	AutoML
Clustering	K-means
Recommendation	Matrix factorization
Dimensionality reduction	PCA
Dimensionality reduction	Autoencoder
Time series	ARIMA
Custom import	Tensorflow models

Not that NLP or image models are not included as BigQuery is preferred solution for tabular data.

BigQuery Jobs

Load, export, query data or copy data automatically.

Import Tensorflow model to BigQuery

Create a Tensorflow model and save to Cloud Storage. Use CREATE MODEL statement with MODEL_TYPE='TENSORFLOW argument in BigQuery to import it.

Export model from BigQuery

Use EXPORT MODEL statement to export as Tensorflow model from BigQuery.

Metadata for BigQuery

Use Google Cloud Data Catalog to search among large number of datasets.

INFORMATION_SCHEMA metadata tables are technical details about BigQuery objects.

BigQuery vs Vertex AI

So what is missing from Google BigQuery that should be done in Vertex AI ?

BigQuery does not have automatic deployment and serving of the models. Also customization is limited to specific models and parameters.

You can save BigQuery models to Vertex AI model registry.

Geographical functions in BigQuery

BigQuery has an interesting set of geographical SQL functions available.

They are usually prefixed by ST_. For example, ST_AREA calculates square meters in given GEOGRAPHY polygon.

BigQuery for ML Engineers in Google Cloud

Blog series

What is Google BigQuery?

BigQuery vs Snowflake

BigQuery vs Spark and Databricks

BigQuery pricing

Data pipelines and data lineage with BigQuery

Connecting to BigQuery

Data ingestion and processing in BigQuery

BigQuery ML examples

Feature preparation

Retrieve information about trained model

BigQuery SQL for data manipulation

Supported ML models in BigQuery

BigQuery Jobs

Import Tensorflow model to BigQuery

Export model from BigQuery

Metadata for BigQuery

BigQuery vs Vertex AI

Geographical functions in BigQuery

Tags of the post

Blog series navigation

You might also like

Participate to discussion

Write a new comment

BigQuery for ML Engineers in Google Cloud

Blog series

What is Google BigQuery?

BigQuery vs Snowflake

BigQuery vs Spark and Databricks

BigQuery pricing

Data pipelines and data lineage with BigQuery

Connecting to BigQuery

Data ingestion and processing in BigQuery

BigQuery ML examples

Feature preparation

Retrieve information about trained model

BigQuery SQL for data manipulation

Supported ML models in BigQuery

BigQuery Jobs

Import Tensorflow model to BigQuery

Export model from BigQuery

Metadata for BigQuery

BigQuery vs Vertex AI

Geographical functions in BigQuery

Tags of the post

Blog series navigation

You might also like

Wealth management web app - Technical implementation

Public report in Looker Studio requires login - Instructions to solve

Tensorflow for ML Engineers

Participate to discussion

Write a new comment

Reply to comment