BigQuery is by far the most important storage and processing service in Google Cloud from ML perspective.

It has many integrated functionalities for ML. It deserves a separate introduction. Here is a great introduction for BigQuery machine learning capabilities by Google.

What is Google BigQuery?

At its core, BigQuery is a serverless columnar database for vast amounts of tabular data. It is capable of quickly training basic ML models by SQL.

Dataset is a container to group tables and ML models.

Google recommends to store ML related structured and semi-structured data to BigQuery. Unstructured data should be kept in Cloud Storage buckets.

BigQuery vs Snowflake

BigQuery is definitely a competitor with Snowflake. In BigQuery you can query datasets from other teams and projects considering you have required permissions.

Public datasets are also awesome. Just think how much time it would take to scrape Google Trends data manually!

BigQuery vs Spark and Databricks

BigQuery can replace Spark in massive parallel processing of structured data. Compared to Spark, BigQuery is more automated and serverless at some extent. Often it is enough to write SQL. And no need to wait cluster to start unlike in Databricks!

The advantage of Spark is that you can call regular Python functions for special cases. In practice I have only used the DataFrame API in PySpark to make processing robust and performant.

BigQuery pricing

BigQuery is priced by analysis and storage in the on-demand pricing. Analysis pricing is the more important component as each query incurs costs. At the moment cost is 6 $ per queried TB while the first 1 TB is free.

On-demand pricing minimum query cost is for 10 MB with rounding to nearest 1 MB.

Flat rate pricing is far more expensive intended for heavy users. Subscriptions exist for per-second (flex), monthly and annually.

For example matrix factorization ML model for product recommendations require flat rate pricing in BigQuery.

100 slots is minimum for flat rate pricing. This would be 2400 $ for monthly reservation. The same capacity with flex slots would be 5 $ per hour.

Data pipelines and data lineage with BigQuery

BigQuery is awesome for data processing pipelines.

Create your tables by statement like this:

CREATE TABLE `project.pipeline_step_2` AS (
	SELECT *
	FROM `project.pipeline_step_1`
	WHERE category='x'
)

And BigQuery will create visual data lineage automatically! The lineage is visible in BigQuery console by simply clicking the table from the panel on the left hand side.

Connecting to BigQuery

Tool connecting to BigQueryHow to connect
GCP DataflowBigQuery I/O connector
Tensorflow or Kerastf.data.Dataset
PythonBigQuery Python client library

BigQuery Omni makes it possible to analyze data across cloud platforms (Azure, AWS) within BigQuery console.

Data ingestion and processing in BigQuery

BigQuery can ingest data from other Google Cloud storage products such as Cloud Storage, Cloud Spanner, Bigtable and Cloud SQL. Also other cloud providers and public datasets are not a problem.

Data can be inserted as batch, stream or SQL inserts.

The key steps in BigQuery ML process are:

  1. Prepare training data in BigQuery
  2. Train the model in BigQuery
  3. Use predictions

Apparently BigQuery can be used as a transactional application database, even though it is primarily for analytical usage.

BigQuery ML examples

Here is an example SQL to create an ML model in BigQuery:

--BigQuery specific model creation
CREATE OR REPLACE MODEL `dataset_name.model_name`
OPTIONS (
	model_type='logistic_reg',
	labels = ['binary_label']
)
AS
--Select features and label by standard SQL
SELECT feature_1, feature_2, binary_label
FROM `dataset_name.table_name`

After the model has been created it appears under the BigQuery dataset.

Documentation recommends to split data for training and testing sets by applying a FARM_FINGERPRINT() hash function for example to a date column.

The models can also be exported in Tensorflow format for online prediction.

Hyper parameters can be optimized by SQL but it utilizes the Vertex AI Vizier. The NUM_TRIALS option is the minimum requirement to run the hyperparameter tuning while training the model by SQL.

Feature preparation

On the feature preparation side BigQuery provides functions such as:

Feature preparation SQL functionDescription
ML.FEATURE_CROSSCreate a feature cross (dot product of two arrays).
ML.BUCKETIZEMake numeric column to category.
ML.MIN_MAX_SCALERScale between 0 and 1.
ML.NGRAMSCreate an array of adjacent items from the source array.
ML.POLYNOMIAL_EXPANDCreate combinations from array and multiple the items.

Retrieve information about trained model

Examples of functions / pseudo tables containing information about the trained model:

Model evaluation SQL functionDescription
ML.EVALUATEPrecision and recall for classification, MSE for regression.
ML.FEATURE_INFOAggregated feature importances.
ML.EXPLAIN_PREDICTRow level feature importance.
ML.TRIAL_INFOHyperparameter info.

Use ML.PREDICT for model inference.

BigQuery SQL for data manipulation

Here are some BigQuery functions that might be useful for data preparation

BigQuery SQL functionDescription
UNNESTUse in FROM to explode record to columns.
ROLLUPUse in GROUP BY to calculate eg cumulative sums.
EXCEPTChoose all but one column in SELECT.

Supported ML models in BigQuery

ML model typeML model name
RegressionLinear regression
RegressionBoosted tree
RegressionRandom forest
RegressionDeep Neural Network (DNN)
RegressionWide DNN
RegressionAutoML
ClassificationLogistic regression
ClassificationBoosted tree
ClassificationRandom forest
ClassificationDeep Neural Network (DNN)
ClassificationWide DNN
ClassificationAutoML
ClusteringK-means
RecommendationMatrix factorization
Dimensionality reductionPCA
Dimensionality reductionAutoencoder
Time seriesARIMA
Custom importTensorflow models

Not that NLP or image models are not included as BigQuery is preferred solution for tabular data.

BigQuery Jobs

Load, export, query data or copy data automatically.

Import Tensorflow model to BigQuery

Create a Tensorflow model and save to Cloud Storage. Use CREATE MODEL statement with MODEL_TYPE='TENSORFLOW argument in BigQuery to import it.

Export model from BigQuery

Use EXPORT MODEL statement to export as Tensorflow model from BigQuery.

Metadata for BigQuery

Use Google Cloud Data Catalog to search among large number of datasets.

INFORMATION_SCHEMA metadata tables are technical details about BigQuery objects.

BigQuery vs Vertex AI

So what is missing from Google BigQuery that should be done in Vertex AI ?

BigQuery does not have automatic deployment and serving of the models. Also customization is limited to specific models and parameters.

You can save BigQuery models to Vertex AI model registry.

Geographical functions in BigQuery

BigQuery has an interesting set of geographical SQL functions available.

They are usually prefixed by ST_. For example, ST_AREA calculates square meters in given GEOGRAPHY polygon.