BigQuery is by far the most important storage and processing service in Google Cloud from ML perspective.
It has many integrated functionalities for ML. It deserves a separate introduction. Here is a great introduction for BigQuery machine learning capabilities by Google.
What is Google BigQuery?
At its core, BigQuery is a serverless columnar database for vast amounts of tabular data. It is capable of quickly training basic ML models by SQL.
Dataset is a container to group tables and ML models.
Google recommends to store ML related structured and semi-structured data to BigQuery. Unstructured data should be kept in Cloud Storage buckets.
BigQuery vs Snowflake
BigQuery is definitely a competitor with Snowflake. In BigQuery you can query datasets from other teams and projects considering you have required permissions.
Public datasets are also awesome. Just think how much time it would take to scrape Google Trends data manually!
BigQuery vs Spark and Databricks
BigQuery can replace Spark in massive parallel processing of structured data. Compared to Spark, BigQuery is more automated and serverless at some extent. Often it is enough to write SQL. And no need to wait cluster to start unlike in Databricks!
The advantage of Spark is that you can call regular Python functions for special cases. In practice I have only used the
DataFrame API in
PySpark to make processing robust and performant.
BigQuery is priced by analysis and storage in the on-demand pricing. Analysis pricing is the more important component as each query incurs costs. At the moment cost is 6 $ per queried TB while the first 1 TB is free.
On-demand pricing minimum query cost is for 10 MB with rounding to nearest 1 MB.
Flat rate pricing is far more expensive intended for heavy users. Subscriptions exist for per-second (flex), monthly and annually.
For example matrix factorization ML model for product recommendations require flat rate pricing in BigQuery.
100 slots is minimum for flat rate pricing. This would be 2400 $ for monthly reservation. The same capacity with flex slots would be 5 $ per hour.
Data pipelines and data lineage with BigQuery
BigQuery is awesome for data processing pipelines.
Create your tables by statement like this:
CREATE TABLE `project.pipeline_step_2` AS ( SELECT * FROM `project.pipeline_step_1` WHERE category='x' )
And BigQuery will create visual data lineage automatically! The lineage is visible in BigQuery console by simply clicking the table from the panel on the left hand side.
Connecting to BigQuery
|Tool connecting to BigQuery||How to connect|
|GCP Dataflow||BigQuery I/O connector|
|Tensorflow or Keras|
|Python||BigQuery Python client library|
BigQuery Omni makes it possible to analyze data across cloud platforms (Azure, AWS) within BigQuery console.
Data ingestion and processing in BigQuery
BigQuery can ingest data from other Google Cloud storage products such as Cloud Storage, Cloud Spanner, Bigtable and Cloud SQL. Also other cloud providers and public datasets are not a problem.
Data can be inserted as batch, stream or SQL inserts.
The key steps in BigQuery ML process are:
- Prepare training data in BigQuery
- Train the model in BigQuery
- Use predictions
Apparently BigQuery can be used as a transactional application database, even though it is primarily for analytical usage.
BigQuery ML examples
Here is an example SQL to create an ML model in BigQuery:
--BigQuery specific model creation CREATE OR REPLACE MODEL `dataset_name.model_name` OPTIONS ( model_type='logistic_reg', labels = ['binary_label'] ) AS --Select features and label by standard SQL SELECT feature_1, feature_2, binary_label FROM `dataset_name.table_name`
After the model has been created it appears under the BigQuery dataset.
Documentation recommends to split data for training and testing sets by applying a FARM_FINGERPRINT() hash function for example to a date column.
The models can also be exported in Tensorflow format for online prediction.
Hyper parameters can be optimized by SQL but it utilizes the Vertex AI Vizier. The
NUM_TRIALS option is the minimum requirement to run the hyperparameter tuning while training the model by SQL.
On the feature preparation side BigQuery provides functions such as:
|Feature preparation SQL function||Description|
|Create a feature cross (dot product of two arrays).|
|Make numeric column to category.|
|Scale between 0 and 1.|
|Create an array of adjacent items from the source array.|
|Create combinations from array and multiple the items.|
Retrieve information about trained model
Examples of functions / pseudo tables containing information about the trained model:
|Model evaluation SQL function||Description|
|Precision and recall for classification, MSE for regression.|
|Aggregated feature importances.|
|Row level feature importance.|
ML.PREDICT for model inference.
BigQuery SQL for data manipulation
Here are some BigQuery functions that might be useful for data preparation
|BigQuery SQL function||Description|
|UNNEST||Use in |
|ROLLUP||Use in |
|EXCEPT||Choose all but one column in |
Supported ML models in BigQuery
|ML model type||ML model name|
|Regression||Deep Neural Network (DNN)|
|Classification||Deep Neural Network (DNN)|
|Custom import||Tensorflow models|
Not that NLP or image models are not included as BigQuery is preferred solution for tabular data.
Load, export, query data or copy data automatically.
Import Tensorflow model to BigQuery
Create a Tensorflow model and save to Cloud Storage. Use
CREATE MODEL statement with
MODEL_TYPE='TENSORFLOW argument in BigQuery to import it.
Export model from BigQuery
EXPORT MODEL statement to export as Tensorflow model from BigQuery.
Metadata for BigQuery
Use Google Cloud Data Catalog to search among large number of datasets.
INFORMATION_SCHEMA metadata tables are technical details about BigQuery objects.
BigQuery vs Vertex AI
So what is missing from Google BigQuery that should be done in Vertex AI ?
BigQuery does not have automatic deployment and serving of the models. Also customization is limited to specific models and parameters.
You can save BigQuery models to Vertex AI model registry.
Geographical functions in BigQuery
BigQuery has an interesting set of geographical SQL functions available.
They are usually prefixed by
ST_. For example,
ST_AREA calculates square meters in given