Some Google materials refer to it as Fully managed Tensorflow.

Vertex AI assumes that data is prepared elsewhere before training the model. The reason is less surprisingly related to Tensorflow. You need to follow the Tensorflow best practices and not perform arbitrary data wrangling anymore after reading the dataset in to the framework.

Vertex AI component overview

This is why managed datasets and the feature store important components in the Vertex AI process.

Vertex AI componentDescription
DatasetsCreate a managed dataset from local files, Cloud Storage or BigQuery.
Feature StorePre-processed features available for all models.
Labelling tasksPaid humans do the data labeling for you.
WorkbenchRun Managed or User-Managed Jupyter notebooks. Schedule executions.
PipelinesRun a configurable set of scripts in specific order.
TrainingTrain an AutoML model or a custom training job.
ExperimentsLog performance for different model versions. Vizier hyperparameter tuning.
Model registryStore model versions.
EndpointsCreate a prediciton API from a model and setup monitoring.
Batch predictionSave larger set of predictions to a Storage bucket.
MetadataTrack and analyze metadata of a machine learning process.
Matching engineA vector database.

Vertex AI Datasets

Creating a managed dataset in Vertex AI is simple. The source should be either Cloud Storage bucket or BigQuery table.

Datasets are straightforward to use in Vertex AI models. The documentation does not state in detail whether they can be used for example in Python code.

Vertex AI Datasets can be exported as JSON Lines files to Google Cloud Storage. Then, those files can be shared to other by signed URLs if needed.

Vertex AI Feature store

Feature store standardizes the way the ML models can access the data. It has this hierarchy:

  1. Featurestore
  2. EntityType (eg Users and Movies)
  3. Feature (eg user age and movie rating)

Feature store entity types

A feature store is the top level container for a set of features. The ingested feature data format has additonally a few requirements . The data must have an entity type column that indicates the id of the entity (row). The enity type groups similar features together. Also a timestamp is required to indicate the creation time of the feature value.

Entity type can be also a combination such as ProductUser.

Feature store monitoring

Monitoring can be configured for Entity type or Feature level.

Feature store can profile the content of the data and monitor the drift over time.

Feature store data ingestion

Features can be ingested by batches or streaming:

Featurestore sourceBatch or streamNotes
Featurestore http APIStreamThe records are sent to the endpoint.
BigQuery TableBatch
Cloud Storage, AvroBatch
Cloud Storage, CSVBatchNo arrays

Feature store serving

All feature data is saved to offline. Only the most recent versions of the records are available for online serving.

Feature store data is not read directly by libraries like Tensorflow or Pandas. Online serving works by HTTP request. Batch output can be exported to these formats:

  • BigQuery Table
  • CSV (no arrays)
  • TFRecord

Some Google materials suggest Memorystore for solution with strict latency requirements. Memorystore is basically a managed Redis or Memcached.

Feature store access

Feature store is likely consumed by multiple teams. Prefer IAM policies to control access.

Labelling tasks

Request humans to label your training data. You need to send:

  • The dataset to label
  • List of possible labels
  • Instructions as PDF
    • Specific
    • Max 20 minutes to read

100 distinct labels is maximum but 20 is recommemded cap to maintain labeler efficiency. Use descriptive label names such as cat and dog instead of label1 and label2.

Do not use overlapping categories. It is possible though, to include labels both and none.

Vertex AI Workbench

Notebooks in Vertex AI Workbench can be conveniently opened in browser tab without significant configuration. Git is supported and for example Docker is pre-installed.

Workbench has two ways of creating Jupyter notebooks environments. Here is the difference between Managed notebooks and User-Managed notebooks.

Managed notebooks are designed to run notebooks as part of production pipelines. Google Cloud Storage and BigQuery integrations are readily available. The notebooks can be scheduled. Instances shutdown automatically after spcified idle time. Managed notebooks come with popular frameworks like Tensorflow and PyTorch out of the box. This option has higer hourly price than User-Managed notebooks.

User-Managed notebooks enable full configuration for experimental work. You need to choose an environment that has the required frameworks installed. Use command line as sudo user. Google recommends one User-Managed instance per person during development as virtual workspace.

This convenience syntax can be used in notebooks to read BigQuery data to a Pandas DataFrame in Python:

%%bigquery df
SELECT *
FROM table_name

The notebook instances have What-If Tool (WIT) installed. Now WIT is part of Tensorboard . WIT adds functionality to explore deployed models whereas Tensorboard typically monitors the training process.

Language Interpretability Tool (LIT) is another tool that visualizes NLP model predictions.

Vertex AI Pipelines

Pipelines introduces machine learning process orchestration from modular components. A pipeline instance runs a GKE instance in the background. Apparently Vertex AI Pipelines were previously known as Kubeflow Pipelines.

Pipelines can be grouped and versioned in the UI. They are able to store metadata as artifacts.

Custom pipelines can be built using either of Kubeflow or Tensorflow Extended (TFX). Readily available TFX components can be used, eg Vertex AI Jobs.

Vertex AI provides Tensorboard to monitor training metrics in Pipelines.

Thanks to lineage tracking of the pipeline artifacts metadata about the datasets and models are saved for each run.

A component in a pipeline is a container image. It simply takes an input and produces an output.

An example Vertex AI pipeline:

  1. Read data
  2. Pre-process data
  3. Train the model
  4. Predict
  5. Output
    • Confusion matrix
    • ROC

Code example of a KFP pipeline:

from kfp import dsl
from kfp.v2 import compiler
from kfp.v2.dsl import component

@component(base_image="python:3.9", output_component_file="first-component.yaml")
def step_1(text: str) -> str:
    return text

@dsl.pipeline(
  name="hello-world",
  description="An intro pipeline",
  pipeline_root=PIPELINE_ROOT,
)

def my_pipeline(text: str = "My input text"):
    product_task = step_1(text)

compiler.Compiler().compile(
    pipeline_func=my_pipeline, package_path="pipeline_job.json"
)

And then running the pipeline

#instantiate api_client here...
response = api_client.create_run_from_job_spec(
    job_spec_path="pipeline_job.json",
)

The interesting things seems to be that the Kubeflow Pipeline json spec contains the needed Python code (but not the Python dependencies). Here is an example.

Vertex AI Hyperparameter tuning

The documentation is a bit confusing. Apparently there are two alternatives for hyperparameter tuning:

  • Vizier
  • Vertex AI hyperparameter tuning

Vizier

Vizier studies are found from the experiments section in Vertex AI.

Vizier is a black box hyperparameter tuning service inside the Vertex AI. It means that the system does not have an objective function or it is too costly to evaluate.

According to Google materials this holds true:

Black box optimization algorithms find the best operating parameters for any system whose performance can be measured as a function of adjustable parameters.

Types of supported hyperparameter optimization methods:

  • Grid search
  • random search
  • Bayesian optimization

Vizier does not optimize cost or tuning time. It runs optimization sequentially.

Vertex AI hyperparameter optimization

Basically it is a regular training job in Vertex AI, but it finds the optimal hyperparameters at the same time. Hyperparameter tuning can be chosen only for Custom training. They are not availale in AutoML.

The code should only report the results of the given parameters. The hyperparameter tuning job takes care of running the model multiple times to find the optimal set up hyperparameters.

Here is how the process works:

  1. Create a Docker container in a Vertex AI notebook
    1. Use deeplearning-platform-release as base image
    2. Install cloudml-hypertune library
    3. The Docker entry file must be trainer.task
  2. The main .py file must read the hyperparameters as command line arguments
  3. The main .py file must report the hyperparameters and the model performance metirc using the cloudml-hypertune library

By this setup Vizier is able to run the container by trying different sets of hyperparameters and seeing which performs best. You need to define the min and max limits for each hyperparameter on Vizier side manually.

Important parameters for hyperparameter tuning:

Hyperparameter tuning parameterImpact
maxTrialCountNumber of trials before stopping. Less should be faster but non-optimal.
parallelTrialCountMore is faster but can reduce effectiveness.
enableTrialEarlyStoppingStop trial when it seems to become unpromising.
resumePreviousJobIdUse information from previous tuning jobs.

Vertex AI Experiments

Find the best model for a specific problem. Experiment with different input datasets, model architectures, hyper parameters and training environments.

Managed Tensorboard available.

Experiment docs .

Vertex AI Model registry

Find all trained models and their versions.

Exploret the model metrics or compare multiple with each other. You can run an evaluation that that generates statistics for selected test set and a batch prediction.

Deploy a model endpoint or run a batch prediction.

Vertex AI Training (custom)

Use for migrating on-prem models or when BigQuery and AutoML does not solve the case.

Preparing a training job

The training workflow goes as follows:

  1. Create training functionality in notebook
  • Read data
  • Pre-processing (only Tensorflow)
  • Train
  1. Generate entrypoint file task.py
  • Model training parameters as input
  • Save the model in the end
  • Use %%writefile or nbconvert

Package the code to a Docker container locally, in Workbench notebook or using Cloud Build. Use pre-built Docker images if the needed frameworks are supported.

Another approach is to create a full python package to which the Vertex AI training job can point.

My understanding is that once a model is trained, it is saved to a Storage bucket and saved in the Models section of Vertex AI.

Running a training job

Training jobs can be run on distributed clusters managed by Vertex AI configured by by environment variables CLUSTER_SPEC (other frameworks) and TF_CONFIG (Tensorflow).

Setting scale-tier=BASIC_TPU variable would set training job to run on TPU processors.

Typical job states (can be asked in the certification exam):

  • JOB_STATE_QUEUED
  • JOB_STATE_RUNNING
  • JOB_STATE_SUCCEEDED

When running multiple workers in parallel, each of them start running whenever they become available.

Save Tensorflow checkpoints (save in PyTorch) and model artifacts to Google Cloud Storage.

Explainable AI in Vertex AI

Use the Vertex AI SDK to understand the model behavior by Explainable AI.

Explaning models by examples is a manual approach to review some training samples. K-nearest neighbors alogrithm is able to fetch to identify the most similar observations. Tree models are not supported.

Feature-based explanations is a more traditional way to show realtive importance of each feature.

Vertex AI AutoML

AutoML is part of Vertex AI. Just create a new training job and you find the AutoML options.

AutoML is codeless and severless. You define the source data from BigQuery, or CSV file and everything else is clicking menus in browser:

  • Source dataset
  • Target variable
  • Type of prediction task (regression, classification…)
  • Performance metric

As an end result AutoML finds the best model performing model for you.

Here are the different types of tasks AutoML can perform.

Type of dataTasks
AutoML TablesClassification, regression and forecasting likelihood
AutoML Natural LanguageSingle or multi label classification, sentiment, entity extraction
AutoML VisionSingle or multi label classification, object detection, image segmentation
AutoML Video IntelligenceAction recognition, classification, object tracking

After testing the AutoML a few times, execution times felt surprisingly long.

AutoML is suitable for models that can tolerate +100 ms latency of inference.

Vertex AI Batch predictions

Here are some requirements for batch prediction data sources:

  • BigQuery source table max 100 GB
  • BigQuery source table must use multi region
  • CSV must have headers with alphanumeric column names (or underscore)
  • CSV delimiter must be comma
  • CSV max size for single file 10 GB, total 100 GB
  • 1k-1B rows
  • 2-1000 columns
  • 2-500 distinct labels for classification
  • Additional permissions required if Vertex AI and data sources are in different projects

Batch prediction monitoring has these configuration options among the others:

  • How often monitoring metrics are evaluated
  • Alert email
  • Sampling rate in percentage

Vertex AI Endpoints

Create an online prediction API using Tensorflow Serving under the hood. Vertex AI private endpoints are useful for low latency peer-to-peer requests.

Recommended approach is to deploy new model to the existing endpoint. Only fraction of the traffic should be allocated to the new model first. This way, the new model can be monitored before full replacement. New deployment creates new compute resources per model.

Monitoring can be enabled for skew and drift when setting up predictions. Behind the scenes it uses Tensorflow Data Validation.

Online prediction logging has three options:

Online prediction loggingLogged information
Container loggingStdout and stderr to Cloud Logging
Access loggingEg Timestamp and latency for each request
Request-response loggingRequest and response logged to BigQuery table

Redeployment is required for logging to take effect.

Vertex AI Metadata

Track and analyze metadata about a machine learning process.

Answers to questions such as:

  • What was the model training data
  • What were the model hyperparameters
  • Info related to failed models

Based on ML Metadata library on Tensorflow TFX.

Matching engine

In traditional databases records are searched by matching exact criteria. For example all cars where color is red.

Vector databases, as the name suggests, store records in vector format. This helps to find records are similar to specified records. This is a requirement when comparing text snippets, images and other complex entities. Maybe cars or cities.

I am sure that vector databases will be a big thing in immediate future.