Some Google materials refer to it as Fully managed Tensorflow.
Vertex AI assumes that data is prepared elsewhere before training the model. The reason is less surprisingly related to Tensorflow. You need to follow the Tensorflow best practices and not perform arbitrary data wrangling anymore after reading the dataset in to the framework.
Vertex AI component overview
This is why managed datasets and the feature store important components in the Vertex AI process.
Vertex AI component | Description |
---|---|
Datasets | Create a managed dataset from local files, Cloud Storage or BigQuery. |
Feature Store | Pre-processed features available for all models. |
Labelling tasks | Paid humans do the data labeling for you. |
Workbench | Run Managed or User-Managed Jupyter notebooks. Schedule executions. |
Pipelines | Run a configurable set of scripts in specific order. |
Training | Train an AutoML model or a custom training job. |
Experiments | Log performance for different model versions. Vizier hyperparameter tuning. |
Model registry | Store model versions. |
Endpoints | Create a prediciton API from a model and setup monitoring. |
Batch prediction | Save larger set of predictions to a Storage bucket. |
Metadata | Track and analyze metadata of a machine learning process. |
Matching engine | A vector database. |
Vertex AI Datasets
Creating a managed dataset in Vertex AI is simple. The source should be either Cloud Storage bucket or BigQuery table.
Datasets are straightforward to use in Vertex AI models. The documentation does not state in detail whether they can be used for example in Python code.
Vertex AI Datasets can be exported as JSON Lines files to Google Cloud Storage. Then, those files can be shared to other by signed URLs if needed.
Vertex AI Feature store
Feature store standardizes the way the ML models can access the data. It has this hierarchy:
- Featurestore
- EntityType (eg Users and Movies)
- Feature (eg user age and movie rating)
Feature store entity types
A feature store
is the top level container for a set of features. The ingested feature data format has additonally a few
requirements
. The data must have an entity type
column that indicates the id of the entity
(row). The enity type
groups similar features together. Also a timestamp is required to indicate the creation time of the feature value.
Entity type can be also a combination such as ProductUser.
Feature store monitoring
Monitoring can be configured for Entity type or Feature level.
Feature store can profile the content of the data and monitor the drift over time.
Feature store data ingestion
Features can be ingested by batches or streaming:
Featurestore source | Batch or stream | Notes |
---|---|---|
Featurestore http API | Stream | The records are sent to the endpoint. |
BigQuery Table | Batch | |
Cloud Storage, Avro | Batch | |
Cloud Storage, CSV | Batch | No arrays |
Feature store serving
All feature data is saved to offline. Only the most recent versions of the records are available for online serving.
Feature store data is not read directly by libraries like Tensorflow
or Pandas
. Online serving works by HTTP
request. Batch output can be exported to these formats:
- BigQuery Table
- CSV (no arrays)
- TFRecord
Some Google materials suggest Memorystore for solution with strict latency requirements. Memorystore is basically a managed Redis or Memcached.
Feature store access
Feature store is likely consumed by multiple teams. Prefer IAM policies to control access.
Labelling tasks
Request humans to label your training data. You need to send:
- The dataset to label
- List of possible labels
- Instructions as PDF
- Specific
- Max 20 minutes to read
100 distinct labels is maximum but 20 is recommemded cap to maintain labeler efficiency. Use descriptive label names such as cat
and dog
instead of label1
and label2
.
Do not use overlapping categories. It is possible though, to include labels both
and none
.
Vertex AI Workbench
Notebooks in Vertex AI Workbench can be conveniently opened in browser tab without significant configuration. Git is supported and for example Docker is pre-installed.
Workbench has two ways of creating Jupyter notebooks environments. Here is the difference between Managed notebooks and User-Managed notebooks.
Managed notebooks are designed to run notebooks as part of production pipelines. Google Cloud Storage and BigQuery integrations are readily available. The notebooks can be scheduled. Instances shutdown automatically after spcified idle time. Managed notebooks come with popular frameworks like Tensorflow and PyTorch out of the box. This option has higer hourly price than User-Managed notebooks.
User-Managed notebooks
enable full configuration for experimental work. You need to choose an environment that has the required frameworks installed. Use command line as sudo
user. Google recommends one User-Managed instance per person during development as virtual workspace.
This convenience syntax can be used in notebooks to read BigQuery data to a Pandas DataFrame in Python:
%%bigquery df
SELECT *
FROM table_name
The notebook instances have What-If Tool (WIT) installed. Now WIT is part of Tensorboard . WIT adds functionality to explore deployed models whereas Tensorboard typically monitors the training process.
Language Interpretability Tool (LIT) is another tool that visualizes NLP model predictions.
Vertex AI Pipelines
Pipelines introduces machine learning process orchestration from modular components. A pipeline instance runs a GKE instance in the background. Apparently Vertex AI Pipelines were previously known as Kubeflow Pipelines.
Pipelines can be grouped and versioned in the UI. They are able to store metadata as artifacts.
Custom pipelines can be built using either of Kubeflow or Tensorflow Extended (TFX). Readily available TFX components can be used, eg Vertex AI Jobs.
Vertex AI provides Tensorboard to monitor training metrics in Pipelines.
Thanks to lineage tracking of the pipeline artifacts metadata about the datasets and models are saved for each run.
A component in a pipeline is a container image. It simply takes an input and produces an output.
An example Vertex AI pipeline:
- Read data
- Pre-process data
- Train the model
- Predict
- Output
- Confusion matrix
- ROC
Code example of a KFP pipeline:
from kfp import dsl
from kfp.v2 import compiler
from kfp.v2.dsl import component
@component(base_image="python:3.9", output_component_file="first-component.yaml")
def step_1(text: str) -> str:
return text
@dsl.pipeline(
name="hello-world",
description="An intro pipeline",
pipeline_root=PIPELINE_ROOT,
)
def my_pipeline(text: str = "My input text"):
product_task = step_1(text)
compiler.Compiler().compile(
pipeline_func=my_pipeline, package_path="pipeline_job.json"
)
And then running the pipeline
#instantiate api_client here...
response = api_client.create_run_from_job_spec(
job_spec_path="pipeline_job.json",
)
The interesting things seems to be that the Kubeflow Pipeline json
spec contains the needed Python code (but not the Python dependencies).
Here
is an example.
Vertex AI Hyperparameter tuning
The documentation is a bit confusing. Apparently there are two alternatives for hyperparameter tuning:
- Vizier
- Vertex AI hyperparameter tuning
Vizier
Vizier studies are found from the experiments section in Vertex AI.
Vizier is a black box hyperparameter tuning service inside the Vertex AI. It means that the system does not have an objective function or it is too costly to evaluate.
According to Google materials this holds true:
Black box optimization algorithms find the best operating parameters for any system whose performance can be measured as a function of adjustable parameters.
Types of supported hyperparameter optimization methods:
- Grid search
- random search
- Bayesian optimization
Vizier does not optimize cost or tuning time. It runs optimization sequentially.
Vertex AI hyperparameter optimization
Basically it is a regular training job in Vertex AI, but it finds the optimal hyperparameters at the same time. Hyperparameter tuning can be chosen only for Custom training. They are not availale in AutoML.
The code should only report the results of the given parameters. The hyperparameter tuning job takes care of running the model multiple times to find the optimal set up hyperparameters.
Here is how the process works:
- Create a Docker container in a Vertex AI notebook
- Use
deeplearning-platform-release
as base image - Install
cloudml-hypertune
library - The Docker entry file must be
trainer.task
- Use
- The main
.py
file must read the hyperparameters as command line arguments - The main
.py
file must report the hyperparameters and the model performance metirc using thecloudml-hypertune
library
By this setup Vizier is able to run the container by trying different sets of hyperparameters and seeing which performs best. You need to define the min and max limits for each hyperparameter on Vizier side manually.
Important parameters for hyperparameter tuning:
Hyperparameter tuning parameter | Impact |
---|---|
maxTrialCount | Number of trials before stopping. Less should be faster but non-optimal. |
parallelTrialCount | More is faster but can reduce effectiveness. |
enableTrialEarlyStopping | Stop trial when it seems to become unpromising. |
resumePreviousJobId | Use information from previous tuning jobs. |
Vertex AI Experiments
Find the best model for a specific problem. Experiment with different input datasets, model architectures, hyper parameters and training environments.
Managed Tensorboard available.
Vertex AI Model registry
Find all trained models and their versions.
Exploret the model metrics or compare multiple with each other. You can run an evaluation that that generates statistics for selected test set and a batch prediction.
Deploy a model endpoint or run a batch prediction.
Vertex AI Training (custom)
Use for migrating on-prem models or when BigQuery and AutoML does not solve the case.
Preparing a training job
The training workflow goes as follows:
- Create training functionality in notebook
- Read data
- Pre-processing (only
Tensorflow
) - Train
- Generate entrypoint file
task.py
- Model training parameters as input
- Save the model in the end
- Use
%%writefile
ornbconvert
Package the code to a Docker container locally, in Workbench notebook or using Cloud Build. Use pre-built Docker images if the needed frameworks are supported.
Another approach is to create a full python package to which the Vertex AI training job can point.
My understanding is that once a model is trained, it is saved to a Storage bucket and saved in the Models section of Vertex AI.
Running a training job
Training jobs can be run on distributed clusters managed by Vertex AI configured by by environment variables CLUSTER_SPEC
(other frameworks) and TF_CONFIG
(Tensorflow).
Setting scale-tier=BASIC_TPU
variable would set training job to run on TPU processors.
Typical job states (can be asked in the certification exam):
JOB_STATE_QUEUED
JOB_STATE_RUNNING
JOB_STATE_SUCCEEDED
When running multiple workers in parallel, each of them start running whenever they become available.
Save Tensorflow checkpoints (save
in PyTorch) and model artifacts to Google Cloud Storage.
Explainable AI in Vertex AI
Use the Vertex AI SDK to understand the model behavior by Explainable AI.
Explaning models by examples is a manual approach to review some training samples. K-nearest neighbors alogrithm is able to fetch to identify the most similar observations. Tree models are not supported.
Feature-based explanations is a more traditional way to show realtive importance of each feature.
Vertex AI AutoML
AutoML is part of Vertex AI. Just create a new training job and you find the AutoML options.
AutoML is codeless and severless. You define the source data from BigQuery, or CSV file and everything else is clicking menus in browser:
- Source dataset
- Target variable
- Type of prediction task (regression, classification…)
- Performance metric
As an end result AutoML finds the best model performing model for you.
Here are the different types of tasks AutoML can perform.
Type of data | Tasks |
---|---|
AutoML Tables | Classification, regression and forecasting likelihood |
AutoML Natural Language | Single or multi label classification, sentiment, entity extraction |
AutoML Vision | Single or multi label classification, object detection, image segmentation |
AutoML Video Intelligence | Action recognition, classification, object tracking |
After testing the AutoML a few times, execution times felt surprisingly long.
AutoML is suitable for models that can tolerate +100 ms latency of inference.
Vertex AI Batch predictions
Here are some requirements for batch prediction data sources:
- BigQuery source table max 100 GB
- BigQuery source table must use multi region
- CSV must have headers with alphanumeric column names (or underscore)
- CSV delimiter must be comma
- CSV max size for single file 10 GB, total 100 GB
- 1k-1B rows
- 2-1000 columns
- 2-500 distinct labels for classification
- Additional permissions required if Vertex AI and data sources are in different projects
Batch prediction monitoring has these configuration options among the others:
- How often monitoring metrics are evaluated
- Alert email
- Sampling rate in percentage
Vertex AI Endpoints
Create an online prediction API using Tensorflow Serving under the hood. Vertex AI private endpoints are useful for low latency peer-to-peer requests.
Recommended approach is to deploy new model to the existing endpoint. Only fraction of the traffic should be allocated to the new model first. This way, the new model can be monitored before full replacement. New deployment creates new compute resources per model.
Monitoring can be enabled for skew and drift when setting up predictions. Behind the scenes it uses Tensorflow Data Validation.
Online prediction logging has three options:
Online prediction logging | Logged information |
---|---|
Container logging | Stdout and stderr to Cloud Logging |
Access logging | Eg Timestamp and latency for each request |
Request-response logging | Request and response logged to BigQuery table |
Redeployment is required for logging to take effect.
Vertex AI Metadata
Track and analyze metadata about a machine learning process.
Answers to questions such as:
- What was the model training data
- What were the model hyperparameters
- Info related to failed models
Based on ML Metadata library on Tensorflow TFX.
Matching engine
In traditional databases records are searched by matching exact criteria. For example all cars where color is red.
Vector databases, as the name suggests, store records in vector format. This helps to find records are similar to specified records. This is a requirement when comparing text snippets, images and other complex entities. Maybe cars or cities.
I am sure that vector databases will be a big thing in immediate future.
Write a new comment
The name will be visible. Email will not be published. More about privacy.