This is a summary of Google Cloud Platform (GCP) products relevant for Machine Learning Engineer role.

Google philosphy seems to be that moving to their platform requires minimal changes to the existing solution. Most GCP products offer easy-to-start option, pre-built templates and customization by containers.

These are my notes for preparing to GCP certification exam.

SaaS, PaaS and IaaS in cloud

Here is a summary of SaaS, PaaS, IaaS and on-premise operating models.

As a service modelExample usersExample serviceVendor responsibility
SaaSBusiness usersGmailApplications, data
PaaSData ScientistsVertex AIOperating system, runtime
IaaSCloud teamVirtual machineStorage, netowrking, servers
On-premiseInfra teamPhysical serversEverything

Google Cloud database services

Here are options for Google Cloud databases.

DatabaseSQLWorkload
FirestoreNoSQLTransactional
Cloud BigTableNoSQLAnalytical
CloudSQLSQLTraditional
Cloud SpannerSQLTransactional
BigQuerySQLAnalytical

Find here more comprehensive introduction to BigQuery for ML Engineers.

Google Cloud ingestion and processing services

ServiceDescriptionCoding requiredProvisioning
DataprocManaged Spark or Hadoop for existing pipelines.SparkConfigure cluster
Cloud Data FusionDrag and drop interface for data integration and ETL pipelines. Batch or stream.NoInstances
DataprepVisual tool for ad-hoc data preparation for ML. Read from Cloud Storage or BigQuery.BasicSaaS
Pub/SubMessage queue. No analytical capabilities.PossiblyServerless
DataflowManaged Apache Beam. Transform data as a stream or batch. Can execute custom scripts.YesServerless

Dataflow

Read more about Dataflow.

Dataflow is modern and scalable, but requires coding skills. It is the most important advanced data processing service for ML pipelines, especially for non-tabular data.

Dataprep

When to use Datprep instead of BigQuery?

Dataprep is a good choice to ingest and transform small scale business datasets possibly outside of Google Cloud and store them to BigQuery or Cloud Storage. Most typical sources are uploaded files, Google Sheets, apps, Cloud Storage and BigQuery.

Dataprep has similar data processing functionalities than Microsoft Power BI. Some coding skills are required.

One way to automate a pipeline on file arrival is by using Google Cloud Functions.

Data Fusion

When to use Data Fusion instead of BigQuery?

Data Fusion has some similarities with Dataprep: It brings data to Google Cloud.

Compared to Dataprep, Data Fusion can be seen more suitable for enterprise users. For example docs have clear instuctions to replicate SQL Server, Oracle or MySQL database BigQuery.

Pipelines are created by no-code GUI. Pipelines can also be scheduled and linked together.

Data management in Google Cloud

These products are mostly for enterprise usage.

Data management serviceDescription
Data CatalogBrowse different data sources and schemas by storing meta data.
DataplexGovern and monitor data accross different sources.

Hubs and marketplaces in Google Cloud

Google Cloud hubDescription
Analytics HubA platform to publish and subscribe datasets.
AI HubPortal to search for data, pipelines, code and ML models.
MarketplaceInstall applications such as Databricks. Mostly non-ML such as Wordpress.

AI Hub

AI Hub is a portal to search for data, pipelines, code and ML models developed by others. As an anaology to Android mobile ecosystem Google compares it to the Google Play Store. You can also share your pipelines wihthin the organization.

ML security in GCP

Security productDescription
Data Loss PreventionDiscover sensitive data. Tools for masking etc.
VPC service controlsMake data accessible only from authorized networks.

Data visualization and BI in Google Cloud

Apparently Looker and Data Studio will become a one product at some point. Looker already shows datastudio.google.com in the address bar.

Previously Data Studio has been a self-service BI tool while Looker has served enterprises. Now these offerings have merged together. This is comparable to Microsoft Power BI which has both functionalities in one product.

Vertex AI

Vertex AI is a all-in-one tool for ML Engineers.

Read here the full article about Vertex AI.

Retail in Google Cloud

Retail has two important features.

Recommendations AI

Recommend products for the user based on the product catalog and user’s event log.

Retail search

Product search results based on Google’s intelligent search engine.

It gives product hierarchies, pagination, filtering and ordering among many other features.

Google Cloud APIs for NLP and CV

These natural language processing and coputer vision APIs provide a great alternative if you do not need a custom solution.

Most probably BigQuery and Vertex AI AutoML utilize these APIs in their backend.

Google Cloud ML APICapabilities
Speech to text
Text to speech
Natural LanguageParts of speech and sentiment
Translation
VisionStatic photos.
Video IntelligenceMotion and action in videos.

Dialogflow API

Platform to create conversational user interfaces. It has Essential and Customer Experience plans.

Dialogflow execution order:

  1. Intent / Topic (Rule based and Machine learning)
  2. Entities (Who, What, When, Where)
  3. Conversation flow

Healthcare Natural Language API

Parse unstructured medical documents such as insurance claims. Generate structured documents from these.

Document AI

Converts unstructured documents to structured json file. It has general and custom processors.

  • Image to text
  • Classify documents
  • Analyze and extract entities

NLP AutoML

NLP AutoML has these objectives available:

NLP problemExample
Classification modelWhich category the text belongs to
Entity extractionInspect text for entities like names and addresses.
Sentiment analysisReveal emotional opinions.

Computer vision API

Google Cloud offers out-of-the-box solutions for these computer vision problems:

Computer vision problemExample
Image classificationDoes the image present cat or dog?
Semantic segmentationDetect topics like grass and dog within the image.
Instance segmentationFind detailed object boundaries in the image.
Image classification and localizationInstance segmentation with bounding box.
Object recognitionDetect objects and their probability to exist in the image.
Object detectionObject recognition with bounding boxes.
Pattern recognitionHuman and text recognition.
Facial recognitionPattern recognition for human faces.
Edge detectionHighlight edges shapes.
Feature matchingDetect attributes regardless of rotation, colors etc.

Google Knowledge Graph Search API

Search entities such as phone numbers and landmarks.

Compute solutions in Google Cloud

Compute optionUse case
Compute EngineGeneric virtual machines (IaaS).
GKEGoogle Kubernetes Engine.
App EngineFully managed code first PaaS for websites and mobile apps.
Cloud RunRun stateless containers.
Cloud FunctionsServerless, no containers.

Compute Engine

Naming convention is n1-standard-2 where:

  • n1 is the machine series
  • standard is the type of processing unit
  • 2 is the number of vCPU

GPU can be added optionally to some virtual machine series to speed up processing. This is why they are sometimes called as accelerators.

Deep Learning VMs for CPU and GPU. TPUs can be used with different image types.

Local SSD for fast I/O available. Frameworks such as Tensorflow, PyTorch and SciKit Learn pre-installed. Compute engine has per-second billing.

Compute engine limits:

  • Max 160 vCPUs
  • Max 64 TB network storage.

Google Kubernetes Engine

GKE is part of compute offering. It is a fully managed Kubernetes environment. This means you do not need to worry about setting up the resources. The service comes with container optimized operating system.

A single computer in the Kubernetes cluster is called a node.

Kubernetes does create nodes by itself. Sombody needs to take care of that process. Google’s managed GKE takes this burden off from the admin’s shoulders.

A node pool is a group of nodes with similar configurations. This is a GKE feature, not part of Kubernetes.

In GKE the control panel is an abstracted service while the nodes run as virtual machines.

Integrated logging, monitoring and networking are worth mentioning in GKE.

MLOps products in Google Cloud

These are generic Google Cloud DevOps services that can be used for MLOps as well.

MLOps productUse case
GKERun Kubeflow pipelines.
Cloud BuildGit and deployment workflows.
Cloud ComposerOrchestrate code executions.

Cloud Build

Cloud Build is configured by the cloudbuild.yaml file. Each step is executed by a Docker container.

Google Cloud provides pre-built Cloud Builders for CI/CD pipelines. They are Docker containers for specific actions such as wget, gsutil or npm.

Cloud Build can be triggered automatically when an action in Git service such as GitHub happens. The trigger can be for example a push to a branch.

The dir parameters in the a build step defines the Docker directory to store the artifacts. Apparently this can be shared accross the steps in the build process.

Cloud Composer

Cloud Composer is a managed Apache Airflow.

Cloud Composer is not especially cost efficient for small tasks. Minimum billing for an environment is 10 minutes. Sometimes it is considered to be constantly running.