This is a summary of Google Cloud Platform (GCP) products relevant for Machine Learning Engineer role.
Google philosphy seems to be that moving to their platform requires minimal changes to the existing solution. Most GCP products offer easy-to-start option, pre-built templates and customization by containers.
These are my notes for preparing to GCP certification exam.
SaaS, PaaS and IaaS in cloud
Here is a summary of SaaS, PaaS, IaaS and on-premise operating models.
|As a service model||Example users||Example service||Vendor responsibility|
|SaaS||Business users||Gmail||Applications, data|
|PaaS||Data Scientists||Vertex AI||Operating system, runtime|
|IaaS||Cloud team||Virtual machine||Storage, netowrking, servers|
|On-premise||Infra team||Physical servers||Everything|
Google Cloud database services
Here are options for Google Cloud databases.
Find here more comprehensive introduction to BigQuery for ML Engineers.
Google Cloud ingestion and processing services
|Dataproc||Managed Spark or Hadoop for existing pipelines.||Spark||Configure cluster|
|Cloud Data Fusion||Drag and drop interface for data integration and ETL pipelines. Batch or stream.||No||Instances|
|Dataprep||Visual tool for ad-hoc data preparation for ML. Read from Cloud Storage or BigQuery.||Basic||SaaS|
|Pub/Sub||Message queue. No analytical capabilities.||Possibly||Serverless|
|Dataflow||Managed Apache Beam. Transform data as a stream or batch. Can execute custom scripts.||Yes||Serverless|
Read more about Dataflow.
Dataflow is modern and scalable, but requires coding skills. It is the most important advanced data processing service for ML pipelines, especially for non-tabular data.
When to use Datprep instead of BigQuery?
Dataprep is a good choice to ingest and transform small scale business datasets possibly outside of Google Cloud and store them to BigQuery or Cloud Storage. Most typical sources are uploaded files, Google Sheets, apps, Cloud Storage and BigQuery.
Dataprep has similar data processing functionalities than Microsoft Power BI. Some coding skills are required.
One way to automate a pipeline on file arrival is by using Google Cloud Functions.
When to use Data Fusion instead of BigQuery?
Data Fusion has some similarities with Dataprep: It brings data to Google Cloud.
Compared to Dataprep, Data Fusion can be seen more suitable for enterprise users. For example docs have clear instuctions to replicate SQL Server, Oracle or MySQL database BigQuery.
Pipelines are created by no-code GUI. Pipelines can also be scheduled and linked together.
Data management in Google Cloud
These products are mostly for enterprise usage.
|Data management service||Description|
|Data Catalog||Browse different data sources and schemas by storing meta data.|
|Dataplex||Govern and monitor data accross different sources.|
Hubs and marketplaces in Google Cloud
|Google Cloud hub||Description|
|Analytics Hub||A platform to publish and subscribe datasets.|
|AI Hub||Portal to search for data, pipelines, code and ML models.|
|Marketplace||Install applications such as Databricks. Mostly non-ML such as Wordpress.|
AI Hub is a portal to search for data, pipelines, code and ML models developed by others. As an anaology to Android mobile ecosystem Google compares it to the Google Play Store. You can also share your pipelines wihthin the organization.
ML security in GCP
|Data Loss Prevention||Discover sensitive data. Tools for masking etc.|
|VPC service controls||Make data accessible only from authorized networks.|
Data visualization and BI in Google Cloud
Apparently Looker and Data Studio will
become a one product
at some point. Looker already shows
datastudio.google.com in the address bar.
Previously Data Studio has been a self-service BI tool while Looker has served enterprises. Now these offerings have merged together. This is comparable to Microsoft Power BI which has both functionalities in one product.
Vertex AI is a all-in-one tool for ML Engineers.
Read here the full article about Vertex AI.
Retail in Google Cloud
Retail has two important features.
Recommend products for the user based on the product catalog and user’s event log.
Product search results based on Google’s intelligent search engine.
It gives product hierarchies, pagination, filtering and ordering among many other features.
Google Cloud APIs for NLP and CV
These natural language processing and coputer vision APIs provide a great alternative if you do not need a custom solution.
Most probably BigQuery and Vertex AI AutoML utilize these APIs in their backend.
|Google Cloud ML API||Capabilities|
|Speech to text|
|Text to speech|
|Natural Language||Parts of speech and sentiment|
|Video Intelligence||Motion and action in videos.|
Platform to create conversational user interfaces. It has Essential and Customer Experience plans.
Dialogflow execution order:
- Intent / Topic (Rule based and Machine learning)
- Entities (Who, What, When, Where)
- Conversation flow
Healthcare Natural Language API
Parse unstructured medical documents such as insurance claims. Generate structured documents from these.
Converts unstructured documents to structured json file. It has general and custom processors.
- Image to text
- Classify documents
- Analyze and extract entities
NLP AutoML has these objectives available:
|Classification model||Which category the text belongs to|
|Entity extraction||Inspect text for entities like names and addresses.|
|Sentiment analysis||Reveal emotional opinions.|
Computer vision API
Google Cloud offers out-of-the-box solutions for these computer vision problems:
|Computer vision problem||Example|
|Image classification||Does the image present cat or dog?|
|Semantic segmentation||Detect topics like grass and dog within the image.|
|Instance segmentation||Find detailed object boundaries in the image.|
|Image classification and localization||Instance segmentation with bounding box.|
|Object recognition||Detect objects and their probability to exist in the image.|
|Object detection||Object recognition with bounding boxes.|
|Pattern recognition||Human and text recognition.|
|Facial recognition||Pattern recognition for human faces.|
|Edge detection||Highlight edges shapes.|
|Feature matching||Detect attributes regardless of rotation, colors etc.|
Google Knowledge Graph Search API
Search entities such as phone numbers and landmarks.
Compute solutions in Google Cloud
|Compute option||Use case|
|Compute Engine||Generic virtual machines (IaaS).|
|GKE||Google Kubernetes Engine.|
|App Engine||Fully managed code first PaaS for websites and mobile apps.|
|Cloud Run||Run stateless containers.|
|Cloud Functions||Serverless, no containers.|
Naming convention is
n1is the machine series
standardis the type of processing unit
2is the number of vCPU
GPU can be added optionally to some virtual machine series to speed up processing. This is why they are sometimes called as accelerators.
Deep Learning VMs for CPU and GPU. TPUs can be used with different image types.
Local SSD for fast I/O available. Frameworks such as Tensorflow, PyTorch and SciKit Learn pre-installed. Compute engine has per-second billing.
Compute engine limits:
- Max 160 vCPUs
- Max 64 TB network storage.
Google Kubernetes Engine
GKE is part of compute offering. It is a fully managed Kubernetes environment. This means you do not need to worry about setting up the resources. The service comes with container optimized operating system.
A single computer in the Kubernetes cluster is called a node.
Kubernetes does create nodes by itself. Sombody needs to take care of that process. Google’s managed GKE takes this burden off from the admin’s shoulders.
A node pool is a group of nodes with similar configurations. This is a GKE feature, not part of Kubernetes.
In GKE the control panel is an abstracted service while the nodes run as virtual machines.
Integrated logging, monitoring and networking are worth mentioning in GKE.
MLOps products in Google Cloud
These are generic Google Cloud DevOps services that can be used for MLOps as well.
|MLOps product||Use case|
|GKE||Run Kubeflow pipelines.|
|Cloud Build||Git and deployment workflows.|
|Cloud Composer||Orchestrate code executions.|
Cloud Build is configured by the
cloudbuild.yaml file. Each step is executed by a Docker container.
Google Cloud provides pre-built Cloud Builders for CI/CD pipelines. They are Docker containers for specific actions such as
Cloud Build can be triggered automatically when an action in Git service such as GitHub happens. The trigger can be for example a push to a branch.
dir parameters in the a build step defines the Docker directory to store the artifacts. Apparently this can be shared accross the steps in the build process.
Cloud Composer is a managed Apache Airflow.
Cloud Composer is not especially cost efficient for small tasks. Minimum billing for an environment is 10 minutes. Sometimes it is considered to be constantly running.