Dataflow for ML Engineers in Google Cloud

Dataflow product in Google Cloud is mandatory for advanced data processing pipelines for machine learning solutions.

It performs typical data engineering tasks by allowing same code to execute both batch and streaming. Pipeline definition and execution are separated.

Convert data to more efficient formats for ML

In machine learning context Google recommends Dataflow to convert unstructured data to TFRecords for Tensorflow.

Dataflow is especially suitable for expensive processing operations. You can even use time window functions.

Creating a Dataflow flow

As an example, Dataflow from Pub/Sub topic to a BigQuery table can be done with a few clicks without coding.

Some Dataflow source locations:

Cloud storage
Pub/Sub

And Dataflow output could be written to:

BigQuery
Pub/Sub
Cloud Storage

For near real-time prediction Dataflow recommends one Pub/Sub for input and another for output.

Dataflow sharding

Dataflow and Google Cloud practices favor stateless processing whenever possible.

Sharding is the key Dataflow concept to parallelize data processing. It means that data is split to small independent pieces. They share nothing together. Opposite to Dataflow, parallel processing framework Spark is not sharded as the operations are managed in the master node.

Creating a custom Dataflow pipeline

Here is an example workflow to create a custom pipeline from BigQuery to Cloud Storage as TFRecords:

Create notebook
Read data from BigQuery to Pandas DataFrame
Use functions such as tf.scale_to_0_1() to transform the DataFrame
Write data as TFRecords
Export the pipeline runner code

Use Dataflow Flex templates to run custom Docker images.

Dataflow for ML Engineers in Google Cloud

Blog series

Convert data to more efficient formats for ML

Creating a Dataflow flow

Dataflow sharding

Creating a custom Dataflow pipeline

Tags of the post

Blog series navigation

You might also like

Participate to discussion

Write a new comment

Dataflow for ML Engineers in Google Cloud

Blog series

Convert data to more efficient formats for ML

Creating a Dataflow flow

Dataflow sharding

Creating a custom Dataflow pipeline

Tags of the post

Blog series navigation

You might also like

Machine learning products in Google Cloud

Comparison of machine learning platforms in major clouds

cnvrg.io - Flexible Kubernetes deployments for advanced data science teams

Participate to discussion

Write a new comment

Reply to comment