Dataflow product in Google Cloud is mandatory for advanced data processing pipelines for machine learning solutions.

It performs typical data engineering tasks by allowing same code to execute both batch and streaming. Pipeline definition and execution are separated.

Convert data to more efficient formats for ML

In machine learning context Google recommends Dataflow to convert unstructured data to TFRecords for Tensorflow.

Dataflow is especially suitable for expensive processing operations. You can even use time window functions.

Creating a Dataflow flow

As an example, Dataflow from Pub/Sub topic to a BigQuery table can be done with a few clicks without coding.

Some Dataflow source locations:

  • Cloud storage
  • Pub/Sub

And Dataflow output could be written to:

  • BigQuery
  • Pub/Sub
  • Cloud Storage

For near real-time prediction Dataflow recommends one Pub/Sub for input and another for output.

Dataflow sharding

Dataflow and Google Cloud practices favor stateless processing whenever possible.

Sharding is the key Dataflow concept to parallelize data processing. It means that data is split to small independent pieces. They share nothing together. Opposite to Dataflow, parallel processing framework Spark is not sharded as the operations are managed in the master node.

Creating a custom Dataflow pipeline

Here is an example workflow to create a custom pipeline from BigQuery to Cloud Storage as TFRecords:

  1. Create notebook
  2. Read data from BigQuery to Pandas DataFrame
  3. Use functions such as tf.scale_to_0_1() to transform the DataFrame
  4. Write data as TFRecords
  5. Export the pipeline runner code

Use Dataflow Flex templates to run custom Docker images.