Dataflow product in Google Cloud is mandatory for advanced data processing pipelines for machine learning solutions.
It performs typical data engineering tasks by allowing same code to execute both batch and streaming. Pipeline definition and execution are separated.
Convert data to more efficient formats for ML
In machine learning context Google recommends Dataflow to convert unstructured data to TFRecords
for Tensorflow
.
Dataflow is especially suitable for expensive processing operations. You can even use time window functions.
Creating a Dataflow flow
As an example, Dataflow from Pub/Sub topic to a BigQuery table can be done with a few clicks without coding.
Some Dataflow source locations:
- Cloud storage
- Pub/Sub
And Dataflow output could be written to:
- BigQuery
- Pub/Sub
- Cloud Storage
For near real-time prediction Dataflow recommends one Pub/Sub for input and another for output.
Dataflow sharding
Dataflow and Google Cloud practices favor stateless processing whenever possible.
Sharding is the key Dataflow concept to parallelize data processing. It means that data is split to small independent pieces. They share nothing together. Opposite to Dataflow, parallel processing framework Spark is not sharded as the operations are managed in the master node.
Creating a custom Dataflow pipeline
Here is an example workflow to create a custom pipeline from BigQuery to Cloud Storage as TFRecords
:
- Create notebook
- Read data from BigQuery to
Pandas
DataFrame
- Use functions such as
tf.scale_to_0_1()
to transform the DataFrame - Write data as
TFRecords
- Export the pipeline runner code
Use Dataflow Flex templates to run custom Docker images.
Write a new comment
The name will be visible. Email will not be published. More about privacy.