Bodo is a platform for data processing with Python and SQL. It is especially suitable for large datasets thanks to its unique parallel processing technology.
When to choose Bodo
Bodo is a tool for data engineers to speed up time consuming ETL jobs.
According to their own words:
Bodo’s linear scaling capability is most noticeable in efforts involving jobs of 100’s of GBs, hundreds of millions of Dataframe rows, and compute times approaching/exceeding 1 hour.
Pay As You Go is in the reach of small and medium businesses. Bodo takes 0.19 $ per core hour plus the cloud resource costs. This means, Bodo takes 50-70% extra compensation for their platform compared to plain virtual machine on-demand prices. On-demand is available only for AWS at the moment.
Commited use is from 2500 $ per year. The system can be provisioned as agreed and support is included.
The pricing page has a couple of examples where the workload has been ran a few times a week. The total cost would be around 2000 $ a month in these cases.
How to host Bodo
Through AWS marketplace with Pay As You Go tier. For Premium and Enterprise contracts deployment can be done on AWS, Azure or On-premise. Google Cloud is coming soon.
The computation can run in your existing cloud VPC. No need to move you data anywhere.
Here is a good summary of deployment options .
Also instructions for Kubernetes installation are available .
Fast parallel computation in Bodo beats Spark
Bodo has impressive benchmarks against Spark, Dask saying they are multiple times faster. You can read their performance benchmarks against Spark, Dask and Ray here .
Apparently the reason for high speeds is different computation paradigm compared to the competing frameworks. For example Spark follows one master, multiple workers pattern. Everything is synchronized with master which can become bottleneck. Bodo implements so called Single Program, Multiple Data (SPMD) pattern that is familiar from super computers.
The main difference is that in Bodo the workers can execute the computation independently to a partition of data. In Bodo, the workers can also pass messages between each other without the master.
Bodo vs Google BigQuery
Cloud providers have integrated solutions for massive parallel processing. So where do we need Bodo?
Google BigQuery is a great example. It should be a replacement for Spark for those in Google Cloud. So why would you need an external framework?
The best answer is that Bodo supports Python, in case you are dependent on it. BigQuery runs only SQL statements.
Supported programming languages
Python and SQL. The platform should work with existing Python code with minimal changes.
Summary of Bodo
Bodo seems to be a promising platform in terms of processing performance. The hosting options are slightly confusing: What is managed, which parts are open source and what exactly is running on cloud?
About the company behind Bodo
Bodo has been founded 2020. The headquarters is in California, USA.