I have written multiple blog posts about machine learning (ML) engineering and machine learning platforms. Those systems are usually target to productionize ML solutions, are somewhat big investments and focus on managing the whole ML lifecycle.
Now let’s take a look at free data science tools that individuals, scientists, students and entrepreneurs can use for explorative data analysis.
All posts from the blog series: Machine learning platforms in cloud
Google had a data science collaboration platform for researchers and students. The free computation engine provides a notebook editor for lightweight prgramming. This makes it ideal to learn Python without need to install anything as the UI is on browser and computation runs in Google data centers.
At the moment Google Colab does not support other programming lanugages than Python. It should be possible to mount Google Drive to conveniently read from and write to persisted file storage.
Google Colab has also a paid Pro version available for more computation resources.
It is possible to connect to Google Cloud computation resources for more processing power. But this requires knowledge about cloud and incurs costs.
Read more about Google Colab frequently asked questions.
Databricks Community Edition
Databricks community edition is free version of the full platform. You login from web browser. Computation happens in AWS and costs are covered by Databricks.
Supported programming languages in Databricks community edition are Python, R, SQL and Scala.
When I tested the service, it sometimes had a bug that prevented the cluster from starting successfully. This behavior can be expected as no service level agreements are applied to this kind of test environment.
Notebooks on code editors
Microsoft used to have free notebook service in the cloud. Azure notebooks is nowadays deprecated and they recommend using Visual Studio Code to run notebooks on your laptop.
Visual Studio Code is primarily a code editor but it has much more features bundled inside You could describe it even as a programming environment.
Build custom data science workspace with Docker
If you want to create a fully customized data science environment for whatever reason, I would use docker.
You can use Python base image and install needed libraries in the Dockerfile. You can install Jupyter notebooks to write code in your browser whil your laptop is computation engine.
It should be relatively easy to move the dockerized solution to cloud hosting to access the environment through internet. Considering how good ML platforms there are available with low cost, I don’t come up with any good reason to build your data science workspace from scratch.
But there is one thing a like about Docker when working locally: Managing programming language and lirbary versions. It is really easy to build a name Docker image for different versions of Python or set of Python libraries. According to my experiences this quickly becomes messy if not thought carefully.
Which free data science workspace to use?
I think Google Colab and Databricks Community edition are extremely simple to run a few lines of code to test your idea. They are always available, no matter which physical devide you use. Google Drive integration makes Colab maybe more appealing from these two.
Both Google Colab and Databricks Community Edition offer an excellent learning path towards enterprise scale data science and machine learning platforms.
For longer term project local environment setup might make more sense. Data privacy, offline working and more control over the environment are the most significant advantages.
Biggest restriction with any of the free data science workspaces is the computation capacity. Once you have more than a few gigabytes of data you will run out of laptop’s memory and free trial quotas.
For more advanced stuff it will be inevitable to have a paid workspace or hosting.