I have written multiple blog posts about machine learning (ML) engineering and machine learning platforms. Those systems are usually target to productionize ML solutions, are somewhat big investments and focus on managing the whole ML lifecycle.
Now let’s take a look at free data science tools that individuals, scientists, students and entrepreneurs can use for explorative data analysis.
Google had a data science collaboration platform for researchers and students. The free computation engine provides a notebook editor for lightweight programming. This makes it ideal to learn Python without need to install anything as the UI is on browser and computation runs in Google data centers.
At the moment Google Colab does not support other programming languages than Python. It should be possible to mount Google Drive to conveniently read from and write to persisted file storage.
Google Colab has also a paid Pro version available for more computation resources.
It is possible to connect to Google Cloud computation resources for more processing power. But this requires knowledge about cloud and incurs costs.
Read more about Google Colab frequently asked questions .
Datalore Community Plan
Datalore let’s you individuals notebooks 120 hours per month for free. It is suitable for professional usage with rich set of configuration. The paid version of Datalore works well even large teams.
For a beginner Colab might be easier to start. Once you realize something is not easily possible, you can move your project to Datalore. Such features might be library management per project, fine grained collaboration, developer tools and reading data in more managed way.
Read here the full Datalore review.
Databricks Community Edition
Databricks community edition is free version of the full platform. You login from web browser. Computation happens in AWS and costs are covered by Databricks.
Supported programming languages in Databricks community edition are Python, R, SQL and Scala.
When I tested the service, it sometimes had a bug that prevented the cluster from starting successfully. This behavior can be expected as no service level agreements are applied to this kind of test environment.
Notebooks in a code editor
Microsoft used to have free notebook service in the cloud. Azure notebooks is nowadays deprecated and they recommend using Visual Studio Code to run notebooks on your laptop.
Visual Studio Code is primarily a code editor but it has much more features bundled inside You could describe it even as a programming environment.
Jupyter notebooks can be run in Visual Studio Code. Other options to run notebooks locally are open source Atom or PyCharm Community Edition .
Local open source data science workspace with Docker and Jupyter
If you want to create a fully customized data science environment for whatever reason, use Docker.
You can use Python base image and install needed libraries in the Dockerfile. Alternatively, choose Miniconda to get most popular packages pre-installed.
Run Jupyter notebooks to render visualizations in your browser while your laptop is the computation engine.
It should be relatively easy to move the dockerized solution to cloud hosting to access the environment through internet. Considering how good ML platforms there are available with low cost, I don’t come up with any good reason to build your data science workspace from scratch.
Which free data science workspace to use?
I think Google Colab and Databricks Community edition are extremely simple to run a few lines of code to test your idea. They are always available, no matter which physical devide you use. Google Drive integration makes Colab maybe more appealing from these two.
Both Google Colab and Databricks Community Edition offer an excellent learning path towards enterprise scale data science and machine learning platforms.
For longer term project local environment setup might make more sense. Data privacy, offline working and more control over the environment are the most significant advantages.
Biggest restriction with any of the free data science workspaces is the computation capacity. Once you have more than a few gigabytes of data you will run out of laptop’s memory and free trial quotas.
For more advanced stuff it will be inevitable to have a paid workspace or hosting.