Clustering time series data with SQL. The purpose of this experiment was to prove that doing data science doesn’t always require fancy tools. SQL is pretty basic after all.
Because the experiment was a technical one, most of the documentation exists in the SQL clustering repository in GitHub.
Clustering is method to divide a set of data points to distinct groups with some defined logic.
Visualizing clusters in 3D plot
This video demonstrated data from an engine of a car or an airplane. Well, actually the data has been generated by myself, but it could be from a vehicle. Each data point contains data from two engine heat sensors and time since the measurement started.
Clustering can reveal the total number of driving sessions during the measurement period. One cluster equals one driving session.
Twiddling around the 3D chart reveals information that would be impossible to observe by using two dimensions only. You can explore the 3D chart by yourself here.
How did SQL clustering work?
It is definitely possible to identify clusters from data by using SQL only. But SQL really is not the top tool for that.
The downside of SQL clustering is, it’s messy to do anything more than basic reasoning. If the clusters can be identified with simple rules, then SQL is fine.
In my code repository the logic was this:
A new cluster begins, if the engine heat decreases compared to the previous data point in the time series.
Is SQL clustering machine learning?
No. At least in this case.
With machine learning, the clustering algorithm would do smarter decisions once it receives more data.
No matter how much more data my SQL query receives, it does not change the reasoning logic. Using SQL for machine learning is like running a marathon bare foot – you can do it, but it hurts a lot.
Other tools for clustering
K-means and K nearest neighbors are probably the best known clustering algorithms.