Working the past few years in both data science and data engineering projects, I have gained pretty good understanding to answer that question.
There is a common misunderstanding, that you could work efficiently in the both data scientist and data engineer roles simultaneously. Even though some of the skills are mutual, both roles cover way too many topics to focus at the same time thoroughly.
Here are all my posts about data science and analytics.
Remember the picture below when reading the text:
- Data platform is built by a data engineer
- Predictions and recommendations are developed by a data scientist
Data engineer creates value by tuning the system and making the data available
My data engineering experience is mostly from projects where the data should be moved from the source systems to a data platform in the cloud. The purpose has been to gather all the enterprise data into one place.
Data engineering (1) is all about getting data from point A to B. We have used the term data pipeline when talking about a set of cloud software components to move data from a single source system to the data platform.
In my opinion data engineering is heavily a back-end role. A data engineer creates value by making the data available across the organization.
Even though you are co-operating with other stakeholder groups inside the company, the discussion is highly technical. Personally I have felt that the data engineering work is most pleasurable for the people who can pro-actively detect opportunities to enhance the technical performance of the IT infrastructure.
It is about the most suitable tool, performance of a data pipeline and so on. You don’t necessarily have to meet the customers of the company to be able to work as an data engineer.
Data scientist creates value by modeling physical world and improving the business
In data science (2) the feedback loop to the physical world is one of the corner stones. The first part of the loop is collecting data from customers. Then the data scientist creates a model to interact back to the physical world. I’m not saying that data science would not be technical, far from it, but they also have to understand how people or products behave.
The data scientist is often defining the business case with the management before the source of additional value is even known. Results from proof of concepts are uncertain and it might take multiple attempts to come up with something useful.
According to my experiences business management understands better the value of data engineering than data science. They do understand that data science has high potential for value. Still, data scientists are much needed in defining the business cases.
Data engineering cases are better understood because those skills are needed when something breakes, becomes slow or is missing. Like fixing or replacing a legacy data warehouse. Whereas data science is leveraged in novel areas where no solutions exists.
The work of a data scientist requires deep domain knowledge whereas data engineers can re-use the same competencies across the industries.
Data engineering top skills - Software development and architectural mindset
Data scientist and data engineer roles both require software development skills. I just think data engineer has to be much more familiar with the core concepts of software development and architecture best practices.
It is not uncommon for data engineers to implement solutions that require network configuration. On those cases knowledge about IP addresses, VPN connections, firewalls, ports etc becomes fundamental.
Another great example is information security. Exposing sensitive unencrypted data publicly to internet would be catastrophic. Sometimes it takes a lot of detailed decisions to get around these pitfalls.
Data architect might be its own role, but nevertheless a data engineer should know how different software components work together in a data pipeline. The first component might receive the data from a source system, the next one processes the data and a third component is responsible for the storage. Several software components or services might be available for each stage, and data engineer should make the decision of the best combination.
To mention some data engineering tools:
- Cloud computing: AWS, Azure, Google Cloud
- Infrastructure setup: Terraform
- Containers: Docker
- Scripting: Python, Java
- ETL pipelines: Talend, Matillion
- Data warehouse: SQL Server, Redshift, Snowflake, PostgreSQL
- Big data processing: Spark
Data scientist top skills - Scripting, natural sciences and business
Where a data engineer builds platforms, data scientist often have their limited sandbox to play around. Quite often the data scientist works on top of what data engineer has built. This might be a computation environment or a software.
The emphasis in data science work is creating scripts that produce knowledge and insights. A common workflow is to read the data in, transform it to another forma, and produce an output. Sometimes the output is charts and tables, sometimes it is a predictive model.
Data science is taking steps to adopt more professional development practices. It is a big help if a data scientist can use version control like Git and build deployment pipelines.
Data scientist builds models from real world phenomena. Understanding subjects like math, physics, chemistry is essential - at least in the industrial companies where I have mostly worked. In the data science department of a marketing company psychology or behavioral science studies might be required on top of coding and math skills.
A general level knowledge is not enough, because you need to dive deep to the problem mechanics. Working for car industry might mean learning how car engine works and so on.
In my opinion, data scientist has higher chance to work with external customers than data engineer. Obviosuly this depends on the team structure and some data scientist might work only at the back of the office.
New titles like machine learning engineering and deep learning architect are arising. Jobs are getting more focused and there is a need to specialize. For example solely focusing on building layers in neural network is already feasible in larger organizations.
Common tools for data scientists:
- Scripting and analysis: Python, R, Scala, Julia
- Data querying: SQL, Spark
- Reporting tools: Power BI, Tableau
- Dashboards: Databox
- Deep learning: Keras, TensorFlow, PyTorch
Difference between data Scientist and data engineering roles in a nutshell
The motivation for this blog post was to bring my view about the differences between data science and data engineering.
Data scientists turn data into insights by utilizing data infrastructure built by data engineers
My key argument is that data science role is a step closer to customer compared to data engineer. Organizations still have steps to take to fill these roles with suitable talents.
Also, the definitions and the toolset of both roles are ballooning. New frameworks and skills are emerging constantly. No data scientist can’t master inside out even the modest list of tools I listed in the previous chapter.
This means, more specialization will be needed in the future. In an optimal situation each expert would pick only reporting tools or only deep learning frameworks to develop themselves in.
Due to lack of experts, most will remain as generalist data scientists or data engineers with too broad roles.