On last fall I wrote about the PySpark framework at my previous employer’s blog. As the name indicates, the topic is extremely technical.
Here you find the original post: PySpark execution logic and code optimization.
The content of the PySpark writing
Here are the main bullets from the article that I wrote with my colleague of the time, Data Scientist Timo Voipio:
- DataFrames in pandas as a PySpark prerequisite
- PySpark DataFrames and their execution logic
- Consider caching to speed up PySpark
- Use small scripts and multiple environments in PySpark
- Favor DataFrame over RDD with structured data
- Avoid User Defined Functions in PySpark
- Number of partitions and partition size in PySpark
- Summary – PySpark basics and optimization
PySpark and Databricks in use also in the current role
My current employer is Unikie. I have ran PySpark on top of Databricks platform among the other duties.
The full work history can be found from my introduction page.