On last fall I wrote about the PySpark framework at my previous employer’s blog. As the name indicates, the topic is extremely technical.

Here you find the original post: PySpark execution logic and code optimization .

The content of the PySpark writing

Here are the main bullets from the article that I wrote with my colleague of the time, Data Scientist Timo Voipio :

  • DataFrames in pandas as a PySpark prerequisite
  • PySpark DataFrames and their execution logic
  • Consider caching to speed up PySpark
  • Use small scripts and multiple environments in PySpark
  • Favor DataFrame over RDD with structured data
  • Avoid User Defined Functions in PySpark
  • Number of partitions and partition size in PySpark
  • Summary – PySpark basics and optimization

PySpark and Databricks in use also in the current role

My current employer is Unikie . I have ran PySpark on top of Databricks platform among the other duties.

The full work history can be found from my introduction page.