Information technology

PySpark execution logic and code optimization

The article goes through the PySpark execution logic and provides guidelines to optimize the speed and performance.

On last fall I wrote about the PySpark framework at my previous employer’s blog. As the name indicates, the topic is extremely technical.

Here you find the original post: PySpark execution logic and code optimization.

The content of the PySpark writing

Here are the main bullets from the article that I wrote with my colleague of the time, Data Scientist Timo Voipio:

  • DataFrames in pandas as a PySpark prerequisite
  • PySpark DataFrames and their execution logic
  • Consider caching to speed up PySpark
  • Use small scripts and multiple environments in PySpark
  • Favor DataFrame over RDD with structured data
  • Avoid User Defined Functions in PySpark
  • Number of partitions and partition size in PySpark
  • Summary – PySpark basics and optimization

PySpark and Databricks in use also in the current role

My current employer is Unikie. I have ran PySpark on top of Databricks platform among the other duties.

The full work history can be found from my introduction page.

Leave a Reply

Your email address will not be published. Required fields are marked *