Breaking news! My business has moved to datatori.com

PySpark execution logic and code optimization

9 February 2020 1 min Data science

On last fall I wrote about the PySpark framework at my previous employer’s blog. As the name indicates, the topic is extremely technical.

On last fall I wrote about the PySpark framework at my previous employer’s blog. As the name indicates, the topic is extremely technical.

Here you find the original post: PySpark execution logic and code optimization .

The content of the PySpark writing

Here are the main bullets from the article that I wrote with my colleague of the time, Data Scientist Timo Voipio :

DataFrames in pandas as a PySpark prerequisite
PySpark DataFrames and their execution logic
Consider caching to speed up PySpark
Use small scripts and multiple environments in PySpark
Favor DataFrame over RDD with structured data
Avoid User Defined Functions in PySpark
Number of partitions and partition size in PySpark
Summary – PySpark basics and optimization

PySpark and Databricks in use also in the current role

My current employer is Unikie . I have ran PySpark on top of Databricks platform among the other duties.

The full work history can be found from my introduction page.

Tags of the post

You might also like

Participate to discussion

No comments yet. Write the first one below!

Write a new comment

Your name

Your email

Subject

Comment

The name will be visible. Email will not be published. More about privacy.