Spark + Python tutorial for data developers

Go to Spark + Python tutorial in AWS Glue in Solita’s data blog.

Spark and parallel computing

A shop cashier can only serve a limited amount of customers at a given time. Tight focus and previous experience can enhance their performance to some extent. A better approach to increase the throughtput is to have more employees.

Parallel computation works with the same core idea. Instead of having a single super computer it might be smarter to have multiple regular computation units to operate in the same task simultaneously.

Spark is an open source framework for parallel computation from Apache foundation.

Spark could be described as an computation environment which is operated by a selected programming language. Python language is a common companion for Spark thanks to Python’s diversity, wide distribution and easy syntax.

Why to run Spark application in a cloud platform?

Spark is designed for big data. Even though you can install Spark on a laptop, the memory, disk space and processing power would become soon as the limiting factor. In the real use cases Spark is often ran on an cloud platforms for which Amazon’s Glue service offers a “low” level option.

Setting up a scalable parallel computation cluster is not a breeze. AWS Glue provides a turnkey server environment so that the data developer just writes the Spark code.

Here you find an introduction to AWS Glue.

A summary about the Spark + Python tutorial

The tutorial covers these topics:

Create a data source for AWS Glue
Crawl the data source to the data catalog
The crawled metadata in Glue tables
AWS Glue jobs for data transformations
Editing the Glue script to transform the data with Python and Spark
Speeding up Spark development with Glue dev endpoint
About Glue performance
Summary about the Glue tutorial with Python and Spark

Spark + Python tutorial for data developers

Spark and parallel computing

Why to run Spark application in a cloud platform?

A summary about the Spark + Python tutorial

Tags of the post

You might also like

Participate to discussion