The Hidden Life of Spark Jobs

Thursday, March 21
4:00 PM - 4:40 PM
Room 118-119

Tl;dr; How to make Apache Spark process data efficiently? Lessons learned from running petabyte scale Hadoop cluster and dozens of spark jobs’ optimisations including the most spectacular: from 2500 gigs of RAM to 240.

Apache Spark is extremely popular for processing data on Hadoop clusters. If Your Spark executors go down, an amount of memory is increased. If processing goes too slow, number of executors is increased. Well, this works for some time but sooner or later You end up with a whole cluster fully utilized in an inefficient way.

During the presentation, we will present our lessons learned and performance improvements on Spark jobs including the most spectacular: from 2500 gigs of RAM to 240. We will also answer the questions like:
- How does pySpark job differ from Scala jobs in terms of performance?
- How does caching affect dynamic resource allocation
- Why is it worth to use mapPartitions?

and many more.

Presentation Video


Paweł Leszczyński
Technical Hadoop Owner
Data Processing Ninja with with over 10 years of experience in the software engineering industry. PhD in distributed databases, working at - petabyte scale ecommerce platform.