Tl;dr; How to make Apache Spark process data efficiently? Lessons learned from running petabyte scale Hadoop cluster and dozens of spark jobs’ optimisations including the most spectacular: from 2500 gigs of RAM to 240.
Apache Spark is extremely popular for processing data on Hadoop clusters. If Your Spark executors go down, an amount of memory is increased. If processing goes too slow, number of executors is increased. Well, this works for some time but sooner or later You end up with a whole cluster fully utilized in an inefficient way.
During the presentation, we will present our lessons learned and performance improvements on Spark jobs including the most spectacular: from 2500 gigs of RAM to 240. We will also answer the questions like:
- How does pySpark job differ from Scala jobs in terms of performance?
- How does caching affect dynamic resource allocation
- Why is it worth to use mapPartitions?
and many more.