What’s new in Apache Spark 2.3 and Spark 2.4

What’s new in Apache Spark 2.3 and Spark 2.4

Thursday, October 11
4:00 PM - 4:40 PM

Apache Spark 2.0 set the architectural foundations of structure in Spark, unified high-level APIs, structured streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.

Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. Likewise, Apache Spark 2.4 will have many JIRA issues resolved over 1100. In this talk, I want to skim and go through those notable features and changes.

Spark 2.3:

• New deployment mode: Kubernetes scheduler backend
• Pandas UDFs (a.k.a. Vectorized UDFs) and Pandas / Arrow optimization
• New structured streaming execution engine: continuous processing
• Data source v2 APIs for both structured streaming and Spark SQL
• Image Support in Spark
• Spark History Server v2
• Native Vectorized ORC support
• Stream-stream Join
• R UDF stability and Structured Streaming API
• Scala 2.10 drop
• Other notable features and improvements

Spark 2.4:

• Pandas UDF - Grouped Aggregate
• Eager Evaluation
• Barrier execution
• Kafka 2.0.0 upgrade
• Avro data source
• Image data source
• Higher-order functions
• Scala 2.12 support
• Native Spark App in K8S
• New JSON and CSV options
• Python configurable execution daemon and worker
• Other notable features and improvements


Hyukjin Kwon
Software Engineer
Hortonworks, Inc.
Hyukjin is a software engineer at Hortonworks, working on many different areas in Spark such as Spark SQL, PySpark, SparkR, etc. He is an Apache Spark committer and mainly focuses on the on the open source community in Apache Spark such as helping discuss and review many features and changes.