Using LLVM to accelerate processing of data in Apache Arrow

Using LLVM to accelerate processing of data in Apache Arrow

Thursday, June 21
12:20 PM - 1:00 PM
Executive Ballroom 210D/H

Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM.

Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.

Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.

Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.


Siddharth Teotia
Software Engineer
I am a software engineer at Dremio and a committer to Apache Arrow project. Previously, I was part of database kernel team at Oracle, where I worked on storage, indexing, and the in-memory columnar query processing layers of Oracle RDBMS. I hold an MS in software engineering from CMU and a BS in information systems from BITS Pilani, India. During studies, I focused on distributed systems, databases, and software architecture. My technical blog: My technical contributions on Quora: Apart from my job as a Software Engineer, I love writing technical content and doing technical presentations about the work I do.