Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM.
Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.
Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.
Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.