The big data platforms of many organisations are underpinned by a technology that is soon to celebrate its 45th birthday: SQL. This industry stalwart is applied in a multitude of critical points in business data flows; the results that these processes generate may significantly influence business and financial decision making. However, the SQL ecosystem has been overlooked and ignored by more recent innovations in the field of software engineering best practices such as fine grained automated testing and code quality metrics. This exposes organisations to poor application maintainability, high bug rates, and ultimately corporate risk.
We present the work we’ve been doing at Hotels.com to address these issues by bringing some advanced software engineering practices and open source tools to the realm of Apache Hive SQL. We first define the relevance of such approaches and demonstrate how automated testing can be applied to Hive SQL using HiveRunner, a JUnit based testing framework. We next consider how best to structure Hive queries to yield meaningful test scenarios that are maintainable and performant. Finally, we demonstrate how test coverage reports can highlight areas of risk in SQL codebases and weaknesses in the testing process. We do this using Mutant Swarm, an open source mutation testing tool for SQL languages developed by Hotels.com that can deliver insights similar to those produced by Java focused tools such as Jacoco and PIT.