More and more companies are storing their datasets in native cloud storage solutions such as S3 and WASB. Running queries directly on those datasets has always been a possibility, yet there are many hurdles to jump to make those queries efficient and secure. There are also challenges on rapid cloud cluster deployment, scaling as well as security and noisy neighbors.
In this talk, we'll cover the lessons we've learned along the way. We'll use a Tableau workbook to illustrate a typical BI scenario and show what's happening behind the scene. We'll dive into S3 caching, partitioning strategy as well as query tuning, configuration, Ranger integration as well as pitfalls to avoid along the way. Lastly, we'll discuss internals of ACID merge, how it works against S3 buckets as well as key metrics to monitor during cloud operations.
The overall goal of this talk is to gear people in the community with knowledge to operate hive confidently in the cloud.