This presentation will provide technical design and development insights to run a secured Spark job in Kubernetes compute cluster that accesses job data from a Kerberized HDFS cluster. Joy will show how to run a long-running machine learning or ETL Spark job in Kubernetes and to access data from HDFS using Kerberos Principal and Delegation token.
The first part of this presentation will unleash the design and best practices to deploy and run Spark in Kubernetes integrated with HDFS that creates on-demand multi-node Spark cluster during job submission, installing/resolving software dependencies (packages), executing/monitoring the workload, and finally disposing the resources at the end of job completion. The second part of this presentation covers the design and development details to setup a Spark+Kubernetes cluster that supports long-running jobs accessing data from secured HDFS storage by creating and renewing Kerberos delegation tokens seamlessly from end-user's Kerberos Principal.
All the techniques covered in this presentation are essential in order to set up a Spark+Kubernetes compute cluster that accesses data securely from distributed storage cluster such as HDFS in a corporate environment. No prior knowledge of any of these technologies is required to attend this presentation.