Many compliance and regulatory bodies—such as HIPAA "Security Rule" for electronic protected health records (ePHI) or controlled unclassified information (CUI) for defense projects—require logging and monitoring of all access to sensitive data such as patient records or CUI data. The challenge for Hadoop clusters, is that they are complex ecosystems that run a variety of services over multiple nodes. For example, services such as Knox, Ranger, Atlas, Oozie, and Hive differ greatly from each other—each with its own format and schema for tracking data access and each likely running on different nodes within the HDP cluster. Hence, collecting the right set of logs and events for each of the data or security services is the first challenge. Furthermore, security monitoring and threat detection or breach analysis typically relies on correlation of these security events. This implies centralizing logs and events where they can then be normalized, indexed, aggregated, and correlated for analysis in a security information and event management (SIEM) tool such as SumoLogic or Splunk or equivalent.
This presentation will share some of the techniques and lessons learned in real-world Haddop implementation at Johns Hopkins. Data will be sanitized as expected. But the focus will be on strategies and techniques used to collect and monitor audit and access log events from key Hadoop services and forwarding to a central server for monitoring, analysis, and response to any suspected breaches or incidents. Automation techniques, such as Ansible scripts to install agents or forwarders uniformly and efficiently across the cluster nodes will also be highlighted where appropriate.