Taming an Ungoverned Legacy Data Lake and Extending Tag-based Authorization in a Heterogeneous Data Discovery Environment

Taming an Ungoverned Legacy Data Lake and Extending Tag-based Authorization in a Heterogeneous Data Discovery Environment

Wednesday, May 22
2:00 PM - 2:40 PM
Marquis Salon 9

Comcast’s Streaming Data platform comprises ingest, transformation, and storage services in the public cloud, and on-prem RDBMS’s, EDW’s, and a large, ungoverned legacy data lake. We use Apache Atlas for data discovery and lineage, relying heavily on its unique-to-the-industry extensibility. First we tackled the public cloud, including kafka topics, avro schemas and S3 datasets. Next we integrated metadata and lineage for the on-prem datasets. More recently we added data-based ML approaches to duplicate elimination and discovery of semantic equivalences. These are aimed primarily at taming the chaos of the legacy data lake, and finding connections between that data lake and the EDW. We use Atlas/Ranger for tag-based authorization not only in the Hadoop environment, but also in AWS S3, Presto, and other public cloud-based applications. We have built API’s to make it very easy for other groups within Comcast to push metadata and lineage to Atlas, removing our group as the bottleneck. All the extensions to Atlas type definitions have been contributed to the Apache Open Source community.

Presentation Video

講演者

Barbara Eckman
Senior Principal Software Architect
Comcast
Barbara Eckman is a Senior Principal Software Architect at Comcast and a recognized innovator in Big Data architecture and governance. She leads data discovery and lineage platform architecture for a division-wide initiative comprising streaming, transforming, storing, and analyzing Big Data. Barbara is also the Lead Metadata Architect for the Comcast Privacy Program, an initiative tackling the challenge of legislation like the California Consumer Privacy Act. Her prior experience includes scientific data and model integration at the Human Genome Project, Merck, GlaxoSmithKline, and IBM, where she served on the peer-elected IBM Academy of Technology.