Comcast's Streaming Data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have previously reported (DataWorks Summit 2017) on how we extended Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous kafka messaging.
Recently we were presented the challenge of providing integrated data discovery and lineage across our public cloud datasources and on-prem datasources, both Hadoop-based and traditional data warehouses and RDBMSs. Can Apache Atlas meet this challenge? A resounding yes! This talk will present our federated architecture, with Atlas providing SQL-like, free-text, and graph search across select metadata from all on-prem and public cloud data sources in our purview. Lightweight, custom connectors/bridges identify metadata/lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining.
While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs. In-house-built solutions typically underestimate the cost of development and maintenance and often suffer from architecture-by-accretion. Atlas' commitment to extensibility, built-in provision of typed, free-text, and graph search, and REST and asynchronous APIs, position it uniquely in the build-vs-buy sweet spot.