Data Privacy at Scale

Wednesday, June 20
11:50 AM - 12:30 PM
Grand Ballroom 220A

Member privacy is of paramount importance to LinkedIn. The company must protect the sensitive data users provide. On the other hand, our members join LinkedIn to find each other, necessitating the sharing of certain data. This privacy paradox can only be addressed by giving users control over where and how their data is used. While this approach is extremely important, it also presents scaling challenges.

In this talk, we will discuss the challenges behind enforcing compliance at scale as well as LinkedIn's solution. Our comprehensive record-level offline compliance framework includes schema metadata tracking, alternate read-time views of the same dataset, physical purging of data on HDFS, and features for users to define custom filtering rules using SQL, assigning such customizations to specific datasets, groups of datasets, or use cases. We achieve this using many open-source projects like Hadoop, Hive, Gobblin, and Wherehows, as well as a homegrown data access layer called Dali. We also show how the same Hadoop-powered framework can be used for enforcing compliance on other stores like Pinot, Salesforce, and Espresso.

While there is no one-size fits all solution to guaranteeing user data privacy, this talk will provide a blueprint and concrete example of how to enforce compliance at scale, which we hope proves useful to organizations working to improve their privacy commitments.

Presentation Video


Issac Buenrostro
Staff Software Engineer
Issac works at LinkedIn in the data management team which is in charge of ingestion, lifecycle, and compliance of most HDFS data, as well as providing tools for the big data ecosystem in LinkedIn. He is a core developer and committer for Apache Gobblin, a distributed big data integration framework for batch and streaming systems. Previous work focused on analytics for video streaming.
Anthony Hsu
Staff Software Engineer
Anthony is a Staff Software Engineer working in the Hadoop Dev team at LinkedIn. He currently works on machine learning infrastructure. Previously, he worked on LinkedIn’s data access layer (Dali) and workflow scheduler (Azkaban).