Improving Data Security of Big Data Applications

Data security is a much neglected aspect of big data management with adverse consequences. Reasons for this neglect have been twofold:

• Platforms such as Hadoop are, by design, complex distributed systems with many moving parts. This has rendered them vulnerable to a number of security threats as evidenced by the recent spate of ransomware attacks
• Big data management itself, as an operational practice, is still in its infancy. Consequently, there’s a dearth of sophisticated tools and methodologies to simplify data governance and compliance.

Given these problems, organizations typically resort to simple rule-based solutions requiring constant manual supervision and intervention in order to safeguard their data from security threats and unauthorized access. However, owing to the complexity of big data workflows, this soon turns out to be cumbersome, unreliable and error-prone.

In this talk, we will share our experiences in addressing the inadequacies of big data security using state-of-the-art algorithms from the areas of machine learning and artificial intelligence. While the applications are numerous, we will specifically focus on demonstrating how early threat detection and PII-sprawl detection systems to secure critical data can be built using open source technologies such as Spark and TensorFlow.