Just the sketch: advanced streaming analytics in Apache Metron

Wednesday, June 20
11:50 AM - 12:30 PM
Meeting Room 230C

Doing advanced analytics in streaming architectures presents unique challenges around the tradeoff of having more context vs. performance. Typically performance and scalability requirements mandate that each message in a stream be operated on without the context of other messages in the stream that may have come before. In this talk, we will talk about using sketching algorithms to engineering a compromise which allows us to consider historical state without compromising scalability.

What we found analyzing the capabilities of many similar SIEMs and cybersecurity platforms is that a good portion of the advanced anaytics boil down to either simple rules enriched with the ability to do statistical baselining, set existence, and set cardinality computations. These operations are necessarily difficult to do in-stream, so often they're done after the fact. We look at ways to open up these analytics to stream computation without sacrificing scalability.

Specifically, we will introduce the infrastructure built for Apache Metron to perform these kinds of tasks. We will cover the novel integration between an Apache Storm and Apache Hbase and orchestrated by a custom domain specific language called Stellar to take all the sting out of constructing sketches and using them to accomplish simple and more advanced analytics such as statistical outlier analysis in stream.

講演者

Casey Stella
VP of Data
FiniteState.io
I am the VP of Data at a cybersecurity IoT startup, FiniteState.io. I am also the Vice President and an active committer for the Apache Metron project. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist in the Oil & Gas industry. Before that, I was a poor graduate student in Math at Texas A&M. I primarily work with the Apache Hadoop software stack. I specialize in writing software and solving problems where there are either scalability concerns due to large amounts of traffic or large amounts of data. I have a particular passion for data science problems or any thing vaguely mathematical. As a Principal Architect focused on data science, I spend time with a variety of clients, large and small, mentoring and helping them use Hadoop to solve their problems.