How do you go from a strictly typed object-based streaming pipeline with simple operations to a structured streaming pipeline with higher order complex relational operations? This is what the Data Engineering team did at GoPro to scale up the development of streaming pipelines for the rapidly growing number of devices and applications.
When big data frameworks such as Hadoop first came to exist, developers were happy because we could finally process large amounts of data without writing complex multi-threaded code or worse yet writing complicated distributed code. Unfortunately, only very simple operations were available such as map and reduce. Almost immediately, higher level operations were desired similar to relational operations. And so Hive and dozens (hundreds?) of SQL-based big data tools became available for more developer-efficient batch processing of massive amounts of data.
In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world, so that nearly every streaming framework now supports higher level relational operations.
In this talk, we will discuss in a very hands-on manner how the streaming data pipelines for GoPro devices and apps have moved from the original Spark streaming with its simple RDD-based operations in Spark 1.x to Spark's structured streaming with its higher level relational operations in Spark 2.x. We will talk about the differences, advantages, and necessary pain points that must be addressed in order to scale relational-based streaming pipelines for massive IoT streams. We will also talk about moving from “hand built” Hadoop/Spark clusters running in the cloud to using a Spark-based cloud service.