Stream processing is a popular paradigm that is becoming more relevant as many applications provide low-latency response time and new application domains emerge that naturally demand data to be processed in motion. One particularly attractive characteristic of the stream-processing paradigm is that it conceptually unifies batch processing (bounded/static historic data) and continuous near-real-time data processing (unbounded streaming event data).
Implementing a unified batch and streaming data architecture is in practice not seamless—near-real-time event data and bulk historic data use different storage systems (messages queues or logs vs. file systems or object stores). Consequently, running the same analysis now and at some arbitrary time in the future (e.g., months, possibly years ahead) means dealing with different data sources and APIs. Few systems are capable of handling both near-real-time streaming workloads and large batch workloads at the same time. And streaming workloads tend to be inherently dynamic, requiring both storage and compute to adjust continuously for maximum resource efficiency.
In this talk, we present an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams). The combination of these two systems offers an unprecedented way of handling “everything as a stream,” while dynamically accommodating workload variations in a novel way. Pravega enables the ingestion capacity of a stream to grow and shrink according to workload and sends signals downstream to enable Flink to scale accordingly.
Pravega offers a permanent streaming storage, exposing an API than enables applications to access data in either near-real time or at any arbitrary time in the future in a uniform fashion. Apache Flink’s SQL and streaming APIs provide a common interface for processing continuous near-real-time data, sets of historic data, or combinations of both. A deep integration between these two systems gives end-to-end exactly-once semantics for pipelines of streams and stream processing and lets both systems jointly scale and adjust automatically to changing data rates.