Stream processing consists of ingesting and processing continuously generated data, often from end users in web applications or from more challenging settings where devices such as servers and sensors generate events at a high rate. Such scenarios often demand the use of a software stack that is able to scale and accommodate changes to the characteristics of the application.
One of the major challenges with processing data streams is adapting to workload variations (e.g., due to daily cycles or the growth of the population of sources). Systems to ingest stream data typically parallelize it by sharding the incoming messages and events according to a routing key. Having the ability to parallelize ingestion is very effective, but future changes to the workload (which are very often unknown beforehand) might make the initial choice for the degree of parallelism inadequate for even short-term spikes. Consequently, the ability to scale by adapting parallelism according to workload while preserving important API properties, such as per-key order, is highly desirable to handle mission-critical workloads.
In this presentation, we explain how to accommodate changes to workloads in and with Pravega, an open source stream store built to ingest and serve stream data. Pravega primarily manipulates and stores segments (append-only byte sequences), forming streams by creating and composing segments, which it uses to enable the scaling of streams. Stream scaling in Pravega is automatic and transparent to the application, but such a change to the ingestion volume might also require the application to follow and scale its resources downstream (e.g., the operators of an Apache Flink job) to accommodate the new ingestion volume. Pravega signals such changes to the application so that it can react accordingly. The cooperation between Pravega and the downstream application is crucial for building an effective stream data pipeline.