Glossary
Handling Late-Arriving Data in Streaming
Learn how to handle late-arriving data in stream processing systems. Understand event time vs processing time, watermarks, and strategies for managing out-of-order events in Kafka, Flink, and other streaming platforms.
Handling Late-Arriving Data in Streaming
One of the most challenging aspects of stream processing is dealing with data that arrives later than expected. Unlike batch processing where you have a complete dataset before computation begins, streaming systems must make decisions about when to produce results even as new data continues to arrive. Understanding how to handle late-arriving data is essential for building accurate and reliable streaming applications.
Understanding Event Time vs Processing Time
The root of the late data problem lies in the difference between two fundamental time concepts in streaming systems.
Event time is when an event actually occurred in the real world. For example, when a sensor reading was taken, when a user clicked a button, or when a financial transaction was initiated. This timestamp is typically embedded in the event payload itself.
Processing time is when the streaming system processes the event. This is when your Kafka consumer reads the message or when your Flink job processes the record.
In an ideal world, events would arrive in perfect event time order, and processing time would closely match event time. In reality, network delays, system outages, mobile devices going offline, clock skew across distributed systems, and various other factors cause events to arrive out of order or significantly delayed.
Consider an IoT deployment where thousands of sensors monitor industrial equipment. Sensors in areas with poor connectivity might buffer readings locally and transmit them in batches when connectivity improves. A temperature reading from 10 minutes ago might arrive after readings from 5 minutes ago, creating challenges for time-based aggregations and alerts.
Common Causes of Late-Arriving Data
Late data occurs for many reasons in distributed systems:
Network issues are perhaps the most common cause. Temporary connectivity problems, packet loss, or routing issues can delay message delivery. Mobile applications are particularly susceptible, as devices move between networks or lose connectivity entirely.
System failures and retries also contribute to late data. When a producer crashes and restarts, it might resend events that were already processed by downstream consumers. Kafka's at-least-once delivery guarantees mean duplicate or delayed messages are possible.
Distributed clock skew can cause events to appear late even when they arrive promptly. If different systems have clocks that are out of sync, an event timestamped on one server might appear to be from the past or future when processed on another server.
Data corrections and updates represent intentional late data. Financial systems often receive transaction adjustments or corrections hours or days after the initial transaction. These late corrections must be incorporated into historical calculations.
Strategies for Handling Late Data
Modern stream processing frameworks provide several mechanisms for handling late-arriving data while maintaining result accuracy.
Watermarks
Watermarks are the primary mechanism for tracking event time progress in streaming systems. A watermark is a special signal that indicates "all events with timestamps less than T have been seen." More precisely, a watermark represents a heuristic estimate of event time progress.
For example, a watermark of 10:15:00 suggests that the system has likely received all events with timestamps before 10:15:00. When a window computation spans 10:00:00 to 10:15:00, the system can trigger the computation and produce results once the watermark advances past 10:15:00.
Watermarks are necessarily heuristic because stream processing systems cannot know with certainty whether more late data will arrive. The watermark strategy must balance completeness (waiting longer to include more late data) against latency (producing results quickly).
Apache Flink provides flexible watermark strategies including periodic watermarks generated at regular intervals and punctuated watermarks based on special markers in the stream. Kafka Streams automatically tracks stream time as the maximum timestamp seen so far.
Allowed Lateness
Even with watermarks, some data will inevitably arrive after the watermark has passed. The allowed lateness parameter defines a grace period during which late data is still incorporated into computations.
When a window closes based on the watermark, the system doesn't immediately discard the window state. Instead, it keeps the window open for the allowed lateness period. Late events arriving during this grace period trigger recomputation and emission of updated results.
For instance, with an allowed lateness of 5 minutes, an event with timestamp 10:12:00 arriving at 10:21:00 (after the 10:15:00 watermark but within the grace period) would still update the 10:00-10:15 window and emit a corrected result.
Side Outputs and Late Data Streams
Some data arrives so late that it falls outside even the allowed lateness window. Rather than silently dropping this data, modern frameworks allow routing it to side outputs or separate late data streams.
This pattern enables multiple strategies for handling extremely late data:
Log late data for auditing and debugging
Store late data in a separate system for batch reprocessing
Trigger alerts when late data volumes exceed thresholds
Merge late data results in downstream systems that can handle corrections
Implementation in Streaming Frameworks
Different streaming platforms provide various approaches to handling late data.
Apache Flink offers comprehensive event time support with customizable watermark generators, allowed lateness configuration, and side outputs for late data. Flink's DataStream API makes event time processing explicit, requiring developers to specify timestamp extraction and watermark generation strategies.
Kafka Streams provides built-in support for windowed operations with configurable grace periods. The framework automatically manages watermarks based on stream time and allows specifying how long windows remain open for late data. Kafka Streams also supports suppression operators that can wait for late data before emitting final results.
Apache Beam (and cloud services like Google Dataflow) pioneered many late data handling concepts, including the separation of event time and processing time, watermarks, and triggers. Beam's model allows precise control over when results are materialized and how late data updates are handled.
Monitoring and Observability
Effective late data handling requires good observability. Key metrics to track include:
Watermark lag measures the difference between current event time (watermark position) and current processing time. Growing watermark lag indicates the system is falling behind in processing events.
Late event count and latency tracks how many events arrive after watermarks and how late they are. Sudden increases might indicate upstream problems or changing data patterns.
Window trigger frequency for recomputation shows how often late data causes windows to re-emit results. Excessive recomputation might indicate watermark strategies need tuning.
Monitoring event timestamp distributions across topics can reveal producers with clock skew or connectivity problems.
Best Practices
When designing systems to handle late-arriving data, consider these guidelines:
Choose appropriate watermark strategies based on your data characteristics. If events arrive mostly in order with occasional stragglers, aggressive watermarks with moderate allowed lateness work well. For highly out-of-order data, more conservative watermarks or longer grace periods may be necessary.
Set allowed lateness based on business requirements, not arbitrary values. Analyze your actual late data patterns and balance accuracy needs against state management costs. Longer allowed lateness means more state to maintain.
Design downstream systems to handle updates. Since late data can trigger result recomputation, consumers of your streaming results must gracefully handle updated values for previously computed time windows.
Monitor and alert on late data patterns. Sudden changes in late data frequency or severity often indicate infrastructure problems, upstream failures, or data quality issues that require investigation.
Test with realistic late data scenarios. Many streaming application bugs only appear when processing out-of-order or late-arriving data. Include late data scenarios in your testing strategy.
Summary
Handling late-arriving data is a fundamental challenge in stream processing that distinguishes it from batch processing. The difference between event time and processing time, combined with the realities of distributed systems, means data will inevitably arrive out of order or delayed.
Modern streaming frameworks provide robust mechanisms for handling late data through watermarks, allowed lateness, and side outputs. Watermarks enable event time-based computations by providing heuristic estimates of stream progress. Allowed lateness offers a grace period for incorporating late events into computations. Side outputs ensure extremely late data isn't lost.
Successful late data handling requires understanding your data patterns, choosing appropriate watermark strategies, configuring reasonable allowed lateness based on business requirements, and maintaining good observability. By designing systems that explicitly account for late-arriving data rather than ignoring it, you can build streaming applications that are both timely and accurate.
Sources and References
Akidau, Tyler, et al. "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing." VLDB Endowment, 2015. https://research.google/pubs/pub43864/
Apache Flink Documentation. "Event Time and Watermarks." https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/time/
Akidau, Tyler. "Streaming 101: The world beyond batch." O'Reilly Radar, 2015. https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/
Apache Kafka Documentation. "Kafka Streams - Time." Confluent Documentation. https://docs.confluent.io/platform/current/streams/concepts.html#time
Kleppmann, Martin. "Designing Data-Intensive Applications." O'Reilly Media, 2017. Chapter 11: Stream Processing.