Glossary
Event Time and Watermarks in Flink
Master event time processing and watermark strategies in Apache Flink. Learn how to handle late-arriving data, configure watermark generators, and build robust streaming applications that process events based on when they actually occurred.
Event Time and Watermarks in Flink
In stream processing, understanding when events occurred versus when they are processed is fundamental to building accurate real-time applications. Apache Flink provides sophisticated mechanisms for handling event time semantics through watermarks, enabling developers to build temporal processing pipelines that handle out-of-order data and late arrivals.
Understanding Time Semantics in Stream Processing
Flink supports three distinct notions of time:
Processing Time refers to the wall-clock time of the machine executing the streaming operation. While simple to implement, processing time offers no guarantees about determinism or correctness when events arrive out of order.
Event Time represents the timestamp embedded in the event itself, indicating when the event actually occurred at the source. This semantic provides deterministic results regardless of when events arrive at the processing system.
Ingestion Time captures the timestamp when an event enters the Flink system. It sits between processing time and event time in terms of determinism and complexity.
For applications requiring accurate temporal reasoning—such as sessionization, fraud detection, or time-based aggregations—event time is essential. However, event time processing introduces complexity: how does the system know when all events for a given time window have arrived?
Watermarks: The Event Time Progress Indicator
Watermarks are Flink's mechanism for measuring progress in event time. A watermark with timestamp t asserts that all events with timestamps less than or equal to t have arrived. This allows Flink to trigger time-based operations like window computations.
Watermarks flow through the streaming topology as special records. When an operator receives a watermark, it can:
Trigger computations for windows whose end time is before the watermark
Emit results for completed windows
Forward the watermark downstream
The key challenge is balancing two competing concerns:
Too aggressive watermarks (advancing time too quickly) may cause late data to be dropped or misclassified
Too conservative watermarks (advancing time too slowly) increase latency as the system waits longer before triggering computations
Watermark Strategies and Generators
Flink provides several built-in watermark strategies and allows custom implementations.
Bounded Out-of-Orderness
The most common strategy assumes events arrive within a maximum delay:
This strategy generates watermarks that lag behind the maximum observed timestamp by the specified duration (5 seconds here). Events arriving more than 5 seconds late will be considered late data.
Monotonous Timestamps
For sources that emit events in strictly increasing timestamp order:
This strategy generates watermarks equal to the latest observed timestamp, suitable for sources like database change streams or ordered log files.
Custom Watermark Generators
For complex scenarios, implement custom watermark logic:
Flink calls onPeriodicEmit() at regular intervals (configured via pipeline.auto-watermark-interval), allowing the generator to emit watermarks based on accumulated state.
Handling Late Data
Despite watermark strategies, some events will inevitably arrive after their corresponding windows have been triggered. Flink provides several mechanisms for late data handling:
Allowed Lateness
Windows can be configured to accept late data for a specified duration after the watermark passes:
When late data arrives within the allowed lateness period, Flink re-triggers the window computation and emits updated results. This enables correction of previously emitted results at the cost of downstream complexity.
Side Outputs
For auditing or alternative processing, late data can be redirected to a side output:
This approach allows separate handling of late events—perhaps logging them for analysis or applying compensating logic.
Integration with Kafka and the Streaming Ecosystem
When consuming from Apache Kafka, watermark configuration becomes critical for end-to-end correctness. Kafka topics often contain out-of-order events due to:
Multiple producers writing concurrently
Network delays between producers and brokers
Partition-level ordering but not global ordering
Flink's Kafka connector supports per-partition watermark generation. Each partition maintains its own watermark, and Flink merges them by taking the minimum across all partitions:
For visibility into watermark propagation and late data patterns, streaming governance platforms provide monitoring capabilities that surface metrics about event time lag, watermark advancement, and late event rates across your streaming topology. This observability is crucial for tuning watermark strategies in production environments.
Watermark Alignment and Parallel Streams
In complex topologies with multiple input streams, watermark alignment becomes important. Flink takes the minimum watermark across all inputs to an operator, ensuring no operator processes data beyond any input's event time progress.
This conservative approach prevents incorrect results but can cause pipeline stalls if one partition or source falls behind. Flink 1.15+ introduced watermark alignment features that can pause faster sources to prevent excessive skew:
This configuration groups sources into alignment groups and limits the maximum drift between aligned sources, improving throughput while maintaining correctness.
Best Practices and Edge Cases
Idle Sources
When a partition or source stops emitting events, its watermark stops advancing, potentially stalling downstream operators. Flink's withIdleness() configuration allows sources to be marked idle:
After one minute without events, the source is considered idle and its watermark is ignored when computing the minimum across inputs.
Debugging Watermark Issues
Common symptoms of watermark problems include:
Windows never triggering: Watermarks not advancing (check timestamp extraction and source idleness)
Excessive late data: Watermark advancing too aggressively (increase out-of-orderness bound)
High latency: Watermark advancing too conservatively (reduce out-of-orderness bound or check for straggling partitions)
Enable watermark logging and metrics to diagnose these issues in production.
Summary
Event time processing with watermarks enables Flink applications to handle out-of-order data and produce accurate temporal results. Key takeaways:
Event time semantics provide deterministic results based on when events actually occurred, essential for temporal reasoning
Watermarks signal progress in event time, allowing time-based operations to trigger while balancing latency and completeness
Watermark strategies must be tuned based on source characteristics: use bounded out-of-orderness for typical Kafka scenarios, monotonous timestamps for ordered sources, and custom generators for complex patterns
Late data handling through allowed lateness and side outputs provides flexibility for correcting results or auditing late arrivals
Kafka integration requires careful consideration of per-partition watermarks and alignment strategies
Observability through monitoring tools helps track watermark propagation and tune strategies in production
Mastering event time and watermarks is essential for building production-grade streaming applications that handle real-world data characteristics while maintaining correctness and performance.