How Data Streaming Actually Works

Every transaction processed instantly for real-time analytics. A user action → dashboard updates in 5 seconds. That's streaming — a different world from batch's hourly / nightly rhythm. Kafka / Kinesis / Flink — this guide covers streaming's core concepts (windowing, watermarks, exactly-once).

Batch vs Streaming

Batch:
  - Process big chunks of data at once
  - "Sum today's sales at midnight" once a day
  - Tools: cron + Spark / Hive / dbt

Streaming:
  - Process each record as it arrives
  - "Active users in the last 5 min" refreshed every second
  - Tools: Kafka / Kinesis (transport) + Flink / Kafka Streams / Spark Streaming (compute)

Trade-off:
  - Batch: simple, accurate results, efficient on large datasets
  - Streaming: low latency, more operationally complex, watermark / late-event challenges

→ Both are needed (Lambda Architecture). Modern trend: Kappa
   (streaming-only handles both) for simplicity.

Streaming's Core Components

Transport (message broker):
- Kafka — partitions + consumer groups + persistent log
- Kinesis (AWS) — managed Kafka-like
- Google Pub/Sub — push model
- RabbitMQ — queue-centric (AMQP)
- Pulsar — multi-tenant, geo-replication

Compute (stream processing):
- Kafka Streams — Java/Scala library, embedded in Kafka
- Flink — most powerful, exactly-once, low-latency
- Spark Streaming — Spark's micro-batch
- Beam — abstraction over Flink / Dataflow / Spark

Sinks (storage / external):
- Warehouse (BigQuery, Snowflake)
- OLAP store (ClickHouse, Druid)
- Another Kafka topic (chain)
- DB (reverse of CDC)

Windowing — Grouping by Time

How does a query like "sum over the last 5 minutes" work on a stream?

Three windowing patterns:

1. Tumbling (fixed-size, no overlap)
   |---1min---|---1min---|---1min---|
   Each window = 1 minute, no overlap.
   "Requests per minute" — simple.

2. Sliding (overlapping)
   |--5min--|
       |--5min--|
           |--5min--|
   slide=1min, window=5min → updated every minute, last 5 min average.

3. Session (user-activity based)
   30s idle after a user's last event ends the session.
   Dynamic window size per user.

Choose:
- Tumbling: simple metrics (per-minute counter)
- Sliding: smooth metrics (rolling average)
- Session: user behavior (web session, game session)

Watermarks — Handling Late Events

Problem: events arrive late.

09:00:00 — A
09:00:01 — B
09:00:02 — C
09:00:30 — Z (actually a 09:00:01 event, network lag)

Window "09:00:00 ~ 09:00:10":
- Close at 09:00:10? Z is dropped
- Wait forever? Can never close

Watermark solution:
- "All events before this time are assumed to have arrived by now"
- watermark = max(event_time) - allowedLateness
- When watermark passes window end → close + emit

Example: allowedLateness = 30s
   At 09:00:10, watermark = max(09:00:02) - 30s = 08:59:32
   → window 09:00:00 ~ 09:00:10 stays open (watermark < 09:00:10)
   At 09:00:40, watermark = 09:00:10
   → window closes; Z is included

Late-event policies:
- discard (default)
- side output (separate sink, late-report)
- accumulate + retract (correct earlier result)

Exactly-Once Semantics — Why It's Hard

"Process exactly once" in a distributed system = the hardest problem.

Guarantees:
- at-most-once: send and forget. Possible loss.
- at-least-once: retry until ack. Possible duplicates.
- exactly-once: no loss + no duplicates.

Challenge: cover failures in 4 layers (producer / broker / consumer / sink).

Kafka's exactly-once (0.11+):
- Idempotent producer (producer ID + sequence number)
- Transactional API (multi-partition writes atomic)
- Consumer reads based on transaction commit log

Flink's exactly-once:
- Checkpoints (distributed snapshots) + state recovery
- Two-phase commit sinks (Kafka, file system, JDBC)

Traps:
- Side effects (external API calls) are hard to guarantee
  → use idempotent APIs or dedup keys
- Throughput cost — per-batch transaction overhead

Stream-Table Duality

Core insight of Kafka Streams / Flink:

Stream = sequence of immutable events
Table = current state (key → value)

Mutual transforms:
- Stream → Table: aggregate events into state
  (e.g. user_signed_up events → user_table)
- Table → Stream: emit changes as events
  (e.g. user_table UPDATE → user_changed events)

→ Same data via two views:
  Table view = "current state"
  Stream view = "history of changes"

Applications:
- KTable + KStream joins (last value with new event)
- Materialized view (table) + change log (stream)
- A natural extension of event sourcing

Real Use Cases

Real-time analytics — user actions → dashboards within seconds
Fraud detection — anomaly check the moment a transaction occurs
Recommendations — a click affects the next recommendation
IoT — real-time monitoring of tens of thousands of sensors
Event-driven microservices — service A's events trigger B, C, D
CDC downstream — DB change → auto-update search index / cache

Common Pitfalls

Windows without watermarks — late events handled incompletely. Declare allowedLateness.
Infinite state — group-by user_id keeps every user's state → memory explosion. Use TTL or a state backend (RocksDB).
Assumed exactly-once + side effects — external API sinks can break the guarantee. Use idempotency.
Partition skew — 90% of events on one user → a single consumer gets the load. Use finer-grained keys.
Streaming where batch suffices — for simple nightly aggregates, batch is simpler. Streaming is for latency.

Wrap-up

Streaming = generalization of batch — record-by-record processing. Windowing handles time-based queries; watermarks handle late events; exactly-once is hard. Kafka + Flink are the modern standard.

Practical — only when real-time is genuinely required. If a nightly batch suffices, that's simpler. When latency matters → Kafka (transport) + Kafka Streams (small) or Flink (large) → sink to ClickHouse / Druid / OLAP.