Skip to content
yutils

How Change Data Capture Actually Works

Log-based (MySQL binlog, Postgres WAL) vs trigger-based vs polling, Debezium as the de-facto open-source CDC, why CDC beats dual writes (consistency), the outbox pattern relationship, and downstream patterns (search index sync, cache invalidate, analytics).

~9 min read

OLTP DB changes → sync to search index / cache / analytics warehouse. Doing "dual writes" in app code breaks consistency and adds complexity. Change Data Capture (CDC) is the answer — read the DB's transaction log directly and emit a change stream. This guide covers the 3 CDC approaches, how Debezium actually works, and its relation to the outbox pattern.

The Problem — The Dual-Write Trap

Naive code:
  async function updateOrder(id, status) {
    await db.orders.update({id}, {status});  // 1
    await searchIndex.update(id, {status});   // 2
    await cache.del("order:" + id);            // 3
  }

Problems:
- 1 succeeds + 2 fails → DB / search out of sync
- 1 + 2 succeed + 3 fails → stale cache
- Concurrent transactions → race conditions
- Distributed failure modes (network / timeout / crash)

→ "Dual-write atomicity broken" — consistency hard to guarantee.

CDC answer:
1. App only updates DB (single transaction)
2. A separate CDC process reads the DB transaction log → emits stream
3. Consumers (search / cache / warehouse) consume the stream → sync

→ Simpler app code + DB is source of truth + every change propagates automatically.

Three CDC Approaches

1. Log-based (most robust)

Read the DB's transaction log directly:
- MySQL: binlog (binary log, also used for replication)
- PostgreSQL: WAL (Write-Ahead Log) + logical decoding
- Oracle: redo log
- SQL Server: change tracking

Pros:
- Captures every INSERT/UPDATE/DELETE
- Zero app impact (read-only)
- Transactional consistency (per commit)
- Backfill possible (replay log from the start)

Cons:
- Uses a DB replication slot (PostgreSQL)
- Only valid within log retention (replay)
- Complex DDL handling for some ALTER TABLEs

Tools: Debezium, Maxwell, AWS DMS

2. Trigger-based

A DB trigger writes changes into a history table:

CREATE TRIGGER orders_audit
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION log_change();

A separate process polls the history table → emits events.

Pros:
- Works on any DB (even without log access)
- DDL changes update the trigger together

Cons:
- DB write overhead (trigger runs per change)
- Can miss in-transaction changes (rollback drops them)
- Trigger schema maintenance burden

3. Polling (simplest)

Periodically query changed rows:

SELECT * FROM orders WHERE updated_at > $last_sync;

Pros:
- No DB setup
- Simplest

Cons:
- Latency = polling interval (usually minutes)
- Can't catch DELETEs (row disappears)
- Depends on updated_at (must exist + be properly updated)
- DB load (repeated queries)

→ OK for MVP, log-based for production.

Debezium — The De-Facto OSS CDC

Architecture:
  Source DB (Postgres/MySQL/...) → Debezium Connector → Kafka

Debezium behavior (PostgreSQL example):
1. Registers a PG replication slot (wal2json or pgoutput plugin)
2. Logical decoding of WAL extracts change events
3. Publishes each event to a Kafka topic:
   {
     "op": "u",                  // c=create, u=update, d=delete, r=read (snapshot)
     "before": {...old row...},
     "after": {...new row...},
     "source": {
       "version": "...",
       "ts_ms": 1234567890000,
       "lsn": "...",
       "db": "mydb",
       "table": "orders"
     },
     "ts_ms": 1234567890123
   }
4. Kafka topics provide distribution + durability → consumers process freely

Schema evolution:
- Schema Registry integration (Avro/JSON Schema)
- ALTER TABLE → schema-change event also emitted

CDC vs Outbox Pattern

Outbox pattern (see the event-sourcing guide):
  // In the same transaction
  await db.orders.update({id}, {status});
  await db.outbox.insert({event: "order_updated", data: ...});
  → A separate process ships unsent outbox rows to Kafka

Difference:
  CDC: DB log itself is the source
  Outbox: app explicitly writes event rows

CDC is simpler (zero app code changes):
- Downside: change event is "raw DB row" — not a domain event
            (UPDATE orders SET status=4 meaning "order shipped" must
             be interpreted externally)

Outbox is more explicit:
- Emit domain events ("order_shipped") directly
- App's intent is clear
- Downside: outbox logic in code, must remember on every change path

Modern recommendation:
- Simple sync (search / cache) → CDC
- Complex domain events (other services react) → Outbox
- They can coexist

Downstream Patterns

Common use cases for CDC streams:

1. Search-index sync (Elasticsearch, OpenSearch):
   UPDATE on orders → update the same doc in ES
   Consumer receives CDC events and calls ES API

2. Cache invalidation (Redis):
   user UPDATE → automatically del user:42 cache in Redis
   Zero explicit-invalidate code in app

3. Analytics warehouse sync (Snowflake, BigQuery):
   Raw change stream → staging tables in warehouse
   dbt transforms downstream into dimensional model

4. Event-driven microservices:
   Service A's DB changes → service B reacts

5. Materialized views:
   Changes across tables → pre-computed views (for search/dashboards)
   Updated at the moment of change (vs nightly batch)

6. Audit log:
   Immutable history of every DB change
   Compliance + debugging

Common Pitfalls

  • Replication slot leak — if Debezium pauses, PG WAL grows unbounded → disk full. Alerts + monitoring.
  • Schema changes break downstream — consumers don't know new columns after ALTER TABLE. Schema Registry + phased migration.
  • Assuming ordering — Kafka order is per partition only. No cross-table ordering guarantee.
  • Big events from big transactions — DELETE 1M rows = 1M events → consumer overload. Batch size limits / throttling.
  • CDC alone for everything — DDL / schema changes / backfill need separate handling. CDC focuses on DML.

Wrap-up

CDC at its core — "DB log is the source of truth; everything downstream derives from it". Solves dual-write inconsistency. Debezium is the OSS standard.

Practical — for search / cache / warehouse sync, CDC alone suffices. When domain events matter, pair with Outbox. Much of the modern data stack is built on CDC (Fivetran has log-based CDC connectors too).

Back to guides