OLTP DB changes → sync to search index / cache / analytics warehouse. Doing "dual writes" in app code breaks consistency and adds complexity. Change Data Capture (CDC) is the answer — read the DB's transaction log directly and emit a change stream. This guide covers the 3 CDC approaches, how Debezium actually works, and its relation to the outbox pattern.
The Problem — The Dual-Write Trap
Naive code:
async function updateOrder(id, status) {
await db.orders.update({id}, {status}); // 1
await searchIndex.update(id, {status}); // 2
await cache.del("order:" + id); // 3
}
Problems:
- 1 succeeds + 2 fails → DB / search out of sync
- 1 + 2 succeed + 3 fails → stale cache
- Concurrent transactions → race conditions
- Distributed failure modes (network / timeout / crash)
→ "Dual-write atomicity broken" — consistency hard to guarantee.
CDC answer:
1. App only updates DB (single transaction)
2. A separate CDC process reads the DB transaction log → emits stream
3. Consumers (search / cache / warehouse) consume the stream → sync
→ Simpler app code + DB is source of truth + every change propagates automatically.Three CDC Approaches
1. Log-based (most robust)
Read the DB's transaction log directly:
- MySQL: binlog (binary log, also used for replication)
- PostgreSQL: WAL (Write-Ahead Log) + logical decoding
- Oracle: redo log
- SQL Server: change tracking
Pros:
- Captures every INSERT/UPDATE/DELETE
- Zero app impact (read-only)
- Transactional consistency (per commit)
- Backfill possible (replay log from the start)
Cons:
- Uses a DB replication slot (PostgreSQL)
- Only valid within log retention (replay)
- Complex DDL handling for some ALTER TABLEs
Tools: Debezium, Maxwell, AWS DMS2. Trigger-based
A DB trigger writes changes into a history table:
CREATE TRIGGER orders_audit
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION log_change();
A separate process polls the history table → emits events.
Pros:
- Works on any DB (even without log access)
- DDL changes update the trigger together
Cons:
- DB write overhead (trigger runs per change)
- Can miss in-transaction changes (rollback drops them)
- Trigger schema maintenance burden3. Polling (simplest)
Periodically query changed rows:
SELECT * FROM orders WHERE updated_at > $last_sync;
Pros:
- No DB setup
- Simplest
Cons:
- Latency = polling interval (usually minutes)
- Can't catch DELETEs (row disappears)
- Depends on updated_at (must exist + be properly updated)
- DB load (repeated queries)
→ OK for MVP, log-based for production.Debezium — The De-Facto OSS CDC
Architecture:
Source DB (Postgres/MySQL/...) → Debezium Connector → Kafka
Debezium behavior (PostgreSQL example):
1. Registers a PG replication slot (wal2json or pgoutput plugin)
2. Logical decoding of WAL extracts change events
3. Publishes each event to a Kafka topic:
{
"op": "u", // c=create, u=update, d=delete, r=read (snapshot)
"before": {...old row...},
"after": {...new row...},
"source": {
"version": "...",
"ts_ms": 1234567890000,
"lsn": "...",
"db": "mydb",
"table": "orders"
},
"ts_ms": 1234567890123
}
4. Kafka topics provide distribution + durability → consumers process freely
Schema evolution:
- Schema Registry integration (Avro/JSON Schema)
- ALTER TABLE → schema-change event also emittedCDC vs Outbox Pattern
Outbox pattern (see the event-sourcing guide):
// In the same transaction
await db.orders.update({id}, {status});
await db.outbox.insert({event: "order_updated", data: ...});
→ A separate process ships unsent outbox rows to Kafka
Difference:
CDC: DB log itself is the source
Outbox: app explicitly writes event rows
CDC is simpler (zero app code changes):
- Downside: change event is "raw DB row" — not a domain event
(UPDATE orders SET status=4 meaning "order shipped" must
be interpreted externally)
Outbox is more explicit:
- Emit domain events ("order_shipped") directly
- App's intent is clear
- Downside: outbox logic in code, must remember on every change path
Modern recommendation:
- Simple sync (search / cache) → CDC
- Complex domain events (other services react) → Outbox
- They can coexistDownstream Patterns
Common use cases for CDC streams:
1. Search-index sync (Elasticsearch, OpenSearch):
UPDATE on orders → update the same doc in ES
Consumer receives CDC events and calls ES API
2. Cache invalidation (Redis):
user UPDATE → automatically del user:42 cache in Redis
Zero explicit-invalidate code in app
3. Analytics warehouse sync (Snowflake, BigQuery):
Raw change stream → staging tables in warehouse
dbt transforms downstream into dimensional model
4. Event-driven microservices:
Service A's DB changes → service B reacts
5. Materialized views:
Changes across tables → pre-computed views (for search/dashboards)
Updated at the moment of change (vs nightly batch)
6. Audit log:
Immutable history of every DB change
Compliance + debuggingCommon Pitfalls
- Replication slot leak — if Debezium pauses, PG WAL grows unbounded → disk full. Alerts + monitoring.
- Schema changes break downstream — consumers don't know new columns after ALTER TABLE. Schema Registry + phased migration.
- Assuming ordering — Kafka order is per partition only. No cross-table ordering guarantee.
- Big events from big transactions — DELETE 1M rows = 1M events → consumer overload. Batch size limits / throttling.
- CDC alone for everything — DDL / schema changes / backfill need separate handling. CDC focuses on DML.
Wrap-up
CDC at its core — "DB log is the source of truth; everything downstream derives from it". Solves dual-write inconsistency. Debezium is the OSS standard.
Practical — for search / cache / warehouse sync, CDC alone suffices. When domain events matter, pair with Outbox. Much of the modern data stack is built on CDC (Fivetran has log-based CDC connectors too).