Sending an email synchronously from a signup endpoint means the user waits for the email provider. A message queue lets you respond immediately and process in the background. Simple in principle, but Kafka, RabbitMQ, and SQS work very differently — and "process each message exactly once" turns out to be hard. This guide covers pub-sub, partitioning, delivery guarantees, and which system fits which job.
Why queues exist
Synchronous HTTP:
Client → POST /signup → API → email service → response
(might be slow) ↑
User waits all of this
Message queue:
Client → POST /signup → API → 200 immediately
↓
Queue: "send email to alice@..."
↓
Worker (background)
↓
email serviceWins:
- Response latency drops
- Decoupling — API and email service are independent. Email outage doesn't block signups
- Load smoothing — queue absorbs peak traffic
- Retry — workers re-attempt on failure
Two models — queue vs log
Queue (RabbitMQ, SQS, Redis Streams)
Producer → [msg1, msg2, msg3, msg4, msg5] → Consumer
↓
Consumer fetches msg1
↓
ack → msg1 removed from queue
↓
msg1 gone forever
→ Each message goes to one consumer
→ Work distribution model — spread loadLog (Kafka, Kinesis)
Producer → [msg1, msg2, msg3, msg4, msg5] → persisted (e.g. days)
↑ ↑
ConsumerA's cursor ConsumerB's cursor
→ Messages stay. Each consumer tracks its own cursor (offset)
→ Add a new consumer and it can replay history
→ Pub-sub model — multiple consumers can see the same messageFundamental difference — queues treat messages as one-off tasks, logs treat them as a persistent event stream.
Kafka — partitioned log
Topic: orders
├── Partition 0: [msg, msg, msg, ...]
├── Partition 1: [msg, msg, msg, ...]
└── Partition 2: [msg, msg, msg, ...]
Producer publishing:
- With a partition key → hash(key) % N → specific partition
- Without → round-robin or random
Consumer Group: order-processor
├── Consumer A → Partition 0
├── Consumer B → Partition 1
└── Consumer C → Partition 2
→ One consumer per partition (within a group)
→ Partitions = maximum parallel consumers
→ Same partition key → same consumer → preserves orderingPros:
- Very high throughput (millions of msg/sec)
- Ordering per partition key
- Long retention (days to years) — supports replay
Cons:
- Operationally complex (Zookeeper / KRaft cluster)
- Changing partition count is disruptive (key-to-partition mapping shifts)
RabbitMQ — flexible routing
Producer → Exchange → route by binding
↓
├── Queue A (orders.payment)
├── Queue B (orders.shipping)
└── Queue C (orders.audit)
↓
Consumer per queue
Exchange types:
- direct: routing key exact match
- topic: pattern match (orders.*, orders.#)
- fanout: every bound queue
- headers: header-based matchPros:
- Flexible routing through exchange types
- Per-message ack — precise retry control
- Priority queues
Cons:
- Throughput lower than Kafka (tens of thousands msg/sec)
- Retention is short (consumed messages are removed)
SQS (AWS) — managed queue
- Fully managed — zero ops
- At-least-once (Standard) or exactly-once (FIFO)
- Up to 14-day retention
- Auto-scales
- Pay per request
Simple but feature-light. The AWS default for background work.
Delivery guarantees — three flavors
- at-most-once — messages may be lost, never duplicated. Fire-and-forget. Logs / analytics
- at-least-once — never lost, possibly duplicated. Consumer must be idempotent. Default in most systems
- exactly-once — never lost, never duplicated. Very hard in distributed systems
Why exactly-once is hard
Consumer:
1. Fetch msg from queue
2. Save result to DB
3. ack queue (= done)
Network failure timing:
- After step 2, before step 3: DB saved but no ack
→ queue redelivers to another consumer → duplicate work
- After step 1, before step 2: queue marks in-flight
→ consumer crash, timeout, another consumer takes over → fine
Fixes:
1. Idempotent consumer — safe to process the same msg twice
(e.g. "set balance to 100" vs "add 100")
2. Store message_id in DB and dedup
3. Two-phase commit (Kafka transactions) — complex and slowPractical answer — at-least-once + idempotent consumers is the standard pattern. "Exactly-once" marketing usually means the same thing.
Consumer groups — Kafka's key idea
Topic with 5 partitions, group "order-processor"
Scenario A — 2 consumers:
Consumer A → Partitions 0, 1
Consumer B → Partitions 2, 3, 4
→ Automatic rebalance
Scenario B — 5 consumers:
Each consumer → 1 partition
→ Max parallelism
Scenario C — 6 consumers:
5 active, 1 idle
→ Partitions cap your consumer count
Add another group:
group "analytics" → reads the same topic with its own cursor
→ No producer changes neededCommon patterns
Outbox pattern
Problem — atomicity between DB transaction and queue publish
1. DB INSERT (success)
2. queue publish (network failure) → message lost forever
Solution:
1. Inside the DB transaction, write to an "outbox" table
2. Background worker reads outbox and publishes to the queue
3. Mark/delete the outbox row after success
→ Atomic with DB; the queue catches up afterward.Dead Letter Queue (DLQ)
Failing forever clogs the queue.
Main queue:
msg fails → retry 3 times → moved to DLQ
→ Main queue moves on
→ A human inspects the DLQ and decides retry vs dropBackpressure
Producer outpaces consumer → queue grows unboundedly. Mitigate with:
- Queue-length caps → producers block or get rejected past the limit
- Auto-scaling consumers (Kubernetes HPA)
- Priority queues for the important traffic
Choosing a system
| Kafka | RabbitMQ | SQS | |
|---|---|---|---|
| Model | Log (partitioned) | Queue + flexible routing | Queue (managed) |
| Throughput | Very high (100K+ msg/sec) | Moderate (~10K msg/sec) | Auto-scales |
| Retention | Days to years | Until ack | Up to 14 days |
| Replay | ✅ (rewind cursor) | ✗ | ✗ |
| Operations | Complex | Moderate | None (managed) |
| Routing | Partition key | Flexible exchanges | Simple queue |
| Cost | Self-host or MSK | Self-host or CloudAMQP | Pay per request |
Picks:
- Event streams / analytics / replay — Kafka
- Complex routing / moderate throughput — RabbitMQ
- AWS shop + simple background jobs — SQS
- Tiny systems — Redis Streams or a "queue table" in your DB
Common pitfalls
1. Assuming ordering
Different partitions = no global ordering. Spread a single user's events across partitions and ordering breaks. Use the user ID as the partition key.
2. Non-idempotent consumers
"Increment counter by 1" with a failed ack → processed twice → counter +2. Design for idempotency: "set counter to N" or dedup by message_id.
3. Forgetting DLQ alerts
Messages pile up in DLQ silently → data loss. Alert on DLQ depth via CloudWatch / Prometheus.
4. Mistaking queue publish for transactional
"DB write + queue publish" isn't atomic. Use the outbox pattern or change data capture (Debezium).
5. Increasing Kafka partition count
Changes the key-to-partition hashing. Past and future messages for the same user can end up on different partitions. Plan ahead.
References
- Designing Data-Intensive Applications (Martin Kleppmann) — dataintensive.net
- Kafka — official docs — kafka.apache.org
- RabbitMQ tutorials — rabbitmq.com
- Outbox pattern — microservices.io
Summary
- MQ decouples APIs from background work — lower latency, retries, load smoothing.
- Queue (RabbitMQ/SQS) vs log (Kafka) — one-and-done vs persistent event stream.
- Kafka partitions are the unit of parallelism. Same key → same partition → ordering preserved.
- Delivery — at-most-once / at-least-once / exactly-once. Exactly-once is hard in distributed systems. Use at-least-once + idempotent consumers.
- Outbox pattern guarantees DB + queue atomicity.
- DLQ isolates failed messages — monitor it.
- Pick — Kafka (stream/replay), RabbitMQ (routing), SQS (AWS managed).