How Message Queues Actually Work (Kafka, RabbitMQ, SQS)

Sending an email synchronously from a signup endpoint means the user waits for the email provider. A message queue lets you respond immediately and process in the background. Simple in principle, but Kafka, RabbitMQ, and SQS work very differently — and "process each message exactly once" turns out to be hard. This guide covers pub-sub, partitioning, delivery guarantees, and which system fits which job.

Why queues exist

Synchronous HTTP:
Client → POST /signup → API → email service → response
                              (might be slow)        ↑
                                     User waits all of this

Message queue:
Client → POST /signup → API → 200 immediately
                            ↓
                     Queue: "send email to alice@..."
                            ↓
                     Worker (background)
                            ↓
                     email service

Wins:

Response latency drops
Decoupling — API and email service are independent. Email outage doesn't block signups
Load smoothing — queue absorbs peak traffic
Retry — workers re-attempt on failure

Two models — queue vs log

Queue (RabbitMQ, SQS, Redis Streams)

Producer → [msg1, msg2, msg3, msg4, msg5] → Consumer
                       ↓
                  Consumer fetches msg1
                       ↓
                  ack → msg1 removed from queue
                       ↓
                  msg1 gone forever

→ Each message goes to one consumer
→ Work distribution model — spread load

Log (Kafka, Kinesis)

Producer → [msg1, msg2, msg3, msg4, msg5] → persisted (e.g. days)
              ↑                          ↑
              ConsumerA's cursor          ConsumerB's cursor

→ Messages stay. Each consumer tracks its own cursor (offset)
→ Add a new consumer and it can replay history
→ Pub-sub model — multiple consumers can see the same message

Fundamental difference — queues treat messages as one-off tasks, logs treat them as a persistent event stream.

Kafka — partitioned log

Topic: orders
  ├── Partition 0: [msg, msg, msg, ...]
  ├── Partition 1: [msg, msg, msg, ...]
  └── Partition 2: [msg, msg, msg, ...]

Producer publishing:
  - With a partition key → hash(key) % N → specific partition
  - Without → round-robin or random

Consumer Group: order-processor
  ├── Consumer A → Partition 0
  ├── Consumer B → Partition 1
  └── Consumer C → Partition 2

→ One consumer per partition (within a group)
→ Partitions = maximum parallel consumers
→ Same partition key → same consumer → preserves ordering

Pros:

Very high throughput (millions of msg/sec)
Ordering per partition key
Long retention (days to years) — supports replay

Cons:

Operationally complex (Zookeeper / KRaft cluster)
Changing partition count is disruptive (key-to-partition mapping shifts)

RabbitMQ — flexible routing

Producer → Exchange → route by binding
                       ↓
                       ├── Queue A (orders.payment)
                       ├── Queue B (orders.shipping)
                       └── Queue C (orders.audit)
                         ↓
                       Consumer per queue

Exchange types:
- direct: routing key exact match
- topic: pattern match (orders.*, orders.#)
- fanout: every bound queue
- headers: header-based match

Pros:

Flexible routing through exchange types
Per-message ack — precise retry control
Priority queues

Cons:

Throughput lower than Kafka (tens of thousands msg/sec)
Retention is short (consumed messages are removed)

SQS (AWS) — managed queue

Fully managed — zero ops
At-least-once (Standard) or exactly-once (FIFO)
Up to 14-day retention
Auto-scales
Pay per request

Simple but feature-light. The AWS default for background work.

Delivery guarantees — three flavors

at-most-once — messages may be lost, never duplicated. Fire-and-forget. Logs / analytics
at-least-once — never lost, possibly duplicated. Consumer must be idempotent. Default in most systems
exactly-once — never lost, never duplicated. Very hard in distributed systems

Why exactly-once is hard

Consumer:
1. Fetch msg from queue
2. Save result to DB
3. ack queue (= done)

Network failure timing:
- After step 2, before step 3: DB saved but no ack
  → queue redelivers to another consumer → duplicate work

- After step 1, before step 2: queue marks in-flight
  → consumer crash, timeout, another consumer takes over → fine

Fixes:
1. Idempotent consumer — safe to process the same msg twice
   (e.g. "set balance to 100" vs "add 100")
2. Store message_id in DB and dedup
3. Two-phase commit (Kafka transactions) — complex and slow

Practical answer — at-least-once + idempotent consumers is the standard pattern. "Exactly-once" marketing usually means the same thing.

Consumer groups — Kafka's key idea

Topic with 5 partitions, group "order-processor"

Scenario A — 2 consumers:
  Consumer A → Partitions 0, 1
  Consumer B → Partitions 2, 3, 4
  → Automatic rebalance

Scenario B — 5 consumers:
  Each consumer → 1 partition
  → Max parallelism

Scenario C — 6 consumers:
  5 active, 1 idle
  → Partitions cap your consumer count

Add another group:
  group "analytics" → reads the same topic with its own cursor
  → No producer changes needed

Common patterns

Outbox pattern

Problem — atomicity between DB transaction and queue publish
  1. DB INSERT (success)
  2. queue publish (network failure) → message lost forever

Solution:
  1. Inside the DB transaction, write to an "outbox" table
  2. Background worker reads outbox and publishes to the queue
  3. Mark/delete the outbox row after success

→ Atomic with DB; the queue catches up afterward.

Dead Letter Queue (DLQ)

Failing forever clogs the queue.

Main queue:
  msg fails → retry 3 times → moved to DLQ
  → Main queue moves on
  → A human inspects the DLQ and decides retry vs drop

Backpressure

Producer outpaces consumer → queue grows unboundedly. Mitigate with:

Queue-length caps → producers block or get rejected past the limit
Auto-scaling consumers (Kubernetes HPA)
Priority queues for the important traffic

Choosing a system

	Kafka	RabbitMQ	SQS
Model	Log (partitioned)	Queue + flexible routing	Queue (managed)
Throughput	Very high (100K+ msg/sec)	Moderate (~10K msg/sec)	Auto-scales
Retention	Days to years	Until ack	Up to 14 days
Replay	✅ (rewind cursor)	✗	✗
Operations	Complex	Moderate	None (managed)
Routing	Partition key	Flexible exchanges	Simple queue
Cost	Self-host or MSK	Self-host or CloudAMQP	Pay per request

Picks:

Event streams / analytics / replay — Kafka
Complex routing / moderate throughput — RabbitMQ
AWS shop + simple background jobs — SQS
Tiny systems — Redis Streams or a "queue table" in your DB

Common pitfalls

1. Assuming ordering

Different partitions = no global ordering. Spread a single user's events across partitions and ordering breaks. Use the user ID as the partition key.

2. Non-idempotent consumers

"Increment counter by 1" with a failed ack → processed twice → counter +2. Design for idempotency: "set counter to N" or dedup by message_id.

3. Forgetting DLQ alerts

Messages pile up in DLQ silently → data loss. Alert on DLQ depth via CloudWatch / Prometheus.

4. Mistaking queue publish for transactional

"DB write + queue publish" isn't atomic. Use the outbox pattern or change data capture (Debezium).

5. Increasing Kafka partition count

Changes the key-to-partition hashing. Past and future messages for the same user can end up on different partitions. Plan ahead.

References

Designing Data-Intensive Applications (Martin Kleppmann) — dataintensive.net
Kafka — official docs — kafka.apache.org
RabbitMQ tutorials — rabbitmq.com
Outbox pattern — microservices.io

Summary

MQ decouples APIs from background work — lower latency, retries, load smoothing.
Queue (RabbitMQ/SQS) vs log (Kafka) — one-and-done vs persistent event stream.
Kafka partitions are the unit of parallelism. Same key → same partition → ordering preserved.
Delivery — at-most-once / at-least-once / exactly-once. Exactly-once is hard in distributed systems. Use at-least-once + idempotent consumers.
Outbox pattern guarantees DB + queue atomicity.
DLQ isolates failed messages — monitor it.
Pick — Kafka (stream/replay), RabbitMQ (routing), SQS (AWS managed).