Multithreaded code is hard. Even a one-liner like counter++ is a race condition. But understanding what mutex / atomic / condvar actually do at the CPU level demystifies them. This guide covers each primitive's mechanism, what memory ordering really means, and the actor model as a different way out.
Race Condition — Why counter++ Is Dangerous
int counter = 0;
// thread A: // thread B:
counter++; counter++;
// Looks like one instruction but really:
// 1. load counter into register
// 2. register += 1
// 3. store register back to counter
// Scenario:
// A: load counter (= 0)
// B: load counter (= 0)
// A: +1 (register = 1)
// B: +1 (register = 1)
// A: store (counter = 1)
// B: store (counter = 1)
// → two increments, result is 1Solution: atomic or lock.
Primitive 1 — Mutex (Mutual Exclusion)
// C++
std::mutex m;
m.lock();
counter++;
m.unlock();
// Java
synchronized (lock) {
counter++;
}
// Rust
let mut guard = m.lock().unwrap();
*guard += 1;
// scope ends → unlockInside the Mutex — futex (Linux)
Two paths:
Fast path (uncontended):
- one CAS (compare-and-swap) sets lock = 1 → success
- no syscall, ~10–20 ns
Slow path (contended, someone else holds it):
- futex(WAIT) syscall → kernel parks thread in a wait queue
- unlock issues futex(WAKE)
- ~1–10 μsfutex = "fast userspace mutex". User-space CAS when uncontended, kernel involvement only on contention. Elegant design.
Primitive 2 — Atomic Operations
// C++
std::atomic<int> counter{0};
counter.fetch_add(1); // single-instruction ++
// Assembly (x86):
// lock incl (%rax) ← lock prefix locks the cache line
// CAS (compare-and-swap):
int expected = 0;
counter.compare_exchange_strong(expected, 1);
// if counter == expected, swap to 1; else update expectedAtomic = single CPU instruction. Much faster than locks. But only simple read/modify/write or CAS — not arbitrary critical sections.
Lock-Free Counter via CAS
do {
int old = counter.load();
int new_val = old + 1;
} while (!counter.compare_exchange_weak(old, new_val));
// "Retry until CAS succeeds" pattern.
// No thread blocks the system (lock-free, though not wait-free).Primitive 3 — Semaphore
// Generalization of mutex with a counter.
// "Allow up to N concurrent accesses."
sem_t sem;
sem_init(&sem, 0, /* initial count */ 5);
// thread:
sem_wait(&sem); // count-- (wait if 0)
// ... critical section
sem_post(&sem); // count++
// Use case: connection pool max=5, only 5 concurrent users.Primitive 4 — Condition Variable
// "Wait until some condition becomes true."
std::mutex m;
std::condition_variable cv;
bool ready = false;
// producer:
{
std::lock_guard<std::mutex> lock(m);
ready = true;
}
cv.notify_one();
// consumer:
std::unique_lock<std::mutex> lock(m);
cv.wait(lock, [] { return ready; }); // wait until ready=true
// on wake, lock is reacquiredCondvars are always paired with a mutex. On wait, the mutex is atomically unlocked + sleep; on wake, mutex is re-acquired. Spurious wakeups happen, so always re-check the predicate in a while loop (Java's wait() is the same).
Memory Ordering — The CPU Reorders Instructions
// Thread A: // Thread B:
data = 42; if (ready) {
ready = true; print(data); // 42?
}
// If the CPU reorders:
// A: ready = true; data = 42; ← store order swapped!
// When B sees ready=true, data may still be 0.Modern CPUs use instruction reordering + store buffers. Results look correct on a single thread (out-of-order execution), but other threads can observe a different order.
memory_order — Knobs in C++ / Java / Go
| Order | Meaning | Cost |
|---|---|---|
| relaxed | Atomicity only, no ordering | Lowest |
| acquire (load) | Subsequent reads/writes after this load | Medium |
| release (store) | Prior reads/writes before this store | Medium |
| seq_cst (default) | Global ordering guaranteed | Highest (safest) |
Most app code uses default (seq_cst). Lock-free algorithms tune with acquire/release.
volatile Is Not a Lock
// Java
volatile int counter = 0;
counter++; // still a race condition!
// volatile means:
// - "always read the latest value from memory (don't cache)"
// - prevents some reorderings
// - but does NOT make "read-modify-write" atomicvolatile addresses visibility only; atomicity requires lock or atomic. C's volatile is even weaker (compiler hint, no hardware memory-model guarantee).
Deadlock — Two Threads Waiting on Each Other
// thread A: // thread B:
lock(m1); lock(m2);
lock(m2); // wait... lock(m1); // wait...
// → wait forever- Fix 1: consistent lock acquisition order (always m1 → m2)
- Fix 2: try_lock + timeout, release all on failure and retry
- Fix 3: protect with a single mutex (coarser granularity, throughput cost)
Database deadlock detectors solve the same problem — check the wait-for graph for cycles.
Lock-Free vs Wait-Free
- Lock-free: even if a thread stalls, system as a whole progresses. Usually CAS-based retry.
- Wait-free: every thread completes in bounded time. Very strong; complex to implement; rare in practice.
- Lock-based: if the lock holder dies, others wait forever.
// Common lock-free data structures:
- atomic counters
- Treiber stack (CAS-based stack)
- Michael-Scott queue (CAS-based MPMC queue)
- LMAX Disruptor (ring buffer + cache-line alignment)The Actor Model — A Way Out without Locks
// Each actor owns its state + mailbox.
// Other actors don't touch the state directly — only messages.
class Counter extends Actor {
private int count = 0; // not externally accessible
onMessage(msg) {
if (msg.type == "inc") count++;
if (msg.type == "get") sender.send(count);
}
}
// All modifications run on a single thread (mailbox loop) → no racesErlang, Akka, Elixir, Pony. Concurrency via isolation. No locks → no deadlocks (but message latency is higher than locks).
Go's Goroutines + Channels — Same Idea
// "Don't share memory. Share by communicating." (Go proverb)
ch := make(chan int, 100)
go func() { ch <- 42 }()
val := <-ch // 42
// Internally there are locks, but app code doesn't see them.Common Pitfalls
- Double-checked locking — broken before Java 5. 5+ requires volatile.
- ABA problem — CAS can't detect A → B → A. Use version counters or hazard pointers.
- Priority inversion — low-priority thread holds a lock, blocks high-priority thread. Fix with priority inheritance.
- False sharing — two atomics on the same cache line (64 B) cause invalidations on every update. Use
alignas(64)/ padding. - Spin lock overuse — busy-waits burn CPU. Only for very short critical sections; mutex (futex) is usually better.
Wrap-up
Concurrency primitives reduce to two ideas: mutual exclusion (lock, atomic) and synchronization (condvar, semaphore). At the CPU level, it's a game of cache lines and memory ordering. Above that sit lock-free algorithms and the actor model.
Practical advice: avoid locks when possible (channels / actors / immutability), and when you can't, minimize lock granularity and hold time. Profile contention.