Skip to content
yutils

How Concurrency Primitives Actually Work

Mutex, semaphore, condvar, atomic CAS, lock-free vs wait-free, why volatile isn't a lock, memory ordering and the happens-before relation, and the actor model as a different way out — explained at the level of what the CPU actually does.

~11 min read

Multithreaded code is hard. Even a one-liner like counter++ is a race condition. But understanding what mutex / atomic / condvar actually do at the CPU level demystifies them. This guide covers each primitive's mechanism, what memory ordering really means, and the actor model as a different way out.

Race Condition — Why counter++ Is Dangerous

int counter = 0;

// thread A:               // thread B:
counter++;                  counter++;

// Looks like one instruction but really:
// 1. load counter into register
// 2. register += 1
// 3. store register back to counter

// Scenario:
// A: load counter (= 0)
// B: load counter (= 0)
// A: +1 (register = 1)
// B: +1 (register = 1)
// A: store (counter = 1)
// B: store (counter = 1)
// → two increments, result is 1

Solution: atomic or lock.

Primitive 1 — Mutex (Mutual Exclusion)

// C++
std::mutex m;
m.lock();
counter++;
m.unlock();

// Java
synchronized (lock) {
    counter++;
}

// Rust
let mut guard = m.lock().unwrap();
*guard += 1;
// scope ends → unlock

Inside the Mutex — futex (Linux)

Two paths:

Fast path (uncontended):
- one CAS (compare-and-swap) sets lock = 1 → success
- no syscall, ~10–20 ns

Slow path (contended, someone else holds it):
- futex(WAIT) syscall → kernel parks thread in a wait queue
- unlock issues futex(WAKE)
- ~1–10 μs

futex = "fast userspace mutex". User-space CAS when uncontended, kernel involvement only on contention. Elegant design.

Primitive 2 — Atomic Operations

// C++
std::atomic<int> counter{0};
counter.fetch_add(1);  // single-instruction ++

// Assembly (x86):
// lock incl (%rax)   ← lock prefix locks the cache line

// CAS (compare-and-swap):
int expected = 0;
counter.compare_exchange_strong(expected, 1);
// if counter == expected, swap to 1; else update expected

Atomic = single CPU instruction. Much faster than locks. But only simple read/modify/write or CAS — not arbitrary critical sections.

Lock-Free Counter via CAS

do {
    int old = counter.load();
    int new_val = old + 1;
} while (!counter.compare_exchange_weak(old, new_val));

// "Retry until CAS succeeds" pattern.
// No thread blocks the system (lock-free, though not wait-free).

Primitive 3 — Semaphore

// Generalization of mutex with a counter.
// "Allow up to N concurrent accesses."

sem_t sem;
sem_init(&sem, 0, /* initial count */ 5);

// thread:
sem_wait(&sem);    // count-- (wait if 0)
// ... critical section
sem_post(&sem);    // count++

// Use case: connection pool max=5, only 5 concurrent users.

Primitive 4 — Condition Variable

// "Wait until some condition becomes true."

std::mutex m;
std::condition_variable cv;
bool ready = false;

// producer:
{
    std::lock_guard<std::mutex> lock(m);
    ready = true;
}
cv.notify_one();

// consumer:
std::unique_lock<std::mutex> lock(m);
cv.wait(lock, [] { return ready; });  // wait until ready=true
// on wake, lock is reacquired

Condvars are always paired with a mutex. On wait, the mutex is atomically unlocked + sleep; on wake, mutex is re-acquired. Spurious wakeups happen, so always re-check the predicate in a while loop (Java's wait() is the same).

Memory Ordering — The CPU Reorders Instructions

// Thread A:                 // Thread B:
data = 42;                    if (ready) {
ready = true;                   print(data);  // 42?
                              }

// If the CPU reorders:
// A: ready = true; data = 42;  ← store order swapped!
// When B sees ready=true, data may still be 0.

Modern CPUs use instruction reordering + store buffers. Results look correct on a single thread (out-of-order execution), but other threads can observe a different order.

memory_order — Knobs in C++ / Java / Go

OrderMeaningCost
relaxedAtomicity only, no orderingLowest
acquire (load)Subsequent reads/writes after this loadMedium
release (store)Prior reads/writes before this storeMedium
seq_cst (default)Global ordering guaranteedHighest (safest)

Most app code uses default (seq_cst). Lock-free algorithms tune with acquire/release.

volatile Is Not a Lock

// Java
volatile int counter = 0;
counter++;  // still a race condition!

// volatile means:
// - "always read the latest value from memory (don't cache)"
// - prevents some reorderings
// - but does NOT make "read-modify-write" atomic

volatile addresses visibility only; atomicity requires lock or atomic. C's volatile is even weaker (compiler hint, no hardware memory-model guarantee).

Deadlock — Two Threads Waiting on Each Other

// thread A:                 // thread B:
lock(m1);                     lock(m2);
lock(m2);  // wait...          lock(m1);  // wait...
// → wait forever
  • Fix 1: consistent lock acquisition order (always m1 → m2)
  • Fix 2: try_lock + timeout, release all on failure and retry
  • Fix 3: protect with a single mutex (coarser granularity, throughput cost)

Database deadlock detectors solve the same problem — check the wait-for graph for cycles.

Lock-Free vs Wait-Free

  • Lock-free: even if a thread stalls, system as a whole progresses. Usually CAS-based retry.
  • Wait-free: every thread completes in bounded time. Very strong; complex to implement; rare in practice.
  • Lock-based: if the lock holder dies, others wait forever.
// Common lock-free data structures:
- atomic counters
- Treiber stack (CAS-based stack)
- Michael-Scott queue (CAS-based MPMC queue)
- LMAX Disruptor (ring buffer + cache-line alignment)

The Actor Model — A Way Out without Locks

// Each actor owns its state + mailbox.
// Other actors don't touch the state directly — only messages.

class Counter extends Actor {
    private int count = 0;  // not externally accessible

    onMessage(msg) {
        if (msg.type == "inc") count++;
        if (msg.type == "get") sender.send(count);
    }
}

// All modifications run on a single thread (mailbox loop) → no races

Erlang, Akka, Elixir, Pony. Concurrency via isolation. No locks → no deadlocks (but message latency is higher than locks).

Go's Goroutines + Channels — Same Idea

// "Don't share memory. Share by communicating." (Go proverb)

ch := make(chan int, 100)
go func() { ch <- 42 }()
val := <-ch  // 42

// Internally there are locks, but app code doesn't see them.

Common Pitfalls

  • Double-checked locking — broken before Java 5. 5+ requires volatile.
  • ABA problem — CAS can't detect A → B → A. Use version counters or hazard pointers.
  • Priority inversion — low-priority thread holds a lock, blocks high-priority thread. Fix with priority inheritance.
  • False sharing — two atomics on the same cache line (64 B) cause invalidations on every update. Use alignas(64) / padding.
  • Spin lock overuse — busy-waits burn CPU. Only for very short critical sections; mutex (futex) is usually better.

Wrap-up

Concurrency primitives reduce to two ideas: mutual exclusion (lock, atomic) and synchronization (condvar, semaphore). At the CPU level, it's a game of cache lines and memory ordering. Above that sit lock-free algorithms and the actor model.

Practical advice: avoid locks when possible (channels / actors / immutability), and when you can't, minimize lock granularity and hold time. Profile contention.

Back to guides