concurrency 동기화 원시는 어떻게 동작할까?

멀티스레드 코드는 어렵다. counter++ 같은 한 줄도 race condition 의 원인. 그러나 mutex / atomic / condvar 같은 도구의 실제 동작을 이해하면 두려움이 줄어든다. 이 가이드는 각 원시의 CPU 레벨 메커니즘, memory ordering 의 의미, 그리고 lock 을 안 쓰는 actor model 까지 정리한다.

Race Condition — 왜 counter++ 가 위험한가

int counter = 0;

// thread A:               // thread B:
counter++;                  counter++;

// 단일 명령처럼 보이지만 실제로는:
// 1. counter 를 register 로 load
// 2. register += 1
// 3. register 를 counter 에 store

// 시나리오:
// A: load counter (= 0)
// B: load counter (= 0)
// A: +1 (register = 1)
// B: +1 (register = 1)
// A: store (counter = 1)
// B: store (counter = 1)
// → 두 번 ++ 했는데 결과 1

해결: atomic 또는 lock.

원시 1 — Mutex (Mutual Exclusion)

// C++
std::mutex m;
m.lock();
counter++;
m.unlock();

// Java
synchronized (lock) {
    counter++;
}

// Rust
let mut guard = m.lock().unwrap();
*guard += 1;
// scope 끝나면 unlock

Mutex 의 내부 — futex (Linux)

mutex 의 두 path:

Fast path (uncontended):
- CAS (compare-and-swap) 한 번으로 lock = 1 설정 → 성공
- 시스템콜 없음, ~10-20 ns

Slow path (contended, 다른 thread 가 잡고 있음):
- futex(WAIT) 시스템콜 → kernel 이 thread 를 wait queue 에 박음
- unlock 시 futex(WAKE) 로 깨움
- ~1-10 μs

futex = "fast userspace mutex". 경쟁 없을 때는 user-space CAS 만, 경쟁 시에만 kernel 개입. 영리한 디자인.

원시 2 — Atomic operations

// C++
std::atomic<int> counter{0};
counter.fetch_add(1);  // 단일 명령으로 ++

// 어셈블리 (x86):
// lock incl (%rax)   ← lock prefix 가 cache line 잠금

// CAS (compare-and-swap):
int expected = 0;
counter.compare_exchange_strong(expected, 1);
// counter 가 expected 면 1 로 swap, 아니면 expected 갱신

Atomic = CPU 명령 한 번. lock 보다 훨씬 빠름. 그러나 단순 read/write/CAS 만. 복잡한 critical section 은 못 함.

CAS 로 만든 lock-free counter

do {
    int old = counter.load();
    int new_val = old + 1;
} while (!counter.compare_exchange_weak(old, new_val));

// "CAS 가 성공할 때까지 retry" 패턴
// → 어떤 thread 도 멈추지 않는다 (wait-free 는 아니지만 lock-free)

원시 3 — Semaphore

// counter 가 있는 mutex 의 일반화.
// "N 개 까지 동시 접근 허용" 같은 의미.

sem_t sem;
sem_init(&sem, 0, /* initial count */ 5);

// thread:
sem_wait(&sem);    // count-- (0 이면 대기)
// ... critical section
sem_post(&sem);    // count++

// 사용 예: connection pool 의 max=5, 동시 5 까지만

원시 4 — Condition Variable

// "어떤 조건이 만족될 때까지 대기" 패턴.

std::mutex m;
std::condition_variable cv;
bool ready = false;

// producer:
{
    std::lock_guard<std::mutex> lock(m);
    ready = true;
}
cv.notify_one();

// consumer:
std::unique_lock<std::mutex> lock(m);
cv.wait(lock, [] { return ready; });  // ready=true 까지 대기
// 깨어나면 lock 다시 잡힘

condvar 는 항상 mutex 와 짝. wait 진입 시 mutex unlock + sleep atomic, 깨어나면 mutex re-acquire. spurious wakeup 가능하므로 predicate 를 while 로 재확인 필수 (Java의 wait() 도 동일).

Memory Ordering — CPU 가 명령을 재배치한다

// Thread A:                 // Thread B:
data = 42;                    if (ready) {
ready = true;                   print(data);  // 42?
                              }

// CPU 가 reorder 하면:
// A: ready = true; data = 42;  ← store 순서가 바뀜!
// B 가 ready=true 봤을 때 data 는 아직 0 일 수 있음

Modern CPU 는 명령 reorder + store buffer 사용. 단일 thread 에서는 결과 동일 (out-of-order execution). multi-thread 에서는 다른 thread 가 보는 순서가 다를 수 있음.

memory_order — C++ / Java / Go 의 옵션

Order	의미	비용
relaxed	atomic 만 보장, 순서 X	최저
acquire (load)	이후 read/write 가 이 load 이후	중
release (store)	이전 read/write 가 이 store 이전	중
seq_cst (default)	전역 순서 보장	최고 (가장 안전)

대부분의 application code 는 default (seq_cst) 사용. lock-free algorithm 만들 때 acquire/release 로 최적화.

volatile 은 lock 이 아니다

// Java
volatile int counter = 0;
counter++;  // 여전히 race condition!

// volatile 의 의미:
// - "최신 값을 항상 메모리에서 읽기 (cache 에 잡아두지 마)"
// - 명령 reorder 일부 막기
// - 그러나 "read-modify-write" atomicity 보장 X

volatile 은 visibility 만 해결. atomicity 는 lock 또는 atomic 필요. C 의 volatile 은 더 약함 (compiler 만의 hint, hardware memory model 보장 X).

Deadlock — 두 thread 가 서로 기다림

// thread A:                 // thread B:
lock(m1);                     lock(m2);
lock(m2);  // wait...          lock(m1);  // wait...
// → 영원히 대기

해결 1: lock acquire 순서 일관 (m1 → m2 항상)
해결 2: try_lock + timeout, 실패 시 모두 release 후 재시도
해결 3: 단일 mutex 로 protect (locking granularity 줄임, throughput 손해)

Database 의 deadlock detector 도 같은 문제 풀이 — wait-for graph 에 cycle 있는지 검사.

Lock-free vs Wait-free

Lock-free: 어느 thread 가 멈춰도 시스템 전체는 progress. 보통 CAS 기반 retry.
Wait-free: 모든 thread 가 bounded time 안에 complete. 매우 강한 보장이지만 구현 복잡, drift 큼.
Lock-based: thread 가 lock 잡고 죽으면 다른 thread 영원히 대기.

// 흔히 쓰는 lock-free 자료구조:
- atomic counter
- Treiber stack (CAS 기반 stack)
- Michael-Scott queue (CAS 기반 MPMC queue)
- LMAX Disruptor (ring buffer + cache-line 정렬)

Actor Model — lock 안 쓰는 다른 길

// 각 actor 는 자기 state + mailbox 가짐.
// 다른 actor 가 직접 state 접근 X — 메시지로만 통신.

class Counter extends Actor {
    private int count = 0;  // 외부 접근 X

    onMessage(msg) {
        if (msg.type == "inc") count++;
        if (msg.type == "get") sender.send(count);
    }
}

// 모든 modification 이 단일 thread (mailbox 처리) → race 0

Erlang / Akka / Elixir / Pony. 동시성을 격리로 풀음. lock 자체가 없으므로 deadlock 도 없음 (다만 messaging 의 latency 가 lock 보다 높음).

Go's goroutines + channels — 비슷한 아이디어

// "메모리 공유하지 마. 메시지로 공유해" (Go 명언)

ch := make(chan int, 100)
go func() { ch <- 42 }()
val := <-ch  // 42

// 내부적으로는 lock 있지만 application code 에서는 안 보임

흔한 함정

Double-checked locking — Java 5 미만에서 broken. 5+ 에서는 volatile 필수.
ABA problem — CAS 에서 A → B → A 변경을 못 잡음. version counter 또는 hazard pointer.
Priority inversion — 낮은 우선순위 thread 가 lock 잡고 있어 높은 우선순위 thread block. priority inheritance 로 해결.
False sharing — 두 atomic 이 같은 cache line (64 byte) 에 있으면 한 thread 의 update 가 다른 thread cache 를 invalidate. alignas(64) / padding.
Spin lock 남용 — busy-wait 으로 CPU 태움. 짧은 critical section 만, 대부분은 mutex (futex) 가 나음.

마무리

Concurrency 의 원시는 결국 두 개념: mutual exclusion (lock, atomic) 과 synchronization (condvar, semaphore). CPU 레벨로 가면 cache line + memory ordering 의 게임. 그 위에 lock-free algorithm, actor model 같은 추상화가 쌓인다.

실용 조언: 가능하면 lock 피하기 (channel / actor / immutable), 피할 수 없으면 lock 의 granularity 와 hold time 최소화. profile 로 contention 측정.