ML training pipeline 은 어떻게 동작할까?

Jupyter notebook 에서 model 학습 → production. 실제 production ML 의 95% 는 그 사이의 pipeline + 운영. data ingestion / feature engineering / training / evaluation / deployment / monitoring 의 cycle. 이 가이드는 그 단계와 MLflow / Kubeflow / SageMaker 같은 도구의 역할을 정리한다.

Pipeline 의 6 단계

1. Data Ingestion
   - source (DB, S3, Kafka) 에서 raw data 추출
   - 시간 범위 명시 (training data 의 cutoff)

2. Feature Engineering
   - raw → numeric features (one-hot, embedding, scaling)
   - train / val / test split (random, time-based, group-based)
   - feature store 에 저장 (재사용)

3. Training
   - hyperparameter 탐색 (grid / random / Bayesian)
   - distributed training (큰 model)
   - 결과 metric 기록 (accuracy, AUC, RMSE)

4. Evaluation
   - holdout test set 의 metric
   - bias / fairness check (group 별 성능)
   - 이전 production model 과 비교

5. Registration
   - 합격하면 model registry 에 박음 (version, artifact, metadata)
   - approval workflow (실험 → staging → production)

6. Deployment
   - serving 환경에 배포 (다음 가이드)
   - canary / shadow / blue-green
   - rollback plan

7. Monitoring (continuous):
   - prediction distribution drift
   - feature drift
   - performance metric (실시간 ground truth 있다면)

Notebook 의 한계

Jupyter notebook 의 장점:
- 인터랙티브 (cell 실행, 결과 확인)
- 시각화 (matplotlib, seaborn)
- 빠른 prototype

Production 시 한계:
1. 순서 의존성 — cell 을 어떤 순서로 실행했는지 기록 X
                "재현 안 됨" 의 #1 원인
2. 환경 의존 — 누구의 conda env, pip 버전, GPU 모델
3. version control 어려움 — JSON 인 .ipynb 는 diff 가독 X
4. testing 어려움 — unit test 의 부재
5. error handling 부족 — production 의 fail 처리

해결:
- notebook 은 exploration 용 → 검증된 코드는 .py 모듈로 추출
- pipeline 정의는 별도 (Python script + orchestrator)
- jupytext: notebook ↔ .py 동기
- papermill: notebook 을 parameterized 실행 (배치)

Reproducibility — 가장 어려운 문제

"같은 코드 + 같은 데이터 → 다른 결과" 의 흔한 원인:

1. Random seed 안 박음
   numpy.random.seed(42)
   torch.manual_seed(42)
   torch.cuda.manual_seed_all(42)
   random.seed(42)
   os.environ["PYTHONHASHSEED"] = "42"

2. GPU 의 non-determinism
   cudnn 의 일부 op 가 non-deterministic
   torch.backends.cudnn.deterministic = True
   torch.backends.cudnn.benchmark = False
   → throughput 약간 손해

3. 분산 학습의 ordering
   gradient all-reduce 가 GPU 수 / network 에 따라 다른 순서
   → numerical drift (floating point 비결정)

4. 데이터 순서
   shuffle 의 seed
   data augmentation 의 random

5. 환경 (library version)
   pytorch 1.12 vs 1.13 → 같은 코드 다른 결과
   → requirements.txt + pinned version + Docker image 박기

→ reproducibility 는 "노력해야 얻는 것". 자동 X.

Experiment Tracking

흔한 anti-pattern:
  result_v1.csv, result_v2.csv, ..., result_v47_final_REAL_NOW.csv
  experiment_2026_05_27_again.ipynb

문제: 어느 hyperparameter 가 어느 결과를 만들었는지 추적 불가.

MLflow / Weights & Biases / Neptune 의 답:
  mlflow.log_param("lr", 0.001)
  mlflow.log_param("batch_size", 32)
  mlflow.log_metric("accuracy", 0.92)
  mlflow.log_artifact("model.pt")

  → 자동 store + UI 로 비교 + version 추적
  → "가장 좋은 hyperparam 조합" 자동 정렬
  → reproducibility metadata (git commit, env)

modern stack:
- MLflow (OSS, self-host) — 가장 popular
- Weights & Biases (SaaS) — UI / collab 강함
- Neptune (SaaS) — metadata 풍부
- Vertex AI Experiments (GCP) — cloud native
- SageMaker Experiments (AWS) — cloud native

Orchestration — pipeline 의 정의

pipeline 의 step 을 DAG (Directed Acyclic Graph) 로:

  ingest → preprocess → train → evaluate → register → deploy

orchestrator 도구:

Kubeflow Pipelines (KFP):
- Kubernetes native
- 각 step = container
- 큰 distributed 학습 적합
- 학습 곡선 큼

Airflow:
- 전통 data orchestration
- ML 도 가능하지만 native 아님
- python operator + task

Prefect / Dagster:
- modern airflow 대안
- dataframe 친화, debug 친화
- ML/data 통합

Vertex AI Pipelines (GCP):
- Kubeflow 기반 managed
- GCP 통합

SageMaker Pipelines (AWS):
- AWS 통합
- studio UI

→ 큰 enterprise = Kubeflow (vendor-neutral) 또는 cloud managed.
   작은 팀 = Prefect / Dagster 가 단순.

Model Registry

production 갈 model 의 "공식" 저장소:

  model_name: fraud_detector
  versions:
    v1: 2026-01-15, accuracy 0.91, status: archived
    v2: 2026-03-22, accuracy 0.93, status: production
    v3: 2026-05-27, accuracy 0.95, status: staging (A/B 진행 중)

state transition:
  staging → production (A/B 결과 좋으면)
  production → archived (새 model 으로 대체)

기능:
- model artifact (binary) 저장
- metadata (accuracy, training data, code commit)
- approval workflow
- rollback (이전 version 으로 복귀)

MLflow Model Registry, Vertex AI Model Registry, SageMaker Model Registry

흔한 함정

notebook 만으로 production — 한 번은 작동 그러나 retrain / 환경 변경 시 깨짐. .py 로 추출 + test 작성.
random seed 안 박음 — 같은 코드 다른 결과 → "왜 어제는 좋았는데 오늘은?". 모든 randomness source 박기.
train / test data leakage — feature 에 test 정보 포함 (target 의 transformation, future value). 매우 흔한 버그.
online metric vs offline metric 차이 무시 — offline 에서 좋아도 user behavior 에 거의 영향 X 일 수 있음. A/B test 필수.
monitoring 안 함 — deploy 후 model drift / performance 변화 안 봄 → 6 개월 후 production 사고.

마무리

ML training pipeline 의 본질 — notebook 의 exploration 결과를 reproducible / monitored / governed system 으로 전환. MLflow + Kubeflow 같은 도구가 그 transition 지원.

실용 — 새 ML 프로젝트 default: notebook (exploration) → Python script (production) + MLflow (tracking) + Kubeflow/Vertex (orchestration) + Model Registry (version). cloud 면 SageMaker / Vertex AI 의 end-to-end 통합도 좋음.