How ML Training Pipelines Actually Work

Jupyter notebook → production model. In reality 95% of production ML is the pipeline + operations between those — data ingestion / feature engineering / training / evaluation / deployment / monitoring. This guide covers the stages and the roles of MLflow / Kubeflow / SageMaker.

The Six Pipeline Stages

1. Data Ingestion
   - Extract raw data from sources (DB, S3, Kafka)
   - Declare time range (training data cutoff)

2. Feature Engineering
   - Raw → numeric features (one-hot, embeddings, scaling)
   - Train / val / test split (random, time-based, group-based)
   - Persist to a feature store (for reuse)

3. Training
   - Hyperparameter search (grid / random / Bayesian)
   - Distributed training (large models)
   - Record metrics (accuracy, AUC, RMSE)

4. Evaluation
   - Metrics on holdout test set
   - Bias / fairness checks (per-group performance)
   - Compare against previous production model

5. Registration
   - On pass, register in a model registry (version, artifact, metadata)
   - Approval workflow (experiment → staging → production)

6. Deployment
   - Deploy to serving infra (next guide)
   - canary / shadow / blue-green
   - Rollback plan

7. Monitoring (continuous):
   - Prediction distribution drift
   - Feature drift
   - Performance metrics (when real-time ground truth exists)

Notebook Limits

Jupyter notebook pros:
- Interactive (run cells, inspect)
- Visualization (matplotlib, seaborn)
- Fast prototyping

Production limits:
1. Cell-order dependency — no record of which order you ran cells
                          → #1 cause of "not reproducible"
2. Environment dependence — whose conda env, pip versions, GPU model
3. Hard to version-control — .ipynb is JSON; diffs are unreadable
4. Hard to test — no unit tests
5. Poor error handling — production failure modes

Mitigations:
- Notebook for exploration; promote verified code to .py modules
- Define pipelines separately (Python script + orchestrator)
- jupytext: sync notebook ↔ .py
- papermill: run notebooks parameterized (batch mode)

Reproducibility — The Hardest Problem

Common causes of "same code + same data → different result":

1. No random seeds
   numpy.random.seed(42)
   torch.manual_seed(42)
   torch.cuda.manual_seed_all(42)
   random.seed(42)
   os.environ["PYTHONHASHSEED"] = "42"

2. GPU non-determinism
   Some cudnn ops are non-deterministic
   torch.backends.cudnn.deterministic = True
   torch.backends.cudnn.benchmark = False
   → slight throughput cost

3. Distributed-training ordering
   gradient all-reduce order varies with GPU count / network
   → numerical drift (floating-point non-determinism)

4. Data ordering
   Shuffle seeds
   Data augmentation randomness

5. Environment (library versions)
   PyTorch 1.12 vs 1.13 → different outputs from same code
   → requirements.txt + pinned versions + Docker image

→ Reproducibility takes effort. It is not automatic.

Experiment Tracking

Common anti-pattern:
  result_v1.csv, result_v2.csv, ..., result_v47_final_REAL_NOW.csv
  experiment_2026_05_27_again.ipynb

Problem: no way to trace which hyperparameters produced which result.

MLflow / Weights & Biases / Neptune answer:
  mlflow.log_param("lr", 0.001)
  mlflow.log_param("batch_size", 32)
  mlflow.log_metric("accuracy", 0.92)
  mlflow.log_artifact("model.pt")

  → Auto storage + UI comparison + version tracking
  → Auto-sort best hyperparameter combos
  → Reproducibility metadata (git commit, env)

Modern stack:
- MLflow (OSS, self-host) — most popular
- Weights & Biases (SaaS) — strong UI / collaboration
- Neptune (SaaS) — rich metadata
- Vertex AI Experiments (GCP) — cloud native
- SageMaker Experiments (AWS) — cloud native

Orchestration — Defining the Pipeline

The pipeline as a DAG (Directed Acyclic Graph):

  ingest → preprocess → train → evaluate → register → deploy

Orchestrators:

Kubeflow Pipelines (KFP):
- Kubernetes-native
- Each step = container
- Great for large distributed training
- Steep learning curve

Airflow:
- Traditional data orchestration
- ML is possible but not native
- python operator + task

Prefect / Dagster:
- Modern Airflow alternatives
- DataFrame-friendly, debug-friendly
- Unified ML/data

Vertex AI Pipelines (GCP):
- Kubeflow-based, managed
- GCP integration

SageMaker Pipelines (AWS):
- AWS integration
- Studio UI

→ Large enterprise = Kubeflow (vendor-neutral) or cloud-managed.
   Small teams = Prefect / Dagster for simplicity.

Model Registry

The "official" registry for production-bound models:

  model_name: fraud_detector
  versions:
    v1: 2026-01-15, accuracy 0.91, status: archived
    v2: 2026-03-22, accuracy 0.93, status: production
    v3: 2026-05-27, accuracy 0.95, status: staging (A/B in progress)

State transitions:
  staging → production (when A/B results look good)
  production → archived (replaced by newer model)

Features:
- Stores model artifacts (binaries)
- Metadata (accuracy, training data, code commit)
- Approval workflow
- Rollback (revert to previous version)

MLflow Model Registry, Vertex AI Model Registry, SageMaker Model Registry

Common Pitfalls

Notebooks for production — works once, then breaks on retrain / env change. Extract to .py + add tests.
Forgetting random seeds — same code, different result → "yesterday it worked". Seed every source of randomness.
Train / test data leakage — features carry test info (target transforms, future values). Extremely common bug.
Ignoring online vs offline metric gap — great offline can barely move user behavior. A/B test is mandatory.
No monitoring — post-deploy drift / perf changes invisible → an incident 6 months later.

Wrap-up

ML training pipelines — convert notebook exploration outputs into a reproducible / monitored / governed system. Tools like MLflow + Kubeflow support that transition.

Practical — new ML project default: notebook (exploration) → Python scripts (production) + MLflow (tracking) + Kubeflow/Vertex (orchestration) + Model Registry (versioning). Cloud-only? SageMaker / Vertex AI's end-to-end integration is attractive.