Jupyter notebook → production model. In reality 95% of production ML is the pipeline + operations between those — data ingestion / feature engineering / training / evaluation / deployment / monitoring. This guide covers the stages and the roles of MLflow / Kubeflow / SageMaker.
The Six Pipeline Stages
1. Data Ingestion
- Extract raw data from sources (DB, S3, Kafka)
- Declare time range (training data cutoff)
2. Feature Engineering
- Raw → numeric features (one-hot, embeddings, scaling)
- Train / val / test split (random, time-based, group-based)
- Persist to a feature store (for reuse)
3. Training
- Hyperparameter search (grid / random / Bayesian)
- Distributed training (large models)
- Record metrics (accuracy, AUC, RMSE)
4. Evaluation
- Metrics on holdout test set
- Bias / fairness checks (per-group performance)
- Compare against previous production model
5. Registration
- On pass, register in a model registry (version, artifact, metadata)
- Approval workflow (experiment → staging → production)
6. Deployment
- Deploy to serving infra (next guide)
- canary / shadow / blue-green
- Rollback plan
7. Monitoring (continuous):
- Prediction distribution drift
- Feature drift
- Performance metrics (when real-time ground truth exists)Notebook Limits
Jupyter notebook pros:
- Interactive (run cells, inspect)
- Visualization (matplotlib, seaborn)
- Fast prototyping
Production limits:
1. Cell-order dependency — no record of which order you ran cells
→ #1 cause of "not reproducible"
2. Environment dependence — whose conda env, pip versions, GPU model
3. Hard to version-control — .ipynb is JSON; diffs are unreadable
4. Hard to test — no unit tests
5. Poor error handling — production failure modes
Mitigations:
- Notebook for exploration; promote verified code to .py modules
- Define pipelines separately (Python script + orchestrator)
- jupytext: sync notebook ↔ .py
- papermill: run notebooks parameterized (batch mode)Reproducibility — The Hardest Problem
Common causes of "same code + same data → different result":
1. No random seeds
numpy.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
random.seed(42)
os.environ["PYTHONHASHSEED"] = "42"
2. GPU non-determinism
Some cudnn ops are non-deterministic
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
→ slight throughput cost
3. Distributed-training ordering
gradient all-reduce order varies with GPU count / network
→ numerical drift (floating-point non-determinism)
4. Data ordering
Shuffle seeds
Data augmentation randomness
5. Environment (library versions)
PyTorch 1.12 vs 1.13 → different outputs from same code
→ requirements.txt + pinned versions + Docker image
→ Reproducibility takes effort. It is not automatic.Experiment Tracking
Common anti-pattern:
result_v1.csv, result_v2.csv, ..., result_v47_final_REAL_NOW.csv
experiment_2026_05_27_again.ipynb
Problem: no way to trace which hyperparameters produced which result.
MLflow / Weights & Biases / Neptune answer:
mlflow.log_param("lr", 0.001)
mlflow.log_param("batch_size", 32)
mlflow.log_metric("accuracy", 0.92)
mlflow.log_artifact("model.pt")
→ Auto storage + UI comparison + version tracking
→ Auto-sort best hyperparameter combos
→ Reproducibility metadata (git commit, env)
Modern stack:
- MLflow (OSS, self-host) — most popular
- Weights & Biases (SaaS) — strong UI / collaboration
- Neptune (SaaS) — rich metadata
- Vertex AI Experiments (GCP) — cloud native
- SageMaker Experiments (AWS) — cloud nativeOrchestration — Defining the Pipeline
The pipeline as a DAG (Directed Acyclic Graph):
ingest → preprocess → train → evaluate → register → deploy
Orchestrators:
Kubeflow Pipelines (KFP):
- Kubernetes-native
- Each step = container
- Great for large distributed training
- Steep learning curve
Airflow:
- Traditional data orchestration
- ML is possible but not native
- python operator + task
Prefect / Dagster:
- Modern Airflow alternatives
- DataFrame-friendly, debug-friendly
- Unified ML/data
Vertex AI Pipelines (GCP):
- Kubeflow-based, managed
- GCP integration
SageMaker Pipelines (AWS):
- AWS integration
- Studio UI
→ Large enterprise = Kubeflow (vendor-neutral) or cloud-managed.
Small teams = Prefect / Dagster for simplicity.Model Registry
The "official" registry for production-bound models:
model_name: fraud_detector
versions:
v1: 2026-01-15, accuracy 0.91, status: archived
v2: 2026-03-22, accuracy 0.93, status: production
v3: 2026-05-27, accuracy 0.95, status: staging (A/B in progress)
State transitions:
staging → production (when A/B results look good)
production → archived (replaced by newer model)
Features:
- Stores model artifacts (binaries)
- Metadata (accuracy, training data, code commit)
- Approval workflow
- Rollback (revert to previous version)
MLflow Model Registry, Vertex AI Model Registry, SageMaker Model RegistryCommon Pitfalls
- Notebooks for production — works once, then breaks on retrain / env change. Extract to .py + add tests.
- Forgetting random seeds — same code, different result → "yesterday it worked". Seed every source of randomness.
- Train / test data leakage — features carry test info (target transforms, future values). Extremely common bug.
- Ignoring online vs offline metric gap — great offline can barely move user behavior. A/B test is mandatory.
- No monitoring — post-deploy drift / perf changes invisible → an incident 6 months later.
Wrap-up
ML training pipelines — convert notebook exploration outputs into a reproducible / monitored / governed system. Tools like MLflow + Kubeflow support that transition.
Practical — new ML project default: notebook (exploration) → Python scripts (production) + MLflow (tracking) + Kubeflow/Vertex (orchestration) + Model Registry (versioning). Cloud-only? SageMaker / Vertex AI's end-to-end integration is attractive.