One of ML production's biggest traps — training-serving skew. Subtle differences between how a feature is computed at training time vs serving time crater accuracy. Feature stores are the fix. This guide covers online vs offline, point-in-time correctness, and how Feast / Tecton actually work.
What Is a Feature
Raw data → numeric features:
raw: user had 5 clicks yesterday, logged in 30 min ago, age 28
features:
user_clicks_24h = 5
minutes_since_login = 30
user_age = 28
user_age_bucket = "26-35" (one-hot)
Training:
data → compute features (batch) → model trains
Serving (production):
user request → compute the same features → model predicts
↑
any drift here from training = skew
Common skew causes:
- training: Spark sliding-window compute
- serving: Python query elsewhere → different result
- training: timezone UTC
- serving: forgot timezone conversion
- training: NULL → 0
- serving: NULL passes through → exception or different valueThe Feature Store Answer
Define feature computation in one place; guarantee training and
serving share the same computation.
Definition (Feast example):
@on_demand_feature_view(
sources=[transactions_source],
schema=[Field(name="user_clicks_24h", dtype=Int64)],
)
def user_clicks_24h_feat(features):
# The same logic runs during training and at serving
return features.groupby("user_id")["clicks"].sum()
→ Generate training data with this function + call the same function at serving
→ Skew = 0Online vs Offline Store
A feature store has two storage backends:
Offline Store (batch):
- Large historical data (years)
- For training-data generation
- Usually a warehouse (BigQuery / Snowflake) or lake (Parquet on S3)
- Bulk reads, low latency not needed
Online Store (low-latency):
- Only current features (latest per user)
- For real-time serving predictions
- Redis / DynamoDB / Cassandra
- < 10ms read latency required
A batch ETL job syncs offline → online:
- Every minute / hour / day
- Some streaming (CDC + Kafka → online)
→ The same feature lives in both stores; sync is the key.Point-in-Time Correctness
Biggest training trap — "future info as a feature".
Example: making "last 7-day clicks" feature for 2026-03-15
- WRONG: compute with current (2026-05-27) data → also includes
post-2026-03-15 clicks → "data leakage"
- RIGHT: only data knowable as of 2026-03-15
→ Features need "as-of" timestamps.
Feast's point-in-time join:
training_data = entity_df (user_id, event_timestamp)
feature_df = feature_store.get_historical_features(
entity_df=training_data,
features=["user_clicks_24h"]
)
→ Joins features valid as of each row's event_timestamp
→ Blocks future info automatically
This "time-travel join" is the feature store's defining capability.
Writing it directly in SQL is possible but enormous (dozens to hundreds of lines).Feature Store Tools
| Tool | Characteristics |
|---|---|
| Feast (OSS) | Most popular OSS; BigQuery/Snowflake/Redshift offline + Redis/DynamoDB online |
| Tecton (SaaS) | From Feast's commercial founder, strong on streaming features |
| Databricks Feature Store | Databricks-integrated, Delta Lake-based |
| Vertex AI Feature Store | GCP managed, BigQuery-integrated |
| SageMaker Feature Store | AWS managed, online (DynamoDB) + offline (S3) |
| Hopsworks | OSS + commercial, strong Spark/Flink support |
Why Feature Stores Are Infrastructure
Why a library alone won't do:
1. Online store HA (24/7, low latency)
2. Offline store historical scan (TBs)
3. Sync between both stores (consistency)
4. Sharing across models / teams (feature reuse)
5. Governance — which feature is used by which model
6. Permissions — access control for sensitive features (PII)
7. Discovery — "has someone already made a similar feature?"
→ "feature-definition library + two DBs + sync jobs + UI + access
control" = company infrastructure.
Small teams (1–2 models) may over-engineer. Consider adoption at 5+
models or 100+ features.Streaming Features
When real-time features (like "last 1 minute clicks") are needed:
Flow:
user click event → Kafka → Flink (window aggregation) → online store
Complex example (fraud detection):
"count of transactions on this card in the last 10 minutes"
→ updated on each transaction
→ fresh value used at serving
→ Tecton, Hopsworks etc. support streaming features as first-class.
Traps:
- Streaming aggregation watermarks / late events (separate guide)
- Hard to guarantee exact match between historical training and streaming
- Streaming infra costCommon Pitfalls
- Features inside the training script — rewritten at serving → skew. From day one, separate functions / a feature store.
- No point-in-time joins — future leakage → great offline accuracy, broken production.
- Stale data in the online store — sync interval too long → decisions use old features.
- Forced "company-wide feature store" — over-engineering for small teams. Consider at 5+ teams / 100+ features.
- Unclear feature ownership — who defines / maintains / has change rights? Define governance.
Wrap-up
Feature stores at their core — guarantee identical computation of features at training and serving + point-in-time joins to prevent leakage + separated online/offline storage. The systematic fix for ML production's biggest trap (skew).
Practical — small ML projects (1–2 models) can stick to plain functions. At 5+ models or 100+ features → Feast (OSS) or cloud-managed (Vertex / SageMaker). Streaming-heavy → Tecton / Hopsworks.