Skip to content
yutils

How Feature Stores Actually Work

Online (low-latency) vs offline (batch) feature serving, point-in-time correctness (the time-travel problem at training time), Feast / Tecton / Databricks Feature Store, training-serving skew, and why feature stores are infrastructure, not a library.

~9 min read

One of ML production's biggest traps — training-serving skew. Subtle differences between how a feature is computed at training time vs serving time crater accuracy. Feature stores are the fix. This guide covers online vs offline, point-in-time correctness, and how Feast / Tecton actually work.

What Is a Feature

Raw data → numeric features:

raw: user had 5 clicks yesterday, logged in 30 min ago, age 28
features:
  user_clicks_24h = 5
  minutes_since_login = 30
  user_age = 28
  user_age_bucket = "26-35"  (one-hot)

Training:
  data → compute features (batch) → model trains

Serving (production):
  user request → compute the same features → model predicts
                            ↑
                  any drift here from training = skew

Common skew causes:
- training: Spark sliding-window compute
- serving: Python query elsewhere → different result
- training: timezone UTC
- serving: forgot timezone conversion
- training: NULL → 0
- serving: NULL passes through → exception or different value

The Feature Store Answer

Define feature computation in one place; guarantee training and
serving share the same computation.

Definition (Feast example):
  @on_demand_feature_view(
    sources=[transactions_source],
    schema=[Field(name="user_clicks_24h", dtype=Int64)],
  )
  def user_clicks_24h_feat(features):
    # The same logic runs during training and at serving
    return features.groupby("user_id")["clicks"].sum()

→ Generate training data with this function + call the same function at serving
→ Skew = 0

Online vs Offline Store

A feature store has two storage backends:

Offline Store (batch):
- Large historical data (years)
- For training-data generation
- Usually a warehouse (BigQuery / Snowflake) or lake (Parquet on S3)
- Bulk reads, low latency not needed

Online Store (low-latency):
- Only current features (latest per user)
- For real-time serving predictions
- Redis / DynamoDB / Cassandra
- < 10ms read latency required

A batch ETL job syncs offline → online:
- Every minute / hour / day
- Some streaming (CDC + Kafka → online)

→ The same feature lives in both stores; sync is the key.

Point-in-Time Correctness

Biggest training trap — "future info as a feature".

Example: making "last 7-day clicks" feature for 2026-03-15
   - WRONG: compute with current (2026-05-27) data → also includes
            post-2026-03-15 clicks → "data leakage"
   - RIGHT: only data knowable as of 2026-03-15

→ Features need "as-of" timestamps.

Feast's point-in-time join:
  training_data = entity_df (user_id, event_timestamp)
  feature_df = feature_store.get_historical_features(
    entity_df=training_data,
    features=["user_clicks_24h"]
  )

  → Joins features valid as of each row's event_timestamp
  → Blocks future info automatically

This "time-travel join" is the feature store's defining capability.
Writing it directly in SQL is possible but enormous (dozens to hundreds of lines).

Feature Store Tools

ToolCharacteristics
Feast (OSS)Most popular OSS; BigQuery/Snowflake/Redshift offline + Redis/DynamoDB online
Tecton (SaaS)From Feast's commercial founder, strong on streaming features
Databricks Feature StoreDatabricks-integrated, Delta Lake-based
Vertex AI Feature StoreGCP managed, BigQuery-integrated
SageMaker Feature StoreAWS managed, online (DynamoDB) + offline (S3)
HopsworksOSS + commercial, strong Spark/Flink support

Why Feature Stores Are Infrastructure

Why a library alone won't do:

1. Online store HA (24/7, low latency)
2. Offline store historical scan (TBs)
3. Sync between both stores (consistency)
4. Sharing across models / teams (feature reuse)
5. Governance — which feature is used by which model
6. Permissions — access control for sensitive features (PII)
7. Discovery — "has someone already made a similar feature?"

→ "feature-definition library + two DBs + sync jobs + UI + access
   control" = company infrastructure.

Small teams (1–2 models) may over-engineer. Consider adoption at 5+
models or 100+ features.

Streaming Features

When real-time features (like "last 1 minute clicks") are needed:

Flow:
  user click event → Kafka → Flink (window aggregation) → online store

Complex example (fraud detection):
  "count of transactions on this card in the last 10 minutes"
  → updated on each transaction
  → fresh value used at serving

→ Tecton, Hopsworks etc. support streaming features as first-class.

Traps:
- Streaming aggregation watermarks / late events (separate guide)
- Hard to guarantee exact match between historical training and streaming
- Streaming infra cost

Common Pitfalls

  • Features inside the training script — rewritten at serving → skew. From day one, separate functions / a feature store.
  • No point-in-time joins — future leakage → great offline accuracy, broken production.
  • Stale data in the online store — sync interval too long → decisions use old features.
  • Forced "company-wide feature store" — over-engineering for small teams. Consider at 5+ teams / 100+ features.
  • Unclear feature ownership — who defines / maintains / has change rights? Define governance.

Wrap-up

Feature stores at their core — guarantee identical computation of features at training and serving + point-in-time joins to prevent leakage + separated online/offline storage. The systematic fix for ML production's biggest trap (skew).

Practical — small ML projects (1–2 models) can stick to plain functions. At 5+ models or 100+ features → Feast (OSS) or cloud-managed (Vertex / SageMaker). Streaming-heavy → Tecton / Hopsworks.

Back to guides