feature store 는 어떻게 동작할까?

ML production 의 가장 큰 함정 중 하나 — training-serving skew. training 시 사용한 feature 계산 방식과 serving 시점의 방식이 미세하게 달라 accuracy 폭락. Feature store 가 그 해결책. 이 가이드는 online vs offline, point-in-time correctness, Feast / Tecton 의 동작을 정리한다.

Feature 의 정체

raw 데이터 → numeric feature:

raw: user 가 어제 5 번 click, login 30 분 전, age 28
feature:
  user_clicks_24h = 5
  minutes_since_login = 30
  user_age = 28
  user_age_bucket = "26-35"  (one-hot)

training:
  data → feature 계산 (batch) → model train

serving (production):
  user 요청 → 같은 feature 계산 → model predict
                        ↑
                  여기서 training 과 다르면 skew

흔한 skew 원인:
- training: Spark 의 sliding window 계산
- serving: Python 의 다른 query → 다른 결과
- training: timezone UTC
- serving: timezone 변환 빠뜨림
- training: NULL 을 0 으로
- serving: NULL 그대로 → exception 또는 다른 결과

Feature Store 의 답

feature 계산을 한 곳에서 정의 + training / serving 양쪽에 같은
계산 보장.

정의 (Feast 예):
  @on_demand_feature_view(
    sources=[transactions_source],
    schema=[Field(name="user_clicks_24h", dtype=Int64)],
  )
  def user_clicks_24h_feat(features):
    # 같은 logic 이 training 과 serving 양쪽에서 실행
    return features.groupby("user_id")["clicks"].sum()

→ training data 생성 시 호출 + serving 시 동일 함수 호출
→ skew 0 보장

Online vs Offline Store

Feature store 는 두 storage:

Offline Store (batch):
- 큰 historical data (수년치)
- training data 생성 용
- 보통 warehouse (BigQuery / Snowflake) 또는 lake (Parquet on S3)
- 대량 read 가능, low latency 불필요

Online Store (low-latency):
- 현재 feature 만 (per user 최신)
- serving 의 real-time prediction
- Redis / DynamoDB / Cassandra
- < 10ms read latency 필수

배치 ETL job 이 offline → online sync:
- 매 분 / 매 시간 / 매일
- 일부는 streaming (CDC + Kafka → online)

→ 같은 feature 가 두 store 에 존재 + sync 가 핵심.

Point-in-Time Correctness

Training 시 가장 큰 함정 — "미래 정보를 feature 로".

예: 2026-03-15 에 user 의 "지난 7 일 click" feature 만들기
   - WRONG: 현재 (2026-05-27) 의 data 로 계산 → 2026-03-15 이후의
            click 도 포함됨 → "data leakage"
   - RIGHT: 2026-03-15 시점에 알 수 있었던 data 만 사용

→ feature 의 "as-of" timestamp 가 필요.

Feast 의 point-in-time join:
  training_data = entity_df (user_id, event_timestamp)
  feature_df = feature_store.get_historical_features(
    entity_df=training_data,
    features=["user_clicks_24h"]
  )

  → 각 row 의 event_timestamp 기준으로 그 시점에 유효한 feature 만 join
  → 미래 정보 자동 차단

이 "time-travel join" 이 feature store 의 핵심 기능.
직접 SQL 로 작성 가능하지만 매우 복잡 (수십~수백 줄).

대표 Feature Store 도구

도구	특징
Feast (OSS)	가장 popular OSS, BigQuery/Snowflake/Redshift offline + Redis/DynamoDB online
Tecton (SaaS)	Feast 의 commercial founder, streaming feature 강력
Databricks Feature Store	Databricks 통합, Delta Lake 기반
Vertex AI Feature Store	GCP managed, BigQuery 통합
SageMaker Feature Store	AWS managed, online (DynamoDB) + offline (S3)
Hopsworks	OSS + commercial, Spark/Flink 강력

Feature Store 가 infra 인 이유

라이브러리만으로 안 되는 이유:

1. Online store 의 high availability (24/7, low latency)
2. Offline store 의 historical scan (수 TB)
3. 두 store 의 sync (consistency)
4. 다중 model / 다중 team 의 공유 (feature 재사용)
5. governance — 어느 feature 가 어느 model 에서 사용 중
6. permission — sensitive feature (PII) 의 access control
7. discovery — "이미 누가 비슷한 feature 만들었나"

→ "feature 정의 library + 두 DB + sync job + UI + access control"
   = 전사 infrastructure.

작은 팀 (1-2 model) 은 over-engineering 가능. 5+ model 또는 100+
feature 시점에 도입 검토.

Streaming Feature

real-time feature (지난 1 분의 click 같은) 가 필요한 경우:

흐름:
  user click event → Kafka → Flink (window aggregation) → online store

complex example (fraud detection):
  "지난 10 분 동안 user 의 이 카드로 거래 횟수"
  → 매 거래마다 update
  → serving 시 fresh value 사용

→ Tecton, Hopsworks 등이 streaming feature 의 first-class 지원.

함정:
- streaming aggregation 의 watermark / late event (별도 가이드 참조)
- training 의 historical 과 streaming 의 정확한 일치 보장 어려움
- streaming infrastructure cost

흔한 함정

feature 를 training script 안에 직접 — serving 시 다시 작성 → skew. 처음부터 별도 함수 / feature store.
point-in-time join 안 함 — 미래 데이터 leak → offline accuracy 좋은데 production 망함.
online store 의 stale data — sync 주기가 너무 김 → 결정 시점에 옛 feature.
"전사 feature store" 강제 도입 — 작은 팀에 over-engineering. 5+ team / 100+ feature 시점.
feature 의 ownership 모호 — 누가 정의 / 누가 유지 / 누가 변경 권한. governance 정의.

마무리

Feature store 의 본질 — training 과 serving 의 같은 feature 계산 보장 + point-in-time join 으로 leak 방지 + online/offline storage 분리. ML production 의 가장 큰 함정 (skew) 의 systematic 해결.

실용 — 작은 ML 프로젝트 (1-2 model) 는 직접 함수로 OK. 5+ model 또는 100+ feature 시 Feast (OSS) 또는 cloud managed (Vertex / SageMaker). streaming feature 가 핵심이면 Tecton / Hopsworks.