dimensional modeling 은 어떻게 동작할까?

Ralph Kimball 의 1996 책 "The Data Warehouse Toolkit" 의 framework. OLTP 의 3NF normalized schema 와 다른 path — fact + dimension 의 star schema. 2010 년대 NoSQL / big data 트렌드로 "old-fashioned" 취급 받다가 2020+ dbt + ELT 로 다시 mainstream. 이 가이드는 dimensional modeling 의 핵심을 정리한다.

OLTP 3NF vs Dimensional

OLTP (transactional, 3NF):
- 중복 최소화 (normalization)
- 많은 table + 많은 JOIN
- INSERT / UPDATE 빠름
- 예: e-commerce 의 orders / users / products / categories table 분리

Dimensional (analytics):
- 의도적 denormalization
- 작은 수의 큰 table (fact + dimension)
- read 빠름 (JOIN 최소)
- 예: fact_orders 가 product_name, category_name 등 dimension column 포함

→ OLTP 와 analytics 의 목적 차이.
   transactional 에서 좋은 schema = analytics 에서 나쁨 (수십 table JOIN).
   → 별도 dimensional model 로 ETL.

Star Schema — 중심 fact + 주변 dimension

         dim_date
            |
dim_product — fact_sales — dim_customer
            |
         dim_store

fact_sales (중심, 큰 table — 수억 row):
- date_id (FK)
- product_id (FK)
- customer_id (FK)
- store_id (FK)
- quantity
- revenue
- discount

dim_date (작은 table):
- date_id (PK)
- date, year, month, week, day_of_week, is_holiday, ...

dim_product:
- product_id (PK)
- name, category, brand, price_tier, color, size, ...

dim_customer:
- customer_id (PK)
- name, age_bracket, gender, region, lifetime_value, ...

query:
  SELECT dim_product.category, SUM(fact_sales.revenue)
  FROM fact_sales
  JOIN dim_product USING (product_id)
  WHERE dim_date.year = 2026
  GROUP BY dim_product.category

→ 단순 JOIN 4 개. BI 도구가 자동 생성하기 쉬움.

Fact vs Dimension — 구분

Fact:
- 측정값 (numeric metric)
- 시간에 따라 누적 (transactional event)
- 큰 table (수억~수십억 row)
- 예: sales, page_view, click, order, payment

Dimension:
- context (categorical, descriptive)
- 비교적 안정 (천천히 변경)
- 작은 table (수만~수백만 row)
- 예: customer, product, store, date, region

→ "WHO did WHAT, WHEN, WHERE, HOW MUCH"
  - WHO = dim_customer
  - WHAT = dim_product
  - WHEN = dim_date
  - WHERE = dim_store
  - HOW MUCH = fact_sales.revenue (measure)

Grain — 가장 중요한 결정

Grain = fact table 의 한 row 가 표현하는 것.

예 1 — Transaction grain (가장 fine):
  한 row = 한 sale transaction
  fact_sales (transaction_id, product_id, qty, revenue, ts)
  → 가장 유연 (모든 aggregate 가능)

예 2 — Daily grain:
  한 row = product × store × date 별 일 합계
  fact_daily_sales (date, product_id, store_id, total_qty, total_revenue)
  → 빠른 query, but 시간대별 분석 불가

예 3 — Monthly grain:
  더 거친 — 월별 product × store
  → 매우 빠르지만 일별 분석 불가

규칙:
- 가능한 fine grain (transaction) 으로 시작
- 성능 요구 시 aggregate table 추가 (rollup)
- grain 의 명시는 모든 의사결정의 기반 → 문서화 필수.

Slowly Changing Dimensions (SCD)

dimension 의 값이 시간에 따라 변하면?
예: customer 의 city 가 Seoul → Busan 으로 이사.

SCD Type 1 (overwrite):
  dim_customer.city = "Busan"  ← 옛 값 덮어쓰기
  → 과거 sale 도 새 city 와 연관
  → 단순, 그러나 history 잃음

SCD Type 2 (new row, history 보존):
  customer_id | name  | city  | valid_from | valid_to   | is_current
  1           | Alice | Seoul | 2024-01-01 | 2026-05-01 | false
  1           | Alice | Busan | 2026-05-01 | 9999-12-31 | true

  fact_sales 의 customer_id 가 SCD type 2 의 surrogate key 가리킴
  → 옛 sale 은 Seoul, 새 sale 은 Busan 으로 자동
  → 가장 자주 사용

SCD Type 3 (column 추가):
  customer | name | current_city | previous_city
  → 마지막 1 회 변경만 추적. 흔하지 않음.

SCD Type 4 (history 별도 table):
  dim_customer (current 만)
  dim_customer_history (모든 변경 row)

SCD Type 6 (combined 1+2+3):
  현재값 column + 과거 row + 둘 다.

→ Type 2 가 modern 추세. dbt 의 snapshot 기능이 자동 type 2.

Snowflake Schema (다른 것)

Star 의 dimension 을 다시 normalize:

  fact_sales
     |
  dim_product → dim_brand → dim_company
     |       → dim_category → dim_dept

장점:
- storage 절약 (brand 정보 한 번만)
- update 일관성

단점:
- JOIN 폭증 (각 dimension 의 sub-table 까지)
- query 작성 복잡

→ modern 의 storage 저렴함 + columnar 압축 → star schema 가 더 일반적.
   snowflake 는 dimension 이 매우 커서 정말 중복 비싼 경우만.

dbt 가 부활시킨 이유

2010s 의 "NoSQL / data lake" 트렌드 동안 dimensional modeling 은
"old-fashioned" 분위기.

2020+ 의 부활:
1. ELT + 클라우드 warehouse → dimensional model 의 SQL 표현이 자연
2. dbt 가 transformation 의 standard → model + lineage + test
3. BI 도구 (Looker, Tableau) 가 star schema 친화
4. data quality / governance 의 importance ↑ → 명시적 dimension

modern stack:
  raw (Fivetran 적재) → staging (dbt: clean) → intermediate (dbt: join)
  → fact_X / dim_Y (dbt: dimensional) → mart (dbt: BI-ready)

→ dbt 의 model 이름 자체가 fct_orders, dim_customers 컨벤션.

흔한 함정

grain 명시 안 함 — 결국 어떤 aggregate 가능한지 불명확. 문서화 + 이름에 grain 박기.
fact 에 너무 많은 column — measure 만 두고 context 는 dimension 으로. row size 작게.
dimension 너무 좁게 — date 만 박지 말고 year, month, week, quarter, is_holiday, fiscal_year 등 풍부 하게. analyst 가 매번 derive 하지 않게.
SCD 무시 — Type 1 (overwrite) 만 사용 → historical analysis 못 함. Type 2 가 default 권장.
OLTP table 그대로 dimensional 로 — 3NF 의 깊은 JOIN 이 analytics 에 부담. denormalize 필요.

마무리

Dimensional modeling 의 본질 — fact (측정) + dimension (context) 의 star schema. grain 명시 + Type 2 SCD 가 핵심 결정. analytics query 의 단순함 + BI 도구 친화.

실용 — modern stack 에서: raw (Fivetran) → dbt staging → dbt marts (fact + dim). Kimball 책의 14 가지 패턴이 여전히 유효. snowflake 는 거의 안 씀 (modern columnar storage 의 효율 덕).