Skip to content
yutils

How Dimensional Modeling Actually Works

Kimball star schema, fact vs dimension tables, grain (the most important decision), slowly changing dimensions (SCD types 1-6), snowflake vs star, and why dbt + ELT made dimensional modeling popular again in 2020+.

~9 min read

The framework from Ralph Kimball's 1996 book "The Data Warehouse Toolkit". A different path from OLTP's 3NF — fact + dimension in a star schema. Dismissed as "old-fashioned" during the 2010s NoSQL / big-data wave, but back in vogue post-2020 with dbt + ELT. This guide covers the core ideas.

OLTP 3NF vs Dimensional

OLTP (transactional, 3NF):
- Minimal redundancy (normalized)
- Many tables + many JOINs
- Fast INSERT / UPDATE
- Example: e-commerce split into orders / users / products / categories

Dimensional (analytics):
- Deliberate denormalization
- Fewer, larger tables (fact + dimension)
- Fast reads (minimal JOINs)
- Example: fact_orders carries product_name, category_name, etc.

→ OLTP and analytics serve different purposes.
   What's good for transactions is bad for analytics (dozens of JOINs).
   → ETL into a separate dimensional model.

Star Schema — Central Fact + Surrounding Dimensions

         dim_date
            |
dim_product — fact_sales — dim_customer
            |
         dim_store

fact_sales (center, large — hundreds of millions of rows):
- date_id (FK)
- product_id (FK)
- customer_id (FK)
- store_id (FK)
- quantity
- revenue
- discount

dim_date (small):
- date_id (PK)
- date, year, month, week, day_of_week, is_holiday, ...

dim_product:
- product_id (PK)
- name, category, brand, price_tier, color, size, ...

dim_customer:
- customer_id (PK)
- name, age_bracket, gender, region, lifetime_value, ...

Query:
  SELECT dim_product.category, SUM(fact_sales.revenue)
  FROM fact_sales
  JOIN dim_product USING (product_id)
  WHERE dim_date.year = 2026
  GROUP BY dim_product.category

→ 4 simple JOINs. Easy for BI tools to generate automatically.

Fact vs Dimension — Distinction

Fact:
- Measurements (numeric metrics)
- Accumulate over time (transactional events)
- Big tables (hundreds of millions to billions of rows)
- Examples: sales, page_view, click, order, payment

Dimension:
- Context (categorical, descriptive)
- Relatively stable (changes slowly)
- Small tables (tens of thousands to millions)
- Examples: customer, product, store, date, region

→ "WHO did WHAT, WHEN, WHERE, HOW MUCH"
  - WHO = dim_customer
  - WHAT = dim_product
  - WHEN = dim_date
  - WHERE = dim_store
  - HOW MUCH = fact_sales.revenue (measure)

Grain — The Most Important Decision

Grain = what one row of the fact table represents.

Example 1 — Transaction grain (finest):
  One row = one sale transaction
  fact_sales (transaction_id, product_id, qty, revenue, ts)
  → Most flexible (all aggregations possible)

Example 2 — Daily grain:
  One row = daily total per product × store × date
  fact_daily_sales (date, product_id, store_id, total_qty, total_revenue)
  → Fast queries, but no intra-day analysis

Example 3 — Monthly grain:
  Coarser — monthly per product × store
  → Very fast, but no per-day analysis

Rule:
- Start at the finest grain possible (transaction)
- Add rollup tables for performance needs
- Document the grain — it underpins every modeling decision.

Slowly Changing Dimensions (SCD)

When dimension values change over time?
Example: a customer's city changes Seoul → Busan.

SCD Type 1 (overwrite):
  dim_customer.city = "Busan"  ← overwrite old value
  → Past sales now associate with the new city
  → Simple, but loses history

SCD Type 2 (new row, keep history):
  customer_id | name  | city  | valid_from | valid_to   | is_current
  1           | Alice | Seoul | 2024-01-01 | 2026-05-01 | false
  1           | Alice | Busan | 2026-05-01 | 9999-12-31 | true

  fact_sales' customer_id points to the SCD Type 2 surrogate key
  → Old sales remain Seoul, new ones become Busan automatically
  → Most commonly used

SCD Type 3 (extra column):
  customer | name | current_city | previous_city
  → Only the last change tracked. Uncommon.

SCD Type 4 (history in a separate table):
  dim_customer (current only)
  dim_customer_history (every changed row)

SCD Type 6 (combined 1+2+3):
  Current-value column + historical rows + both.

→ Type 2 is the modern norm. dbt's snapshot feature automates Type 2.

Snowflake Schema (a Different Thing)

Re-normalize the star's dimensions:

  fact_sales
     |
  dim_product → dim_brand → dim_company
     |       → dim_category → dim_dept

Pros:
- Save storage (brand info stored once)
- Update consistency

Cons:
- More JOINs (incl. each dimension's sub-tables)
- Complex query authoring

→ Modern cheap storage + columnar compression → star is generally preferred.
   Use snowflake only when dimensions are truly huge and duplication is costly.

Why dbt Revived Dimensional Modeling

During the 2010s NoSQL / data-lake wave, dimensional modeling felt
"old-fashioned".

Post-2020 revival:
1. ELT + cloud warehouses → dimensional models naturally fit SQL
2. dbt became the transformation standard — models + lineage + tests
3. BI tools (Looker, Tableau) love star schemas
4. Rising importance of data quality / governance → explicit dimensions

Modern stack:
  raw (Fivetran loaded) → staging (dbt: clean) → intermediate (dbt: joins)
  → fact_X / dim_Y (dbt: dimensional) → marts (dbt: BI-ready)

→ dbt's own naming convention is fct_orders, dim_customers.

Common Pitfalls

  • Undeclared grain — unclear which aggregations are possible. Document + put grain in the name.
  • Too many columns in fact — keep measures only; push context to dimensions. Small row sizes.
  • Thin dimensions — don't store just a date — enrich with year, month, week, quarter, is_holiday, fiscal_year. Don't make analysts derive them.
  • Skipping SCD — Type 1 only (overwrite) → can't do historical analysis. Type 2 is the default recommend.
  • Lifting OLTP tables as-is — 3NF's deep JOINs hurt analytics. Denormalize.

Wrap-up

Dimensional modeling at its core — star schema of fact (measurement) + dimension (context). Grain declaration + SCD Type 2 are the key decisions. Simple analytics queries + BI-friendly.

Practical — in the modern stack: raw (Fivetran) → dbt staging → dbt marts (fact + dim). Kimball's 14 patterns still hold. Snowflake schemas are rare today (thanks to columnar storage's efficiency).

Back to guides