The framework from Ralph Kimball's 1996 book "The Data Warehouse Toolkit". A different path from OLTP's 3NF — fact + dimension in a star schema. Dismissed as "old-fashioned" during the 2010s NoSQL / big-data wave, but back in vogue post-2020 with dbt + ELT. This guide covers the core ideas.
OLTP 3NF vs Dimensional
OLTP (transactional, 3NF):
- Minimal redundancy (normalized)
- Many tables + many JOINs
- Fast INSERT / UPDATE
- Example: e-commerce split into orders / users / products / categories
Dimensional (analytics):
- Deliberate denormalization
- Fewer, larger tables (fact + dimension)
- Fast reads (minimal JOINs)
- Example: fact_orders carries product_name, category_name, etc.
→ OLTP and analytics serve different purposes.
What's good for transactions is bad for analytics (dozens of JOINs).
→ ETL into a separate dimensional model.Star Schema — Central Fact + Surrounding Dimensions
dim_date
|
dim_product — fact_sales — dim_customer
|
dim_store
fact_sales (center, large — hundreds of millions of rows):
- date_id (FK)
- product_id (FK)
- customer_id (FK)
- store_id (FK)
- quantity
- revenue
- discount
dim_date (small):
- date_id (PK)
- date, year, month, week, day_of_week, is_holiday, ...
dim_product:
- product_id (PK)
- name, category, brand, price_tier, color, size, ...
dim_customer:
- customer_id (PK)
- name, age_bracket, gender, region, lifetime_value, ...
Query:
SELECT dim_product.category, SUM(fact_sales.revenue)
FROM fact_sales
JOIN dim_product USING (product_id)
WHERE dim_date.year = 2026
GROUP BY dim_product.category
→ 4 simple JOINs. Easy for BI tools to generate automatically.Fact vs Dimension — Distinction
Fact:
- Measurements (numeric metrics)
- Accumulate over time (transactional events)
- Big tables (hundreds of millions to billions of rows)
- Examples: sales, page_view, click, order, payment
Dimension:
- Context (categorical, descriptive)
- Relatively stable (changes slowly)
- Small tables (tens of thousands to millions)
- Examples: customer, product, store, date, region
→ "WHO did WHAT, WHEN, WHERE, HOW MUCH"
- WHO = dim_customer
- WHAT = dim_product
- WHEN = dim_date
- WHERE = dim_store
- HOW MUCH = fact_sales.revenue (measure)Grain — The Most Important Decision
Grain = what one row of the fact table represents.
Example 1 — Transaction grain (finest):
One row = one sale transaction
fact_sales (transaction_id, product_id, qty, revenue, ts)
→ Most flexible (all aggregations possible)
Example 2 — Daily grain:
One row = daily total per product × store × date
fact_daily_sales (date, product_id, store_id, total_qty, total_revenue)
→ Fast queries, but no intra-day analysis
Example 3 — Monthly grain:
Coarser — monthly per product × store
→ Very fast, but no per-day analysis
Rule:
- Start at the finest grain possible (transaction)
- Add rollup tables for performance needs
- Document the grain — it underpins every modeling decision.Slowly Changing Dimensions (SCD)
When dimension values change over time?
Example: a customer's city changes Seoul → Busan.
SCD Type 1 (overwrite):
dim_customer.city = "Busan" ← overwrite old value
→ Past sales now associate with the new city
→ Simple, but loses history
SCD Type 2 (new row, keep history):
customer_id | name | city | valid_from | valid_to | is_current
1 | Alice | Seoul | 2024-01-01 | 2026-05-01 | false
1 | Alice | Busan | 2026-05-01 | 9999-12-31 | true
fact_sales' customer_id points to the SCD Type 2 surrogate key
→ Old sales remain Seoul, new ones become Busan automatically
→ Most commonly used
SCD Type 3 (extra column):
customer | name | current_city | previous_city
→ Only the last change tracked. Uncommon.
SCD Type 4 (history in a separate table):
dim_customer (current only)
dim_customer_history (every changed row)
SCD Type 6 (combined 1+2+3):
Current-value column + historical rows + both.
→ Type 2 is the modern norm. dbt's snapshot feature automates Type 2.Snowflake Schema (a Different Thing)
Re-normalize the star's dimensions:
fact_sales
|
dim_product → dim_brand → dim_company
| → dim_category → dim_dept
Pros:
- Save storage (brand info stored once)
- Update consistency
Cons:
- More JOINs (incl. each dimension's sub-tables)
- Complex query authoring
→ Modern cheap storage + columnar compression → star is generally preferred.
Use snowflake only when dimensions are truly huge and duplication is costly.Why dbt Revived Dimensional Modeling
During the 2010s NoSQL / data-lake wave, dimensional modeling felt
"old-fashioned".
Post-2020 revival:
1. ELT + cloud warehouses → dimensional models naturally fit SQL
2. dbt became the transformation standard — models + lineage + tests
3. BI tools (Looker, Tableau) love star schemas
4. Rising importance of data quality / governance → explicit dimensions
Modern stack:
raw (Fivetran loaded) → staging (dbt: clean) → intermediate (dbt: joins)
→ fact_X / dim_Y (dbt: dimensional) → marts (dbt: BI-ready)
→ dbt's own naming convention is fct_orders, dim_customers.Common Pitfalls
- Undeclared grain — unclear which aggregations are possible. Document + put grain in the name.
- Too many columns in fact — keep measures only; push context to dimensions. Small row sizes.
- Thin dimensions — don't store just a date — enrich with year, month, week, quarter, is_holiday, fiscal_year. Don't make analysts derive them.
- Skipping SCD — Type 1 only (overwrite) → can't do historical analysis. Type 2 is the default recommend.
- Lifting OLTP tables as-is — 3NF's deep JOINs hurt analytics. Denormalize.
Wrap-up
Dimensional modeling at its core — star schema of fact (measurement) + dimension (context). Grain declaration + SCD Type 2 are the key decisions. Simple analytics queries + BI-friendly.
Practical — in the modern stack: raw (Fivetran) → dbt staging → dbt marts (fact + dim). Kimball's 14 patterns still hold. Snowflake schemas are rare today (thanks to columnar storage's efficiency).