"ETL" — the data-warehouse vocabulary since the 1980s. Around 2015, cloud warehouses (BigQuery / Snowflake / Redshift) hit mainstream and the three letters flipped order → "ELT". Not just letter shuffling: a paradigm shift in compute vs storage cost. This guide covers the precise difference and when ETL still wins.
ETL — Traditional (1980s–2010s)
Extract → Transform → Load
1. Extract: pull from source systems (OLTP DB, SaaS APIs, CSVs)
2. Transform: a separate ETL server (Informatica, Talend) processes
- clean / join / aggregate / type convert / business logic
3. Load: write final shape to a warehouse (Teradata, Oracle DW)
Characteristics:
- Transform happens outside the warehouse (on expensive ETL servers)
- Only the final shape lives in the warehouse (storage was costly)
- Schema-on-write — schema decided before load
- Mostly batch (nightly jobs)
Problems:
- If transform logic changes → re-extract (you only stored final)
- ETL-server bottleneck (single-server capacity)
- "I needed that column too, but transform dropped it" → rebuild
- Vendor lock-in to ETL tools (Informatica etc.)ELT — Modern (2015+)
Extract → Load → Transform
1. Extract: pull raw data from sources
2. Load: write nearly raw into the warehouse (raw / staging schema)
3. Transform: process inside the warehouse via SQL
(BigQuery / Snowflake / Redshift)
Characteristics:
- Transform happens inside the warehouse (cloud elastic compute)
- Raw data preserved — different transforms later are possible
- Schema-on-read friendly (raw can be loose)
- Streaming / micro-batch possible
What made this possible:
- Cloud warehouse storage is cheap (~$23 / TB / month on S3)
- Compute is decoupled from storage and scales separately
- Warehouse SQL is powerful enough to replace ETL tools
- ELT-native tooling (dbt, Dataform) ecosystemWhy ELT Took Over
Old assumptions:
- Storage is expensive → only keep processed data
- Compute is expensive → do everything on the ETL server
- Warehouse compute = SQL queries only (no heavy transforms)
Cloud changes:
- Storage extremely cheap (especially compressed Parquet)
- Compute elastic (huge transform jobs finish in 5 min)
- Warehouse SQL expressiveness ↑ (window functions, JSON, UDFs)
- Compute / storage separated → independent scaling
→ "Load everything raw, transform later with SQL" is cost-effective + flexible.
dbt (data build tool, 2016+):
- SQL + Jinja templates + dependency graph
- A "transform-only" tool — Extract/Load is Fivetran/Airbyte's job
- Standard in the modern data stackSchema-on-Write vs Schema-on-Read
Schema-on-Write (ETL era):
- Schema fixed before load
- Type mismatch → reject or transform
- Pro: guaranteed data quality + fast queries
- Con: schema change needs backfills; new fields hard to add
Schema-on-Read (data lake / ELT):
- Store raw (JSON, CSV, Parquet)
- Interpret schema at query time
- Pro: flexibility ↑, new fields usable immediately
- Con: queries can be slower; no data-quality net
→ ELT often pairs with schema-on-read.
But dbt models declare schemas + tests → some of both worlds.Per-Stage Cost
| Stage | ETL | ELT |
|---|---|---|
| Extract | Same (source load) | Same |
| Transform | ETL server compute (heavy annual cost) | Warehouse compute (usage-based, cheaper) |
| Load | Final only (storage ↓) | Raw + final (storage ↑ but cheap $) |
| Re-transform | Re-extract (expensive) | Re-run SQL in warehouse (cheap) |
When ETL Still Wins
- Compliance / PII — keeping sensitive data raw in the warehouse is risky. Mask at extract time.
- Source data can't reach the warehouse — on-prem / air-gapped / "no external SaaS warehouse" policies.
- Real-time streaming — Kafka Streams / Flink transform inside the stream. A different path from ELT batch.
- Existing ETL ecosystem — enterprises with Informatica / Talend + ops staff already in place.
- Raw data too large — selective extract required — loading everything blows up warehouse cost. ETL pre-selects columns / rows.
Modern Data Stack
Standard stack (2020+):
Source (Postgres, MongoDB, Salesforce, Stripe, ...)
↓
Extract + Load (Fivetran, Airbyte, Stitch) ← auto schemas + incremental
↓
Warehouse (Snowflake / BigQuery / Redshift / Databricks)
↓
Transform (dbt) ← SQL + lineage + tests + docs
↓
Reverse ETL (Hightouch, Census) — warehouse → operational tools
BI (Looker, Tableau, Metabase) — analysts / business
ML (Snowpark, BigQuery ML) — train near the data
→ "ELT" is the center of the stack. ETL/ELT tooling is commoditized.Common Pitfalls
- Loading every source as-is — warehouse cost explodes for huge raw. Selective extract.
- All logic in dbt — fine for reporting; for complex ML features use Spark or dbt-fal.
- Schema-on-read swamp — unlimited raw + no stewardship → "data swamp". Metadata / catalog required.
- Ignoring ELT cost — warehouse compute is usage-priced. Monitor monthly transform-job costs.
- Raw PII landed unprotected — GDPR risk. Mask at extract or separate audit boundary.
Wrap-up
ETL → ELT isn't a letter swap — it's the new approach enabled by cloud's cost paradigm (cheap storage + elastic compute). The modern data stack's standard.
Practical — new projects default: Fivetran/Airbyte (Extract+Load) + Snowflake/BigQuery (Warehouse) + dbt (Transform). Keep ETL only for PII / compliance edges in large enterprises.