Skip to content
yutils

How ETL and ELT Actually Differ

Extract-Transform-Load (the legacy) vs Extract-Load-Transform (the modern default), why cloud warehouses (BigQuery / Snowflake / Redshift) flipped the order, the cost shift from compute to storage, schema-on-write vs schema-on-read, and when ETL still wins.

~9 min read

"ETL" — the data-warehouse vocabulary since the 1980s. Around 2015, cloud warehouses (BigQuery / Snowflake / Redshift) hit mainstream and the three letters flipped order → "ELT". Not just letter shuffling: a paradigm shift in compute vs storage cost. This guide covers the precise difference and when ETL still wins.

ETL — Traditional (1980s–2010s)

Extract → Transform → Load

1. Extract: pull from source systems (OLTP DB, SaaS APIs, CSVs)
2. Transform: a separate ETL server (Informatica, Talend) processes
   - clean / join / aggregate / type convert / business logic
3. Load: write final shape to a warehouse (Teradata, Oracle DW)

Characteristics:
- Transform happens outside the warehouse (on expensive ETL servers)
- Only the final shape lives in the warehouse (storage was costly)
- Schema-on-write — schema decided before load
- Mostly batch (nightly jobs)

Problems:
- If transform logic changes → re-extract (you only stored final)
- ETL-server bottleneck (single-server capacity)
- "I needed that column too, but transform dropped it" → rebuild
- Vendor lock-in to ETL tools (Informatica etc.)

ELT — Modern (2015+)

Extract → Load → Transform

1. Extract: pull raw data from sources
2. Load: write nearly raw into the warehouse (raw / staging schema)
3. Transform: process inside the warehouse via SQL
   (BigQuery / Snowflake / Redshift)

Characteristics:
- Transform happens inside the warehouse (cloud elastic compute)
- Raw data preserved — different transforms later are possible
- Schema-on-read friendly (raw can be loose)
- Streaming / micro-batch possible

What made this possible:
- Cloud warehouse storage is cheap (~$23 / TB / month on S3)
- Compute is decoupled from storage and scales separately
- Warehouse SQL is powerful enough to replace ETL tools
- ELT-native tooling (dbt, Dataform) ecosystem

Why ELT Took Over

Old assumptions:
- Storage is expensive → only keep processed data
- Compute is expensive → do everything on the ETL server
- Warehouse compute = SQL queries only (no heavy transforms)

Cloud changes:
- Storage extremely cheap (especially compressed Parquet)
- Compute elastic (huge transform jobs finish in 5 min)
- Warehouse SQL expressiveness ↑ (window functions, JSON, UDFs)
- Compute / storage separated → independent scaling

→ "Load everything raw, transform later with SQL" is cost-effective + flexible.

dbt (data build tool, 2016+):
- SQL + Jinja templates + dependency graph
- A "transform-only" tool — Extract/Load is Fivetran/Airbyte's job
- Standard in the modern data stack

Schema-on-Write vs Schema-on-Read

Schema-on-Write (ETL era):
- Schema fixed before load
- Type mismatch → reject or transform
- Pro: guaranteed data quality + fast queries
- Con: schema change needs backfills; new fields hard to add

Schema-on-Read (data lake / ELT):
- Store raw (JSON, CSV, Parquet)
- Interpret schema at query time
- Pro: flexibility ↑, new fields usable immediately
- Con: queries can be slower; no data-quality net

→ ELT often pairs with schema-on-read.
   But dbt models declare schemas + tests → some of both worlds.

Per-Stage Cost

StageETLELT
ExtractSame (source load)Same
TransformETL server compute (heavy annual cost)Warehouse compute (usage-based, cheaper)
LoadFinal only (storage ↓)Raw + final (storage ↑ but cheap $)
Re-transformRe-extract (expensive)Re-run SQL in warehouse (cheap)

When ETL Still Wins

  • Compliance / PII — keeping sensitive data raw in the warehouse is risky. Mask at extract time.
  • Source data can't reach the warehouse — on-prem / air-gapped / "no external SaaS warehouse" policies.
  • Real-time streaming — Kafka Streams / Flink transform inside the stream. A different path from ELT batch.
  • Existing ETL ecosystem — enterprises with Informatica / Talend + ops staff already in place.
  • Raw data too large — selective extract required — loading everything blows up warehouse cost. ETL pre-selects columns / rows.

Modern Data Stack

Standard stack (2020+):

  Source (Postgres, MongoDB, Salesforce, Stripe, ...)
       ↓
  Extract + Load (Fivetran, Airbyte, Stitch)  ← auto schemas + incremental
       ↓
  Warehouse (Snowflake / BigQuery / Redshift / Databricks)
       ↓
  Transform (dbt)  ← SQL + lineage + tests + docs
       ↓
  Reverse ETL (Hightouch, Census) — warehouse → operational tools
  BI (Looker, Tableau, Metabase) — analysts / business
  ML (Snowpark, BigQuery ML) — train near the data

→ "ELT" is the center of the stack. ETL/ELT tooling is commoditized.

Common Pitfalls

  • Loading every source as-is — warehouse cost explodes for huge raw. Selective extract.
  • All logic in dbt — fine for reporting; for complex ML features use Spark or dbt-fal.
  • Schema-on-read swamp — unlimited raw + no stewardship → "data swamp". Metadata / catalog required.
  • Ignoring ELT cost — warehouse compute is usage-priced. Monitor monthly transform-job costs.
  • Raw PII landed unprotected — GDPR risk. Mask at extract or separate audit boundary.

Wrap-up

ETL → ELT isn't a letter swap — it's the new approach enabled by cloud's cost paradigm (cheap storage + elastic compute). The modern data stack's standard.

Practical — new projects default: Fivetran/Airbyte (Extract+Load) + Snowflake/BigQuery (Warehouse) + dbt (Transform). Keep ETL only for PII / compliance edges in large enterprises.

Back to guides