Evolution from Data Warehouse → Data Lake → Lakehouse. "Just a pile of Parquet files on S3" grows into a modern platform with ACID and analytics SQL — thanks to Iceberg / Delta / Hudi plus the maturation of Trino / DuckDB. This guide unpacks the layers.
Warehouse vs Lake vs Lakehouse
Data Warehouse (traditional):
- Structured data (tables, schemas)
- Proprietary storage (Teradata, Oracle DW)
- Schema-on-write
- Examples: Snowflake, BigQuery, Redshift
- Pros: fast SQL, ACID, BI-friendly
- Cons: hard with unstructured (JSON, logs, images), expensive
Data Lake (2010s):
- Everything raw (CSV, JSON, Parquet, images, video)
- Cheap object storage (S3, GCS, ADLS)
- Schema-on-read
- Pros: very cheap, flexible
- Cons: no ACID, slow SQL, "data swamp" risk
Lakehouse (2020+):
- Lake's cheap storage + warehouse's ACID + SQL
- Key = open table formats (Iceberg / Delta / Hudi)
- Examples: Databricks, AWS Athena + Iceberg
- "Best of both" — emerging modern standardS3 + Parquet — The Lake Basics
Storage layer:
- S3 / GCS / Azure Blob — object storage, $20-25 / TB / month
- Near-infinite scale, durable (11-9s)
File formats:
- CSV — simple, human-readable. Inefficient for big data.
- JSON — flexible. Expensive parsing, poor compression.
- Avro — schema-embedded + row-oriented. Common for streaming.
- Parquet — columnar + heavy compression. Default for analytics.
- ORC — Parquet-like, dominant in Hive land.
Parquet's columnar advantage:
user_id | name | email | created_at
--------|--------|-----------|------------
1 | Alice | a@x.com | 2026-01-01
2 | Bob | b@y.com | 2026-01-02
...
→ Each column in its own chunk with column-specific compression
→ "SELECT name FROM users" reads only the name column, skips others
→ 10-100× faster than row-oriented (CSV)
→ Parquet is the de-facto modern lake standard.Open Table Formats — The Lakehouse Key
Plain Parquet problems:
- No ACID — concurrent writes can corrupt
- DELETE / UPDATE hard (Parquet is immutable)
- No snapshots / time travel
- File-count explosion (millions of small files)
Open-table-format answer:
- A metadata layer tracks "which Parquet files are the current table"
- Per-snapshot metadata → time travel
- ACID transactions
- Compaction (combine small files into big files)
- Schema evolution
Three main formats:
Iceberg (Netflix, 2017):
- Apache standard, vendor-neutral
- Snapshot + manifest + Avro metadata
- Supported by nearly all engines (Trino, Spark, Flink, Snowflake, BigQuery)
Delta Lake (Databricks, 2017):
- Driven by Databricks, OSS
- _delta_log/ directory with JSON metadata
- Best with Spark / Databricks
Hudi (Uber, 2017):
- Strong for streaming workloads (upsert)
- Copy-on-Write + Merge-on-Read modes
→ Iceberg leads adoption thanks to vendor neutrality.
Delta on Databricks, Hudi for streaming-heavy.Schema-on-Read
Lake's defining trait — no schema enforced at write.
Storage:
s3://lake/events/year=2026/month=05/day=27/file_001.parquet
s3://lake/logs/app=api/...
s3://lake/raw/api_response_dump.json
s3://lake/users/users.csv
Declare schema at read time:
CREATE EXTERNAL TABLE events (
user_id INT, event_type STRING, ts TIMESTAMP
)
LOCATION 's3://lake/events/'
STORED AS PARQUET;
Pros:
- Land it first — apply any schema later
- New columns become usable automatically
- Source changes don't disturb the lake
Cons:
- Garbage accumulation — endless raw data without schema → swamp
- Queries can be slower (convert at parse time)
- Data lineage / quality not guaranteed
→ Lakehouses (open table formats) recover some schema-on-write benefits.Partitioning — Query Acceleration
Directory structure like s3://lake/events/year=2026/month=05/day=27/.
Query:
SELECT * FROM events
WHERE event_date BETWEEN '2026-05-01' AND '2026-05-31'
→ Engine scans only month=05 partitions, ignores other months
→ Scan cost reduced 30× (year → month)
Traps:
- Too fine-grained (e.g. per hour) → millions of tiny files
- Too coarse → no pruning benefit
- Typical: day-grain on a date column
Modern (open table formats):
- Partitioning info lives in metadata (no S3 prefix dependency)
- Hidden partitioning (Iceberg) — partition transforms are automaticQuery Engines
| Engine | Characteristics |
|---|---|
| Trino (Presto) | Distributed, JOIN across multiple sources (lake + DB + Kafka), standard SQL |
| Spark SQL | Databricks foundation, ETL + ML unified |
| DuckDB | Single-node embedded, very fast on small datasets |
| Athena (AWS) | Managed Trino, queries S3 directly, pay-per-query |
| BigQuery | Warehouse + lake (external tables on GCS) |
| ClickHouse | OLAP DB, S3 external tables, sub-second queries |
Common Pitfalls
- "Data lake = files on S3" — without metadata / catalog, you get a "data swamp". Use AWS Glue / Hive Metastore / Iceberg catalog.
- Small-file problem — streaming writes a tiny Parquet every 5 min → millions. Compaction required.
- Schema evolution breaking old files — column drop or type change clashes with old Parquet's schema. Follow the open-table-format evolution rules.
- No partitioning — full table scan every time. Big datasets must partition.
- Concurrent-write races — multiple processes on plain Parquet corrupt manifests. Use open-table-format ACID.
Wrap-up
The lake's evolution — S3 + Parquet's cheapness + Iceberg / Delta's ACID = lakehouse. Rising cost of proprietary stacks (Snowflake / BigQuery) accelerates lakehouse adoption.
Practical — modern start: S3 + Parquet + Iceberg + Trino. On Databricks → Delta + Spark. If analytics queries only, Snowflake / BigQuery still offer convenience — a real trade-off (convenience vs cost).