How Data Lakes Actually Work

Evolution from Data Warehouse → Data Lake → Lakehouse. "Just a pile of Parquet files on S3" grows into a modern platform with ACID and analytics SQL — thanks to Iceberg / Delta / Hudi plus the maturation of Trino / DuckDB. This guide unpacks the layers.

Warehouse vs Lake vs Lakehouse

Data Warehouse (traditional):
- Structured data (tables, schemas)
- Proprietary storage (Teradata, Oracle DW)
- Schema-on-write
- Examples: Snowflake, BigQuery, Redshift
- Pros: fast SQL, ACID, BI-friendly
- Cons: hard with unstructured (JSON, logs, images), expensive

Data Lake (2010s):
- Everything raw (CSV, JSON, Parquet, images, video)
- Cheap object storage (S3, GCS, ADLS)
- Schema-on-read
- Pros: very cheap, flexible
- Cons: no ACID, slow SQL, "data swamp" risk

Lakehouse (2020+):
- Lake's cheap storage + warehouse's ACID + SQL
- Key = open table formats (Iceberg / Delta / Hudi)
- Examples: Databricks, AWS Athena + Iceberg
- "Best of both" — emerging modern standard

S3 + Parquet — The Lake Basics

Storage layer:
- S3 / GCS / Azure Blob — object storage, $20-25 / TB / month
- Near-infinite scale, durable (11-9s)

File formats:
- CSV — simple, human-readable. Inefficient for big data.
- JSON — flexible. Expensive parsing, poor compression.
- Avro — schema-embedded + row-oriented. Common for streaming.
- Parquet — columnar + heavy compression. Default for analytics.
- ORC — Parquet-like, dominant in Hive land.

Parquet's columnar advantage:
  user_id | name   | email     | created_at
  --------|--------|-----------|------------
  1       | Alice  | a@x.com   | 2026-01-01
  2       | Bob    | b@y.com   | 2026-01-02
  ...

  → Each column in its own chunk with column-specific compression
  → "SELECT name FROM users" reads only the name column, skips others
  → 10-100× faster than row-oriented (CSV)

→ Parquet is the de-facto modern lake standard.

Open Table Formats — The Lakehouse Key

Plain Parquet problems:
- No ACID — concurrent writes can corrupt
- DELETE / UPDATE hard (Parquet is immutable)
- No snapshots / time travel
- File-count explosion (millions of small files)

Open-table-format answer:
- A metadata layer tracks "which Parquet files are the current table"
- Per-snapshot metadata → time travel
- ACID transactions
- Compaction (combine small files into big files)
- Schema evolution

Three main formats:

Iceberg (Netflix, 2017):
- Apache standard, vendor-neutral
- Snapshot + manifest + Avro metadata
- Supported by nearly all engines (Trino, Spark, Flink, Snowflake, BigQuery)

Delta Lake (Databricks, 2017):
- Driven by Databricks, OSS
- _delta_log/ directory with JSON metadata
- Best with Spark / Databricks

Hudi (Uber, 2017):
- Strong for streaming workloads (upsert)
- Copy-on-Write + Merge-on-Read modes

→ Iceberg leads adoption thanks to vendor neutrality.
   Delta on Databricks, Hudi for streaming-heavy.

Schema-on-Read

Lake's defining trait — no schema enforced at write.

Storage:
  s3://lake/events/year=2026/month=05/day=27/file_001.parquet
  s3://lake/logs/app=api/...
  s3://lake/raw/api_response_dump.json
  s3://lake/users/users.csv

Declare schema at read time:
  CREATE EXTERNAL TABLE events (
    user_id INT, event_type STRING, ts TIMESTAMP
  )
  LOCATION 's3://lake/events/'
  STORED AS PARQUET;

Pros:
- Land it first — apply any schema later
- New columns become usable automatically
- Source changes don't disturb the lake

Cons:
- Garbage accumulation — endless raw data without schema → swamp
- Queries can be slower (convert at parse time)
- Data lineage / quality not guaranteed

→ Lakehouses (open table formats) recover some schema-on-write benefits.

Partitioning — Query Acceleration

Directory structure like s3://lake/events/year=2026/month=05/day=27/.

Query:
  SELECT * FROM events
  WHERE event_date BETWEEN '2026-05-01' AND '2026-05-31'

→ Engine scans only month=05 partitions, ignores other months
→ Scan cost reduced 30× (year → month)

Traps:
- Too fine-grained (e.g. per hour) → millions of tiny files
- Too coarse → no pruning benefit
- Typical: day-grain on a date column

Modern (open table formats):
- Partitioning info lives in metadata (no S3 prefix dependency)
- Hidden partitioning (Iceberg) — partition transforms are automatic

Query Engines

Engine	Characteristics
Trino (Presto)	Distributed, JOIN across multiple sources (lake + DB + Kafka), standard SQL
Spark SQL	Databricks foundation, ETL + ML unified
DuckDB	Single-node embedded, very fast on small datasets
Athena (AWS)	Managed Trino, queries S3 directly, pay-per-query
BigQuery	Warehouse + lake (external tables on GCS)
ClickHouse	OLAP DB, S3 external tables, sub-second queries

Common Pitfalls

"Data lake = files on S3" — without metadata / catalog, you get a "data swamp". Use AWS Glue / Hive Metastore / Iceberg catalog.
Small-file problem — streaming writes a tiny Parquet every 5 min → millions. Compaction required.
Schema evolution breaking old files — column drop or type change clashes with old Parquet's schema. Follow the open-table-format evolution rules.
No partitioning — full table scan every time. Big datasets must partition.
Concurrent-write races — multiple processes on plain Parquet corrupt manifests. Use open-table-format ACID.

Wrap-up

The lake's evolution — S3 + Parquet's cheapness + Iceberg / Delta's ACID = lakehouse. Rising cost of proprietary stacks (Snowflake / BigQuery) accelerates lakehouse adoption.

Practical — modern start: S3 + Parquet + Iceberg + Trino. On Databricks → Delta + Spark. If analytics queries only, Snowflake / BigQuery still offer convenience — a real trade-off (convenience vs cost).