Data Warehouse vs Data Lake Architecture
How do I design a data warehouse vs data lake architecture?
TL;DR
- Bottom line: Choose a data warehouse for structured BI/SQL workloads, a data lake for unstructured/ML workloads, or a lakehouse for both — most greenfield projects in 2024-2026 should default to lakehouse architecture.
- Key tool/command:
dbt build --select tag:gold(transform raw lake data into warehouse-grade models via medallion architecture) - Watch out for: Building a data lake without governance — within 12 months it becomes a "data swamp" that nobody trusts or uses.
- Works with: Snowflake, Databricks, BigQuery, AWS Redshift + S3, Azure Synapse, Apache Iceberg, Delta Lake, Apache Hudi, dbt, Apache Spark, Apache Flink.
Constraints
- Never store sensitive data (PII, PHI, PCI) in a raw data lake zone without encryption-at-rest and access controls — compliance violations carry legal penalties
- Data warehouse schema-on-write is mandatory for financial reporting and regulatory workloads that require auditability and ACID compliance
- Do not build a data lake without a governance layer (catalog, lineage, access policies) — ungoverned lakes become data swamps within 6-12 months
- Separate storage and compute in any new architecture — tightly coupled systems cannot scale cost-effectively
- Budget 30-40% of project effort for data quality and governance tooling — technology selection is only 20% of the problem
Quick Reference
| Component | Data Warehouse | Data Lake | Data Lakehouse |
|---|---|---|---|
| Schema approach | Schema-on-write (predefined) | Schema-on-read (flexible) | Schema-on-write + schema-on-read |
| Data types | Structured (SQL tables) | Structured, semi-structured, unstructured | All types with ACID transactions |
| Storage format | Proprietary columnar (internal) | Parquet, ORC, Avro, JSON, CSV | Open table formats (Iceberg, Delta, Hudi) |
| Storage cost | ~$23/TB/month (cloud DW) | ~$2-5/TB/month (object storage) | ~$2-5/TB/month (object storage) |
| Compute model | Integrated or separated | Separated (Spark, Presto, etc.) | Separated (multi-engine) |
| Query performance | Sub-second to seconds (optimized) | Seconds to minutes (scan-heavy) | Seconds (with caching/indexing) |
| ACID transactions | Full ACID | None (without table format) | Full ACID via table format |
| Governance | Built-in (roles, masking, audit) | Manual (must add catalog + ACLs) | Built-in (Unity Catalog, Polaris) |
| Primary users | Business analysts, BI developers | Data engineers, data scientists | All data roles |
| Best workload | BI dashboards, ad-hoc SQL, reporting | ML training, log analytics, raw storage | Mixed: BI + ML + streaming |
| Vendor examples | Snowflake, BigQuery, Redshift | S3 + Athena, ADLS + Synapse, GCS | Databricks Lakehouse, BigQuery Omni |
| Time-to-insight | Fast (pre-modeled data) | Slow (requires transformation) | Medium (medallion layers) |
| Data freshness | Batch (ELT) or micro-batch | Real-time streaming possible | Real-time + batch unified |
| Scaling strategy | Auto-scale compute (credits) | Scale storage infinitely, compute on-demand | Storage infinite, compute elastic |
Decision Tree
START
├── Is your primary workload BI/SQL reporting with structured data only?
│ ├── YES → Data Warehouse (Snowflake, BigQuery, Redshift)
│ │ Best for: dashboards, ad-hoc queries, regulatory reporting
│ └── NO ↓
├── Is your primary workload ML/data science with unstructured data?
│ ├── YES → Data Lake (S3/ADLS + Spark/Athena)
│ │ Best for: training pipelines, log analytics, raw archival
│ └── NO ↓
├── Do you need BOTH SQL analytics AND ML on the same data?
│ ├── YES ↓
│ │ ├── Data volume > 10 TB or > 5 data teams?
│ │ │ ├── YES → Data Lakehouse (Databricks, BigQuery with Iceberg)
│ │ │ │ Use medallion architecture (Bronze → Silver → Gold)
│ │ │ └── NO → Managed Warehouse + object storage export
│ └── NO ↓
├── Do you need real-time streaming analytics?
│ ├── YES → Data Lakehouse with streaming ingestion
│ │ (Delta Live Tables, Apache Flink + Iceberg)
│ └── NO ↓
└── DEFAULT → Start with a managed Data Warehouse
Migrate to Lakehouse when lake requirements emerge
Step-by-Step Guide
1. Audit data sources and classify workloads
Inventory all data sources and classify each by structure level (structured, semi-structured, unstructured), volume, velocity, and primary consumers. This classification drives the architecture decision. [src2]
# data_source_inventory.yaml
sources:
- name: transactional_db
type: structured
format: PostgreSQL
volume_gb_per_day: 5
velocity: batch_daily
consumers: [bi_analysts, finance]
sensitivity: pii
- name: clickstream_logs
type: semi_structured
format: JSON
volume_gb_per_day: 200
velocity: real_time
consumers: [data_scientists, product]
sensitivity: none
Verify: Count sources by type — if >60% structured + SQL consumers, lean warehouse. If >40% unstructured/ML, lean lake. Mixed = lakehouse.
2. Select storage layer and table format
Choose your storage tier based on workload classification. For lakehouse, select an open table format that provides ACID transactions on top of object storage. [src6]
-- Lakehouse: Delta Lake setup (Databricks)
-- Creates a managed Delta table with schema enforcement
CREATE TABLE IF NOT EXISTS bronze.raw_events (
event_id STRING,
event_type STRING,
payload STRING,
ingested_at TIMESTAMP
)
USING DELTA
PARTITIONED BY (date_trunc('day', ingested_at))
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
Verify: DESCRIBE EXTENDED bronze.raw_events → should show Provider: delta, partitioning columns, and table properties.
3. Implement medallion architecture layers
Structure data flow through Bronze (raw), Silver (cleansed), and Gold (business-ready) layers. Each layer has a specific purpose and quality contract. [src4]
-- Silver layer: cleansed and conformed (dbt model)
{{ config(
materialized='incremental',
unique_key='event_id',
partition_by={'field': 'event_date', 'data_type': 'date'}
) }}
SELECT
event_id, event_type,
JSON_EXTRACT_SCALAR(payload, '$.user_id') AS user_id,
CAST(ingested_at AS DATE) AS event_date
FROM {{ source('bronze', 'raw_events') }}
WHERE event_type IS NOT NULL
{% if is_incremental() %}
AND ingested_at > (SELECT MAX(ingested_at) FROM {{ this }})
{% endif %}
Verify: dbt test --select stg_events → all tests pass (not_null, unique on event_id).
4. Configure governance and access controls
Set up a data catalog, column-level security, and row-level access policies. This is the step most teams skip — and the #1 cause of data swamp failure. [src1]
-- Snowflake: row-level security policy
CREATE OR REPLACE ROW ACCESS POLICY pii_access_policy
AS (val VARCHAR) RETURNS BOOLEAN ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_ENGINEER', 'COMPLIANCE') THEN TRUE
ELSE FALSE
END;
ALTER TABLE gold.dim_customers
ADD ROW ACCESS POLICY pii_access_policy ON (email);
Verify: SELECT email FROM gold.dim_customers LIMIT 5 as analyst role → should show ***@domain.com.
5. Set up transformation pipeline with dbt
Configure dbt to transform data through medallion layers with testing, documentation, and lineage tracking. [src5]
# dbt_project.yml
name: 'analytics_lakehouse'
version: '1.0.0'
models:
analytics_lakehouse:
bronze:
+materialized: view
+schema: bronze
silver:
+materialized: incremental
+schema: silver
gold:
+materialized: table
+schema: gold
Verify: dbt debug → connection OK. dbt build --select tag:gold → all models and tests pass.
6. Implement monitoring and cost controls
Set up query monitoring, storage lifecycle policies, and cost alerting. Cloud data platforms can incur runaway costs without proper guardrails. [src3]
-- BigQuery: cost monitoring query
SELECT
user_email,
SUM(total_bytes_processed) / POW(2, 40) AS tb_processed,
SUM(total_bytes_processed) / POW(2, 40) * 5 AS est_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY user_email ORDER BY tb_processed DESC;
Verify: Check reservation utilization in Cloud Console. Set billing alerts at 80% of monthly budget.
Code Examples
SQL (dbt): Gold layer business model
-- models/gold/fct_daily_revenue.sql
-- Input: Silver-layer orders and products tables
-- Output: Daily revenue fact table for BI dashboards
{{ config(
materialized='table',
partition_by={'field': 'order_date', 'data_type': 'date'},
cluster_by=['product_category', 'region']
) }}
SELECT
o.order_date,
p.product_category,
o.region,
COUNT(DISTINCT o.order_id) AS total_orders,
SUM(o.quantity * p.unit_price) AS gross_revenue,
SUM(o.discount_amount) AS total_discounts,
SUM(o.quantity * p.unit_price) - SUM(o.discount_amount) AS net_revenue
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_products') }} p ON o.product_id = p.product_id
GROUP BY 1, 2, 3
Python (PySpark): Bronze layer ingestion
# Input: Raw JSON files from S3/ADLS landing zone
# Output: Bronze Delta table with metadata columns
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name
spark = SparkSession.builder.appName("bronze_ingestion") \
.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension").getOrCreate()
bronze_df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "s3://lakehouse/schemas/events")
.load("s3://landing-zone/events/"))
(bronze_df.withColumn("_ingested_at", current_timestamp())
.withColumn("_source_file", input_file_name())
.writeStream.format("delta").outputMode("append")
.option("checkpointLocation", "s3://lakehouse/checkpoints/bronze_events")
.trigger(availableNow=True).toTable("bronze.raw_events"))
Python (PySpark): Silver layer transformation
# Input: Bronze Delta table (raw events)
# Output: Silver Delta table (cleansed, deduplicated)
from pyspark.sql.functions import from_json, col, to_date, row_number
from pyspark.sql.window import Window
from delta.tables import DeltaTable
dedup_window = Window.partitionBy("event_id").orderBy(col("_ingested_at").desc())
silver_df = (spark.read.table("bronze.raw_events")
.withColumn("parsed", from_json(col("payload"), payload_schema))
.select("event_id", "event_type",
col("parsed.user_id").alias("user_id"),
col("parsed.amount").alias("amount"),
to_date("parsed.timestamp").alias("event_date"))
.filter(col("event_id").isNotNull())
.withColumn("rn", row_number().over(dedup_window))
.filter(col("rn") == 1).drop("rn"))
# Merge into Silver (upsert pattern)
silver_table = DeltaTable.forPath(spark, "s3://lakehouse/silver/events")
(silver_table.alias("t").merge(silver_df.alias("s"), "t.event_id = s.event_id")
.whenMatchedUpdateAll().whenNotMatchedInsertAll().execute())
Anti-Patterns
Wrong: Dumping everything into a data lake with no governance
# BAD — raw dump with no catalog, no schema, no access controls
df.write.parquet("s3://company-lake/data/")
# Result: "data swamp" — nobody knows what's there
Correct: Ingesting with schema registry, metadata, and catalog
# GOOD — structured ingestion with governance
(df.withColumn("_ingested_at", current_timestamp())
.withColumn("_source", lit("crm_export"))
.withColumn("_schema_version", lit("v2.1"))
.write.format("delta").option("mergeSchema", "true")
.mode("append").saveAsTable("bronze.crm_contacts"))
# Registered in Unity Catalog with lineage + ACLs
Wrong: One massive Gold table for all consumers
-- BAD — single denormalized table serving every team
CREATE TABLE gold.everything AS
SELECT * FROM silver.orders
JOIN silver.customers ON ...
JOIN silver.products ON ...
JOIN silver.inventory ON ...;
-- 500+ columns, slow queries, impossible to maintain
Correct: Purpose-built Gold models per domain
-- GOOD — domain-specific Gold models
-- gold/finance/fct_daily_revenue.sql (10 columns)
-- gold/marketing/fct_campaign_attribution.sql (15 columns)
-- gold/product/fct_user_engagement.sql (12 columns)
-- Each model is tested, documented, and owned by a team
Wrong: Running ML training directly on the warehouse
# BAD — warehouse compute for ML training (5-10x cost)
results = snowflake_cursor.execute(
"SELECT * FROM gold.features WHERE training_set = TRUE"
) # Pulling 500GB through warehouse = $$$
model.fit(pd.DataFrame(results.fetchall()))
Correct: Using lake storage for ML, warehouse for BI
# GOOD — read from lake storage for ML
training_df = spark.read.format("delta").load(
"s3://lakehouse/silver/features"
) # Spark cluster reads from object storage directly
model.fit(training_df.toPandas())
Wrong: Tightly coupling storage and compute
BAD — legacy on-prem warehouse with fixed capacity
Cannot scale storage independently from compute.
Paying for idle compute during off-peak hours.
Over-provisioned or under-performing.
Correct: Decoupled storage and compute
-- GOOD — Snowflake virtual warehouses scale independently
ALTER WAREHOUSE prod_wh SET
WAREHOUSE_SIZE = 'X-LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 6;
-- Storage infinite, compute on demand
Common Pitfalls
- Data swamp: Building a lake without metadata catalog, data quality checks, or access policies. Within 6-12 months nobody trusts the data. Fix: Deploy a governance layer (Unity Catalog, AWS Glue Catalog, or Apache Atlas) from day one. [src4]
- Warehouse cost explosion: Running exploratory queries on full datasets without partition pruning or cost limits. Fix: Enable
STATEMENT_TIMEOUT_IN_SECONDS(Snowflake),maximumBytesBilled(BigQuery), or slot reservations. [src3] - Schema drift in lake: Source schemas change silently, breaking downstream pipelines. Fix: Use schema evolution in Delta Lake (
mergeSchema) or Iceberg and adddbt testfor schema assertions. [src6] - Medallion layer skipping: Writing directly from ingestion to Gold, bypassing Bronze and Silver. Fix: Always land in Bronze first, transform in Silver, aggregate in Gold. [src4]
- Vendor lock-in: Building on proprietary formats that prevent multi-engine access. Fix: Use open table formats (Apache Iceberg for multi-engine, Delta Lake for Databricks-native). [src6]
- Ignoring data lifecycle: Storing all data at the hot tier indefinitely. Fix: Implement tiered storage with lifecycle policies — move data older than 90 days to cold storage. [src7]
- Under-estimating ETL complexity: Assuming ingestion is simple. Fix: Budget 50-60% of project time for pipeline development, testing, monitoring, and error handling. [src5]
- No cost observability: Running platforms without per-team cost attribution. Fix: Tag queries by team/project, set up daily cost reports, alert on anomalies exceeding 20% of baseline. [src7]
Diagnostic Commands
# Check Delta Lake table health (Databricks)
DESCRIBE DETAIL delta.`s3://lakehouse/silver/events`
# Check table format version and properties
SHOW TBLPROPERTIES bronze.raw_events
# Verify dbt model freshness
dbt source freshness --select source:bronze
# Check Snowflake warehouse utilization
SELECT WAREHOUSE_NAME, AVG(AVG_RUNNING), AVG(QUEUED_LOAD)
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
WHERE START_TIME > DATEADD('day', -7, CURRENT_TIMESTAMP())
GROUP BY WAREHOUSE_NAME;
# Check BigQuery slot utilization
SELECT period_start,
SUM(period_slot_ms) / (1000 * 60) AS slot_minutes
FROM `region-us`.INFORMATION_SCHEMA.JOBS_TIMELINE_BY_PROJECT
WHERE period_start > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY period_start ORDER BY period_start;
# Check data lake storage size (AWS)
aws s3 ls s3://lakehouse/ --recursive --summarize | tail -2
Version History & Compatibility
| Technology | Current Version | Key Feature | Notes |
|---|---|---|---|
| Apache Iceberg | 1.7.x (2025) | Universal compatibility, partition evolution | Preferred for multi-engine lakehouse |
| Delta Lake | 4.0 (2025) | UniForm (Iceberg/Hudi interop), liquid clustering | Databricks-native, open-source |
| Apache Hudi | 1.0 (2025) | Record-level upserts, incremental queries | Best for CDC-heavy workloads |
| dbt Core | 1.9.x (2025) | Microbatch incremental, unit testing | Standard transformation tool |
| Apache Spark | 3.5.x (2025) | Spark Connect, improved Iceberg support | De facto lake compute engine |
| Snowflake | Current | Iceberg Tables, Polaris Catalog (open-source) | Warehouse + lake convergence |
| BigQuery | Current | BigLake, Omni (multi-cloud), Iceberg support | Serverless warehouse + lake |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Structured BI/reporting is the only workload | You need ML training on unstructured data | Data Lake or Lakehouse |
| Need sub-second query latency for dashboards | Data volume is < 100 GB total | PostgreSQL + materialized views |
| Regulatory compliance requires full audit trail | Budget is < $500/month for data infrastructure | Managed PostgreSQL + dbt |
| Mixed workloads: BI + ML + streaming | Team has < 3 data engineers | Start with a managed warehouse |
| Data volume exceeds 10 TB with diverse formats | All data fits in a single relational database | OLTP database with read replicas |
| Multiple teams need independent compute scaling | Only one data consumer team | Single managed warehouse |
| Real-time streaming + batch analytics required | Latency requirements are > 1 hour acceptable | Batch warehouse with scheduled ELT |
Important Caveats
- Cloud data warehouse pricing models differ significantly: Snowflake charges per-second of compute (credits), BigQuery charges per-byte scanned (on-demand) or per-slot (reserved), Redshift charges per-node-hour. Direct cost comparisons require normalizing to the same workload.
- The "lakehouse" term is used differently by each vendor: Databricks means Delta Lake + Unity Catalog, Snowflake means Iceberg Tables + Polaris Catalog, Google means BigQuery + BigLake. Evaluate specific capabilities, not marketing labels.
- Open table formats (Iceberg, Delta, Hudi) are converging — Delta Lake UniForm can expose Delta tables as Iceberg or Hudi, and Iceberg REST Catalog is becoming a standard. Choose based on your primary compute engine.
- Data lake storage costs are 77-95% lower than warehouse storage, but total cost of ownership includes compute, governance tooling, and engineering time — a well-run warehouse can be cheaper than a poorly-governed lake.
- Medallion architecture (Bronze/Silver/Gold) is the dominant pattern but not the only one — some teams use a two-layer approach (raw + curated) or a data mesh pattern with domain-owned data products.