Data Warehouse vs Data Lake Architecture

Type: Software Reference Confidence: 0.91 Sources: 7 Verified: 2026-02-23 Freshness: quarterly

TL;DR

Constraints

Quick Reference

ComponentData WarehouseData LakeData Lakehouse
Schema approachSchema-on-write (predefined)Schema-on-read (flexible)Schema-on-write + schema-on-read
Data typesStructured (SQL tables)Structured, semi-structured, unstructuredAll types with ACID transactions
Storage formatProprietary columnar (internal)Parquet, ORC, Avro, JSON, CSVOpen table formats (Iceberg, Delta, Hudi)
Storage cost~$23/TB/month (cloud DW)~$2-5/TB/month (object storage)~$2-5/TB/month (object storage)
Compute modelIntegrated or separatedSeparated (Spark, Presto, etc.)Separated (multi-engine)
Query performanceSub-second to seconds (optimized)Seconds to minutes (scan-heavy)Seconds (with caching/indexing)
ACID transactionsFull ACIDNone (without table format)Full ACID via table format
GovernanceBuilt-in (roles, masking, audit)Manual (must add catalog + ACLs)Built-in (Unity Catalog, Polaris)
Primary usersBusiness analysts, BI developersData engineers, data scientistsAll data roles
Best workloadBI dashboards, ad-hoc SQL, reportingML training, log analytics, raw storageMixed: BI + ML + streaming
Vendor examplesSnowflake, BigQuery, RedshiftS3 + Athena, ADLS + Synapse, GCSDatabricks Lakehouse, BigQuery Omni
Time-to-insightFast (pre-modeled data)Slow (requires transformation)Medium (medallion layers)
Data freshnessBatch (ELT) or micro-batchReal-time streaming possibleReal-time + batch unified
Scaling strategyAuto-scale compute (credits)Scale storage infinitely, compute on-demandStorage infinite, compute elastic

Decision Tree

START
├── Is your primary workload BI/SQL reporting with structured data only?
│   ├── YES → Data Warehouse (Snowflake, BigQuery, Redshift)
│   │         Best for: dashboards, ad-hoc queries, regulatory reporting
│   └── NO ↓
├── Is your primary workload ML/data science with unstructured data?
│   ├── YES → Data Lake (S3/ADLS + Spark/Athena)
│   │         Best for: training pipelines, log analytics, raw archival
│   └── NO ↓
├── Do you need BOTH SQL analytics AND ML on the same data?
│   ├── YES ↓
│   │   ├── Data volume > 10 TB or > 5 data teams?
│   │   │   ├── YES → Data Lakehouse (Databricks, BigQuery with Iceberg)
│   │   │   │         Use medallion architecture (Bronze → Silver → Gold)
│   │   │   └── NO → Managed Warehouse + object storage export
│   └── NO ↓
├── Do you need real-time streaming analytics?
│   ├── YES → Data Lakehouse with streaming ingestion
│   │         (Delta Live Tables, Apache Flink + Iceberg)
│   └── NO ↓
└── DEFAULT → Start with a managed Data Warehouse
              Migrate to Lakehouse when lake requirements emerge

Step-by-Step Guide

1. Audit data sources and classify workloads

Inventory all data sources and classify each by structure level (structured, semi-structured, unstructured), volume, velocity, and primary consumers. This classification drives the architecture decision. [src2]

# data_source_inventory.yaml
sources:
  - name: transactional_db
    type: structured
    format: PostgreSQL
    volume_gb_per_day: 5
    velocity: batch_daily
    consumers: [bi_analysts, finance]
    sensitivity: pii

  - name: clickstream_logs
    type: semi_structured
    format: JSON
    volume_gb_per_day: 200
    velocity: real_time
    consumers: [data_scientists, product]
    sensitivity: none

Verify: Count sources by type — if >60% structured + SQL consumers, lean warehouse. If >40% unstructured/ML, lean lake. Mixed = lakehouse.

2. Select storage layer and table format

Choose your storage tier based on workload classification. For lakehouse, select an open table format that provides ACID transactions on top of object storage. [src6]

-- Lakehouse: Delta Lake setup (Databricks)
-- Creates a managed Delta table with schema enforcement
CREATE TABLE IF NOT EXISTS bronze.raw_events (
  event_id STRING,
  event_type STRING,
  payload STRING,
  ingested_at TIMESTAMP
)
USING DELTA
PARTITIONED BY (date_trunc('day', ingested_at))
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true'
);

Verify: DESCRIBE EXTENDED bronze.raw_events → should show Provider: delta, partitioning columns, and table properties.

3. Implement medallion architecture layers

Structure data flow through Bronze (raw), Silver (cleansed), and Gold (business-ready) layers. Each layer has a specific purpose and quality contract. [src4]

-- Silver layer: cleansed and conformed (dbt model)
{{ config(
    materialized='incremental',
    unique_key='event_id',
    partition_by={'field': 'event_date', 'data_type': 'date'}
) }}

SELECT
    event_id, event_type,
    JSON_EXTRACT_SCALAR(payload, '$.user_id') AS user_id,
    CAST(ingested_at AS DATE) AS event_date
FROM {{ source('bronze', 'raw_events') }}
WHERE event_type IS NOT NULL
{% if is_incremental() %}
  AND ingested_at > (SELECT MAX(ingested_at) FROM {{ this }})
{% endif %}

Verify: dbt test --select stg_events → all tests pass (not_null, unique on event_id).

4. Configure governance and access controls

Set up a data catalog, column-level security, and row-level access policies. This is the step most teams skip — and the #1 cause of data swamp failure. [src1]

-- Snowflake: row-level security policy
CREATE OR REPLACE ROW ACCESS POLICY pii_access_policy
  AS (val VARCHAR) RETURNS BOOLEAN ->
  CASE
    WHEN CURRENT_ROLE() IN ('DATA_ENGINEER', 'COMPLIANCE') THEN TRUE
    ELSE FALSE
  END;

ALTER TABLE gold.dim_customers
  ADD ROW ACCESS POLICY pii_access_policy ON (email);

Verify: SELECT email FROM gold.dim_customers LIMIT 5 as analyst role → should show ***@domain.com.

5. Set up transformation pipeline with dbt

Configure dbt to transform data through medallion layers with testing, documentation, and lineage tracking. [src5]

# dbt_project.yml
name: 'analytics_lakehouse'
version: '1.0.0'
models:
  analytics_lakehouse:
    bronze:
      +materialized: view
      +schema: bronze
    silver:
      +materialized: incremental
      +schema: silver
    gold:
      +materialized: table
      +schema: gold

Verify: dbt debug → connection OK. dbt build --select tag:gold → all models and tests pass.

6. Implement monitoring and cost controls

Set up query monitoring, storage lifecycle policies, and cost alerting. Cloud data platforms can incur runaway costs without proper guardrails. [src3]

-- BigQuery: cost monitoring query
SELECT
  user_email,
  SUM(total_bytes_processed) / POW(2, 40) AS tb_processed,
  SUM(total_bytes_processed) / POW(2, 40) * 5 AS est_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY user_email ORDER BY tb_processed DESC;

Verify: Check reservation utilization in Cloud Console. Set billing alerts at 80% of monthly budget.

Code Examples

SQL (dbt): Gold layer business model

-- models/gold/fct_daily_revenue.sql
-- Input:  Silver-layer orders and products tables
-- Output: Daily revenue fact table for BI dashboards
{{ config(
    materialized='table',
    partition_by={'field': 'order_date', 'data_type': 'date'},
    cluster_by=['product_category', 'region']
) }}

SELECT
    o.order_date,
    p.product_category,
    o.region,
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(o.quantity * p.unit_price) AS gross_revenue,
    SUM(o.discount_amount) AS total_discounts,
    SUM(o.quantity * p.unit_price) - SUM(o.discount_amount) AS net_revenue
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_products') }} p ON o.product_id = p.product_id
GROUP BY 1, 2, 3

Python (PySpark): Bronze layer ingestion

# Input:  Raw JSON files from S3/ADLS landing zone
# Output: Bronze Delta table with metadata columns
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name

spark = SparkSession.builder.appName("bronze_ingestion") \
    .config("spark.sql.extensions",
            "io.delta.sql.DeltaSparkSessionExtension").getOrCreate()

bronze_df = (spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "s3://lakehouse/schemas/events")
    .load("s3://landing-zone/events/"))

(bronze_df.withColumn("_ingested_at", current_timestamp())
    .withColumn("_source_file", input_file_name())
    .writeStream.format("delta").outputMode("append")
    .option("checkpointLocation", "s3://lakehouse/checkpoints/bronze_events")
    .trigger(availableNow=True).toTable("bronze.raw_events"))

Python (PySpark): Silver layer transformation

# Input:  Bronze Delta table (raw events)
# Output: Silver Delta table (cleansed, deduplicated)
from pyspark.sql.functions import from_json, col, to_date, row_number
from pyspark.sql.window import Window
from delta.tables import DeltaTable

dedup_window = Window.partitionBy("event_id").orderBy(col("_ingested_at").desc())

silver_df = (spark.read.table("bronze.raw_events")
    .withColumn("parsed", from_json(col("payload"), payload_schema))
    .select("event_id", "event_type",
            col("parsed.user_id").alias("user_id"),
            col("parsed.amount").alias("amount"),
            to_date("parsed.timestamp").alias("event_date"))
    .filter(col("event_id").isNotNull())
    .withColumn("rn", row_number().over(dedup_window))
    .filter(col("rn") == 1).drop("rn"))

# Merge into Silver (upsert pattern)
silver_table = DeltaTable.forPath(spark, "s3://lakehouse/silver/events")
(silver_table.alias("t").merge(silver_df.alias("s"), "t.event_id = s.event_id")
    .whenMatchedUpdateAll().whenNotMatchedInsertAll().execute())

Anti-Patterns

Wrong: Dumping everything into a data lake with no governance

# BAD — raw dump with no catalog, no schema, no access controls
df.write.parquet("s3://company-lake/data/")
# Result: "data swamp" — nobody knows what's there

Correct: Ingesting with schema registry, metadata, and catalog

# GOOD — structured ingestion with governance
(df.withColumn("_ingested_at", current_timestamp())
   .withColumn("_source", lit("crm_export"))
   .withColumn("_schema_version", lit("v2.1"))
   .write.format("delta").option("mergeSchema", "true")
   .mode("append").saveAsTable("bronze.crm_contacts"))
# Registered in Unity Catalog with lineage + ACLs

Wrong: One massive Gold table for all consumers

-- BAD — single denormalized table serving every team
CREATE TABLE gold.everything AS
SELECT * FROM silver.orders
JOIN silver.customers ON ...
JOIN silver.products ON ...
JOIN silver.inventory ON ...;
-- 500+ columns, slow queries, impossible to maintain

Correct: Purpose-built Gold models per domain

-- GOOD — domain-specific Gold models
-- gold/finance/fct_daily_revenue.sql (10 columns)
-- gold/marketing/fct_campaign_attribution.sql (15 columns)
-- gold/product/fct_user_engagement.sql (12 columns)
-- Each model is tested, documented, and owned by a team

Wrong: Running ML training directly on the warehouse

# BAD — warehouse compute for ML training (5-10x cost)
results = snowflake_cursor.execute(
    "SELECT * FROM gold.features WHERE training_set = TRUE"
)  # Pulling 500GB through warehouse = $$$
model.fit(pd.DataFrame(results.fetchall()))

Correct: Using lake storage for ML, warehouse for BI

# GOOD — read from lake storage for ML
training_df = spark.read.format("delta").load(
    "s3://lakehouse/silver/features"
)  # Spark cluster reads from object storage directly
model.fit(training_df.toPandas())

Wrong: Tightly coupling storage and compute

BAD — legacy on-prem warehouse with fixed capacity
Cannot scale storage independently from compute.
Paying for idle compute during off-peak hours.
Over-provisioned or under-performing.

Correct: Decoupled storage and compute

-- GOOD — Snowflake virtual warehouses scale independently
ALTER WAREHOUSE prod_wh SET
  WAREHOUSE_SIZE = 'X-LARGE'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  MIN_CLUSTER_COUNT = 1
  MAX_CLUSTER_COUNT = 6;
-- Storage infinite, compute on demand

Common Pitfalls

Diagnostic Commands

# Check Delta Lake table health (Databricks)
DESCRIBE DETAIL delta.`s3://lakehouse/silver/events`

# Check table format version and properties
SHOW TBLPROPERTIES bronze.raw_events

# Verify dbt model freshness
dbt source freshness --select source:bronze

# Check Snowflake warehouse utilization
SELECT WAREHOUSE_NAME, AVG(AVG_RUNNING), AVG(QUEUED_LOAD)
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
WHERE START_TIME > DATEADD('day', -7, CURRENT_TIMESTAMP())
GROUP BY WAREHOUSE_NAME;

# Check BigQuery slot utilization
SELECT period_start,
  SUM(period_slot_ms) / (1000 * 60) AS slot_minutes
FROM `region-us`.INFORMATION_SCHEMA.JOBS_TIMELINE_BY_PROJECT
WHERE period_start > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY period_start ORDER BY period_start;

# Check data lake storage size (AWS)
aws s3 ls s3://lakehouse/ --recursive --summarize | tail -2

Version History & Compatibility

TechnologyCurrent VersionKey FeatureNotes
Apache Iceberg1.7.x (2025)Universal compatibility, partition evolutionPreferred for multi-engine lakehouse
Delta Lake4.0 (2025)UniForm (Iceberg/Hudi interop), liquid clusteringDatabricks-native, open-source
Apache Hudi1.0 (2025)Record-level upserts, incremental queriesBest for CDC-heavy workloads
dbt Core1.9.x (2025)Microbatch incremental, unit testingStandard transformation tool
Apache Spark3.5.x (2025)Spark Connect, improved Iceberg supportDe facto lake compute engine
SnowflakeCurrentIceberg Tables, Polaris Catalog (open-source)Warehouse + lake convergence
BigQueryCurrentBigLake, Omni (multi-cloud), Iceberg supportServerless warehouse + lake

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Structured BI/reporting is the only workloadYou need ML training on unstructured dataData Lake or Lakehouse
Need sub-second query latency for dashboardsData volume is < 100 GB totalPostgreSQL + materialized views
Regulatory compliance requires full audit trailBudget is < $500/month for data infrastructureManaged PostgreSQL + dbt
Mixed workloads: BI + ML + streamingTeam has < 3 data engineersStart with a managed warehouse
Data volume exceeds 10 TB with diverse formatsAll data fits in a single relational databaseOLTP database with read replicas
Multiple teams need independent compute scalingOnly one data consumer teamSingle managed warehouse
Real-time streaming + batch analytics requiredLatency requirements are > 1 hour acceptableBatch warehouse with scheduled ELT

Important Caveats

Related Units