dbt build --select tag:gold (transform raw lake data into warehouse-grade models via medallion architecture)| Component | Data Warehouse | Data Lake | Data Lakehouse |
|---|---|---|---|
| Schema approach | Schema-on-write (predefined) | Schema-on-read (flexible) | Schema-on-write + schema-on-read |
| Data types | Structured (SQL tables) | Structured, semi-structured, unstructured | All types with ACID transactions |
| Storage format | Proprietary columnar (internal) | Parquet, ORC, Avro, JSON, CSV | Open table formats (Iceberg, Delta, Hudi) |
| Storage cost | ~$23/TB/month (cloud DW) | ~$2-5/TB/month (object storage) | ~$2-5/TB/month (object storage) |
| Compute model | Integrated or separated | Separated (Spark, Presto, etc.) | Separated (multi-engine) |
| Query performance | Sub-second to seconds (optimized) | Seconds to minutes (scan-heavy) | Seconds (with caching/indexing) |
| ACID transactions | Full ACID | None (without table format) | Full ACID via table format |
| Governance | Built-in (roles, masking, audit) | Manual (must add catalog + ACLs) | Built-in (Unity Catalog, Polaris) |
| Primary users | Business analysts, BI developers | Data engineers, data scientists | All data roles |
| Best workload | BI dashboards, ad-hoc SQL, reporting | ML training, log analytics, raw storage | Mixed: BI + ML + streaming |
| Vendor examples | Snowflake, BigQuery, Redshift | S3 + Athena, ADLS + Synapse, GCS | Databricks Lakehouse, BigQuery Omni |
| Time-to-insight | Fast (pre-modeled data) | Slow (requires transformation) | Medium (medallion layers) |
| Data freshness | Batch (ELT) or micro-batch | Real-time streaming possible | Real-time + batch unified |
| Scaling strategy | Auto-scale compute (credits) | Scale storage infinitely, compute on-demand | Storage infinite, compute elastic |
START
├── Is your primary workload BI/SQL reporting with structured data only?
│ ├── YES → Data Warehouse (Snowflake, BigQuery, Redshift)
│ │ Best for: dashboards, ad-hoc queries, regulatory reporting
│ └── NO ↓
├── Is your primary workload ML/data science with unstructured data?
│ ├── YES → Data Lake (S3/ADLS + Spark/Athena)
│ │ Best for: training pipelines, log analytics, raw archival
│ └── NO ↓
├── Do you need BOTH SQL analytics AND ML on the same data?
│ ├── YES ↓
│ │ ├── Data volume > 10 TB or > 5 data teams?
│ │ │ ├── YES → Data Lakehouse (Databricks, BigQuery with Iceberg)
│ │ │ │ Use medallion architecture (Bronze → Silver → Gold)
│ │ │ └── NO → Managed Warehouse + object storage export
│ └── NO ↓
├── Do you need real-time streaming analytics?
│ ├── YES → Data Lakehouse with streaming ingestion
│ │ (Delta Live Tables, Apache Flink + Iceberg)
│ └── NO ↓
└── DEFAULT → Start with a managed Data Warehouse
Migrate to Lakehouse when lake requirements emerge
Inventory all data sources and classify each by structure level (structured, semi-structured, unstructured), volume, velocity, and primary consumers. This classification drives the architecture decision. [src2]
# data_source_inventory.yaml
sources:
- name: transactional_db
type: structured
format: PostgreSQL
volume_gb_per_day: 5
velocity: batch_daily
consumers: [bi_analysts, finance]
sensitivity: pii
- name: clickstream_logs
type: semi_structured
format: JSON
volume_gb_per_day: 200
velocity: real_time
consumers: [data_scientists, product]
sensitivity: none
Verify: Count sources by type — if >60% structured + SQL consumers, lean warehouse. If >40% unstructured/ML, lean lake. Mixed = lakehouse.
Choose your storage tier based on workload classification. For lakehouse, select an open table format that provides ACID transactions on top of object storage. [src6]
-- Lakehouse: Delta Lake setup (Databricks)
-- Creates a managed Delta table with schema enforcement
CREATE TABLE IF NOT EXISTS bronze.raw_events (
event_id STRING,
event_type STRING,
payload STRING,
ingested_at TIMESTAMP
)
USING DELTA
PARTITIONED BY (date_trunc('day', ingested_at))
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
Verify: DESCRIBE EXTENDED bronze.raw_events → should show Provider: delta, partitioning columns, and table properties.
Structure data flow through Bronze (raw), Silver (cleansed), and Gold (business-ready) layers. Each layer has a specific purpose and quality contract. [src4]
-- Silver layer: cleansed and conformed (dbt model)
{{ config(
materialized='incremental',
unique_key='event_id',
partition_by={'field': 'event_date', 'data_type': 'date'}
) }}
SELECT
event_id, event_type,
JSON_EXTRACT_SCALAR(payload, '$.user_id') AS user_id,
CAST(ingested_at AS DATE) AS event_date
FROM {{ source('bronze', 'raw_events') }}
WHERE event_type IS NOT NULL
{% if is_incremental() %}
AND ingested_at > (SELECT MAX(ingested_at) FROM {{ this }})
{% endif %}
Verify: dbt test --select stg_events → all tests pass (not_null, unique on event_id).
Set up a data catalog, column-level security, and row-level access policies. This is the step most teams skip — and the #1 cause of data swamp failure. [src1]
-- Snowflake: row-level security policy
CREATE OR REPLACE ROW ACCESS POLICY pii_access_policy
AS (val VARCHAR) RETURNS BOOLEAN ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_ENGINEER', 'COMPLIANCE') THEN TRUE
ELSE FALSE
END;
ALTER TABLE gold.dim_customers
ADD ROW ACCESS POLICY pii_access_policy ON (email);
Verify: SELECT email FROM gold.dim_customers LIMIT 5 as analyst role → should show ***@domain.com.
Configure dbt to transform data through medallion layers with testing, documentation, and lineage tracking. [src5]
# dbt_project.yml
name: 'analytics_lakehouse'
version: '1.0.0'
models:
analytics_lakehouse:
bronze:
+materialized: view
+schema: bronze
silver:
+materialized: incremental
+schema: silver
gold:
+materialized: table
+schema: gold
Verify: dbt debug → connection OK. dbt build --select tag:gold → all models and tests pass.
Set up query monitoring, storage lifecycle policies, and cost alerting. Cloud data platforms can incur runaway costs without proper guardrails. [src3]
-- BigQuery: cost monitoring query
SELECT
user_email,
SUM(total_bytes_processed) / POW(2, 40) AS tb_processed,
SUM(total_bytes_processed) / POW(2, 40) * 5 AS est_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY user_email ORDER BY tb_processed DESC;
Verify: Check reservation utilization in Cloud Console. Set billing alerts at 80% of monthly budget.
-- models/gold/fct_daily_revenue.sql
-- Input: Silver-layer orders and products tables
-- Output: Daily revenue fact table for BI dashboards
{{ config(
materialized='table',
partition_by={'field': 'order_date', 'data_type': 'date'},
cluster_by=['product_category', 'region']
) }}
SELECT
o.order_date,
p.product_category,
o.region,
COUNT(DISTINCT o.order_id) AS total_orders,
SUM(o.quantity * p.unit_price) AS gross_revenue,
SUM(o.discount_amount) AS total_discounts,
SUM(o.quantity * p.unit_price) - SUM(o.discount_amount) AS net_revenue
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_products') }} p ON o.product_id = p.product_id
GROUP BY 1, 2, 3
# Input: Raw JSON files from S3/ADLS landing zone
# Output: Bronze Delta table with metadata columns
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name
spark = SparkSession.builder.appName("bronze_ingestion") \
.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension").getOrCreate()
bronze_df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "s3://lakehouse/schemas/events")
.load("s3://landing-zone/events/"))
(bronze_df.withColumn("_ingested_at", current_timestamp())
.withColumn("_source_file", input_file_name())
.writeStream.format("delta").outputMode("append")
.option("checkpointLocation", "s3://lakehouse/checkpoints/bronze_events")
.trigger(availableNow=True).toTable("bronze.raw_events"))
# Input: Bronze Delta table (raw events)
# Output: Silver Delta table (cleansed, deduplicated)
from pyspark.sql.functions import from_json, col, to_date, row_number
from pyspark.sql.window import Window
from delta.tables import DeltaTable
dedup_window = Window.partitionBy("event_id").orderBy(col("_ingested_at").desc())
silver_df = (spark.read.table("bronze.raw_events")
.withColumn("parsed", from_json(col("payload"), payload_schema))
.select("event_id", "event_type",
col("parsed.user_id").alias("user_id"),
col("parsed.amount").alias("amount"),
to_date("parsed.timestamp").alias("event_date"))
.filter(col("event_id").isNotNull())
.withColumn("rn", row_number().over(dedup_window))
.filter(col("rn") == 1).drop("rn"))
# Merge into Silver (upsert pattern)
silver_table = DeltaTable.forPath(spark, "s3://lakehouse/silver/events")
(silver_table.alias("t").merge(silver_df.alias("s"), "t.event_id = s.event_id")
.whenMatchedUpdateAll().whenNotMatchedInsertAll().execute())
# BAD — raw dump with no catalog, no schema, no access controls
df.write.parquet("s3://company-lake/data/")
# Result: "data swamp" — nobody knows what's there
# GOOD — structured ingestion with governance
(df.withColumn("_ingested_at", current_timestamp())
.withColumn("_source", lit("crm_export"))
.withColumn("_schema_version", lit("v2.1"))
.write.format("delta").option("mergeSchema", "true")
.mode("append").saveAsTable("bronze.crm_contacts"))
# Registered in Unity Catalog with lineage + ACLs
-- BAD — single denormalized table serving every team
CREATE TABLE gold.everything AS
SELECT * FROM silver.orders
JOIN silver.customers ON ...
JOIN silver.products ON ...
JOIN silver.inventory ON ...;
-- 500+ columns, slow queries, impossible to maintain
-- GOOD — domain-specific Gold models
-- gold/finance/fct_daily_revenue.sql (10 columns)
-- gold/marketing/fct_campaign_attribution.sql (15 columns)
-- gold/product/fct_user_engagement.sql (12 columns)
-- Each model is tested, documented, and owned by a team
# BAD — warehouse compute for ML training (5-10x cost)
results = snowflake_cursor.execute(
"SELECT * FROM gold.features WHERE training_set = TRUE"
) # Pulling 500GB through warehouse = $$$
model.fit(pd.DataFrame(results.fetchall()))
# GOOD — read from lake storage for ML
training_df = spark.read.format("delta").load(
"s3://lakehouse/silver/features"
) # Spark cluster reads from object storage directly
model.fit(training_df.toPandas())
BAD — legacy on-prem warehouse with fixed capacity
Cannot scale storage independently from compute.
Paying for idle compute during off-peak hours.
Over-provisioned or under-performing.
-- GOOD — Snowflake virtual warehouses scale independently
ALTER WAREHOUSE prod_wh SET
WAREHOUSE_SIZE = 'X-LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 6;
-- Storage infinite, compute on demand
STATEMENT_TIMEOUT_IN_SECONDS (Snowflake), maximumBytesBilled (BigQuery), or slot reservations. [src3]mergeSchema) or Iceberg and add dbt test for schema assertions. [src6]# Check Delta Lake table health (Databricks)
DESCRIBE DETAIL delta.`s3://lakehouse/silver/events`
# Check table format version and properties
SHOW TBLPROPERTIES bronze.raw_events
# Verify dbt model freshness
dbt source freshness --select source:bronze
# Check Snowflake warehouse utilization
SELECT WAREHOUSE_NAME, AVG(AVG_RUNNING), AVG(QUEUED_LOAD)
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
WHERE START_TIME > DATEADD('day', -7, CURRENT_TIMESTAMP())
GROUP BY WAREHOUSE_NAME;
# Check BigQuery slot utilization
SELECT period_start,
SUM(period_slot_ms) / (1000 * 60) AS slot_minutes
FROM `region-us`.INFORMATION_SCHEMA.JOBS_TIMELINE_BY_PROJECT
WHERE period_start > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY period_start ORDER BY period_start;
# Check data lake storage size (AWS)
aws s3 ls s3://lakehouse/ --recursive --summarize | tail -2
| Technology | Current Version | Key Feature | Notes |
|---|---|---|---|
| Apache Iceberg | 1.7.x (2025) | Universal compatibility, partition evolution | Preferred for multi-engine lakehouse |
| Delta Lake | 4.0 (2025) | UniForm (Iceberg/Hudi interop), liquid clustering | Databricks-native, open-source |
| Apache Hudi | 1.0 (2025) | Record-level upserts, incremental queries | Best for CDC-heavy workloads |
| dbt Core | 1.9.x (2025) | Microbatch incremental, unit testing | Standard transformation tool |
| Apache Spark | 3.5.x (2025) | Spark Connect, improved Iceberg support | De facto lake compute engine |
| Snowflake | Current | Iceberg Tables, Polaris Catalog (open-source) | Warehouse + lake convergence |
| BigQuery | Current | BigLake, Omni (multi-cloud), Iceberg support | Serverless warehouse + lake |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Structured BI/reporting is the only workload | You need ML training on unstructured data | Data Lake or Lakehouse |
| Need sub-second query latency for dashboards | Data volume is < 100 GB total | PostgreSQL + materialized views |
| Regulatory compliance requires full audit trail | Budget is < $500/month for data infrastructure | Managed PostgreSQL + dbt |
| Mixed workloads: BI + ML + streaming | Team has < 3 data engineers | Start with a managed warehouse |
| Data volume exceeds 10 TB with diverse formats | All data fits in a single relational database | OLTP database with read replicas |
| Multiple teams need independent compute scaling | Only one data consumer team | Single managed warehouse |
| Real-time streaming + batch analytics required | Latency requirements are > 1 hour acceptable | Batch warehouse with scheduled ELT |