Data Integration Architecture for Startup Dashboards
Purpose
This recipe builds a unified data pipeline connecting CRM, payment processing, analytics, advertising, banking, and email platforms into a single data warehouse, with a transformation layer producing business-ready metrics. The output eliminates manual spreadsheet consolidation by automatically computing MRR, CAC, LTV, retention, and channel attribution from raw source data. [src1]
Prerequisites
- API credentials for each data source — Stripe API key, GA4 service account, HubSpot token, ad platform credentials
- KPI definitions from
metrics framework— Startup KPI Framework by Stage - CRM configured — CRM Setup Guide
- Data warehouse account — Supabase, BigQuery, or Snowflake
- Docker installed (if self-hosting Airbyte) — Docker Desktop
Constraints
- Airbyte Cloud free tier: 10 credits/month (~100K rows). Self-hosted is unlimited but needs Docker + 4GB RAM [src2]
- Fivetran costs $1/credit scaling with volume — budget $100-500/mo at Series A scale
- Stripe API: 100 read requests/second — pace historical backfills [src4]
- Google Ads API needs developer token — 3-5 business days for approval
- Plaid: $0.30-3/connection/month in production — disconnect dormant accounts
- Batch syncs introduce 1-24 hour latency — real-time costs 3-5x more
Tool Selection Decision
Which path?
├── Non-technical AND budget = free
│ └── PATH A: Zapier + Google Sheets — basic integration, limited scale
├── Semi-technical AND budget = free
│ └── PATH B: Airbyte (self-hosted) + Supabase + Metabase — full pipeline, $0
├── Developer AND budget = $0-100/mo
│ └── PATH C: Airbyte Cloud + Supabase + dbt + Metabase — managed pipeline
└── Developer AND budget = $100-500/mo
└── PATH D: Fivetran + BigQuery/Snowflake + dbt Cloud — enterprise-grade
| Path | Tools | Cost | Setup Time | Scalability |
|---|---|---|---|---|
| A: No-Code | Zapier + Google Sheets | $0-20/mo | 2-4 hours | Low |
| B: Self-Hosted | Airbyte OSS + Supabase + Metabase | $0 | 8-12 hours | Medium |
| C: Managed Free | Airbyte Cloud + Supabase + dbt | $0-25/mo | 4-8 hours | Medium-High |
| D: Enterprise | Fivetran + BigQuery + dbt Cloud | $200-500/mo | 6-12 hours | Very High |
Execution Flow
Step 1: Set Up the Data Warehouse
Duration: 30-60 minutes · Tool: Supabase or BigQuery
Create raw_data schema for ingestion and analytics schema for transformed business metrics. Set up dedicated Airbyte user with appropriate grants.
Verify: Connect from Airbyte using airbyte_user credentials · If failed: Check connection pooler settings, use port 5432 for direct connections
Step 2: Configure Data Source Connections
Duration: 2-4 hours · Tool: Airbyte or Fivetran
Connect sources in priority order: Stripe (revenue), CRM (pipeline), GA4 (traffic), Ad platforms (spend), Email (engagement), Banking (cash). Use incremental sync mode where available.
Verify: Manual Stripe sync completes with rows visible in raw_data schema · If failed: Check API key permissions and scopes
Step 3: Build the Transformation Layer
Duration: 4-8 hours · Tool: dbt or SQL views
Create business-ready views: daily_mrr, customer_ltv, channel_cac, cohort_retention. Map raw Stripe/CRM/ad data into standardized metrics tables.
Verify: MRR query matches Stripe dashboard within 2% · If failed: Check column name mappings and currency handling
Step 4: Set Up Incremental Sync Schedules
Duration: 30-60 minutes · Tool: Airbyte/Fivetran scheduler
Stripe: every 6 hours. CRM: every 6 hours. GA4: daily. Ad platforms: daily. Banking: daily. Email: daily.
Verify: Sync history shows successful runs with row counts · If failed: Enable Slack alerts for failed syncs
Step 5: Connect Visualization Layer
Duration: 2-4 hours · Tool: Metabase or Looker Studio
Connect dashboard tool to analytics schema. Create initial panels: MRR trend, LTV distribution, CAC by channel, active subscriptions, pipeline value.
Verify: Dashboard loads in under 5 seconds with accurate data · If failed: Check database connection points to analytics schema
Step 6: Data Quality Monitoring
Duration: 1-2 hours · Tool: SQL alerts or Metabase alerts
Set up freshness checks (data staleness > 12 hours), completeness checks (MRR not zero), and consistency checks (no > 20% day-over-day MRR drop).
Verify: Quality checks return no alerts when data is syncing correctly · If failed: Investigate specific source connector sync status
Output Schema
{
"output_type": "data_pipeline",
"format": "configured pipeline + warehouse + transformation layer",
"components": [
{"name": "source_connections", "type": "Airbyte/Fivetran configs", "required": true},
{"name": "raw_data_warehouse", "type": "PostgreSQL/BigQuery schema", "required": true},
{"name": "transformation_layer", "type": "SQL views or dbt models", "required": true},
{"name": "data_quality_monitors", "type": "SQL alerts", "required": true},
{"name": "visualization_connection", "type": "Metabase/Looker config", "required": true}
],
"expected_table_count": "5-8 raw tables, 4-6 analytics views"
}
Quality Benchmarks
| Quality Metric | Minimum Acceptable | Good | Excellent |
|---|---|---|---|
| Sync success rate | > 90% | > 95% | > 99% |
| Data freshness (Stripe) | < 24 hours | < 12 hours | < 6 hours |
| MRR accuracy vs Stripe | Within 5% | Within 2% | Within 0.5% |
| Source coverage | 3+ sources | 5+ sources | 7+ sources |
| Data quality alerts | Weekly | Monthly | Quarterly |
If below minimum: Check sync logs for failures, verify API credentials, confirm rate limits not exceeded.
Error Handling
| Error | Likely Cause | Recovery Action |
|---|---|---|
| 401 Unauthorized | API key expired or revoked | Regenerate key, update connector settings |
| 429 Rate Limited | Too many concurrent requests | Reduce sync frequency, add throttling |
| Schema drift: column missing | Source API updated schema | Update connector version; remap columns |
| MRR calculation mismatch | Currency/tax inclusion differences | Verify amounts in cents, exclude taxes |
| Warehouse storage full | Unchecked raw data growth | Implement retention policy; archive > 2 years |
| Slow dashboard queries | Missing indexes | Add indexes on date and customer_id columns |
Cost Breakdown
| Component | Free Tier | Growth | Scale |
|---|---|---|---|
| Data pipeline (Airbyte) | $0 | $50-100/mo | $200-400/mo |
| Data warehouse | $0 | $25-50/mo | $50-200/mo |
| Transformation (dbt) | $0 | $0 | $100/mo |
| Visualization | $0 | $85/mo | $85-200/mo |
| Banking API (Plaid) | $0 | $10-30/mo | $30-100/mo |
| Total | $0 | $170-265/mo | $465-1,000/mo |
Anti-Patterns
Wrong: Building Custom ETL Scripts
Writing custom Python scripts per API. These break on version changes and require 2-4 hours/month maintenance per source. [src3]
Correct: Use Managed Connectors
Use Airbyte/Fivetran for extraction (600+ pre-built connectors). Custom code only for the transformation layer.
Wrong: Syncing Everything in Real-Time
Real-time syncs cost 3-5x more with no benefit if dashboards are checked daily. [src1]
Correct: Match Sync Frequency to Decision Cadence
Revenue: every 6 hours. Ad spend: daily. Analytics: daily. Only real-time for real-time decisions.
When This Matters
Use when a startup needs to combine data from 3+ sources into a single dashboard. Typically becomes necessary between seed and Series A, when investor reporting demands consistent, auditable metrics from consolidated data sources.