How to build a live metrics dashboard at scale?

- Bottom line: A real-time analytics dashboard ingests events through a streaming platform (Kafka/Kinesis), pre-aggregates them into time-windowed rollups via a stream processor (Flink/ClickHouse materialized views), stores results in a low-latency OLAP database (Druid/ClickHouse/TimescaleDB), and pushes updates to clients via WebSocket or SSE connections.

Real-time dashboard system design architecture

- Bottom line: A real-time analytics dashboard ingests events through a streaming platform (Kafka/Kinesis), pre-aggregates them into time-windowed rollups via a stream processor (Flink/ClickHouse materialized views), stores results in a low-latency OLAP database (Druid/ClickHouse/TimescaleDB), and pushes updates to clients via WebSocket or SSE connections.

Streaming analytics dashboard design patterns

- Bottom line: A real-time analytics dashboard ingests events through a streaming platform (Kafka/Kinesis), pre-aggregates them into time-windowed rollups via a stream processor (Flink/ClickHouse materialized views), stores results in a low-latency OLAP database (Druid/ClickHouse/TimescaleDB), and pushes updates to clients via WebSocket or SSE connections.

How to design a real-time monitoring dashboard system?

- Bottom line: A real-time analytics dashboard ingests events through a streaming platform (Kafka/Kinesis), pre-aggregates them into time-windowed rollups via a stream processor (Flink/ClickHouse materialized views), stores results in a low-latency OLAP database (Druid/ClickHouse/TimescaleDB), and pushes updates to clients via WebSocket or SSE connections.

Real-Time Analytics Dashboard System Design

How do I design a real-time analytics dashboard?

TL;DR

Bottom line: A real-time analytics dashboard ingests events through a streaming platform (Kafka/Kinesis), pre-aggregates them into time-windowed rollups via a stream processor (Flink/ClickHouse materialized views), stores results in a low-latency OLAP database (Druid/ClickHouse/TimescaleDB), and pushes updates to clients via WebSocket or SSE connections.
Key tool/command: CREATE MATERIALIZED VIEW mv_metrics ENGINE = AggregatingMergeTree() AS SELECT ...
Watch out for: Querying raw event tables directly from the dashboard -- at scale this creates unbounded query latency and competes with ingestion for I/O.
Works with: Any language/stack. Core: Kafka/Kinesis (ingestion), Flink/ksqlDB (processing), ClickHouse/Druid/TimescaleDB (storage), Grafana/custom UI (visualization), WebSocket/SSE (delivery).

Constraints

Choose push (WebSocket/SSE) vs pull (polling) based on update frequency: push for >1 update/sec, polling for <1 update/min
Pre-aggregate metrics at ingestion time for dashboards with >10K concurrent viewers; live queries cannot sustain that fanout
Never query raw event tables directly from the dashboard; always use materialized views or pre-aggregated rollup tables
WebSocket connections consume ~50KB memory each; budget for max_connections * 50KB per server node
Time-series data must be stored with time-based partitioning; unpartitioned tables degrade to full scans beyond 100M rows

Quick Reference

Component	Role	Technology Options	Scaling Strategy
Event Ingestion	Receives raw events from producers	Apache Kafka, Amazon Kinesis, Apache Pulsar	Partition by event key; add partitions for throughput
Stream Processor	Transforms, filters, and windows events	Apache Flink, ksqlDB, Kafka Streams, Spark Structured Streaming	Parallel task slots per partition; checkpoint to durable storage
Pre-Aggregation Layer	Computes rollups at ingest time	ClickHouse Materialized Views, Druid rollup, Flink windowed aggregations	Write to AggregatingMergeTree or SummingMergeTree target tables
OLAP Storage	Serves low-latency analytical queries	ClickHouse, Apache Druid, TimescaleDB, Apache Pinot	Column-oriented storage; horizontal sharding by time range
Query Cache	Caches frequently requested dashboard queries	Redis, Memcached	TTL = dashboard refresh interval; invalidate on new data window
Push Delivery	Streams updates to connected clients	WebSocket (Socket.IO, ws), Server-Sent Events (SSE)	Horizontal via sticky sessions + Redis PubSub fan-out
API Gateway	Rate limiting, auth, request routing	Kong, AWS API Gateway, Nginx	Per-client rate limits; separate read/write paths
Dashboard Frontend	Renders charts, tables, and metrics	Grafana, Apache Superset, custom React/D3.js	CDN for static assets; lazy-load panels not in viewport
Time-Series Index	Partitions data by time for fast range queries	Native time partitioning (ClickHouse, TimescaleDB)	Auto-drop or archive partitions older than retention window
Alerting Engine	Fires alerts when metrics cross thresholds	Grafana Alerting, PagerDuty, custom Flink CEP	Decouple from dashboard; process alert rules on stream processor
Backfill Pipeline	Replays historical data for corrections	Kafka replay from offset, batch re-ingestion	Run during off-peak; separate compute from real-time pipeline
Monitoring & Observability	Tracks pipeline health: lag, throughput, errors	Prometheus + Grafana, Datadog	Alert on consumer lag >30s and ingestion error rate >0.1%

Decision Tree

START
|-- Dashboard update frequency?
|   |-- Sub-second (<1s latency)?
|   |   |-- Use WebSocket push + in-memory pre-aggregation (Flink stateful operators)
|   |   |-- Storage: ClickHouse or Druid with real-time ingestion
|   |   +-- Avoid: materialized views with sync lag; use Flink direct-to-sink
|   |-- Near-real-time (1-30s)?
|   |   |-- Use SSE or WebSocket + ClickHouse materialized views
|   |   |-- Batch micro-aggregations in 5-10s tumbling windows
|   |   +-- This covers 80% of dashboard use cases
|   +-- Periodic (1-5 min)?
|       |-- Use polling API + query cache (Redis, 60s TTL)
|       |-- Pre-aggregate with scheduled ClickHouse materialized views
|       +-- Simplest architecture; start here unless requirements demand faster
|
|-- Concurrent viewers?
|   |-- <100 viewers?
|   |   +-- Direct DB queries OK; single OLAP node sufficient
|   |-- 100-10K viewers?
|   |   |-- Pre-aggregate + cache; fan out via WebSocket with Redis PubSub
|   |   +-- 2-4 WebSocket server nodes behind load balancer
|   +-- >10K viewers?
|       |-- CDN-cached SSE streams or broadcast via pub/sub (Redis/NATS)
|       |-- Never let viewers trigger individual DB queries
|       +-- Compute once, broadcast N times
|
+-- Data volume?
    |-- <10K events/sec?
    |   +-- Single ClickHouse node or TimescaleDB; Kafka optional
    |-- 10K-1M events/sec?
    |   +-- Kafka (3+ brokers) -> Flink -> ClickHouse cluster (3+ nodes)
    +-- >1M events/sec?
        |-- Kafka cluster (6+ brokers, partitioned by key)
        |-- Flink cluster with RocksDB state backend
        |-- ClickHouse cluster with distributed tables
        +-- Druid with deep storage on S3 for historical queries

Step-by-Step Guide

1. Define metrics and SLAs

Identify which metrics the dashboard must display and their latency requirements. Separate metrics into tiers: Tier 1 (sub-second), Tier 2 (near-real-time), Tier 3 (periodic). [src7]

Metric inventory example:
- requests_per_second    -> Tier 1 (sub-second)    -> WebSocket push
- error_rate_5min        -> Tier 2 (near-real-time) -> SSE, 5s refresh
- revenue_today          -> Tier 2 (near-real-time) -> SSE, 10s refresh
- daily_active_users     -> Tier 3 (periodic)       -> API poll, 5min cache
- p99_latency_1h         -> Tier 2 (near-real-time) -> SSE, 30s refresh

Verify: Every metric has a defined SLA (latency target) and delivery mechanism.

2. Set up the event ingestion pipeline

Deploy Kafka as the central event bus. Producers emit structured events with timestamps. [src4]

# Create Kafka topic with sufficient partitions for throughput
kafka-topics.sh --create \
  --bootstrap-server kafka:9092 \
  --topic raw-events \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=604800000  # 7 days retention

Verify: kafka-consumer-groups.sh --describe --group dashboard-consumer shows partitions assigned and lag near zero.

3. Build pre-aggregation with materialized views

Create materialized views in ClickHouse that compute rollups at insert time, not query time. [src2]

-- Source table: raw events
CREATE TABLE raw_events (
    event_id     UInt64,
    event_type   LowCardinality(String),
    user_id      UInt64,
    value        Float64,
    ts           DateTime64(3),
    region       LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(ts)
ORDER BY (event_type, ts);

-- Materialized view: 1-minute aggregations
CREATE MATERIALIZED VIEW mv_metrics_1min
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(window_start)
ORDER BY (event_type, region, window_start)
AS SELECT
    event_type, region,
    toStartOfMinute(ts) AS window_start,
    countState()        AS event_count,
    sumState(value)     AS value_sum,
    avgState(value)     AS value_avg
FROM raw_events
GROUP BY event_type, region, window_start;

Verify: SELECT count() FROM mv_metrics_1min grows as data is inserted. Query latency should be <100ms.

4. Implement the WebSocket push service

Build a server that subscribes to aggregated results and pushes updates to connected dashboard clients. [src3] [src5]

const WebSocket = require('ws');         // [email protected]
const Redis = require('ioredis');        // [email protected]
const wss = new WebSocket.Server({ port: 8080 });
const redisSub = new Redis({ host: 'redis', port: 6379 });
const subscriptions = new Map();

wss.on('connection', (ws) => {
    ws.on('message', (msg) => {
        const { action, metric } = JSON.parse(msg);
        if (action === 'subscribe') {
            if (!subscriptions.has(metric)) subscriptions.set(metric, new Set());
            subscriptions.get(metric).add(ws);
        }
    });
    ws.on('close', () => {
        for (const subs of subscriptions.values()) subs.delete(ws);
    });
});

redisSub.subscribe('metric-updates');
redisSub.on('message', (channel, message) => {
    const update = JSON.parse(message);
    const clients = subscriptions.get(update.metric) || new Set();
    const payload = JSON.stringify(update);
    for (const ws of clients) {
        if (ws.readyState === WebSocket.OPEN) ws.send(payload);
    }
});

Verify: Connect a WebSocket client, subscribe to a metric, and publish a test message to Redis. The client should receive it within <50ms.

5. Configure the dashboard frontend

Connect the frontend to the WebSocket server and render updates incrementally. [src3]

function useRealtimeMetric(metricKey) {
    const [data, setData] = React.useState([]);
    React.useEffect(() => {
        const ws = new WebSocket('wss://dashboard-api.example.com/ws');
        ws.onopen = () => ws.send(JSON.stringify({ action: 'subscribe', metric: metricKey }));
        ws.onmessage = (e) => {
            const update = JSON.parse(e.data);
            setData(prev => [...prev, update].slice(-300));
        };
        ws.onclose = () => setTimeout(() => {/* reconnect with backoff */}, 2000);
        return () => ws.close();
    }, [metricKey]);
    return data;
}

Verify: Open the dashboard in two browsers; both should display the same live data within 1 second of each other.

6. Set up retention and partition management

Automate old partition removal to keep query performance stable. [src1]

-- ClickHouse: TTL-based automatic retention
ALTER TABLE raw_events MODIFY TTL ts + INTERVAL 90 DAY DELETE;
ALTER TABLE mv_metrics_1min MODIFY TTL window_start + INTERVAL 365 DAY DELETE;

Verify: SELECT partition, rows FROM system.parts WHERE table = 'raw_events' shows no partitions older than the retention window.

Code Examples

Python: WebSocket Dashboard Client with Auto-Reconnect

# Input:  WebSocket URL and metric key to subscribe to
# Output: Prints real-time metric updates to stdout

import asyncio, json
import websockets  # [email protected]

async def subscribe_metric(ws_url: str, metric: str):
    while True:
        try:
            async with websockets.connect(ws_url) as ws:
                await ws.send(json.dumps({"action": "subscribe", "metric": metric}))
                async for message in ws:
                    update = json.loads(message)
                    print(f"[{update['timestamp']}] {metric}: {update['value']}")
        except websockets.ConnectionClosed:
            print("Connection lost, reconnecting in 2s...")
            await asyncio.sleep(2)

asyncio.run(subscribe_metric("wss://dashboard-api.example.com/ws", "requests_per_second"))

SQL: ClickHouse Materialized View Refresh Pattern

-- Input:  Continuous inserts into raw_events table
-- Output: Pre-aggregated 1-minute windows, queryable in <50ms

INSERT INTO raw_events VALUES
    (1, 'page_view', 42, 1.0, now(), 'us-east'),
    (2, 'purchase',  42, 99.99, now(), 'us-east');

-- Query the materialized view (auto-updated on insert)
SELECT event_type, countMerge(event_count) AS cnt,
       avgMerge(value_avg) AS avg_val
FROM mv_metrics_1min
WHERE window_start >= now() - INTERVAL 5 MINUTE
GROUP BY event_type;

Go: SSE Server for Dashboard Updates

// Input:  HTTP request to /events?metric=requests_per_second
// Output: Server-Sent Events stream with real-time metric updates

func sseHandler(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "text/event-stream")
    w.Header().Set("Cache-Control", "no-cache")
    flusher, _ := w.(http.Flusher)
    metric := r.URL.Query().Get("metric")
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()
    for {
        select {
        case <-ticker.C:
            value := queryLatestMetric(metric)
            fmt.Fprintf(w, "data: {\"metric\":\"%s\",\"value\":%.2f}\n\n", metric, value)
            flusher.Flush()
        case <-r.Context().Done():
            return
        }
    }
}

Anti-Patterns

Wrong: Querying raw events directly from the dashboard

-- BAD -- every dashboard refresh runs a full scan on billions of rows
SELECT event_type, COUNT(*) AS cnt, AVG(value) AS avg_val
FROM raw_events WHERE ts >= now() - INTERVAL 1 HOUR
GROUP BY event_type;
-- At 1M events/sec, this table has 3.6B rows/hour
-- 100 concurrent viewers = 100 parallel full scans = cluster death

Correct: Query pre-aggregated materialized views

-- GOOD -- queries a compact rollup table, not raw events
SELECT event_type, countMerge(event_count) AS cnt,
       avgMerge(value_avg) AS avg_val
FROM mv_metrics_1min
WHERE window_start >= now() - INTERVAL 1 HOUR GROUP BY event_type;
-- Reads ~60 rows instead of 3.6B. Sub-100ms response.

Wrong: Polling the database on every client refresh

// BAD -- every dashboard client polls the API, which queries the DB
setInterval(async () => {
    const res = await fetch('/api/metrics?range=1h');
    updateChart(await res.json());
}, 1000);
// 1000 clients * 1 req/sec = 1000 DB queries/sec

Correct: Compute once, push to all clients via WebSocket/SSE

// GOOD -- server computes once, broadcasts to all subscribers
const ws = new WebSocket('wss://dashboard-api.example.com/ws');
ws.onopen = () => ws.send(JSON.stringify({ action: 'subscribe', metric: 'rps' }));
ws.onmessage = (e) => updateChart(JSON.parse(e.data));
// DB load: 1 query/5s instead of 1000 queries/sec

Wrong: Storing all metrics in a single unpartitioned table

-- BAD -- no partitioning means full table scan for time-range queries
CREATE TABLE metrics (
    id BIGINT PRIMARY KEY, metric_name VARCHAR(255),
    value DOUBLE, recorded_at TIMESTAMP
);
-- After 6 months: billions of rows, no partition pruning

Correct: Partition by time range

-- GOOD -- ClickHouse: automatic time-based partitioning
CREATE TABLE metrics (
    metric_name LowCardinality(String),
    value Float64, recorded_at DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(recorded_at)
ORDER BY (metric_name, recorded_at);
-- Queries only touch relevant partitions; old ones drop instantly

Wrong: Synchronous processing in the ingestion path

# BAD -- blocking processing delays event ingestion
def handle_event(event):
    save_to_database(event)           # 5-10ms
    compute_aggregation(event)         # 20-50ms
    notify_all_dashboards(event)       # 10-100ms
    # Total: 40-180ms per event = massive backlog at scale

Correct: Async pipeline with decoupled stages

# GOOD -- each stage runs independently via message queues
def handle_event(event):
    kafka_producer.send('raw-events', event)  # <1ms, non-blocking
    # Flink reads raw-events -> aggregations -> ClickHouse
    # Separate service pushes to WebSocket clients
    # Each stage scales independently

Common Pitfalls

Consumer lag snowball: If the stream processor falls behind, lag compounds and dashboards show stale data. Fix: alert on consumer lag; auto-scale Flink task managers when lag exceeds 30 seconds. [src4]
Materialized view sync delay: Large batches create visible lag in ClickHouse materialized views. Fix: tune max_insert_block_size; use smaller, more frequent batches (10K rows/1s vs 100K/10s). [src2]
WebSocket connection storms: Server restart causes all clients to reconnect simultaneously. Fix: exponential backoff with jitter on the client (delay = min(base * 2^attempt, max_delay) + random(0, jitter)). [src5]
High-cardinality group-by explosion: Grouping by user_id in materialized views creates millions of buckets. Fix: pre-aggregate only by low-cardinality dimensions; handle high-cardinality via on-demand OLAP with LIMIT. [src1]
No backpressure handling: Downstream slowdowns cause unbounded queue growth and OOM. Fix: configure max.poll.records and Flink checkpointing for natural backpressure. [src4]
Dashboard query amplification: 20 panels = 20 queries per refresh. Fix: batch-fetch all panel data in a single API call or deduplicate identical time-range queries. [src7]
Timezone-naive aggregation: toStartOfDay() without timezone produces wrong results globally. Fix: store UTC; apply timezone at the display layer. [src2]
Missing data gap handling: Charts show misleading interpolation during outages. Fix: emit heartbeat events; render gaps as breaks, not smooth lines. [src3]

Diagnostic Commands

# Check Kafka consumer lag for the dashboard pipeline
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group dashboard-aggregator

# Monitor ClickHouse query performance
clickhouse-client --query "SELECT query_duration_ms, read_rows, result_rows \
  FROM system.query_log WHERE type = 'QueryFinish' \
  ORDER BY event_time DESC LIMIT 20"

# Check materialized view status
clickhouse-client --query "SELECT database, table, engine, total_rows \
  FROM system.tables WHERE engine LIKE '%MergeTree%'"

# Monitor WebSocket connection count
curl -s http://dashboard-ws:8080/healthz | jq '.active_connections'

# Check partition sizes and retention compliance
clickhouse-client --query "SELECT partition, table, rows, \
  formatReadableSize(bytes_on_disk) AS size \
  FROM system.parts WHERE table = 'raw_events' ORDER BY partition DESC LIMIT 20"

# Verify Redis PubSub fan-out is working
redis-cli PUBSUB NUMSUB metric-updates

Version History & Compatibility

Technology	Version	Status	Notes
Apache Kafka	3.6-3.8	Current	KRaft mode (no Zookeeper) production-ready since 3.6
ClickHouse	24.x-25.x	Current	Materialized views with AggregatingMergeTree stable since 22.x
Apache Druid	30.x	Current	Multi-stage query engine since 28.0
Apache Flink	1.19-1.20	Current	Unified batch/stream; state backend improvements in 1.19
Grafana	11.x	Current	Grafana Live (WebSocket push) stable since 8.0; HA requires Redis
TimescaleDB	2.15+	Current	Continuous aggregates stable since 2.0
WebSocket API	RFC 6455	Stable	Universal browser support; HTTP/2 compatible
Server-Sent Events	W3C spec	Stable	Native browser support; auto-reconnect built-in

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Metrics must update in <30 seconds for operational decisions	Reports are generated once per day or week	Traditional data warehouse (Snowflake, BigQuery)
Monitoring live systems (error rates, throughput, latency)	Historical analysis over months/years of data	Batch analytics with scheduled queries
Thousands of concurrent viewers need the same live data	Each user needs a unique, complex ad-hoc query	OLAP query engine with per-user caching
Event-driven architecture already produces a Kafka/Kinesis stream	Data arrives in batch files (CSV uploads, ETL dumps)	Batch ingestion pipeline + scheduled dashboard refresh
Alerting on metric thresholds must fire within seconds	Accuracy matters more than speed (financial reconciliation)	Double-entry batch pipeline with reconciliation step

Important Caveats

Lambda architecture (separate batch + stream layers) provides more accurate historical data but doubles operational complexity. Kappa architecture (stream-only) is simpler but struggles with complex historical reprocessing. Most production dashboards use a hybrid: Kappa for real-time + periodic batch correction jobs.
ClickHouse materialized views are trigger-based (fire on INSERT); they do not handle UPDATE or DELETE automatically. If your source data requires corrections, use a ReplacingMergeTree source table and rebuild affected materialized view partitions.
Grafana Live supports 100 concurrent WebSocket connections per instance by default; for >100 viewers, deploy Grafana in HA mode with Redis as the Live HA engine.
Pre-aggregation at ingest time trades query flexibility for speed. Adding a new metric dimension requires rebuilding materialized views and replaying historical data from Kafka. Plan Kafka retention accordingly (7-30 days for replay).
Server-Sent Events (SSE) are unidirectional (server-to-client) and work over standard HTTP. For dashboards that only display data, SSE is simpler than WebSocket and supports automatic reconnection natively.