Real-Time Analytics Dashboard System Design

Type: Software Reference Confidence: 0.90 Sources: 7 Verified: 2026-02-23 Freshness: 2026-02-23

TL;DR

Constraints

Quick Reference

ComponentRoleTechnology OptionsScaling Strategy
Event IngestionReceives raw events from producersApache Kafka, Amazon Kinesis, Apache PulsarPartition by event key; add partitions for throughput
Stream ProcessorTransforms, filters, and windows eventsApache Flink, ksqlDB, Kafka Streams, Spark Structured StreamingParallel task slots per partition; checkpoint to durable storage
Pre-Aggregation LayerComputes rollups at ingest timeClickHouse Materialized Views, Druid rollup, Flink windowed aggregationsWrite to AggregatingMergeTree or SummingMergeTree target tables
OLAP StorageServes low-latency analytical queriesClickHouse, Apache Druid, TimescaleDB, Apache PinotColumn-oriented storage; horizontal sharding by time range
Query CacheCaches frequently requested dashboard queriesRedis, MemcachedTTL = dashboard refresh interval; invalidate on new data window
Push DeliveryStreams updates to connected clientsWebSocket (Socket.IO, ws), Server-Sent Events (SSE)Horizontal via sticky sessions + Redis PubSub fan-out
API GatewayRate limiting, auth, request routingKong, AWS API Gateway, NginxPer-client rate limits; separate read/write paths
Dashboard FrontendRenders charts, tables, and metricsGrafana, Apache Superset, custom React/D3.jsCDN for static assets; lazy-load panels not in viewport
Time-Series IndexPartitions data by time for fast range queriesNative time partitioning (ClickHouse, TimescaleDB)Auto-drop or archive partitions older than retention window
Alerting EngineFires alerts when metrics cross thresholdsGrafana Alerting, PagerDuty, custom Flink CEPDecouple from dashboard; process alert rules on stream processor
Backfill PipelineReplays historical data for correctionsKafka replay from offset, batch re-ingestionRun during off-peak; separate compute from real-time pipeline
Monitoring & ObservabilityTracks pipeline health: lag, throughput, errorsPrometheus + Grafana, DatadogAlert on consumer lag >30s and ingestion error rate >0.1%

Decision Tree

START
|-- Dashboard update frequency?
|   |-- Sub-second (<1s latency)?
|   |   |-- Use WebSocket push + in-memory pre-aggregation (Flink stateful operators)
|   |   |-- Storage: ClickHouse or Druid with real-time ingestion
|   |   +-- Avoid: materialized views with sync lag; use Flink direct-to-sink
|   |-- Near-real-time (1-30s)?
|   |   |-- Use SSE or WebSocket + ClickHouse materialized views
|   |   |-- Batch micro-aggregations in 5-10s tumbling windows
|   |   +-- This covers 80% of dashboard use cases
|   +-- Periodic (1-5 min)?
|       |-- Use polling API + query cache (Redis, 60s TTL)
|       |-- Pre-aggregate with scheduled ClickHouse materialized views
|       +-- Simplest architecture; start here unless requirements demand faster
|
|-- Concurrent viewers?
|   |-- <100 viewers?
|   |   +-- Direct DB queries OK; single OLAP node sufficient
|   |-- 100-10K viewers?
|   |   |-- Pre-aggregate + cache; fan out via WebSocket with Redis PubSub
|   |   +-- 2-4 WebSocket server nodes behind load balancer
|   +-- >10K viewers?
|       |-- CDN-cached SSE streams or broadcast via pub/sub (Redis/NATS)
|       |-- Never let viewers trigger individual DB queries
|       +-- Compute once, broadcast N times
|
+-- Data volume?
    |-- <10K events/sec?
    |   +-- Single ClickHouse node or TimescaleDB; Kafka optional
    |-- 10K-1M events/sec?
    |   +-- Kafka (3+ brokers) -> Flink -> ClickHouse cluster (3+ nodes)
    +-- >1M events/sec?
        |-- Kafka cluster (6+ brokers, partitioned by key)
        |-- Flink cluster with RocksDB state backend
        |-- ClickHouse cluster with distributed tables
        +-- Druid with deep storage on S3 for historical queries

Step-by-Step Guide

1. Define metrics and SLAs

Identify which metrics the dashboard must display and their latency requirements. Separate metrics into tiers: Tier 1 (sub-second), Tier 2 (near-real-time), Tier 3 (periodic). [src7]

Metric inventory example:
- requests_per_second    -> Tier 1 (sub-second)    -> WebSocket push
- error_rate_5min        -> Tier 2 (near-real-time) -> SSE, 5s refresh
- revenue_today          -> Tier 2 (near-real-time) -> SSE, 10s refresh
- daily_active_users     -> Tier 3 (periodic)       -> API poll, 5min cache
- p99_latency_1h         -> Tier 2 (near-real-time) -> SSE, 30s refresh

Verify: Every metric has a defined SLA (latency target) and delivery mechanism.

2. Set up the event ingestion pipeline

Deploy Kafka as the central event bus. Producers emit structured events with timestamps. [src4]

# Create Kafka topic with sufficient partitions for throughput
kafka-topics.sh --create \
  --bootstrap-server kafka:9092 \
  --topic raw-events \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=604800000  # 7 days retention

Verify: kafka-consumer-groups.sh --describe --group dashboard-consumer shows partitions assigned and lag near zero.

3. Build pre-aggregation with materialized views

Create materialized views in ClickHouse that compute rollups at insert time, not query time. [src2]

-- Source table: raw events
CREATE TABLE raw_events (
    event_id     UInt64,
    event_type   LowCardinality(String),
    user_id      UInt64,
    value        Float64,
    ts           DateTime64(3),
    region       LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(ts)
ORDER BY (event_type, ts);

-- Materialized view: 1-minute aggregations
CREATE MATERIALIZED VIEW mv_metrics_1min
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(window_start)
ORDER BY (event_type, region, window_start)
AS SELECT
    event_type, region,
    toStartOfMinute(ts) AS window_start,
    countState()        AS event_count,
    sumState(value)     AS value_sum,
    avgState(value)     AS value_avg
FROM raw_events
GROUP BY event_type, region, window_start;

Verify: SELECT count() FROM mv_metrics_1min grows as data is inserted. Query latency should be <100ms.

4. Implement the WebSocket push service

Build a server that subscribes to aggregated results and pushes updates to connected dashboard clients. [src3] [src5]

const WebSocket = require('ws');         // [email protected]
const Redis = require('ioredis');        // [email protected]
const wss = new WebSocket.Server({ port: 8080 });
const redisSub = new Redis({ host: 'redis', port: 6379 });
const subscriptions = new Map();

wss.on('connection', (ws) => {
    ws.on('message', (msg) => {
        const { action, metric } = JSON.parse(msg);
        if (action === 'subscribe') {
            if (!subscriptions.has(metric)) subscriptions.set(metric, new Set());
            subscriptions.get(metric).add(ws);
        }
    });
    ws.on('close', () => {
        for (const subs of subscriptions.values()) subs.delete(ws);
    });
});

redisSub.subscribe('metric-updates');
redisSub.on('message', (channel, message) => {
    const update = JSON.parse(message);
    const clients = subscriptions.get(update.metric) || new Set();
    const payload = JSON.stringify(update);
    for (const ws of clients) {
        if (ws.readyState === WebSocket.OPEN) ws.send(payload);
    }
});

Verify: Connect a WebSocket client, subscribe to a metric, and publish a test message to Redis. The client should receive it within <50ms.

5. Configure the dashboard frontend

Connect the frontend to the WebSocket server and render updates incrementally. [src3]

function useRealtimeMetric(metricKey) {
    const [data, setData] = React.useState([]);
    React.useEffect(() => {
        const ws = new WebSocket('wss://dashboard-api.example.com/ws');
        ws.onopen = () => ws.send(JSON.stringify({ action: 'subscribe', metric: metricKey }));
        ws.onmessage = (e) => {
            const update = JSON.parse(e.data);
            setData(prev => [...prev, update].slice(-300));
        };
        ws.onclose = () => setTimeout(() => {/* reconnect with backoff */}, 2000);
        return () => ws.close();
    }, [metricKey]);
    return data;
}

Verify: Open the dashboard in two browsers; both should display the same live data within 1 second of each other.

6. Set up retention and partition management

Automate old partition removal to keep query performance stable. [src1]

-- ClickHouse: TTL-based automatic retention
ALTER TABLE raw_events MODIFY TTL ts + INTERVAL 90 DAY DELETE;
ALTER TABLE mv_metrics_1min MODIFY TTL window_start + INTERVAL 365 DAY DELETE;

Verify: SELECT partition, rows FROM system.parts WHERE table = 'raw_events' shows no partitions older than the retention window.

Code Examples

Python: WebSocket Dashboard Client with Auto-Reconnect

# Input:  WebSocket URL and metric key to subscribe to
# Output: Prints real-time metric updates to stdout

import asyncio, json
import websockets  # [email protected]

async def subscribe_metric(ws_url: str, metric: str):
    while True:
        try:
            async with websockets.connect(ws_url) as ws:
                await ws.send(json.dumps({"action": "subscribe", "metric": metric}))
                async for message in ws:
                    update = json.loads(message)
                    print(f"[{update['timestamp']}] {metric}: {update['value']}")
        except websockets.ConnectionClosed:
            print("Connection lost, reconnecting in 2s...")
            await asyncio.sleep(2)

asyncio.run(subscribe_metric("wss://dashboard-api.example.com/ws", "requests_per_second"))

SQL: ClickHouse Materialized View Refresh Pattern

-- Input:  Continuous inserts into raw_events table
-- Output: Pre-aggregated 1-minute windows, queryable in <50ms

INSERT INTO raw_events VALUES
    (1, 'page_view', 42, 1.0, now(), 'us-east'),
    (2, 'purchase',  42, 99.99, now(), 'us-east');

-- Query the materialized view (auto-updated on insert)
SELECT event_type, countMerge(event_count) AS cnt,
       avgMerge(value_avg) AS avg_val
FROM mv_metrics_1min
WHERE window_start >= now() - INTERVAL 5 MINUTE
GROUP BY event_type;

Go: SSE Server for Dashboard Updates

// Input:  HTTP request to /events?metric=requests_per_second
// Output: Server-Sent Events stream with real-time metric updates

func sseHandler(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "text/event-stream")
    w.Header().Set("Cache-Control", "no-cache")
    flusher, _ := w.(http.Flusher)
    metric := r.URL.Query().Get("metric")
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()
    for {
        select {
        case <-ticker.C:
            value := queryLatestMetric(metric)
            fmt.Fprintf(w, "data: {\"metric\":\"%s\",\"value\":%.2f}\n\n", metric, value)
            flusher.Flush()
        case <-r.Context().Done():
            return
        }
    }
}

Anti-Patterns

Wrong: Querying raw events directly from the dashboard

-- BAD -- every dashboard refresh runs a full scan on billions of rows
SELECT event_type, COUNT(*) AS cnt, AVG(value) AS avg_val
FROM raw_events WHERE ts >= now() - INTERVAL 1 HOUR
GROUP BY event_type;
-- At 1M events/sec, this table has 3.6B rows/hour
-- 100 concurrent viewers = 100 parallel full scans = cluster death

Correct: Query pre-aggregated materialized views

-- GOOD -- queries a compact rollup table, not raw events
SELECT event_type, countMerge(event_count) AS cnt,
       avgMerge(value_avg) AS avg_val
FROM mv_metrics_1min
WHERE window_start >= now() - INTERVAL 1 HOUR GROUP BY event_type;
-- Reads ~60 rows instead of 3.6B. Sub-100ms response.

Wrong: Polling the database on every client refresh

// BAD -- every dashboard client polls the API, which queries the DB
setInterval(async () => {
    const res = await fetch('/api/metrics?range=1h');
    updateChart(await res.json());
}, 1000);
// 1000 clients * 1 req/sec = 1000 DB queries/sec

Correct: Compute once, push to all clients via WebSocket/SSE

// GOOD -- server computes once, broadcasts to all subscribers
const ws = new WebSocket('wss://dashboard-api.example.com/ws');
ws.onopen = () => ws.send(JSON.stringify({ action: 'subscribe', metric: 'rps' }));
ws.onmessage = (e) => updateChart(JSON.parse(e.data));
// DB load: 1 query/5s instead of 1000 queries/sec

Wrong: Storing all metrics in a single unpartitioned table

-- BAD -- no partitioning means full table scan for time-range queries
CREATE TABLE metrics (
    id BIGINT PRIMARY KEY, metric_name VARCHAR(255),
    value DOUBLE, recorded_at TIMESTAMP
);
-- After 6 months: billions of rows, no partition pruning

Correct: Partition by time range

-- GOOD -- ClickHouse: automatic time-based partitioning
CREATE TABLE metrics (
    metric_name LowCardinality(String),
    value Float64, recorded_at DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(recorded_at)
ORDER BY (metric_name, recorded_at);
-- Queries only touch relevant partitions; old ones drop instantly

Wrong: Synchronous processing in the ingestion path

# BAD -- blocking processing delays event ingestion
def handle_event(event):
    save_to_database(event)           # 5-10ms
    compute_aggregation(event)         # 20-50ms
    notify_all_dashboards(event)       # 10-100ms
    # Total: 40-180ms per event = massive backlog at scale

Correct: Async pipeline with decoupled stages

# GOOD -- each stage runs independently via message queues
def handle_event(event):
    kafka_producer.send('raw-events', event)  # <1ms, non-blocking
    # Flink reads raw-events -> aggregations -> ClickHouse
    # Separate service pushes to WebSocket clients
    # Each stage scales independently

Common Pitfalls

Diagnostic Commands

# Check Kafka consumer lag for the dashboard pipeline
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group dashboard-aggregator

# Monitor ClickHouse query performance
clickhouse-client --query "SELECT query_duration_ms, read_rows, result_rows \
  FROM system.query_log WHERE type = 'QueryFinish' \
  ORDER BY event_time DESC LIMIT 20"

# Check materialized view status
clickhouse-client --query "SELECT database, table, engine, total_rows \
  FROM system.tables WHERE engine LIKE '%MergeTree%'"

# Monitor WebSocket connection count
curl -s http://dashboard-ws:8080/healthz | jq '.active_connections'

# Check partition sizes and retention compliance
clickhouse-client --query "SELECT partition, table, rows, \
  formatReadableSize(bytes_on_disk) AS size \
  FROM system.parts WHERE table = 'raw_events' ORDER BY partition DESC LIMIT 20"

# Verify Redis PubSub fan-out is working
redis-cli PUBSUB NUMSUB metric-updates

Version History & Compatibility

TechnologyVersionStatusNotes
Apache Kafka3.6-3.8CurrentKRaft mode (no Zookeeper) production-ready since 3.6
ClickHouse24.x-25.xCurrentMaterialized views with AggregatingMergeTree stable since 22.x
Apache Druid30.xCurrentMulti-stage query engine since 28.0
Apache Flink1.19-1.20CurrentUnified batch/stream; state backend improvements in 1.19
Grafana11.xCurrentGrafana Live (WebSocket push) stable since 8.0; HA requires Redis
TimescaleDB2.15+CurrentContinuous aggregates stable since 2.0
WebSocket APIRFC 6455StableUniversal browser support; HTTP/2 compatible
Server-Sent EventsW3C specStableNative browser support; auto-reconnect built-in

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Metrics must update in <30 seconds for operational decisionsReports are generated once per day or weekTraditional data warehouse (Snowflake, BigQuery)
Monitoring live systems (error rates, throughput, latency)Historical analysis over months/years of dataBatch analytics with scheduled queries
Thousands of concurrent viewers need the same live dataEach user needs a unique, complex ad-hoc queryOLAP query engine with per-user caching
Event-driven architecture already produces a Kafka/Kinesis streamData arrives in batch files (CSV uploads, ETL dumps)Batch ingestion pipeline + scheduled dashboard refresh
Alerting on metric thresholds must fire within secondsAccuracy matters more than speed (financial reconciliation)Double-entry batch pipeline with reconciliation step

Important Caveats

Related Units