Scalable Notification System Design (Push, In-App, Email)

Type: Software Reference Confidence: 0.92 Sources: 7 Verified: 2026-02-23 Freshness: 2026-02-23

TL;DR

Constraints

Quick Reference

ComponentRoleTechnology OptionsScaling Strategy
API GatewayRate-limit, authenticate, route notification requestsKong, AWS API Gateway, NginxHorizontal + auto-scale on request rate
Notification ServiceValidate payload, enrich with user prefs, fan-out to queuesNode.js, Go, Java Spring BootStateless; scale by CPU/request count
User Preference ServiceStore per-user channel opt-ins, quiet hours, frequency capsPostgreSQL + Redis cacheRead-heavy; cache in Redis with 5-min TTL
Template EngineRender channel-specific content from templatesHandlebars, Jinja2, MJML (email)Stateless; co-locate with notification service
Message QueueDecouple submission from delivery, buffer spikesKafka, RabbitMQ, AWS SQS+SNSPartition by channel; scale consumers independently
Priority RouterClassify notifications (critical/high/normal/low) and route to priority queuesCustom logic + separate queue topicsSeparate queues per priority tier
Push ProcessorDeliver to mobile/web via FCM and APNsFCM HTTP v1 API, APNs HTTP/2Batch sends (FCM: 500 tokens/multicast); scale by queue depth
Email ProcessorRender and send email via ESPSendGrid, Amazon SES, MailgunWarm up IPs; separate transactional vs marketing subdomains
SMS ProcessorSend SMS via telecom APITwilio, Nexmo/Vonage, AWS SNSRate-limit per country; scale by queue depth
In-App ProcessorDeliver real-time to connected clientsWebSocket (Socket.io), SSE, long pollingSticky sessions or Redis Pub/Sub for horizontal scale
Notification StorePersist notification history and read/unread statusPostgreSQL (recent), S3/cold storage (archive)Time-partition; archive after 90 days
Retry / DLQHandle failed deliveries with exponential backoffBuilt-in queue retry + Dead Letter QueueAlert on DLQ depth; manual inspection workflow
Analytics & MonitoringTrack delivery rates, open rates, latencyPrometheus + Grafana, ELK Stack, DatadogAggregate metrics; alert on delivery rate drops
Device RegistryStore and validate device tokens per userPostgreSQL or DynamoDBPrune invalid tokens on provider feedback

Decision Tree

START: What notification channels do you need?
|
+-- Push notifications only?
|   +-- YES --> Use FCM (Android/web) + APNs (iOS) directly
|   |           with a simple queue (SQS/Redis) in front
|   +-- NO  |
|           v
+-- Need email + push?
|   +-- YES --> Add message queue fan-out (SNS->SQS pattern or Kafka topics)
|   |           with separate push and email processors
|   +-- NO  |
|           v
+-- Need in-app real-time notifications?
|   +-- YES --> Add WebSocket/SSE gateway with Redis Pub/Sub
|   |           for cross-instance message delivery
|   +-- NO  |
|           v
+-- Need SMS?
|   +-- YES --> Add SMS processor + Twilio/Vonage integration
|               with per-country rate limiting
|   +-- NO  --> Reassess requirements
|
SCALE CHECK:
+-- < 1K notifications/day?
|   +-- YES --> Single-server with in-process queue (Bull/Celery) is sufficient
|   +-- NO  |
|           v
+-- 1K-100K notifications/day?
|   +-- YES --> Managed queue service (SQS+SNS or Cloud Tasks) + 2-4 workers
|   +-- NO  |
|           v
+-- 100K-10M notifications/day?
|   +-- YES --> Kafka with consumer groups, partitioned by channel + user-id sharding
|   +-- NO  |
|           v
+-- > 10M notifications/day?
    +-- YES --> Multi-region Kafka clusters, dedicated worker pools,
                Redis cluster for preference caching, sharded notification store

Step-by-Step Guide

1. Define notification events and payload schema

Establish a canonical event format that all services publish. Include event type, recipient, channel hints, priority, and idempotency key. [src3]

{
  "event_id": "evt_abc123",
  "event_type": "order.shipped",
  "recipient_id": "user_789",
  "channels": ["push", "email", "in_app"],
  "priority": "high",
  "data": {
    "order_id": "ORD-456",
    "tracking_url": "https://example.com/track/ORD-456"
  },
  "idempotency_key": "order_shipped_ORD-456",
  "created_at": "2026-02-23T10:00:00Z"
}

Verify: Validate schema with JSON Schema or Zod. All downstream processors must accept this format without transformation.

2. Build the user preference service

Store per-user, per-channel, per-notification-type preferences. Cache aggressively since this is read on every notification. [src4]

CREATE TABLE user_notification_preferences (
  user_id       UUID NOT NULL,
  channel       TEXT NOT NULL CHECK (channel IN ('push','email','sms','in_app')),
  notif_type    TEXT NOT NULL,
  enabled       BOOLEAN DEFAULT true,
  quiet_start   TIME,
  quiet_end     TIME,
  timezone      TEXT DEFAULT 'UTC',
  updated_at    TIMESTAMPTZ DEFAULT now(),
  PRIMARY KEY (user_id, channel, notif_type)
);

Verify: SELECT * FROM user_notification_preferences WHERE user_id = 'test-user'; → should return one row per channel/type combination.

3. Implement fan-out with message queues

Publish the notification event to a central topic. Subscribers (one per channel) receive a copy and enqueue it for processing. [src5]

import boto3
import json

sns = boto3.client('sns', region_name='us-east-1')
TOPIC_ARN = 'arn:aws:sns:us-east-1:123456789:notifications'

def publish_notification(event: dict):
    """Publish notification event to SNS for fan-out."""
    sns.publish(
        TopicArn=TOPIC_ARN,
        Message=json.dumps(event),
        MessageAttributes={
            'priority': {
                'DataType': 'String',
                'StringValue': event.get('priority', 'normal')
            }
        }
    )

Verify: Check SQS queue depth after publishing: aws sqs get-queue-attributes --queue-url $QUEUE_URL --attribute-names ApproximateNumberOfMessages → should increment by 1 per subscribed queue.

4. Build channel-specific processors

Each processor pulls from its queue, checks user preferences, renders the message, and calls the external provider. Implement exponential backoff for retries. [src3]

import firebase_admin
from firebase_admin import credentials, messaging

cred = credentials.Certificate('service-account.json')
firebase_admin.initialize_app(cred)

def process_push(event: dict):
    tokens = get_device_tokens(event['recipient_id'])
    if not tokens:
        return {'status': 'skipped', 'reason': 'no_tokens'}
    message = messaging.MulticastMessage(
        tokens=tokens[:500],
        notification=messaging.Notification(
            title=event['data'].get('title', 'Notification'),
            body=event['data'].get('body', ''),
        ),
        data={k: str(v) for k, v in event['data'].items()},
    )
    response = messaging.send_each_for_multicast(message)
    return {'success': response.success_count}

Verify: Check FCM response — response.success_count should be > 0 for valid tokens.

5. Add in-app notification delivery via WebSocket

For real-time in-app notifications, maintain persistent connections and use Redis Pub/Sub for cross-instance routing. [src4]

const { Server } = require('socket.io');
const { createClient } = require('redis');

const io = new Server(3001, { cors: { origin: '*' } });
const redisSub = createClient({ url: process.env.REDIS_URL });

const userSockets = new Map();

io.on('connection', (socket) => {
  const uid = socket.handshake.auth.userId;
  if (!userSockets.has(uid)) userSockets.set(uid, new Set());
  userSockets.get(uid).add(socket.id);
  socket.on('disconnect', () => {
    userSockets.get(uid)?.delete(socket.id);
  });
});

redisSub.subscribe('notifications:in_app', (message) => {
  const event = JSON.parse(message);
  const sockets = userSockets.get(event.recipient_id);
  if (sockets) sockets.forEach(id => io.to(id).emit('notification', event.data));
});

Verify: Connect a test WebSocket client and publish to Redis channel notifications:in_app → client should receive the event within 100ms.

6. Implement retry logic and dead letter queues

Configure exponential backoff with jitter. After max retries, move to DLQ for manual inspection. [src3]

import time, random

MAX_RETRIES = 5
BASE_DELAY = 1

def process_with_retry(event, processor_fn):
    for attempt in range(MAX_RETRIES):
        try:
            return processor_fn(event)
        except TransientError as e:
            if attempt == MAX_RETRIES - 1:
                send_to_dlq(event, str(e))
                raise
            delay = BASE_DELAY * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)
        except PermanentError as e:
            send_to_dlq(event, str(e))
            raise

Verify: Check DLQ depth after intentionally sending a malformed payload.

7. Add monitoring and alerting

Track delivery success rates per channel, latency percentiles, and queue depths. Alert when delivery rate drops below threshold. [src3]

groups:
  - name: notification_alerts
    rules:
      - alert: NotificationDeliveryRateLow
        expr: >
          rate(notifications_delivered_total{status="success"}[5m])
          / rate(notifications_dispatched_total[5m]) < 0.95
        for: 5m
        labels:
          severity: warning

Verify: Trigger a test alert by temporarily lowering the threshold. Check Grafana dashboard shows the alert firing.

Code Examples

Python: Send Push Notification via FCM HTTP v1

# Input:  device_token (str), title (str), body (str), data (dict)
# Output: message_id (str) on success

import firebase_admin
from firebase_admin import credentials, messaging  # firebase-admin>=6.4.0

cred = credentials.Certificate('service-account.json')
firebase_admin.initialize_app(cred)

def send_push(token: str, title: str, body: str, data: dict = None) -> str:
    message = messaging.Message(
        token=token,
        notification=messaging.Notification(title=title, body=body),
        data={k: str(v) for k, v in (data or {}).items()},
        android=messaging.AndroidConfig(priority='high'),
        apns=messaging.APNSConfig(
            headers={'apns-priority': '10'},
            payload=messaging.APNSPayload(aps=messaging.Aps(sound='default'))
        )
    )
    return messaging.send(message)

Node.js: In-App Notification Service with Socket.io

// Input:  userId (string), notification payload (object)
// Output: Emits 'notification' event to connected sockets for the user

const { Server } = require('socket.io');    // [email protected]
const { createClient } = require('redis');  // [email protected]

const io = new Server(3001, { cors: { origin: '*' } });
const redis = createClient({ url: process.env.REDIS_URL });

const userSockets = new Map();

io.on('connection', (socket) => {
  const uid = socket.handshake.auth.userId;
  if (!userSockets.has(uid)) userSockets.set(uid, new Set());
  userSockets.get(uid).add(socket.id);
  socket.on('disconnect', () => {
    userSockets.get(uid)?.delete(socket.id);
    if (!userSockets.get(uid)?.size) userSockets.delete(uid);
  });
});

async function notifyUser(userId, payload) {
  const sockets = userSockets.get(userId);
  if (sockets) sockets.forEach(id => io.to(id).emit('notification', payload));
  await redis.publish('notif:in_app', JSON.stringify({ userId, payload }));
}

Python: Email Notification via SendGrid

# Input:  to_email (str), subject (str), html_content (str)
# Output: HTTP status code (202 on success)

from sendgrid import SendGridAPIClient          # sendgrid>=6.11.0
from sendgrid.helpers.mail import Mail, Email, Content

def send_email(to_email: str, subject: str, html_content: str) -> int:
    sg = SendGridAPIClient(api_key='SG.your_api_key')
    mail = Mail(
        from_email=Email('[email protected]', 'My App'),
        subject=subject,
        to_emails=to_email,
        html_content=Content('text/html', html_content)
    )
    mail.add_header('List-Unsubscribe', '<https://example.com/unsubscribe>')
    response = sg.send(mail)
    return response.status_code

Anti-Patterns

Wrong: Synchronous notification delivery in the request path

# BAD -- blocks the API response until all notifications are sent
@app.post('/api/orders/{order_id}/ship')
def ship_order(order_id: str):
    order = update_order_status(order_id, 'shipped')
    send_push_notification(order.user_id, 'Shipped!')    # blocks 200-500ms
    send_email(order.user_email, 'Shipped', render())     # blocks 1-3s
    send_sms(order.user_phone, 'Shipped!')                # blocks 500ms-2s
    return {'status': 'shipped'}  # user waits 2-5 seconds total

Correct: Asynchronous fan-out via message queue

# GOOD -- publishes event and returns immediately
@app.post('/api/orders/{order_id}/ship')
def ship_order(order_id: str):
    order = update_order_status(order_id, 'shipped')
    publish_notification_event({
        'event_type': 'order.shipped',
        'recipient_id': order.user_id,
        'channels': ['push', 'email', 'sms'],
        'idempotency_key': f'order_shipped_{order_id}'
    })
    return {'status': 'shipped'}  # returns in <50ms

Wrong: Single shared queue for all channels

# BAD -- email provider outage blocks push and SMS delivery
notifications_queue = Queue('all_notifications')

def process_all(event):
    if event['channel'] == 'email':
        send_email(event)       # if SendGrid is down, entire queue backs up
    elif event['channel'] == 'push':
        send_push(event)

Correct: Separate queues per channel with independent consumers

# GOOD -- channel isolation; email outage does not affect push/SMS
push_queue = Queue('notifications:push')
email_queue = Queue('notifications:email')
sms_queue = Queue('notifications:sms')

@push_queue.consumer
def process_push(event):
    send_push(event)

@email_queue.consumer
def process_email(event):
    send_email(event)

Wrong: No idempotency protection on notification delivery

# BAD -- worker crash after send but before ack = duplicate notification
def process_notification(event):
    send_push(event['token'], event['message'])
    queue.ack(event['receipt_handle'])  # crash here = duplicate

Correct: Idempotency check before sending

# GOOD -- check if notification was already sent using idempotency key
def process_notification(event):
    idem_key = event['idempotency_key']
    if redis.setnx(f'notif:sent:{idem_key}', '1'):
        redis.expire(f'notif:sent:{idem_key}', 86400)
        send_push(event['token'], event['message'])
    queue.ack(event['receipt_handle'])

Wrong: Ignoring device token lifecycle

# BAD -- stale tokens waste API calls and trigger throttling
def send_to_all_devices(user_id, message):
    tokens = db.query('SELECT token FROM devices WHERE user_id = ?', user_id)
    for token in tokens:
        fcm.send(token, message)  # never checks if token is valid

Correct: Prune invalid tokens on provider feedback

# GOOD -- remove tokens FCM/APNs report as invalid
def send_to_all_devices(user_id, message):
    tokens = db.query('SELECT token FROM devices WHERE user_id = ?', user_id)
    response = fcm.send_multicast(tokens, message)
    for i, result in enumerate(response.responses):
        if result.exception:
            if result.exception.code in ('UNREGISTERED', 'INVALID_ARGUMENT'):
                db.execute('DELETE FROM devices WHERE token = ?', tokens[i])

Common Pitfalls

Diagnostic Commands

# Check SQS queue depth (AWS)
aws sqs get-queue-attributes --queue-url $QUEUE_URL \
  --attribute-names ApproximateNumberOfMessages

# Check DLQ depth for failed notifications
aws sqs get-queue-attributes --queue-url $DLQ_URL \
  --attribute-names ApproximateNumberOfMessages

# Verify FCM service account is valid
firebase projects:list

# Test push notification delivery (FCM HTTP v1)
curl -X POST "https://fcm.googleapis.com/v1/projects/YOUR_PROJECT/messages:send" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -d '{"message":{"token":"DEVICE_TOKEN","notification":{"title":"Test","body":"Hello"}}}'

# Check Redis Pub/Sub channel subscribers (in-app)
redis-cli PUBSUB NUMSUB notifications:in_app

# Check Kafka consumer lag (notification processor)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group notification-push-processor

Version History & Compatibility

Version / ServiceStatusBreaking ChangesMigration Notes
FCM HTTP v1 APICurrent (2024+)Legacy HTTP/XMPP removed June 2024Migrate to fcm.googleapis.com/v1/projects/{project}/messages:send with OAuth2
APNs HTTP/2Current (2016+)Binary protocol removed 2021Use HTTP/2 port 443; prefer token-based auth
SendGrid v3 APICurrentv2 API deprecatedUse /v3/mail/send; API keys replace username/password
Socket.io 4.xCurrentv2→v4: namespace middleware changedUpgrade client and server together
AWS SNS/SQSStableFIFO queues added 2020Use FIFO for ordered delivery; standard queues for throughput

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Multi-channel delivery (push + email + SMS + in-app) is requiredOnly need simple webhook callbacks to a single endpointWebhook delivery system with retry
Scale exceeds 10K notifications/day and growingFewer than 100 notifications/day with a single channelDirect API calls to FCM/SendGrid from your app
Need user preference management and quiet hoursSending only system-to-system alerts (no human recipients)Event bus (Kafka/RabbitMQ) without notification layer
Regulatory compliance (CAN-SPAM, GDPR) mattersInternal dev notifications only (Slack/Discord bots)Slack Incoming Webhooks or Discord Bot API
Need delivery tracking, analytics, and audit trailsFire-and-forget logging with no delivery guarantees neededSimple log aggregation (ELK/Loki)

Important Caveats

Related Units