notification service architecture

- Bottom line: A scalable notification system uses event-driven fan-out with per-channel message queues, a user preference service, dedicated channel processors (push/email/SMS/in-app), and a retry-with-DLQ strategy to reliably deliver millions of notifications daily.

push notification system design

- Bottom line: A scalable notification system uses event-driven fan-out with per-channel message queues, a user preference service, dedicated channel processors (push/email/SMS/in-app), and a retry-with-DLQ strategy to reliably deliver millions of notifications daily.

multi-channel notification platform

- Bottom line: A scalable notification system uses event-driven fan-out with per-channel message queues, a user preference service, dedicated channel processors (push/email/SMS/in-app), and a retry-with-DLQ strategy to reliably deliver millions of notifications daily.

real-time notification infrastructure

- Bottom line: A scalable notification system uses event-driven fan-out with per-channel message queues, a user preference service, dedicated channel processors (push/email/SMS/in-app), and a retry-with-DLQ strategy to reliably deliver millions of notifications daily.

Scalable Notification System Design (Push, In-App, Email)

How do I design a scalable notification system (push, in-app, email)?

TL;DR

Bottom line: A scalable notification system uses event-driven fan-out with per-channel message queues, a user preference service, dedicated channel processors (push/email/SMS/in-app), and a retry-with-DLQ strategy to reliably deliver millions of notifications daily.
Key tool/command: SNS -> SQS -> Lambda/Worker fan-out pattern (or Kafka topics with consumer groups for higher throughput)
Watch out for: Sending notifications without checking user preferences first -- this causes opt-out violations, spam complaints, and CAN-SPAM/GDPR fines.
Works with: FCM HTTP v1 API (Android/web), APNs HTTP/2 (iOS), SendGrid/SES (email), WebSocket/SSE (in-app). Cloud-agnostic architecture.

Constraints

Always query the user preference service and check opt-in status before dispatching any notification to any channel
Use separate message queues per channel so a failure in email delivery does not block push or in-app notifications
FCM legacy HTTP API and XMPP API were removed in June 2024; all push to Android/web must use HTTP v1 API
APNs requires HTTP/2 connections on port 443 (or 2197); use token-based authentication (keys do not expire)
Email notifications must include a one-click unsubscribe header (RFC 8058) and physical mailing address per CAN-SPAM
Never store raw device tokens or APNs certificates in client-side code or public repositories

Quick Reference

Component	Role	Technology Options	Scaling Strategy
API Gateway	Rate-limit, authenticate, route notification requests	Kong, AWS API Gateway, Nginx	Horizontal + auto-scale on request rate
Notification Service	Validate payload, enrich with user prefs, fan-out to queues	Node.js, Go, Java Spring Boot	Stateless; scale by CPU/request count
User Preference Service	Store per-user channel opt-ins, quiet hours, frequency caps	PostgreSQL + Redis cache	Read-heavy; cache in Redis with 5-min TTL
Template Engine	Render channel-specific content from templates	Handlebars, Jinja2, MJML (email)	Stateless; co-locate with notification service
Message Queue	Decouple submission from delivery, buffer spikes	Kafka, RabbitMQ, AWS SQS+SNS	Partition by channel; scale consumers independently
Priority Router	Classify notifications (critical/high/normal/low) and route to priority queues	Custom logic + separate queue topics	Separate queues per priority tier
Push Processor	Deliver to mobile/web via FCM and APNs	FCM HTTP v1 API, APNs HTTP/2	Batch sends (FCM: 500 tokens/multicast); scale by queue depth
Email Processor	Render and send email via ESP	SendGrid, Amazon SES, Mailgun	Warm up IPs; separate transactional vs marketing subdomains
SMS Processor	Send SMS via telecom API	Twilio, Nexmo/Vonage, AWS SNS	Rate-limit per country; scale by queue depth
In-App Processor	Deliver real-time to connected clients	WebSocket (Socket.io), SSE, long polling	Sticky sessions or Redis Pub/Sub for horizontal scale
Notification Store	Persist notification history and read/unread status	PostgreSQL (recent), S3/cold storage (archive)	Time-partition; archive after 90 days
Retry / DLQ	Handle failed deliveries with exponential backoff	Built-in queue retry + Dead Letter Queue	Alert on DLQ depth; manual inspection workflow
Analytics & Monitoring	Track delivery rates, open rates, latency	Prometheus + Grafana, ELK Stack, Datadog	Aggregate metrics; alert on delivery rate drops
Device Registry	Store and validate device tokens per user	PostgreSQL or DynamoDB	Prune invalid tokens on provider feedback

Decision Tree

START: What notification channels do you need?
|
+-- Push notifications only?
|   +-- YES --> Use FCM (Android/web) + APNs (iOS) directly
|   |           with a simple queue (SQS/Redis) in front
|   +-- NO  |
|           v
+-- Need email + push?
|   +-- YES --> Add message queue fan-out (SNS->SQS pattern or Kafka topics)
|   |           with separate push and email processors
|   +-- NO  |
|           v
+-- Need in-app real-time notifications?
|   +-- YES --> Add WebSocket/SSE gateway with Redis Pub/Sub
|   |           for cross-instance message delivery
|   +-- NO  |
|           v
+-- Need SMS?
|   +-- YES --> Add SMS processor + Twilio/Vonage integration
|               with per-country rate limiting
|   +-- NO  --> Reassess requirements
|
SCALE CHECK:
+-- < 1K notifications/day?
|   +-- YES --> Single-server with in-process queue (Bull/Celery) is sufficient
|   +-- NO  |
|           v
+-- 1K-100K notifications/day?
|   +-- YES --> Managed queue service (SQS+SNS or Cloud Tasks) + 2-4 workers
|   +-- NO  |
|           v
+-- 100K-10M notifications/day?
|   +-- YES --> Kafka with consumer groups, partitioned by channel + user-id sharding
|   +-- NO  |
|           v
+-- > 10M notifications/day?
    +-- YES --> Multi-region Kafka clusters, dedicated worker pools,
                Redis cluster for preference caching, sharded notification store

Step-by-Step Guide

1. Define notification events and payload schema

Establish a canonical event format that all services publish. Include event type, recipient, channel hints, priority, and idempotency key. [src3]

{
  "event_id": "evt_abc123",
  "event_type": "order.shipped",
  "recipient_id": "user_789",
  "channels": ["push", "email", "in_app"],
  "priority": "high",
  "data": {
    "order_id": "ORD-456",
    "tracking_url": "https://example.com/track/ORD-456"
  },
  "idempotency_key": "order_shipped_ORD-456",
  "created_at": "2026-02-23T10:00:00Z"
}

Verify: Validate schema with JSON Schema or Zod. All downstream processors must accept this format without transformation.

2. Build the user preference service

Store per-user, per-channel, per-notification-type preferences. Cache aggressively since this is read on every notification. [src4]

CREATE TABLE user_notification_preferences (
  user_id       UUID NOT NULL,
  channel       TEXT NOT NULL CHECK (channel IN ('push','email','sms','in_app')),
  notif_type    TEXT NOT NULL,
  enabled       BOOLEAN DEFAULT true,
  quiet_start   TIME,
  quiet_end     TIME,
  timezone      TEXT DEFAULT 'UTC',
  updated_at    TIMESTAMPTZ DEFAULT now(),
  PRIMARY KEY (user_id, channel, notif_type)
);

Verify: SELECT * FROM user_notification_preferences WHERE user_id = 'test-user'; → should return one row per channel/type combination.

3. Implement fan-out with message queues

Publish the notification event to a central topic. Subscribers (one per channel) receive a copy and enqueue it for processing. [src5]

import boto3
import json

sns = boto3.client('sns', region_name='us-east-1')
TOPIC_ARN = 'arn:aws:sns:us-east-1:123456789:notifications'

def publish_notification(event: dict):
    """Publish notification event to SNS for fan-out."""
    sns.publish(
        TopicArn=TOPIC_ARN,
        Message=json.dumps(event),
        MessageAttributes={
            'priority': {
                'DataType': 'String',
                'StringValue': event.get('priority', 'normal')
            }
        }
    )

Verify: Check SQS queue depth after publishing: aws sqs get-queue-attributes --queue-url $QUEUE_URL --attribute-names ApproximateNumberOfMessages → should increment by 1 per subscribed queue.

4. Build channel-specific processors

Each processor pulls from its queue, checks user preferences, renders the message, and calls the external provider. Implement exponential backoff for retries. [src3]

import firebase_admin
from firebase_admin import credentials, messaging

cred = credentials.Certificate('service-account.json')
firebase_admin.initialize_app(cred)

def process_push(event: dict):
    tokens = get_device_tokens(event['recipient_id'])
    if not tokens:
        return {'status': 'skipped', 'reason': 'no_tokens'}
    message = messaging.MulticastMessage(
        tokens=tokens[:500],
        notification=messaging.Notification(
            title=event['data'].get('title', 'Notification'),
            body=event['data'].get('body', ''),
        ),
        data={k: str(v) for k, v in event['data'].items()},
    )
    response = messaging.send_each_for_multicast(message)
    return {'success': response.success_count}

Verify: Check FCM response — response.success_count should be > 0 for valid tokens.

5. Add in-app notification delivery via WebSocket

For real-time in-app notifications, maintain persistent connections and use Redis Pub/Sub for cross-instance routing. [src4]

const { Server } = require('socket.io');
const { createClient } = require('redis');

const io = new Server(3001, { cors: { origin: '*' } });
const redisSub = createClient({ url: process.env.REDIS_URL });

const userSockets = new Map();

io.on('connection', (socket) => {
  const uid = socket.handshake.auth.userId;
  if (!userSockets.has(uid)) userSockets.set(uid, new Set());
  userSockets.get(uid).add(socket.id);
  socket.on('disconnect', () => {
    userSockets.get(uid)?.delete(socket.id);
  });
});

redisSub.subscribe('notifications:in_app', (message) => {
  const event = JSON.parse(message);
  const sockets = userSockets.get(event.recipient_id);
  if (sockets) sockets.forEach(id => io.to(id).emit('notification', event.data));
});

Verify: Connect a test WebSocket client and publish to Redis channel notifications:in_app → client should receive the event within 100ms.

6. Implement retry logic and dead letter queues

Configure exponential backoff with jitter. After max retries, move to DLQ for manual inspection. [src3]

import time, random

MAX_RETRIES = 5
BASE_DELAY = 1

def process_with_retry(event, processor_fn):
    for attempt in range(MAX_RETRIES):
        try:
            return processor_fn(event)
        except TransientError as e:
            if attempt == MAX_RETRIES - 1:
                send_to_dlq(event, str(e))
                raise
            delay = BASE_DELAY * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)
        except PermanentError as e:
            send_to_dlq(event, str(e))
            raise

Verify: Check DLQ depth after intentionally sending a malformed payload.

7. Add monitoring and alerting

Track delivery success rates per channel, latency percentiles, and queue depths. Alert when delivery rate drops below threshold. [src3]

groups:
  - name: notification_alerts
    rules:
      - alert: NotificationDeliveryRateLow
        expr: >
          rate(notifications_delivered_total{status="success"}[5m])
          / rate(notifications_dispatched_total[5m]) < 0.95
        for: 5m
        labels:
          severity: warning

Verify: Trigger a test alert by temporarily lowering the threshold. Check Grafana dashboard shows the alert firing.

Code Examples

Python: Send Push Notification via FCM HTTP v1

# Input:  device_token (str), title (str), body (str), data (dict)
# Output: message_id (str) on success

import firebase_admin
from firebase_admin import credentials, messaging  # firebase-admin>=6.4.0

cred = credentials.Certificate('service-account.json')
firebase_admin.initialize_app(cred)

def send_push(token: str, title: str, body: str, data: dict = None) -> str:
    message = messaging.Message(
        token=token,
        notification=messaging.Notification(title=title, body=body),
        data={k: str(v) for k, v in (data or {}).items()},
        android=messaging.AndroidConfig(priority='high'),
        apns=messaging.APNSConfig(
            headers={'apns-priority': '10'},
            payload=messaging.APNSPayload(aps=messaging.Aps(sound='default'))
        )
    )
    return messaging.send(message)

Node.js: In-App Notification Service with Socket.io

// Input:  userId (string), notification payload (object)
// Output: Emits 'notification' event to connected sockets for the user

const { Server } = require('socket.io');    // [email protected]
const { createClient } = require('redis');  // [email protected]

const io = new Server(3001, { cors: { origin: '*' } });
const redis = createClient({ url: process.env.REDIS_URL });

const userSockets = new Map();

io.on('connection', (socket) => {
  const uid = socket.handshake.auth.userId;
  if (!userSockets.has(uid)) userSockets.set(uid, new Set());
  userSockets.get(uid).add(socket.id);
  socket.on('disconnect', () => {
    userSockets.get(uid)?.delete(socket.id);
    if (!userSockets.get(uid)?.size) userSockets.delete(uid);
  });
});

async function notifyUser(userId, payload) {
  const sockets = userSockets.get(userId);
  if (sockets) sockets.forEach(id => io.to(id).emit('notification', payload));
  await redis.publish('notif:in_app', JSON.stringify({ userId, payload }));
}

Python: Email Notification via SendGrid

# Input:  to_email (str), subject (str), html_content (str)
# Output: HTTP status code (202 on success)

from sendgrid import SendGridAPIClient          # sendgrid>=6.11.0
from sendgrid.helpers.mail import Mail, Email, Content

def send_email(to_email: str, subject: str, html_content: str) -> int:
    sg = SendGridAPIClient(api_key='SG.your_api_key')
    mail = Mail(
        from_email=Email('[email protected]', 'My App'),
        subject=subject,
        to_emails=to_email,
        html_content=Content('text/html', html_content)
    )
    mail.add_header('List-Unsubscribe', '<https://example.com/unsubscribe>')
    response = sg.send(mail)
    return response.status_code

Anti-Patterns

Wrong: Synchronous notification delivery in the request path

# BAD -- blocks the API response until all notifications are sent
@app.post('/api/orders/{order_id}/ship')
def ship_order(order_id: str):
    order = update_order_status(order_id, 'shipped')
    send_push_notification(order.user_id, 'Shipped!')    # blocks 200-500ms
    send_email(order.user_email, 'Shipped', render())     # blocks 1-3s
    send_sms(order.user_phone, 'Shipped!')                # blocks 500ms-2s
    return {'status': 'shipped'}  # user waits 2-5 seconds total

Correct: Asynchronous fan-out via message queue

# GOOD -- publishes event and returns immediately
@app.post('/api/orders/{order_id}/ship')
def ship_order(order_id: str):
    order = update_order_status(order_id, 'shipped')
    publish_notification_event({
        'event_type': 'order.shipped',
        'recipient_id': order.user_id,
        'channels': ['push', 'email', 'sms'],
        'idempotency_key': f'order_shipped_{order_id}'
    })
    return {'status': 'shipped'}  # returns in <50ms

Wrong: Single shared queue for all channels

# BAD -- email provider outage blocks push and SMS delivery
notifications_queue = Queue('all_notifications')

def process_all(event):
    if event['channel'] == 'email':
        send_email(event)       # if SendGrid is down, entire queue backs up
    elif event['channel'] == 'push':
        send_push(event)

Correct: Separate queues per channel with independent consumers

# GOOD -- channel isolation; email outage does not affect push/SMS
push_queue = Queue('notifications:push')
email_queue = Queue('notifications:email')
sms_queue = Queue('notifications:sms')

@push_queue.consumer
def process_push(event):
    send_push(event)

@email_queue.consumer
def process_email(event):
    send_email(event)

Wrong: No idempotency protection on notification delivery

# BAD -- worker crash after send but before ack = duplicate notification
def process_notification(event):
    send_push(event['token'], event['message'])
    queue.ack(event['receipt_handle'])  # crash here = duplicate

Correct: Idempotency check before sending

# GOOD -- check if notification was already sent using idempotency key
def process_notification(event):
    idem_key = event['idempotency_key']
    if redis.setnx(f'notif:sent:{idem_key}', '1'):
        redis.expire(f'notif:sent:{idem_key}', 86400)
        send_push(event['token'], event['message'])
    queue.ack(event['receipt_handle'])

Wrong: Ignoring device token lifecycle

# BAD -- stale tokens waste API calls and trigger throttling
def send_to_all_devices(user_id, message):
    tokens = db.query('SELECT token FROM devices WHERE user_id = ?', user_id)
    for token in tokens:
        fcm.send(token, message)  # never checks if token is valid

Correct: Prune invalid tokens on provider feedback

# GOOD -- remove tokens FCM/APNs report as invalid
def send_to_all_devices(user_id, message):
    tokens = db.query('SELECT token FROM devices WHERE user_id = ?', user_id)
    response = fcm.send_multicast(tokens, message)
    for i, result in enumerate(response.responses):
        if result.exception:
            if result.exception.code in ('UNREGISTERED', 'INVALID_ARGUMENT'):
                db.execute('DELETE FROM devices WHERE token = ?', tokens[i])

Common Pitfalls

Notification storms during incidents: A cascading failure triggers millions of alerts simultaneously. Fix: implement circuit breakers and per-event-type rate limits (max 1 alert per user per event type per 5 minutes). [src3]
Stale device tokens causing FCM/APNs errors: Users uninstall the app but tokens remain in the database. Fix: prune tokens on UNREGISTERED/410 errors; run weekly token validation. [src1]
Email landing in spam due to missing authentication: Emails without SPF, DKIM, and DMARC get flagged. Fix: configure DNS records for your sending domain; use a separate subdomain for notifications. [src6]
No quiet hours enforcement: Push notifications at 3 AM increase opt-out rates. Fix: store user timezone; defer non-critical notifications to the next active window. [src4]
Queue poisoning from malformed payloads: A bad message causes repeated processor crashes. Fix: validate before enqueueing; set maxReceiveCount to move poison messages to DLQ after 3-5 retries. [src3]
Missing idempotency causing duplicates: At-least-once delivery means workers may process the same message twice. Fix: include idempotency key; check Redis/DB before sending. [src4]
Ignoring provider rate limits: FCM allows 600K requests/minute; APNs throttles on burst. Fix: implement token-bucket rate limiting; use batch APIs (FCM: 500 tokens/multicast). [src1]
Monolithic notification table: Storing all history in a single table degrades after millions of rows. Fix: partition by created_at; archive older than 90 days to cold storage. [src3]

Diagnostic Commands

# Check SQS queue depth (AWS)
aws sqs get-queue-attributes --queue-url $QUEUE_URL \
  --attribute-names ApproximateNumberOfMessages

# Check DLQ depth for failed notifications
aws sqs get-queue-attributes --queue-url $DLQ_URL \
  --attribute-names ApproximateNumberOfMessages

# Verify FCM service account is valid
firebase projects:list

# Test push notification delivery (FCM HTTP v1)
curl -X POST "https://fcm.googleapis.com/v1/projects/YOUR_PROJECT/messages:send" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -d '{"message":{"token":"DEVICE_TOKEN","notification":{"title":"Test","body":"Hello"}}}'

# Check Redis Pub/Sub channel subscribers (in-app)
redis-cli PUBSUB NUMSUB notifications:in_app

# Check Kafka consumer lag (notification processor)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group notification-push-processor

Version History & Compatibility

Version / Service	Status	Breaking Changes	Migration Notes
FCM HTTP v1 API	Current (2024+)	Legacy HTTP/XMPP removed June 2024	Migrate to `fcm.googleapis.com/v1/projects/{project}/messages:send` with OAuth2
APNs HTTP/2	Current (2016+)	Binary protocol removed 2021	Use HTTP/2 port 443; prefer token-based auth
SendGrid v3 API	Current	v2 API deprecated	Use `/v3/mail/send`; API keys replace username/password
Socket.io 4.x	Current	v2→v4: namespace middleware changed	Upgrade client and server together
AWS SNS/SQS	Stable	FIFO queues added 2020	Use FIFO for ordered delivery; standard queues for throughput

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Multi-channel delivery (push + email + SMS + in-app) is required	Only need simple webhook callbacks to a single endpoint	Webhook delivery system with retry
Scale exceeds 10K notifications/day and growing	Fewer than 100 notifications/day with a single channel	Direct API calls to FCM/SendGrid from your app
Need user preference management and quiet hours	Sending only system-to-system alerts (no human recipients)	Event bus (Kafka/RabbitMQ) without notification layer
Regulatory compliance (CAN-SPAM, GDPR) matters	Internal dev notifications only (Slack/Discord bots)	Slack Incoming Webhooks or Discord Bot API
Need delivery tracking, analytics, and audit trails	Fire-and-forget logging with no delivery guarantees needed	Simple log aggregation (ELK/Loki)

Important Caveats

FCM delivery is best-effort on Android when the app is force-stopped; OEM battery optimizations (Xiaomi, Samsung, Huawei) may block delivery
APNs device tokens change when a user restores from backup to a new device; always re-register tokens on app launch
Email delivery reputation is per-IP and per-domain; new IPs must be "warmed up" gradually (start with 50-100 emails/day, increase over 4-6 weeks)
WebSocket connections are stateful; horizontal scaling requires a shared Pub/Sub layer (Redis, NATS) to route messages to the correct instance
SMS costs vary by country ($0.01-$0.10+ per message); implement per-country rate limits and cost caps
At-least-once delivery is the practical guarantee; design clients to handle duplicate notifications gracefully