SNS -> SQS -> Lambda/Worker fan-out pattern (or Kafka topics with consumer groups for higher throughput)| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| API Gateway | Rate-limit, authenticate, route notification requests | Kong, AWS API Gateway, Nginx | Horizontal + auto-scale on request rate |
| Notification Service | Validate payload, enrich with user prefs, fan-out to queues | Node.js, Go, Java Spring Boot | Stateless; scale by CPU/request count |
| User Preference Service | Store per-user channel opt-ins, quiet hours, frequency caps | PostgreSQL + Redis cache | Read-heavy; cache in Redis with 5-min TTL |
| Template Engine | Render channel-specific content from templates | Handlebars, Jinja2, MJML (email) | Stateless; co-locate with notification service |
| Message Queue | Decouple submission from delivery, buffer spikes | Kafka, RabbitMQ, AWS SQS+SNS | Partition by channel; scale consumers independently |
| Priority Router | Classify notifications (critical/high/normal/low) and route to priority queues | Custom logic + separate queue topics | Separate queues per priority tier |
| Push Processor | Deliver to mobile/web via FCM and APNs | FCM HTTP v1 API, APNs HTTP/2 | Batch sends (FCM: 500 tokens/multicast); scale by queue depth |
| Email Processor | Render and send email via ESP | SendGrid, Amazon SES, Mailgun | Warm up IPs; separate transactional vs marketing subdomains |
| SMS Processor | Send SMS via telecom API | Twilio, Nexmo/Vonage, AWS SNS | Rate-limit per country; scale by queue depth |
| In-App Processor | Deliver real-time to connected clients | WebSocket (Socket.io), SSE, long polling | Sticky sessions or Redis Pub/Sub for horizontal scale |
| Notification Store | Persist notification history and read/unread status | PostgreSQL (recent), S3/cold storage (archive) | Time-partition; archive after 90 days |
| Retry / DLQ | Handle failed deliveries with exponential backoff | Built-in queue retry + Dead Letter Queue | Alert on DLQ depth; manual inspection workflow |
| Analytics & Monitoring | Track delivery rates, open rates, latency | Prometheus + Grafana, ELK Stack, Datadog | Aggregate metrics; alert on delivery rate drops |
| Device Registry | Store and validate device tokens per user | PostgreSQL or DynamoDB | Prune invalid tokens on provider feedback |
START: What notification channels do you need?
|
+-- Push notifications only?
| +-- YES --> Use FCM (Android/web) + APNs (iOS) directly
| | with a simple queue (SQS/Redis) in front
| +-- NO |
| v
+-- Need email + push?
| +-- YES --> Add message queue fan-out (SNS->SQS pattern or Kafka topics)
| | with separate push and email processors
| +-- NO |
| v
+-- Need in-app real-time notifications?
| +-- YES --> Add WebSocket/SSE gateway with Redis Pub/Sub
| | for cross-instance message delivery
| +-- NO |
| v
+-- Need SMS?
| +-- YES --> Add SMS processor + Twilio/Vonage integration
| with per-country rate limiting
| +-- NO --> Reassess requirements
|
SCALE CHECK:
+-- < 1K notifications/day?
| +-- YES --> Single-server with in-process queue (Bull/Celery) is sufficient
| +-- NO |
| v
+-- 1K-100K notifications/day?
| +-- YES --> Managed queue service (SQS+SNS or Cloud Tasks) + 2-4 workers
| +-- NO |
| v
+-- 100K-10M notifications/day?
| +-- YES --> Kafka with consumer groups, partitioned by channel + user-id sharding
| +-- NO |
| v
+-- > 10M notifications/day?
+-- YES --> Multi-region Kafka clusters, dedicated worker pools,
Redis cluster for preference caching, sharded notification store
Establish a canonical event format that all services publish. Include event type, recipient, channel hints, priority, and idempotency key. [src3]
{
"event_id": "evt_abc123",
"event_type": "order.shipped",
"recipient_id": "user_789",
"channels": ["push", "email", "in_app"],
"priority": "high",
"data": {
"order_id": "ORD-456",
"tracking_url": "https://example.com/track/ORD-456"
},
"idempotency_key": "order_shipped_ORD-456",
"created_at": "2026-02-23T10:00:00Z"
}
Verify: Validate schema with JSON Schema or Zod. All downstream processors must accept this format without transformation.
Store per-user, per-channel, per-notification-type preferences. Cache aggressively since this is read on every notification. [src4]
CREATE TABLE user_notification_preferences (
user_id UUID NOT NULL,
channel TEXT NOT NULL CHECK (channel IN ('push','email','sms','in_app')),
notif_type TEXT NOT NULL,
enabled BOOLEAN DEFAULT true,
quiet_start TIME,
quiet_end TIME,
timezone TEXT DEFAULT 'UTC',
updated_at TIMESTAMPTZ DEFAULT now(),
PRIMARY KEY (user_id, channel, notif_type)
);
Verify: SELECT * FROM user_notification_preferences WHERE user_id = 'test-user'; → should return one row per channel/type combination.
Publish the notification event to a central topic. Subscribers (one per channel) receive a copy and enqueue it for processing. [src5]
import boto3
import json
sns = boto3.client('sns', region_name='us-east-1')
TOPIC_ARN = 'arn:aws:sns:us-east-1:123456789:notifications'
def publish_notification(event: dict):
"""Publish notification event to SNS for fan-out."""
sns.publish(
TopicArn=TOPIC_ARN,
Message=json.dumps(event),
MessageAttributes={
'priority': {
'DataType': 'String',
'StringValue': event.get('priority', 'normal')
}
}
)
Verify: Check SQS queue depth after publishing: aws sqs get-queue-attributes --queue-url $QUEUE_URL --attribute-names ApproximateNumberOfMessages → should increment by 1 per subscribed queue.
Each processor pulls from its queue, checks user preferences, renders the message, and calls the external provider. Implement exponential backoff for retries. [src3]
import firebase_admin
from firebase_admin import credentials, messaging
cred = credentials.Certificate('service-account.json')
firebase_admin.initialize_app(cred)
def process_push(event: dict):
tokens = get_device_tokens(event['recipient_id'])
if not tokens:
return {'status': 'skipped', 'reason': 'no_tokens'}
message = messaging.MulticastMessage(
tokens=tokens[:500],
notification=messaging.Notification(
title=event['data'].get('title', 'Notification'),
body=event['data'].get('body', ''),
),
data={k: str(v) for k, v in event['data'].items()},
)
response = messaging.send_each_for_multicast(message)
return {'success': response.success_count}
Verify: Check FCM response — response.success_count should be > 0 for valid tokens.
For real-time in-app notifications, maintain persistent connections and use Redis Pub/Sub for cross-instance routing. [src4]
const { Server } = require('socket.io');
const { createClient } = require('redis');
const io = new Server(3001, { cors: { origin: '*' } });
const redisSub = createClient({ url: process.env.REDIS_URL });
const userSockets = new Map();
io.on('connection', (socket) => {
const uid = socket.handshake.auth.userId;
if (!userSockets.has(uid)) userSockets.set(uid, new Set());
userSockets.get(uid).add(socket.id);
socket.on('disconnect', () => {
userSockets.get(uid)?.delete(socket.id);
});
});
redisSub.subscribe('notifications:in_app', (message) => {
const event = JSON.parse(message);
const sockets = userSockets.get(event.recipient_id);
if (sockets) sockets.forEach(id => io.to(id).emit('notification', event.data));
});
Verify: Connect a test WebSocket client and publish to Redis channel notifications:in_app → client should receive the event within 100ms.
Configure exponential backoff with jitter. After max retries, move to DLQ for manual inspection. [src3]
import time, random
MAX_RETRIES = 5
BASE_DELAY = 1
def process_with_retry(event, processor_fn):
for attempt in range(MAX_RETRIES):
try:
return processor_fn(event)
except TransientError as e:
if attempt == MAX_RETRIES - 1:
send_to_dlq(event, str(e))
raise
delay = BASE_DELAY * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
except PermanentError as e:
send_to_dlq(event, str(e))
raise
Verify: Check DLQ depth after intentionally sending a malformed payload.
Track delivery success rates per channel, latency percentiles, and queue depths. Alert when delivery rate drops below threshold. [src3]
groups:
- name: notification_alerts
rules:
- alert: NotificationDeliveryRateLow
expr: >
rate(notifications_delivered_total{status="success"}[5m])
/ rate(notifications_dispatched_total[5m]) < 0.95
for: 5m
labels:
severity: warning
Verify: Trigger a test alert by temporarily lowering the threshold. Check Grafana dashboard shows the alert firing.
# Input: device_token (str), title (str), body (str), data (dict)
# Output: message_id (str) on success
import firebase_admin
from firebase_admin import credentials, messaging # firebase-admin>=6.4.0
cred = credentials.Certificate('service-account.json')
firebase_admin.initialize_app(cred)
def send_push(token: str, title: str, body: str, data: dict = None) -> str:
message = messaging.Message(
token=token,
notification=messaging.Notification(title=title, body=body),
data={k: str(v) for k, v in (data or {}).items()},
android=messaging.AndroidConfig(priority='high'),
apns=messaging.APNSConfig(
headers={'apns-priority': '10'},
payload=messaging.APNSPayload(aps=messaging.Aps(sound='default'))
)
)
return messaging.send(message)
// Input: userId (string), notification payload (object)
// Output: Emits 'notification' event to connected sockets for the user
const { Server } = require('socket.io'); // [email protected]
const { createClient } = require('redis'); // [email protected]
const io = new Server(3001, { cors: { origin: '*' } });
const redis = createClient({ url: process.env.REDIS_URL });
const userSockets = new Map();
io.on('connection', (socket) => {
const uid = socket.handshake.auth.userId;
if (!userSockets.has(uid)) userSockets.set(uid, new Set());
userSockets.get(uid).add(socket.id);
socket.on('disconnect', () => {
userSockets.get(uid)?.delete(socket.id);
if (!userSockets.get(uid)?.size) userSockets.delete(uid);
});
});
async function notifyUser(userId, payload) {
const sockets = userSockets.get(userId);
if (sockets) sockets.forEach(id => io.to(id).emit('notification', payload));
await redis.publish('notif:in_app', JSON.stringify({ userId, payload }));
}
# Input: to_email (str), subject (str), html_content (str)
# Output: HTTP status code (202 on success)
from sendgrid import SendGridAPIClient # sendgrid>=6.11.0
from sendgrid.helpers.mail import Mail, Email, Content
def send_email(to_email: str, subject: str, html_content: str) -> int:
sg = SendGridAPIClient(api_key='SG.your_api_key')
mail = Mail(
from_email=Email('[email protected]', 'My App'),
subject=subject,
to_emails=to_email,
html_content=Content('text/html', html_content)
)
mail.add_header('List-Unsubscribe', '<https://example.com/unsubscribe>')
response = sg.send(mail)
return response.status_code
# BAD -- blocks the API response until all notifications are sent
@app.post('/api/orders/{order_id}/ship')
def ship_order(order_id: str):
order = update_order_status(order_id, 'shipped')
send_push_notification(order.user_id, 'Shipped!') # blocks 200-500ms
send_email(order.user_email, 'Shipped', render()) # blocks 1-3s
send_sms(order.user_phone, 'Shipped!') # blocks 500ms-2s
return {'status': 'shipped'} # user waits 2-5 seconds total
# GOOD -- publishes event and returns immediately
@app.post('/api/orders/{order_id}/ship')
def ship_order(order_id: str):
order = update_order_status(order_id, 'shipped')
publish_notification_event({
'event_type': 'order.shipped',
'recipient_id': order.user_id,
'channels': ['push', 'email', 'sms'],
'idempotency_key': f'order_shipped_{order_id}'
})
return {'status': 'shipped'} # returns in <50ms
# BAD -- email provider outage blocks push and SMS delivery
notifications_queue = Queue('all_notifications')
def process_all(event):
if event['channel'] == 'email':
send_email(event) # if SendGrid is down, entire queue backs up
elif event['channel'] == 'push':
send_push(event)
# GOOD -- channel isolation; email outage does not affect push/SMS
push_queue = Queue('notifications:push')
email_queue = Queue('notifications:email')
sms_queue = Queue('notifications:sms')
@push_queue.consumer
def process_push(event):
send_push(event)
@email_queue.consumer
def process_email(event):
send_email(event)
# BAD -- worker crash after send but before ack = duplicate notification
def process_notification(event):
send_push(event['token'], event['message'])
queue.ack(event['receipt_handle']) # crash here = duplicate
# GOOD -- check if notification was already sent using idempotency key
def process_notification(event):
idem_key = event['idempotency_key']
if redis.setnx(f'notif:sent:{idem_key}', '1'):
redis.expire(f'notif:sent:{idem_key}', 86400)
send_push(event['token'], event['message'])
queue.ack(event['receipt_handle'])
# BAD -- stale tokens waste API calls and trigger throttling
def send_to_all_devices(user_id, message):
tokens = db.query('SELECT token FROM devices WHERE user_id = ?', user_id)
for token in tokens:
fcm.send(token, message) # never checks if token is valid
# GOOD -- remove tokens FCM/APNs report as invalid
def send_to_all_devices(user_id, message):
tokens = db.query('SELECT token FROM devices WHERE user_id = ?', user_id)
response = fcm.send_multicast(tokens, message)
for i, result in enumerate(response.responses):
if result.exception:
if result.exception.code in ('UNREGISTERED', 'INVALID_ARGUMENT'):
db.execute('DELETE FROM devices WHERE token = ?', tokens[i])
max 1 alert per user per event type per 5 minutes). [src3]UNREGISTERED/410 errors; run weekly token validation. [src1]maxReceiveCount to move poison messages to DLQ after 3-5 retries. [src3]created_at; archive older than 90 days to cold storage. [src3]# Check SQS queue depth (AWS)
aws sqs get-queue-attributes --queue-url $QUEUE_URL \
--attribute-names ApproximateNumberOfMessages
# Check DLQ depth for failed notifications
aws sqs get-queue-attributes --queue-url $DLQ_URL \
--attribute-names ApproximateNumberOfMessages
# Verify FCM service account is valid
firebase projects:list
# Test push notification delivery (FCM HTTP v1)
curl -X POST "https://fcm.googleapis.com/v1/projects/YOUR_PROJECT/messages:send" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-d '{"message":{"token":"DEVICE_TOKEN","notification":{"title":"Test","body":"Hello"}}}'
# Check Redis Pub/Sub channel subscribers (in-app)
redis-cli PUBSUB NUMSUB notifications:in_app
# Check Kafka consumer lag (notification processor)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group notification-push-processor
| Version / Service | Status | Breaking Changes | Migration Notes |
|---|---|---|---|
| FCM HTTP v1 API | Current (2024+) | Legacy HTTP/XMPP removed June 2024 | Migrate to fcm.googleapis.com/v1/projects/{project}/messages:send with OAuth2 |
| APNs HTTP/2 | Current (2016+) | Binary protocol removed 2021 | Use HTTP/2 port 443; prefer token-based auth |
| SendGrid v3 API | Current | v2 API deprecated | Use /v3/mail/send; API keys replace username/password |
| Socket.io 4.x | Current | v2→v4: namespace middleware changed | Upgrade client and server together |
| AWS SNS/SQS | Stable | FIFO queues added 2020 | Use FIFO for ordered delivery; standard queues for throughput |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Multi-channel delivery (push + email + SMS + in-app) is required | Only need simple webhook callbacks to a single endpoint | Webhook delivery system with retry |
| Scale exceeds 10K notifications/day and growing | Fewer than 100 notifications/day with a single channel | Direct API calls to FCM/SendGrid from your app |
| Need user preference management and quiet hours | Sending only system-to-system alerts (no human recipients) | Event bus (Kafka/RabbitMQ) without notification layer |
| Regulatory compliance (CAN-SPAM, GDPR) matters | Internal dev notifications only (Slack/Discord bots) | Slack Incoming Webhooks or Discord Bot API |
| Need delivery tracking, analytics, and audit trails | Fire-and-forget logging with no delivery guarantees needed | Simple log aggregation (ELK/Loki) |