HMAC-SHA256(webhook_secret, msg_id.timestamp.payload) per the Standard Webhooks spec.== enables timing side-channel attacks| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| Event Ingestion API | Accepts event triggers from internal services | REST API, gRPC, internal SDK | Horizontal load balancers, rate limiting |
| Event Store | Persists events durably before dispatch | PostgreSQL, MySQL, DynamoDB | Partitioned by tenant, time-based archival |
| Subscription Registry | Maps events to endpoint URLs + secrets | PostgreSQL, Redis (cache layer) | Read replicas, in-memory cache with TTL |
| Dispatcher / Scheduler | Reads pending events, fans out to delivery workers | Custom service, Temporal, cron | Partition by tenant/event-type |
| Delivery Workers | Execute HTTP POST to subscriber endpoints | Worker pool, Lambda, Cloud Run | Auto-scaling worker pool, concurrency limits per endpoint |
| Retry Queue | Holds failed deliveries for exponential backoff | SQS, RabbitMQ, Redis sorted sets, PostgreSQL | Visibility timeout, DLQ after max retries |
| Signature Module | Signs payloads with HMAC-SHA256 per endpoint secret | crypto stdlib, Standard Webhooks SDK | Stateless, scales with workers |
| Idempotency Store | Tracks processed webhook IDs on receiver side | Redis (SETNX + TTL), PostgreSQL unique index | TTL-based cleanup (7-30 days) |
| Dead Letter Queue (DLQ) | Captures permanently failed deliveries for inspection | SQS DLQ, database table, S3 archive | Manual review dashboard, bulk replay |
| Circuit Breaker | Disables delivery to consistently failing endpoints | In-memory state machine per endpoint | Threshold: 5+ consecutive failures, half-open probe every 5 min |
| Monitoring & Alerting | Tracks delivery rates, latency, failure patterns | Prometheus + Grafana, Datadog, CloudWatch | Per-endpoint dashboards, anomaly alerts |
| Rate Limiter | Prevents overwhelming subscriber endpoints | Token bucket per endpoint | Configurable per subscriber (default: 1000/min) |
| Event Log / Audit Trail | Records all delivery attempts for debugging | Append-only table, ClickHouse, Elasticsearch | Time-partitioned, 30-90 day retention |
START: Building a webhooks system
|
+-- Are you the SENDER (provider) or RECEIVER (consumer)?
| |
| +-- SENDER -->
| | |
| | +-- Expected volume?
| | | +-- <100 events/sec --> Simple: PostgreSQL + worker thread
| | | +-- 100-10K events/sec --> Standard: Message queue + worker pool
| | | +-- >10K events/sec --> Scale: Partitioned queue + auto-scaling workers + circuit breakers
| | |
| | +-- Need strict ordering?
| | +-- YES, per endpoint --> FIFO queue partitioned by endpoint_id
| | +-- NO --> Parallel delivery workers (higher throughput)
| |
| +-- RECEIVER -->
| |
| +-- Verify signature (HMAC-SHA256) --> reject if invalid
| +-- Check idempotency key --> skip if already processed
| +-- Enqueue for async processing --> return 200 immediately
| +-- Process from queue --> update state idempotently
|
+-- Need a managed solution?
+-- YES --> Svix, Hookdeck, or cloud provider (AWS EventBridge, Google Eventarc)
+-- NO --> Build with the architecture in Quick Reference above
Adopt the Standard Webhooks specification for interoperability. Every webhook payload needs a unique event ID, a timestamp, an event type, and the event data. [src2]
{
"webhook_id": "msg_2Yh8KlR4vQHPMnZx",
"type": "invoice.paid",
"timestamp": "2026-02-23T12:00:00Z",
"data": {
"invoice_id": "inv_abc123",
"amount": 9900,
"currency": "usd"
}
}
Verify: Payload is valid JSON, under 20KB, contains webhook_id + type + timestamp + data fields.
Sign every outbound webhook so receivers can verify authenticity. The signature covers the message ID, timestamp, and body to prevent tampering and replay attacks. [src1] [src2]
import hmac, hashlib, time, base64
def sign_webhook(payload_bytes, secret, msg_id):
timestamp = str(int(time.time()))
to_sign = f"{msg_id}.{timestamp}.{payload_bytes.decode('utf-8')}"
signature = hmac.new(
base64.b64decode(secret), to_sign.encode("utf-8"), hashlib.sha256
).digest()
return {
"webhook-id": msg_id,
"webhook-timestamp": timestamp,
"webhook-signature": f"v1,{base64.b64encode(signature).decode()}"
}
Verify: echo -n "msg_id.timestamp.body" | openssl dgst -sha256 -hmac "decoded_secret" -binary | base64 matches the signature header value.
The delivery worker sends the HTTP POST with signed headers and retries on failure using exponential backoff with jitter. Cap retries at a total window of ~3 days. [src3] [src7]
RETRY_SCHEDULE = [0, 5, 300, 1800, 7200, 18000, 36000, 50400, 72000, 86400]
async def deliver_webhook(url, payload, headers, max_retries=10):
async with httpx.AsyncClient(timeout=30.0) as client:
for attempt in range(max_retries):
if attempt > 0:
delay = RETRY_SCHEDULE[min(attempt, len(RETRY_SCHEDULE) - 1)]
await asyncio.sleep(delay + random.uniform(0, delay * 0.1))
try:
resp = await client.post(url, content=payload, headers=headers)
if 200 <= resp.status_code < 300:
return True
except httpx.RequestError:
pass
return False # Move to DLQ
Verify: Simulate a failing endpoint and confirm retries follow the schedule. After max retries, the event should appear in the dead letter queue.
Receivers must verify the HMAC signature before processing any webhook. Reject requests with invalid signatures or stale timestamps. [src1] [src6]
const crypto = require('crypto');
function verifyWebhook(req, secret, toleranceSec = 300) {
const msgId = req.headers['webhook-id'];
const timestamp = req.headers['webhook-timestamp'];
const now = Math.floor(Date.now() / 1000);
if (Math.abs(now - parseInt(timestamp)) > toleranceSec)
throw new Error('Stale timestamp');
const toSign = `${msgId}.${timestamp}.${req.rawBody}`;
const expected = crypto
.createHmac('sha256', Buffer.from(secret, 'base64'))
.update(toSign).digest('base64');
const isValid = req.headers['webhook-signature'].split(' ').some(sig =>
crypto.timingSafeEqual(Buffer.from(expected), Buffer.from(sig.replace('v1,', '')))
);
if (!isValid) throw new Error('Invalid signature');
return JSON.parse(req.rawBody);
}
Verify: Send a webhook with a tampered body and confirm it returns 401. Send a valid webhook and confirm it returns 200.
Store processed webhook IDs to prevent duplicate processing. Use Redis SETNX with a 7-30 day TTL or a database unique constraint. [src5]
import redis
r = redis.Redis()
def process_webhook_idempotently(webhook_id, payload):
lock_key = f"webhook:processed:{webhook_id}"
if not r.set(lock_key, "1", nx=True, ex=86400 * 14):
return False # Already processed
try:
handle_event(payload)
return True
except Exception:
r.delete(lock_key)
raise
Verify: Send the same webhook ID twice. First call processes; second call returns immediately without re-processing.
When an endpoint consistently fails, stop sending to avoid wasting resources and triggering alert fatigue. Probe periodically to detect recovery. [src4] [src5]
class CircuitBreaker:
CLOSED, OPEN, HALF_OPEN = "closed", "open", "half_open"
def __init__(self, failure_threshold=5, recovery_timeout=300):
self.state = self.CLOSED
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = 0
def can_send(self):
if self.state == self.CLOSED: return True
if self.state == self.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = self.HALF_OPEN
return True
return False
return True # HALF_OPEN: allow probe
Verify: Trigger 5 consecutive failures for an endpoint. Confirm the circuit opens and no further deliveries are attempted until the recovery timeout.
# Input: event dict, subscriber endpoint URL, signing secret
# Output: delivery result (success/failure/dlq)
import hmac, hashlib, base64, time, json, uuid
import httpx, asyncio, random
RETRY_DELAYS = [0, 5, 300, 1800, 7200, 18000, 36000, 50400, 72000, 86400]
async def send_webhook(event: dict, endpoint_url: str, secret: str) -> str:
msg_id = f"msg_{uuid.uuid4().hex[:16]}"
payload = json.dumps(event, separators=(",", ":")).encode("utf-8")
async with httpx.AsyncClient(timeout=30.0) as client:
for attempt in range(len(RETRY_DELAYS)):
if attempt > 0:
delay = RETRY_DELAYS[attempt]
await asyncio.sleep(delay + random.uniform(0, delay * 0.1))
timestamp = str(int(time.time()))
to_sign = f"{msg_id}.{timestamp}.{payload.decode()}"
sig = hmac.new(
base64.b64decode(secret), to_sign.encode(), hashlib.sha256
).digest()
headers = {
"Content-Type": "application/json",
"webhook-id": msg_id,
"webhook-timestamp": timestamp,
"webhook-signature": f"v1,{base64.b64encode(sig).decode()}",
}
try:
resp = await client.post(endpoint_url, content=payload, headers=headers)
if 200 <= resp.status_code < 300: return "delivered"
if 400 <= resp.status_code < 500 and resp.status_code not in (408, 429):
return "rejected"
except httpx.RequestError:
pass
return "dlq"
// Input: incoming HTTP POST with webhook-id, webhook-timestamp, webhook-signature headers
// Output: 200 OK (accepted) or 401 Unauthorized (invalid signature)
const express = require('express'); // [email protected]
const crypto = require('crypto');
const { Queue } = require('bullmq'); // [email protected]
const app = express();
const webhookQueue = new Queue('webhooks');
const WEBHOOK_SECRET = process.env.WEBHOOK_SECRET;
const TOLERANCE_SEC = 300;
app.use('/webhooks', express.raw({ type: 'application/json' }));
app.post('/webhooks', async (req, res) => {
const msgId = req.headers['webhook-id'];
const timestamp = req.headers['webhook-timestamp'];
const sigHeader = req.headers['webhook-signature'];
const age = Math.abs(Math.floor(Date.now() / 1000) - parseInt(timestamp));
if (age > TOLERANCE_SEC) return res.status(401).json({ error: 'Stale timestamp' });
const toSign = `${msgId}.${timestamp}.${req.body.toString()}`;
const expected = crypto
.createHmac('sha256', Buffer.from(WEBHOOK_SECRET, 'base64'))
.update(toSign).digest('base64');
const isValid = sigHeader.split(' ').some(sig => {
try {
return crypto.timingSafeEqual(
Buffer.from(expected), Buffer.from(sig.replace('v1,', ''))
);
} catch { return false; }
});
if (!isValid) return res.status(401).json({ error: 'Invalid signature' });
await webhookQueue.add('process', {
webhookId: msgId,
payload: JSON.parse(req.body.toString()),
}, { jobId: msgId });
res.status(200).json({ received: true });
});
app.listen(3000);
# BAD -- processing in the request handler blocks the response,
# causing timeouts that trigger retries and duplicate processing
@app.post("/webhooks")
def handle_webhook(request):
payload = request.json()
update_database(payload) # Slow DB write
send_email_notification(payload) # External API call
recalculate_billing(payload) # CPU-intensive work
return {"ok": True} # Response after 45 seconds -> provider retries
# GOOD -- return 200 immediately, process via background queue
@app.post("/webhooks")
async def handle_webhook(request):
payload = request.json()
await task_queue.enqueue("process_webhook", payload) # Sub-millisecond
return Response(status_code=200) # Instant response
// BAD -- JSON.parse then JSON.stringify changes whitespace/key order
app.post('/webhooks', express.json(), (req, res) => {
const body = JSON.stringify(req.body); // Re-serialized, not original bytes!
const sig = crypto.createHmac('sha256', secret).update(body).digest('hex');
// This comparison will ALWAYS fail
});
// GOOD -- capture raw bytes before any parsing
app.post('/webhooks', express.raw({ type: 'application/json' }), (req, res) => {
const rawBody = req.body.toString(); // Original bytes from sender
const sig = crypto.createHmac('sha256', Buffer.from(secret, 'base64'))
.update(rawBody).digest('base64');
// Compare against header using crypto.timingSafeEqual
});
# BAD -- string equality is not constant-time; enables timing attacks
if computed_signature == received_signature:
process_webhook(payload)
# GOOD -- hmac.compare_digest is constant-time
import hmac
if hmac.compare_digest(computed_signature, received_signature):
process_webhook(payload)
# BAD -- duplicate deliveries charge the customer multiple times
@app.post("/webhooks")
def handle(request):
event = request.json()
charge_customer(event["data"]["customer_id"], event["data"]["amount"])
return {"ok": True} # Processes EVERY delivery, including retries
# GOOD -- check if webhook already processed using the unique event ID
@app.post("/webhooks")
def handle(request):
event = request.json()
webhook_id = request.headers["webhook-id"]
if redis.set(f"wh:{webhook_id}", "1", nx=True, ex=86400 * 14):
charge_customer(event["data"]["customer_id"], event["data"]["amount"])
return {"ok": True} # Always return 200 to stop retries
return 200 immediately, enqueue to background worker. [src1]capture and verify the raw request body bytes, never re-serialize. [src6]store webhook_id in Redis (SETNX + TTL) or database unique constraint before processing. [src5]verify webhook-timestamp is within a 5-minute window of current time. [src2]use environment variables or secret managers (Vault, AWS Secrets Manager) and rotate secrets periodically. [src3]support multiple signatures in the header during rotation; accept old and new key for a transition period. [src2]treat 410 as a permanent failure, disable the subscription, and notify the subscriber. [src3]keep payloads under 20KB; send a reference ID and let the receiver fetch full data via API. [src2]# Test webhook endpoint connectivity
curl -X POST https://your-endpoint.com/webhooks \
-H "Content-Type: application/json" \
-H "webhook-id: msg_test123" \
-H "webhook-timestamp: $(date +%s)" \
-d '{"type":"test","data":{}}' -v
# Generate HMAC-SHA256 signature for testing
echo -n "msg_test123.$(date +%s).{\"type\":\"test\"}" | \
openssl dgst -sha256 -hmac "$(echo 'your_base64_secret' | base64 -d)" -binary | base64
# Monitor webhook delivery queue depth (Redis)
redis-cli LLEN webhook:delivery:queue
# Check dead letter queue size
redis-cli LLEN webhook:dlq
# Verify webhook receiver is responding quickly
curl -o /dev/null -s -w "Response time: %{time_total}s\n" \
-X POST https://your-endpoint.com/webhooks \
-H "Content-Type: application/json" -d '{}'
| Standard / Provider | Current Version | Key Changes | Notes |
|---|---|---|---|
| Standard Webhooks | v1.0 (2023) | First formal spec: webhook-id, webhook-timestamp, webhook-signature headers | Backed by Svix, adopted by Livekit, Clerk, Brex |
| Stripe Webhooks | v2 (2024) | Thin events (fetch full object via API), reduced payload sizes | v1 signatures still supported |
| GitHub Webhooks | v3 (2023) | Repository, organization, and app-level webhooks | Uses X-Hub-Signature-256 header |
| AWS EventBridge | 2023 | Partner event sources, API destinations with OAuth | Managed alternative to custom webhook infrastructure |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| You need to notify external systems of events in near-real-time | You need sub-100ms latency guaranteed | WebSockets or gRPC streaming |
| Subscribers are third-party systems you don't control | Both systems are within your infrastructure | Internal message queue (SQS, RabbitMQ, Kafka) |
| Event volume is moderate (<50K events/sec) | Volume exceeds 100K events/sec sustained | Event streaming (Kafka, Pulsar) with consumer pull |
| You want a simple HTTP-based integration pattern | You need guaranteed exactly-once processing | Transactional outbox + consumer with two-phase commit |
| Subscribers need to process events at their own pace | You need strict global ordering of all events | Partitioned log (Kafka) with ordered consumers |