How to Design a Reliable Webhooks System
How do I design a reliable webhooks system?
TL;DR
- Bottom line: A reliable webhooks system requires persistent event storage, HMAC-signed payloads, exponential-backoff retries with jitter, idempotency keys for deduplication, and circuit breakers per endpoint.
- Key tool/command:
HMAC-SHA256(webhook_secret, msg_id.timestamp.payload)per the Standard Webhooks spec. - Watch out for: Processing webhooks synchronously in the HTTP handler -- this causes timeouts, triggers retries, and creates duplicate processing cascades.
- Works with: Any HTTP stack; language-agnostic pattern. Standard Webhooks libraries available for Python, Node.js, Go, Java, Ruby, PHP, Rust, Kotlin.
Constraints
- Always sign payloads with HMAC-SHA256 (or ed25519) using a per-endpoint secret -- unsigned webhooks are spoofable by anyone who knows the URL
- Receivers must respond with 2xx within 5-30 seconds -- defer all business logic to a background queue
- Design for at-least-once delivery -- exactly-once is impossible over HTTP; receivers must be idempotent
- Persist events to durable storage (database or durable queue) before attempting delivery -- in-memory queues lose events on process crash
- Use constant-time string comparison for HMAC verification --
==enables timing side-channel attacks - Enforce a timestamp window (5-10 minutes) on signature verification to prevent replay attacks
Quick Reference
| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| Event Ingestion API | Accepts event triggers from internal services | REST API, gRPC, internal SDK | Horizontal load balancers, rate limiting |
| Event Store | Persists events durably before dispatch | PostgreSQL, MySQL, DynamoDB | Partitioned by tenant, time-based archival |
| Subscription Registry | Maps events to endpoint URLs + secrets | PostgreSQL, Redis (cache layer) | Read replicas, in-memory cache with TTL |
| Dispatcher / Scheduler | Reads pending events, fans out to delivery workers | Custom service, Temporal, cron | Partition by tenant/event-type |
| Delivery Workers | Execute HTTP POST to subscriber endpoints | Worker pool, Lambda, Cloud Run | Auto-scaling worker pool, concurrency limits per endpoint |
| Retry Queue | Holds failed deliveries for exponential backoff | SQS, RabbitMQ, Redis sorted sets, PostgreSQL | Visibility timeout, DLQ after max retries |
| Signature Module | Signs payloads with HMAC-SHA256 per endpoint secret | crypto stdlib, Standard Webhooks SDK | Stateless, scales with workers |
| Idempotency Store | Tracks processed webhook IDs on receiver side | Redis (SETNX + TTL), PostgreSQL unique index | TTL-based cleanup (7-30 days) |
| Dead Letter Queue (DLQ) | Captures permanently failed deliveries for inspection | SQS DLQ, database table, S3 archive | Manual review dashboard, bulk replay |
| Circuit Breaker | Disables delivery to consistently failing endpoints | In-memory state machine per endpoint | Threshold: 5+ consecutive failures, half-open probe every 5 min |
| Monitoring & Alerting | Tracks delivery rates, latency, failure patterns | Prometheus + Grafana, Datadog, CloudWatch | Per-endpoint dashboards, anomaly alerts |
| Rate Limiter | Prevents overwhelming subscriber endpoints | Token bucket per endpoint | Configurable per subscriber (default: 1000/min) |
| Event Log / Audit Trail | Records all delivery attempts for debugging | Append-only table, ClickHouse, Elasticsearch | Time-partitioned, 30-90 day retention |
Decision Tree
START: Building a webhooks system
|
+-- Are you the SENDER (provider) or RECEIVER (consumer)?
| |
| +-- SENDER -->
| | |
| | +-- Expected volume?
| | | +-- <100 events/sec --> Simple: PostgreSQL + worker thread
| | | +-- 100-10K events/sec --> Standard: Message queue + worker pool
| | | +-- >10K events/sec --> Scale: Partitioned queue + auto-scaling workers + circuit breakers
| | |
| | +-- Need strict ordering?
| | +-- YES, per endpoint --> FIFO queue partitioned by endpoint_id
| | +-- NO --> Parallel delivery workers (higher throughput)
| |
| +-- RECEIVER -->
| |
| +-- Verify signature (HMAC-SHA256) --> reject if invalid
| +-- Check idempotency key --> skip if already processed
| +-- Enqueue for async processing --> return 200 immediately
| +-- Process from queue --> update state idempotently
|
+-- Need a managed solution?
+-- YES --> Svix, Hookdeck, or cloud provider (AWS EventBridge, Google Eventarc)
+-- NO --> Build with the architecture in Quick Reference above
Step-by-Step Guide
1. Define event schema and webhook payload format
Adopt the Standard Webhooks specification for interoperability. Every webhook payload needs a unique event ID, a timestamp, an event type, and the event data. [src2]
{
"webhook_id": "msg_2Yh8KlR4vQHPMnZx",
"type": "invoice.paid",
"timestamp": "2026-02-23T12:00:00Z",
"data": {
"invoice_id": "inv_abc123",
"amount": 9900,
"currency": "usd"
}
}
Verify: Payload is valid JSON, under 20KB, contains webhook_id + type + timestamp + data fields.
2. Implement HMAC-SHA256 payload signing
Sign every outbound webhook so receivers can verify authenticity. The signature covers the message ID, timestamp, and body to prevent tampering and replay attacks. [src1] [src2]
import hmac, hashlib, time, base64
def sign_webhook(payload_bytes, secret, msg_id):
timestamp = str(int(time.time()))
to_sign = f"{msg_id}.{timestamp}.{payload_bytes.decode('utf-8')}"
signature = hmac.new(
base64.b64decode(secret), to_sign.encode("utf-8"), hashlib.sha256
).digest()
return {
"webhook-id": msg_id,
"webhook-timestamp": timestamp,
"webhook-signature": f"v1,{base64.b64encode(signature).decode()}"
}
Verify: echo -n "msg_id.timestamp.body" | openssl dgst -sha256 -hmac "decoded_secret" -binary | base64 matches the signature header value.
3. Build the delivery worker with retry logic
The delivery worker sends the HTTP POST with signed headers and retries on failure using exponential backoff with jitter. Cap retries at a total window of ~3 days. [src3] [src7]
RETRY_SCHEDULE = [0, 5, 300, 1800, 7200, 18000, 36000, 50400, 72000, 86400]
async def deliver_webhook(url, payload, headers, max_retries=10):
async with httpx.AsyncClient(timeout=30.0) as client:
for attempt in range(max_retries):
if attempt > 0:
delay = RETRY_SCHEDULE[min(attempt, len(RETRY_SCHEDULE) - 1)]
await asyncio.sleep(delay + random.uniform(0, delay * 0.1))
try:
resp = await client.post(url, content=payload, headers=headers)
if 200 <= resp.status_code < 300:
return True
except httpx.RequestError:
pass
return False # Move to DLQ
Verify: Simulate a failing endpoint and confirm retries follow the schedule. After max retries, the event should appear in the dead letter queue.
4. Implement receiver-side signature verification
Receivers must verify the HMAC signature before processing any webhook. Reject requests with invalid signatures or stale timestamps. [src1] [src6]
const crypto = require('crypto');
function verifyWebhook(req, secret, toleranceSec = 300) {
const msgId = req.headers['webhook-id'];
const timestamp = req.headers['webhook-timestamp'];
const now = Math.floor(Date.now() / 1000);
if (Math.abs(now - parseInt(timestamp)) > toleranceSec)
throw new Error('Stale timestamp');
const toSign = `${msgId}.${timestamp}.${req.rawBody}`;
const expected = crypto
.createHmac('sha256', Buffer.from(secret, 'base64'))
.update(toSign).digest('base64');
const isValid = req.headers['webhook-signature'].split(' ').some(sig =>
crypto.timingSafeEqual(Buffer.from(expected), Buffer.from(sig.replace('v1,', '')))
);
if (!isValid) throw new Error('Invalid signature');
return JSON.parse(req.rawBody);
}
Verify: Send a webhook with a tampered body and confirm it returns 401. Send a valid webhook and confirm it returns 200.
5. Add idempotency on the receiver side
Store processed webhook IDs to prevent duplicate processing. Use Redis SETNX with a 7-30 day TTL or a database unique constraint. [src5]
import redis
r = redis.Redis()
def process_webhook_idempotently(webhook_id, payload):
lock_key = f"webhook:processed:{webhook_id}"
if not r.set(lock_key, "1", nx=True, ex=86400 * 14):
return False # Already processed
try:
handle_event(payload)
return True
except Exception:
r.delete(lock_key)
raise
Verify: Send the same webhook ID twice. First call processes; second call returns immediately without re-processing.
6. Implement circuit breakers per endpoint
When an endpoint consistently fails, stop sending to avoid wasting resources and triggering alert fatigue. Probe periodically to detect recovery. [src4] [src5]
class CircuitBreaker:
CLOSED, OPEN, HALF_OPEN = "closed", "open", "half_open"
def __init__(self, failure_threshold=5, recovery_timeout=300):
self.state = self.CLOSED
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = 0
def can_send(self):
if self.state == self.CLOSED: return True
if self.state == self.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = self.HALF_OPEN
return True
return False
return True # HALF_OPEN: allow probe
Verify: Trigger 5 consecutive failures for an endpoint. Confirm the circuit opens and no further deliveries are attempted until the recovery timeout.
Code Examples
Python: Webhook Sender with Retry and Signing
# Input: event dict, subscriber endpoint URL, signing secret
# Output: delivery result (success/failure/dlq)
import hmac, hashlib, base64, time, json, uuid
import httpx, asyncio, random
RETRY_DELAYS = [0, 5, 300, 1800, 7200, 18000, 36000, 50400, 72000, 86400]
async def send_webhook(event: dict, endpoint_url: str, secret: str) -> str:
msg_id = f"msg_{uuid.uuid4().hex[:16]}"
payload = json.dumps(event, separators=(",", ":")).encode("utf-8")
async with httpx.AsyncClient(timeout=30.0) as client:
for attempt in range(len(RETRY_DELAYS)):
if attempt > 0:
delay = RETRY_DELAYS[attempt]
await asyncio.sleep(delay + random.uniform(0, delay * 0.1))
timestamp = str(int(time.time()))
to_sign = f"{msg_id}.{timestamp}.{payload.decode()}"
sig = hmac.new(
base64.b64decode(secret), to_sign.encode(), hashlib.sha256
).digest()
headers = {
"Content-Type": "application/json",
"webhook-id": msg_id,
"webhook-timestamp": timestamp,
"webhook-signature": f"v1,{base64.b64encode(sig).decode()}",
}
try:
resp = await client.post(endpoint_url, content=payload, headers=headers)
if 200 <= resp.status_code < 300: return "delivered"
if 400 <= resp.status_code < 500 and resp.status_code not in (408, 429):
return "rejected"
except httpx.RequestError:
pass
return "dlq"
Node.js: Webhook Receiver with Signature Verification
// Input: incoming HTTP POST with webhook-id, webhook-timestamp, webhook-signature headers
// Output: 200 OK (accepted) or 401 Unauthorized (invalid signature)
const express = require('express'); // [email protected]
const crypto = require('crypto');
const { Queue } = require('bullmq'); // [email protected]
const app = express();
const webhookQueue = new Queue('webhooks');
const WEBHOOK_SECRET = process.env.WEBHOOK_SECRET;
const TOLERANCE_SEC = 300;
app.use('/webhooks', express.raw({ type: 'application/json' }));
app.post('/webhooks', async (req, res) => {
const msgId = req.headers['webhook-id'];
const timestamp = req.headers['webhook-timestamp'];
const sigHeader = req.headers['webhook-signature'];
const age = Math.abs(Math.floor(Date.now() / 1000) - parseInt(timestamp));
if (age > TOLERANCE_SEC) return res.status(401).json({ error: 'Stale timestamp' });
const toSign = `${msgId}.${timestamp}.${req.body.toString()}`;
const expected = crypto
.createHmac('sha256', Buffer.from(WEBHOOK_SECRET, 'base64'))
.update(toSign).digest('base64');
const isValid = sigHeader.split(' ').some(sig => {
try {
return crypto.timingSafeEqual(
Buffer.from(expected), Buffer.from(sig.replace('v1,', ''))
);
} catch { return false; }
});
if (!isValid) return res.status(401).json({ error: 'Invalid signature' });
await webhookQueue.add('process', {
webhookId: msgId,
payload: JSON.parse(req.body.toString()),
}, { jobId: msgId });
res.status(200).json({ received: true });
});
app.listen(3000);
Anti-Patterns
Wrong: Synchronous processing in the webhook handler
# BAD -- processing in the request handler blocks the response,
# causing timeouts that trigger retries and duplicate processing
@app.post("/webhooks")
def handle_webhook(request):
payload = request.json()
update_database(payload) # Slow DB write
send_email_notification(payload) # External API call
recalculate_billing(payload) # CPU-intensive work
return {"ok": True} # Response after 45 seconds -> provider retries
Correct: Acknowledge immediately, process asynchronously
# GOOD -- return 200 immediately, process via background queue
@app.post("/webhooks")
async def handle_webhook(request):
payload = request.json()
await task_queue.enqueue("process_webhook", payload) # Sub-millisecond
return Response(status_code=200) # Instant response
Wrong: Verifying signature over parsed JSON
// BAD -- JSON.parse then JSON.stringify changes whitespace/key order
app.post('/webhooks', express.json(), (req, res) => {
const body = JSON.stringify(req.body); // Re-serialized, not original bytes!
const sig = crypto.createHmac('sha256', secret).update(body).digest('hex');
// This comparison will ALWAYS fail
});
Correct: Verify signature over raw request body
// GOOD -- capture raw bytes before any parsing
app.post('/webhooks', express.raw({ type: 'application/json' }), (req, res) => {
const rawBody = req.body.toString(); // Original bytes from sender
const sig = crypto.createHmac('sha256', Buffer.from(secret, 'base64'))
.update(rawBody).digest('base64');
// Compare against header using crypto.timingSafeEqual
});
Wrong: Using equality operator for signature comparison
# BAD -- string equality is not constant-time; enables timing attacks
if computed_signature == received_signature:
process_webhook(payload)
Correct: Use constant-time comparison
# GOOD -- hmac.compare_digest is constant-time
import hmac
if hmac.compare_digest(computed_signature, received_signature):
process_webhook(payload)
Wrong: No idempotency handling on the receiver
# BAD -- duplicate deliveries charge the customer multiple times
@app.post("/webhooks")
def handle(request):
event = request.json()
charge_customer(event["data"]["customer_id"], event["data"]["amount"])
return {"ok": True} # Processes EVERY delivery, including retries
Correct: Deduplicate using webhook ID before processing
# GOOD -- check if webhook already processed using the unique event ID
@app.post("/webhooks")
def handle(request):
event = request.json()
webhook_id = request.headers["webhook-id"]
if redis.set(f"wh:{webhook_id}", "1", nx=True, ex=86400 * 14):
charge_customer(event["data"]["customer_id"], event["data"]["amount"])
return {"ok": True} # Always return 200 to stop retries
Common Pitfalls
- Timeout-triggered retry storms: Webhook receivers that do heavy processing synchronously cause timeouts (>30s), triggering provider retries. Each retry also times out, creating an exponential cascade. Fix:
return 200 immediately, enqueue to background worker. [src1] - Re-serialized body breaks signatures: Parsing JSON and re-serializing it changes whitespace, key order, or encoding, producing bytes that don't match the original signed payload. Fix:
capture and verify the raw request body bytes, never re-serialize. [src6] - Missing idempotency on state-changing operations: At-least-once delivery means duplicates are guaranteed over time. Without deduplication, duplicate payments, notifications, or inventory changes occur. Fix:
store webhook_id in Redis (SETNX + TTL) or database unique constraint before processing. [src5] - No replay attack protection: Without timestamp verification, an attacker who captures a valid webhook can replay it indefinitely. Fix:
verify webhook-timestamp is within a 5-minute window of current time. [src2] - Hardcoded secrets in source code: Webhook signing secrets committed to version control are exposed to anyone with repo access. Fix:
use environment variables or secret managers (Vault, AWS Secrets Manager) and rotate secrets periodically. [src3] - Not handling key rotation gracefully: Changing the signing secret instantly invalidates all in-flight webhooks. Fix:
support multiple signatures in the header during rotation; accept old and new key for a transition period. [src2] - Ignoring HTTP 410 Gone responses: Continuing to retry against endpoints that return 410 wastes resources. Fix:
treat 410 as a permanent failure, disable the subscription, and notify the subscriber. [src3] - Unbounded payload sizes: Sending multi-megabyte webhook payloads causes receiver OOM errors and network timeouts. Fix:
keep payloads under 20KB; send a reference ID and let the receiver fetch full data via API. [src2]
Diagnostic Commands
# Test webhook endpoint connectivity
curl -X POST https://your-endpoint.com/webhooks \
-H "Content-Type: application/json" \
-H "webhook-id: msg_test123" \
-H "webhook-timestamp: $(date +%s)" \
-d '{"type":"test","data":{}}' -v
# Generate HMAC-SHA256 signature for testing
echo -n "msg_test123.$(date +%s).{\"type\":\"test\"}" | \
openssl dgst -sha256 -hmac "$(echo 'your_base64_secret' | base64 -d)" -binary | base64
# Monitor webhook delivery queue depth (Redis)
redis-cli LLEN webhook:delivery:queue
# Check dead letter queue size
redis-cli LLEN webhook:dlq
# Verify webhook receiver is responding quickly
curl -o /dev/null -s -w "Response time: %{time_total}s\n" \
-X POST https://your-endpoint.com/webhooks \
-H "Content-Type: application/json" -d '{}'
Version History & Compatibility
| Standard / Provider | Current Version | Key Changes | Notes |
|---|---|---|---|
| Standard Webhooks | v1.0 (2023) | First formal spec: webhook-id, webhook-timestamp, webhook-signature headers | Backed by Svix, adopted by Livekit, Clerk, Brex |
| Stripe Webhooks | v2 (2024) | Thin events (fetch full object via API), reduced payload sizes | v1 signatures still supported |
| GitHub Webhooks | v3 (2023) | Repository, organization, and app-level webhooks | Uses X-Hub-Signature-256 header |
| AWS EventBridge | 2023 | Partner event sources, API destinations with OAuth | Managed alternative to custom webhook infrastructure |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| You need to notify external systems of events in near-real-time | You need sub-100ms latency guaranteed | WebSockets or gRPC streaming |
| Subscribers are third-party systems you don't control | Both systems are within your infrastructure | Internal message queue (SQS, RabbitMQ, Kafka) |
| Event volume is moderate (<50K events/sec) | Volume exceeds 100K events/sec sustained | Event streaming (Kafka, Pulsar) with consumer pull |
| You want a simple HTTP-based integration pattern | You need guaranteed exactly-once processing | Transactional outbox + consumer with two-phase commit |
| Subscribers need to process events at their own pace | You need strict global ordering of all events | Partitioned log (Kafka) with ordered consumers |
Important Caveats
- Webhooks provide at-least-once delivery, never exactly-once. Receivers must always implement idempotency regardless of how reliable the sender is.
- Network partitions, DNS failures, and TLS certificate issues can cause silent delivery failures. Monitor delivery success rates per endpoint and alert on anomalies.
- The Standard Webhooks specification is gaining adoption but is not yet an IETF RFC. Some providers (Stripe, GitHub) use proprietary header formats. Check your provider's specific documentation.
- Circuit breakers must be per-endpoint, not global. One failing subscriber should not block delivery to healthy subscribers.
- Webhook secrets must be unique per subscriber endpoint. Sharing a signing secret across multiple subscribers means compromising one subscriber compromises all of them.