How to Design a Reliable Webhooks System

Type: Software Reference Confidence: 0.93 Sources: 7 Verified: 2026-02-23 Freshness: 2026-02-23

TL;DR

Constraints

Quick Reference

ComponentRoleTechnology OptionsScaling Strategy
Event Ingestion APIAccepts event triggers from internal servicesREST API, gRPC, internal SDKHorizontal load balancers, rate limiting
Event StorePersists events durably before dispatchPostgreSQL, MySQL, DynamoDBPartitioned by tenant, time-based archival
Subscription RegistryMaps events to endpoint URLs + secretsPostgreSQL, Redis (cache layer)Read replicas, in-memory cache with TTL
Dispatcher / SchedulerReads pending events, fans out to delivery workersCustom service, Temporal, cronPartition by tenant/event-type
Delivery WorkersExecute HTTP POST to subscriber endpointsWorker pool, Lambda, Cloud RunAuto-scaling worker pool, concurrency limits per endpoint
Retry QueueHolds failed deliveries for exponential backoffSQS, RabbitMQ, Redis sorted sets, PostgreSQLVisibility timeout, DLQ after max retries
Signature ModuleSigns payloads with HMAC-SHA256 per endpoint secretcrypto stdlib, Standard Webhooks SDKStateless, scales with workers
Idempotency StoreTracks processed webhook IDs on receiver sideRedis (SETNX + TTL), PostgreSQL unique indexTTL-based cleanup (7-30 days)
Dead Letter Queue (DLQ)Captures permanently failed deliveries for inspectionSQS DLQ, database table, S3 archiveManual review dashboard, bulk replay
Circuit BreakerDisables delivery to consistently failing endpointsIn-memory state machine per endpointThreshold: 5+ consecutive failures, half-open probe every 5 min
Monitoring & AlertingTracks delivery rates, latency, failure patternsPrometheus + Grafana, Datadog, CloudWatchPer-endpoint dashboards, anomaly alerts
Rate LimiterPrevents overwhelming subscriber endpointsToken bucket per endpointConfigurable per subscriber (default: 1000/min)
Event Log / Audit TrailRecords all delivery attempts for debuggingAppend-only table, ClickHouse, ElasticsearchTime-partitioned, 30-90 day retention

Decision Tree

START: Building a webhooks system
|
+-- Are you the SENDER (provider) or RECEIVER (consumer)?
|   |
|   +-- SENDER -->
|   |   |
|   |   +-- Expected volume?
|   |   |   +-- <100 events/sec --> Simple: PostgreSQL + worker thread
|   |   |   +-- 100-10K events/sec --> Standard: Message queue + worker pool
|   |   |   +-- >10K events/sec --> Scale: Partitioned queue + auto-scaling workers + circuit breakers
|   |   |
|   |   +-- Need strict ordering?
|   |       +-- YES, per endpoint --> FIFO queue partitioned by endpoint_id
|   |       +-- NO --> Parallel delivery workers (higher throughput)
|   |
|   +-- RECEIVER -->
|       |
|       +-- Verify signature (HMAC-SHA256) --> reject if invalid
|       +-- Check idempotency key --> skip if already processed
|       +-- Enqueue for async processing --> return 200 immediately
|       +-- Process from queue --> update state idempotently
|
+-- Need a managed solution?
    +-- YES --> Svix, Hookdeck, or cloud provider (AWS EventBridge, Google Eventarc)
    +-- NO --> Build with the architecture in Quick Reference above

Step-by-Step Guide

1. Define event schema and webhook payload format

Adopt the Standard Webhooks specification for interoperability. Every webhook payload needs a unique event ID, a timestamp, an event type, and the event data. [src2]

{
  "webhook_id": "msg_2Yh8KlR4vQHPMnZx",
  "type": "invoice.paid",
  "timestamp": "2026-02-23T12:00:00Z",
  "data": {
    "invoice_id": "inv_abc123",
    "amount": 9900,
    "currency": "usd"
  }
}

Verify: Payload is valid JSON, under 20KB, contains webhook_id + type + timestamp + data fields.

2. Implement HMAC-SHA256 payload signing

Sign every outbound webhook so receivers can verify authenticity. The signature covers the message ID, timestamp, and body to prevent tampering and replay attacks. [src1] [src2]

import hmac, hashlib, time, base64

def sign_webhook(payload_bytes, secret, msg_id):
    timestamp = str(int(time.time()))
    to_sign = f"{msg_id}.{timestamp}.{payload_bytes.decode('utf-8')}"
    signature = hmac.new(
        base64.b64decode(secret), to_sign.encode("utf-8"), hashlib.sha256
    ).digest()
    return {
        "webhook-id": msg_id,
        "webhook-timestamp": timestamp,
        "webhook-signature": f"v1,{base64.b64encode(signature).decode()}"
    }

Verify: echo -n "msg_id.timestamp.body" | openssl dgst -sha256 -hmac "decoded_secret" -binary | base64 matches the signature header value.

3. Build the delivery worker with retry logic

The delivery worker sends the HTTP POST with signed headers and retries on failure using exponential backoff with jitter. Cap retries at a total window of ~3 days. [src3] [src7]

RETRY_SCHEDULE = [0, 5, 300, 1800, 7200, 18000, 36000, 50400, 72000, 86400]

async def deliver_webhook(url, payload, headers, max_retries=10):
    async with httpx.AsyncClient(timeout=30.0) as client:
        for attempt in range(max_retries):
            if attempt > 0:
                delay = RETRY_SCHEDULE[min(attempt, len(RETRY_SCHEDULE) - 1)]
                await asyncio.sleep(delay + random.uniform(0, delay * 0.1))
            try:
                resp = await client.post(url, content=payload, headers=headers)
                if 200 <= resp.status_code < 300:
                    return True
            except httpx.RequestError:
                pass
    return False  # Move to DLQ

Verify: Simulate a failing endpoint and confirm retries follow the schedule. After max retries, the event should appear in the dead letter queue.

4. Implement receiver-side signature verification

Receivers must verify the HMAC signature before processing any webhook. Reject requests with invalid signatures or stale timestamps. [src1] [src6]

const crypto = require('crypto');

function verifyWebhook(req, secret, toleranceSec = 300) {
  const msgId = req.headers['webhook-id'];
  const timestamp = req.headers['webhook-timestamp'];
  const now = Math.floor(Date.now() / 1000);
  if (Math.abs(now - parseInt(timestamp)) > toleranceSec)
    throw new Error('Stale timestamp');

  const toSign = `${msgId}.${timestamp}.${req.rawBody}`;
  const expected = crypto
    .createHmac('sha256', Buffer.from(secret, 'base64'))
    .update(toSign).digest('base64');

  const isValid = req.headers['webhook-signature'].split(' ').some(sig =>
    crypto.timingSafeEqual(Buffer.from(expected), Buffer.from(sig.replace('v1,', '')))
  );
  if (!isValid) throw new Error('Invalid signature');
  return JSON.parse(req.rawBody);
}

Verify: Send a webhook with a tampered body and confirm it returns 401. Send a valid webhook and confirm it returns 200.

5. Add idempotency on the receiver side

Store processed webhook IDs to prevent duplicate processing. Use Redis SETNX with a 7-30 day TTL or a database unique constraint. [src5]

import redis
r = redis.Redis()

def process_webhook_idempotently(webhook_id, payload):
    lock_key = f"webhook:processed:{webhook_id}"
    if not r.set(lock_key, "1", nx=True, ex=86400 * 14):
        return False  # Already processed
    try:
        handle_event(payload)
        return True
    except Exception:
        r.delete(lock_key)
        raise

Verify: Send the same webhook ID twice. First call processes; second call returns immediately without re-processing.

6. Implement circuit breakers per endpoint

When an endpoint consistently fails, stop sending to avoid wasting resources and triggering alert fatigue. Probe periodically to detect recovery. [src4] [src5]

class CircuitBreaker:
    CLOSED, OPEN, HALF_OPEN = "closed", "open", "half_open"

    def __init__(self, failure_threshold=5, recovery_timeout=300):
        self.state = self.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = 0

    def can_send(self):
        if self.state == self.CLOSED: return True
        if self.state == self.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = self.HALF_OPEN
                return True
            return False
        return True  # HALF_OPEN: allow probe

Verify: Trigger 5 consecutive failures for an endpoint. Confirm the circuit opens and no further deliveries are attempted until the recovery timeout.

Code Examples

Python: Webhook Sender with Retry and Signing

# Input:  event dict, subscriber endpoint URL, signing secret
# Output: delivery result (success/failure/dlq)

import hmac, hashlib, base64, time, json, uuid
import httpx, asyncio, random

RETRY_DELAYS = [0, 5, 300, 1800, 7200, 18000, 36000, 50400, 72000, 86400]

async def send_webhook(event: dict, endpoint_url: str, secret: str) -> str:
    msg_id = f"msg_{uuid.uuid4().hex[:16]}"
    payload = json.dumps(event, separators=(",", ":")).encode("utf-8")
    async with httpx.AsyncClient(timeout=30.0) as client:
        for attempt in range(len(RETRY_DELAYS)):
            if attempt > 0:
                delay = RETRY_DELAYS[attempt]
                await asyncio.sleep(delay + random.uniform(0, delay * 0.1))
            timestamp = str(int(time.time()))
            to_sign = f"{msg_id}.{timestamp}.{payload.decode()}"
            sig = hmac.new(
                base64.b64decode(secret), to_sign.encode(), hashlib.sha256
            ).digest()
            headers = {
                "Content-Type": "application/json",
                "webhook-id": msg_id,
                "webhook-timestamp": timestamp,
                "webhook-signature": f"v1,{base64.b64encode(sig).decode()}",
            }
            try:
                resp = await client.post(endpoint_url, content=payload, headers=headers)
                if 200 <= resp.status_code < 300: return "delivered"
                if 400 <= resp.status_code < 500 and resp.status_code not in (408, 429):
                    return "rejected"
            except httpx.RequestError:
                pass
    return "dlq"

Node.js: Webhook Receiver with Signature Verification

// Input:  incoming HTTP POST with webhook-id, webhook-timestamp, webhook-signature headers
// Output: 200 OK (accepted) or 401 Unauthorized (invalid signature)

const express = require('express');       // [email protected]
const crypto = require('crypto');
const { Queue } = require('bullmq');      // [email protected]

const app = express();
const webhookQueue = new Queue('webhooks');
const WEBHOOK_SECRET = process.env.WEBHOOK_SECRET;
const TOLERANCE_SEC = 300;

app.use('/webhooks', express.raw({ type: 'application/json' }));

app.post('/webhooks', async (req, res) => {
  const msgId = req.headers['webhook-id'];
  const timestamp = req.headers['webhook-timestamp'];
  const sigHeader = req.headers['webhook-signature'];

  const age = Math.abs(Math.floor(Date.now() / 1000) - parseInt(timestamp));
  if (age > TOLERANCE_SEC) return res.status(401).json({ error: 'Stale timestamp' });

  const toSign = `${msgId}.${timestamp}.${req.body.toString()}`;
  const expected = crypto
    .createHmac('sha256', Buffer.from(WEBHOOK_SECRET, 'base64'))
    .update(toSign).digest('base64');

  const isValid = sigHeader.split(' ').some(sig => {
    try {
      return crypto.timingSafeEqual(
        Buffer.from(expected), Buffer.from(sig.replace('v1,', ''))
      );
    } catch { return false; }
  });
  if (!isValid) return res.status(401).json({ error: 'Invalid signature' });

  await webhookQueue.add('process', {
    webhookId: msgId,
    payload: JSON.parse(req.body.toString()),
  }, { jobId: msgId });

  res.status(200).json({ received: true });
});

app.listen(3000);

Anti-Patterns

Wrong: Synchronous processing in the webhook handler

# BAD -- processing in the request handler blocks the response,
# causing timeouts that trigger retries and duplicate processing
@app.post("/webhooks")
def handle_webhook(request):
    payload = request.json()
    update_database(payload)          # Slow DB write
    send_email_notification(payload)  # External API call
    recalculate_billing(payload)      # CPU-intensive work
    return {"ok": True}               # Response after 45 seconds -> provider retries

Correct: Acknowledge immediately, process asynchronously

# GOOD -- return 200 immediately, process via background queue
@app.post("/webhooks")
async def handle_webhook(request):
    payload = request.json()
    await task_queue.enqueue("process_webhook", payload)  # Sub-millisecond
    return Response(status_code=200)                       # Instant response

Wrong: Verifying signature over parsed JSON

// BAD -- JSON.parse then JSON.stringify changes whitespace/key order
app.post('/webhooks', express.json(), (req, res) => {
  const body = JSON.stringify(req.body);  // Re-serialized, not original bytes!
  const sig = crypto.createHmac('sha256', secret).update(body).digest('hex');
  // This comparison will ALWAYS fail
});

Correct: Verify signature over raw request body

// GOOD -- capture raw bytes before any parsing
app.post('/webhooks', express.raw({ type: 'application/json' }), (req, res) => {
  const rawBody = req.body.toString();  // Original bytes from sender
  const sig = crypto.createHmac('sha256', Buffer.from(secret, 'base64'))
    .update(rawBody).digest('base64');
  // Compare against header using crypto.timingSafeEqual
});

Wrong: Using equality operator for signature comparison

# BAD -- string equality is not constant-time; enables timing attacks
if computed_signature == received_signature:
    process_webhook(payload)

Correct: Use constant-time comparison

# GOOD -- hmac.compare_digest is constant-time
import hmac
if hmac.compare_digest(computed_signature, received_signature):
    process_webhook(payload)

Wrong: No idempotency handling on the receiver

# BAD -- duplicate deliveries charge the customer multiple times
@app.post("/webhooks")
def handle(request):
    event = request.json()
    charge_customer(event["data"]["customer_id"], event["data"]["amount"])
    return {"ok": True}  # Processes EVERY delivery, including retries

Correct: Deduplicate using webhook ID before processing

# GOOD -- check if webhook already processed using the unique event ID
@app.post("/webhooks")
def handle(request):
    event = request.json()
    webhook_id = request.headers["webhook-id"]
    if redis.set(f"wh:{webhook_id}", "1", nx=True, ex=86400 * 14):
        charge_customer(event["data"]["customer_id"], event["data"]["amount"])
    return {"ok": True}  # Always return 200 to stop retries

Common Pitfalls

Diagnostic Commands

# Test webhook endpoint connectivity
curl -X POST https://your-endpoint.com/webhooks \
  -H "Content-Type: application/json" \
  -H "webhook-id: msg_test123" \
  -H "webhook-timestamp: $(date +%s)" \
  -d '{"type":"test","data":{}}' -v

# Generate HMAC-SHA256 signature for testing
echo -n "msg_test123.$(date +%s).{\"type\":\"test\"}" | \
  openssl dgst -sha256 -hmac "$(echo 'your_base64_secret' | base64 -d)" -binary | base64

# Monitor webhook delivery queue depth (Redis)
redis-cli LLEN webhook:delivery:queue

# Check dead letter queue size
redis-cli LLEN webhook:dlq

# Verify webhook receiver is responding quickly
curl -o /dev/null -s -w "Response time: %{time_total}s\n" \
  -X POST https://your-endpoint.com/webhooks \
  -H "Content-Type: application/json" -d '{}'

Version History & Compatibility

Standard / ProviderCurrent VersionKey ChangesNotes
Standard Webhooksv1.0 (2023)First formal spec: webhook-id, webhook-timestamp, webhook-signature headersBacked by Svix, adopted by Livekit, Clerk, Brex
Stripe Webhooksv2 (2024)Thin events (fetch full object via API), reduced payload sizesv1 signatures still supported
GitHub Webhooksv3 (2023)Repository, organization, and app-level webhooksUses X-Hub-Signature-256 header
AWS EventBridge2023Partner event sources, API destinations with OAuthManaged alternative to custom webhook infrastructure

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
You need to notify external systems of events in near-real-timeYou need sub-100ms latency guaranteedWebSockets or gRPC streaming
Subscribers are third-party systems you don't controlBoth systems are within your infrastructureInternal message queue (SQS, RabbitMQ, Kafka)
Event volume is moderate (<50K events/sec)Volume exceeds 100K events/sec sustainedEvent streaming (Kafka, Pulsar) with consumer pull
You want a simple HTTP-based integration patternYou need guaranteed exactly-once processingTransactional outbox + consumer with two-phase commit
Subscribers need to process events at their own paceYou need strict global ordering of all eventsPartitioned log (Kafka) with ordered consumers

Important Caveats

Related Units