ERP integration DLQ patterns and retry strategies

- Bottom line: Implement a three-layer resilience stack: exponential backoff with jitter for transient errors, circuit breakers for prolonged outages, and dead letter queues for poison messages that exhaust all retries. Every ERP integration needs all three.

How to handle failed ERP API messages with dead letter queues

- Bottom line: Implement a three-layer resilience stack: exponential backoff with jitter for transient errors, circuit breakers for prolonged outages, and dead letter queues for poison messages that exhaust all retries. Every ERP integration needs all three.

Exponential backoff and circuit breaker for ERP integrations

- Bottom line: Implement a three-layer resilience stack: exponential backoff with jitter for transient errors, circuit breakers for prolonged outages, and dead letter queues for poison messages that exhaust all retries. Every ERP integration needs all three.

Retry and dead letter queue architecture for enterprise middleware

- Bottom line: Implement a three-layer resilience stack: exponential backoff with jitter for transient errors, circuit breakers for prolonged outages, and dead letter queues for poison messages that exhaust all retries. Every ERP integration needs all three.

Error Handling and Dead Letter Queues for ERP Integrations: Retry Strategies

How do you implement error handling and dead letter queues for ERP integrations - retry strategies?

TL;DR

Bottom line: Implement a three-layer resilience stack: exponential backoff with jitter for transient errors, circuit breakers for prolonged outages, and dead letter queues for poison messages that exhaust all retries. Every ERP integration needs all three.
Key limit: DLQ retention varies by platform — AWS SQS max 14 days, Azure Service Bus unlimited on Premium tier, Kafka depends on topic retention config. Plan your reprocessing SLA around these limits.
Watch out for: Retrying without idempotency keys causes duplicate records in the target ERP. Every retry-eligible operation must be idempotent — use unique transaction IDs, not auto-increment.
Best for: Any ERP integration with zero-loss or at-least-once delivery requirements — order-to-cash, AP automation, inventory sync, and payroll feeds.
Authentication: N/A (pattern-level card). See system-specific cards for authentication details per ERP vendor.

System Profile

This card covers cross-platform error handling patterns applicable to all major ERP integrations, regardless of the specific ERP system (Salesforce, SAP, Oracle, NetSuite, Dynamics 365, Workday) or middleware platform. The patterns are implemented at the integration middleware layer — the message broker or iPaaS platform that sits between systems.

System	Role	DLQ Support	Retry Support	Circuit Breaker
AWS SQS/SNS	Message broker	Native (redrive policy)	maxReceiveCount (1-1000)	Manual (Step Functions / custom)
Azure Service Bus	Message broker	Native (subqueue per entity)	MaxDeliveryCount (default 10)	Manual (custom implementation)
Apache Kafka	Event streaming	Via error topic convention	Consumer-side retry logic	Manual (custom implementation)
RabbitMQ	Message broker	Native (dead-letter exchange)	x-delivery-limit header	Manual (custom implementation)
MuleSoft Anypoint	iPaaS	Built-in DLQ connector	Configurable retry policies	Built-in circuit breaker scope
Boomi	iPaaS	Error handling shapes	Configurable retry	Custom via process routes
Workato	iPaaS	Error monitoring recipes	Auto-retry with configurable count	Custom via error handlers

API Surfaces & Capabilities

Pattern	Type	Best For	Complexity	Latency Impact	Data Loss Risk
Immediate retry	Retry	Transient network blips	Low	Minimal (ms)	Medium (no backoff)
Exponential backoff	Retry	Rate limits, temporary overload	Medium	Increasing (1s-60s)	Low
Exponential backoff + jitter	Retry	High-concurrency retries	Medium	Increasing (randomized)	Low
Circuit breaker	Protection	Prolonged service outages	Medium	Fast-fail when open	None (preserves messages)
Dead letter queue	Error isolation	Poison messages, schema errors	Medium	None (async)	Very low
Saga with compensation	Transaction	Multi-system writes	High	Variable	Very low
Outbox pattern	Delivery guarantee	Exactly-once publish	High	Minimal	Near zero

Rate Limits & Quotas

Per-Platform DLQ Limits

Platform	Max Retention	Max Message Size	Max DLQ Depth	Reprocessing Method
AWS SQS	14 days	256 KB (2 GB with S3)	Unlimited	Redrive to source queue
Azure Service Bus (Standard)	Unlimited	256 KB	5 GB per entity	Receive + resubmit
Azure Service Bus (Premium)	Unlimited	100 MB	80 GB per entity	Receive + resubmit
Apache Kafka (error topic)	Topic retention config	1 MB default (configurable)	Partition-based	Consumer from error topic
RabbitMQ	TTL-based (configurable)	128 MB default	Memory/disk-based	Consume from DLX queue

Retry Budget Guidelines

Integration Type	Max Retries	Initial Delay	Max Delay	Backoff Factor	Jitter
Real-time API (user-facing)	3-5	500ms	30s	2x	Full jitter
Batch/bulk processing	5-10	1s	5min	2x	Equal jitter
Event-driven (CDC, webhooks)	5-8	1s	2min	2x	Full jitter
File-based import (FBDI, EIB)	3	30s	10min	3x	None

Circuit Breaker Thresholds

Parameter	Real-time Integration	Batch Integration	Event-driven
Failure threshold	5 failures in 60s	3 failures in 5min	5 failures in 60s
Open state duration	30s	5min	60s
Half-open probe count	1	1-3	1
Success threshold to close	2 consecutive	3 consecutive	2 consecutive

Authentication

N/A — this is a pattern-level card. Authentication is handled at the ERP API layer. See system-specific cards for auth flows per ERP vendor.

Constraints

Idempotency is mandatory: Every retryable operation must include an idempotency key (UUID, composite business key, or hash). Without it, retries create duplicate records in the target ERP. [src1]
DLQ messages have no automatic cleanup: AWS SQS, Azure Service Bus, Kafka error topics — none auto-purge DLQ messages. You must implement explicit retention policies or manual/automated reprocessing workflows. [src5]
Circuit breaker state must be shared: If your integration runs on multiple worker instances, circuit breaker state must be stored in a shared data store (Redis, DynamoDB, database) — local in-memory state causes split-brain behavior. [src4]
Retry amplification risk: Retrying at multiple layers (application + middleware + infrastructure) causes exponential retry amplification. Pick ONE retry layer and disable retries at other layers. [src6]
DLQ depth = integration health: A growing DLQ backlog indicates a systemic issue, not a transient failure. Alert on DLQ depth > 100 messages and investigate root cause before replaying. [src1]
Poison messages must be identified early: Messages that will never succeed (schema validation failures, missing required fields, invalid data types) should be routed to DLQ immediately without retries. Only transient errors deserve retries. [src2]

Integration Pattern Decision Tree

START — ERP integration message fails processing
├── What type of error?
│   ├── Transient (429 rate limit, 503 unavailable, network timeout)
│   │   ├── Is circuit breaker OPEN?
│   │   │   ├── YES → Fast-fail, queue for retry after circuit reset
│   │   │   └── NO ↓
│   │   ├── Retry count < max retries?
│   │   │   ├── YES → Retry with exponential backoff + jitter
│   │   │   │   ├── delay = min(initialDelay * 2^attempt, maxDelay)
│   │   │   │   └── actualDelay = random(0, delay) [full jitter]
│   │   │   └── NO → Move to Dead Letter Queue with full context
│   │   └── Did retry succeed?
│   │       ├── YES → Mark success, reset failure counter
│   │       └── NO → Increment failure counter, check circuit breaker
│   ├── Non-transient / Poison (400, 422, schema mismatch)
│   │   └── Route IMMEDIATELY to DLQ — retries will never succeed
│   ├── Partial success (bulk: some records succeed, some fail)
│   │   ├── Extract failed records from response
│   │   └── Route failed-record message through retry pipeline
│   └── Authentication error (401, 403)
│       ├── Token expired? → Refresh token, retry once
│       └── Permissions changed? → Alert + DLQ (do not retry)
├── DLQ message processing
│   ├── Automated triage: classify, group, auto-resolve known patterns
│   └── Manual review: fix data, replay, or purge
└── Monitoring
    ├── DLQ depth: alert at > 100, page at > 1000
    ├── DLQ ingestion rate: alert if > 1% of total throughput
    └── Circuit breaker state changes: log every transition

Quick Reference

Error Type	Retryable?	Strategy	Max Retries	DLQ Action
HTTP 429 (Rate Limit)	Yes	Backoff, respect Retry-After	5-10	Reprocess after cooldown
HTTP 500 (Server Error)	Yes	Exponential backoff + jitter	5	Investigate server-side
HTTP 502/503/504 (Gateway)	Yes	Exponential backoff	5	Reprocess after recovery
HTTP 400 (Bad Request)	No	Immediate DLQ	0	Fix payload, resubmit
HTTP 401 (Unauthorized)	Once	Refresh token, retry once	1	Rotate credentials
HTTP 403 (Forbidden)	No	Immediate DLQ	0	Fix permissions
HTTP 404 (Not Found)	No	Immediate DLQ	0	Fix resource reference
HTTP 409 (Conflict)	Conditional	Retry with conflict resolution	3	Manual merge
HTTP 422 (Validation)	No	Immediate DLQ	0	Fix data, resubmit
Connection timeout	Yes	Exponential backoff	5	Check network/firewall
SSL/TLS failure	No	Immediate DLQ	0	Fix certificates
Schema mismatch	No	Immediate DLQ	0	Update schema, resubmit

Step-by-Step Integration Guide

1. Classify errors into retryable vs non-retryable

Before implementing retry logic, establish a clear error classification. Retrying non-retryable errors wastes resources and delays DLQ processing. [src2]

RETRYABLE_ERRORS = {429, 500, 502, 503, 504}
NON_RETRYABLE_ERRORS = {400, 401, 403, 404, 405, 409, 422}

def classify_error(status_code, error_body):
    if status_code in RETRYABLE_ERRORS:
        return "transient"
    if status_code == 401:
        return "auth_expired"  # Retry once after token refresh
    if status_code in NON_RETRYABLE_ERRORS:
        return "poison"  # Route to DLQ immediately
    if status_code >= 500:
        return "transient"
    return "poison"

Verify: Run classifier against last 30 days of integration logs → expected: <5% misclassification rate.

2. Implement exponential backoff with jitter

The backoff formula prevents retry storms. Full jitter distributes retries evenly across the delay window. [src3]

import random

def exponential_backoff_with_jitter(attempt, initial_delay=1.0,
                                     max_delay=60.0, factor=2.0):
    exponential_delay = initial_delay * (factor ** attempt)
    capped_delay = min(exponential_delay, max_delay)
    return random.uniform(0, capped_delay)  # Full jitter

Verify: exponential_backoff_with_jitter(0) returns 0-1.0; exponential_backoff_with_jitter(5) returns 0-32.0.

3. Implement the circuit breaker

Prevents hammering an ERP API that is already down — avoids wasting API quota and prevents cascading failures. [src4]

class CircuitBreaker:
    CLOSED, OPEN, HALF_OPEN = "closed", "open", "half_open"

    def __init__(self, failure_threshold=5, reset_timeout=30,
                 success_threshold=2):
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.success_threshold = success_threshold
        self.state = self.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None

    def can_execute(self):
        if self.state == self.CLOSED: return True
        if self.state == self.OPEN:
            if time.time() - self.last_failure_time >= self.reset_timeout:
                self.state = self.HALF_OPEN
                return True
            return False
        return True  # half_open

Verify: Trigger 5 failures → can_execute() returns False. Wait reset_timeout → returns True (half-open).

4. Build the retry-with-DLQ pipeline

Combine error classification, backoff, and circuit breaker into a unified pipeline. [src1, src2]

class ERPRetryPipeline:
    def process_message(self, message):
        idempotency_key = message.get("idempotency_key") or str(uuid.uuid4())
        for attempt in range(self.max_retries + 1):
            if not self.cb.can_execute():
                return self._send_to_dlq(message, attempt,
                    "circuit_breaker_open", "ERP unavailable")
            try:
                response = self.erp.send(message)
                self.cb.record_success()
                return {"status": "success", "attempt": attempt}
            except ERPAPIError as e:
                error_type = classify_error(e.status_code, e.body)
                if error_type == "poison":
                    return self._send_to_dlq(message, attempt,
                        f"http_{e.status_code}", str(e.body))
                self.cb.record_failure()
                if attempt < self.max_retries:
                    delay = exponential_backoff_with_jitter(attempt)
                    time.sleep(delay)
        return self._send_to_dlq(message, self.max_retries,
            "max_retries_exceeded", "All retries exhausted")

Verify: Mock ERP returns 503 three times then 200 → pipeline returns {"status": "success", "attempt": 3}.

5. Configure platform-specific DLQ

Set up the dead letter queue on your chosen message broker. [src5, src7]

# AWS SQS: Create DLQ and configure redrive policy
aws sqs create-queue --queue-name erp-integration-dlq \
  --attributes '{"MessageRetentionPeriod":"1209600"}'  # 14 days

# Azure Service Bus: DLQ is automatic (subqueue per entity)
az servicebus queue update --name erp-integration \
  --namespace-name mybus --resource-group myrg \
  --max-delivery-count 5

Verify: aws sqs get-queue-attributes shows DLQ ARN and maxReceiveCount of 5.

6. Set up DLQ monitoring and alerting

DLQ without monitoring is a silent data loss risk. [src1]

from prometheus_client import Counter, Gauge, Histogram

dlq_messages_total = Counter('erp_dlq_messages_total',
    'Total messages routed to DLQ', ['integration_id', 'error_type'])
dlq_depth = Gauge('erp_dlq_depth',
    'Current DLQ message count', ['queue_name'])
circuit_breaker_state = Gauge('erp_circuit_breaker_state',
    'Circuit breaker state (0=closed, 1=open, 2=half_open)',
    ['service_name'])

Verify: Query /metrics endpoint → all metric families visible. Trigger DLQ message → counter increments.

Code Examples

Python: Retry handler with DLQ for Salesforce Bulk API

# Input:  Records to upsert via Salesforce Bulk API 2.0
# Output: Success/failure counts, DLQ message IDs for failed batches

import requests, time, random, json, uuid
from datetime import datetime, timezone

class SalesforceBulkRetryHandler:
    def __init__(self, instance_url, access_token, dlq_client,
                 max_retries=5, initial_delay=1.0, max_delay=60.0):
        self.base_url = f"{instance_url}/services/data/v62.0"
        self.headers = {"Authorization": f"Bearer {access_token}",
                        "Content-Type": "application/json"}
        self.dlq = dlq_client
        self.max_retries = max_retries

    def upsert_with_retry(self, object_name, external_id_field, records):
        idem_key = str(uuid.uuid4())
        for attempt in range(self.max_retries + 1):
            try:
                job = requests.post(f"{self.base_url}/jobs/ingest",
                    headers=self.headers,
                    json={"object": object_name,
                          "externalIdFieldName": external_id_field,
                          "operation": "upsert", "contentType": "CSV"})
                if job.status_code == 429:
                    retry_after = int(job.headers.get("Retry-After", 0))
                    time.sleep(max(self._backoff(attempt), retry_after))
                    continue
                if job.status_code >= 500:
                    time.sleep(self._backoff(attempt)); continue
                if job.status_code >= 400:
                    return self._dead_letter(records, idem_key, attempt,
                        f"HTTP {job.status_code}", job.text)
                return {"status": "success", "job_id": job.json()["id"]}
            except requests.exceptions.ConnectionError:
                if attempt < self.max_retries:
                    time.sleep(self._backoff(attempt))
        return self._dead_letter(records, idem_key, self.max_retries,
            "max_retries", "All retries exhausted")

    def _backoff(self, attempt):
        return random.uniform(0, min(1.0 * (2 ** attempt), 60.0))

JavaScript/Node.js: Generic ERP retry middleware with circuit breaker

// Input:  ERP API call function + message payload
// Output: Success response or DLQ envelope

async function retryWithDLQ(apiCall, message, {
  maxRetries = 5, initialDelay = 1000, maxDelay = 60000,
  factor = 2, circuitBreaker, dlqSend
} = {}) {
  const idempotencyKey = message.idempotencyKey || crypto.randomUUID();
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    if (!circuitBreaker.canExecute()) {
      return dlqSend({ originalMessage: message,
        errorCode: 'circuit_breaker_open', attempts: attempt,
        idempotencyKey, deadLetteredAt: new Date().toISOString() });
    }
    try {
      const result = await apiCall(message, idempotencyKey);
      circuitBreaker.recordSuccess();
      return { status: 'success', attempt, result };
    } catch (err) {
      if ([400, 403, 404, 422].includes(err.statusCode)) {
        return dlqSend({ originalMessage: message,
          errorCode: `http_${err.statusCode}`, attempts: attempt + 1,
          idempotencyKey, deadLetteredAt: new Date().toISOString() });
      }
      circuitBreaker.recordFailure();
      if (attempt < maxRetries) {
        const delay = Math.random() * Math.min(
          initialDelay * Math.pow(factor, attempt), maxDelay);
        await new Promise(r => setTimeout(r, delay));
      }
    }
  }
  return dlqSend({ originalMessage: message,
    errorCode: 'max_retries_exceeded', attempts: maxRetries + 1,
    idempotencyKey, deadLetteredAt: new Date().toISOString() });
}

cURL: DLQ monitoring and replay

# Check DLQ depth (AWS SQS)
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789/erp-dlq \
  --attribute-names ApproximateNumberOfMessages

# Replay DLQ messages back to source queue
aws sqs start-message-move-task \
  --source-arn arn:aws:sqs:us-east-1:123456789:erp-dlq \
  --destination-arn arn:aws:sqs:us-east-1:123456789:erp-integration

# Check DLQ count (Azure Service Bus)
az servicebus queue show --name erp-integration \
  --namespace-name mybus --resource-group myrg \
  --query 'countDetails.deadLetterMessageCount'

Data Mapping

DLQ Envelope Schema Reference

Field	Type	Required	Description	Gotcha
original_message	Object	Yes	Complete original message payload	Must preserve all fields — partial messages break replay
error_code	String	Yes	Machine-readable error code	Standardize across all integrations
error_detail	String	Yes	Human-readable error description	Truncate to 4KB max
attempt_count	Integer	Yes	Number of attempts before dead-lettering	Starts at 1, not 0
idempotency_key	String	Yes	Unique key for replay deduplication	UUID or composite business key
dead_lettered_at	ISO 8601	Yes	When message was dead-lettered	Always UTC
source_queue	String	Yes	Origin queue/topic name	Required for routing replays
integration_id	String	Yes	Integration flow identifier	Maps to monitoring dashboards
correlation_id	String	Recommended	End-to-end trace ID	Enables cross-system debugging

Data Type Gotchas

Timestamp formats vary by ERP: Salesforce uses ISO 8601 UTC, SAP uses YYYYMMDD + HHMMSS in separate fields, NetSuite depends on user timezone. Always normalize to UTC ISO 8601 in DLQ envelopes. [src1]
Partial success responses differ: Salesforce Bulk API returns per-record CSV, SAP returns BAPI messages, Oracle FBDI returns error report file. Parse all formats before determining which records to DLQ. [src2]
Message size limits: AWS SQS max 256KB. If DLQ envelope exceeds this, store payload in S3 and include only the reference. Azure Service Bus Premium supports up to 100MB. [src5, src7]

Error Handling & Failure Points

Common Error Codes

Code	Meaning	Cause	Resolution
HTTP 429	Rate limit exceeded	Too many API calls in window	Exponential backoff, respect Retry-After header
HTTP 503	Service unavailable	ERP maintenance or overload	Exponential backoff, check ERP status page
ETIMEDOUT	Connection timeout	Network issue or slow response	Increase timeout, check firewall rules
ECONNREFUSED	Connection refused	Service down or port blocked	Open circuit breaker, alert operations
MaxDeliveryCountExceeded	DLQ threshold reached	Message failed N times	Review error, fix root cause, replay
UNABLE_TO_LOCK_ROW	Record lock conflict	Concurrent update to same record	Retry with jitter, implement locking
INVALID_SESSION_ID	Token expired/invalid	OAuth token expired	Refresh token, retry once

Failure Points in Production

Retry amplification across layers: Application retries 5x, middleware retries 3x, infrastructure retries 2x = 30 actual API calls. Fix: Disable retries at all layers except one. [src6]
DLQ messages accumulate silently: Teams set up DLQ but no monitoring. Thousands of unprocessed messages represent lost orders. Fix: Alert on DLQ depth > 100 within 24 hours of go-live. [src1]
Circuit breaker split-brain: Each instance has its own circuit breaker state. Fix: Store circuit breaker state in Redis or DynamoDB. [src4]
Replay without idempotency causes duplicates: 200 of 500 replayed messages had actually succeeded before timeout — now duplicated. Fix: Every operation must use an idempotency key. Use upsert, not insert. [src2]
Poison message loops: Schema error routes to DLQ, automated replay pushes it back, fails again. Fix: Track replay_count in DLQ envelope. Quarantine messages with replay_count > 3. [src1]
Rate limit backoff ignored during batch replay: 1,000 DLQ messages replayed simultaneously hit rate limits again. Fix: Rate-limit DLQ replay to 10-20% of normal throughput. [src3]

Anti-Patterns

Wrong: Retrying all errors indiscriminately

# BAD — retries non-retryable errors, wasting time and API quota
def send_to_erp(message, max_retries=5):
    for attempt in range(max_retries):
        try:
            return erp_api.post(message)
        except Exception as e:
            time.sleep(2 ** attempt)  # Retries 400, 403, 422 too
    raise Exception("All retries failed")

Correct: Classify errors before retrying

# GOOD — only retries transient errors, DLQs poison messages
def send_to_erp(message, max_retries=5):
    for attempt in range(max_retries):
        try:
            return erp_api.post(message)
        except ERPError as e:
            if e.status_code in (400, 403, 404, 422):
                dlq.send(message, error=str(e))  # Don't retry
                return None
            if attempt < max_retries - 1:
                time.sleep(exponential_backoff_with_jitter(attempt))
    dlq.send(message, error="Retries exhausted")

Wrong: Fixed-interval retry (no backoff)

# BAD — hammers ERP API at constant rate during outage
for attempt in range(5):
    try:
        response = erp_api.post(message); break
    except Exception:
        time.sleep(5)  # Same 5s delay every time

Correct: Exponential backoff with jitter

# GOOD — spreads retry load, respects rate limits
for attempt in range(5):
    try:
        response = erp_api.post(message); break
    except TransientError:
        delay = min(1.0 * (2 ** attempt), 60.0)
        time.sleep(random.uniform(0, delay))  # Full jitter

Wrong: DLQ without envelope metadata

# BAD — raw message in DLQ, no context for debugging
dlq.send(original_message)
# Later: "Why is this in the DLQ? When did it fail?"

Correct: DLQ with rich diagnostic envelope

# GOOD — complete context for debugging and safe replay
dlq.send({
    "original_message": original_message,
    "error_code": "http_429",
    "error_detail": "Rate limit exceeded on Salesforce Bulk API",
    "attempt_count": 6,
    "idempotency_key": "order-12345-upsert-v2",
    "dead_lettered_at": "2026-03-01T14:30:00Z",
    "source_queue": "salesforce-bulk-ingest",
    "correlation_id": "trace-abc-123"
})

Common Pitfalls

No DLQ retention policy: Messages accumulate forever. After 6 months, 50,000 stale messages — replaying them creates chaos. Fix: Set TTL (14 days transient, 90 days data issues). Auto-archive to cold storage. [src5]
Retrying at multiple layers: Application + middleware + infrastructure retries = exponential amplification. Fix: Choose one retry layer (application for ERP). Disable retries elsewhere. [src6]
Missing idempotency on replay: Replaying DLQ without idempotency keys creates duplicates. Fix: Include idempotency_key in every message. Use upsert, not insert. [src2]
Circuit breaker too sensitive: Opens on a single 503, stopping all traffic. Fix: Require 5+ failures in 60s. Use percentage-based thresholds for high volume. [src4]
Circuit breaker too lenient: 50 failures needed to open = 50 wasted API calls. Fix: Tune based on normal error rate. For <1% natural error rate, 5 failures in 60s is appropriate. [src4]
DLQ replay storms: All DLQ messages replayed at once overwhelms API. Fix: Rate-limit replays to 10-20% of normal throughput. [src3]
Ignoring partial success in bulk operations: Treating entire batch as failed creates duplicates. Fix: Parse per-record results. Only retry failed records. [src1]

Diagnostic Commands

# Check DLQ depth (AWS SQS)
aws sqs get-queue-attributes --queue-url $DLQ_URL \
  --attribute-names ApproximateNumberOfMessages

# Check DLQ depth (Azure Service Bus)
az servicebus queue show --name erp-integration \
  --namespace-name $NAMESPACE --resource-group $RG \
  --query 'countDetails.deadLetterMessageCount'

# Check circuit breaker state (Redis)
redis-cli GET "circuit_breaker:salesforce_api:state"

# Monitor retry rate (Prometheus)
# rate(erp_retry_attempts_total[5m]) / rate(erp_messages_processed_total[5m])

# Replay DLQ messages (AWS SQS)
aws sqs start-message-move-task \
  --source-arn $DLQ_ARN \
  --destination-arn $SOURCE_QUEUE_ARN \
  --max-number-of-messages-per-second 10

Version History & Compatibility

Pattern/Platform	Version	Date	Status	Key Changes
AWS SQS DLQ Redrive	GA	2023-07	Current	Native message-move-task for DLQ replay
Azure Service Bus DLQ	GA	2025-05	Current	Enhanced dead-letter reason headers
Apache Kafka Error Topics	Convention	Ongoing	Current	No native DLQ — error topic pattern
RabbitMQ Dead Letter Exchange	GA (3.x+)	2024-09	Current	x-delivery-limit in 3.12+ (quorum queues)
MuleSoft Error Handling	4.x	2025-01	Current	Enhanced DLQ connector with retry policies

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Zero-loss requirement (financial transactions, orders)	Fire-and-forget analytics events	Simple logging + batch reconciliation
ERP API has rate limits or intermittent outages	Target has 99.99% availability SLA	Direct API call with try/catch
Multi-system integration (multiple failure points)	Single-system CRUD operations	Database transaction with rollback
Asynchronous processing (message queues, events)	Synchronous user-facing API calls	HTTP retry middleware (Polly, resilience4j)
Batch operations with partial failures	All-or-nothing transactional requirements	Saga pattern with compensation
Long-running integration jobs (>30s per operation)	Sub-second API calls	Inline retry with timeout

Cross-System Comparison

Capability	AWS SQS	Azure Service Bus	Apache Kafka	RabbitMQ	MuleSoft
Native DLQ	Yes (redrive policy)	Yes (subqueue)	No (error topic)	Yes (DLX)	Yes (connector)
Max retention	14 days	Unlimited (Premium)	Topic config	TTL-based	Platform storage
DLQ replay	Native (move-task)	Manual	Consumer from error topic	Manual	UI-based
Max delivery count	1-1000	1-2000 (default 10)	Consumer-side	x-delivery-limit	Configurable
Circuit breaker	Manual	Manual	Manual	Manual	Built-in
Message ordering	FIFO (optional)	Sessions (optional)	Partition ordering	Per-queue	Flow ordering
Dead-letter reason	Application-set	System headers + custom	Application-set	Application-set	Error type metadata
Max message size	256KB (2GB w/ S3)	256KB / 100MB	1MB default	128MB default	Platform limit
Cost model	Per-message	Per-operation + storage	Self-hosted infra	Self-hosted infra	License-based

Important Caveats

DLQ is not error handling — it is error isolation: A DLQ stores messages that have already failed all retry attempts. It does not fix errors. You still need triage, alerting, and review processes.
Idempotency is not optional: Any retry strategy without idempotency keys will create duplicate records in the target ERP. This is the single most common and expensive mistake.
Platform DLQ features vary significantly: AWS SQS, Azure Service Bus, and RabbitMQ have native DLQ support with different capabilities. Kafka requires custom error-topic implementation.
Circuit breaker thresholds need tuning: Default thresholds rarely match production traffic patterns. Start conservative (5 failures in 60s) and adjust based on observed error rates.
Retry strategies must account for rate limits: ERP APIs like Salesforce (100K calls/24h) and SAP have hard quotas. Retries consume from the same quota.
Information currency: Message broker DLQ features evolve rapidly. AWS, Azure, and RabbitMQ added significant DLQ improvements in 2024-2025. Verify against current documentation.