Error Handling and Dead Letter Queues for ERP Integrations: Retry Strategies
How do you implement error handling and dead letter queues for ERP integrations - retry strategies?
TL;DR
- Bottom line: Implement a three-layer resilience stack: exponential backoff with jitter for transient errors, circuit breakers for prolonged outages, and dead letter queues for poison messages that exhaust all retries. Every ERP integration needs all three.
- Key limit: DLQ retention varies by platform — AWS SQS max 14 days, Azure Service Bus unlimited on Premium tier, Kafka depends on topic retention config. Plan your reprocessing SLA around these limits.
- Watch out for: Retrying without idempotency keys causes duplicate records in the target ERP. Every retry-eligible operation must be idempotent — use unique transaction IDs, not auto-increment.
- Best for: Any ERP integration with zero-loss or at-least-once delivery requirements — order-to-cash, AP automation, inventory sync, and payroll feeds.
- Authentication: N/A (pattern-level card). See system-specific cards for authentication details per ERP vendor.
System Profile
This card covers cross-platform error handling patterns applicable to all major ERP integrations, regardless of the specific ERP system (Salesforce, SAP, Oracle, NetSuite, Dynamics 365, Workday) or middleware platform. The patterns are implemented at the integration middleware layer — the message broker or iPaaS platform that sits between systems.
| System | Role | DLQ Support | Retry Support | Circuit Breaker |
|---|---|---|---|---|
| AWS SQS/SNS | Message broker | Native (redrive policy) | maxReceiveCount (1-1000) | Manual (Step Functions / custom) |
| Azure Service Bus | Message broker | Native (subqueue per entity) | MaxDeliveryCount (default 10) | Manual (custom implementation) |
| Apache Kafka | Event streaming | Via error topic convention | Consumer-side retry logic | Manual (custom implementation) |
| RabbitMQ | Message broker | Native (dead-letter exchange) | x-delivery-limit header | Manual (custom implementation) |
| MuleSoft Anypoint | iPaaS | Built-in DLQ connector | Configurable retry policies | Built-in circuit breaker scope |
| Boomi | iPaaS | Error handling shapes | Configurable retry | Custom via process routes |
| Workato | iPaaS | Error monitoring recipes | Auto-retry with configurable count | Custom via error handlers |
API Surfaces & Capabilities
| Pattern | Type | Best For | Complexity | Latency Impact | Data Loss Risk |
|---|---|---|---|---|---|
| Immediate retry | Retry | Transient network blips | Low | Minimal (ms) | Medium (no backoff) |
| Exponential backoff | Retry | Rate limits, temporary overload | Medium | Increasing (1s-60s) | Low |
| Exponential backoff + jitter | Retry | High-concurrency retries | Medium | Increasing (randomized) | Low |
| Circuit breaker | Protection | Prolonged service outages | Medium | Fast-fail when open | None (preserves messages) |
| Dead letter queue | Error isolation | Poison messages, schema errors | Medium | None (async) | Very low |
| Saga with compensation | Transaction | Multi-system writes | High | Variable | Very low |
| Outbox pattern | Delivery guarantee | Exactly-once publish | High | Minimal | Near zero |
Rate Limits & Quotas
Per-Platform DLQ Limits
| Platform | Max Retention | Max Message Size | Max DLQ Depth | Reprocessing Method |
|---|---|---|---|---|
| AWS SQS | 14 days | 256 KB (2 GB with S3) | Unlimited | Redrive to source queue |
| Azure Service Bus (Standard) | Unlimited | 256 KB | 5 GB per entity | Receive + resubmit |
| Azure Service Bus (Premium) | Unlimited | 100 MB | 80 GB per entity | Receive + resubmit |
| Apache Kafka (error topic) | Topic retention config | 1 MB default (configurable) | Partition-based | Consumer from error topic |
| RabbitMQ | TTL-based (configurable) | 128 MB default | Memory/disk-based | Consume from DLX queue |
Retry Budget Guidelines
| Integration Type | Max Retries | Initial Delay | Max Delay | Backoff Factor | Jitter |
|---|---|---|---|---|---|
| Real-time API (user-facing) | 3-5 | 500ms | 30s | 2x | Full jitter |
| Batch/bulk processing | 5-10 | 1s | 5min | 2x | Equal jitter |
| Event-driven (CDC, webhooks) | 5-8 | 1s | 2min | 2x | Full jitter |
| File-based import (FBDI, EIB) | 3 | 30s | 10min | 3x | None |
Circuit Breaker Thresholds
| Parameter | Real-time Integration | Batch Integration | Event-driven |
|---|---|---|---|
| Failure threshold | 5 failures in 60s | 3 failures in 5min | 5 failures in 60s |
| Open state duration | 30s | 5min | 60s |
| Half-open probe count | 1 | 1-3 | 1 |
| Success threshold to close | 2 consecutive | 3 consecutive | 2 consecutive |
Authentication
N/A — this is a pattern-level card. Authentication is handled at the ERP API layer. See system-specific cards for auth flows per ERP vendor.
Constraints
- Idempotency is mandatory: Every retryable operation must include an idempotency key (UUID, composite business key, or hash). Without it, retries create duplicate records in the target ERP. [src1]
- DLQ messages have no automatic cleanup: AWS SQS, Azure Service Bus, Kafka error topics — none auto-purge DLQ messages. You must implement explicit retention policies or manual/automated reprocessing workflows. [src5]
- Circuit breaker state must be shared: If your integration runs on multiple worker instances, circuit breaker state must be stored in a shared data store (Redis, DynamoDB, database) — local in-memory state causes split-brain behavior. [src4]
- Retry amplification risk: Retrying at multiple layers (application + middleware + infrastructure) causes exponential retry amplification. Pick ONE retry layer and disable retries at other layers. [src6]
- DLQ depth = integration health: A growing DLQ backlog indicates a systemic issue, not a transient failure. Alert on DLQ depth > 100 messages and investigate root cause before replaying. [src1]
- Poison messages must be identified early: Messages that will never succeed (schema validation failures, missing required fields, invalid data types) should be routed to DLQ immediately without retries. Only transient errors deserve retries. [src2]
Integration Pattern Decision Tree
START — ERP integration message fails processing
├── What type of error?
│ ├── Transient (429 rate limit, 503 unavailable, network timeout)
│ │ ├── Is circuit breaker OPEN?
│ │ │ ├── YES → Fast-fail, queue for retry after circuit reset
│ │ │ └── NO ↓
│ │ ├── Retry count < max retries?
│ │ │ ├── YES → Retry with exponential backoff + jitter
│ │ │ │ ├── delay = min(initialDelay * 2^attempt, maxDelay)
│ │ │ │ └── actualDelay = random(0, delay) [full jitter]
│ │ │ └── NO → Move to Dead Letter Queue with full context
│ │ └── Did retry succeed?
│ │ ├── YES → Mark success, reset failure counter
│ │ └── NO → Increment failure counter, check circuit breaker
│ ├── Non-transient / Poison (400, 422, schema mismatch)
│ │ └── Route IMMEDIATELY to DLQ — retries will never succeed
│ ├── Partial success (bulk: some records succeed, some fail)
│ │ ├── Extract failed records from response
│ │ └── Route failed-record message through retry pipeline
│ └── Authentication error (401, 403)
│ ├── Token expired? → Refresh token, retry once
│ └── Permissions changed? → Alert + DLQ (do not retry)
├── DLQ message processing
│ ├── Automated triage: classify, group, auto-resolve known patterns
│ └── Manual review: fix data, replay, or purge
└── Monitoring
├── DLQ depth: alert at > 100, page at > 1000
├── DLQ ingestion rate: alert if > 1% of total throughput
└── Circuit breaker state changes: log every transition
Quick Reference
| Error Type | Retryable? | Strategy | Max Retries | DLQ Action |
|---|---|---|---|---|
| HTTP 429 (Rate Limit) | Yes | Backoff, respect Retry-After | 5-10 | Reprocess after cooldown |
| HTTP 500 (Server Error) | Yes | Exponential backoff + jitter | 5 | Investigate server-side |
| HTTP 502/503/504 (Gateway) | Yes | Exponential backoff | 5 | Reprocess after recovery |
| HTTP 400 (Bad Request) | No | Immediate DLQ | 0 | Fix payload, resubmit |
| HTTP 401 (Unauthorized) | Once | Refresh token, retry once | 1 | Rotate credentials |
| HTTP 403 (Forbidden) | No | Immediate DLQ | 0 | Fix permissions |
| HTTP 404 (Not Found) | No | Immediate DLQ | 0 | Fix resource reference |
| HTTP 409 (Conflict) | Conditional | Retry with conflict resolution | 3 | Manual merge |
| HTTP 422 (Validation) | No | Immediate DLQ | 0 | Fix data, resubmit |
| Connection timeout | Yes | Exponential backoff | 5 | Check network/firewall |
| SSL/TLS failure | No | Immediate DLQ | 0 | Fix certificates |
| Schema mismatch | No | Immediate DLQ | 0 | Update schema, resubmit |
Step-by-Step Integration Guide
1. Classify errors into retryable vs non-retryable
Before implementing retry logic, establish a clear error classification. Retrying non-retryable errors wastes resources and delays DLQ processing. [src2]
RETRYABLE_ERRORS = {429, 500, 502, 503, 504}
NON_RETRYABLE_ERRORS = {400, 401, 403, 404, 405, 409, 422}
def classify_error(status_code, error_body):
if status_code in RETRYABLE_ERRORS:
return "transient"
if status_code == 401:
return "auth_expired" # Retry once after token refresh
if status_code in NON_RETRYABLE_ERRORS:
return "poison" # Route to DLQ immediately
if status_code >= 500:
return "transient"
return "poison"
Verify: Run classifier against last 30 days of integration logs → expected: <5% misclassification rate.
2. Implement exponential backoff with jitter
The backoff formula prevents retry storms. Full jitter distributes retries evenly across the delay window. [src3]
import random
def exponential_backoff_with_jitter(attempt, initial_delay=1.0,
max_delay=60.0, factor=2.0):
exponential_delay = initial_delay * (factor ** attempt)
capped_delay = min(exponential_delay, max_delay)
return random.uniform(0, capped_delay) # Full jitter
Verify: exponential_backoff_with_jitter(0) returns 0-1.0; exponential_backoff_with_jitter(5) returns 0-32.0.
3. Implement the circuit breaker
Prevents hammering an ERP API that is already down — avoids wasting API quota and prevents cascading failures. [src4]
class CircuitBreaker:
CLOSED, OPEN, HALF_OPEN = "closed", "open", "half_open"
def __init__(self, failure_threshold=5, reset_timeout=30,
success_threshold=2):
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.success_threshold = success_threshold
self.state = self.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
def can_execute(self):
if self.state == self.CLOSED: return True
if self.state == self.OPEN:
if time.time() - self.last_failure_time >= self.reset_timeout:
self.state = self.HALF_OPEN
return True
return False
return True # half_open
Verify: Trigger 5 failures → can_execute() returns False. Wait reset_timeout → returns True (half-open).
4. Build the retry-with-DLQ pipeline
Combine error classification, backoff, and circuit breaker into a unified pipeline. [src1, src2]
class ERPRetryPipeline:
def process_message(self, message):
idempotency_key = message.get("idempotency_key") or str(uuid.uuid4())
for attempt in range(self.max_retries + 1):
if not self.cb.can_execute():
return self._send_to_dlq(message, attempt,
"circuit_breaker_open", "ERP unavailable")
try:
response = self.erp.send(message)
self.cb.record_success()
return {"status": "success", "attempt": attempt}
except ERPAPIError as e:
error_type = classify_error(e.status_code, e.body)
if error_type == "poison":
return self._send_to_dlq(message, attempt,
f"http_{e.status_code}", str(e.body))
self.cb.record_failure()
if attempt < self.max_retries:
delay = exponential_backoff_with_jitter(attempt)
time.sleep(delay)
return self._send_to_dlq(message, self.max_retries,
"max_retries_exceeded", "All retries exhausted")
Verify: Mock ERP returns 503 three times then 200 → pipeline returns {"status": "success", "attempt": 3}.
5. Configure platform-specific DLQ
Set up the dead letter queue on your chosen message broker. [src5, src7]
# AWS SQS: Create DLQ and configure redrive policy
aws sqs create-queue --queue-name erp-integration-dlq \
--attributes '{"MessageRetentionPeriod":"1209600"}' # 14 days
# Azure Service Bus: DLQ is automatic (subqueue per entity)
az servicebus queue update --name erp-integration \
--namespace-name mybus --resource-group myrg \
--max-delivery-count 5
Verify: aws sqs get-queue-attributes shows DLQ ARN and maxReceiveCount of 5.
6. Set up DLQ monitoring and alerting
DLQ without monitoring is a silent data loss risk. [src1]
from prometheus_client import Counter, Gauge, Histogram
dlq_messages_total = Counter('erp_dlq_messages_total',
'Total messages routed to DLQ', ['integration_id', 'error_type'])
dlq_depth = Gauge('erp_dlq_depth',
'Current DLQ message count', ['queue_name'])
circuit_breaker_state = Gauge('erp_circuit_breaker_state',
'Circuit breaker state (0=closed, 1=open, 2=half_open)',
['service_name'])
Verify: Query /metrics endpoint → all metric families visible. Trigger DLQ message → counter increments.
Code Examples
Python: Retry handler with DLQ for Salesforce Bulk API
# Input: Records to upsert via Salesforce Bulk API 2.0
# Output: Success/failure counts, DLQ message IDs for failed batches
import requests, time, random, json, uuid
from datetime import datetime, timezone
class SalesforceBulkRetryHandler:
def __init__(self, instance_url, access_token, dlq_client,
max_retries=5, initial_delay=1.0, max_delay=60.0):
self.base_url = f"{instance_url}/services/data/v62.0"
self.headers = {"Authorization": f"Bearer {access_token}",
"Content-Type": "application/json"}
self.dlq = dlq_client
self.max_retries = max_retries
def upsert_with_retry(self, object_name, external_id_field, records):
idem_key = str(uuid.uuid4())
for attempt in range(self.max_retries + 1):
try:
job = requests.post(f"{self.base_url}/jobs/ingest",
headers=self.headers,
json={"object": object_name,
"externalIdFieldName": external_id_field,
"operation": "upsert", "contentType": "CSV"})
if job.status_code == 429:
retry_after = int(job.headers.get("Retry-After", 0))
time.sleep(max(self._backoff(attempt), retry_after))
continue
if job.status_code >= 500:
time.sleep(self._backoff(attempt)); continue
if job.status_code >= 400:
return self._dead_letter(records, idem_key, attempt,
f"HTTP {job.status_code}", job.text)
return {"status": "success", "job_id": job.json()["id"]}
except requests.exceptions.ConnectionError:
if attempt < self.max_retries:
time.sleep(self._backoff(attempt))
return self._dead_letter(records, idem_key, self.max_retries,
"max_retries", "All retries exhausted")
def _backoff(self, attempt):
return random.uniform(0, min(1.0 * (2 ** attempt), 60.0))
JavaScript/Node.js: Generic ERP retry middleware with circuit breaker
// Input: ERP API call function + message payload
// Output: Success response or DLQ envelope
async function retryWithDLQ(apiCall, message, {
maxRetries = 5, initialDelay = 1000, maxDelay = 60000,
factor = 2, circuitBreaker, dlqSend
} = {}) {
const idempotencyKey = message.idempotencyKey || crypto.randomUUID();
for (let attempt = 0; attempt <= maxRetries; attempt++) {
if (!circuitBreaker.canExecute()) {
return dlqSend({ originalMessage: message,
errorCode: 'circuit_breaker_open', attempts: attempt,
idempotencyKey, deadLetteredAt: new Date().toISOString() });
}
try {
const result = await apiCall(message, idempotencyKey);
circuitBreaker.recordSuccess();
return { status: 'success', attempt, result };
} catch (err) {
if ([400, 403, 404, 422].includes(err.statusCode)) {
return dlqSend({ originalMessage: message,
errorCode: `http_${err.statusCode}`, attempts: attempt + 1,
idempotencyKey, deadLetteredAt: new Date().toISOString() });
}
circuitBreaker.recordFailure();
if (attempt < maxRetries) {
const delay = Math.random() * Math.min(
initialDelay * Math.pow(factor, attempt), maxDelay);
await new Promise(r => setTimeout(r, delay));
}
}
}
return dlqSend({ originalMessage: message,
errorCode: 'max_retries_exceeded', attempts: maxRetries + 1,
idempotencyKey, deadLetteredAt: new Date().toISOString() });
}
cURL: DLQ monitoring and replay
# Check DLQ depth (AWS SQS)
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789/erp-dlq \
--attribute-names ApproximateNumberOfMessages
# Replay DLQ messages back to source queue
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:123456789:erp-dlq \
--destination-arn arn:aws:sqs:us-east-1:123456789:erp-integration
# Check DLQ count (Azure Service Bus)
az servicebus queue show --name erp-integration \
--namespace-name mybus --resource-group myrg \
--query 'countDetails.deadLetterMessageCount'
Data Mapping
DLQ Envelope Schema Reference
| Field | Type | Required | Description | Gotcha |
|---|---|---|---|---|
| original_message | Object | Yes | Complete original message payload | Must preserve all fields — partial messages break replay |
| error_code | String | Yes | Machine-readable error code | Standardize across all integrations |
| error_detail | String | Yes | Human-readable error description | Truncate to 4KB max |
| attempt_count | Integer | Yes | Number of attempts before dead-lettering | Starts at 1, not 0 |
| idempotency_key | String | Yes | Unique key for replay deduplication | UUID or composite business key |
| dead_lettered_at | ISO 8601 | Yes | When message was dead-lettered | Always UTC |
| source_queue | String | Yes | Origin queue/topic name | Required for routing replays |
| integration_id | String | Yes | Integration flow identifier | Maps to monitoring dashboards |
| correlation_id | String | Recommended | End-to-end trace ID | Enables cross-system debugging |
Data Type Gotchas
- Timestamp formats vary by ERP: Salesforce uses ISO 8601 UTC, SAP uses YYYYMMDD + HHMMSS in separate fields, NetSuite depends on user timezone. Always normalize to UTC ISO 8601 in DLQ envelopes. [src1]
- Partial success responses differ: Salesforce Bulk API returns per-record CSV, SAP returns BAPI messages, Oracle FBDI returns error report file. Parse all formats before determining which records to DLQ. [src2]
- Message size limits: AWS SQS max 256KB. If DLQ envelope exceeds this, store payload in S3 and include only the reference. Azure Service Bus Premium supports up to 100MB. [src5, src7]
Error Handling & Failure Points
Common Error Codes
| Code | Meaning | Cause | Resolution |
|---|---|---|---|
| HTTP 429 | Rate limit exceeded | Too many API calls in window | Exponential backoff, respect Retry-After header |
| HTTP 503 | Service unavailable | ERP maintenance or overload | Exponential backoff, check ERP status page |
| ETIMEDOUT | Connection timeout | Network issue or slow response | Increase timeout, check firewall rules |
| ECONNREFUSED | Connection refused | Service down or port blocked | Open circuit breaker, alert operations |
| MaxDeliveryCountExceeded | DLQ threshold reached | Message failed N times | Review error, fix root cause, replay |
| UNABLE_TO_LOCK_ROW | Record lock conflict | Concurrent update to same record | Retry with jitter, implement locking |
| INVALID_SESSION_ID | Token expired/invalid | OAuth token expired | Refresh token, retry once |
Failure Points in Production
- Retry amplification across layers: Application retries 5x, middleware retries 3x, infrastructure retries 2x = 30 actual API calls. Fix:
Disable retries at all layers except one.[src6] - DLQ messages accumulate silently: Teams set up DLQ but no monitoring. Thousands of unprocessed messages represent lost orders. Fix:
Alert on DLQ depth > 100 within 24 hours of go-live.[src1] - Circuit breaker split-brain: Each instance has its own circuit breaker state. Fix:
Store circuit breaker state in Redis or DynamoDB.[src4] - Replay without idempotency causes duplicates: 200 of 500 replayed messages had actually succeeded before timeout — now duplicated. Fix:
Every operation must use an idempotency key. Use upsert, not insert.[src2] - Poison message loops: Schema error routes to DLQ, automated replay pushes it back, fails again. Fix:
Track replay_count in DLQ envelope. Quarantine messages with replay_count > 3.[src1] - Rate limit backoff ignored during batch replay: 1,000 DLQ messages replayed simultaneously hit rate limits again. Fix:
Rate-limit DLQ replay to 10-20% of normal throughput.[src3]
Anti-Patterns
Wrong: Retrying all errors indiscriminately
# BAD — retries non-retryable errors, wasting time and API quota
def send_to_erp(message, max_retries=5):
for attempt in range(max_retries):
try:
return erp_api.post(message)
except Exception as e:
time.sleep(2 ** attempt) # Retries 400, 403, 422 too
raise Exception("All retries failed")
Correct: Classify errors before retrying
# GOOD — only retries transient errors, DLQs poison messages
def send_to_erp(message, max_retries=5):
for attempt in range(max_retries):
try:
return erp_api.post(message)
except ERPError as e:
if e.status_code in (400, 403, 404, 422):
dlq.send(message, error=str(e)) # Don't retry
return None
if attempt < max_retries - 1:
time.sleep(exponential_backoff_with_jitter(attempt))
dlq.send(message, error="Retries exhausted")
Wrong: Fixed-interval retry (no backoff)
# BAD — hammers ERP API at constant rate during outage
for attempt in range(5):
try:
response = erp_api.post(message); break
except Exception:
time.sleep(5) # Same 5s delay every time
Correct: Exponential backoff with jitter
# GOOD — spreads retry load, respects rate limits
for attempt in range(5):
try:
response = erp_api.post(message); break
except TransientError:
delay = min(1.0 * (2 ** attempt), 60.0)
time.sleep(random.uniform(0, delay)) # Full jitter
Wrong: DLQ without envelope metadata
# BAD — raw message in DLQ, no context for debugging
dlq.send(original_message)
# Later: "Why is this in the DLQ? When did it fail?"
Correct: DLQ with rich diagnostic envelope
# GOOD — complete context for debugging and safe replay
dlq.send({
"original_message": original_message,
"error_code": "http_429",
"error_detail": "Rate limit exceeded on Salesforce Bulk API",
"attempt_count": 6,
"idempotency_key": "order-12345-upsert-v2",
"dead_lettered_at": "2026-03-01T14:30:00Z",
"source_queue": "salesforce-bulk-ingest",
"correlation_id": "trace-abc-123"
})
Common Pitfalls
- No DLQ retention policy: Messages accumulate forever. After 6 months, 50,000 stale messages — replaying them creates chaos. Fix:
Set TTL (14 days transient, 90 days data issues). Auto-archive to cold storage.[src5] - Retrying at multiple layers: Application + middleware + infrastructure retries = exponential amplification. Fix:
Choose one retry layer (application for ERP). Disable retries elsewhere.[src6] - Missing idempotency on replay: Replaying DLQ without idempotency keys creates duplicates. Fix:
Include idempotency_key in every message. Use upsert, not insert.[src2] - Circuit breaker too sensitive: Opens on a single 503, stopping all traffic. Fix:
Require 5+ failures in 60s. Use percentage-based thresholds for high volume.[src4] - Circuit breaker too lenient: 50 failures needed to open = 50 wasted API calls. Fix:
Tune based on normal error rate. For <1% natural error rate, 5 failures in 60s is appropriate.[src4] - DLQ replay storms: All DLQ messages replayed at once overwhelms API. Fix:
Rate-limit replays to 10-20% of normal throughput.[src3] - Ignoring partial success in bulk operations: Treating entire batch as failed creates duplicates. Fix:
Parse per-record results. Only retry failed records.[src1]
Diagnostic Commands
# Check DLQ depth (AWS SQS)
aws sqs get-queue-attributes --queue-url $DLQ_URL \
--attribute-names ApproximateNumberOfMessages
# Check DLQ depth (Azure Service Bus)
az servicebus queue show --name erp-integration \
--namespace-name $NAMESPACE --resource-group $RG \
--query 'countDetails.deadLetterMessageCount'
# Check circuit breaker state (Redis)
redis-cli GET "circuit_breaker:salesforce_api:state"
# Monitor retry rate (Prometheus)
# rate(erp_retry_attempts_total[5m]) / rate(erp_messages_processed_total[5m])
# Replay DLQ messages (AWS SQS)
aws sqs start-message-move-task \
--source-arn $DLQ_ARN \
--destination-arn $SOURCE_QUEUE_ARN \
--max-number-of-messages-per-second 10
Version History & Compatibility
| Pattern/Platform | Version | Date | Status | Key Changes |
|---|---|---|---|---|
| AWS SQS DLQ Redrive | GA | 2023-07 | Current | Native message-move-task for DLQ replay |
| Azure Service Bus DLQ | GA | 2025-05 | Current | Enhanced dead-letter reason headers |
| Apache Kafka Error Topics | Convention | Ongoing | Current | No native DLQ — error topic pattern |
| RabbitMQ Dead Letter Exchange | GA (3.x+) | 2024-09 | Current | x-delivery-limit in 3.12+ (quorum queues) |
| MuleSoft Error Handling | 4.x | 2025-01 | Current | Enhanced DLQ connector with retry policies |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Zero-loss requirement (financial transactions, orders) | Fire-and-forget analytics events | Simple logging + batch reconciliation |
| ERP API has rate limits or intermittent outages | Target has 99.99% availability SLA | Direct API call with try/catch |
| Multi-system integration (multiple failure points) | Single-system CRUD operations | Database transaction with rollback |
| Asynchronous processing (message queues, events) | Synchronous user-facing API calls | HTTP retry middleware (Polly, resilience4j) |
| Batch operations with partial failures | All-or-nothing transactional requirements | Saga pattern with compensation |
| Long-running integration jobs (>30s per operation) | Sub-second API calls | Inline retry with timeout |
Cross-System Comparison
| Capability | AWS SQS | Azure Service Bus | Apache Kafka | RabbitMQ | MuleSoft |
|---|---|---|---|---|---|
| Native DLQ | Yes (redrive policy) | Yes (subqueue) | No (error topic) | Yes (DLX) | Yes (connector) |
| Max retention | 14 days | Unlimited (Premium) | Topic config | TTL-based | Platform storage |
| DLQ replay | Native (move-task) | Manual | Consumer from error topic | Manual | UI-based |
| Max delivery count | 1-1000 | 1-2000 (default 10) | Consumer-side | x-delivery-limit | Configurable |
| Circuit breaker | Manual | Manual | Manual | Manual | Built-in |
| Message ordering | FIFO (optional) | Sessions (optional) | Partition ordering | Per-queue | Flow ordering |
| Dead-letter reason | Application-set | System headers + custom | Application-set | Application-set | Error type metadata |
| Max message size | 256KB (2GB w/ S3) | 256KB / 100MB | 1MB default | 128MB default | Platform limit |
| Cost model | Per-message | Per-operation + storage | Self-hosted infra | Self-hosted infra | License-based |
Important Caveats
- DLQ is not error handling — it is error isolation: A DLQ stores messages that have already failed all retry attempts. It does not fix errors. You still need triage, alerting, and review processes.
- Idempotency is not optional: Any retry strategy without idempotency keys will create duplicate records in the target ERP. This is the single most common and expensive mistake.
- Platform DLQ features vary significantly: AWS SQS, Azure Service Bus, and RabbitMQ have native DLQ support with different capabilities. Kafka requires custom error-topic implementation.
- Circuit breaker thresholds need tuning: Default thresholds rarely match production traffic patterns. Start conservative (5 failures in 60s) and adjust based on observed error rates.
- Retry strategies must account for rate limits: ERP APIs like Salesforce (100K calls/24h) and SAP have hard quotas. Retries consume from the same quota.
- Information currency: Message broker DLQ features evolve rapidly. AWS, Azure, and RabbitMQ added significant DLQ improvements in 2024-2025. Verify against current documentation.