This card covers cross-platform error handling patterns applicable to all major ERP integrations, regardless of the specific ERP system (Salesforce, SAP, Oracle, NetSuite, Dynamics 365, Workday) or middleware platform. The patterns are implemented at the integration middleware layer — the message broker or iPaaS platform that sits between systems.
| System | Role | DLQ Support | Retry Support | Circuit Breaker |
|---|---|---|---|---|
| AWS SQS/SNS | Message broker | Native (redrive policy) | maxReceiveCount (1-1000) | Manual (Step Functions / custom) |
| Azure Service Bus | Message broker | Native (subqueue per entity) | MaxDeliveryCount (default 10) | Manual (custom implementation) |
| Apache Kafka | Event streaming | Via error topic convention | Consumer-side retry logic | Manual (custom implementation) |
| RabbitMQ | Message broker | Native (dead-letter exchange) | x-delivery-limit header | Manual (custom implementation) |
| MuleSoft Anypoint | iPaaS | Built-in DLQ connector | Configurable retry policies | Built-in circuit breaker scope |
| Boomi | iPaaS | Error handling shapes | Configurable retry | Custom via process routes |
| Workato | iPaaS | Error monitoring recipes | Auto-retry with configurable count | Custom via error handlers |
| Pattern | Type | Best For | Complexity | Latency Impact | Data Loss Risk |
|---|---|---|---|---|---|
| Immediate retry | Retry | Transient network blips | Low | Minimal (ms) | Medium (no backoff) |
| Exponential backoff | Retry | Rate limits, temporary overload | Medium | Increasing (1s-60s) | Low |
| Exponential backoff + jitter | Retry | High-concurrency retries | Medium | Increasing (randomized) | Low |
| Circuit breaker | Protection | Prolonged service outages | Medium | Fast-fail when open | None (preserves messages) |
| Dead letter queue | Error isolation | Poison messages, schema errors | Medium | None (async) | Very low |
| Saga with compensation | Transaction | Multi-system writes | High | Variable | Very low |
| Outbox pattern | Delivery guarantee | Exactly-once publish | High | Minimal | Near zero |
| Platform | Max Retention | Max Message Size | Max DLQ Depth | Reprocessing Method |
|---|---|---|---|---|
| AWS SQS | 14 days | 256 KB (2 GB with S3) | Unlimited | Redrive to source queue |
| Azure Service Bus (Standard) | Unlimited | 256 KB | 5 GB per entity | Receive + resubmit |
| Azure Service Bus (Premium) | Unlimited | 100 MB | 80 GB per entity | Receive + resubmit |
| Apache Kafka (error topic) | Topic retention config | 1 MB default (configurable) | Partition-based | Consumer from error topic |
| RabbitMQ | TTL-based (configurable) | 128 MB default | Memory/disk-based | Consume from DLX queue |
| Integration Type | Max Retries | Initial Delay | Max Delay | Backoff Factor | Jitter |
|---|---|---|---|---|---|
| Real-time API (user-facing) | 3-5 | 500ms | 30s | 2x | Full jitter |
| Batch/bulk processing | 5-10 | 1s | 5min | 2x | Equal jitter |
| Event-driven (CDC, webhooks) | 5-8 | 1s | 2min | 2x | Full jitter |
| File-based import (FBDI, EIB) | 3 | 30s | 10min | 3x | None |
| Parameter | Real-time Integration | Batch Integration | Event-driven |
|---|---|---|---|
| Failure threshold | 5 failures in 60s | 3 failures in 5min | 5 failures in 60s |
| Open state duration | 30s | 5min | 60s |
| Half-open probe count | 1 | 1-3 | 1 |
| Success threshold to close | 2 consecutive | 3 consecutive | 2 consecutive |
N/A — this is a pattern-level card. Authentication is handled at the ERP API layer. See system-specific cards for auth flows per ERP vendor.
START — ERP integration message fails processing
├── What type of error?
│ ├── Transient (429 rate limit, 503 unavailable, network timeout)
│ │ ├── Is circuit breaker OPEN?
│ │ │ ├── YES → Fast-fail, queue for retry after circuit reset
│ │ │ └── NO ↓
│ │ ├── Retry count < max retries?
│ │ │ ├── YES → Retry with exponential backoff + jitter
│ │ │ │ ├── delay = min(initialDelay * 2^attempt, maxDelay)
│ │ │ │ └── actualDelay = random(0, delay) [full jitter]
│ │ │ └── NO → Move to Dead Letter Queue with full context
│ │ └── Did retry succeed?
│ │ ├── YES → Mark success, reset failure counter
│ │ └── NO → Increment failure counter, check circuit breaker
│ ├── Non-transient / Poison (400, 422, schema mismatch)
│ │ └── Route IMMEDIATELY to DLQ — retries will never succeed
│ ├── Partial success (bulk: some records succeed, some fail)
│ │ ├── Extract failed records from response
│ │ └── Route failed-record message through retry pipeline
│ └── Authentication error (401, 403)
│ ├── Token expired? → Refresh token, retry once
│ └── Permissions changed? → Alert + DLQ (do not retry)
├── DLQ message processing
│ ├── Automated triage: classify, group, auto-resolve known patterns
│ └── Manual review: fix data, replay, or purge
└── Monitoring
├── DLQ depth: alert at > 100, page at > 1000
├── DLQ ingestion rate: alert if > 1% of total throughput
└── Circuit breaker state changes: log every transition
| Error Type | Retryable? | Strategy | Max Retries | DLQ Action |
|---|---|---|---|---|
| HTTP 429 (Rate Limit) | Yes | Backoff, respect Retry-After | 5-10 | Reprocess after cooldown |
| HTTP 500 (Server Error) | Yes | Exponential backoff + jitter | 5 | Investigate server-side |
| HTTP 502/503/504 (Gateway) | Yes | Exponential backoff | 5 | Reprocess after recovery |
| HTTP 400 (Bad Request) | No | Immediate DLQ | 0 | Fix payload, resubmit |
| HTTP 401 (Unauthorized) | Once | Refresh token, retry once | 1 | Rotate credentials |
| HTTP 403 (Forbidden) | No | Immediate DLQ | 0 | Fix permissions |
| HTTP 404 (Not Found) | No | Immediate DLQ | 0 | Fix resource reference |
| HTTP 409 (Conflict) | Conditional | Retry with conflict resolution | 3 | Manual merge |
| HTTP 422 (Validation) | No | Immediate DLQ | 0 | Fix data, resubmit |
| Connection timeout | Yes | Exponential backoff | 5 | Check network/firewall |
| SSL/TLS failure | No | Immediate DLQ | 0 | Fix certificates |
| Schema mismatch | No | Immediate DLQ | 0 | Update schema, resubmit |
Before implementing retry logic, establish a clear error classification. Retrying non-retryable errors wastes resources and delays DLQ processing. [src2]
RETRYABLE_ERRORS = {429, 500, 502, 503, 504}
NON_RETRYABLE_ERRORS = {400, 401, 403, 404, 405, 409, 422}
def classify_error(status_code, error_body):
if status_code in RETRYABLE_ERRORS:
return "transient"
if status_code == 401:
return "auth_expired" # Retry once after token refresh
if status_code in NON_RETRYABLE_ERRORS:
return "poison" # Route to DLQ immediately
if status_code >= 500:
return "transient"
return "poison"
Verify: Run classifier against last 30 days of integration logs → expected: <5% misclassification rate.
The backoff formula prevents retry storms. Full jitter distributes retries evenly across the delay window. [src3]
import random
def exponential_backoff_with_jitter(attempt, initial_delay=1.0,
max_delay=60.0, factor=2.0):
exponential_delay = initial_delay * (factor ** attempt)
capped_delay = min(exponential_delay, max_delay)
return random.uniform(0, capped_delay) # Full jitter
Verify: exponential_backoff_with_jitter(0) returns 0-1.0; exponential_backoff_with_jitter(5) returns 0-32.0.
Prevents hammering an ERP API that is already down — avoids wasting API quota and prevents cascading failures. [src4]
class CircuitBreaker:
CLOSED, OPEN, HALF_OPEN = "closed", "open", "half_open"
def __init__(self, failure_threshold=5, reset_timeout=30,
success_threshold=2):
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.success_threshold = success_threshold
self.state = self.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
def can_execute(self):
if self.state == self.CLOSED: return True
if self.state == self.OPEN:
if time.time() - self.last_failure_time >= self.reset_timeout:
self.state = self.HALF_OPEN
return True
return False
return True # half_open
Verify: Trigger 5 failures → can_execute() returns False. Wait reset_timeout → returns True (half-open).
Combine error classification, backoff, and circuit breaker into a unified pipeline. [src1, src2]
class ERPRetryPipeline:
def process_message(self, message):
idempotency_key = message.get("idempotency_key") or str(uuid.uuid4())
for attempt in range(self.max_retries + 1):
if not self.cb.can_execute():
return self._send_to_dlq(message, attempt,
"circuit_breaker_open", "ERP unavailable")
try:
response = self.erp.send(message)
self.cb.record_success()
return {"status": "success", "attempt": attempt}
except ERPAPIError as e:
error_type = classify_error(e.status_code, e.body)
if error_type == "poison":
return self._send_to_dlq(message, attempt,
f"http_{e.status_code}", str(e.body))
self.cb.record_failure()
if attempt < self.max_retries:
delay = exponential_backoff_with_jitter(attempt)
time.sleep(delay)
return self._send_to_dlq(message, self.max_retries,
"max_retries_exceeded", "All retries exhausted")
Verify: Mock ERP returns 503 three times then 200 → pipeline returns {"status": "success", "attempt": 3}.
Set up the dead letter queue on your chosen message broker. [src5, src7]
# AWS SQS: Create DLQ and configure redrive policy
aws sqs create-queue --queue-name erp-integration-dlq \
--attributes '{"MessageRetentionPeriod":"1209600"}' # 14 days
# Azure Service Bus: DLQ is automatic (subqueue per entity)
az servicebus queue update --name erp-integration \
--namespace-name mybus --resource-group myrg \
--max-delivery-count 5
Verify: aws sqs get-queue-attributes shows DLQ ARN and maxReceiveCount of 5.
DLQ without monitoring is a silent data loss risk. [src1]
from prometheus_client import Counter, Gauge, Histogram
dlq_messages_total = Counter('erp_dlq_messages_total',
'Total messages routed to DLQ', ['integration_id', 'error_type'])
dlq_depth = Gauge('erp_dlq_depth',
'Current DLQ message count', ['queue_name'])
circuit_breaker_state = Gauge('erp_circuit_breaker_state',
'Circuit breaker state (0=closed, 1=open, 2=half_open)',
['service_name'])
Verify: Query /metrics endpoint → all metric families visible. Trigger DLQ message → counter increments.
# Input: Records to upsert via Salesforce Bulk API 2.0
# Output: Success/failure counts, DLQ message IDs for failed batches
import requests, time, random, json, uuid
from datetime import datetime, timezone
class SalesforceBulkRetryHandler:
def __init__(self, instance_url, access_token, dlq_client,
max_retries=5, initial_delay=1.0, max_delay=60.0):
self.base_url = f"{instance_url}/services/data/v62.0"
self.headers = {"Authorization": f"Bearer {access_token}",
"Content-Type": "application/json"}
self.dlq = dlq_client
self.max_retries = max_retries
def upsert_with_retry(self, object_name, external_id_field, records):
idem_key = str(uuid.uuid4())
for attempt in range(self.max_retries + 1):
try:
job = requests.post(f"{self.base_url}/jobs/ingest",
headers=self.headers,
json={"object": object_name,
"externalIdFieldName": external_id_field,
"operation": "upsert", "contentType": "CSV"})
if job.status_code == 429:
retry_after = int(job.headers.get("Retry-After", 0))
time.sleep(max(self._backoff(attempt), retry_after))
continue
if job.status_code >= 500:
time.sleep(self._backoff(attempt)); continue
if job.status_code >= 400:
return self._dead_letter(records, idem_key, attempt,
f"HTTP {job.status_code}", job.text)
return {"status": "success", "job_id": job.json()["id"]}
except requests.exceptions.ConnectionError:
if attempt < self.max_retries:
time.sleep(self._backoff(attempt))
return self._dead_letter(records, idem_key, self.max_retries,
"max_retries", "All retries exhausted")
def _backoff(self, attempt):
return random.uniform(0, min(1.0 * (2 ** attempt), 60.0))
// Input: ERP API call function + message payload
// Output: Success response or DLQ envelope
async function retryWithDLQ(apiCall, message, {
maxRetries = 5, initialDelay = 1000, maxDelay = 60000,
factor = 2, circuitBreaker, dlqSend
} = {}) {
const idempotencyKey = message.idempotencyKey || crypto.randomUUID();
for (let attempt = 0; attempt <= maxRetries; attempt++) {
if (!circuitBreaker.canExecute()) {
return dlqSend({ originalMessage: message,
errorCode: 'circuit_breaker_open', attempts: attempt,
idempotencyKey, deadLetteredAt: new Date().toISOString() });
}
try {
const result = await apiCall(message, idempotencyKey);
circuitBreaker.recordSuccess();
return { status: 'success', attempt, result };
} catch (err) {
if ([400, 403, 404, 422].includes(err.statusCode)) {
return dlqSend({ originalMessage: message,
errorCode: `http_${err.statusCode}`, attempts: attempt + 1,
idempotencyKey, deadLetteredAt: new Date().toISOString() });
}
circuitBreaker.recordFailure();
if (attempt < maxRetries) {
const delay = Math.random() * Math.min(
initialDelay * Math.pow(factor, attempt), maxDelay);
await new Promise(r => setTimeout(r, delay));
}
}
}
return dlqSend({ originalMessage: message,
errorCode: 'max_retries_exceeded', attempts: maxRetries + 1,
idempotencyKey, deadLetteredAt: new Date().toISOString() });
}
# Check DLQ depth (AWS SQS)
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789/erp-dlq \
--attribute-names ApproximateNumberOfMessages
# Replay DLQ messages back to source queue
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:123456789:erp-dlq \
--destination-arn arn:aws:sqs:us-east-1:123456789:erp-integration
# Check DLQ count (Azure Service Bus)
az servicebus queue show --name erp-integration \
--namespace-name mybus --resource-group myrg \
--query 'countDetails.deadLetterMessageCount'
| Field | Type | Required | Description | Gotcha |
|---|---|---|---|---|
| original_message | Object | Yes | Complete original message payload | Must preserve all fields — partial messages break replay |
| error_code | String | Yes | Machine-readable error code | Standardize across all integrations |
| error_detail | String | Yes | Human-readable error description | Truncate to 4KB max |
| attempt_count | Integer | Yes | Number of attempts before dead-lettering | Starts at 1, not 0 |
| idempotency_key | String | Yes | Unique key for replay deduplication | UUID or composite business key |
| dead_lettered_at | ISO 8601 | Yes | When message was dead-lettered | Always UTC |
| source_queue | String | Yes | Origin queue/topic name | Required for routing replays |
| integration_id | String | Yes | Integration flow identifier | Maps to monitoring dashboards |
| correlation_id | String | Recommended | End-to-end trace ID | Enables cross-system debugging |
| Code | Meaning | Cause | Resolution |
|---|---|---|---|
| HTTP 429 | Rate limit exceeded | Too many API calls in window | Exponential backoff, respect Retry-After header |
| HTTP 503 | Service unavailable | ERP maintenance or overload | Exponential backoff, check ERP status page |
| ETIMEDOUT | Connection timeout | Network issue or slow response | Increase timeout, check firewall rules |
| ECONNREFUSED | Connection refused | Service down or port blocked | Open circuit breaker, alert operations |
| MaxDeliveryCountExceeded | DLQ threshold reached | Message failed N times | Review error, fix root cause, replay |
| UNABLE_TO_LOCK_ROW | Record lock conflict | Concurrent update to same record | Retry with jitter, implement locking |
| INVALID_SESSION_ID | Token expired/invalid | OAuth token expired | Refresh token, retry once |
Disable retries at all layers except one. [src6]Alert on DLQ depth > 100 within 24 hours of go-live. [src1]Store circuit breaker state in Redis or DynamoDB. [src4]Every operation must use an idempotency key. Use upsert, not insert. [src2]Track replay_count in DLQ envelope. Quarantine messages with replay_count > 3. [src1]Rate-limit DLQ replay to 10-20% of normal throughput. [src3]# BAD — retries non-retryable errors, wasting time and API quota
def send_to_erp(message, max_retries=5):
for attempt in range(max_retries):
try:
return erp_api.post(message)
except Exception as e:
time.sleep(2 ** attempt) # Retries 400, 403, 422 too
raise Exception("All retries failed")
# GOOD — only retries transient errors, DLQs poison messages
def send_to_erp(message, max_retries=5):
for attempt in range(max_retries):
try:
return erp_api.post(message)
except ERPError as e:
if e.status_code in (400, 403, 404, 422):
dlq.send(message, error=str(e)) # Don't retry
return None
if attempt < max_retries - 1:
time.sleep(exponential_backoff_with_jitter(attempt))
dlq.send(message, error="Retries exhausted")
# BAD — hammers ERP API at constant rate during outage
for attempt in range(5):
try:
response = erp_api.post(message); break
except Exception:
time.sleep(5) # Same 5s delay every time
# GOOD — spreads retry load, respects rate limits
for attempt in range(5):
try:
response = erp_api.post(message); break
except TransientError:
delay = min(1.0 * (2 ** attempt), 60.0)
time.sleep(random.uniform(0, delay)) # Full jitter
# BAD — raw message in DLQ, no context for debugging
dlq.send(original_message)
# Later: "Why is this in the DLQ? When did it fail?"
# GOOD — complete context for debugging and safe replay
dlq.send({
"original_message": original_message,
"error_code": "http_429",
"error_detail": "Rate limit exceeded on Salesforce Bulk API",
"attempt_count": 6,
"idempotency_key": "order-12345-upsert-v2",
"dead_lettered_at": "2026-03-01T14:30:00Z",
"source_queue": "salesforce-bulk-ingest",
"correlation_id": "trace-abc-123"
})
Set TTL (14 days transient, 90 days data issues). Auto-archive to cold storage. [src5]Choose one retry layer (application for ERP). Disable retries elsewhere. [src6]Include idempotency_key in every message. Use upsert, not insert. [src2]Require 5+ failures in 60s. Use percentage-based thresholds for high volume. [src4]Tune based on normal error rate. For <1% natural error rate, 5 failures in 60s is appropriate. [src4]Rate-limit replays to 10-20% of normal throughput. [src3]Parse per-record results. Only retry failed records. [src1]# Check DLQ depth (AWS SQS)
aws sqs get-queue-attributes --queue-url $DLQ_URL \
--attribute-names ApproximateNumberOfMessages
# Check DLQ depth (Azure Service Bus)
az servicebus queue show --name erp-integration \
--namespace-name $NAMESPACE --resource-group $RG \
--query 'countDetails.deadLetterMessageCount'
# Check circuit breaker state (Redis)
redis-cli GET "circuit_breaker:salesforce_api:state"
# Monitor retry rate (Prometheus)
# rate(erp_retry_attempts_total[5m]) / rate(erp_messages_processed_total[5m])
# Replay DLQ messages (AWS SQS)
aws sqs start-message-move-task \
--source-arn $DLQ_ARN \
--destination-arn $SOURCE_QUEUE_ARN \
--max-number-of-messages-per-second 10
| Pattern/Platform | Version | Date | Status | Key Changes |
|---|---|---|---|---|
| AWS SQS DLQ Redrive | GA | 2023-07 | Current | Native message-move-task for DLQ replay |
| Azure Service Bus DLQ | GA | 2025-05 | Current | Enhanced dead-letter reason headers |
| Apache Kafka Error Topics | Convention | Ongoing | Current | No native DLQ — error topic pattern |
| RabbitMQ Dead Letter Exchange | GA (3.x+) | 2024-09 | Current | x-delivery-limit in 3.12+ (quorum queues) |
| MuleSoft Error Handling | 4.x | 2025-01 | Current | Enhanced DLQ connector with retry policies |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Zero-loss requirement (financial transactions, orders) | Fire-and-forget analytics events | Simple logging + batch reconciliation |
| ERP API has rate limits or intermittent outages | Target has 99.99% availability SLA | Direct API call with try/catch |
| Multi-system integration (multiple failure points) | Single-system CRUD operations | Database transaction with rollback |
| Asynchronous processing (message queues, events) | Synchronous user-facing API calls | HTTP retry middleware (Polly, resilience4j) |
| Batch operations with partial failures | All-or-nothing transactional requirements | Saga pattern with compensation |
| Long-running integration jobs (>30s per operation) | Sub-second API calls | Inline retry with timeout |
| Capability | AWS SQS | Azure Service Bus | Apache Kafka | RabbitMQ | MuleSoft |
|---|---|---|---|---|---|
| Native DLQ | Yes (redrive policy) | Yes (subqueue) | No (error topic) | Yes (DLX) | Yes (connector) |
| Max retention | 14 days | Unlimited (Premium) | Topic config | TTL-based | Platform storage |
| DLQ replay | Native (move-task) | Manual | Consumer from error topic | Manual | UI-based |
| Max delivery count | 1-1000 | 1-2000 (default 10) | Consumer-side | x-delivery-limit | Configurable |
| Circuit breaker | Manual | Manual | Manual | Manual | Built-in |
| Message ordering | FIFO (optional) | Sessions (optional) | Partition ordering | Per-queue | Flow ordering |
| Dead-letter reason | Application-set | System headers + custom | Application-set | Application-set | Error type metadata |
| Max message size | 256KB (2GB w/ S3) | 256KB / 100MB | 1MB default | 128MB default | Platform limit |
| Cost model | Per-message | Per-operation + storage | Self-hosted infra | Self-hosted infra | License-based |