Error Handling and Dead Letter Queues for ERP Integrations: Retry Strategies

Type: ERP Integration System: Cross-Platform (AWS SQS, Azure Service Bus, Kafka, RabbitMQ) Confidence: 0.92 Sources: 7 Verified: 2026-03-02 Freshness: evolving

TL;DR

System Profile

This card covers cross-platform error handling patterns applicable to all major ERP integrations, regardless of the specific ERP system (Salesforce, SAP, Oracle, NetSuite, Dynamics 365, Workday) or middleware platform. The patterns are implemented at the integration middleware layer — the message broker or iPaaS platform that sits between systems.

SystemRoleDLQ SupportRetry SupportCircuit Breaker
AWS SQS/SNSMessage brokerNative (redrive policy)maxReceiveCount (1-1000)Manual (Step Functions / custom)
Azure Service BusMessage brokerNative (subqueue per entity)MaxDeliveryCount (default 10)Manual (custom implementation)
Apache KafkaEvent streamingVia error topic conventionConsumer-side retry logicManual (custom implementation)
RabbitMQMessage brokerNative (dead-letter exchange)x-delivery-limit headerManual (custom implementation)
MuleSoft AnypointiPaaSBuilt-in DLQ connectorConfigurable retry policiesBuilt-in circuit breaker scope
BoomiiPaaSError handling shapesConfigurable retryCustom via process routes
WorkatoiPaaSError monitoring recipesAuto-retry with configurable countCustom via error handlers

API Surfaces & Capabilities

PatternTypeBest ForComplexityLatency ImpactData Loss Risk
Immediate retryRetryTransient network blipsLowMinimal (ms)Medium (no backoff)
Exponential backoffRetryRate limits, temporary overloadMediumIncreasing (1s-60s)Low
Exponential backoff + jitterRetryHigh-concurrency retriesMediumIncreasing (randomized)Low
Circuit breakerProtectionProlonged service outagesMediumFast-fail when openNone (preserves messages)
Dead letter queueError isolationPoison messages, schema errorsMediumNone (async)Very low
Saga with compensationTransactionMulti-system writesHighVariableVery low
Outbox patternDelivery guaranteeExactly-once publishHighMinimalNear zero

Rate Limits & Quotas

Per-Platform DLQ Limits

PlatformMax RetentionMax Message SizeMax DLQ DepthReprocessing Method
AWS SQS14 days256 KB (2 GB with S3)UnlimitedRedrive to source queue
Azure Service Bus (Standard)Unlimited256 KB5 GB per entityReceive + resubmit
Azure Service Bus (Premium)Unlimited100 MB80 GB per entityReceive + resubmit
Apache Kafka (error topic)Topic retention config1 MB default (configurable)Partition-basedConsumer from error topic
RabbitMQTTL-based (configurable)128 MB defaultMemory/disk-basedConsume from DLX queue

Retry Budget Guidelines

Integration TypeMax RetriesInitial DelayMax DelayBackoff FactorJitter
Real-time API (user-facing)3-5500ms30s2xFull jitter
Batch/bulk processing5-101s5min2xEqual jitter
Event-driven (CDC, webhooks)5-81s2min2xFull jitter
File-based import (FBDI, EIB)330s10min3xNone

Circuit Breaker Thresholds

ParameterReal-time IntegrationBatch IntegrationEvent-driven
Failure threshold5 failures in 60s3 failures in 5min5 failures in 60s
Open state duration30s5min60s
Half-open probe count11-31
Success threshold to close2 consecutive3 consecutive2 consecutive

Authentication

N/A — this is a pattern-level card. Authentication is handled at the ERP API layer. See system-specific cards for auth flows per ERP vendor.

Constraints

Integration Pattern Decision Tree

START — ERP integration message fails processing
├── What type of error?
│   ├── Transient (429 rate limit, 503 unavailable, network timeout)
│   │   ├── Is circuit breaker OPEN?
│   │   │   ├── YES → Fast-fail, queue for retry after circuit reset
│   │   │   └── NO ↓
│   │   ├── Retry count < max retries?
│   │   │   ├── YES → Retry with exponential backoff + jitter
│   │   │   │   ├── delay = min(initialDelay * 2^attempt, maxDelay)
│   │   │   │   └── actualDelay = random(0, delay) [full jitter]
│   │   │   └── NO → Move to Dead Letter Queue with full context
│   │   └── Did retry succeed?
│   │       ├── YES → Mark success, reset failure counter
│   │       └── NO → Increment failure counter, check circuit breaker
│   ├── Non-transient / Poison (400, 422, schema mismatch)
│   │   └── Route IMMEDIATELY to DLQ — retries will never succeed
│   ├── Partial success (bulk: some records succeed, some fail)
│   │   ├── Extract failed records from response
│   │   └── Route failed-record message through retry pipeline
│   └── Authentication error (401, 403)
│       ├── Token expired? → Refresh token, retry once
│       └── Permissions changed? → Alert + DLQ (do not retry)
├── DLQ message processing
│   ├── Automated triage: classify, group, auto-resolve known patterns
│   └── Manual review: fix data, replay, or purge
└── Monitoring
    ├── DLQ depth: alert at > 100, page at > 1000
    ├── DLQ ingestion rate: alert if > 1% of total throughput
    └── Circuit breaker state changes: log every transition

Quick Reference

Error TypeRetryable?StrategyMax RetriesDLQ Action
HTTP 429 (Rate Limit)YesBackoff, respect Retry-After5-10Reprocess after cooldown
HTTP 500 (Server Error)YesExponential backoff + jitter5Investigate server-side
HTTP 502/503/504 (Gateway)YesExponential backoff5Reprocess after recovery
HTTP 400 (Bad Request)NoImmediate DLQ0Fix payload, resubmit
HTTP 401 (Unauthorized)OnceRefresh token, retry once1Rotate credentials
HTTP 403 (Forbidden)NoImmediate DLQ0Fix permissions
HTTP 404 (Not Found)NoImmediate DLQ0Fix resource reference
HTTP 409 (Conflict)ConditionalRetry with conflict resolution3Manual merge
HTTP 422 (Validation)NoImmediate DLQ0Fix data, resubmit
Connection timeoutYesExponential backoff5Check network/firewall
SSL/TLS failureNoImmediate DLQ0Fix certificates
Schema mismatchNoImmediate DLQ0Update schema, resubmit

Step-by-Step Integration Guide

1. Classify errors into retryable vs non-retryable

Before implementing retry logic, establish a clear error classification. Retrying non-retryable errors wastes resources and delays DLQ processing. [src2]

RETRYABLE_ERRORS = {429, 500, 502, 503, 504}
NON_RETRYABLE_ERRORS = {400, 401, 403, 404, 405, 409, 422}

def classify_error(status_code, error_body):
    if status_code in RETRYABLE_ERRORS:
        return "transient"
    if status_code == 401:
        return "auth_expired"  # Retry once after token refresh
    if status_code in NON_RETRYABLE_ERRORS:
        return "poison"  # Route to DLQ immediately
    if status_code >= 500:
        return "transient"
    return "poison"

Verify: Run classifier against last 30 days of integration logs → expected: <5% misclassification rate.

2. Implement exponential backoff with jitter

The backoff formula prevents retry storms. Full jitter distributes retries evenly across the delay window. [src3]

import random

def exponential_backoff_with_jitter(attempt, initial_delay=1.0,
                                     max_delay=60.0, factor=2.0):
    exponential_delay = initial_delay * (factor ** attempt)
    capped_delay = min(exponential_delay, max_delay)
    return random.uniform(0, capped_delay)  # Full jitter

Verify: exponential_backoff_with_jitter(0) returns 0-1.0; exponential_backoff_with_jitter(5) returns 0-32.0.

3. Implement the circuit breaker

Prevents hammering an ERP API that is already down — avoids wasting API quota and prevents cascading failures. [src4]

class CircuitBreaker:
    CLOSED, OPEN, HALF_OPEN = "closed", "open", "half_open"

    def __init__(self, failure_threshold=5, reset_timeout=30,
                 success_threshold=2):
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.success_threshold = success_threshold
        self.state = self.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None

    def can_execute(self):
        if self.state == self.CLOSED: return True
        if self.state == self.OPEN:
            if time.time() - self.last_failure_time >= self.reset_timeout:
                self.state = self.HALF_OPEN
                return True
            return False
        return True  # half_open

Verify: Trigger 5 failures → can_execute() returns False. Wait reset_timeout → returns True (half-open).

4. Build the retry-with-DLQ pipeline

Combine error classification, backoff, and circuit breaker into a unified pipeline. [src1, src2]

class ERPRetryPipeline:
    def process_message(self, message):
        idempotency_key = message.get("idempotency_key") or str(uuid.uuid4())
        for attempt in range(self.max_retries + 1):
            if not self.cb.can_execute():
                return self._send_to_dlq(message, attempt,
                    "circuit_breaker_open", "ERP unavailable")
            try:
                response = self.erp.send(message)
                self.cb.record_success()
                return {"status": "success", "attempt": attempt}
            except ERPAPIError as e:
                error_type = classify_error(e.status_code, e.body)
                if error_type == "poison":
                    return self._send_to_dlq(message, attempt,
                        f"http_{e.status_code}", str(e.body))
                self.cb.record_failure()
                if attempt < self.max_retries:
                    delay = exponential_backoff_with_jitter(attempt)
                    time.sleep(delay)
        return self._send_to_dlq(message, self.max_retries,
            "max_retries_exceeded", "All retries exhausted")

Verify: Mock ERP returns 503 three times then 200 → pipeline returns {"status": "success", "attempt": 3}.

5. Configure platform-specific DLQ

Set up the dead letter queue on your chosen message broker. [src5, src7]

# AWS SQS: Create DLQ and configure redrive policy
aws sqs create-queue --queue-name erp-integration-dlq \
  --attributes '{"MessageRetentionPeriod":"1209600"}'  # 14 days

# Azure Service Bus: DLQ is automatic (subqueue per entity)
az servicebus queue update --name erp-integration \
  --namespace-name mybus --resource-group myrg \
  --max-delivery-count 5

Verify: aws sqs get-queue-attributes shows DLQ ARN and maxReceiveCount of 5.

6. Set up DLQ monitoring and alerting

DLQ without monitoring is a silent data loss risk. [src1]

from prometheus_client import Counter, Gauge, Histogram

dlq_messages_total = Counter('erp_dlq_messages_total',
    'Total messages routed to DLQ', ['integration_id', 'error_type'])
dlq_depth = Gauge('erp_dlq_depth',
    'Current DLQ message count', ['queue_name'])
circuit_breaker_state = Gauge('erp_circuit_breaker_state',
    'Circuit breaker state (0=closed, 1=open, 2=half_open)',
    ['service_name'])

Verify: Query /metrics endpoint → all metric families visible. Trigger DLQ message → counter increments.

Code Examples

Python: Retry handler with DLQ for Salesforce Bulk API

# Input:  Records to upsert via Salesforce Bulk API 2.0
# Output: Success/failure counts, DLQ message IDs for failed batches

import requests, time, random, json, uuid
from datetime import datetime, timezone

class SalesforceBulkRetryHandler:
    def __init__(self, instance_url, access_token, dlq_client,
                 max_retries=5, initial_delay=1.0, max_delay=60.0):
        self.base_url = f"{instance_url}/services/data/v62.0"
        self.headers = {"Authorization": f"Bearer {access_token}",
                        "Content-Type": "application/json"}
        self.dlq = dlq_client
        self.max_retries = max_retries

    def upsert_with_retry(self, object_name, external_id_field, records):
        idem_key = str(uuid.uuid4())
        for attempt in range(self.max_retries + 1):
            try:
                job = requests.post(f"{self.base_url}/jobs/ingest",
                    headers=self.headers,
                    json={"object": object_name,
                          "externalIdFieldName": external_id_field,
                          "operation": "upsert", "contentType": "CSV"})
                if job.status_code == 429:
                    retry_after = int(job.headers.get("Retry-After", 0))
                    time.sleep(max(self._backoff(attempt), retry_after))
                    continue
                if job.status_code >= 500:
                    time.sleep(self._backoff(attempt)); continue
                if job.status_code >= 400:
                    return self._dead_letter(records, idem_key, attempt,
                        f"HTTP {job.status_code}", job.text)
                return {"status": "success", "job_id": job.json()["id"]}
            except requests.exceptions.ConnectionError:
                if attempt < self.max_retries:
                    time.sleep(self._backoff(attempt))
        return self._dead_letter(records, idem_key, self.max_retries,
            "max_retries", "All retries exhausted")

    def _backoff(self, attempt):
        return random.uniform(0, min(1.0 * (2 ** attempt), 60.0))

JavaScript/Node.js: Generic ERP retry middleware with circuit breaker

// Input:  ERP API call function + message payload
// Output: Success response or DLQ envelope

async function retryWithDLQ(apiCall, message, {
  maxRetries = 5, initialDelay = 1000, maxDelay = 60000,
  factor = 2, circuitBreaker, dlqSend
} = {}) {
  const idempotencyKey = message.idempotencyKey || crypto.randomUUID();
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    if (!circuitBreaker.canExecute()) {
      return dlqSend({ originalMessage: message,
        errorCode: 'circuit_breaker_open', attempts: attempt,
        idempotencyKey, deadLetteredAt: new Date().toISOString() });
    }
    try {
      const result = await apiCall(message, idempotencyKey);
      circuitBreaker.recordSuccess();
      return { status: 'success', attempt, result };
    } catch (err) {
      if ([400, 403, 404, 422].includes(err.statusCode)) {
        return dlqSend({ originalMessage: message,
          errorCode: `http_${err.statusCode}`, attempts: attempt + 1,
          idempotencyKey, deadLetteredAt: new Date().toISOString() });
      }
      circuitBreaker.recordFailure();
      if (attempt < maxRetries) {
        const delay = Math.random() * Math.min(
          initialDelay * Math.pow(factor, attempt), maxDelay);
        await new Promise(r => setTimeout(r, delay));
      }
    }
  }
  return dlqSend({ originalMessage: message,
    errorCode: 'max_retries_exceeded', attempts: maxRetries + 1,
    idempotencyKey, deadLetteredAt: new Date().toISOString() });
}

cURL: DLQ monitoring and replay

# Check DLQ depth (AWS SQS)
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789/erp-dlq \
  --attribute-names ApproximateNumberOfMessages

# Replay DLQ messages back to source queue
aws sqs start-message-move-task \
  --source-arn arn:aws:sqs:us-east-1:123456789:erp-dlq \
  --destination-arn arn:aws:sqs:us-east-1:123456789:erp-integration

# Check DLQ count (Azure Service Bus)
az servicebus queue show --name erp-integration \
  --namespace-name mybus --resource-group myrg \
  --query 'countDetails.deadLetterMessageCount'

Data Mapping

DLQ Envelope Schema Reference

FieldTypeRequiredDescriptionGotcha
original_messageObjectYesComplete original message payloadMust preserve all fields — partial messages break replay
error_codeStringYesMachine-readable error codeStandardize across all integrations
error_detailStringYesHuman-readable error descriptionTruncate to 4KB max
attempt_countIntegerYesNumber of attempts before dead-letteringStarts at 1, not 0
idempotency_keyStringYesUnique key for replay deduplicationUUID or composite business key
dead_lettered_atISO 8601YesWhen message was dead-letteredAlways UTC
source_queueStringYesOrigin queue/topic nameRequired for routing replays
integration_idStringYesIntegration flow identifierMaps to monitoring dashboards
correlation_idStringRecommendedEnd-to-end trace IDEnables cross-system debugging

Data Type Gotchas

Error Handling & Failure Points

Common Error Codes

CodeMeaningCauseResolution
HTTP 429Rate limit exceededToo many API calls in windowExponential backoff, respect Retry-After header
HTTP 503Service unavailableERP maintenance or overloadExponential backoff, check ERP status page
ETIMEDOUTConnection timeoutNetwork issue or slow responseIncrease timeout, check firewall rules
ECONNREFUSEDConnection refusedService down or port blockedOpen circuit breaker, alert operations
MaxDeliveryCountExceededDLQ threshold reachedMessage failed N timesReview error, fix root cause, replay
UNABLE_TO_LOCK_ROWRecord lock conflictConcurrent update to same recordRetry with jitter, implement locking
INVALID_SESSION_IDToken expired/invalidOAuth token expiredRefresh token, retry once

Failure Points in Production

Anti-Patterns

Wrong: Retrying all errors indiscriminately

# BAD — retries non-retryable errors, wasting time and API quota
def send_to_erp(message, max_retries=5):
    for attempt in range(max_retries):
        try:
            return erp_api.post(message)
        except Exception as e:
            time.sleep(2 ** attempt)  # Retries 400, 403, 422 too
    raise Exception("All retries failed")

Correct: Classify errors before retrying

# GOOD — only retries transient errors, DLQs poison messages
def send_to_erp(message, max_retries=5):
    for attempt in range(max_retries):
        try:
            return erp_api.post(message)
        except ERPError as e:
            if e.status_code in (400, 403, 404, 422):
                dlq.send(message, error=str(e))  # Don't retry
                return None
            if attempt < max_retries - 1:
                time.sleep(exponential_backoff_with_jitter(attempt))
    dlq.send(message, error="Retries exhausted")

Wrong: Fixed-interval retry (no backoff)

# BAD — hammers ERP API at constant rate during outage
for attempt in range(5):
    try:
        response = erp_api.post(message); break
    except Exception:
        time.sleep(5)  # Same 5s delay every time

Correct: Exponential backoff with jitter

# GOOD — spreads retry load, respects rate limits
for attempt in range(5):
    try:
        response = erp_api.post(message); break
    except TransientError:
        delay = min(1.0 * (2 ** attempt), 60.0)
        time.sleep(random.uniform(0, delay))  # Full jitter

Wrong: DLQ without envelope metadata

# BAD — raw message in DLQ, no context for debugging
dlq.send(original_message)
# Later: "Why is this in the DLQ? When did it fail?"

Correct: DLQ with rich diagnostic envelope

# GOOD — complete context for debugging and safe replay
dlq.send({
    "original_message": original_message,
    "error_code": "http_429",
    "error_detail": "Rate limit exceeded on Salesforce Bulk API",
    "attempt_count": 6,
    "idempotency_key": "order-12345-upsert-v2",
    "dead_lettered_at": "2026-03-01T14:30:00Z",
    "source_queue": "salesforce-bulk-ingest",
    "correlation_id": "trace-abc-123"
})

Common Pitfalls

Diagnostic Commands

# Check DLQ depth (AWS SQS)
aws sqs get-queue-attributes --queue-url $DLQ_URL \
  --attribute-names ApproximateNumberOfMessages

# Check DLQ depth (Azure Service Bus)
az servicebus queue show --name erp-integration \
  --namespace-name $NAMESPACE --resource-group $RG \
  --query 'countDetails.deadLetterMessageCount'

# Check circuit breaker state (Redis)
redis-cli GET "circuit_breaker:salesforce_api:state"

# Monitor retry rate (Prometheus)
# rate(erp_retry_attempts_total[5m]) / rate(erp_messages_processed_total[5m])

# Replay DLQ messages (AWS SQS)
aws sqs start-message-move-task \
  --source-arn $DLQ_ARN \
  --destination-arn $SOURCE_QUEUE_ARN \
  --max-number-of-messages-per-second 10

Version History & Compatibility

Pattern/PlatformVersionDateStatusKey Changes
AWS SQS DLQ RedriveGA2023-07CurrentNative message-move-task for DLQ replay
Azure Service Bus DLQGA2025-05CurrentEnhanced dead-letter reason headers
Apache Kafka Error TopicsConventionOngoingCurrentNo native DLQ — error topic pattern
RabbitMQ Dead Letter ExchangeGA (3.x+)2024-09Currentx-delivery-limit in 3.12+ (quorum queues)
MuleSoft Error Handling4.x2025-01CurrentEnhanced DLQ connector with retry policies

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Zero-loss requirement (financial transactions, orders)Fire-and-forget analytics eventsSimple logging + batch reconciliation
ERP API has rate limits or intermittent outagesTarget has 99.99% availability SLADirect API call with try/catch
Multi-system integration (multiple failure points)Single-system CRUD operationsDatabase transaction with rollback
Asynchronous processing (message queues, events)Synchronous user-facing API callsHTTP retry middleware (Polly, resilience4j)
Batch operations with partial failuresAll-or-nothing transactional requirementsSaga pattern with compensation
Long-running integration jobs (>30s per operation)Sub-second API callsInline retry with timeout

Cross-System Comparison

CapabilityAWS SQSAzure Service BusApache KafkaRabbitMQMuleSoft
Native DLQYes (redrive policy)Yes (subqueue)No (error topic)Yes (DLX)Yes (connector)
Max retention14 daysUnlimited (Premium)Topic configTTL-basedPlatform storage
DLQ replayNative (move-task)ManualConsumer from error topicManualUI-based
Max delivery count1-10001-2000 (default 10)Consumer-sidex-delivery-limitConfigurable
Circuit breakerManualManualManualManualBuilt-in
Message orderingFIFO (optional)Sessions (optional)Partition orderingPer-queueFlow ordering
Dead-letter reasonApplication-setSystem headers + customApplication-setApplication-setError type metadata
Max message size256KB (2GB w/ S3)256KB / 100MB1MB default128MB defaultPlatform limit
Cost modelPer-messagePer-operation + storageSelf-hosted infraSelf-hosted infraLicense-based

Important Caveats

Related Units