Poison message handling and dead letter queue triage for ERP integrations

- Bottom line: A poison message is any message that repeatedly fails processing and blocks the queue. Detect within 3-5 delivery attempts, classify the error (transient vs permanent vs data quality), route to a typed DLQ, then triage via automated classification before manual review. Never replay without fixing the root cause and verifying idempotency.

How to triage and replay failed ERP integration messages from DLQ

- Bottom line: A poison message is any message that repeatedly fails processing and blocks the queue. Detect within 3-5 delivery attempts, classify the error (transient vs permanent vs data quality), route to a typed DLQ, then triage via automated classification before manual review. Never replay without fixing the root cause and verifying idempotency.

Poison pill detection and remediation in enterprise middleware

- Bottom line: A poison message is any message that repeatedly fails processing and blocks the queue. Detect within 3-5 delivery attempts, classify the error (transient vs permanent vs data quality), route to a typed DLQ, then triage via automated classification before manual review. Never replay without fixing the root cause and verifying idempotency.

DLQ triage workflow and message replay patterns for ERP

- Bottom line: A poison message is any message that repeatedly fails processing and blocks the queue. Detect within 3-5 delivery attempts, classify the error (transient vs permanent vs data quality), route to a typed DLQ, then triage via automated classification before manual review. Never replay without fixing the root cause and verifying idempotency.

Poison Message Handling: Triage and Replay of Failed ERP Integration Messages

How do you handle poison messages - triage and replay of failed ERP integration messages?

TL;DR

Bottom line: A poison message is any message that repeatedly fails processing and blocks the queue. Detect within 3-5 delivery attempts, classify the error (transient vs permanent vs data quality), route to a typed DLQ, then triage via automated classification before manual review. Never replay without fixing the root cause and verifying idempotency. [src1, src2]
Key limit: DLQ retention is finite on most platforms — AWS SQS max 14 days (original enqueue timestamp preserved), Azure Service Bus unlimited on Premium, Kafka depends on topic retention config. Unprocessed poison messages silently expire. [src3, src4]
Watch out for: Infinite retry loops are the #1 anti-pattern — a message that fails due to a schema violation will fail identically on every retry, consuming processing capacity and blocking healthy messages. Set maxDeliveryCount/maxReceiveCount to 3-5, not 100. [src1, src7]
Best for: Any ERP integration where failed messages must be triaged, diagnosed, fixed, and replayed — order-to-cash, AP automation, inventory sync, payroll feeds, intercompany settlement.
Authentication: N/A (pattern-level card). See system-specific cards for broker/iPaaS authentication details.

System Profile

This card covers poison message handling as a cross-platform architecture pattern for ERP integrations. It focuses specifically on what happens after a message exhausts its retry budget and lands in a dead letter queue — detection, classification, triage, remediation, and replay. For retry strategies that determine when a message becomes a poison message (exponential backoff, circuit breakers), see the companion card on error handling and DLQ fundamentals.

The patterns apply across all major message brokers (AWS SQS, Azure Service Bus, Apache Kafka, RabbitMQ) and iPaaS platforms (MuleSoft Anypoint MQ, Boomi Atom Queue, Workato, Celigo). The specific ERP system at either end (Salesforce, SAP, Oracle, NetSuite, Dynamics 365, Workday) does not change the poison message handling approach — it changes the error codes and data mapping fixes needed during remediation.

System	Role	API Surface	Direction
Source ERP (e.g., Salesforce)	Event producer — generates change events or outbound messages	REST, Platform Events, CDC	Outbound
Message Broker (e.g., AWS SQS, Kafka)	Message transport + DLQ infrastructure	SQS API, Kafka Protocol	Transport
iPaaS (e.g., MuleSoft, Boomi)	Integration orchestrator — message transformation and routing	Anypoint MQ, Atom Queue	Orchestrator
Target ERP (e.g., SAP S/4HANA)	Message consumer — processes inbound records	OData, BAPI, IDoc	Inbound

API Surfaces & Capabilities

Poison message handling capabilities vary significantly across platforms. The key differentiators are automatic DLQ routing, DLQ inspection APIs, and native replay/redrive support: [src3, src4, src5]

Platform	DLQ Type	Auto-Route	Max Delivery Count	Inspection API	Native Replay	DLQ Retention
AWS SQS	Separate queue	Yes (redrive policy)	Configurable (1-1000)	ReceiveMessage on DLQ	Yes (DLQ Redrive API)	Same as source (max 14 days)
Azure Service Bus	Sub-queue ($deadletterqueue)	Yes (MaxDeliveryCount)	Default 10, configurable	Peek/receive on sub-queue	Manual (receive + re-send)	Unlimited (Premium)
Apache Kafka	Separate topic (DLT)	Application-level	Application-level	Consumer on DLT topic	Application-level	Topic retention config
RabbitMQ	Separate queue (x-dead-letter-exchange)	Yes (x-delivery-limit)	Configurable via quorum queues	AMQP consume on DLQ	Manual (consume + re-publish)	Queue TTL config
MuleSoft Anypoint MQ	Separate queue	Yes (max delivery attempts)	Configurable	Anypoint MQ API	Yes (REM)	7 days default
Boomi Atom Queue	Built-in DLQ	Yes (after 7 attempts)	7 (6 retries + original)	Queue Management panel	Yes (resend dead letters)	Atom storage lifecycle

Rate Limits & Quotas

DLQ Throughput Limits

Platform	Replay Rate Limit	Concurrent Replays	Max DLQ Size	Notes
AWS SQS	System-optimized or custom max velocity	1 active redrive task per source queue	No hard limit (cost-based)	Redrive task max duration: 36 hours; max 100 active tasks per account [src4]
Azure Service Bus	No built-in rate limit on replay	N/A (manual process)	Entity size limit (Premium: 80 GB)	No automatic cleanup — messages persist until explicitly completed [src3]
Apache Kafka	Consumer throughput	Consumer group parallelism	Topic retention (size or time)	No native redrive — must implement consumer that reads DLT and produces to main topic [src5]
MuleSoft Anypoint MQ	API rate limits apply	Per-queue basis	120,000 in-flight messages	REM feature provides managed replay with visibility [src6]
Boomi	Queue throughput	Per-atom basis	Atom storage capacity	Dead letters visible in Queue Management panel; batch resend available

Monitoring Thresholds

Metric	Target	Alert When
DLQ ingestion rate	< 1% of incoming throughput	Sustained > 1% for 15 minutes [src1]
DLQ backlog (depth)	< 1,000 messages	Growing for > 1 hour without triage [src1]
Oldest message age in DLQ	< 24 hours for critical streams	Any message > 24 hours untriaged [src1]
Replay success rate	> 95%	Below 90% on any replay batch [src1]
Poison ratio (DLQ / total)	< 5%	Above 5% sustained [src1]
Time to first triage	< 4h (critical), < 24h (standard)	Exceeding SLA threshold [src1]

Authentication

N/A — pattern-level card. Authentication is handled at the broker/iPaaS layer:

Platform	Auth Method	Notes
AWS SQS	IAM roles / policies	DLQ access requires sqs:ReceiveMessage + sqs:DeleteMessage + sqs:SendMessage on both source and DLQ
Azure Service Bus	SAS or Azure AD (RBAC)	DLQ is a sub-queue — same connection string, append /$deadletterqueue [src3]
Apache Kafka	SASL/SCRAM, mTLS, or ACLs	DLT is a regular topic — requires separate ACL for consumer group [src5]
MuleSoft	Anypoint Platform credentials	DLQ management requires Manage Queues permission [src6]

Constraints

Detection threshold: Set maxDeliveryCount / maxReceiveCount to 3-5. Below 3 sends transient failures to DLQ prematurely; above 5 wastes processing capacity on truly unrecoverable messages. [src1, src7]
Retention is finite: AWS SQS DLQ messages retain their original enqueue timestamp — a message with 14-day retention that spent 10 days in the source queue has only 4 days left in the DLQ. [src4]
Replay ordering: Replaying messages out of order creates referential integrity violations — parent records must be replayed before child records.
Idempotency is mandatory for replay: Every replayed message must carry an idempotency key. Without it, replay creates duplicate records in the target ERP. [src1]
DLQ-of-DLQ is an anti-pattern: If your DLQ consumer fails, do NOT route to a second DLQ. Log, alert, and stop processing. [src1]
No cross-region DLQ on MuleSoft: Anypoint MQ requires DLQ and source queue in the same region. [src6]
Azure DLQ has no TTL: Messages in Azure Service Bus DLQ persist indefinitely. Without a purge process, DLQ grows unbounded. [src3]

Integration Pattern Decision Tree

START — Message has failed processing and landed in DLQ
├── Step 1: Classify the failure
│   ├── Transient error? (timeout, 429, 503, network error)
│   │   ├── YES → Should NOT be in DLQ — investigate why retries exhausted
│   │   │   ├── maxDeliveryCount too low? → Increase to 3-5
│   │   │   ├── Backoff delay too short? → Increase max backoff
│   │   │   └── Upstream system down for extended period? → Expected; replay now
│   │   └── Action: REPLAY IMMEDIATELY (system has recovered)
│   ├── Data quality error? (schema violation, missing field, invalid reference)
│   │   ├── Can the message be fixed automatically?
│   │   │   ├── YES → Auto-remediate → REPLAY WITH IDEMPOTENCY CHECK
│   │   │   └── NO → Route to manual review queue
│   │   └── Action: FIX DATA → REPLAY WITH IDEMPOTENCY CHECK
│   ├── Permanent error? (invalid endpoint, auth failure, business rule violation)
│   │   ├── Code/config bug? → Fix, deploy → REPLAY ENTIRE BATCH
│   │   └── Business rule rejection? → Fix target state or DISCARD + ALERT
│   └── Unknown error? → QUARANTINE → MANUAL TRIAGE
├── Step 2: Remediate
│   ├── Automated fix possible? → Apply transform → validate → replay
│   └── Manual fix needed? → Alert ops → ticket → SLA clock starts
├── Step 3: Replay
│   ├── Verify idempotency key present
│   ├── Verify ordering (parent before child)
│   ├── Replay to original queue (NOT directly to consumer)
│   ├── Monitor replay success rate
│   └── If fails again → QUARANTINE (no infinite loop)
└── Step 4: Post-mortem
    ├── New failure category? → Add classifier rule
    ├── Recurring pattern? → Fix upstream validation
    └── Update monitoring thresholds

Quick Reference

Scenario	Action	Replay?	Idempotency?	Alert Level
Schema violation (missing field)	Fix data, validate, replay	Yes	Yes	Warning
Invalid foreign key reference	Create parent first, then replay	Yes (ordered)	Yes	Warning
Rate limit exhaustion (429)	Should not be in DLQ — increase retry budget	Yes (immediate)	Yes	Info
Authentication failure (401/403)	Fix credentials, replay batch	Yes	Yes	Critical
Business rule violation	Fix target ERP state or discard	Conditional	Yes	Warning
Malformed payload (unparseable)	Discard — cannot be fixed	No	N/A	Error
Target system decommissioned	Discard + archive for audit	No	N/A	Critical
Duplicate record conflict (409)	Already processed — safe to discard	No	N/A	Info
Cascading failure (parent failed)	Fix parent first, replay children in order	Yes (ordered)	Yes	Warning
Unknown/unclassified error	Quarantine for investigation	Pending triage	Yes	Error

Step-by-Step Integration Guide

1. Classify errors at the consumer level

Before a message reaches the DLQ, classify the error type in your consumer. This metadata travels with the message and determines the triage path. [src1, src7]

def classify_error(exception, message):
    """Classify processing errors to determine DLQ triage path."""
    error_info = {
        "error_class": type(exception).__name__,
        "error_message": str(exception)[:500],
        "timestamp": datetime.utcnow().isoformat(),
        "message_id": message.get("message_id"),
        "attempt_count": message.get("approximate_receive_count", 0),
    }
    if isinstance(exception, (TimeoutError, ConnectionError)):
        error_info["category"] = "transient"
        error_info["retry_eligible"] = True
    elif isinstance(exception, (ValidationError, SchemaError)):
        error_info["category"] = "data_quality"
        error_info["retry_eligible"] = False
    elif isinstance(exception, (AuthenticationError, PermissionError)):
        error_info["category"] = "permanent"
        error_info["retry_eligible"] = False
    else:
        error_info["category"] = "unknown"
        error_info["retry_eligible"] = False
    return error_info

Verify: Check DLQ messages have category attribute set → confirms classification is running.

2. Configure platform-specific DLQ routing

Set up automatic dead-letter routing with appropriate delivery count thresholds. [src3, src4]

# AWS SQS — Create DLQ and attach redrive policy
aws sqs create-queue --queue-name erp-orders-dlq \
  --attributes '{"MessageRetentionPeriod":"1209600"}'

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789/erp-orders \
  --attributes '{
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789:erp-orders-dlq\",\"maxReceiveCount\":\"5\"}"
  }'

# Azure Service Bus — Set MaxDeliveryCount (recommend 5 for ERP)
az servicebus queue update \
  --resource-group erp-integration \
  --namespace-name erp-bus \
  --name erp-orders \
  --max-delivery-count 5

Verify: Send a message that always fails → confirm it appears in DLQ after 5 attempts.

3. Build the DLQ triage consumer

Create a dedicated consumer that reads from the DLQ, classifies messages, and routes them through the triage workflow. [src1, src7]

import json, boto3
from datetime import datetime

sqs = boto3.client("sqs")
DLQ_URL = "https://sqs.us-east-1.amazonaws.com/123456789/erp-orders-dlq"
SOURCE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/erp-orders"

def triage_dlq_messages(max_messages=10):
    """Read DLQ, classify, and route for remediation or replay."""
    response = sqs.receive_message(
        QueueUrl=DLQ_URL,
        MaxNumberOfMessages=max_messages,
        MessageAttributeNames=["All"],
        AttributeNames=["All"],
    )
    for msg in response.get("Messages", []):
        error_category = msg.get("MessageAttributes", {}).get(
            "error_category", {}).get("StringValue", "unknown")
        receive_count = int(msg["Attributes"].get("ApproximateReceiveCount", 0))

        if receive_count > 3:  # Prevent infinite triage loops
            quarantine_message(msg, reason="triage_loop_detected")
            continue

        if error_category == "transient":
            replay_message(msg, json.loads(msg["Body"]), SOURCE_URL)
        elif error_category == "data_quality":
            attempt_auto_fix(msg, json.loads(msg["Body"]))
        elif error_category == "permanent":
            route_to_manual_review(msg, json.loads(msg["Body"]))
        else:
            quarantine_message(msg, reason="unclassified")

Verify: aws sqs get-queue-attributes --queue-url $DLQ_URL --attribute-names ApproximateNumberOfMessages → count decreasing as triage runs.

4. Implement safe replay with idempotency check

Replay messages back to the source queue with idempotency verification. [src1, src4]

def replay_message(dlq_msg, body, target_queue_url):
    """Replay a DLQ message with idempotency safety."""
    idempotency_key = body.get("idempotency_key")
    if not idempotency_key:
        quarantine_message(dlq_msg, reason="missing_idempotency_key")
        return

    if is_already_processed(idempotency_key):
        sqs.delete_message(QueueUrl=DLQ_URL, ReceiptHandle=dlq_msg["ReceiptHandle"])
        return  # already handled

    body["_replay"] = {
        "replayed_at": datetime.utcnow().isoformat(),
        "replay_attempt": body.get("_replay", {}).get("replay_attempt", 0) + 1,
    }
    if body["_replay"]["replay_attempt"] > 3:
        quarantine_message(dlq_msg, reason="max_replay_attempts_exceeded")
        return

    sqs.send_message(
        QueueUrl=target_queue_url,
        MessageBody=json.dumps(body),
        MessageAttributes={
            "idempotency_key": {"DataType": "String", "StringValue": idempotency_key},
            "is_replay": {"DataType": "String", "StringValue": "true"},
        },
    )
    sqs.delete_message(QueueUrl=DLQ_URL, ReceiptHandle=dlq_msg["ReceiptHandle"])

Verify: Replay a known-good message → confirm no duplicate in target ERP.

Code Examples

Python: DLQ Depth Monitoring with CloudWatch Alerting

# Input:  DLQ queue name, alert threshold, SNS topic ARN
# Output: CloudWatch alarms for DLQ depth and message age

import boto3
cloudwatch = boto3.client("cloudwatch")

def create_dlq_depth_alarm(queue_name, threshold=100, sns_topic_arn=None):
    cloudwatch.put_metric_alarm(
        AlarmName=f"dlq-depth-{queue_name}",
        AlarmDescription=f"DLQ {queue_name} has > {threshold} messages",
        Namespace="AWS/SQS",
        MetricName="ApproximateNumberOfMessagesVisible",
        Dimensions=[{"Name": "QueueName", "Value": queue_name}],
        Statistic="Maximum",
        Period=300, EvaluationPeriods=2,
        Threshold=threshold,
        ComparisonOperator="GreaterThanThreshold",
        AlarmActions=[sns_topic_arn] if sns_topic_arn else [],
    )

def create_dlq_age_alarm(queue_name, max_age_seconds=86400, sns_topic_arn=None):
    cloudwatch.put_metric_alarm(
        AlarmName=f"dlq-age-{queue_name}",
        AlarmDescription=f"DLQ {queue_name} has messages older than {max_age_seconds}s",
        Namespace="AWS/SQS",
        MetricName="ApproximateAgeOfOldestMessage",
        Dimensions=[{"Name": "QueueName", "Value": queue_name}],
        Statistic="Maximum",
        Period=300, EvaluationPeriods=1,
        Threshold=max_age_seconds,
        ComparisonOperator="GreaterThanThreshold",
        AlarmActions=[sns_topic_arn] if sns_topic_arn else [],
    )

JavaScript/Node.js: Kafka DLT Consumer with Triage Logic

// Input:  Kafka connection config, DLT topic name
// Output: Triage consumer that classifies and routes failed messages

const { Kafka } = require("kafkajs"); // [email protected]
const kafka = new Kafka({ brokers: ["broker:9092"] });
const consumer = kafka.consumer({ groupId: "dlq-triage" });
const producer = kafka.producer({ idempotent: true });

async function runDLTTriageConsumer(dltTopic, mainTopic) {
  await consumer.connect();
  await producer.connect();
  await consumer.subscribe({ topic: dltTopic, fromBeginning: false });

  await consumer.run({
    eachMessage: async ({ message }) => {
      const headers = message.headers || {};
      const errorType = headers["error-type"]?.toString() || "unknown";
      const retryCount = parseInt(headers["retry-count"]?.toString() || "0");
      const idempotencyKey = headers["idempotency-key"]?.toString();

      if (!idempotencyKey) {
        await logToQuarantine(message, "missing_idempotency_key");
        return;
      }
      if (retryCount > 3) {
        await logToQuarantine(message, "max_retries_exceeded");
        return;
      }
      switch (errorType) {
        case "transient":
          await producer.send({ topic: mainTopic, messages: [{
            key: message.key, value: message.value,
            headers: { ...headers, "is-replay": "true",
              "retry-count": String(retryCount + 1) },
          }] });
          break;
        case "data_quality":
          await routeToRemediationTopic(message);
          break;
        default:
          await logToQuarantine(message, errorType);
          await alertOpsTeam(message, errorType);
      }
    },
  });
}

cURL: Azure Service Bus DLQ Inspection

# Input:  Service Bus namespace, queue name, SAS token
# Output: Peek at dead-lettered messages for triage

SAS_TOKEN="SharedAccessSignature sr=..."

# Peek messages in DLQ (non-destructive)
curl -X POST \
  "https://erp-bus.servicebus.windows.net/erp-orders/\$deadletterqueue/messages/head?timeout=30" \
  -H "Authorization: $SAS_TOKEN"

# Complete (delete) a DLQ message after successful triage
curl -X DELETE \
  "https://erp-bus.servicebus.windows.net/erp-orders/\$deadletterqueue/messages/{messageId}/{lockToken}" \
  -H "Authorization: $SAS_TOKEN"

Data Mapping

Poison Message Context Preservation

When a message moves to the DLQ, critical context must be preserved for effective triage and replay:

Field	Purpose	Required for Replay?	Notes
original_message_id	Trace back to original message	Yes	Idempotency dedup and audit trail
idempotency_key	Prevent duplicate processing	Yes	Without this, replay creates duplicates
error_category	Triage classification	Yes	Determines triage path
error_message	Root cause description	No (helpful)	Truncate to 500 chars
source_queue	Original queue/topic	Yes	Required for replay routing
original_timestamp	When first produced	Yes	Detect aging and retention deadline
attempt_count	Delivery attempt count	Yes	Helps tune retry budget
correlation_id	Links related messages	Conditional	Required for ordered replay

Platform-Specific DLQ Metadata

Platform	Auto-Captured Metadata	Custom Metadata	Access Pattern
AWS SQS	ApproximateReceiveCount, SentTimestamp	MessageAttributes (up to 10)	ReceiveMessage with AttributeNames=All [src4]
Azure Service Bus	DeliveryCount, EnqueuedTimeUtc, DeadLetterReason	Custom properties (unlimited)	Peek/receive on $deadletterqueue [src3]
Apache Kafka	Offset, partition, timestamp	Headers (key-value byte arrays)	Consumer on DLT topic [src5]
MuleSoft Anypoint MQ	deliveryCount, destination	Custom properties	Anypoint MQ API or REM console [src6]

Error Handling & Failure Points

Common Error Codes That Create Poison Messages

Code	Meaning	Source System	Triage Action
400	Payload validation failure	Target ERP API	Data quality fix → replay
404	Referenced record does not exist	Target ERP API	Create parent → replay children
409	Duplicate record — already exists	Target ERP API	Safe to discard
422	Business rule violation	Target ERP API	Fix target state → replay
INVALID_FIELD	Field not writable	Salesforce API	Update field mapping → replay
UNABLE_TO_LOCK_ROW	Record locked	Salesforce API	Transient — increase retry budget
GOVERNANCE_LIMIT	SuiteScript governance exhausted	NetSuite	Reduce batch size, replay
-ERR_PARSE	Malformed XML/JSON	Any consumer	Permanent — discard + log

Failure Points in Production

Silent DLQ message expiration: AWS SQS DLQ messages retain original enqueue timestamp. A message that spent 10 days in the source queue has only 4 days left in the DLQ. Fix: Set DLQ retention to maximum (14 days); monitor ApproximateAgeOfOldestMessage. [src4]
Replay creates duplicates: Consumer crashes after processing but before acknowledging. Fix: Implement idempotency check using message_id or business key — upsert, not insert. [src1]
Ordered replay violation: Parent invoice replayed after child line items. Fix: Sort replay batch by correlation_id + sequence_number; replay parents first. [src1]
DLQ consumer infinite loop: DLQ triage consumer crashes and creates a "DLQ of DLQ." Fix: Never assign a DLQ to a DLQ consumer. Log errors, alert, halt processing. [src1, src7]
Replay storm overwhelms target ERP: Replaying 50K accumulated messages triggers rate limiting. Fix: Use velocity-controlled replay — AWS SQS custom redrive velocity or app-level throttle at 50-100 msg/s. [src4]
Stale messages replay into changed schema: Target ERP schema changed since message was produced. Fix: Validate schema compatibility before replay; transform stale messages to current schema. [src7]

Anti-Patterns

Wrong: Infinite retry loop with no DLQ

# BAD — schema violation retries forever, blocks queue, burns compute
def process_message(message):
    while True:
        try:
            call_erp_api(message)
            return
        except Exception:
            time.sleep(5)  # never gives up

Correct: Bounded retry with classification and DLQ routing

# GOOD — classify error, retry transient only, DLQ permanent failures
def process_message(message, max_retries=5):
    for attempt in range(max_retries):
        try:
            call_erp_api(message)
            return
        except TransientError:
            delay = min(2 ** attempt + random.uniform(0, 1), 60)
            time.sleep(delay)
        except (ValidationError, SchemaError) as e:
            route_to_dlq(message, category="data_quality", error=str(e))
            return
        except Exception as e:
            route_to_dlq(message, category="permanent", error=str(e))
            return
    route_to_dlq(message, category="transient_exhausted", error="max retries")

Wrong: Silent message discard on failure

# BAD — failed messages logged and forgotten. Data is lost forever.
def process_message(message):
    try:
        call_erp_api(message)
    except Exception as e:
        logger.error(f"Failed: {e}")
        acknowledge(message)  # message deleted, data lost

Correct: Route to DLQ with full context for later triage

# GOOD — failed messages preserved with diagnostic context
def process_message(message):
    try:
        call_erp_api(message)
    except Exception as e:
        error_context = classify_error(e, message)
        route_to_dlq(message, category=error_context["category"],
            error=str(e), correlation_id=message.get("correlation_id"))
        acknowledge(message)  # now safely in DLQ

Wrong: Replaying without idempotency check

# BAD — replay sends to ERP without checking if already processed
def replay_from_dlq(dlq_messages):
    for msg in dlq_messages:
        call_erp_api(msg)  # may create duplicate invoice/order
        delete_from_dlq(msg)

Correct: Replay with idempotency verification

# GOOD — check if already processed before replay
def replay_from_dlq(dlq_messages):
    for msg in dlq_messages:
        idempotency_key = msg.get("idempotency_key")
        if is_already_processed(idempotency_key):
            delete_from_dlq(msg)
            continue
        try:
            call_erp_api_with_upsert(msg)  # upsert, not insert
            mark_as_processed(idempotency_key)
            delete_from_dlq(msg)
        except Exception as e:
            quarantine(msg, reason=str(e))  # no infinite loop

Common Pitfalls

DLQ as permanent storage: Teams never process DLQ. Depth grows to thousands. Messages expire. Fix: SLA-based triage — critical: <4h, standard: <24h. Monitor DLQ depth and age as production metrics. [src1]
No visibility into DLQ contents: Ops knows DLQ has 500 messages but cannot inspect without consuming. Fix: Use peek/browse operations (Azure peek-lock, SQS visibility timeout, Kafka manual offset). [src3, src4]
Replaying at full speed after outage: 50K messages replayed at once overwhelms target ERP. Fix: Velocity-controlled replay — start at 10% normal throughput. AWS SQS Redrive supports custom max velocity. [src4]
Missing parent-child correlation: Order header fails, 15 line items also fail. Replaying items before header creates orphans. Fix: Tag with correlation_id + sequence_number. Sort replay by correlation_id, then sequence ascending. [src1]
Treating all DLQ messages identically: Single process handles 400, 401, 503 the same way. Fix: Automated classification at DLQ ingestion. Auto-replay transient errors. Route data quality to auto-fix. Escalate novel errors only. [src1, src7]
No DLQ in dev/staging: DLQ configured in production only. Developers never see poison behavior. Fix: Mirror DLQ config in all environments. Include poison message scenarios in integration tests. [src7]

Diagnostic Commands

# === AWS SQS DLQ Diagnostics ===
# Check DLQ message count
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789/erp-orders-dlq \
  --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible

# Check oldest message age (seconds)
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789/erp-orders-dlq \
  --attribute-names ApproximateAgeOfOldestMessage

# Initiate DLQ redrive to source queue
aws sqs start-message-move-task \
  --source-arn arn:aws:sqs:us-east-1:123456789:erp-orders-dlq \
  --destination-arn arn:aws:sqs:us-east-1:123456789:erp-orders \
  --max-number-of-messages-per-second 50

# Check redrive task status
aws sqs list-message-move-tasks \
  --source-arn arn:aws:sqs:us-east-1:123456789:erp-orders-dlq

# === Azure Service Bus DLQ Diagnostics ===
# Check DLQ message count
az servicebus queue show \
  --resource-group erp-integration \
  --namespace-name erp-bus \
  --name erp-orders \
  --query "countDetails.deadLetterMessageCount"

# === Apache Kafka DLT Diagnostics ===
# Check DLT topic consumer lag
kafka-consumer-groups.sh --bootstrap-server broker:9092 \
  --describe --group dlq-triage

# === MuleSoft Anypoint MQ Diagnostics ===
curl -X GET "https://anypoint.mulesoft.com/mq/admin/api/v1/organizations/{orgId}/environments/{envId}/regions/{region}/destinations/erp-orders-dlq/stats" \
  -H "Authorization: Bearer $ANYPOINT_TOKEN"

Version History & Compatibility

Feature	Release Date	Platform	Breaking Changes	Migration Notes
SQS DLQ Redrive API	2024-06	AWS SQS	N/A (new feature)	Replaces custom redrive consumers; velocity control
Anypoint MQ REM	2025-01	MuleSoft	N/A (new feature)	Managed replay — replaces manual consume + re-publish
MaxDeliveryCount	GA	Azure Service Bus	N/A	Default 10; recommend 5 for ERP integrations
Spring @RetryableTopic + DLT	2021	Kafka/Spring	N/A	Auto-creates retry-N and -dlt topics
Quorum Queue delivery-limit	2020	RabbitMQ 3.8	Classic queues unsupported	Must migrate to quorum queues
Boomi Event Streams DLQ	2024	Boomi	N/A	Configurable max retries with exponential backoff

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Messages repeatedly fail and block queue processing	Simple transient failures that resolve with retry + backoff	Error handling & DLQ fundamentals
Failed messages must be diagnosed, fixed, and replayed	Fire-and-forget integrations (message loss acceptable)	Simple error logging + monitoring
Multi-step flows with parent/child message dependencies	Single API call with synchronous response	Direct API error handling with retry
Compliance requires no data loss in integration pipeline	High-throughput streaming where per-message triage is cost-prohibitive	Batch error aggregation + statistical monitoring
Multiple failure categories need different remediation	All failures have the same root cause	Single-path retry strategy

Cross-System Comparison

Capability	AWS SQS	Azure Service Bus	Apache Kafka	MuleSoft Anypoint MQ	Boomi
DLQ Architecture	Separate queue	Sub-queue ($deadletterqueue)	Separate topic (DLT)	Separate queue	Built-in DLQ
Auto Dead-Letter	Yes (redrive policy)	Yes (MaxDeliveryCount)	No (application-level)	Yes	Yes (after 7 attempts)
Max Delivery Config	1-1000	1-2000 (default 10)	Application-defined	Configurable	Fixed at 7
Native Replay	Yes (Redrive API)	No (manual)	No (application-level)	Yes (REM)	Yes (resend)
DLQ Retention	Max 14 days	Unlimited (Premium)	Topic config	7 days default	Atom lifecycle
DLQ Reason Metadata	Custom attributes	DeadLetterReason header	Custom headers	Custom properties	Limited
Non-Destructive Peek	Visibility timeout	Peek-lock	Consumer offset mgmt	API browse	Panel view
FIFO Support	FIFO DLQ for FIFO queue	FIFO within sessions	Partition-ordered	FIFO queue	No
Monitoring	CloudWatch metrics	Azure Monitor	Consumer group lag	Anypoint Monitoring	Dashboard

Important Caveats

Poison message handling is downstream of retry strategy — if your retry/backoff config is wrong, messages reach the DLQ that should not be there. Review the error-handling-dead-letter-queues card first. [src1]
AWS SQS does NOT reset the message retention timer when a message moves to DLQ — messages can silently expire before triage. Always set DLQ retention to 14 days and monitor ApproximateAgeOfOldestMessage. [src4]
Azure Service Bus DLQ messages persist indefinitely with no automatic cleanup — without triage or purge, DLQ grows unbounded. [src3]
Kafka has no native DLQ mechanism — dead letter topics are application-level. Spring @RetryableTopic automates this, but non-Spring consumers must implement DLT routing manually. [src5]
Replaying large DLQ backlogs can overwhelm target ERP with burst traffic far exceeding normal volume. Always implement velocity-controlled replay. [src4]
This card covers message-level poison handling. For distributed transaction coordination (compensating transactions, saga rollback), see the saga pattern card.