Poison Message Handling: Triage and Replay of Failed ERP Integration Messages
How do you handle poison messages - triage and replay of failed ERP integration messages?
TL;DR
- Bottom line: A poison message is any message that repeatedly fails processing and blocks the queue. Detect within 3-5 delivery attempts, classify the error (transient vs permanent vs data quality), route to a typed DLQ, then triage via automated classification before manual review. Never replay without fixing the root cause and verifying idempotency. [src1, src2]
- Key limit: DLQ retention is finite on most platforms — AWS SQS max 14 days (original enqueue timestamp preserved), Azure Service Bus unlimited on Premium, Kafka depends on topic retention config. Unprocessed poison messages silently expire. [src3, src4]
- Watch out for: Infinite retry loops are the #1 anti-pattern — a message that fails due to a schema violation will fail identically on every retry, consuming processing capacity and blocking healthy messages. Set maxDeliveryCount/maxReceiveCount to 3-5, not 100. [src1, src7]
- Best for: Any ERP integration where failed messages must be triaged, diagnosed, fixed, and replayed — order-to-cash, AP automation, inventory sync, payroll feeds, intercompany settlement.
- Authentication: N/A (pattern-level card). See system-specific cards for broker/iPaaS authentication details.
System Profile
This card covers poison message handling as a cross-platform architecture pattern for ERP integrations. It focuses specifically on what happens after a message exhausts its retry budget and lands in a dead letter queue — detection, classification, triage, remediation, and replay. For retry strategies that determine when a message becomes a poison message (exponential backoff, circuit breakers), see the companion card on error handling and DLQ fundamentals.
The patterns apply across all major message brokers (AWS SQS, Azure Service Bus, Apache Kafka, RabbitMQ) and iPaaS platforms (MuleSoft Anypoint MQ, Boomi Atom Queue, Workato, Celigo). The specific ERP system at either end (Salesforce, SAP, Oracle, NetSuite, Dynamics 365, Workday) does not change the poison message handling approach — it changes the error codes and data mapping fixes needed during remediation.
| System | Role | API Surface | Direction |
|---|---|---|---|
| Source ERP (e.g., Salesforce) | Event producer — generates change events or outbound messages | REST, Platform Events, CDC | Outbound |
| Message Broker (e.g., AWS SQS, Kafka) | Message transport + DLQ infrastructure | SQS API, Kafka Protocol | Transport |
| iPaaS (e.g., MuleSoft, Boomi) | Integration orchestrator — message transformation and routing | Anypoint MQ, Atom Queue | Orchestrator |
| Target ERP (e.g., SAP S/4HANA) | Message consumer — processes inbound records | OData, BAPI, IDoc | Inbound |
API Surfaces & Capabilities
Poison message handling capabilities vary significantly across platforms. The key differentiators are automatic DLQ routing, DLQ inspection APIs, and native replay/redrive support: [src3, src4, src5]
| Platform | DLQ Type | Auto-Route | Max Delivery Count | Inspection API | Native Replay | DLQ Retention |
|---|---|---|---|---|---|---|
| AWS SQS | Separate queue | Yes (redrive policy) | Configurable (1-1000) | ReceiveMessage on DLQ | Yes (DLQ Redrive API) | Same as source (max 14 days) |
| Azure Service Bus | Sub-queue ($deadletterqueue) | Yes (MaxDeliveryCount) | Default 10, configurable | Peek/receive on sub-queue | Manual (receive + re-send) | Unlimited (Premium) |
| Apache Kafka | Separate topic (DLT) | Application-level | Application-level | Consumer on DLT topic | Application-level | Topic retention config |
| RabbitMQ | Separate queue (x-dead-letter-exchange) | Yes (x-delivery-limit) | Configurable via quorum queues | AMQP consume on DLQ | Manual (consume + re-publish) | Queue TTL config |
| MuleSoft Anypoint MQ | Separate queue | Yes (max delivery attempts) | Configurable | Anypoint MQ API | Yes (REM) | 7 days default |
| Boomi Atom Queue | Built-in DLQ | Yes (after 7 attempts) | 7 (6 retries + original) | Queue Management panel | Yes (resend dead letters) | Atom storage lifecycle |
Rate Limits & Quotas
DLQ Throughput Limits
| Platform | Replay Rate Limit | Concurrent Replays | Max DLQ Size | Notes |
|---|---|---|---|---|
| AWS SQS | System-optimized or custom max velocity | 1 active redrive task per source queue | No hard limit (cost-based) | Redrive task max duration: 36 hours; max 100 active tasks per account [src4] |
| Azure Service Bus | No built-in rate limit on replay | N/A (manual process) | Entity size limit (Premium: 80 GB) | No automatic cleanup — messages persist until explicitly completed [src3] |
| Apache Kafka | Consumer throughput | Consumer group parallelism | Topic retention (size or time) | No native redrive — must implement consumer that reads DLT and produces to main topic [src5] |
| MuleSoft Anypoint MQ | API rate limits apply | Per-queue basis | 120,000 in-flight messages | REM feature provides managed replay with visibility [src6] |
| Boomi | Queue throughput | Per-atom basis | Atom storage capacity | Dead letters visible in Queue Management panel; batch resend available |
Monitoring Thresholds
| Metric | Target | Alert When |
|---|---|---|
| DLQ ingestion rate | < 1% of incoming throughput | Sustained > 1% for 15 minutes [src1] |
| DLQ backlog (depth) | < 1,000 messages | Growing for > 1 hour without triage [src1] |
| Oldest message age in DLQ | < 24 hours for critical streams | Any message > 24 hours untriaged [src1] |
| Replay success rate | > 95% | Below 90% on any replay batch [src1] |
| Poison ratio (DLQ / total) | < 5% | Above 5% sustained [src1] |
| Time to first triage | < 4h (critical), < 24h (standard) | Exceeding SLA threshold [src1] |
Authentication
N/A — pattern-level card. Authentication is handled at the broker/iPaaS layer:
| Platform | Auth Method | Notes |
|---|---|---|
| AWS SQS | IAM roles / policies | DLQ access requires sqs:ReceiveMessage + sqs:DeleteMessage + sqs:SendMessage on both source and DLQ |
| Azure Service Bus | SAS or Azure AD (RBAC) | DLQ is a sub-queue — same connection string, append /$deadletterqueue [src3] |
| Apache Kafka | SASL/SCRAM, mTLS, or ACLs | DLT is a regular topic — requires separate ACL for consumer group [src5] |
| MuleSoft | Anypoint Platform credentials | DLQ management requires Manage Queues permission [src6] |
Constraints
- Detection threshold: Set maxDeliveryCount / maxReceiveCount to 3-5. Below 3 sends transient failures to DLQ prematurely; above 5 wastes processing capacity on truly unrecoverable messages. [src1, src7]
- Retention is finite: AWS SQS DLQ messages retain their original enqueue timestamp — a message with 14-day retention that spent 10 days in the source queue has only 4 days left in the DLQ. [src4]
- Replay ordering: Replaying messages out of order creates referential integrity violations — parent records must be replayed before child records.
- Idempotency is mandatory for replay: Every replayed message must carry an idempotency key. Without it, replay creates duplicate records in the target ERP. [src1]
- DLQ-of-DLQ is an anti-pattern: If your DLQ consumer fails, do NOT route to a second DLQ. Log, alert, and stop processing. [src1]
- No cross-region DLQ on MuleSoft: Anypoint MQ requires DLQ and source queue in the same region. [src6]
- Azure DLQ has no TTL: Messages in Azure Service Bus DLQ persist indefinitely. Without a purge process, DLQ grows unbounded. [src3]
Integration Pattern Decision Tree
START — Message has failed processing and landed in DLQ
├── Step 1: Classify the failure
│ ├── Transient error? (timeout, 429, 503, network error)
│ │ ├── YES → Should NOT be in DLQ — investigate why retries exhausted
│ │ │ ├── maxDeliveryCount too low? → Increase to 3-5
│ │ │ ├── Backoff delay too short? → Increase max backoff
│ │ │ └── Upstream system down for extended period? → Expected; replay now
│ │ └── Action: REPLAY IMMEDIATELY (system has recovered)
│ ├── Data quality error? (schema violation, missing field, invalid reference)
│ │ ├── Can the message be fixed automatically?
│ │ │ ├── YES → Auto-remediate → REPLAY WITH IDEMPOTENCY CHECK
│ │ │ └── NO → Route to manual review queue
│ │ └── Action: FIX DATA → REPLAY WITH IDEMPOTENCY CHECK
│ ├── Permanent error? (invalid endpoint, auth failure, business rule violation)
│ │ ├── Code/config bug? → Fix, deploy → REPLAY ENTIRE BATCH
│ │ └── Business rule rejection? → Fix target state or DISCARD + ALERT
│ └── Unknown error? → QUARANTINE → MANUAL TRIAGE
├── Step 2: Remediate
│ ├── Automated fix possible? → Apply transform → validate → replay
│ └── Manual fix needed? → Alert ops → ticket → SLA clock starts
├── Step 3: Replay
│ ├── Verify idempotency key present
│ ├── Verify ordering (parent before child)
│ ├── Replay to original queue (NOT directly to consumer)
│ ├── Monitor replay success rate
│ └── If fails again → QUARANTINE (no infinite loop)
└── Step 4: Post-mortem
├── New failure category? → Add classifier rule
├── Recurring pattern? → Fix upstream validation
└── Update monitoring thresholds
Quick Reference
| Scenario | Action | Replay? | Idempotency? | Alert Level |
|---|---|---|---|---|
| Schema violation (missing field) | Fix data, validate, replay | Yes | Yes | Warning |
| Invalid foreign key reference | Create parent first, then replay | Yes (ordered) | Yes | Warning |
| Rate limit exhaustion (429) | Should not be in DLQ — increase retry budget | Yes (immediate) | Yes | Info |
| Authentication failure (401/403) | Fix credentials, replay batch | Yes | Yes | Critical |
| Business rule violation | Fix target ERP state or discard | Conditional | Yes | Warning |
| Malformed payload (unparseable) | Discard — cannot be fixed | No | N/A | Error |
| Target system decommissioned | Discard + archive for audit | No | N/A | Critical |
| Duplicate record conflict (409) | Already processed — safe to discard | No | N/A | Info |
| Cascading failure (parent failed) | Fix parent first, replay children in order | Yes (ordered) | Yes | Warning |
| Unknown/unclassified error | Quarantine for investigation | Pending triage | Yes | Error |
Step-by-Step Integration Guide
1. Classify errors at the consumer level
Before a message reaches the DLQ, classify the error type in your consumer. This metadata travels with the message and determines the triage path. [src1, src7]
def classify_error(exception, message):
"""Classify processing errors to determine DLQ triage path."""
error_info = {
"error_class": type(exception).__name__,
"error_message": str(exception)[:500],
"timestamp": datetime.utcnow().isoformat(),
"message_id": message.get("message_id"),
"attempt_count": message.get("approximate_receive_count", 0),
}
if isinstance(exception, (TimeoutError, ConnectionError)):
error_info["category"] = "transient"
error_info["retry_eligible"] = True
elif isinstance(exception, (ValidationError, SchemaError)):
error_info["category"] = "data_quality"
error_info["retry_eligible"] = False
elif isinstance(exception, (AuthenticationError, PermissionError)):
error_info["category"] = "permanent"
error_info["retry_eligible"] = False
else:
error_info["category"] = "unknown"
error_info["retry_eligible"] = False
return error_info
Verify: Check DLQ messages have category attribute set → confirms classification is running.
2. Configure platform-specific DLQ routing
Set up automatic dead-letter routing with appropriate delivery count thresholds. [src3, src4]
# AWS SQS — Create DLQ and attach redrive policy
aws sqs create-queue --queue-name erp-orders-dlq \
--attributes '{"MessageRetentionPeriod":"1209600"}'
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789/erp-orders \
--attributes '{
"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789:erp-orders-dlq\",\"maxReceiveCount\":\"5\"}"
}'
# Azure Service Bus — Set MaxDeliveryCount (recommend 5 for ERP)
az servicebus queue update \
--resource-group erp-integration \
--namespace-name erp-bus \
--name erp-orders \
--max-delivery-count 5
Verify: Send a message that always fails → confirm it appears in DLQ after 5 attempts.
3. Build the DLQ triage consumer
Create a dedicated consumer that reads from the DLQ, classifies messages, and routes them through the triage workflow. [src1, src7]
import json, boto3
from datetime import datetime
sqs = boto3.client("sqs")
DLQ_URL = "https://sqs.us-east-1.amazonaws.com/123456789/erp-orders-dlq"
SOURCE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/erp-orders"
def triage_dlq_messages(max_messages=10):
"""Read DLQ, classify, and route for remediation or replay."""
response = sqs.receive_message(
QueueUrl=DLQ_URL,
MaxNumberOfMessages=max_messages,
MessageAttributeNames=["All"],
AttributeNames=["All"],
)
for msg in response.get("Messages", []):
error_category = msg.get("MessageAttributes", {}).get(
"error_category", {}).get("StringValue", "unknown")
receive_count = int(msg["Attributes"].get("ApproximateReceiveCount", 0))
if receive_count > 3: # Prevent infinite triage loops
quarantine_message(msg, reason="triage_loop_detected")
continue
if error_category == "transient":
replay_message(msg, json.loads(msg["Body"]), SOURCE_URL)
elif error_category == "data_quality":
attempt_auto_fix(msg, json.loads(msg["Body"]))
elif error_category == "permanent":
route_to_manual_review(msg, json.loads(msg["Body"]))
else:
quarantine_message(msg, reason="unclassified")
Verify: aws sqs get-queue-attributes --queue-url $DLQ_URL --attribute-names ApproximateNumberOfMessages → count decreasing as triage runs.
4. Implement safe replay with idempotency check
Replay messages back to the source queue with idempotency verification. [src1, src4]
def replay_message(dlq_msg, body, target_queue_url):
"""Replay a DLQ message with idempotency safety."""
idempotency_key = body.get("idempotency_key")
if not idempotency_key:
quarantine_message(dlq_msg, reason="missing_idempotency_key")
return
if is_already_processed(idempotency_key):
sqs.delete_message(QueueUrl=DLQ_URL, ReceiptHandle=dlq_msg["ReceiptHandle"])
return # already handled
body["_replay"] = {
"replayed_at": datetime.utcnow().isoformat(),
"replay_attempt": body.get("_replay", {}).get("replay_attempt", 0) + 1,
}
if body["_replay"]["replay_attempt"] > 3:
quarantine_message(dlq_msg, reason="max_replay_attempts_exceeded")
return
sqs.send_message(
QueueUrl=target_queue_url,
MessageBody=json.dumps(body),
MessageAttributes={
"idempotency_key": {"DataType": "String", "StringValue": idempotency_key},
"is_replay": {"DataType": "String", "StringValue": "true"},
},
)
sqs.delete_message(QueueUrl=DLQ_URL, ReceiptHandle=dlq_msg["ReceiptHandle"])
Verify: Replay a known-good message → confirm no duplicate in target ERP.
Code Examples
Python: DLQ Depth Monitoring with CloudWatch Alerting
# Input: DLQ queue name, alert threshold, SNS topic ARN
# Output: CloudWatch alarms for DLQ depth and message age
import boto3
cloudwatch = boto3.client("cloudwatch")
def create_dlq_depth_alarm(queue_name, threshold=100, sns_topic_arn=None):
cloudwatch.put_metric_alarm(
AlarmName=f"dlq-depth-{queue_name}",
AlarmDescription=f"DLQ {queue_name} has > {threshold} messages",
Namespace="AWS/SQS",
MetricName="ApproximateNumberOfMessagesVisible",
Dimensions=[{"Name": "QueueName", "Value": queue_name}],
Statistic="Maximum",
Period=300, EvaluationPeriods=2,
Threshold=threshold,
ComparisonOperator="GreaterThanThreshold",
AlarmActions=[sns_topic_arn] if sns_topic_arn else [],
)
def create_dlq_age_alarm(queue_name, max_age_seconds=86400, sns_topic_arn=None):
cloudwatch.put_metric_alarm(
AlarmName=f"dlq-age-{queue_name}",
AlarmDescription=f"DLQ {queue_name} has messages older than {max_age_seconds}s",
Namespace="AWS/SQS",
MetricName="ApproximateAgeOfOldestMessage",
Dimensions=[{"Name": "QueueName", "Value": queue_name}],
Statistic="Maximum",
Period=300, EvaluationPeriods=1,
Threshold=max_age_seconds,
ComparisonOperator="GreaterThanThreshold",
AlarmActions=[sns_topic_arn] if sns_topic_arn else [],
)
JavaScript/Node.js: Kafka DLT Consumer with Triage Logic
// Input: Kafka connection config, DLT topic name
// Output: Triage consumer that classifies and routes failed messages
const { Kafka } = require("kafkajs"); // [email protected]
const kafka = new Kafka({ brokers: ["broker:9092"] });
const consumer = kafka.consumer({ groupId: "dlq-triage" });
const producer = kafka.producer({ idempotent: true });
async function runDLTTriageConsumer(dltTopic, mainTopic) {
await consumer.connect();
await producer.connect();
await consumer.subscribe({ topic: dltTopic, fromBeginning: false });
await consumer.run({
eachMessage: async ({ message }) => {
const headers = message.headers || {};
const errorType = headers["error-type"]?.toString() || "unknown";
const retryCount = parseInt(headers["retry-count"]?.toString() || "0");
const idempotencyKey = headers["idempotency-key"]?.toString();
if (!idempotencyKey) {
await logToQuarantine(message, "missing_idempotency_key");
return;
}
if (retryCount > 3) {
await logToQuarantine(message, "max_retries_exceeded");
return;
}
switch (errorType) {
case "transient":
await producer.send({ topic: mainTopic, messages: [{
key: message.key, value: message.value,
headers: { ...headers, "is-replay": "true",
"retry-count": String(retryCount + 1) },
}] });
break;
case "data_quality":
await routeToRemediationTopic(message);
break;
default:
await logToQuarantine(message, errorType);
await alertOpsTeam(message, errorType);
}
},
});
}
cURL: Azure Service Bus DLQ Inspection
# Input: Service Bus namespace, queue name, SAS token
# Output: Peek at dead-lettered messages for triage
SAS_TOKEN="SharedAccessSignature sr=..."
# Peek messages in DLQ (non-destructive)
curl -X POST \
"https://erp-bus.servicebus.windows.net/erp-orders/\$deadletterqueue/messages/head?timeout=30" \
-H "Authorization: $SAS_TOKEN"
# Complete (delete) a DLQ message after successful triage
curl -X DELETE \
"https://erp-bus.servicebus.windows.net/erp-orders/\$deadletterqueue/messages/{messageId}/{lockToken}" \
-H "Authorization: $SAS_TOKEN"
Data Mapping
Poison Message Context Preservation
When a message moves to the DLQ, critical context must be preserved for effective triage and replay:
| Field | Purpose | Required for Replay? | Notes |
|---|---|---|---|
| original_message_id | Trace back to original message | Yes | Idempotency dedup and audit trail |
| idempotency_key | Prevent duplicate processing | Yes | Without this, replay creates duplicates |
| error_category | Triage classification | Yes | Determines triage path |
| error_message | Root cause description | No (helpful) | Truncate to 500 chars |
| source_queue | Original queue/topic | Yes | Required for replay routing |
| original_timestamp | When first produced | Yes | Detect aging and retention deadline |
| attempt_count | Delivery attempt count | Yes | Helps tune retry budget |
| correlation_id | Links related messages | Conditional | Required for ordered replay |
Platform-Specific DLQ Metadata
| Platform | Auto-Captured Metadata | Custom Metadata | Access Pattern |
|---|---|---|---|
| AWS SQS | ApproximateReceiveCount, SentTimestamp | MessageAttributes (up to 10) | ReceiveMessage with AttributeNames=All [src4] |
| Azure Service Bus | DeliveryCount, EnqueuedTimeUtc, DeadLetterReason | Custom properties (unlimited) | Peek/receive on $deadletterqueue [src3] |
| Apache Kafka | Offset, partition, timestamp | Headers (key-value byte arrays) | Consumer on DLT topic [src5] |
| MuleSoft Anypoint MQ | deliveryCount, destination | Custom properties | Anypoint MQ API or REM console [src6] |
Error Handling & Failure Points
Common Error Codes That Create Poison Messages
| Code | Meaning | Source System | Triage Action |
|---|---|---|---|
| 400 | Payload validation failure | Target ERP API | Data quality fix → replay |
| 404 | Referenced record does not exist | Target ERP API | Create parent → replay children |
| 409 | Duplicate record — already exists | Target ERP API | Safe to discard |
| 422 | Business rule violation | Target ERP API | Fix target state → replay |
| INVALID_FIELD | Field not writable | Salesforce API | Update field mapping → replay |
| UNABLE_TO_LOCK_ROW | Record locked | Salesforce API | Transient — increase retry budget |
| GOVERNANCE_LIMIT | SuiteScript governance exhausted | NetSuite | Reduce batch size, replay |
| -ERR_PARSE | Malformed XML/JSON | Any consumer | Permanent — discard + log |
Failure Points in Production
- Silent DLQ message expiration: AWS SQS DLQ messages retain original enqueue timestamp. A message that spent 10 days in the source queue has only 4 days left in the DLQ. Fix:
Set DLQ retention to maximum (14 days); monitor ApproximateAgeOfOldestMessage. [src4] - Replay creates duplicates: Consumer crashes after processing but before acknowledging. Fix:
Implement idempotency check using message_id or business key — upsert, not insert. [src1] - Ordered replay violation: Parent invoice replayed after child line items. Fix:
Sort replay batch by correlation_id + sequence_number; replay parents first. [src1] - DLQ consumer infinite loop: DLQ triage consumer crashes and creates a "DLQ of DLQ." Fix:
Never assign a DLQ to a DLQ consumer. Log errors, alert, halt processing. [src1, src7] - Replay storm overwhelms target ERP: Replaying 50K accumulated messages triggers rate limiting. Fix:
Use velocity-controlled replay — AWS SQS custom redrive velocity or app-level throttle at 50-100 msg/s. [src4] - Stale messages replay into changed schema: Target ERP schema changed since message was produced. Fix:
Validate schema compatibility before replay; transform stale messages to current schema. [src7]
Anti-Patterns
Wrong: Infinite retry loop with no DLQ
# BAD — schema violation retries forever, blocks queue, burns compute
def process_message(message):
while True:
try:
call_erp_api(message)
return
except Exception:
time.sleep(5) # never gives up
Correct: Bounded retry with classification and DLQ routing
# GOOD — classify error, retry transient only, DLQ permanent failures
def process_message(message, max_retries=5):
for attempt in range(max_retries):
try:
call_erp_api(message)
return
except TransientError:
delay = min(2 ** attempt + random.uniform(0, 1), 60)
time.sleep(delay)
except (ValidationError, SchemaError) as e:
route_to_dlq(message, category="data_quality", error=str(e))
return
except Exception as e:
route_to_dlq(message, category="permanent", error=str(e))
return
route_to_dlq(message, category="transient_exhausted", error="max retries")
Wrong: Silent message discard on failure
# BAD — failed messages logged and forgotten. Data is lost forever.
def process_message(message):
try:
call_erp_api(message)
except Exception as e:
logger.error(f"Failed: {e}")
acknowledge(message) # message deleted, data lost
Correct: Route to DLQ with full context for later triage
# GOOD — failed messages preserved with diagnostic context
def process_message(message):
try:
call_erp_api(message)
except Exception as e:
error_context = classify_error(e, message)
route_to_dlq(message, category=error_context["category"],
error=str(e), correlation_id=message.get("correlation_id"))
acknowledge(message) # now safely in DLQ
Wrong: Replaying without idempotency check
# BAD — replay sends to ERP without checking if already processed
def replay_from_dlq(dlq_messages):
for msg in dlq_messages:
call_erp_api(msg) # may create duplicate invoice/order
delete_from_dlq(msg)
Correct: Replay with idempotency verification
# GOOD — check if already processed before replay
def replay_from_dlq(dlq_messages):
for msg in dlq_messages:
idempotency_key = msg.get("idempotency_key")
if is_already_processed(idempotency_key):
delete_from_dlq(msg)
continue
try:
call_erp_api_with_upsert(msg) # upsert, not insert
mark_as_processed(idempotency_key)
delete_from_dlq(msg)
except Exception as e:
quarantine(msg, reason=str(e)) # no infinite loop
Common Pitfalls
- DLQ as permanent storage: Teams never process DLQ. Depth grows to thousands. Messages expire. Fix:
SLA-based triage — critical: <4h, standard: <24h. Monitor DLQ depth and age as production metrics. [src1] - No visibility into DLQ contents: Ops knows DLQ has 500 messages but cannot inspect without consuming. Fix:
Use peek/browse operations (Azure peek-lock, SQS visibility timeout, Kafka manual offset). [src3, src4] - Replaying at full speed after outage: 50K messages replayed at once overwhelms target ERP. Fix:
Velocity-controlled replay — start at 10% normal throughput. AWS SQS Redrive supports custom max velocity. [src4] - Missing parent-child correlation: Order header fails, 15 line items also fail. Replaying items before header creates orphans. Fix:
Tag with correlation_id + sequence_number. Sort replay by correlation_id, then sequence ascending. [src1] - Treating all DLQ messages identically: Single process handles 400, 401, 503 the same way. Fix:
Automated classification at DLQ ingestion. Auto-replay transient errors. Route data quality to auto-fix. Escalate novel errors only. [src1, src7] - No DLQ in dev/staging: DLQ configured in production only. Developers never see poison behavior. Fix:
Mirror DLQ config in all environments. Include poison message scenarios in integration tests. [src7]
Diagnostic Commands
# === AWS SQS DLQ Diagnostics ===
# Check DLQ message count
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789/erp-orders-dlq \
--attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible
# Check oldest message age (seconds)
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789/erp-orders-dlq \
--attribute-names ApproximateAgeOfOldestMessage
# Initiate DLQ redrive to source queue
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:123456789:erp-orders-dlq \
--destination-arn arn:aws:sqs:us-east-1:123456789:erp-orders \
--max-number-of-messages-per-second 50
# Check redrive task status
aws sqs list-message-move-tasks \
--source-arn arn:aws:sqs:us-east-1:123456789:erp-orders-dlq
# === Azure Service Bus DLQ Diagnostics ===
# Check DLQ message count
az servicebus queue show \
--resource-group erp-integration \
--namespace-name erp-bus \
--name erp-orders \
--query "countDetails.deadLetterMessageCount"
# === Apache Kafka DLT Diagnostics ===
# Check DLT topic consumer lag
kafka-consumer-groups.sh --bootstrap-server broker:9092 \
--describe --group dlq-triage
# === MuleSoft Anypoint MQ Diagnostics ===
curl -X GET "https://anypoint.mulesoft.com/mq/admin/api/v1/organizations/{orgId}/environments/{envId}/regions/{region}/destinations/erp-orders-dlq/stats" \
-H "Authorization: Bearer $ANYPOINT_TOKEN"
Version History & Compatibility
| Feature | Release Date | Platform | Breaking Changes | Migration Notes |
|---|---|---|---|---|
| SQS DLQ Redrive API | 2024-06 | AWS SQS | N/A (new feature) | Replaces custom redrive consumers; velocity control |
| Anypoint MQ REM | 2025-01 | MuleSoft | N/A (new feature) | Managed replay — replaces manual consume + re-publish |
| MaxDeliveryCount | GA | Azure Service Bus | N/A | Default 10; recommend 5 for ERP integrations |
| Spring @RetryableTopic + DLT | 2021 | Kafka/Spring | N/A | Auto-creates retry-N and -dlt topics |
| Quorum Queue delivery-limit | 2020 | RabbitMQ 3.8 | Classic queues unsupported | Must migrate to quorum queues |
| Boomi Event Streams DLQ | 2024 | Boomi | N/A | Configurable max retries with exponential backoff |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Messages repeatedly fail and block queue processing | Simple transient failures that resolve with retry + backoff | Error handling & DLQ fundamentals |
| Failed messages must be diagnosed, fixed, and replayed | Fire-and-forget integrations (message loss acceptable) | Simple error logging + monitoring |
| Multi-step flows with parent/child message dependencies | Single API call with synchronous response | Direct API error handling with retry |
| Compliance requires no data loss in integration pipeline | High-throughput streaming where per-message triage is cost-prohibitive | Batch error aggregation + statistical monitoring |
| Multiple failure categories need different remediation | All failures have the same root cause | Single-path retry strategy |
Cross-System Comparison
| Capability | AWS SQS | Azure Service Bus | Apache Kafka | MuleSoft Anypoint MQ | Boomi |
|---|---|---|---|---|---|
| DLQ Architecture | Separate queue | Sub-queue ($deadletterqueue) | Separate topic (DLT) | Separate queue | Built-in DLQ |
| Auto Dead-Letter | Yes (redrive policy) | Yes (MaxDeliveryCount) | No (application-level) | Yes | Yes (after 7 attempts) |
| Max Delivery Config | 1-1000 | 1-2000 (default 10) | Application-defined | Configurable | Fixed at 7 |
| Native Replay | Yes (Redrive API) | No (manual) | No (application-level) | Yes (REM) | Yes (resend) |
| DLQ Retention | Max 14 days | Unlimited (Premium) | Topic config | 7 days default | Atom lifecycle |
| DLQ Reason Metadata | Custom attributes | DeadLetterReason header | Custom headers | Custom properties | Limited |
| Non-Destructive Peek | Visibility timeout | Peek-lock | Consumer offset mgmt | API browse | Panel view |
| FIFO Support | FIFO DLQ for FIFO queue | FIFO within sessions | Partition-ordered | FIFO queue | No |
| Monitoring | CloudWatch metrics | Azure Monitor | Consumer group lag | Anypoint Monitoring | Dashboard |
Important Caveats
- Poison message handling is downstream of retry strategy — if your retry/backoff config is wrong, messages reach the DLQ that should not be there. Review the error-handling-dead-letter-queues card first. [src1]
- AWS SQS does NOT reset the message retention timer when a message moves to DLQ — messages can silently expire before triage. Always set DLQ retention to 14 days and monitor ApproximateAgeOfOldestMessage. [src4]
- Azure Service Bus DLQ messages persist indefinitely with no automatic cleanup — without triage or purge, DLQ grows unbounded. [src3]
- Kafka has no native DLQ mechanism — dead letter topics are application-level. Spring @RetryableTopic automates this, but non-Spring consumers must implement DLT routing manually. [src5]
- Replaying large DLQ backlogs can overwhelm target ERP with burst traffic far exceeding normal volume. Always implement velocity-controlled replay. [src4]
- This card covers message-level poison handling. For distributed transaction coordination (compensating transactions, saga rollback), see the saga pattern card.