How do you implement the circuit breaker pattern for ERP API integrations?
TL;DR
Bottom line: Wrap every outbound ERP API call in a circuit breaker that trips after a configurable failure threshold, fails fast while the ERP recovers, and probes with limited half-open requests before restoring full traffic.
Key limit: Circuit breaker is per-process state by default — horizontal scaling requires shared state (Redis) or per-instance breakers with coordinated thresholds.
Watch out for: Setting thresholds too sensitive (trips on 2 failures) causes false opens during normal ERP latency spikes; too tolerant (50 failures) defeats the purpose. Start at 50% failure rate over 10-second windows with minimum 8 requests sampled.
Best for: Real-time ERP API integrations where downstream unavailability (SAP maintenance, Salesforce governor limits, Oracle Cloud outages) would cascade into thread exhaustion, connection pool starvation, or saga timeout failures.
This card covers the circuit breaker pattern as applied to ERP API integrations across all major ERP systems. It is platform-agnostic but provides concrete implementations for the four major integration languages (Python, Java, C#, Node.js) and two leading iPaaS platforms (MuleSoft, Boomi). The pattern applies identically whether calling SAP S/4HANA OData, Salesforce REST, Oracle ERP Cloud REST, NetSuite SuiteTalk, or Dynamics 365 Web API.
Property
Value
Pattern
Circuit Breaker (client-side resilience)
Applies To
All ERP REST/SOAP/OData API calls
Granularity
One breaker per ERP endpoint or API surface
States
Closed (normal) → Open (failing fast) → Half-Open (probing recovery)
Circuit breakers protect calls to ERP API surfaces. Different API surfaces exhibit different failure modes and require different breaker configurations. [src1]
ERP API Surface
Typical Failure Mode
Recommended Breaker Config
Recovery Time
SAP S/4HANA OData
503 during planned downtime, timeouts on complex queries
5 failures / 30s window, 120s break
5-30 min (planned), 1-4h (incident)
Salesforce REST API
429 rate limit, REQUEST_LIMIT_EXCEEDED, 503
3 consecutive 429s, 60s break
60s (rate limit), 5-15 min (incident)
Oracle ERP Cloud REST
500/503 during patching, FBDI timeouts
5 failures / 60s window, 180s break
15-60 min (patching), 1-2h (incident)
NetSuite SuiteTalk/REST
SSS_REQUEST_LIMIT_EXCEEDED, concurrency cap
3 failures / 20s, 30s break
30-60s (concurrency), 10-30 min (incident)
Dynamics 365 OData
429 with Retry-After header, 503 during updates
Honor Retry-After header, 5 failures / 30s
Per Retry-After value, 5-30 min (updates)
Workday REST/SOAP
503 during tenant maintenance, auth token expiry
5 failures / 60s, 120s break
30-120 min (maintenance)
Rate Limits & Quotas
Circuit Breaker Configuration Parameters
Parameter
Description
Recommended Default
Notes
Failure threshold
Percentage or count of failures that trips the breaker
50% failure rate OR 5 consecutive failures
Percentage-based is more robust than count-based
Sampling window
Time period over which failures are counted
10-30 seconds
Too short = noise triggers opens; too long = slow detection
Minimum throughput
Minimum requests in window before threshold is evaluated
8-10 requests
Prevents tripping on 1 failure out of 2 requests
Break duration
How long the circuit stays open before half-open probe
30-120 seconds
Match to ERP typical recovery time
Half-open probe count
Number of test requests allowed in half-open state
1-3 requests
Too many probes can re-overload a recovering service
Success threshold
Consecutive successes in half-open to close circuit
3-5 successes
Ensures recovery is stable
Timeout
Per-request timeout that counts as failure
30-60 seconds for ERP APIs
ERP APIs are slower than typical microservices
Per-ERP Error Codes That Should Trip the Breaker
ERP System
Trip On (Open Circuit)
Do NOT Trip On (Retry Instead)
Notes
Salesforce
503, REQUEST_LIMIT_EXCEEDED, SERVER_UNAVAILABLE
400, INVALID_FIELD, DUPLICATE_VALUE
429 — trip after 3 consecutive, not on first
SAP S/4HANA
503, 504, CX_SY_RESOURCE_EXHAUSTION
400, /IWBEP/CM_MGW_RT (OData validation)
504 indicates SAP app server overload
Oracle ERP Cloud
503, 500 (repeated), FBDI import timeout
400, ORA-00001, validation errors
Distinguish transient 500 from persistent logic errors
Governance errors are transient; validation permanent
Dynamics 365
429 (with Retry-After), 503, 502
400, 403, 404, -2147204784
Always honor Retry-After header
Authentication
Authentication failures interact with circuit breakers in specific ways. Token expiry should NOT trip the circuit. [src1]
Scenario
Should Trip Breaker?
Correct Handling
OAuth token expired (401)
No
Refresh token, retry once, then trip if refresh fails
API key invalid (403)
No
Alert immediately — config error, not transient
Auth server unreachable
Yes
Trip breaker on auth endpoint separately
MFA challenge required
No
Alert — cannot be automated; wrong auth flow
Rate limit on auth endpoint
Yes
Trip breaker; queue data requests until auth recovers
Constraints
Circuit breaker does NOT replace retry — Use retry for transient errors (first 2-3 attempts), then circuit breaker trips to prevent retry storms. Retry inside breaker, not breaker inside retry.
Per-endpoint granularity — One breaker per ERP API surface. A single breaker for all Salesforce APIs means a Bulk API timeout opens the circuit for REST operations.
State isolation — Circuit breaker state is in-memory by default. In Kubernetes with 10 pods, each pod has its own breaker. Externalize to Redis for shared state if needed.
Idempotency required for half-open probes — Half-open probe requests may be duplicates of previously failed requests. Without idempotency keys, you risk duplicate records.
Break duration must match ERP recovery — SAP maintenance: 30-120 min. Salesforce rate limit: 60s. A 5-second break is useless for a 30-minute maintenance window.
Cannot circuit-break fire-and-forget — Async message queues act as natural buffers. Use circuit breaker between queue consumer and ERP API, not on the queue itself.
Integration Pattern Decision Tree
START — Should I use a circuit breaker for this ERP integration?
|
+-- Is the integration synchronous (real-time API call)?
| +-- YES --> Circuit breaker is strongly recommended
| | +-- Is the ERP API call idempotent?
| | | +-- YES --> Standard circuit breaker + retry
| | | +-- NO --> Circuit breaker + idempotency key + DLQ
| | +-- Are you calling multiple ERP endpoints?
| | +-- YES --> Separate breaker per endpoint
| | +-- NO --> Single breaker sufficient
| +-- NO (async / message-based)
|
+-- Is there a synchronous ERP API call within the async flow?
| +-- YES --> Circuit breaker on the API call, not the queue consumer
| +-- NO --> Circuit breaker adds no value; use DLQ + retry instead
|
+-- Which resilience pattern do I need?
+-- Transient errors (network blip, 1-2 failures) --> Retry with backoff
+-- Sustained outage (ERP down for minutes) --> Circuit breaker
+-- Protecting shared resources (thread pools) --> Bulkhead
+-- Server-side request throttling --> Rate limiter
+-- Combining all four --> Retry -> Circuit Breaker -> Bulkhead -> Timeout
Verify: Each breaker should have independent state. Tripping sf-bulk should NOT affect sf-rest.
2. Configure thresholds based on ERP behavior
Start with conservative defaults, tune based on production telemetry. [src1, src2]
Parameter
Conservative Start
Tuned After 30 Days
Failure rate threshold
50%
Adjust based on baseline error rate
Sampling window
30 seconds
Match to ERP API response time P99
Minimum throughput
10
Match to actual request volume
Break duration
60 seconds
Match to ERP typical recovery time
Half-open probes
3
Increase if recovery is gradual
Verify: Monitor circuit state transitions for 7 days. If breaker trips >5x/day on a healthy ERP, thresholds are too sensitive.
3. Implement the circuit breaker in your language
See Code Examples section below for complete implementations in Python, Java, C#, and Node.js. [src2, src4, src5, src8]
4. Wire the fallback strategy for open circuit
When the circuit is open, the application must do something useful instead of throwing an exception. [src1]
Circuit Open — What to do with the request:
+-- Is the operation a READ?
| +-- Return cached data (if fresh enough)
| +-- Return degraded response ("ERP data temporarily unavailable")
| +-- Route to secondary ERP instance (if available)
+-- Is the operation a WRITE?
| +-- Queue to dead letter queue for later replay
| +-- Write to local staging table + reconcile later
| +-- Return 503 + Retry-After header to upstream caller
+-- ALWAYS:
+-- Log circuit state change
+-- Fire alert if circuit open >5 minutes
+-- Increment circuit-open counter in metrics
Code Examples
Python: Circuit Breaker for ERP API Calls
# Input: ERP API endpoint URL, authentication headers
# Output: API response or fallback value when circuit is open
# Requires: requests>=2.31.0
import time, threading, requests
from enum import Enum
from dataclasses import dataclass, field
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreaker:
failure_threshold: int = 5
recovery_timeout: float = 60.0
half_open_max_calls: int = 3
success_threshold: int = 3
erp_timeout: float = 30.0
TRIP_STATUS_CODES = {429, 500, 502, 503, 504}
# ... (see .md for full implementation)
# Usage: one breaker per ERP endpoint
sf_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60)
sap_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=120)
response = sf_breaker.call(
"GET", "https://myorg.my.salesforce.com/services/data/v62.0/query",
params={"q": "SELECT Id, Name FROM Account LIMIT 10"},
headers={"Authorization": "Bearer <token>"},
fallback=lambda: {"records": [], "note": "Salesforce unavailable"}
)
Java: Resilience4j Circuit Breaker for ERP APIs
// Input: ERP API endpoint, authentication config
// Output: API response or fallback when circuit is open
// Requires: io.github.resilience4j:resilience4j-circuitbreaker:2.2.0
CircuitBreakerConfig sapConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slidingWindowType(SlidingWindowType.TIME_BASED)
.slidingWindowSize(30)
.minimumNumberOfCalls(8)
.waitDurationInOpenState(Duration.ofSeconds(120))
.permittedNumberOfCallsInHalfOpenState(3)
.recordExceptions(ConnectException.class, HttpTimeoutException.class)
.build();
CircuitBreaker sapBreaker = CircuitBreakerRegistry.of(sapConfig)
.circuitBreaker("sap-odata", sapConfig);
Polly uses ratio (0.0-1.0), Resilience4j uses percentage (0-100) — a FailureRatio of 0.5 in Polly equals failureRateThreshold(50) in Resilience4j. Mixing these up means 50x more or less sensitive. [src2, src8]
Opossum rollingCountTimeout is in milliseconds, Resilience4j slidingWindowSize is in seconds — 10000 in Opossum equals 10 in Resilience4j. Off by 1000x if you port config. [src5, src8]
Polly 8.x SamplingDuration replaced Polly 7.x durationOfBreak semantics — entire config model changed in the migration. [src2]
Error Handling & Failure Points
Common Error Codes
Code
Meaning
Should Trip Breaker?
Resolution
429
Rate limit exceeded
Yes (after 3 consecutive)
Backoff, respect Retry-After, align break to rate limit window
500
Internal Server Error
Yes (if repeated)
Trip after 3-5 in window; single 500 could be transient
502
Bad Gateway
Yes
Proxy/LB failure; ERP app server likely down
503
Service Unavailable
Yes (immediately)
ERP explicitly saying stop; trip immediately
504
Gateway Timeout
Yes
ERP app server overloaded
400
Bad Request
No
Fix the request payload — code bug, not outage
401
Unauthorized
No
Refresh auth token; trip only if refresh also fails
403
Forbidden
No
Permission issue; alert, don't trip
404
Not Found
No
Wrong endpoint; fix code
Failure Points in Production
False opens during ERP maintenance windows: SAP has scheduled downtime (02:00-04:00 UTC). Circuit trips, alerts fire. Fix: Implement maintenance window suppression — skip alerting during scheduled windows. [src1]
Token expiry cascade: OAuth token expires, all requests get 401, breaker trips. Fix: Exclude 401 from circuit breaker; implement separate token refresh circuit. [src1, src3]
Half-open probe creates duplicate record: Probe retries write without idempotency key. Fix: Every ERP write must include idempotency key (externalId in NetSuite, External_ID__c in Salesforce). [src3]
Break duration too short for Oracle patching: 15-60 min patching, 30s break cycles 120 times. Fix: Implement exponential break duration — 30s, 60s, 120s, 240s, max 600s. [src1]
Thread exhaustion before circuit trips: 30s timeout x 5 failures = 150s of blocked threads. Fix: Set aggressive per-request timeout (10-15s) + bulkhead to limit concurrent ERP calls. [src7]
Anti-Patterns
Wrong: Circuit breaker on writes without DLQ
# BAD — writes are silently dropped when circuit opens
try:
breaker.call("POST", f"{erp_url}/invoices", json=invoice_data)
except CircuitOpenError:
logger.warning("Circuit open — invoice not created")
# Invoice is LOST. No retry. No queue. Gone forever.
Correct: Circuit breaker with dead letter queue for writes
# GOOD — failed writes are queued for later replay
try:
breaker.call("POST", f"{erp_url}/invoices", json=invoice_data)
except CircuitOpenError:
dlq.send({
"endpoint": f"{erp_url}/invoices",
"payload": invoice_data,
"idempotency_key": invoice_data["externalId"],
})
Wrong: Single circuit breaker for all ERP endpoints
// BAD — Bulk API timeout trips the breaker for REST API too
const erpBreaker = new CircuitBreaker(callAnyErpApi, { timeout: 30000 });
await erpBreaker.fire("bulk/import", bulkPayload); // hangs 60s, trips
await erpBreaker.fire("query/accounts", {}); // BLOCKED!
Correct: Separate circuit breaker per API surface
// GOOD — each API surface has its own breaker
const sfRestBreaker = new CircuitBreaker(callSfRest, { timeout: 15000 });
const sfBulkBreaker = new CircuitBreaker(callSfBulk, { timeout: 300000 });
// Bulk timeout does NOT affect REST operations
Wrong: Too-sensitive threshold on high-latency ERP APIs
# BAD — ERP APIs are NOT microservices
resilience4j.circuitbreaker.instances.sap-odata:
failureRateThreshold: 20 # too sensitive
slidingWindowSize: 5 # too short
minimumNumberOfCalls: 2 # too few
waitDurationInOpenState: 5s # too short
Correct: Thresholds calibrated for ERP latency profiles
# GOOD — tuned for real ERP behavior
resilience4j.circuitbreaker.instances.sap-odata:
failureRateThreshold: 50
slidingWindowSize: 30
minimumNumberOfCalls: 8
waitDurationInOpenState: 120s
permittedNumberOfCallsInHalfOpenState: 3
Common Pitfalls
Treating 401 as a circuit-tripping failure: Token expiry causes 401, breaker trips, all ERP calls blocked. Fix: Exclude 401 from breaker; handle auth separately with refresh logic. [src1]
Using microservice-scale timeouts for ERP APIs: A 3-second timeout causes false timeouts on ERP APIs that routinely take 5-15s. Fix: Set ERP-specific timeouts: 30s for REST, 60s for bulk, 120s for file imports. [src1, src7]
Not implementing exponential break duration: Fixed 30s break during 2h SAP outage = 240 unnecessary probes. Fix: Exponential backoff on break duration: 30s, 60s, 120s, 240s, capped at 600s. [src1]
Circuit breaker without monitoring: Breaker trips and nobody knows. Fix: Log every state transition, expose as health check, alert for circuits open >5 min. [src1, src7]
Applying circuit breaker to queue consumers: Consumer stops but messages keep arriving, filling the partition. Fix: Apply breaker between consumer and ERP API, not on consumer itself. [src3]
Sharing state across instances without coordination: One instance's network blip trips shared Redis-backed circuit for all. Fix: Use local breakers with shared metrics; trip global circuit via feature flag if >50% report failures. [src1]
Diagnostic Commands
# Check Resilience4j circuit breaker state (Spring Boot Actuator)
curl -s http://localhost:8080/actuator/circuitbreakers | jq '.circuitBreakers'
# Expected: {"sap-odata":{"state":"CLOSED","failureRate":-1.0}}
# Check Resilience4j circuit breaker events
curl -s http://localhost:8080/actuator/circuitbreakerevents | jq '.circuitBreakerEvents[-5:]'
# Check Polly circuit state via custom health endpoint
curl -s http://localhost:5000/health/circuits | jq
# Expected: {"sap-odata":"Closed","sf-rest":"Closed","oracle-rest":"HalfOpen"}
# Test ERP API health directly (bypassing circuit breaker)
# Salesforce
curl -s -o /dev/null -w "%{http_code}" \
-H "Authorization: Bearer $SF_TOKEN" \
"https://myorg.my.salesforce.com/services/data/v62.0/limits"
# SAP S/4HANA
curl -s -o /dev/null -w "%{http_code}" \
-H "Authorization: Bearer $SAP_TOKEN" \
"https://my-sap.s4hana.cloud/sap/opu/odata4/sap/api_business_partner/\$metadata"
# Check if ERP is in maintenance (manual probe)
curl -s -w "\n%{http_code} %{time_total}s" \
-H "Authorization: Bearer $TOKEN" \
"$ERP_API_URL/health" 2>&1
# 503 = maintenance; 200 with >10s = degraded; 200 with <2s = healthy
Version History & Compatibility
Library
Version
Release
Breaking Changes
Notes
Polly
8.x
2023-07
Complete API rewrite — Policy replaced by ResiliencePipeline
Cannot mix v7 and v8
Polly
7.x
2019-06
Legacy — maintenance only
Still widely used; plan migration
Resilience4j
2.2.0
2024-03
Minor — TIME_BASED sliding window improvements
Recommended for new Java projects
Resilience4j
1.x
2020-01
EOL
Migrate to 2.x
Opossum
8.1.3
2025-01
Minor — improved TypeScript types
Stable; primary Node.js option
iPaaS Circuit Breaker Support
Platform
Built-in?
Config Level
Notes
MuleSoft
Yes — gateway policy
API Gateway (Envoy)
maxConnections, maxPendingRequests, maxRequests
Boomi
No — custom scripting
Process level
Implement in Data Process shape with Groovy
Workato
No — custom connector
Connector SDK
Build in custom connector actions
SAP Integration Suite
Partial — retry + timeout
iFlow level
No native breaker; use Groovy + JCache
When to Use / When Not to Use
Use When
Don't Use When
Use Instead
Synchronous ERP API calls that can cascade failures
Async message-based integration (Kafka, SQS)
Dead letter queue + retry policy
ERP has known maintenance windows causing extended downtime
Single transient error that resolves on retry
Simple retry with exponential backoff
Multiple downstream ERP endpoints with independent failure modes
All calls go through a single gateway handling resilience
Gateway-level circuit breaker (MuleSoft policy)
Long-lived service processing continuous requests
One-time batch job that runs once and exits
Retry + error log
Saga pattern with multiple ERP steps needing protection
Simple two-system point-to-point integration
Retry + DLQ (circuit breaker is overkill)
Cross-System Comparison
Library Comparison for Custom Implementations
Capability
Polly 8.x
Resilience4j 2.x
Opossum 8.x
pybreaker 1.x
Language
C# / .NET
Java / Kotlin
Node.js / TypeScript
Python
Circuit states
4 (incl. Isolated)
3
3
3
Sliding window
Time-based
Time or Count
Time-based (rolling)
Count-based
Dynamic break
BreakDurationGenerator
Custom (extend)
Not built-in
Custom (extend)
HttpClient integration
Native (IHttpClientFactory)
Spring WebClient
Manual wrap
Manual wrap
Monitoring
Built-in telemetry
Actuator + Micrometer
Events + Prometheus
Custom events
Bulkhead
Yes (same pipeline)
Yes (separate)
No
No
Rate limiter
Yes (same pipeline)
Yes (separate)
No
No
Maturity
Very high
Very high
High
Moderate
Important Caveats
ERP APIs have fundamentally different latency profiles than microservices (5-15s P95 vs 100ms P95) — do not use microservice-default configurations
Circuit breaker state is lost on process restart — if your service restarts while the ERP is still down, the circuit starts Closed and must re-learn the failure state
Different ERP editions have different rate limits (Salesforce Enterprise: 100K/24h vs Developer: 15K/24h) — breaker thresholds should account for edition-specific limits
In multi-tenant iPaaS deployments, circuit breaker state for one tenant's ERP should not affect another tenant's circuit — tenant isolation is critical
Library versions change configuration APIs significantly (Polly 7 vs 8 is a complete rewrite) — pin library versions and test after upgrades
This card covers implementations as of March 2026. ERP API error codes, rate limits, and maintenance schedules are subject to change with each ERP release