Circuit Breaker Pattern: Implementation Guide
How do I implement the circuit breaker pattern?
TL;DR
- Bottom line: Prevent cascade failures in distributed systems by wrapping remote calls in a 3-state machine (closed, open, half-open) that fails fast when a downstream service is unhealthy.
- Key tool/command:
new CircuitBreaker(asyncFn, { timeout: 3000, errorThresholdPercentage: 50, resetTimeout: 30000 }) - Watch out for: Using the same circuit breaker instance for multiple unrelated services -- each service needs its own breaker with tuned thresholds.
- Works with: Any language/platform. Libraries: opossum (Node.js), pybreaker (Python), Resilience4j (Java), Polly (.NET), gobreaker (Go).
Constraints
- MUST create one circuit breaker instance per remote service/endpoint -- sharing a breaker across unrelated services masks failures
- NEVER set failure threshold to 1 -- a single failure is not a pattern; minimum recommended is 3-5 failures
- Open-state timeout (resetTimeout) MUST be long enough for the downstream service to recover -- too short causes repeated half-open probes that worsen load
- ALWAYS implement a fallback strategy (cached response, default value, or graceful degradation) -- circuit breaker without fallback just returns errors faster
- Circuit breaker MUST be combined with timeouts on the underlying call -- without timeouts, slow calls hang indefinitely and never trip the breaker
- NEVER apply circuit breakers to local/in-process calls -- the overhead outweighs any benefit; breakers are for network boundaries only
Quick Reference
State Machine
| State | Behavior | Transition Trigger | Action | Monitoring |
|---|---|---|---|---|
| Closed | All requests pass through to downstream service | Failure count/rate exceeds threshold | Trips to Open | Track success/failure counts |
| Open | All requests fail immediately (no downstream call) | Reset timeout expires | Transitions to Half-Open | Alert on state change |
| Half-Open | Limited probe requests pass through | Probe succeeds | Closes circuit (resets counters) | Track probe results |
| Half-Open | Limited probe requests pass through | Probe fails | Re-opens circuit | Increment open duration |
Configuration Parameters
| Parameter | Purpose | Typical Default | Aggressive | Conservative |
|---|---|---|---|---|
failureThreshold | Failures before tripping | 5 | 3 | 10 |
failureRateThreshold | Failure % before tripping | 50% | 25% | 75% |
resetTimeout | Seconds before half-open probe | 30s | 10s | 60s |
successThreshold | Successful probes to close | 1 | 1 | 3 |
timeout | Max wait per request | 3000ms | 1000ms | 10000ms |
slidingWindowSize | Calls evaluated for failure rate | 10 | 5 | 100 |
slidingWindowType | Count-based vs time-based | count | count | time (60s) |
halfOpenRequests | Concurrent probes in half-open | 1 | 1 | 3-5 |
Decision Tree
START: Should I add a circuit breaker?
├── Is the call to a remote/network service?
│ ├── NO -> Do NOT use circuit breaker (local calls don't need it)
│ └── YES ↓
├── Can the downstream service experience sustained failures?
│ ├── NO (transient only) -> Use retry with backoff instead
│ └── YES ↓
├── Would repeated calls to a failing service cause harm?
│ ├── NO -> Simple timeout + retry may suffice
│ └── YES ↓
├── Do you need a fallback response?
│ ├── YES -> Circuit breaker + fallback function
│ └── NO -> Circuit breaker + fail-fast error
│
THRESHOLD TUNING:
├── Is the service critical (payments, auth)?
│ ├── YES -> Conservative: failureThreshold=10, resetTimeout=60s
│ └── NO ↓
├── Is the service high-throughput (>100 req/s)?
│ ├── YES -> Use percentage-based: failureRateThreshold=50%, slidingWindow=100
│ └── NO -> Count-based: failureThreshold=5, resetTimeout=30s
│
RECOVERY STRATEGY:
├── Does downstream have gradual warm-up needs?
│ ├── YES -> Gradual recovery: increase halfOpenRequests over time
│ └── NO -> Standard half-open: single probe request
Step-by-Step Guide
1. Identify the remote call to protect
Determine which downstream service calls are susceptible to sustained failures. Focus on calls where the downstream can become completely unavailable, slow responses cascade to callers, and there is a meaningful fallback. [src1]
Candidate calls:
- External API calls (payment gateways, third-party APIs)
- Database queries to remote clusters
- Inter-service calls in microservices
- Message broker publish operations
Verify: Review your service dependency graph and identify calls with timeout > 1s or historical failure rates > 1%.
2. Choose a circuit breaker library
Select the library for your language/framework. [src3] [src4] [src5]
| Language | Library | Install |
|---|---|---|
| Node.js | opossum | npm install opossum@^8.1 |
| Python | pybreaker | pip install pybreaker>=1.2 |
| Java | Resilience4j | io.github.resilience4j:resilience4j-circuitbreaker:2.2.0 |
| .NET | Polly | dotnet add package Polly |
| Go | gobreaker | go get github.com/sony/gobreaker |
Verify: npm list opossum / pip show pybreaker / check build.gradle
3. Configure thresholds based on service characteristics
Start with sensible defaults, then tune based on production metrics. [src2]
General starting point:
- failureThreshold: 5 (or 50% failure rate)
- resetTimeout: 30 seconds
- timeout per call: 3 seconds
- successThreshold: 1 (probes to close)
Adjust based on:
- SLA of downstream service
- Recovery time of downstream (must be < resetTimeout)
- Acceptable error rate for your users
Verify: Run load test with downstream killed -- circuit should open within failureThreshold failed calls.
4. Implement the fallback strategy
Every circuit breaker must have a fallback. The fallback fires when the circuit is open or when the call fails. [src2] [src6]
Fallback strategies (pick one):
1. Cached response - Return last known good value
2. Default value - Return safe static default
3. Degraded service - Call a simpler backup endpoint
4. Queue for retry - Accept request, process later
5. Graceful error - Return meaningful error with ETA
Verify: Force circuit open and confirm fallback returns expected response.
5. Add monitoring and alerting
Circuit breaker state changes are critical operational events. [src3]
Must-have metrics:
- circuit.state (gauge: 0=closed, 1=half-open, 2=open)
- circuit.calls.total (counter, by outcome: success/failure/rejected)
- circuit.calls.duration (histogram)
- circuit.state_change (event, with from/to states)
Alert on:
- Circuit opens (immediate notification)
- Circuit stays open > 5 minutes (escalation)
- High rejection rate in half-open state
Verify: Trip the circuit intentionally and verify alerts fire within expected timeframe.
Code Examples
Node.js (opossum): HTTP Service Call Protection
// Input: HTTP endpoint that may fail
// Output: Response data or fallback value
const CircuitBreaker = require('opossum'); // ^8.1.3
const axios = require('axios'); // ^1.7.0
async function fetchUserProfile(userId) {
const res = await axios.get(
`https://api.users.example.com/v1/users/${userId}`,
{ timeout: 3000 }
);
return res.data;
}
const breaker = new CircuitBreaker(fetchUserProfile, {
timeout: 5000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
volumeThreshold: 5,
rollingCountTimeout: 10000,
});
breaker.fallback((userId) => ({
id: userId, name: 'Unknown', cached: true
}));
breaker.on('open', () => console.warn('[CB] Circuit OPENED'));
breaker.on('close', () => console.info('[CB] Circuit CLOSED'));
breaker.on('halfOpen',() => console.info('[CB] Circuit HALF-OPEN'));
const profile = await breaker.fire('user-123');
Python (pybreaker): Database Call Protection
# Input: Database query that may fail
# Output: Query result or fallback
import pybreaker # >=1.2.0
class MonitorListener(pybreaker.CircuitBreakerListener):
def state_change(self, cb, old_state, new_state):
print(f"[CB] {cb.name}: {old_state.name} -> {new_state.name}")
db_breaker = pybreaker.CircuitBreaker(
fail_max=5, reset_timeout=30,
exclude=[ValueError],
listeners=[MonitorListener()],
name="postgres-primary",
)
@db_breaker
def get_user(user_id: str) -> dict:
conn = psycopg2.connect("host=db.example.com dbname=app")
cur = conn.cursor()
cur.execute("SELECT * FROM users WHERE id = %s", (user_id,))
return cur.fetchone()
try:
user = get_user("user-123")
except pybreaker.CircuitBreakerError:
user = {"id": "user-123", "cached": True}
Java (Resilience4j): REST Client Protection
// Input: REST API call that may fail
// Output: API response or fallback
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(80)
.slowCallDurationThreshold(Duration.ofSeconds(3))
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(3)
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.recordExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(IllegalArgumentException.class)
.build();
CircuitBreaker breaker = CircuitBreakerRegistry.of(config)
.circuitBreaker("userService");
Supplier<UserProfile> decorated = CircuitBreaker
.decorateSupplier(breaker, () -> client.getProfile(userId));
Try<UserProfile> result = Try.ofSupplier(decorated)
.recover(t -> new UserProfile(userId, "Unknown", true));
Go: Manual Circuit Breaker Implementation
// Input: Any function that calls a remote service
// Output: Result or error (ErrCircuitOpen when tripped)
type CircuitBreaker struct {
mu sync.Mutex
state State // Closed, Open, HalfOpen
failureCount int
failureThreshold int
successThreshold int
resetTimeout time.Duration
lastFailure time.Time
}
func (cb *CircuitBreaker) Execute(fn func() (interface{}, error)) (interface{}, error) {
cb.mu.Lock()
if cb.state == Open && time.Since(cb.lastFailure) > cb.resetTimeout {
cb.state = HalfOpen
cb.successCount = 0
}
if cb.state == Open {
cb.mu.Unlock()
return nil, ErrCircuitOpen
}
cb.mu.Unlock()
result, err := fn()
cb.mu.Lock()
defer cb.mu.Unlock()
if err != nil {
cb.failureCount++
cb.lastFailure = time.Now()
if cb.failureCount >= cb.failureThreshold { cb.state = Open }
return result, err
}
// Success handling: close circuit if enough probes succeed
if cb.state == HalfOpen {
cb.successCount++
if cb.successCount >= cb.successThreshold {
cb.state = Closed; cb.failureCount = 0
}
} else { cb.failureCount = 0 }
return result, nil
}
Anti-Patterns
Wrong: Sharing one circuit breaker across multiple services
// BAD -- one breaker for all services masks which one is failing
const breaker = new CircuitBreaker(makeRequest);
await breaker.fire('https://api-a.example.com/data');
await breaker.fire('https://api-b.example.com/data');
// If API-A fails, API-B also gets blocked!
Correct: One breaker per downstream service
// GOOD -- isolated failure domains
const breakerA = new CircuitBreaker(
(url) => axios.get(url), { name: 'api-a', resetTimeout: 30000 }
);
const breakerB = new CircuitBreaker(
(url) => axios.get(url), { name: 'api-b', resetTimeout: 30000 }
);
await breakerA.fire('https://api-a.example.com/data');
await breakerB.fire('https://api-b.example.com/data');
Wrong: Circuit breaker without timeout on the underlying call
# BAD -- slow calls hang forever, breaker never trips on timeouts
@db_breaker
def fetch_data():
return requests.get("https://slow-api.example.com/data")
# No timeout! A 5-minute hang won't trigger the breaker
Correct: Always set a timeout on the protected call
# GOOD -- timeout ensures slow calls are counted as failures
@db_breaker
def fetch_data():
return requests.get(
"https://slow-api.example.com/data",
timeout=3 # 3 second timeout
)
Wrong: Same threshold for all services
// BAD -- payment service and logging service get identical config
CircuitBreakerConfig oneConfigForAll = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(30))
.build();
// Payment failures at 50% is WAY too late; logging at 50% is too early
Correct: Tune thresholds per service criticality
// GOOD -- critical services get tighter thresholds
CircuitBreakerConfig paymentConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(20) // trip early for payments
.waitDurationInOpenState(Duration.ofSeconds(60))
.build();
CircuitBreakerConfig loggingConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(80) // logging can tolerate more failures
.waitDurationInOpenState(Duration.ofSeconds(10))
.build();
Wrong: No fallback when circuit opens
// BAD -- user gets a raw error
try {
const data = await breaker.fire(userId);
} catch (err) {
throw err; // CircuitBreakerError reaches the user as a 500
}
Correct: Always provide a meaningful fallback
// GOOD -- graceful degradation
breaker.fallback((userId) => ({
id: userId, name: 'User', source: 'cache', stale: true,
message: 'Profile temporarily unavailable, showing cached data'
}));
const data = await breaker.fire(userId);
Common Pitfalls
- Threshold set too low (1-2 failures): Single transient errors trip the circuit needlessly, causing unnecessary service disruption. Fix: Set
failureThreshold >= 5or use percentage-based (50%) with a minimum call volume. [src1] - resetTimeout shorter than downstream recovery time: If the downstream needs 60s to restart but your resetTimeout is 10s, repeated half-open probes add load to a recovering service. Fix: Set
resetTimeoutto at least 2x the expected downstream recovery time. [src2] - Not excluding expected exceptions: Business logic exceptions (validation errors, not-found) should not count as circuit breaker failures. Fix: Use
excludeorignoreExceptionsconfig to filter non-infrastructure errors. [src3] - Testing only the happy path: Circuit breaker logic is only exercised during failures -- if you never test failure scenarios, you discover bugs in production. Fix: Write integration tests that simulate downstream failure and verify circuit state transitions. [src6]
- Forgetting to monitor state changes: Without observability, you cannot tell if a circuit is open, how long it stayed open, or how often it trips. Fix: Emit metrics on every state change and set alerts for circuit open events. [src3]
- Circuit breaker on synchronous in-process calls: Adding circuit breakers around local function calls adds latency and complexity with no benefit. Fix: Reserve circuit breakers for network boundaries (HTTP, gRPC, database, message broker). [src1]
- Not handling half-open correctly: Allowing full traffic in half-open state defeats the purpose. Fix: Limit half-open to 1-3 probe requests before deciding. [src2]
Diagnostic Commands
# Check circuit breaker state via Spring Boot Actuator
curl -s http://localhost:8080/actuator/circuitbreakers | jq '.circuitBreakers'
# Check Resilience4j metrics in Prometheus format
curl -s http://localhost:8080/actuator/prometheus | grep resilience4j_circuitbreaker
# Check circuit breaker events (last 5)
curl -s http://localhost:8080/actuator/circuitbreakerevents | jq '.circuitBreakerEvents[-5:]'
# Load test to verify circuit trips under failure
hey -n 100 -c 10 http://localhost:3000/api/protected-endpoint
Version History & Compatibility
| Library | Current Version | Breaking Changes | Notes |
|---|---|---|---|
| opossum (Node.js) | 8.x | v8: ESM support, dropped Node <18 | v6->v7: callback API removed |
| pybreaker (Python) | 1.2.x | None recent | Supports Python 3.8+ |
| Resilience4j (Java) | 2.2.x | v2.0: replaced deprecated Hystrix patterns | Spring Boot 3.x compatible |
| Polly (.NET) | 8.x | v8: new pipeline-based API | v7->v8: policy syntax changed |
| gobreaker (Go) | 0.7.x | None | Stable API since v0.5 |
| Hystrix (Java) | 1.5.18 | EOL -- maintenance mode since 2018 | Migrate to Resilience4j |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Remote service calls that can experience sustained outages | Transient single-request failures | Retry with exponential backoff |
| Preventing cascade failures across microservices | Protecting local/in-process function calls | Direct error handling |
| Downstream has known recovery time (restarts, scaling events) | Rate limiting your own outbound requests | Token bucket / leaky bucket rate limiter |
| You need fast failure when a dependency is down | Isolating concurrent request pools | Bulkhead pattern |
| Protecting against slow responses that consume thread/connection pools | Simple request timeout | Timeout pattern (often combined with CB) |
| Multiple callers hit the same failing service | The operation is idempotent and cheap to retry | Simple retry |
Important Caveats
- Circuit breaker and retry are complementary, not alternatives -- use retry for transient failures (inside the circuit breaker), and the circuit breaker for sustained failures. The retry should respect circuit breaker state and stop retrying when the circuit is open.
- In distributed systems with multiple instances, each instance maintains its own circuit breaker state by default. For coordinated tripping, you need a shared state store (Redis, etc.), which adds complexity.
- Percentage-based thresholds require a minimum volume of calls to be meaningful. A single failed call out of 2 total is 50% but should not trip the circuit. Always configure
minimumNumberOfCallsalongside percentage thresholds. - Half-open state is the most dangerous phase -- if your probe request is expensive or has side effects, the half-open probe itself can cause problems. Use lightweight health check endpoints for probes when possible.
- Circuit breakers add latency to every call (state check, metrics recording). For ultra-low-latency paths (<1ms), measure the overhead. Modern libraries add <0.1ms but this can matter at extreme scale.