circuit breaker microservices

- Bottom line: Prevent cascade failures in distributed systems by wrapping remote calls in a 3-state machine (closed, open, half-open) that fails fast when a downstream service is unhealthy.

cascade failure prevention

- Bottom line: Prevent cascade failures in distributed systems by wrapping remote calls in a 3-state machine (closed, open, half-open) that fails fast when a downstream service is unhealthy.

circuit breaker state machine

- Bottom line: Prevent cascade failures in distributed systems by wrapping remote calls in a 3-state machine (closed, open, half-open) that fails fast when a downstream service is unhealthy.

Circuit Breaker Pattern: Implementation Guide

How do I implement the circuit breaker pattern?

TL;DR

Bottom line: Prevent cascade failures in distributed systems by wrapping remote calls in a 3-state machine (closed, open, half-open) that fails fast when a downstream service is unhealthy.
Key tool/command: new CircuitBreaker(asyncFn, { timeout: 3000, errorThresholdPercentage: 50, resetTimeout: 30000 })
Watch out for: Using the same circuit breaker instance for multiple unrelated services -- each service needs its own breaker with tuned thresholds.
Works with: Any language/platform. Libraries: opossum (Node.js), pybreaker (Python), Resilience4j (Java), Polly (.NET), gobreaker (Go).

Constraints

MUST create one circuit breaker instance per remote service/endpoint -- sharing a breaker across unrelated services masks failures
NEVER set failure threshold to 1 -- a single failure is not a pattern; minimum recommended is 3-5 failures
Open-state timeout (resetTimeout) MUST be long enough for the downstream service to recover -- too short causes repeated half-open probes that worsen load
ALWAYS implement a fallback strategy (cached response, default value, or graceful degradation) -- circuit breaker without fallback just returns errors faster
Circuit breaker MUST be combined with timeouts on the underlying call -- without timeouts, slow calls hang indefinitely and never trip the breaker
NEVER apply circuit breakers to local/in-process calls -- the overhead outweighs any benefit; breakers are for network boundaries only

Quick Reference

State Machine

State	Behavior	Transition Trigger	Action	Monitoring
Closed	All requests pass through to downstream service	Failure count/rate exceeds threshold	Trips to Open	Track success/failure counts
Open	All requests fail immediately (no downstream call)	Reset timeout expires	Transitions to Half-Open	Alert on state change
Half-Open	Limited probe requests pass through	Probe succeeds	Closes circuit (resets counters)	Track probe results
Half-Open	Limited probe requests pass through	Probe fails	Re-opens circuit	Increment open duration

Configuration Parameters

Parameter	Purpose	Typical Default	Aggressive	Conservative
`failureThreshold`	Failures before tripping	5	3	10
`failureRateThreshold`	Failure % before tripping	50%	25%	75%
`resetTimeout`	Seconds before half-open probe	30s	10s	60s
`successThreshold`	Successful probes to close	1	1	3
`timeout`	Max wait per request	3000ms	1000ms	10000ms
`slidingWindowSize`	Calls evaluated for failure rate	10	5	100
`slidingWindowType`	Count-based vs time-based	count	count	time (60s)
`halfOpenRequests`	Concurrent probes in half-open	1	1	3-5

Decision Tree

START: Should I add a circuit breaker?
├── Is the call to a remote/network service?
│   ├── NO -> Do NOT use circuit breaker (local calls don't need it)
│   └── YES ↓
├── Can the downstream service experience sustained failures?
│   ├── NO (transient only) -> Use retry with backoff instead
│   └── YES ↓
├── Would repeated calls to a failing service cause harm?
│   ├── NO -> Simple timeout + retry may suffice
│   └── YES ↓
├── Do you need a fallback response?
│   ├── YES -> Circuit breaker + fallback function
│   └── NO -> Circuit breaker + fail-fast error
│
THRESHOLD TUNING:
├── Is the service critical (payments, auth)?
│   ├── YES -> Conservative: failureThreshold=10, resetTimeout=60s
│   └── NO ↓
├── Is the service high-throughput (>100 req/s)?
│   ├── YES -> Use percentage-based: failureRateThreshold=50%, slidingWindow=100
│   └── NO -> Count-based: failureThreshold=5, resetTimeout=30s
│
RECOVERY STRATEGY:
├── Does downstream have gradual warm-up needs?
│   ├── YES -> Gradual recovery: increase halfOpenRequests over time
│   └── NO -> Standard half-open: single probe request

Step-by-Step Guide

1. Identify the remote call to protect

Determine which downstream service calls are susceptible to sustained failures. Focus on calls where the downstream can become completely unavailable, slow responses cascade to callers, and there is a meaningful fallback. [src1]

Candidate calls:
- External API calls (payment gateways, third-party APIs)
- Database queries to remote clusters
- Inter-service calls in microservices
- Message broker publish operations

Verify: Review your service dependency graph and identify calls with timeout > 1s or historical failure rates > 1%.

2. Choose a circuit breaker library

Select the library for your language/framework. [src3] [src4] [src5]

Language	Library	Install
Node.js	opossum	`npm install opossum@^8.1`
Python	pybreaker	`pip install pybreaker>=1.2`
Java	Resilience4j	`io.github.resilience4j:resilience4j-circuitbreaker:2.2.0`
.NET	Polly	`dotnet add package Polly`
Go	gobreaker	`go get github.com/sony/gobreaker`

Verify: npm list opossum / pip show pybreaker / check build.gradle

3. Configure thresholds based on service characteristics

Start with sensible defaults, then tune based on production metrics. [src2]

General starting point:
- failureThreshold: 5 (or 50% failure rate)
- resetTimeout: 30 seconds
- timeout per call: 3 seconds
- successThreshold: 1 (probes to close)

Adjust based on:
- SLA of downstream service
- Recovery time of downstream (must be < resetTimeout)
- Acceptable error rate for your users

Verify: Run load test with downstream killed -- circuit should open within failureThreshold failed calls.

4. Implement the fallback strategy

Every circuit breaker must have a fallback. The fallback fires when the circuit is open or when the call fails. [src2] [src6]

Fallback strategies (pick one):
1. Cached response    - Return last known good value
2. Default value      - Return safe static default
3. Degraded service   - Call a simpler backup endpoint
4. Queue for retry    - Accept request, process later
5. Graceful error     - Return meaningful error with ETA

Verify: Force circuit open and confirm fallback returns expected response.

5. Add monitoring and alerting

Circuit breaker state changes are critical operational events. [src3]

Must-have metrics:
- circuit.state (gauge: 0=closed, 1=half-open, 2=open)
- circuit.calls.total (counter, by outcome: success/failure/rejected)
- circuit.calls.duration (histogram)
- circuit.state_change (event, with from/to states)

Alert on:
- Circuit opens (immediate notification)
- Circuit stays open > 5 minutes (escalation)
- High rejection rate in half-open state

Verify: Trip the circuit intentionally and verify alerts fire within expected timeframe.

Code Examples

Node.js (opossum): HTTP Service Call Protection

// Input:  HTTP endpoint that may fail
// Output: Response data or fallback value
const CircuitBreaker = require('opossum');  // ^8.1.3
const axios = require('axios');             // ^1.7.0

async function fetchUserProfile(userId) {
  const res = await axios.get(
    `https://api.users.example.com/v1/users/${userId}`,
    { timeout: 3000 }
  );
  return res.data;
}

const breaker = new CircuitBreaker(fetchUserProfile, {
  timeout: 5000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000,
  volumeThreshold: 5,
  rollingCountTimeout: 10000,
});

breaker.fallback((userId) => ({
  id: userId, name: 'Unknown', cached: true
}));

breaker.on('open',    () => console.warn('[CB] Circuit OPENED'));
breaker.on('close',   () => console.info('[CB] Circuit CLOSED'));
breaker.on('halfOpen',() => console.info('[CB] Circuit HALF-OPEN'));

const profile = await breaker.fire('user-123');

Python (pybreaker): Database Call Protection

# Input:  Database query that may fail
# Output: Query result or fallback
import pybreaker  # >=1.2.0

class MonitorListener(pybreaker.CircuitBreakerListener):
    def state_change(self, cb, old_state, new_state):
        print(f"[CB] {cb.name}: {old_state.name} -> {new_state.name}")

db_breaker = pybreaker.CircuitBreaker(
    fail_max=5, reset_timeout=30,
    exclude=[ValueError],
    listeners=[MonitorListener()],
    name="postgres-primary",
)

@db_breaker
def get_user(user_id: str) -> dict:
    conn = psycopg2.connect("host=db.example.com dbname=app")
    cur = conn.cursor()
    cur.execute("SELECT * FROM users WHERE id = %s", (user_id,))
    return cur.fetchone()

try:
    user = get_user("user-123")
except pybreaker.CircuitBreakerError:
    user = {"id": "user-123", "cached": True}

Java (Resilience4j): REST Client Protection

// Input:  REST API call that may fail
// Output: API response or fallback
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .slowCallRateThreshold(80)
    .slowCallDurationThreshold(Duration.ofSeconds(3))
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .permittedNumberOfCallsInHalfOpenState(3)
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(10)
    .minimumNumberOfCalls(5)
    .recordExceptions(IOException.class, TimeoutException.class)
    .ignoreExceptions(IllegalArgumentException.class)
    .build();

CircuitBreaker breaker = CircuitBreakerRegistry.of(config)
    .circuitBreaker("userService");

Supplier<UserProfile> decorated = CircuitBreaker
    .decorateSupplier(breaker, () -> client.getProfile(userId));

Try<UserProfile> result = Try.ofSupplier(decorated)
    .recover(t -> new UserProfile(userId, "Unknown", true));

Go: Manual Circuit Breaker Implementation

// Input:  Any function that calls a remote service
// Output: Result or error (ErrCircuitOpen when tripped)
type CircuitBreaker struct {
    mu               sync.Mutex
    state            State  // Closed, Open, HalfOpen
    failureCount     int
    failureThreshold int
    successThreshold int
    resetTimeout     time.Duration
    lastFailure      time.Time
}

func (cb *CircuitBreaker) Execute(fn func() (interface{}, error)) (interface{}, error) {
    cb.mu.Lock()
    if cb.state == Open && time.Since(cb.lastFailure) > cb.resetTimeout {
        cb.state = HalfOpen
        cb.successCount = 0
    }
    if cb.state == Open {
        cb.mu.Unlock()
        return nil, ErrCircuitOpen
    }
    cb.mu.Unlock()

    result, err := fn()
    cb.mu.Lock()
    defer cb.mu.Unlock()
    if err != nil {
        cb.failureCount++
        cb.lastFailure = time.Now()
        if cb.failureCount >= cb.failureThreshold { cb.state = Open }
        return result, err
    }
    // Success handling: close circuit if enough probes succeed
    if cb.state == HalfOpen {
        cb.successCount++
        if cb.successCount >= cb.successThreshold {
            cb.state = Closed; cb.failureCount = 0
        }
    } else { cb.failureCount = 0 }
    return result, nil
}

Anti-Patterns

Wrong: Sharing one circuit breaker across multiple services

// BAD -- one breaker for all services masks which one is failing
const breaker = new CircuitBreaker(makeRequest);
await breaker.fire('https://api-a.example.com/data');
await breaker.fire('https://api-b.example.com/data');
// If API-A fails, API-B also gets blocked!

Correct: One breaker per downstream service

// GOOD -- isolated failure domains
const breakerA = new CircuitBreaker(
  (url) => axios.get(url), { name: 'api-a', resetTimeout: 30000 }
);
const breakerB = new CircuitBreaker(
  (url) => axios.get(url), { name: 'api-b', resetTimeout: 30000 }
);
await breakerA.fire('https://api-a.example.com/data');
await breakerB.fire('https://api-b.example.com/data');

Wrong: Circuit breaker without timeout on the underlying call

# BAD -- slow calls hang forever, breaker never trips on timeouts
@db_breaker
def fetch_data():
    return requests.get("https://slow-api.example.com/data")
    # No timeout! A 5-minute hang won't trigger the breaker

Correct: Always set a timeout on the protected call

# GOOD -- timeout ensures slow calls are counted as failures
@db_breaker
def fetch_data():
    return requests.get(
        "https://slow-api.example.com/data",
        timeout=3  # 3 second timeout
    )

Wrong: Same threshold for all services

// BAD -- payment service and logging service get identical config
CircuitBreakerConfig oneConfigForAll = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .build();
// Payment failures at 50% is WAY too late; logging at 50% is too early

Correct: Tune thresholds per service criticality

// GOOD -- critical services get tighter thresholds
CircuitBreakerConfig paymentConfig = CircuitBreakerConfig.custom()
    .failureRateThreshold(20)   // trip early for payments
    .waitDurationInOpenState(Duration.ofSeconds(60))
    .build();

CircuitBreakerConfig loggingConfig = CircuitBreakerConfig.custom()
    .failureRateThreshold(80)   // logging can tolerate more failures
    .waitDurationInOpenState(Duration.ofSeconds(10))
    .build();

Wrong: No fallback when circuit opens

// BAD -- user gets a raw error
try {
  const data = await breaker.fire(userId);
} catch (err) {
  throw err;  // CircuitBreakerError reaches the user as a 500
}

Correct: Always provide a meaningful fallback

// GOOD -- graceful degradation
breaker.fallback((userId) => ({
  id: userId, name: 'User', source: 'cache', stale: true,
  message: 'Profile temporarily unavailable, showing cached data'
}));
const data = await breaker.fire(userId);

Common Pitfalls

Threshold set too low (1-2 failures): Single transient errors trip the circuit needlessly, causing unnecessary service disruption. Fix: Set failureThreshold >= 5 or use percentage-based (50%) with a minimum call volume. [src1]
resetTimeout shorter than downstream recovery time: If the downstream needs 60s to restart but your resetTimeout is 10s, repeated half-open probes add load to a recovering service. Fix: Set resetTimeout to at least 2x the expected downstream recovery time. [src2]
Not excluding expected exceptions: Business logic exceptions (validation errors, not-found) should not count as circuit breaker failures. Fix: Use exclude or ignoreExceptions config to filter non-infrastructure errors. [src3]
Testing only the happy path: Circuit breaker logic is only exercised during failures -- if you never test failure scenarios, you discover bugs in production. Fix: Write integration tests that simulate downstream failure and verify circuit state transitions. [src6]
Forgetting to monitor state changes: Without observability, you cannot tell if a circuit is open, how long it stayed open, or how often it trips. Fix: Emit metrics on every state change and set alerts for circuit open events. [src3]
Circuit breaker on synchronous in-process calls: Adding circuit breakers around local function calls adds latency and complexity with no benefit. Fix: Reserve circuit breakers for network boundaries (HTTP, gRPC, database, message broker). [src1]
Not handling half-open correctly: Allowing full traffic in half-open state defeats the purpose. Fix: Limit half-open to 1-3 probe requests before deciding. [src2]

Diagnostic Commands

# Check circuit breaker state via Spring Boot Actuator
curl -s http://localhost:8080/actuator/circuitbreakers | jq '.circuitBreakers'

# Check Resilience4j metrics in Prometheus format
curl -s http://localhost:8080/actuator/prometheus | grep resilience4j_circuitbreaker

# Check circuit breaker events (last 5)
curl -s http://localhost:8080/actuator/circuitbreakerevents | jq '.circuitBreakerEvents[-5:]'

# Load test to verify circuit trips under failure
hey -n 100 -c 10 http://localhost:3000/api/protected-endpoint

Version History & Compatibility

Library	Current Version	Breaking Changes	Notes
opossum (Node.js)	8.x	v8: ESM support, dropped Node <18	v6->v7: callback API removed
pybreaker (Python)	1.2.x	None recent	Supports Python 3.8+
Resilience4j (Java)	2.2.x	v2.0: replaced deprecated Hystrix patterns	Spring Boot 3.x compatible
Polly (.NET)	8.x	v8: new pipeline-based API	v7->v8: policy syntax changed
gobreaker (Go)	0.7.x	None	Stable API since v0.5
Hystrix (Java)	1.5.18	EOL -- maintenance mode since 2018	Migrate to Resilience4j

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Remote service calls that can experience sustained outages	Transient single-request failures	Retry with exponential backoff
Preventing cascade failures across microservices	Protecting local/in-process function calls	Direct error handling
Downstream has known recovery time (restarts, scaling events)	Rate limiting your own outbound requests	Token bucket / leaky bucket rate limiter
You need fast failure when a dependency is down	Isolating concurrent request pools	Bulkhead pattern
Protecting against slow responses that consume thread/connection pools	Simple request timeout	Timeout pattern (often combined with CB)
Multiple callers hit the same failing service	The operation is idempotent and cheap to retry	Simple retry

Important Caveats

Circuit breaker and retry are complementary, not alternatives -- use retry for transient failures (inside the circuit breaker), and the circuit breaker for sustained failures. The retry should respect circuit breaker state and stop retrying when the circuit is open.
In distributed systems with multiple instances, each instance maintains its own circuit breaker state by default. For coordinated tripping, you need a shared state store (Redis, etc.), which adds complexity.
Percentage-based thresholds require a minimum volume of calls to be meaningful. A single failed call out of 2 total is 50% but should not trip the circuit. Always configure minimumNumberOfCalls alongside percentage thresholds.
Half-open state is the most dangerous phase -- if your probe request is expensive or has side effects, the half-open probe itself can cause problems. Use lightweight health check endpoints for probes when possible.
Circuit breakers add latency to every call (state check, metrics recording). For ultra-low-latency paths (<1ms), measure the overhead. Modern libraries add <0.1ms but this can matter at extreme scale.