Circuit Breaker Pattern: Implementation Guide

Type: Software Reference Confidence: 0.92 Sources: 7 Verified: 2026-02-24 Freshness: 2026-02-24

TL;DR

Constraints

Quick Reference

State Machine

StateBehaviorTransition TriggerActionMonitoring
ClosedAll requests pass through to downstream serviceFailure count/rate exceeds thresholdTrips to OpenTrack success/failure counts
OpenAll requests fail immediately (no downstream call)Reset timeout expiresTransitions to Half-OpenAlert on state change
Half-OpenLimited probe requests pass throughProbe succeedsCloses circuit (resets counters)Track probe results
Half-OpenLimited probe requests pass throughProbe failsRe-opens circuitIncrement open duration

Configuration Parameters

ParameterPurposeTypical DefaultAggressiveConservative
failureThresholdFailures before tripping5310
failureRateThresholdFailure % before tripping50%25%75%
resetTimeoutSeconds before half-open probe30s10s60s
successThresholdSuccessful probes to close113
timeoutMax wait per request3000ms1000ms10000ms
slidingWindowSizeCalls evaluated for failure rate105100
slidingWindowTypeCount-based vs time-basedcountcounttime (60s)
halfOpenRequestsConcurrent probes in half-open113-5

Decision Tree

START: Should I add a circuit breaker?
├── Is the call to a remote/network service?
│   ├── NO -> Do NOT use circuit breaker (local calls don't need it)
│   └── YES ↓
├── Can the downstream service experience sustained failures?
│   ├── NO (transient only) -> Use retry with backoff instead
│   └── YES ↓
├── Would repeated calls to a failing service cause harm?
│   ├── NO -> Simple timeout + retry may suffice
│   └── YES ↓
├── Do you need a fallback response?
│   ├── YES -> Circuit breaker + fallback function
│   └── NO -> Circuit breaker + fail-fast error
│
THRESHOLD TUNING:
├── Is the service critical (payments, auth)?
│   ├── YES -> Conservative: failureThreshold=10, resetTimeout=60s
│   └── NO ↓
├── Is the service high-throughput (>100 req/s)?
│   ├── YES -> Use percentage-based: failureRateThreshold=50%, slidingWindow=100
│   └── NO -> Count-based: failureThreshold=5, resetTimeout=30s
│
RECOVERY STRATEGY:
├── Does downstream have gradual warm-up needs?
│   ├── YES -> Gradual recovery: increase halfOpenRequests over time
│   └── NO -> Standard half-open: single probe request

Step-by-Step Guide

1. Identify the remote call to protect

Determine which downstream service calls are susceptible to sustained failures. Focus on calls where the downstream can become completely unavailable, slow responses cascade to callers, and there is a meaningful fallback. [src1]

Candidate calls:
- External API calls (payment gateways, third-party APIs)
- Database queries to remote clusters
- Inter-service calls in microservices
- Message broker publish operations

Verify: Review your service dependency graph and identify calls with timeout > 1s or historical failure rates > 1%.

2. Choose a circuit breaker library

Select the library for your language/framework. [src3] [src4] [src5]

LanguageLibraryInstall
Node.jsopossumnpm install opossum@^8.1
Pythonpybreakerpip install pybreaker>=1.2
JavaResilience4jio.github.resilience4j:resilience4j-circuitbreaker:2.2.0
.NETPollydotnet add package Polly
Gogobreakergo get github.com/sony/gobreaker

Verify: npm list opossum / pip show pybreaker / check build.gradle

3. Configure thresholds based on service characteristics

Start with sensible defaults, then tune based on production metrics. [src2]

General starting point:
- failureThreshold: 5 (or 50% failure rate)
- resetTimeout: 30 seconds
- timeout per call: 3 seconds
- successThreshold: 1 (probes to close)

Adjust based on:
- SLA of downstream service
- Recovery time of downstream (must be < resetTimeout)
- Acceptable error rate for your users

Verify: Run load test with downstream killed -- circuit should open within failureThreshold failed calls.

4. Implement the fallback strategy

Every circuit breaker must have a fallback. The fallback fires when the circuit is open or when the call fails. [src2] [src6]

Fallback strategies (pick one):
1. Cached response    - Return last known good value
2. Default value      - Return safe static default
3. Degraded service   - Call a simpler backup endpoint
4. Queue for retry    - Accept request, process later
5. Graceful error     - Return meaningful error with ETA

Verify: Force circuit open and confirm fallback returns expected response.

5. Add monitoring and alerting

Circuit breaker state changes are critical operational events. [src3]

Must-have metrics:
- circuit.state (gauge: 0=closed, 1=half-open, 2=open)
- circuit.calls.total (counter, by outcome: success/failure/rejected)
- circuit.calls.duration (histogram)
- circuit.state_change (event, with from/to states)

Alert on:
- Circuit opens (immediate notification)
- Circuit stays open > 5 minutes (escalation)
- High rejection rate in half-open state

Verify: Trip the circuit intentionally and verify alerts fire within expected timeframe.

Code Examples

Node.js (opossum): HTTP Service Call Protection

// Input:  HTTP endpoint that may fail
// Output: Response data or fallback value
const CircuitBreaker = require('opossum');  // ^8.1.3
const axios = require('axios');             // ^1.7.0

async function fetchUserProfile(userId) {
  const res = await axios.get(
    `https://api.users.example.com/v1/users/${userId}`,
    { timeout: 3000 }
  );
  return res.data;
}

const breaker = new CircuitBreaker(fetchUserProfile, {
  timeout: 5000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000,
  volumeThreshold: 5,
  rollingCountTimeout: 10000,
});

breaker.fallback((userId) => ({
  id: userId, name: 'Unknown', cached: true
}));

breaker.on('open',    () => console.warn('[CB] Circuit OPENED'));
breaker.on('close',   () => console.info('[CB] Circuit CLOSED'));
breaker.on('halfOpen',() => console.info('[CB] Circuit HALF-OPEN'));

const profile = await breaker.fire('user-123');

Python (pybreaker): Database Call Protection

# Input:  Database query that may fail
# Output: Query result or fallback
import pybreaker  # >=1.2.0

class MonitorListener(pybreaker.CircuitBreakerListener):
    def state_change(self, cb, old_state, new_state):
        print(f"[CB] {cb.name}: {old_state.name} -> {new_state.name}")

db_breaker = pybreaker.CircuitBreaker(
    fail_max=5, reset_timeout=30,
    exclude=[ValueError],
    listeners=[MonitorListener()],
    name="postgres-primary",
)

@db_breaker
def get_user(user_id: str) -> dict:
    conn = psycopg2.connect("host=db.example.com dbname=app")
    cur = conn.cursor()
    cur.execute("SELECT * FROM users WHERE id = %s", (user_id,))
    return cur.fetchone()

try:
    user = get_user("user-123")
except pybreaker.CircuitBreakerError:
    user = {"id": "user-123", "cached": True}

Java (Resilience4j): REST Client Protection

// Input:  REST API call that may fail
// Output: API response or fallback
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .slowCallRateThreshold(80)
    .slowCallDurationThreshold(Duration.ofSeconds(3))
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .permittedNumberOfCallsInHalfOpenState(3)
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(10)
    .minimumNumberOfCalls(5)
    .recordExceptions(IOException.class, TimeoutException.class)
    .ignoreExceptions(IllegalArgumentException.class)
    .build();

CircuitBreaker breaker = CircuitBreakerRegistry.of(config)
    .circuitBreaker("userService");

Supplier<UserProfile> decorated = CircuitBreaker
    .decorateSupplier(breaker, () -> client.getProfile(userId));

Try<UserProfile> result = Try.ofSupplier(decorated)
    .recover(t -> new UserProfile(userId, "Unknown", true));

Go: Manual Circuit Breaker Implementation

// Input:  Any function that calls a remote service
// Output: Result or error (ErrCircuitOpen when tripped)
type CircuitBreaker struct {
    mu               sync.Mutex
    state            State  // Closed, Open, HalfOpen
    failureCount     int
    failureThreshold int
    successThreshold int
    resetTimeout     time.Duration
    lastFailure      time.Time
}

func (cb *CircuitBreaker) Execute(fn func() (interface{}, error)) (interface{}, error) {
    cb.mu.Lock()
    if cb.state == Open && time.Since(cb.lastFailure) > cb.resetTimeout {
        cb.state = HalfOpen
        cb.successCount = 0
    }
    if cb.state == Open {
        cb.mu.Unlock()
        return nil, ErrCircuitOpen
    }
    cb.mu.Unlock()

    result, err := fn()
    cb.mu.Lock()
    defer cb.mu.Unlock()
    if err != nil {
        cb.failureCount++
        cb.lastFailure = time.Now()
        if cb.failureCount >= cb.failureThreshold { cb.state = Open }
        return result, err
    }
    // Success handling: close circuit if enough probes succeed
    if cb.state == HalfOpen {
        cb.successCount++
        if cb.successCount >= cb.successThreshold {
            cb.state = Closed; cb.failureCount = 0
        }
    } else { cb.failureCount = 0 }
    return result, nil
}

Anti-Patterns

Wrong: Sharing one circuit breaker across multiple services

// BAD -- one breaker for all services masks which one is failing
const breaker = new CircuitBreaker(makeRequest);
await breaker.fire('https://api-a.example.com/data');
await breaker.fire('https://api-b.example.com/data');
// If API-A fails, API-B also gets blocked!

Correct: One breaker per downstream service

// GOOD -- isolated failure domains
const breakerA = new CircuitBreaker(
  (url) => axios.get(url), { name: 'api-a', resetTimeout: 30000 }
);
const breakerB = new CircuitBreaker(
  (url) => axios.get(url), { name: 'api-b', resetTimeout: 30000 }
);
await breakerA.fire('https://api-a.example.com/data');
await breakerB.fire('https://api-b.example.com/data');

Wrong: Circuit breaker without timeout on the underlying call

# BAD -- slow calls hang forever, breaker never trips on timeouts
@db_breaker
def fetch_data():
    return requests.get("https://slow-api.example.com/data")
    # No timeout! A 5-minute hang won't trigger the breaker

Correct: Always set a timeout on the protected call

# GOOD -- timeout ensures slow calls are counted as failures
@db_breaker
def fetch_data():
    return requests.get(
        "https://slow-api.example.com/data",
        timeout=3  # 3 second timeout
    )

Wrong: Same threshold for all services

// BAD -- payment service and logging service get identical config
CircuitBreakerConfig oneConfigForAll = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .build();
// Payment failures at 50% is WAY too late; logging at 50% is too early

Correct: Tune thresholds per service criticality

// GOOD -- critical services get tighter thresholds
CircuitBreakerConfig paymentConfig = CircuitBreakerConfig.custom()
    .failureRateThreshold(20)   // trip early for payments
    .waitDurationInOpenState(Duration.ofSeconds(60))
    .build();

CircuitBreakerConfig loggingConfig = CircuitBreakerConfig.custom()
    .failureRateThreshold(80)   // logging can tolerate more failures
    .waitDurationInOpenState(Duration.ofSeconds(10))
    .build();

Wrong: No fallback when circuit opens

// BAD -- user gets a raw error
try {
  const data = await breaker.fire(userId);
} catch (err) {
  throw err;  // CircuitBreakerError reaches the user as a 500
}

Correct: Always provide a meaningful fallback

// GOOD -- graceful degradation
breaker.fallback((userId) => ({
  id: userId, name: 'User', source: 'cache', stale: true,
  message: 'Profile temporarily unavailable, showing cached data'
}));
const data = await breaker.fire(userId);

Common Pitfalls

Diagnostic Commands

# Check circuit breaker state via Spring Boot Actuator
curl -s http://localhost:8080/actuator/circuitbreakers | jq '.circuitBreakers'

# Check Resilience4j metrics in Prometheus format
curl -s http://localhost:8080/actuator/prometheus | grep resilience4j_circuitbreaker

# Check circuit breaker events (last 5)
curl -s http://localhost:8080/actuator/circuitbreakerevents | jq '.circuitBreakerEvents[-5:]'

# Load test to verify circuit trips under failure
hey -n 100 -c 10 http://localhost:3000/api/protected-endpoint

Version History & Compatibility

LibraryCurrent VersionBreaking ChangesNotes
opossum (Node.js)8.xv8: ESM support, dropped Node <18v6->v7: callback API removed
pybreaker (Python)1.2.xNone recentSupports Python 3.8+
Resilience4j (Java)2.2.xv2.0: replaced deprecated Hystrix patternsSpring Boot 3.x compatible
Polly (.NET)8.xv8: new pipeline-based APIv7->v8: policy syntax changed
gobreaker (Go)0.7.xNoneStable API since v0.5
Hystrix (Java)1.5.18EOL -- maintenance mode since 2018Migrate to Resilience4j

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Remote service calls that can experience sustained outagesTransient single-request failuresRetry with exponential backoff
Preventing cascade failures across microservicesProtecting local/in-process function callsDirect error handling
Downstream has known recovery time (restarts, scaling events)Rate limiting your own outbound requestsToken bucket / leaky bucket rate limiter
You need fast failure when a dependency is downIsolating concurrent request poolsBulkhead pattern
Protecting against slow responses that consume thread/connection poolsSimple request timeoutTimeout pattern (often combined with CB)
Multiple callers hit the same failing serviceThe operation is idempotent and cheap to retrySimple retry

Important Caveats

Related Units