new CircuitBreaker(asyncFn, { timeout: 3000, errorThresholdPercentage: 50, resetTimeout: 30000 })| State | Behavior | Transition Trigger | Action | Monitoring |
|---|---|---|---|---|
| Closed | All requests pass through to downstream service | Failure count/rate exceeds threshold | Trips to Open | Track success/failure counts |
| Open | All requests fail immediately (no downstream call) | Reset timeout expires | Transitions to Half-Open | Alert on state change |
| Half-Open | Limited probe requests pass through | Probe succeeds | Closes circuit (resets counters) | Track probe results |
| Half-Open | Limited probe requests pass through | Probe fails | Re-opens circuit | Increment open duration |
| Parameter | Purpose | Typical Default | Aggressive | Conservative |
|---|---|---|---|---|
failureThreshold | Failures before tripping | 5 | 3 | 10 |
failureRateThreshold | Failure % before tripping | 50% | 25% | 75% |
resetTimeout | Seconds before half-open probe | 30s | 10s | 60s |
successThreshold | Successful probes to close | 1 | 1 | 3 |
timeout | Max wait per request | 3000ms | 1000ms | 10000ms |
slidingWindowSize | Calls evaluated for failure rate | 10 | 5 | 100 |
slidingWindowType | Count-based vs time-based | count | count | time (60s) |
halfOpenRequests | Concurrent probes in half-open | 1 | 1 | 3-5 |
START: Should I add a circuit breaker?
├── Is the call to a remote/network service?
│ ├── NO -> Do NOT use circuit breaker (local calls don't need it)
│ └── YES ↓
├── Can the downstream service experience sustained failures?
│ ├── NO (transient only) -> Use retry with backoff instead
│ └── YES ↓
├── Would repeated calls to a failing service cause harm?
│ ├── NO -> Simple timeout + retry may suffice
│ └── YES ↓
├── Do you need a fallback response?
│ ├── YES -> Circuit breaker + fallback function
│ └── NO -> Circuit breaker + fail-fast error
│
THRESHOLD TUNING:
├── Is the service critical (payments, auth)?
│ ├── YES -> Conservative: failureThreshold=10, resetTimeout=60s
│ └── NO ↓
├── Is the service high-throughput (>100 req/s)?
│ ├── YES -> Use percentage-based: failureRateThreshold=50%, slidingWindow=100
│ └── NO -> Count-based: failureThreshold=5, resetTimeout=30s
│
RECOVERY STRATEGY:
├── Does downstream have gradual warm-up needs?
│ ├── YES -> Gradual recovery: increase halfOpenRequests over time
│ └── NO -> Standard half-open: single probe request
Determine which downstream service calls are susceptible to sustained failures. Focus on calls where the downstream can become completely unavailable, slow responses cascade to callers, and there is a meaningful fallback. [src1]
Candidate calls:
- External API calls (payment gateways, third-party APIs)
- Database queries to remote clusters
- Inter-service calls in microservices
- Message broker publish operations
Verify: Review your service dependency graph and identify calls with timeout > 1s or historical failure rates > 1%.
Select the library for your language/framework. [src3] [src4] [src5]
| Language | Library | Install |
|---|---|---|
| Node.js | opossum | npm install opossum@^8.1 |
| Python | pybreaker | pip install pybreaker>=1.2 |
| Java | Resilience4j | io.github.resilience4j:resilience4j-circuitbreaker:2.2.0 |
| .NET | Polly | dotnet add package Polly |
| Go | gobreaker | go get github.com/sony/gobreaker |
Verify: npm list opossum / pip show pybreaker / check build.gradle
Start with sensible defaults, then tune based on production metrics. [src2]
General starting point:
- failureThreshold: 5 (or 50% failure rate)
- resetTimeout: 30 seconds
- timeout per call: 3 seconds
- successThreshold: 1 (probes to close)
Adjust based on:
- SLA of downstream service
- Recovery time of downstream (must be < resetTimeout)
- Acceptable error rate for your users
Verify: Run load test with downstream killed -- circuit should open within failureThreshold failed calls.
Every circuit breaker must have a fallback. The fallback fires when the circuit is open or when the call fails. [src2] [src6]
Fallback strategies (pick one):
1. Cached response - Return last known good value
2. Default value - Return safe static default
3. Degraded service - Call a simpler backup endpoint
4. Queue for retry - Accept request, process later
5. Graceful error - Return meaningful error with ETA
Verify: Force circuit open and confirm fallback returns expected response.
Circuit breaker state changes are critical operational events. [src3]
Must-have metrics:
- circuit.state (gauge: 0=closed, 1=half-open, 2=open)
- circuit.calls.total (counter, by outcome: success/failure/rejected)
- circuit.calls.duration (histogram)
- circuit.state_change (event, with from/to states)
Alert on:
- Circuit opens (immediate notification)
- Circuit stays open > 5 minutes (escalation)
- High rejection rate in half-open state
Verify: Trip the circuit intentionally and verify alerts fire within expected timeframe.
// Input: HTTP endpoint that may fail
// Output: Response data or fallback value
const CircuitBreaker = require('opossum'); // ^8.1.3
const axios = require('axios'); // ^1.7.0
async function fetchUserProfile(userId) {
const res = await axios.get(
`https://api.users.example.com/v1/users/${userId}`,
{ timeout: 3000 }
);
return res.data;
}
const breaker = new CircuitBreaker(fetchUserProfile, {
timeout: 5000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
volumeThreshold: 5,
rollingCountTimeout: 10000,
});
breaker.fallback((userId) => ({
id: userId, name: 'Unknown', cached: true
}));
breaker.on('open', () => console.warn('[CB] Circuit OPENED'));
breaker.on('close', () => console.info('[CB] Circuit CLOSED'));
breaker.on('halfOpen',() => console.info('[CB] Circuit HALF-OPEN'));
const profile = await breaker.fire('user-123');
# Input: Database query that may fail
# Output: Query result or fallback
import pybreaker # >=1.2.0
class MonitorListener(pybreaker.CircuitBreakerListener):
def state_change(self, cb, old_state, new_state):
print(f"[CB] {cb.name}: {old_state.name} -> {new_state.name}")
db_breaker = pybreaker.CircuitBreaker(
fail_max=5, reset_timeout=30,
exclude=[ValueError],
listeners=[MonitorListener()],
name="postgres-primary",
)
@db_breaker
def get_user(user_id: str) -> dict:
conn = psycopg2.connect("host=db.example.com dbname=app")
cur = conn.cursor()
cur.execute("SELECT * FROM users WHERE id = %s", (user_id,))
return cur.fetchone()
try:
user = get_user("user-123")
except pybreaker.CircuitBreakerError:
user = {"id": "user-123", "cached": True}
// Input: REST API call that may fail
// Output: API response or fallback
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(80)
.slowCallDurationThreshold(Duration.ofSeconds(3))
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(3)
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.recordExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(IllegalArgumentException.class)
.build();
CircuitBreaker breaker = CircuitBreakerRegistry.of(config)
.circuitBreaker("userService");
Supplier<UserProfile> decorated = CircuitBreaker
.decorateSupplier(breaker, () -> client.getProfile(userId));
Try<UserProfile> result = Try.ofSupplier(decorated)
.recover(t -> new UserProfile(userId, "Unknown", true));
// Input: Any function that calls a remote service
// Output: Result or error (ErrCircuitOpen when tripped)
type CircuitBreaker struct {
mu sync.Mutex
state State // Closed, Open, HalfOpen
failureCount int
failureThreshold int
successThreshold int
resetTimeout time.Duration
lastFailure time.Time
}
func (cb *CircuitBreaker) Execute(fn func() (interface{}, error)) (interface{}, error) {
cb.mu.Lock()
if cb.state == Open && time.Since(cb.lastFailure) > cb.resetTimeout {
cb.state = HalfOpen
cb.successCount = 0
}
if cb.state == Open {
cb.mu.Unlock()
return nil, ErrCircuitOpen
}
cb.mu.Unlock()
result, err := fn()
cb.mu.Lock()
defer cb.mu.Unlock()
if err != nil {
cb.failureCount++
cb.lastFailure = time.Now()
if cb.failureCount >= cb.failureThreshold { cb.state = Open }
return result, err
}
// Success handling: close circuit if enough probes succeed
if cb.state == HalfOpen {
cb.successCount++
if cb.successCount >= cb.successThreshold {
cb.state = Closed; cb.failureCount = 0
}
} else { cb.failureCount = 0 }
return result, nil
}
// BAD -- one breaker for all services masks which one is failing
const breaker = new CircuitBreaker(makeRequest);
await breaker.fire('https://api-a.example.com/data');
await breaker.fire('https://api-b.example.com/data');
// If API-A fails, API-B also gets blocked!
// GOOD -- isolated failure domains
const breakerA = new CircuitBreaker(
(url) => axios.get(url), { name: 'api-a', resetTimeout: 30000 }
);
const breakerB = new CircuitBreaker(
(url) => axios.get(url), { name: 'api-b', resetTimeout: 30000 }
);
await breakerA.fire('https://api-a.example.com/data');
await breakerB.fire('https://api-b.example.com/data');
# BAD -- slow calls hang forever, breaker never trips on timeouts
@db_breaker
def fetch_data():
return requests.get("https://slow-api.example.com/data")
# No timeout! A 5-minute hang won't trigger the breaker
# GOOD -- timeout ensures slow calls are counted as failures
@db_breaker
def fetch_data():
return requests.get(
"https://slow-api.example.com/data",
timeout=3 # 3 second timeout
)
// BAD -- payment service and logging service get identical config
CircuitBreakerConfig oneConfigForAll = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(30))
.build();
// Payment failures at 50% is WAY too late; logging at 50% is too early
// GOOD -- critical services get tighter thresholds
CircuitBreakerConfig paymentConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(20) // trip early for payments
.waitDurationInOpenState(Duration.ofSeconds(60))
.build();
CircuitBreakerConfig loggingConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(80) // logging can tolerate more failures
.waitDurationInOpenState(Duration.ofSeconds(10))
.build();
// BAD -- user gets a raw error
try {
const data = await breaker.fire(userId);
} catch (err) {
throw err; // CircuitBreakerError reaches the user as a 500
}
// GOOD -- graceful degradation
breaker.fallback((userId) => ({
id: userId, name: 'User', source: 'cache', stale: true,
message: 'Profile temporarily unavailable, showing cached data'
}));
const data = await breaker.fire(userId);
failureThreshold >= 5 or use percentage-based (50%) with a minimum call volume. [src1]resetTimeout to at least 2x the expected downstream recovery time. [src2]exclude or ignoreExceptions config to filter non-infrastructure errors. [src3]# Check circuit breaker state via Spring Boot Actuator
curl -s http://localhost:8080/actuator/circuitbreakers | jq '.circuitBreakers'
# Check Resilience4j metrics in Prometheus format
curl -s http://localhost:8080/actuator/prometheus | grep resilience4j_circuitbreaker
# Check circuit breaker events (last 5)
curl -s http://localhost:8080/actuator/circuitbreakerevents | jq '.circuitBreakerEvents[-5:]'
# Load test to verify circuit trips under failure
hey -n 100 -c 10 http://localhost:3000/api/protected-endpoint
| Library | Current Version | Breaking Changes | Notes |
|---|---|---|---|
| opossum (Node.js) | 8.x | v8: ESM support, dropped Node <18 | v6->v7: callback API removed |
| pybreaker (Python) | 1.2.x | None recent | Supports Python 3.8+ |
| Resilience4j (Java) | 2.2.x | v2.0: replaced deprecated Hystrix patterns | Spring Boot 3.x compatible |
| Polly (.NET) | 8.x | v8: new pipeline-based API | v7->v8: policy syntax changed |
| gobreaker (Go) | 0.7.x | None | Stable API since v0.5 |
| Hystrix (Java) | 1.5.18 | EOL -- maintenance mode since 2018 | Migrate to Resilience4j |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Remote service calls that can experience sustained outages | Transient single-request failures | Retry with exponential backoff |
| Preventing cascade failures across microservices | Protecting local/in-process function calls | Direct error handling |
| Downstream has known recovery time (restarts, scaling events) | Rate limiting your own outbound requests | Token bucket / leaky bucket rate limiter |
| You need fast failure when a dependency is down | Isolating concurrent request pools | Bulkhead pattern |
| Protecting against slow responses that consume thread/connection pools | Simple request timeout | Timeout pattern (often combined with CB) |
| Multiple callers hit the same failing service | The operation is idempotent and cheap to retry | Simple retry |
minimumNumberOfCalls alongside percentage thresholds.