delay = min(cap, base * 2^attempt) * random() for optimal load distribution on failing services.delay = min(cap, base * 2^attempt) * random(0, 1)| Strategy | Formula | Thundering Herd Risk | Fairness | Complexity | Best For |
|---|---|---|---|---|---|
| No backoff (fixed) | delay = constant | Very High | Equal | Trivial | Never use for retries |
| Linear backoff | delay = base * attempt | High | Equal | Low | Simple rate limiting |
| Exponential (no jitter) | delay = min(cap, base * 2^attempt) | High | Equal | Low | Prototype only |
| Full jitter | delay = random(0, min(cap, base * 2^attempt)) | Very Low | High | Low | Default recommendation |
| Equal jitter | delay = exp/2 + random(0, exp/2) | Low | Medium | Low | Predictable minimum wait |
| Decorrelated jitter | delay = min(cap, random(base, prev * 3)) | Low | Medium | Medium | Stateful clients |
| Fixed delay | delay = constant | Very High | Equal | Trivial | Polling, not retries |
| Exponential + token bucket | Full jitter + token bucket rate limit | Very Low | High | Medium | AWS SDK default |
START: Is the failed operation retryable?
|
+-- Is the error transient (429, 408, 500, 502, 503, 504, network timeout)?
| +-- NO (400, 401, 403, 404, 422) --> Do NOT retry. Return error immediately.
| +-- YES ↓
|
+-- Is the operation idempotent (safe to repeat)?
| +-- NO --> Add idempotency key or do NOT retry.
| +-- YES ↓
|
+-- How many concurrent clients may retry simultaneously?
| +-- Few (<10) --> Exponential backoff (jitter optional)
| +-- Many (10-1000) --> Full jitter (recommended default)
| +-- Very many (>1000) --> Full jitter + token bucket + circuit breaker
|
+-- Do you need a guaranteed minimum wait time?
| +-- YES --> Equal jitter (half fixed, half random)
| +-- NO --> Full jitter (lowest total load)
|
+-- Is this a long-running background job?
+-- YES --> Decorrelated jitter (independent of attempt count)
+-- NO --> Full jitter with max 3-5 attempts
Only retry transient failures. Server errors (5xx) and rate limits (429) are retryable. Client errors (4xx except 429, 408) are permanent and must not be retried. [src3]
RETRYABLE_STATUS_CODES = {408, 429, 500, 502, 503, 504}
RETRYABLE_EXCEPTIONS = (ConnectionError, TimeoutError, OSError)
def is_retryable(error):
if isinstance(error, RETRYABLE_EXCEPTIONS):
return True
if hasattr(error, 'status_code'):
return error.status_code in RETRYABLE_STATUS_CODES
return False
Verify: is_retryable(HTTPError(status_code=503)) returns True; is_retryable(HTTPError(status_code=400)) returns False.
Full jitter provides the best load distribution across retrying clients. The formula randomizes the delay between 0 and the exponential ceiling. [src1]
import random
def full_jitter_delay(attempt, base=1.0, cap=30.0):
exp_delay = min(cap, base * (2 ** attempt))
return random.uniform(0, exp_delay)
Verify: full_jitter_delay(0) returns value in [0, 1.0]; full_jitter_delay(5) returns value in [0, 30.0].
Wrap the retryable operation in a loop with configurable max attempts, applying the jitter delay between each attempt. [src2]
import time, logging
def retry_with_backoff(fn, max_attempts=4, base=1.0, cap=30.0):
last_exception = None
for attempt in range(max_attempts):
try:
return fn()
except Exception as e:
last_exception = e
if not is_retryable(e) or attempt == max_attempts - 1:
raise
delay = full_jitter_delay(attempt, base, cap)
logging.warning(f"Attempt {attempt+1}/{max_attempts} failed. Retrying in {delay:.2f}s")
time.sleep(delay)
raise last_exception
Verify: Function retries on 503, gives up on 400, raises after max_attempts exhausted.
Prevent retry amplification by limiting the total retry rate across all requests. AWS SDKs use a token bucket: 500 initial tokens, 5 tokens per successful call refunded, 5 tokens consumed per retry. [src7]
import threading
class RetryBudget:
def __init__(self, max_tokens=500, refill_per_success=5, cost_per_retry=5):
self.tokens = max_tokens
self.max_tokens = max_tokens
self.refill = refill_per_success
self.cost = cost_per_retry
self._lock = threading.Lock()
def acquire(self):
with self._lock:
if self.tokens >= self.cost:
self.tokens -= self.cost
return True
return False
def success(self):
with self._lock:
self.tokens = min(self.max_tokens, self.tokens + self.refill)
Verify: After 100 consecutive failures (500 tokens consumed), acquire() returns False.
When the server sends a Retry-After header (common with 429 and 503), use the server-specified delay instead of your calculated backoff. [src3]
def get_retry_delay(response, attempt, base=1.0, cap=30.0):
retry_after = response.headers.get('Retry-After')
if retry_after:
try:
return min(float(retry_after), cap)
except ValueError:
pass
return full_jitter_delay(attempt, base, cap)
Verify: Response with Retry-After: 5 returns 5.0; without header falls back to jitter calculation.
# Input: Any function that may raise transient errors
# Output: Automatic retry with exponential backoff + jitter
from tenacity import (
retry, stop_after_attempt, wait_exponential_jitter,
retry_if_exception_type, before_sleep_log
)
import logging
import httpx # pip install httpx>=0.27
logger = logging.getLogger(__name__)
@retry(
stop=stop_after_attempt(4),
wait=wait_exponential_jitter(initial=1, max=30, jitter=5),
retry=retry_if_exception_type((httpx.TransportError, httpx.HTTPStatusError)),
before_sleep=before_sleep_log(logger, logging.WARNING),
reraise=True,
)
def fetch_with_retry(url: str) -> dict:
response = httpx.get(url, timeout=10)
if response.status_code in (429, 500, 502, 503, 504):
response.raise_for_status()
return response.json()
// Input: Async function that may throw retryable errors
// Output: Result of successful call, or throws after max attempts
async function retryWithBackoff(fn, {
maxAttempts = 4,
baseDelay = 1000,
maxDelay = 30000,
} = {}) {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
const isRetryable = error.status >= 500 || error.status === 429
|| error.code === 'ECONNRESET' || error.code === 'ETIMEDOUT';
if (!isRetryable || attempt === maxAttempts - 1) throw error;
const expDelay = Math.min(maxDelay, baseDelay * 2 ** attempt);
const delay = Math.random() * expDelay;
await new Promise(r => setTimeout(r, delay));
}
}
}
// Input: Context, retryable function
// Output: Result of successful call, or error after max attempts
package retry
import (
"context"
"math"
"math/rand"
"time"
"fmt"
)
type Config struct {
MaxAttempts int
BaseDelay time.Duration
MaxDelay time.Duration
}
func Do(ctx context.Context, cfg Config, fn func() error) error {
var lastErr error
for attempt := 0; attempt < cfg.MaxAttempts; attempt++ {
lastErr = fn()
if lastErr == nil {
return nil
}
if attempt == cfg.MaxAttempts-1 {
break
}
expDelay := math.Min(
float64(cfg.MaxDelay),
float64(cfg.BaseDelay)*math.Pow(2, float64(attempt)),
)
delay := time.Duration(rand.Float64() * expDelay)
select {
case <-ctx.Done():
return fmt.Errorf("retry cancelled: %w", ctx.Err())
case <-time.After(delay):
}
}
return fmt.Errorf("all %d attempts failed: %w", cfg.MaxAttempts, lastErr)
}
// Input: Callable<T> that may throw retryable exceptions
// Output: Result of successful call, or throws after max attempts
import java.util.Set;
import java.util.concurrent.ThreadLocalRandom;
public class RetryWithBackoff {
private static final Set<Integer> RETRYABLE = Set.of(429, 500, 502, 503, 504);
public static <T> T execute(
RetryableCall<T> fn, int maxAttempts, long baseMs, long capMs
) throws Exception {
Exception lastException = null;
for (int attempt = 0; attempt < maxAttempts; attempt++) {
try {
return fn.call();
} catch (RetryableException e) {
lastException = e;
if (!RETRYABLE.contains(e.getStatusCode())
|| attempt == maxAttempts - 1) throw e;
long expDelay = Math.min(capMs, baseMs * (1L << attempt));
long delay = ThreadLocalRandom.current().nextLong(0, expDelay + 1);
Thread.sleep(delay);
}
}
throw lastException;
}
@FunctionalInterface
public interface RetryableCall<T> { T call() throws Exception; }
}
# BAD -- hammering a failing service makes the outage worse
for attempt in range(5):
try:
return call_api()
except Exception:
pass # Retry immediately with zero delay
# GOOD -- spreading retries over time lets the service recover
for attempt in range(5):
try:
return call_api()
except TransientError:
delay = min(30, 1.0 * 2 ** attempt) * random.random()
time.sleep(delay)
# BAD -- 400 Bad Request will fail every time, wasting retry attempts
@retry(stop=stop_after_attempt(4), wait=wait_exponential())
def create_user(data):
response = httpx.post('/users', json=data)
response.raise_for_status() # Retries even on 400/401/404!
# GOOD -- only retry server errors and rate limits
@retry(
stop=stop_after_attempt(4),
wait=wait_exponential_jitter(initial=1, max=30),
retry=retry_if_exception(lambda e: getattr(e, 'response', None)
and e.response.status_code in (429, 500, 502, 503, 504)),
)
def create_user(data):
response = httpx.post('/users', json=data)
response.raise_for_status()
# BAD -- all 1000 clients retry at exactly 1s, 2s, 4s, 8s -> thundering herd
delay = base * (2 ** attempt)
time.sleep(delay)
# GOOD -- clients spread retries uniformly, preventing synchronized storms
delay = random.uniform(0, min(cap, base * (2 ** attempt)))
time.sleep(delay)
# BAD -- retries forever, consuming threads/connections/memory
while True:
try:
return call_api()
except Exception:
time.sleep(2 ** attempt)
attempt += 1
# GOOD -- give up after max_attempts and let the caller handle the failure
MAX_ATTEMPTS = 4
for attempt in range(MAX_ATTEMPTS):
try:
return call_api()
except TransientError:
if attempt == MAX_ATTEMPTS - 1:
raise
delay = random.uniform(0, min(30.0, 1.0 * 2 ** attempt))
time.sleep(delay)
delay = random.uniform(0, min(cap, base * 2^attempt)). [src1]Idempotency-Key header; server deduplicates based on key. [src2]Retry-After: 60, ignoring it and retrying in 2s triggers rate limiting or bans. Fix: Parse Retry-After header and use server-specified delay as minimum. [src3]2^20 = 1,048,576 seconds (~12 days). Fix: min(cap, base * 2^attempt) with cap of 30-60 seconds. [src7]| Use When | Don't Use When | Use Instead |
|---|---|---|
| Transient network errors (timeouts, DNS failures) | Client-side validation errors (4xx) | Return error immediately |
| Rate-limited APIs returning 429 | Non-idempotent operations without idempotency keys | Idempotency pattern first, then retry |
| Cloud service temporary unavailability (503) | Real-time user-facing requests with tight latency SLAs | Hedged requests / speculative execution |
| Batch processing / background jobs | Service is consistently down (not transient) | Circuit breaker to fail fast |
| Database connection pool exhaustion | Retries at multiple layers (client + LB + gateway) | Single retry point with retry budget |
| Message queue consumer failures | Authentication/authorization errors (401, 403) | Re-authenticate, do not retry |
BackoffType.Exponential with UseJitter = true. Prefer library implementations over hand-rolled retries.