Retry with Exponential Backoff and Jitter

Type: Software Reference Confidence: 0.92 Sources: 7 Verified: 2026-02-24 Freshness: 2026-02-24

TL;DR

Constraints

Quick Reference

StrategyFormulaThundering Herd RiskFairnessComplexityBest For
No backoff (fixed)delay = constantVery HighEqualTrivialNever use for retries
Linear backoffdelay = base * attemptHighEqualLowSimple rate limiting
Exponential (no jitter)delay = min(cap, base * 2^attempt)HighEqualLowPrototype only
Full jitterdelay = random(0, min(cap, base * 2^attempt))Very LowHighLowDefault recommendation
Equal jitterdelay = exp/2 + random(0, exp/2)LowMediumLowPredictable minimum wait
Decorrelated jitterdelay = min(cap, random(base, prev * 3))LowMediumMediumStateful clients
Fixed delaydelay = constantVery HighEqualTrivialPolling, not retries
Exponential + token bucketFull jitter + token bucket rate limitVery LowHighMediumAWS SDK default

Decision Tree

START: Is the failed operation retryable?
|
+-- Is the error transient (429, 408, 500, 502, 503, 504, network timeout)?
|   +-- NO (400, 401, 403, 404, 422) --> Do NOT retry. Return error immediately.
|   +-- YES ↓
|
+-- Is the operation idempotent (safe to repeat)?
|   +-- NO --> Add idempotency key or do NOT retry.
|   +-- YES ↓
|
+-- How many concurrent clients may retry simultaneously?
|   +-- Few (<10) --> Exponential backoff (jitter optional)
|   +-- Many (10-1000) --> Full jitter (recommended default)
|   +-- Very many (>1000) --> Full jitter + token bucket + circuit breaker
|
+-- Do you need a guaranteed minimum wait time?
|   +-- YES --> Equal jitter (half fixed, half random)
|   +-- NO --> Full jitter (lowest total load)
|
+-- Is this a long-running background job?
    +-- YES --> Decorrelated jitter (independent of attempt count)
    +-- NO --> Full jitter with max 3-5 attempts

Step-by-Step Guide

1. Identify retryable errors

Only retry transient failures. Server errors (5xx) and rate limits (429) are retryable. Client errors (4xx except 429, 408) are permanent and must not be retried. [src3]

RETRYABLE_STATUS_CODES = {408, 429, 500, 502, 503, 504}
RETRYABLE_EXCEPTIONS = (ConnectionError, TimeoutError, OSError)

def is_retryable(error):
    if isinstance(error, RETRYABLE_EXCEPTIONS):
        return True
    if hasattr(error, 'status_code'):
        return error.status_code in RETRYABLE_STATUS_CODES
    return False

Verify: is_retryable(HTTPError(status_code=503)) returns True; is_retryable(HTTPError(status_code=400)) returns False.

2. Implement the full jitter formula

Full jitter provides the best load distribution across retrying clients. The formula randomizes the delay between 0 and the exponential ceiling. [src1]

import random

def full_jitter_delay(attempt, base=1.0, cap=30.0):
    exp_delay = min(cap, base * (2 ** attempt))
    return random.uniform(0, exp_delay)

Verify: full_jitter_delay(0) returns value in [0, 1.0]; full_jitter_delay(5) returns value in [0, 30.0].

3. Build the retry loop with maximum attempts

Wrap the retryable operation in a loop with configurable max attempts, applying the jitter delay between each attempt. [src2]

import time, logging

def retry_with_backoff(fn, max_attempts=4, base=1.0, cap=30.0):
    last_exception = None
    for attempt in range(max_attempts):
        try:
            return fn()
        except Exception as e:
            last_exception = e
            if not is_retryable(e) or attempt == max_attempts - 1:
                raise
            delay = full_jitter_delay(attempt, base, cap)
            logging.warning(f"Attempt {attempt+1}/{max_attempts} failed. Retrying in {delay:.2f}s")
            time.sleep(delay)
    raise last_exception

Verify: Function retries on 503, gives up on 400, raises after max_attempts exhausted.

4. Add retry budget / token bucket (for high-scale systems)

Prevent retry amplification by limiting the total retry rate across all requests. AWS SDKs use a token bucket: 500 initial tokens, 5 tokens per successful call refunded, 5 tokens consumed per retry. [src7]

import threading

class RetryBudget:
    def __init__(self, max_tokens=500, refill_per_success=5, cost_per_retry=5):
        self.tokens = max_tokens
        self.max_tokens = max_tokens
        self.refill = refill_per_success
        self.cost = cost_per_retry
        self._lock = threading.Lock()

    def acquire(self):
        with self._lock:
            if self.tokens >= self.cost:
                self.tokens -= self.cost
                return True
            return False

    def success(self):
        with self._lock:
            self.tokens = min(self.max_tokens, self.tokens + self.refill)

Verify: After 100 consecutive failures (500 tokens consumed), acquire() returns False.

5. Respect Retry-After headers

When the server sends a Retry-After header (common with 429 and 503), use the server-specified delay instead of your calculated backoff. [src3]

def get_retry_delay(response, attempt, base=1.0, cap=30.0):
    retry_after = response.headers.get('Retry-After')
    if retry_after:
        try:
            return min(float(retry_after), cap)
        except ValueError:
            pass
    return full_jitter_delay(attempt, base, cap)

Verify: Response with Retry-After: 5 returns 5.0; without header falls back to jitter calculation.

Code Examples

Python (tenacity): Decorator-Based Retry

# Input:  Any function that may raise transient errors
# Output: Automatic retry with exponential backoff + jitter

from tenacity import (
    retry, stop_after_attempt, wait_exponential_jitter,
    retry_if_exception_type, before_sleep_log
)
import logging
import httpx  # pip install httpx>=0.27

logger = logging.getLogger(__name__)

@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=1, max=30, jitter=5),
    retry=retry_if_exception_type((httpx.TransportError, httpx.HTTPStatusError)),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    reraise=True,
)
def fetch_with_retry(url: str) -> dict:
    response = httpx.get(url, timeout=10)
    if response.status_code in (429, 500, 502, 503, 504):
        response.raise_for_status()
    return response.json()

Node.js: Async Retry with Full Jitter

// Input:  Async function that may throw retryable errors
// Output: Result of successful call, or throws after max attempts

async function retryWithBackoff(fn, {
  maxAttempts = 4,
  baseDelay = 1000,
  maxDelay = 30000,
} = {}) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      const isRetryable = error.status >= 500 || error.status === 429
        || error.code === 'ECONNRESET' || error.code === 'ETIMEDOUT';
      if (!isRetryable || attempt === maxAttempts - 1) throw error;
      const expDelay = Math.min(maxDelay, baseDelay * 2 ** attempt);
      const delay = Math.random() * expDelay;
      await new Promise(r => setTimeout(r, delay));
    }
  }
}

Go: Retry with Context and Full Jitter

// Input:  Context, retryable function
// Output: Result of successful call, or error after max attempts

package retry

import (
    "context"
    "math"
    "math/rand"
    "time"
    "fmt"
)

type Config struct {
    MaxAttempts int
    BaseDelay   time.Duration
    MaxDelay    time.Duration
}

func Do(ctx context.Context, cfg Config, fn func() error) error {
    var lastErr error
    for attempt := 0; attempt < cfg.MaxAttempts; attempt++ {
        lastErr = fn()
        if lastErr == nil {
            return nil
        }
        if attempt == cfg.MaxAttempts-1 {
            break
        }
        expDelay := math.Min(
            float64(cfg.MaxDelay),
            float64(cfg.BaseDelay)*math.Pow(2, float64(attempt)),
        )
        delay := time.Duration(rand.Float64() * expDelay)
        select {
        case <-ctx.Done():
            return fmt.Errorf("retry cancelled: %w", ctx.Err())
        case <-time.After(delay):
        }
    }
    return fmt.Errorf("all %d attempts failed: %w", cfg.MaxAttempts, lastErr)
}

Java: Retry with Exponential Backoff

// Input:  Callable<T> that may throw retryable exceptions
// Output: Result of successful call, or throws after max attempts

import java.util.Set;
import java.util.concurrent.ThreadLocalRandom;

public class RetryWithBackoff {
    private static final Set<Integer> RETRYABLE = Set.of(429, 500, 502, 503, 504);

    public static <T> T execute(
            RetryableCall<T> fn, int maxAttempts, long baseMs, long capMs
    ) throws Exception {
        Exception lastException = null;
        for (int attempt = 0; attempt < maxAttempts; attempt++) {
            try {
                return fn.call();
            } catch (RetryableException e) {
                lastException = e;
                if (!RETRYABLE.contains(e.getStatusCode())
                        || attempt == maxAttempts - 1) throw e;
                long expDelay = Math.min(capMs, baseMs * (1L << attempt));
                long delay = ThreadLocalRandom.current().nextLong(0, expDelay + 1);
                Thread.sleep(delay);
            }
        }
        throw lastException;
    }

    @FunctionalInterface
    public interface RetryableCall<T> { T call() throws Exception; }
}

Anti-Patterns

Wrong: Retry without any backoff

# BAD -- hammering a failing service makes the outage worse
for attempt in range(5):
    try:
        return call_api()
    except Exception:
        pass  # Retry immediately with zero delay

Correct: Retry with exponential backoff and jitter

# GOOD -- spreading retries over time lets the service recover
for attempt in range(5):
    try:
        return call_api()
    except TransientError:
        delay = min(30, 1.0 * 2 ** attempt) * random.random()
        time.sleep(delay)

Wrong: Retrying non-retryable errors (400, 401, 404)

# BAD -- 400 Bad Request will fail every time, wasting retry attempts
@retry(stop=stop_after_attempt(4), wait=wait_exponential())
def create_user(data):
    response = httpx.post('/users', json=data)
    response.raise_for_status()  # Retries even on 400/401/404!

Correct: Only retry transient errors

# GOOD -- only retry server errors and rate limits
@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=1, max=30),
    retry=retry_if_exception(lambda e: getattr(e, 'response', None)
          and e.response.status_code in (429, 500, 502, 503, 504)),
)
def create_user(data):
    response = httpx.post('/users', json=data)
    response.raise_for_status()

Wrong: Exponential backoff without jitter

# BAD -- all 1000 clients retry at exactly 1s, 2s, 4s, 8s -> thundering herd
delay = base * (2 ** attempt)
time.sleep(delay)

Correct: Always add full jitter

# GOOD -- clients spread retries uniformly, preventing synchronized storms
delay = random.uniform(0, min(cap, base * (2 ** attempt)))
time.sleep(delay)

Wrong: Infinite retries with no maximum

# BAD -- retries forever, consuming threads/connections/memory
while True:
    try:
        return call_api()
    except Exception:
        time.sleep(2 ** attempt)
        attempt += 1

Correct: Bounded retries with a cap

# GOOD -- give up after max_attempts and let the caller handle the failure
MAX_ATTEMPTS = 4
for attempt in range(MAX_ATTEMPTS):
    try:
        return call_api()
    except TransientError:
        if attempt == MAX_ATTEMPTS - 1:
            raise
        delay = random.uniform(0, min(30.0, 1.0 * 2 ** attempt))
        time.sleep(delay)

Common Pitfalls

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Transient network errors (timeouts, DNS failures)Client-side validation errors (4xx)Return error immediately
Rate-limited APIs returning 429Non-idempotent operations without idempotency keysIdempotency pattern first, then retry
Cloud service temporary unavailability (503)Real-time user-facing requests with tight latency SLAsHedged requests / speculative execution
Batch processing / background jobsService is consistently down (not transient)Circuit breaker to fail fast
Database connection pool exhaustionRetries at multiple layers (client + LB + gateway)Single retry point with retry budget
Message queue consumer failuresAuthentication/authorization errors (401, 403)Re-authenticate, do not retry

Important Caveats

Related Units