Designing a Distributed Rate Limiter
How do I design a distributed rate limiter?
TL;DR
- Bottom line: Use a Redis-backed token bucket or sliding window counter at the API gateway layer; wrap all counter operations in Lua scripts for atomicity; deploy Redis with replication for availability.
- Key tool/command:
redis-cli EVAL "lua_script" 1 key limit window(atomic rate-limit check via Lua) - Watch out for: Race conditions from non-atomic read-then-write operations — two concurrent requests can both pass the check before either increments the counter.
- Works with: Any language with a Redis client; deploy at API gateway (Kong, NGINX, Envoy), application middleware, or edge (Cloudflare, AWS WAF).
Constraints
- All counter operations (read-check-increment) must be atomic — use Redis Lua scripts or CAS operations to prevent race conditions
- Clock skew across distributed nodes causes inconsistent windowing — use a centralized time source or wall-clock-aligned windows
- Redis single-instance is a SPOF — always deploy with master-replica replication and Sentinel or Cluster for production
- Never fail-open by default in security-critical contexts (DDoS protection, auth brute-force) — design explicit fail-open vs fail-closed policy per limiter
- Memory grows linearly with unique keys — set TTL on all rate limit keys and monitor Redis memory usage
Quick Reference
| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| Rate Limit Algorithm | Core counting logic | Token bucket, sliding window counter, fixed window, leaky bucket | Algorithm choice depends on burst tolerance vs accuracy requirements |
| Centralized Counter Store | Atomic counters shared across nodes | Redis (primary), Memcached, DynamoDB | Redis Cluster with hash slots; read replicas for high-read workloads |
| API Gateway / Proxy | Enforcement point before application | NGINX, Kong, Envoy, AWS API Gateway, Cloudflare | Horizontal scaling; each instance queries shared counter store |
| Lua Script Engine | Atomic multi-step operations | Redis EVAL/EVALSHA | Scripts cached server-side via SHA; no additional scaling needed |
| Rule Configuration Store | Per-client/tier rate limits | Config file, database, feature flags | Hot-reload without restart; hierarchical rules (global > tenant > endpoint) |
| Client Identity Resolver | Extracts rate-limit key from request | API key, JWT claims, IP address, combination | Consistent hashing to same Redis shard per client key |
| Response Header Formatter | Communicates limit status to clients | X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After, RateLimit (draft RFC) | Standard headers; no scaling concern |
| Async Counter Sync | Reduces latency for non-critical limits | Local counters + periodic sync to Redis | Eventual consistency; configurable sync interval |
| Monitoring & Alerting | Tracks rejection rates, Redis latency | Prometheus + Grafana, Datadog, CloudWatch | Alert on rejection spike, Redis memory, counter key cardinality |
| Fallback / Circuit Breaker | Handles Redis outage gracefully | In-memory local limiter, fail-open, fail-closed | Degrade to local approximate limiting; log for reconciliation |
| Load Shedder | Protects system under extreme load | Priority-based traffic classification | Stripe model: critical > POSTs > GETs > test traffic [src1] |
| Clock Synchronization | Consistent time across nodes | NTP, wall-clock-aligned windows | Align windows to Unix epoch seconds; tolerate small skew with weighted windows |
Decision Tree
START
├── Request volume <1K/sec and single datacenter?
│ ├── YES → In-process rate limiter (Guava RateLimiter, Go rate.Limiter)
│ └── NO ↓
├── Need to allow short traffic bursts above steady rate?
│ ├── YES → Token bucket algorithm (Redis + Lua)
│ │ ├── Stripe-style API? → Add concurrent request limiter + load shedder
│ │ └── Standard API? → Token bucket with per-key buckets
│ └── NO ↓
├── Need precise per-second accuracy (billing, compliance)?
│ ├── YES → Sliding window log (Redis sorted sets) — higher memory cost
│ └── NO ↓
├── Need smooth output rate (queue processing, webhooks)?
│ ├── YES → Leaky bucket (process at fixed rate, queue excess)
│ └── NO ↓
├── Simple implementation, tolerant of boundary burst?
│ ├── YES → Fixed window counter (simplest Redis INCR)
│ └── NO ↓
└── DEFAULT → Sliding window counter (Cloudflare approach) — best accuracy/memory trade-off
Step-by-Step Guide
1. Choose your rate-limiting key
Define what identifies a unique client. Common keys: API key, user ID, IP address, or composite (user + endpoint). [src1]
Rate-limit key examples:
Per user: ratelimit:{user_id}:{endpoint}
Per API key: ratelimit:{api_key}
Per IP: ratelimit:{client_ip}:{endpoint}
Per tenant: ratelimit:{tenant_id}:{tier}
Verify: Ensure keys are unique per client scope — duplicate keys cause shared limits across unrelated clients.
2. Implement atomic counter operations with Redis Lua
All rate-limit checks must be atomic. A Lua script ensures the read-check-increment cycle cannot be interrupted by concurrent requests. [src3] [src4]
-- Token bucket rate limiter (Redis Lua script)
-- KEYS[1] = rate limit key
-- ARGV[1] = max tokens (bucket capacity)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp (seconds)
-- ARGV[4] = tokens to consume (usually 1)
-- Returns: {allowed (0/1), remaining_tokens}
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])
if tokens == nil then
tokens = capacity
last_refill = now
end
local elapsed = math.max(0, now - last_refill)
tokens = math.min(capacity, tokens + (elapsed * refill_rate))
local allowed = 0
local remaining = tokens
if tokens >= requested then
tokens = tokens - requested
allowed = 1
remaining = tokens
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
return {allowed, remaining}
Verify: redis-cli EVAL "$(cat token_bucket.lua)" 1 "ratelimit:user123" 10 1 $(date +%s) 1 → {1, 9}
3. Add response headers
Communicate rate-limit state back to clients using IETF draft standard headers. [src7]
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1708732800
Retry-After: 30
Content-Type: application/json
{"error": "rate_limit_exceeded", "message": "Rate limit exceeded. Retry after 30s."}
Verify: curl -v https://api.example.com/endpoint 2>&1 | grep X-RateLimit
4. Deploy Redis with high availability
Use Redis Sentinel or Cluster for production to avoid single point of failure. [src3]
# Redis Sentinel configuration (sentinel.conf)
sentinel monitor mymaster 10.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 10000
sentinel parallel-syncs mymaster 1
Verify: redis-cli -p 26379 SENTINEL master mymaster → confirms master is up and replicas connected.
5. Implement fallback for Redis outages
Rate limiters must handle Redis failures gracefully. Decide on fail-open or fail-closed based on context. [src1]
Fallback strategy:
1. Try Redis rate-limit check (Lua script)
2. If Redis unavailable:
a. Security-critical (auth, payment): FAIL CLOSED → reject with 503
b. General API: FAIL OPEN → allow request, log the bypass
c. Optional: fall back to in-memory approximate counter per node
3. Circuit breaker: stop retrying Redis for 30s after 3 consecutive failures
4. Alert on fallback activation
Verify: Kill Redis replica, confirm application logs fallback activation and requests handled per policy.
6. Configure tiered rate limits
Implement hierarchical rules for different client tiers and priority-based load shedding. [src1] [src6]
Rate-limit tiers:
free_tier: 100 req/min, 1000 req/hour
pro_tier: 1000 req/min, 50000 req/hour
enterprise: 10000 req/min, unlimited/hour
Priority classification (Stripe model):
P0 — Critical (payments): never shed
P1 — POST operations: shed at 90% capacity
P2 — GET operations: shed at 80% capacity
P3 — Test mode/analytics: shed at 70% capacity
Verify: Send requests with different API keys → confirm each tier receives correct X-RateLimit-Limit.
Code Examples
Python: Redis-Based Sliding Window Counter
Full script: sliding_window_limiter.py (46 lines)
# Input: client_key (str), max_requests (int), window_seconds (int)
# Output: (allowed: bool, remaining: int, retry_after: int)
# Requires: redis>=5.0.0
import time, redis
SLIDING_WINDOW_LUA = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local curr_window = tostring(math.floor(now / window) * window)
local prev_window = tostring(tonumber(curr_window) - window)
local prev_count = tonumber(redis.call('GET', key..':'..prev_window) or '0')
local curr_count = tonumber(redis.call('GET', key..':'..curr_window) or '0')
local elapsed = now - tonumber(curr_window)
local weight = (window - elapsed) / window
local estimated = math.floor(prev_count * weight) + curr_count
if estimated >= limit then return {0, 0, window - elapsed} end
redis.call('INCR', key..':'..curr_window)
redis.call('EXPIRE', key..':'..curr_window, window * 2)
return {1, limit - estimated - 1, 0}
"""
class SlidingWindowRateLimiter:
def __init__(self, redis_url="redis://localhost:6379"):
self.r = redis.from_url(redis_url, decode_responses=True)
self._sha = self.r.script_load(SLIDING_WINDOW_LUA)
def is_allowed(self, client_key, max_requests=100, window_seconds=60):
try:
result = self.r.evalsha(self._sha, 1,
f"rl:{client_key}", max_requests, window_seconds, time.time())
return bool(result[0]), int(result[1]), int(result[2])
except redis.ConnectionError:
return True, -1, 0 # Fail-open
Node.js: Express Middleware Rate Limiter
Full script: express_rate_limiter.js (65 lines)
// Input: Express request object
// Output: next() if allowed, 429 response if rate-limited
// Requires: ioredis@^5.0.0, express@^4.18.0
const Redis = require('ioredis');
const TOKEN_BUCKET_LUA = `
local key = KEYS[1]
local cap = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local d = redis.call('HMGET', key, 'tokens', 'ts')
local t = tonumber(d[1]) or cap
local ts = tonumber(d[2]) or now
t = math.min(cap, t + (now - ts) * rate)
if t < 1 then return {0, 0, math.ceil((1-t)/rate)} end
t = t - 1
redis.call('HMSET', key, 'tokens', t, 'ts', now)
redis.call('EXPIRE', key, math.ceil(cap/rate)*2)
return {1, math.floor(t), 0}`;
function createRateLimiter(opts = {}) {
const { redisUrl='redis://localhost:6379', capacity=100, refillRate=1.67 } = opts;
const r = new Redis(redisUrl, { maxRetriesPerRequest: 1 });
return async (req, res, next) => {
const key = `rl:${req.headers['x-api-key'] || req.ip}`;
try {
const [ok, rem, retry] = await r.eval(TOKEN_BUCKET_LUA, 1, key,
capacity, refillRate, Date.now()/1000);
res.set({ 'X-RateLimit-Limit': String(capacity),
'X-RateLimit-Remaining': String(rem) });
if (!ok) { res.set('Retry-After', String(retry));
return res.status(429).json({ error: 'rate_limit_exceeded' }); }
next();
} catch(e) { console.error('RL error:', e.message); next(); }
};
}
Anti-Patterns
Wrong: Non-atomic read-then-write counter
# BAD — race condition: two requests both read count=99, both pass, actual=101
count = redis_client.get(f"rl:{client_id}")
if int(count or 0) < rate_limit:
redis_client.incr(f"rl:{client_id}")
allow_request()
Correct: Atomic Lua script
# GOOD — entire check runs atomically in Redis
result = redis_client.eval("""
local count = redis.call('INCR', KEYS[1])
if count == 1 then redis.call('EXPIRE', KEYS[1], ARGV[2]) end
if count > tonumber(ARGV[1]) then return 0 else return 1 end
""", 1, f"rl:{client_id}", rate_limit, window_seconds)
Wrong: Fixed window without boundary burst protection
# BAD — 100 req at 0:59 + 100 at 1:00 = 200 in 2 seconds
window = int(time.time() / 60)
key = f"rl:{client_id}:{window}"
count = redis_client.incr(key)
Correct: Sliding window counter
# GOOD — weighted average across adjacent windows prevents boundary burst
# Cloudflare: estimated = prev_count * ((window - elapsed) / window) + curr_count
# At t=0:45 with 60s window: weight=0.25, only 25% of previous window counts
Wrong: In-memory state only (not shared)
# BAD — 4 servers x 100/min = client can send 400/min total
request_counts = {} # Not shared across instances
def check_rate(client_id):
request_counts[client_id] = request_counts.get(client_id, 0) + 1
return request_counts[client_id] <= 100
Correct: Centralized counter store
# GOOD — all instances share the same counter via Redis
result = redis_client.evalsha(script_sha, 1, f"rl:{client_id}",
max_requests, window_seconds, time.time())
Wrong: No TTL on rate-limit keys
# BAD — keys accumulate forever, Redis memory grows unbounded
redis_client.hmset(f"rl:{client_id}", {"tokens": tokens, "ts": now})
# No EXPIRE — if client disappears, key persists forever
Correct: Always set TTL
# GOOD — TTL = 2x window ensures cleanup even with clock drift
redis_client.hmset(f"rl:{client_id}", {"tokens": tokens, "ts": now})
redis_client.expire(f"rl:{client_id}", window_seconds * 2)
Common Pitfalls
- Race conditions from non-atomic operations: Two concurrent requests both read the same counter, both pass, both increment — actual count exceeds limit. Fix:
redis.eval(lua_script)for atomic read-check-increment. [src3] - Fixed-window boundary burst: Client sends max requests at end of window N and start of N+1, doubling throughput in a short period. Fix: use sliding window counter with weighted average. [src2]
- Redis SPOF causing total outage: Single Redis instance down = all requests unthrottled or blocked. Fix: Redis Sentinel or Cluster + in-memory fallback with circuit breaker. [src1]
- Clock skew in distributed nodes: Servers with offset clocks produce inconsistent window boundaries. Fix: wall-clock-aligned windows (
floor(ts/window)*window) + NTP sync. [src5] - Missing Retry-After header: Clients without backoff guidance retry immediately, causing thundering herd. Fix: always include
Retry-Afteron 429 responses. [src7] - Over-granular rate-limit keys: Using user+endpoint+method+IP creates millions of unique keys. Fix: use
user_idorapi_key+ optional endpoint grouping. [src1] - No request priority differentiation: Analytics calls consume the same budget as payment operations. Fix: tiered load shedding with priority classification. [src1]
- Testing only at low traffic: Rate limiters working at 10 req/sec may fail at 10K req/sec. Fix: load test at production scale; benchmark Redis
EVALSHAlatency under concurrency. [src4]
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Protecting APIs from abuse (brute force, scraping) | Single-server app with no distributed requirements | In-process rate limiter (Guava, Go rate, node-rate-limiter-flexible) |
| Enforcing SLA/tier-based API quotas | Need to block L3/L4 DDoS attacks | Network-level DDoS mitigation (Cloudflare, AWS Shield, Cloud Armor) |
| Preventing resource exhaustion in microservices | Service is failing (unhealthy, not overloaded) | Circuit breaker pattern (Resilience4j, Polly, Hystrix) |
| Multi-datacenter API with shared rate limits | Rate limiting per-user with small user base and single server | Simple in-memory counter with mutex |
| Cost control for expensive downstream calls (LLM APIs, payment processors) | Need request queuing with guaranteed delivery | Message queue (RabbitMQ, SQS) with consumer rate control |
Important Caveats
- Sliding window counter (Cloudflare approach) is an approximation — 0.003% error rate across 400M requests, acceptable for most use cases but not billing-precise metering
- Redis Cluster splits keyspace into 16384 hash slots — ensure rate-limit key hashing distributes evenly; use hash tags
{user_id}for related keys on same shard - Token bucket
capacityandrefill_rateare independent — capacity controls burst size, refill_rate controls sustained throughput; misconfiguring either produces unexpected behavior - Lua scripts block the Redis event loop during execution — keep scripts under 100 microseconds; complex scripts at high concurrency increase p99 latency
- Distributed rate limiting with strict consistency adds 1-5ms latency per request; for sub-millisecond latency, use local approximate counting with periodic sync