aws route53 create-health-check + latency/geolocation routing policies (or Cloudflare Load Balancing with steering policies)| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| Authoritative DNS | Responds to client DNS queries with region-optimal IPs | Route 53, Cloudflare DNS, Google Cloud DNS, NS1, PowerDNS | Anycast nameservers, horizontal replication |
| GeoDNS Routing | Maps client IP geolocation to nearest datacenter | Route 53 Geolocation, Cloudflare Geo Steering, GCP routing policies | GeoIP database updates, ECS support |
| Latency-Based Routing | Returns IPs of lowest-latency region for each client | Route 53 Latency, Cloudflare Dynamic Steering, NS1 Filter Chains | Continuous latency probing from edge PoPs |
| Weighted Routing | Distributes traffic by percentage across endpoints | Route 53 Weighted, Cloudflare Weighted steering, HAProxy GSLB | Adjust weights per capacity; canary deployments |
| Anycast Network | Single IP advertised from multiple PoPs via BGP | Cloudflare (built-in), GCP Global LB, custom BGP (BIRD/FRR) | Add PoPs, BGP community-based traffic engineering |
| Health Checks | Detect unhealthy backends and remove from DNS | Route 53 Health Checks, Cloudflare Monitors, HAProxy agent checks | Multi-region probes, configurable thresholds |
| L4/L7 Load Balancer | Local traffic distribution within each datacenter | NGINX, HAProxy, Envoy, AWS ALB/NLB, GCP Backend Services | Horizontal scaling, connection draining |
| Failover Controller | Automatic rerouting on region-level outage | Route 53 Failover, Cloudflare Failover steering, GCP auto-failover | Primary/secondary/tertiary hierarchy |
| SSL/TLS Termination | Terminates HTTPS at the edge before backend routing | Cloudflare SSL, AWS ACM + ALB, GCP Managed Certificates | Edge termination, certificate auto-renewal |
| Monitoring & Observability | Track DNS resolution, latency, failover events | Cloudflare Analytics, Route 53 query logs, Prometheus + Grafana | Alerting on failover events, TTL violations |
| DNS Caching Layer | Reduce query load on authoritative servers | Resolver TTL caching, local BIND/Unbound caches | Tune TTL 60-300s; cache warming |
| Edge Compute | Run logic at DNS/HTTP edge for routing decisions | Cloudflare Workers, Lambda@Edge, Fastly Compute | Latency-based A/B testing, custom routing rules |
START
├── Need data sovereignty / compliance routing?
│ ├── YES → Use GeoDNS with jurisdiction-locked regions
│ │ (Route 53 Geolocation or Cloudflare Geo Steering)
│ └── NO ↓
├── Need sub-100ms failover with zero DNS propagation delay?
│ ├── YES → Use Anycast IP with L7 health checks
│ │ (GCP Global LB, Cloudflare proxy, custom BGP)
│ └── NO ↓
├── Need lowest-latency routing across 3+ regions?
│ ├── YES → Use latency-based routing with health checks
│ │ (Route 53 Latency + failover, Cloudflare Dynamic Steering)
│ └── NO ↓
├── Need canary/blue-green deployments?
│ ├── YES → Use weighted routing (e.g., 90/10 split)
│ │ (Route 53 Weighted, Cloudflare Weighted Steering)
│ └── NO ↓
├── <10K req/s with 2-3 regions?
│ ├── YES → Simple DNS round-robin with health checks
│ │ (Any managed DNS provider)
│ └── NO ↓
└── DEFAULT → Combine Anycast (edge) + GeoDNS (regional) + L7 LB (local)
This is the standard architecture for >100K req/s global services
Identify 2-6 geographic regions based on your user distribution. Deploy identical application stacks in each region with independent databases or read replicas. [src1]
# Example: List your target regions and endpoints
# Region: us-east-1 → alb-us-east.example.com
# Region: eu-west-1 → alb-eu-west.example.com
# Region: ap-south-1 → alb-ap-south.example.com
Verify: curl -s -o /dev/null -w "%{http_code}" https://alb-us-east.example.com/health → expected: 200
Set up health checks that probe each backend from at least 3 geographic locations. Configure failure thresholds (typically 3 consecutive failures) and check intervals (10-30s). [src2]
# AWS Route 53: Create health check
aws route53 create-health-check --caller-reference "us-east-$(date +%s)" \
--health-check-config '{
"FullyQualifiedDomainName": "alb-us-east.example.com",
"Port": 443,
"Type": "HTTPS",
"ResourcePath": "/health",
"RequestInterval": 10,
"FailureThreshold": 3,
"Regions": ["us-east-1", "eu-west-1", "ap-southeast-1"]
}'
Verify: aws route53 get-health-check-status --health-check-id <ID> → all reporters show Success
Choose and configure your primary routing strategy (latency, geolocation, or weighted). Associate health checks with each record set. [src2]
# AWS Route 53: Latency-based routing with health check
aws route53 change-resource-record-sets --hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "us-east-1",
"Region": "us-east-1",
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.10"}],
"HealthCheckId": "hc-us-east-id"
}
}]
}'
Verify: dig api.example.com +short from different regions returns region-appropriate IPs
In each region, deploy L4/L7 load balancers (NGINX, HAProxy, or cloud ALBs) to distribute traffic across local backend instances. [src4]
frontend http_front
bind *:443 ssl crt /etc/ssl/certs/example.pem
default_backend app_servers
backend app_servers
balance roundrobin
option httpchk GET /health
http-check expect status 200
server app1 10.0.1.10:8080 check inter 5s fall 3 rise 2
server app2 10.0.1.11:8080 check inter 5s fall 3 rise 2
server app3 10.0.1.12:8080 check inter 5s fall 3 rise 2
Verify: echo "show stat" | socat /var/run/haproxy.sock stdio | grep app_servers → all servers show UP
Configure primary, secondary, and tertiary failover targets so that regional failures cascade to the next-best region automatically. [src1]
# Cloudflare: Configure failover pool priority via API
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/{zone_id}/load_balancers/{lb_id}" \
-H "Authorization: Bearer $CF_TOKEN" \
-H "Content-Type: application/json" \
--data '{
"default_pools": ["pool-us-east", "pool-eu-west", "pool-ap-south"],
"fallback_pool": "pool-us-west",
"steering_policy": "dynamic_latency",
"session_affinity": "cookie"
}'
Verify: Disable primary pool health endpoint, then dig example.com → should return secondary pool IPs within TTL window
Monitor DNS resolution times, failover events, health check flaps, and regional traffic distribution. Alert on anomalies. [src6]
# Prometheus alert rule: DNS failover detected
alert: DNSFailoverTriggered
expr: probe_dns_lookup_time_seconds > 1
for: 2m
labels:
severity: warning
annotations:
summary: "DNS failover detected for {{ $labels.instance }}"
Verify: Trigger a test failover and confirm alerts fire within expected thresholds
# Input: Multi-region ALB endpoints
# Output: Latency-routed DNS with automatic failover
resource "aws_route53_health_check" "us_east" {
fqdn = "alb-us-east.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
regions = ["us-east-1", "eu-west-1", "ap-southeast-1"]
tags = { Name = "us-east-health-check" }
}
resource "aws_route53_record" "api_us_east" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "us-east-1"
ttl = 60
latency_routing_policy {
region = "us-east-1"
}
records = ["203.0.113.10"]
health_check_id = aws_route53_health_check.us_east.id
}
# Input: DNS-discovered backend servers across regions
# Output: Load-balanced traffic with health-aware routing
resolvers cloudns
nameserver dns1 8.8.8.8:53
nameserver dns2 1.1.1.1:53
resolve_retries 3
timeout resolve 1s
timeout retry 1s
hold valid 60s
accepted_payload_size 8192
frontend gslb_front
bind *:443 ssl crt /etc/ssl/certs/example.pem
default_backend regional_backends
backend regional_backends
balance leastconn
option httpchk GET /health
http-check expect status 200
server-template us-east- 3 us-east.example.com:443 \
check inter 10s fall 3 rise 2 ssl verify required \
resolvers cloudns resolve-prefer ipv4
server-template eu-west- 3 eu-west.example.com:443 \
check inter 10s fall 3 rise 2 ssl verify required \
resolvers cloudns resolve-prefer ipv4
# Input: List of endpoints and regions
# Output: Health status map with latency metrics
import dns.resolver # dnspython==2.6.1
import requests # requests==2.31.0
import time
from concurrent.futures import ThreadPoolExecutor
ENDPOINTS = {
"us-east": "alb-us-east.example.com",
"eu-west": "alb-eu-west.example.com",
"ap-south": "alb-ap-south.example.com",
}
def check_endpoint(region, host):
try:
start = time.monotonic()
answers = dns.resolver.resolve(host, "A")
dns_ms = (time.monotonic() - start) * 1000
start = time.monotonic()
resp = requests.get(f"https://{host}/health", timeout=5)
http_ms = (time.monotonic() - start) * 1000
return {"region": region, "status": "healthy" if resp.status_code == 200 else "degraded",
"dns_ms": round(dns_ms, 2), "http_ms": round(http_ms, 2)}
except Exception as e:
return {"region": region, "status": "unhealthy", "error": str(e)}
with ThreadPoolExecutor(max_workers=len(ENDPOINTS)) as pool:
results = list(pool.map(lambda kv: check_endpoint(*kv), ENDPOINTS.items()))
for r in results:
print(f"[{r['region']}] {r['status']} dns={r.get('dns_ms','N/A')}ms http={r.get('http_ms','N/A')}ms")
# BAD — health check from one location only
# A network partition between probe and backend triggers false failover
health_check:
type: HTTPS
endpoint: alb-us-east.example.com
probe_region: us-east-1 # Single point of failure!
failure_threshold: 1 # Too aggressive!
# GOOD — probes from 3+ regions, requires consensus
health_check:
type: HTTPS
endpoint: alb-us-east.example.com
probe_regions: [us-east-1, eu-west-1, ap-southeast-1]
failure_threshold: 3
request_interval: 10
# BAD — TTL=5 causes massive query amplification
# Resolvers may ignore TTL < 30s anyway (RFC 8767)
api.example.com. 5 IN A 203.0.113.10
# Result: 100x query volume, no real failover improvement
# GOOD — TTL=60 provides fast-enough failover with manageable query volume
api.example.com. 60 IN A 203.0.113.10
# Combined with Route 53 health checks: actual failover time = ~70-90s
# BAD — users from unmapped regions get NXDOMAIN or random routing
geo_routing = {
"US": "203.0.113.10",
"EU": "198.51.100.10",
# No default! Users from Africa, South America get nothing
}
# GOOD — every region has a path, unmapped locations get best-effort routing
geo_routing = {
"US": "203.0.113.10",
"EU": "198.51.100.10",
"AP": "192.0.2.10",
"DEFAULT": "203.0.113.10", # Fallback to lowest-latency PoP
}
# BAD — BGP route changes mid-connection cause TCP resets
# Anycast works for DNS (UDP) and short HTTP, but long-lived
# WebSocket/gRPC streams break on route flaps
anycast_ip: 198.51.100.1 → backend-pool (no session tracking)
# GOOD — Anycast handles initial routing, L7 LB pins the session
Client → Anycast IP (198.51.100.1)
→ Nearest edge PoP (BGP)
→ L7 Load Balancer (session cookie / consistent hashing)
→ Specific backend instance (pinned for connection lifetime)
rise threshold (e.g., 3 consecutive successes) before re-adding; implement circuit breakers. [src4]# Check DNS resolution from multiple resolvers
dig api.example.com +short @8.8.8.8
dig api.example.com +short @1.1.1.1
# Trace full DNS resolution path
dig api.example.com +trace
# Check DNS response with ECS (EDNS Client Subnet)
dig api.example.com +subnet=198.51.100.0/24 @8.8.8.8
# Verify TTL values
dig api.example.com +noall +answer
# Test health check endpoint
curl -s -o /dev/null -w "HTTP %{http_code} in %{time_total}s\n" https://api.example.com/health
# Check Route 53 health check status
aws route53 get-health-check-status --health-check-id <HC_ID>
# List Cloudflare load balancer pool health
curl -s "https://api.cloudflare.com/client/v4/zones/{zone}/load_balancers/pools/{pool}/health" \
-H "Authorization: Bearer $CF_TOKEN" | jq '.result'
# Monitor DNS resolution latency over time
while true; do echo "$(date +%T) $(dig api.example.com +noall +stats | grep 'Query time')"; sleep 10; done
# Check BGP route for Anycast IP
whois -h whois.radb.net 198.51.100.0/24
| Technology | Current Version | Key Changes | Notes |
|---|---|---|---|
| AWS Route 53 | 2013-04-01 API | Application Recovery Controller (2023), IP-based routing (2022) | Most routing policies backwards-compatible since launch |
| Cloudflare LB | Adaptive (2024) | Adaptive LB GA, DNS-only mode, custom rules | Steering: off, geo, random, dynamic_latency, proximity, least_outstanding_requests |
| GCP Cloud DNS | v1 | Global anycast, routing policies (WRR, geolocation) (2022) | Integrated with Global External HTTP(S) LB |
| HAProxy GSLB | 2.9+ / ALOHA 15+ | DNS resolution in backends, server-templates | GSLB module requires ALOHA or Enterprise |
| NGINX Plus | R30+ | DNS SRV discovery, upstream resolve, geo module | GSLB requires NGINX Plus (not open-source) |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Backends in 2+ geographic regions, need to route users to nearest | All backends in a single datacenter or region | Local L4/L7 load balancer (HAProxy, NGINX, ALB) |
| Need automatic failover with <2 min recovery time | Need sub-second failover for real-time systems | Anycast with BGP failover or service mesh |
| Data sovereignty compliance (EU data stays in EU) | No regulatory requirements for data locality | Latency-based routing (simpler) |
| Multi-cloud deployment needing cloud-agnostic routing | Fully committed to single cloud provider | Cloud-native load balancing (ALB, Cloud LB) |
| Canary/blue-green deployments at region level | Canary within a single cluster | Kubernetes Ingress with traffic splitting |
| Distribute traffic based on capacity (weighted) | All regions have identical capacity | Simple GeoDNS or latency routing |