Global DNS and Load Balancing Architecture

Type: Software Reference Confidence: 0.92 Sources: 7 Verified: 2026-02-23 Freshness: 2026-02-23

TL;DR

Constraints

Quick Reference

ComponentRoleTechnology OptionsScaling Strategy
Authoritative DNSResponds to client DNS queries with region-optimal IPsRoute 53, Cloudflare DNS, Google Cloud DNS, NS1, PowerDNSAnycast nameservers, horizontal replication
GeoDNS RoutingMaps client IP geolocation to nearest datacenterRoute 53 Geolocation, Cloudflare Geo Steering, GCP routing policiesGeoIP database updates, ECS support
Latency-Based RoutingReturns IPs of lowest-latency region for each clientRoute 53 Latency, Cloudflare Dynamic Steering, NS1 Filter ChainsContinuous latency probing from edge PoPs
Weighted RoutingDistributes traffic by percentage across endpointsRoute 53 Weighted, Cloudflare Weighted steering, HAProxy GSLBAdjust weights per capacity; canary deployments
Anycast NetworkSingle IP advertised from multiple PoPs via BGPCloudflare (built-in), GCP Global LB, custom BGP (BIRD/FRR)Add PoPs, BGP community-based traffic engineering
Health ChecksDetect unhealthy backends and remove from DNSRoute 53 Health Checks, Cloudflare Monitors, HAProxy agent checksMulti-region probes, configurable thresholds
L4/L7 Load BalancerLocal traffic distribution within each datacenterNGINX, HAProxy, Envoy, AWS ALB/NLB, GCP Backend ServicesHorizontal scaling, connection draining
Failover ControllerAutomatic rerouting on region-level outageRoute 53 Failover, Cloudflare Failover steering, GCP auto-failoverPrimary/secondary/tertiary hierarchy
SSL/TLS TerminationTerminates HTTPS at the edge before backend routingCloudflare SSL, AWS ACM + ALB, GCP Managed CertificatesEdge termination, certificate auto-renewal
Monitoring & ObservabilityTrack DNS resolution, latency, failover eventsCloudflare Analytics, Route 53 query logs, Prometheus + GrafanaAlerting on failover events, TTL violations
DNS Caching LayerReduce query load on authoritative serversResolver TTL caching, local BIND/Unbound cachesTune TTL 60-300s; cache warming
Edge ComputeRun logic at DNS/HTTP edge for routing decisionsCloudflare Workers, Lambda@Edge, Fastly ComputeLatency-based A/B testing, custom routing rules

Decision Tree

START
├── Need data sovereignty / compliance routing?
│   ├── YES → Use GeoDNS with jurisdiction-locked regions
│   │         (Route 53 Geolocation or Cloudflare Geo Steering)
│   └── NO ↓
├── Need sub-100ms failover with zero DNS propagation delay?
│   ├── YES → Use Anycast IP with L7 health checks
│   │         (GCP Global LB, Cloudflare proxy, custom BGP)
│   └── NO ↓
├── Need lowest-latency routing across 3+ regions?
│   ├── YES → Use latency-based routing with health checks
│   │         (Route 53 Latency + failover, Cloudflare Dynamic Steering)
│   └── NO ↓
├── Need canary/blue-green deployments?
│   ├── YES → Use weighted routing (e.g., 90/10 split)
│   │         (Route 53 Weighted, Cloudflare Weighted Steering)
│   └── NO ↓
├── <10K req/s with 2-3 regions?
│   ├── YES → Simple DNS round-robin with health checks
│   │         (Any managed DNS provider)
│   └── NO ↓
└── DEFAULT → Combine Anycast (edge) + GeoDNS (regional) + L7 LB (local)
              This is the standard architecture for >100K req/s global services

Step-by-Step Guide

1. Define regions and deploy backends

Identify 2-6 geographic regions based on your user distribution. Deploy identical application stacks in each region with independent databases or read replicas. [src1]

# Example: List your target regions and endpoints
# Region: us-east-1  → alb-us-east.example.com
# Region: eu-west-1  → alb-eu-west.example.com
# Region: ap-south-1 → alb-ap-south.example.com

Verify: curl -s -o /dev/null -w "%{http_code}" https://alb-us-east.example.com/health → expected: 200

2. Configure health checks from multiple vantage points

Set up health checks that probe each backend from at least 3 geographic locations. Configure failure thresholds (typically 3 consecutive failures) and check intervals (10-30s). [src2]

# AWS Route 53: Create health check
aws route53 create-health-check --caller-reference "us-east-$(date +%s)" \
  --health-check-config '{
    "FullyQualifiedDomainName": "alb-us-east.example.com",
    "Port": 443,
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "RequestInterval": 10,
    "FailureThreshold": 3,
    "Regions": ["us-east-1", "eu-west-1", "ap-southeast-1"]
  }'

Verify: aws route53 get-health-check-status --health-check-id <ID> → all reporters show Success

3. Set up DNS routing policy

Choose and configure your primary routing strategy (latency, geolocation, or weighted). Associate health checks with each record set. [src2]

# AWS Route 53: Latency-based routing with health check
aws route53 change-resource-record-sets --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "us-east-1",
        "Region": "us-east-1",
        "TTL": 60,
        "ResourceRecords": [{"Value": "203.0.113.10"}],
        "HealthCheckId": "hc-us-east-id"
      }
    }]
  }'

Verify: dig api.example.com +short from different regions returns region-appropriate IPs

4. Configure local load balancers per region

In each region, deploy L4/L7 load balancers (NGINX, HAProxy, or cloud ALBs) to distribute traffic across local backend instances. [src4]

frontend http_front
    bind *:443 ssl crt /etc/ssl/certs/example.pem
    default_backend app_servers

backend app_servers
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200
    server app1 10.0.1.10:8080 check inter 5s fall 3 rise 2
    server app2 10.0.1.11:8080 check inter 5s fall 3 rise 2
    server app3 10.0.1.12:8080 check inter 5s fall 3 rise 2

Verify: echo "show stat" | socat /var/run/haproxy.sock stdio | grep app_servers → all servers show UP

5. Implement failover hierarchy

Configure primary, secondary, and tertiary failover targets so that regional failures cascade to the next-best region automatically. [src1]

# Cloudflare: Configure failover pool priority via API
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/{zone_id}/load_balancers/{lb_id}" \
  -H "Authorization: Bearer $CF_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{
    "default_pools": ["pool-us-east", "pool-eu-west", "pool-ap-south"],
    "fallback_pool": "pool-us-west",
    "steering_policy": "dynamic_latency",
    "session_affinity": "cookie"
  }'

Verify: Disable primary pool health endpoint, then dig example.com → should return secondary pool IPs within TTL window

6. Set up monitoring and alerting

Monitor DNS resolution times, failover events, health check flaps, and regional traffic distribution. Alert on anomalies. [src6]

# Prometheus alert rule: DNS failover detected
alert: DNSFailoverTriggered
expr: probe_dns_lookup_time_seconds > 1
for: 2m
labels:
  severity: warning
annotations:
  summary: "DNS failover detected for {{ $labels.instance }}"

Verify: Trigger a test failover and confirm alerts fire within expected thresholds

Code Examples

Terraform: Route 53 Latency-Based Routing with Health Checks

# Input:  Multi-region ALB endpoints
# Output: Latency-routed DNS with automatic failover

resource "aws_route53_health_check" "us_east" {
  fqdn              = "alb-us-east.example.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health"
  failure_threshold  = 3
  request_interval   = 10
  regions            = ["us-east-1", "eu-west-1", "ap-southeast-1"]

  tags = { Name = "us-east-health-check" }
}

resource "aws_route53_record" "api_us_east" {
  zone_id         = aws_route53_zone.main.zone_id
  name            = "api.example.com"
  type            = "A"
  set_identifier  = "us-east-1"
  ttl             = 60

  latency_routing_policy {
    region = "us-east-1"
  }

  records          = ["203.0.113.10"]
  health_check_id  = aws_route53_health_check.us_east.id
}

HAProxy: GSLB with DNS-Based Backend Resolution

# Input:  DNS-discovered backend servers across regions
# Output: Load-balanced traffic with health-aware routing

resolvers cloudns
    nameserver dns1 8.8.8.8:53
    nameserver dns2 1.1.1.1:53
    resolve_retries 3
    timeout resolve 1s
    timeout retry   1s
    hold valid      60s
    accepted_payload_size 8192

frontend gslb_front
    bind *:443 ssl crt /etc/ssl/certs/example.pem
    default_backend regional_backends

backend regional_backends
    balance leastconn
    option httpchk GET /health
    http-check expect status 200
    server-template us-east- 3 us-east.example.com:443 \
        check inter 10s fall 3 rise 2 ssl verify required \
        resolvers cloudns resolve-prefer ipv4
    server-template eu-west- 3 eu-west.example.com:443 \
        check inter 10s fall 3 rise 2 ssl verify required \
        resolvers cloudns resolve-prefer ipv4

Python: Programmatic DNS Health Check Monitor

# Input:  List of endpoints and regions
# Output: Health status map with latency metrics

import dns.resolver  # dnspython==2.6.1
import requests      # requests==2.31.0
import time
from concurrent.futures import ThreadPoolExecutor

ENDPOINTS = {
    "us-east": "alb-us-east.example.com",
    "eu-west": "alb-eu-west.example.com",
    "ap-south": "alb-ap-south.example.com",
}

def check_endpoint(region, host):
    try:
        start = time.monotonic()
        answers = dns.resolver.resolve(host, "A")
        dns_ms = (time.monotonic() - start) * 1000
        start = time.monotonic()
        resp = requests.get(f"https://{host}/health", timeout=5)
        http_ms = (time.monotonic() - start) * 1000
        return {"region": region, "status": "healthy" if resp.status_code == 200 else "degraded",
                "dns_ms": round(dns_ms, 2), "http_ms": round(http_ms, 2)}
    except Exception as e:
        return {"region": region, "status": "unhealthy", "error": str(e)}

with ThreadPoolExecutor(max_workers=len(ENDPOINTS)) as pool:
    results = list(pool.map(lambda kv: check_endpoint(*kv), ENDPOINTS.items()))
for r in results:
    print(f"[{r['region']}] {r['status']} dns={r.get('dns_ms','N/A')}ms http={r.get('http_ms','N/A')}ms")

Anti-Patterns

Wrong: Single-region health check for global DNS

# BAD — health check from one location only
# A network partition between probe and backend triggers false failover
health_check:
  type: HTTPS
  endpoint: alb-us-east.example.com
  probe_region: us-east-1  # Single point of failure!
  failure_threshold: 1     # Too aggressive!

Correct: Multi-region health checks with sensible thresholds

# GOOD — probes from 3+ regions, requires consensus
health_check:
  type: HTTPS
  endpoint: alb-us-east.example.com
  probe_regions: [us-east-1, eu-west-1, ap-southeast-1]
  failure_threshold: 3
  request_interval: 10

Wrong: DNS TTL of 0 or 5 seconds for "instant" failover

# BAD — TTL=5 causes massive query amplification
# Resolvers may ignore TTL < 30s anyway (RFC 8767)
api.example.com.  5  IN  A  203.0.113.10
# Result: 100x query volume, no real failover improvement

Correct: TTL 60-300s balanced with active health checks

# GOOD — TTL=60 provides fast-enough failover with manageable query volume
api.example.com.  60  IN  A  203.0.113.10
# Combined with Route 53 health checks: actual failover time = ~70-90s

Wrong: GeoDNS without fallback for unmapped regions

# BAD — users from unmapped regions get NXDOMAIN or random routing
geo_routing = {
    "US": "203.0.113.10",
    "EU": "198.51.100.10",
    # No default! Users from Africa, South America get nothing
}

Correct: GeoDNS with explicit default fallback

# GOOD — every region has a path, unmapped locations get best-effort routing
geo_routing = {
    "US": "203.0.113.10",
    "EU": "198.51.100.10",
    "AP": "192.0.2.10",
    "DEFAULT": "203.0.113.10",  # Fallback to lowest-latency PoP
}

Wrong: Anycast with long-lived TCP without session persistence

# BAD — BGP route changes mid-connection cause TCP resets
# Anycast works for DNS (UDP) and short HTTP, but long-lived
# WebSocket/gRPC streams break on route flaps
anycast_ip: 198.51.100.1 → backend-pool (no session tracking)

Correct: Anycast for initial connection + session affinity for stateful protocols

# GOOD — Anycast handles initial routing, L7 LB pins the session
Client → Anycast IP (198.51.100.1)
       → Nearest edge PoP (BGP)
       → L7 Load Balancer (session cookie / consistent hashing)
       → Specific backend instance (pinned for connection lifetime)

Common Pitfalls

Diagnostic Commands

# Check DNS resolution from multiple resolvers
dig api.example.com +short @8.8.8.8
dig api.example.com +short @1.1.1.1

# Trace full DNS resolution path
dig api.example.com +trace

# Check DNS response with ECS (EDNS Client Subnet)
dig api.example.com +subnet=198.51.100.0/24 @8.8.8.8

# Verify TTL values
dig api.example.com +noall +answer

# Test health check endpoint
curl -s -o /dev/null -w "HTTP %{http_code} in %{time_total}s\n" https://api.example.com/health

# Check Route 53 health check status
aws route53 get-health-check-status --health-check-id <HC_ID>

# List Cloudflare load balancer pool health
curl -s "https://api.cloudflare.com/client/v4/zones/{zone}/load_balancers/pools/{pool}/health" \
  -H "Authorization: Bearer $CF_TOKEN" | jq '.result'

# Monitor DNS resolution latency over time
while true; do echo "$(date +%T) $(dig api.example.com +noall +stats | grep 'Query time')"; sleep 10; done

# Check BGP route for Anycast IP
whois -h whois.radb.net 198.51.100.0/24

Version History & Compatibility

TechnologyCurrent VersionKey ChangesNotes
AWS Route 532013-04-01 APIApplication Recovery Controller (2023), IP-based routing (2022)Most routing policies backwards-compatible since launch
Cloudflare LBAdaptive (2024)Adaptive LB GA, DNS-only mode, custom rulesSteering: off, geo, random, dynamic_latency, proximity, least_outstanding_requests
GCP Cloud DNSv1Global anycast, routing policies (WRR, geolocation) (2022)Integrated with Global External HTTP(S) LB
HAProxy GSLB2.9+ / ALOHA 15+DNS resolution in backends, server-templatesGSLB module requires ALOHA or Enterprise
NGINX PlusR30+DNS SRV discovery, upstream resolve, geo moduleGSLB requires NGINX Plus (not open-source)

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Backends in 2+ geographic regions, need to route users to nearestAll backends in a single datacenter or regionLocal L4/L7 load balancer (HAProxy, NGINX, ALB)
Need automatic failover with <2 min recovery timeNeed sub-second failover for real-time systemsAnycast with BGP failover or service mesh
Data sovereignty compliance (EU data stays in EU)No regulatory requirements for data localityLatency-based routing (simpler)
Multi-cloud deployment needing cloud-agnostic routingFully committed to single cloud providerCloud-native load balancing (ALB, Cloud LB)
Canary/blue-green deployments at region levelCanary within a single clusterKubernetes Ingress with traffic splitting
Distribute traffic based on capacity (weighted)All regions have identical capacitySimple GeoDNS or latency routing

Important Caveats

Related Units