| Pattern | Type | Protocol | Latency | Throughput | Coupling | Best For | Technology Options |
|---|---|---|---|---|---|---|---|
| REST (JSON/HTTP) | Sync request-response | HTTP/1.1+ | 10-100ms | Moderate (~2-4K rps/instance) | Temporal + spatial | Public APIs, CRUD operations | Express, FastAPI, Spring Boot, Go net/http |
| gRPC (Protobuf) | Sync request-response | HTTP/2 | 1-20ms | High (~7-9K rps/instance) | Temporal + spatial | Internal service calls, polyglot | grpc-go, grpc-java, grpc-node, grpc-python |
| gRPC Streaming | Sync bidirectional | HTTP/2 | Sub-ms per msg | Very high | Temporal + spatial | Real-time data feeds, chat | Same gRPC libs + streaming APIs |
| GraphQL | Sync request-response | HTTP/1.1+ | 10-200ms | Moderate | Temporal + spatial | API aggregation, BFF pattern | Apollo Server, Hasura, graphql-go |
| Message Queue (P2P) | Async point-to-point | AMQP/STOMP | 5-50ms | High | Loose | Task distribution, work queues | RabbitMQ, Amazon SQS, Azure Service Bus |
| Pub/Sub Event Bus | Async publish-subscribe | Kafka/NATS | 2-20ms | Very high (1M+ msgs/s) | Very loose | Event-driven, domain events | Apache Kafka, NATS, Google Pub/Sub |
| Event Sourcing | Async event log | Kafka/EventStore | 10-100ms | High | Very loose | Audit trails, CQRS | EventStoreDB, Kafka + custom projections |
| Saga (Choreography) | Async distributed tx | Events via broker | 100ms-10s | Moderate | Loose | Multi-service transactions | Kafka, RabbitMQ + saga state tracking |
| Saga (Orchestration) | Mixed sync+async | HTTP/gRPC + broker | 50ms-5s | Moderate | Moderate | Complex workflows, compensations | Temporal, Camunda, Step Functions |
| Service Mesh Sidecar | Sync (transparent) | HTTP/2, mTLS | +1-3ms overhead | Proxied | Loose (infra) | mTLS, retries, observability | Istio, Linkerd, Consul Connect |
| Webhooks | Async callback | HTTP/1.1+ | 100ms-30s | Low-moderate | Loose | External integrations, notifications | Custom HTTP endpoints |
| Shared Database | Sync shared state | SQL/NoSQL | 1-10ms | High | Very tight | Legacy migration only (anti-pattern) | PostgreSQL, MongoDB (avoid in greenfield) |
START
├── Need immediate response (request-response)?
│ ├── YES → Is this a public/external API?
│ │ ├── YES → REST (JSON over HTTP) -- universal client support
│ │ └── NO → Internal service-to-service?
│ │ ├── YES → Need streaming or bidirectional?
│ │ │ ├── YES → gRPC Streaming
│ │ │ └── NO → Is latency critical (<10ms)?
│ │ │ ├── YES → gRPC (Protobuf) -- 2-7x faster than REST
│ │ │ └── NO → gRPC preferred, REST acceptable
│ │ └── NO → API aggregation (BFF)?
│ │ ├── YES → GraphQL or API Gateway composition
│ │ └── NO → REST
│ └── NO → Fire-and-forget or event notification?
│ ├── YES → Single consumer (task queue)?
│ │ ├── YES → Message Queue (RabbitMQ, SQS)
│ │ └── NO → Multiple consumers need same event?
│ │ ├── YES → Pub/Sub Event Bus (Kafka, NATS)
│ │ └── NO → Point-to-point queue
│ └── NO → Multi-service transaction (saga)?
│ ├── YES → Simple flow (3-4 steps)?
│ │ ├── YES → Choreography-based Saga (events)
│ │ └── NO → Orchestration-based Saga (Temporal, Step Functions)
│ └── NO → Need audit trail / replay?
│ ├── YES → Event Sourcing (EventStoreDB, Kafka)
│ └── NO → Standard Pub/Sub
Map each service interaction as synchronous (needs response) or asynchronous (fire-and-forget / eventual). Draw a service dependency graph. Any cycle indicates incorrect boundaries. [src3]
Service A --[sync query]--> Service B
Service A --[async event]--> Event Bus --[subscribe]--> Service C
Service A --[async event]--> Event Bus --[subscribe]--> Service D
Verify: No service should have more than 2 synchronous downstream dependencies. Count sync edges per node.
Define service contracts using Protocol Buffers. gRPC generates client/server stubs in all major languages from a single .proto file. [src4]
// order_service.proto
syntax = "proto3";
package order;
service OrderService {
rpc GetOrder (GetOrderRequest) returns (OrderResponse);
rpc CreateOrder (CreateOrderRequest) returns (OrderResponse);
rpc StreamOrderUpdates (GetOrderRequest) returns (stream OrderEvent);
}
message GetOrderRequest {
string order_id = 1;
}
message CreateOrderRequest {
string customer_id = 1;
repeated OrderItem items = 2;
}
message OrderItem {
string product_id = 1;
int32 quantity = 2;
int64 price_cents = 3;
}
message OrderResponse {
string order_id = 1;
string status = 2;
int64 total_cents = 3;
string created_at = 4;
}
message OrderEvent {
string order_id = 1;
string event_type = 2;
string timestamp = 3;
}
Verify: protoc --lint_out=. order_service.proto -- no warnings. Generate stubs: protoc --go_out=. --go-grpc_out=. order_service.proto
Choose a message broker based on your throughput and ordering needs. Kafka for high-throughput ordered event streams; RabbitMQ for flexible routing and work queues. [src5]
# docker-compose.yml -- local development Kafka setup (KRaft mode, no ZooKeeper)
services:
kafka:
image: apache/kafka:3.7.0
ports:
- "9092:9092"
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@localhost:9093
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_LOG_DIRS: /tmp/kraft-logs
CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
Verify: docker exec kafka kafka-topics.sh --bootstrap-server localhost:9092 --list returns empty list (broker is healthy).
Wrap every synchronous outbound call with a circuit breaker. This prevents cascading failures when a downstream service is slow or unavailable. [src1]
# Python with tenacity + custom circuit breaker
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential, CircuitBreaker
breaker = CircuitBreaker(fail_max=5, reset_timeout=30)
@breaker
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=0.5, max=5))
async def call_inventory_service(product_id: str) -> dict:
async with httpx.AsyncClient(timeout=5.0) as client:
resp = await client.get(f"http://inventory-service/api/v1/stock/{product_id}")
resp.raise_for_status()
return resp.json()
Verify: Kill inventory service, call 6 times -> circuit opens. Wait 30s -> circuit half-opens.
Every async message consumer must handle duplicate deliveries gracefully. Use an idempotency key (event ID) stored in a deduplication table. [src6]
# Idempotent Kafka consumer with deduplication
import json
from kafka import KafkaConsumer
consumer = KafkaConsumer(
'order-events',
bootstrap_servers='localhost:9092',
group_id='payment-service',
auto_offset_commit=False,
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
processed_ids = set() # In production: use Redis or DB table
for message in consumer:
event = message.value
event_id = event['event_id']
if event_id in processed_ids:
consumer.commit() # Skip duplicate, commit offset
continue
process_payment(event)
processed_ids.add(event_id)
consumer.commit()
Verify: Publish same event twice with identical event_id -> consumer processes it only once.
Propagate trace context (W3C Trace Context or B3) across all communication boundaries -- sync and async. Without this, debugging cross-service issues is nearly impossible. [src2]
# OpenTelemetry instrumentation for gRPC + Kafka
from opentelemetry import trace
from opentelemetry.instrumentation.grpc import GrpcInstrumentorClient
from opentelemetry.instrumentation.kafka import KafkaInstrumentor
# Auto-instrument gRPC client calls
GrpcInstrumentorClient().instrument()
# Auto-instrument Kafka producer/consumer
KafkaInstrumentor().instrument()
# Manual span for business logic
tracer = trace.get_tracer("order-service")
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
# gRPC call -- trace context propagated automatically
inventory = inventory_stub.CheckStock(request)
# Kafka publish -- trace context injected into headers
producer.send('order-events', value=event)
Verify: curl http://jaeger:16686/api/traces?service=order-service -> traces span across services.
# Input: Kafka messages on 'order-created' topic
# Output: Payment processing + 'payment-completed' event
import asyncio
import json
from aiokafka import AIOKafkaConsumer, AIOKafkaProducer
async def payment_event_handler():
consumer = AIOKafkaConsumer(
'order-created',
bootstrap_servers='kafka:9092',
group_id='payment-service',
value_deserializer=lambda m: json.loads(m.decode())
)
producer = AIOKafkaProducer(
bootstrap_servers='kafka:9092',
value_serializer=lambda v: json.dumps(v).encode()
)
await consumer.start()
await producer.start()
try:
async for msg in consumer:
order = msg.value
result = await process_payment(order['order_id'], order['total_cents'])
await producer.send('payment-completed', value={
'order_id': order['order_id'],
'payment_id': result['payment_id'],
'status': 'completed'
})
finally:
await consumer.stop()
await producer.stop()
// Input: gRPC requests to OrderService
// Output: Order responses with circuit breaker + logging
package main
import (
"context"
"log"
"net"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
pb "myapp/proto/order"
)
func unaryInterceptor(ctx context.Context, req interface{},
info *grpc.UnaryServerInfo, handler grpc.UnaryHandler,
) (interface{}, error) {
start := time.Now()
resp, err := handler(ctx, req)
log.Printf("method=%s duration=%s error=%v",
info.FullMethod, time.Since(start), err)
return resp, err
}
func main() {
lis, _ := net.Listen("tcp", ":50051")
srv := grpc.NewServer(grpc.UnaryInterceptor(unaryInterceptor))
pb.RegisterOrderServiceServer(srv, &orderServer{})
log.Fatal(srv.Serve(lis))
}
// Input: HTTP requests to downstream services
// Output: Resilient responses with fallback on failure
import CircuitBreaker from 'opossum';
const breakerOptions = {
timeout: 3000, // 3s timeout per request
errorThresholdPercentage: 50,
resetTimeout: 30000 // 30s before half-open
};
async function fetchInventory(productId: string): Promise<InventoryResponse> {
const res = await fetch(`http://inventory-svc/api/v1/stock/${productId}`);
if (!res.ok) throw new Error(`Inventory service error: ${res.status}`);
return res.json();
}
const breaker = new CircuitBreaker(fetchInventory, breakerOptions);
breaker.fallback((productId: string) => ({ productId, inStock: null, cached: true }));
breaker.on('open', () => console.warn('Circuit OPEN: inventory-svc'));
// Usage: const stock = await breaker.fire('product-123');
// BAD -- 5-hop synchronous chain: one slow service kills everything
// Order -> Inventory -> Pricing -> Tax -> Shipping -> Notification
// Total latency = sum of all latencies; one failure = total failure
POST /orders
-> GET inventory-svc/stock/{id} // 50ms
-> GET pricing-svc/price/{id} // 30ms
-> GET tax-svc/calculate // 40ms
-> POST shipping-svc/estimate // 100ms
-> POST notification-svc/send // 200ms
// Total: 420ms best case. Any timeout cascades upward.
// GOOD -- Max 1 sync hop; rest is async via events
POST /orders
-> GET inventory-svc/stock/{id} // 1 sync hop (needs real-time answer)
<- 201 Created (return to client)
-> publish 'order-created' event // Async from here
-> payment-svc consumes // Independent
-> shipping-svc consumes // Independent
-> notification-svc consumes // Independent
// Total sync latency: ~50ms. Async services process in parallel.
# BAD -- processing duplicate messages charges customer twice
def handle_payment(event):
charge_customer(event['customer_id'], event['amount']) # No dedup!
db.insert('payments', event) # Duplicate row on retry
# GOOD -- idempotency key prevents double-processing
def handle_payment(event):
if db.exists('processed_events', event['event_id']):
return # Already processed, skip
charge_customer(event['customer_id'], event['amount'])
db.insert('payments', {**event, 'processed_at': now()})
db.insert('processed_events', {'event_id': event['event_id']})
# Use a DB transaction to make both inserts atomic
// BAD -- browsers don't support HTTP/2 trailers (gRPC requirement)
// Frontend JS cannot call gRPC endpoints directly
const client = new OrderServiceClient('https://api.example.com:50051');
// This fails: browsers use HTTP/1.1 or HTTP/2 without trailer support
// GOOD -- Envoy proxy transcodes gRPC to gRPC-Web for browsers
// envoy.yaml
listeners:
- filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
http_filters:
- name: envoy.filters.http.grpc_web # Transcodes for browsers
- name: envoy.filters.http.router
// OR: Use an API gateway that exposes REST -> gRPC internally
// Browser -> REST (API Gateway) -> gRPC (internal services)
// BAD -- two services reading/writing the same 'orders' table
// Order Service and Shipping Service both do:
SELECT * FROM orders WHERE status = 'pending';
UPDATE orders SET status = 'shipped' WHERE id = ?;
// Schema changes in one service break the other. Tight coupling.
// GOOD -- Order Service owns 'orders' table
// Shipping Service owns 'shipments' table
// Communication via events:
// Order Service publishes:
{ "event": "order_placed", "order_id": "123", "items": [...] }
// Shipping Service consumes event, writes to its own table:
INSERT INTO shipments (order_id, status) VALUES ('123', 'pending');
// No shared database. Schema changes are independent.
connect_timeout (1-3s) and read_timeout (3-10s) on every HTTP client instance. [src1]max.poll.records (Kafka) or prefetch_count (RabbitMQ) to limit in-flight messages. [src5]X-Correlation-ID or W3C traceparent header in every outbound call and message. [src2]# Check if gRPC service is healthy
grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check
# List available gRPC services
grpcurl -plaintext localhost:50051 list
# Check Kafka topic lag (consumer behind producer)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group payment-service
# Check RabbitMQ queue depth
rabbitmqctl list_queues name messages_ready messages_unacknowledged
# Test REST endpoint with timing
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
http://order-service:8080/api/v1/orders/123
# Check service mesh proxy status (Istio)
istioctl proxy-status
# Verify mTLS between services (Istio)
istioctl authn tls-check order-service.default.svc.cluster.local
| Technology | Current Version | Key Change | Notes |
|---|---|---|---|
| gRPC | v1.62+ (2024) | xDS load balancing by default | Enable via GRPC_XDS_BOOTSTRAP env var |
| Apache Kafka | 3.7+ (2024) | KRaft mode GA (no ZooKeeper) | Migrate from ZooKeeper before Kafka 4.0 removes support |
| RabbitMQ | 3.13+ (2024) | Khepri metadata store (replaces Mnesia) | Optional; improves cluster stability |
| Istio | 1.21+ (2024) | Ambient mesh (sidecar-less option) | Reduces per-pod overhead by ~50% |
| gRPC-Web | 1.5+ (2023) | Stable for production | Use with Envoy or grpc-web npm package |
| NATS | 2.10+ (2024) | JetStream improvements | Competing alternative to Kafka for lighter workloads |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Services need independent deployment and scaling | Team is <5 engineers or domain is not well understood | Modular monolith with clear module boundaries |
| Different services need different languages/frameworks | All services share the same database anyway | Monolith or modular monolith |
| You need fault isolation (one service failure != total failure) | Latency budget is <5ms for the full request path | In-process function calls (monolith) |
| Event-driven workflows with multiple independent consumers | You need strict ACID transactions across services | Shared database or distributed transaction coordinator |
| High-throughput async processing (>10K events/s) | Simple CRUD app with <1K users | REST monolith or serverless functions |