GraphQL API architecture design

- Bottom line: At scale, use Apollo Federation v2 to compose a supergraph from domain-owned subgraphs behind a single gateway — this gives teams independent deployability while clients see one unified API.

scaling GraphQL API production

- Bottom line: At scale, use Apollo Federation v2 to compose a supergraph from domain-owned subgraphs behind a single gateway — this gives teams independent deployability while clients see one unified API.

GraphQL federation vs schema stitching

- Bottom line: At scale, use Apollo Federation v2 to compose a supergraph from domain-owned subgraphs behind a single gateway — this gives teams independent deployability while clients see one unified API.

GraphQL API design patterns at scale

- Bottom line: At scale, use Apollo Federation v2 to compose a supergraph from domain-owned subgraphs behind a single gateway — this gives teams independent deployability while clients see one unified API.

how to design a scalable GraphQL API

- Bottom line: At scale, use Apollo Federation v2 to compose a supergraph from domain-owned subgraphs behind a single gateway — this gives teams independent deployability while clients see one unified API.

GraphQL API Architecture at Scale

How do I architect a GraphQL API at scale?

TL;DR

Bottom line: At scale, use Apollo Federation v2 to compose a supergraph from domain-owned subgraphs behind a single gateway -- this gives teams independent deployability while clients see one unified API.
Key tool/command: rover supergraph compose --config supergraph.yaml
Watch out for: N+1 queries in resolvers -- without DataLoader batching, a single nested GraphQL query can trigger hundreds of database calls.
Works with: Apollo Federation v2 + Apollo Router, Netflix DGS (Java/Kotlin), GraphQL Mesh, any language with a GraphQL server library (graphql-js, graphql-java, gqlgen, Strawberry).

Constraints

Always implement query depth limiting and complexity analysis in production -- unbounded GraphQL queries can cause exponential database load and denial-of-service
Never expose GraphQL introspection in production environments -- it reveals your entire schema to attackers and enables automated exploitation
DataLoader (or equivalent batching) is mandatory for any resolver that fetches from a database or external service -- without it, N+1 queries will destroy performance
Persisted queries or an allowlist of approved operations must be enforced in production to prevent arbitrary query injection and abuse
Federation gateway (Apollo Router / GraphQL Mesh) must be the only public entry point -- never expose subgraph endpoints directly to clients

Quick Reference

Component	Role	Technology Options	Scaling Strategy
Gateway / Router	Schema composition, query planning, routing to subgraphs	Apollo Router (Rust), GraphQL Mesh, Cosmo Router	Horizontal -- stateless, deploy behind LB; cache query plans
Subgraph Services	Domain-owned partial schema + resolvers	Apollo Server, Netflix DGS, gqlgen (Go), Strawberry (Python)	Horizontal -- independent scaling per domain team
Schema Registry	Version control, composition validation, breaking change detection	Apollo GraphOS, Hive (open-source), Cosmo	Central -- single registry, CI/CD integration
DataLoader / Batching	Batch + cache data fetches within a single request	graphql/dataloader (JS), Spring BatchLoader (DGS), dataloaden (Go)	Per-request instance -- no cross-request caching
Query Complexity Analyzer	Reject queries exceeding cost threshold before execution	graphql-query-complexity, Apollo cost analysis plugin	Configured at gateway -- cost limits per client tier
Persisted Query Store	Map query hashes to approved operations	Redis, CDN edge, Apollo APQ, Relay Compiler	Cache at edge -- hash lookup is O(1)
Caching Layer	Response + entity caching to reduce resolver execution	CDN (Cloudflare, Fastly), Redis, Apollo cache hints	Cache-Control headers + entity-level cache invalidation
Observability	Distributed tracing across gateway + subgraphs	OpenTelemetry, Apollo Studio, Datadog, Jaeger	Trace context propagation via HTTP headers
Rate Limiter	Per-client query budget based on complexity cost	Apollo Router plugins, Cloudflare WAF, custom middleware	Token bucket per API key; cost-based budgets
Auth / AuthZ	Authentication at gateway, authorization at resolver level	JWT validation at gateway, directive-based @auth in subgraphs	Gateway validates tokens; subgraphs enforce field-level access

Decision Tree

START
├── Single team, <5 entity types, <1K QPS?
│   ├── YES → Monolith GraphQL server (Apollo Server / graphql-yoga / gqlgen)
│   └── NO ↓
├── 2-5 teams, shared schema ownership?
│   ├── YES → Schema stitching with GraphQL Mesh or modular monolith with schema modules
│   └── NO ↓
├── 6+ teams, each owns a domain (users, products, orders)?
│   ├── YES → Apollo Federation v2 with subgraph-per-team
│   └── NO ↓
├── Java/Kotlin ecosystem, Spring Boot stack?
│   ├── YES → Netflix DGS Framework with Federation support
│   └── NO ↓
├── >50K QPS, need edge caching and query plan optimization?
│   ├── YES → Apollo Router (Rust) + persisted queries + CDN caching + entity cache
│   └── NO ↓
├── Need to combine GraphQL + REST + gRPC backends?
│   ├── YES → GraphQL Mesh as a unifying gateway layer
│   └── NO ↓
└── DEFAULT → Start with monolith GraphQL server, extract to federation when team count exceeds 3

Step-by-Step Guide

1. Define your supergraph schema with domain boundaries

Map your domain into bounded contexts. Each domain team owns a subgraph with its core types. Use the @key directive to declare entity identity, allowing other subgraphs to extend types across boundaries. [src1]

# products subgraph -- owns Product type
type Product @key(fields: "id") {
  id: ID!
  name: String!
  price: Float!
  category: Category!
}

type Category {
  id: ID!
  name: String!
}

type Query {
  product(id: ID!): Product
  products(first: Int = 10, after: String): ProductConnection!
}

Verify: rover subgraph check <graph>@<variant> --schema products.graphql --name products → composition succeeds with no errors.

2. Implement entity references across subgraphs

When one subgraph needs data owned by another, use stub types with @key to declare a reference. The gateway resolves the full entity by calling the owning subgraph's __resolveReference function. [src1]

# reviews subgraph -- references Product from products subgraph
type Product @key(fields: "id") {
  id: ID!
  reviews: [Review!]!
  averageRating: Float
}

type Review {
  id: ID!
  author: User!
  body: String!
  rating: Int!
  createdAt: DateTime!
}

// reviews subgraph -- resolve Product references
const resolvers = {
  Product: {
    __resolveReference(product) {
      return { id: product.id };
    },
    reviews(product) {
      return reviewsLoader.load(product.id);
    },
    averageRating(product) {
      return ratingsLoader.load(product.id);
    },
  },
};

Verify: Query through gateway: { product(id: "1") { name reviews { body rating } } } → returns product name from products subgraph and reviews from reviews subgraph in a single response.

3. Set up the federation gateway with Apollo Router

Deploy Apollo Router as the single client-facing endpoint. It composes subgraph schemas into a supergraph, builds query plans, and routes operations to the appropriate subgraphs. [src1]

# supergraph.yaml -- Apollo Router configuration
supergraph:
  listen: 0.0.0.0:4000
  introspection: false  # disabled in production
subgraphs:
  products:
    routing_url: http://products-service:4001/graphql
  reviews:
    routing_url: http://reviews-service:4002/graphql
  users:
    routing_url: http://users-service:4003/graphql

# Compose and validate the supergraph
rover supergraph compose --config supergraph.yaml > supergraph.graphql

# Start the router
./router --supergraph supergraph.graphql --config router.yaml

Verify: curl -X POST http://localhost:4000/ -H "Content-Type: application/json" -d '{"query":"{ __typename }"}' → returns {"data":{"__typename":"Query"}}.

4. Implement DataLoader for N+1 prevention

Create request-scoped DataLoader instances for every data source a resolver calls. DataLoader batches all .load(key) calls within a single tick into one batch function call. [src2]

// dataloaders.js -- Request-scoped DataLoader factory
const DataLoader = require("dataloader");  // [email protected]
const db = require("./db");

function createLoaders() {
  return {
    productLoader: new DataLoader(async (ids) => {
      const products = await db.query(
        "SELECT * FROM products WHERE id = ANY($1)", [ids]
      );
      const map = new Map(products.map(p => [p.id, p]));
      return ids.map(id => map.get(id) || null);
    }),
  };
}

Verify: Enable query logging on database → a query for 10 products with reviews produces exactly 2 SQL queries, not 11.

5. Add query complexity analysis and depth limiting

Configure the gateway to reject queries that exceed a maximum depth or complexity cost before execution. Assign cost weights to fields based on their resolver expense. [src4]

// Apollo Server with query complexity plugin
const { createComplexityLimitRule } = require("graphql-validation-complexity");

const server = new ApolloServer({
  typeDefs,
  resolvers,
  validationRules: [
    createComplexityLimitRule(1000, {
      scalarCost: 1,
      objectCost: 2,
      listFactor: 10,
    }),
  ],
});

Verify: Send a deeply nested query (depth 15) → receive error. Send a wide query with many list fields → receive cost exceeded error.

6. Implement persisted queries and client allowlisting

Use Automatic Persisted Queries (APQ) for public clients or a compiled operation allowlist for trusted clients. [src7]

const server = new ApolloServer({
  typeDefs,
  resolvers,
  persistedQueries: {
    cache: new KeyvAdapter(new Keyv("redis://redis:6379")),
    ttl: 86400,  // 24 hours
  },
});
// Client sends hash first: { "extensions": { "persistedQuery": { "sha256Hash": "abc..." } } }

Verify: Send a query with only its SHA-256 hash → first returns PersistedQueryNotFound, second with full query registers it, third with hash returns cached result.

Code Examples

TypeScript: Apollo Federation Subgraph with Authentication

// products-subgraph/index.ts -- Complete federated subgraph
// Input:  GraphQL queries routed from Apollo Router
// Output: Product data with field-level authorization

import { ApolloServer } from "@apollo/server";  // @apollo/[email protected]
import { buildSubgraphSchema } from "@apollo/subgraph";  // @apollo/[email protected]
import { gql } from "graphql-tag";

const typeDefs = gql`
  extend schema @link(url: "https://specs.apollo.dev/federation/v2.5",
    import: ["@key", "@shareable", "@requires", "@external"])

  type Product @key(fields: "id") {
    id: ID!
    name: String!
    price: Float!
    internalCost: Float @auth(requires: ADMIN)
  }

  type Query {
    product(id: ID!): Product
    products(first: Int = 20, after: String): ProductConnection!
  }
`;

const server = new ApolloServer({
  schema: buildSubgraphSchema({ typeDefs, resolvers }),
});

Go: gqlgen Subgraph with DataLoader

// resolvers/product.go -- gqlgen resolver with dataloaden
// Input:  GraphQL product queries
// Output: Batched database responses

package resolvers

import (
    "context"
    "github.com/graph-gophers/dataloader/v7"
)

func (r *queryResolver) Product(ctx context.Context, id string) (*Product, error) {
    thunk := r.Loaders.ProductLoader.Load(ctx, dataloader.StringKey(id))
    result, err := thunk()
    if err != nil {
        return nil, err
    }
    return result.(*Product), nil
}

Anti-Patterns

Wrong: Exposing database structure as GraphQL schema

// BAD -- Schema mirrors database tables, not business domain
type products_table {
  product_id: Int!
  product_name: String
  category_fk: Int
  created_at: String
  is_deleted: Boolean
}
// Leaks implementation details, exposes internal IDs.

Correct: Design schema around business domain

// GOOD -- Schema represents business concepts [src3]
type Product @key(fields: "id") {
  id: ID!
  name: String!
  price: Money!
  category: Category!
  availability: Availability!
}
// Clean business types, strong typing, no leaked internals.

Wrong: Resolvers without DataLoader (N+1 problem)

// BAD -- Each product triggers a separate DB query
const resolvers = {
  Product: {
    reviews(product) {
      return db.query("SELECT * FROM reviews WHERE product_id = $1", [product.id]);
    },
  },
};
// 100 products = 101 database queries.

Correct: Batched resolvers with DataLoader

// GOOD -- DataLoader batches all review fetches into 1 query [src2]
const resolvers = {
  Product: {
    reviews(product, _, { loaders }) {
      return loaders.reviewsByProductLoader.load(product.id);
      // 100 products = 2 database queries total.
    },
  },
};

Wrong: No query depth or complexity limits

// BAD -- Accepts any query, no matter how expensive
const server = new ApolloServer({ typeDefs, resolvers });
// Attacker sends deeply nested query -> server runs out of memory.

Correct: Enforce depth + complexity limits at gateway

// GOOD -- Reject expensive queries before execution [src4]
const server = new ApolloServer({
  typeDefs,
  resolvers,
  validationRules: [depthLimit(10), complexityLimit(1000)],
  introspection: false,
});

Wrong: Offset-based pagination

# BAD -- Offset pagination degrades at scale
type Query {
  products(limit: Int, offset: Int): [Product]
}
# offset: 100000 -> database scans and skips 100K rows.

Correct: Cursor-based (Relay-style) pagination

# GOOD -- Cursor pagination is stable and performant [src3]
type Query {
  products(first: Int!, after: String): ProductConnection!
}
type ProductConnection {
  edges: [ProductEdge!]!
  pageInfo: PageInfo!
}
# O(1) seek regardless of page position.

Common Pitfalls

N+1 queries in nested resolvers: The most common GraphQL performance issue. Every resolver that accesses a data source must use DataLoader. Fix: Create per-request DataLoader instances for every data source. [src2]
Federation composition failures in CI/CD: Subgraph schema changes that break composition are only caught when all subgraphs are composed together. Fix: Run rover subgraph check in CI for every PR. [src1]
Over-fetching via SELECT * in resolvers: Resolvers fetch all columns even when the client only requests 2 fields. Fix: Inspect info.fieldNodes to determine requested fields. [src3]
Missing error handling in __resolveReference: If a subgraph's reference resolver throws, the entire federated query fails. Fix: Return null with a partial error for missing entities. [src1]
Client-side caching issues with mutations: After a mutation, the client cache holds stale data. Fix: Return the mutated object in the mutation response; use refetchQueries for complex invalidation. [src7]
Unbounded list fields without pagination: A field like reviews: [Review!]! with no limit can return millions of rows. Fix: Always use connection-style pagination with a maximum first value. [src4]
Gateway timeout on slow subgraphs: One slow subgraph blocks the entire federated query. Fix: Set per-subgraph timeouts in router config; return partial data with errors. [src2]
Schema drift between subgraph and gateway: Deploying a subgraph before updating the supergraph causes runtime errors. Fix: Use a CI/CD pipeline that composes, validates, and deploys atomically. [src1]

Diagnostic Commands

# Validate supergraph composition
rover supergraph compose --config supergraph.yaml

# Check a subgraph against the deployed supergraph
rover subgraph check <graph>@<variant> --schema products.graphql --name products

# Introspect a running subgraph (development only)
rover subgraph introspect http://localhost:4001/graphql

# Test query execution through the gateway
curl -X POST http://localhost:4000/ \
  -H "Content-Type: application/json" \
  -d '{"query":"{ product(id: \"1\") { name price reviews { rating } } }"}'

# Monitor Apollo Router metrics (Prometheus)
curl http://localhost:9090/metrics | grep apollo_router

# Trace query execution plan in Apollo Router
APOLLO_ROUTER_LOG=apollo_router::query_planner=debug ./router --supergraph supergraph.graphql

Version History & Compatibility

Version	Status	Breaking Changes	Migration Notes
GraphQL Spec Sep 2025	Current	Schema Coordinates, OneOf inputs	First spec update since Oct 2021
Apollo Federation v2.5+	Current	None since v2.0	Requires @link directive
Apollo Federation v2.0	Stable	@key syntax changed from v1	Add @link import, replace @requires syntax
Apollo Federation v1	Deprecated	---	Upgrade to v2 for @shareable, @override
Netflix DGS 9.x	Current (Spring Boot 3.x)	Requires Java 17+	Aligns with Spring GraphQL
Netflix DGS 7.x	Maintenance	---	Upgrade to 9.x for Spring Boot 3
GraphQL Mesh v1.x	Current	---	Replaces 0.x with stable API

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Multiple teams need to contribute to a unified API independently	Single team owns the entire API surface	Monolith GraphQL server (Apollo Server, graphql-yoga)
Clients need flexible, nested data fetching in a single request	Simple CRUD with flat resources and no nesting	REST API with OpenAPI spec
Mobile + web clients have very different data needs for the same screen	All clients need identical data shapes	REST with versioned endpoints
Need to compose multiple backend services into one API	Single backend database with no microservices	Direct GraphQL-to-database (Hasura, PostGraphile)
Schema evolution without breaking clients is critical	Strict API versioning is acceptable	REST with content negotiation
Need real-time subscriptions alongside queries	Only request-response needed, no real-time	REST API or gRPC for service-to-service

Important Caveats

Apollo Federation v2 is not backward-compatible with v1 schemas -- the @link directive and import syntax are required. Gradual migration is supported but requires updating all subgraphs.
Netflix runs 800+ DGS subgraphs in production with federation -- but they built extensive internal tooling. Do not assume the open-source DGS framework alone gives you Netflix-scale capabilities.
Shopify's GraphQL design tutorial recommends designing around business domain objects, not database tables -- this is the single most impactful architecture decision.
GraphQL over HTTP is not yet fully standardized (working draft as of 2025). Most implementations use POST with JSON body, but GET for persisted queries varies.
Federation adds latency per hop: a query touching 3 subgraphs makes 3 network calls from the gateway. For latency-sensitive paths, denormalize data into a single subgraph or use entity caching.