How to Design a CI/CD Pipeline Architecture
How do I design a CI/CD pipeline architecture?
TL;DR
- Bottom line: A well-designed CI/CD pipeline connects source control triggers to isolated runners, executes build/test/security stages with fast feedback, produces signed immutable artifacts, and promotes them through environments with explicit gates -- targeting < 10 min for the fast-feedback loop and < 1 day total lead time.
- Key tool/command:
.github/workflows/ci.yml(GitHub Actions) or.gitlab-ci.yml(GitLab CI) -- declarative YAML defining stages, jobs, and deployment gates. - Watch out for: Building artifacts multiple times (once per environment) instead of building once and promoting -- this is the #1 source of "works in staging, fails in prod" bugs.
- Works with: GitHub Actions, GitLab CI, Jenkins 2.x+, CircleCI, Azure DevOps, AWS CodePipeline, Tekton. Concepts are platform-agnostic.
Constraints
- Never store secrets in pipeline YAML or version control -- use platform secret stores with short-lived credentials
- Build artifacts exactly once and promote the same binary/image through all environments
- Fast-feedback stages (lint, unit tests, SAST) must complete in under 10 minutes
- Never deploy to production without at least one automated or manual gate
- Pin all CI tool versions, runner images, and action references to SHA or exact version tags
- Quarantine flaky tests immediately -- >2% flake rate erodes CI trust and leads to ignored failures
Quick Reference
| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| Source Control Trigger | Detects code changes, initiates pipeline | GitHub webhooks, GitLab push events, Jenkins SCM polling | Event-driven (no polling); branch filters to limit triggers |
| Pipeline Orchestrator | Defines stages, manages job dependencies, gates | GitHub Actions, GitLab CI, Jenkins Pipeline, CircleCI, Tekton | Declarative YAML; parallel job execution; DAG-based scheduling |
| Build Runner / Executor | Executes pipeline jobs in isolated environments | GitHub-hosted runners, GitLab shared runners, Jenkins agents, self-hosted runners | Auto-scaling runner pools; ephemeral containers; spot/preemptible instances |
| Artifact Registry | Stores immutable build outputs (images, packages) | Docker Hub, GitHub Packages, GitLab Container Registry, AWS ECR, Artifactory | Content-addressable storage; geo-replicated registries; retention policies |
| Test Framework | Validates correctness at unit, integration, E2E levels | Jest, pytest, JUnit, Cypress, Playwright | Parallel test sharding; intelligent test selection; flaky test quarantine |
| Security Scanner (SAST/DAST) | Shift-left vulnerability detection | Snyk, Semgrep, Trivy, CodeQL, SonarQube | Run SAST in parallel with unit tests; DAST on staging only |
| Secret Manager | Provides credentials to pipeline without exposure | GitHub Secrets, GitLab CI Variables, HashiCorp Vault, AWS Secrets Manager | OIDC federation for cloud providers; short-lived tokens; no static keys |
| Deployment Controller | Manages rollout strategy to target environments | Kubernetes (kubectl/Helm), ArgoCD, AWS CodeDeploy, Terraform | Canary/blue-green via progressive delivery; automated rollback on metric degradation |
| Artifact Signer / SBOM | Supply chain integrity and provenance | Sigstore/Cosign, SLSA provenance, Syft (SBOM), in-toto attestations | Keyless signing via OIDC; SLSA Level 3 with isolated builders |
| Notification / Observability | Pipeline status feedback and metrics | Slack/Teams webhooks, Datadog CI Visibility, Grafana, DORA dashboards | Track 4 DORA metrics; alert on failure rate spikes; pipeline duration trends |
| Environment Manager | Manages staging, preview, production targets | Kubernetes namespaces, Terraform workspaces, Vercel/Netlify preview deploys | Ephemeral preview environments per PR; auto-cleanup on merge |
| Cache Layer | Speeds up repeated builds by reusing dependencies | GitHub Actions cache, GitLab CI cache, S3-backed caches, Turborepo | Cache by lockfile hash; layer caching for Docker; distributed remote caches |
Decision Tree
Platform Selection
START: Choose your CI/CD platform
|
+-- Already using GitHub for source control?
| +-- YES --> GitHub Actions (native integration, largest marketplace)
| +-- NO |
|
+-- Need built-in container registry + security scanning + GitOps?
| +-- YES --> GitLab CI (all-in-one DevOps platform)
| +-- NO |
|
+-- Require maximum customization + plugin ecosystem?
| +-- YES --> Is your team willing to manage infrastructure?
| | +-- YES --> Jenkins (self-hosted, 1800+ plugins)
| | +-- NO --> CircleCI or GitHub Actions (managed)
| +-- NO |
|
+-- Enterprise with Azure/Microsoft ecosystem?
| +-- YES --> Azure DevOps Pipelines
| +-- NO |
|
+-- AWS-native infrastructure with CodeCommit/CodeBuild?
| +-- YES --> AWS CodePipeline + CodeBuild + CodeDeploy
| +-- NO |
|
+-- Kubernetes-native, want pipelines-as-code in K8s?
| +-- YES --> Tekton Pipelines
| +-- NO --> GitHub Actions (safest default for most teams)
Scaling Decision
+-- Team size < 5, deploys weekly?
| +-- Single workflow file, manual deploy gate
|
+-- Team 5-20, deploys daily?
| +-- Multi-stage pipeline, automated staging deploy, manual prod approval
|
+-- Team 20+, multiple deploys/day?
| +-- Monorepo: path-based triggers + parallel jobs
| +-- Microservices: per-service pipelines + shared reusable workflows
|
+-- Enterprise 100+, continuous deployment?
| +-- GitOps (ArgoCD/Flux) + progressive delivery (Argo Rollouts/Flagger)
| +-- DORA metrics dashboard + deployment frequency tracking
Step-by-Step Guide
1. Define pipeline stages and fast-feedback loop
Structure your pipeline into discrete stages that run in order, with fast checks first. The goal is to catch 80% of issues in the first 5 minutes. [src1] [src2]
# Canonical stage ordering (platform-agnostic concept)
stages:
- lint # < 1 min: code style, formatting
- security # < 2 min: SAST, secret scanning (parallel with lint)
- build # < 3 min: compile, bundle, create artifact
- unit-test # < 5 min: fast unit tests (parallel shards)
- integration # < 10 min: API tests, DB tests
- staging # deploy to staging, run smoke tests
- approval # manual gate or automated canary check
- production # deploy to production
- post-deploy # smoke tests, DAST, notification
Verify: Pipeline stages are sequential; jobs within a stage can run in parallel. Lint + security should complete before build starts.
2. Configure source control triggers
Set up event-based triggers that only run relevant pipeline stages. Avoid running full pipelines on every push to every branch. [src1]
# GitHub Actions trigger configuration
on:
pull_request:
branches: [main, develop]
paths-ignore:
- '**.md'
- 'docs/**'
push:
branches: [main]
release:
types: [published]
Verify: Push to a documentation-only branch should NOT trigger the pipeline. Push to main should trigger full pipeline.
3. Implement build-once, promote-everywhere
Build your artifact exactly once, tag it with the commit SHA, and promote that exact artifact through environments. [src7]
# Build and tag with commit SHA
docker build -t myapp:${GITHUB_SHA} .
docker tag myapp:${GITHUB_SHA} registry.example.com/myapp:${GITHUB_SHA}
docker push registry.example.com/myapp:${GITHUB_SHA}
# In staging: deploy the exact SHA
kubectl set image deployment/myapp app=registry.example.com/myapp:${GITHUB_SHA}
# In production: promote the SAME image (no rebuild)
kubectl set image deployment/myapp app=registry.example.com/myapp:${GITHUB_SHA}
Verify: docker inspect registry.example.com/myapp:${SHA} returns identical image ID in both staging and production.
4. Add automated quality gates
Each environment promotion requires passing a quality gate. Combine automated checks with optional manual approval for production. [src4]
# GitHub Actions: require status checks before merge
# Settings > Branches > Branch protection rules > Require status checks:
# - lint
# - test
# - security-scan
# Settings > Environments > production > Required reviewers: 1
Verify: A PR with failing tests cannot be merged. Production deployment requires explicit approval.
5. Integrate security scanning (shift-left)
Add SAST, dependency scanning, and secret detection as parallel stages that run alongside unit tests. [src6]
# Run security checks in parallel with tests
security:
parallel:
- sast: semgrep --config=auto .
- deps: trivy fs --severity HIGH,CRITICAL .
- secrets: gitleaks detect --source .
- sbom: syft . -o spdx-json > sbom.json
Verify: trivy fs . returns exit code 0 (no HIGH/CRITICAL vulnerabilities). gitleaks detect returns exit code 0 (no leaked secrets).
6. Set up DORA metrics tracking
Track the four DORA metrics to measure pipeline health. Elite teams target: deployment frequency multiple times/day, lead time < 1 day, change failure rate < 15%, recovery time < 1 hour. [src4]
# Key metrics to instrument:
# 1. Deployment Frequency: count deployments per day/week
# 2. Lead Time for Changes: time from commit to production deploy
# 3. Change Failure Rate: % of deployments causing incidents
# 4. Mean Time to Recovery: time from incident to resolution
Verify: Dashboard shows deployment frequency trend. Lead time from commit to production is tracked.
Code Examples
GitHub Actions: Complete CI/CD Pipeline
# .github/workflows/ci-cd.yml
# Input: Push to main or PR to main
# Output: Tested, scanned, built artifact deployed to staging/production
name: CI/CD Pipeline
on:
pull_request:
branches: [main]
push:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
lint:
runs-on: ubuntu-24.04
steps:
- uses: actions/[email protected]
- uses: actions/[email protected]
with: { node-version: '22' }
- run: npm ci
- run: npm run lint
test:
runs-on: ubuntu-24.04
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- uses: actions/[email protected]
- uses: actions/[email protected]
with: { node-version: '22' }
- run: npm ci
- run: npm test -- --shard=${{ matrix.shard }}/4
security:
runs-on: ubuntu-24.04
steps:
- uses: actions/[email protected]
- uses: github/codeql-action/init@v3
with: { languages: javascript }
- uses: github/codeql-action/analyze@v3
build:
needs: [lint, test, security]
runs-on: ubuntu-24.04
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/[email protected]
- uses: docker/[email protected]
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- id: meta
uses: docker/[email protected]
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: type=sha,prefix=
- uses: docker/[email protected]
with:
push: true
tags: ${{ steps.meta.outputs.tags }}
deploy-staging:
if: github.ref == 'refs/heads/main'
needs: [build]
runs-on: ubuntu-24.04
environment: staging
steps:
- run: |
kubectl set image deployment/myapp \
app=${{ needs.build.outputs.image-tag }}
deploy-production:
needs: [deploy-staging]
runs-on: ubuntu-24.04
environment:
name: production
url: https://myapp.example.com
steps:
- run: |
kubectl set image deployment/myapp \
app=${{ needs.build.outputs.image-tag }}
GitLab CI: Complete CI/CD Pipeline
# .gitlab-ci.yml
# Input: Merge request or push to main
# Output: Tested, scanned, built artifact deployed through environments
stages:
- validate
- build
- test
- staging
- production
variables:
IMAGE_TAG: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA
lint:
stage: validate
image: node:22-alpine
script:
- npm ci --cache .npm
- npm run lint
cache:
key: ${CI_COMMIT_REF_SLUG}
paths: [.npm]
sast:
stage: validate
include:
- template: Security/SAST.gitlab-ci.yml
build:
stage: build
image: docker:27
services:
- docker:27-dind
script:
- docker build -t $IMAGE_TAG .
- docker push $IMAGE_TAG
rules:
- if: $CI_COMMIT_BRANCH == "main"
unit-test:
stage: test
image: node:22-alpine
parallel: 4
script:
- npm ci
- npm test -- --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL
artifacts:
reports:
junit: junit.xml
deploy-staging:
stage: staging
environment:
name: staging
url: https://staging.myapp.example.com
script:
- kubectl set image deployment/myapp app=$IMAGE_TAG
rules:
- if: $CI_COMMIT_BRANCH == "main"
deploy-production:
stage: production
environment:
name: production
url: https://myapp.example.com
script:
- kubectl set image deployment/myapp app=$IMAGE_TAG
rules:
- if: $CI_COMMIT_BRANCH == "main"
when: manual
needs:
- deploy-staging
Anti-Patterns
Wrong: Rebuilding artifacts per environment
# BAD -- rebuilds for each environment; staging and production binaries may differ
deploy-staging:
script:
- docker build -t myapp:staging . # build #1
- docker push myapp:staging
deploy-production:
script:
- docker build -t myapp:production . # build #2 -- NOT the same binary!
- docker push myapp:production
Correct: Build once, promote the artifact
# GOOD -- single build, same image promoted through environments
build:
script:
- docker build -t myapp:${CI_COMMIT_SHA} . # build once
- docker push myapp:${CI_COMMIT_SHA}
deploy-staging:
script:
- kubectl set image deployment/myapp app=myapp:${CI_COMMIT_SHA} # same image
deploy-production:
script:
- kubectl set image deployment/myapp app=myapp:${CI_COMMIT_SHA} # same image
Wrong: Storing secrets in pipeline YAML
# BAD -- secrets in plaintext, committed to version control
env:
DATABASE_URL: "postgres://admin:[email protected]:5432/prod"
AWS_SECRET_ACCESS_KEY: "AKIA..."
Correct: Using platform secret stores with OIDC
# GOOD -- secrets injected at runtime, never in source control
jobs:
deploy:
permissions:
id-token: write # enables OIDC
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/deploy
aws-region: us-east-1
# No static credentials -- uses short-lived OIDC token
Wrong: Monolithic pipeline that runs everything sequentially
# BAD -- 45-minute sequential pipeline; lint failure blocks everything
steps:
- run: npm run lint
- run: npm test
- run: npm run e2e
- run: docker build .
- run: trivy image myapp
- run: kubectl apply -f k8s/
# Total: 45 minutes, sequential, no parallelism
Correct: Parallel stages with fast-feedback first
# GOOD -- parallel execution, fast feedback in < 5 minutes
jobs:
lint: # 1 min, runs immediately
...
security-scan: # 2 min, runs in parallel with lint
...
unit-test: # 4 min, runs in parallel with lint
...
build: # 3 min, waits for lint + test + security
needs: [lint, unit-test, security-scan]
e2e-test: # 10 min, waits for build
needs: [build]
deploy: # 2 min, waits for e2e
needs: [e2e-test]
# Total: ~20 min with parallelism (vs 45 min sequential)
Wrong: Using floating version tags for CI tools
# BAD -- @v3 could change without warning, breaking your pipeline
- uses: actions/checkout@v3 # could be v3.0.0 today, v3.9.9 tomorrow
- uses: docker/build-push-action@latest # completely unpinned
Correct: Pinning to exact SHA or version
# GOOD -- deterministic, reproducible builds
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
- uses: docker/[email protected] # exact version
Common Pitfalls
- No caching strategy: Every build downloads dependencies from scratch, adding 2-5 minutes per run. Fix: cache
node_modules,.m2, pip cache keyed by lockfile hash. [src1] - Flaky tests ignored: Tests intermittently fail but nobody fixes them, eroding CI trust until developers routinely re-run pipelines. Fix: quarantine flaky tests immediately; track flake rate as a team metric. [src7]
- No branch protection: Developers push directly to main, bypassing CI entirely. Fix: require status checks (lint, test, security) to pass before merge. [src1]
- Secrets in logs: Pipeline logs expose environment variables or command output containing credentials. Fix: mark secrets as masked in CI settings; never
echo $SECRET; audit log output. [src3] - No rollback plan: Production deployment fails with no automated way to revert. Fix: keep the previous artifact tag; automate rollback on health check failure (
kubectl rollout undo). [src4] - Over-triggering: Every push to every branch triggers the full pipeline, wasting runner minutes. Fix: use path filters, branch filters, and
[skip ci]conventions. [src2] - Shared mutable state in tests: Integration tests share a database or filesystem, causing ordering-dependent failures. Fix: use per-test database schemas or containers; clean state before each test. [src7]
- No pipeline observability: Teams have no visibility into pipeline duration trends, failure rates, or bottlenecks. Fix: track the 4 DORA metrics; set up CI dashboards (Datadog CI Visibility, GitHub Actions insights). [src4]
Diagnostic Commands
# Check GitHub Actions workflow syntax
gh workflow list
gh run list --limit 5
# View recent pipeline runs and their status
gh run view <run-id> --log-failed
# Verify Docker image exists in registry
docker manifest inspect registry.example.com/myapp:${SHA}
# Check Kubernetes deployment rollout status
kubectl rollout status deployment/myapp --timeout=300s
# Measure DORA: deployment frequency (last 30 days)
gh api repos/{owner}/{repo}/deployments --paginate | jq '[.[] | select(.created_at > "2026-01-23")] | length'
# Check for secrets accidentally committed
gitleaks detect --source . --verbose
Version History & Compatibility
| Platform | Current Version | Key Feature | Notes |
|---|---|---|---|
| GitHub Actions | v2 (2024+) | Reusable workflows, OIDC, larger runners | Largest marketplace; 2000+ free minutes/month (public repos unlimited) |
| GitLab CI | 17.x (2025) | CI Components catalog, SLSA provenance | Built-in container registry, SAST, DAST; all-in-one platform |
| Jenkins | 2.479+ (2025) | Declarative Pipeline, Pipeline as Code | Requires self-hosting; 1800+ plugins; highest customization |
| CircleCI | Cloud (2025) | Orbs, intelligent test splitting, Docker layer caching | Managed; strong Docker support; credit-based pricing |
| Azure DevOps | 2025 | YAML pipelines, template expressions | Deep Azure/Microsoft integration; hybrid self-hosted agents |
| Tekton | 0.60+ (2025) | Kubernetes-native, CRD-based pipelines | Cloud-native; steep learning curve; ideal for K8s-first teams |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Building any software project with >1 developer | Solo developer with manual deploys to a single server | Simple shell script or manual deploy |
| You need reproducible, auditable builds | Prototyping or hackathon with no production target | Direct git push to hosting (Vercel, Netlify auto-deploy) |
| Compliance requires build provenance (SOC2, SLSA) | The project has no tests and no build step | Add tests first, then add CI/CD |
| Team targets DORA elite metrics (daily deploys, <1h recovery) | Deploying static files with no build process | Static site hosts with git-triggered deploys |
| Microservices with independent release cycles | Tightly coupled monolith deploying everything together | Single pipeline with all-or-nothing deploy |
| Multiple environments (dev, staging, production) | Single environment with no promotion path | Direct deployment script |
Important Caveats
- Pipeline YAML syntax and features vary significantly between platforms -- a GitHub Actions workflow is not portable to GitLab CI without rewriting. The architecture concepts (stages, gates, artifact promotion) are portable; the implementation is not.
- Self-hosted runners (Jenkins agents, GitHub self-hosted runners) require patching, monitoring, and security hardening -- they become attack vectors if compromised, as they have access to secrets and deployment credentials.
- DORA metrics are team-level indicators, not individual developer metrics. Using deployment frequency to evaluate individual performance leads to gaming rather than genuine improvement.
- "CI/CD" is often used loosely to mean just CI (automated testing). True CD (continuous deployment to production) requires significant investment in automated testing, monitoring, and rollback capabilities. Most teams practice CI + continuous delivery (manual production gate), not continuous deployment.
- Cost can escalate quickly with managed CI/CD: GitHub Actions charges per-minute for private repos, CircleCI uses credits, and GitLab CI shared runners have quotas. Self-hosted runners trade money for operational overhead.