Engineering Productivity Benchmarks (DORA + Delivery Metrics)
Summary
Comprehensive engineering productivity benchmarks covering the five DORA metrics (deployment frequency, lead time for changes, change failure rate, mean time to recovery, rework rate) plus cycle time, PR metrics, and throughput data. Sourced from the 2025 DORA Report (~5,000 respondents) and LinearB's analysis of 8.1M+ pull requests across 4,800 teams. The most significant finding: AI coding assistants boost individual output but organizational delivery metrics remain flat. [src1]
Data vintage: Based on 2025 DORA survey data and LinearB's 2024-2025 PR analysis from 4,800+ engineering teams across 42 countries.
Key shift: DORA expanded from 4 to 5 metrics in 2025 by adding rework rate. The framework reorganized into throughput metrics (deployment frequency, lead time, recovery time) and instability metrics (change failure rate, rework rate). The traditional elite/high/medium/low classification was replaced with archetype-based clusters. [src1][src4]
Constraints
- These benchmarks represent primarily US/Western tech industry software teams. Do not apply to hardware engineering, manufacturing, or non-software R&D organizations.
- DORA figures are self-reported survey data; measured platform data (LinearB, Plandek) shows materially different distributions. Use DORA for directional comparison, platform data for precision.
- Figures are medians unless otherwise noted. Means are skewed by outlier elite performers; use median for realistic target-setting.
- Data collected H1-H2 2025. If more than 6 months old, search for updated DORA or LinearB reports before citing.
- Only compare teams within the same segment (team size, company stage). A 5-person startup deploying 10x/day is not comparable to a 200-person enterprise deploying daily.
Metrics
Velocity
Deployment Frequency
Definition: Number of production deployments per unit of time per team/service. Measures how often code reaches production. Counted at the service or application level, not per developer.
| Segment | Median | 25th Pct | 75th Pct | Top Decile |
|---|---|---|---|---|
| Small team (2-10) | 2-3x/week | 1x/week | 1x/day | Multiple/day |
| Mid-size (11-50) | 1-2x/week | 2x/month | 3-5x/week | 1x/day |
| Large (51-200) | 1x/week | 2x/month | 2-3x/week | Daily |
| Enterprise (200+) | 2-4x/month | 1x/month | 1x/week | 2-3x/week |
Trend: Only 16.2% of organizations achieve on-demand deployment. 23.9% deploy less than once per month. Distribution is bimodal. [src1][src3]
Red flag threshold: Deploying less than once per month indicates batch-oriented delivery with high risk per deployment.
Lead Time for Changes
Definition: Time from code commit to code successfully running in production. Includes code review, CI/CD pipeline execution, and any manual approval gates.
| Segment | Median | 25th Pct | 75th Pct | Top Decile |
|---|---|---|---|---|
| Small team (2-10) | 1-2 days | 2-5 days | 2-6 hours | < 1 hour |
| Mid-size (11-50) | 2-5 days | 1-2 weeks | 1-2 days | < 1 day |
| Large (51-200) | 3-7 days | 1-4 weeks | 2-3 days | 1-2 days |
| Enterprise (200+) | 1-2 weeks | 1-6 months | 3-7 days | 1-3 days |
Trend: Only 9.4% of teams achieve lead times under one hour. 31.9% fall in the one-day-to-one-week range. [src1][src3]
Red flag threshold: Lead time exceeding 1 month signals severe process bottlenecks or manual gates.
Stability
Change Failure Rate (CFR)
Definition: Percentage of deployments that cause a failure in production requiring remediation (rollback, hotfix, patch, or emergency fix).
| Segment | Median | 25th Pct | 75th Pct | Top Decile |
|---|---|---|---|---|
| Small team (2-10) | 10% | 15-20% | 5% | < 2% |
| Mid-size (11-50) | 12% | 20-25% | 5-8% | < 3% |
| Large (51-200) | 15% | 25-30% | 8-10% | < 5% |
| Enterprise (200+) | 18% | 30%+ | 10-15% | < 5% |
Trend: Only 8.5% of teams achieve ideal CFR of 0-2%. AI-assisted code changes show higher initial failure rates. [src1][src5]
Red flag threshold: CFR above 25% indicates systemic quality issues.
Mean Time to Recovery (MTTR)
Definition: Time from detection of a production failure to full service restoration. Also called "failed deployment recovery time" in the 2025 DORA framework.
| Segment | Median | 25th Pct | 75th Pct | Top Decile |
|---|---|---|---|---|
| Small team (2-10) | 1-4 hours | 4-12 hours | 30-60 min | < 15 min |
| Mid-size (11-50) | 2-8 hours | 8-24 hours | 1-2 hours | < 30 min |
| Large (51-200) | 4-12 hours | 12-48 hours | 2-4 hours | < 1 hour |
| Enterprise (200+) | 12-24 hours | 24-72 hours | 4-12 hours | < 2 hours |
Trend: Elite teams achieve MTTR under 1 hour across all segments. Teams with automated rollback recover 5-10x faster. [src1][src3]
Red flag threshold: MTTR exceeding 24 hours for non-enterprise teams indicates inadequate incident response.
Rework Rate (5th DORA Metric — New in 2025)
Definition: Percentage of deployments that are unplanned fixes or patches to correct user-facing defects from prior deployments.
| Segment | Median | 25th Pct | 75th Pct | Top Decile |
|---|---|---|---|---|
| All segments | 8-12% | 15-20% | 4-6% | < 3% |
Trend: Increased AI adoption correlates with increased rework rate — AI-generated code ships faster but requires more post-deployment corrections. [src1][src4]
Red flag threshold: Rework rate above 15% means more time fixing than shipping planned work.
Efficiency
Cycle Time (PR Open to Merged)
Definition: Total elapsed time from pull request creation to merge into main branch. Includes pickup time, review time, revision cycles, and final approval.
| Segment | Median | 25th Pct | 75th Pct | Top Decile |
|---|---|---|---|---|
| Small team (2-10) | 3-4 days | 5-7 days | 1-2 days | < 26 hours |
| Mid-size (11-50) | 5-7 days | 7-14 days | 2-4 days | < 2 days |
| Large (51-200) | 7-10 days | 10-21 days | 4-6 days | < 3 days |
| Enterprise (200+) | 10-14 days | 14-30 days | 5-8 days | < 5 days |
Trend: Average cycle time is ~7 days, with PRs sitting in review for 4 of those 7 days. Code review is the single largest bottleneck. [src2]
Red flag threshold: Cycle time exceeding 14 days for non-enterprise teams signals review process breakdown.
Throughput (PRs Merged per Developer per Week)
Definition: Number of pull requests merged per developer per week. Measures individual developer output normalized across team sizes.
| Segment | Median | 25th Pct | 75th Pct | Top Decile |
|---|---|---|---|---|
| All segments | 2-3 PRs/week | 1-2 PRs/week | 4-5 PRs/week | 6+ PRs/week |
Trend: Teams using AI coding assistants show 15-25% improvement in PR throughput, but with higher rework rates. [src1][src2]
Red flag threshold: Sustained throughput below 1 PR/developer/week indicates blockers or oversized PRs.
Quality
PR Size
Definition: Number of code changes (additions + modifications + deletions) per pull request. The single most impactful metric for engineering velocity.
| Segment | Median | 25th Pct | 75th Pct | Top Decile |
|---|---|---|---|---|
| All segments | 200-300 lines | 400-661 lines | 100-194 lines | < 100 lines |
Trend: Elite teams maintain PR sizes under 194 changes. Teams with PRs under 194 lines achieve merge frequencies 5x faster. [src2]
Red flag threshold: PRs above 500 lines correlate with 3-5x longer cycle times and higher CFR.
Merge Time
Definition: Time from final code review approval to merge into main branch.
| Segment | Median | 25th Pct | 75th Pct | Top Decile |
|---|---|---|---|---|
| All segments | 4-8 hours | 12-24 hours | 1-2 hours | < 2 hours |
Trend: Elite teams maintain merge times under 2 hours. Automated merge queues are the primary improvement driver. [src2]
Red flag threshold: Merge time exceeding 24 hours after approval indicates CI/CD bottlenecks.
Composite Metrics & Rules of Thumb
| Rule | Formula / Threshold | Interpretation |
|---|---|---|
| DORA Throughput Score | High deployment frequency + Low lead time | Both must be strong — high frequency with long lead time indicates small, inefficient batches |
| DORA Stability Score | Low CFR + Low MTTR + Low rework rate | All three must be healthy — low CFR with high MTTR means failures are rare but catastrophic |
| Cycle Time Ratio | Review time / Total cycle time < 50% | If review exceeds 50% of cycle time, review process is the bottleneck |
| PR Size Rule | Median PR < 200 lines | Highest-leverage metric — drives cycle time, CFR, and review quality simultaneously |
| Deploy:Rework Ratio | Planned deploys / Rework deploys > 8:1 | Less than 12.5% of deployments should be unplanned fixes |
| AI Productivity Paradox | Individual output up + Team metrics flat | AI boosts individual velocity but does not automatically improve organizational throughput |
Segment Definitions
| Segment | Definition | Typical Characteristics |
|---|---|---|
| Small team (2-10 engineers) | Startup or small product team, single-service | Direct communication, minimal process overhead, trunk-based development |
| Mid-size (11-50 engineers) | Growth-stage company or business unit | Multiple squads, code ownership emerging, PR reviews required |
| Large (51-200 engineers) | Scale-up or division within enterprise | Platform teams, shared services, architecture governance |
| Enterprise (200+ engineers) | Large organization or multi-BU company | Complex CI/CD, compliance gates, change advisory boards |
Year-over-Year Trend Summary
| Metric | 2023 | 2024 | 2025 | Direction |
|---|---|---|---|---|
| Deployment frequency (% daily+) | 30% | 32% | 38% | Up 8pp over 2 years |
| Lead time (% under 1 day) | 35% | 38% | 41% | Up 6pp, steady |
| Change failure rate (median) | 12% | 14% | 15% | Up 3pp — AI contributing |
| MTTR (% under 1 hour) | 20% | 22% | 25% | Up 5pp — automation gains |
| Cycle time (average) | 8 days | 7.5 days | 7 days | Down 12.5% over 2 years |
| PR size (elite threshold) | 250 lines | 220 lines | 194 lines | Down 22% — smaller PRs |
Common Misinterpretations
- Treating deployment frequency as the primary metric: High deployment frequency without stability is counterproductive. A team deploying 10x/day with 30% CFR is worse off than one deploying daily with 3% CFR. Always evaluate throughput and stability together. [src1]
- Applying enterprise benchmarks to startups (or vice versa): A 5-person team should deploy multiple times daily; expecting that from a 300-person org with compliance requirements sets unrealistic targets.
- Equating individual AI productivity gains with team improvement: AI tools boost individual output (21% more tasks, 98% more PRs) but organizational delivery metrics remain flat. The bottleneck shifts from coding to review, testing, and deployment. [src1]
- Using DORA metrics as targets rather than diagnostics: Goodhart's Law applies — when deployment frequency becomes a target, teams game it. Use metrics for diagnosis, not incentive compensation.
- Ignoring PR size while optimizing cycle time: PR size is the single strongest predictor of cycle time. Teams that focus on review improvements without addressing oversized PRs see minimal reduction. [src2]
When This Matters
Fetch when a user asks about engineering team performance benchmarks, wants to evaluate their DORA metrics against industry peers, is setting engineering KPIs or OKRs, needs to diagnose delivery bottlenecks, or is evaluating the impact of AI coding tools on team productivity.