Technical Architecture Assessment

How sound is technical architecture — scalability, SPOFs, deployment frequency, incident response?

Purpose

Evaluates the soundness of a software system's technical architecture across six dimensions: scalability, SPOF resilience, CI/CD maturity, incident response, observability, and code quality. Produces a structured diagnostic identifying where architecture enables or constrains the business. [src1]

Constraints

Requires deployment metrics, monitoring dashboards, and incident history — self-reported scores inflate by 0.5-1.0 points
DORA benchmarks are bimodal — teams cluster at high or low maturity, not a smooth curve [src2]
Architecture maturity is context-dependent — a well-maintained monolith can outperform a poorly implemented microservices setup
Does not cover security posture, compliance, or data privacy
Re-run quarterly or after major infrastructure events

Assessment Dimensions

Dimension 1: Scalability & Performance

What this measures: System's ability to handle increasing load through horizontal or vertical scaling while maintaining acceptable latency.

Score	Level	Description	Evidence
1	Ad hoc	No capacity planning; system fails under moderate spikes	No load testing; single server; p95 unknown
2	Emerging	Basic vertical scaling; known bottlenecks unaddressed	Manual scaling; shared DB; p95 > 2s
3	Defined	Horizontal scaling for stateless services; load testing in release cycle	Auto-scaling configured; DB read replicas; p95 < 500ms
4	Managed	Architecture designed for 10x load; data-driven capacity planning	Proven 5x spike handling; multi-region; p95 < 200ms
5	Optimized	Elastic architecture handles 100x spikes; cost-optimized scaling	Real-time auto-scaling; sub-linear cost growth

Red flags: No one knows current p95; outages during traffic spikes; single-instance database. [src3]

Quick diagnostic question: "What happens if traffic triples tomorrow — when did you last test that?"

Dimension 2: SPOF Resilience

What this measures: Whether single points of failure have been identified and mitigated across infrastructure, data, dependencies, and people.

Score	Level	Description	Evidence
1	Ad hoc	No SPOF analysis; critical systems on single instances; bus factor of 1	Single DB server; no failover; no dependency mapping
2	Emerging	Some SPOFs identified but not mitigated; basic backups	Backups exist but untested; partial redundancy
3	Defined	Formal SPOF audit; critical path redundancy; DR plan tested annually	Failover for critical services; bus factor >= 2; backup restoration tested quarterly
4	Managed	Chaos engineering active; automated failover; quarterly DR tests	Chaos experiments monthly; failover < 60s; RTO < 4h
5	Optimized	Self-healing infrastructure; multi-region active-active; zero customer impact from failures	80%+ automated remediation; active-active 2+ regions

Red flags: No SPOF analysis done; single DB with no replication; single engineer owns critical subsystem. [src4]

Quick diagnostic question: "If your primary database dies right now, what happens and how long until recovery?"

Dimension 3: CI/CD & Deployment Maturity

What this measures: Speed, safety, and automation of the delivery pipeline — code commit to production, benchmarked against DORA metrics.

Score	Level	Description	Evidence
1	Ad hoc	Manual deployments; risky all-day events; < monthly releases	No automated testing; change failure > 45%; lead time > 6 months
2	Emerging	Basic CI; semi-automated deploys; monthly-ish releases	Automated build only; change failure 30-45%; lead time 1-6 months
3	Defined	Full CI/CD; automated testing; weekly-daily deploys; feature flags	60%+ coverage; change failure 15-30%; lead time 1w-1m; rollback documented
4	Managed	Continuous deployment with canary/blue-green; multiple deploys/day	< 15% change failure; lead time 1d-1w; MTTR < 1h; 80%+ coverage
5	Optimized	On-demand zero-downtime deployment; progressive delivery; automated quality gates	< 5% change failure; lead time < 1 day; deploy confidence > 99%

Red flags: Deployments only when specific people available; no automated tests; deployment freezes beyond holidays; rollback = restore from backup. [src1]

Quick diagnostic question: "How often do you deploy, and what percentage requires a hotfix or rollback?"

Dimension 4: Incident Response

What this measures: Maturity of processes for detecting, responding to, and learning from production incidents.

Score	Level	Description	Evidence
1	Ad hoc	Customers find issues first; no process; no postmortems	No on-call; no alerting; MTTD in hours/days
2	Emerging	Basic on-call; some alerting; reactive handling	Informal on-call; postmortems for major incidents only
3	Defined	Structured process; severity levels; blameless postmortems standard	Runbooks for top 10 failures; MTTD < 15min for P1; action items tracked
4	Managed	Distributed incident response; automated detection; SLOs with error budgets	Automated incident creation; quarterly trend review; 90%+ action items completed
5	Optimized	Proactive prevention; automated remediation; predictive alerting	60%+ alerts auto-remediated; incident rate declining QoQ

Red flags: No on-call; customers report outages first; same incident type recurs monthly. [src5]

Quick diagnostic question: "When was your last production incident, how did you find out, and what did you change?"

Dimension 5: Observability

What this measures: Ability to understand system behavior through logs, metrics, traces, and dashboards.

Score	Level	Description	Evidence
1	Ad hoc	No centralized logging; basic CPU/memory metrics only	Console.log debugging; SSH into production; no dashboards
2	Emerging	Centralized logging; basic monitoring; some app metrics	Log aggregation; 1-2 dashboards; no distributed tracing
3	Defined	Structured logging; APM deployed; SLI-based alerting	Three pillars (logs, metrics, traces); custom dashboards per team
4	Managed	Full distributed tracing; anomaly detection; observability as code	End-to-end tracing; < 5% false positive rate; Terraform-managed
5	Optimized	AI-assisted root cause analysis; predictive observability	AIOps correlation; sub-minute root cause ID; cost-optimized telemetry

Red flags: Engineers SSH into production to debug; no one can answer current p95; monitoring is infrastructure-only. [src3]

Quick diagnostic question: "If a user reports slowness, what tool does your engineer open first and how fast can they find root cause?"

Dimension 6: Code Quality & Maintainability

What this measures: Structural health of the codebase — technical debt, test coverage, documentation, and onboarding speed.

Score	Level	Description	Evidence
1	Ad hoc	No standards; no review; < 20% test coverage; high coupling	No linting; no PRs; no docs; every change risks regressions
2	Emerging	Basic standards; inconsistent review; 20-40% coverage	Linting configured; some unit tests; onboarding 2-3 months
3	Defined	Enforced standards; mandatory review; 40-70% coverage; ADRs	2-reviewer minimum; integration tests; onboarding < 30 days
4	Managed	Automated quality gates; > 70% coverage; tech debt budgeted	SonarQube gates in CI; quarterly debt reduction; clear module boundaries
5	Optimized	> 85% coverage; architecture enforces modularity; continuous refactoring	Mutation testing; onboarding < 2 weeks; health metrics trending positive

Red flags: No code review; "only one person understands this"; test suite hours long or skipped; engineers afraid to refactor. [src6]

Quick diagnostic question: "How confident is your team making a significant change to a core module without regression?"

Scoring & Interpretation

Formula: Overall Score = (Scalability + SPOF Resilience + CI/CD + Incident Response + Observability + Code Quality) / 6

Overall Score	Level	Interpretation	Next Step
1.0 - 1.9	Critical	Architecture is a business risk — outages and slow delivery constrain growth	Address lowest dimension; invest in CI/CD and observability first
2.0 - 2.9	Developing	Basic systems in place but significant gaps; breaks at 3-5x scale	Close biggest gap; priority: observability > CI/CD > incident response
3.0 - 3.9	Competent	Solid foundation with room for optimization	Invest in chaos engineering, SLOs, progressive delivery
4.0 - 4.5	Advanced	Engineering enables, not bottlenecks; mature DevOps	Fine-tune cost efficiency; advanced observability; platform capabilities
4.6 - 5.0	Best-in-class	Architecture is a competitive advantage	Maintain; evaluate emerging paradigms

Dimension-Level Action Routing

Weak Dimension (Score < 3)	Fetch This Card
Scalability & Performance	Cloud Migration Playbook
SPOF Resilience	Business Continuity Planning
CI/CD & Deployment	Technology Stack Decision Framework
Incident Response	Cyber Risk Quantification
Observability	Technology Stack Decision Framework
Code Quality	Technology Stack Decision Framework

Benchmarks by Segment

Segment	Expected Average	"Good" Threshold	"Alarm" Threshold
Seed/Series A (1-5 eng)	1.8	2.5	1.2
Series B (6-20 eng)	2.8	3.3	2.0
Growth (21-50 eng)	3.5	4.0	2.8
Scale/Public (50+ eng)	4.2	4.5	3.5

DORA Metrics Quick Reference

Metric	Elite	High	Medium	Low
Deployment Frequency	On-demand (multiple/day)	Daily to weekly	Weekly to monthly	< monthly
Lead Time for Changes	< 1 day	1 day - 1 week	1 week - 1 month	1 - 6 months
Change Failure Rate	< 5%	5-15%	15-30%	30-45%
Mean Time to Recovery	< 1 hour	< 1 day	1 day - 1 week	> 1 week

Only 16.2% achieve on-demand deployment; 23.9% deploy less than monthly. [src2]

Common Pitfalls

Self-assessment inflation: Teams over-score by 0.5-1.0 points. Calibrate with evidence (metrics, dashboards), not opinions. [src3]
Tools vs maturity confusion: Having Kubernetes does not mean scalability is mature. Assess practices and outcomes, not tool purchases.
Ignoring the monolith option: A well-structured monolith with good CI/CD can score 4+. Assess fitness for purpose, not architectural fashion. [src6]
Snapshot bias: A team scoring 2.5 but improving 0.3/quarter is healthier than a declining 3.5. Track quarterly trends.

When This Matters

Fetch when a user asks to evaluate engineering architecture, prepare for technical due diligence (fundraising, acquisition), diagnose declining delivery velocity, assess readiness for a scaling phase, or onboard a new CTO/VP Engineering needing a baseline.