Experiment Tracking Framework

Type: Execution Recipe Confidence: 0.88 Sources: 6 Verified: 2026-03-12

Purpose

This recipe produces a complete experiment tracking system — from structured hypothesis documentation through sample size calculation, test execution, statistical analysis, and decision logging. The output is a reusable framework that ensures every product experiment follows rigorous methodology: declared hypotheses, pre-calculated sample sizes, significance thresholds, and documented decisions with learnings. Prevents the most common startup failure mode where teams “run experiments” by shipping changes and checking metrics a few days later without statistical rigor. [src1]

Prerequisites

Constraints

Tool Selection Decision

Which path?
├── User is non-technical AND budget = free
│   └── PATH A: No-Code Free — Google Sheets + Evan Miller calculator + free PostHog
├── User is non-technical AND budget > $0
│   └── PATH B: No-Code Paid — VWO or Optimizely (built-in hypothesis management)
├── User is semi-technical or developer AND budget = free
│   └── PATH C: Code + Free — Statsig/PostHog SDK + spreadsheet + Evan Miller
└── User is developer AND budget > $0
    └── PATH D: Code + Paid — Statsig/Eppo + warehouse-native analysis
PathToolsCostSpeedOutput Quality
A: No-Code FreeGoogle Sheets + Evan Miller + PostHog$030 min/experimentGood — rigorous if discipline maintained
B: No-Code PaidVWO Plan or Optimizely$200–500/mo10–15 min/experimentExcellent — enforced workflow
C: Code + FreeStatsig/PostHog SDK + Sheets$020 min/experimentVery Good — SDK automates assignment
D: Code + PaidStatsig/Eppo + warehouse$200–1000/mo10 min/experimentExcellent — automated power analysis + CUPED

Execution Flow

Step 1: Write the Hypothesis

Duration: 10–20 minutes · Tool: Experiment tracker

Write a structured hypothesis: “If [specific change], then [measurable outcome] will [direction] by [estimated magnitude], because [reasoning based on data].” Include experiment ID, primary metric (OEC), guardrail metrics, owner, and status. [src4]

Verify: Hypothesis includes a specific, falsifiable prediction with a named metric and estimated effect size. · If failed: If the team cannot state a specific expected outcome, document as exploratory and define what “interesting” means before launching.

Step 2: Calculate Sample Size

Duration: 5–10 minutes · Tool: Evan Miller Calculator or Statsig Power Analysis [src2] [src3]

Calculate minimum sample size per variation BEFORE launching. Inputs: baseline conversion rate, minimum detectable effect (MDE), alpha (0.05), power (0.80). Then estimate runtime: total sample size / weekly traffic × 7 days.

Verify: Sample size documented. Runtime fits within 2–8 weeks. · If failed: Increase MDE, increase traffic allocation, or use CUPED variance reduction (20–50% sample reduction). [src3]

Step 3: Design the Experiment

Duration: 15–30 minutes · Tool: Experimentation platform + tracker

Document full design: hypothesis, primary metric, guardrails, traffic allocation, sample size target, runtime estimate, randomization unit, targeting rules, and exclusions. Get design reviewed by at least one other person.

Verify: Design document complete. No overlap with running experiments. · If failed: If overlapping experiments exist, pause one or implement mutual exclusion groups.

Step 4: Implement Tracking

Duration: 15–60 minutes · Tool: Platform SDK or visual editor

Code path: integrate experimentation SDK, implement random assignment, and fire conversion events. No-code path: use VWO/Optimizely visual editor to create variations, set targeting, configure goals, and QA in preview mode.

Verify: Both variations render correctly. Events fire on expected actions. Platform shows incoming data within 30 minutes. · If failed: Check SDK initialization, event naming, and ad-blocker interference.

Step 5: Run Test to Significance

Duration: 1–8 weeks · Tool: Platform dashboard

Monitor weekly: check for sample ratio mismatch (SRM), data quality, and guardrail violations. Do NOT stop early, extend, change allocation, or add metrics mid-experiment. If using sequential testing (Statsig, Eppo), valid decisions can be made when the platform declares significance. [src3]

Verify: Sample size target reached. No SRM detected. Guardrails intact. · If failed: If SRM > 1% — stop, investigate randomization, restart with new ID.

Step 6: Analyze Results

Duration: 30–60 minutes · Tool: Platform + spreadsheet

Analyze: calculate lift, 95% confidence interval, p-value for primary metric. Check all guardrails against pre-set thresholds. Segment analysis is exploratory only — not primary decision criteria.

Verify: CI does not cross zero for significant results. Guardrails within bounds. · If failed: If inconclusive (CI crosses zero), document as inconclusive. Do NOT extend experiment.

Step 7: Make Decision

Duration: 15–30 minutes · Tool: Experiment tracker

Apply decision matrix: significant + positive + guardrails intact = Ship. Significant + guardrail violated = Iterate. Not significant + within MDE = Inconclusive. Significant + negative = Kill. [src1]

Verify: Decision recorded with rationale. Stakeholders notified. · If failed: If team disagrees despite clear data, escalate to pre-agreed decision-maker.

Step 8: Document Learnings

Duration: 15–20 minutes · Tool: Tracker + learnings repository

Record experiment conclusion: decision, observed lift, projected impact, key learnings, follow-up experiments, and searchable tags. Ensure learnings are indexed by feature area, metric, and outcome. [src6]

Verify: Learnings searchable by tag. Follow-up experiments added to backlog. · If failed: If learnings are not consulted, institute mandatory prior-art search before new experiments.

Output Schema

{
  "output_type": "experiment_tracker",
  "format": "spreadsheet or database",
  "columns": [
    {"name": "experiment_id", "type": "string", "description": "Unique ID (EXP-YYYY-NNN)", "required": true},
    {"name": "hypothesis", "type": "string", "description": "If/then/because statement", "required": true},
    {"name": "primary_metric", "type": "string", "description": "Overall Evaluation Criterion", "required": true},
    {"name": "baseline_rate", "type": "number", "description": "Current metric value", "required": true},
    {"name": "mde", "type": "number", "description": "Minimum detectable effect", "required": true},
    {"name": "sample_size_per_variation", "type": "number", "description": "Required n per variation", "required": true},
    {"name": "status", "type": "string", "description": "Proposed|Designed|Running|Analyzing|Decided", "required": true},
    {"name": "decision", "type": "string", "description": "Ship|Iterate|Kill|Inconclusive", "required": false},
    {"name": "learnings", "type": "string", "description": "Key takeaways", "required": false}
  ],
  "expected_row_count": "10-200 per quarter",
  "sort_order": "date_proposed descending",
  "deduplication_key": "experiment_id"
}

Quality Benchmarks

Quality MetricMinimum AcceptableGoodExcellent
Hypothesis completion rate> 70%> 85%> 95%
Pre-calculated sample size rate> 80%> 95%100%
Experiment win rate> 15%> 25%> 35%
SRM detection rate> 90%> 95%100%
Learnings documented rate> 60%> 80%> 95%
Experiments concluded/quarter> 3> 8> 15

If below minimum: If hypothesis completion is low, enforce the structured template — reject experiments without full documentation. If win rate is below 15%, improve hypothesis quality by requiring data-backed observations.

Error Handling

ErrorLikely CauseRecovery Action
Sample ratio mismatch (SRM)Randomization bug, bot traffic, or redirectsStop experiment. Investigate randomization. Fix root cause and restart with new ID.
Experiment never reaches significanceUnderpowered test — MDE too small for trafficDocument as inconclusive. Use larger MDE or CUPED variance reduction next time. [src3]
Guardrail metric violatedTreatment has unintended side effectIf critical guardrail: stop. If non-critical: continue to full sample, then decide.
Conflicting overlapping experimentsNo experiment coordinationImplement experiment layers/namespaces. Create shared experiment calendar.
Results contradict strong priorsNovelty effect, Hawthorne effectSegment by exposure time. Consider holdback test (95/5 split for 2 weeks). [src1]
Premature stopping from peekingDashboard checked before sample size reachedSwitch to sequential testing or restrict dashboard access until target met.

Cost Breakdown

ComponentFree TierPaid TierAt Scale
Experimentation platformPostHog: 1M events/mo; Statsig: 10M events/mo$150–500/mo (VWO, Optimizely)$1,000–5,000/mo
Sample size calculatorEvan Miller: unlimitedBuilt-in (Statsig, Eppo)Automated power analysis
Experiment trackerGoogle Sheets: $0Notion: $10–20/moPlatform-native
Statistical analysisManual spreadsheetPlatform auto-analysisWarehouse-native (Eppo)
Total per quarter$0$500–1,600$3,000–15,000+

Anti-Patterns

Wrong: Peeking at results and stopping early

Checking the dashboard daily and shipping the first time p-value dips below 0.05. With daily peeking over 30 days, actual false positive rate inflates from 5% to approximately 25–30%. [src1]

Correct: Pre-commit to sample size or use sequential testing

Calculate sample size before launch and only analyze at completion. Alternatively, use sequential testing (Statsig, Eppo) that maintains valid confidence intervals at every point. [src3]

Wrong: Testing too many metrics without correction

Running an experiment with 10 success metrics, then celebrating when any one shows significance. With 10 tests at alpha 0.05, probability of at least one false positive is ~40%.

Correct: One primary metric with multiple testing correction

Choose a single OEC for the go/no-go decision. For secondary metrics, apply Bonferroni correction or false discovery rate control. [src1]

Wrong: Running experiments on tiny audiences

A/B testing a feature used by 200 visitors/week — even detecting a 50% relative lift requires ~4,700 per variation, meaning 47 weeks of data.

Correct: Match experiment ambition to traffic

Use the sample size calculator BEFORE committing. If runtime exceeds 8 weeks, target a higher-traffic page, test a bolder change, or use qualitative methods. [src2]

When This Matters

Use when a startup or product team needs a rigorous, repeatable system for running experiments. Essential once a product has sufficient traffic (100+ weekly conversions) and the team makes data-informed product decisions. Prevents the two most expensive experiment failures: shipping losers that look like winners (false positives from peeking) and killing winners that looked inconclusive (underpowered tests).

Related Units