Experiment Tracking Framework
Purpose
This recipe produces a complete experiment tracking system — from structured hypothesis documentation through sample size calculation, test execution, statistical analysis, and decision logging. The output is a reusable framework that ensures every product experiment follows rigorous methodology: declared hypotheses, pre-calculated sample sizes, significance thresholds, and documented decisions with learnings. Prevents the most common startup failure mode where teams “run experiments” by shipping changes and checking metrics a few days later without statistical rigor. [src1]
Prerequisites
- Baseline metrics — current conversion rates (or relevant KPIs) with at least 2–4 weeks of historical data
- Traffic estimate — weekly unique visitors or events for the metric being tested
- Experimentation platform account — Statsig (free: 10M events/mo), PostHog (free: 1M events/mo), or VWO (free trial)
- Experiment tracker — Google Sheets, Notion, or platform-native dashboard (Optimizely Collaboration, VWO Plan)
- Stakeholder alignment — agreement on primary metric (OEC) and guardrail metrics
Constraints
- Never peek at results and stop early without sequential testing correction — inflates false positive rates to 25–30%. [src1]
- Declare statistical significance threshold (alpha = 0.05) and MDE BEFORE launching — post-hoc adjustment is p-hacking. [src1]
- Run experiments for at least 1–2 full business cycles (7–14 days minimum) to capture weekly seasonality. [src1]
- Multiple metric testing requires Bonferroni or FDR correction — 10 metrics at alpha 0.05 yields ~40% false positive probability.
- Never run overlapping experiments on the same population affecting the same metric without interaction analysis.
Tool Selection Decision
Which path?
├── User is non-technical AND budget = free
│ └── PATH A: No-Code Free — Google Sheets + Evan Miller calculator + free PostHog
├── User is non-technical AND budget > $0
│ └── PATH B: No-Code Paid — VWO or Optimizely (built-in hypothesis management)
├── User is semi-technical or developer AND budget = free
│ └── PATH C: Code + Free — Statsig/PostHog SDK + spreadsheet + Evan Miller
└── User is developer AND budget > $0
└── PATH D: Code + Paid — Statsig/Eppo + warehouse-native analysis
| Path | Tools | Cost | Speed | Output Quality |
|---|---|---|---|---|
| A: No-Code Free | Google Sheets + Evan Miller + PostHog | $0 | 30 min/experiment | Good — rigorous if discipline maintained |
| B: No-Code Paid | VWO Plan or Optimizely | $200–500/mo | 10–15 min/experiment | Excellent — enforced workflow |
| C: Code + Free | Statsig/PostHog SDK + Sheets | $0 | 20 min/experiment | Very Good — SDK automates assignment |
| D: Code + Paid | Statsig/Eppo + warehouse | $200–1000/mo | 10 min/experiment | Excellent — automated power analysis + CUPED |
Execution Flow
Step 1: Write the Hypothesis
Duration: 10–20 minutes · Tool: Experiment tracker
Write a structured hypothesis: “If [specific change], then [measurable outcome] will [direction] by [estimated magnitude], because [reasoning based on data].” Include experiment ID, primary metric (OEC), guardrail metrics, owner, and status. [src4]
Verify: Hypothesis includes a specific, falsifiable prediction with a named metric and estimated effect size. · If failed: If the team cannot state a specific expected outcome, document as exploratory and define what “interesting” means before launching.
Step 2: Calculate Sample Size
Duration: 5–10 minutes · Tool: Evan Miller Calculator or Statsig Power Analysis [src2] [src3]
Calculate minimum sample size per variation BEFORE launching. Inputs: baseline conversion rate, minimum detectable effect (MDE), alpha (0.05), power (0.80). Then estimate runtime: total sample size / weekly traffic × 7 days.
Verify: Sample size documented. Runtime fits within 2–8 weeks. · If failed: Increase MDE, increase traffic allocation, or use CUPED variance reduction (20–50% sample reduction). [src3]
Step 3: Design the Experiment
Duration: 15–30 minutes · Tool: Experimentation platform + tracker
Document full design: hypothesis, primary metric, guardrails, traffic allocation, sample size target, runtime estimate, randomization unit, targeting rules, and exclusions. Get design reviewed by at least one other person.
Verify: Design document complete. No overlap with running experiments. · If failed: If overlapping experiments exist, pause one or implement mutual exclusion groups.
Step 4: Implement Tracking
Duration: 15–60 minutes · Tool: Platform SDK or visual editor
Code path: integrate experimentation SDK, implement random assignment, and fire conversion events. No-code path: use VWO/Optimizely visual editor to create variations, set targeting, configure goals, and QA in preview mode.
Verify: Both variations render correctly. Events fire on expected actions. Platform shows incoming data within 30 minutes. · If failed: Check SDK initialization, event naming, and ad-blocker interference.
Step 5: Run Test to Significance
Duration: 1–8 weeks · Tool: Platform dashboard
Monitor weekly: check for sample ratio mismatch (SRM), data quality, and guardrail violations. Do NOT stop early, extend, change allocation, or add metrics mid-experiment. If using sequential testing (Statsig, Eppo), valid decisions can be made when the platform declares significance. [src3]
Verify: Sample size target reached. No SRM detected. Guardrails intact. · If failed: If SRM > 1% — stop, investigate randomization, restart with new ID.
Step 6: Analyze Results
Duration: 30–60 minutes · Tool: Platform + spreadsheet
Analyze: calculate lift, 95% confidence interval, p-value for primary metric. Check all guardrails against pre-set thresholds. Segment analysis is exploratory only — not primary decision criteria.
Verify: CI does not cross zero for significant results. Guardrails within bounds. · If failed: If inconclusive (CI crosses zero), document as inconclusive. Do NOT extend experiment.
Step 7: Make Decision
Duration: 15–30 minutes · Tool: Experiment tracker
Apply decision matrix: significant + positive + guardrails intact = Ship. Significant + guardrail violated = Iterate. Not significant + within MDE = Inconclusive. Significant + negative = Kill. [src1]
Verify: Decision recorded with rationale. Stakeholders notified. · If failed: If team disagrees despite clear data, escalate to pre-agreed decision-maker.
Step 8: Document Learnings
Duration: 15–20 minutes · Tool: Tracker + learnings repository
Record experiment conclusion: decision, observed lift, projected impact, key learnings, follow-up experiments, and searchable tags. Ensure learnings are indexed by feature area, metric, and outcome. [src6]
Verify: Learnings searchable by tag. Follow-up experiments added to backlog. · If failed: If learnings are not consulted, institute mandatory prior-art search before new experiments.
Output Schema
{
"output_type": "experiment_tracker",
"format": "spreadsheet or database",
"columns": [
{"name": "experiment_id", "type": "string", "description": "Unique ID (EXP-YYYY-NNN)", "required": true},
{"name": "hypothesis", "type": "string", "description": "If/then/because statement", "required": true},
{"name": "primary_metric", "type": "string", "description": "Overall Evaluation Criterion", "required": true},
{"name": "baseline_rate", "type": "number", "description": "Current metric value", "required": true},
{"name": "mde", "type": "number", "description": "Minimum detectable effect", "required": true},
{"name": "sample_size_per_variation", "type": "number", "description": "Required n per variation", "required": true},
{"name": "status", "type": "string", "description": "Proposed|Designed|Running|Analyzing|Decided", "required": true},
{"name": "decision", "type": "string", "description": "Ship|Iterate|Kill|Inconclusive", "required": false},
{"name": "learnings", "type": "string", "description": "Key takeaways", "required": false}
],
"expected_row_count": "10-200 per quarter",
"sort_order": "date_proposed descending",
"deduplication_key": "experiment_id"
}
Quality Benchmarks
| Quality Metric | Minimum Acceptable | Good | Excellent |
|---|---|---|---|
| Hypothesis completion rate | > 70% | > 85% | > 95% |
| Pre-calculated sample size rate | > 80% | > 95% | 100% |
| Experiment win rate | > 15% | > 25% | > 35% |
| SRM detection rate | > 90% | > 95% | 100% |
| Learnings documented rate | > 60% | > 80% | > 95% |
| Experiments concluded/quarter | > 3 | > 8 | > 15 |
If below minimum: If hypothesis completion is low, enforce the structured template — reject experiments without full documentation. If win rate is below 15%, improve hypothesis quality by requiring data-backed observations.
Error Handling
| Error | Likely Cause | Recovery Action |
|---|---|---|
| Sample ratio mismatch (SRM) | Randomization bug, bot traffic, or redirects | Stop experiment. Investigate randomization. Fix root cause and restart with new ID. |
| Experiment never reaches significance | Underpowered test — MDE too small for traffic | Document as inconclusive. Use larger MDE or CUPED variance reduction next time. [src3] |
| Guardrail metric violated | Treatment has unintended side effect | If critical guardrail: stop. If non-critical: continue to full sample, then decide. |
| Conflicting overlapping experiments | No experiment coordination | Implement experiment layers/namespaces. Create shared experiment calendar. |
| Results contradict strong priors | Novelty effect, Hawthorne effect | Segment by exposure time. Consider holdback test (95/5 split for 2 weeks). [src1] |
| Premature stopping from peeking | Dashboard checked before sample size reached | Switch to sequential testing or restrict dashboard access until target met. |
Cost Breakdown
| Component | Free Tier | Paid Tier | At Scale |
|---|---|---|---|
| Experimentation platform | PostHog: 1M events/mo; Statsig: 10M events/mo | $150–500/mo (VWO, Optimizely) | $1,000–5,000/mo |
| Sample size calculator | Evan Miller: unlimited | Built-in (Statsig, Eppo) | Automated power analysis |
| Experiment tracker | Google Sheets: $0 | Notion: $10–20/mo | Platform-native |
| Statistical analysis | Manual spreadsheet | Platform auto-analysis | Warehouse-native (Eppo) |
| Total per quarter | $0 | $500–1,600 | $3,000–15,000+ |
Anti-Patterns
Wrong: Peeking at results and stopping early
Checking the dashboard daily and shipping the first time p-value dips below 0.05. With daily peeking over 30 days, actual false positive rate inflates from 5% to approximately 25–30%. [src1]
Correct: Pre-commit to sample size or use sequential testing
Calculate sample size before launch and only analyze at completion. Alternatively, use sequential testing (Statsig, Eppo) that maintains valid confidence intervals at every point. [src3]
Wrong: Testing too many metrics without correction
Running an experiment with 10 success metrics, then celebrating when any one shows significance. With 10 tests at alpha 0.05, probability of at least one false positive is ~40%.
Correct: One primary metric with multiple testing correction
Choose a single OEC for the go/no-go decision. For secondary metrics, apply Bonferroni correction or false discovery rate control. [src1]
Wrong: Running experiments on tiny audiences
A/B testing a feature used by 200 visitors/week — even detecting a 50% relative lift requires ~4,700 per variation, meaning 47 weeks of data.
Correct: Match experiment ambition to traffic
Use the sample size calculator BEFORE committing. If runtime exceeds 8 weeks, target a higher-traffic page, test a bolder change, or use qualitative methods. [src2]
When This Matters
Use when a startup or product team needs a rigorous, repeatable system for running experiments. Essential once a product has sufficient traffic (100+ weekly conversions) and the team makes data-informed product decisions. Prevents the two most expensive experiment failures: shipping losers that look like winners (false positives from peeking) and killing winners that looked inconclusive (underpowered tests).