This recipe produces a complete experiment tracking system — from structured hypothesis documentation through sample size calculation, test execution, statistical analysis, and decision logging. The output is a reusable framework that ensures every product experiment follows rigorous methodology: declared hypotheses, pre-calculated sample sizes, significance thresholds, and documented decisions with learnings. Prevents the most common startup failure mode where teams “run experiments” by shipping changes and checking metrics a few days later without statistical rigor. [src1]
Which path?
├── User is non-technical AND budget = free
│ └── PATH A: No-Code Free — Google Sheets + Evan Miller calculator + free PostHog
├── User is non-technical AND budget > $0
│ └── PATH B: No-Code Paid — VWO or Optimizely (built-in hypothesis management)
├── User is semi-technical or developer AND budget = free
│ └── PATH C: Code + Free — Statsig/PostHog SDK + spreadsheet + Evan Miller
└── User is developer AND budget > $0
└── PATH D: Code + Paid — Statsig/Eppo + warehouse-native analysis
| Path | Tools | Cost | Speed | Output Quality |
|---|---|---|---|---|
| A: No-Code Free | Google Sheets + Evan Miller + PostHog | $0 | 30 min/experiment | Good — rigorous if discipline maintained |
| B: No-Code Paid | VWO Plan or Optimizely | $200–500/mo | 10–15 min/experiment | Excellent — enforced workflow |
| C: Code + Free | Statsig/PostHog SDK + Sheets | $0 | 20 min/experiment | Very Good — SDK automates assignment |
| D: Code + Paid | Statsig/Eppo + warehouse | $200–1000/mo | 10 min/experiment | Excellent — automated power analysis + CUPED |
Duration: 10–20 minutes · Tool: Experiment tracker
Write a structured hypothesis: “If [specific change], then [measurable outcome] will [direction] by [estimated magnitude], because [reasoning based on data].” Include experiment ID, primary metric (OEC), guardrail metrics, owner, and status. [src4]
Verify: Hypothesis includes a specific, falsifiable prediction with a named metric and estimated effect size. · If failed: If the team cannot state a specific expected outcome, document as exploratory and define what “interesting” means before launching.
Duration: 5–10 minutes · Tool: Evan Miller Calculator or Statsig Power Analysis [src2] [src3]
Calculate minimum sample size per variation BEFORE launching. Inputs: baseline conversion rate, minimum detectable effect (MDE), alpha (0.05), power (0.80). Then estimate runtime: total sample size / weekly traffic × 7 days.
Verify: Sample size documented. Runtime fits within 2–8 weeks. · If failed: Increase MDE, increase traffic allocation, or use CUPED variance reduction (20–50% sample reduction). [src3]
Duration: 15–30 minutes · Tool: Experimentation platform + tracker
Document full design: hypothesis, primary metric, guardrails, traffic allocation, sample size target, runtime estimate, randomization unit, targeting rules, and exclusions. Get design reviewed by at least one other person.
Verify: Design document complete. No overlap with running experiments. · If failed: If overlapping experiments exist, pause one or implement mutual exclusion groups.
Duration: 15–60 minutes · Tool: Platform SDK or visual editor
Code path: integrate experimentation SDK, implement random assignment, and fire conversion events. No-code path: use VWO/Optimizely visual editor to create variations, set targeting, configure goals, and QA in preview mode.
Verify: Both variations render correctly. Events fire on expected actions. Platform shows incoming data within 30 minutes. · If failed: Check SDK initialization, event naming, and ad-blocker interference.
Duration: 1–8 weeks · Tool: Platform dashboard
Monitor weekly: check for sample ratio mismatch (SRM), data quality, and guardrail violations. Do NOT stop early, extend, change allocation, or add metrics mid-experiment. If using sequential testing (Statsig, Eppo), valid decisions can be made when the platform declares significance. [src3]
Verify: Sample size target reached. No SRM detected. Guardrails intact. · If failed: If SRM > 1% — stop, investigate randomization, restart with new ID.
Duration: 30–60 minutes · Tool: Platform + spreadsheet
Analyze: calculate lift, 95% confidence interval, p-value for primary metric. Check all guardrails against pre-set thresholds. Segment analysis is exploratory only — not primary decision criteria.
Verify: CI does not cross zero for significant results. Guardrails within bounds. · If failed: If inconclusive (CI crosses zero), document as inconclusive. Do NOT extend experiment.
Duration: 15–30 minutes · Tool: Experiment tracker
Apply decision matrix: significant + positive + guardrails intact = Ship. Significant + guardrail violated = Iterate. Not significant + within MDE = Inconclusive. Significant + negative = Kill. [src1]
Verify: Decision recorded with rationale. Stakeholders notified. · If failed: If team disagrees despite clear data, escalate to pre-agreed decision-maker.
Duration: 15–20 minutes · Tool: Tracker + learnings repository
Record experiment conclusion: decision, observed lift, projected impact, key learnings, follow-up experiments, and searchable tags. Ensure learnings are indexed by feature area, metric, and outcome. [src6]
Verify: Learnings searchable by tag. Follow-up experiments added to backlog. · If failed: If learnings are not consulted, institute mandatory prior-art search before new experiments.
{
"output_type": "experiment_tracker",
"format": "spreadsheet or database",
"columns": [
{"name": "experiment_id", "type": "string", "description": "Unique ID (EXP-YYYY-NNN)", "required": true},
{"name": "hypothesis", "type": "string", "description": "If/then/because statement", "required": true},
{"name": "primary_metric", "type": "string", "description": "Overall Evaluation Criterion", "required": true},
{"name": "baseline_rate", "type": "number", "description": "Current metric value", "required": true},
{"name": "mde", "type": "number", "description": "Minimum detectable effect", "required": true},
{"name": "sample_size_per_variation", "type": "number", "description": "Required n per variation", "required": true},
{"name": "status", "type": "string", "description": "Proposed|Designed|Running|Analyzing|Decided", "required": true},
{"name": "decision", "type": "string", "description": "Ship|Iterate|Kill|Inconclusive", "required": false},
{"name": "learnings", "type": "string", "description": "Key takeaways", "required": false}
],
"expected_row_count": "10-200 per quarter",
"sort_order": "date_proposed descending",
"deduplication_key": "experiment_id"
}
| Quality Metric | Minimum Acceptable | Good | Excellent |
|---|---|---|---|
| Hypothesis completion rate | > 70% | > 85% | > 95% |
| Pre-calculated sample size rate | > 80% | > 95% | 100% |
| Experiment win rate | > 15% | > 25% | > 35% |
| SRM detection rate | > 90% | > 95% | 100% |
| Learnings documented rate | > 60% | > 80% | > 95% |
| Experiments concluded/quarter | > 3 | > 8 | > 15 |
If below minimum: If hypothesis completion is low, enforce the structured template — reject experiments without full documentation. If win rate is below 15%, improve hypothesis quality by requiring data-backed observations.
| Error | Likely Cause | Recovery Action |
|---|---|---|
| Sample ratio mismatch (SRM) | Randomization bug, bot traffic, or redirects | Stop experiment. Investigate randomization. Fix root cause and restart with new ID. |
| Experiment never reaches significance | Underpowered test — MDE too small for traffic | Document as inconclusive. Use larger MDE or CUPED variance reduction next time. [src3] |
| Guardrail metric violated | Treatment has unintended side effect | If critical guardrail: stop. If non-critical: continue to full sample, then decide. |
| Conflicting overlapping experiments | No experiment coordination | Implement experiment layers/namespaces. Create shared experiment calendar. |
| Results contradict strong priors | Novelty effect, Hawthorne effect | Segment by exposure time. Consider holdback test (95/5 split for 2 weeks). [src1] |
| Premature stopping from peeking | Dashboard checked before sample size reached | Switch to sequential testing or restrict dashboard access until target met. |
| Component | Free Tier | Paid Tier | At Scale |
|---|---|---|---|
| Experimentation platform | PostHog: 1M events/mo; Statsig: 10M events/mo | $150–500/mo (VWO, Optimizely) | $1,000–5,000/mo |
| Sample size calculator | Evan Miller: unlimited | Built-in (Statsig, Eppo) | Automated power analysis |
| Experiment tracker | Google Sheets: $0 | Notion: $10–20/mo | Platform-native |
| Statistical analysis | Manual spreadsheet | Platform auto-analysis | Warehouse-native (Eppo) |
| Total per quarter | $0 | $500–1,600 | $3,000–15,000+ |
Checking the dashboard daily and shipping the first time p-value dips below 0.05. With daily peeking over 30 days, actual false positive rate inflates from 5% to approximately 25–30%. [src1]
Calculate sample size before launch and only analyze at completion. Alternatively, use sequential testing (Statsig, Eppo) that maintains valid confidence intervals at every point. [src3]
Running an experiment with 10 success metrics, then celebrating when any one shows significance. With 10 tests at alpha 0.05, probability of at least one false positive is ~40%.
Choose a single OEC for the go/no-go decision. For secondary metrics, apply Bonferroni correction or false discovery rate control. [src1]
A/B testing a feature used by 200 visitors/week — even detecting a 50% relative lift requires ~4,700 per variation, meaning 47 weeks of data.
Use the sample size calculator BEFORE committing. If runtime exceeds 8 weeks, target a higher-traffic page, test a bolder change, or use qualitative methods. [src2]
Use when a startup or product team needs a rigorous, repeatable system for running experiments. Essential once a product has sufficient traffic (100+ weekly conversions) and the team makes data-informed product decisions. Prevents the two most expensive experiment failures: shipping losers that look like winners (false positives from peeking) and killing winners that looked inconclusive (underpowered tests).