Experiment Tracking Framework

How do I track experiments — hypothesis, test, measure, decide methodology, minimum sample sizes, significance?

Purpose

This recipe produces a complete experiment tracking system — from structured hypothesis documentation through sample size calculation, test execution, statistical analysis, and decision logging. The output is a reusable framework that ensures every product experiment follows rigorous methodology: declared hypotheses, pre-calculated sample sizes, significance thresholds, and documented decisions with learnings. Prevents the most common startup failure mode where teams “run experiments” by shipping changes and checking metrics a few days later without statistical rigor. [src1]

Prerequisites

Baseline metrics — current conversion rates (or relevant KPIs) with at least 2–4 weeks of historical data
Traffic estimate — weekly unique visitors or events for the metric being tested
Experimentation platform account — Statsig (free: 10M events/mo), PostHog (free: 1M events/mo), or VWO (free trial)
Experiment tracker — Google Sheets, Notion, or platform-native dashboard (Optimizely Collaboration, VWO Plan)
Stakeholder alignment — agreement on primary metric (OEC) and guardrail metrics

Constraints

Never peek at results and stop early without sequential testing correction — inflates false positive rates to 25–30%. [src1]
Declare statistical significance threshold (alpha = 0.05) and MDE BEFORE launching — post-hoc adjustment is p-hacking. [src1]
Run experiments for at least 1–2 full business cycles (7–14 days minimum) to capture weekly seasonality. [src1]
Multiple metric testing requires Bonferroni or FDR correction — 10 metrics at alpha 0.05 yields ~40% false positive probability.
Never run overlapping experiments on the same population affecting the same metric without interaction analysis.

Tool Selection Decision

Which path?
├── User is non-technical AND budget = free
│   └── PATH A: No-Code Free — Google Sheets + Evan Miller calculator + free PostHog
├── User is non-technical AND budget > $0
│   └── PATH B: No-Code Paid — VWO or Optimizely (built-in hypothesis management)
├── User is semi-technical or developer AND budget = free
│   └── PATH C: Code + Free — Statsig/PostHog SDK + spreadsheet + Evan Miller
└── User is developer AND budget > $0
    └── PATH D: Code + Paid — Statsig/Eppo + warehouse-native analysis

Path	Tools	Cost	Speed	Output Quality
A: No-Code Free	Google Sheets + Evan Miller + PostHog	$0	30 min/experiment	Good — rigorous if discipline maintained
B: No-Code Paid	VWO Plan or Optimizely	$200–500/mo	10–15 min/experiment	Excellent — enforced workflow
C: Code + Free	Statsig/PostHog SDK + Sheets	$0	20 min/experiment	Very Good — SDK automates assignment
D: Code + Paid	Statsig/Eppo + warehouse	$200–1000/mo	10 min/experiment	Excellent — automated power analysis + CUPED

Execution Flow

Step 1: Write the Hypothesis

Duration: 10–20 minutes · Tool: Experiment tracker

Write a structured hypothesis: “If [specific change], then [measurable outcome] will [direction] by [estimated magnitude], because [reasoning based on data].” Include experiment ID, primary metric (OEC), guardrail metrics, owner, and status. [src4]

Verify: Hypothesis includes a specific, falsifiable prediction with a named metric and estimated effect size. · If failed: If the team cannot state a specific expected outcome, document as exploratory and define what “interesting” means before launching.

Step 2: Calculate Sample Size

Duration: 5–10 minutes · Tool: Evan Miller Calculator or Statsig Power Analysis [src2] [src3]

Calculate minimum sample size per variation BEFORE launching. Inputs: baseline conversion rate, minimum detectable effect (MDE), alpha (0.05), power (0.80). Then estimate runtime: total sample size / weekly traffic × 7 days.

Verify: Sample size documented. Runtime fits within 2–8 weeks. · If failed: Increase MDE, increase traffic allocation, or use CUPED variance reduction (20–50% sample reduction). [src3]

Step 3: Design the Experiment

Duration: 15–30 minutes · Tool: Experimentation platform + tracker

Document full design: hypothesis, primary metric, guardrails, traffic allocation, sample size target, runtime estimate, randomization unit, targeting rules, and exclusions. Get design reviewed by at least one other person.

Verify: Design document complete. No overlap with running experiments. · If failed: If overlapping experiments exist, pause one or implement mutual exclusion groups.

Step 4: Implement Tracking

Duration: 15–60 minutes · Tool: Platform SDK or visual editor

Code path: integrate experimentation SDK, implement random assignment, and fire conversion events. No-code path: use VWO/Optimizely visual editor to create variations, set targeting, configure goals, and QA in preview mode.

Verify: Both variations render correctly. Events fire on expected actions. Platform shows incoming data within 30 minutes. · If failed: Check SDK initialization, event naming, and ad-blocker interference.

Step 5: Run Test to Significance

Duration: 1–8 weeks · Tool: Platform dashboard

Monitor weekly: check for sample ratio mismatch (SRM), data quality, and guardrail violations. Do NOT stop early, extend, change allocation, or add metrics mid-experiment. If using sequential testing (Statsig, Eppo), valid decisions can be made when the platform declares significance. [src3]

Verify: Sample size target reached. No SRM detected. Guardrails intact. · If failed: If SRM > 1% — stop, investigate randomization, restart with new ID.

Step 6: Analyze Results

Duration: 30–60 minutes · Tool: Platform + spreadsheet

Analyze: calculate lift, 95% confidence interval, p-value for primary metric. Check all guardrails against pre-set thresholds. Segment analysis is exploratory only — not primary decision criteria.

Verify: CI does not cross zero for significant results. Guardrails within bounds. · If failed: If inconclusive (CI crosses zero), document as inconclusive. Do NOT extend experiment.

Step 7: Make Decision

Duration: 15–30 minutes · Tool: Experiment tracker

Apply decision matrix: significant + positive + guardrails intact = Ship. Significant + guardrail violated = Iterate. Not significant + within MDE = Inconclusive. Significant + negative = Kill. [src1]

Verify: Decision recorded with rationale. Stakeholders notified. · If failed: If team disagrees despite clear data, escalate to pre-agreed decision-maker.

Step 8: Document Learnings

Duration: 15–20 minutes · Tool: Tracker + learnings repository

Record experiment conclusion: decision, observed lift, projected impact, key learnings, follow-up experiments, and searchable tags. Ensure learnings are indexed by feature area, metric, and outcome. [src6]

Verify: Learnings searchable by tag. Follow-up experiments added to backlog. · If failed: If learnings are not consulted, institute mandatory prior-art search before new experiments.

Output Schema

{
  "output_type": "experiment_tracker",
  "format": "spreadsheet or database",
  "columns": [
    {"name": "experiment_id", "type": "string", "description": "Unique ID (EXP-YYYY-NNN)", "required": true},
    {"name": "hypothesis", "type": "string", "description": "If/then/because statement", "required": true},
    {"name": "primary_metric", "type": "string", "description": "Overall Evaluation Criterion", "required": true},
    {"name": "baseline_rate", "type": "number", "description": "Current metric value", "required": true},
    {"name": "mde", "type": "number", "description": "Minimum detectable effect", "required": true},
    {"name": "sample_size_per_variation", "type": "number", "description": "Required n per variation", "required": true},
    {"name": "status", "type": "string", "description": "Proposed|Designed|Running|Analyzing|Decided", "required": true},
    {"name": "decision", "type": "string", "description": "Ship|Iterate|Kill|Inconclusive", "required": false},
    {"name": "learnings", "type": "string", "description": "Key takeaways", "required": false}
  ],
  "expected_row_count": "10-200 per quarter",
  "sort_order": "date_proposed descending",
  "deduplication_key": "experiment_id"
}

Quality Benchmarks

Quality Metric	Minimum Acceptable	Good	Excellent
Hypothesis completion rate	> 70%	> 85%	> 95%
Pre-calculated sample size rate	> 80%	> 95%	100%
Experiment win rate	> 15%	> 25%	> 35%
SRM detection rate	> 90%	> 95%	100%
Learnings documented rate	> 60%	> 80%	> 95%
Experiments concluded/quarter	> 3	> 8	> 15

If below minimum: If hypothesis completion is low, enforce the structured template — reject experiments without full documentation. If win rate is below 15%, improve hypothesis quality by requiring data-backed observations.

Error Handling

Error	Likely Cause	Recovery Action
Sample ratio mismatch (SRM)	Randomization bug, bot traffic, or redirects	Stop experiment. Investigate randomization. Fix root cause and restart with new ID.
Experiment never reaches significance	Underpowered test — MDE too small for traffic	Document as inconclusive. Use larger MDE or CUPED variance reduction next time. [src3]
Guardrail metric violated	Treatment has unintended side effect	If critical guardrail: stop. If non-critical: continue to full sample, then decide.
Conflicting overlapping experiments	No experiment coordination	Implement experiment layers/namespaces. Create shared experiment calendar.
Results contradict strong priors	Novelty effect, Hawthorne effect	Segment by exposure time. Consider holdback test (95/5 split for 2 weeks). [src1]
Premature stopping from peeking	Dashboard checked before sample size reached	Switch to sequential testing or restrict dashboard access until target met.

Cost Breakdown

Component	Free Tier	Paid Tier	At Scale
Experimentation platform	PostHog: 1M events/mo; Statsig: 10M events/mo	$150–500/mo (VWO, Optimizely)	$1,000–5,000/mo
Sample size calculator	Evan Miller: unlimited	Built-in (Statsig, Eppo)	Automated power analysis
Experiment tracker	Google Sheets: $0	Notion: $10–20/mo	Platform-native
Statistical analysis	Manual spreadsheet	Platform auto-analysis	Warehouse-native (Eppo)
Total per quarter	$0	$500–1,600	$3,000–15,000+

Anti-Patterns

Wrong: Peeking at results and stopping early

Checking the dashboard daily and shipping the first time p-value dips below 0.05. With daily peeking over 30 days, actual false positive rate inflates from 5% to approximately 25–30%. [src1]

Correct: Pre-commit to sample size or use sequential testing

Calculate sample size before launch and only analyze at completion. Alternatively, use sequential testing (Statsig, Eppo) that maintains valid confidence intervals at every point. [src3]

Wrong: Testing too many metrics without correction

Running an experiment with 10 success metrics, then celebrating when any one shows significance. With 10 tests at alpha 0.05, probability of at least one false positive is ~40%.

Correct: One primary metric with multiple testing correction

Choose a single OEC for the go/no-go decision. For secondary metrics, apply Bonferroni correction or false discovery rate control. [src1]

Wrong: Running experiments on tiny audiences

A/B testing a feature used by 200 visitors/week — even detecting a 50% relative lift requires ~4,700 per variation, meaning 47 weeks of data.

Correct: Match experiment ambition to traffic

Use the sample size calculator BEFORE committing. If runtime exceeds 8 weeks, target a higher-traffic page, test a bolder change, or use qualitative methods. [src2]

When This Matters

Use when a startup or product team needs a rigorous, repeatable system for running experiments. Essential once a product has sufficient traffic (100+ weekly conversions) and the team makes data-informed product decisions. Prevents the two most expensive experiment failures: shipping losers that look like winners (false positives from peeking) and killing winners that looked inconclusive (underpowered tests).