A/B Testing Infrastructure

Type: Execution Recipe Confidence: 0.88 Sources: 7 Verified: 2026-03-12

Purpose

This recipe sets up a working A/B testing system on a landing page — from sample size calculation through test creation, variant deployment, and statistical analysis. The output is a running experiment that splits traffic between control and variant(s), tracks conversions per variant, and produces statistically valid results. [src1]

Prerequisites

Constraints

Tool Selection Decision

Which path?
├── Developer AND budget = free
│   └── PATH A: PostHog Experiments — free, feature flags + analytics
├── Non-technical AND budget > $0
│   └── PATH B: VWO — $49/mo, visual drag-and-drop editor
├── Developer AND wants zero dependencies
│   └── PATH C: Custom JavaScript — $0, lightweight split with GA4
└── High traffic AND enterprise needs
    └── PATH D: Statsig/Optimizely — $0-50K/yr, advanced stats
PathToolsCostSetup TimeBest For
A: PostHogPostHog Cloud$030 minDevelopers, product teams
B: VWOVWO Starter$49/mo20 minNon-technical, visual tests
C: Custom JSVanilla JS + GA4$045 minMinimalists, privacy-focused
D: EnterpriseStatsig/Optimizely$0-50K+/yr60 minHigh traffic, advanced stats

Execution Flow

Step 1: Calculate Required Sample Size

Duration: 5 minutes · Tool: Online calculator

Use Evan Miller's calculator or Optimizely's calculator. At 5% baseline CVR and 20% relative MDE: ~3,800 visitors per variant. Test duration = (visitors per variant × variants) / monthly traffic. [src4] [src5]

Quick Reference (95% significance, 80% power):
───────────────────────────────────────
Baseline 2%, MDE 50%:  ~3,600 per variant
Baseline 5%, MDE 20%:  ~3,800 per variant
Baseline 5%, MDE 50%:  ~700 per variant
Baseline 10%, MDE 20%: ~1,900 per variant
Baseline 10%, MDE 50%: ~350 per variant

Verify: Test duration is under 8 weeks. · If failed: If > 3 months, use qualitative testing instead.

Step 2: Create the Experiment

Duration: 10-20 minutes · Tool: PostHog, VWO, or code editor

Create the experiment with control and variant(s), set goal metric, configure traffic allocation (50/50). For PostHog, use feature flags with experiment code. For VWO, use the visual editor. For custom JS, use localStorage-based variant assignment with GA4 event tracking. [src1] [src7]

Verify: Open page in two incognito windows — one shows control, one shows variant. · If failed: Check feature flag status, clear localStorage between tests.

Step 3: Implement Anti-Flicker Protection

Duration: 5 minutes · Tool: Code editor

Add CSS to hide the page until variant is applied, with a 2-second timeout fallback. For PostHog, use server-side bootstrapping. VWO has built-in anti-flicker.

Verify: On slow 3G throttling, no original content flashes. · If failed: Move anti-flicker to first element in <head>.

Step 4: Set Up Conversion Goal Tracking

Duration: 5-10 minutes · Tool: PostHog, VWO, or GA4

Ensure conversion events include variant identifier. PostHog tracks automatically. VWO uses goal settings. Custom JS sends variant ID with GA4 events.

Verify: Test conversion in each variant, verify attribution in dashboard. · If failed: Check experiment impression fires before conversion event.

Step 5: Monitor and Wait for Significance

Duration: 2-8 weeks · Tool: PostHog, VWO, or calculator

Check results weekly but do not stop early. Verify traffic split, check for errors, confirm sample accumulation. Use VWO's significance calculator for custom JS tests. [src3]

Verify: At calculated sample size, one variant has > 95% significance. · If failed: If neither wins, difference is too small to detect.

Step 6: Implement Winner and Clean Up

Duration: 10 minutes · Tool: Code editor + testing tool

Make winning variant the permanent default. Remove testing code, anti-flicker snippet, and archive the experiment. Document results for future reference.

Verify: Winning variant is default. All test code removed. CVR stable for 1 week. · If failed: Verify permanent implementation matches test variant exactly.

Output Schema

{
  "output_type": "ab_test_configuration",
  "format": "running experiment + documentation",
  "columns": [
    {"name": "test_name", "type": "string", "description": "Unique test identifier", "required": true},
    {"name": "tool", "type": "string", "description": "Testing platform used", "required": true},
    {"name": "variants", "type": "array", "description": "Variant descriptions", "required": true},
    {"name": "goal_metric", "type": "string", "description": "Primary conversion event", "required": true},
    {"name": "sample_size_required", "type": "number", "description": "Minimum visitors per variant", "required": true},
    {"name": "estimated_duration", "type": "string", "description": "Weeks to significance", "required": true},
    {"name": "status", "type": "string", "description": "running/completed/stopped", "required": true}
  ],
  "expected_row_count": "1",
  "sort_order": "N/A",
  "deduplication_key": "test_name"
}

Quality Benchmarks

Quality MetricMinimum AcceptableGoodExcellent
Traffic split accuracyWithin ±10%Within ±5%Within ±2%
Sample size reached> 80% of minimum100% of minimum120% of minimum
Test duration< 12 weeks< 6 weeks< 3 weeks
Flicker-free experience< 2s delay< 500msZero flicker (server-side)
Result documentationWinner declaredFull stats documentedLearnings shared

If below minimum: Check feature flag configuration for split accuracy. Increase MDE or traffic if sample size unreachable.

Error Handling

ErrorLikely CauseRecovery Action
Both variants show same contentFeature flag not evaluatingCheck dashboard for flag status; verify JS loads
Conversion rates at 0%Events not firing or not attributedVerify events fire with variant ID; check goal metric name
Page flicker visibleAnti-flicker snippet missingMove anti-flicker CSS to first element in <head>
Traffic not splittingCDN caching same variantCheck CDN does not cache variant-specific content
No data in PostHogScript not loading or blockedCheck console for errors; test in incognito
Significance fluctuatingNormal before reaching sample sizeDo not stop early; wait for full sample

Cost Breakdown

ComponentFree TierPaid TierAt Scale
PostHog$0 (1M events/mo)Pay-per-use$0.00005/event >1M
VWO$0 (30-day trial)$49/mo Starter$972/mo Enterprise
Statsig$0 (50M events/mo)$150/mo ProCustom
Custom JS + GA4$0$0$0
Total$0$0-49/mo$49-972/mo

Anti-Patterns

Wrong: Stopping the test when one variant "looks like it's winning"

Early stopping inflates false positive rates from 5% to over 25%. A variant that looks better at day 3 may be identical at day 23. [src3]

Correct: Pre-commit to the calculated sample size

Calculate required sample size before starting and commit to running until that number is reached.

Wrong: Testing tiny changes on low-traffic pages

Testing button color on 500 monthly visitors requires 15+ months for significance. [src5]

Correct: Match test ambition to traffic volume

Low-traffic: test big changes (headlines, layouts). High-traffic: can test subtle changes (copy, placement).

When This Matters

Use this recipe when the agent needs to set up a quantitative A/B test on a landing page. Requires existing analytics and conversion events. Handles experiment design, tool setup, and statistical analysis.

Related Units