A/B Testing Infrastructure

How do I set up A/B testing — Posthog feature flags, VWO, custom JS split, minimum traffic for significance?

Purpose

This recipe sets up a working A/B testing system on a landing page — from sample size calculation through test creation, variant deployment, and statistical analysis. The output is a running experiment that splits traffic between control and variant(s), tracks conversions per variant, and produces statistically valid results. [src1]

Prerequisites

Deployed landing page with analytics — Analytics Implementation
Test hypothesis — documented hypothesis with specific change and expected outcome
Variant design — mockup or description of what the variant looks like
Monthly traffic estimate — needed to calculate test duration
Testing tool account — PostHog (free) or VWO (30-day trial)

Constraints

At 5% baseline CVR and 5% MDE: ~3,800 visitors per variant for 95% significance. [src3] [src5]
Never declare a winner before reaching calculated sample size. Early peeking inflates false positive rates to 20-30%. [src3]
PostHog experiments free tier: 1M events/month. VWO Starter: $49/mo for 50K users. [src6]
Do not run multiple overlapping tests on the same page element.
Client-side testing causes page flicker — mitigate with anti-flicker snippets or server-side rendering.

Tool Selection Decision

Which path?
├── Developer AND budget = free
│   └── PATH A: PostHog Experiments — free, feature flags + analytics
├── Non-technical AND budget > $0
│   └── PATH B: VWO — $49/mo, visual drag-and-drop editor
├── Developer AND wants zero dependencies
│   └── PATH C: Custom JavaScript — $0, lightweight split with GA4
└── High traffic AND enterprise needs
    └── PATH D: Statsig/Optimizely — $0-50K/yr, advanced stats

Path	Tools	Cost	Setup Time	Best For
A: PostHog	PostHog Cloud	$0	30 min	Developers, product teams
B: VWO	VWO Starter	$49/mo	20 min	Non-technical, visual tests
C: Custom JS	Vanilla JS + GA4	$0	45 min	Minimalists, privacy-focused
D: Enterprise	Statsig/Optimizely	$0-50K+/yr	60 min	High traffic, advanced stats

Execution Flow

Step 1: Calculate Required Sample Size

Duration: 5 minutes · Tool: Online calculator

Use Evan Miller's calculator or Optimizely's calculator. At 5% baseline CVR and 20% relative MDE: ~3,800 visitors per variant. Test duration = (visitors per variant × variants) / monthly traffic. [src4] [src5]

Quick Reference (95% significance, 80% power):
───────────────────────────────────────
Baseline 2%, MDE 50%:  ~3,600 per variant
Baseline 5%, MDE 20%:  ~3,800 per variant
Baseline 5%, MDE 50%:  ~700 per variant
Baseline 10%, MDE 20%: ~1,900 per variant
Baseline 10%, MDE 50%: ~350 per variant

Verify: Test duration is under 8 weeks. · If failed: If > 3 months, use qualitative testing instead.

Step 2: Create the Experiment

Duration: 10-20 minutes · Tool: PostHog, VWO, or code editor

Create the experiment with control and variant(s), set goal metric, configure traffic allocation (50/50). For PostHog, use feature flags with experiment code. For VWO, use the visual editor. For custom JS, use localStorage-based variant assignment with GA4 event tracking. [src1] [src7]

Verify: Open page in two incognito windows — one shows control, one shows variant. · If failed: Check feature flag status, clear localStorage between tests.

Step 3: Implement Anti-Flicker Protection

Duration: 5 minutes · Tool: Code editor

Add CSS to hide the page until variant is applied, with a 2-second timeout fallback. For PostHog, use server-side bootstrapping. VWO has built-in anti-flicker.

Verify: On slow 3G throttling, no original content flashes. · If failed: Move anti-flicker to first element in <head>.

Step 4: Set Up Conversion Goal Tracking

Duration: 5-10 minutes · Tool: PostHog, VWO, or GA4

Ensure conversion events include variant identifier. PostHog tracks automatically. VWO uses goal settings. Custom JS sends variant ID with GA4 events.

Verify: Test conversion in each variant, verify attribution in dashboard. · If failed: Check experiment impression fires before conversion event.

Step 5: Monitor and Wait for Significance

Duration: 2-8 weeks · Tool: PostHog, VWO, or calculator

Check results weekly but do not stop early. Verify traffic split, check for errors, confirm sample accumulation. Use VWO's significance calculator for custom JS tests. [src3]

Verify: At calculated sample size, one variant has > 95% significance. · If failed: If neither wins, difference is too small to detect.

Step 6: Implement Winner and Clean Up

Duration: 10 minutes · Tool: Code editor + testing tool

Make winning variant the permanent default. Remove testing code, anti-flicker snippet, and archive the experiment. Document results for future reference.

Verify: Winning variant is default. All test code removed. CVR stable for 1 week. · If failed: Verify permanent implementation matches test variant exactly.

Output Schema

{
  "output_type": "ab_test_configuration",
  "format": "running experiment + documentation",
  "columns": [
    {"name": "test_name", "type": "string", "description": "Unique test identifier", "required": true},
    {"name": "tool", "type": "string", "description": "Testing platform used", "required": true},
    {"name": "variants", "type": "array", "description": "Variant descriptions", "required": true},
    {"name": "goal_metric", "type": "string", "description": "Primary conversion event", "required": true},
    {"name": "sample_size_required", "type": "number", "description": "Minimum visitors per variant", "required": true},
    {"name": "estimated_duration", "type": "string", "description": "Weeks to significance", "required": true},
    {"name": "status", "type": "string", "description": "running/completed/stopped", "required": true}
  ],
  "expected_row_count": "1",
  "sort_order": "N/A",
  "deduplication_key": "test_name"
}

Quality Benchmarks

Quality Metric	Minimum Acceptable	Good	Excellent
Traffic split accuracy	Within ±10%	Within ±5%	Within ±2%
Sample size reached	> 80% of minimum	100% of minimum	120% of minimum
Test duration	< 12 weeks	< 6 weeks	< 3 weeks
Flicker-free experience	< 2s delay	< 500ms	Zero flicker (server-side)
Result documentation	Winner declared	Full stats documented	Learnings shared

If below minimum: Check feature flag configuration for split accuracy. Increase MDE or traffic if sample size unreachable.

Error Handling

Error	Likely Cause	Recovery Action
Both variants show same content	Feature flag not evaluating	Check dashboard for flag status; verify JS loads
Conversion rates at 0%	Events not firing or not attributed	Verify events fire with variant ID; check goal metric name
Page flicker visible	Anti-flicker snippet missing	Move anti-flicker CSS to first element in <head>
Traffic not splitting	CDN caching same variant	Check CDN does not cache variant-specific content
No data in PostHog	Script not loading or blocked	Check console for errors; test in incognito
Significance fluctuating	Normal before reaching sample size	Do not stop early; wait for full sample

Cost Breakdown

Component	Free Tier	Paid Tier	At Scale
PostHog	$0 (1M events/mo)	Pay-per-use	$0.00005/event >1M
VWO	$0 (30-day trial)	$49/mo Starter	$972/mo Enterprise
Statsig	$0 (50M events/mo)	$150/mo Pro	Custom
Custom JS + GA4	$0	$0	$0
Total	$0	$0-49/mo	$49-972/mo

Anti-Patterns

Wrong: Stopping the test when one variant "looks like it's winning"

Early stopping inflates false positive rates from 5% to over 25%. A variant that looks better at day 3 may be identical at day 23. [src3]

Correct: Pre-commit to the calculated sample size

Calculate required sample size before starting and commit to running until that number is reached.

Wrong: Testing tiny changes on low-traffic pages

Testing button color on 500 monthly visitors requires 15+ months for significance. [src5]

Correct: Match test ambition to traffic volume

Low-traffic: test big changes (headlines, layouts). High-traffic: can test subtle changes (copy, placement).

When This Matters

Use this recipe when the agent needs to set up a quantitative A/B test on a landing page. Requires existing analytics and conversion events. Handles experiment design, tool setup, and statistical analysis.