A/B Testing with Feature Flags: The Complete Guide

TL;DR

Feature flags assign users to A/B variants via MurmurHash3 bucketing — consistent and deterministic
Exposure events fire automatically on every decide() call — no manual tracking needed
The 15-second dedup window prevents a single page load from creating multiple exposure events
Track conversions with trackEvent('event_name', { value }) — include revenue for revenue-based tests
Call a winner at 95% confidence with sufficient sample size — never stop early due to a promising early trend
Clean up the experiment flag within 30 days of calling a winner

What Is A/B Testing with Feature Flags?

A/B testing (also called split testing or controlled experimentation) is the practice of exposing different groups of users to different versions of a feature and measuring which performs better against a defined metric. Feature flags are the mechanism that controls which version each user sees.

The alternative — hard-coding variants and splitting traffic at the load balancer or CDN level — works but has significant drawbacks: you can't target by user attributes, you can't run multiple experiments simultaneously without complex routing rules, and you can't call a winner and clean up without a deploy. Feature flags solve all three problems.

Kohavi, Tang, and Xu's work on trustworthy online controlled experiments at Microsoft, Google, and LinkedIn established that high-velocity experimentation programs — running hundreds of tests simultaneously — universally use feature flags as their core infrastructure. The pattern scales from 2-person startups to organizations running thousands of concurrent experiments.

How It Works Under the Hood

Variant Assignment

When you call userContext.decide('exp_checkout_cta'), the SDK hashes the combination of user ID + flag key using MurmurHash3, producing a deterministic integer in the 0–100 range. If your experiment is 50/50 control vs treatment, users whose hash lands in 0–49 get control; 50–99 get treatment.

This bucketing is stable: the same user ID + flag key always produces the same hash, so users see the same variant across every session, page load, and device. This is essential for experiment validity — users who see control on visit 1 and treatment on visit 2 contaminate the data.

Automatic Exposure Tracking

Every decide() call fires an $exposureevent in the background — fire-and-forget, non-blocking. The event includes the user ID, the flag key, and the variation they received. This becomes the denominator for your experiment: “how many users in each variant were actually exposed?”

The backend deduplicates exposure events within a 15-second window per user/flag/page combination. This means rapid re-renders or multiple calls to decide()in the same component tree won't inflate your exposure count. A user navigating to a new page generates a fresh exposure event.

Bot Exclusion

Pass the $userAgent attribute when creating the user context. The SDK automatically detects bot traffic and returns { variationKey: 'off', enabled: false }for all flags, skipping exposure and conversion events entirely. Bots don't skew your experiment results.

Step-by-Step: Running Your First A/B Test

Step 1: Define a Hypothesis

Before touching code, write down:

Whatyou're changing (the independent variable)
Who is in the experiment (all users, logged-in users, specific segments)
What metricyou're optimizing (primary metric + guardrail metrics)
Minimum detectable effect — the smallest improvement worth shipping
Required sample size — calculate upfront to avoid underpowered tests

Example hypothesis:Changing the checkout CTA from “Continue” to “Complete purchase” will increase checkout completion rate by at least 2 percentage points for logged-in users.

Step 2: Create the Flag in SignaKit

In the SignaKit dashboard, create an experiment flag named following the exp_ convention — for example, exp_checkout_cta_text. Set up two variations: control (existing CTA) and treatment (new CTA). Set the traffic split to 50/50 and apply any targeting rules.

Step 3: Integrate the SDK

// lib/signakit/getFlags.ts
import { createInstance } from '@signakit/flags-node'

const client = createInstance({ sdkKey: process.env.SIGNAKIT_SDK_KEY! })
await client.onReady()

export async function getCheckoutFlags(userId: string, userAgent: string) {
  const userContext = client.createUserContext(userId, {
    $userAgent: userAgent, // enables bot detection
  })
  return userContext
}

// app/checkout/page.tsx
import { getCheckoutFlags } from '@/lib/signakit/getFlags'

export default async function CheckoutPage() {
  const ctx = await getCheckoutFlags(session.userId, headers().get('user-agent') ?? '')

  const { variationKey } = ctx.decide('exp_checkout_cta_text') ?? { variationKey: 'control' }
  const ctaText = variationKey === 'treatment' ? 'Complete purchase' : 'Continue'

  return <CheckoutForm ctaText={ctaText} onComplete={handleComplete} />
}

Step 4: Track the Conversion Event

// In your checkout completion handler
async function handleComplete(orderTotal: number) {
  await ctx.trackEvent('checkout_complete', { value: orderTotal })
  router.push('/confirmation')
}

SignaKit automatically attributes the conversion to the correct experiment variant because the active flag decisions are included with every event. You don't need to pass the flag key or variant manually.

Step 5: Monitor and Wait

Resist checking your results every hour. Frequent peeking increases the chance of a false positive — a result that looks significant but will regress as more data arrives. Check results on a predefined schedule: once you've collected at least the sample size you calculated in Step 1, and not before.

Step 6: Call a Winner and Clean Up

When you reach statistical significance and your pre-defined sample size, call the winner. Ship the winning variant to 100% of users by removing the flag branch, then delete the exp_ flag from the dashboard. Flag cleanup is not optional — stale experiment flags accumulate into technical debt that slows down future development.

Analytics dashboard displaying A/B experiment results and conversion metrics

Tracking Conversions

The trackEvent() method sends a conversion event that SignaKit attributes to the active experiment variants for that user. The three forms:

// Simple conversion — binary (did it happen?)
await userContext.trackEvent('signup')

// Valued conversion — revenue, score, or any numeric metric
await userContext.trackEvent('purchase', { value: 129.99 })

// Conversion with metadata — for debugging and segmentation
await userContext.trackEvent('form_submit', {
  metadata: { formId: 'checkout', step: 'payment' },
})

A few rules for clean conversion data:

Track the metric your hypothesis mentions,not a proxy. If you're optimizing revenue, track purchase with value, not just add_to_cart.
Track only once per eligible conversion.If a user can purchase multiple times, decide upfront whether you're measuring first-purchase rate or total revenue.
Define guardrail metrics. Track at least one metric you want to protect — page load time, error rate, support tickets — to catch regressions that the primary metric might miss.

Bell curve statistical data visualization on a screen showing experiment significance

Statistical Significance and When to Call a Winner

SignaKit uses a two-proportion z-test to calculate statistical significance for binary conversion metrics, and a Welch's t-test for continuous metrics (like revenue). The default confidence threshold is 95% — meaning a less than 5% probability that the observed difference is due to random chance.

Three rules for calling a winner cleanly:

Reach your pre-calculated sample size first. Running a test until it looks good is p-hacking. A test that reaches 95% significance with 200 users but needed 2,000 is likely a false positive.
Check both primary and guardrail metrics. A 10% lift in conversion with a 20% increase in page load time is not a win.
Ship the control if there's no significant difference.A null result is still a result — it tells you the change didn't matter enough to be worth the ongoing maintenance cost.

Pro tip: Use an online sample size calculator (there are several free ones) to determine how many users you need in each arm before starting the experiment. Underpowered tests waste calendar time and produce unreliable results.

Common A/B Testing Mistakes

1. Not defining the primary metric before starting

If you pick the metric after seeing results, you're choosing the metric that makes your preferred variant look good. Define it before you look at any data.

2. Running the test too short

A 48-hour test that captures a weekend but not a weekday mix has a biased sample. Run experiments for at least one full business cycle — typically 1–2 weeks minimum — regardless of how quickly significance is reached.

3. Changing the flag configuration mid-experiment

Adjusting traffic splits, adding targeting rules, or changing variation assignments after the experiment starts invalidates the data collected so far. If you need to change the setup, restart the experiment with a new flag key.

4. Forgetting to exclude bots

Bot traffic inflates exposure counts and dilutes conversion rates. Always pass $userAgent to createUserContext() to enable automatic bot exclusion.

5. Not cleaning up the flag

After calling a winner, teams often leave the flag in place “just in case.” Stale experiment flags accumulate into dead code paths, confuse new engineers, and create subtle bugs when targeting rules interact unexpectedly. Schedule cleanup as part of calling the winner.

Frequently Asked Questions

Can I run multiple A/B tests at the same time?

Yes. Each experiment flag is evaluated independently. As long as your experiments don't interact in ways that would contaminate results (e.g., testing the same UI element in two experiments simultaneously), running multiple experiments in parallel is safe and common.

What's the difference between an A/B test and a multi-armed bandit?

An A/B test runs a fixed traffic split until statistical significance is reached, then you manually call a winner. A multi-armed bandit (MAB) automatically shifts traffic toward better-performing variants during the experiment, maximizing conversions during the test itself rather than just after. SignaKit's multi-armed bandit (Autotune) is available on all plans including Free.

How do I test server-rendered pages?

Use createUserContext() server-side (in a Server Component or route handler), evaluate the flag, and pass the result as a prop. Because evaluation is synchronous after onReady()completes, there's no waterfall — the flag decision is available before the first render. See the Next.js quick start in our feature flags guide.

What if a user is exposed to a variant but never converts?

That's the expected case for most users. The experiment counts the number of exposures per variant as the denominator and conversions as the numerator. Users who were exposed but didn't convert are part of the valid sample — they contribute to calculating the baseline conversion rate for each variant.

→Feature Flags Best Practices Checklist →Feature Flag Naming Conventions: The Guide to Names You Won't Regret →What Are Feature Flags? The Complete Guide

Run your first experiment

A/B testing and multi-armed bandit — free on all plans

SignaKit includes experimentation on the Free plan. No add-ons, no per-seat pricing — just results.

Start free See experiments →