April 3, 2021

Don’t dilute your A/B tests

A common practice when you want to introduce some sort of change to your website or app is to A/B test that change: Expose some percentage of users to the new experience and measure their engagement relative to the users who didn’t get the new experience. While this process sounds straightforward, there are many potential pitfalls.

One of them is treatment dilution, which can reduce the power of your experiment and make it less informative.

Basic Online Experiment Setup

An online A/B test is typically instrumented at the request level. That is, you want to vary the response to a user’s request somehow based on whether they are in the control or treatment group, and then measure how their behaviour or interaction with your website changes. A group is also known as a condition or bucket here.

You’ll first need a way to randomly assign each user to one of the two conditions: Either control or treatment. (In general, there may be more than two conditions, but we’ll focus on the simple use case here) This is usually done by applying a hash function to a unique identifier associated with the user. The hash function then generates values that are (or should be) uniformly distributed over some range, making it straightforward to randomly assign some percentage of users to each condition or bucket.

For example, if we had a hash function that produced a uniform distribution over all 32-bit signed integer values, we could achieve a random 50/50 split between control and treatment by assigning all users whose hash value was in [Integer.MIN_VALUE, -1] to control, and all users whose hash value was in [0, Integer.MAX_VALUE] to treatment. Different percentage splits could be achieved by specifying different ranges. Note that although the assignment of a user to a condition is random, once it is done, it does not change for the duration of the experiment.

The key points are:

The initial assignment of a user to a condition is random.
Once a user is assigned to a condition, determining which condition they are in is straightforward and deterministic.
The assignment of a user to a condition is separate from whether the user is included in the experiment analysis. We only want to include users that actually experienced the part of the website where the A/B test ran.

This initial assignment is typically done by your A/B testing tool in the configuration or UI, so you don’t need to worry about these low-level details. For example, in Optimizely, this is done by setting Variations Keys and Traffic Distribution.

Next, you’ll need to instrument your code with the A/B testing library you’re using. This means making a call to some library function with, at a minimum, a user identifier and the experiment identifier. It may look something like this:

condition = ab.trigger_experiment(experiment_id, user_id)
if condition == 'control':
  # Code for the baseline or existing behaviour
  show_baseline_experience()
elif condition == 'treatment':
  # Code for the new behaviour
  show_new_experience()
else:
  # This may indicate an unexpected/error condition, in which which case we show the baseline experience.
  show_baseline_experience()

Making this API call returns which condition or bucket the user should be in for the experiment. In our case, we only have two conditions and so our code displays the baseline experience for control and a new experience for treatment.

Making this call also triggers the experiment for the user, recording that they were active in this experiment and thus increasing the sample size of the respective condition for this experiment. The importance of triggering the experiment for a user is that only users who triggered the experiment will be included during the analysis. It’s generally not helpful to include users in the experiment who never triggered the code.

These two buckets of users can be represented with a Venn diagram:

Control and Treatment Distribution assuming a 50/50 split

The rest of your website must be instrumented to track events for users, so that you can measure the things (metrics) you hope would be influenced by your A/B test. For example, if your change is intended to increase sign-ups, you would need to track the event of a successful sign-up. Having a robust tracking pipeline is an essential part of A/B testing your changes: It is the foundation upon which your A/B testing depends. The Optimizely API documentation has an example of this concept of tracking events.

So far, everything’s good. But what if we want to include only certain types of users?

Online Experiment Setup with Inclusion Criteria

What if our website had both a mobile version and a desktop version, but we only wanted to test the change out on the mobile version - that is, we only wanted the A/B test to apply to mobile users? This means that only mobile users should be eligible for the new experience. A straightforward modification of the code like this would seem to do the trick:

condition = ab.trigger_experiment(experiment_id, user_id)
if condition == 'control':
  # Code for the baseline or existing behaviour
  show_baseline_experience()
elif condition == 'treatment':
  if is_mobile(user):
    # Code for the new behaviour
    show_new_experience()
  else:
    show_baseline_experience()
else:
  # This may indicate an unexpected/error condition, in which which case we show the baseline experience.
  show_baseline_experience()

The problem with this approach is that you are triggering the experiment for a bunch of users who would never be able to see the new experience. (Non-mobile users) These users will still have been assigned to either the control or treatment condition. However because they were never truly exposed to the experiment, it’s no longer reasonable to expect any difference in behaviour between control and treatment for these users.

Our Venn diagram with this problematic experiment setup looks like this:

Treatment Dilution: Only Mobile Users can ever see the new experience, but non-mobile users being included dilutes the difference between treatment and control. If we compare treatment and control, the difference will be reduced because only the subset of mobile users could have shown a difference.

The net effect of this problematic setup is that the experiment results are diluted and you may miss out on the real results from your experiment, assuming there were any. Because you are triggering the experiment for non-mobile users who will never see the new experience, you are including these extra users in both your control and treatment conditions. These non-mobile users, in aggregate, will all behave similarly regardless of whether they were assigned to control or treatment. Because they get included in the analysis, these extra users will tend to push the metrics difference (or effect size) between control and treatment toward zero, decreasing the statistical power of your experiment.

The fix is straightforward: Always put the check for whether a user is eligible for an experiment before you trigger the experiment. This results in code like this:

if is_mobile(user_id):
  condition = ab.trigger_experiment(experiment_id, user_id)
  if condition == 'control':
    # Code for the baseline or existing behaviour
    show_baseline_experience()
  elif condition == 'treatment':
    # Code for the new behaviour
    show_new_experience()
  else:
    # This may indicate an unexpected/error condition, in which which case we show the baseline experience.
    show_baseline_experience()
else:
  # Experiment is not run for non-mobile users.
  show_baseline_experience()

This ensures that the experiment is only triggered for mobile users, and now we no longer have users who were triggered into treatment without having actually seen the new experience.

Note that the Optimizely API for triggering an experiment embeds the eligibility check into the API call itself. You can create an audience of users who qualify for the experiment, and only those users will receive a valid result from the activate() call. Additionally, only those qualifying users will be included in the analysis.

Closing Remarks

The general rule to avoid treatment dilution is to always put the criteria check for whether a user is eligible for an experiment before you trigger the experiment. This avoids unnecessarily including users who will experience no difference between treatment and control, which is the cause of the treatment dilution.

This seems straightforward, and it is in the simple example I’ve provided above. However, the real world may not be so straightforward, and sources of treatment dilution can be subtle. For example, if the new experience requires you to call a new external service for the user, you must make sure that external service can handle the given user’s request. If the new service cannot handle the user (for whatever reason), and you decide to fallback to the old behaviour to gracefully degrade, this may be unintentionally introducing treatment dilution. You should be sure that when you trigger an experiment for a user, and they are assigned into treatment, that they will always receive the new experience.

Lastly, this procedure is not valid if you gave your treatment users the choice of whether they wanted to see the new experience or not. In that case, to avoid selection bias, you’d probably want to use this approach to estimate the Local Average Treatment Effect (LATE) instead of a straight difference between treatment and control. (With some caveats that are nicely covered at the end of that article)

Aside: Treatment and Control etymology

Typically, the group of users who receive the new experience are labelled as the treatment group and the group of users who retain the existing or baseline experience is known as the control group.

These terms are borrowed from the field of clinical research, where randomized controlled trials (RCTs) are frequently used to assess the efficacy of some intervention. An intervention might be some new drug, or some new treatment for a disease, and in order to properly measure the effectiveness of this intervention, we must randomly assign some subjects to receive the intervention (treatment group), and the others to not receive the intervention. (control group)

An A/B test can be seen as an RCT. In this way, a group of users is randomly selected to receive the “treatment” (the new experience), while the remainder of the users continue to receive the existing or baseline experience to act as a “control” to measure the difference against.