An Introduction to Bayesian A/B Testing in Exploratory

A/B Testing Framework that Business People Can Understand

I’m sure many of you have heard about A/B Testing. You create two groups — A and B — and measure the performance of each group and see which one has performed better.

Let’s say you deploy two versions of your web site landing pages to see which pages help more visitors sign up for your service.

And, your test result came back after a week or so and it looks like this.

This is clear.

A (blue color) is consistently performing much better than B (orange color)! It’s obvious, and why didn’t we do that earlier?!

Except, it is not that simple in the real world. The difference between A and B can be very subtle and it can look something like below.

Just by looking at this, you might think that A seems to be better than B. But you might not be confident enough because B is actually better than A in one day and even for the days A is better than B the difference is very small.

Now, would you be comfortable making a decision to go with A based on this result? Are you sure to say that B won’t be performing better than A tomorrow? Deciding to go with A will cost you additional times and money in the development, the design, the deployment, etc.

So you want to be certain that A is indeed better than B.

This is where the power of Statistics comes in.

There are two popular ways to do. One is a frequentist way called ‘Chi-Squared Test’ and another is a bayesian way called ‘Bayesian A/B Test’.

In this post, I’m going to talk about how Chi-Square Test works in a context of A/B Test and the challenges you would face with this approach. Then, I’ll introduce Bayesian A/B Test as another way to approach in order to evaluate the result of A/B Test.

But before that, first we need to prepare the data, regardless of which way you want to go with. If you are just interested in how Bayesian A/B Test works, then skip the next section.

Preparing Data

Let’s say we are testing two versions of our landing page and monitoring how much ‘sign ups’ each of the pages is bringing in every day.

I have uploaded a sample data here, which you can download as CSV.

It is aggregated at date level with the following columns.

  • date
  • landingPagePath — there are two pages as the landing page. This is our A and B information.
  • uniquePageView — unique counts for each landing page
  • signUpCount — number of the counts that ended up signing up. This is the converted counts.

To run either Chi-Square or Bayesian A/B, we have two pre-requisites for the data.

First, we need to have the counts not just for how many signed up but also for how many NOT signed up.

Second, we need to have the data in a tidy format or a long format by having signed up or not-signed up information in a single column not as separate columns, like the below.

Take a look at ‘date’ column. Now we have 4 rows for ‘2017–05–23’ while we used to have only 2 rows. This is because we have the count value for each landing page id and for each status of whether sign up or not sign up.

Once we get the data in this format we can move on to run either Chi-Square or Bayesian A/B. If you already have the data in this format then skip the following data wrangling section.

But most of the times, the data is not presented in this format, especially when you are pulling data from some services like Google Analytics. In such cases, you want to follow the next data wrangling section.

To get this data, we need to take the following three steps.

  1. Create a column that has the number of ‘Non-Signup Counts’.
  2. Gather ‘Sign Up Counts’ and ‘Non-Signup Counts’ so that they will be presented under ‘is_signup’ column as two categories, ‘Sign Up’ or ‘Non-Sign Up’.
  3. Make this ‘is_signup’ column to be Logical data type rather than Character type.

1. Calculate the Counts for Non-Signup.

So, again here is the original data we start with.

First, we want to get the counts for Non-Signed Up.

We can easily calculate this by subtracting the sign up counts from the total counts (unique page views).

uniquePageView - signUpCount

Select ‘Create Calculation (Mutate)’ from the column header menu.

And type the following calculation formula.

This will give us the count for Non Sign Up.

2. Gather ‘Sign Up Counts’ and ‘Non-Signup Counts’

Now, we want to have a column that indicates whether it is Sign Up or Non-Sign Up, rather than have them presented separately as two different columns.

For this, we can use Gather command from ‘tidyr’ package, which converts the wide format data to the long format data or un-pivot the data if you will. By the way, here, we have only two columns to ‘gather’, but the ‘Gather’ command can ‘gather’ many columns like the below as well.

In Exploratory, select ‘SignUpCount’ and ‘Not_signUpCount’ columns with Command (or Control) key, and select Gather (Wide to Long) -> Selected Columns from the column header menu.

In the Gather dialog, we can set the names for the newly created columns.

In the result, we can see the original column names is now presented as the values for is_signup column along with the values under value column.

If you happen to be following step by step with the sample data in Exploratory, then the output might be different. That’s because I have sorted the data by date column, which makes it a bit easier to see what have just happened. But, this step is not necessary.

3. Make this ‘is_signup’ column to be Logical data type rather than Character type.

This step is optional. You can run Chi-Square or Bayesian A/B without converting this column to be Logical (TRUE or FALSE). But it makes it easier to interpret the result later and also it can guaranteed that we’ll have only two possible values (TRUE or FALSE).

To convert a column from Character type to Logical type, select ‘Create Calculation (Mutate)’ from the column header menu to open Mutate dialog where you can write an expression to do the conversion.

And type the following condition.

is_signup == "signUpCount"

This will evaluate each row to see whether the value is ‘singUpCount’ or not. If it matches then it returns TRUE, otherwise FALSE.

I’m overriding the original column with this newly ‘calculated’ values.

Now the data is ready, let’s take a look at Chi-Square Test first.

Chi-Squared Test

To perform Chi-Square in Exploratory, go to Analytics view and select Chi-Square Test from Type.

Then, we want to assign the columns.

  • Select is_signup column for the Target Variable
  • Select landingPagePath column for the Explanatory Variable
  • Select value column for the Value

And, run!

The most important information here is the P-Value.

Here, it is 0.16. This shows the rate that the difference between the two landing pages for the conversion can happen by a random chance. In this case, it means that this difference can happen by chance at a rate of 16%.

Now, is this 16% too high or too low? It depends. Different businesses and industries have different thresholds. But let’s say we take the commonly adopted threshold as 5% in order to call if it is statistically significance or not. Then, this 16% is too high to conclude that the difference between these two landing pages is statistically significant.

Then, what should we do with this?

Not doing anything? Or, we should still go with one of the landing page anyway? Or, should we test it again?

This is one of the challenges we face with Chi-Square Test for A/B Testing. But this is not the only challenge. Here is a list of the challenges for using Chi-Square Test.

  1. We need to know how much of the data we need to collect for the test before starting the test.
  2. We can’t evaluate the result in real-time as we go, instead we need to wait to make any decision until we collect a full of the planned data size.
  3. The test result is not intuitively understandable especially for those without a statistical background. (What is P-value again?)
  4. The test result can be read as black and white, either it is statistically significant or not. This makes it hard to figure out what to do especially when not statistically significant.

If you are concerned with these challenges, you might want to give the Bayesian approach a shot, which I’m going to introduce in the next section.

Bayesian A/B Test

Bayesian A/B Testing employs Bayesian inference methods to give you ‘probability’ of how much A is better (or worse) than B.

The immediate advantage of this method is that we can understand the result intuitively even without a proper statistical training. This means that it’s easier to communicate with business stakeholders.

Another advantage is that you don’t have to worry too much about the test size when you evaluate the result. You can start evaluating the result from the first day (or maybe even the first hour) by reading the probability of which one between A and B is better than the other.

Of course, it would be better to have enough data size, but it’s much better to be able to say, for example, “A is better than B with 60% probability” than “We don’t have enough data yet.” And you can decide if you want to wait longer or not at any time.

Why has it not been always Bayesian?

So the Bayesian approach sounds great for businesses. But this approach is still not so popular compared to the other approaches including Chi-Square Test.

One big reason is that the Bayesian approach takes a lot of calculations by simulating many variations. This was hard in the old days with low spec computers, but with today’s modern PC with moderate computation power, this is no longer a problem.

What do we need to know before using Bayesian?

There are two things you need to know about Bayesian. One is the Prior and another is the Posterior. The prior is basically the knowledge you have about the data before. For example, most likely you would know what would be your web site’s typical conversion rate like before you even start the testing. You might say something like between 15 to 20%.

The posterior is the updated knowledge after the real data start coming in. So it’s like the below.

Posterior = Data + Prior

You might not be familiar with these key terms of Bayesian, but the concept is pretty straightforward. If you want to know more about priors and posteriors you should take a look at this post by Frank Portman.

How to use Bayesian A/B Testing framework in Exploratory

The cool thing is, there is already an R package called “bayesAB” built and maintained by Frank Portman. It provides a simple way to employ Bayesian inference methods for evaluating the A/B test results.

Let’s try inside Exploratory.

Just to refresh our memory about the data, here is the user conversion data we have prepared before.

Go to Analytics view and select ‘A/B Test — Bayesian’ from Type.

We need to assign columns to the following boxes.

  • Target Variable
  • Explanatory Variable
  • Value

Target Variable indicates the outcome that we want to see. In this case, that is whether users signed up or not.

Explanatory Variable indicates the two versions you are testing, basically it is either A or B.

Value indicates how many sign ups are for each outcome (Sign up or not) by each version (A or B).

Assign ‘is_signup’ column to Target Variable, ‘landingPagePath’ column to Explanatory Variable, and ‘value’ column to Value.

Then, Run!

This will produce a summary information like below.

The most important part of this information is ‘Chance of Being Better’ column. In this case, we can read that as the probability of A is better than B is 8% (0.08) and the probability of B is better than A is 92% (0.92).

‘Expected Improvement Rate’ column shows how much A is better than B. In this case, the number is negative so we can interpret it as the conversion would be about 2% worse if we go with page A. This means, B is would perform 2% better.

You can go to ‘Improvement Rate’ tab where you can see the improvement rate’s probability distribution.

The X-axis represents how much A is better than B with a calculation like below.

(A-B) / B * 100

And you can read each bar as the probability of the performance improvement rate.

For example, to interpret an orange bar that the pink arrow is pointing to, we can say “A is 1.75% (X-axis) worse than B and the probability of that is 10.9%.”

And the ratio of the entire orange area against all (and the ratio of the blue against all) is the number presented under Chance of Being Better column in the summary view above.

Can we add the Prior?

The above evaluation was done without setting any prior information explicitly. If you don’t give the prior information, it assumes no prior knowledge on the distribution, and use the uniform distribution as the prior. And this would be ok when you have enough data size. But that might not be the case if you are still in the first few days where the result does not necessarily represent your general trend.

To give the Prior, you want to provide the average and the standard deviation of the past conversion rates so that Exploratory will calculate the prior internally for you.

How to get the average and the standard deviation (SD)?

You can import the past data and have Exploratory calculate the average and the standard deviation of the conversion rate.

Let’s pretend that this is the past data of the user conversion. (Yes, I’m using the same data as an example here.)

We need to calculate the conversion rate first.

Select ‘Create Calculation (Mutate)’ from the column header menu.

and type something like the below to calculate the rate.

signUpCount / uniquePageView

Here’s the conversion rate for each day and for each page.

Once you get this column created, you can simply go to Summary view and find out the average and the standard deviation (Std Dev) of the conversion rate.

We can see the average conversion rate as 0.098 (9.8%) and the standard deviation as 0.1154 (11.54%). Then, you want to give these numbers to A/B Test — Bayesian Analytics like the below.

That’s it!

As I mentioned above, there are a few ways to evaluate the A/B Test result. Which one to pick depends on your needs. It’s not like one is better than the other.

But, if you want to monitor and evaluate the result in real time and need to communicate the result with those without a statistical background better, you should give Bayesian A/B Test method a shot!