Any experienced A/B or multivariate tester has probably encountered a situation similar to the following:
You are put in charge of testing a new design on the company website. You set up the test with your testing platform, making sure that visitors are randomly assigned to the old and new experiences.
A few days later, you check back on the results and are thrilled to see that the new experience is a clear winner, with statistically significant improvement over the old. You excitedly report the results to your boss, and the web team rolls out the new design the next day.
But then, things start to go awry. Click-thru rates tank, bounce rates skyrocket, and the performance soon dips lower than what the old experience had maintained.
What went wrong?
A Decision Made Too Fast
When this happens, it is usually one of two reasons. One possible cause is difference in experience/ad fatigue rates, which we will not cover here. The other, simpler reason is that you may have been a little too quick to make your choice.
How is that possible? you may ask. The result of the test was statistically significant with 99% confidence — doesn’t that mean that the new experience is definitely better than the old?
The problem here is known as sampling bias in statistics — the subset of audience upon which you performed the A/B test was not representative of your entire customer base.
Put simpler, your audience can change over time.
Audience Shifts Over Time
The people that visit your website on weekdays may be different from those that visit on weekends. The audience browsing your site during daytime may not be the same as in the evening. Those attracted to your site during promotions and sales may deviate from your core customer profile.
If these different audiences prefer different experiences, a result that was statistically significant during the weekday may easily flip when the weekend rolls around.
This problem exists no matter what is being compared (website components, landing pages, email formats or ad creatives) or the testing method you use (A/B, multivariate, multi-armed bandit, etc.).
How do we know, then, how long to run a test for? If the statistical significance, the confidence level or the p-value cannot be trusted, then how can you tell when our results are truly reliable?
Patience You Must Have
Many experts advise to wait it out: run all tests for some minimum duration of time until the sample audience becomes representative of the total audience. If a test is run for 2 weeks, then the total should capture most if not all of audience variance over day of week, and hour of day.
But this solution is not perfect, either. If there is a promotion going on during the entire testing period, there is no guarantee that the results you got are valid for non-promotional periods, regardless of how long you wait.
Thus, the reason for the somewhat tongue-in-cheek title for this article: the busier things are (the more things that are going on outside the norm), the less reliable your test results become.
Given that many businesses are constantly running different promotions on and off throughout the year, this can cause problems with the reliability of test results. Is there a better, more flexible way to approach this problem?
Audience Segmentation On Experience Preference
For a business where most purchases come from repeat customers (that is to say, most businesses), the solution lies in a concept familiar to all marketers: segmentation.
The practice of audience segmentation has existed as long as marketing itself, but the key difference here is that segments, for testing purposes, should be determined according to how the individual reacts to different experiences. This can be done using techniques such as cluster analysis.
If we measure the effect of the new experience separately for each of these segments, we can better estimate the potential long-term impact of the experience regardless of when the test is run, how long it was run for and what kind of events or promotions are going on.
An Example Of Weighted Estimation
Let us say we have two segments, A and B. We know that on average over the course of the year, 60% of the total audience fall into segment A, and 40% into segment B.
During the test period, there is a promotion going on that heavily targets segment B, causing 80% of all visitors to come from that segment, vs. only 20% for A.
The new experience has affinity for segment B, increasing click-through rates (CTR) by 10%, but decreasing CTR by 10% for segment A.
% of Visits, Yearly Average
% of Visits, Test Period
Experience Impact on CTR
During the test period, the average CTR increases by (+10% * 80%) + (-10% * 20%) = +6%. This average is the only number seen in a simple A/B test, and the conclusion would be that the new experience is better.
However, since we know that the audience ratio is heavily skewed during the promotional period, we can estimate what the average impact will be for the year: (+10% * 40%) + (-10% * 60%) = -2%. The new experience is clearly not a good permanent change to make.
These audience segments based on experience preference not only make A/B tests more reliable, but also become a long-term marketing asset to the business applicable to anything from retargeted advertising to personalized web experiences.
Even for businesses with high churn, the same strategy works, with techniques such as look-alike modeling helping to predict which segment a new customer or visitor may fall into.
Tomorrow’s Testing & Today’s Tips
The objective of comparative testing in digital marketing is evolving from trying to answer which experience is better, toward being able to answer which experience is better for what audience.
Eventually, perhaps all web experiences — owned, earned or paid — will become personalized at the individual level. This future may come sooner than we think; but, below are some tips for marketers that need to run an A/B test tomorrow:
- Run the test for a minimum of 2 weeks (as a benchmark) in order to average out any audience shifts by hour of day and day of week.
- Avoid running tests during major seasonal or promotional periods, since the results you get may not be representative of performance across the year/quarter.
- If you have not started already, start collecting cookie-level data for all future tests you run, in order to build out audience segments that react similarly across experiences.
Special thanks to my colleague Jon Tehero for inspiration and insights.
Opinions expressed in the article are those of the guest author and not necessarily Marketing Land.