Any experienced A/B or multivariate tester has probably encountered a situation similar to the following:

You are put in charge of testing a new design on the company website. You set up the test with your testing platform, making sure that visitors are randomly assigned to the old and new experiences.

A few days later, you check back on the results and are thrilled to see that the new experience is a clear winner, with statistically significant improvement over the old. You excitedly report the results to your boss, and the web team rolls out the new design the next day.

But then, things start to go awry. Click-thru rates tank, bounce rates skyrocket, and the performance soon dips lower than what the old experience had maintained.

What went wrong?

A Decision Made Too Fast

When this happens, it is usually one of two reasons. One possible cause is difference in experience/ad fatigue rates, which we will not cover here. The other, simpler reason is that you may have been a little too quick to make your choice.

How is that possible? you may ask. The result of the test was statistically significant with 99% confidence — doesn’t that mean that the new experience is definitely better than the old?

The problem here is known as sampling bias in statistics — the subset of audience upon which you performed the A/B test was not representative of your entire customer base.

Put simpler, your audience can change over time.

Audience Shifts Over Time

The people that visit your website on weekdays may be different from those that visit on weekends. The audience browsing your site during daytime may not be the same as in the evening. Those attracted to your site during promotions and sales may deviate from your core customer profile.

If these different audiences prefer different experiences, a result that was statistically significant during the weekday may easily flip when the weekend rolls around.

This problem exists no matter what is being compared (website components, landing pages, email formats or ad creatives) or the testing method you use (A/B, multivariate, multi-armed bandit, etc.).

How do we know, then, how long to run a test for? If the statistical significance, the confidence level or the p-value cannot be trusted, then how can you tell when our results are truly reliable?

Patience You Must Have

Many experts advise to wait it out: run all tests for some minimum duration of time until the sample audience becomes representative of the total audience. If a test is run for 2 weeks, then the total should capture most if not all of audience variance over day of week, and hour of day.

But this solution is not perfect, either. If there is a promotion going on during the entire testing period, there is no guarantee that the results you got are valid for non-promotional periods, regardless of how long you wait.

Thus, the reason for the somewhat tongue-in-cheek title for this article: the busier things are (the more things that are going on outside the norm), the less reliable your test results become.

Given that many businesses are constantly running different promotions on and off throughout the year, this can cause problems with the reliability of test results. Is there a better, more flexible way to approach this problem?

Audience Segmentation On Experience Preference

For a business where most purchases come from repeat customers (that is to say, most businesses), the solution lies in a concept familiar to all marketers: segmentation.

The practice of audience segmentation has existed as long as marketing itself, but the key difference here is that segments, for testing purposes, should be determined according to how the individual reacts to different experiences. This can be done using techniques such as cluster analysis.

If we measure the effect of the new experience separately for each of these segments, we can better estimate the potential long-term impact of the experience regardless of when the test is run, how long it was run for and what kind of events or promotions are going on.

An Example Of Weighted Estimation

Let us say we have two segments, A and B. We know that on average over the course of the year, 60% of the total audience fall into segment A, and 40% into segment B.

During the test period, there is a promotion going on that heavily targets segment B, causing 80% of all visitors to come from that segment, vs. only 20% for A.

The new experience has affinity for segment B, increasing click-through rates (CTR) by 10%, but decreasing CTR by 10% for segment A.

Segment A

Segment B

% of Visits, Yearly Average

60%

40%

% of Visits, Test Period

20%

80%

Experience Impact on CTR

-10%

+10%

During the test period, the average CTR increases by (+10% * 80%) + (-10% * 20%) = +6%. This average is the only number seen in a simple A/B test, and the conclusion would be that the new experience is better.

However, since we know that the audience ratio is heavily skewed during the promotional period, we can estimate what the average impact will be for the year: (+10% * 40%) + (-10% * 60%) = -2%. The new experience is clearly not a good permanent change to make.

These audience segments based on experience preference not only make A/B tests more reliable, but also become a long-term marketing asset to the business applicable to anything from retargeted advertising to personalized web experiences.

Even for businesses with high churn, the same strategy works, with techniques such as look-alike modeling helping to predict which segment a new customer or visitor may fall into.

Tomorrow’s Testing & Today’s Tips

The objective of comparative testing in digital marketing is evolving from trying to answer which experience is better, toward being able to answer which experience is better for what audience.

Eventually, perhaps all web experiences — owned, earned or paid — will become personalized at the individual level. This future may come sooner than we think; but, below are some tips for marketers that need to run an A/B test tomorrow:

  1. Run the test for a minimum of 2 weeks (as a benchmark) in order to average out any audience shifts by hour of day and day of week.
  2. Avoid running tests during major seasonal or promotional periods, since the results you get may not be representative of performance across the year/quarter.
  3. If you have not started already, start collecting cookie-level data for all future tests you run, in order to build out audience segments that react similarly across experiences.

Special thanks to my colleague Jon Tehero for inspiration and insights.

Opinions expressed in the article are those of the guest author and not necessarily Marketing Land.

Related Topics: Analytics | Analytics & Marketing Column | Channel: Analytics | How To Guides: Marketing Analytics | Internet Marketing | Retargeting & Remarketing

Sponsored


About The Author: is a Senior Business Analyst in the Digital Marketing team at Adobe, conducting market research across multiple advertising channels including search, display, and social media.



Sign Up To Get This Newsletter Via Email:  


Share

Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. You can read more about our comments policy here.
  • MichaelaKennedy

    I suppose no degree or style of testing is perfect, hence the term testing! Great post, Kohki. It never occurred to me there would be a time not to run split testing, unless there just weren’t enough numbers to glean an analysis, like a small email list, for example. I still have seen intriguing results, though, when a test was done between a promo and a non promo. surprisingly, the non promo did better. It was for a travel service, where people were just a lot interested in the service itself rather than the contest that was being run. If a promo involves too much work from the visitor, it can backfire.

  • http://tinyurl.com/ltjpt98 http://tinyurl.com/ltjpt98

    Interesting post!

 

Get Our News, Everywhere!

Daily Email:

Follow Marketing Land on Twitter @marketingland Like Marketing Land on Facebook Follow Marketing Land on Google+ Subscribe to Our Feed! Join our LinkedIn Group Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Marketing News!

Marketing Day is a once-per-day newsletter update - sign up below and get the news delivered to you!