What is a Holdback?
Some experiments require a long time to reach significant results. If one option is preferable in the short term, you can split unevenly for a long period of time: expose 95% of traffic to the preferred option, and 5% to the other one. That pattern and the second variant are both known as a holdback.
When is it Appropriate to Use That Approach?
Some changes to your web service or product, like making the purchase flow easier to navigate, are meant to raise business-critical metrics immediately. Others, like a new channel for customer service, might improve customer satisfaction rapidly but will only have a measurable, compounding effect on retention and other business-critical metrics in the long run. You can confirm that customers like the new option by looking at the Net Promoter Score (NPS). However, should you expose half of your users to a worse experience for months to measure the impact on churn?
What If the Experiment Can’t Last Long Enough?
There are many cases where an experiment should not last more than a few weeks, three months at most, to keep the product cycle manageable. However, some effects, like customer churn, can take longer to measure. Say you want to measure the impact of your change on churn. Say your customers book a holiday or review their retirement plan only once a year. In either of those case, a ten-week experiment is too short to expect any customer return and gather data to measure churn.
There are several options:
- You can end the split early; have all the traffic revert to your default option; wait several weeks or months after both variants have merged; finally look at the impact of the split long after. In other words: did changing something in April and May have any measurable impact in October, or in April the following year? The conversion of experience in between will likely contaminate your results and require non-trivial corrections. You can limit your experiment to people who only access your service in April, for instance.
- You can infer, based on previous observations, that an improvement in one leading metric should improve in a lagging metric. For instance, you have noticed that an increase of 5 points of NPS leads to a decrease of 1% in churn. That would be speculative.
An approach that we could recommend is to run the experiment as expected but to set the short-term goal, like customer satisfaction survey as your objective criteria, and roll it out to all customers if the impact after a few weeks is significantly positive. Months later, you can check whether your overall retention has indeed improved compared to before your experiment. That comes with the limit of a Before-and-After comparison.
- Rather than rely on a noisy approach like Before-and-After, the current best practice to confirm the effect on business metrics would be to run an experiment that lasts longer than what quarterly planning would allow. If that seems costly because one option has clear short-term benefits, we can recommend a small twist: instead of rolling out to all and losing information, or maintaining an expensive 50/50 split, we can hold back a small portion of your traffic, say 5% or 10% for longer, say several months, two years maybe. Then, while most of your clients see improved customer service, the minority held-back will continue to experience the previous version of the service. It’s not ideal for users; we’d have to maintain two options, but you will be able to compare the actual, compounded impact of a better service over months.
With that third approach, you can still measure what it’s like to have better customer service for a couple of purchase cycles; not only that, you can also measure the impact of expecting excellent service, time after time over extended periods. For example, it might increase entitlement, it could affect the brand positively, it could drive stories about exceptional situations where a better service was helpful.
Can Keeping a Small Group Out Be Significant?
The first question from your statisticians or analysts will likely be “Would we be able to measure the impact over only 5% of the audience? Wouldn’t that mean three times less power?” It would, roughly (5% is ten times fewer units than 50% and, following the central limit theorem, √10 is about 3) but a longer test set-up would be more sensitive: more visitors can enroll in the experiment and some effects compound.
More importantly, with customers being exposed to customer service multiple times, their retention should not just improve but compound. If your retention improves by 10% over one month, it’s 21% better after two, 77% better after six months. That’s several times larger. Those more consequential effects are easier to detect.
Which Variant to Roll Out and Which One to Hold Back?
If you run a balanced 50/50 test, you know which variant offers the most short-term positive value, or which one is the most promising overall. To minimize the negative impact on the business from testing, you want to roll out to 90 or 95% of user population the variant with the most promising outcome, especially on leading indicators: best customer satisfaction, most items marked as favorites, etc.
You can decide to pick the option that will be easiest to deactivate, in case the holdback experiment gives surprising results. Introducing new interactions means that removing them will come at a cost. Keep in mind however that a hold-back is here to confirm a previous result, possibly measure its impact more accurately—it rarely flips the overall outcome.
Another way to decide which option to prioritize is to think about the possibilities that this opens. Allowing customers to identify their favorites (without buying) allows you to reactivate them with more purchase opportunities. It allows your machine learning team to train better recommendations. Those improvements can contribute to assigning more value to your preferred option.
Of course, if your users talk to each other, those left behind might resent that they don’t have a better experience. You might get bad press from the discrepancy. Exercise discretion and override the holdback when it is more expensive than interesting. Still, this effort will be beneficial in the long run to convince your executive stakeholders to invest in better service for long-term objectives.
Keep Technical Debt Under Control
When running this process, one of the most common issues is maintaining the old code or the previous operational processes for longer. That is a legitimate source of concern for the software engineers and the operational managers who will want to move on. Getting them engaged with the process is critical. You should explain the value of experimentation and why a holdback is useful when dealing with long-term effects. They will generally understand that this aligns with their objective of having a more streamlined experience and investing to resolve technical and operational debt in the long term too.
Learn More About the Holdback Pattern
When you run an experiment that proves beneficial over the short term, you will want to roll it out as soon as you have significant results. However, if you still want to investigate its long-term effect, you also need to know the reliability of an experiment. To make sure as many users as possible benefit from the improved experience, roll it out to the majority of users, say 95%. Keep a minority of users in long-term control group. This is known as a holdback.
After several months, you should have a strong signal about the long-term impact on key metrics, notably those that compound. Remember to switch the holdback to the new experience when your experiment is over.
I hope you enjoyed reading this. You might also enjoy:
- Experimentation in Split: Make Your Events Work for You!
- A/A Testing for Product Experimentation You Can Trust
- Simultaneous Experimentation: Run Multiple A/B Tests Concurrently
- More Powerful Experiments and Personalization at Scale with Amplitude and Split
Follow us on YouTube, Twitter, and LinkedIn for more great content! You should also join the Split Community on Slack.