Many product and engineering teams are thrilled with the comfort of feature flags: they can release new code and revert it instantly if anything happens that they don’t like. Having a killswitch as an option significantly diminishes the risk of a failed release. In addition, engineering teams can worry about fixing the code or improving the idea later rather than having the stress of a hotfix.

When organizations implement feature flags, they often begin to think about monitoring, measurement, and testing. One common assumption is that every release will, by itself, enable simple measurements that provide clear insights: compare metrics before and after the release. However, releasing a feature to all users won’t isolate the effects of one specific change. The roll-out will be faster than splitting the traffic by running 50/50 A/B tests, or waiting for confirmation through canary deployments. However, an all-at-once release won’t gather insights as clearly.

You might think: with twice as many people seeing the new version, we should know whether the change we released worked twice as fast, right? You will indeed know whether metrics go up almost immediately, but you might not be able to tell whether that was an accident, due to the change you care about, or another change that happened almost at the same time, or something outside of your control.

Victory Has Many Parents

This idea that you can test faster with more people seems intuitive on its surface. Still, it assumes that the metric that you care about (conversion, retention, average order value) should not change meaningfully during the course of the test—except for that one modification that the feature flag is controlling. Unfortunately, that’s rarely the case. It’s more common that things change for unrelated reasons. Ask travel agencies when holidays come up, financial services around tax season, e-commerce companies just before Black Friday, apparel stores during sales, or florists on February 13th: any seasonal activity will shake all your metrics beyond recognition. Competitors, software bugs, media coverage, even weather can affect your metrics.

An Ancient Problem

But this is not just a problem for business: the area where random control trials (RCT, the name that scientists use for A/B tests) had the most visible impact is medical science. You will see many experimentation specialists use the vocabulary of clinical trials (treatment, variant) for a good reason: a glaring lack of proper controls plagued the craft of curing people for far too long. Why would leeches and bloodlettings based on “the theory of humors,” ingesting Mercury metal, radium water, and other now-disturbing practices be so prevalent and valued for centuries? Because, the patients—at least those who survived—felt a lot better afterward. The healers were likely sincere and keen to cure, happy to receive the praises from survivors and quick to forget their failures. We can imagine now that there must not have been many; that their renewed health could have been despite the dangerous attempts.

It is all too natural to assign the credit to the most compelling (or expensive) effort made in the heat of the moment. Indeed, if the patient survived, each of the many contradictory attempts (elixirs, infusion, fumigation, dancing with scary masks to scare evil spirits, etc.) could be praised as “the cure.” Every practice had an easy way to explain any adverse event—typically, patients inconveniently not believing in the healer’s power enough.

Often, all the healer needed was good timing: the viral flu typically lasts three days. Just learn that your incantations are more effective on the third day, and they will appear miraculous.

Blindly attributing success to anything that preceded it is not a new phenomenon: Tacitus, a century before our era, lamented how Victory has many parents, but Defeat is an orphan. He was far from the first. If everyone could see the presumed solutions failed to work most of the time, why did so many survive for so long?

A Contemporary Problem

It’s easy, from our modern perspective, to laugh at the naiveté of “the patients—at least those who survived—felt a lot better afterward.” It’s easy to assume that fallacies like that remain in the past. Yet, praising the most visible effort among survivors might not be gone entirely. 

Take a look at the nearest shelf of business books; you might see a similar argument used a lot: very few writers interview people who started a company that failed miserably after following the favorite advice of the day. That was true for ideas popular fifty years ago. It is just as valid for contemporary suggestions like just-in-time logistics; the contradictory strategies of specialization, platformization, vertical or horizontal integration; digitalization too. Every day, a commentator on social media mocks survivor bias by responding with a certain line-drawing of a plane riddled with red dots, a trope referring to Abraham Wald’s work on bomber failure point during the Second World War.

Certain people might even try to convince you to use A/B testing because successful companies do it.

Do not trust an idea because a survivor told you that, for them, it worked. Instead, trust an idea because many people tried, and they fared overall better than similar people who could have tried but did not.

Enter Controlled Experiments

A Long Time Coming

The idea that you needed a controlled test was new in 18th-century medicine and wasn’t immediately accepted. The first experiment that proved that lemons could cure scurvy happened in 1747. Despite the widespread and horrifying symptoms on board, it took four decades for the British Navy to pay attention and mandate lemons in 1795. Who knows how history could have been different if, two decades earlier, British ships could have sailed the Atlantic without losing half of their crew and prevented the American Revolution?

Immunization against smallpox; the use of soap or rubbing alcohol when taking care of wounds; antibiotics; feedstock to help plants grow: all those ideas that are now part of science’s basic canon. To prove their efficacy, they had to go through the demanding protocol of controlled experiments. We are all healthier and better-fed for it. We can now rely on them, learn on top of those conclusive results.

A Scientific Compounding

So many other techniques thankfully didn’t pass the same test, including bloodlettings. One that did work and might surprise you: willow bark. That’s where we used to get aspirin. The scary masks didn’t prove effective against the flu but were very popular with kids. I suspect that’s, at least partially, how we got Halloween.

As you can tell from these examples, the result didn’t just benefit the half that was lucky enough to chew on actual willow bark over the placebo. Everyone after the test could point at the measured effect on patients and explore further: which part of the bark was effective? Could it survive boiling or burning? Would larger quantities be more effective?

A reliable protocol establishes a solid basis of trust. Then, one can run more tests on top of that foundation. Those results compound. The overall improvements lead to dramatic learnings. For example, to test the effect of willow on pain, hundreds of patients had to be left chewing another bark for days; that helped confirm that the plant did work. What started as simple results compounded. Fever management now saves millions of lives with effective biochemistry, perfected over hundreds more experiments. When running, a test can feel expensive or uncaring. However, the long-term benefits of the test and its refinements will often outweigh that cost by orders of magnitude. 

Regression to the Mean

Why do we need to be so formal? There are many possible issues when not running a proper test, what specialists often call “uncontrolled observations”. Let’s use an example to illustrate a common one: an idea that saved many lives recently in Europe was replacing road intersections with roundabouts. The process was easy enough:

  1. Find the most dangerous intersections (with stop signs or red lights) based on the last year of traffic accidents
  2. Spend a lot on roadworks to replace those with a roundabout
  3. Count the casualties the following year

For all the changed intersections, the result was invariably fewer accidents. So naturally, everyone was thrilled and started planning more roundabouts.

Except, some intersections were not changed immediately. Maybe the work took longer to start because of budget constraints, or perhaps some local group didn’t like the idea of construction noise. Either way: things took longer than expected. The dangerous intersections were left as they were for a while. Transport authorities collected more accident data. 

Surely, those intersections would still be deadly during that unexpected wait for improvements, with a record-breaking year of accidents? Actually, no: those intersections suffered significantly fewer accidents—almost as few as if they were replaced by roundabouts. But those intersections didn’t change.

How Come Doing Nothing Helped?

What stunned traffic authorities didn’t surprise statisticians. Specialists are all too familiar with the regression to the mean. The phenomenon is common for every metric prone to a bit of noise. How does it work? Imagine a race; running, swimming, drag racing… The type doesn’t matter much. What matters is that it is competitive:

  • pick the top three fastest racers;
  • the following week, have them run another race, same process, same participants;
  • check how the previous top three did this time.

They won’t all be in the top three this time, at least not in the same order. So your top three will likely be good but not as performant as the first time around. The shift happened because the top three contained both the fastest and luckiest racers who felt great on that day. Wait another week and some of them might have an off day.

Regression to the mean is pivotal to consider when thinking about tests and statistics. For example, if a region has a bad sales quarter, you can replace the team with a new one. The new team will most likely do better, but one cannot be sure whether the improvement is because they are better or because you chose to judge the previous team on their worst quarter so far, the one that prompted you to act.

How to Avoid Regression to the Mean

Transport authorities had to run proper tests to prove that roundabouts were effective:

  • pick hundreds of the most dangerous intersections; 
  • split them into two groups:
    • one where we would replace all crossings with roundabouts,
    • one where all would stay the same: no construction;
  • compare the accidents statistics,
    • not just with the previous year,
    • but between the two groups.

Transport authorities realized that both groups had fewer accidents than the year before (as expected). They also noticed that the intersections with a certain amount of traffic, with a new roundabout (after a period of drivers getting used to the new way), had significantly fewer bad accidents than the other intersections in the same year. There might have been individual outliers, but the overall effect was consistent and significant. More importantly, it meant changing intersections saved lives beyond the statistical fluke.

Transport authorities didn’t just prove the idea worked: they identified nuances about drivers’ initial confusion and the relevance of high traffic. That distinction would not have been possible when comparing accidents statistics before and after the change. Detailed controlled studies and successive test iterations led to a well-established conclusion that roundabouts are effective when introduced properly on high-traffic intersections.

It’s now possible to point at those results and argue that we don’t need to let people in half of the intersections needlessly hurt themselves one more year. Once again: A/B tests come with a cost, but it is far smaller than the cost of not running them.

The Frustration of Releasing Through A/B Tests

No one who has tried such tests will tell you that the exercise was always fun. Results are often less compelling than the proponents were expecting. Edge-cases like users with multiple accounts will muddy your results. A first code version with some bugs might hurt your initial results. Significance often comes later than we would like—and all that is if you count on a reliable platform like Split to handle all the implementation problems painlessly. Releasing new features with tests and expecting significant results is very much doing things the hard way.

It is more comfortable to ride the tailwinds of circumstances, regression to the mean, and the salience of a big project. With all those positive effects, you can claim, as you change things, that all the upward change is due to your efforts alone. But it is neither fair, nor reliable. Doing it the hard way will prove more interesting, effective, and resilient to changing circumstances. As they say at my gym: “No pain, no gain.”

In the last two hundred years, science, medicine, and technology have seen durable, effective change. If you want the same for your organization, controlled tests are the best way to do it. They allow team members to prove their case, gradually. With tests, your organization can learn reliably and improve step by step.

Extraordinary Growth on Top of Experiments

Of course, results will be slow at first, but learning will be steady. Talk to product managers at Google, Facebook, Amazon, Microsoft, LinkedIn, Booking·com, Twitter, Etsy, Netflix, and Spotify. They will unanimously tell you how initial tests crushed their best ideas—or what they thought was their best idea. They all have examples of disappointing results, leading them to rethink the way they change features. After that, they tried again and found ways to learn through that demanding grind. Now check the stock prices of all those companies.

Despite the apparent similarities, all these companies have very different management styles, internal cultures and product philosophies. However, all share the conviction that to be rolled out, any change needs to “bring data.” Here “data” means: an A/B tested, significantly positive impact on a critical metric. Such changes often require many iterations.

If your first reaction when reading this last paragraph is to have doubts, you have the right mindset to run A/B tests! Remember the earlier (cynical) comment about business books: don’t trust a list of ten remarkable survivors; don’t assume that whatever they seem to have in common is what must explain their success. It is likely unrelated. Even ten companies worth billions of dollars each are not an argument.

If you remain unconvinced that A/B tests might be the best solution for you, test the idea of A/B testing itself. Randomly split your teams into two groups and demand that:

  • Half of them run A/B tests before rolling out any new feature, publish their results and try to understand the many surprises that they will see
  • The other half just trust their instincts and roll out what they think are good ideas

Chose when to start with A/B testing

It might take a while to see a difference. It might work better for specific projects than others.  Typically, new companies like start-ups face urgent, obvious improvements: there’s a lot of basic features to implement. Previous, larger companies have set customers’ expectations and it’s not worth reinventing the wheel. On top of that, small companies have fewer customers to test them on. If that sounds like you, waiting for A/B test results might slow you down initially.

The companies in the list above did not all start with A/B tests. The early roadmap was clear, After a while, internal debates about prioritization became more common. Instead of arguing on assumptions, they started thinking about testing ideas. In a now-famous talk at the Web 2.0 conference in 2006, Marissa Mayer gave several inspiring examples of questions that were unclear to Google at the time. If you have already had those debates internally, A/B tests are the best way to resolve them. So you probably shouldn’t wait.

Try A/B testing. See if some results surprise you, and investigate why. After a few surprises and investigations, let us know if you would prefer to be a company trying to build a roadmap without the insights you gained from the insights that A/B testing sprung on you.

Learn More About Experimentation

Random control trials (RCT) or A/B tests allow you to have an unbiased control to compare to your variant. This technique lets you measure the impact of your change unambiguously. Tests comparing metrics before and after a change are too likely to confuse other time-based effects: noise, natural improvements, seasonality, audience shift.

I hope you enjoyed reading this. You might also enjoy:

  • Experimentation in Split: Make Your Events Work for You!
  • A/A Testing for Product Experimentation You Can Trust
  • Simultaneous Experimentation: Run Multiple A/B Tests Concurrently
  • More Powerful Experiments and Personalization at Scale with Amplitude and Split

Follow us on YouTube, Twitter, and LinkedIn for more great content! You should also join the Split Community on Slack.