Evaluating the quality of CausalImpact predictions

CausalImpact is one of the most popular packages used in SEO experimentation. Its popularity is understandable.

SEO experimentation provides exciting insights and ways for SEOs to report on the value of their work.

Yet, the accuracy of any machine learning model is dependent on the input information it is given.

Simply put, poor input data leads to unreliable predictions.

In this post, we will show how reliable (and unreliable) CausalImpact can be. You’ll also learn how to build confidence in your experimental results.

We’ll start with a brief overview of how CausalImpact works. Then, we’ll examine how to evaluate its reliability. Finally, we’ll explore a reproducible methodology to test and improve the accuracy of your own SEO experiments.

What is CausalImpact and how does it work?

CausalImpact is a package that uses Bayesian statistics to estimate the effect of an event in the absence of an experiment. This estimation is called causal inference.

Causal inference estimates if an observed change was caused by a specific event.

It is often used to evaluate the performance of SEO experiments.

For instance, when given the date of an event, CausalImpact (CI) will use the data points before the intervention to predict the data points after the intervention. It will then compare the prediction to the observed data and estimate the difference with a certain confidence threshold.

Furthermore, control groups can be used to make the predictions more accurate.

Different parameters will also have an impact on the accuracy of the prediction:

All these parameters help to provide more context to the model and enhance its reliability.
[oc-redirect num=1]

Why is evaluating the accuracy of SEO experiments important?

In the past years, I have analyzed many SEO experiments and something struck me.

Many times, using different control groups and timeframes on identical test sets and intervention dates yielded different results.

For example, below are two results from the exact same event.

The first analysis returned a statistically significant decline.
SEO experiment 1

The second was not statistically significant.
SEO experiment 2

Simply put, this example shows that for the same event, different results were returned based on the chosen parameters.

One has to wonder: Which prediction is actually accurate? Isn’t “statistically significant” supposed to increase confidence in our conclusions?

Definitions

To better understand the world of SEO experiments, the reader should be aware of the basic concepts of SEO experiments:

The example below will help illustrate these concepts:

Modifying the title (experiment) should increase organic CTR by 1% (hypothesis) of the product pages in five cities (test group). The estimations will be improved using an unchanged title on all the other cities (control group).

Pillars of accurate SEO experiment prediction

Not all test groups will return accurate estimations

Some test groups consistently yield inaccurate predictions. These should generally be avoided in experiments.

For instance, test groups with large abnormal traffic variations often return unreliable results.

Imagine a site that had:

Running an experiment on such unstable data is likely to produce unreliable or misleading results.

These insights are based on a series of extensive experiments, using the methodology described in the next section.

When not using control groups

When using control groups

Beware of the length of data prior to experiments

Interestingly, when experimenting with control groups, using 16 months of data prior can cause a very intense error rate.

In fact, errors can be as large as estimating a 3x increase of traffic when there were no actual changes.

However, using 3 years of data removed that error rate.This comes in contrast with simple pre-post experiments where that error rate was not increased by increasing the length from 16 to 36 months.

That doesn’t mean that using controls is bad. It’s quite the contrary.

It simply shows how adding control impacts the predictions.

This is the case when there are big variations in the control group.

This takeaway is especially important for websites that have had abnormal traffic variations in the past year (critical technical error, COVID pandemic, etc.).

How to evaluate the CausalImpact prediction

Now, there is no accuracy score built in the CausalImpact library. So, it has to be inferred otherwise.

One can look at how other machine learning models estimate the accuracy of their predictions and realize that the Sum of Squares Errors (SSE) is a very common metric.

The sum of squares errors, or residual sum of squares, calculates the sum of all (n) differences between the expectations (yi) and the actual results (f(xi)), squared.
The sum of squares errors

The lower the SSE, the better the result.

The challenge is that with pre-post experiments on SEO traffic, there are no actual results.

Although no changes were made on-site, some changes may have happened outside of your control (e.g., Google Algorithm update, new competitor, etc.). SEO traffic doesn’t vary by a fixed number either but varies progressively up and down.

SEO specialists may wonder how to overcome the challenge.

Introducing fake variations

To be certain of the size of the variation caused by an event, the experimenter can introduce fixed variations at different points in time, and see if CausalImpact successfully estimated the change.

Even better, the SEO expert can repeat the process for different test and control groups.

Using Python, fixed variations were introduced to the data at different intervention dates for the post-period.

The sum of squares errors was then estimated between the variation reported by CausalImpact and the introduced variation.

The idea goes like this:

  1. Choose a test and control data.
  2. Introduce fake interventions in the real data at different dates (e.g., 5% increase).
  3. Compare the CausalImpact estimations to each of the introduced variations.
  4. Compute the Sum of Squares Errors (SSE).
  5. Repeat step 1 with multiple controls.
  6. Choose the control with the smallest SSE for real-world experiments

The methodology

With the methodology below, I created a table that I could use to identify which control had the best and worst error rates at different points in time.

First, choose a test and control data and introduce variations from -50% to 50%.

Then, run CausalImpact (CI) and subtract the variations reported by CI to the variation you actually introduced.

After, compute the squares of these differences and sum all the values together.

Methodology: compute the squares of the differences

Next, repeat the same process at different dates to reduce the risk of a bias caused by a real variation at a specific date.

Methodology: compute the squares of the differences_2

Again, repeat with multiple control groups.

Methodology: compute the squares of the differences_3

Finally, the control with the smallest sum of squares errors is the best control group to use for your test data.

If you repeat each of the steps for each of your test data, the result will vary.

On the resulting table, each row represents a control group, each column represents a test group. The data inside is the SSE.

Methodology: SSE test groups and control groups
Sorting that table, I am now confident that, for each of the test groups, I can select the best control group for it.

Should we use control groups or not?

Evidence shows that using control groups helps to have better estimations than simple pre-post.

However, this is true only if we choose the right control group.

How long should the period of estimation be?

The answer to that depends on the controls that we are selecting.

When not using a control, 16 months prior experiment seems enough.

When using a control, using only 16 months may lead to massive error rates. Using 3 years helps reduce the risk of misinterpretation.

Should we use one control or multiple controls?

The answer to that question depends on the test data.

Very stable test data can perform well when compared against multiple controls. In this case, this is good because using a lot of control makes the model less impacted by unsuspected fluctuations in one of the controls.

On other datasets, using multiple controls can make the model 10-20 times less precise than using a single one.

Interesting work in the SEO community

CausalImpact isn’t the only library that can be used for SEO testing, nor is the above methodology the only solution for evaluating prediction accuracy.

For alternative approaches and tools, check out recent contributions from the SEO and data science communities:

While Prophet is better suited for forecasting than experimentation, understanding multiple libraries strengthens your ability to test, forecast, and explain SEO performance.

Things to consider when doing SEO experiments

Reproducible results

In this tutorial, I wanted to focus on how one could improve the accuracy of SEO experiments without the burden of knowing how to code. Moreover, the source for the data can vary, and each site is different.

Hence, the Python code that I used to produce this content was not part of the scope of this article.

However, with the same logic, you can reproduce the above experiments.

Conclusion

If you had only one takeaway to get from this article, it would be that CausalImpact analysis can be very accurate, but can always be way off.

It is very important for SEOs wishing to use this package to understand what they are dealing with. The result of my own journey is that I wouldn’t trust CausalImpact without testing out the accuracy of the model on the input data first.