Inferring The Effect Of An Event Using Causal Inference
What is Causal Inference
In a marketing campaign, how do we measure the improvements to the success metrics, such as number of ads clicks, webpage visits, signups, that result from the campaign? How can we estimate the impact a new feature launch can have on the user or system metrics, such as the average time users spend on the product? How can we evaluate a policy change that can affect the user engagement?
All these may sound simple because we can just compare the measures before the event (e.g. marketing campaign) and after the event, or compare the measures of two groups like A/B testing. But, it’s actually hard to measure such impact in the real world because there are many attributes that can influence the outcome (e.g. page views). These additional attributes are called ‘noise’.
The theories of causal inference can better answer these questions.
The problem of causal inference treatment
To infer the causal impact that a designed intervention has exerted on an outcome metric over time has been of significant interest in econometrics and marketing. The methodology for measuring causal impact could also be used for many other applications involving causal inference: examples include problems found in economics, epidemiology, or the political and social sciences, etc.
Common deficiencies in conventional approaches
Estimating the causal impact of an intervention relies on estimating the counterfactual, i.e., what would have happened in the absence of the intervention. This is why the best way of estimating a causal effect is to run a randomized experiment. In a randomized experiment, we try to find the counterfactual from the control group. However, in reality, it is difficult to determine counterfactual values because we cannot establish with certainty what would have happened had the supposed cause not occurred. Let us illustrate the problem and our approach via a very simple example.
Assume there is games festival running in December promoting a game title ABC. The monthly revenue from this game are as follows:
Question: What is the incremental impact resulting from the games festival campaign? There are two common methods we might use to answer this question.
Method 1: Simple pre- vs post- comparison
If we compare pre-campaign vs post-campaign, can we conclude that the campaign alone led to approximately $4k increase in revenue? Unfortunately no, as it could very well be that since more people are taking vacations in December, the number of people playing games and their average time spent on games are anyway larger. Even without the campaign, there might be revenue of $8k in December, in which case the campaign would have been responsible for the remaining $2k. This is illustrated below.
Conclusion: Simple pre- vs post- comparisons could be insufficient as they ignore seasonality and underlying trends.
Method 2: Direct control comparison
Say, we pick up a control title XYZ which has similar revenue level historically as title ABC but not part of the campaign promotions. Assume XYZ made $8k revenue in December. Can we conclude that the campaign made $2k impact?
Again, no, as we don't know whether title XYZ is a suitable control. It may be of a different game genre thus it could respond differently to external treatment.
Conclusion: Direct comparisons between treated and control are tricky as they may ignore other underlying differences between the two.
How does Causal Inference analysis work
In a Causal Inference analysis, a Bayesian framework is utilized to estimate the causal effect of a designed intervention on a time series. The guiding variables include a response time series (e.g., clicks) and a set of control time series (e.g., clicks in non-affected markets, clicks on other sites, or Google Trends data). These variables then establish a Bayesian structural time-series model with a built-in spike-and-slab prior for automatic variable selection. This model then predicts how the response metric would have evolved after the intervention if the intervention had not occurred.
Terminologies
Before we continue with more details, let’s first recap some of the terminology:
Learning period/Pre-intervention period/Training period: Period of time before the campaign during which the model to estimate counterfactuals is built.
Campaign period/Post-intervention period/Testing period: Period of time during which the treated region is treated (typically, has a campaign running) but not the control group(s).
Control time series/Control group: Geographical region (country, DMA, GMA,...) or group of people, etc, that did not receive the treatment and that is used to build the model. Ideally, control regions should be as similar as possible to the treated regions.
Dependent time series/Test group (Treatment group): Geographical region (country, DMA, GMA,...) or group of people, etc, that received the treatment, i.e. that ran the campaign.
Counterfactual/Baseline: Values that would have been observed in the treated region, during the campaign period, had the campaign not been run.
Workflows
Now let’s explore how Causal Inference works through one example - here we used CausalImpact package in RStudio for illustration.
Assume we have daily revenue data from Google Play ($K) from country A as follows. There is marketing campaign running from 2019-03-12 through 2019-04-10 (testing period) promoting the revenue on Google Play. We treat period 2019-01-01 through 2019-03-11 as learning period. Now marketers are interested to measure the incremental boost (if any) the campaign brought to the revenue.
Step 1 (“original” chart)
With these time-series data, a Causal Impact tool would begin by constructing a Bayesian state-space model specifically for the historical data provided (i.e., before campaign launch). The model infers (or 'learns') the relationship between the metric of interest and control variables (e.g., the same metric in other geographical regions). For simplicity, we don’t introduce any controls in this case, but first focus on the test group itself.
The workflow then uses this model to estimate the counterfactuals (baseline, blue dotted line), i.e., what would have happened had the marketing campaign not occurred. To do this, we use the observations in the control variables during the campaign period, and check to see if predictions match observed outcome in pre-period (predictability).
The blue shaded area is credible interval. We will elaborate on "credible interval" later.
Step 2 (“pointwise” chart)
Once we get this baseline, then we can calculate the differences between the two lines, factual values (black solid line) and counterfactual values (blue dotted line), and consider the differences as the real impacts of the event. We should expect to see the pointwise difference during the learning period fluctuates around zero to indicate the fitness of the model.
Step 3 (“cumulative” chart)
We can accumulate all the impacted values since the event occurred and see the total impacted values up to any given point. In this example, the total incremental impact from the campaign is about 300 ($K) during the whole campaign period.
Output
The Causal Inference workflow should generate the results chart as illustrated above. Meanwhile, it also provides different types of reports:
Summary report which includes most relevant summary statistics with confidence intervals. This is helpful for trained analysts who understand the metrics and can interpret them well.
Detailed structured verbal report, which is useful for users without formal training in statistics.
From the report, we know the campaign brought total about $316K revenue uplift during the whole campaign period, which is about 9.9% uplift compared to the forecasted baseline. The 80% credible interval is [$302K, $331K].
Assumptions
Causal Impact model provides a fall-back option for those situations where such a randomized experiment is unavailable. However, it's important to be aware of the model assumptions that come with this fall-back option. CausalImpact is not a replacement for a randomized experiment when these assumptions are not satisfied.
CausalImpact rests on three main assumptions:
Predictability. CausalImpact assumes that it is possible to model the outcome time series of interest as a linear combination of the set of control time series that were entered. For example, in the case of a marketing study, we'd assume that sales in one country can be predicted from sales in other countries. This assumption can be assessed by checking how well the predicted counterfactual (dotted line) matches the observed outcome time series in the pre-period (i.e., before the intervention started).
Unaffectedness. CausalImpact assumes that we have access to a set of control time series that were themselves not affected by the intervention. If they were, we might falsely under- or overestimate the true effect. Or we might falsely conclude that there was an effect even though in reality there wasn't. For example, in a marketing study, we'd assume that advertising in one country had no spill-over effect on our set of control countries.
Stability. CausalImpact assumes that the relationship between covariates and treated time series, as established during the pre-period, would have remained stable throughout the post-period if the intervention had not taken place.
How to interpret the outcomes from Causal Impact Analysis
The Causal Impact workflow would return estimates of the counterfactuals over the campaign period. With it, we can easily calculate estimates of the uplift over the campaign period (difference of counterfactual and actual value). Importantly, the model takes the measurements and modeling uncertainty into account by providing a credible interval around the estimates (typically takes value 80%-95% based on campaign). With credit interval, the model estimates the upper and lower bound of counterfactuals, and accordingly, lower and upper bound for the campaign uplift.
Assume we are using 80% confidence interval.
If the actuals is greater than the upper bound (UB) of the confidence interval of the baseline (i.e. the cumulative LB of the difference is greater than zero), then we are 80% certain that the true estimate is greater than actuals, thus allowing us to conclude that the campaign had a significant positive impact on the metric under study.
If the lower bound of the baseline is less than the actuals and the upper bound greater than the actuals (i.e. zero is in between of the cumulative UB and LB of the difference), then we cannot conclude what impact the campaign may have had. (Note that we cannot conclude that the campaign had no effect, either). Usually we state as “there is no statistically significant proof to show the campaign had positive impact on metric xxx”. Further research is needed here and we can try a few adjustments to the model.
Occasionally, we may find that the lower bound (LB) of the baseline is greater than the actuals (i.e. the cumulative UB of the difference is less than zero). Technically, this means we are 80% certain that the true impact estimate was negative. This is often difficult to interpret and may require additional scrutiny in ensuring the parameters to model are properly configured.
When interpreting results, it is just as important (if not more) to look at the confidence interval as it is to look at the estimates themselves.
Practical tips for good results
Here are some guidelines for obtaining reliable results.
The learning period should be reasonably long enough (the more data the model has to train on, the better). Good results are often obtained with training periods of between 1 and 12 month.
The campaign period should not be longer than the training period (very difficult to forecast over a period of time longer than the period used to train the model). A typical example is a campaign period of 3 months, given a training period of 6 months.
When including/excluding regions as control groups, it is best to be conservative and remove a potentially problematic control than leaving in a bad one. For example, Canada and the UK are often excluded as controls when analysing a campaign in the US, since the campaign may have spilt over to these other English-speaking countries.
Some general guidelines in picking up controls:
The control groups were unaffected by the campaign itself (little or no spillover)
The control groups do not modify behavior between the learning and campaign periods.
The test and control groups are similarly affected by external events.
Once the analysis is done with control groups, it helps to calculate inclusion probability for each control group. Simply speaking, the inclusion probability is the likelihood a control group got selected in the final sample; so if the inclusion probability is close to 1, it means this control group behaves very similarly to the test group in the training period, thus it's fair to use it for baseline forecast in the testing period. This similarity can also be directly visualized in time-series.
Causal Impact tool is not perfect and there is always a risk that it will fail to detect an existing uplift or report one that does not exist. Two red flags will be particularly wide credible intervals, as well as zero being at the border of the credible intervals. If those situations were to arise, we would recommend increasing the learning period (if feasible), as well as increasing the number of controls.
Please do not re-run the same analysis multiple times by slightly modifying outputs in order to obtain the greatest uplift:
The model is robust and slightly changing parameters will barely modify results.