Why Causal Inference Works

demystifying the causal paradigm

Why Causal Inference Works

Why does Causal inference work in practice?

We can find out using the Bias-Variance tradeoff, the Central Limit Theorem, and a nifty theorem from Cauchy and Schwartz.

Causality Definition

Let’s start with the definition of causality. If a change in variable X causes a corresponding change in Y that can be measured, then we can say that X causally determines Y to some degree. This defiition is the most common one I have seen and can be attributed to Judea Pearl. While many variables can take the mantle of X for any given Y, we estimate the causal link one variable at time.

Holding everything else constant

Mathematically, we can say that X and Y correlate, given that everything else is constant. Correlation does not necessarily equal causation, but there is no causation without correlation. For example, in dietary work, foods that cause decrease in energy or increase in inflammation can all be consumed simultaneously in one meal, but we want to identify them so they can be removed from the diet. Therefore to find which foods are the true causal culprits, we eliminate all candidates and only gradually, one at a time, reintroduce them and measure any adverse effects so we can pinpoint which one or multiple it is. Everything else is constant, and we change one thing, this leads us to causality.

Bias - Variance Tradeoff

So how can use the Bias Variance tradeoff to our advantage here? Ultimately we are looking for an accurate estimate of the effects of one variable on the other, the causal effects. This is the same goal as all of machine learning, we want to minimize the Mean Squared Error of an estimator, which decomposes into Bias, Variance, and irreducible error. Y = f(X) is an estimator that we can determine with statistical tools, but we have to rig the game in our favor or else we won’t have accurate CAUSAL estimates. Put another way, our MSE will be too large AND won’t measure anything truly causal.

The way that we rig the game is by arranging our system to be as close as possible to the very definition of causality, with an eye towards minimizing the MSE. Typically we have two groups, one in which the “treatment” has been applied and one where it has not. The mean difference in outcomes between those two groups can give us an accurate estimate of the causal effect, only if we hold all other variables constant between the two groups. If group X has had the treatment applied and group Y has not, then we estimate the Average Treatment effect (ATE) as \(E[X - Y]\). Given that the sample means, \(E[X]\) and \(E[Y]\) are unbiased estimators of the true population means, then the expected value of their differences is also unbiased, assuming we don’t weight one variable differently than the other. Great, we minimized bias but what about variance?

Uncertainty Propagation

This is where we need to get clever, and to do that we are going to borrow from some fundamental laws of uncertainty propagation. Since we are estimating a new variable by estimating the causal effect, we need a variance measure for this new variable. Can we pick a variance measure that allows us to reduce uncertainty? After all, the bias has already been minimized but if we can also minimize variance, then our total MSE is minimized. It turns out, we can reduce variance if the two groups are NOT independent. If they are not independent, then they are dependent and have non-zero covariance. Using the rules of uncertainty propagation (how the uncertainties of multiple variables combine), we can obtain the following for our ATE estimate, f.

\[f = X - Y >> \sigma^2_f=\sigma^2_X+\sigma^2_Y-2\sigma_{XY}\]

The last term here is our covariance term, which corresponds to our data being dependent. If data is dependent then \(P(X|Y) \ne P(X)\). This case is shown clearly in paired difference tests, for example when the difference X - Y is averaged across the differences between treatment and no treatment for the same unit, hence the dependence. If we maximize covariance between the two groups, which is approximated when the units that are “treated” are the exact same unit, then we minimize the variance of our treatment effect. Amazing, right?

Maximizing the covariance is often achieved through blocking and matching, but there is another way when we don’t want to just use observational data.

Randomized Trials and the CLT

Randomized Controlled Trials (RCTs) are the gold standard for causal estimation because they do in a more effective way what blocking/matching do in an approximate way. We do not have to worry about unobserved confounders because the randomization process probabilistically breaks any effect of these variables. How can we represent this mathematically? This is where the Central Limit Theorem (CLT) helps us understand why this is the case.

The CLT tells us that the sampling distribution of any estimator of a population statistic, under certain assumptions, is unbiased against the true population statistic. The normalized version of the sampling distribution will approach the standard normal distribution. This not only is the backbone of modern statistics but has practical implications for estimating these parameters in a time and money constrained way, because statistics developed in an age where Big Data did not exist. Putting aside the idea of convergence of these statistics, which can be a real issue as described here, we can use the CLT to show why RCTs are superior to other methods.

Given that the CLT tells us a sufficiently large sample from the underlying population will help us estimate the sampling distribution of that parameter, we then extend it to two samples from the same population. Given that both will allow us to estimate the SAME sampling distribution, we can then easily see that their covariance is maximized because geometrically they have the same Gaussian shape, but more formally

\[cov(X,Y) \le max(var(X), var(Y))\]

which is a consequence of the Cauchy-Schwartz Inequality. Given that in our case we are estimating the variance of X, which is the covariance of X with itself, our covariance is equal to the variance of X and therefore is maximized under the bound we just established. This means that our uncertainty estimate around the new function \(f = X-Y\) is minimized using covariance as a point of leverage. The only thing that differs between the two groups is the treatment, therefore if the treatment has a causal effect, it will show up in our estimate of f. Beyond increasing the covariance between X and Y, increasing the sample sizes for each of the two groups will help us minimize the individual variances, which will benefit our overall variance estimate. In the end, we can now see how we can minimize the error of our estimator which is subject to the Bias-Variance tradeoff, which leads to the techniques of causal inference.

I hope this was a beneficial explanation of why a technique that many of us use whether in academia or in the professional space is something we can rely on mathematically. Given these guarantees and how they combine with each other, we can incorporate causal inference into our workflows to drive better reasoning.

The Truth will out!

Homework

Here is a homework question, how does what you’ve learned today bear on Simpson’s Paradox, specifically how does the causal framework resolve Simpson’s Paradox?

#statistics #causality

SDG

essential