Omitted Variable Bias: A Quick Primer

The next potentially serious issue with the Brennan Center report that I want to consider is one that arises in pretty much every empirical social science paper, namely the always-present threat of omitted variable bias. I actually want to spend a few posts on this issue, so I thought it could be helpful to start with a brief, nontechnical overview of why and when this is problem for the more non-statistical readers of this blog. That way I can refer back to this in future posts, rather than “see the middle of a longer, more substantive post.” And those already familiar with OVB can just skip this one.

Here’s a simple example to demonstrate how—and when, and to what extent—OVB throws off a model’s results. Let’s say we are trying to understand what causes an individual to engage in crime, and we think those with more education are less likely to commit crime. So we include education as an explanatory variable. However, due to a lack of data, we can’t include any information on whether someone is using drugs. Does this omitted variable matter, and to what extent?

It’s easy to show how it matters. I mean, how much clearer could this be?

Screen Shot 2015-02-18 at 10.43.04 PM

I kid. I mean, that really the magnitude of OVB (picture stolen from here), but that’s not exactly intuitive.

The concern with OVB is this: people using drugs are less likely to attend school, so they’ll generally have a lower level of education. And they are more likely to commit crime. So drugs are correlated with education, and drugs are correlated with criminal offending.

So when I run a regression of education on crime but omit drugs, what does the result for education that computer spits back at me capture? Well, it picks up the real effect of education on crime, but it also picks up part of the effect of drugs: those on drugs have less education, so part of the reason that those who have lower education appear to commit more crimes is actually because of their generally-higher levels drug use.

In other words, within the pool of those classified as “low education” are high and low drug users, and similarly within the “high education” pool, although a greater fraction of the low education pool uses drugs at a high level. And it is likely that within the pool of lower education people, those with higher drug use offend more. If we had data on drug use, the model could separate these two effects out, but without it, it just returns some sort of average effect of education and drugs.

We can actually be much more precise about this. There are three components to thinking about OVB (this really is the equation above now, but still: ignore it). There’s the true effect of education on crime, there is the correlation between education and drugs, and there is the true effect of drugs on crime. The coefficient that the regression returns is basically:

the true effect of education plus (the correlation between education and drugs times the true effect of drugs).

Thus if a 10% increase in education reduces the probability of offending by 5%, if a 10% increase in drug use increases the risk of offending by 7%, and the correlation between drug use and eduction is –0.3 (since education and drug use are negatively correlated), then the regression will tell you that a 10% increase in education reduces offending by –5% + (–0.3 x 7%) = –7.1%. In other words, it will overstate the effect of education. (For those of you expecting no math, I apologize: this is basically the last of it.)

This makes sense: increased education is associated with less offending as well as less drug use, and less drug use is associated with less offending. But by omitting drug use from the model, the education terms picks up some of both effects, making education look more effective than it should.

So, two big points:

First: we can see when OVB matters. If the omitted variable is uncorrelated with what we are looking at, then it is irrelevant. Perhaps area temperature influences crime rates—it is easier to commit crimes when it is warm and everyone is outside—but maybe (maybe!) climate is uncorrelated with educational outcomes. Then omitting climate has no effect on our estimate of education, since changes in education tell us nothing about changes in weather.

Similarly, if the omitted variable has no independent effect on crime we can ignore it, no matter how correlated it is with education.

Or, put more generally, the smaller the correlation between the included and omitted variable, and the smaller the direct effect of the omitted variable on whatever you are looking at, the less serious the bias is.

Second: We can (in simple cases) predict the direction of the bias, which can actually be quite useful.

Recall that what the regression reports is true effect + (correlation times omitted effect). So in our education case, the true effect is negative (education reduces crime), the correlation is negative (drug use and education are negatively correlated), and the omitted effect is positive (more drugs leads to more offending). So the “bias factor” will be negative (a negative times a positive), and a negative plus a negative is even more negative: the regression will overstate the true effect.

That’s useful to know. In our example above, then, we know that –7.1% is a ceiling: the true value is something less than that (i.e., closer to zero). We don’t know how much less, but we know it can’t be more.

Of course, if the omitted variable were positively correlated with both education and crime—something that causes people to both offend more but also achieve more in school, perhaps some sort of aggressive ambition that is hard to detect, say—then the regression would understate the true effect of education (a negative true effect plus a positive bias would push the number too close to zero). And so on and so on for positive and negative correlations and positive and negative omitted effects.

Now, in practice, there is a limit to this. Often multiple variables will be omitted, and the effect of education would capture all of these: the more-negative bias of drug use, the less-negative bias of ambition, etc., etc. And in this case, it would be almost impossible to know how all the various biases net out. But where we think only one or two key variables are missing, then we can at least know if our estimate is a ceiling or a floor.

So that’s a crash primer on OVB. The next post will start to look at how it plays out in the Brennan report.

Posted by John Pfaff on February 19, 2015 at 09:36 AM

Comments

This video provides an example of how omitted variable bias can arise in econometrics.

Posted by: du lich ha long | Aug 21, 2017 3:14:45 AM

where the “prime” notation means the transpose of a matrix and the -1 superscript is matrix inversion.

Posted by: sapa | Aug 21, 2017 3:12:00 AM

Posted by: Chau Au | Jul 27, 2015 5:23:01 AM

Comments

Share this:

Like this:

Discover more from PrawfsBlawg