# Binary Dependent Variable? … Just use OLS

Here is an excerpt from a recent paper published in Applied Economic Perspectives and Policy by John Gibson; entitled, “Are You Estimating the Right Thing? An Editor Reflects.”

An old issue, first discussed by Ravallion (1996) but still showing up in applied work, is researchers estimating latent variable, binary outcome models, such as probits, when they have continuous and fully observed data. For example, they may have data on household consumption but they estimate a probit model of the probability that households are below the poverty line. Specifically, for per capita consumption c, poverty line z, and covariate vector x, they model prob [c < z | x]. Such modeling involves ignoring information on the continuous distribution of consumption, pretending one knows only a binary variable for whether households are poor or not. The motivation for this may be a desire to see how covariates change the risk of a particular outcome (e.g., the risk of being poor), once marginal effects are calculated from the probits.

This procedure is wrong for at least three reasons. First, latent variable models make distributional assumptions about the unknown errors that are not needed for models like OLS (and Instrumental Variables) that can be estimated on the continuous data. The consequences of misspecification are worse with binary outcome models; for example, heteroscedasticity makes maximum likelihood estimates of probit coefficients inconsistent, while OLS stays consistent, with heteroscedasticity only affecting efficiency of OLS.

Second, using latent variable models when one wants to interpret results in terms of how covariates change the probability of an outcome is redundant, as these probabilities can be derived from OLS (and IV) results. […] No distributional assumptions need be made for the regression estimators that provide the values used to produce these probability estimates, unlike for probit or logit models. These estimating equations also are robust to misspecification that can make binary variable models inconsistent.

Third, probits are biased by random errors in dependent variables, unlike for linear regression (e.g., OLS) where dependent variable errors only cause imprecision but not bias (Hausman 2001). Random errors cause misclassification in binary variable models, where a dependent variable has a value of one (e.g., a household is defined as poor) when it should have the value zero, or vice versa, and such errors can lead to large biases. For example, Hausman et al. (1998) show that with just a 5% risk of misclassification, the probit coefficient on a right-hand side (RHS) dummy variable (e.g., for a female-headed household) falls to 70% of its true value. It is even worse for a log-normally distributed RHS covariate (e.g., land holdings); with just a 2% (5%) risk of misclassification, the coefficient on this variable is biased down by more than 20% (45%).

The redundant and probably biased approach of using latent variable, binary outcome models when a researcher has continuous data should be distinguished from legitimate debate about whether OLS (strictly, the Linear Probability Model) or non-linear models like probits and logits are better when one has actual binary variables (e.g., whether a plant closed down or not). [See this World Bank blog post from 2012 for a more general discussion.] The school of thought associated with Mostly Harmless Econometrics (Angrist and Pischke 2009) argues for “just use OLS” because if the conditional expectation function is linear then a linear model like OLS (or IV) will provide this, and if it is non-linear then a linear model usually approximates it, especially if focus is on the marginal effects rather than on the coefficients for the latent index variable. Also, there are advantages with linear models for inference and in not having to choose if marginal effects are measured as finite differences or as derivatives. Abandoning the practice of using latent variable, binary outcome models when one has continuous data would be an easy way to improve the credibility of results that are framed in terms of the probability of certain outcomes, such as being poor.

The rest of the paper is worth a read. It covers four other common mistakes: “Confusing Quantity Response to Price with the Quality Response to Price,” “Zeros, Tobits, and Unconditional Expected Values,” “Ignoring Spillovers when Studying Inherently Spatial Phenomena,” and “Ignoring Complex Sample Designs.”

HT to David McKenzie in today’s weekly links on the Development Impact blog for flagging this paper.