Tuesday 16th July
Wednesday 17th July
Thursday 18th July
Friday 19th July
Download the conference book
The Trouble with Logit and Probit: Teaching and Presenting Nonlinear Probability Models |
|
Convenor | Dr Henning Best (GESIS - Leibniz Institute for the Social Sciences) |
Coordinator 1 | Dr Klaus Pforr (GESIS - Leibniz Institute for the Social Sciences) |
While researchers in the social sciences have used Logit and Probit routinely since the 1990s, some of the difficulties in using various types of nonlinear probability models have received increased attention in recent years only. At least three important methodological problems have been raised in the discussion:
- The general interpretation of the coefficients is not as straightforward as in OLS
- Coefficients cannot easily be compared between subgroups
- Coefficients cannot easily be compared between nested models
Some of the difficulties stem from what has come to be known as "neglected heterogeneity". There are interesting suggestions on how to cope with neglected heterogeneity mathematically, and on how to interpret the coefficients in a meaningful way. Yet, these suggestions still have to trickle down to teaching quantitative methods, especially in undergraduate courses on multivariate statistics. Additionally, standards on how to present nonlinear models in publications still have to be established. Is the tabular presentation of coefficients we all are used to from linear models equally appropriate for Logit and Probit?
In this session we especially seek presentations on approaches to interpreting and presenting Logit and Probit results, as well as suggestions and experiences for teaching nonlinear models without neglecting these important problems.
Allison (1999), Williams (2009, 2010) and others have painted out that comparisons of logit and probit coefficients across groups (e.g. men and women) are potentially problematic. Unlike OLS regression, heteroskedasticity can bias coefficients and distort comparisons. As Hoetker (2004, p. 17) notes, "in the presence of even fairly small differences in residual variation, naive comparisons of coefficients [across groups] can indicate differences where none exist, hide differences that do exist, and even show differences in the opposite direction of what actually exists." Several solutions have been proposed (e.g. Allison 1999, Williams 2009, Long 2009) but each of these solutions has its own limitations that researchers should be aware of. This paper reviews the problem, the proposed solutions, and the potential problems with the solutions. The pros and cons of each approach are discussed and possible criteria for choosing between them are suggested.
Although the parameters of logit and probit and other non-linear probability models are often explained and interpreted in relation to the regression coefficients of an underlying linear latent variable model, we argue that they may also be usefully interpreted in terms of the correlations between the dependent variable of the latent variable model and its predictor variables. We show how this correlation can be derived from the parameters of non-linear probability models, develop tests for the statistical significance of the derived correlation, and illustrate its usefulness in two applications. Under certain circumstances, which we explain, the derived correlation provides a way of overcoming the problems inherent in cross-sample comparisons of the parameters of non-linear probability models.
In the social sciences logit and probit models are often used multivariate data analysis procedures for binary dependent variables. Both procedures can be thought of as resting on a linear model for an unobserved variable y* from which a nonlinear model for the probability of y = 1 is derived. Based on a series of Monte-Carlo simulations we first demonstrate that the b-coefficients from logit and probit models should not be compared between nested models. We then test different approaches for their suitability for comparing coefficients between nested models. Results of our simulation study show that y*-standardized coefficients are of limited utility and coefficients from a linear probability model should only be used with normally distributed variables. However, average marginal effects and regression coefficients corrected by a method proposed by Karlson et al. (2012) lead to satisfactory results in many different scenarios.
Fixed effects models have become a prime tool for causal analysis, as they allow to control for unobservable heterogeneity. For nonlinear models like the logistic regression, this advantage is bought with the price, that the interpretation of the effects is severey limited. The effect on the outcome probabilities is ruled out for nonlinear fixed effects models, as it depends on the complete linear combination of the logit term. This leaves either to interpret the odds ratio effect or the effect on the outcome probability conditional on the fixed effects, where both alternatives are problematic.
This papers papers discusses two strategies suggested by Cameron and Trivedi (2005) and Schröder (2010) as possible exits to this dilemma. Finally, the correlated random effects model following Mundlak (1978) and Chamberlain (1980) is proposed as a recourse in this situation, as it offers a better compromise between the restriction on assumptions about heterogeneity and the restriction on interpretability.
Applying standard maximum likelihood-based (ML) logistic regression to model "rare events" (e.g. sexual recidivism, voting for extreme parties, severe criminal offenses) might result in serious underestimation of the occurrence probabilities and biased standard errors of the logit coefficients (King & Zeng, 2001). The phenomenon is likely to occur, if the proportion of 1's (indicating events in the binary dependent variable) is smaller than five percent. To date, two potential remedies are known for this particular problem: (i) a correction procedure proposed by King and Zeng (2001) and (ii) the ap-plication of "exact logistic regression". However, these approaches are not free from limitations. While the former is not applicable to rare events in small samples, the latter is--despite the development of more efficient estimation algorithms--still computa-tionally intensive.
In order to explore the potential of these two alternatives as well as investigate to what extent the estimates of the ordinary ML-based logistic regression model are affected by "rare events bias", Monte Carlo simulations will be conducted. Three parameters will be subject to variation: (i) the proportion of 1's (event rareness), (ii) sample size, and (iii) the number of events per independent variable. The results shall allow the formulation of guidelines for applied researchers, containing under which conditions ML-based lo-gistic regression produces biased estimates and what appears to be the most appropri-ate alternative.
Literature:
King, G. & Zeng, L. (2001): Logistic Regression in Rare Events Data.
http://gking.harvard.edu/files/0s_0.