Binary Logit and Probit Regression
Many of the things that we seek to predict are qualitative, and have been measured at the nominal or ordinal level. A number of statistical models have been developed that allow the application of "regression" like modeling in these situations. The two that have been most popular in the sociological literature are the "logit" or "logistic regression" and the "probit" models. These models can be fairly easily applied to cases where the dependent variable is either nominal or ordinal, and has two or more levels; and the independent variables are any mix of qualitative and quantitative predictors.
The logit and the probit are two examples of a larger class of "generalized linear models" (you may wish to see the materials for Sociology 271 Intermediate Quantitative Analysis, elsewhere in this web site). The logit and probit regression models regress a function of the probability that a case falls in a certain category of Y, on a linear combination of X variables. The "right hand side" of the logit and probit, then, are the same as they are in the classical normal linear regression model. The slope coefficients tell us the effect of a unit change in X on a function of the probability of Y.
The difference between the logit and probit lies on the left-hand side of the equation. The logit model is exactly that: the left hand side is the logit of Y, or the log of the odds that a case falls in one category on Y versus another. For example, if Y were whether a child was born to a woman in a given year, the logit model would express the effects of X on the log of the odds of a birth versus a non-birth. The left hand side of the probit model can be thought of as being a Z score. In the probit model, a unit change in X produces a "b" unit change in the cumulative normal probability, or Z score, that Y falls in a particular category. For example, the probit model would express the effect of a unit change in X on the cumulative normal probability that a women had a birth within a year.
Both the logit and the probit regression models are estimated by maximum likelihood. Consequently, goodness of fit and inferential statistics are based on the log likelihood and chi-square test statistics. Interpreting these statistics may be somewhat unfamiliar, and is discussed in the examples below. The other, and main challenge with the logit and probit is the interpretation of the descriptive statistics (the estimated regression function). A number of approaches are commonly used, and these will also be briefly examined below.
The logit and probit model have been extended in a number of ways that we will not examine here. The probability of Y may be seen as censored at bottom or top, or both. The model may be a two stage "sample selection" one in which a secondary equation predicts which set of cases are at risk of an event, while the main probit or logit predicts the probability of an event among those at risk. Extensions such as these are often covered in advanced econometrics (and probably biometrics), and can be estimated with some packages (e.g. Limdep, Shazam) or by writing appropriate non-linear minimization algorithms in languages such as Gauss. All of these issues are well beyond the scope of the current exercise.
An even more introductory discussion and slide show on logit and probit models is included in the documentation of the course sociology 110C.
Return to the course home page
We will use the logistic and probit models to re-analyze data that we looked at using cross-tabulations and the hierarchical log linear model. The reader may wish to compare and contrast the three approaches.
The dependent variable is the respondent's (General Social Survey, 1991) self identified political liberalism (POLVIEWS). Here, we will analyze the probability that a person self-identifies as liberal or very liberal, versus any other identification. Since POLVIEWS is measured on a 7-point scale of increasing conservatism, there are a number of ways in which it's variance could be analyzed. It could be treated as a continuous variable (not a very good idea), or ordinal variable, or as a nominal polyotomy or nominal dichotomy. We will use the logit an probit on a nominal dichotomy; variations for multinomial outcomes and for ordered multinomial outcomes are (slightly) advanced topics. You may wish to see notes elsewhere in the Sociology 271: Intermediate Quantitative Analysis web site.
Two independent variables are used to predict the log-odds or cumulative normal probability that a person self identifies as a political liberal. Education and age are each measured in years. The linear by linear interaction of the two is not significant, and is highly collinear with the main terms; hence, it is not included in the current models. A note of caution is due here. Our other analyses, using categorical versions of age and education truncated the variation in education, but also expressed more "qualitative" than linear "quantitative" hypotheses about the effects of the two independent variables. So, the models being examined here are really quite different, in a key way, from our cross-tab and log linear examples.
Logit and probit models use the method of maximum likelihood to fit functions of proportions of cases at each level of the X vector that have a score of "1" on the dependent variable. As such, goodness-of-fit is evaluated using likelihood ratio chi-square statistics. The linear and additive vector of X is related to the probability of Y=1 by a "link" function. There are a number of such functions available in standard software packages. We examine two, from SAS PROC LOGISTIC -- the logit or log-odds link and the "normit" or cumulative normal, or probit link. This particular SAS procedure allows a couple other approaches that we won't examine. PROC PROBIT offers the same menu of choices, but slightly different output; PROC GENMOD offers a still wider range and variety of probability models. In each case then, the regression coefficients express the partial linear additive effects of unit increments in X to the linked function of Y. The major trick in logistic and probit regression is translating these coefficients into some more intelligible way of talking about effects.
Return to the table of contents of this page
Return to the table of contents of this page
Return to the table of contents of this page
Univariate Procedure
Variable=AGE AGE OF RESPONDENT
Moments
N 1514 Sum Wgts 1514
Mean 45.62616 Sum 69078
Std Dev 17.80842 Variance 317.1397
Skewness 0.524025 Kurtosis -0.78556
USS 3631596 CSS 479832.4
CV 39.03116 Std Mean 0.45768
T:Mean=0 99.68998 Pr>|T| 0.0001
Num ^= 0 1514 Num > 0 1514
M(Sign) 757 Pr>=|M| 0.0001
Sgn Rank 573427.5 Pr>=|S| 0.0001
Quantiles(Def=5)
100% Max 89 99% 85
75% Q3 60 95% 78
50% Med 41 90% 73
25% Q1 32 10% 24
0% Min 18 5% 22
1% 20
Range 71
Q3-Q1 28
Mode 35
Extremes
Lowest Obs Highest Obs
18( 1188) 89( 721)
18( 825) 89( 770)
18( 394) 89( 946)
19( 1134) 89( 1103)
19( 1117) 89( 1264)
Variable=EDUC HIGHEST YEAR OF SCHOOL COMPLETED
Moments
N 1510 Sum Wgts 1510
Mean 12.88411 Sum 19455
Std Dev 2.984022 Variance 8.904386
Skewness -0.16815 Kurtosis 0.709803
USS 264097 CSS 13436.72
CV 23.16049 Std Mean 0.076792
T:Mean=0 167.7802 Pr>|T| 0.0001
Num ^= 0 1508 Num > 0 1508
M(Sign) 754 Pr>=|M| 0.0001
Sgn Rank 568893 Pr>=|S| 0.0001
Quantiles(Def=5)
100% Max 20 99% 20
75% Q3 15 95% 18
50% Med 12 90% 16
25% Q1 12 10% 9
0% Min 0 5% 8
1% 5
Range 20
Q3-Q1 3
Mode 12
Extremes
Lowest Obs Highest Obs
0( 946) 20( 831)
0( 868) 20( 887)
3( 1487) 20( 987)
3( 951) 20( 1008)
3( 747) 20( 1334)
The LOGISTIC Procedure
Data Set: OUT1.TES2
Response Variable: P
Response Levels: 2
Number of Observations: 1508
The Logit or Logistic Regression Model
Link Function: Logit
Response Profile
Ordered
Value P Count
1 0 246
2 1 1262
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 1343.589 1333.898 .
SC 1348.907 1349.854 .
-2 LOG L 1341.589 1327.898 13.690 with 2 DF (p=0.0011)
Score . . 13.373 with 2 DF (p=0.0012)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 -0.4427 0.4075 1.1804 0.2773 . .
AGE 1 -0.0150 0.00424 12.5602 0.0004 -0.147717 0.985
EDUC 1 -0.0411 0.0247 2.7599 0.0967 -0.067665 0.960
Association of Predicted Probabilities and Observed Responses
Concordant = 56.1% Somers' D = 0.137
Discordant = 42.4% Gamma = 0.140
Tied = 1.5% Tau-a = 0.038
(310452 pairs) c = 0.569
Classification Table
Correct Incorrect Percentages
------------ ------------ -------------------------------------
Prob Non- Non- Sensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
------------------------------------------------------------------------
0.164 137 647 615 109 52.0 55.7 51.3 81.8 14.4
The Probit (or Normit) Model
Link Function: Normit
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 1343.589 1333.955 .
SC 1348.907 1349.910 .
-2 LOG L 1341.589 1327.955 13.634 with 2 DF (p=0.0011)
Score . . 13.373 with 2 DF (p=0.0012)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized
Variable DF Estimate Error Chi-Square Chi-Square Estimate
INTERCPT 1 -0.3349 0.2263 2.1907 0.1389 .
AGE 1 -0.00829 0.00232 12.8161 0.0003 -0.147731
EDUC 1 -0.0216 0.0136 2.5238 0.1121 -0.064604
Association of Predicted Probabilities and Observed Responses
Concordant = 56.1% Somers' D = 0.138
Discordant = 42.3% Gamma = 0.141
Tied = 1.5% Tau-a = 0.038
(310452 pairs) c = 0.569
Classification Table
Correct Incorrect Percentages
------------ ------------ -------------------------------------
Prob Non- Non- Sensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
------------------------------------------------------------------------
0.164 142 646 616 104 52.3 57.7 51.2 81.3 13.9
Return to the table of contents of this page
The first step in interpretation is to test and describe the overall goodness of fit of the model. In ML approaches, a common method is to examine the difference between the residuals of the model under the constraint that all regression coefficients are zero and the residuals where the coefficients are estimated from the sample data. The reduction in the "badness of fit" as a result of freeing parameters for each X can be tested as a chi-square statistic with as many degrees of freedom as freed parameters. In the logistic model, the -2ll of the "intercept only" (slopes constrained to zero) model is 1342. The model with the two independent variables has a -2ll of 1328. The difference, or "goodness of fit" is 13.69 with 2 d.f., significant at p <.01. The results for the probit are very similar (as is almost always the case, unless the probabilities being predicted are very small or very large).
There is no exact analog of the R2 of OLS regression for ML estimators. Several measures, however, can be quite useful. One very simple approach is to examine the size of the chi-square goodness of fit relative to the chi-square of the intercept only model. For the logit model, the intercept only model has a -2ll of 1342. The final model has a -2ll of 1328. The improvement then is 13.69 units. This might be expressed in ratio to the original "badness of fit," giving a ratio of .014. The same calculation for the probit gives a reduction in error of 13.63 units, or about 1%. With large sample sizes, this approach to understanding goodness of fit can be misleading, as the value of chi-square is sample size dependent. An alternative, normed, measure or "pseudo R-squared" can be calculated as the -2ll divided by the sample size plus the -2ll. For the logit model, this calculation yields .468, for the probit, it yields .469. Two other approaches are also sometimes used. One can calculate measures of association between the predictions of the model and actual outcomes at each level of the X vector. Where there are few "ties" (as often occurs when the predictors are continuous), gamma can be calculated as a measure of association. Here, gamma for the logit is .14, and for the probit it is .141. Where the predictors are categorical, and more ties occur, a more conservative measure like Tau might be preferred. One can also use the model to make categorical predictions for each case, and compare the number of correct and incorrect predictions, false positives, false negatives, and the like. This approach may give the false impression that the model is making relatively accurate predictions when the outcome is not equiprobable, unless prior probabilities for the predictions are set (as we attempted to do in the output, but failed to execute properly*****).
The somewhat more challenging part of working with the logit and probit (and probability models generally), is finding a useful way to discuss the partial slope coefficients. There are several approaches to this problem. And, depending on the needs of the analysis, and the audience, each is perfectly acceptable.
The Logit Model:
The regression prediction equation is given as Y = -.4427 -.015(Age) -.0411(Educ), and may be interpreted directly. One would say: "the predicted log of the odds that a person is a liberal versus a conservative are -.4427 when that person has an age of zero and an education of zero. Each year of age reduces the log odds that a person will be a liberal by .015 units, each year of education reduces the log odds that a person is a liberal by .0411." This interpretation is good in that it fairly clearly indicates the direction of the effects, and the relative magnitudes in raw form. However, statements about effects on the log of the odds of the dependent variable do not mean much to most audiences, who don't think in this metric.
An alternative is to speak about effects on the odds, rather than effects on the log odds. Most people do have an intuitive feel for statements about odds. To do so, we take the antilog (exp function) of the regression coefficients. In doing so, however, we have turned the regression function into a multiplicative model in the odds, rather than an additive model in the log odds. One would say: "The odds that a person who zero years of age and zero years of education is a liberal are .642 to 1.00. For each year of age, these odds are multiplied by .985 (or, the odds are reduced by about 1.5%), for each year of education, the odds that a person is a liberal are multiplied by .960 (or reduced by about 4%)." This is, to many folks, a more intelligible way of talking about the direction and magnitude of effects.
Once can go a step further, and make statements about the effects of a unit change in each X (or a standard deviation units change in X) on the probability that a person is a liberal. This is probably the clearest and most obvious way of describing effects. However since the model is non-linear in the probabilities, we must choose a reference point for calculating the effects of unit changes in X. The most common choice of a reference point is a person who has the sample mean values on all X variables. To calculate the probability that such a person is a liberal, one substitutes the sample mean values into the regression equation, and calculates the predicted log odds. The antilog of this predicted value, divided by 1 plus the antilog of this predicted value, gives the predicted probability for a specific vector of X values. In our model, this predicted value is .1609. That is, a person who is at the sample mean values on age (45.63 years) and education (12.88 years) has a predicted probability of being liberal of .1609, or about 16%.
This is a useful reference point, because we can now ask: how much different would the probability be if this person (the one at the sample means) increased in age by one year, or increased in education by one year? One simply substitutes the new values into the prediction equation and calculates the probabilities. In our case, we find that a one year increase in age for a person at the sample means on all X variables, reduces the probability that the person will be a liberal by .0027, or .27%. A one year increase in education for a person at the sample means reduces the probability that they will be a liberal by .0055 or .55%. We could also express the change in probabilities of increasing by one standard deviation of each variable, from the sample means. In this case, for a person at the sample means of age and education, an increase of one standard deviation in age reduces the probability that they will be a liberal by .0334, or 3.34%; an increase of one standard deviation in education reduces the probability that they will be a liberal by .0159, or 1.59%.
A final possibility for expressing effects is to draw a picture. One can easily calculate the predicted probabilities of the dependent variable across the full range of values of each X, while holding the other Xs constant at their sample means (or some other reference value, if one wishes). The results are probably best shown as a graph, which will show the familiar "S" shaped probability function of the dependent variable, across the range of the X of interest, while holding other X constant at their most "typical" values. Such graphical displays can be quite compelling, and are accessible to most audiences.
The Probit Model:
Interpretation of the probit coefficients is, in some senses, rather easier than the logit. The regression coefficients of the probit model are effects on a cumulative normal function of the probabilities that Y = 1 (i.e. the probability that a person is a liberal). As such, they are already in a metric that many statistically trained folks understand: the metric of "Z" or standard normal scores.
Using this, one can interpret the coefficients directly. The results of our probit model are:
Y = -.3349 -.00829(Age) -.0216(Education)
One could state that: "Z score of a person of age zero and education zero is -.3349. For each year of age, that Z score is reduced by .00829; for each year of education, the Z score is reduced by .0216."
It is straight forward to translate predicted probits (z scores) into probabilities, using any table of the standard normal distribution. The probability that a person with zero age and education is a liberal is the probability associated with the Z score of -.3349, or .3707. That is, if there were such a person, they would have a 37.7% chance of being a liberal. Again, it might be useful to calculate the "elasticities" or effects on the probabilities for a one unit change or a one standard deviation change in each X from the sample mean, holding the other X constant at the sample means. For age, a one year increase reduces the probability of liberalism by .0024, a standard deviation increase in age reduces it by .0340. For education, a one year increase reduces the probability of liberalism by .0049; a one standard deviation unit increase reduces the probability by .0165.
Again, of course, graphs could be used to present the predicted probabilities of Y for any set of X values that one might choose.
Return to the table of contents of this page
The Logit and Probit models by ML are useful tools where the dependent variable is qualitative (we have examined the binary case, but multiple values, and ordered values can also be modeled). By using ML, these models overcome the problems of violation of assumptions in OLS approaches. The models are non-linear in the probability of Y as a function of linear additive change in X. This solves the problem of predicted probabilities less than zero or greater than one in the linear probability model (which can also be estimated by ML). The non-linearity in the probabilities does result in some complexity in interpretation of the slope coefficients. As we have shown, however, this is not too troublesome.
Return to the table of contents of this page
Return to the course home page