Multivariate Analysis: Sociology 203A

Binary Logit and Probit Regression


This page an annotated example of multivariate logit and probit analyses, supporting the introductory graduate course offered in the Department of Sociology at the University of California, Riverside. The course is taught by Robert A. Hanneman. Your comments and suggestions are very welcome. You can click here to send me e-mail.
Introduction:

Many of the things that we seek to predict are qualitative, and have been measured at the nominal or ordinal level. A number of statistical models have been developed that allow the application of "regression" like modeling in these situations. The two that have been most popular in the sociological literature are the "logit" or "logistic regression" and the "probit" models. These models can be fairly easily applied to cases where the dependent variable is either nominal or ordinal, and has two or more levels; and the independent variables are any mix of qualitative and quantitative predictors.

The logit and the probit are two examples of a larger class of "generalized linear models" (you may wish to see the materials for Sociology 271 Intermediate Quantitative Analysis, elsewhere in this web site). The logit and probit regression models regress a function of the probability that a case falls in a certain category of Y, on a linear combination of X variables. The "right hand side" of the logit and probit, then, are the same as they are in the classical normal linear regression model. The slope coefficients tell us the effect of a unit change in X on a function of the probability of Y.

The difference between the logit and probit lies on the left-hand side of the equation. The logit model is exactly that: the left hand side is the logit of Y, or the log of the odds that a case falls in one category on Y versus another. For example, if Y were whether a child was born to a woman in a given year, the logit model would express the effects of X on the log of the odds of a birth versus a non-birth. The left hand side of the probit model can be thought of as being a Z score. In the probit model, a unit change in X produces a "b" unit change in the cumulative normal probability, or Z score, that Y falls in a particular category. For example, the probit model would express the effect of a unit change in X on the cumulative normal probability that a women had a birth within a year.

Both the logit and the probit regression models are estimated by maximum likelihood. Consequently, goodness of fit and inferential statistics are based on the log likelihood and chi-square test statistics. Interpreting these statistics may be somewhat unfamiliar, and is discussed in the examples below. The other, and main challenge with the logit and probit is the interpretation of the descriptive statistics (the estimated regression function). A number of approaches are commonly used, and these will also be briefly examined below.

The logit and probit model have been extended in a number of ways that we will not examine here. The probability of Y may be seen as censored at bottom or top, or both. The model may be a two stage "sample selection" one in which a secondary equation predicts which set of cases are at risk of an event, while the main probit or logit predicts the probability of an event among those at risk. Extensions such as these are often covered in advanced econometrics (and probably biometrics), and can be estimated with some packages (e.g. Limdep, Shazam) or by writing appropriate non-linear minimization algorithms in languages such as Gauss. All of these issues are well beyond the scope of the current exercise.

An even more introductory discussion and slide show on logit and probit models is included in the documentation of the course sociology 110C.

Return to the course home page


Table of contents


The Problem

 We will use the logistic and probit models to re-analyze data that we looked at using cross-tabulations and the hierarchical log linear model. The reader may wish to compare and contrast the three approaches.

The dependent variable is the respondent's (General Social Survey, 1991) self identified political liberalism (POLVIEWS). Here, we will analyze the probability that a person self-identifies as liberal or very liberal, versus any other identification. Since POLVIEWS is measured on a 7-point scale of increasing conservatism, there are a number of ways in which it's variance could be analyzed. It could be treated as a continuous variable (not a very good idea), or ordinal variable, or as a nominal polyotomy or nominal dichotomy. We will use the logit an probit on a nominal dichotomy; variations for multinomial outcomes and for ordered multinomial outcomes are (slightly) advanced topics. You may wish to see notes elsewhere in the Sociology 271: Intermediate Quantitative Analysis web site.

Two independent variables are used to predict the log-odds or cumulative normal probability that a person self identifies as a political liberal. Education and age are each measured in years. The linear by linear interaction of the two is not significant, and is highly collinear with the main terms; hence, it is not included in the current models. A note of caution is due here. Our other analyses, using categorical versions of age and education truncated the variation in education, but also expressed more "qualitative" than linear "quantitative" hypotheses about the effects of the two independent variables. So, the models being examined here are really quite different, in a key way, from our cross-tab and log linear examples.

Logit and probit models use the method of maximum likelihood to fit functions of proportions of cases at each level of the X vector that have a score of "1" on the dependent variable. As such, goodness-of-fit is evaluated using likelihood ratio chi-square statistics. The linear and additive vector of X is related to the probability of Y=1 by a "link" function. There are a number of such functions available in standard software packages. We examine two, from SAS PROC LOGISTIC -- the logit or log-odds link and the "normit" or cumulative normal, or probit link. This particular SAS procedure allows a couple other approaches that we won't examine. PROC PROBIT offers the same menu of choices, but slightly different output; PROC GENMOD offers a still wider range and variety of probability models. In each case then, the regression coefficients express the partial linear additive effects of unit increments in X to the linked function of Y. The major trick in logistic and probit regression is translating these coefficients into some more intelligible way of talking about effects.

Return to the table of contents of this page


SAS Code: The Data Step

 

Return to the table of contents of this page


SAS Code: The Procedure Step

 

Return to the table of contents of this page


Output

Univariate Procedure

Variable=AGE           AGE OF RESPONDENT

                 Moments
 N              1514  Sum Wgts       1514
 Mean       45.62616  Sum           69078
 Std Dev    17.80842  Variance   317.1397
 Skewness   0.524025  Kurtosis   -0.78556
 USS         3631596  CSS        479832.4
 CV         39.03116  Std Mean    0.45768
 T:Mean=0   99.68998  Pr>|T|       0.0001
 Num ^= 0       1514  Num > 0        1514
 M(Sign)         757  Pr>=|M|      0.0001
 Sgn Rank   573427.5  Pr>=|S|      0.0001

            Quantiles(Def=5)
 100% Max        89       99%        85
  75% Q3         60       95%        78
  50% Med        41       90%        73
  25% Q1         32       10%        24
   0% Min        18        5%        22
                           1%        20
 Range           71
 Q3-Q1           28
 Mode            35

                 Extremes
    Lowest    Obs     Highest    Obs
        18(    1188)       89(     721)
        18(     825)       89(     770)
        18(     394)       89(     946)
        19(    1134)       89(    1103)
        19(    1117)       89(    1264)


Variable=EDUC          HIGHEST YEAR OF SCHOOL COMPLETED

                 Moments
N              1510  Sum Wgts       1510
 Mean       12.88411  Sum           19455
 Std Dev    2.984022  Variance   8.904386
 Skewness   -0.16815  Kurtosis   0.709803
 USS          264097  CSS        13436.72
 CV         23.16049  Std Mean   0.076792
 T:Mean=0   167.7802  Pr>|T|       0.0001
 Num ^= 0       1508  Num > 0        1508
 M(Sign)         754  Pr>=|M|      0.0001
 Sgn Rank     568893  Pr>=|S|      0.0001

            Quantiles(Def=5)
 100% Max        20       99%        20
  75% Q3         15       95%        18
  50% Med        12       90%        16
  25% Q1         12       10%         9
   0% Min         0        5%         8
                           1%         5
 Range           20
 Q3-Q1            3
 Mode            12

                 Extremes
    Lowest    Obs     Highest    Obs
         0(     946)       20(     831)
         0(     868)       20(     887)
         3(    1487)       20(     987)
         3(     951)       20(    1008)
         3(     747)       20(    1334)

The LOGISTIC Procedure

Data Set: OUT1.TES2
Response Variable: P
Response Levels: 2
Number of Observations: 1508

The Logit or Logistic Regression Model

Link Function: Logit


     Response Profile
Ordered
  Value       P     Count

      1       0       246
      2       1      1262

   Model Fitting Information and Testing Global Null Hypothesis BETA=0

                            Intercept
              Intercept        and
Criterion       Only       Covariates    Chi-Square for Covariates
AIC            1343.589      1333.898         .
SC             1348.907      1349.854         .
-2 LOG L       1341.589      1327.898       13.690 with 2 DF (p=0.0011)
Score              .             .          13.373 with 2 DF (p=0.0012)

                 Analysis of Maximum Likelihood Estimates
            Parameter Standard    Wald       Pr >    Standardized     Odds
Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio

INTERCPT 1    -0.4427   0.4075     1.1804     0.2773            .     .
AGE      1    -0.0150  0.00424    12.5602     0.0004    -0.147717    0.985
EDUC     1    -0.0411   0.0247     2.7599     0.0967    -0.067665    0.960

Association of Predicted Probabilities and Observed Responses
Concordant = 56.1%          Somers' D = 0.137
 Discordant = 42.4%          Gamma     = 0.140
 Tied       =  1.5%          Tau-a     = 0.038
 (310452 pairs)              c         = 0.569

Classification Table
          Correct      Incorrect                Percentages
       ------------  ------------  -------------------------------------
 Prob          Non-          Non-           Sensi-  Speci-  False  False
Level  Event  Event  Event  Event  Correct  tivity  ficity   POS    NEG
------------------------------------------------------------------------
0.164    137    647    615    109     52.0    55.7    51.3   81.8   14.4

The Probit (or Normit) Model

Link Function: Normit

   Model Fitting Information and Testing Global Null Hypothesis BETA=0

                            Intercept
              Intercept        and
Criterion       Only       Covariates    Chi-Square for Covariates

AIC            1343.589      1333.955         .
SC             1348.907      1349.910         .
-2 LOG L       1341.589      1327.955       13.634 with 2 DF (p=0.0011)
Score              .             .          13.373 with 2 DF (p=0.0012)

                    Analysis of Maximum Likelihood Estimates
                Parameter   Standard      Wald         Pr >      Standardized
Variable   DF    Estimate     Error    Chi-Square   Chi-Square     Estimate

INTERCPT   1      -0.3349     0.2263       2.1907       0.1389              .
AGE        1     -0.00829    0.00232      12.8161       0.0003      -0.147731
EDUC       1      -0.0216     0.0136       2.5238       0.1121      -0.064604

Association of Predicted Probabilities and Observed Responses
 Concordant = 56.1%          Somers' D = 0.138
 Discordant = 42.3%          Gamma     = 0.141
 Tied       =  1.5%          Tau-a     = 0.038
 (310452 pairs)              c         = 0.569

                          Classification Table
          Correct      Incorrect                Percentages
       ------------  ------------  -------------------------------------
 Prob          Non-          Non-           Sensi-  Speci-  False  False
Level  Event  Event  Event  Event  Correct  tivity  ficity   POS    NEG
------------------------------------------------------------------------
0.164    142    646    616    104     52.3    57.7    51.2   81.3   13.9

Return to the table of contents of this page


Interpretation

 The first step in interpretation is to test and describe the overall goodness of fit of the model. In ML approaches, a common method is to examine the difference between the residuals of the model under the constraint that all regression coefficients are zero and the residuals where the coefficients are estimated from the sample data. The reduction in the "badness of fit" as a result of freeing parameters for each X can be tested as a chi-square statistic with as many degrees of freedom as freed parameters. In the logistic model, the -2ll of the "intercept only" (slopes constrained to zero) model is 1342. The model with the two independent variables has a -2ll of 1328. The difference, or "goodness of fit" is 13.69 with 2 d.f., significant at p <.01. The results for the probit are very similar (as is almost always the case, unless the probabilities being predicted are very small or very large).

There is no exact analog of the R2 of OLS regression for ML estimators. Several measures, however, can be quite useful. One very simple approach is to examine the size of the chi-square goodness of fit relative to the chi-square of the intercept only model. For the logit model, the intercept only model has a -2ll of 1342. The final model has a -2ll of 1328. The improvement then is 13.69 units. This might be expressed in ratio to the original "badness of fit," giving a ratio of .014. The same calculation for the probit gives a reduction in error of 13.63 units, or about 1%. With large sample sizes, this approach to understanding goodness of fit can be misleading, as the value of chi-square is sample size dependent. An alternative, normed, measure or "pseudo R-squared" can be calculated as the -2ll divided by the sample size plus the -2ll. For the logit model, this calculation yields .468, for the probit, it yields .469. Two other approaches are also sometimes used. One can calculate measures of association between the predictions of the model and actual outcomes at each level of the X vector. Where there are few "ties" (as often occurs when the predictors are continuous), gamma can be calculated as a measure of association. Here, gamma for the logit is .14, and for the probit it is .141. Where the predictors are categorical, and more ties occur, a more conservative measure like Tau might be preferred. One can also use the model to make categorical predictions for each case, and compare the number of correct and incorrect predictions, false positives, false negatives, and the like. This approach may give the false impression that the model is making relatively accurate predictions when the outcome is not equiprobable, unless prior probabilities for the predictions are set (as we attempted to do in the output, but failed to execute properly*****).

The somewhat more challenging part of working with the logit and probit (and probability models generally), is finding a useful way to discuss the partial slope coefficients. There are several approaches to this problem. And, depending on the needs of the analysis, and the audience, each is perfectly acceptable.

The Logit Model:

The regression prediction equation is given as Y = -.4427 -.015(Age) -.0411(Educ), and may be interpreted directly. One would say: "the predicted log of the odds that a person is a liberal versus a conservative are -.4427 when that person has an age of zero and an education of zero. Each year of age reduces the log odds that a person will be a liberal by .015 units, each year of education reduces the log odds that a person is a liberal by .0411." This interpretation is good in that it fairly clearly indicates the direction of the effects, and the relative magnitudes in raw form. However, statements about effects on the log of the odds of the dependent variable do not mean much to most audiences, who don't think in this metric.

An alternative is to speak about effects on the odds, rather than effects on the log odds. Most people do have an intuitive feel for statements about odds. To do so, we take the antilog (exp function) of the regression coefficients. In doing so, however, we have turned the regression function into a multiplicative model in the odds, rather than an additive model in the log odds. One would say: "The odds that a person who zero years of age and zero years of education is a liberal are .642 to 1.00. For each year of age, these odds are multiplied by .985 (or, the odds are reduced by about 1.5%), for each year of education, the odds that a person is a liberal are multiplied by .960 (or reduced by about 4%)." This is, to many folks, a more intelligible way of talking about the direction and magnitude of effects.

Once can go a step further, and make statements about the effects of a unit change in each X (or a standard deviation units change in X) on the probability that a person is a liberal. This is probably the clearest and most obvious way of describing effects. However since the model is non-linear in the probabilities, we must choose a reference point for calculating the effects of unit changes in X. The most common choice of a reference point is a person who has the sample mean values on all X variables. To calculate the probability that such a person is a liberal, one substitutes the sample mean values into the regression equation, and calculates the predicted log odds. The antilog of this predicted value, divided by 1 plus the antilog of this predicted value, gives the predicted probability for a specific vector of X values. In our model, this predicted value is .1609. That is, a person who is at the sample mean values on age (45.63 years) and education (12.88 years) has a predicted probability of being liberal of .1609, or about 16%.

This is a useful reference point, because we can now ask: how much different would the probability be if this person (the one at the sample means) increased in age by one year, or increased in education by one year? One simply substitutes the new values into the prediction equation and calculates the probabilities. In our case, we find that a one year increase in age for a person at the sample means on all X variables, reduces the probability that the person will be a liberal by .0027, or .27%. A one year increase in education for a person at the sample means reduces the probability that they will be a liberal by .0055 or .55%. We could also express the change in probabilities of increasing by one standard deviation of each variable, from the sample means. In this case, for a person at the sample means of age and education, an increase of one standard deviation in age reduces the probability that they will be a liberal by .0334, or 3.34%; an increase of one standard deviation in education reduces the probability that they will be a liberal by .0159, or 1.59%.

A final possibility for expressing effects is to draw a picture. One can easily calculate the predicted probabilities of the dependent variable across the full range of values of each X, while holding the other Xs constant at their sample means (or some other reference value, if one wishes). The results are probably best shown as a graph, which will show the familiar "S" shaped probability function of the dependent variable, across the range of the X of interest, while holding other X constant at their most "typical" values. Such graphical displays can be quite compelling, and are accessible to most audiences.

The Probit Model:

Interpretation of the probit coefficients is, in some senses, rather easier than the logit. The regression coefficients of the probit model are effects on a cumulative normal function of the probabilities that Y = 1 (i.e. the probability that a person is a liberal). As such, they are already in a metric that many statistically trained folks understand: the metric of "Z" or standard normal scores.

Using this, one can interpret the coefficients directly. The results of our probit model are:

Y = -.3349 -.00829(Age) -.0216(Education)

One could state that: "Z score of a person of age zero and education zero is -.3349. For each year of age, that Z score is reduced by .00829; for each year of education, the Z score is reduced by .0216."

It is straight forward to translate predicted probits (z scores) into probabilities, using any table of the standard normal distribution. The probability that a person with zero age and education is a liberal is the probability associated with the Z score of -.3349, or .3707. That is, if there were such a person, they would have a 37.7% chance of being a liberal. Again, it might be useful to calculate the "elasticities" or effects on the probabilities for a one unit change or a one standard deviation change in each X from the sample mean, holding the other X constant at the sample means. For age, a one year increase reduces the probability of liberalism by .0024, a standard deviation increase in age reduces it by .0340. For education, a one year increase reduces the probability of liberalism by .0049; a one standard deviation unit increase reduces the probability by .0165.

Again, of course, graphs could be used to present the predicted probabilities of Y for any set of X values that one might choose.

Return to the table of contents of this page


Conclusion

 The Logit and Probit models by ML are useful tools where the dependent variable is qualitative (we have examined the binary case, but multiple values, and ordered values can also be modeled). By using ML, these models overcome the problems of violation of assumptions in OLS approaches. The models are non-linear in the probability of Y as a function of linear additive change in X. This solves the problem of predicted probabilities less than zero or greater than one in the linear probability model (which can also be estimated by ML). The non-linearity in the probabilities does result in some complexity in interpretation of the slope coefficients. As we have shown, however, this is not too troublesome.

Return to the table of contents of this page


Return to the index of example analyses

Return to the course home page