Multivariate Analysis: Sociology 203A

Graphical Conventions for Causal Models

This page is an annotated example of causal ("path") diagraming conventions used in multivariate statistical analysis. It supports the introductory graduate course offered in the Department of Sociology at the University of California, Riverside. The course is taught by Robert A. Hanneman. Similar material is covered at a less advanced level in the pages for Sociology 110C. Your comments and suggestions are very welcome. You can click here to send me e-mail.
Introduction:

Most analyses of quantitative data are guided by theories (statements of causal relations among general concepts) that lead us to make specific predictions about what ought to be observed in patterns of association and partial association among particular variables in a particular sample (the multivariate hypothesis or model). It is usually a good idea for the researcher to use some formalism (like a propositional inventory or diagram) rather than "everyday" language to express the model.

For many multivariate hypotheses (we shall use the term "models" interchangeably, here) the "causal" or "path" diagram is a very useful way of displaying the model. The causal model is extremely useful because: 1) it leaves little ambiguity between the analyst and the consumer about what is being said and 2) it guides the analyst in determining which analyses to conduct. With as few as four variables, the number of logically possible associations and partial associations that could be analyzed becomes very large. Taking the time to formulate a clear model at the beginning of analysis can often save substantial amounts of work later.

Causal or path models are methods for asserting causal relations among variables (or more properly, among the concepts of which the variables observed are realizations in a sample). Causal models can state very complicated hypotheses, but they are built up out of very simple elements. The discussions below deal with the definitions and conventions used in constructing causal models. They are followed by several brief examples.

The Elements of a Causal Model:

Asserting a causal relation

• Exogenous and endogenous
• Independent and dependent
• Residual or error
• Observed and unobserved
• Direct
• Indirect
• Covariation
• Interaction
• Reciprocal effects

Example 1: Blau and Duncan on status attainment

Example 2: An extension of the model, adding interaction

Example 3: A model with latent (or unobserved or factor) variables

Asserting a causal relation

The notion of "causal" analysis is epistemologically controversial. In practice, however, it is commonplace for analysts to assert that changes in one variable produce (by some mechanism, often not observable) a consequent change in another variable. If we wish to assert that changes in X are causes of changes in Y, we must seek to satisfy three conditions.

X and Y covary: If changes in Y are random with respect to X, then X cannot be a cause of change in Y. It is not necessarily that X be the only source of change in Y, or that the covariation be perfect. Indeed, most causal relations in social science theories are believed to be stochastic.

Change in X precedes change in Y. In the western way of thinking, causes come before effects in linear time. This may not be "true," but it is a fairly useful notion, pragmatically. In practice, this can get pretty tricky, because the assertion of causal order among variables in a study depends not only on the logic of the causal mechanism, but also the timing of the measurement of the variables in the design.

The covariation between X and Y is not produced by some other variable Z. (Non-spuriousness). X and Y may covary, and X may precede Y in time; but if Z precedes them both, and causes both, the covariation between X and Y does not indicate a causal relation. This condition is also difficult to satisfy in practice. Causal models are always regarded as tentative, and subject to rejection if a new "Z" can be found.

Variables

Models, and our diagrams to display them, are composed of variables and relations among variables. Variables are represented by their names or by letters. FAED, for example, might be used in a causal diagram to represent the variable "father's formal educational attainment, measured in years of schooling completed." Variables can play several different roles in a model, and there is a language for describing different types of variables.

### Exogenous and endogenous

Exogenous variables are variables that are not caused by any other variables in the model. That is, their causes are exogenous to the model. Endogenous variables, conversely, are caused by one or more of the other variables in the model. Many models contain two types of exogenous variables: 1) substantive variables which cause other variables, but who's causes are not specified in the model and 2) residuals. In diagrams, the former type of exogenous variables are usually placed at the left side of the diagram; the latter type of exogenous variable (residuals) are placed near the endogenous variable of which they are a cause. Every variable in a causal diagram may be classified as either exogenous or endogenous.

### Independent and dependent

Independent variables are variables that are causes of other variables in the model. Dependent variables are those that are caused by other variables in the model. Exogenous variables are therefore also independent variables. Endogenous variables are therefore dependent variables. But, not all independent variables are exogenous. If a variable is a cause of another in the model it is an independent variable with respect to that other; it may, however, also be caused by other variables. That is, confusingly, a variable can be an independent variable with respect to some other; and a dependent variable with respect to another.

### Residual or error

Residuals or errors are exogenous independent variables, but of a special type. Residuals reflect "other" or "unspecified" causes of variability in dependent variables. Residuals are not directly measured. Rather, they are inferred, as that which is "left over" after we have made our best effort with measured variables to account for the variability in dependent variables. In using statistical methods to estimate the parameters of models in samples, it is usually assumed that residuals have a mean of zero, are normally distributed and (as most causal diagrams indicate) uncorrelated with other variables in the model (sometimes, two residuals may be hypothesized to be correlated with each other).

### Observed and unobserved

The variables in a model may be either observed or unobserved. Observed variables are usually represented by placing the variable name in a box; unobserved variables are usually represented by placing the variable name in a circle or ellipse. Observed variables are constructed by applying some instrument to a phenomenon and recording the result; unobserved variables are created as constructs (or indexes or factors) from observed variables.

Effects

Variables are connected together by effects to form the full model. That is, the variables form the nodes, and effects specify the relations among the variables. Together, the variables and effects form the model.

There are only two types of relations among variables that are specified in causal diagraming: direct or "causal" and covariation or "correlational." Direct or causal effects are also sometimes combined into compound effects. Two of these compounded forms are used some commonly that they are named: "indirect" and "interaction."

### Direct

A direct effect is the basic building block of a causal model. A direct effect is specified by drawing an arrow from an independent variable (exogenous or endogenous, observed or unobserved) to a dependent variable. The direct effect represents the hypothesis that a change in the independent or source variable will result in a subsequent change in the dependent or receiving variable, regardless of the levels or actions of any other variables in the model. Whenever possible, the elements of a model are arrayed so that direct effects move from left to right, and from top to bottom of the diagram (it is not always possible to strictly observe this convention). In this diagram, Y is shown as being caused by the action of two variables: X and the residual e subscript y.

### Covariation

A relation of covariation or correlation between variables is represented by a curved double headed arrow connecting the two variable labels. Relations of covariation are specified only among exogenous variables in causal models. The correlations that are observable between endogenous variables are to be accounted for, or produced by the actions of other variables in the model. In this diagram, two exogenous variables are shown to covary, as do the residuals associated with two endogenous variables.

### Indirect

Indirect effects are "compound" effects -- that is, they involve more than one independent variable. X is said to have an indirect effect on Z when a change in X results in a change in Y, and any change in Y results in a change in Z. That is, X has an indirect effect on Z, via Y. The diagram shows a hypothesis in which X has two different (and independent) effects on Z: one is a direct effect ("a"); the other is an indirect effect (which is the effect of a change in X along the pathway "b" and "c").

In some methods of estimating parameters of causal models, the magnitude of indirect effects can be calculated by multiplying along the paths from the independent to the ultimate dependent variable (this is not true for all statistical estimates of effects). The notion is that a change of one unit in X will result in "b" units of change in Y; a change of one unit in Y results in a "c" units change in Z; therefore, a one unit change in X results in b*c units change in Z "indirectly."

### Interaction

A statistical interaction is said to exist when the effect of an independent variable on a dependent variable differs across levels of a third (or control) variable. Suppose, for example, the variable Z has two levels. We calculate the association between X and Y for Z=1, and separately calculate the association between X and Y for Z=2. If the two "parts" of the association between X and Y, controlling Z, differ, then statistical interaction exists. There is no single standard way of representing interaction in causal diagrams. One method shows an arrow going from the "control" variable to the effect connecting an independent to a dependent variable. Here are two examples:

In the panel at the left, the hypothesis states that (in addition to direct and indirect effects) the effect of Y on Z differs across, or depends on the level of X. In the panel at the right, the hypothesis is that the effect of X on Z depends on the level of Y. Note that, in both examples, direct effects of each of the independent variables are shown in addition to their interaction. This is not strictly necessary, as effects may be hypothesized to exist only in the presence of other variables. It is more common for theories to propose that X and Y each have effects on Z independently of one another, and that, in addition there is an effect in common. This form of interaction is termed a "hierarchical" interaction.

### Reciprocal effects

All too often, the research designs we utilize do not allow us to disentangle the causal ordering of variables. In very many processes that we seek to model, there are "feedback" effects that operate over time, making the causal ordering of scores on two variables unclear, if they are measured simultaneously. For example, if we did a cross sectional survey and asked women to report how many years they have worked outside the home for pay, and also how many children they have, we would have a problem in building a simple causal model. The scores that we observe at the time that we did the study are the result (probably) of fertility and labor force decisions that have "fed back" on one another over the subject's history. In cases like this, we may wish to indicate reciprocal effects between the variables.

Suppose the variable F was the number of children to whom a woman had given birth; W was the number of years of full time work experience outside the home since completing education; S was the number of siblings the woman had in her family of origin; A was the woman's chronological age; E was the woman's years of completed schooling and M was the occupational prestige of the woman's mother. No correlations are shown between S and E or between A and M, just to keep the diagram readable. In this case, the model asserts "feedback" or reciprocal causation between F and W. The arguments might be that each additional child reduces work experience due to birthing and child care responsibilities; each unit of W may reduce F by providing material and status rewards to the woman outside of the "mother" role. The result of these two effects is a "positive feedback" cycle where work experience reduces fertility which increases work experience, etc.

Example 1: Blau and Duncan on status attainment

The techniques of causal diagraming and "path analysis" had been used in other disciplines for some time before they were adopted by sociologists. Among the early adopters in sociology were Peter Blau and Otis Dudley Duncan, who's The American Occupational Structure used the technique extensively and made it very popular.

Blau and Duncan analyzed data collected from a sample of adult male children and their families. Among (many) other issues, Blau and Duncan proposed a very simple model of the occupational attainment process. One version is shown in the following diagram:

The model shows measures of two aspects of a (male) person's social class of origin: the occupational prestige and education of the father. These are fixed at birth or "ascriptive" status traits, and are treated as exogeneous (they could be causally ordered, but we don't really care about the dynamics of the father, for current purposes). The son's education is treated as an intervening variable, and prior in time to the son's occupational prestige. The son's education is an "achieved" status characteristic -- not determined by birth. Blau and Duncan's major interest focused on whether ascriptive or achieved characteristics were more important in determining status outcomes, and whether the effects of ascriptive traits were primarily direct or indirect.

Each of the proposed direct effects in the model (other than residual effects) needs to have a theoretical statement about the mechanism that is supposed to be generating the observable association. Higher father's education may affect son's education by providing the child with higher expectations, and support for attainment. Higher father's occupational prestige may contribute to higher educational expectations for the child, as well as providing access to financial resources to support higher education. Father's occupational prestige is shown as directly affecting son's occupational prestige, independent of the son's education. It is supposed that father's with higher occupational prestige may be able to pass property directly to children, support investment in the child's career development, provide connections, and perhaps provide higher expectations for the child's attainment that motivate the son's occupational attainment.

One of the key insights of the Blau-Duncan model is that the effects of ascriptive characteristics on the son's success may be both direct and indirect. Father's occupational attainment is hypothesized to confer a number of direct advantages. It also contributes to success indirectly by increasing the son's education -- which, in turn, furthers success. To answer the question of whether ascriptive or achieved characteristics are more important, both the direct and indirect effects of parental status needed to be taken into account. If only direct effects were considered, Blau and Duncan discovered, achieved characteristics appeared to be much more important. If both direct and indirect effects were considered, ascriptive and achieved status characteristics appeared to be about equal in importance. Blau and Duncan also noted that the residuals in their results were quite large -- suggesting that occupational prestige attainment and educational attainment were substantially affected by variables other than (and uncorrelated with) the causes they considered.

Example 2: An extension of the model, adding interaction

The original Blau and Duncan status attainment model has been expanded and refined in many ways. Among these, is it's application to testing theories about the lower average levels of attainment of Blacks. While the literature on this subject is large and complex, we can gain some insights about using causal models by "elaborating" our model to include a variable representing whether a respondent is Black, or not.

We must first decide where the new variable fits in the chain of causal order in our model. Being Black is an ascriptive status, so the new variable needs to be treated as exogenous. The race of the son does not determine the occupational prestige or education of the father (though the father's race probably had such effects). So, it is probably most useful to simply add the variable B (for the race of the son, Black or not) as a third correlated exogenous variable.

Next, we must ask about the effects of the new variable on others in the model (and, if the new variable were endogeneous, the effects of other variables on it):

Might the son's race have an effect on their educational attainment, independent of the father's social class? Many of the educational disadvantages of Blacks may be class, rather than race effects. But, geographical segregation, and perhaps direct discrimination in the quality of schooling provided could generate a direct effect of Black on son's education.

Might the son's race have an effect on the effects of father's education and occupational prestige on son's education? The presence of such interactions would be consistent with arguments that the social class of origin has more positive (or less positive) effects on the son's education for Black sons than non-Black sons. I can't immediately think of a compelling theory that would suggest that class effects on education are different for Blacks than for non-Blacks, so I choose to not hypothesize such effects. One might argue, however, that African-American culture (independent of class culture) places greater importance or lesser importance on education than non-African American cultures. The important point, for illustration, is that the diagraming causes us to ask these questions.

Might there be a direct effect of being Black on occupational prestige attainment, independent of the father's social class, and son's education? Most would argue that there is such an effect, and that it is generated by direct discrimination, and structural factors in the operation of labor markets (the locational decisions of firms, etc.).

Might the effects of educational attainment on occupational attainment for sons be different for Black and non-Black sons? Again, one might suppose that such interactions may exist. One view is that the education achieved by Blacks is regarded by employers as being qualitatively inferior, and hence is given less value in making hiring and promotion decisions. If such a hypothesis were true, then the positive effects of education on occupational prestige would be less for Blacks than for non-Blacks. Alternatively, because high levels of educational attainment are, on the average, less common for Blacks, the interaction effect could be positive. In a segregated labor market, demand for highly educated Blacks may be greater -- relative to the restricted supply -- than demand for highly educated non-Blacks. Note that the process of justifying the inclusion or exclusion of a causal effect should cause the analyst to not only explain why a (for example) positive effect is hypothesized, but also why a negative effect should not be.

Lastly, might the direct effect of father's occupation on son's occupation operate differently for Black and non-Black sons? To the extent that these effects refer to the father's ability to pass on property, contacts, or to set aspirations for sons, there is no obvious reason to expect an interaction. So, we will leave this out.

The resulting diagram for the elaborated model, then looks like:

Example 3: A model with latent variables

Many concepts are difficult to measure with high reliability and validity using a single indicator. Often researchers develop several measures that are intended to measure the same underlying or "latent" variable. The causal diagraming language can be used to express multivariate hypotheses about measurement or auxiliary models (the relationships between latent and manifest variables) and substantive models (the relationships among latent variables). Here is an abstract example of such a diagram that illustrates a number of the kinds of effects that sometimes appear in "latent variable" models.

Models with latent variables are not always as "messy" as the illustration. We've picked a fairly complicated example to make provide examples of some common specification issues.

First, note that latent variables are shown inside of elipses, and that manifest variables are shown inside rectangles. One could argue about what to do with residual terms. The most common approach is to represent them with neither elipses or rectangles, to make their special status clear.

Second, note that one of the exogenous causal variables (T) is treated as manifest, while the other (X) is treated as latent. The variable T might be, for example, a treatment group in an experiment, that was administered between Y1 and Y2 (a before-after design). We might reasonably regard T as measured without error. X might be some "covariate" measured at the beginning of the study, and viewed as affecting scores on the outcome variable (Y) both before and after treatment. X is shown as having three manifest variables or indicators (X1, X2, X3), each of which is unreliable. That is, each of the manifest variables also has a random error term. These error terms are shown as uncorrelated.

The "before" (Y1) and "after" (Y2) scores are each measured with three instruments, and each is regarded as not perfectly reliable (i.e. there are residuals on the manifest variables). These residuals, however, are shown as covarying. This often occurs when the same instrument (like a questionnaire) is used to assess scores at two or more points in time. The errors of measurement at time one are likely to have something in common with the errors of measurement at time two, due to item uniqueness or methods effects.

The latent endogeneous variables (Y1 and Y2) are also shown with residuals. These terms are sometimes called "errors in equations" rather than "errors in variables." This is because each endogenous variable is predicted by an equation made up of effects of independent variables, and the residuals here are not due to measurement error, but rather to mis-specification (left out variables). Again, however, these residuals are shown as being correlated. The argument is that the variables that were "left out" when we did the pre-test may also have been "left out" when we did the post-test -- causing correlated prediction errors.