Chapter Summary

Multivariate analysis investigates the relationships between more than two variables.

Controlling for the variation in other variables allows you to begin making causal inference and rule out spurious relationships.

  • A spurious relationship is one in which the association between two variables is caused by a third.
     

In a multivariate cross-tabulation, you can control a third variable by holding it constant—this is control by grouping, or, grouping the observations according to their values on a third variable and then observe the original relationship within each of these groups.

Multiple regression analysis extends the bivariate regression analysis presented in Chapter 13 to include additional independent variables.

  • Both types of regression involve finding an equation that best fits or approximates the data and describes the relationship between the independent and dependent variables.
     

In a multiple regression, a coefficient indicates how much and in what direction the dependent variable, Y, changes with a one-unit increase in the independent variable, X controlling for all other variables in the model.

Statistical significance can be determined through a t-test by dividing a regression coefficient by its standard error and comparing the observed t to a critical value.

A dummy variable has two categories, generally coded one for the presence of a characteristic and zero otherwise. Recoding a nominal level variable as a dummy variable allows the variable to be used in numerical analysis.

One can measure an interaction—to determine whether variables behave differently in the presence of a third.

A standardized coefficient shows the partial effects of an X on Y in standard deviation units. The larger the absolute value, the greater the effect of a one-standard deviation change in X on the mean of Y, controlling for or holding other variables constant.

Multiple R squared is a measure of goodness of fit of the model with the data. It is the ratio of the explained variation in the dependent variable to the total variation in the dependent variable; hence, it equals the proportion of the variance in the dependent variable that may be explained by the set of independent variables.

Multiple regression can be used to test hypotheses through a t-test—comparing a t statistic with a critical value from the t table.

When the dependent variable is in binary form (only two categories like voted or not), you must use a slightly different form of regression called the linear probability model that estimates the probability of an outcome on the dependent variable.

  • The linear probability model, however, cannot be used to test hypotheses because it violates necessary assumptions.
     

A (nonlinear) logistic regression is usually a better choice for a binary dependent variable.

A logistic regression is interpreted differently than a multiple regression. Coefficients in a logistic regression change when each independent variable is set at a different value (like the mean, or one standard deviation above the mean).

To interpret logistic regression coefficients, you must specify a value for each variable and use the resulting coefficients to predict the probability of Y=1. The coefficients, therefore, do not, on their own, indicate the magnitude of the relationship between an independent variable and a dependent variable, only the direction of the relationship.

  • To assess the magnitude of a relationship you must calculate the predicted probability or odds ratio, or examine a graphical representation of the relationship.
     

Goodness of fit for a logistic regression can be measured by calculating pseudo R squared.

Statistical significance for use in hypothesis testing can be assessed in a similar manner to multiple regression by comparing a z statistic to a critical value.