Cramming Sam's top tips from chapter 9
Click on the topic to read Sam's tips from the book
- A linear model (regression) is a way of predicting values of one variable from another based on a model that describes a straight line.
- This line is the line that best summarizes the pattern of the data.
- To assess how well the model fits the data use:
- R2, which tells us how much variance is explained by the model compared to how much variance there is to explain in the first place. It is the proportion of variance in the outcome variable that is shared by the predictor variable.
- F, which tells us how much variability the model can explain relative to how much it can’t explain (i.e., it’s the ratio of how good the model is compared to how bad it is).
- the b-value, which tells us the gradient of the regression line and the strength of the relationship between a predictor and the outcome variable. If it is significant (Sig. < 0.05 in the SPSS output) then the predictor variable significantly predicts the outcome variable.
Use the descriptive statistics to check the correlation matrix for multicollinearity; that is, predictors that correlate too highly with each other, r > 0.9.
The model summary
- The fit of the linear model can be assessed using the Model Summary and ANOVA tables from SPSS.
- R2 tells you the proportion of variance explained by the model.
- If you have done a hierarchical regression, assess the improvement of the model at each stage by looking at the change in R2 and whether it is significant (values less than 0.05 in the column labelled Sig. F Change).
- The F-test tells us whether the model is a significant fit to the data overall (look for values less than 0.05 in the column labelled Sig.).
- The individual contribution of variables to the regression model can be found in the Coefficients table. If you have done a hierarchical regression then look at the values for the final model.
- You can see whether each predictor variable has made a significant contribution to predicting the outcome by looking at the column labelled Sig. (values less than 0.05 are significant).
- The standardized beta values tell you the importance of each predictor (bigger absolute value = more important).
- The tolerance and VIF values will also come in handy later, so make a note of them.
- To check for multicollinearity, use the VIF values from the table labelled Coefficients.
- If these values are less than 10 then that indicates there probably isn’t cause for concern.
- If you take the average of VIF values, and it is not substantially greater than 1, then there’s also no cause for concern.
- Look for cases that might be influencing the model.
- Look at standardized residuals and check that no more than 5% of cases have absolute values above 2, and that no more than about 1% have absolute values above 2.5. Any case with a value above about 3 could be an outlier.
- Look in the data editor for the values of Cook’s distance: any value above 1 indicates a case that might be influencing the model.
- Calculate the average leverage and look for values greater than twice or three times this average value.
- For Mahalanobis distance, a crude check is to look for values above 25 in large samples (500) and values above15 in smaller samples (100). However, Barnett and Lewis (1978) should be consulted for more refined guidelines.
- Look for absolute values of DFBeta greater than 1.
- Calculate the upper and lower limit of acceptable values for the covariance ratio, CVR. Cases that have a CVR that fall outside these limits may be problematic.
- Look at the graph of ZRESID* plotted against ZPRED*. If it looks like a random array of dots then this is good. If the dots get more or less spread out over the graph (look like a funnel) then the assumption of homogeneity of variance is probably unrealistic. If the dots have a pattern to them (i.e., a curved shape) then the assumption of linearity is probably not true. If the dots seem to have a pattern and are more spread out at some points on the plot than others then this could reflect violations of both homogeneity of variance and linearity. Any of these scenarios puts the validity of your model into question. Repeat the above for all partial plots too.
- Look at the histogram and P-P plot. If the histogram looks like a normal distribution (and the P-P plot looks like a diagonal line), then all is well. If the histogram looks non-normal and the P-P plot looks like a wiggly snake curving around a diagonal line then things are less good. Be warned, though: distributions can look very non-normal in small samples even when they are normal.