Cramming Sam's Top Tips

Click on the topic to read Sam's tips from the book

Variables

When doing and reading research you’re likely to encounter these terms:

Independent variable: A variable thought to be the cause of some effect. This term is usually used in experimental research to describe a variable that the experimenter has manipulated.
Dependent variable: A variable thought to be affected by changes in an independent variable. You can think of this variable as an outcome.
Predictor variable: A variable thought to predict an outcome variable. This term is basically another way of saying ‘independent variable’. (Although some people won’t like me saying that; I think life would be easier if we talked only about predictors and outcomes.)
Outcome variable: A variable thought to change as a function of changes in a predictor variable. For the sake of an easy life this term could be synonymous with ‘dependent variable’.

Levels of meaurement

Variables can be split into categorical and continuous, and within these types there are different levels of measurement:
Categorical (entities are divided into distinct categories):

Binary variable: There are only two categories (e.g., dead or alive).
Nominal variable: There are more than two categories (e.g., whether someone is an omnivore, vegetarian, vegan, or fruitarian).
Ordinal variable: The same as a nominal variable but the categories have a logical order (e.g., whether people got a fail, a pass, a merit or a distinction in their exam).

Continuous (entities get a distinct score):

Interval variable: Equal intervals on the variable represent equal differences in the property being measured (e.g., the difference between 6 and 8 is equivalent to the difference between 13 and 15).
Ratio variable: The same as an interval variable, but the ratios of scores on the scale must also make sense (e.g., a score of 16 on an anxiety scale means that the person is, in reality, twice as anxious as someone scoring 8). For this to be true, the scale must have a meaningful zero point.

Central tendency

The mean is the sum of all scores divided by the number of scores. The value of the mean can be influenced quite heavily by extreme scores.
The median is the middle score when the scores are placed in ascending order. It is not as influenced by extreme scores as the mean.
The mode is the score that occurs most frequently.

Dispersion

The deviance or error is the distance of each score from the mean.
The sum of squared errors is the total amount of error in the mean. The errors/deviances are squared before adding them up.
The variance is the average distance of scores from the mean. It is the sum of squares divided by the number of scores. It tells us about how widely dispersed scores are around the mean.
The standard deviation is the square root of the variance. It is the variance converted back to the original units of measurement of the scores used to compute it. Large standard deviations relative to the mean suggest data are widely spread around the mean, whereas small standard deviations suggest data are closely packed around the mean.
The range is the distance between the highest and lowest score.
The interquartile range is the range of the middle 50% of the scores.

Distributions and z-scores

A frequency distribution can be either a table or a chart that shows each possible score on a scale of measurement along with the number of times that score occurred in the data.
Scores are sometimes expressed in a standard form known as z-scores.
To transform a score into a z-score you subtract from it the mean of all scores and divide the result by the standard deviation of all scores.
The sign of the z-score tells us whether the original score was above or below the mean; the value of the z-score tells us how far the score was from the mean in standard deviation units.

The standard error

The standard error of the mean is the standard deviation of sample means. As such, it is a measure of how representative of the population a sample mean is likely to be. A large standard error (relative to the sample mean) means that there is a lot of variability between the means of different samples and so the sample mean we have might not be representative of the population mean. A small standard error indicates that most sample means are similar to the population mean (i.e., our sample mean is likely to accurately reflect the population mean).

Confidence intervals

A confidence interval for the mean is a range of scores constructed such that the population mean will fall within this range in 95% of samples.
The confidence interval is not an interval within which we are 95% confident that the population mean will fall.

Null hypothesis significance testing

NHST is a widespread method for assessing scientific theories. The basic idea is that we have two competing hypotheses: one says that an effect exists (the alternative hypothesis) and the other says that an effect doesn’t exist (the null hypothesis). We compute a test statistic that represents the alternative hypothesis and calculate the probability that we would get a value as big as the one we have if the null hypothesis were true. If this probability is less than 0.05 we reject the idea that there is no effect, say that we have a statistically significant finding and throw a little party. If the probability is greater than 0.05 we do not reject the idea that there is no effect, we say that we have a non-significant finding and we look sad.
We can make two types of error: we can believe that there is an effect when, in reality, there isn’t (a Type I error); and we can believe that there is not an effect when, in reality, there is (a Type II error).
The power of a statistical test is the probability that it will find an effect when one exists.
The significance of a test statistic is directly linked to the sample size: the same effect will have different p-values in different-sized samples, small differences can be deemed ‘significant’ in large samples, and large effects might be deemed ‘non-significant’ in small samples.

Problems with NHST

A lot of scientists misunderstand NHST. A few examples of poorly understood things related to significance testing are:

A significant effect is not necessarily an important one.
A non-significant result does not mean that the null hypothesis is true.
A significant result does not mean that the null hypothesis is false.

NHST encourages all-or-nothing thinking whereby an effect with a p-value just below 0.05 is perceived as important whereas one with a p-value just above 0.05 is perceived as unimportant.
NHST is biased by researchers deviating from their initial sampling frame (e.g., by stopping data collection earlier than planned).
There are lots of ways that scientists can influence the p-value. These are known as researcher degrees of freedom and include selective exclusion of data, fitting dsifferent statistical models but reporting only the one with the most favourable results, stopping data collection at a point other than that decided at the study’s conception, and including only control variables that influence the p-value in a favourable way.
Incentive structures in science that reward publication of significant results also reward the use of researcher degrees of freedom.
p-hacking refers to practices that lead to the selective reporting of significant p-values, most commonly trying multiple analyses and reporting only the one that yields significant results.
Hypothesizing after the results are known (HARKing) occurs when scientists present a hypothesis that was made after data analysis as though it were made at the study’s conception.

Effect sizes and meta-analysis

An effect size is a way of measuring the size of an observed effect, usually relative to the background error.
Cohen’s d is the difference between two means divided by the standard deviation of the mean of the control group, or by a pooled estimate based on the standard deviations of both groups.
Pearson’s correlation coefficient, r, is a versatile effect size measure that can quantify the strength (and direction) of relationship between two continuous variables, and can also quantify the difference between groups along a continuous variable. It ranges from −1 (a perfect negative relationship) through 0 (no relationship at all) to +1 (a perfect positive relationship).
The odds ratio is the ratio of the odds of an event occurring in one category compared to another. An odds ratio of 1 indicates that the odds of a particular outcome are equal in both categories.
Estimating the size of an effect in the population by combining effect sizes from different studies that test the same hypothesis is called meta-analysis.

Summary of the Bayesian process

1. Define a prior that represents your subjective beliefs about a hypothesis (the prior is a single value) or a parameter (the prior is a distribution of possibilities). The prior can range from completely uninformative, which means that you are prepared to believe pretty much anything, to strongly informative, which means that your initial beliefs are quite narrow and specific.

2 Inspect the relevant data. In our frivolous example, this was observing the behaviour of your crush. In science, the process would be a bit more formal than that.

3 Bayes’ theorem is used to update the prior distribution with the data. The result is a posterior probability, which can be a single value representing your new belief in a hypothesis, or a distribution that represents your beliefs in plausible values of a parameter, after seeing the data.

4 A posterior distribution can be used to obtain a point estimate (perhaps the peak of the distribution) or an interval estimate (a boundary containing a certain percentage, for example 95%, of the posterior distribution) of the parameter in which you were originally interested.

Bayes factors

Bayes’ theorem can be used to update your prior belief in a hypothesis based on the observed data.
The probability of the alternative hypothesis given the data relative to the probability of the null hypothesis given the data is quantified by the posterior odds.
A Bayes factor is the ratio of the probability of the data given the alternative hypothesis to that for the null hypothesis. A Bayes factor greater than 1 suggests that the observed data are more likely given the alternative hypothesis than given the null. Values less than 1 suggest the opposite. Values between 1 and 3 reflect evidence for the alternative hypothesis that is ‘barely worth mentioning’, values between 1 and 3 is evidence that ‘has substance’, and values between 3 and 10 are ‘strong’ evidence (Jeffreys, 1961).

Graphs

The vertical axis of a graph is known as the y-axis (or ordinate).
The horizontal axis of a graph is known as the x-axis (or abscissa).

If you want to draw a good graph follow the cult of Tufte:

Don’t create false impressions of what the data show (likewise, don’t hide effects) by scaling the y-axis in some weird way.
Avoid chartjunk: Don’t use patterns, 3-D effects, shadows, pictures of spleens, photos of your Uncle Fred, pink cats or anything else.

Skewness and kurtosis

To check that the distribution of scores is approximately normal, look at the values of skewness and kurtosis in the output.
Positive values of skewness indicate too many low scores in the distribution, whereas negative values indicate a build-up of high scores.
Positive values of kurtosis indicate a heavy-tailed distribution, whereas negative values indicate a light-tailed distribution.
The further the value is from zero, the more likely it is that the data are not normally distributed.
You can convert these scores to z-scores by dividing by their standard error. If the resulting score (when you ignore the minus sign) is greater than 1.96 then it is significant (p < 0.05).
Significance tests of skew and kurtosis should not be used in large samples (because they are likely to be significant even when skew and kurtosis are not too different from normal).

Normality tests

The K-S test can be used (but shouldn’t be) to see if a distribution of scores significantly differs from a normal distribution.
If the K-S test is significant (Sig. in the SPSS table is less than 0.05) then the scores are significantly different from a normal distribution.
Otherwise, scores are approximately normally distributed.
The Shapiro–Wilk test does much the same thing, but it has more power to detect differences from normality (so this test might be significant when the K-S test is not).
Warning: In large samples these tests can be significant even when the scores are only slightly different from a normal distribution. Therefore, I don’t particularly recommend them and they should always be interpreted in conjunction with histograms, P-P or Q-Q plots, and the values of skew and kurtosis.

Homogeneity of variance

Homogeneity of variance/homoscedasticity is the assumption that the spread of outcome scores is roughly equal at different points on the predictor variable.
The assumption can be evaluated by looking at a plot of the standardized predicted values from your model against the standardized residuals (zpred vs. zresid).
When comparing groups, this assumption can be tested with Levene’s test and the variance ratio (Hartley’s Fmax).

If Levene’s test is significant (Sig. in the SPSS table is less than 0.05) then the variances are significantly different in different groups.
Otherwise, homogeneity of variance can be assumed.
The variance ratio is the largest group variance divided by the smallest. This value needs to be smaller than the critical values in the additional material.

Warning: There are good reasons not to use Levene’s test or the variance ratio. In large samples they can be significant when group variances are similar, and in small samples they can be non-significant when group variances are very different.

Mann-Whitney test

The Mann–Whitney test and Wilcoxon rank-sum test compare two conditions when different participants take part in each condition and the resulting data have unusual cases or violate any assumption in Chapter 6.
Look at the row labelled Asymptotic Sig. or Exact Sig. (if your sample is small). If the value is less than 0.05 then the two groups are significantly different.
The values of the mean ranks tell you how the groups differ (the group with the highest scores will have the highest mean rank).
Report the U-statistic (or Ws if you prefer), the corresponding z and the significance value. Also report the medians and their corresponding ranges (or draw a boxplot).
Calculate the effect size and report this too.

Wilcoxon signed-rank test

The Wilcoxon signed-rank test compares two conditions when the scores are related (e.g., scores come from the same participants) and the resulting data have unusual cases or violate any assumption in Chapter 6.
Look at the row labelled Asymptotic Sig. (2-sided test). If the value is less than 0.05 then the two conditions are significantly different.
Look at the histogram and numbers of positive or negative differences to tell you how the groups differ (the greater number of differences in a particular direction tells you the direction of the result).
Report the T-statistic, the corresponding z, the exact significance value and an effect size. Also report the medians and their corresponding ranges (or draw a boxplot).

Kruskal–Wallis test

The Kruskal–Wallis test compares several conditions when different participants take part in each condition and the resulting data have unusual cases or violate any assumption in Chapter 6.
Look at the row labelled Asymptotic Sig. A value less than 0.05 is typically taken to mean that the groups are significantly different.
Pairwise comparisons compare all possible pairs of groups with a p-value that is corrected so that the error rate across all tests remains at 5%.
If you predict that the medians will increase or decrease across your groups in a specific order then test this with the Jonckheere–Terpstra test.
Report the H-statistic, the degrees of freedom and the significance value for the main analysis. For any follow-up tests, report an effect size, the corresponding z and the significance value. Also report the medians and their corresponding ranges (or draw a boxplot).

Friedman's ANOVA

Friedman’s ANOVA compares several conditions when the data are related (usually because the same participants take part in each condition) and the resulting data have unusual cases or violate any assumption in Chapter 6.
Look at the row labelled Asymptotic Sig. If the value is less than 0.05 then typically people conclude that the conditions are significantly different.
You can follow up the main analysis with pairwise comparisons. These tests compare all possible pairs of conditions using a p-value that is adjusted such that the overall Type I error rate remains at 5%.
Report the χ2 statistic, the degrees of freedom and the significance value for the main analysis. For any follow-up tests, report an effect size, the corresponding z and the significance value.
Report the medians and their ranges (or draw a boxplot).

Correlation

A crude measure of the relationship between variables is the covariance.
If we standardize this value we get Pearson’s correlation coefficient, r.
The correlation coefficient has to lie between −1 and +1.
A coefficient of +1 indicates a perfect positive relationship, a coefficient of −1 indicates a perfect negative relationship, and a coefficient of 0 indicates no linear relationship.
The correlation coefficient is a commonly used measure of the size of an effect: values of ±0.1 represent a small effect, ±0.3 is a medium effect and ±0.5 is a large effect. However, interpret the size of correlation within the context of the research you’ve done rather than blindly following these benchmarks.

Correlations

Spearman’s correlation coefficient, rs, is a non-parametric statistic and requires only ordinal data for both variables.
Kendall’s correlation coefficient, τ, is like Spearman’s rs but probably better for small samples.
The point-biserial correlation coefficient, rpb, quantifies the relationship between a continuous variable and a variable that is a discrete dichotomy (e.g., there is no continuum underlying the two categories, such as dead or alive).
The biserial correlation coefficient, rb, quantifies the relationship between a continuous variable and a variable that is a continuous dichotomy (e.g., there is a continuum underlying the two categories, such as passing or failing an exam).

Partial and semi-partial correlations

A partial correlation quantifies the relationship between two variables while accounting for the effects of a third variable on both variables in the original correlation.
A semi-partial correlation quantifies the relationship between two variables while accounting for the effects of a third variable on only one of the variables in the original correlation.

Linear models

A linear model (regression) is a way of predicting values of one variable from another based on a model that describes a straight line.
This line is the line that best summarizes the pattern of the data.
To assess how well the model fits the data use:

R2, which tells us how much variance is explained by the model compared to how much variance there is to explain in the first place. It is the proportion of variance in the outcome variable that is shared by the predictor variable.
F, which tells us how much variability the model can explain relative to how much it can’t explain (i.e., it’s the ratio of how good the model is compared to how bad it is).
the b-value, which tells us the gradient of the regression line and the strength of the relationship between a predictor and the outcome variable. If it is significant (Sig. < 0.05 in the SPSS output) then the predictor variable significantly predicts the outcome variable.

Descriptive statistics

Use the descriptive statistics to check the correlation matrix for multicollinearity; that is, predictors that correlate too highly with each other, r > 0.9.

The model summary

The fit of the linear model can be assessed using the Model Summary and ANOVA tables from SPSS.
R2 tells you the proportion of variance explained by the model.
If you have done a hierarchical regression, assess the improvement of the model at each stage by looking at the change in R2 and whether it is significant (values less than 0.05 in the column labelled Sig. F Change).
The F-test tells us whether the model is a significant fit to the data overall (look for values less than 0.05 in the column labelled Sig.).

Coefficients

The individual contribution of variables to the regression model can be found in the Coefficients table. If you have done a hierarchical regression then look at the values for the final model.
You can see whether each predictor variable has made a significant contribution to predicting the outcome by looking at the column labelled Sig. (values less than 0.05 are significant).
The standardized beta values tell you the importance of each predictor (bigger absolute value = more important).
The tolerance and VIF values will also come in handy later, so make a note of them.

Multicollinearity

To check for multicollinearity, use the VIF values from the table labelled Coefficients.
If these values are less than 10 then that indicates there probably isn’t cause for concern.
If you take the average of VIF values, and it is not substantially greater than 1, then there’s also no cause for concern.

Residuals

Look for cases that might be influencing the model.
Look at standardized residuals and check that no more than 5% of cases have absolute values above 2, and that no more than about 1% have absolute values above 2.5. Any case with a value above about 3 could be an outlier.
Look in the data editor for the values of Cook’s distance: any value above 1 indicates a case that might be influencing the model.
Calculate the average leverage and look for values greater than twice or three times this average value.
For Mahalanobis distance, a crude check is to look for values above 25 in large samples (500) and values above15 in smaller samples (100). However, Barnett and Lewis (1978) should be consulted for more refined guidelines.
Look for absolute values of DFBeta greater than 1.
Calculate the upper and lower limit of acceptable values for the covariance ratio, CVR. Cases that have a CVR that fall outside these limits may be problematic.

Model assumptions

Look at the graph of ZRESID* plotted against ZPRED*. If it looks like a random array of dots then this is good. If the dots get more or less spread out over the graph (look like a funnel) then the assumption of homogeneity of variance is probably unrealistic. If the dots have a pattern to them (i.e., a curved shape) then the assumption of linearity is probably not true. If the dots seem to have a pattern and are more spread out at some points on the plot than others then this could reflect violations of both homogeneity of variance and linearity. Any of these scenarios puts the validity of your model into question. Repeat the above for all partial plots too.
Look at the histogram and P-P plot. If the histogram looks like a normal distribution (and the P-P plot looks like a diagonal line), then all is well. If the histogram looks non-normal and the P-P plot looks like a wiggly snake curving around a diagonal line then things are less good. Be warned, though: distributions can look very non-normal in small samples even when they are normal.

The independent t-test

The independent t-test compares two means, when those means have come from different groups of entities.
You should probably ignore the column labelled Levene’s Test for Equality of Variance and always look at the row in the table labelled Equal variances not assumed.
Look at the column labelled Sig. If the value is less than 0.05 then the means of the two groups are significantly different.
Look at the table labelled Bootstrap for Independent Samples Test to get a robust confidence interval for the difference between means.
Look at the values of the means to see how the groups differ.
A robust version of the test can be computed using syntax.
A Bayes factor can be computed that quantifies the ratio of how probable the data are under the alternative hypothesis compared to the null.
Calculate and report the effect size. Go on, you can do it!J

Paired-samples t-test

The paired-samples t-test compares two means, when those means have come from the same entities.
Look at the column labelled Sig. If the value is less than 0.05 then the means of the two conditions are significantly different.
Look at the values of the means to tell you how the conditions differ.
Look at the table labelled Bootstrap for Paired Samples Test to get a robust confidence interval for the difference between means.
A robust version of the test can be computed using syntax.
A Bayes factor can be computed that quantifies the ratio of how probable the data are under the alternative hypothesis compared to the null.
Calculate and report the effect size too.

Moderation

Moderation occurs when the relationship between two variables changes as a function of a third variable. For example, the relationship between watching horror films and feeling scared at bedtime might increase as a function of how vivid an imagination a person has.
Moderation is tested using a linear model in which the outcome (fear at bedtime) is predicted from a predictor (how many horror films are watched), the moderator (imagination) and the interaction of the predictor variables.
Predictors should be centred before the analysis.
The interaction of two variables is their scores multiplied together.
If the interaction is significant then the moderation effect is also significant.
If moderation is found, follow up the analysis with simple slopes analysis, which looks at the relationship between the predictor and outcome at low, mean and high levels of the moderator.

Mediation

Mediation is when the strength of the relationship between a predictor variable and outcome variable is reduced by including another variable as a predictor. Essentially, mediation equates to the relationship between two variables being ‘explained’ by a third. For example, the relationship between watching horror films and feeling scared at bedtime might be explained by scary images appearing in your head.
Mediation is tested by assessing the size of the indirect effect and its confidence interval. If the confidence interval contains zero then we tend to assume that a genuine mediation effect doesn’t exist. If the confidence interval doesn’t contain zero, then we tend to conclude that mediation has occurred.

Planned contrasts

If the F for the overall model is significant you need to find out which groups differ.
When you have generated specific hypotheses before the experiment, use planned contrasts.
Each contrast compares two ‘chunks’ of variance. (A chunk can contain one or more groups.)
The first contrast will usually be experimental groups against control groups.
The next contrast will be to take one of the chunks that contained more than one group (if there were any) and divide it in to two chunks.
You repeat this process: if there are any chunks in previous contrasts that contained more than one group that haven’t already been broken down into smaller chunks, then create new contrasts that breaks them down into smaller chunks.
Carry on creating contrasts until each group has appeared in a chunk on its own in one of your contrasts.
The number of contrasts you end up with should be one less than the number of experimental conditions. If not, you’ve done it wrong.
In each contrast assign a ‘weight’ to each group that is the value of the number of groups in the opposite chunk in that contrast.
For a given contrast, randomly select one chunk, and for the groups in that chunk change their weights to be negative numbers.
Breathe a sigh of relief.

Post hoc tests

When you have no specific hypotheses before the experiment, follow up the model with post hoc tests.
When you have equal sample sizes and group variances are similar use REGWQ or Tukey.
If you want guaranteed control over the Type I error rate then use Bonferroni.
If sample sizes are slightly different then use Gabriel’s, but if sample sizes are very different use Hochberg’s GT2.
If there is any doubt that group variances are equal then use the Games–Howell procedure.

One-way independent ANOVA

One-way independent ANOVA compares several means, when those means have come from different groups of people; for example, if you have several experimental conditions and have used different participants in each condition. It is a special case of the linear model.
When you have generated specific hypotheses before the experiment use planned contrasts, but if you don’t have specific hypotheses use post hoc tests.
There are lots of different post hoc tests: when you have equal sample sizes and homogeneity of variance is met, use REGWQ or Tukey’s HSD. If sample sizes are slightly different then use Gabriel’s procedure, but if sample sizes are very different use Hochberg’s GT2. If there is any doubt about homogeneity of variance use the Games–Howell procedure.
You can test for homogeneity of variance using Levene’s test, but consider using a robust test in all situations (the Welch or Browne–Forsythe F) or Wilcox’s t1way() function.
Locate the p-value (usually in a column labelled Sig.). If the value is less than 0.05 then scientists typically interpret this as the group means being significantly different.
For contrasts and post hoc tests, again look to the columns labelled Sig. to discover if your comparisons are significant (they will be if the significance value is less than 0.05).

Covariates

When the linear model is used to compare several means adjusted for the effect of one or more other variables (called covariates) it can be referred to as analysis of covariance (ANCOVA).
Before the analysis check that the covariate(s) are independent of any independent variables by seeing whether those independent variables predict the covariate (i.e., the covariate should not differ across groups).
In the table labelled Tests of Between-Subjects Effects, assuming you’re using an alpha of 0.05, look to see if the value in the column labelled Sig. is below 0.05 for both the covariate and the independent variable. If it is for the covariate then this variable has a significant relationship to the outcome variable; if it is for the independent variable then the means (adjusted for the effect of the covariate) are significantly different across categories of this variable.
If you have generated specific hypotheses before the experiment use planned contrasts; if not, use post hoc tests.
For parameters and post hoc tests, look at the columns labelled Sig. to discover if your comparisons are significant (they will be if the significance value is less than 0.05). Use bootstrapping to get robust versions of these tests.
In addition to the assumptions in Chapter 6, test for homogeneity of regression slopes by customizing the model to look at the independent variable × covariate interaction.

Factorial ANOVA

Two-way independent designs compare several means when there are two independent variables and different entities have been used in all experimental conditions. For example, if you wanted to know whether different teaching methods worked better for different topics, you could take students from four courses (Psychology, Geography, Management, and Statistics) and assign them to either lecture-based or book-based teaching. The two variables are topic and method of teaching. The outcome might be the end-of-year mark (as a percentage).
In the table labelled Tests of Between-Subjects Effects, look at the column labelled Sig. for all main effects and interactions; if the value is less than 0.05 then the effect is significant using the conventional criterion.
To interpret a significant interaction, plot an interaction graph and conduct simple effects analysis.
You don’t need to interpret main effects if an interaction effect involving that variable is significant.
If significant main effects are not qualified by an interaction then consult post hoc tests to see which groups differ: significance is shown by values smaller than 0.05 in the columns labelled Sig., and bootstrap confidence intervals that do not contain zero.
Test the same assumptions as for any linear model (see Chapter 6).

One-way repeated-measures designs

One-way repeated-measures designs compares several means, when those means come from the same entities; for example, if you measured people’s statistical ability each month over a year-long course.
When you have three or more repeated-measures conditions there is an additional assumption: sphericity.
You can test for sphericity using Mauchly’s test, but it is better to always adjust for the departure from sphericity in the data.
The table labelled Tests of Within-Subjects Effects shows the main F-statistic. Other things being equal, always read the row labelled Greenhouse–Geisser (or Huynh–Feldt, but you’ll have to read this chapter to find out the relative merits of the two procedures). If the value in the column labelled Sig. is less than 0.05 then the means of the conditions are significantly different.
For contrasts and post hoc tests, again look to the columns labelled Sig. to discover if your comparisons are significant (i.e., the value is less than 0.05).

Factorial repeated-measures designs

Two-way repeated-measures designs compare means when there are two predictor/independent variables, and the same entities have been used in all conditions.
You can test the assumption of sphericity when you have three or more repeated-measures conditions with Mauchly’s test, but a better approach is to routinely interpret F-statistics that have been corrected for the amount by which the data are not spherical.
The table labelled Tests of Within-Subjects Effects shows the F-statistics and their p-values. In a two-way design you will have a main effect of each variable and the interaction between them. For each effect, read the row labelled Greenhouse–Geisser (you can also look at Huynh–Feldt, but you’ll have to read this chapter to find out the relative merits of the two procedures). If the value in the column labelled Sig. is less than 0.05 then the effect is significant.
Break down the main effects and interactions using contrasts. These contrasts appear in the table labelled Tests of Within-Subjects Contrasts. If the values in the column labelled Sig. are less than 0.05 the contrast is significant.

Mixed designs

Mixed designs compare several means when there are two or more independent variables, and at least one of them has been measured using the same entities and at least one other has been measured using different entities.
Correct for deviations from sphericity for the repeated-measures variable(s) by routinely interpreting the Greenhouse–Geisser corrected effects. (Some people do this only if Mauchly’s test is significant, but this approach is problematic because the results of the test depend on the sample size.)
The table labelled Tests of Within-Subjects Effects shows the F-statistic(s) for any repeated-measures variables and all of the interaction effects. For each effect, read the row labelled Greenhouse–Geisser or Huynh–Feldt (read the previous chapter to find out the relative merits of the two procedures). If the value in the Sig. column is less than 0.05 then the means are significantly different.
The table labelled Tests of Between-Subjects Effects shows the F-statistic(s) for any between-group variables. If the value in the Sig. column is less than 0.05 then the means of the groups are significantly different.
Break down the main effects and interaction terms using contrasts. These contrasts appear in the table labelled Tests of Within-Subjects Contrasts; again look to the columns labelled Sig. to discover if your comparisons are significant (they are if the significance value is less than 0.05).
Look at the means – or, better still, draw graphs – to help you interpret the contrasts.

MANOVA

MANOVA is used to test the difference between groups across several outcome variables/outcomes simultaneously.
Box’s test looks at the assumption of equal covariance matrices. This test can be ignored when sample sizes are equal because when they are, some MANOVA test statistics are robust to violations of this assumption. If group sizes differ this test should be inspected. If the value of Sig. is less than 0.001 then the results of the analysis should not be trusted (see Section 17.7.1).
The table labelled Multivariate Tests gives us four test statistics (Pillai’s trace, Wilks’s lambda, Hotelling’s trace and Roy’s largest root). I recommend using Pillai’s trace. If the value of Sig. for this statistic is less than 0.05 then the groups differ significantly with respect to a linear combination of the outcome variables.
Univariate F-statistics can be used to follow up the MANOVA (a different F-statistic for each outcome variable). The results of these are listed in the table entitled Tests of Between-Subjects Effects. These F-statistics can in turn be followed up using contrasts. Personally I recommend discriminant function analysis over this approach.

Discriminant function analysis

Discriminant function analysis can be used after MANOVA to see how the outcome variables discriminate the groups.
Discriminant function analysis identifies variates (combinations of the outcome variables). To find out how many variates are significant look at the tables labelled Wilks’s Lambda: if the value of Sig. is less than 0.05 then the variate is significantly discriminating the groups.
Once the significant variates have been identified, use the table labelled Canonical Discriminant Function Coefficients to find out how the outcome variables contribute to the variates. High scores indicate that an outcome variable is important for a variate, and variables with positive and negative coefficients are contributing to the variate in opposite ways.
Finally, to find out which groups are discriminated by a variate look at the table labelled Functions at Group Centroids: for a given variate, groups with values opposite in sign are being discriminated by that variate.

Preliminary analysis

Scan the correlation matrix for variables that have very small correlations with most other variables, or correlate very highly (r = 0.9) with one or more other variables.
In factor analysis, check that the determinant of this matrix is bigger than 0.00001; if it is then multicollinearity isn’t a problem. You don’t need to worry about this for principal component analysis.
In the table labelled KMO and Bartlett’s Test the KMO statistic should be greater than 0.5 as a bare minimum; if it isn’t, collect more data. You should check the KMO statistic for individual variables by looking at the diagonal of the anti-image matrix. These values should also be above 0.5 (this is useful for identifying problematic variables if the overall KMO is unsatisfactory).
Bartlett’s test of sphericity will usually be significant (the value of Sig. will be less than 0.05), if it’s not, you’ve got a disaster on your hands.

Factor extraction

To decide how many factors to extract, look at the table labelled Communalities and the column labelled Extraction. If these values are all 0.7 or above and you have less than 30 variables then the default (Kaiser’s criterion) for extracting factors is fine. Likewise, if your sample size exceeds 250 and the average of the communalities is 0.6 or greater. Alternatively, with 200 or more participants the scree plot can be used.
Check the bottom of the table labelled Reproduced Correlations for the percentage of ‘nonredundant residuals with absolute values greater than 0.05’. This percentage should be less than 50% and the smaller it is, the better.

Interpretation

If you’ve conduced orthogonal rotation then look at the table labelled Rotated Factor Matrix. For each variable, note the factor/component for which the variable has the highest loading (above about 0.3–0.4 when you ignore the plus or minus sign). Try to make sense of what the factors represent by looking for common themes in the items that load highly on the same factor.
If you’ve conducted oblique rotation then do the same as above but for the Pattern Matrix. Double-check what you find by doing the same for the Structure Matrix.

Associations between two categorical variables

To test the relationship between two categorical variables use Pearson’s chi-square test or the likelihood ratio statistic.
Look at the table labelled Chi-Square Tests; if the Exact Sig. value is less than 0.05 for the row labelled Pearson Chi-Square then there is a significant relationship between your two variables.
Check underneath this table to make sure that no expected frequencies are less than 5.
Look at the contingency table to work out what the relationship between the variables is: look out for significant standardized residuals (values outside of ±1.96), and columns that have different letters as subscripts (this indicates a significant difference).
Calculate the odds ratio.
The Bayes factor reported by SPSS Statistics tells you the probability of the data under the null hypothesis relative to the alternative. Divide 1 by this value to see the probability of the data under the alternative hypothesis relative to the null. Values greater than 1 indicate that your belief should change towards the alternative hypothesis, with values greater than 3 starting to indicate a change in beliefs that has substance.
Report the Â2 statistic, the degrees of freedom, the significance value and odds ratio. Also report the contingency table.

Loglinear analysis

Test the relationship between more than two categorical variables with loglinear analysis.
Loglinear analysis is hierarchical: the initial model contains all main effects and interactions. Starting with the highest-order interaction, terms are removed to see whether their removal significantly affects the fit of the model. If it does then this term is not removed and all lower-order effects are ignored.
Look at the table labelled K-Way and Higher-Order Effects to see which effects have been retained in the final model. Then look at the table labelled Partial Associations to see the individual significance of the retained effects (look at the column labelled Sig. – values less than 0.05 indicate significance).
Look at the Goodness-of-Fit Tests for the final model: if this model is a good fit of the data then this statistic should be non-significant (Sig. should be bigger than 0.05).
Look at the contingency table to interpret any significant effects (percentage of total for cells is the best thing to look at).

Issues in logistic regression

In logistic regression, we assume the same things as the linear model.
The linearity assumption is that each predictor has a linear relationship with the log of the outcome variable.
If we created a table that combined all possible values of all variables then we should ideally have some data in every cell of this table. If you don’t then watch out for big standard errors.
If the outcome variable can be predicted perfectly from one predictor variable (or a combination of predictor variables) then we have complete separation. This problem creates large standard errors too.
Overdispersion is where the variance is larger than expected from the model. This can be caused by violating the assumption of independence. This problem makes the standard errors too small.

Model fit

Build your model systematically and choose the most parsimonious model as the final one.
The overall fit of the model is shown by −2LL and its associated chi-square statistic. If the significance of the chi-square statistic is less than 0.05, then the model is a significant fit to the data.
Check the table labelled Variables in the Equation to see the regression parameters for any predictors you have in the model.
For each variable in the model, look at the Wald statistic and its significance (which again should be below 0.05). Use the odds ratio, Exp(B), for interpretation. If the value is greater than 1 then as the predictor increases, the odds of the outcome occurring increase. Conversely, a value less than 1 indicates that as the predictor increases, the odds of the outcome occurring decrease. For the aforementioned interpretation to be reliable the confidence interval of Exp(B) should not cross 1.

Diagnostic statistics

Look for cases that might be influencing the logistic regression model by checking residuals.
Look at standardized residuals and check that no more than 5% of cases have absolute values above 2, and that no more than about 1% have absolute values above 2.5. Any case with a value above about 3 could be an outlier.
Look in the data editor for the values of Cook’s distance: any value above 1 indicates a case that might be influencing the model.
Calculate the average leverage (the number of predictors plus 1, divided by the sample size) and then look for values greater than twice or three times this average value.
Look for absolute values of DFBeta greater than 1.

Multilevel models

Multilevel models should be used to analyse data that have a hierarchical structure. For example, you might measure depression after psychotherapy. In your sample, patients will see different therapists within different clinics. This is a three-level hierarchy with depression scores from patients (level 1), nested within therapists (level 2), who are themselves nested within clinics (level 3).
Hierarchical models are just linear models in which you can allow parameters to vary (this is called a random effect). In the standard linear model, parameters generally are a fixed value estimated from the sample (a fixed effect).
If we estimate a linear model within each context (the therapist or clinic, to use the example above) rather than the sample as a whole, then we can assume that the intercepts of these models vary (a random intercepts model), or that the slopes of these models differ (a random slopes model) or that both vary.
We can compare different models by looking at the difference in the value of −2LL. Usually we would do this when we have changed only one parameter (added one new thing to the model).
For any model we have to assume a covariance structure. For random intercepts models the default of variance components is fine, but when slopes are random an unstructured covariance structure is often assumed. When data are measured over time an autoregressive structure (AR(1)) is often assumed.

Multilevel models output

The Information Criteria table can be used to assess the overall fit of the model. The value of −2LL can be tested for significance with df equal to the number of parameters being estimated. It is mainly used, though, to compare models that are the same in all but one parameter by testing the difference in −2LL in the two models against df = 1 (if only one parameter has been changed). The AIC, AICC, CAIC and BIC can also be compared across models (but not tested for significance).
The table of Type III Tests of Fixed Effects tells you whether your predictors significantly predict the outcome: look in the column labelled Sig. If the value is less than 0.05 then the effect is significant.
The table of Estimates of Fixed Effects gives us the b-values for each effect and its confidence interval. The direction of these coefficients tells us whether the relationship between each predictor and the outcome is positive or negative.
The table labelled Estimates of Covariance Parameters tells us about random effects in the model. These values can tell us how much intercepts and slopes varied over our level 1 variable. The significance of these estimates should be treated cautiously. The exact labelling of these effects depends on which covariance structure you selected for the analysis.

Growth models

Growth models are multilevel models in which changes in an outcome over time are modelled using potential growth patterns.
These growth patterns can be linear, quadratic, cubic, logarithmic, exponential, or anything you like really.
The hierarchy in the data is that time points are nested within people (or other entities). As such, it’s a way of analysing repeated-measures data that have a hierarchical structure.
The Information Criteria table can be used to assess the overall fit of the model. The value of −2LL can be tested for significance with df equal to the number of parameters being estimated. It is mainly used, though, to compare models that are the same in all but one parameter by testing the difference in −2LL in the two models against df = 1 (if only one parameter has been changed). The AIC, AICC, CAIC and BIC can also be compared across models (but not tested for significance).
The table of Type III Tests of Fixed Effects tells you whether the growth functions in the model significantly predict the outcome: look in the column labelled Sig. If the value is less than 0.05 then the effect is significant.
The table labelled Estimates of Covariance Parameters tells us about random effects in the model. These values can tell us how much intercepts and slopes varied over our level 1 variable. The significance of these estimates should be treated cautiously. The exact labelling of these effects depends on which covariance structure you selected for the analysis.
An autoregressive covariance structure, AR(1), is often assumed in time course data such as that in growth models.

You are here

Student Resources