Learning Objectives

Learning Objectives

After completing your study of this chapter, you should be able to do the following:

  • Define reliability/precision, and describe three methods for estimating the reliability/ precision of a psychological test and its scores.
  • Describe how an observed test score is made up of the true score and random error, and describe the difference between random error and systematic error.
  • Calculate and interpret a reliability coefficient, including adjusting a reliability coef­ficient obtained using the split-half method.
  • Differentiate between the KR-20 and coefficient alpha formulas, and understand how they are used to estimate internal consistency.
  • Calculate the standard error of measurement, and use it to construct a confidence interval around an observed score.
  • Identify four sources of test error and six factors related to these sources of error that are particularly important to consider.
  • Explain the premises of generalizability theory, and describe its contribution to esti­mating reliability.

Chapter Summary

Psychological tests are measurement instruments. An important attribute of a measurement instrument is its reliability/precision or consistency. We need evidence that the test yields the same score each time a person takes the test unless the test taker has actually changed. When we know a test is reliable, we can conclude that changes in a person’s score really are due to changes in that person. Also, we can compare the scores of two or more people on a reliable test.

Test developers use three methods for checking reliability. Each takes into account various condi­tions that could produce differences in test scores. Using the test–retest method, a test developer gives the same test to the same group of test takers on two different occasions. The scores from the first and second administrations are then correlated to obtain the reliability coefficient. The greatest danger in using the test–retest method of estimating reliability/precision is that the test takers will score differ­ently (usually higher) on the test because of practice effects. To overcome practice effects and differ­ences in individuals and the test administration from one time to the next, psychologists often give two forms of the same test—alike in every way—to the same people at the same time. This method is called alternate or parallel forms.

If a test taker can take the test only once, researchers divide the test into halves and correlate the scores on the first half with the scores on the second half. This method, called split-half reliability, includes using the Spearman–Brown formula to adjust the correlation coefficient for test length. A more precise way to measure internal consistency is to compare individuals’ scores on all possible ways of splitting the test into halves. The KR-20 and coefficient alpha formulas allow researchers to estimate the reliability of the test scores by correlating the answer to each test question with the answers to all of the other test questions.

The reliability of scoring is also important. Tests that require the scorer to make judgments about the test takers’ answers and tests that require the scorer to observe the test takers’ behavior may have error contributed by the scorer. We estimate scorer reliability by having two or more persons score the same test and then correlating their scores to see whether their judgments are consistent or have a single person score two occasions of the same test.

No measurement instrument is perfectly reliable or consistent. We express this idea by saying that each observed test score (X) contains two parts: a true score (T) and error (E). Two types of error appear in test scores: random error and systematic error. The more random error present in a set of test scores, the lower the reliability coefficient will be. Another way of saying the same thing is that the higher the proportion of true score variance is of the observed scores, the higher the reliability coefficient will be.

To quantify a test’s reliability/precision estimate, we use a reliability coefficient, which is another name for the correlation coefficient when it estimates reliability/precision. This statistic quantifies the esti­mated relationship between two forms of the test. The statistical procedure we use most often to calculate the reliability coefficient is the Pearson product–moment correlation. All statistical software programs and many spreadsheet programs will calculate the Pearson product–moment correlation. Coefficient alpha and KR-20, both of which also use correlation, are available in statistical packages only.

To interpret the meaning of the reliability coefficient, we look at its sign and the number itself. Reliability coefficients range from –0.00 (a completely unreliable test) to +1.00 (a perfectly reliable test). Psychologists have not set a fixed value at which reliability can be interpreted as satisfactory or unsatisfactory.

Psychologists use the standard error of measurement (SEM) as an index of the amount of inconsis­tency or error expected in an individual’s test score. We can then use the SEM to construct a confidence interval—a range of scores that most likely includes the true score. Confidence intervals provide infor­mation about whether individuals’ observed test scores are statistically different from each other. Six factors—test length, homogeneity of questions, the test–retest interval, test administration, scoring, and cooperation of test takers—are important factors that influence the reliability of the test scores.

Another approach to estimating reliability is generalizability theory, which concerns how well and under what conditions we can generalize an estimation of reliability from one test to another or on the same test given under different circumstances. Generalizability theory seeks to identify sources of sys­tematic error that classical test theory would simply label as random error. Using analysis of variance, researchers and test developers can identify systematic error and then take measures to eliminate it, thereby increasing the overall reliability of the test.