Reliability and Validity

Measurement experts (and many educators) believe that every measurement device should possess certain qualities. Perhaps the two most common technical concepts in measurement are reliability and validity. Any kind of assessment, whether traditional or "authentic," must be developed in a way that gives the assessor accurate information about the performance of the individual. At one extreme, we wouldn't have an individual paint a picture if we wanted to assess writing skills.

A. Reliability: Definition

• The degree of consistency between two measures of the same thing. (Mehrens and Lehman, 1987).

• The measure of how stable, dependable, trustworthy, and consistent a test is in measuring the same thing each time (Worthen et al., 1993)

For example, if we wish to measure a person's weight, we would hope that the scale would register the same measure each time the person stepped on the scale. Another approach to studying consistency would be to have a whole group of people weigh themselves twice (changing scales and/or times and/or the reader and recorder of the measure) and determine whether the relative standing of the persons remains about the same. This would give us an estimate of the reliability of the measure.

Or, if we wanted to measure the length of a piece of wood, the tape used better yield the same measure each time. Even if you had someone else remeasure the wood, the result should be consistent.

Assume that you gave a student a history test yesterday and then gave the test again today. You found that the student scored very high the first day and very low the second day. It could have been that the student had an off day or that the test is simply unreliable.

A student's test score may vary for many reasons. The amount of the characteristic we are measuring may change across time; the particular questions we ask in order to infer a person's knowledge could affect the score; any change in directions, timing, or amount of rapport with the test administrator could cause score variability; inaccuracies in scoring a test paper will affect the scores and finally such things as health, motivation, degree of fatigue of the person, and good or bad luck in guessing could cause score variability.

B. Validity

1. Definition:

• Truthfulness: Does the test measure what it purports to measure? the extent to which certain inferences can be made from test scores or other measurement. (Mehrens and Lehman, 1987)

• The degree to which they accomplish the purpose for which they are being used. (Worthen et al., 1993)

For a test to be valid, or truthful, it must first be reliable. If we cannot even get a bathroom scale to give us a consistent weight measure, we certainly cannot expect it to be accurate. Note, however, that a measure might be consistent (reliable) but not accurate (valid). A scale may record weights as two pounds too heavy each time. In other words, reliability is a necessary but insufficient condition for validity. (Neither validity nor reliability is an either/or dichotomy; there are degrees of each.) Since a single test may be used for many different purposes, there is no single validity index for a test. A test that has some validity for one purpose may be invalid for another.

2. Kinds of Validity Evidence

• Content Validity Evidence - Refers to the extent to which the content of a test's items represents the entire body of content to be measured. The basic issue in content validation is representativeness. That is, how adequately does the content of the test represent the entire body of content to which the test user intends to generalize? Since the responses to a test are only a sample of a student's behavior, the validity of any inferences about that student depends upon the representativeness of that sample. Two questions to be asked: 1. To what degree does the test include a representative sample of all important parts of the behavioral domain? 2. To what extent is the test free from the influence of irrelevant variables that would threaten the validity of inferences based on the observed scores?

• Criterion-Related Validity - Refers to the extent to which one can infer from an individual's score on a test how well she will perform some other external task or activity that is supposedly measured by the test in question. That is, is the test score useful in predicting some future performance (predictive validity) or can the test score be substituted for some less efficient way of gathering data (concurrent validity)? Examples of criteria are success in school, success in class, or success on-the-job as an employee.

• Construct Validity - Words like assertiveness, giftedness, and hyperactivity refer to abstract ideas that humans construct in their minds to help them explain observed patterns or differences in the behavior of themselves or other people. Intelligence, self-esteem, aggressiveness, and achievement motivation, creativity, critical thinking ability, reading comprehension, mathematical reasoning ability, shyness, curiosity, hypocrisy, and procrastination are also examples of constructs. A construct is an unobservable, postulated attribute of individuals created in our minds to help explain or theorize about human behavior. Since constructs do not exist outside the human mind, they are not directly measurable. In other words, the degree to which one can infer certain constructs in a psychological theory from the test scores.

For example, people who are interested in studying a construct such as creativity have probably hypothesized that creative people will perform differently from those who are not creative. It is possible to build a theory specifying how creative people (people who possess the construct creativity) behave differently from others. Once this is done, creative people can be identified by observing the behavior of individuals and classifying them according to the theory. Suppose one wishes to build a paper-and-pencil test to measure creativity. Once developed, the creativity test would be considered to have construct validity to the degree that the test scores are related to the judgments made from observing behavior identified by the psychological theory as creative. If the anticipated relationships are not found, then the construct validity of the inference that the test is measuring creativity is not supported.

• Face Validity - Refers to whether the test looks valid "on the face of it." That is, would untrained people who look at or take the test be likely to think the test is measuring what its author claims? Face validity often is a desirable feature of a test in the sense that it is useful from a public acceptance standpoint. If a test appears irrelevant, examinees may not take the test seriously, or potential users may not consider the results useful.

References

Mehrens, W. A. & Lehmann, I. J. (1987). Using standardized tests in education. New York: Longman.

Worthen, B. R., Borg, W. R., and White, K. R. (1993). Measurement and evaluation in the school. NY: Longman.