Measurement experts (and many educators)
believe that every measurement device should possess certain
qualities. Perhaps the two most common technical concepts in
measurement are reliability and validity. Any kind of assessment,
whether traditional or "authentic," must be developed in a way
that gives the assessor accurate information about the performance
of the individual. At one extreme, we wouldn't have an individual
paint a picture if we wanted to assess writing skills.
A. Reliability: Definition
The degree of consistency
between two measures of the same thing. (Mehrens and Lehman,
1987).
The measure of how stable,
dependable, trustworthy, and consistent a test is in measuring
the same thing each time (Worthen et al., 1993)
For example, if we wish to measure
a person's weight, we would hope that the scale would register
the same measure each time the person stepped on the scale. Another
approach to studying consistency would be to have a whole group
of people weigh themselves twice (changing scales and/or times
and/or the reader and recorder of the measure) and determine
whether the relative standing of the persons remains about the
same. This would give us an estimate of the reliability of the
measure.
Or, if we wanted to measure the
length of a piece of wood, the tape used better yield the same
measure each time. Even if you had someone else remeasure the
wood, the result should be consistent.
Assume that you gave a student
a history test yesterday and then gave the test again today.
You found that the student scored very high the first day and
very low the second day. It could have been that the student
had an off day or that the test is simply unreliable.
A student's test score may vary
for many reasons. The amount of the characteristic we are measuring
may change across time; the particular questions we ask in order
to infer a person's knowledge could affect the score; any change
in directions, timing, or amount of rapport with the test administrator
could cause score variability; inaccuracies in scoring a test
paper will affect the scores and finally such things as health,
motivation, degree of fatigue of the person, and good or bad
luck in guessing could cause score variability.
B. Validity
1. Definition:
Truthfulness: Does the
test measure what it purports to measure? the extent to which
certain inferences can be made from test scores or other
measurement. (Mehrens and Lehman, 1987)
The degree to which
they accomplish the purpose for which they are being used.
(Worthen et al., 1993)
For a test to be valid, or truthful,
it must first be reliable. If we cannot even get a bathroom scale
to give us a consistent weight measure, we certainly cannot expect
it to be accurate. Note, however, that a measure might be consistent
(reliable) but not accurate (valid). A scale may record weights
as two pounds too heavy each time. In other words, reliability
is a necessary but insufficient condition for validity. (Neither
validity nor reliability is an either/or dichotomy; there are
degrees of each.) Since a single test may be used for many different
purposes, there is no single validity index for a test. A test
that has some validity for one purpose may be invalid for another.
2. Kinds of Validity Evidence
Content Validity Evidence
- Refers to the extent to which the content of a test's items
represents the entire body of content to be measured. The
basic issue in content validation is representativeness.
That is, how adequately does the content of the test represent
the entire body of content to which the test user intends
to generalize? Since the responses to a test are only a sample
of a student's behavior, the validity of any inferences about
that student depends upon the representativeness of that
sample. Two questions to be asked: 1. To what degree does
the test include a representative sample of all important
parts of the behavioral domain? 2. To what extent is the
test free from the influence of irrelevant variables that
would threaten the validity of inferences based on the observed
scores?
Criterion-Related Validity
- Refers to the extent to which one can infer from an individual's
score on a test how well she will perform some other external
task or activity that is supposedly measured by the test
in question. That is, is the test score useful in predicting
some future performance (predictive validity) or can the
test score be substituted for some less efficient way of
gathering data (concurrent validity)? Examples of criteria
are success in school, success in class, or success on-the-job
as an employee.
Construct Validity -
Words like assertiveness, giftedness, and hyperactivity refer
to abstract ideas that humans construct in their minds to
help them explain observed patterns or differences in the
behavior of themselves or other people. Intelligence, self-esteem,
aggressiveness, and achievement motivation, creativity, critical
thinking ability, reading comprehension, mathematical reasoning
ability, shyness, curiosity, hypocrisy, and procrastination
are also examples of constructs. A construct is an unobservable,
postulated attribute of individuals created in our minds
to help explain or theorize about human behavior. Since constructs
do not exist outside the human mind, they are not directly
measurable. In other words, the degree to which one can infer
certain constructs in a psychological theory from the test
scores.
For example, people who are interested
in studying a construct such as creativity have probably hypothesized
that creative people will perform differently from those who
are not creative. It is possible to build a theory specifying
how creative people (people who possess the construct creativity)
behave differently from others. Once this is done, creative people
can be identified by observing the behavior of individuals and
classifying them according to the theory. Suppose one wishes
to build a paper-and-pencil test to measure creativity. Once
developed, the creativity test would be considered to have construct
validity to the degree that the test scores are related to the
judgments made from observing behavior identified by the psychological
theory as creative. If the anticipated relationships are not
found, then the construct validity of the inference that the
test is measuring creativity is not supported.
Face Validity - Refers
to whether the test looks valid "on the face of it." That
is, would untrained people who look at or take the test be
likely to think the test is measuring what its author claims?
Face validity often is a desirable feature of a test in the
sense that it is useful from a public acceptance standpoint.
If a test appears irrelevant, examinees may not take the
test seriously, or potential users may not consider the results
useful.
References
Mehrens, W. A. & Lehmann,
I. J. (1987). Using standardized tests in education. New York:
Longman.
Worthen, B. R.,
Borg, W. R., and White, K. R. (1993). Measurement and evaluation
in the school. NY: Longman.