Koretz on What Educational Testing Tells Us

Daniel Koretz argues for a moderate, sceptical approach to achievement testing:

Perhaps testing seems so misleadingly simple because for those of us raised and educated in the United States, standardized testing has been ubiquitous, just a fact of life. We were administered achievement tests in elementary and secondary school. Most of us took tests for admission to postsecondary education, many of us repeatedly. We take pencil-and-paper or computerized tests to obtain our driver’s license. Many take licensure examinations for entrance to a trade or a profession. Testing has become a routine part of our vocabulary and our public discourse. …

Careful testing can in fact give us tremendously valuable information about student achievement that we would otherwise lack—otherwise, why would we do it?—and it does rest on several generations of accumulated scientific research and development. But that is no reason to be uncritical in using information from tests. One need not be a psychometrician to understand the key issues raised by achievement testing and to be an informed user of the information tests provide. …

Of the many complexities entailed by educational testing, the most fundamental, and the one that is ultimately the root of so many misunderstandings of test scores, is that test scores usually do not provide a direct and complete measure of educational achievement. Rather, they are incomplete measures, proxies for the more comprehensive measures that we would ideally use but that are generally unavailable to us. There are two reasons for the incompleteness of achievement tests. One, stressed by careful developers of standardized tests for more than half a century, is that these tests can measure only a subset of the goals of education. The second is that even in assessing the goals that can be measured well, tests are generally very small samples of behavior that we use to make estimates of students’ mastery of very large domains of knowledge and skill. …

One sometimes disquieting consequence of the incompleteness of tests is that different tests often provide somewhat inconsistent results. For example, for more than three decades the federal government has funded a large-scale assessment of students nationwide called the National Assessment of Educational Progress, often simply labeled NAEP (pronounced “nape”), which is widely considered the best single barometer of the achievement of the nation’s youth. There are actually two NAEP assessments, one (the main NAEP) designed for detailed reporting in any given year, and a second designed to provide the most consistent estimates of long-term trends. Both show that mathematics achievement has been improving in both grade four and grade eight— particularly in the fourth grade, where the increase has been among the most rapid nationwide changes in performance, up or down, ever recorded. But the upward trend in the main NAEP has been markedly faster than the improvement in the long-term-trend NAEP. Why? Because the tests measure mathematics somewhat differently, taking somewhat different samples of behavior from the large domain of mathematics achievement, and the improvement in student performance has varied from one component of mathematics to another. Such discrepancies are commonplace. They need not indicate that anything is “wrong” with either test (although they may); they can arise simply because different tests measure somewhat different samples of knowledge and skills. And these disparities are not a reason to put test scores aside. Rather, they indicate the need to use scores cautiously—for example, by looking at multiple sources of information about performance and by paying little heed to modest differences in scores. …

Then there is the problem of figuring out how to report performance on a test. Most of us grew up in a school system with some simple but arbitrary rules for grading tests, such as “90 percent correct gets you an A.” But replace a few hard questions with easier ones, or vice versa—and variations of this sort occur even when people try to avoid them—and “90 percent correct” no longer signifies the level of mastery it did before. And in any event, what is an “K”! We know that to obtain a grade of ‘A” can require much more in one class than in another. Psychometricians therefore have had to create scales for reporting performance on tests. These scales are of many different types. Most readers will have encountered arbitrary numerical scales (for example, the SAT scale, which runs from 200 to 800); norm-referenced scales that compare a student to a distribution of students, perhaps a national distribution (for example, grade equivalents and percentile ranks); and the currently dominant performance standards, which break the entire distribution of performance into just a few bins,
based on judgments of what students should be able to do. These various scales have different relationships to raw performance on the test, and therefore they often provide differing views of performance.

Further, sometimes a test does not function as it should. A test may be biased, producing systematically incorrect estimates of the performance of a particular group of students. For example, a mathematics test that requires reading complex text and writing long answers may be biased against immigrant students who are competent in mathematics but have not yet achieved fluency in English. These cases of bias must be distinguished from simple differences in performance that accurately represent achievement. For instance, if poor students in a given city attend inferior schools, a completely unbiased test is likely to give them lower scores because the inferior teaching they received impeded their learning. …

Testing is by its nature a highly technical enterprise that rests on a foundation of complex mathematics, much of which is not generally understood even by quantitative social scientists in other fields. Many of the technical reports posted on the Web by state departments of education and other organizations that sponsor tests confront readers with bewildering technical terms and, in some cases, the even more intimidating mathematics that formalizes them. This creates the unfortunate misapprehension that the principles of testing are beyond the reach of most people. The mathematics is essential for the proper design and operation of testing programs, but one does not need it to understand the fundamental principles that underlie the sensible uses of tests and the reasonable interpretation of test scores.

But the core principals and concepts are truly essential. Without an understanding of validity, reliability, bias, scaling, and standard setting, for example, one cannot fully make sense of the information yielded by tests or find sensible resolutions to the currently bitter controversies about testing in American education. Many people simply dismiss these complexities, treating them as unimportant precisely because they seem technical and esoteric. I suspect this was part of the issue for the delegate from the business group I mentioned earlier. This proclivity to associate the arcane with the unimportant is both ludicrous and pernicious. When we are ill, most of us rely on medical treatments that reflect complex, esoteric knowledge of all manner of physiological and biochemical processes that very few of us understand well, if at all. Yet few of us would tell our doctors that their knowledge, or that of the biomedical researchers who designed the drugs we take, can’t possibly be important because, to our uninformed ears, it is arcane. Nor would we dismiss the arcane engineering that goes into modern aircraft control systems or, for that matter, the computers that control our cars. Ignoring the complexities of educational testing leads people to major misunderstandings about the performance of students, schools, and school systems. And while the consequences of these misunderstandings may not seem as dire as airplanes falling from the sky, they are serious enough for children, for their teachers, and for the nation, which relies for its welfare on a well-educated citizenry.

Koretz, Daniel. 2008. Measuring Up: What Educational Testing Really Tells Us. Cambridge, MA: Harvard University Press. pp. 7-14. || Amazon || WorldCat