Gould on the Mismeasure of Man

Harvard biologist Stephen Jay Gould discusses the modern origins of standardised intelligence tests.

Lewis M. Terman, the twelfth child in an Indiana farm farm of fourteen, traced his interest in the study of intelligence to an itinerant book peddler and phrenologist who visited his home when he was nine or ten and predicted good things after feeling the bumps on his skull. Terman pursued this early interest, never doubting that a measurable mental worth lay inside people’s heads. In his doctoral dissertation of 1906, Terman examined seven “bright” and seven “stupid” boys and defended each of his tests as a measure of intelligence by appealing to the standard catalogue of racial and national stereotypes. Of tests for invention, he wrote: “We have only to compare the negro with the Eskimo or Indian, and the Australian native with the Anglo-Saxon, to be struck by an apparent kinship between general intellectual and inventive ability” (1906, p. 14). Of mathematical ability, he proclaimed (1906, p. 29): “Ethnology shows that racial progress has been closely paralleled by development of the ability to deal with mathematical concepts and relations.”…

Goddard introduced Binet’s scale to America, but Terman was the primary architect of its popularity. Binet’s last version of 1911 included fifty-four tasks, graded from prenursery to mid-teen-age vears. Terman’s first revision of 1916 extended the scale to “superior adults” and increased the number of tasks to ninety. Terman, by then a professor at Stanford University, gave his revision a name that has become part of our century’s vocabulary—the Stanford-Binet, the standard for virtually all “IQ” tests that followed. …

But Terman’s major influence did not reside in his sharpening or extension of the Binet scale. Binet’s tasks had to be administered by a trained tester working with one child at a time. They could not be used as instruments for general ranking. But Terman wished to test everybody, for he hoped to establish a gradation of innate ability that could sort all children into their proper stations in life:

What pupils shall be tested? The answer is, all. If only selected children are tested, many of the cases most in need of adjustment will be overlooked. The purpose of the tests is to tell us what we do not already know, and it would be a mistake to test only those pupils who are recognized as obviously below or above average. Some of the biggest surprises are encountered in testing those who have been looked upon as close to average in ability. Universal testing is fully warranted (1923, p. 22).

The Stanford-Binet, like its parent, remained a test for individuals, but it became the paradigm for virtually all the written versions that followed. By careful juggling and elimination, Terman standardized the scale so that “average” children would score 100 at each age (mental age equal to chronological age). Terman also screened out the variation among children by establishing a standard deviation of 15 or 16 points at each chronological age. With its Mean of 100 and standard deviation of 15, the Stanford-Binet •ecame (and in many respects remains to this day) the primary criterion for judging a plethora of mass-marketed written tests that followed. The invalid argument runs: we know that the Stanford-Binet measures intelligence; therefore, any written test that correlates strongly with Stanford-Binet also measures intelligence. Much of the elaborate statistical work performed by testers during the past fifty years provides no independent confirmation for the proposition that tests measure intelligence, but merely establishes correlation with a preconceived and unquestioned standard.

Testing soon became a multimillion-dollar industry; marketing companies dared not take a chance with tests not proven by their correlation with Terman’s standard. The Army Alpha (see pp. 192-222) initiated mass testing, but a flood of competitors greeted school administrators within a few years after the war’s end. A quick glance at the advertisements appended to Terman’s later book (1923) illustrates, dramatically and unintentionally, how all Terman’s cautious words about careful and lengthy assessment (1919, p. 299, for example) could evaporate before strictures of cost and time when his desire to test all children became a reality (Fig. 5.3). Thirty minutes and five tests might mark a child for life, if schools adopted the following examination, advertised in Terman 1923, and constructed by a committee that included Thorndike, Yerkes, and Terman himself. …

I believe that the conditions of testing, and the basic character of the examination, make it ludicrous to believe that Beta measured any internal state deserving the label intelligence. Despite the plea for geniality, the examination was conducted in an almost frantic rush. Most parts could not be finished in the time allotted, but recruits were not forewarned. My students compiled the following record of completions on the seven parts (see p. 212). For two of the tests, digit symbols and number checking (4 and 5), most students simply couldn’t write fast enough to complete the ninety and fifty items, even though the protocol was clear to all. The third test with a majority of incompletes, cube counting (number 2), was too difficult for the number of items included and the time allotted.

Gould, Stephen Jay. 1981. The Mismeasure of Man. New York: Norton. pp. 174-7, 210-1. || Amazon || WorldCat