Saturday, August 19, 2006

Charles Murray: What's in a test?

Sunday, August 13, 2006

What's in a test? by Charles Murray, DALLAS MORNING NEWS

Deception and denial, when it comes to No Child Left Behind

Test scores are the last refuge of the No Child Left Behind Act. They have to be, because so little else about the act is attractive.

NCLB takes a giant step toward nationalizing elementary and secondary education, a disaster for federalism.

It pushes classrooms toward relentless drilling, not something that inspires able people to become teachers or makes children eager to learn. It holds good students hostage to the performance of the least talented at a time when the economic future of the country depends more than ever on the performance of the most talented.

The one aspect of the act that could have inspired enthusiasm from me, promoting school choice, has fallen far short of its hopes. The only way to justify NCLB is through compelling evidence that test scores are improving.

So let's talk about test scores.

The case that NCLB has failed to raise test scores had been made most comprehensively in a report from the Civil Rights Project at Harvard University, released just a few weeks ago. The Civil Rights Project has an openly liberal political agenda, but the author of the report, Jaekyung Lee, lays out the data in graphs that anyone can follow, subjects them to appropriate statistical analyses and arrives at conclusions that can stand on their scholarly merits: NCLB has not had a significant impact on overall test scores and has not narrowed the racial and socioeconomic achievement gap.

Is it too early to tell? As a parent who has had children in public schools since NCLB began, I don't think so. The Frederick County, Md., schools our children have attended have turned themselves inside out to try to produce the right test results, with dismaying effects on the content of classroom instruction and devastating effects on teacher morale. We actually lost our best English teacher to the effects of high-stakes testing. "I want to teach my students how to write," he said, "not teach them how to pass a test that says they can write." He quit.

So, yes, I think that if we parents have had to put up with these kinds of troubling effects on our children's schooling for four years, we are entitled to expect evidence of results. After all, "accountability" is NCLB's favorite word, and the Department of Education is holding school systems accountable for improvements in test scores with a vengeance. Sauce for the goose, sauce for the gander.

The Department of Education will undoubtedly produce numbers to dispute the findings of the Civil Rights Project, which brings me to the point of this essay. Those numbers will consist largely of pass percentages, not mean scores. A particular score is deemed to separate "proficient" from "not proficient." Reach that score, and you've passed the test. If 60 percent of one group – blondes, let's say – pass while only 50 percent of redheads pass, then the blonde-redhead gap is 10 percentage points.

A pass percentage is a bad standard for educational progress. Conceptually, "proficiency" has no objective meaning that lends itself to a cutoff. Administratively, the NCLB penalties for failure to make adequate progress give the states powerful incentives to make progress as easy to show as possible. A pass percentage throws away valuable information, telling you whether someone got over a bar, but not how high the bar was set or by how much the bar was cleared. Most important: If you are trying to measure progress in closing group differences, a comparison of changes in pass percentages is inherently misleading.

Take the case of Texas, from which George Bush acquired his faith in NCLB. As the president described it to the Urban League in 2003: "In my state, Texas, 73 percent of the white students passed the math test in 1994, while only 38 percent of African-American students passed it. So we made that the point of reference. We had people focused on the results for the first time – not process, but results. And because teachers rose to the challenge, because the problem became clear, that gap has now closed to 10 points." President Bush's numbers are accurately stated. They are also meaningless.

Any test that meets ordinary standards produces an approximation of what statisticians call a "normal distribution" of scores – a bell curve – because achievement in any open-ended skill such as reading comprehension or mathematics really is more or less normally distributed. The tests that produce anything except a bell curve are usually ones so simple that large proportions of students get every item correct. They hide the underlying normal distribution but don't change it.

Thus point No.1, that using easy tests and discussing results in terms of pass percentages obscure a reality that NCLB seems bent on denying: All the children cannot be above average. They cannot all even be proficient, if "proficient" is defined legitimately. Some children do not have the necessary skills.

Point No.2 goes to the inherent distortions introduced by the use of pass percentages: Because of the underlying normal distribution, a gain in a given number of points has varying effects on group differences depending on where the gain falls.

To illustrate point No.2, consider a test that has a hundred-point scale with a mean of 50 and a standard deviation of 15 (the standard deviation, a measure of the variability of the scores, tells you how tall and skinny or how short and broad the bell curve will be). How many students are involved when a range of, say, 10 points is at issue? The shaded areas in Figure 1 show two possibilities.

The total area under the bell curve includes all the students. The shaded area on the left includes all those with a score of 40 to 49 points – 24.8 percent of all students, if the distribution is perfectly normal. The shaded area on the right includes all those with a score of 80 to 89 points – just 1.9 percent of all students. Suppose we are still comparing redheads and blondes. If the mean score of redheads goes from 40 to 50, it has risen all the way from the 25th to the 50th percentile of all students. If the blonde mean goes from 80 to 90, it has moved merely from the 98th to the 99th percentile of all students.

You do not have to be a statistician to see that these built-in features of normally distributed scores – gains that are equal in points are not equal in the number of students they affect or in the percentile distances that students move – complicate the use of pass percentages when comparing groups.

If you want to get deeper into the math, you may visit a quirky and provocative Web site,, run by someone who calls himself La Griffe du Lion. I surmise that he is an established scholar – a quantitative discipline seems likely – who once published on the fraught topic of group differences, learned how unpleasant and even professionally perilous that can be and decided to remain anonymous henceforth. In any case, his technical skills are first rate. Click on the topic line entitled "Closing the Racial Learning Gap" for a much more detailed version of the argument and data that I am presenting here.

For our purposes, you need know only this: If the real difference between two groups, measured as it should be with means and standard deviations, remains constant, the size of the pass-percentage gap between two groups changes nonlinearly in a mathematically inevitable way. In other words, if there really is a constant, meaningful difference between groups, you can generate a curve that predicts how the point gap will change as tests are made easier or harder or as students become more or less competent. La Griffe has done this, and his curve fits the Texas data almost perfectly. In Figure 2, the white pass rate is used as the basis for predicting the size of the white-black gap. The circles represent the observed sizes of the test score gap from 1994 to 2002.

Test scores in Texas went up for both blacks and whites. Maybe that's good news, representing real gains in learning for everyone, or maybe it's not so good, representing the effects of teaching to the test. The data Texas reports do not permit a judgment. But the black gains are almost exactly what would be predicted if the magnitude of the underlying black-white difference remained unchanged. If there really was closure of the gap, all that Texas has to do is release the group means, as well as information about the black and white distributions of scores, and it will be easy to measure it. Whatever the real closure may be, however, it cannot come close to the dramatic reduction that Mr. Bush found in the difference between black and white pass rates.

In this instance, the percentage-passed measure misleadingly showed a huge reduction in the black-white achievement gap. But look at the lefthand side of the curve. In a state that imposes tough standards – for example, one that establishes a threshold that only 40 percent of whites pass – across-the-board improvements in scores can misleadingly show an increase in the white-black achievement gap when none occurred.

Question: Doesn't this mean that the same set of scores could be made to show a rising or falling group difference just by changing the definition of a passing score? Answer: Yes.

At stake is not some arcane statistical nuance. The federal government is doling out rewards and penalties to school systems across the country based on changes in pass percentages. It is an uninformative measure for many reasons, but when it comes to measuring one of the central outcomes sought by No Child Left Behind, the closure of the achievement gap that separates poor students from rich, Latino from white and black from white, the measure is beyond uninformative. It is deceptive.

Charles Murray, W.H. Brady Scholar at the American Enterprise Institute, is the author, most recently, of "In Our Hands: A Plan to Replace the Welfare State." This essay first appeared in The Wall Street Journal. His e-mail address is

Online at:

No comments:

Post a Comment