May 2005 | Volume 62 | Number 8
Supporting New Educators Pages 79-81
All About Accountability / NAEP: Gold Standard or Fool's Gold?
W. James Popham
Since 1969, the National Assessment of Educational Progress (NAEP) has periodically tested samples of U.S. students to determine their skills and knowledge in major subject areas. NAEP designers call their assessment program The Nation's Report Card, and, because of the substantial psychometric sophistication they lavish on its tests, regard it as the “gold standard” of education measurement.
Recently, state-by-state NAEP results have been reported alongside the results from the state accountability tests required by No Child Left Behind (NCLB).1 With rare exceptions, far fewer students score “at or above proficient” on the NAEP tests than on the state accountability tests. The national test's results are widely viewed as more credible: People believe that the NAEP is objective and rigorous, whereas state NCLB tests can be soft and self-serving.
I can best explain the flaws in this thinking through a modern-day parable. Imagine that two federal health-related programs have recently been installed in U.S. public schools: One focuses on promoting students' physical conditioning (let's call it “Fitness Forever”), and one focuses on students' auditory ability (let's go with “Hearing Health”). For the Fitness Forever initiative, each state measures every student's physical fitness using its own state-chosen assessments and arrives at its own definition of whether a student is normal, above normal, or below normal. Annually, every state must report the percentage of its public school students who display normal-or-above physical fitness.
In contrast, the Hearing Health program evaluates each state's success according to a nationally collected, sample-based assessment of students' hearing using identical, standardized audiometric tests. On the basis of a prestigious national panel's definitions of normal, above normal, and below normal, federal authorities annually report state-by-state percentages of students whose hearing is classified as normal-or-above.
If state-by-state results on the two assessments were published side by side, would it make sense to use the scores on the nationally administered hearing test to confirm the validity of a state's results on its state-determined fitness test? Of course not! The fitness and hearing tests were designed to satisfy different measurement functions.
The moral of this parable should be apparent: It is absurd to confirm the legitimacy of results derived from one test by using results based on a markedly different test. But, the reader may protest, don't state accountability tests and NAEP tests have the same measurement function: namely, to assess students' math and reading achievement levels? Actually, no.
State accountability tests are designed to detect instructional improvements in a state's public schools. Indeed, NCLB gave each state's educators 12 years to get their students to earn proficient-or-better test scores, stipulating that each public school must make adequate yearly progress in attaining that lofty goal. If a state's NCLB tests are not sensitive enough to detect instructional improvement, then the cornerstone of NCLB's accountability strategy crumbles.
NAEP, on the other hand, was originally designed to serve as an objective, nonpartisan mechanism for monitoring long-term student achievement trends in the United States. NAEP was never supposed to play a role in enhancing the learning of U.S. students. In fact, for more than three decades, those who govern NAEP have unrelentingly resisted any serious effort to make their “gold standard” assessments contribute to improving the instruction taking place in the nation's classrooms. NAEP officials are, above all, devoted to the accuracy of NAEP-based longitudinal analyses. Making NAEP truly sensitive to changes in instruction would be antithetical to the thinking of those who run NAEP.
Yes, NAEP provides educators with the curricular frameworks on which its assessments are based. And in my view, those frameworks are first-rate. But during the nuts-and-bolts creation of NAEP tests, developers take care notto create assessments apt to be meaningfully influenced by instructional interventions. Such instructional sensitivity would diminish the measurement bliss that longitudinal analyses can provide to NAEP psychometricians. Looking back at NAEP results over the years, you'll rarely see anything other than fairly minor fluctuations in performance. So, if you're seeking systematic evidence of long-term instructional progress, don't look at NAEP scores. A more accurate label for this costly federal program would be the National Assessment of Educational No-Progress.
Should the public and education policymakers regard NAEP results as “fool's gold”? Of course not. NAEP has carved out its own distinctive measurement mission. But a test that fulfills one measurement function should not be employed to validate a test that fulfills a different function. When the results of this “gold standard” test are invoked in an attempt to judge the accuracy and credibility of state-level NCLB test results, then the people doing the judging are the fools.
1 For example, see Skinner, R. A. (2005, Jan. 6). State of the states. Education Week, pp. 77–78, 80.
W. James Popham is Emeritus Professor in the UCLA Graduate School of Education and Information studies; firstname.lastname@example.org.