November 2012 | Volume 70 | Number 3
Teacher Evaluation: What's Fair? What's Effective? Pages 38-42
Teacher Evaluation: What's Fair? What's Effective? Pages 38-42
How can districts that are required to use value-added measures ensure that they do so responsibly?
The
debate is polarized. Both sides are entrenched in their views. And
schools are caught in the middle, having to implement evaluations using
value-added measures whose practical value is unclear.
Value-added models are a specific type of growth model, a diverse group of statistical techniques to isolate a teacher's impact on his or her students' testing progress while controlling for other measurable factors, such as student and school characteristics, that are outside that teacher's control. Opponents, including many teachers, argue that value-added models are unreliable and invalid and have absolutely no business at all in teacher evaluations, especially high-stakes evaluations that guide employment and compensation decisions. Supporters, in stark contrast, assert that teacher evaluations are only meaningful if these measures are a heavily weighted component.
Value-added models are a specific type of growth model, a diverse group of statistical techniques to isolate a teacher's impact on his or her students' testing progress while controlling for other measurable factors, such as student and school characteristics, that are outside that teacher's control. Opponents, including many teachers, argue that value-added models are unreliable and invalid and have absolutely no business at all in teacher evaluations, especially high-stakes evaluations that guide employment and compensation decisions. Supporters, in stark contrast, assert that teacher evaluations are only meaningful if these measures are a heavily weighted component.
Supporters
and opponents alike draw on a large and growing body of research that
spans three decades (see Lipscomb, Teh, Gill, Chiang, & Owens, 2010,
for a policy-oriented review). But despite the confidence on both
sides, there is virtually no empirical evidence as to whether using
value-added or other growth models—the types of models being used vary
from state to state—in high-stakes evaluations can improve teacher
performance or student outcomes. The reason is simple: It has never
really been tried before.
It will probably be several years before there is solid initial evidence on whether and how the various new evaluation systems work in practice. In the meantime, the existing research can and must inform the design and implementation of these systems.
Value-added estimates are based exclusively on scores from standardized tests, which are of notoriously varying quality and are not necessarily suitable for measuring teacher effectiveness (Koretz, 2002). Moreover, different models can produce different results for the same teacher (Harris, Sass, & Semykina, 2010; McCaffrey, Lockwood, Koretz, & Hamilton, 2004), as can different tests plugged into the same model (Papay, 2011).
These are all important points, but the unfortunate truth is that virtually all measures can be subject to such criticism, including the one that value-added opponents tend to support—classroom observations. Observation scores can be similarly imprecise and unstable over time (Measures of Effective Teaching Project, 2012). Different protocols yield different results for the same teacher, as do different observers using the same protocol (Rockoff & Speroni, 2010).
As states put together new observation systems, most are attempting to address these issues. For instance, many are requiring that each teacher be evaluated multiple times every year by different observers.
The same cannot, however, be said about value-added estimates. Too often, states fail to address the potential problems with using these measures.
By themselves, value-added data are neither good nor bad. It is how we use them that matters. There are basic steps that states and districts can take to minimize mistakes while still preserving the information the estimates provide. None of these recommendations are sexy or even necessarily controversial. Yet they are not all being sufficiently addressed in new evaluation systems.
A more logical approach would be to set a lower minimum weight—say, 10–20 percent—and let districts experiment with going higher. Such variation could be useful in assessing whether and why different configurations lead to divergent results, and this information could then be used to make informed decisions about increasing or decreasing weights in the future.
System designers must pay close attention to how raw value-added scores are converted into evaluation ratings and how those ratings are distributed in relation to other components. This attention is particularly important given that value-added models, unlike many other measures (such as observations), are designed to produce a spread of results—some teachers at the top, some at the bottom, and some in the middle. This imposed variability will increase the impact of value-added scores if other components do not produce much of a spread.
Some states and districts that have already determined scoring formulas do not seem to be paying much attention to this issue. They are instead relying on the easy way out. For example, they are converting scores to simplistic, seemingly arbitrary four- or five-category sorting schemes (perhaps based on percentile ranks) with little flexibility or guidance on how districts might calibrate the scoring to suit the other components they choose.
Some of the imprecision associated with value-added measures is systematic. For example, there may be differences between students in different classes that are not measurable, and these differences may cause some teachers to receive lower (or higher) scores for reasons they cannot control (Rothstein, 2009).
In practice, systematic error is arguably no less important than random error—statistical noise due largely to small samples. Even a perfect value-added model would generate estimates with random error.
Think about the political polls cited almost every day on television and in newspapers. A poll might show a politician's approval rating at 60 percent, but there is usually a margin of error accompanying that estimate. In this case, let's say it is plus or minus four percentage points. Given this margin of error, we can be confident that the "true" rating is somewhere between 56 and 64 percent (though more likely closer to 60 than to 56 or 64). This range is called a confidence interval.
In polls, this confidence interval is usually relatively narrow because polling companies use very large samples, which reduces the chance that anomalies will influence the results. Classes, on the other hand, tend to be small—a few dozen students at most. Thus, value-added estimates—especially those based on one year of data, small classes, or both—are often subject to huge margins of error; 20 to 40 percentage points is not unusual (see, for example, Corcoran, 2010).
If you were told that a politician's approval rating was 60 percent, plus or minus 30 percentage points, you would laugh off the statistic. You would know that it is foolish to draw any strong conclusions from a rating so imprecise. Yet this is exactly what states and districts are doing with value-added estimates. It is at least defensible to argue that these estimates, used in this manner, have no business driving high-stakes decisions.
There are relatively simple ways that states and districts can increase accuracy. One basic step would be to require that at least two or three years of data be accumulated for teachers before counting their value-added scores toward their evaluation (or, alternatively, varying the weight of value-added measures by sample size). Larger samples make for more precise estimates and have also been shown to mitigate some forms of systematic error (Koedel & Betts, 2011). Value-added estimates can also be adjusted ("shrunken") according to sample size, which can reduce the noise from random error (Ballou, Sanders, & Wright, 2004).
Second, even when sample sizes are larger, states and districts should directly account for the aforementioned confidence intervals. One of the advantages of value-added models is that, unlike with observations, you can actually measure some of the error in practice. Accounting for it does not, of course, ensure that the estimates are valid—that the models are measuring unbiased causal effects—but it at least means you will be interpreting the information you have in the best possible manner. The majority of states and districts are ignoring this basic requirement.
Another important detail is the accuracy of the large administrative data sets used to calculate value-added scores. These data sets must be continually checked for errors (for example, in the correct linking of students with teachers), and teachers must have an opportunity to review their class rosters every year to ensure they are being evaluated for the progress of students they actually teach.
Finally, each state should arrange for a thorough, long-term, independent research evaluation of new systems, starting right at the outset. There are few prospects more disturbing than the idea of making drastic, sweeping changes in how teachers are evaluated but never knowing how these changes have worked out.
All these exercises should be accompanied by a clear path to making changes based on the results. It is difficult to assess the degree to which states and districts are fulfilling this recommendation. No doubt all of them are performing some of these analyses and would do more if they had the capacity.
Error is inevitable, no matter which measures you use and how you use them. But responsible policymakers will do what they can to mitigate imprecision while preserving the information the measures transmit. It is not surprising that many states and districts have neglected some of these steps. They were already facing budget cuts and strained capacity before having to design and implement new teacher evaluations in a short time frame. This was an extremely difficult task.
Luckily, in many places, there is still time. Let's use that time wisely.
It will probably be several years before there is solid initial evidence on whether and how the various new evaluation systems work in practice. In the meantime, the existing research can and must inform the design and implementation of these systems.
Reliability and Validity Apply to All Measures
Critics of value-added measures make a powerful case that value-added estimates are unreliable. Depending on how much data are available and where you set the bar, a teacher could be classified as a "top" or "bottom" teacher because of a random statistical error (Schochet & Chiang, 2010). That same teacher could receive a significantly different rating the next year (Goldhaber & Hansen, 2008; McCaffrey, Sass, Lockwood, & Mihaly, 2009). It makes little sense, critics argue, to base hiring, firing, and compensation decisions on such imprecise estimates. There are also strong objections to value-added measures in terms of validity—that is, the degree to which they actually measure teacher performance. (For an accessible discussion of validity and reliability in value-added measures, see Harris, 2011.)Value-added estimates are based exclusively on scores from standardized tests, which are of notoriously varying quality and are not necessarily suitable for measuring teacher effectiveness (Koretz, 2002). Moreover, different models can produce different results for the same teacher (Harris, Sass, & Semykina, 2010; McCaffrey, Lockwood, Koretz, & Hamilton, 2004), as can different tests plugged into the same model (Papay, 2011).
These are all important points, but the unfortunate truth is that virtually all measures can be subject to such criticism, including the one that value-added opponents tend to support—classroom observations. Observation scores can be similarly imprecise and unstable over time (Measures of Effective Teaching Project, 2012). Different protocols yield different results for the same teacher, as do different observers using the same protocol (Rockoff & Speroni, 2010).
As states put together new observation systems, most are attempting to address these issues. For instance, many are requiring that each teacher be evaluated multiple times every year by different observers.
The same cannot, however, be said about value-added estimates. Too often, states fail to address the potential problems with using these measures.
Four Research-Based Recommendations
It is easy to sympathize with educators who balk at having their fates decided in part by complex, seemingly imprecise statistical models that few understand. But it is not convincing to argue that value-added scores provide absolutely no useful information about teacher performance. There is some evidence that value-added scores can predict the future performance of a teacher's students (Gordon, Kane, & Staiger, 2006; Rockoff & Speroni, 2010) and that high value-added scores are associated with modest improvements in long-term student outcomes, such as earnings (Chetty, Friedman, & Rockoff, 2011). It is, however, equally unconvincing to assert that value-added data must be the dominant component in any meaningful evaluation system or that the value-added estimates are essential no matter how they are used (Baker et al., 2010).By themselves, value-added data are neither good nor bad. It is how we use them that matters. There are basic steps that states and districts can take to minimize mistakes while still preserving the information the estimates provide. None of these recommendations are sexy or even necessarily controversial. Yet they are not all being sufficiently addressed in new evaluation systems.
Avoid mandating universally high weights for value-added measures.
There is no "correct" weight to give value-added measures within a teacher's overall evaluation score. At least, there isn't one that is supported by research. Yet many states are mandating evaluations that require a specific and relatively high weight (usually 35–50 percent). Some states do not specify a weight but employ a matrix by which different combinations of value-added scores, observations, and other components generate final ratings; in these systems, value-added scores still tend to be a driving component. Because there will be minimal variation between districts, there will be little opportunity to test whether outcomes differ for different designs.A more logical approach would be to set a lower minimum weight—say, 10–20 percent—and let districts experiment with going higher. Such variation could be useful in assessing whether and why different configurations lead to divergent results, and this information could then be used to make informed decisions about increasing or decreasing weights in the future.
Pay attention to all components of the evaluation.
No matter what the weight of value-added measures may be on paper, their actual importance will depend in no small part on the other components chosen and how they are scored. Consider an extreme hypothetical example: If an evaluation is composed of value-added data and observations, with each counting for 50 percent, and a time-strapped principal gives all teachers the same observation score, then value-added measures will determine 100 percent of the variation in teachers' final scores.System designers must pay close attention to how raw value-added scores are converted into evaluation ratings and how those ratings are distributed in relation to other components. This attention is particularly important given that value-added models, unlike many other measures (such as observations), are designed to produce a spread of results—some teachers at the top, some at the bottom, and some in the middle. This imposed variability will increase the impact of value-added scores if other components do not produce much of a spread.
Some states and districts that have already determined scoring formulas do not seem to be paying much attention to this issue. They are instead relying on the easy way out. For example, they are converting scores to simplistic, seemingly arbitrary four- or five-category sorting schemes (perhaps based on percentile ranks) with little flexibility or guidance on how districts might calibrate the scoring to suit the other components they choose.
Don't ignore error—address it.
Although the existence of error in value-added data is discussed continually, there is almost never any discussion, let alone action, about whether and how to address it. There are different types of error, although they are often conflated.Some of the imprecision associated with value-added measures is systematic. For example, there may be differences between students in different classes that are not measurable, and these differences may cause some teachers to receive lower (or higher) scores for reasons they cannot control (Rothstein, 2009).
In practice, systematic error is arguably no less important than random error—statistical noise due largely to small samples. Even a perfect value-added model would generate estimates with random error.
Think about the political polls cited almost every day on television and in newspapers. A poll might show a politician's approval rating at 60 percent, but there is usually a margin of error accompanying that estimate. In this case, let's say it is plus or minus four percentage points. Given this margin of error, we can be confident that the "true" rating is somewhere between 56 and 64 percent (though more likely closer to 60 than to 56 or 64). This range is called a confidence interval.
In polls, this confidence interval is usually relatively narrow because polling companies use very large samples, which reduces the chance that anomalies will influence the results. Classes, on the other hand, tend to be small—a few dozen students at most. Thus, value-added estimates—especially those based on one year of data, small classes, or both—are often subject to huge margins of error; 20 to 40 percentage points is not unusual (see, for example, Corcoran, 2010).
If you were told that a politician's approval rating was 60 percent, plus or minus 30 percentage points, you would laugh off the statistic. You would know that it is foolish to draw any strong conclusions from a rating so imprecise. Yet this is exactly what states and districts are doing with value-added estimates. It is at least defensible to argue that these estimates, used in this manner, have no business driving high-stakes decisions.
There are relatively simple ways that states and districts can increase accuracy. One basic step would be to require that at least two or three years of data be accumulated for teachers before counting their value-added scores toward their evaluation (or, alternatively, varying the weight of value-added measures by sample size). Larger samples make for more precise estimates and have also been shown to mitigate some forms of systematic error (Koedel & Betts, 2011). Value-added estimates can also be adjusted ("shrunken") according to sample size, which can reduce the noise from random error (Ballou, Sanders, & Wright, 2004).
Second, even when sample sizes are larger, states and districts should directly account for the aforementioned confidence intervals. One of the advantages of value-added models is that, unlike with observations, you can actually measure some of the error in practice. Accounting for it does not, of course, ensure that the estimates are valid—that the models are measuring unbiased causal effects—but it at least means you will be interpreting the information you have in the best possible manner. The majority of states and districts are ignoring this basic requirement.
Continually monitor results and evaluate the evaluations.
This final recommendation may sound like a platitude in the era of test-based accountability, but it is too important to omit. States and districts that implement new systems must thoroughly analyze the results every single year. They need to check whether value-added estimates (or evaluation scores in general) vary systematically by student, school, or teacher characteristics; how value-added scores match up with the other components (see Jacob & Lefgren, 2008); and how sensitive final ratings are to changes in the weighting and scoring of the components. States also need to monitor how stakeholders, most notably teachers and administrators, are responding to the new systems.Another important detail is the accuracy of the large administrative data sets used to calculate value-added scores. These data sets must be continually checked for errors (for example, in the correct linking of students with teachers), and teachers must have an opportunity to review their class rosters every year to ensure they are being evaluated for the progress of students they actually teach.
Finally, each state should arrange for a thorough, long-term, independent research evaluation of new systems, starting right at the outset. There are few prospects more disturbing than the idea of making drastic, sweeping changes in how teachers are evaluated but never knowing how these changes have worked out.
All these exercises should be accompanied by a clear path to making changes based on the results. It is difficult to assess the degree to which states and districts are fulfilling this recommendation. No doubt all of them are performing some of these analyses and would do more if they had the capacity.
If We Do This, Let's Do It Right
Test-based teacher evaluations are probably the most controversial issue in U.S. education policy today. In the public debate, both sides have focused almost exclusively on whether to include value-added measures in new evaluation systems. Supporters of value-added scoring say it should dominate evaluations, whereas opponents say it has no legitimate role at all. It is as much of a mistake to use value-added estimates carelessly as it is to refuse to consider them at all.Error is inevitable, no matter which measures you use and how you use them. But responsible policymakers will do what they can to mitigate imprecision while preserving the information the measures transmit. It is not surprising that many states and districts have neglected some of these steps. They were already facing budget cuts and strained capacity before having to design and implement new teacher evaluations in a short time frame. This was an extremely difficult task.
Luckily, in many places, there is still time. Let's use that time wisely.
EL OnlineFor another perspective on the use of value-added data, see the online-only article "Value-Added: The Emperor with No Clothes" by Stephen J. Caldas. |
References
Baker, E., Barton, P., Darling-Hammond, L., Haertel, E., Ladd, H., Linn, R., et al. (2010). Problems with the use of student test scores to evaluate teachers (Briefing paper 278). Washington, DC: Economic Policy Institute.
Ballou, D., Sanders, W., & Wright, P. (2004). Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics, 29(1), 37–65.
Chetty, R., Friedman, J., & Rockoff, J. (2011). The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood (NBER Working Paper 17699). Washington, DC: National Bureau of Economic Research.
Corcoran, S. (2010). Can teachers be evaluated
by their students' test scores? Should they be? The use of value-added
measures of teacher effectiveness in policy and practice. New York: Annenberg Institute.
Goldhaber, D., & Hansen, M. (2008). Is it just a bad class? Assessing the stability of measured teacher performance (Working Paper 2008-5). Denver, CO: Center for Reinventing Public Education.
Gordon, R., Kane, T., & Staiger, D. (2006). Identifying effective teachers using performance on the job. Washington, DC: Brookings Institution.
Harris, D. (2011). Value-added measures in education: What every educator needs to know. Cambridge, MA: Harvard Education Press.
Harris, D., Sass, T., & Semykina, A. (2010). Value-added models and the measurement of teacher productivity (CALDER Working Paper 54). Washington, DC: Center for Analysis of Longitudinal Data in Education Research.
Jacob, B. A., & Lefgren, L. (2008). Can
principals identify effective teachers? Evidence on subjective
performance evaluation in education. Journal of Labor Economics, 25(1), 101–136.
Koedel, C., & Betts, J. (2011). Does student
sorting invalidate value-added models of teacher effectiveness? An
extended analysis of the Rothstein critique. Education Finance and Policy, 6(1), 18–42.
Koretz, D. (2002). Limitations in the use of achievement tests as measures of educators' productivity. Journal of Human Resources, 37(4), 752–777.
Lipscomb, S., Teh, B., Gill, B., Chiang, H., & Owens, A. (2010). Teacher and principal value-added: Research findings and implementation practices. Washington, DC: Mathematica Policy Research.
McCaffrey, D., Lockwood, J. R., Koretz, D., & Hamilton, L. (2004). Evaluating value-added models for teacher accountability. Santa Monica, CA: RAND Corporation.
McCaffrey, D., Sass, T., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal stability of teacher effects. Education Finance and Policy, 4(4), 572–606.
Measures of Effective Teaching Project. (2012). Gathering feedback for teaching: Combining high-quality observation with student surveys and achievement gains (MET Project Research Paper). Seattle, WA: Bill and Melinda Gates Foundation.
Papay, J. (2011). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1), 163–193.
Rockoff, J., & Speroni, C. (2010). Subjective and objective evaluations of teacher effectiveness. American Economic Review, 100(2), 261–266.
Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, 4(4), 537–571.
Schochet, P., & Chiang, H. (2010). Error rates in measuring teacher and school performance based on student test score gains
(NCEE 2010-4004). Washington, DC: National Center for Education
Evaluation and Regional Assistance, U.S. Department of Education.
Matthew Di Carlo is a senior fellow at the Albert Shanker Institute in Washington, DC.
Copyright © 2012 by ASCD
No comments:
Post a Comment