Sunday, November 21, 2010

Is D.C.'s teacher evaluation system rigged?

By Valerie Strauss | Washington Post
November 1, 2010

My guest is Aaron Pallas, professor of sociology and education at Teachers College, Columbia University. Pallas writes the Sociological Eye on Education blog for The Hechinger Report, a nonprofit, nonpartisan education-news outlet affiliated with the Hechinger Institute on Education and the Media.

By Aaron Pallas
Two old jokes about doctors and medical school:
Joke #1: 50 percent of all doctors finish in the bottom half of their medical school class.

Joke #2: Q: What do you call the person who finishes last in his or her medical school class? A: “Doctor.”

Why do we laugh at these jokes? (At least, the first time we hear them?) Because there’s an incongruity in the idea of very high achievers (as most medical students are) being portrayed as low achievers.

It’s all relative, of course. The 50 percent of the doctors who finish in the bottom half of the class at the Johns Hopkins University School of Medicine, arguably the top medical school in the country, are not low achievers. The school accepts only 5 percent of its applicants; the average undergraduate GPA of admitted students is 3.85; and the average MCAT composite score is 35, in the top 5 percent of test-takers nationally.

In contrast, the average GPA of admitted students at Lake Erie College of Osteopathic Medicine (LECOM) is about 3.4, and the average MCAT composite score is 26, around the 50th percentile of test-takers nationally. Not low achievers, to be sure, but not in the same league as those at Hopkins.

If we looked at the bottom 10 percent of Hopkins medical students, they might well exceed the achievements of 80 percent of the students at LECOM. Where do we draw the line to label a student as a “low achiever”?

That’s one of the problems with value-added measures of teacher performance, which are in the news again as New York City considers releasing its teachers’ value-added scores. (The teachers’ union has sued to stop their release.) Value-added measures are based on ranking teachers against one another. They are relative measures, in the sense that a teacher’s ranking depends entirely on the performance of other teachers.

We can contrast value-added measures with the other main thrust in the development of new teacher-evaluation systems, classroom observations by raters trained to evaluate teachers’ practices according to a clear set of criteria and standards. Such classroom observations are an absolute measure; the rating of a teacher is against a fixed yardstick of what constitutes good practice, not against other teachers.

In a value-added measure, there are winners and losers, since the technique demands that some teachers appear more effective than others. In contrast, rating teachers according to an observational protocol could result in all teachers being rated “effective.”

Although many new teacher-evaluation systems join value-added measures of teachers’ contributions to students’ test scores with classroom observations of teachers’ practices, the architects of these systems have not thought through the implications of combining relative measures of teacher performance (e.g., value-added measures) with absolute measures of teacher performance (e.g., classroom observations). Or perhaps they have thought this through; if so, I don’t like their thinking.

Many readers will be familiar with the IMPACT teacher evaluation system piloted by the Washington, D.C. public schools in the 2009-2010 school year. I say “piloted” because the system was brand-new and had not been tried before.

The first year of IMPACT had real consequences. In July, outgoing D.C. Schools Chancellor Michelle Rhee fired 165 teachers on the basis of their scores on the IMPACT evaluation, which generated a score for each teacher on a scale from 100 to 400 points.

(It’s really a scale from 1 to 4, but the components are multiplied by the weight they get in the overall IMPACT score calculation.)

Teachers whose overall score was between 350 points and 400 points were classified as “highly effective”; those whose scores were between 250 points and 350 points were labeled “effective”; teachers with scores between 175 and 250 points were rated “minimally effective”; and teachers who scored between 100 and 175 points were labeled “ineffective.” Teachers in the ineffective category were subject to immediate dismissal. Any teacher rated minimally ineffective two years in a row is also subject to immediate termination.

The components of the IMPACT system varied for different teachers in D.C. For general-education teachers in grades four through eight, 50 percent of their IMPACT score was based on an individual value-added (IVA) score comparing the performance of a teacher’s students on the standardized 2010 D.C. Comprehensive Assessment System (CAS) to that of other teachers whose students were deemed similar at the start of the 2009-10 school year.

(Computationally, the teacher’s final value-added score ranging from 1 to 4 was multiplied by 50.)

An additional 40 percent of the score was derived from five classroom observations carried out by the school principal and “master educators” hired by the district, in which teachers were rated against a Teaching and Learning Framework with nine different dimensions (e.g., deliver content clearly, or engage all students in learning). Five percent of the overall scores was based on the principal’s rating of the teacher’s commitment to the school community, and the final five percent on a school value-added score estimating the academic growth of all students in the school in reading and math from 2009 to 2010.

We still don’t know anything about the properties of the individual value-added calculations that make up 50 percent of the evaluation for Group 1 teachers in grades four through eight in D.C. Although some teachers were fired in July on the basis of their value-added scores, DCPS has not released the technical report prepared by its contractor, Mathematica Policy Research, detailing the method. (An official told me recently that the technical report is expected to be released later this fall.) But the information that is available raises serious questions about the IMPACT framework.

The IMPACT reports that were provided to Group 1 teachers calculate what is described as a “raw” value-added score in reading and math that represents whether a teacher did better in raising students’ scores from 2009 to 2010 than other teachers did with similar students. (That’s not actually how the value-added calculations work, because the scores for different grades aren’t directly comparable, but it’s what D.C. reported to its teachers.)

For example, a teacher whose students scored 3 points higher, on average, on the 2010 D.C. CAS assessment than similar students taught by other teachers would have a raw value-added score of +3. A teacher whose students scored 4 points lower, on average, than similar students taught by other teachers would have a raw value-added score of -4. These raw value-added scores were then converted into a final value-added score.

Value-added measures create winners and losers. By definition, if there are teachers who are doing better than average, there will be other teachers who are doing worse than average. The average raw value-added score for teachers is 0, and a D.C. official confirmed that 50 percent of the teachers have scores greater than 0, and 50 percent have scores less than 0.

But the conversion of the raw value-added scores to a final value-added score involves more than just tinkering with numbers, because the final value-added score is what places teachers at risk of being labeled ineffective or minimally effective, and hence at risk of being fired. This is a value judgment, not a matter of statistics. How did the DC IMPACT system determine what value-added score represents effective or ineffective teaching?

The table converting raw value-added scores to the final value-added score tells the story. In reading, teachers whose raw value-added scores were between -.2 and .1 received a final score of 2.5. Raw value-added scores below -.2 received a final score of less than 2.5, with those with scores of -5.9 or below having a final score of 1.0.

Conversely, raw value-added scores greater than .1 had final scores greater than 2.5, ranging up to a final score of 4.0 for teachers whose raw score was 5.8 or above.

What this means is that 50 percent of all teachers received a final value-added score below 2.5, and 50 percent received a final score greater than 2.5. But what’s the meaning of a 2.5? Recall that the overall IMPACT score defines a teacher who averages lower than 2.5 on the various IMPACT components as minimally effective or ineffective.

And a teacher who scores in this range two years in a row is subject to immediate termination.

So here it is: by definition, the value-added component of the D.C. IMPACT evaluation system defines 50 percent of all teachers in grades four through eight as ineffective or minimally effective in influencing their students’ learning. And given the imprecision of the value-added scores, just by chance some teachers will be categorized as ineffective or minimally effective two years in a row. The system is rigged to label teachers as ineffective or minimally effective as a precursor to firing them.

The pendulum has simply swung from one end of absurdity to the other; if many systems have historically rated 99 percent of teachers “satisfactory,” as documented in “The Widget Effect,” we now have in D.C. a system that declares exactly 50 percent of teachers ineffective or minimally effective.

Not all of these teachers will be terminated, although if their entire evaluation were based on value-added measures, they could be. Ironically, the classroom observation ratings assigned by D.C. principals and “master educators” are, on average, higher than the value-added scores that teachers in grades four through eight received. So it’s the fact that observers generally judge teachers’ practices to be effective that offsets the risk of dismissal on the basis of value-added scores. These observations are based on holding up a teacher’s performance to an absolute yardstick of what constitutes good practice, rather than comparing teachers to one another.

We still don’t know very much about the properties of these classroom observations, however, and it’s certainly worrisome when a school superintendent signals to principals that the evaluation system is to be used to dismiss teachers. Bill Turque of the Washington Post reported that former Chancellor Michelle Rhee told D.C. principals in August, “Unless you are comfortable with putting your own child in a classroom, that teacher does not have to be there .... So either go hard or go home.”

Rhee has now gone home, but the IMPACT evaluation system appears here to stay. Is it in the top half of teacher-evaluation systems? Only time will tell.

No comments:

Post a Comment