Evaluate teachers with a single or multiple measures?

Share on Facebook
Share on Twitter
Share on


Back in July, I wrote a series of posts on teacher evaluations, outlining why the Massachusetts law that was passed, with much fanfare a “bold, pioneering teacher-evaluation system,” was not likely to lead to much improvement for teachers or for students.

There are many other reasons to doubt the boldness or pioneering-ness of the new Massachusetts teacher evaluation system. There’s the small ball criticisms like

  • The law required evaluations 18 years ago, and few school districts have fulfilled their requirements for that time – so what makes this different?
  • By the time the evaluations go into effect (three years hence), the MCAS will be a thing of the past, with the state having promised to move to an unknown national test by then.  So, forget the apples-to-apples comparisons on data you were hoping for, at least for a number of years.
  • Why spend enormous gobs of time on implementing complex end-of-pipe regulations when we could get it right from the start by increasingly slowly but deliberately the difficulty of the Massachusetts Tests for Educator Licensure?
  • Starting the focus on evaluations with teachers is misguided.  It is always best in any organizational setting to evaluate managers—in this case, principals, central office personnel, and superintendents.

In one less than creatively titled post, Will the new teacher evaluation system improve instruction?, I noted that

The fact that the Massachusetts Teachers Association is supportive of the state’s new teacher evaluation system is not in and of itself a criticism—but when it is hard to get any grasp whatsoever as to the percentage of the evaluation to be based on improvements in students’ academic achievement, well, that should give us pause. And the fact that the National Education Association, the MTA’s parent organization, has maintained its opposition to the use of student standardized tests for the purpose of evaluating teachers should at least raise one’s suspicions.

The Lowell Sun picked up the same theme:

While the new rules are positive, they leave it up to individual school committees to decide just how much emphasis will be placed on the tests when it comes to judging a teacher’s performance. This serves to make teacher evaluations an open question from district to district. We suspect there will be some foot-dragging, especially where teachers’ unions remain a powerful force in electing school-board members and affecting policy…

As did the Worcester Telegram & Gazette:

[T]he regulations lay out 16 “indicators” for teacher standards in the areas of Curriculum and Planning, Teaching All Students, Family and Community Engagement, and Professional Culture. There are 20 such “indicators” for administrators, reaching into every conceivable area of day-to-day school management.From their performance on these many indicators, educators are to be classified and graded, assigned one of four overall ratings — Exemplary, Proficient, Needs Improvement, or Unsatisfactory — and then placed on an action plan, or, in the case of competent educators, largely left alone.

It isn’t clear to us how any of this will help districts rid themselves of bad teachers… or, on the positive side, facilitate the recruitment, promotion and rewarding of excellent teachers.

We were hoping for a far more succinct, specific and clear set of expectations that would promote accountability and excellence… [T]hese new educator evaluation regulations strike us as an excellent way to create more work and worry for administrators and teachers, while ensuring plenty of new grist for the wheels of bureaucracy that revolve at the state Department of Education.


And, yet, I hear repeatedly that evaluations using multiple measures are the way to go, because single-measure approaches (especially ones based on standardized tests) are not reliable, don’t give the whole picture, etc.  This multiple measure approach comes from teachers, which I understand, but it is also coming from the Gates Foundation and other folks who consider themselves in the reform camp.  (That’s a discussion for another day.)

What does the data say about the use of multiple measures to evaluate teachers?

As Jay Greene notes at his blog,

The Gates Foundation has released the next installment of reports in their Measuring Effective Teachers Project.When the last report was released, I found myself in a tussle with the Gates folks and Sam Dillon at the New York Times because I noted that the study’s results didn’t actually support the finding attributed to it.  Vicki Phillips, the education chief at Gates, told the NYT and LA Times that the study showed that “drill and kill” and “teaching to the test” hurt student achievement when the study actually found no such thing.

With the latest round of reports, the Gates folks are back to their old game of spinning their results to push policy recommendations that are actually unsupported by the data.  The main message emphasized in the new round of reports is that we need multiple measures of teacher effectiveness, not just value-added measures derived from student test scores, to make reliable and valid predictions about how effective different teachers are at improving student learning.

This is the clear thrust of the newly released Policy and Practice Brief and Research Paper and is obviously what the reporters are being told by the Gates media people


given that sources like Education Week and Ed Sector insist that the “findings demonstrate the importance of multiple measures of teacher evaluation.”  Greene then looks at the research rather than just the Gates Foundation press releases.

But buried away on p. 51 of the Research Paper in Table 16 we see that value-added measures based on student test results — by themselves — are essentially as good or better than the much more expensive and cumbersome method of combining them with student surveys and classroom observations when it comes to predicting the effectiveness of teachers.  That is, the new Gates study actually finds that multiple measures are largely a waste of time and money when it comes to predicting the effectiveness of teachers at raising student scores in math and reading. (My emphasis)

For the wonks among you, click here to read the details on the reliability of using criteria other than test scores to evaluate teachers.  Greene notes that the Gates’ study actually says quite the opposite of what their press releases suggest:

Adding the student surveys and classroom observation measures to test scores yields almost no benefits, but it adds an enormous amount of cost and effort to a system for measuring teacher effectiveness…

Greene calls “this pattern of presentation” across the two Gates reports on teacher evaluations simply “spinning.”

So, why are the Gates folks saying that their research shows the benefits of multiple measures of teacher effectiveness when their research actually suggests virtually no benefits to combining other measures with test scores and when there are significant costs to adding those other measures?  The simple answer is politics.  Large numbers of educators and a segment of the population find relying solely on test scores for measuring teacher effectiveness to be unpalatable, but they might tolerate a system that combined test scores with classroom observations and other measures.  Rather than using their research to explain that these common preferences for multiple measures are inconsistent with the evidence, the Gates folks want to appease this constituency so that they can put a formal system of systematically measuring teacher effectiveness in place.  The research is being spun to serve a policy agenda…

Gates’ pattern of presentation

suggests the importance of multiple measures, since the classroom observations are strengthened when other measures are added.  The only place you find the reliability and validity of test scores by themselves is at the bottom of the Research Paper in Tables 16 and 20.  If both the lay-version and technical reports had always shown how little test scores are improved by adding student surveys and classroom observations, it would be plain that test scores alone are just about as good as multiple measures.

That’s a remarkable finding given that the usual take is that only complex, expensive systems of accountability will do good teachers justice.  The researchers who undertook this latest Measuring Effective Teachers Project report deserve credit for assembling the data and analyzing it. The Gates Foundation deserves a Pinocchio Award for saying quite the opposite–and reporters parroting the press releases need to acquire some new skills.  Spitting out what you’ve been spoonfed is stuff any toddler can do.

Also seen in Boston Globe Blogs.