Evaluating Teachers Using Student Test Scores: Value-added Measures (Part 1)
In most organizations, supervisors measure and evaluate employees’ performance. Consequences, both positive and negative, flow from the judgments they make. Not in public schools where supervisors have commonly judged over 95 percent of all teachers “satisfactory.” Such percentages clearly do not distinguish between effective and ineffective teaching. The reform-driven agenda for the past decade that included testing, accountability, expanding parental choice through charter schools, and establishing a Common Core curriculum across the nation now includes in its to-do list distinguishing between good and poor teaching.[i]
The current generation of reform-driven policymakers, donors, and educational entrepreneurs are determined to sort out “good” from mediocre and poor teaching if for no other reason than identifying those high-performers who have had sustained effects on student learning and reward them with recognition, bonuses, and high salaries. They are equally determined to rid the teacher corps of persistently ineffective teachers.[ii]
How to identify the best and worst in the profession in ways that teachers perceive as fair, improves the craft of teaching, and retain their support for the process has, in most places, thwarted reformers. But not enough to stop policymakers and donors from launching a flurry of programs that seek to recognize high performers while firing time-servers.
Reform-minded policymakers, donors, and media have concentrated on annual test scores. Some big city districts like Washington, D.C., Los Angeles and New York have not only used student scores to determine individual teacher effectiveness but also permitted publication of each teacher’s “effectiveness” ranking (e.g., Los Angeles Unified school district). Because teachers see serious flaws in using test scores to reward and punish teaching, they are far less enthusiastic about new systems to evaluate teaching and award bonuses.
Behind these new policies in judging teaching performance are models of teaching effectiveness containing complex algorithms drawn from research studies done a quarter-century ago by William Sanders and others called “value-added measures” (VAM).
How do value-added measures work? Using an end-of-year standardized achievement test in math and English, VAM predicts how well a student would do based on student’s attendance, past performance on tests, and other characteristics. Student growth in learning (as measured by a standardized test) is calculated. Thus, how much value—the test score–does the teacher add to each student’s learning in a year. Teachers of those students who take these end-of-year tests are then held responsible for getting their students to reach the predicted level. If a teacher’s students do reach or exceed their predicted test scores, the teacher is rated effective or highly effective. For those teachers whose students miss the mark, they are rated ineffective.
Most teachers perceive VAM as unfair. Less than half of all teachers (mostly in elementary schools and even a smaller percentage in secondary schools) have usable data (e.g., multiple years of students’ math and reading scores) to be evaluated. For those teachers lacking student test scores, new tests will have to be developed and other metrics will be used. Also teachers know that other factors such as student effort and family background play a part in their academic performance. They also know that other data drawn from peer and supervisor observations of lessons, the quality of instructional materials used by teachers, student and parent satisfaction with the teacher are weighed much less or even ignored in judging teaching performance.
Moreover, student scores are unstable year to year, that is, different students are being tested not the same ones as they move through the grades yet teacher “effectiveness” ratings are based on different cohorts of students. What this means is that a substantial percentage of teachers in one year might be ranked “highly effective” and in the next year may be ranked “ineffective.” False positives (e.g., tests that say you have cancer when you do not) are common in such situations. Furthermore, many teachers know that both measurement error and teaching experience (i.e., over time, teachers improve as they have bad years and good years) accounts for instability in ratings of teacher effectiveness. Finally, many teachers see the process of using student scores to judge effectiveness as pitting teacher against teacher, increasing competition among teachers rather than collaboration across grades and specialties within a school; such systems, they believe, are not aimed at helping teachers improve daily lessons but to name, blame, and defame teachers. [iii]
Yet with all of these negatives, there are also many teachers, principals, policymakers, and parents who are convinced that something has to be done to improve evaluation and distinguish between effective and ineffective teaching. In Washington, D.C. a new system of evaluation and pay-for-performance, inaugurated by former Chancellor Michelle Rhee, reveals both the strengths and the flaws in VAM.
[i] Daniel Weisberg, et. al., “The Widget Effect,” (Washington, D.C., The New Teacher Project, 2009).
[ii] There is a crucial distinction between “good” and “successful” teaching and an equally important one between “successful” teaching and “successful” learning that avid reformers ignore. See Gary Fenstermacher and Virginia Richardson, “On Making Determinations of Quality in Teaching,” Teachers College Record, 2005, 107, pp. 186-213.
[iii] Linda Darling-Hammond and colleagues summarize the negatives of VAM in “Evaluating Teacher Evaluation,” Education Week, February 12, 2012 at:http://www.edweek.org/ew/articles/2012/03/01/kappan_hammond.html . For another view that argues, on balance, that VAM is worthwhile in evaluating teachers, see Steven Glazerman, et. al., “Evaluating Teachers: The Important Role of Value-Added,” Brookings Institution, November 17, 2010, at:http://www.brookings.edu/reports/2010/1117_evaluating_teachers.aspx
For stability in teacher ratings over time see, Dan Goldhaber and Michael Hansen, “Is It Just a Bad Class? Assessing the Stability of Measured Teacher Performance,” Center for Education Data & Research, University of Washington, 2010, CEDR Working Paper #2010-3. On issues of reliability and validity in value-added measures, see Matthew Di Carlo posts April 12 and 20, 2012 at:http://shankerblog.org/?p=5621 ; http://nepc.colorado.edu/blog/value-added-versus-observations-part-two-validity .
This blog post has been shared by permission from the author.
Readers wishing to comment on the content are encouraged to do so via the link to the original post.
Find the original post here:
The views expressed by the blogger are not necessarily those of NEPC.