The Perils of Favoring Consistency over Validity: Are “bad” VAMS more “consistent” than better ones?
This is another stat-geeky researcher post, but I’ll try to tease out the practical implications. This post comes about partly, though not directly in response to a new Brown Center/Brookings report on evaluating teacher evaluation systems. From that report, by an impressive team of authors, one can tease out two apparent preferences for evaluation systems, or more specifically for any statistical component of those evaluation systems to be based on student assessment scores.
- A preference to isolate as precisely as statistically feasible, the influence of the teacher on student test score gains;
- A preference to have a statistical rating of teacher effectiveness that is relatively consistent from year to year (where the more consistent models still aren’t particularly consistent).
While there shouldn’t necessarily be a conflict between identifying the best model of teacher effects and having a model that is reliable over time, I would argue that the pressure to achieve the second objective above may lead researchers – especially those developing models for direct application in school districts – to make inappropriate decisions regarding the first objective. After all, one of the most common critiques levied at those using value-added models to rate teacher effectiveness is the lack of consistency of the year to year ratings.
Further, even the Brown Center/Brookings report took a completely agnostic stance regarding the possibility that better and worse models exist, but played up the relative importance of consistency, or reliability, of the teacher’s persistent effect over time.
There are “better” and “worse” models
The reality is that there are better and worse value-added models (though even better ones remain problematic). Specifically there are better and worse ways to handle certain problems that emerge from using value-added modeling to determine teacher effectiveness. One of the biggest issues is how well the model corrects for problems of the non-random assignment of students to teachers across classrooms and schools. It is incredibly difficult to untangle teacher effects from peer group effects and/or any other factor within schooling at the classroom level (mix of students/ lighting/heating/ noise/ class size). We can only better isolate the teacher effect from these other effects if each teacher is given the opportunity to work across varied settings and with varied students over time.
A fine example of taking an insufficient model (LA Times, Buddin Model) and raising it to a higher level with the same data are the alternative modeling exercises prepared by Derek Briggs & Ben Domingue of the University of Colorado. Among other things, Briggs/Domingue shows that by including classroom level peer characteristics in addition to student level dummy variables for economic status and race, significantly reduces the extent to which teacher effectiveness ratings remain influenced by the non-random sorting of students across classrooms.
In our first stage we looked for empirical evidence that students and teachers are sorted into classrooms non-randomly on the basis of variables that are not being controlled for in Buddin’s value-added model. To do this, we investigated whether a student’s teacher in the future could have an effect on a student’s test performance in the past—something that is logically impossible and a sign that the model is flawed (has been misspecified). We found strong evidence that this is the case, especially for reading outcomes. If students are non-randomly assigned to teachers in ways that systemically advantage some teachers and disadvantage others (e.g., stronger students tending to be in certain teachers’ classrooms), then these advantages and disadvantages will show up whether one looks at past teachers, present teachers, or future teachers. That is, the model’s outputs result, at least in part, from this bias, in addition to the teacher effectiveness the model is hoping to capture.
Later:
The second stage of the sensitivity analysis was designed to illustrate the magnitude of this bias. To do this, we specified an alternate value-added model that, in addition to the variables Buddin used in his approach, controlled for (1) a longer history of a student’s test performance, (2) peer influence, and (3) school-level factors.
Clearly, it is important to include classroom level and peer group covariates to attempt to identify more precisely the “teacher effect,” and remove the bias in teacher estimates that results from the non-random ways in which kids are sorted across schools and classrooms.
Two levels of the non-random assignment problem
To clarify, there may be at least two levels to the non-random assignment problem, and both may be persistent problems over time for any given teacher or group of teachers under a single evaluation system. In other words: Persistent non-random assignment!
As I mentioned above, we can only untangle the classroom level effects, which include different mixes of students, class sizes and classroom settings, or even time of day a specific course is taught, if each teacher to be evaluated has the opportunity to teach different mixes of kids, in different classroom settings and at different times of day and so on. Otherwise, some teachers are subjected to persistently different teaching conditions.
Focusing specifically on the importance of students and peer effect, it is more likely than not, that rather than having totally different groups and types of kids year after year, some teachers:
- persistently work with children coming from the most disadvantaged family/household background environments;
- persistently take on the role of trying to serve the most disruptive children.
At the very least, statistical modeling efforts must attempt to correct for the first of these peer effects with comprehensive classroom level measures of peer composition (and a longer trail of lagged test scores for each student). Briggs showed that doing so made significant improvements to the LAT model. And Briggs showed that the LAT model contained substantial biases, and failed specific falsification tests used to identify those biases. Specifically, the effectiveness of a student’s subsequent teacher could be used to predict the effectiveness of their previous teacher. Briggs/Domingue note:
These results provide strong evidence that students are being sorted into grade 4 and grade 5 classrooms on the basis of variables that have not been included in the LAVAM (p. 11)
That is, a persistent pattern of non-random sorting which affects teachers’ effectiveness ratings. And, a persistent pattern of bias in those ratings that was significantly reduced by Briggs’ improved models.
At this point, you’re probably wondering why I keep harping on this term “persistent.”
Persistent Teacher Effect vs Persistent Model Bias?
So, back to the original point, and the conflict between those two objectives, reframed:
- Getting a model consistent enough to shut up those VAM naysayers;
- Estimating a statistically more valid VAM, by including appropriate levels of complexity (and accepting the reduced numbers of teachers who can be evaluated as data demands are increased).
Put this way, it’s a battle between REFORMY and RESEARCHY. Obviously, I favor the RESEARCHY perspective, mainly because it favors a BETTER MODEL! And a BETTER MODEL IS A FAIRER MODEL! But sadly, I think that REFORMY will too often win this epic battle.
Now, about that word “persistent.” Ever since the Gates/Kane teaching effectiveness report, there has been new interest in identifying the “persistent effect of teachers” on student test score gains. That is, an obsession with focusing public attention on that tiny sapling of explained variation in test scores that persists from year to year, while making great effort to divert public attention away from the forest of variance explained by other factors. “Persistent” is also the term du jour for the Brown/Brookings report.
A huge leap in those reports referring to “persistent effect” is to expand that phrase from the persistent classroom level variance explained to: “persistent year to year contribution of teachers to student achievement.” (p. 16, Brown/Brookings) It is assumed that any “persistent effect” estimated from any value added model – regardless of the features of that model – represents a persistent “teacher effect.”
But the persistent effect likely contains two components – persistent teacher effect & persistent bias – and the balance of weight of those components depends largely on how well the model deals with non-random assignment. The “persistent teacher effect” may easily be dwarfed by the “persistent non-random assignment bias” in an insufficiently specified model (or one dependent on crappy data).
AND, the persistently crappy model – by failing to reduce the persistent bias – is actually quite likely to be much more stable over time. In other words, if the model fails miserably at correcting for non-random assignment, a teacher who gets stuck with the most difficult kids year after year is much more likely to get a consistently bad rating. More effectively correct for non-random sorting, and the teacher’s rating likely jumps around at least a bit more from year to year.
And we all know that in the current conversations – model consistency trumps model validity. That must change! Above and beyond all of the MAJOR TECHNICAL AND PRACTICAL CONCERNS I’ve raised repeatedly in this blog, there exists little or no incentive, and little or no pressure from researchers (who should no better) for state policy makers or local public school districts to actually try to produce more valid measures of effectiveness. In fact, too many incentives and pressures exist to use bad measures rather then better ones.
NOTE:
The Brookings method for assessing the validity of comprehensive evaluations works best/only works with a more stable VAM model. This means that their system provides an incentive for using a more stable model at the expense of accuracy. As a result, they’ve sort of built into their system – which is supposed to measure accuracy of evaluations – an incentive for less accurate VAM models. It’s kind of a vicious circle.
This blog post has been shared by permission from the author.
Readers wishing to comment on the content are encouraged to do so via the link to the original post.
Find the original post here:
The views expressed by the blogger are not necessarily those of NEPC.