Standards Based Tests: Proceed with Caution
Value-Added Models and its Discontents
As noted on the Home Page, K-12 teachers today face two challenges: (i) Value-added models (VAM) for evaluating teacher effectiveness, and (ii) standards-based student tests (also called high stakes tests) that determine a teacher’s value-added score. Empirical data reveal excessive year-to-year volatility in both the VAM and standards-based testing regime, implying a lack of consistency that raise serious questions about how useful VAM, and the standards-based tests upon which they are based, can be, in distinguishing between effective and ineffective teachers. Two recent examples illustrate the volatility problem with standards-based student tests:
Value-added modeling (VAM) is a relatively new, and somewhat controversial, statistical method for evaluating teachers based on their students’ performance on year-end tests of reading and mathematics. The heart and soul of the VAM methodology for measuring teacher effectiveness, is based on a variety of computationally intensive statistical techniques originally developed to analyze complex data sets arising in agricultural and industrial quality control. More precisely, it uses arcane statistical techniques to ``predict”, for example, how well a 5th grade student is expected to perform based on his/her performance in the year-end 4th grade tests. If enough students perform less well than expected then the teacher is said to have added negative value. When this happens the teacher is rated as ineffective.
Two recent articles, published only two days apart, strongly suggest that the VAM methodology for deciding whether or not a teacher is effective, is deeply flawed. The first one, by Michael Winerip, Hard-Working Teachers, Sabotaged When Student Test Scores Slip, was published in the NYTimes, March 4, 2012 (Michael Winerip). It describes how mindless, formal, statistical techniques applied to raw data-detached from the population from which it was collected- can damage the reputations of three outstanding fifth-grade teachers in Public School 146, one of the highest-achieving elementary schools in Brooklyn, NY.
“Though 89 percent of P.S. 146 fifth graders were rated proficient in math in 2009, ” writes Mr. Winerip, ”the year before, as fourth graders, 97 percent were rated as proficient. This resulted in the worst thing that can happen to a teacher in America today: negative value was added. The difference between 89 percent and 97 percent proficiency at P.S. 146 is the result of three children scoring a 2 out of 4 instead of a 3 out of 4.”
The second article, by Bill Turque, ‘Creative ... motivating’ and fired, was published in the Washington Post, March 6, 2012 (Bill Turque). In this case, the consequences for a poor value-added score were more severe: the teacher was fired. “Ms. Wysocki, 31, was let go”, writes Mr. Turque, “because the reading and math scores of her students didn’t grow as predicted. Her undoing was “value-added,” a complex statistical tool used to measure a teacher’s direct contribution to test results. The District and at least 25 states, under prodding from the Obama administration, have adopted or are developing value-added systems to assess teachers.”
In this case, Ms. Wysocki offered a possible and intriguing explanation as to why the reading and math scores of her students did not improve as much as predicted: Cheating. In detail: “Many students arrived at her class in August 2010 after receiving inflated test scores in fourth grade. Fourteen of her 25 students had attended Barnard Elementary. The school is one of 41 in which publishers of the D.C. Comprehensive Assessment System tests found unusually high numbers of answer sheet erasures in spring 2010, with wrong answers changed to right. Twenty-nine percent of Barnard’s 2010 fourth-graders scored at the advanced level in reading, about five times the District average.… But Wysocki was worried. Some students who had scored advanced in fourth grade, she said, could barely read.” [Turque, op.cit]
Since VAM is based on a variety of complex and computationally intensive statistical techniques it is clear that any serious analysis of VAM must also be statistically valid. In particular, the exact value of a teacher’s value added score, however, is uncertain, because it is subject to the sampling error due to the year-by-year variation in the socio-economic indicators, academic preparedness, and a host of other factors (e.g., parental unemployment, absence of a father, poverty, illness and all the other vicissitudes of modern life), that can, and often do, affect a student’s performance on a high stakes test. For this reason, instead of simply providing the “predicted value”, statisticians recommend it is more informative to report a confidence interval, which is an interval of values that contains the unknown parameter (the value added score) with high probability, usually taken to be 95%. For this reason no reputable statistician will report a value-added score without also giving the margin of error associated with it, just as news organizations never, or hardly ever, publish a voter preference poll without also reporting the poll’s margin of error. (The margin of error of the predicted value is one-half the width of the confidence interval.) Which raises the following question: What is the margin of error associated with a teacher’s value added score and how is it computed? In particular, if the confidence interval includes the minimum score for an effective teacher rating, then the teacher cannot, and should not, be rated as ineffective.
Another problem with the value-added score is that it is used to make one of two decisions: (I) the teacher is effective or (II) the teacher is ineffective; which implies a certitude that is not and cannot be inferred from the data alone. In statistical parlance this is an example of a test of a statistical hypothesis: that is, the Principal must decide on the basis of the data whether or not the teacher is effective. In a criminal trial, for example, the defendant is always assumed to be innocent; this is an example of a null hypothesis. In the context of VAM the null hypothesis is that the teacher is effective. Clearly, the Principal can make one of two types of error: A Type I error occurs when he mistakenly decides that an effective teacher is ineffective; a Type II error occurs when he decides that an ineffective teacher is effective. Of particular statistical interest is the answer to the following questions: What is the probability of a Type I error? And what is the power of the test? (Technical note: The power of a test is the probability that the test will ``detect” the teacher is really ineffective.)
Another, and equally serious problem, is that merely reporting the predicted value of the value-added score does not address the crucial question: How much of the variation in the VAM statistic is due to the teacher and how much is due to random variation in the student population of her class? Until the cheerleaders for VAM can satisfactorily answer these and other questions, it should not be approved for general use.
Readers interested in a more technical discussion can download my (unpublished) paper, Racing To The Top: A Treadmill to Nowhere, which gives a statistician’s perspective on this problem.