“Use It or Lose It” Professional Judgment: Educational Evaluation and Bayesian Reasoning Sherman Dorn University of South Florida Author correspondence: sdorn@coedu.usf.edu Suggested citation: Dorn, S. (2009). “Use it or lose it” professional judgment. Educational evaluation and Bayesian reasoning. Working paper published through the Social Science Research Network (papers.ssrn.com). Accessed [date], from http://ssrn.com/abstract=1461508. My thanks to David Figlio, Doug Harris, Jennifer Imazeki, and Jeffrey Kromrey for ideas and comments on earlier versions. Any errors of interpretation or facts remain the author’s. Running Head: USE IT OR LOSE IT Electronic copy available at: http://ssrn.com/abstract=1461508 USE IT OR LOSE IT, 2 Abstract This paper presents a Bayesian framework for evaluative classification. Current education policy debates center on arguments about whether and how to use student test score data in school and personnel evaluation. Proponents of such use argue that refusing to use data violates both the public’s need to hold schools accountable when they use taxpayer dollars and students’ right to educational opportunities. Opponents of formulaic use of test-score data argue that most standardized test data is susceptible to fatal technical flaws, is a partial picture of student achievement, and leads to behavior that corrupts the measures. A Bayesian perspective on summative ordinal classification is a possible framework for combining quantitative outcome data for students with the qualitative types of evaluation that critics of high-stakes testing advocate. This paper describes the key characteristics of a Bayesian perspective on classification, describes a method to translate a naïve Bayesian classifier into a point-based system for evaluation, and draws conclusions from the comparison on the construction of algorithmic (including point-based) systems that could capture the political and practical benefits of a Bayesian approach. The most important practical conclusion is that point-based systems with fixed components and weights cannot capture the dynamic and political benefits of a reciprocal relationship between professional judgment and quantitative student outcome data. Electronic copy available at: http://ssrn.com/abstract=1461508 USE IT OR LOSE IT, 3 “Use It or Lose It” Professional Judgment: Educational Evaluation and Bayesian Reasoning On July 24, 2009, President Barack Obama and Secretary of Education Arne Duncan announced draft regulations for state applications for “Race to the Top” funds appropriated by Congress in early 2009 (Branigin, 2009). Among the draft requirements for state applicants was the elimination of so-called legislative and regulatory “firewalls” between student-outcome data and teacher records. The assertive rhetoric from Obama administration appointees clearly implies that the current administration will push states to use student outcome data in teacher evaluation. While most of the public discussion of and controversy surrounding such data use has focused on performance-pay policies (e.g., Azordegan, Byrnett, Campbell, Greenman, & Coulter, 2005; Behrstock & Akerstrom, 2008; Max & Koppich, 2007), the most consequential potential use of student outcome data is for questions of employment—whether and to what extent test-score and other outcome data will influence the ordinary evaluation and continuation of teacher employment, tenure, and intervention efforts (BaratzSnowden, 2009; Weisberg, Sexton, Mulhern, & Keeling, 2009). In the past decade, arguments about test-score use have focused on the No Child Left Behind Act’s cruder mechanisms for labeling schools, either arguments in favor of formulaic triggers as essential public-policy tools to equalize opportunity or arguments that such triggers are inherently corrupting (e.g., Dorn, 2007; Nichols & Berliner, 2007). Similar arguments inevitably surround the use of test-score and other student outcome data for personnel evaluation purposes, but with additional issues: the attribution of outcomes to single teachers, the omission of many educators by default when assessments only exist for a small part of the curriculum, and a recent technical USE IT OR LOSE IT, 4 literature emphasizing the difficulty of identifying anything more than a small portion of teachers as outliers (as either effective or ineffective) through test-score data (e.g., Baratz-Snowden, 2009; Lockwood, Louis, & McCaffrey, 2002). While Duncan has made clear that his intent is not to force teacher evaluation decisions to revolve entirely around test scores, the current administration’s position is that student outcomes need to figure into evaluation. But how that is done is an open question. In Florida, the statutory authorization for the Merit Award Program option for school districts requires that student outcome data "shall be weighted at not less than 60 percent of the overall evaluation" (Florida Statutes 1012.225(3)(c)). The Florida legislature mandated one of several options to use in combining quantitative and qualitative judgments of teacher effectiveness, the point system. While there are variations that meet the statutory language, almost any real-world implementation meeting the spirit of the law would almost all be linear combinations of different subscores (also see Max, 2007). This type of algorithm for combining qualitative professional judgments and student outcome data is not the only option for using student outcome data. In performance-pay, for example, there exist a small number of programs that include student performance as one pathway through which teachers may seek pay increases, as in Minneapolis or Denver (Azordegan et al., 2005; Potemski & Rowland, 2009). 1 BaratzSnowden (2009) argued that student outcomes should be part of evaluation, but without specifying how: Standardized test scores can play a role in presenting evidence of learning, but using standardized test scores as the sole or predominant measure of 1 Also see the Denver ProComp website at http://denverprocomp.org. USE IT OR LOSE IT, 5 achievement is unwarranted and unwise given the inadequacy of such tests to capture the complexities and breadth of student learning and the limitations of current value-added methodologies. Nonetheless, it is absolutely essential that teachers present evidence of student learning—through test results and other material—as part of the tenure system if it is to be credible. Calling upon experienced teachers to help develop the multiple sources of such evidence is essential in redesigning the tenure system. (p. 28) The evaluation systems she highlighted—in Toledo and Minneapolis’s local public school systems and in the Los Angeles Green Dot union contract—are different variations of a holistic or portfolio system of teacher evaluation, including requirements for documenting both the process and outcomes of professional development. Such a holistic system may well be the outcome of proactive collaboration between teacher union locals and local school boards, but the history of performancepay plans suggests that some states will attempt to impose the type of algorithmic requirements for evaluations that Florida’s legislature has created in its performancepay statute. In states without collective bargaining or with more legal or practical authority for legislatures, local collective bargaining may be less important than the political environment at the state level. Legislators who distrust school districts are going to be less likely to accept holistic evaluation reviews than district-level management with a history of collaborative relationships with unions. The differences between holistic and algorithmic use of student outcome data include at least two dimensions of the continuing debate over high-stakes testing policies. One dimension is the technical adequacy of existing assessments. Advocates of an algorithmic approach are likely to argue that current assessments are not perfect but are a sufficient basis on which to make decisions. The same recognized flaws of existing assessments will probably be the focus of critics of an algorithmic approach (whether or USE IT OR LOSE IT, 6 not the critics would accept even a holistic teacher evaluation system that uses student outcome data). The critics will continue to argue that tests assess only a small portion of the formal curriculum and student performance, and that their use will corrupt the measures and teaching practices. Behind the debates about the technical flaws or adequacy of existing assessments, however, there is another dimension of the discussion, and that is around trust of professional judgment. The arguments of Weisberg et al. (2009) feed into the historical dynamic of accountability politics (Dorn, 2007): policymakers and many citizens distrust either the capacity or willingness of educators to make appropriate judgments about school practices and teacher performance. Reciprocating this lack of trust, many teachers believe that state legislators and advocates of high-stakes testing use testing as a tool with which to attack public schools and teachers. Unless addressing trust and mistrust is central to the design of teacher evaluation systems, the evaluation policies that develop in response to criticism of current practices are likely to be unsatisfactory to the two sides of the debate over test-score use. Many school critics are wary of evaluation policies that leave open the possibility of evaluation systems that never identify weak teachers, and an algorithmic approach such as the one mandated for Florida’s Merit Award Program is likely to appeal to such critics. But an algorithmic approach will be unsatisfactory to those who distrust the use of tests to drive decisionmaking in schools. USE IT OR LOSE IT, 7 Bayesian Perspectives on Classification Resolving the trust problem in teacher evaluation requires stepping back to ask what we are seeking: sound decisions about whether teachers should continue without intervention, should be given additional professional assistance, or should leave the field. Making those decisions with confidence should be goal of a teacher evaluation system, decisions that are far more likely to be right than wrong. There is one algorithmic approach that could be promising, or at least a foundation for thinking about how to combine qualitative professional judgment and quantitative data in ways that focus on making critical personnel decisions, give significant weight to professional judgment when it is made, and leaves a safety valve for decisionmaking when supervisors (and peers, in peer-review systems) are unwilling to make hard decisions about teachers. One can use Bayesian reasoning to understand evaluative classification as a process of judgments reshaped with data. This section describes Bayes’ Theorem on the calculation of conditional probabilities, a possible translation of that use in personnel and program evaluation, and some of the general political and technical issues involved in translating Bayes’ Theorem into evaluation use. Bayes’ Theorem and Conditional Probability The standard presentation of Bayes’ theorem centers on the conditional probability of A given data x, or P(A|x), (1) where P(A) is the general probability of A, P(x) is the general probability of x, and P(x|A) is the conditional probability of observing x given A (also called the likelihood of USE IT OR LOSE IT, 8 observing x given A). 2 Equation (1) is the most common form of Bayes’ Theorem, and it captures the relationship among four components of conditional probability. With complete information, one can see Bayes’ Theorem in action. For example, if one examines the 1785 public elementary schools in Florida which received a letter grade from the state in June 2009, what is the probability of receiving a letter grade of “A” if the percentage of tested students meeting the state’s standards on the reading test was exactly 50%? 3 Here, almost 71% of all public elementary schools in Florida received an “A” from the state, 0.08% of “A”-labeled schools had exactly 50% of all participating students meeting the state’s standards on reading tests, and 0.56% of all schools had 50% of students meeting state standards, or . This claimed is validated by inspection of the records: 10 public elementary schools in Florida had 50% of all students meeting state standards in the spring 2009 tests, and one such school (or 10%) received an “A” from the state in summer 2009. (Out of all Florida elementary schools receiving an “A,” Liberty City Elementary School in Miami had the lowest proportion of students at or above the state reading test cut score in spring exams.) A similar exercise with the set of all public elementary schools having 50% or more students passing state standards in 2009 (1723 schools, a set that contained all 1261 elementary schools receiving an “A” from the state in 2009) will show that 73.2% of such schools received a letter grade of “A.” The appendix contains a more technical discussion of Bayesian reasoning, the naïve Bayesian classifier, and a translation to an additive point system. 2 3 For letter grades assigned Florida’s public schools, see http://schoolgrades.fldoe.org. USE IT OR LOSE IT, 9 With complete data, Bayes’ theorem is an accounting exercise. But an accounting exercise is not the value of Bayes’ theorem. The general value of conditional probability is the ability to reason consistently about incomplete information. The evaluation of medical test results is the most common example of this use (perhaps because test results are often evaluated with the wrong perspective). If someone tests positive for a rare condition—for example, if the probability of having the condition is 1 in 10,000— even a highly accurate test can generally be wrong, even where 95% of those with the condition have a positive test, and only 5% of those without the condition test positive. Here, we break down P(x) into the sum of the probability of testing positive for those with the disease (A) and the probability of those testing positive without the disease (~A): . The result is counterintuitive to many: a test with 95% accuracy in two dimensions is going to be wrong for the vast majority of positive results from a population where the risk is extremely low. Because the prevalence of a condition can dominate the value of a test result, repeating tests (or testing split samples) is important to provide confidence about the interpretation of positive test results for rare conditions. Bayes and Inductive Reasoning A minority of statisticians and a number of philosophers of science push a Bayesian approach in a different direction; in one Bayesian perspective, P(A) could be the general probability of A, but it can also be the judgment of the probability of A before gathering data, or the prior probability. In this view, P(A|x) is the posterior probability of A after gathering data x. This approach combines a prior judgment of A USE IT OR LOSE IT, 10 (which could be based on qualitative judgments) with data collection and analysis. In a Bayesian perspective, the data updates (or bumps) one’s prior judgment. Bayesian advocates argue that this is consistent with the scientific method (Howson & Urbach, 2005). Those skeptical of a Bayesian approach with a subjective prior often argue that the inclusion of subjective judgment in a prior probability is not objective; subjective Bayesians often respond that the priors always exist, and a subjective Bayesian approach merely reveals those choices in an explicit fashion. There is little literature attempting to apply a Bayesian approach to program or personnel evaluation in education, either the ordinary meaning of conditional probability or the subjectivist Bayesian approach. 4 The social-science field with experience in applying a Bayesian reasoning to conditional probability is in law, where there is a small literature on discussing statistics with juries (e.g., Kaye, 1999; Lindsey, Hertwig, & Gigerenzer, 2003). Some litigators have an incentive to avoid juries’ inappropriately applying conditional probability in a manner known in legal circles as the prosecutor’s fallacy—confusing the probability of testing positive given a hypothesis with the probability of the hypothesis being true given a positive test (e.g., Fenton & Neil, 2000). Bayesian Reasoning and Evaluation Policy The current policy debate over teacher evaluation provides an important reason to consider the use of a Bayesian approach, an approach with both practical and political benefits. For these purposes, the most important characteristic of the standard equation See Wood (1972) for a book review of Bayesian arguments at a Phi Delta Kappan symposium, with subjective Bayesian reasoning apparently considered more charming than practical by Wood. 4 USE IT OR LOSE IT, 11 for a posterior probability is the relationship between the prior probability and the likelihood: a forceful statement of prior probability is bumped less by any given data than a weaker statement of prior probability. 5 For classification purposes, one would compare the probability of being in two different groups, or the odds. For example, consider an evaluation system that uses principal or peer judgment, and a teacher where both the school principal and peers think the teacher could use intervention but are not entirely certain. If the professional judgment before gathering additional data is that the teacher is somewhat more likely to need intervention than not—odds of 3:2, or a professional judgment that 60% of teachers in similar situations and with similar information available to administrators and teachers would need remediation—how could additional data update that professional judgment? The key term is a likelihood ratio—for a choice between one decision and another (for example, deciding whether a teacher needs help with instruction), the ratio of the likelihood of seeing a data pattern under one hypothesis (for example, intervening with a teacher) against the likelihood of seeing the same data pattern under the competing hypothesis (not intervening). After gathering additional data—and again assuming that the data is professionally relevant and the relevant likelihood functions are salient—assume that the principal and peers discover that 6% of teachers judged as needing remediation produce the data gathered and that 1% of teachers judged as not needing remediation produce the data gathered. The likelihood ratio is 6:1, and the posterior (after-data-gathering) odds of needing intervention then become 9:1, or a 90% posterior probability of needing remediation. But the data can also bump the prior judgment in the other direction. If 6% of teachers In the long run, data will dominate both a frequentist and a Bayesian’s estimation of relevant quantities. But we generally do not live in an asymptotic world, especially with regard to personnel evaluation. 5 USE IT OR LOSE IT, 12 judged as needing remediation produce the data gathered but 8% of teachers judged as not needing remediation also produce the data gathered, the likelihood ratio is 6:8 (or 3:4), and the posterior (after-data-gathering) odds of needing intervention then become 9:8, or an only slightly greater than even odds (approximately 53% probability) of needing intervention. In the hypothetical cases described above, data can help one update the prior judgment, and data can bump the prior judgment in different directions. But there is another important characteristic of posterior odds: the forcefulness of the prior odds also shape the posterior judgment. A forceful statement of prior relative odds (e.g., a prior judgment that the teacher is twice as likely to need intervention as not) would be bumped less by any given data than a weaker statement of prior relative odds (e.g., a prior judgment that the teacher is equally likely to need intervention as not). If a school’s culture is one where the principal and peers err on the side of nonintervention in the case of a weak teacher, then the likelihood ratio would dominate the posterior odds. If a principal and peers make forceful judgments about teachers, the data are going to be less influential. 6 The reciprocal relationship between the influence of prior judgments and the influence of data-generated likelihood ratios could well be a practical and political strength rather than a liability, including and perhaps especially for those skeptical of subjective judgments of teacher effectiveness. A system with a Bayesian rationale for combining professional judgment with quantitative data can encourage professional As Howson and Urbach (2005) and many others point out, a large amount of data will dominate strong prior probabilities. The practical issue here is balancing professional judgment and student outcome data, and in many cases the data from a single year will not dominate a prior supplied by a professional evaluative rating. 6 USE IT OR LOSE IT, 13 judgment by making the judgment more influential where administrators (and peers in peer-review systems) make stronger judgments. However, in such a system, institutional cultures that avoid forceful professional judgments would be more likely to produce weak prior odds that are overridden by likelihood ratios (which data would drive). Such systems could satisfy educators’ and teachers unions’ concerns that personnel evaluation not rely entirely on test scores, because professional judgment would take precedence where it is exercised forcefully. But the ability of data to dominate weak prior judgments could also satisfy the concerns of policymakers dissatisfied with the unwillingness of administrators to make forceful judgments about ineffective teachers. A Bayesian algorithm provides a way out from the trust dilemma surrounding professional judgment and teacher evaluation: a “use it or lose it” approach to professional judgment is an alternative to either a dominant use of (imperfect) test data or a dominant use of (sometimes-reluctant) professional judgment, creating a possible operationalization of Shulman’s (1988) marriage of insufficiencies. Bayesian classifiers exist in practice, if not in education evaluation, and most of us have benefited by experiencing at least one—or rather not being aware of how we are benefiting. Most of the statistical research on the characteristics of Bayesian classifiers is in the field of machine learning, and many e-mail filters use Bayesian approaches to identifying spam (Graham, 2004; Sahami, Dumais, Heckerman, & Horvitz, 1998). Bayesian spam filters use the relative likelihood of several identifiable words or character strings from training sets to score a candidate e-mail as either likely spam or likely nonspam. As explained in some formal detail in the appendix, it is possible to use more than one data source in Bayesian classification (in spam filtering, multiple words). If one assumes that all data sources are independent of each other, then one calculates USE IT OR LOSE IT, 14 likelihood ratios and the classification comes from the product of a prior odds judgment (e.g., a professional evaluative judgment) with all of the likelihood ratios. Though an assumption of independence wreaks havoc with point estimates of most statistical inferences, there is considerable reason to believe that the yes/no decisions of a socalled naïve Bayesian classifier are not damaged much by an incorrect independence assumption (Domingos & Pazzani, 1997; Hand & Yu, 2001; Lewis, 1998; Rish, 2001; Zhang, 2001). Despite the research on the robustness of naïve Bayesian classifiers and the relative simplicity of calculation, they are not used in critical-decision frameworks where they could be of use, such as medical diagnoses. Recent research in medical diagnosis generally uses more complex Bayesian classifiers, which suggests that they can be substantially superior to more traditional diagnosis scoring methods (e.g., Biagioli, Scolletta, Cevenini, Barbini, Giomarelli, & Barbini, 2006). Salient Data and Likelihood Functions While Bayesian reasoning allows one to create a “use it or lose it” approach to evaluation in theory, that potential does not guarantee a practical “use it or lose it” algorithm. The utility of any such Bayesian system of evaluation depends on characteristics assumed in the prior section: the existence both of data and likelihood functions salient to the judgment of professional effectiveness. The limited curriculum coverage inherent in any test or assessment system is well-known (e.g., Dorn, 2007; Koretz, 2008), but to some extent, the use of likelihood functions in such a system would make the technical requirement of data use a little looser than the debates over test score use might lead one to believe. First, the explicit inclusion of professional judgment and the relationship between strength of judgment and influence over USE IT OR LOSE IT, 15 posterior odds (explained in the prior section) reduce the reliance on circumscribed sources of data. Second, likelihood functions could be constructed that accommodate measurement error; one category of candidate for such accommodation is the set of kernel likelihood functions (e.g., the likelihood of observing a datum plus or minus the measurement error, or a distributional likelihood of being observed at a certain percentile rank plus or minus a decile). 7 It is thus the combination of data and likelihood function that needs to be salient to personnel evaluation, and the choice of a likelihood function for a specific source of data entails value judgments about our categorical judgments of effectiveness. Consider the type of judgment involved in the two choices about proficiency measures mentioned earlier—having an exact proportion of students judged proficient (or a kernel measure centered on an exact proportion) versus having that proportion of students or higher judged proficient. Identifying either an exact or a kernel measure as important implies that the precise place of student measures is more important to the judgment of effectiveness than a broad category such as “50% or more” and that distinguishing groups of students with 50% proficiency from all groups with 50% or greater proficiency is important in evaluation policy. 8 Either is a defensible position, but the consequences of the choices lead in different directions. In addition to the choices of data and likelihood functions, one would need to identify an appropriate source of distribution for any likelihood function and a Domingos and Pazzani (1997) suggest dividing real-valued variables into five to ten bins, but the practical difference between a kernel and a quintile- or decile-based approach is beyond the scope of this paper. 7 I assume here that proficiency is a meaningful construct. While the construct of proficiency depends on the validity of cut scores, which are always arbitrary (Glass, 1978), one could make the same argument as in the text with any ordinal measure chosen for the task at hand. 8 USE IT OR LOSE IT, 16 defensible categorization of data into the relevant bins. The assumption in the Bayesian reasoning presented above is that there already is a classification of teachers into different categories and an existing and known distribution of the data for each category. In reality, any chosen distribution is likely to depend on tentative proxy judgments: we tentatively divide a set of teachers into categories and use a sampled distribution of data on those teachers to create the likelihood functions. The political legitimacy of those proxy judgments and comparable samples would depend on the classification method. The tentative classification by administrators and teachers would be most acceptable in a political sense, but the consequences of such classification are also likely to make the task aversive for many who might be asked to participate. 9 While relatively simple in concept, the implementation of a Bayesian approach to evaluation involves both technical and political judgments. Bayesian Reasoning and Additive Point Systems The political benefit of a Bayesian approach is the possible construction of an evaluation system where professional judgment has a “use it or lose it” trait. That benefit is transferrable to an additive point system, with some restrictions on the structure of a point system. A point system with rigid weights or rigid maximums for the contribution of different components will not have the benefit of a Bayesian approach. A point system with more flexible relationships between different components can have Waving away the comparable-population question, one could predict a low response rate for teachers asked to judge the effectiveness of current peers in their schools, even if they are promised that their judgments would not affect personnel evaluation of their current peers for that year: their judgments would set the classifications used in later years and thus they would be responsible for setting the likelihood functions by which they and peers would be judged, at least in part. 9 USE IT OR LOSE IT, 17 the benefit of a Bayesian approach, and it is possible to construct a point system that is equivalent to a naïve Bayesian classifier. While the Bayesian equivalent point system is theoretical rather than a likely practice, the existence of an equivalent suggests what is necessary for a point system to capture the political benefit of a Bayesian approach. Bayesian Equivalents in Additive Point Systems As explained in the appendix, the conversion of a naïve Bayesian classifier to an additive point system requires a logarithmic transformation of the classifier’s product of factors, a log transformation that creates a sum of log odds and log likelihood ratios. In theory, each contribution could be calculated based on the same likelihood ratios as described earlier, with a positive log likelihood ratio increasing the posterior odds of a target decision and a negative log likelihood ratio decreasing the posterior odds of a target decision. To preserve the equivalent reciprocal relationship among components, each component in the Bayesian equivalent point system must be unbounded on both the positive and negative ends. A strong prior statement in a Bayesian system is equivalent to a log odds further from 0 (either positive or negative) than any other component in a point system, and at least in theory, a point system can capture the political benefits of a Bayesian approach to evaluation. Rigid Point Systems In contrast to the point equivalent of a Bayesian approach, a point system with rigidly-bounded or –weighted components fails to capture the political benefits of a Bayesian approach to evaluation. In a point system with bounded components, there is no reciprocation among the components. The effective power of a set of judgments by USE IT OR LOSE IT, 18 professionals is entirely independent of the effective power of any other component. No matter how forceful or weak the judgment of an administrator or peer committee, the authority of all other components remain the same. Or, in practical terms, if the judgment of a principal is worth 50% of the potential points and most teachers receive identical scores, the influence of data remains the same as when a principal gives a range of scores to teachers. If evaluators responsible for one component of a point system are hesitant to make forceful judgments about weak teachers, other components do not become more important in compensation. If one believes in the abstract value of a particular component at the precise weight contained in a system, the advantage of such an approach is precisely the rigid authority of its components. However, such a rigid system depends heavily on the value judgments made in its construction and is unable to provide either type of assurance that would address the trust/mistrust dynamics in the debate over teacher evaluation. Intermediate Options However, a point-based system does not need to have fixed weights for component scales. With the removal of rigid bounds and weights from a point system, it is possible to capture the political benefit of a Bayesian approach to evaluation. If the weight for a component scale can vary, then one could introduce a reciprocal relationship between professional (supervisory and peer) judgments, on the one hand, and other sources of evaluative information such as student outcomes, on the other. The most important potential benefit of a Bayesian approach to evaluation is the political consequences of combining professional judgment and data in a way that gives more authority to professionals who are willing to make forceful judgments with the USE IT OR LOSE IT, 19 possibility of reciprocal authority for data when professionals are not willing to make forceful judgments. The practicality, advantages, and disadvantages of each approach described below will vary, and the purpose of describing some options is less to advocate for a particular approach than to illustrate a minimal range of approaches to pointbased evaluation systems. Weighting by component range. One such system would be a weighting of components by range, a literal “use it or lose it” formula. If a qualitative evaluation is worth up to half of the total points, but a set of evaluative ratings only spans half of the potential range, a “use it or lose it” policy could expand the weight of student-outcome data to fill the extra 25% of points not in the range of the qualitative evaluations. Such a system would be simple to explain and implement. It would also impose odd incentives to game the system, whereby a principal can insulate highly-rated teachers from the effect of test scores by giving extremely low ratings to a small number of teachers in a school. Standardizing ratings by a central dispersion measure. A second approach would be an indirect reweighting by the transformation of both test-score data and professional evaluative ratings into standardized scores, in comparison to a central dispersion measure such as a standard deviation (within a relevant population). Suppose that the influence of professional evaluative ratings were set at twice the weight of ratings derived from test scores, after both are transformed into standard-deviation units. In a unit where administrators (or administrators and teachers, with peer review) provide a range of ratings to teachers, the extreme-valued ratings will be more influential than test scores, at both the high and low ends. On the other hand, if the professional evaluative ratings have no variation—where a standardized rating would be USE IT OR LOSE IT, 20 in the middle for all—then the test scores determine the end distribution. Administrators (and peers) could choose not to exercise their professional judgment in rating teachers, but such a choice would give compensatory authority to student outcome data. Overdetermined total. A third approach could be a point system that theoretically overdetermines outcomes, with more than 100% of the potential range covered by the sum of components. For example, if professional judgment evaluation scores are worth 50 points, and data from student outcomes are worth 50 points, the range of the sum is 0 to 100. But if the range of sum scores is restricted to [20,80], each component’s potential range spans 62.5% of the range for the total. The rationale for such a system would be that a system does not need to worry about extreme values that represent consistency between qualitative and quantitative sources of data: a teacher with high ratings in all categories is presumed to be highly performing, while a teacher with low ratings in all categories is presumed to be low-performing. It is in the middle of the range where the overdetermined sum has effect. Conclusion With the decision of the Obama administration to condition Race to the Top funds on the elimination of barriers to linking teacher and student test data, the weight of the U.S. political system is shifting towards linking teacher evaluations to testoutcome data. Some part of the policy discussion is focused on performance-pay policies and attendant choices, but the root importance of such a linkage is with regard to employment rather than pay: to what extent should teachers’ jobs depend on test-score and other student outcome data? USE IT OR LOSE IT, 21 This is not a new discussion, and the tensions involved in these policy debates will remain. Many teachers, administrators, and parents will oppose policies that place test scores in a dominant position, because they see tests as highly flawed and creating perverse incentives. School critics (including many parents) will oppose policies that result in uniform satisfactory evaluations for almost all teachers and see test-score use as an imperfect but justifiable tool to change evaluation practices. Without intervention, the likely outcome of these debates is a dichotomy of policies, with some school systems and states experimenting with crude uses of test-score data and other systems and states refusing to change, pointing to the inevitable problems with crude evaluation mechanisms. This paper points in a different direction, using a Bayesian inference mechanism as a starting point. The description of unconventional algorithmic options is less to advocate for any of these approaches than to illustrate potential: one may be able to translate the most politically-valuable characteristic of a Bayesian approach to simpler algorithms, or at least algorithms that can be understood by a broad group of stakeholders. While one may not see formal Bayesian reasoning in personnel evaluation systems, there are some important lessons to take from the Bayesian approach and the parallel between the log transformation of conditional probability equations, on the one hand, and additive point systems, on the other. Most importantly, the political benefits from a Bayesian approach requires conscious construction in a point-based system of evaluation. While a Bayesian calculation of posterior odds explicitly creates a reciprocal relationship between prior odds and likelihood ratios, a point-based system with fixed weights/component contributions removes that reciprocal relationship. Adjusting weights, the use of standard-deviation-adjusted ratings, or overdetermined point USE IT OR LOSE IT, 22 systems are three methods to construct such reciprocal relationships, and it might be of significant benefit to explore such approaches. While one probably could construct an evaluation system based entirely on a Bayesian approach, that is not necessary to gain the most important benefits: a range of technical solutions that provides reasonable incentives for all parties and a starting point for further development and local negotiations. Many teachers unions are unlikely to accept the use of test scores unless it is subservient to professional judgment, but other stakeholders are unlikely to accept the dominance of professional judgment without a backup method of evaluating teachers when the professional judgment is timid in judging weak teachers. A “use it or lose it” approach to professional judgment is a workable approach rooted in a Bayesian approach to inductive reasoning and with a few possible constructions within a point-based evaluation system. Appendix The relative-odds formulation of the Bayes theorem is the standard beginning point for naïve Bayesian classification, if one looks at relative odds of two possibilities A and B and considers them to be possible decisions. Let A be the need to intervene to help a poor teacher and B be nonintervention. 10 Then the relative odds of needing intervention versus nonintervention are (2), The categories need not be exclusive: One could create additional categories such as recognition for merit, or dismissal, though the translation into an additive point system has difficulties with more than two categories. 10 USE IT OR LOSE IT, 23 where the first term on the right-hand side represents the prior relative odds of needing intervention and the second term is the relative likelihood of x given the two classifications under consideration (or likelihood ratio). One could reasonably interpret the first term as the judgment of relative need for intervention before gathering data {x} and the second term as the relative likelihood of seeing the data under those relative judgments. If one had distributional information about {x} for teachers needing and not needing supervisory intervention, where the data {x} and the framing of the likelihood function were professionally salient (a question discussed below), then equation (2) allows one to adjust one’s prior professional judgment by relevant data. 11 In the cases of updating an administrator’s initial professional judgment with data, the likelihood ratio is the key datum. In the case where the likelihood of data under condition A (intervention) is 6% and the likelihood of seeing the data under condition B . The posterior (after-data- (non intervention) is 1%, the likelihood ratio is gathering) odds of needing intervention then become , or a 90% posterior probability of needing remediation. But the data can also bump the prior judgment in the other direction. If 6% of teachers judged as needing remediation produce the data gathered but 8% of teachers judged as not needing remediation also produce the data gathered, the likelihood ratio is . The posterior (after-data-gathering) odds of needing intervention then become , or an only slightly greater than even odds (approximately 53% probability) of needing intervention. 11 The question of appropriate data is discussed below. USE IT OR LOSE IT, 24 Multiple Data Sources and Naïve Bayesian Classifiers Consider first a Bayesian mechanism for adjusting professional judgment by two data sources rather than one, {x} and {y}. Then the posterior odds become , (3) which would require computational estimation of the likelihood ratios for interdependent data. 12 However, if {x} and {y} are independent, the last term becomes and (4) or, more generally, with {xi} for n independent variables, (5). The concept here is that with a set of independent variables, or a series of data sources, one can repeatedly update the original professional judgment using the likelihood ratios of the different sources of data. The independence of the data sources is not an assumption likely to hold for most data sources in schools, but the simplified construct enables a direct comparison to point-based systems of evaluation, and there is some reason to believe that classifying algorithms are less vulnerable to inapt independence assumptions than real-valued estimators are (Domingos & Pazzani, 1997; Hand & Yu, 2001; Lewis, 1998; Rish, 2001; Zhang, 2001). Log Transformation and Additive Points A log transformation of equation (5) leads directly to a point-like system, Empirical Bayes estimation of the last term’s quantities (likelihood of observing one variable given a prior and another variable) commonly involves Monte Carlo simulation using a Gibbs sampler. This paper is designed to provide a simpler introduction to the issues involved and assumes the identification of independent variables. 12 USE IT OR LOSE IT, 25 (6) and if H= , , and , then equation (6) becomes (7), which corresponds to an additive point-based system where a classification cutoff score for H corresponds to log relative posterior odds, judgment for two categories, and each is the log odds of professional is the log of a likelihood ratio estimated from {xi}. The transformation of a Bayesian updating system into a linear point-based system is not a statement that all point-based systems have an underlying Bayesian equivalent. The requirements here are steep: the correspondence of the first component to log prior odds (the qualitative professional judgment), the correspondence of additional components to a set of n independent sources {xi} (or a single data source {x}) and salient likelihood functions, and a cutoff score representing the relative odds at which a decision is appropriate. While it is possible to construct a point system in this manner (or to infer hypothetical, latent variables that operate in the way described here), the point of this exercise is not to suggest the construction of an explicit Bayesianjustified point system. Instead, the parallel can be a tool to explore the characteristics of any point-based system. Weights. First, consider a set of weights {wi} for {hi}, such that . Transforming this weighted linear formula back into equation (5), one can see that , (8) USE IT OR LOSE IT, 26 where the weights {wi} become exponents for each data source’s likelihood ratio. The consequence of exponential weights in equation (8) is partly dependent on the range of likelihood ratio for each source {xi} and also partly dependent on the threshold value of that would trigger a decision. If either the threshold value of the posterior odds is close to 1 or the likelihood ratio for a source xi is close to 1, the exponential place of wi becomes less consequential. The parallel in a weighted point system is similar: the weights for different components will not act in a linear fashion, but the effect of weights will depend on the implicit sensitivity of the threshold value for H and the range for each hi. A threshold value for H that easily triggers a decision and restricted ranges for {hi} are associated with minimal effects of weights, while broad ranges for {hi} and classifications of H imply a highly nonlinear effect of weighting. 13 Multidimensional scales. The consequences of multidimensional components follow from the nonlinear consequence of weighting. The concern about implicitly multidimensional scales for many measurement researchers is construct validity, but that may be less important in a type of evaluation that its designers intended to combine different types of sources. On the other hand, a multidimensional scale (or component of a point system) effectively sets weights for each dimension in ways that are not deliberate. A similar conclusion follows for sources of data that are not truly independent of each other. To the extent that a subset of variables {xi} underlying {hi} is not independent, collinear components of the subset of variables could be interpreted as a smaller set of variables that are differentially weighted. For example, if xj and xk are There is a similar effect of broad ranges in point-based grading systems. If the range for a course’s component is on the order of the point range for a single grade, extreme values for a single assignment can shift a term grade by a letter grade, and that is equivalent to a nonlinear consequence of the assignment on the relative that a student will earn one grade as opposed to another. 13 USE IT OR LOSE IT, 27 linearly dependent so that xk= Kxj, one could replace xj and xk with (K+1)·xj, and K+1 operates as a weight with the consequences described above. References Azordegan, J., Byrnett, P., Campbell, K., Greenman, J., & Coulter, T. (2005). Diversifying teacher compensation. Denver, CO: Education Commission of the States. Retrieved July 26, 2009, from http://www.eric.ed.gov/ERICWebPortal/contentdelivery/servlet/ERICServlet?accno=E D489329. Baratz-Snowden, J. (2009, June). Fixing tenure: A proposal for assuring teacher effectiveness and due process. Washington, DC: Center for American Progress. Retrieved July 10, 2009, from http://www.americanprogress.org/issues/2009/06/teacher_tenure.html. Behrstock, E., & Akerstrom, J. (2008, December). Performance pay in Houston. Rockville, MD: Center for Educator Compensation Reform. Retrieved July 26, 2009, from http://www.cecr.ed.gov/guides/summaries/HoustonCaseSummary.pdf. Biagioli, B., Scolletta, S., Cevenini, G., Barbini, E., Giomarelli, P., & Barbini, P. (2006). A multivariate Bayesian model for assessing morbidity after coronary artery surgery. Critical Care, 10(3), R94. doi: 10.1186/cc4951. Bock, R., Wolfe, R., & Fisher, T. (1996). A review and analysis of the Tennessee value added assessment system [technical report]. Nashville, TN: Tennessee Office of Education Accountability. Branigin, W. (2009, July 24). Obama launches “race” for $4 billion in education funds. Washington Post. Retrieved July 24, 2009, from USE IT OR LOSE IT, 28 http://www.washingtonpost.com/wpdyn/content/article/2009/07/24/AR2009072402203.html. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103-130. Fenton, N. E., & Neil, M. (2000). The jury observation fallacy and the use of bayesian networks to present probabilistic legal arguments. Mathematics Today (Bulletin of the IMA), 36(6), 180-187. Glass, G. V (1978). Standards and criteria. Journal of Educational Measurement, 15(4), 237–261. Graham, P. (2004). Hackers and painters: Big ideas from the computer age. Sebastopol, CA: O’Reilly Media, Inc. Graham’s essay on spam filtering is also available at http://www.paulgraham.com/spam.html. Hand, D. J., & Yu, K. (2001). Idiot's Bayes: Not so stupid after all? International Statistical Review, 69(3), 385-398. Howson, C., & Urbach, P. (2005). Scientific reasoning: The Bayesian approach (3rd ed.). Chicago: Open Court Publishing. Kaye, D. H. (1999). Clarifying the burden of persuasion: what Bayesian decision rules do and do not do. International Journal of Evidence & Proof, 3(1). Koretz, D. (2008). Measuring up: What educational testing really tells us. Cambridge, MA: Harvard University Press. Lewis, D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning (pp. 4-15). Heidelberg, Denmark: Springer Verlag. USE IT OR LOSE IT, 29 Lindsey, S., Hertwig, R., & Gigerenzer, G. (2003). Communicating statistical DNA evidence. Jurimetrics Journal, 43, 147-163. Lockwood, J. R., Louis, T. A., & McCaffrey, D. F. (2002). Uncertainty in rank estimation: Implications for value-added modeling accountability systems. Journal of Educational and Behavioral Statistics, 27(3), 255-270. Max, J. (2007, November). The evolution of performance pay in Florida. Rockville, MD: Center for Educator Compensation Reform. Retrieved July 26, 2009, from http://www.cecr.ed.gov/guides/summaries/FloridaCaseSummary.pdf. Max, J., & Koppich, J. E. (2007, December). Engaging stakeholders in teacher pay reform. Rockville, MD: Center for Educator Compensation Reform. Retrieved July 26, 2009, from http://www.cecr.ed.gov/guides/EmergingIssuesReport1.pdf. Potemski, A., & Rowland, C. (2009, April). Pay reform in Minneapolis Public Schools: Multiple approaches to alternative compensation. Rockville, MD: Center for Educator Compensation Reform. Retrieved July 26, 2009, from http://www.cecr.ed.gov/guides/summaries/MinneapolisCaseSummary.pdf. Rish, I. (2001). An empirical study of the naive Bayes classifier. IBM Technical Report RC22230. Hawthorne, NY: T. J. Watson Research Center. Retrieved August 6, 2009, from http://www.research.ibm.com/people/r/rish/papers/RC22230.pdf. Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian approach to filtering junk e-mail. AAAI Technical Report WS-98-05. Menlo Park, CA: Association for the Advancement of Artificial Intelligence. Retrieved August 6, 2009, from http://www.aaai.org/Papers/Workshops/1998/WS-98-05/WS98-05-009.pdf. Shulman, L. S. (1988). A union of insufficiencies: Strategies for teacher assessment in a period of educational reform. Educational Leadership, 46(3), 36-41. USE IT OR LOSE IT, 30 Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009, June). The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness. Brooklyn, NY: The New Teacher Project. Retrieved July 10, 2009, from http://www.widgeteffect.org/. Wood, R. (1972). Review of Bayesian Statistics edited by D. L. Meyer and R. O. Collier, Jr. The School Review, 80(4), 629-640. Zhang, H. (2001). The optimality of naive Bayes. Fredericton, NB: University of New Brunswick. Paper presented at annual meeting of the Florida Artificial Intelligence Research Society (Miami Beach). Retrieved August 6, 2009, from http://www.aaai.org/Papers/FLAIRS/2004/Flairs04-097.pdf.