Learn About: Evaluating Performance | Common Core
Home > Staffing and Students > Building a better evaluation system: At a glance > Building a better evaluation system
| print Print

Building a better evaluation system: full report

“Look at the back of the baseball card.” Around the major leagues, that’s the common response from managers when they’re asked about what kind of performance they expect from a player in an upcoming season.

Now, players’ batting averages vary from season to season. But managers clearly know that looking over statistical data from past seasons’ performances is a reliable way to gauge a player’s future performance. Reliable enough, in fact, for teams to consider it when making multi-million-dollar personnel decisions. Although the dollars involved may differ, most of us are accustomed to some quantifiable information being included in our job evaluations.

What about teachers?

The push to change teacher evaluation systems, and especially to include statistical measures of teachers’ effect on student learning, is here. In 2005, 13 states were able to link teachers to their students’ performance data; currently, 35 states are able to do so and the number is expected to grow. The Obama administration’s Race to the Top (RTTT) effort urged states and districts to use this teacher-student link in teacher evaluations in order to be eligible for grants. In response, 17 states reportedly changed their evaluation systems to improve their chances of receiving RTTT funds. Private foundations like the Bill & Melinda Gates Foundation have also used their resources to examine teachers’ effectiveness and encourage the use of such measures.

Clearly, this is a fast-moving train that will likely affect many, if not most, school districts eventually. In order to prepare, here’s what you should know:

  • The current system is lacking. Most current evaluation systems fail to identify the true variation in the effectiveness of teachers by rating all but a few teachers as "satisfactory."
  • Improving teacher effectiveness can dramatically impact student learning. Designing and implementing a quality teacher evaluation system -- one that identifies strong teaching where it exists and targets interventions where they’re needed for improvement -- would take additional funds and careful thought, but the benefits for students and for the community could be significant.
  • Value-added models have many caveats, but are better than what we have now. To improve teacher effectiveness, evaluations should include information on how much teachers impact their students’ learning. The fairest way to do so is by using value-added models, which work to isolate the impact a teacher has on his or her students’ achievement. Although a teacher’s value-added score may misidentify a teacher simply based on statistical variation, this variation should be considered against the current evaluation system, which almost certainly misidentifies many ineffective teachers as “satisfactory.”
  • Statistical measures are used to evaluate people in other industries effectively. Professional baseball, for instance, bases its multi-million-dollar salary decisions largely on a player’s statistics, which vary from year to year about as much as teachers’ value-added scores. Other professions that are evaluated on similarly imprecise year-to-year measures included realtors; financial investors; utility company repairmen; and others.
  • There are ways to improve value-added models. The more years of data used, the more precise their identifications become. Including more information about student achievement can also reduce the possibility of error. For example, the chances of misidentifying a teacher decreases from 35 percent to 25 percent if three years of data are used instead of just one. Improving the data from state assessments, and aligning the assessments to what is taught, could also improve value-added models.
  • Multiple measures are the way to go. Using traditional measures, such as classroom observation, along with value-added data will present a fuller, more accurate picture of not only how effective a teacher has been but how effective a teacher is likely to be in the future. Emerging research bears this out by showing a strong relationship between teachers’ value added data and traditional principal observations as well as student surveys.

The current system is lacking

“As important as evaluation is to assessing teacher performance, what passes for teacher evaluation in many districts frankly isn’t up to this important task. Way too often, teacher evaluations are superficial. They’re subjective. They miss a prime opportunity to improve teacher practice and, thereby, increase student learning. And that’s what it’s all about, isn’t it? (Weingarten 2011)” –Randi Weingarten, President, American Federation of Teachers

Current teacher evaluation systems leave much to be desired in their ability to accurately measure teachers’ effectiveness. For example, a report from The New Teacher Project found that, in districts using a satisfactory/unsatisfactory evaluation system, only 1 percent of teachers received ”unsatisfactory” on their most recent evaluation (Weisberg, et al. 2009). In districts that use multiple categories, just 6 percent of teachers fail to be rated in the top two categories and less than 1 percent of teachers are rated as unsatisfactory (Weisberg et al. 2009). Yet, a number of studies have shown, teachers’ effectiveness varies greatly even within schools (Kane and Cantrell 2010). Unfortunately, current teacher evaluations fail to identify such variations.

One reason may be that not all teachers get evaluated every year. In many cases, tenured teachers are only required to receive a formal evaluation every two to three years (Toch and Rothman 2008). New teachers get evaluated up to two or three times a year (Brandt, et al. 2007), but the type of evaluations doesn’t necessarily mean they are receiving helpful feedback. Even when teachers are evaluated, evaluations are typically based on two or fewer classroom observations of 60 minutes or less conducted by school administrators untrained on how to conduct an evaluation (Weisberg, et al. 2009). In a study by the Midwest Regional Educational Laboratory (Midwest REL), just two of seven Midwest states required administrators to receive training on how to conduct evaluations (Brandt, Thomas and Burke 2008). Most districts don’t require training, either. Just 8 percent of all districts had policies that referenced any form of training or certification criteria for raters (Brandt, et al. 2007).

Certainly this questions the accuracy of current teacher evaluation systems. As the New Teacher Project Report states, “When all teachers are rated good or great, those who are truly exceptional cannot be formally identified.” And the majority of teachers and administrators (59 percent and 63 percent) agree that their districts are not doing enough to identify the most effective teachers (Weisberg et al. 2009). Without the ability to identify truly effective or ineffective teachers, either for personnel decisions or continuous improvement, districts will be hard pressed to improve the overall quality of their workforce.

Improving teacher effectiveness is crucial

Uncertainty over teachers’ true effectiveness leads to serious consequences for students. We know from many different studies that the effectiveness of teachers can dramatically impact individual students. For example, one study showed that above-average students highly effective teachers for three consecutive years scored 49 percentile points higher in math than students who started at the same achievement level but received three ineffective teachers in a row (Jordan, Mendro and Weerasinghe 1997). Practically, that could mean the difference between being placed in honors courses or remedial ones. Other research has shown that teachers have the single greatest impact on a student’s performance, more than family background, or school (Center for Public Education 2005). Improving teacher quality through accurate evaluations, then, would significantly benefit student achievement.

Some researchers believe improving teacher quality can even have an impact on the national economy. Stanford University economist Dr. Eric Hanushek estimates that U.S. 15-year-olds could rank among the world leaders, such as Finland, in math and science performance, if 8 percent of the least effective teachers were replaced with average effective teachers (Hanushek 2010). And, he says, such a dramatic increase could add over $112 trillion into our economy. Hanushek also estimates that a highly effective teacher (that is, one at the 84th percentile) adds over $400,000 of value into the labor market per class taught. His calculation is based on the increased wages that teacher’s students are expected to earn due to the academic benefit they received from being in a single effective teacher’s class (Hanushek 2010). Students who had a highly effective 4th grade teacher, for example, will earn a combined $426,225 more during their lifetime than if they simply had an average 4th grade teacher (Hanushek 2010).

Using Hanushek’s numbers, then, if a single highly effective teacher stays in the classroom for 30 years, he or she will contribute nearly $13 million into our national economy over what a teacher with average effectiveness would. In contrast, compared to an average teacher, an ineffective teacher would actually subtract over $800,000 from the economy each year, or $24 million over a 30-year career (Hanushek 2010).

Of course, this is only in theory. There is no evidence that there is an excess supply of average teachers to replace the least effective teachers. But that is why using teacher evaluations to improve the teachers we do have is so important. While the discussion about reforming teacher evaluations too often centers around the idea of identifying and removing ineffective teachers, truly useful evaluations would support and improve some ineffective teachers, and help average teachers become better as well. Teacher turnover already costs the nation over $7 billion a year right now, so simply removing ineffective teachers would only exacerbate that cost and wouldn’t be a good use of an evaluation system. In contrast, being able to identify ineffective teachers early and know where they need to improve would allow districts to provide them with appropriate support and professional development – saving the districts money and benefiting their current students. Designing and implementing a quality teacher evaluation system may take additional funds, but the benefits clearly outweigh the costs. To do so, incorporating student achievement data into teacher evaluations is the logical first step.

The movement toward including student achievement data

Traditionally, teachers’ quality has been defined by the number of years they have been teaching, the degrees they have earned, and to a lesser extent, their instructional techniques. There has been little, if any, emphasis placed on how much a teacher’s students actually learned. Even though there is no shortage of student achievement data, very few teacher evaluation systems currently include any sort of measure of how the teacher impacted their students’ achievement. Although 35 states have the data systems in place to match teachers with their students’ test scores over time, the systems are typically not designed to make high-stakes decisions about teachers (DQC 2011).

However, this may be changing. According to the 2010 State Teacher Policy Yearbook, 10 states now require evidence of student learning to hold a significant weight in a teacher’s evaluation, up from just 4 in 2009 (Jacobs, et al. 2011). An additional 6 states require some measure of student learning to be included (Jacobs et al 2011). The increase is a likely result of President Obama’s Race to the Top (RTTT) competition (Jacobs, et al 2011); a full 17 states modified their teacher evaluation systems, reportedly in order to compete for RTTT funds (Bruce 2010). So, there has been an uptick in the number of states that require teachers to be evaluated based on students’ performance, although the number is still less than half.

As states and districts look toward linking student and teacher performance, the best tools available are value-added models. Value-added models are a type of growth model -- measurements of how much a particular student learned from one point in time to another. Many factors contribute to a student’s learning, but value-added growth models can measure the impact of one factor (in this case, a teacher) on the change in a student’s performance. So, for example, a value-added model would work to separate and identify the learning gains a student made over a school year with a particular teacher from the learning he or she brought in at the beginning of the year.

Value-added models: understanding the drawbacks and the potential

Value-added models have gained prominence because of their attempt to provide an objective and standard measure of a teacher’s impact on student achievement. Unlike previous attempts at linking teachers and student achievement, teachers aren’t rewarded for having a class of high achieving students or penalized for teaching low-performers (Toch and Rothman 2008). In addition, they are less expensive and time-consuming than observations and portfolios, and the results can be easily communicated.

While these strengths address many of the weaknesses found in teacher evaluations today, how precisely value-added models measure teachers’ effectiveness has been the subject of intense scrutiny and criticism among some experts, while being defended by others. All experts agree that value-added models are not perfect measures of teacher effectiveness, but they disagree about the usefulness of value-added models in evaluating teachers.

In addition, many have raised various other concerns about value-added models. For instance, value-added models cannot currently provide individual scores for all teachers, and they often measure teacher effectiveness relative to others in the building instead of to a fixed standard. To read more about these concerns, visit the “Policy Considerations ” and “Statistical Caveats” pieces elsewhere in the guide.

While it’s important to understand the debate, value-added models can be used as a more accurate tool in evaluating teachers than the current evaluation systems in place. Value-added models’ caveats simply need to be understood so that their results are used correctly along with other measures of teacher performance in order to gain a more complete picture of a teacher’s true performance. If misused, using value-added models to evaluate teachers may do more harm than good in improving student achievement.

A good, but imprecise measure of effectiveness

Value-added models are one of the best objective measures of teacher effectiveness, but they should be used in concert with other measures of teacher effectiveness because the scores still contain a significant amount of “noise,” or statistical error. That is, a teacher’s value-added score may vary due to the statistical limitations of value-added models, not due to any real change in the true effectiveness of the teacher (McCaffrey, Sass and Lockwood 2008, Sass 2008). For example, multiple studies have found that, of teachers who ranked in the top 20 percent of effectiveness one year, less than a third of those had scores in the top 20 percent the next year (Sass 2008), though the vast majority stayed in the top half (McCaffrey, Sass and Lockwood 2008). So rankings based on value-added estimates change from year to year, and some of that change doesn’t necessarily reflect an actual change in teacher effectiveness (Goldhaber 2010, McCaffrey, Sass and Lockwood 2008, Sass 2008).

Value-added models’ effectiveness ratings vary from class to class, year to year, and from test to test, or when different statistical models are used (Economic Policy Institute 2010). A teacher may be identified as highly effective based on one measure, but be identified as highly ineffective based on another measure, even when evaluating the same year in the same content area (EPI 2010, McCaffrey and Lockwood 2008). Keep in mind that some of the instability may be due to actual change in a teacher’s performance (Glazerman, et al. 2010, Goldhaber and Hansen 2008).

While experts agree that value-added models are imperfect measures, they disagree whether those imperfections preclude value-added data from being a useful tool in evaluating teachers. Dan Goldhaber writes, “The question, however, should not be whether this is good or bad for teachers, but whether the number of incorrect classifications is acceptable given the impact on student learning.” (Goldhaber 2010)

Value-added models correlate with other measures of teacher effectiveness

Despite the variance, value-added results mirror those of other measures of teacher effectiveness – a boost for the theory that changes in a teacher’s value-added score also reflects a change in a teacher’s actual effectiveness (Glazerman et al 2010, Goldhaber and Hansen 2008). Some recent studies have shown that other measures of teacher effectiveness correlate highly with a teacher’s value-added score.

For example, research has shown that principals can informally identify which teachers have the largest impact on their students’ achievement (Sass 2008). Moreover, recent studies have shown that principal evaluations are a better predictor of teacher value-added scores than are teacher experience and educational attainment (Harris and Sass 2009, Jacob and Lefgren 2008, Sass 2008) .

Another study that compared instructional practices to value-added scores found similar results. The study found evidence that teachers who had high value-added scores had stronger instructional practices than teachers with low value-added scores (Grossman, et al. 2010). The study went on to conclude that “Overall, our study provides evidence that value-added measures do more than measure the characteristics of students that arrive in a teacher’s classroom. They also seem to be capturing important differences in the quality of instruction.”

These results are echoed by other current studies, such as an evaluation of the popular “The System for Teacher and Student Advancement” (TAP) system, which uses multiple measures to evaluate teachers. The report found when teachers demonstrate strong instructional skills, their students show higher academic growth (Jerald and Hook 2011).

Even beyond the correlation between teacher practices and their value-added scores is the finding that students know an effective teacher when they see one. In the ongoing Measures of Effective Teaching Project (MET), funded by the Bill & Melinda Gates Foundation, a report on initial findings has found that student perceptions of their teacher in one class are related to the achievement gains of students in other classes taught by the same teacher (Kane and Cantrell 2010).

Furthermore, there is other evidence that changes in value-added scores may not be due to “noise” alone in how value-added scores predict future teacher effectiveness. A recent study found that teacher value-added scores predict future teacher value-added scores when using multiple years of data (Harris and Sass 2009). Value-added scores not only provide information on what teachers did in the past; they are useful in predicting which teachers will be successful in the future.

Taking these results together provides some evidence that year to year fluctuations in teacher value-added scores are not due to statistical “noise” alone. Even if value-added measures are imprecise, they appear to identify those teachers with the best instructional practices and whose students are making the greatest gains. The high correlation with other ways of measuring teacher effectiveness suggests that using value-added models along with other measures provides a more reliable, effective method of evaluation.

Other industries’ use of statistics in evaluations

Can imprecise measures, no matter how valuable, be used to evaluate teachers? In fact, it is a general practice in fields outside of teaching, such as professional baseball (Glazerman et al. 2010). A batting average is widely accepted as a measure of a player’s performance. Yet it has similar year-to-year fluctuations as value-added scores do for teachers (Glazerman et al 2010). The back of Ted Williams’ baseball card says he had one of the highest career batting averages of all time at .344. But the prime of his 19-year career, his full season batting average ranged from .318 (merely good) to .406 (amazing), a very large difference in baseball. Even with these fluctuations, he is still considered one of the best hitters in the game. So managers still use batting averages in evaluating hitters – just as researchers found that even though teacher value-added scores fluctuate, the top-ranked teachers tend to demonstrate the strongest instructional skills (Jerald and Hook 2011).

Baseball is not the only profession to use statistical measures to evaluate their employees. Other professions do too, and the year-to-year value-added correlation is similar to that of teachers’ value-added scores (Glazerman et al 2010, Goldhaber 2010). A study on objective measures of job performance found that the year-to-year correlation in high-complexity jobs was similar to that of value-added models (Glazerman et al 2010, Goldhaber and Hansen 2008, McCaffrey, Sass and Lockwood 2008). Other professions that are evaluated on similarly imprecise year-to-year measures include the volume of home sales for realtors; rate of return by investors; output of sewing machine operators; productivity of field-service personnel for utility companies; and insurance sales just to name a few (Glazerman et a l 2010; Goldhaber 2010).

As researcher Dan Goldhaber noted, “Ultimately, one has to make a judgment call about the risks of misclassification…Value-added models should be compared to the [evaluation systems] currently in place and not a nirvana that does not exist.” (Goldhaber 2010) While value-added models’ imprecision is a concern, the variation in scores should be considered against the current evaluation system, which almost certainly misidentifies a large number ineffective teachers as “satisfactory.”

Improving value-added models

Even though value-added models provide similar accuracy to other industries’ evaluation models, there are also a number of ways to improve the accuracy of value-added scores. Here are some of them:

Include multiple years of data. To minimize the impact of small sample sizes, which can distort scores, value-added models can be measured over multiple years (Goldhaber 2010, McCaffrey, Sass and Lockwood 2008, Schochet and Hanley 2010). Researchers at Mathematica found that the chances of a value-added model misidentifying a truly average teacher as very effective or very ineffective is 25 percent if three years of data are included, instead of 35 percent if one year of data is considered (Schochet and Hanley 2010). Ten years of data would reduce the error rate to 12 percent (Schochet and Hanley 2010). Note that these errors occur when value-added scores are used alone. Using multiple measures would reduce the error rate even more.

Compare teachers within their school. Including more information about student achievement, such as test scores in all subjects, can also reduce the error in value-added models (Goldhaber 2010). To reduce error due to the non-random assignment of students to schools, districts may consider only comparing teachers within a school, on the theory that these teachers are more likely to be serving similar students (EPI 2010, Schochet and Hanley 2010). However, the drawback is that the data will not be useful in identifying whether less effective teachers are concentrated in certain schools within the district.

Improve the data value-added models are based on. This means better collection and verification of the data that is already collected to minimize missing data and incorrect data. It also means creating a more accurate measure of student achievement. Most current state assessments were not designed to be used with value-added models. For the assessments to capture student growth more effectively, they should be better aligned to the curriculum; measure a wide range of knowledge so they measure very low- and very high-achieving students; and be better aligned from one grade to the next (Economic Policy Institute 2010).

One possible path: using multiple measures

While there are many disagreements about value-added models, all experts agree on one thing: value-added models should be used in concert with multiple measures of teacher quality. Possible other measures include:

  • The school’s overall value-added score
  • Multiple classroom observations from a trained peer or administrator, with feedback
  • Parent/student surveys
  • Teacher portfolios
  • Lesson plan evaluation
  • Peer review

One evaluation system that uses multiple measures is the previously mentioned TAP system. This system evaluates teachers based on observations of teacher practices as well as gains in student achievement (Jerald and Hook 2011). In general terms, 50 percent of teachers’ evaluations are based on a classroom evaluation of the teacher’s skills, knowledge, and responsibilities, while 30 percent is based on the individual teacher’s value-added score and the remaining 20 percent is based on the school’s value-added results (Jerald and Hook 2011). For those teachers who are not able to be evaluated using value-added scores, 50 percent is based on the school’s value-added score, while the other 50 percent is based on classroom evaluation (Jerald and Hook 2011). 

While TAP bases 30 percent of a teacher’s evaluation on his or her individual value-added score, research has no suggestions on the best weight to give value-added scores in concert with other evaluation measures. Some experts feel 25 percent is too much (Economic Policy Institute 2010); others feel value-added scores should make up a more significant portion of the evaluation. In general, this leaves the decision to policymakers based on the resources and goals of their schools.

No matter how an evaluation system is set up, the goals for any system is that it: (Goldhaber 2010)

  • Be rigorous and substantive while allowing for nuance
  • Provide meaningful teacher feedback
  • Be directly linked to consequences and outcomes
  • Be seen as trustworthy
  • Ultimately result in improved learning and achievement for students

If your district decides to tie teacher evaluations to their compensation, check out the joint statement by the American Association of School Administrators, the American Federation of Teachers, the National Education Association and the National School Boards Association on the 11 guiding principles for teacher incentive plans. Examining the research is just the first step in building a stable and well-accepted process according to the guidelines, which encourage school boards, administrators and unions/associations to work together, and offer specific recommendations such as sustainable funding, multiple measures of evaluation, and collaboration among teachers.


Effectively evaluating teachers has become a hotly debated topic, due to President Obama’s Race to the Top program and the push by private foundations. Research has long shown that the quality of our students’ teachers is vitally important to their, and our, success – but the current evaluation systems fail to accurately identify either effective or ineffective teachers. Instead, most of today’s current evaluation systems simply label nearly all teachers as satisfactory and provide no measure of how much a teacher impacted students’ learning.

Value-added models are tools that attempt to isolate the impact of teachers on students’ achievement. While they have several weaknesses that make them imprecise, they are the best tools available. And other professions – even ones with as much money at stake as professional baseball and real estate – evaluate their employees with similarly imprecise measures. Furthermore, there are statistical techniques to make value-added measures more accurate.

No matter how good they are, value-added results should just be one tool in a teacher evaluation system. How much weight to put on individual teacher value-added results is a policy question researchers have yet to answer. But teachers’ value-added data can be used effectively in concert with school-level value-added scores, results from multiple classroom observations, and other tools, such as student/parent surveys.

By understanding the strengths and weaknesses of value-added models and how to best incorporate them into a comprehensive evaluation system, school boards and other policymakers can create a teacher evaluation system that more accurately identifies the most and least effective teachers. Doing so is the first step in improving the quality of our teachers.

Questions for School Board Members

When considering implementing a teacher evaluation system including value-added data, school boards should first think about policy considerations, then examine the technical capabilities of the district to include value-added data, and finally make decisions about the design and implementation of the new teacher evaluation system and its results.

Policy Questions

  • Why do we want to use value-added results?
  • How will the results of the teacher evaluation be used?
  • Who has access to the value-added data?
  • How will it be disseminated?
  • How will the evaluation help improve a teacher’s performance?
  • How will the evaluation help to improve personnel decisions?
  • Will principals’ evaluations include value-added scores? Will value-added scores be part of the superintendent’s goals?
  • Are teachers, administrators and other stakesholders involved in the design?

Technical Questions

  • Are we able to connect teachers to student test scores?
  • Who will design the value-added model?
  • Where can we look for advice on designing a system that would work best for us?
  • What student data can we include in the value-added model?
  • How will the value-added model account for missing student data?

Design Questions

  • What percent of a teacher’s evaluation will be based on value-added scores?
  • What measures will be used to evaluate teachers without value-added scores?
  • What other measures (observation, portfolios, etc.) will be used to evaluate teachers in concert with value-added scores?
  • How will this affect multi-year tenure (for instance, two-year tenure) if the accuracy of value-added scores improves with more years of data?
  • How will the evaluation account for team teaching or will it?
  • Should value-added scores be averaged over multiple years?
  • Do you want the value-added model to compare teachers within a single school or compare teachers across the district?
  • How will the value-added model account for differences in student populations and resources across schools?

This report was written by Jim Hull, Center for Public Education Senior Policy Analyst.

Posted: March 31, 2011

Add Your Comments

Display name as (required):  

Comments (max 2000 characters):


Home > Staffing and Students > Building a better evaluation system: At a glance > Building a better evaluation system