Which student would you say achieved more? A high-scoring student who started out as a high-performer or an average-scoring student who started at the bottom?
Currently under the 2001 No Child Left Behind Act (NCLB) schools are given credit for the percent of students achieving the state's "proficient" level, regardless of how far students progressed to get to proficient. Recent policy discussions about school and teacher accountability, however, are expanding the proficient view of achievement by recognizing that some students have much farther to go to reach proficiency, even though it remains the minimum target for everyone. This has led policymakers to look at ways to measure academic growth via growth models.
Simply put, a growth model measures the amount of students' academic progress between two points in time. The terms "value added" and "growth models" are often the cited statistical methods for measuring student growth for accountability purposes. But what exactly are these methods? Do they measure what they claim to measure? How should they be used? More important, as a school policymaker, educator, parent, or voter, why should you care?
This guide is intended to answer these and other questions and to help you decide which model, if any, should be used in your state or district. Although we explain growth models within the framework of NCLB, they can be used for a variety of educational purposes—not just for high-stakes accountability as we also illustrate throughout this guide. To help you get the most from this guide, it is organized as follows:
Growth models can be sophisticated tools that help gauge how much student learning is taking place. But like all tools, they are most effective in the hands of those who understand how to use them. This guide illustrates why.
Why are policymakers talking about growth models?
Terms you should know
Status model: A method for measuring how students perform at one point in time. For example, the percent of fourth graders scoring at proficient or above in 2006.
Growth model: A method for measuring the amount of academic progress each student makes between two points in time. For example, Johnny showed a fifty point growth by improving his math score from three hundred last year in the fourth grade to three hundred fifty on this year's fifth grade exam.
Value-Added model: A method of measuring the degree in which teachers, schools, or education programs improve student performance.
Achievement level: Established categories of performance that describe how well students have mastered the knowledge and skills being assessed. For this guide, we use advanced, proficient, basic, andbelow basic for achievement levels. Proficient or above is assumed to represent the level that meets the state standard.
Scale score: A single numeric score that shows the overall performance on a standardized test. Typically, a raw score (number of questions answered correctly) is converted to a scale score according to the difficulty of the test and/or individual items. (For example, the 200–800 scale used for the SAT.)
Vertical scale scores: Numeric scores on standardized tests that have been constructed so that the scale used for scoring is the same for two or more grade levels. Hence, a student's scale score gain over multiple years represents the student's level of academic growth over that period of time.
Most of us are accustomed to getting information about academic results in the form of a score. Whether it's reported as a number or a letter grade, it tells us basically the same thing—how well students have learned certain subject matter or skills at one point in time. However, a score does not typically tell us how far students grew academically to produce that number or grade. We don't know if the score reflects relatively normal progress, if it represents a huge leap forward, or even if students lost ground.
This poses a real question for policymakers because of the challenges to our present definition of achievement for school accountability purposes. Since the 1990s, education policy has been mostly focused on results as defined by state academic standards, which all students are expected to meet. These standards are the tracks on which school accountability runs. Students are tested on the material described by state standards and schools are held accountable for whether or not students meet those standards. But growth models will shift accountability to include measures for how much progress students make, not just whether they meet state standards.
Most states had some form of standards-based accountability in place when NCLB was signed into law in 2002. However, NCLB took accountability to a new level by requiring schools to meet specific targets each year—called "Adequate Yearly Progress," or AYP—with all groups of students. AYP targets are based on a status model of achievement, meaning that schools are evaluated on the achievement status of their students, in this case the percent of students scoring at or above a "proficient" level of achievement. Each state establishes AYP targets that must culminate at one hundred percent student proficiency in the year 2014.
However, many educators argue that a status criterion alone is an unfair way to measure school effectiveness, particularly for high-poverty urban and rural schools that receive a large proportion of students who enter school already behind their peers who have greater home and community advantages. Under current law, schools can be labeled "in need of improvement" for failing to meet the state's AYP target, even if they produced more sizable gains with their students than more affluent schools. Many educators and an increasing number of policymakers believe that these schools should still be recognized for effecting significant student growth. California's Superintendent of Public Instruction, Jack O'Connell, echoed the sentiments of many educators when he said that "The growth model is a much more accurate portrayal of a school's performance" (Wallis and Steptoe 2007).
In the current NCLB environment calls for growth models have largely centered on using them for high-stakes school accountability purposes, but some growth models, especially value-added models, can also be used to evaluate teacher or program effectiveness and as a tool for school improvement. Most researchers agree that these statistical tools present a more complete picture of school performance. However, they disagree over how precisely various growth models measure student growth and what role they should play in accountability. In the following sections, we describe each growth model; discuss what is needed to develop, implement, and maintain a growth model; and explore the strengths and limitations of different models.
What are the different types of growth models?
Growth models measure the amount of academic progress students make between two points in time. There are numerous types of growth models but most tend to fall into five general categories:
Each of these categories encompass several variations depending on the model's purpose and available data. Because growth models are relatively new in education, and different models continue to be developed, these five categories may not capture all models.
Also, we're including two models in this guide that do not necessarily measure the academic progress ofindividual students over time, which some researchers consider the definition of a growth model. These models—"Improvement" and "Performance Index"—typically measure the change in the percent of students meeting a certain benchmark (typically "proficient") but do not measure the amount of growth each individual student made from year to year. However, both have been allowed as growth models for NCLB and state accountability programs, so we discuss them immediately below. Afterward, the rest of this guide refers only to growth models that follow individual students over time.
The improvement model
The Improvement Model compares the scores of one cohort, or class, of students in a particular grade to the scores of a subsequent cohort of students in the same grade. This model is based on achievement status—for example, students scoring at proficient or above—but the difference in scores over time would be considered growth in the school. For example, if fifty-five percent of last year's fourth graders scored at or above proficient and sixty percent of this year's fourth graders reached proficient, then, using the Improvement model, this school showed five percentage points in growth for fourth grade scores.
Figure 1: How an Improvement Model works
In this hypothetical school, the performance of this year's fourth graders is compared to last year's fourth graders. The difference is the "improvement" or change.
||Last year's 4th graders
||This year's 4th graders
||+ 5 pts.
Sound familiar? Many of you will recognize this model as NCLB's "Safe Harbor" provision. It does not measure growth among individual students or even the same cohort of students. The model actually compares two totally distinct groups of students, or in this example, last year's fourth graders to this year's fourth graders. The benefit of the improvement model is that it is fairly easy to implement and understand. While it doesn't track how individual students progress, it provides some indication of whether more students in a particular grade level are getting to proficiency from year to year. However, the change in the percent of students reaching proficiency may be due to different characteristics of the students in each cohort rather than a change in school effectiveness. For example, the difference between last year's fourth graders' performance and this year's fourth graders could have been due to an increase in class sizes due to closing a nearby school in the district.
Performance index model
Most Performance Index models are status type models that give credit to schools for getting more students out of the lowest achievement levels even if they haven't reached proficiency. Just as with Status models, Performance Index models can be used as an Improvement Model. And just as with Improvement models they do not necessarily measure the academic growth of individual students, but account for change in the schools' performance from one year to the next. There is, however, one important distinction: Most Index models currently used by states recognize change across a range of academic achievement levels, not just at proficient or above.1 As the example below shows, the school received partial credit for the students scoring at the basic level but not below basic level.
In statistics, an index combines several indicators into one. Grade Point Average (GPA) is an index that we are all familiar with. It covers several indicators—grades students earn in various courses—and it is weighted in favor of the highest grades, an "A" is worth four points, a "B" is three points, a "C" is two points, and so on. To figure the GPA it's a matter of elementary math: Add up the grade points, divide by the number of courses, and the result is the GPA. The GPA shows how close students come to earning A's across their classes with straight A's earning a perfect 4.0 GPA.
Performance Index models developed by states for school accountability follow this same general principle. Think of it as the GPA for a school where the goal is to determine how close the school comes to getting all students to proficiency. It does so by measuring student performance based on the percent of students scoring in each achievement level. More points are then awarded for students scoring at the highest levels, just as students earn more points for higher grades. The points are averaged for the school and the result is the index.
Figure 2: How a Performance Index Model works
This hypothetical school is in a state using an Index Model for school accountability. The index awards points for achievement are as follows:
|Students at proficient and above
|Students at basic
|Students at below basic
A perfect score of one hundred points means that all students reached proficient. Our school would earn sixty-eight points as shown in the table. Using an Improvement Model, this same school would earn only fifty-five points for the percent of student who reached proficient.
Performance Index Model
||This year's 4th graders
||.55 X 100 pts
||.25 X 50 pts.
||.20 X 0 pts
|Index score for school
When comparing performance from year to year, a Performance Index will include changes that occurred at the low end of student achievement and can also be designed to include changes among students scoring at proficient or better.
Performance Index models can mitigate one of the frequently cited cautions about using Status models in accountability systems. That is, critics charge that evaluating schools only on the percent of students at proficient might motivate schools to concentrate their efforts on the so-called "bubble kids"—those students who score just above or just below the proficient level—to the possible neglect of students on the lowest and the highest ends of the performance scale. By instituting a Performance Index Model schools have more incentive to concentrate on more students below proficiency not just those on the cusp.
Although Performance Index models can be developed to give credit to schools for moving students from proficient to advanced, most do not because NCLB does not give credit for growth for students already above the proficiency level. However, changes in the index from year to year are a better indicator of how well schools are educating students who began at the lowest achievement levels—not just those on the proficient bubble—than the current status models.
As of 2006 twelve states have adopted some form of Index Model for NCLB purposes: Alabama, Louisiana, Massachusetts, Minnesota, Mississippi, New Mexico, New York, Oklahoma, Pennsylvania, Rhode Island, South Carolina, and Vermont (Sunderman 2006).
From a practical standpoint, there is one good thing about Performance Index models. Most don't require sophisticated data systems. Keep in mind, however, that these models generally don't measure the growth of individual students from year to year. They also don't capture change within each achievement level. For example, if a state set a cut score of two hundred for "basic" and three hundred for "proficient," schools wouldn't get credit for students whose scores improved from two hundred to two hundred ninety-eight. They would get credit for students who improved from two hundred ninety-nine to three hundred one. Establishing more achievement levels would help to capture these changes, making the model a more accurate measure of growth.
Simple growth model
In most cases, simple growth models don't require a statistician to explain or even compute data. Typically it's just the difference in scale scores from one year to the next. But unlike the Improvement and most Performance Index models, which compare successive cohorts at the same grade level (fourth graders in our hypothetical school) Simple Growth models actually document change in the scores of individual students as they move from grade to grade. For example if a fourth grader in school X scored three hundred fifty last year and four hundred on this year's fifth grade assessment, the student made a fifty point growth. The growth is calculated for each student who took both the fourth and fifth grade tests and then averaged to calculate the school's growth.
Figure 3: How a Simple Growth Model works
This hypothetical school has five fifth graders who took the fourth grade assessment last year. The change in scores are calculated in the table below for each student and a school average is reported.
Simple growth model
||Last year's 4th grade scale score
||This year's 5th grade scale score
One drawback of this model? Only those students who took the tests in both years are included in the school's growth calculation. Another is that the points themselves provide no information (DePascale 2006). A fifty point gain may or may not mean the student has met a set target or is on track to meet it in the future. For Simple Growth models to be useful, experts, educators, and in many cases, policymakers must make informed judgments about how much growth is enough.
Growth to proficiency model
While Simple Growth models measure individual student growth, they do not indicate if students are developing the skills they need to meet state standards of proficiency. Growth to Proficiency models—also known as Growth to Standards or On—Track—are designed to show whether students are on track to meet standards for proficient and above. As such, they have become popular, mainly in response to the U.S. Department of Education's NCLB Growth Model Pilot program.
At this writing, the federal pilot program is allowing nine states to use Growth models for NCLB accountability. However, these nine states had to meet a list of conditions to do so, most notably that one hundred percent of their students are still expected to be proficient by 2014, as the 2001 law states. This provision precluded from consideration some Growth models that were already in use, such as those used in North Carolina and Tennessee because they did not require students to reach a certain benchmark in a certain amount of time.
Nearly twenty states submitted plans, including North Carolina and Tennessee who revised their models in accordance with the federal guidelines. Several states developed a hybrid of Growth and Status models, or Growth to Proficiency. Although there are several variations, the key ingredient across all Growth to Proficiency models is that schools get credit when a student's progress keeps them on pace to reach an established benchmark—usually proficient—at a set time in the future, typically within three to four years or by the end of high school (Davidson and Davidson 2004).
The advantages to this model are (1) that schools are recognized for producing gains in student performance even if their students score below proficient and (2) there is more incentive to focus on all students below the proficiency level, not just the "bubble kids." There are even models that give incentives to schools to focus on students above the proficiency level. Tennessee developed its model so schools must ensure that students who are already scoring above proficient are still on pace to remain proficient in the coming years.
However, without targets, the model itself cannot determine which students are on track for proficiency. No matter what model is chosen, Growth or Status, it is up to policymakers to set goals to determine how much students should know and when they should know it. Then the model can be designed to determine which students are meeting those targets.
Figure 4: How a Growth to Proficiency Model works
Our hypothetical school has five students whose growth targets were established at the end of fourth grade based on meeting proficiency in seventh grade. Growth targets are based on the yearly growth needed to hit the seventh grade proficient score:
|Fifth grade proficient score = 400
||Seventh grade proficient score = 500
The goal for this year (NCLB Annual Measurable Objective, or AMO) is that seventy-five percent of the students must hit their targets. If students don't score at the proficient level they must hit their growth target for the school to make AYP.
||Last year 4th grade scale score achieved
||This year 5th grade scale score achieved
||Did student score proficient?
||What is student's growth target?
||Did non-proficient students hit growth target?
||Did student make AYP?
In this example, three of five students met the proficient target and therefore do not have to meet a growth target. Two students did not meet the proficiency target: One met his growth target while the other student did not meet hers. This means four out of five students met AYP, or eighty percent, which exceeds the seventy-five percent goal for this year (AMO). Therefore, this school made AYP.
Note: A Value-Added Model is one type of growth model, but not all growth models are Value-Added. As a growth model, Value-Added measures change in individual students' performance over a period of time. But unlike other Growth models, Value-Added also measures how much a particular instructional resource, such as a school, teacher, or education program, contributed to that change (Hershberg, Simon and Lea-Kruger 2004). It's a distinction that often gets lost in discussions about measuring growth and somehow the two terms have become intertwined.
A Value-Added Model (VAM) is typically the most statistically complex of all Growth models. However, if used correctly, VAMs are quite possibly the most powerful statistical tools available for evaluating the effectiveness of teachers, schools, and education programs. Value-added models are designed to isolate the effects of outside factors—such as prior performance or student characteristics—from student achievement in order to determine how much value teachers, schools, and/or programs added to students' academic growth.
The calculations for value-added models take many forms and cannot be easily illustrated due to their complex statistical methodology. Instead, we present a simplified version of a VAM calculation in order to illustrate the basic principles. This Value-Added Model looks at individual students' past performance to predict how they should perform in the upcoming year. Usually an individual's predicted score is based on the average score of similar students in previous years who shared similar past performance patterns and characteristics.
If students perform above their predicted performance, they are considered to have shown positive growth. If they perform as predicted, then they are considered to have made expected growth. If they perform below their predicted performance, then they are considered to have negative growth. Under a VAM, a student could show improvement but still make "negative growth" if the improvement was less than predicted.
Figure 5: How a Value-Added model works
Say a VAM is being used to evaluate the effectiveness of our hypothetical school. Each of the five students' fifth grade scores are compared to their predicted scores based on fourth grade performance. The difference determines whether students made positive, negative, or expected growth and becomes an indicator of the value this school added to its students' learning. Value-Added makes no statement about high or low scores, only the amount of student gains.
Fifth grade predicted growth based on fourth grade performance
||Last year 4th grade scale score
||This year 5th grade scale score
In this example, the school, on average, produced expected gains. However, the growth was largely due to more than expected growth among low-performers. High-performing students did not make their predicted growth. This school should evaluate what they are doing with their high-performers to make sure they make progress, even as they continue to do what seems to be helping their low-performers.
VAMs isolate the effects of teachers, schools, and education programs by using data from each of these categories. From there, statistical tools such as Hierarchal Linear Models (HLMs) are used to separate the teacher, school, or program effectiveness from other factors that may have influenced the change in student achievement. Tennessee is the well-known VAM pioneer, but states such as Pennsylvania and Ohio and various districts across the nation have been using VAM data in some form.
There are many variations to value-added models but they all share the same goal of measuring student growth and determining who or what is responsible for it.
What is needed to implement a Growth Model?
No matter which growth model is used and for what purpose, some basic features should be in place to design and implement a valid and reliable measure of student growth (Goldschmidt, et al. 2005). These features include:
- A statement of policy intent
- Properly designed, annual tests
- Data systems to collect, store, and analyze the data
- Statistical expertise
- Professional development
- Transparency and good communication
Although these features do not garner much attention in the growth model discussion they are the foundation on which most growth models should be built.
Statement of policy intent
Policymakers should have a clear statement of intent when thinking about adopting a growth model because different purposes need different models (Goldschmidt, et al. 2005, McCaffrey, et al. 2003). Deciding exactly what is to be measured and how this information will be used is the first step. For example, a growth model used for high-stakes school accountability purposes may look different from a model used to identify professional development needs. A "Growth to Proficiency Model" might fill the first purpose, especially if the goal is to evaluate student progress toward an established performance target. On the other hand, decisions about professional development would benefit from understanding the effect teachers have with all their students, high- and low-achievers alike. A Value-Added model would probably work best for this purpose because it provides information about gains made by different students that are attributable to schools or teachers, regardless of whether they are high- or low-performers. Articulating a clear goal will assure that the growth model you design will be the best fit.
Properly designed tests
To get the most accurate results, the tests used for measuring growth should have three key characteristics: (1) they document year-to-year growth on a single scale, (2) they measure a broad range of skills, and (3) the test content is aligned with state standards.
Tests that document year-to-year growth on a single scale
Although not a requirement, the best tests to use for measuring yearly growth are vertically aligned and scaled. This means that each successive test builds upon the content and skills measured by the previous test. It assures that tests taken over multiple grade levels show a coherent progression in learning. For example, by making sure the fifth grade math test represents what a student should have learned in a year since taking the fourth grade test.
Think of a growth chart in a pediatrician's office. Children's height is measured against it and recorded during yearly check-ups. The measurements change as they grow, but the chart remains the same. The chart is also based on physical averages, doctors don't use a twenty-foot chart to measure human growth. Yet it still accommodates the lower and upper extremes of children's height.
Tests that are vertically scaled work the same way. Knowledge is gained and students are tested over several grade levels, but students are scored against the same scale. The range of the scale varies depending on the range of knowledge the tests are measuring and the number of grade levels they are addressing.
NCLB now requires states to conduct annual testing in grades three to eight, which takes care of one requirement of vertical scaling—annual tests. However, being tested every year doesn't necessarily mean that the change in scores reflects a year's growth in student achievement. That is where vertical scaling comes in. Tests are developed for different grade levels—for example, for fourth and fifth—but scored on the same scale. It's as though the student took the same test covering the range of skills in both grades. This way, educators are assured that a change in scores represents a change in student achievement instead of differences in the tests themselves.
Although it is possible to create some growth models without vertically scaled tests, there is disagreement among researchers on the accuracy and technicalities on how to do so (CCSSO 2007). When it's done without vertical scaling, testing experts use statistical techniques to approximate the change in growth from year to year (Gong, Perie, and Dunn 2006). Lacking a vertical scale, the data is typically converted to a normed scale (McCall, Kingsbury, and Olson 2004), meaning statisticians compare students' performance to each other. Converting to a norm scale, sometimes called norming, is like teachers who grade on a curve by awarding "A's" to the highest ten percent, "F's" to the lowest ten percent, and "C's" to the bulk of their students. By definition, someone will always be above average and someone will always be below average on a normed scale, regardless of whether students are meeting standards or not. So policymakers and educators need to consider whether such a normed scale approach is what they want in a standards-based system.2
Tests that measure a broad range of skills
Some researchers say to effectively measure growth tests should measure the full range of skills students may possess at that grade level, meaning that basic and advanced skills should be measured in addition to the skills that define proficient (Sanders 2003, McCaffrey, et al. 2003). For example, a test that focuses on basic skills would be ineffective at measuring the growth of Advanced Placement (AP) students because the test does not include high-end content and most AP students would likely get all the answers correct. This is what researchers refer to as a ceiling effect. Conversely there is a floor effect when an assessment of advanced skills does not provide information about what a poor performance on the test means about the test-taker's skills. Using the AP example again, if students in remedial math took the AP Calculus exam, most will likely get all the questions incorrect masking any growth they made in mastering basic math.
Here's another way to look at it: Picture the body of students' knowledge as a football field that is one hundred yards long with the fifty yard line being "proficient." To get an accurate view of the range of student performance, the assessment would have to measure the full one hundred yards. However, many state tests currently in use are focused on the area near the fifty yard line, because their aim is to determine if a student is proficient or not. A low score on such a test means that the student is in his or her own team's territory and therefore not proficient, but it doesn't show whether the low-scoring student is on the ten or forty yard line. Likewise, a high score could mean the student scored a touchdown or just barely made it into field-goal range—if we didn't measure the broad range of skills, we just wouldn't know.
Because tests focused on the proficient level are unlikely to capture progress near the ends zones, they are not the best instruments for capturing growth. On the other hand, tests that can distinguish how low a low performance is, and how high a high performance is, are capable of showing when students improve from the one yard line to the twenty-five yard line, even if they have yet to cross midfield.
Test content is aligned with state standards
Researchers also emphasize that tests used for growth models should be aligned with standards (Braun 2004, Davidson, and Davidson 2004). A teacher who effectively teaches the content in the standards and the students who learned it should not be penalized because the tests did not accurately measure the standards (Braun 2004). Even though this is a problem with today's status models, the problem is exacerbated when comparing results at two points in time. Without proper alignment between state standards and tests the results of any growth model will be meaningless no matter how technically sound the model is. However, according to the American Federation of Teachers, many current state tests are not adequately aligned with their state standards (AFT 2006).
Data systems to collect, store and analyze the data
It doesn't matter whether your assessments are vertically aligned and scaled, effectively measure a full range of knowledge, and are properly aligned to standards if there is nowhere to store the data. In recent years most, if not all, states have improved their data systems in response to NCLB (McCall, Kingsbury, and Olson 2004). However, in many cases, the data systems required by growth models, especially value-added models, tend to be more expensive (Goldschmidt, et al. 2005)—a cost that should be taken under consideration if your state or district is looking to implement a growth model.
To measure individual student growth you need to set up a longitudinal data system. It sounds complicated but it's really not. A longitudinal data system has the ability to follow the same students as they move at least from grade to grade. Preferably it can also follow students from school to school and even from district to district within a state. Following students across state lines is not possible at the present time.
Most current data systems are designed to collect and store grade level data (not student level data) to determine what percent of students reached a certain benchmark. But they are not able to follow those students to the next grade to determine how their performance changed.
The key ingredient in a data system for growth models is a unique student identifier, or more commonly, a student ID number. A student ID works just the same as a Social Security number. Each student is assigned a unique number when they enter the school system and it remains with them throughout their academic career even when they change schools or move to a new district in the state. Each student's test scores and characteristics such as race and socioeconomic status should be included with their student ID in the data system (Blank and Cavell 2005). This information is important for accountability based on student subgroups and for value-added models. Other information to consider collecting so program effectiveness is monitored are the courses the student has taken, the grades in those courses, and the educational programs in which the student participated.
According to the Data Quality Campaign it costs between one and three million dollars just to develop and deploy a unique student identifier system (Smith 2007). These dollars primarily represent the cost to build technology systems associated with assigning student IDs, verifying that no students have more than one ID, sharing the ID with districts, updating state data systems, and vendor prices/contractors costs (Smith 2007). But this cost does not include data collection at either the state or district level. That cost will vary by state depending on the size of the state and the data systems previously in place. However, almost all states will have already incurred this cost and will have student ID systems in place by the end of the 2007–2008 school year (Smith 2007).
Some states also assign unique IDs to teachers to match teachers and students. This helps schools monitor teacher effectiveness, which can be valuable information for school improvement planning (Doran 2003). However, including teacher data should not be undertaken lightly. Before moving ahead, teachers should be included in the discussion about what data is collected and particularly how it will be used. Other school characteristics, such as specific academic programs, can also be coded and monitored for effectiveness.
Once a system is in place to collect individual student data there must be somewhere to store and analyze it. Databases that can store the vast amount of data needed to implement a growth model are usually much larger and more expensive than the databases needed for most current status model systems (Goldschmidt, et al. 2005). This is because growth models require data for every student while most current status model systems only need grade level data like the percent of students who score proficient. You will also need to invest in specialized software that can handle the calculations required by your particular growth model.
According to testimony by Dr. Chrys Dougherty, Director of Research at the National Center for Education Accountability (NCEA) to the House of Representative's Committee on Education and Labor, only twenty-seven states will have the capacity to implement a growth model in the 2007-2008 school year. He also said that the number is likely to grow to forty states in the following three years. The elements in data systems missing most often are: (1) statewide student identifiers, (2) the ability to link students' test score records over time, and (3) information on untested students and the reasons they were not tested (2007). States need to add these elements and collect those data for at least two years, preferably three, before even the most basic growth model can be implemented (Blank and Cavell 2005, Davidson and Davidson 2004, Doran 2003).
Just as engineers are needed to build a bridge, psychometricians and statisticians are needed to build a growth model that accurately measures student achievement growth. Creating an effective model can be a complex technical process requiring adequately trained psychometricians and other statistical experts (Goldschmidt, et al. 2005). These experts are qualified to design a growth model that is aligned with the statement of policy intent using the data they have available to them. Sometimes these experts will already be in-house but if not, they need to be hired (Goldschmidt, et al. 2005).
Although the type of training will differ depending on the purpose of the growth model, some researchers believe it is vital for all stakeholders to receive training so they understand how to use this new information effectively (Drury and Doran 2003). Stakeholders include teachers, principals, school board members, central office administrators (Drury and Doran 2003), and students and parents. Without buy-in from these groups, the growth model's usefulness will be seriously limited (Drury and Doran 2003). As William Sanders noted, "You can do the best analysis in the world but if you don't have people trained and coached to use the information, not much is going to happen" (Schaeffer 2004).
Clear and open communication
Growth models can be complex and very difficult for non-statisticians to understand, therefore, some say professional development is a key factor for gaining buy-in. This is especially true for value-added models because there is no simple way to isolate the impact of teaching on student learning (Hershberg, Simon, and Lea-Kruger 2004). Others, such as University of Massachusetts economics professor Dale Ballou, believe that complex value-added models fail one of educators' most important criteria: That the models be transparent (Ballou 2002). For example, he says that few, if any, who are evaluated by these sophisticated models will understand why two schools (or teachers) with similar achievement gains receive different ratings even though this can be a potential outcome in a Value-Added system (Ballou 2002).
Others aren't so skeptical. To these researchers complexity isn't necessarily a drawback, noting that not everyone who uses a personal computer or drives a car understands how they work (Hershberg, Simon, and Lea-Kruger 2004). However it is necessary that all stakeholders, especially teachers and administrators, know what is being measured and more important, what the results mean.
Even if states can't provide enough transparency for non-experts to understand the model, they still need to assure stakeholders that the model is statistically sound and measures what it is intended to measure. Some researchers recommend that each state open its model's methodology to outside expert reviewers (Doran 2004), as states are required to do in the NCLB Pilot program. Educators and the public at large are likely to be more open to accepting and using results from a model that is open to review rather than kept a secret.
Behind every great idea there is a need for money to support it. Growth-models are typically more expensive to administer than most current status models (Goldschmidt, et al. 2005). However, the cost will vary considerably from state to state or district to district depending on what elements may already be in place (Goldschmidt, et al. 2005). Designing new tests, implementing a new longitudinal data system, hiring statistical experts and additional staff to collect and analyze the data, and providing effective professional development can be expensive enterprises. Keep in mind there is also likely to be a cost to districts for collecting and reporting data for each individual student to the state. Of course, the cost of developing and implementing a growth model will vary by state depending on its size and the data systems previously in place. But the rewards of implementing a growth model may well outweigh the costs, particularly if some of these elements already exist or can be easily refitted to serve a growth model.
Although it costs between one and three million dollars to implement a student ID system—which is required to calculate individual student growth—there is an additional cost to actually make calculations. While there isn't much information on what it costs states to develop and run the calculations for a growth model, Ohio contracts with the SAS Institute in North Carolina for such purposes at two dollars per student. Ohio provides the necessary student level data to the SAS Institute and the Institute ascertains the Value-Added calculations and provides that data to each school via a secure website. Ohio also contracts with Battelle for Kids to provide training on using Value-Added data effectively in the classroom. For approximately three million dollars per year Battelle works with Ohio using a train-the-trainer system. Ohio is just one example, but it provides some indication on some of the costs of implementing a growth model. There are likely additional costs at both the state and district levels that have not been included in the cost of implementing a growth model, such as staff time for collecting and using the data.
What are the limitations of growth models?
Growth models hold great promise of evaluating schools on the amount of growth their students achieve. But growth models, especially value-added models, are relatively new in the education realm and their limitations are still being debated within the research community. Please note, however, that the research community is almost united in the opinion that growth models provide more accurate measures of schools than the current status models alone. Moreover current status models also suffer from many of the same limitations. While none of these issues should preclude states or districts from considering implementing a growth model, they do need to be acknowledged so the model developed will be the most effective tool for its purpose.
The limitations can be described as follows:
Measures of achievement can be good, but none is perfect
This guide doesn't debate the pros and cons of standardized testing;there are plenty of publications that do. But it is necessary to discuss limitations and how they can affect the reliability of a growth measurement.
As discussed earlier, it's important to use tests that are appropriate for growth models. Growth can be measured without tests, but any tests used should have the following features:
- They cover the lower and upper ranges of student performance, rather than cluster test content around the knowledge and skills that constitute "proficient."
- They are vertically aligned and scaled to more accurately measure student achievement from year to year.
- They are aligned with state standards.
Unfortunately, while some tests are clearly better than others, there is no perfect measure of achievement (Ballou 2002, McCaffrey, et al. 2003), a statement to which even the most ardent supporter of standardized testing would agree.
One of the problems with tests used for growth models is that gain scores over time tend to be what statisticians call "noisier" than measures of achievement at a single point in time. By this, statisticians mean that gain scores tend to fluctuate even though a student's true ability typically does not change much from year to year (Ballou 2002, Doran 2004). This happens because on any given test a student's performance is a result of his or her true ability and random influences, like distractions, during the test and the selection of items—effects that statisticians call measurement error. When scores from the two tests are subtracted from each other, as in Simple Growth models, the measurement error increases so the "true" performance becomes less clear (Ballou 2002).
There are statistical adjustments to minimize "noise," such as including scores from other subjects and previous years (Raudenbush 2004b). Another way to minimize the effect of "noisy" data is to create rolling averages by averaging growth over multiple years to provide a more stable picture of performance (Drury and Doran 2003, Raudenbush 2004a). However, such adjustments will add to the complexity of the growth model and may make it difficult to explain to educators why two schools (or teachers) with similar achievement gains received different ratings of effectiveness (Ballou 2002).
There is no data for untested subjects
As with all other test-based accountability systems, growth models are restricted to those subjects that can be and are tested. In many states that confines judging growth on the basis of only two subjects, reading and mathematics, as is the case with the status models in many states today.
Some growth models, usually value-added models, incorporate scores from all tested subjects, such as social studies and science, that some argue have been overlooked since the passage of NCLB. Even when these other subjects are included in the formulas, researchers estimate that about one-third of teachers in a school will not be included in measuring the school's effectiveness (Andrejko 2004). For the most part, these would be teachers of subjects such as music and art which are not typically assessed using standardized tests.
There can be missing or incomplete student data
Even the best data collection system cannot assure that all data will be produced and reported for every student. In any given year, some students are absent during testing. Some students transfer into school during the school year from other states. Others transfer out. These factors and others lead to missing or incomplete data for some students. Not all growth models are able to incorporate information on students that do not have all their previous test scores, although some do (CCSSO 2007).
Missing data can have a large impact on growth results, depending on the characteristics of students who typically fall into this group. Students who are highly mobile, for example, tend to be lower achievers. A high incidence of mobility in a school would produce gaps in the data and could distort the "effect" reported for the school or its teachers (McCaffrey, et al. 2003). One analysis found gain scores would be effected if ten percent or more of the student records contained missing scores (Braun 2004).
However, researchers are still divided over the impact students with missing test scores actually have on growth calculations. Some value-added models, such as the one used in Tennessee, do not exclude students from the calculation simply because they are missing some test scores. Tennessee is able to do this by including enough other data—such as previous scores and scores in other subjects—that missing some data is said to have little impact. States and districts need to decide how to deal with students with missing test scores when designing their growth model.
Experts dispute how completely "value-added" models capture teacher effect
There is a continuing debate between statisticians on the extent to which preexisting student factors, such as socioeconomic status (SES) and prior achievement, can be controlled for to truly isolate the effect a teacher has on student achievement. Since these measures are not perfect, most statisticians agree that they should not be the only tool used in evaluating teachers. However, they disagree on the role they should play. Although this section focuses on the use of value-added models (VAMs) in evaluating teachers, many of the same issues pertain to VAMs when evaluating schools and educational programs.
Value-added models used to evaluate teacher effectiveness are designed to measure a teacher's contribution to student achievement. VAMs typically compare an individual teacher's effectiveness to theaverage effective teacher in her district. Simply put, teacher effectiveness is computed as the difference between students' achievement after being in a teacher's class compared to what their achievement would have been if they had been with the "average" teacher. But there are other factors outside of teachers' control that could influence achievement including student characteristics, school climate, district policies, and even the student's previous teachers (McCaffrey, et al. 2003).
Most researchers agree that VAMs must account for these outside factors to provide an accurate measure of teacher effectiveness. Statisticians have developed various techniques in an attempt to minimize the influence of outside factors, but the debate is by no means settled (McCaffrey, Lockwood, Koretz, and Hamiltion 2004).
Another consideration is that most VAMs are meant to evaluate the effect of a single teacher. For this reason, they have been criticized for not accounting for the growing trend of team teaching. The argument goes like this: If a student is being taught by multiple teachers how can any model isolate the effect of each one individually? Models like Tennessee's represent teacher effects as independent and additive even though other teachers and tutors may effect academic growth as well (Kupermintz 2003). However, some statisticians say that VAMs can be designed to account for team teaching and departmentalized instruction (Sanders 2006). And it is likely that VAMs could be set up to evaluate the impact of the team of teachers as well. Both may be done by including data from each of the teachers in the team by which the student was taught.
A single growth model does not serve all purposes
Growth data can be helpful in many ways, and it is tempting to create one growth model and use it for multiple purposes. Policymakers and educators should resist this temptation. Although a single model could save a lot of time and money, many researchers strongly discourage using just one model, because trying to pull distinct pieces of information from one model would likely lead to false conclusions (Ballou 2002). For example, a growth model developed for high-stakes school accountability, such as NCLB, should not be used for program evaluation when the evaluation controls for other variables such as socioeconomic status, which is not acceptable for NCLB (Gong, Perie, and Dunn 2006).
Measuring growth in high school is difficult
Some growth models are difficult and in some instances impossible to use in certain high school settings (Yeagley 2007). This is because high schools tend to lack key elements needed to track growth. For one, many states assess students only once during high school and do not administer annual tests. Even in states with more high school tests, the tests are typically by subject, not grade level, and are not typically vertically aligned and scaled.
How can growth-models be used effectively?
By acknowledging their limitations and working to minimize them, growth models can be used effectively and provide information about schooling that is not otherwise available. Keep in mind, however, that the type of growth model used should be appropriate for its purpose.
Other ways growth models can be used include:
Probably the most familiar application of growth models is school accountability. The discussion now is not whether growth should be used for school accountability but the best way to measure it, particularly for NCLB. From almost the time NCLB was enacted in 2002, states and researchers have been developing growth models that would give credit to schools for making significant progress on students who entered school far from proficient while adhering to the goal of NCLB to have all students proficient by 2014. This led to the proliferation of what are known as Growth to Proficiency or Growth to Standards models, which we described earlier. Previous growth models struggled to answer the question, "How much growth is enough growth?" Growth to Proficiency models answer the question by declaring, "Enough growth for the student to reach proficiency by a set time," (Gong, Perie, and Dunn 2006). However, whatever the set time should be can only be determined by policymakers. The models can then be designed around that timeline. You should keep in mind that while a growth model may accurately measure student growth, its effectiveness can only be judged by how the data is used.
Growth to Proficiency models also mitigated one of the negative effects of some models that do little to narrow achievement gaps. For example, some Simple Growth models are set up to recognize average growth or one year's worth of growth among students.
When all students stay on pace, gaps between high- and low-achievers do not change. Models based on predicted growth—such as value-added models—that only expected a student to grow similar to past growth, can actually widen these gaps because they do not expect students to grow more than they had in the past.
Thus low-achievers, who begin with less-than-average gains, will be "expected" to fall further and further behind (Blank and Cavell 2005). In contrast, Growth to Proficiency models expect low-performing students to accelerate their gains in order to meet a set target (Davidson and Davidson 2004, Doran 2003).
Three different types of Growth to Proficiency models include those in North Carolina and Tennessee that were approved by the Department of Education for their NCLB Growth Model Pilot Program, and Harold Doran and Lance Izumi's Rate of Expected Academic Change (REACH). Although each model is calculated differently, they have the same overarching goals of giving credit to schools that are getting their students on track to proficiency.
Evaluate teacher performance
Value-added models probably work best for evaluating teacher performance because they are the most effective at isolating teacher effects from outside factors. But again, the tools are not perfect. Most researchers assert that VAMs should be just one of several indicators used when evaluating teachers (Andrejko 2004, Ballou 2002, Braun 2005, McCaffrey, et al. 2003). Other measures such as classroom observations, examinations of lesson plans, portfolios, and other evaluations of professional practice should also be considered (Andrejko 2004, Braun 2005, Hershberg, Simon and Lea-Kruger 2004). Moreover, Value-Added measures should be used to inform local decision making, not replace it (Ballou 2002).
The policy discussion focuses on the usefulness of growth models in high-stakes, school accountability systems. But quite possibly the most effective use of growth information is for improving practice: To inform instructional improvement, evaluate the effectiveness of academic programs, and target professional development for teachers and administrators.
Value-Added measures can provide valuable information about the effects of curriculum, instructional techniques, and other instructional practices on student learning. Armed with data, teachers and administrators can help pinpoint what works and what needs to be improved to best meet the needs of their students—information that can also become the basis for school improvement plans. (Hershberg, Simon, and Lea-Kruger 2004).
By seeing where and why they are effective, teachers can reflect on their own practices and share their best techniques with their colleagues. Administrators can analyze the data and target professional development for staff. Clearly, teachers found to be less effective can be given the assistance they need to improve. But even otherwise effective teachers can benefit from analyzing their own data. For example, a teacher may be effective overall, but a Value-Added analysis may suggest that she is more effective with her higher performing students than her low performers (Hershberg, Simon, and Lea-Kruger 2004). Professional development that provides strategies for helping struggling learners would help this teacher advance the learning of all her students.
The same data from VAMs that can help pinpoint professional development needs can also provide principals with valuable information for assigning teachers strategically. An analysis of the data will help principals identify which teachers are most effective in which subjects, grade levels, or even groups of students. They can then make the best match between teachers' individual strengths and students' needs. (Hershberg, Simon, and Lea-Kruger 2004).
Evaluate teacher preparation programs
With the right data collection system, Value-Added measures can also be used to evaluate teacher preparation programs in state universities. The state of Louisiana is a leader in this regard. The state's Value-Added Teacher Preparation Program Assessment Model has the capacity to examine the performance of its K–12 students, and connect growth in student learning to state teacher preparation programs to determine the effect of their graduates (Berry and Fuller 2006). However, keep in mind that only those teachers of subject areas that are tested are included in the data, so caution should be taken when evaluating a teacher preparation program on a subset of their graduates (Berry and Fuller 2006).
How can you get the most from growth models?
The book on growth models is just being written. Researchers have only begun to determine how accurate the measures are and how they should best be used. Educators have just scratched the surface on how growth models can help them improve their schools.
Researchers emphasize that growth models should not be the sole basis for making high-stakes decisions, particularly in regard to teacher evaluation (Ballou 2002, Braun 2005, Kupermintz 2003, McCaffrey, et al. 2003, Raudenbush 2004a). Others caution that using growth models such as value-added models in teacher accountability could discourage teachers from using this valuable data to inform their instruction (Yeagley 2007).
We need to learn more before we will know the full effects of using growth models like VAMs for any sort of accountability. Nonetheless, several researchers acknowledge that even though growth models are not perfect, they are probably better than the accountability systems we currently have in place (McCaffrey, et al. 2003).
What all researchers can agree on is that growth models provide valuable information that is not otherwise available from models that only look at achievement status. Before policymakers consider implementing a growth model, they should ask some pertinent questions.
Questions for District Policymakers to Consider
- What is the purpose for using growth data and who will be affected?
- What additional resources will be needed at the school and district level to implement and maintain a growth model?
- How will teachers, administrators, and other education stakeholders be trained on how growth data will be calculated and used effectively?
- How will the growth data be disseminated to districts, schools, teachers, and the community at-large?
Additional Questions for State Policymakers to Consider
- Is a growth measure a better measure than is currently used?
- Which stakeholders should be involved in developing the growth model?
- What elements do we need to implement a valid and reliable growth model and how much would it cost?
- What elements are already in place?
- How will the absence of an element or elements affect the reliability and validity of the growth model results?
- Do the benefits outweigh the costs of implementing a growth model?
- How important is it that all stakeholders including teachers and parents understand how the growth data is calculated? Will the stakeholders accept a growth model if they do not fully understand how is calculated?
- Will the growth model be open for peer and public review? If so, how often will it be reviewed and by whom?
Questions for Federal Policymakers to Consider
- How many states have the necessary elements in place to develop a valid and reliable growth model?
- What will it cost states to develop, implement, and maintain a growth model?
- How will the cost vary across states?
- What financial burden will there be for districts? Do rural and other small districts have the capacity to collect and disseminate the data necessary for a growth model?
- Which growth models would be best to meet the goals of NCLB?
- Which growth models will provide the most accurate identification of schools that need improvement?
- How will growth models affect the goal of one hundred percent proficiency by 2014?
- Which students should be included in growth model calculations and accountability? Should all students have the same growth targets?
- How much flexibility should states have in designing growth models for NCLB?
- What others purposes should the growth model serve besides school accountability?
- Would a growth model have a greater effect on improving schools if it was designed to inform instruction rather than for high-stakes school/teacher accountability?
- What are the limitations of growth models?
- What continuing researching is needed to understand the impact growth models are having in classrooms and to help improve their impact at the classroom?
1 Variations of the Performance Index models such as Transition Matrix models and Value Table models are able to measure individual students' growth as they score at higher achievement levels such as going from Below Basic in fourth grade to Basic in fifth.
2 There are other statistical techniques available such as using Z-scores or Multilevel Modeling Approaches that can be used to create a growth model in the absence of vertically scaled tests. But they add to the already complex model and are likely to alter the results you are hoping to measure. Gong, B., Perie, M., and Dunn, J. (2006). Using Student Longitudinal Growth Measures for School Accountability Under No Child Left Behind: An Update to Inform Design Decisions. Center for Assessment. Retrieved on June 7, 2007, from http://www.nciea.org/publications/GrowthModelUpdate_BGMAPJD07.pdf.
The document was written by Jim Hull, policy analyst, Center for Public Education.Special thanks to Cyndie Schmeiser and Jeff Allen at ACT, Inc.; Mary Delagardelle and staff at the Iowa School Boards Foundation; and Joseph Montacalvo for their insightful feedback and suggestions. However, the opinions and any errors found within the paper are solely those of the author.
Posted: November 9, 2007
©2007 Center for Public Education