The goal of this guide is to provide useful information about standardized testing, or assessment, for practitioners and non-practitioners who care about public schools. It includes the nature of assessment, types of assessments and tests, and definitions.
Assessment is the essential ingredient of accountability and accountability is the key word in education today. NCLB mandates accountability for academic progress, using tests and assessments at the state level to monitor student progress toward 100 percent proficiency for all students by 2014. Because of this mandate, districts, schools, and teachers supplement the required tests with additional assessments throughout the year to monitor student learning and ensure that students can do well on state tests.
Such intense focus on tests raises several issues: What kinds of tests are being used at the state, district, school, and classroom level? What should school board members and administrators know about tests, both from the point of view of selecting them and from that of students taking the tests? What do terms such as standardize and high-stakes mean?
In order to answer these questions, an exploration of various tests and assessments is in order along with a discussion of how they are created. We will look at
- Formative and Summative Assessment
- Standardized and High-Stakes Tests
- Norm-Referenced and Criterion-Referenced Tests
- Reliability, Validity, and Fairness
We will also take a look at different formats of tests
- Multiple-Choice Tests
- Open-Ended Tests
Finally, we will explore these other large-scale standardized tests
- National Assessment of Education Progress (NAEP)
- International Tests: Trends in International Mathematics and Science Study (TIMSS) and the Program for International Student Assessment (PISA)
- College Admissions: SAT, ACT
- Advanced Placement (AP)
- International Baccalaureate (IB)
Formative and summative assessment
Assessments and tests are means to an end, not the end in itself. Assessment is a dipstick dropped into the academic program to obtain information about what has been learned to produce data about student learning. As Linda Darling-Hammond writes, “Testing is information for an accountability system; it is not the system itself” (Darling-Hammond, 2004). Effective education requires information about learning at many points during the process, so two kinds of assessment have evolved—formative and summative.
This assessment provides information about learning in process. It consists of the weekly quizzes, tests, and even essays given by teachers to their classes. Teachers and students use the results of formative assessments to understand how students are progressing and to make adjustments in instruction. Rick Stiggins calls it “day-to-day classroom assessment” and claims evidence that it has triggered “remarkable gains in student achievement” (Stiggins, 2004).
Not all formative assessment is teacher-designed. Textbook publishers now include in their packaging CD-ROMs of tests aligned to the chapters in their books. Additionally, test publishers have begun to supply assessments for use at intervals in the classroom (frequently at every six or every nine weeks). These benchmark tests are aligned to the state standards and tests that students will take for accountability purposes (Herman and Baker, 2005; Stokes, 2005; McIntire, 2005, 4-5).
Summative assessment provokes most of the controversy about testing because it includes “high-stakes, standardized” testing carried out by the states. Summative assessment records the state of student learning at certain end points in a student’s academic career—at the end of a school year, or at certain grades such as grades 3rd, 5th, 8th, and 11th. It literally “sums up” what students have learned.
This guide is mostly concerned with explaining the components and functions of summative assessment.
Standardized and high-stakes tests
Standardized testing means that a test is “administered and scored in a predetermined, standard manner” (Popham, 1999). Students take the same test in the same conditions at the same time, if possible, so results can be attributed to student performance and not to differences in the administration or form of the test (Wilde, 2004). For this reason, the results of standardized tests can be compared across schools, districts, or states.
Standardized testing is sometimes used as a shorthand expression for machine scored multiple-choice tests. As we will see, however, standardized tests can have almost any format.
High-stakes testing has consequences attached to the results. For example, high-stakes tests can be used to determine students’ promotion from grade to grade or graduation from high school (Resnick, 2004; Cizek, 2001). State testing to document Adequate Yearly Progress (AYP) in accordance with NCLB is called “high-stakes” because of the consequences to schools (and of course to students) that fail to maintain a steady increase in achievement across the subpopulations of the schools (i.e., minority, poor, and special education students).
Low-stakes testing has no consequences outside the school, although the results may have classroom consequences such as contributing to students’ grades. Formative assessment is a good example of low-stakes testing.
With these definitions in mind, we now look at summative, high-stakes, and standardized tests from two points of view:that of prospective purchasers of tests (e.g., school administrators, school boards), and that of students facing test items demanding answers.
Norm-referenced and criterion-referenced tests
A prospective purchaser of tests is faced with a choice, to buy norm-referenced or criterion-referenced tests. The design and functions of each are so different that it is necessary to discuss them in some detail.
These tests are designed to compare individual students’ achievement to that of a “norm group,” a representative sample of his or her peers. The design is governed by the normal or bell-shaped curve in the sense that all elements of the test are directed towards spreading out the results on the curve (Monetti, 2003; NASBE, 2001; Zucker, 2003; Popham, 1999). The curve-governed design of norm-referenced tests means that they do not compare the students’ achievement to standards for what they should know and be able to do—they only compare students to other students who are assumed to be in the same norm group. The Educators’ Handbook on Effective Testing (2002) lists the norms frequently used by major testing publishers. For example, the available norms for the Iowa Test of Basic Skills are: districts of similar sizes, regions of the country, socio-economic status, ethnicity, and type of school (e.g., public, Catholic, private non-Catholic) in addition to a representation of students nationally.
Purchasers of norm-referenced tests need to ensure that the chosen norm is a useful comparison for their students. Purchasers should also be sure that the norm has been developed recently, because populations change rapidly. A norm including a small percentage of English language learners can become a norm with almost 50 percent English language learners in less than the ten-year interval before it is revised.
Results of norm-referenced tests are frequently reported in terms of percentiles: a score in the 70th percentile means that the student has done better than 70 percent of the others in the norm group (Monetti, 2003). Percentile rankings are often used to identify students for various academic programs such as gifted and talented, regular, or remedial classes. On a symmetrical bell curve, a score in the 50th percentile is the average.
Because norm-referenced tests are designed to spread students’ scores along the bell curve, the questions asked in the tests do not necessarily represent the knowledge and skills that all students are expected to have learned. Instead, during the test development process, “test items answered correctly by 80 percent or more of the test takers don’t make it past the final cut [into the final test]” writes Popham (1999).
Norm-referenced tests lead to frustration on two counts. First they frustrate the teacher’s success in teaching important knowledge and skills because students are unlikely to face questions about that skill and knowledge on the test (Popham, 1999). Second, no group of students can achieve at higher levels without others achieving at lower levels. Norm-referenced tests make it mathematically impossible for “all the children to be above average” (ERS; Burley, 2002).
These tests are designed to show how students achieve in comparison to standards, usually state standards. (NASBE, 2001; Wilde, 2004; Zucker, 2003). In contrast to norm-referenced tests, it is theoretically possible for all students to achieve the highest—or the lowest—score, because there is no attempt to compare students to each other, only to the standards. Results are reported in levels that are typically basic, proficient, and advanced. The test items are not chosen to sort students but to ascertain whether they have mastered the knowledge and skills contained in the standards.
Criterion-referenced tests—sometimes, more correctly, called standards-based tests—begin from a state’s standards, which list the knowledge and skills students are expected to learn. Because standards are usually far more numerous than could ever be included in a test, test designers work with teachers and content specialists to narrow down the standards to essential knowledge and skills at the grades to be tested. They are the basis for the development of test items.
The number of criterion-referenced tests in use at the state level has dramatically increased since NCLB was implemented in 2001 (NCES, 2005), because they measure achievement of the knowledge and skills required by state standards. At this writing, 44 states now use criterion-referenced assessments: 24 states use only criterion-referenced tests, and the other 20 use both criterion-referenced tests and norm-referenced tests. Thirteen states use “hybrid” tests, single tests that are reported both as norm-referenced tests (in percentiles or stanines—a nine-point scale used for normalized test scores) and as criterion-referenced tests (in basic, proficient, and advanced levels) in an attempt to show at the same time where students score in relation to standards and in relation to a norm group. Only one state, Iowa (home of the Iowa Test of Basic Skills, and also the only state in the nation without state academic standards) uses a norm-referenced test alone (Education Week 2006).
Reliability, validity, and fairness
All standardized tests must meet psychometric (test study, design, and administration) standards for reliability, validity, and lack of bias (Zucker, 2003; Bracey, 2002; Joint Committee on Testing Practices, 2004). Reliability means that the test is so internally consistent that a student could take it repeatedly and get approximately the same score; validity means that the test measures accurately what it is intended to measure. Tests of course must be unbiased, that is, students must not be at a disadvantage no matter what ethnic or social group they belong to. For example, a mathematics item proposed for a test to be marketed nationally referred to subway tokens—clearly familiar to students living in large cities with subway trains, but not to students living in other areas. The item was dropped.
Most tests, whether purchased by states or districts, are developed by commercial test publishers, although states usually contract to use their own names on the tests and align them with their own standards. Some however, such as Florida, Alabama, Mississippi, and Nevada administer respectively the Stanford Achievement Test (SAT-9), the Stanford 10, the TerraNova, and the Iowa Test of Basis Skills (ITBS) by name. (The characteristics of these tests are described in Dickinson et al. .) States designate a single test, with their own name or that of the commercial publisher, to report Adequate Yearly Progress (AYP) under NCLB to the U.S. Department of Education.
Although there are egregious examples of errors occurring during the administration and scoring of tests, (Goldberg, 2005) an objective view of the huge amount of tests administered in the 50 states and the District of Columbia would have to conclude that errors affect only a small percentage of schools and students. The large-scale testing industry is responding to a balloon of demand in the face of NCLB, despite being “still in the primitive…knives and bearskins stage,” as the vice president of Princeton Review put it (2003). The pressure for more and better testing will only increase, especially as the testing industry fully realizes the potential of computer technology (Stokes, 2005; NCES, 2005; Toch, 2006).
The students’ view: Multiple-choice, open-ended, and performance tests
Administrators and teachers know whether the test students are taking is norm or criterion referenced and whether the scores will be reported in levels or percentiles, but students perceive only that they are being asked to fill in bubbles or write explanations.
Anyone who has attended a U.S. school in the last half century is familiar with the bubble tests. Students face a question with four possible answers and respond by filling in a blank “bubble” with a number two pencil. Why a number two pencil? Because the lead in the pencil is a conductor of electricity so that the answer sheets can be scored by machine (Lemann, 1999). Tests that ask for bubbled answers are called multiple-choice, although sometimes controlled choice or selected response. All the terms refer to the fact that possible answers are given and the student has to choose rather than provide an individual response.
A few years ago it was justifiable to criticize multiple-choice testing because it seemed reductive (Mitchell, 1992). Critics charged that the questions focused on memorized facts and did not encourage thinking. However, test designers took up the challenge to make more sophisticated multiple-choice tests. In many cases multiple-choice tests now require considerable thought, even notes and calculations, before choosing a bubble.
Nonetheless, it remains true that multiple-choice tests “are clearly limited in the kinds of achievement they can measure” (Zucker, 2003; NASBE, 2001). These tests do not ask students to produce anything, but only to recognize (even after some thought) the “right” answer. In doing so, multiple-choice tests foster a mindset that expects a right answer even though further experience in both school and life tends to frustrate that expectation.
These test items ask students to respond either by writing a few sentences in short answer form, or by writing an extended essay. Open-ended questions are also known as “constructed response” because test-takers must construct their response as opposed to selecting a correct answer (Zucker, 2003). The advantage of open-ended items is that they allow a student to display knowledge and apply critical thinking skills. It is particularly difficult to assess writing ability, for example, without an essay or writing sample.
The disadvantage is that constructed-response items require human readers, although attempts are being made to develop computer programs to score essays (Sireci, 2000; Rudner, 2001; Shermis, 2001). Short-answer questions can be scored by looking for key terms since they often don’t ask for complete sentences. But many state assessments ask for an extended essay, often in separate tests from the one used to report AYP. Companies across the United States assemble groups of qualified people, often retired teachers when they can get them, to read and score essays or long answers using a common rubric for scoring (Stover, 1999).
A rubric is a guide to scoring that provides a detailed description of essays that should be given a particular score (frequently one-six points, with six being the best). After extensive training with models of each score, two readers rate an essay independently. If their scores differ, a third reader reads the essay without knowing the two preceding scores. Group scoring of essays has a long history and has proved to be remarkably reliable (Mitchell, 1992).
Essays and long answers have the desirable effect of promoting more writing and writing instruction in the classroom, but they are expensive to score. Multiple-choice testing is less expensive because it is scored by machine (ERS; NASBE, 2001). Differences in cost can be gauged from a U.S. General Accounting Office report estimating that from 2002-2008, states will spend $1.9 billion on mandated testing if they use only machine-scored multiple-choice tests. States will spend $3.9 billion if they maintain the present mixture of multiple-choice and a few open-ended items. They will spend $5.3 billion if they increase the use of open-ended items—including essays—making the cost of using open-ended items more than 2.5 times the amount of using multiple-choice tests alone (GAO, 2003). Clearly, the difference in cost makes testing choices difficult.
Also called authentic assessment, performance assessment challenges students to perform a task just as it would be performed in the classroom or in life (e.g., a science experiment, a piano recital). Performance assessment was widely promoted in the early 1990s (Mitchell, 1992), but it is time-consuming, difficult to standardize, and expensive.
Portfolios are a type of performance assessment that were also popular before 2001, when state testing in accordance with NCLB came to dominate. Portfolios are collections of student work designed to show growth over a semester or a year. However, they are difficult to evaluate accurately, because their production and contents can not be standardized (Gearhart, 1993). Both portfolios and performance assessment are now used as formative rather than summative assessment.
Other large-scale standardized tests
We have discussed classroom tests that students take daily or weekly, and state-mandated tests that they encounter annually. Some students face additional tests emanating from other authorities or organizations.
The National Assessment of Educational Progress (NAEP)
Also known as the Nations Report Card, NAEP is the “only nationally representative and continuing assessment of what America’s students know and can do in various subject areas” (NAEP). NAEP is a matrix test, meaning that the test is divided among students, so that two students sitting next to each other may not be looking at the same question. Students selected to take it form a statistically representative sample of the nation’s students. Not all schools in a district and or even all students in the school will take the test.
NAEP tests students in 4th, 8th, and 12th grade every two years in reading and mathematics and at longer intervals in other academic subjects such as science, history, and geography. Fifty-one states and territories now participate in NAEP. Since 2003, NAEP has invited urban districts to participate in a special Trial Urban Assessment; ten large city districts and the District of Columbia now participate.
NAEP is a criterion-referenced test. Its test items are derived from the NAEP Frameworks, the documents that act as national standards for NAEP. The tests combine multiple-choice, and short- and long-answer items. Results are reported in four levels: below basic, basic, proficient, and advanced. Separate scores are reported for groups of students based on characteristics such as race, ethnicity, family income, and gender. NAEP does not provide scores for individual students or schools.
NAEP is a yardstick for student achievement that can be used to evaluate the performance of states and urban districts volunteering to participate. State tests, in contrast, vary in type, quality, and scope. They cannot be compared to one another.
International Tests: TIMSS and PISA
Two tests that also do not yield individual students scores are international: the Trends in International Mathematics and Science Study (TIMSS) and the Program for International Student Assessment (PISA). Both are matrix and criterion-referenced tests that provide important information about the educational achievement of U.S. students compared to their peers worldwide. As such, they usually provoke headlines in the news media when scores are released.
TIMSS reports on science and mathematics achievement at grades four and eight. It is planning its next international test in 2007. Every three years PISA tests 15 year-olds in industrialized countries on knowledge and skills essential for participation in the 21st century society.
College Admissions: SAT and ACT
Many high school students take the SAT and ACT, privately owned national tests that many colleges and universities require as part of a student’s application package. (SAT and ACT are the correct names of the tests—they are no longer acronyms.) Additionally, a few states, including Illinois and Colorado, now mandate that every student in grade 11 must take the college admissions test used in their state. Other states are considering adopting similar policies.
For most of its history the SAT, which is owned by the College Board but designed and administered by the Educational Testing Service, was deliberately not connected to any state or school’s curriculum. Because it was designed to predict college success, at least as far as the freshman year, many regarded it as an aptitude rather than achievement test. However, the SAT recently underwent a major and widely publicized redesign to “better reflect what students study in high school” (The College Board). One of the changes is that the formerly all multiple-choice test now includes a writing sample, an essay scored by readers.
The ACT is designed to establish mastery of a generalized curriculum spelled out in the Standards for Transition or College Readiness Standards. The ACT is a multiple-choice test that now offers an optional writing test. Research conducted in conjunction with the Education Trust showed that rigorous adherence to high standards in courses taught by well qualified teachers resulted in ACT scores correlated with success in freshman college courses (ACT/Education Trust, 2004).
Advanced Placement (AP)
AP examinations are owned and administered by the College Board to allow students to get credit for college-level work while in high school. Teachers attend courses offered by the College Board to familiarize themselves with the AP syllabus in 20 subject areas. The examinations consist of multiple-choice and essay questions, although a few AP examinations, such as Studio Art, require the submission of a portfolio of work. Colleges and universities will sometimes award college credit to students with a high score (three or higher) on an AP examination. However, this is no guarantee, because different institutions have different policies about accepting AP scores.
International Baccalaureate (IB)
Like AP, which is a combined program of courses and examinations, the IB offers a course of instruction and examinations that allow U.S. students to gain credentials equivalent to those in European schools. IB is now in almost 1,600 schools in 122 countries. The program begins at the grammar school level and continues through the end of high school, although students can enroll at any point. The assessments consist of essays except for problems in mathematics and science.
The goal of this guide was to provide the background needed to understand the assessments used in schools. We believe it also serves as a tutorial for school board members, parents and others to determine what questions about tests and assessments they would like their administrators and teachers to answer. For a complete research review about the effects of testing on teaching and learning, please go to the Center’s research review on the effects of testing.
This guide was prepared by Ruth Mitchell for the Center for Public Education. Mitchell, a freelance consultant and writer, specializes in education research and policy.
Posted: Feb. 15, 2006
© 2006 Center for Public Education