| A guide to standardized testing: The nature of assessment The goal of this guide is to provide useful information about standardized testing, or assessment, for practitioners and non-practitioners who care about public schools. It includes the nature of assessment, types of assessments and tests, and definitions. Assessment is the essential ingredient of accountability and accountability is the key word in education today. NCLB mandates accountability for academic progress, using tests and assessments at the state level to monitor student progress toward 100 percent proficiency for all students by 2014. Because of this mandate, districts, schools, and teachers supplement the required tests with additional assessments throughout the year to monitor student learning and ensure that students can do well on state tests. Such intense focus on tests raises several issues: What kinds of tests are being used at the state, district, school, and classroom level? What should school board members and administrators know about tests, both from the point of view of selecting them and from that of students taking the tests? What do terms such as standardize and high-stakes mean? In order to answer these questions, an exploration of various tests and assessments is in order along with a discussion of how they are created. We will look at
We will also take a look at different formats of tests Finally, we will explore these other large-scale standardized tests
Formative and summative assessment Formative Assessment Not all formative assessment is teacher-designed. Textbook publishers now include in their packaging CD-ROMs of tests aligned to the chapters in their books. Additionally, test publishers have begun to supply assessments for use at intervals in the classroom (frequently at every six or every nine weeks). These benchmark tests are aligned to the state standards and tests that students will take for accountability purposes (Herman and Baker, 2005; Stokes, 2005; McIntire, 2005, 4-5). Summative Assessment This guide is mostly concerned with explaining the components and functions of summative assessment. Standardized and high-stakes tests Standardized Testing Standardized testing is sometimes used as a shorthand expression for machine scored multiple-choice tests. As we will see, however, standardized tests can have almost any format. High-Stakes Testing Low-Stakes Testing With these definitions in mind, we now look at summative, high-stakes, and standardized tests from two points of view:that of prospective purchasers of tests (e.g., school administrators, school boards), and that of students facing test items demanding answers. Norm-referenced and criterion-referenced tests
Norm-Referenced Tests Purchasers of norm-referenced tests need to ensure that the chosen norm is a useful comparison for their students. Purchasers should also be sure that the norm has been developed recently, because populations change rapidly. A norm including a small percentage of English language learners can become a norm with almost 50 percent English language learners in less than the ten-year interval before it is revised. Results of norm-referenced tests are frequently reported in terms of percentiles: a score in the 70th percentile means that the student has done better than 70 percent of the others in the norm group (Monetti, 2003). Percentile rankings are often used to identify students for various academic programs such as gifted and talented, regular, or remedial classes. On a symmetrical bell curve, a score in the 50th percentile is the average. Because norm-referenced tests are designed to spread students’ scores along the bell curve, the questions asked in the tests do not necessarily represent the knowledge and skills that all students are expected to have learned. Instead, during the test development process, “test items answered correctly by 80 percent or more of the test takers don’t make it past the final cut [into the final test]” writes Popham (1999). Norm-referenced tests lead to frustration on two counts. First they frustrate the teacher’s success in teaching important knowledge and skills because students are unlikely to face questions about that skill and knowledge on the test (Popham, 1999). Second, no group of students can achieve at higher levels without others achieving at lower levels. Norm-referenced tests make it mathematically impossible for “all the children to be above average” (ERS; Burley, 2002). Criterion-Referenced Tests
Criterion-referenced tests—sometimes, more correctly, called standards-based tests—begin from a state’s standards, which list the knowledge and skills students are expected to learn. Because standards are usually far more numerous than could ever be included in a test, test designers work with teachers and content specialists to narrow down the standards to essential knowledge and skills at the grades to be tested. They are the basis for the development of test items. The number of criterion-referenced tests in use at the state level has dramatically increased since NCLB was implemented in 2001 (NCES, 2005), because they measure achievement of the knowledge and skills required by state standards. At this writing, 44 states now use criterion-referenced assessments: 24 states use only criterion-referenced tests, and the other 20 use both criterion-referenced tests and norm-referenced tests. Thirteen states use “hybrid” tests, single tests that are reported both as norm-referenced tests (in percentiles or stanines—a nine-point scale used for normalized test scores) and as criterion-referenced tests (in basic, proficient, and advanced levels) in an attempt to show at the same time where students score in relation to standards and in relation to a norm group. Only one state, Iowa (home of the Iowa Test of Basic Skills, and also the only state in the nation without state academic standards) uses a norm-referenced test alone (Education Week 2006). Reliability, validity, and fairness Most tests, whether purchased by states or districts, are developed by commercial test publishers, although states usually contract to use their own names on the tests and align them with their own standards. Some however, such as Florida, Alabama, Mississippi, and Nevada administer respectively the Stanford Achievement Test (SAT-9), the Stanford 10, the TerraNova, and the Iowa Test of Basis Skills (ITBS) by name. (The characteristics of these tests are described in Dickinson et al. [2002].) States designate a single test, with their own name or that of the commercial publisher, to report Adequate Yearly Progress (AYP) under NCLB to the U.S. Department of Education. Although there are egregious examples of errors occurring during the administration and scoring of tests, (Goldberg, 2005) an objective view of the huge amount of tests administered in the 50 states and the District of Columbia would have to conclude that errors affect only a small percentage of schools and students. The large-scale testing industry is responding to a balloon of demand in the face of NCLB, despite being “still in the primitive…knives and bearskins stage,” as the vice president of Princeton Review put it (2003). The pressure for more and better testing will only increase, especially as the testing industry fully realizes the potential of computer technology (Stokes, 2005; NCES, 2005; Toch, 2006). The students’ view: Multiple-choice, open-ended, and performance tests Multiple-Choice Tests A few years ago it was justifiable to criticize multiple-choice testing because it seemed reductive (Mitchell, 1992). Critics charged that the questions focused on memorized facts and did not encourage thinking. However, test designers took up the challenge to make more sophisticated multiple-choice tests. In many cases multiple-choice tests now require considerable thought, even notes and calculations, before choosing a bubble. Nonetheless, it remains true that multiple-choice tests “are clearly limited in the kinds of achievement they can measure” (Zucker, 2003; NASBE, 2001). These tests do not ask students to produce anything, but only to recognize (even after some thought) the “right” answer. In doing so, multiple-choice tests foster a mindset that expects a right answer even though further experience in both school and life tends to frustrate that expectation. Open-Ended Tests The disadvantage is that constructed-response items require human readers, although attempts are being made to develop computer programs to score essays (Sireci, 2000; Rudner, 2001; Shermis, 2001). Short-answer questions can be scored by looking for key terms since they often don’t ask for complete sentences. But many state assessments ask for an extended essay, often in separate tests from the one used to report AYP. Companies across the United States assemble groups of qualified people, often retired teachers when they can get them, to read and score essays or long answers using a common rubric for scoring (Stover, 1999). A rubric is a guide to scoring that provides a detailed description of essays that should be given a particular score (frequently one-six points, with six being the best). After extensive training with models of each score, two readers rate an essay independently. If their scores differ, a third reader reads the essay without knowing the two preceding scores. Group scoring of essays has a long history and has proved to be remarkably reliable (Mitchell, 1992). Essays and long answers have the desirable effect of promoting more writing and writing instruction in the classroom, but they are expensive to score. Multiple-choice testing is less expensive because it is scored by machine (ERS; NASBE, 2001). Differences in cost can be gauged from a U.S. General Accounting Office report estimating that from 2002-2008, states will spend $1.9 billion on mandated testing if they use only machine-scored multiple-choice tests. States will spend $3.9 billion if they maintain the present mixture of multiple-choice and a few open-ended items. They will spend $5.3 billion if they increase the use of open-ended items—including essays—making the cost of using open-ended items more than 2.5 times the amount of using multiple-choice tests alone (GAO, 2003). Clearly, the difference in cost makes testing choices difficult. Performance Assessment Portfolios Other large-scale standardized tests The National Assessment of Educational Progress (NAEP) NAEP tests students in 4th, 8th, and 12th grade every two years in reading and mathematics and at longer intervals in other academic subjects such as science, history, and geography. Fifty-one states and territories now participate in NAEP. Since 2003, NAEP has invited urban districts to participate in a special Trial Urban Assessment; ten large city districts and the District of Columbia now participate. NAEP is a criterion-referenced test. Its test items are derived from the NAEP Frameworks, the documents that act as national standards for NAEP. The tests combine multiple-choice, and short- and long-answer items. Results are reported in four levels: below basic, basic, proficient, and advanced. Separate scores are reported for groups of students based on characteristics such as race, ethnicity, family income, and gender. NAEP does not provide scores for individual students or schools. NAEP is a yardstick for student achievement that can be used to evaluate the performance of states and urban districts volunteering to participate. State tests, in contrast, vary in type, quality, and scope. They cannot be compared to one another. International Tests: TIMSS and PISA TIMSS reports on science and mathematics achievement at grades four and eight. It is planning its next international test in 2007. Every three years PISA tests 15 year-olds in industrialized countries on knowledge and skills essential for participation in the 21st century society. College Admissions: SAT and ACT For most of its history the SAT, which is owned by the College Board but designed and administered by the Educational Testing Service, was deliberately not connected to any state or school’s curriculum. Because it was designed to predict college success, at least as far as the freshman year, many regarded it as an aptitude rather than achievement test. However, the SAT recently underwent a major and widely publicized redesign to “better reflect what students study in high school” (The College Board). One of the changes is that the formerly all multiple-choice test now includes a writing sample, an essay scored by readers. The ACT is designed to establish mastery of a generalized curriculum spelled out in the Standards for Transition or College Readiness Standards. The ACT is a multiple-choice test that now offers an optional writing test. Research conducted in conjunction with the Education Trust showed that rigorous adherence to high standards in courses taught by well qualified teachers resulted in ACT scores correlated with success in freshman college courses (ACT/Education Trust, 2004). Advanced Placement (AP) International Baccalaureate (IB) Moving Forward This guide was prepared by Ruth Mitchell for the Center for Public Education. Mitchell, a freelance consultant and writer, specializes in education research and policy. Posted: Feb. 15, 2006 © 2006 Center for Public Education
|
Home > Policies: Also In Guide: Standards and testing
|
Print
A guide to standardized testing: The nature of assessment
Home > Policies: Also In Guide: Standards and testing

