definitions of language proficiency - didisukyadi.staf.upi.edu · web viewempat keterampilan...

TES KEMAHIRAN BERBAHASA INGGRIS

Kemahiran Berbahasa

Kemahiran berbahasa dimaknai secara seragam. Cummins (1984), menyatakan bahwa kemahiran berbahasa ada yang menyebutnya terdiri dari 64 komponen yang berbeda, tetapi ada pula yang menyebutkan hanya terdiri dari satu faktor saja. Valdés dan Figueroa (1994) menyebutkan bahwa mengetahui suatu bahasa tidak cukup hanya menguasai pelafalan, tatabahasa, dan santun berbahasa, tetapi juga melibatkan penguasaan sejumlah komponen yang saling terkait dan berinteraksi satu sama lain tergantung konteks komunikasi yang terjadi.

Oller dan Damico (1991) menyatakan bahwa rincian elemen kemahiran berbahasa belum ditentukan dan masih terus diperdebatkan. Setiap tes kemahiran berbahasa harus didasarkan atas model atau definisi kemahiran berbahasa yang akurat. The Council of Chief State School Officers (CCSSO) mendefinisikan bahwa siswa yang mahir berbahasa Inggris dapat menggunakan bahasa itu untuk bertanya, memahami ucapan gurunya dan bahan bacaan, mengungkapkan pikirannya, dan menjawab apa yang ditanyakan di kelas. Empat keterampilan berbahasa yang memberi kontribusi atas kemahiran berbahasa adalah keterampilan berbicara, keterampilan membaca, keterampilan menyimak, dan keterampilan menulis.

Canales (1994) melandasi definisi kemahiran berbahasa dengan landasan sosio-teoretis., yaitu bahasa tidak dilihat sebagai bagian yang terpisah-pisah (misalnya., pelafalan, kosakata, dan tatabahasa). Bahasa berkembang dalam suatu budaya dan berfungsi sebagai media untuk menyampaikan kepercayaan dan adat dan kebiasaan budaya (lihat kasus penerjemahan idiom). Kemahiran berbahasa bersifat dinamis dan kontekstual (bervariasi bergantung situasi, status penutur dan topik pembicaraan), diskursif (memerlukan ujaran yang saling berhubungan), dan membutuhkan keterampilan integratif sehingga kompetensi komunikatif dapat dicapai. Dengan kata lain, kemahiran berbahasa merupakan kemampuan menggunakan unsur bahasa yang diskrit seperti kosakata, struktur wacana dan bahasa tubuh untuk menyampaikan makna.

Keterampilan berbahasa yang mendasari keberhasilan akademik seorang siswa mencakupi kemampuan merespon pertanyaan teman dan gurunya atas informasi tertentu, menanyakan pertanyaan lanjutan, dan menyintesis bahan bacaan. Siswa harus memahami instruksi lisan rutin dalam seting kelompok besar, dan komentar untuk teman sebayanya dalam kelompok kecil. Dalam keterampilan membaca, siswa dituntut mampu menggali informasi dari berbagai jenis teks. Dalam keterampilan menulis, siswa dituntut mampu menulis jawaban pendek, paragraf, esei dan makalah. Pemelajar bahasa yang berhasil juga dituntut mengetahui pranata sosial dan budaya yang berkaitan dengan penggunaan bahasa.

Konsepsi kemahiran berbahasa sebagaimana digambarkan di atas setidaknya terdiri dari dua hal. Pertama, definisi itu mengakomodir keempat keterampilan berbahasa: speaking, listening, reading dan writing. Kedua, setiap definisi menempatkan kemahiran berbahasa

pada konteks tertentu, yaitu pendidikan. Dampaknya, tes kemahiran berbahasa harus menggunakan prosedur tes yang sebisa mungkin menggambarkan kontekstualisasi bahasa yang digunakan pada sebagian besar kelas berbahasa Inggris.

Valdés dan Figueroa (1994) menyatakan bahwa tes kemahiran berbahasa harus mengidentifikasi tingkat tuntutan yang diminta oleh konteks dan jenis kemampuan berbahasa yang umumnya digunakan oleh siswa penutur bahasa Inggris monolingual yang sebagian besar sukses dalam konteks tersebut. Berdasarkan pemikiran itu, kita dapat menetapkan criteria untuk mengukur keterampilan berbahasa siswa bukan penutur asli bahasa Inggris untuk memutuskan apakah mereka harus didik dalam bahasa Inggris atau dalam bahasanya sendiri. Rekomendasi itu dipahami karena tes kemahiran berbahasa dimaksudkan untuk membantu pendidikan dalam memberikan penilaian yang akurat apakah seorang siswa memerlukan bantuan atau tidak dalam kegiatan pembelajarannya. Keputusan seperti itu akan aka menjadi sulit manakala tugas yang ada dalam tes itu hanya memiliki sedikit kemiripan dengan karakteristik tugas yang biasa diberikan pada kebanyakan kelas.

General Nature of Language Proficiency Tests

Oller dan Damico (1991) menyatakan bahwa tes kemahiran berbahasa dapat dikaitkan dengan tiga aliran. Yang pertama adalah pendekatan diskrit yang didasari oleh asumsi bahwa bahasa terdiri atas komponen yang dapat dipisah-pisah seperti fonologi, morfologi, leksikon, sintaksis, dan seterusnya dan tiap-tiap komponen dapat lebih jauh dibagi ke dalam elemen yang berbeda (misalnya, bunyi ke dalam kelas bunyi atau fonem, suku kata, morfem, kata, idiom, dan struktur frase). Mereka menyatakan bahwa tes bahasa tidak akan valid jika measukan beberapa keterampilan atau ranah struktur (Lado, 1961). Dengan model ini, model penilaian yang ideal akan melibatkan evaluasi setiap ranah dan setiap keterampilan yang dianggap penting. Hasilnya dapat digabung dan membentuk gambaran keseluruhan kemahiran berbahasa (p. 82).

Tes kemahiran berbahasa diskrit umumnya menggunakan format tes seperti membedakan fonem dimana peserta tes diminta menentukan apakah dua kata yang diberikan secara lisan sama atau berbeda (misalnya, /ten/ versus /den/). Contoh lainnya adalah tes yang dirancang untuk mengukur kosakata yang meminta siswa memilih pilihan yang tepat dari serangkaian pilihan yang sudah ditetapkan.

Kelemahan model tes secara diskrit ini di antaranya adalah:

Kesulitan membatasi pengetesan bahasa ke dalam satu keterampilan (misalnya, writing) tanpa melibatkan keterampilan lainnya (misalnya, reading);

Kesulitan membatasi tes ke dalam satu ranah linguistic (misalnya, vocabulary) tanpa melibatkan ranah lain (misalnya, phonology); dan

Kesulitan bahasa tanpa melibatkan konteks atau mengaitkannya dengan pengalaman manusia.

Menurut Damico dan Oller (1991), keterbatasan itu menimbulkan munculnya trend kedua dalam pengetesan, yaitu pendekatan integrative atau holistic. Tes seperti itu menghendaki kemahiran berbahasa dites dalam konteks wacana yang kaya (p. 83). Asumsi yang mendasarinya adalah bahwa pemrosesan atau penggunaan bahasa menyiratkan penggunaan lebih dari satu komponen bahasa (misalnya, vocabulary, grammar, gesture) dan keterampilan (misalnya, listening, speaking). Mengikuti logika ini, sebuah tugas terintegrasi bias saja meminta seorang peserta tes untuk menyimak sebuah ceritera dan kemudian menceriterakannya kembali atau menyimak ceritera dan menuliskan kembali ceritera itu.

Trend pengetesan bahasa ketiga yang digambarkan Damico dan Oller (1992) dikenal sebagai pengetesan bahasa secara pragmatic. Perbedaan mendasar pendekatan itu dengan pendekatan integratif adalah upaya menghubungkan situasi tes dengan pengalaman peserta tes. Seperti dinyatakan Oller dan Damico (1991), penggunaan bahasa dalam situasi normal berkaitan dengan orang, tempat, peristiwa, dan hubungan yang menyiratkan keseluruhan rentang pengalaman dan rentang itu dihambat oleh waktu atau factor temporal. Oleh karenanya, tes bahasa pragmatik dirancang sebisa mungkin seperti kenyataan "real life" atau seotentik mungkin.

Berbeda dengan tugas integrative, pendekatan tes pragmatik meminta peserta tes mengerjakan tugas menyimak hanya dalam kondisi tekstual dan temporal yang mencirikan kegiatan itu. Misalnya, jika peserta tes hendak menyimak sebuah ceritera dan menceritakannya kembali, kondisi berikut harus dipenuhi. Dari sudut pandang pragmatic, pemelajar bahasa umunya tidak menyimak ceritera yang direkam, tetapi umumnya menyimak ceritera yang dibacakan orang dewasa. Dalam kaitan ini, tugas menyimak sebuah ceritera yang direkam tidak memenuhi syarat pragmatic. Pendekatan pragmatic dicirikan oleh:

Input visual normal diberikan (misalnya, isyarat pembaca, cetakan pada halaman, nomor otentik gambar yang bdihubungkan dengan ceritera.

Waktu diatur berbeda sehingga memungkinkan pemelajar memperoleh kesempatan untuk bertanya, menarik kesimpulan, bereaksi secara normal atas isi ceritera.

Ceritera, temanya, pembaca dan tujuan kegiatan membentuk pengalaman siswa.

Oller dan Damico (1991) melihat kekuatan pengetesan pragmatik ada pada kenyataan bahwa semua tujuan butir item yang disusun secara diskrit (diagnosis, focus, isolasi) akan lebih baik dicapai melalui konteks yang kaya. Sebagai metode analisis linguistic, pendekatan tes secara diskrit memiliki validitas, tetapi sebagai salah satu metode yang praktis untuk menilai keterampilan berbahasa, pendekatan itu disalahgunakan, kontraproduktif, dan secara logika tidak mungkin.

Jika tujuannya adalah mengukur kemahiran berbahasa dalam aspek tatabahasa, kosakata, atau pelafalan, tujuan itu akan lebih mungkin dicapai melalui pendekatan bahasa pragmatic daripada pendekatan diskrit.

Keterbatasan Tes Kemahiran Berbahasa Saat ini

Tes kemahiran berbahasa harus didasari teori atau modelkemahiran berbahasa. Akan tetapi, belum ada konsensus di antara para ahli bahasa mengenai hakekat kemahiran berbahasa. Akibatnya, muncul berbagai tes kemahiran berbahasa yang satu sama lain berbeda secara mendasar. Yang lebih penting lagi adalah kenyataan bahwa tes kemahiran berbahasa yang berbeda menghasilkan klasifikasi bahasa yang berbeda pula (misalnya, (non-English speaking, limited English speaking and fully English proficient) untuk siswa yang sama (Ulibarri, Spencer & Rivas, 1981). Valdés dan Figueroa (1994) melaporkan bahwa tidak hanya kualitas tes yang harus menjadi perhatian para pendidik melainkan juga rancangan tes kemahiran berbahasa itu sendiri.

Unfortunately, it is not only the test qualities with which educators must be concerned.

Related to the design of language proficiency tests, there may be a propensity for test developers to use a discrete point approach to language testing. Valdés and Figueroa (1994) state:

As might be expected, instruments developed to assess the language proficiency of "bilingual" students borrowed directly from traditions of second and foreign language testing. Rather than integrative and pragmatic, these language assessments instruments tended to resemble discrete-point, paper- and-pencil tests administered orally. (p. 64)

Consequently, and to the degree that the above two points are accurate, currently available language proficiency tests not only yield questionable results about student's language abilities, but the results are also based on the most impoverished model of language testing.

In closing this section of the handbook, consider the advice of Spolsky (1984):

Those involved with language tests, whether they are developing tests or using their results, have three responsibilities. The first is to avoid certainty: Anyone who claims to have a perfect test or to be prepared to make an important decision on the basis of a single test result is acting irresponsibly. The second is to avoid mysticism: Whenever we hide behind authority, technical jargon, statistics or cutely labelled new constructs, we are equally guilty. Thirdly, and this is fundamental, we must always make sure that tests, like dangerous drugs, are accurately labelled and used with considerable care. (p. 6)

In addition, bear in mind that the above advice applies to any testing situation (e.g., measuring intelligence, academic achievement, self-concept), not only language proficiency testing. Remember also that the use of standardized language proficiency testing, in the context of language minority education, is only about two decades old. Much remains to be learned. Finally, there is little doubt that any procedure for assessing a learner's language proficiency must also entail the use of additional strategically selected measures (e.g., teacher judgments, miscue analysis, writing samples).

The Tests Described

The English language proficiency tests presented in this Guide are the:

1) Basic Inventory of Natural Language (Herbert, 1979); 2) Bilingual Syntax Measure (Burt, Dulay & Hernández-Chávez, 1975); 3) Idea Proficiency Test (Dalton, 1978;94); 4) Language Assessment Scales (De Avila & Duncan, 1978; 1991); and 5) Woodcock-Muñoz Language Survey (1993).

Test Descriptions and Publisher Information

Figure 1:

Five Standardized English Language Proficiency Tests Included in this Handbook

Assessment Instrument General Description

Basic Inventory of Natural Language (BINL) CHECpoint Systems, Inc. 1520 North Waterman Ave. San Bernadino, CA 92404 1-800-635-1235

The BINL (1979) is used to generate a measure of the K-12 student's oral language proficiency. The test must be administered individually and uses large photographs to elicit unstructured, spontaneous language samples from the student which must be tape-recorded for scoring purposes. The student's language sample is scored based on fluency, level of complexity and average sentence length. The test can be used for more than 32 different languages.

Bilingual Syntax Measure (BSM) I and II Psychological Corporation P.O. Box 839954 San Antonio, TX 78283 1-800-228-0752

The BSM I (1975) is designed to generate a measure of the K-2 student's oral language proficiency; BSM II (1978) is designed for grades 3 through 12. The oral language sample is elicited using cartoon drawings with specific questions asked by the examiner. The student's score is based on whether or not the student produces the desired grammatical structure in their responses. Both the BSM I & BSM II are available in Spanish and English.

Idea Proficiency Tests (IPT) Ballard & Tighe Publishers 480 Atlas Street Brea, CA 92621 1-800-321-4332

The various forms of the IPT ( 1978 & 1994) are designed to generate measures of oral proficiency and reading and writing ability for students in grades K through adult. The oral measure must be individually administered but the reading and writing tests can be administered in small groups. In general, the tests can be described as discrete-point, measuring content such as vocabulary, syntax, and reading for understanding.

All forms of the IPT are available in Spanish and English.

Language Assessment Scales (LAS) CTB MacMillan McGraw-Hill 2500 Garden Road Monterey, CA 93940 1-800-538-9547

The various forms of the LAS (1978 & 1991) are designed to generate measures of oral proficiency and reading and writing ability for students in grades K through adult. The oral measure must be individually administered but the reading and writing tests can be administered in small groups. In general, the tests can be described as discrete-point and holistic, measuring content such as vocabulary, minimal pairs, listening comprehension and story retelling . All forms of the LAS are available in Spanish and English.

Woodcock-Muñoz Language Survey Riverside Publishing Co. 8420 Bryn Mawr Ave. Chicago, IL 60631 1-800-323-9540

The Language Survey (1993) is designed to generate measures of cognitive aspects of language proficiency for oral language as well as reading and writing for individuals 48 months and older. All parts of this test must be individually administered. The test is discrete-point in nature and measures content such as vocabulary, verbal analogies, and letter-word identification. The Language Survey is available in Spanish and English.

Approaches to AssessmentAssessment can be broadly divided into two areas, formal and informal, but as Farr (1991, p. 496) cautions, they really are on a continuum because both are based on student performance. Traditional formal assessment looks at what students know at the end of a given period of instruction. Informal assessment looks at how a student knows as well as what he knows. Formal assessments are usually published. Informal ones are usually teacher-developed although there are published measures, including informal reading inventories, checklists, surveys, and interview guides. Obviously, the measure that we as educators choose determines the information that the instrument will yield. Therefore, we must be very clear about our purpose when we choose an assessment instrument. The choice of assessment instrument—from teacher observation to student survey to formal published test—should be informed by the assessor’s purpose. Selection of the wrong instrument will not allow inferences appropriate to the assessor’s needs. Traditionally, administrators, seeking information about students’ success in reading, selected published, standardized tests with available normative information, such as the Iowas. This allowed them to compare district performance with statewide and national scores and to comply with Title I requirements. Although the comparisons may have given them confidence in the success of local curriculums, the scores yielded little information that would help guide instruction or curriculum design.

Tests

The preponderance of objective, norm-referenced tests traditionally have offered students little information about themselves as learners. However, the same could be said of the uninformed use of a teacher’s pop quiz or the misuse of the portfolio as a mere paper repository. Traditional testing is akin to a behaviorist’s view of the learner as the passive recipient of data. Current testing theory is based on the cognitive psychologists’ view of the learner as an active construer of meaning from the information available from the environment. We now know, for example, that we should not try to decontextualize test items by using short excerpts in reading that block the reader’s use of prior knowledge to construct new information. Short passages prevent skilled readers from using the reading strategies they would employ with a longer passage as they become familiar with the topic and discover the organization of the text. Current theory dictates the use of long passages across a variety of text types and topics to gain a valid indication of reader proficiency. We no longer depend solely on short answers, such as multiple choice, but include open-ended items that permit test takers more latitude to display their reading skills. Parallel issues arise in the assessment of writing. We no longer assume that students’ abilities to revise and edit a given text reflect their abilities to generate, organize, and elaborate original ideas. In short, editing texts is not a complete test of writing proficiency. Current theory holds that any test that purports to be a valid test of writing must include opportunity for the writer to compose original, well-organized text with varied sentence structures and rich word choice using the conventions of standard written English. New Jersey’s new 4th-, 8th-, and 11th-grade tests, which are aligned to the language arts literacy standards, reflect much of current theory concerning learning and testing. Not only do they incorporate long reading passages with opportunities for open-ended responses to diff e rent text types and theme-based topics, but they also elicit multiple writing samples from students. In addition, they provide opportunities for students to integrate the reading and writing processes through decision making and problem solving in order to compose an original text using information from a reading passage as support. The tests also honor the hallmarks of assessment outlined by Case. They are valid because they measure what they purport to measure, that is, they provide rich contexts for the assessment of meaningful speaking, listening, writing, reading, and viewing behaviors. The new tests are also fair because they are aligned to the language arts literacy standards and indicators that have been published and distributed to educators, who will share them with their students, parents, and the community. Furthermore, this curriculum framework provides the same audiences with vignettes and activities that vividly translate the standards into classroom practices. Teachers can use this material to enhance student attainment of the standards and to foster student success on the new tests.

2.1 Common characteristics across instruments. Bachman’s (2000) review of the literature on language tests outlines the development of language testing over the last 20 years. He points out that while testing practice from the mid-1960s and the 1970s tended to be based on a construction of language as skills (listening, speaking, reading, writing) and components (grammar, vocabulary, pronunciation), such constructions were critiqued as new approaches to the study of language emerged. Specifically, in the 1980s, the influence of communicative approaches

to language instruction was paramount. Since applied linguists were developing approaches to teaching that focused on the co-construction of meaning, and the importance of context-based communication, traditional assessments (such as those developed in the 1970s) were ill-suited for the new approach. In the 1990s, test-makers became concerned with issues such as the development of (a) new research methodologies, such as criterion-referenced measurement, (b) practical advances, such as pragmatics testing, (c) factors that affect test performance, (d) authentic and performance assessment, and (e) ethical considerations of language testing.

2.2 Language constructs represented. The tests reviewed above are based on the assumption that language proficiency can be measured accurately by only sampling discrete aspects such as phonology, syntax, morphology, and lexicon. The tests rarely consider aspects of language that can be crucial to academic success, such as pragmatic competence (Cummins, 2000). In other words, most language proficiency tests limit the construction of language proficiency to grammatical competence. An important flaw with this construction is that to assess grammatical competence, tests usually rely on prescriptivist notions of grammar. For instance, if one such type of test were to assess students’ acquisition of the English verb system, an item like (1) below might be presented.

(1) Dad called earlier. He ___________ (might/ is/ had/ might could) stop by later this evening.

If a student were to fill in the blank with might could, he would probably be penalized because the Standard English verb system allows one modal verb in that position. However, if said student were a member of the group of native English speakers who make a distinction between (2) and (3) below, such an item would be invalid:

(2) He might stop by later this evening.

(3) He might could stop by later this evening. While the differences in meaning are subtle and pragmatically determined, in (3) there is less likelihood that "he" will stop by than in (2) (Wolfram and Schilling-Estes, 1998:335). Speakers of the dialect in which sentences like (3) are common need contextual cues in order to distinguish the forcefulness of the assertion. However, a typical language proficiency test would not allow for nuances in meaning made by speakers of so-called non-Standard varieties of English. Furthermore, to limit the construction of language proficiency to a closed set of grammatical categories negates the real need for language learners to master communicative principles which are essential in informal and academic contexts. After all, language learners must develop a range of communicative styles to suit their purposes. A language learner whose repertoire is limited to academic discourse styles cannot be considered fully communicatively competent. Up to this point we have discussed how commonly-used tests utilize similar constructions of language proficiency, and how this construction of language proficiency is closely linked to prescriptivist notions of standard grammar. In the next section we discuss the criticisms that standardized language proficiency tests have received in test reviews.

2.3 Critiques of the four most commonly-used tests. In addition to the limitation of language proficiency to grammatical competence, other criticisms are revealed in test reviews. These have indicated that some of the common shortcomings are (a) that many test items are not valid (Haber, 1985; Carpenter, 1994; Hedberg, 1995; Kao, 1998), (b) that interrater reliability is low (Crocker, 1998), and (c) that the tests are normed on populations that are not representative of the samples of children to whom these measures are commonly administered (Chesterfield, 1985; Haber, 1985; Shellenberger, 1985; Lopez, 2001). Table 3 includes a summary of the reviews.

TUJUAN TESTSto make inferences about individuals’ language abilityto make predictions about individuals’ ability to use language in contexts outside the test itselfto make decisions about individuals

ZuckerDespite this variety, tests generally share some common goals:• measuring what students know and can do• improving instruction• helping students achieve higher standards

The purpose of tests is to provide educators, students, parents, and policy makers with information that is valid, fair, and reliable. Standardized tests provide information that helps support four critically important tasks for educators and the public:1. Identify the instructional needs of individual students so educators can respond with effective, targeted teaching and appropriate instructional materials;

2. Judge students’ proficiency inessential basic skills and challenging standards and measure their educational growth over time;3. Evaluate the effectiveness of educational programs; and,4. Monitor schools for educational accountability including under the NCLB Act.In sum, tests provide information to help students learn more successfully, teachers teach more effectively, and schools to be more accountable.

There are limits to testing, however. Tests are a necessary but not the exclusive means to evaluate current achievement and students’ growth in skills. What may be tested is not, and cannot be, inclusive of all of the desired outcomes of instruction. Tests should be considered a means to an end and not ends in themselves. Tests should be used in combination with other important types of information such as teacher judgments of student work and classroom performance plus other individual and group assessments, to measure achievement and growth.

JENIS TES

High-Stakes TestingHigh-stakes testing has consequences attached to the results. For example, high-stakes tests can be used to determine students’ promotion from grade to grade or graduation from high school (Resnick, 2004; Cizek, 2001). State testing to document Adequate Yearly Progress (AYP) in accordance with NCLB is called “high-stakes” because of the consequences to schools (and of course to students) that fail to maintain a steady increase in achievement across the subpopulations of the schools (i.e., minority, poor, and special education students).

Low-Stakes TestingLow-stakes testing has no consequences outside the school, although the results may have classroom consequences such as contributing to students’ grades. Formative assessment is a good example of low-stakes testing.

Formative AssessmentThis assessment provides information about learning in process. It consists of the weekly quizzes, tests, and even essays given by teachers to their classes. Teachers and students use the results of formative assessments to understand how students are progressing and to make adjustments in instruction. Rick Stiggins calls it “day-to-day classroom assessment” and claims evidence that it has triggered “remarkable gains in student achievement” (Stiggins, 2004).

Summative assessment provokes most of the controversy about testing because it includes “high-stakes, standardized” testing carried out by the states. Summative

assessment records the state of student learning at certain end points in a student’s academic career—at the end of a school year, or at certain grades such as grades 3rd, 5th, 8th, and 11th. It literally “sums up” what students have learned.

PENAFSIRAN HASIL TESIn addition to designing to account for concerns of reliability, validity, and fairness, test publishers design a standardized test according to how its results will be reported and used. The number of correctly answered questions on a test, the student’s raw score, only has meaning in the context of the test’s interpretive framework. Types of interpretive frameworks include Norm-referenced Testing (NRT) Criterion-referenced Testing (CRT), and Standards-based Testing.

Norm-referenced Testing (NRT)A standardized test designed in the NRT interpretive framework can be used to compare a test-taker’s results to the results of a reference group that has taken the same test. To norm a test so that results can be compared, a test publisher gathers normative data through field trials of the test with a representative, national sample of students. To compare groups as large as entire school systems, norm referenced tests are typically designed to cover a broad range of what test-takers are expected to know and be able to do within a subject area. When reporting the results of a norm-referenced test, the test-taker’s raw score can be used to make a comparison to the reference group in various ways. Two common methods for making this comparison are to report the test’s result as a percentile rank or as a stanine.

A percentile rank (PR) reports the percentage of test-takers whose results are above or below a certain score. For example, a test-taker with a PR of 80 on a test performed better than 80% of the corresponding reference group. The highest possible PR is 99, meaning that the test-taker scored higher than 99% of the reference group, while the lowest PR is 1, and a PR of 50 is the average. A stanine indicates the relative standing of a test-taker’s score in comparison to the reference group with a low of one, a high of nine, and five as the average. Stanines 1, 2, and 3 are considered “below average”; stanines 4, 5, and 6 are considered “average”; and stanines 7, 8, and 9 are considered “above average.” Each stanine represents an approximately equal unit of achievement. Therefore, the difference between stanines 2 and 4 represents about the same difference in achievement as between stanines 5 and 7. The percentage of scores in the reference group that are classified in each stanine is 4, 7, 12, 17, 20, 17, 12, 7, and 4 respectively. Stanines may correspond to certain ranges of percentile ranks and are typically presented as a curve.

Norm-Referenced and Criterion-Referenced TestsA prospective purchaser of tests is faced with a choice, to buy norm-referenced or criterion-referenced tests. The design and functions of each are so different that it is necessary to discuss them in some detail.

Norm-Referenced TestsThese tests are designed to compare individual students’ achievement to that of a “norm group,” a representative sample of his or her peers. The design is governed by the normal or bell-shaped curve in the sense that all elements of the test are directed towards spreading out the results on the curve (Monetti, 2003; NASBE, 2001; Zucker, 2003; Popham, 1999). The curve-governed design of norm-referenced tests means that they do not compare the students’ achievement to standards for what they should know and be able to do—they only compare students to other students who are assumed to be in the same norm group. The Educators’ Handbook on Effective Testing (2002) lists the norms frequently used by major testing publishers. For example, the available norms for the Iowa Test of Basic Skills are: districts of similar sizes, regions of the country, socio-economic status, ethnicity, and type of school (e.g., public, Catholic, private non-Catholic) in addition to a representation of students nationally.

Purchasers of norm-referenced tests need to ensure that the chosen norm is a useful comparison for their students. Purchasers should also be sure that the norm has been developed recently, because populations change rapidly. A norm including a small percentage of English language learners can become a norm with almost 50 percent English language learners in less than the ten-year interval before it is revised.

Results of norm-referenced tests are frequently reported in terms of percentiles: a score in the 70th percentile means that the student has done better than 70 percent of the others in the norm group (Monetti, 2003). Percentile rankings are often used to identify students for various academic programs such as gifted and talented, regular, or remedial classes. On a symmetrical bell curve, a score in the 50th percentile is the average.

Because norm-referenced tests are designed to spread students’ scores along the bell curve, the questions asked in the tests do not necessarily represent the knowledge and skills that all students are expected to have learned. Instead, during the test development process, “test items answered correctly by 80 percent or more of the test takers don’t make it past the final cut [into the final test]” writes Popham (1999).

Norm-referenced tests lead to frustration on two counts. First they frustrate the teacher’s success in teaching important knowledge and skills because students are unlikely to face questions about that skill and knowledge on the test (Popham, 1999). Second, no group of

view enlarged chart

http://www.centerforpubliceducation.org/site/lookup.asp?c=kjJXJ5MPIwE&b=1698647

students can achieve at higher levels without others achieving at lower levels. Norm-referenced tests make it mathematically impossible for “all the children to be above average” (ERS; Burley, 2002).

Criterion-referenced Testing (CRT)Rather than compare a student’s test result with the results of a reference group, criterion-referenced tests are intended to measure a level of mastery according to a specific set of performance standards. Hence, the content of a criterion referenced test often includes more focused subject matter than a norm-referenced test. The test-taker’s score corresponds to a performance level, such as basic, proficient, or advanced. NCLB requires each state to design or select an assessment yielding results that can be used to classify students into performance levels for the corresponding academic subject.

What Is The Difference Between A Criterion-Referenced Test and A Norm Referenced Test?All standardized tests now administered to elementary and secondary school students measure student achievement against a set of academic standards or curricular objectives. The standards may be common among the states and major national academic organizations, thus enabling national comparisons. Or, the standards may be local standards chosen by the school district or state, which may only allow local comparisons among students in a district or state. There are many ways to report and interpret the results of a standardized test. One way is based on specific criteria, such as academic skills or objectives and academic achievement standards developed at the state or local level. For example, “She has demonstrated mastery of reading at the third-grade level” is a determination made by a criterion-referenced test (CRT). A standardized test also can describe a student’s performance compared to other students nationally or locally. For example, “He reads better than 90 percent of fourth grade students nationally” is a determination made from a norm-referenced test (NRT). A student’s score on a CRT using local academic standards is intended to be compared only with other students who have taken the same test. In contrast, a student’s scores on an NRT can show performance on academic standards and also enable comparisons with students both locally and nationally. When a local CRT is used with a national NRT, the results can be interpreted together to obtain more comprehensive information about a student’s performance. For example, “She is ‘proficient’ on a state mandated CRT and is performing at an academic level that is better than 70 percent of students nationwide.”

Criterion-Referenced TestsThese tests are designed to show how students achieve in comparison to standards, usually state standards. (NASBE, 2001; Wilde, 2004; Zucker, 2003). In contrast to norm-referenced tests, it is theoretically possible for all students to achieve the highest—or the lowest—score, because there is no attempt to compare students to each other, only to the standards. Results are reported in levels that are typically basic, proficient, and advanced.

The test items are not chosen to sort students but to ascertain whether they have mastered the knowledge and skills contained in the standards.

Criterion-referenced tests—sometimes, more correctly, called standards-based tests—begin from a state’s standards, which list the knowledge and skills students are expected to learn. Because standards are usually far more numerous than could ever be included in a test, test designers work with teachers and content specialists to narrow down the standards to essential knowledge and skills at the grades to be tested. They are the basis for the development of test items.

The number of criterion-referenced tests in use at the state level has dramatically increased since NCLB was implemented in 2001 (NCES, 2005), because they measure achievement of the knowledge and skills required by state standards. At this writing, 44 states now use criterion-referenced assessments: 24 states use only criterion-referenced tests, and the other 20 use both criterion-referenced tests and norm-referenced tests. Thirteen states use “hybrid” tests, single tests that are reported both as norm-referenced tests (in percentiles or stanines—a nine-point scale used for normalized test scores) and as criterion-referenced tests (in basic, proficient, and advanced levels) in an attempt to show at the same time where students score in relation to standards and in relation to a norm group. Only one state, Iowa (home of the Iowa Test of Basic Skills, and also the only state in the nation without state academic standards) uses a norm-referenced test alone (Education Week 2006).

STANDARDS-BASED TESTINGStandards-based testing allows states to accomplish both objectives (NRT and CRT) at once by incorporating elements of norm-referenced and criterion referenced testing. A standards-based test is both normed to a reference group and aligned to a set of performance standards. This framework, also called the augmented NRT model, enables states to report standards-based information (content standards scores), performance levels (cut-scores), and percentile rank information for every student. For example, a test publisher can use a state’s academic standards to augment an existing norm-referenced test so that the test taker’s results can be used for both comparisons to a reference group and assigning performance levels. Typically, statewide results from the first year that a standards-based test is administered are used to establish the test’s reference group. Careful design by the test publisher ensures that the test is valid for measuring student mastery of the academic standards. Because NCLB requires states to report student performance levels while also comparing the results of specified student populations to the results of previous years, properly designed standards-based tests are especially suited to meet NCLB requirements.

Standardized testing means that a test is “administered and scored in a predetermined,

standard manner” (Popham, 1999). Students take the same test in the same conditions at the same time, if possible, so results can be attributed to student performance and not to differences in the administration or form of the test (Wilde, 2004). For this reason, the results of standardized tests can be compared across schools, districts, or states.

Standardized testing is sometimes used as a shorthand expression for machine scored multiple-choice tests. As we will see, however, standardized tests can have almost any format.

A standardized achievement test is, simply, a test that is developed using standard procedures and is then administered and scored in a consistent manner for all test takers. Students respond to identical or very similar questions under the same conditions and test directions. The standardization of test questions, directions, conditions of testing, and scoring is needed to make test scores comparable and to assure, as much as possible, that test takers have equal, unbiased opportunities to demonstrate what they know and can do. Standardization can apply to any type or format of test. However, some types of educational tests such as classroom and teacher-developed tests are not usually considered to be “standardized” tests because they are given under varying conditions and are scored using variable rules. Standardized tests may be used for a variety of purposes. One purpose of testing is to enable educators to make high-stakes decisions about individual students through measures such as high school graduation tests. In contrast, the annual testing provisions of the NCLB Act are used to inform schools, teachers, and parents about student improvement in the classroom and to hold schools and states accountable for such improvement.

How Are Standardized Tests Used?Information from standardized tests can be used for many purposes. These purposes may include:

Supporting instructional decisions for individual students by identifying their instructional needs. A test may be used to diagnose a student’s strengths and weaknesses, thus allowing the teacher or school to choose effective instructional programs for the student.

Demonstrating students’ proficiency in basic skills and their ability to meet academic standards. Test results are used by states to demonstrate individual student mastery of specified levels of achievement.

Informing parents and the public about school and student performance.States administer standardized assessments and report the results, in part to inform the public about how well the schools and their students are progressing over time and compared to other localities or schools. Many states and districts publish annual report cards on school districts and individual schools. The results of the tests can motivate education reform by informing and influencing parents to take action to improve the quality of local schools.

Holding schools and educators accountable for student performance on tests aligned to high standards of what students should know and be able to do.Consequences are often attached to test results and may include school improvement plans, technical assistance, increased or decreased funding for schools, salary bonuses, promotions, loss of accreditation and takeovers of local schools by the state. Such consequences are used to leverage change at the school and classroom level.

Evaluating programs. Many federal and state education programs use standardized tests to determine if public policy objectives are being achieved, and if public funds are well-spent.

Determining rewards and sanctions. Tests may be used for high-stakes purposes with rewards and sanctions to make decisions about individual students, such as placement in specific programs or classes, graduation from high school, or promotion to the next grade.

FORMAT TESTMultiple-choice questions: Many standardized tests require students to select a single correct responseto each test question (called “items”) from among a small number of specific choices. This format—called “multiple choice” or “selected response”—is efficient, practical, and usually produces highly reliable results. Multiple-choice tests offer the advantages of objectivity and uniformity in scoring, ease of administration, and low cost.

Performance assessment questions: Performance assessments require students to generate a response to a question rather than choosing from a set of responses provided to them. Examples include exhibitions, investigations, demonstrations, written or oral responses, journals, and portfolios. Performance assessments can be given and scored according to standard procedures and rules so that a test containing performance assessment questions is a standardized test. Performance assessments typically focus on the process of problem solving rather than on answers or solutions. Tests including performance assessments, however, are generally less reliable, more difficult to score, and more costly than tests using multiple choice items.

Constructed-response questions: Constructed-response items may be one type of performance assessment, in which students are given the opportunity to fill-in-a-blank or provide a brief written response to a question, rather than select from an array of possible answers. Constructed-response questions are often included, along with multiple choice questions, on a test to obtain additional and different types of information about what a student knows or can do.

Test Question FormatsWhile there is no set format for all questions on standardized tests, the most common standardized test question formats include Multiple-choice Questions and Short-answer Questions.

Short-answer QuestionsThe short-answer question format, also known as the open-ended or constructed response format, presents the test-taker with a question that is answered by a fill in-the-blank or short written response. Answers to constructed-response questions are hand-scored using a rubric that allows for a range of acceptable and partially correct answers. Questions and answers in this format provide a more sophisticated evaluation of student performance than selected-response questions. However, the reliability of scores obtained using constructed-response questions depends more heavily on the scoring method. Carefully designed constructed response questions with a clear scoring rubric can provide important information about student performance and knowledge that cannot be as effectively demonstrated by the selected-response format.

Open-Ended TestsThese test items ask students to respond either by writing a few sentences in short answer form, or by writing an extended essay. Open-ended questions are also known as “constructed response” because test-takers must construct their response as opposed to selecting a correct answer (Zucker, 2003). The advantage of open-ended items is that they allow a student to display knowledge and apply critical thinking skills. It is particularly difficult to assess writing ability, for example, without an essay or writing sample.

The disadvantage is that constructed-response items require human readers, although attempts are being made to develop computer programs to score essays (Sireci, 2000; Rudner, 2001; Shermis, 2001). Short-answer questions can be scored by looking for key terms since they often don’t ask for complete sentences. But many state assessments ask for an extended essay, often in separate tests from the one used to report AYP. Companies across the United States assemble groups of qualified people, often retired teachers when they can get them, to read and score essays or long answers using a common rubric for scoring (Stover, 1999).

A rubric is a guide to scoring that provides a detailed description of essays that should be given a particular score (frequently one-six points, with six being the best). After extensive training with models of each score, two readers rate an essay independently. If their scores differ, a third reader reads the essay without knowing the two preceding scores. Group scoring of essays has a long history and has proved to be remarkably reliable (Mitchell, 1992).

Essays and long answers have the desirable effect of promoting more writing and writing instruction in the classroom, but they are expensive to score. Multiple-choice testing is less expensive because it is scored by machine (ERS; NASBE, 2001). Differences in cost

can be gauged from a U.S. General Accounting Office report estimating that from 2002-2008, states will spend $1.9 billion on mandated testing if they use only machine-scored multiple-choice tests. States will spend $3.9 billion if they maintain the present mixture of multiple-choice and a few open-ended items. They will spend $5.3 billion if they increase the use of open-ended items—including essays—making the cost of using open-ended items more than 2.5 times the amount of using multiple-choice tests alone (GAO, 2003). Clearly, the difference in cost makes testing choices difficult.

Performance AssessmentAlso called authentic assessment, performance assessment challenges students to perform a task just as it would be performed in the classroom or in life (e.g., a science experiment, a piano recital). Performance assessment was widely promoted in the early 1990s (Mitchell, 1992), but it is time-consuming, difficult to standardize, and expensive.

PortfoliosPortfolios are a type of performance assessment that were also popular before 2001, when state testing in accordance with NCLB came to dominate. Portfolios are collections of student work designed to show growth over a semester or a year. However, they are difficult to evaluate accurately, because their production and contents can not be standardized (Gearhart, 1993). Both portfolios and performance assessment are now used as formative rather than summative assessment.

QUALITIES OF AN EFFECTIVE TESTThe requirements of NCLB pose a significant challenge to state educational systems: All students must have the same chance to be successful at showing what they know and can do in periodic, high-stakes assessments. Consequently, states must select or design high-quality tests that can be used by the general student population while meeting the special requirements of certain groups and even the needs of individual students. Moreover, the high stakes involved compel states to be certain that the tests accurately measure student achievement. All standardized tests must meet psychometric (test study, design, and administration) standards for reliability, validity, and lack of bias (Zucker, 2003; Bracey, 2002; Joint Committee on Testing Practices, 2004).For a test to solve this combination of challenges effectively, it must be proven to be:• Reliable – The test must produce consistent results. Reliability means that the test is so internally consistent that a student could take it repeatedly and get approximately the same score.• Valid – The test must be shown to measure what it is intended to measure. • Unbiased – The test should not place students at a disadvantage because of gender, ethnicity, language, or disability.

References

ACT, Inc. & The Education Trust. (2004). On course for success: A close look at selected high school

courses that prepare all students for college and work. Washington DC: The Education Trust. (Available:

http://www.act.org/path/policy/pdf/success_report.pdf )

Bracey, G. W. 2002. Put to the test: An educator’s and consumer’s guide to standardized testing. (2nd ed.)

Bloomington IN: Phi Delta Kappa International.

Burley, H. (2002, February). A Measure of Knowledge. American School Board Journal,18(2).

Cannell, J. J. (1987). Nationally normed elementary achievement testing in America’s public schools: How

all fifty states are above the national average. West Virginia: Friends for Education.

Cizek, G. J. (1998). Filling in the blanks: Putting standardized tests to the test. Washington D.C.: The

Thomas B. Fordham Foundation.

Cizek, G. J. (2001, Winter). More unintended consequences of high-stakes testing. Educational

Measurement, Issues and Practice, 20(4), 19-28.

Darling-Hammond, L. (2004, June). Standards, accountability, and school reform. Teachers College Record,

106(6), 1047-1085.

Data connections: Using assessment to improve teaching and learning [CD-ROM]. (2002). Charleston, West

Virginia: Edvantia (Formerly Appalachian Educational Laboratory).

Dickinson, A. C., Friedman, M. I., Hatch, C. W., Jacobs, J. E., Nickerson, A. B., & Schnepel, K. C. (2002).

Educators’ handbook on effective testing. Columbia, SC: Institute for Evidence-Based Decision-Making in

Education.

Educational Research Service. (n.d.). Focus on high-stakes testing. Arlington VA: Educational Research

Service

Education Week, (2006). Quality Counts At 10. Washington D.C.: Editorial Projects in Education

General Accounting Office, (2003). Characteristics of tests will influence expenses: Information sharing may

help states realize efficiencies. Washington D.C.: United States General Accounting Office.

Gearhart, M., Herman, J. L., Baker, E. L., & Whittaker, A. K. (1993, July) Whose work is it? A question for

the validity of large-scale portfolio assessment. CSE Technical Report 363. Available:

http://www.cse.ucla.edu/products/Reports/TECH363.pdf

http://www.cse.ucla.edu/products/Reports/TECH363.pdf

http://www.act.org/path/policy/pdf/success_report.pdf

Goldberg, M. (2005, January). Test mess 2: Are we doing better a year later? Phi Delta Kappan, 86(5), 389-

400.

Herman, J. L., & Baker, E. L. (2005, November). Making benchmark testing work. Educational Leadership,

63(3), 49-53.

Joint Committee on Testing Practices. (2004). Code of fair testing practices in education (Revised).

Washington D.C.: American Psychological Association.

Lemann, N. (1999). The big test. New York: Farrar, Strauss, and Giroux.

Linn, R. L. (2005, Summer). Fixing the NCLB accountability system. CRESST Policy Brief 8. Available:

http://www.cse.ucla.edu/products/policy/cresst_policy8.pdf

McIntire, T. (2005, April). Data: Maximize your mining, part one. Technology and Learning, 25(9).

Mitchell, R. (1992). Testing for learning: How new approaches to evaluation can improve American schools.

New York: Free Press.

Monetti, D. M., & Hinkle, K. T. (2003). Five important test interpretation skills for school counselors. ERIC

Digest. ED481472 2003-09-00.

National Association of State Boards of Education. (2001). A primer on state accountability and large-scale

assessments. Available: http://www.nasbe.org/Educational_Issues/Reports/Assessment.pdf

National Education Goals Panel. (1998). Talking about tests: An idea book for state leaders. Washington

DC: United States Department of Education.

National Center for Education Statistics. (2005). State education reforms. Standards, assessment, and

accountability. Table 1.5. Names and types of statewide assessments administered, by state: 2003-

4 [Online report]. Retrieved December 7, 2005, from http://nces.ed.gov/programs/statereform/saa_tab5.asp.

National Center for Education Statistics. (2005, August). Online assessment in mathematics and writing:

Reports for the NAEP technology-based assessment project, research and development series. Washington

DC: United States Department of Education.

Popham, J. W. (1999, March). Why standardized tests don’t measure educational quality. Educational

Leadership, 56(6), 8-15.

Princeton Review. (2003). Testing the testers 2003: An annual ranking of state accountability systems.

Available: http://testprep.princetonreview.com/testingtesters/report.asp

http://testprep.princetonreview.com/testingtesters/report.asp

http://nces.ed.gov/programs/statereform/saa_tab5.asp

http://www.nasbe.org/Educational_Issues/Reports/Assessment.pdf

http://www.cse.ucla.edu/products/policy/cresst_policy8.pdf

Resnick, B. (2004, April). Majority of districts/schools employ “high-stakes” testing. Successful School

Marketer. Retrieved December 9, 2005, from http://www.schooldata.com/ssm-resnick-majority.htm

Resnick, M. (2004). The educated student: Defining and advancing student achievement. Alexandria VA:

National School Boards Association.

Rudner, L., & Gagne, P. (2001). An overview of three approaches to scoring written essays by computer.

ERIC Digest. ED458290 2001-12-0

Shermis, M. D., Rasmussen, J. L., Rajecki, D. W., Olson, J., & Marsilio, C. (2001). All prompts are created

equal, but some prompts are more equal than others. Journal of Applied Measurement, 2(2), 154-70.

Sireci, S. G., & Rizavi, S. (2000). Comparing computerized and human scoring of students’ essays. New

York: The College Board. ERIC report number 354.

Stiggins, R. (2004, September). New assessment beliefs for a new school mission. Phi Delta Kappan, 88(1),

22-27.

Stokes, V. (2005, October). No longer a year behind. Learning and Leading with Technology, 33(2), 15-17.

Stover, D. (1999, March, 23). Who grades the essays on standardized tests? School Board News, p. 3.

Toch, T. (2006, January). Margins of Error: The Education Testing Industry in the No Child Left Behind Era.

Washington, D.C.: Education Sector./p>

Wilde, J. (2004, January). Definitions for the no child left behind act of 2001: Assessment. Washington DC:

National Clearinghouse for English Language Acquisition (NCELA).

Zucker, S. (2003, December). Fundamentals of standardized testing. San Antonio TX: Harcourt

Assessment, Inc.

References

American Psychological Association. (1985). Standards for Educational and Psychological Testing.

Washington, DC: American Psychological Association.

Amori, B. A., Dalton, E.F. , & Tighe, P.L. (1992). IPT 1 Reading & Writing, Grades 2-3, Form 1A, English.

Brea, CA: Ballard & Tighe, Publishers.

Anastasi, A. (1988). Psychological Testing (sixth edition). New York, NY: Macmillan Publishing Company.

http://www.schooldata.com/ssm-resnick-majority.htm

Ballard, W.S., Tighe, P.L., & Dalton, E. F. (1979, 1982, 1984, & 1991). Examiner's Manual IPT I, Oral

Grades K-6, Forms A, B, C, and D English. Brea, CA: Ballard & Tighe, Publishers.

Ballard, W.S., Tighe, P.L., & Dalton, E. F. (1979, 1982, 1984, & 1991). Technical Manual IPT I, Oral Grades

K-6, Forms C and D English. Brea, CA: Ballard & Tighe, Publishers.

Burt, M.K., Dulay, H.C., & Hernández-Chávez, E., (1976). Bilingual Syntax Measure I, Technical Handbook.

San Antonio, TX: Harcourt, Brace, Jovanovich, Inc.

Burt, M.K., Dulay, H.C., Hernández-Chávez, E., & Taleporos, E. (1980). Bilingual Syntax Measure II,

Technical Handbook. San Antonio, TX: Harcourt, Brace, Jovanovich, Inc.

Canale, M. (1984). On some theoretical frameworks for language proficiency. In C. Rivera (Ed.), Language

proficiency and academic achievement. Avon, England: Multilingual Matters Ltd.

Canales, J. A. (1994). Linking Language Assessment to Classroom Practices. In R. Rodriguez, N. Ramos, &

J. A. Ruiz-Escalante (Eds.) Compendium of Readings in Bilingual Education: Issues and Practices. Austin,

TX: Texas Association for Bilingual Education.

CHECpoint Systems, Inc. (1987). Basic Inventory of Natural Language Authentic Language Testing

Technical Report. San Bernadino, CA: CHECpoint Systems, Inc.

Council of Chief State School Officers (1992). Recommendations for Improving the Assessment and

Monitoring of Students with Limited English Proficiency. Alexandria, VA: Council of Chief State School

Officers, Weber Design.

CTB MacMillan McGraw-Hill (1991). LAS Preview Materials: Because Every Child Deserves to Understand

and Be Understood. Monterey, CA: CTB MacMillan McGraw -Hill.

Cummins, J. (1984). Wanted: A theoretical framework for relating language proficiency to academic

achievement among bilingual students. In C. Rivera (Ed.), Language proficiency and academic

achievement. Avon, England: Multilingual Matters Ltd.

Dalton, E. F. (1979, 1982, 1991). IPT Oral Grades K-6 Technical Manual, IDEA Oral Language Proficiency

Test Forms C and D English. Brea, CA: Ballard & Tighe, Publishers.

Dalton, E. F. & Barrett, T.J. (1992). Technical Manual IPT 1 & 2, Reading and Writing, Grades 2-6, Forms

1A and 2A English. Brea, CA: Ballard & Tighe, Publishers.

De Avila, E.A. & Duncan, S. E. (1990). LAS, Language Assessment Scales, Oral Technical Report, English,

Forms 1C, 1D, 2C, 2D, Spanish, Forms 1B, 2B. Monterey, CA: CTB MacMillan McGraw-Hill.

De Avila, E.A. & Duncan, S. E. (1981, 1982). A Convergent Approach to Oral Language Assessment:

Theoretical and Technical Specifications on the Language Assessment Scales (LAS), Form A. Monterey,

CA: CTB McGraw-Hill.

De Avila, E.A. & Duncan, S. E. (1987, 1988, 1989, 1990). LAS, Language Assessment Scales, Oral

Administration Manual, English, Forms 2C and 2D. Monterey, CA: CTB MacMillan McGraw-Hill.

Duncan, S.E. & De Avila, E.A. (1988). Examiner's Manual: Language Assessment Scales Reading/Writing

(LAS R/W). Monterey, CA: CTB /McGraw Hill.

Durán, R.P. (1988). Validity and Language Skills Assessment: Non-English Background Students. In H.

Wainer & H.I. Braun (Eds). Test Validity. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

National Commission on Testing and Public Policy. (1990). From Gatekeeper to Gateway: Transforming

Testing in America. Chestnut Hill, MA: National Commission on Testing and Public Policy.

Oller, J.W. Jr. & Damico, J.S. (1991). Theoretical considerations in the assessment of LEP students. In E.

Hamayan & J.S. Damico (Eds.), Limiting bias in the assessment of bilingual students. Austin: Pro-ed

publications.

Rivera, C. (1995). How can we ensure equity in statewide assessment programs? Unpublished document.

Evaluation Assistance Center-East, George Washington University, Arlington, VA.

Roos, P. (1995). Rights of limited English proficient students under Federal Law -- A guide for school

administrators. Unpublished paper presented at Weber State University, Success for all Students

Conference, Ogden, UT.

Spolsky, B. (1984). The uses of language tests: An ethical envoi. In C. Rivera (Ed.), Placement procedures

in bilingual education: Education and policy issues. Avon, England: Multilingual Matters Ltd.

Ulibarri, D., Spencer, M., & Rivas, G. (1981). Language proficiency and academic achievement: A study of

language proficiency tests and their relationship to school ratings as predictors of academic achievement.

NABE Journal, Vol. V, No. 3, Spring.

Valdés, G. and Figueroa, R. (1994). Bilingualism and testing A special case of bias. Norwood, NJ: Ablex

Publishing Corporation.

Wheeler, P. & Haertel, G.D. (1993). Resource Handbook on Performance Assessment and Measurement: A

Tool for Students, Practitioners, and Policymakers. Berkeley, CA: The Owl Press.

Woodcock, R. W. & Muñoz-Sandoval, A.F. (1993). Woodcock-Muñoz Language Survey Comprehensive

Manual. Chicago, IL: Riverside Publishing Company.

Table 3: Critiques of four most commonly used tests Test View of language Problematic aspects LAS Language consists of discrete skills and elements. -Hedberg (1995): LAS-Oral is inadequate for placing language-minority students because of inadequate standardization procedures. -Carpenter (1994): LAS reading/ writing is inappropriate to make entry and exit decisions; teacher judgement would be just as valid. IPT Language consists of discrete skills and elements. -Lopez (2001): Norming procedures limit test validity for a wide range of U.S. students, greater emphasis on discrete aspects of language proficiency and less emphasis on pragmatic competence, no studies were conducted to investigate how test content relates to achievement. -Ochoa (2001): Standardization sample is not representative of the range of U.S. English speakers, nor is the Spanish version representative of the range of Spanish speakers in the U.S.. WMLS Cummins’ BICS/ CALP distinction -Crocker (1998): To account for construct validity, test makers rely on intercorrelations, not on an explanation of the underlying traits that test attempts to measure. -Kao (1998): Test-makers provide insufficient information about validity. There is little explanation about the Cognitive-Academic Skills (CALP) construct. -Schrank, Fletcher, and Guajardo Alvarado (1996): LAB Language consists of discrete skills and elements. -Chesterfield (1985): LAS is problematic for identification of students for bilingual programs, contains unnecessary items, is inadequate to predict success or as a basis for intervention. •

References American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards of educational and psychological testing. Washington, D.C.: American Psychological Association. August, D., & Hakuta, K. (1997). (Eds). Improving schooling for language-minority children: A research agenda. Washington, D. C.: National Academy Press. Bachman, L. (2000). Modern language testing at the end of the century: Assuring that what we count counts. Language testing 17 (1) 1-42. Bowman, B. T., Donovan, M. S., & Burns, M. (Eds.). (2001). Eager to learn: Educating our preschoolers. Washington, D.C.: National Academy Press. Burt. M.K., Dulay, H.C., Hernández-Chávez, E., and Taleporos, E. (1980). Bilingual Syntax Measure II, Technical handbook. San Antonio, TX: Harcourt, Brace, Jovanovich. Carpenter, C. D. (1994). Review of Language Assessment Scales, Reading and Writing. Supplement to the eleventh mental measurements yearbook. Lincoln, NE: University of Nebraska Press. Chesterfield, K. B. (1985). Review of Language Assessment Battery. The Ninth Mental Measurements Yearbook Volume I. Lincoln, NE: University of Nebraska Press. Crocker, L. (1998). Review of the Woodcock-Muñoz Language Survey. The Thirteenth Mental Measurements Yearbook. Lincoln, NB: University of Nebraska Press. • 679 •

Cummins, J. (2000). Language, power, and pedagogy: Bilingual children in the crossfire. Clevedon, UK: Multilingual Matters Ltd. Cummins, J., Muñoz-Sandoval, A.F., Alvarado, C.G., & M.L. Ruef (1998). The Bilingual Verbal Ability Tests. Itasca, IL: Riverside. Dalton, E. F. (1991). IPT Oral Grades K-6 Technical Manual, IDEA Oral Language Proficiency Test Forms C and D English. Brea, CA: Ballard & Tighe, Publishers. De Avila, E.A. & Duncan, S. E. (1990). Language Assessment Scales, Oral Technical Report, English, Forms 1C, 1D, 2C, 2D, Spanish, Forms 1B, 2B. Monterey, CA: CTB MacMillan McGraw-Hill. Del Vecchio, A., & Guerrero, M. (1995). Handbook of language proficiency tests. Albuquerque, NM: Evaluation Assistance Center–Western Region, New Mexico Highlands University. Garcia, E. (1985). Review of Bilingual Syntax Measure II. The Ninth Mental Measurements Yearbook Volume I. Lincoln, NE: University of Nebraska Press. Garcia, G.E. and Pearson, P.D. (1994). Assessment and diversity. Review of research in education 20:337-391. Gee, J.P. (2003). Opportunity to learn: A language-based perspective on assessment. Assessment in education 10:27-46. Guyette, T. (1985). Review of Basic Inventory of Natural Language. The ninth mental measurements yearbook Volume I. Lincoln, NE: University of Nebraska Press. Guyette,T. (1994). Review of Language Assessment Scales, Reading and Writing. Supplement to the eleventh mental measurements yearbook. Lincoln, NE: University of Nebraska Press. Harris Stefanakis, E. (1998). Whose judgement counts?: Assessing bilingual children, K-3. Portsmouth, NH: Heinemann. Haber, L. (1985). Review of Language Assessment Scales. The ninth mental measurements yearbook Volume I. Lincoln, NE: University of Nebraska Press. Hedberg, N. L. (1995). Review of Language Assessment Scales-- Oral. The twelfth mental measurements yearbook. Lincoln, NE: University of Nebraska Press. Kao, C. (1998). Review of the Woodcock-Muñoz Language Survey. The Thirteenth Mental Measurements yearbook. Lincoln, NE: University of Nebraska Press. Kindler, A. (2002). Survey of states’ limited English proficiency students and available educational programs and services: 2000-2001 summary report. Washington, D.C.: National Clearinghouse for English Language Acquisition and Language Instruction Educational Programs. Lopez, E. A. (2001). Review of the IDEA Oral Language Proficiency Test. The Fourteenth Mental Measurements Yearbook. Lincoln, NE: University of Nebraska Press. MacSwan, J. Rolstad, K. and Glass, G.V. (2002). Do some school-age children have no language? Some problems of construct validity in the Pre-LAS Español. Bilingual research journal 26: 213-238. Macías, R. (1998). Summary Report of the Survey of the States' Limited English Proficient Students and Available Educational Programs and Services 1995-96. Washington, D.C.: National Clearinghouse for Bilingual Education. McLaughlin, B., Gesi Blanchard, A., & Osanai, Y. (1995). Assessing language development in bilingual preschool children. Washington, D.C.: National Clearinghouse for Bilingual Education. Messick, S. (1988). Validity. In R.L. Linn (Ed.) Educational measurements. Third edition. New York: Amercian Council on Education/ McMillan. No Child Left Behind Act. (2001). Retrieved October 2, 2002 from http://www.nochildleftbehind.gov Ochoa, S. H. (2001). Review of the IDEA Oral Language Proficiency Test. The Fourteenth Mental Measurements Yearbook. Lincoln, NE: University of Nebraska Press. Rueda, R. (in press). Student learning and assessment: Setting an agenda. In Pedraza, P. and Rivera, M. (Eds.). National Latino/a Education Research Agenda Project. Shellenberger, S. (1985). Review of Bilingual Syntax Measure II. The Ninth Mental Measurements Yearbook Volume I. Lincoln, NE: University of Nebraska Press. Tidwell, P.S. (1995). Review of Language Assessment Scales-- Oral. The twelfth mental measurements yearbook. Lincoln, NE: University of Nebraska Press. Valdés, G. and Figueroa, R. A. (1994). Bilingualism and testing: A special case of bias. Norwood, NJ: Ablex. Valdés, G. (2001). Learning and not learning English: Latino students in American schools. New York: Teacher’s College Press. Wolfram, W., & Schilling-Estes, N. (1998). American English. Malden, MA: Blackwell. Woodcock, R. W. and Muñoz-Sandoval, A.F. (1993). Woodcock-Muñoz Language Survey Comprehensive Manual. Chicago: Riverside Publishing Company.

Copyright information ISB4: Proceedings of the 4th International Symposium on Bilingualism © 2005 Cascadilla Press, Somerville, MA. All rights reserved ISBN 978-1-57473-210-8 CD-ROM ISBN 978-1-57473-107-1 library binding (5-volume set) A copyright notice for each paper is located at the bottom of the first page of the paper. Reprints for course packs can be authorized by Cascadilla Press. Ordering information To order a copy of the proceedings,

contact: Cascadilla Press P.O. Box 440355 Somerville, MA 02144, USA phone: 1-617-776-2370 fax: 1-617-776-2271 [email protected] www.cascadilla.com Web access and citation information This paper is available from www.cascadilla.com/isb4.html and is identical to the version published by Cascadilla Press on CD-ROM and in library binding.

definitions of language proficiency - didisukyadi.staf.upi.edu · web viewempat keterampilan...

Documents