newton ch2

THE EARLY YEARS OF VALIDITY1800S-----1951

Maryam Bolouri

Major developments in England, France, Germany, and the USA

1836: matriculation examinations 1845: first in USA, the superiority of

written exam over oral quiz 1853: India act for impartial selection

for civil services 1858: local examinations in OXFORD

and Cambridge Development of statistical approach

in Britain such as Spearman contribution

Major developments in England, France, Germany, and the USA

1904: Binet in France , development of a series of test to discriminate unmotivated and incapable children from the others

USA, Yerkes et al. development of intelligence test in army recruits

Purpose: bring scientific methods to the study of edu such as achievement test or development of mental tests

Problem: growing discontent regarding the unreliability of marks and unfair evaluation by human minds

Personal equation concern:

Solution: sentence completion, T/F items, MC selection…

Development of objective and standard based assessment (1st roots in USA and so V is the product of NA)

Led to the mushrooming publication of standard tests and research into test and testing from 1910--1920

The outcome of pre 1921Structured and objective assessment Distinction btw sub-domains of edu and

psycho measurement1. Professional communities: diagnosis,

achievement, selection2. Scientific communities: explore

personality characteristics and innate differences

Distinction btw different types of tests (ling vs. performance-individual vs. group- written and standardized tests)

Recognition of CO.CO as a tool for judging the quality of tests

Post 1921 era The term “V” began to take root in the

lexicon of researchers and practitioners. 1911 Freeman: technique and V of test

methods 1915 Terman: evaluated the V of

intelligence and IQ tests 1916 Starch: referred to V or fairness of

measures 1916 Thorndike: essentials of valid scale 1919 APA attempts for professional

certification in response to use of mental tests by unqualified individuals

Post 1921 era 1921 NADER national asso of Directors of edu

research: seek standardization and consistency among concepts and procedures (similar to APA attempts in 1895, 1906).

Regulations proposed by them:1. Preparation and selection2. Experimental org of test and instruction3. Trail of tentative test4. Final org of test5. Final cond of test (scoring, tabulation and

interpretation)6. Determine V7. Determine R8. Determine norms

1st official definition of V By NADER

Challenged to promote and develop new methods 1st classic definition of V:

The degree to which a test or examination measures what it purports to measure

The idea of criterion was central to this and the dominant approaches were predictive or concurrent ones.

Content consideration existed yet was not sig and robust

1915—1930 boom period: new tests multiplied like rabbits, being uncritical to the instruments and the results

Early years: Over simplistic descriptions Elaboration of insights that had been

established before Elevation of empirical evidence at the

expense of logical analysis (dust-bowl empiricism)

According to Shepard: 1920—1950: defense to test criterion correlations

1940s: V= predictive Co. COAccording to Kane: criterion phaseAccording to Cronbach: whole of V

theory: prediction

Some issues regarding early years: 1. We cannot ignore early years Theory of prediction descriptive and

explanatory investigations The omissions of early years

discussion is counter productive and we shouldn’t teach V from the baseline of 1954.

Only with reference to the baseline of 1921 the transition from Trinitarian conception of V to present day theory can be understood.

Some issues regarding early years: 2. Too many seminal works In early years There were too many seminal works

that made impossible for a coherent tradition to emerge.

Each with new perspectives 1920s was prolific for edu

measurement Difference in perspectives among

authors within sub domains as well as in different sub domains

Some issues regarding early years:3. V in different ways and phases Both wars influenced testing and

validation. Large implementation of mental testing

and a method of scoring by stencil for rapid marking by Otis during 1st world war

The army α and β: military aptitude gave mental testing publicity and prestige

Mechanical test construction to predict criterion measures (blindly empirical)

This is only one side of this complex story from mid of 19th to 20th century (to 1952)

Prediction phase a caricature:1) Widespread adoption of blindly

empirical methods specifically aptitude testing for the army

2) The degradation of classic definition over time and the method for V measurement was mistaken for definition of V. it consists of 3 stages

a) Quality of measurementb) Degree of correlation btw test and

criterionc) Co. Co btw the test and criterion

from a to b: 1922:McCall, only by correlations we know what test measures Classic definition: discrete V and

validation, It was conceptual abstraction. A hypothetical true proficiency rank as an

absolute criterion There is no single true proficiency rank

but a range of ranks No sense of prediction, just in terms of

correlation btw actual test results and hypo proficiency

from a to b: 1922:McCall, only by correlations we know what test measures

2 methods to determine the correspondence:

1. Prolonged careful observation in real life situ determine true proficiency and use it as criterion rank students on the test correlate them

2. Rank pupils with known proficiency rank on the test correlate them

Other approaches to develop criterion: Expert or teacher judgment Results of multiple existing tests

measure the same thing Results from specific tests

From b to c: change of criteria from conceptual abstraction to more concrete and pragmatic measures Coefficient of V= Co. Co btw the test and

scores and criterion scores V= observed agreement rather than a

hypo agreement btw test scores and true proficiency

V= empirical correlation There was no Q to the v of criterion

scores!! Fusion of definition and method Underscored the use of test and each

test has different V with regard to the use

From b to c: change of criteria from conceptual abstraction to more concrete and pragmatic measures Dominance of atheoretical definition

Distinction btw practical V and factorial V

Practical V: a test is valid for anything with which it correlates (Guilford, 1946)

There are 2 kinds of V and the practical V addresses the fundamental Q of V

Undue emphasis on empirical evidence problem: inadequacy of definition and criterion problem

Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes1. School achievement – Walter MonroeV as multifaceted concept based on

correlation and a conceptual definition of V was expressed

a. Objectivity in describing the performances (rater)

b. Reliability( Co of R, index of R, error of measurement, Co

of correspondence, overlapping of grade groups)

c. Discrimination (agreement with Normal curve

d. Comparison with criterion measures

e. V inference based on test structure and admin

Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes 6 threats to valid interpretation:1. Do the tasks require other abilities ?2. Can the tasks be answered in a variety of

methods? (other than the intended one)3. Is the test administered under a variety

of conds?4. Do students continue to exe their ability

across all tasks?5. Are the tasks rep of the field of ability

being measured?6. Are all students given this opportunity?

Unitary conception of V: Integration of multiple sources of

empirical evidence and logical analysis 2 primary categories of sources of

evidence:1. Expert opinion vs. experimental

Ruch 19292. Curricular vs. statistical – Ruch 1933 3 approaches to logical analysis: Ruch

19291. Competent person judgment on the

appropriateness of content2. Alignment of content with test book3. Alignment of content with recommendation of

national edu committees

Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudesFundamental role: extensive sampling in school achievement tests,

random sampling from the field, or rep of the most important elements, measuring the same thing or attribute

Tests parallel to actual teaching Centrality of logical analysisProblem: no field is perfectly homogeneous , so there

would be always a certain degree of compromiseMajor innovation:Scaling, tests with different levels of difficulty items of a

test were not selected based on content and rep effectively

Problem: tension btw discrimination and sampling

From random sampling to restricted samplingIt not possible to construct a robust measure

of overall achievement based on weighted sampling of behavior across the entire achievement domain.

So instead of rep sample we should tap the essence of achievement .

So those items with high correlation to general achievement must be selected. Each item play a role contributing to the essence of general achievement attribute

Items discriminate btw high and low students correlate high with criterion.

From random sampling to restricted sampling V from curriculum viewpoint and V

from general achievement view point need to arrive at a compromise.

A large unresolved tension can be detected throughout the study by Lindquist (1936)

Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes

Tyler (1931): V in terms of usefulness of the test in measuring the attainment of course objectives

He was not opposed to empirical approach, but not impressed by the use of T marks as empirical criterion

His suggestion: development of preliminary tests for each course objectives to help

1) creating comprehensive criterion measures 2) diagnostic purposes Then preparation of some practical tests to be

validated by correlation

Tyler’s concerns:1. Sampling2. Test construction3. Validity 4. Mental process, no distinction btw

content of subj and the required mental process, and items test info not the interpretation or application of principles

5. Negative impacts of tests on instruction and the reform of curriculum. Studying and teaching were adapted to the emphasis of tests

Tension btw empirical and logical1930s-1940s Overemphasis on empirical:

inadequacy of criteria for establishing V and backwash effect on teaching and learning

Overemphasis on logical: impossibility of rep sampling and fallibility of human judgement

Tyler: rational hypo in test construction

Pendulum swings against empirical considerations (technician viewpoint)

2 key principles in evaluation movement

1. The evaluation could not begin until the curriculum had been defined in terms of behavioral objectives

2. Any useful device might be employed in the production of pupil growth account:

Teacher judgment Essay examination Objective test

Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudesLogical approach: Raw brain power and Binet-

Simon scales were extended.Problem: thorough description of the universe of

intelligent behavior was not straightforward, there was no clear definition

Binet: faculties are different from general intelligence , a single test can be a test of intelligence.

Post-Binet: not a single test, but combined tests (manifold and heterogeneous) performance on a test is the product of both faculties and general intelligence.

Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes

Solution: permissive sampling, assess considerably more than the essence of intelligence

V can be maximized by intentional construct under-representation or intentional construct- irrelevance

Assumption: random irrelevant item variance cancel out in law of averages.

Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes Empirical approach: Criterion measure of intelligence is needed During 1st world war: a number of reputed tests

of higher quality to be adopted as yardstickOtis group test: most valid Terman GroupMiller Group test: least valid Army AlphaCattell-1943: promoted F.A as an important

validation technique and transform it from lay activity to scientific prax

Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes For the purpose of vocational guidance and

selection 1st Assumption: aptitudes were stable, if not

innate 2nd assumption: aptitudes differ across and

within individuals along continua Difference of aptitude measurement: the

criterion was not sth of present but of the future. Successful performance in vocation= exercise of

skills and abilities that had not yet been developed.

Problem? How should it be validated??

Empirical approach of Aptitude test:

The idea of sampling is meaningless so it led to elevation of empirical approaches in 4 stages:

1) Administer the aptitude test2) Wait until the required skills and

abilities are received3) Assess job proficiency in situ4) Correlate the result of tests and

assessment of job proficency

Empirical approach of Aptitude test:

Absence of clear rational principles Development based on haphazard trial and

error search for effective predictors With minimum rationality Large list of preference to discriminate btw

professions Selection of items with high correlation to

criterion in successive fashion (multiple regression challenge) low inter item correlation and high correlation with criterion (weakness of aptitude test)

Achilles heel of aptitude testing

Robust criterion measures V for criterion measures 2 major components of criterion

problem:1. The definition of criterion, subjective

judgment and widespread lack of agreement over occupational success

2. The development of a procedure to measure the criterion

Thorndike (1949): 3 categories of criteria

1. Ultimate category: complete final goal of a particular type of selection, multifaceted and not available for direct study

2. Intermediate category3. Immediate category Validation will fall back on no 2, 3Blind empiricism is fragile,

dangerous. It was repeatedly said by Messick 1970s—1990s

Mid 1940s: Paul Meehl and Lee Cronbach, construct V

Paul Meehl:Dissatisfied by client self-ratingSelf rating should not be used as a

behavior surrogate but as an indirect sign of sth deeper

Because it requires1. Appropriate level of self

understanding2. Willingness to disclose

Mid 1940s: Paul Meehl and Lee Cronbach, construct V

Lee Cronbach:Impact of item formatResponse set: the tendency to respond

differently to items in different ways6 kinds of response: Give many responses,

Speed, Accuracy, Gamble…A threat to V:different individuals

demonstrate different response set on same set

Solution: use T/F less and MC more

Cronbach (1949): 5 technical criteria of a good test1. Validity2. Reliability3. Objectivity4. Norms5. Good items2 approaches of logical analysis

(psychological understanding of attribute) and empirical evidence

Cronbach (1949): V as the correspondence of test to definition of attributeThere are items that correspond to

definition of attribute yet bring irrelevant variables that make the items impure:

1. Items with different answers of test takers using different methods

2. Items with limited access to some test takers from certain cultural groups

3. Items that are vulnerable to response sets

4. Items correspond to content yet fail to assess desired processes

Cronbach (1949): ultimate consideration1. Logical analysis is inferior to empirical

evidence.2. Most frequently used criterion: instructor

or supervisors rating, others tests of the same attribute

3. Discussed criterion problem in-depth 4. Rise of particular empirical approach :

factorial V, the degree that a test could purely measure one type of ability

newton ch2

Education