newton ch2
TRANSCRIPT
THE EARLY YEARS OF VALIDITY1800S-----1951
Maryam Bolouri
Major developments in England, France, Germany, and the USA
1836: matriculation examinations 1845: first in USA, the superiority of
written exam over oral quiz 1853: India act for impartial selection
for civil services 1858: local examinations in OXFORD
and Cambridge Development of statistical approach
in Britain such as Spearman contribution
Major developments in England, France, Germany, and the USA
1904: Binet in France , development of a series of test to discriminate unmotivated and incapable children from the others
USA, Yerkes et al. development of intelligence test in army recruits
Purpose: bring scientific methods to the study of edu such as achievement test or development of mental tests
Problem: growing discontent regarding the unreliability of marks and unfair evaluation by human minds
Personal equation concern:
Solution: sentence completion, T/F items, MC selection…
Development of objective and standard based assessment (1st roots in USA and so V is the product of NA)
Led to the mushrooming publication of standard tests and research into test and testing from 1910--1920
The outcome of pre 1921Structured and objective assessment Distinction btw sub-domains of edu and
psycho measurement1. Professional communities: diagnosis,
achievement, selection2. Scientific communities: explore
personality characteristics and innate differences
Distinction btw different types of tests (ling vs. performance-individual vs. group- written and standardized tests)
Recognition of CO.CO as a tool for judging the quality of tests
Post 1921 era The term “V” began to take root in the
lexicon of researchers and practitioners. 1911 Freeman: technique and V of test
methods 1915 Terman: evaluated the V of
intelligence and IQ tests 1916 Starch: referred to V or fairness of
measures 1916 Thorndike: essentials of valid scale 1919 APA attempts for professional
certification in response to use of mental tests by unqualified individuals
Post 1921 era 1921 NADER national asso of Directors of edu
research: seek standardization and consistency among concepts and procedures (similar to APA attempts in 1895, 1906).
Regulations proposed by them:1. Preparation and selection2. Experimental org of test and instruction3. Trail of tentative test4. Final org of test5. Final cond of test (scoring, tabulation and
interpretation)6. Determine V7. Determine R8. Determine norms
1st official definition of V By NADER
Challenged to promote and develop new methods 1st classic definition of V:
The degree to which a test or examination measures what it purports to measure
The idea of criterion was central to this and the dominant approaches were predictive or concurrent ones.
Content consideration existed yet was not sig and robust
1915—1930 boom period: new tests multiplied like rabbits, being uncritical to the instruments and the results
Early years: Over simplistic descriptions Elaboration of insights that had been
established before Elevation of empirical evidence at the
expense of logical analysis (dust-bowl empiricism)
According to Shepard: 1920—1950: defense to test criterion correlations
1940s: V= predictive Co. COAccording to Kane: criterion phaseAccording to Cronbach: whole of V
theory: prediction
Some issues regarding early years: 1. We cannot ignore early years Theory of prediction descriptive and
explanatory investigations The omissions of early years
discussion is counter productive and we shouldn’t teach V from the baseline of 1954.
Only with reference to the baseline of 1921 the transition from Trinitarian conception of V to present day theory can be understood.
Some issues regarding early years: 2. Too many seminal works In early years There were too many seminal works
that made impossible for a coherent tradition to emerge.
Each with new perspectives 1920s was prolific for edu
measurement Difference in perspectives among
authors within sub domains as well as in different sub domains
Some issues regarding early years:3. V in different ways and phases Both wars influenced testing and
validation. Large implementation of mental testing
and a method of scoring by stencil for rapid marking by Otis during 1st world war
The army α and β: military aptitude gave mental testing publicity and prestige
Mechanical test construction to predict criterion measures (blindly empirical)
This is only one side of this complex story from mid of 19th to 20th century (to 1952)
Prediction phase a caricature:1) Widespread adoption of blindly
empirical methods specifically aptitude testing for the army
2) The degradation of classic definition over time and the method for V measurement was mistaken for definition of V. it consists of 3 stages
a) Quality of measurementb) Degree of correlation btw test and
criterionc) Co. Co btw the test and criterion
from a to b: 1922:McCall, only by correlations we know what test measures Classic definition: discrete V and
validation, It was conceptual abstraction. A hypothetical true proficiency rank as an
absolute criterion There is no single true proficiency rank
but a range of ranks No sense of prediction, just in terms of
correlation btw actual test results and hypo proficiency
from a to b: 1922:McCall, only by correlations we know what test measures
2 methods to determine the correspondence:
1. Prolonged careful observation in real life situ determine true proficiency and use it as criterion rank students on the test correlate them
2. Rank pupils with known proficiency rank on the test correlate them
Other approaches to develop criterion: Expert or teacher judgment Results of multiple existing tests
measure the same thing Results from specific tests
From b to c: change of criteria from conceptual abstraction to more concrete and pragmatic measures Coefficient of V= Co. Co btw the test and
scores and criterion scores V= observed agreement rather than a
hypo agreement btw test scores and true proficiency
V= empirical correlation There was no Q to the v of criterion
scores!! Fusion of definition and method Underscored the use of test and each
test has different V with regard to the use
From b to c: change of criteria from conceptual abstraction to more concrete and pragmatic measures Dominance of atheoretical definition
Distinction btw practical V and factorial V
Practical V: a test is valid for anything with which it correlates (Guilford, 1946)
There are 2 kinds of V and the practical V addresses the fundamental Q of V
Undue emphasis on empirical evidence problem: inadequacy of definition and criterion problem
Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes1. School achievement – Walter MonroeV as multifaceted concept based on
correlation and a conceptual definition of V was expressed
a. Objectivity in describing the performances (rater)
b. Reliability( Co of R, index of R, error of measurement, Co
of correspondence, overlapping of grade groups)
c. Discrimination (agreement with Normal curve
d. Comparison with criterion measures
e. V inference based on test structure and admin
Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes 6 threats to valid interpretation:1. Do the tasks require other abilities ?2. Can the tasks be answered in a variety of
methods? (other than the intended one)3. Is the test administered under a variety
of conds?4. Do students continue to exe their ability
across all tasks?5. Are the tasks rep of the field of ability
being measured?6. Are all students given this opportunity?
Unitary conception of V: Integration of multiple sources of
empirical evidence and logical analysis 2 primary categories of sources of
evidence:1. Expert opinion vs. experimental
Ruch 19292. Curricular vs. statistical – Ruch 1933 3 approaches to logical analysis: Ruch
19291. Competent person judgment on the
appropriateness of content2. Alignment of content with test book3. Alignment of content with recommendation of
national edu committees
Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudesFundamental role: extensive sampling in school achievement tests,
random sampling from the field, or rep of the most important elements, measuring the same thing or attribute
Tests parallel to actual teaching Centrality of logical analysisProblem: no field is perfectly homogeneous , so there
would be always a certain degree of compromiseMajor innovation:Scaling, tests with different levels of difficulty items of a
test were not selected based on content and rep effectively
Problem: tension btw discrimination and sampling
From random sampling to restricted samplingIt not possible to construct a robust measure
of overall achievement based on weighted sampling of behavior across the entire achievement domain.
So instead of rep sample we should tap the essence of achievement .
So those items with high correlation to general achievement must be selected. Each item play a role contributing to the essence of general achievement attribute
Items discriminate btw high and low students correlate high with criterion.
From random sampling to restricted sampling V from curriculum viewpoint and V
from general achievement view point need to arrive at a compromise.
A large unresolved tension can be detected throughout the study by Lindquist (1936)
Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes
Tyler (1931): V in terms of usefulness of the test in measuring the attainment of course objectives
He was not opposed to empirical approach, but not impressed by the use of T marks as empirical criterion
His suggestion: development of preliminary tests for each course objectives to help
1) creating comprehensive criterion measures 2) diagnostic purposes Then preparation of some practical tests to be
validated by correlation
Tyler’s concerns:1. Sampling2. Test construction3. Validity 4. Mental process, no distinction btw
content of subj and the required mental process, and items test info not the interpretation or application of principles
5. Negative impacts of tests on instruction and the reform of curriculum. Studying and teaching were adapted to the emphasis of tests
Tension btw empirical and logical1930s-1940s Overemphasis on empirical:
inadequacy of criteria for establishing V and backwash effect on teaching and learning
Overemphasis on logical: impossibility of rep sampling and fallibility of human judgement
Tyler: rational hypo in test construction
Pendulum swings against empirical considerations (technician viewpoint)
2 key principles in evaluation movement
1. The evaluation could not begin until the curriculum had been defined in terms of behavioral objectives
2. Any useful device might be employed in the production of pupil growth account:
Teacher judgment Essay examination Objective test
Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudesLogical approach: Raw brain power and Binet-
Simon scales were extended.Problem: thorough description of the universe of
intelligent behavior was not straightforward, there was no clear definition
Binet: faculties are different from general intelligence , a single test can be a test of intelligence.
Post-Binet: not a single test, but combined tests (manifold and heterogeneous) performance on a test is the product of both faculties and general intelligence.
Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes
Solution: permissive sampling, assess considerably more than the essence of intelligence
V can be maximized by intentional construct under-representation or intentional construct- irrelevance
Assumption: random irrelevant item variance cancel out in law of averages.
Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes Empirical approach: Criterion measure of intelligence is needed During 1st world war: a number of reputed tests
of higher quality to be adopted as yardstickOtis group test: most valid Terman GroupMiller Group test: least valid Army AlphaCattell-1943: promoted F.A as an important
validation technique and transform it from lay activity to scientific prax
Terman (1928) 3 primary concerns of edu and psycho measurement1. achievement 2. intelligence 3. aptitudes For the purpose of vocational guidance and
selection 1st Assumption: aptitudes were stable, if not
innate 2nd assumption: aptitudes differ across and
within individuals along continua Difference of aptitude measurement: the
criterion was not sth of present but of the future. Successful performance in vocation= exercise of
skills and abilities that had not yet been developed.
Problem? How should it be validated??
Empirical approach of Aptitude test:
The idea of sampling is meaningless so it led to elevation of empirical approaches in 4 stages:
1) Administer the aptitude test2) Wait until the required skills and
abilities are received3) Assess job proficiency in situ4) Correlate the result of tests and
assessment of job proficency
Empirical approach of Aptitude test:
Absence of clear rational principles Development based on haphazard trial and
error search for effective predictors With minimum rationality Large list of preference to discriminate btw
professions Selection of items with high correlation to
criterion in successive fashion (multiple regression challenge) low inter item correlation and high correlation with criterion (weakness of aptitude test)
Achilles heel of aptitude testing
Robust criterion measures V for criterion measures 2 major components of criterion
problem:1. The definition of criterion, subjective
judgment and widespread lack of agreement over occupational success
2. The development of a procedure to measure the criterion
Thorndike (1949): 3 categories of criteria
1. Ultimate category: complete final goal of a particular type of selection, multifaceted and not available for direct study
2. Intermediate category3. Immediate category Validation will fall back on no 2, 3Blind empiricism is fragile,
dangerous. It was repeatedly said by Messick 1970s—1990s
Mid 1940s: Paul Meehl and Lee Cronbach, construct V
Paul Meehl:Dissatisfied by client self-ratingSelf rating should not be used as a
behavior surrogate but as an indirect sign of sth deeper
Because it requires1. Appropriate level of self
understanding2. Willingness to disclose
Mid 1940s: Paul Meehl and Lee Cronbach, construct V
Lee Cronbach:Impact of item formatResponse set: the tendency to respond
differently to items in different ways6 kinds of response: Give many responses,
Speed, Accuracy, Gamble…A threat to V:different individuals
demonstrate different response set on same set
Solution: use T/F less and MC more
Cronbach (1949): 5 technical criteria of a good test1. Validity2. Reliability3. Objectivity4. Norms5. Good items2 approaches of logical analysis
(psychological understanding of attribute) and empirical evidence
Cronbach (1949): V as the correspondence of test to definition of attributeThere are items that correspond to
definition of attribute yet bring irrelevant variables that make the items impure:
1. Items with different answers of test takers using different methods
2. Items with limited access to some test takers from certain cultural groups
3. Items that are vulnerable to response sets
4. Items correspond to content yet fail to assess desired processes
Cronbach (1949): ultimate consideration1. Logical analysis is inferior to empirical
evidence.2. Most frequently used criterion: instructor
or supervisors rating, others tests of the same attribute
3. Discussed criterion problem in-depth 4. Rise of particular empirical approach :
factorial V, the degree that a test could purely measure one type of ability