Test Validity Validity of measurement Reliability refers to
Test Validity
Validity of measurement • Reliability refers to consistency – Are we getting something stable over time? – Internally consistent? • Validity refers to accuracy – Is the measure accurate? – Are we really measuring what we want?
Important distinction! The term “validity” is used in two different ways 1. Validity of an assessment or method of collecting data • The validity of a test or questionnaire or interview 2. Validity of a research study • Was the entire study of high quality • Did it have high internal and external validity
Important distinction! The term “validity” is used in two different ways 1. Referring to entire studies or research reports: – – – 2. OK: “We examined the internal validity of the study. ” OK: “We looked for the threats to validity. ” OK: “That study involved randomly assigning students to groups, so it had strong internal validity, but it was carried out in a special school, so it is weak on external validity. ” Referring to a test or questionnaire or some assessment: – – – OK: “The test is a widely used and well-validated measure of student achievement. ” OK: “The checklist they used seemed reasonable, but they did not present any information on its reliability or validity. ” NOT: “The test lacked internal validity. ” (This sounds very strange to me. )
Types of validity • Validity – the extent to which the instrument (test, questionnaire, etc. ) is measuring what it intends to measure – Examples: § Math test v is it covering the right content and concepts? v is it also influenced by reading level or background knowledge? § Attitude assessment v are the questions appropriate? v does it assess different dimensions of attitudes (intensity, direction, etc. ) • Validity is also assessed in a particular context – A test may be valid in some contexts and not in others – A questionnaire may be useful with some populations and not so useful with other groups – Not: “The test has high validity. ” – OK: “The test has been useful in assessing early reading skills among native speakers of English. ”
Types of validity • Content validity – The extent to which the items reflect a specific domain of content § Is the sample of items really representative? – Often a matter of judgment – Experts may be asked to rate the relevance and appropriateness of the items or questions § e. g. , rate each item: very important / nice to know / not important – “Face validity” refers to whether the items appear to be valid (to the test taker or test user)
Types of validity Criterion-related validity • Concurrent validity – agreement with a separate measure – common in educational assessments § e. g. , Bayley Scales and S-B IQ test § Complete version and screening test version – Issue: Is there really a strong existing measure, a “gold standard” we can use for validating a new measure? • Predictive validity – agreement with some future measure – SAT scores and college GPA – GRE scores and graduate school performance
Types of validity (cont. ) Construct validity • Does the measure appear to produce results that are consistent with our theories about the construct? – Example: We have a “stage-model” of development, so does out measure produce scores/results that look like “stages”? • Convergent validity – Does out measure converge or agree with other measures that should be similar? And. . . • Discriminant validity – Does our measure disagree (or diverge) where it should be different?
Stanford Achievement Test Example – Grade 1
Stanford Achievement Test Example – Grade 12
Mc. Carthy Screening test example • A test for pre-school children (2. 5 – 8. 5) • Six subtests: – Verbal, perceptual-performance, quantitative, general cognitive (composite), memory, motor • Reliability evidence for using a short version as a screening test – Split-half correlations for several scales (r =. 60 to. 80) – Test-retest reliability for other scales (on a subset of children) showed a range of correlations, from. 32 to. 70.
Mc. Carthy Scales of Children’s Abilities • Reliability • The internal consistency coefficients for the General Cognitive Index (GCI) averaged. 93 across 10 age groups between 2. 5, and 8. 5 years. • Test-retest reliability of GCI over a one month interval was. 80. Stability coefficients of the cognitive scales ranged from. 62 to. 76 with the Motor Scale emerging as the only scale that lacked stability (r=. 33).
A short version developed as a screening test Validity information for a short version • A sample of 60 children with learning disabilities • On full version of entire test – 53 out of 60 (88%) failed at least 2 of the 6 subtests • On the short version (the proposed screening version) – 40 out of 60 (67%) failed (and would be identified) • Is this enough information?
- Slides: 13