Lesson Six Reliability Contents Definition of reliability Factors

Lesson Six Reliability

Contents ¡ ¡ ¡ Definition of reliability Factors contributing to unreliability Types of reliability Indication of reliability: Reliability coefficient Ways of obtaining reliability coefficient: l l l ¡ ¡ Alternate/Parallel forms Test-retest Split-half & KR-21/KR-20 Two ways of testing reliability How to make test more reliable Yun-Pi Yuan 2

Definition of Reliability (1) ¡ “The consistency of measures across different times, test forms, raters, and other characteristics of the measurement context” (Bachman, 1990, p. 24). ¡ If you give the same test to the same testees on two different occasions, the test should yield similar results. Yun-Pi Yuan 3

Definition of Reliability (2) A reliable test is consistent and dependable. ¡ Scores are consistent and reproducible. ¡ ¡ The accuracy or precision with which a test measures something; that is, consistency, dependability, or stability of test results. Yun-Pi Yuan 4

Factors Contributing to Unreliability ¡ X=T+ E (observed score = true score + error score) ¡ Concerned with freedom from nonsystematic fluctuation. ¡ Fluctuations in l the student l scoring l test administration l the test itself Yun-Pi Yuan 5

Types of Reliability Student- (or Person-) related reliability ¡ Rater- (or Scorer-) related reliability ¡ l l Intra-rater reliability Inter-rater reliability Test administration reliability ¡ Test (or instrument-related) reliability ¡ Yun-Pi Yuan 6

Student-Related Reliability (1) ¡ The source of the error score comes from the test takers. l Temporary illness l Fatigue l Anxiety l Other physical or psychological factors l Test-wiseness (i. e. , strategies for efficient test taking) Yun-Pi Yuan 7

Student-Related Reliability (2) ¡ Principles: l Assess on several occasions l Assess when person is prepared and best able to perform well l Ensure that person understands what is expected (e. g. , instructions are clear) Yun-Pi Yuan 8

Rater (or Scorer) Reliability (1) ¡ Fluctuations: including human error, subjectivity, and bias ¡ Principles: l Use experienced trained raters. l Use more than one rater. l Raters should carry out their assessments independently. Yun-Pi Yuan 9

Rater Reliability (2) ¡ Two kinds of rater reliability: l Intra-rater reliability l Inter-rater reliability Yun-Pi Yuan 10

Intra-Rater Reliability ¡ Fluctuations l Unclear including: scoring criteria l Fatigue l Bias toward particular good and bad students l Simple carelessness Yun-Pi Yuan 11

Inter-Rater Reliability (1) ¡ Fluctuations including: l Lack of attention to scoring criteria l Inexperience l Inattention l Preconceived biases Yun-Pi Yuan 12

Inter-Rater Reliability (2) ¡ Used with subjective tests when two or more independent raters are involved in scoring ¡ Train the raters before scoring (e. g. , TWE, dept. oral and composition tests for recommended students). Yun-Pi Yuan 13

Inter-Rater Reliability (3) ¡ Compare the scores of the same testee given by different raters. If r= high, there’s inter-rater reliability. Yun-Pi Yuan 14

Test Administration Reliability ¡ Street noise l Listening comprehension test ¡ Photocopying variations ¡ Lighting ¡ Variations in temperature ¡ Condition of desks and chairs ¡ Monitors Yun-Pi Yuan 15

Test Reliability ¡ Measurement errors come from the test itself: l Test is too long l Test with a time limit l Test format allows for guessing l Ambiguous test items l Test with more than one correct answer Yun-Pi Yuan 16

Reliability Coefficient (r) ¡ To quantify the reliability of a test allow us to compare the reliability of different tests. ¡ 0 ≤ r ≤ 1 (ideal r= 1, which means the test gives precisely the same results for particular testees regardless of when it happened to be administered). ¡ If r = 1: 100% reliable ¡ A good achievement test: r>=. 90 ¡ R<. 70 shouldn’t use the test Yun-Pi Yuan 17

How to Get Reliability Coefficient ¡ Two forms, two administrations: ¡ One form, two administrations: alternate/parallel forms test-retest ¡ One form, one administration (internal consistency): l split-half (Spearman-Brown procedure) l KR-21 Yun-Pi Yuan 18

Alternate/Parallel Forms ¡ Two forms, two administrations: Equivalent forms (i. e. , different items testing the same topic) taken by the same test taker on different days l If r is high, this test is said to have good reliability. l the most stringent form l Yun-Pi Yuan 19 Test plan Form A Form B

Test-Retest ¡ One form, two administrations Test A Trial 1 Trial 2 ¡ The same test is administered to the same testees with a short time lag, and then calculate r. ¡ Appropriate for highly speeded test Yun-Pi Yuan 20

Split-half (Spearman-Brown Procedure) One test, one administration ¡ Split the test into halves (i. e. , odd questions vs even questions) to form two sets of scores. ¡ Also called internal consistency ¡ Q 1 Q 2 First Half Q 3 Q 4 Q 5 Second Half Q 6 Yun-Pi Yuan 21

Split-half (2) Note that the r isn’t the reliability of the test ¡ A math relationship between test length and reliability: the longer the test, the more reliable it is. ¡ Rel. total = nr/1+ (n-1)r Spearman & Brown Prophecy Formula ¡ E. g. , correlation between 2 parts of test; r=. 6 rel. of full test =. 75 ¡ If lengthen the test items into 3 times: r=. 82 ¡ Yun-Pi Yuan 22

Kuder-Ridchardson formula 21 ¡ KR-21 = k/(k-1){1 -[x (1 - x/k)]/s 2} ¡ k= number of items; x= mean ¡ s= standard deviation (formula see Bailey 100) description of the spread outness in a set of scores (or score deviations from the mean) l o<=s the larger s, the more spread out l E. g. , 2 sets of scores: (5, 4, 3) and (7, 4, 1); which group in general behaves more similarly? l Yun-Pi Yuan 23

Kuder-Ridchardson formula 20 ¡ KR-20= [k/(k-1)][1 -(∑pq/s 2) p= item difficulty (percent of people who got an item right) ¡ q= 1 -p (i. e. , percent of people who got an item wrong) ¡ Yun-Pi Yuan 24

Ways of Testing Reliability ¡ Examine the amount of variation l l ¡ Standard Error of Measurement (SEM) The smaller the better Calculate “reliability coefficient” l l “r” The bigger the better Yun-Pi Yuan 25

Standard Error of Measurement (1) ¡ Average SD of an individual over a large number of testing ¡ Essence of variability of scores of an individual ¡ How large the error component is likely to be ¡ Particularly useful in interpretation of test scores ¡ SEM= S√ 1 -rel. Yun-Pi Yuan 26

Standard Error of Measurement (2) ¡ Average of a set of scores= “true” score of the individual ¡ X 1=T 1+ E 1 X 2=T 2+ E 2 : : : Xn = T n + E n X=T+0 Yun-Pi Yuan 27

Standard Error of Measurement (3) ¡ E. g. , GRE SD= 100, rel. =. 91 SEM= 100 √ 1 -. 91= 30 o How do we apply the SEM in the interpretation of the score? ¡ For a given spread of scores, the greater the reliability coefficient, the smaller will be the SEM. Yun-Pi Yuan 28

Ways of Enhancing Reliability ¡ General strategies: ¡ Consider possible sources of unreliability l Reduce or average out nonsystematic fluctuations in ¡raters ¡persons ¡test administration ¡instruments Yun-Pi Yuan 29

How to Make Tests More Reliable? (1) ¡Take enough samples of behavior ¡Try to avoid ambiguous items ¡Provide clear and explicit instructions ¡Ensure tests are well layout & perfectly legible ¡Provide uniform and undistracted condition of administration ¡Try to use objective tests Yun-Pi Yuan 30

How to Make Tests More Reliable? (2) ¡ Try to use direct tests ¡ Have independent, trained raters ¡ Provide a detailed scoring key ¡ Try to identify the test takers by number, not by names ¡ Try to have more multiple independent scoring in subjective tests (Hughes, 1989, pp. 36 -42). Yun-Pi Yuan 31