A Sweet Approach to Understanding Basic Principles of
A “Sweet Approach” to Understanding Basic Principles of Educational Measurement Assessing the Performance of Chocolates
Objectives n At the end of instruction, participants will n n n Describe sources of error that threaten the reliability and validity of performance assessment measures List specific strategies to address these threats Define and appropriately employ performance measurement terminology n n Anchors, Likert, Horns and Halo Effects Develop and test a performance-based measurement instrument n n Construct scale Train raters in using it effectively Assess validity and reliability of measures Identify/ explain sources of error 2
Lesson Plan: Set the task n Judge at the Wisconsin State Fair for “Open Class” Commercial Chocolates n n n Develop key factors to rate chocolates Develop the rating scale Train other “judges” Taste chocolates and rate Overview Key Measurement Principles n n Goal – to ID sources and strategies to control errors Step-by-step approach to task // process in educational measurement 3
Timeline n n n Introduction to measurement Development of criteria Develop scale Train raters Sample and rate chocolates Identify sources of error 20 min 10 min 4
Underlying Assumption of Performance Based Measures n An individuals observed performance/score is a combination of: True Score +Errors of Measurement n n Random Controllable All measurement seeks to control errors so that the measured score = true score Does OBSERVED score = TRUE score? 5
Familiar EBM terminology: n Validity: n Relevance: does the measure actually reflect the variable of interest? Appropriateness: relevant to purpose of the study n Meaningfulness: measure reflects variable of interest n Usefulness: aids decision-making n n Accuracy: is the measurement free from error? Random error: who takes the test n Systematic error: bias n 6
Principles of Measurement Validity – What are you measuring? Is What You Measure = What Intend to Measure Fasting blood sugar vs. Diabetes control? HAIC = USMLE Exam Score = Competent MD? Smoothness of Chocolate = Best Chocolate? 7
Principles of Measurement Common Types of Validity Evidence 1. Content-related evidence Looks like a duck, sounds like a duck = duck n Appropriateness, logically get at intended performance n n Expert review (Face validity) Representative sample from content domain - objectives 8
Principles of Measurement 3 Types of Validity Evidence 2. Criterion-related evidence Sugar content of the grapes = best wine n Relationship between this and others n n n Predictive - MCAT and medical school Concurrent – comparison to a gold standard Often Expressed as: n n Correlation coefficients for continuous variables Cross-tabulations for dichotomous variables 9
Principles of Measurement 3 Types of Validity Evidence 3. Construct-related evidence n Psychological construct or characteristic where gold standard does not exist n n n Ex Medicine: Self efficacy → medication adherence Ex. Educ: Body/Kinesthetic IQ → Suturing Skills Test relationship between performance and a theoretical model n Multiple Regression, Correlation, Factor analysis 10
Validity: Accuracy and Reliability Examples of Error: Random Error Controllable error Things you CAN Control Things that you can not control n n Subject Variability Snow storm delays n n Instrument Variability Observer Variability n n (Intra- vs. Inter-observer) Halos/ Horns
Principles of Measurement: Reliability n Defined: consistency of the scores obtained n n at one time over time Which color is: 1. 2. 3. Not reliable Reliable but not Valid Reliable and Valid? 12
Controlling Error in Performance Measures n Subject n n n Motivation, energy, anxiety Location Maturation History Regression (high/low) Rater n n n age, gender, ethnicity biases n Instrument n halo/horns fatigue Scales n n n Characteristics n n n Normative Criterion Length Inadequate instructions Poor formatting Illogical order Vague terminology Reponses that fail to fit question/ scale Items favor one group over another 13
KEY: Always Think 4 Common Categories of ERRORS 1. 2. 3. 4. Instrument Raters Design/Administration Subjects 14
Strategies to Control Errors n Standardize Conditions (location, instrument, attitude) n n Obtain more information on subjects n n relevant characteristics (e. g. do they like chocolate)? Obtain more information on details n n Clear specific descriptors of desired behaviors Trained raters/ Proctors How and under what conditions data is collected location, instrumentation, history, subject attitude Appropriate Design 15
AND NOW. . . n You have been selected to create the measurement tool with which to judge. . . WISCONSIN’s BEST CHOCOLATE 16
Review Your Tasks Chocolate Judge n n n List criteria indicative of “best chocolate” Develop Likert-scale rating item, with descriptive anchors for each criterion Train other raters to use your item Pilot all items on samples of chocolate Identify potential sources of error Examine reliability of ratings n Which errors contributed – could be controlled? 17
A few words about scales. . . n Summated rating scale (Likert): Length of scale = accuracy with which raters can make decisions n Assumes equal intervals between decision points (Ratio scale) n Gap between Excellent and good = gap between good and satisfactory (Ratio) n So need to provide scale anchors to inform raters (control error) of your intent n 18
A few words about scale assumptions. . . n Normative: Compared/relative to other learners n n n Standard score or mean score (USMLE Steps) Compared to other chocolates in the group its “average”; “above average” Criterion-based: Compared to a “gold standard”; or minimum threshold that is pre-established n n 80% correct on examination Godiva? Smooth like silk or granular like sand 19
STEP 1: Develop Valid Indicators of Best Chocolate 3 min 6 min Individually: List on worksheet all the key dimensions/features (indicators) associated with“best” chocolate Small Group: Agree on 5 indicators 10 min Large Group: Agree on indicators that will be used to judge the Wisconsin State Fair Open Commercial Chocolate Competition 20
Chocolate Performance Criteria Indicator Authors 1. 2. 3. 4. 5. 6. 7. 21
STEP 2: Developing Rating Scale 10 min Each team develop 1 Likert like scale: n Develop scale that is 7 points with n n 1 (most favorable) to 7 = least favorable n Scale n n Criterion: Silky - Gravel Normative: Excellent - Horrible n Write n may be either Descriptors (or anchors) Associate at least scale anchors (1, 4, and 7) with a descriptor (1= silky) n Transfer to newsprint for posting 22
Step 3: Train Your “Raters” 2 min Write Your Scale Indicator (Dimension) and Descriptors for at least 1, 4, 7 on Newsprint 3 min Develop a training plan (and who will do it) Goal: eliminate rater errors 1 min Train group n May use a chocolate for training 23
Step 4: Sampling Step 1 Starting at the beginning of the line, pick up a sample place (pre-segmented with #’s). Step 2 Proceed in order down the table picking up 1 piece of chocolate from each plate. Place sample piece on the number associated with that location on the plate (sample 7 = plate 7) Step 3 At instructor’s signal (do NOT start until told), sample each piece and complete the rating for that sample to correspond to number on the rating form. (7 = 7 on form) 24
Step 5: Sources of Error True Score ≠ Observed Score n n Turn rating sheets in now Note on the Sources of Error Worksheet any factors that would affect reliability (consistency) of your ratings Goal: to identify variance due to true variability (in chocolate) n + Variance due to raters + variance due to instrumentation + variance due to administration + etc. n 25
Potential Sources of Error n n n n . . . . Sources of Error: Instrument, Raters, Design/Admin, Subjects (chocolate) 26
Step 6: Review of Scores n Within individual variance n n all “high” Between individuals n Halo/Horns Sources of Error: Instrument, Raters, Design/Admin, Subjects (chocolate) 27
Scale Choc 1 2 3 4 5 6 7 28
Measurement = Chocolate n n Validity: Criterion measure essence of Wisconsin’s “best” chocolate? Reliability: control errors due to n Instrumentation n n Raters n n sample #’s confused; clarity of criterion; competent to judge; biases Administration n allow drink soda; eat in any order, time, directions, location, standardization Sources of Error: Instrument, Raters, Design/Admin, Subjects (chocolate) 29
Chocolates n # 1: Confections Solid Milk Chocolate n n n n 1997 Wisconsin State Farm Seal of Excellence #2: Hershey’s Extra Dark 60% Cocoa #3: Regal Dynasty Milk #4: Hershey’s Cookies ‘n’ Cream #5: Dove Dark Hearts #6: Palmer Milk Hearts #7: Nestle’s Milk hearts 30
Do Your Sources of Errors Explain Ratings? n Raters n n n Did they chat amongst themselves Contamination? Drink diet coke, coffee? Rating strategy: n n n Taste them all then rate? Rate each independently? Recognize brands? Bias? Rater Fatigue? Instrument n n Scale descriptors clear? All variables considered in scale development (e. g. cookie pieces? // Non-traditional students/ learners) 31
Beyond Chocolate Apply Principles to Assessment of Learner Performance n What are common errors / problems in learner assessment? Sources of Error: Instrument, Raters, Design/Admin, Subjects (chocolate) 32
Beyond Chocolate Apply Principles to Assessment of Learner Performance n Rating Forms for Resident/Student Performance Criterion (Behavioral Anchors) vs Normative n What content/dimensions n Scales n n OSCE’s (added variability – why? ) What else need to control? n Why OSVE? n Sources of Error: Instrument, Raters, Design/Admin, Subjects (chocolate) 33
SUMMARY: Part I True Score + Controllable Errors (Sources of Error) + Random Error Observed Score (Faculty rating, MCQ) n Sources of error: n n n Instrument (Valid dimensions, clear directions, pilot and revise) Administration/Design (Standardize) Raters (Train) Sources of Error: Instrument, Raters, Design/Admin, Subjects (chocolate) 34
SUMMARY: Part II n n n Think, think a-bout Errors of mea-sure-ment Rater Bias Va-lid-it-ty Too Every time you test ROW ROW YOUR BOAT 35
Supplemental Slides
Follow-up: Chocolates n Seven different chocolates were evaluated on seven different indicators during the 2006 Chocolate Survey. Table 1: Identities of Chocolates 1 Confections Solid Milk Chocolate 2 Hershey’s Extra Dark 60% Cocoa 3 Regal Dynasty Milk 4 Hershey’s Cookies ‘n’ Cream 5 Dove Dark Hearts 6 Palmer Milk Hearts 7 Nestlé’s Milk Hearts 37
Results: Descriptive Stats Table 2: Mean Scores* - Ascending Values (Max N = 12) Indicator Chocolate Number 1 2 3 4 5 6 7 Ave 1 Texture 3. 17 4. 83 5. 67 2. 25 4. 33 4. 75 4. 17 2 w/ Wine 3. 80 3. 90 5. 30 3. 00 5. 10 4. 80 4. 54 3 Sweet/Bitter 3. 00 5. 17 3. 75 2. 17 4. 58 3. 00 2. 50 3. 45 4 Intensity 3. 83 3. 33 5. 25 3. 00 4. 42 5. 25 4. 33 5 Meltability 3. 58 4. 50 4. 67 5. 25 3. 00 3. 92 4. 50 4. 20 6 Aftertaste 3. 00 3. 92 5. 33 4. 92 3. 00 4. 25 4. 50 4. 13 7 Moreishness 3. 50 4. 67 5. 58 5. 25 3. 83 4. 75 5. 17 4. 68 Average 3. 41 4. 24 5. 02 4. 81 3. 25 4. 22 4. 49 4. 21 *Scale: 7=Lowest Rating 1=Highest Rating 38
Inter-rater reliability n n Kendall’s Concordance varies on a scale of 0 = no agreement 1 = perfect agreement Hope for concordance > 0. 7 39
Results: Inter-rater agreement Table 3: Inter-Rater Agreement and Indicator Scales (Max N=12) Indicator W Scales 1=Highest Rating 4 7=Lowest Rating 1 Texture . 640 Creamy, Silky, Smooth Moderately Smooth, Hints of Chalkiness Grainy, Chalky 3 Sweetness/ Bitterness . 571 Cloyingly Sweet Perfect Balance Disgustingly Bitter 2 w/ Wine . 495 No Wine, Without More Chocolate Nice, But Could Eat Separately Spit Chocolate Out 4 Intensity . 400 Intense Taste Good Taste Sugary Taste 5 Meltability . 345 Melts Quickly and Smoothly Melts More Slowly and Less Smoothly Chewy 7 Moreishness . 295 Want More, Even If Tummy Hurts Indifferent Don't Want More 6 Aftertaste . 255 Lingering Echoes of Original Flavor Unbalanced Echo of Original Flavor No Echoes of Original Flavor 40
Lack of concordance – Why? 41
- Slides: 41