Chapter 8 Test Development Test Development 5 Stages

  • Slides: 21
Download presentation
Chapter 8 Test Development

Chapter 8 Test Development

Test Development: 5 Stages • Test conceptualization • Test construction • Test tryout –

Test Development: 5 Stages • Test conceptualization • Test construction • Test tryout – When all items are new, the test is called beta test. – When you insert some new items into the existing test, those items are called field test items. • Item analysis • Test revision

Classical True Score Theory • Began in 1920’s with Classical Test Theory – True

Classical True Score Theory • Began in 1920’s with Classical Test Theory – True Score theory – Focus on Observed Score=True Score + Error – The theory assumes that traits are constant and the variation in observed scores are caused by random errors. – These random errors over many repeated measurements are expected to cancel out each other. In the long run, the expected mean of measurement errors should be zero. • Disadvantage of CTT – Student scores are item-dependent: Estimation of ability based only on # correct answers 3

Introduction to IRT • Item response theory models the relationship between characteristics of items

Introduction to IRT • Item response theory models the relationship between characteristics of items (item parameters) and characteristics of individuals (latent traits) to estimate the probability of a correct response. • Advantages of item response theory – Improved precision of measurement – Enables persons to be measured using different sets of items: Adaptive testing. 4

Introduction to IRT • Disadvantage: When there is no variance, nothing you can do

Introduction to IRT • Disadvantage: When there is no variance, nothing you can do – When a student answered all items correctly (100%), IRT cannot estimate his/her ability – When an item is too easy (100% students can score it) or too difficult (0%), IRT cannot estimate its psychometric attributes. 5

 • A teacher wants to give her students a math test to assess

• A teacher wants to give her students a math test to assess their skill level at the beginning of the school year. HARD questions AVERAGE questions EASY questions 6

Item Calibration and Ability Estimation We cannot judge a student's ability based solely on

Item Calibration and Ability Estimation We cannot judge a student's ability based solely on the number of items answered correctly. Item attributes, such as difficulty level should be taken into account. 7

Item Calibration and Ability Estimation • The ideal case - Guttman pattern HARD AVERAGE

Item Calibration and Ability Estimation • The ideal case - Guttman pattern HARD AVERAGE EASY More proficient Less proficient 8

Who is better ? HARD Q’s AVERAGE EASY Q’s • In this case, we

Who is better ? HARD Q’s AVERAGE EASY Q’s • In this case, we cannot draw a firm conclusion that they have the same level of proficiency because Student 4 answered two easy items correctly, whereas Student 6 answered two hard questions correctly. 9

Item Characteristic Curve • The item characteristic curve is the basic building block of

Item Characteristic Curve • The item characteristic curve is the basic building block of item response theory; all the other constructs of theory depend upon this curve. • The shape of the item characteristic curve is related to item difficulty and student proficiency (skill level). Skill Level Probability Item Difficulty Item Characteristic Curve 10

The probability of answering an item correctly One Parameter Item Characteristic Curve Average Standardized

The probability of answering an item correctly One Parameter Item Characteristic Curve Average Standardized scale The student skill levels Relationship between skill level and probability of correct answer 11

Difficult Item vs. Easy Item Characteristic Curve in case of a difficult item Item

Difficult Item vs. Easy Item Characteristic Curve in case of a difficult item Item Characteristic Curve in case of an easy item 12

Misfit (Optional) MS (Mean square) = Chi-square/degree of freedom) Don’t worry about what “out”

Misfit (Optional) MS (Mean square) = Chi-square/degree of freedom) Don’t worry about what “out” means now. Fitness Mean squares Good Below 1. 4 Marginal Between 1. 4 to 1. 5 Bad Above 1. 5 13

Summary • IRT considers student proficiency/skill level AND item difficulty together. • The Item

Summary • IRT considers student proficiency/skill level AND item difficulty together. • The Item Characteristic Curve (ICC) indicates the probability of answering an item correctly given a particular student's proficiency level. • Problematic items(misfit) can be identified by mean square (chi-square/df). In a mis-behaved item, the response pattern does not correspond to the rest.

Think-aloud protocol • Qualitative Item Analysis • Also known as mental protocol • User

Think-aloud protocol • Qualitative Item Analysis • Also known as mental protocol • User interface design: what do users think during the process? • Diagnostic tool: What are the learners thinking? What are their misconceptions? • Instructional design and assessment design: How do the experts solve the problem? • Assessment tool: Does the learner really know what he is doing?

User interface design and ergonomics • A method to collect data for testing usability

User interface design and ergonomics • A method to collect data for testing usability or user interface • Introduced by Clayton Lewis at IBM • The user verbalize what he or she is thinking while performing the task

Diagnosis • The goal of having the participants think aloud is to reveal what

Diagnosis • The goal of having the participants think aloud is to reveal what information is kept in short-term memory during the problem solving process. • Short-term memory corresponds the "voice" one is aware of during self-talk. • Because most cognitive theories hold short-term memory is the working space for problem solving and holding information from long-term memory, think aloud protocols have been suggested as an essential data collection device for understanding cognitive processes for diagnosis.

Evidence-centered design • Introduced by Robert Mislevy at ETS (now he is at U.

Evidence-centered design • Introduced by Robert Mislevy at ETS (now he is at U. of Maryland, College Park) • We need to know what the experts know to solve real problems • We can interview them or ask them to do a think aloud protocol • What evidence should be shown to prove that the examinee has the expertise?

Exercise 1 • Construct to be assessed: skills of post-processing in photography • Method:

Exercise 1 • Construct to be assessed: skills of post-processing in photography • Method: Think aloud protocol. The best photographer at APU will verbalize the process of enhancing photos in Adobe Photo. Shop. The item authors (you) will observe the process and write test items that provide evidence of competence of post-processing. • Item format: Multiple-choice, T/F or short essay

Exercise 2 • Select a task or a software application that you are familiar

Exercise 2 • Select a task or a software application that you are familiar with • Perform a think aloud protocol; do not lecture. Lecturing is presenting the information to an audience but thinking aloud is talking to yourself while doing the job • Write three test items that provide evidence of mastering the task or the software package. • Item format: Multiple-choice, or short essay (use concept map if possible)

Exercise 2 • If you do not want to do a software demo, you

Exercise 2 • If you do not want to do a software demo, you can choose a noncomputerrelated activity e. g. CPR.