Measurement 102 Steven Viger Lead Psychometrician Office of

Student Performance Measurement • The difference between validity and reliability – Validity • The

Student Performance Measurement Low reliability High validity Low reliability Low validity High reliability High

Student Performance Measurement • Validity – Documenting validity is a process of gathering evidence

Student Performance Measurement • Individual item validity evidence includes – Focus is on elimination

Student Performance Measurement • Scale score validity evidence includes – Input from item-level validity

Student Performance Measurement • Item Response Theory is used to create the score scale

The Rasch Model (1 parameter logistic model) • The psychometric/statistical model used for MEAP

The 3 Parameter Logistic Model • The psychometric/statistical model used with the MME 12/20/2021

MEAP example (10 items scaled using Rasch) 12/20/2021 10

MME model (10 items scaled using the 3 -PL model) 12/20/2021 11

How do we get there? • Although the graphics on the previous screens may

IRT Estimation • The person by item matrix is fed into an IRT program

Parameter Estimation • For single parameter (item difficulty) models, WINSTEPS is the industry standard.

The Rasch Model (MEAP and ELPA) 12/20/2021 15

The 3 Parameter Logistic Model (MME) 12/20/2021 16

From Theta to Scale Scores • Once item parameters are known, we can use

DIF • Differential item functioning (DIF) is a phenomenon which occurs in the context

DIF • Once DIF is detected the item(s) is/are brought to the attention of

Item Types • MDE assessments contain a variety of items with different levels of

Equating • Core items are more established and have been used before. In fact,

Equating • When we have a test designed to measure the same construct from

Equating • Three restrictions are important for equating: 1. The two tests must measure

Equipercentile Equating • Two scores, one on Form X and the other on Form

Linear Equating • Based on the assumption that the two forms of the test,

Linear Equating • Analogous to multiple regression • Generally expressed as Y* = a(X-c)

Basic Data Collection Designs • Design 1: a large group of examinees are selected

Design 2 • Preferable when the test administrator has the luxury of more time

Design 3 • Two randomly assigned groups each take a different test along with

IRT Equating • When using IRT, if the data fit the model reasonably well,

IRT Equating Cont’d • Based on the principals in the previous slide and the

Contact Information Steven Viger Michigan Department of Education 608 W. Allegan St. Lansing, MI

Contact Information Joseph Martineau Michigan Department of Education 608 W. Allegan St. Lansing, MI

Slides: 34

Download presentation

Measurement 102 Steven Viger Lead Psychometrician, Office of General Assessment and Accountability, Michigan Dept. of Education Joseph Martineau, Ph. D. Interim Director, Office of General Assessment and Accountability, Michigan Dept. of Education

Student Performance Measurement • The difference between validity and reliability – Validity • The degree to which the assessment measures the intended construct(s) – Reliability • The consistency with which the assessment produces scores 12/20/2021 2

Student Performance Measurement Low reliability High validity Low reliability Low validity High reliability High validity 12/20/2021 3

Student Performance Measurement • Validity – Documenting validity is a process of gathering evidence that the assessment measures what is intended 12/20/2021 4

Student Performance Measurement • Individual item validity evidence includes – Focus is on elimination of “constructirrelevant variance” – Item development/review procedures – Alignment of individual items – Bias – Simple item analyses 12/20/2021 5

Student Performance Measurement • Scale score validity evidence includes – Input from item-level validity evidence (the validity of the score scale depends upon the validity of the items that contribute to that score scale) – Convergent, divergent relationships with appropriate external criteria, for example, • Strong relationships with other measures of achievement – Teacher assigned grades – Other subject area assessments – Success in college – Alignment of overall assessment to content standards – Comparability across forms and administrations • Accommodations • Year to year equating 12/20/2021 6

Student Performance Measurement • Item Response Theory is used to create the score scale – Treats all sub-components as a single construct – Assumes that there is a high correlation between subcomponents – Statistically speaking, this indicates that there is a strong first principal component of all items that contribute to the construct in question – It would probably be better to measure the sub-components separately, but that would require significantly more assessment items – Assumes that a more able person has a higher probability of responding correctly to an item than a less able person • Specifically, when a person’s ability is greater than the item difficulty, they have a better than 50% chance of getting the item correct. 12/20/2021 7

The Rasch Model (1 parameter logistic model) • The psychometric/statistical model used for MEAP 12/20/2021 8

The 3 Parameter Logistic Model • The psychometric/statistical model used with the MME 12/20/2021 9

MEAP example (10 items scaled using Rasch) 12/20/2021 10

MME model (10 items scaled using the 3 -PL model) 12/20/2021 11

How do we get there? • Although the graphics on the previous screens may make conceptual sense, many wonder what we use to produce this curve. • We are psychometricians…not psychomagicians, so the numbers come from somewhere. • We need a person by item matrix to begin the process. MME Science 0101001011110001111010111100010110 110011011000011101001 01101011001010011 12/20/2021 12

IRT Estimation • The person by item matrix is fed into an IRT program to produce estimates of item parameters and person parameters. – Item parameters are the ‘guessability’, discrimination and difficulty parameters – Person parameters are the ability estimates we use to create a student’s scale score. 12/20/2021 13

Parameter Estimation • For single parameter (item difficulty) models, WINSTEPS is the industry standard. • More complex models like the 3 parameter model used in the MME require more specialized software such as PARSCALE. • Once we know the parameters, we feed them into the appropriate model to give us an estimated probability of correct response. 12/20/2021 14

The Rasch Model (MEAP and ELPA) 12/20/2021 15

The 3 Parameter Logistic Model (MME) 12/20/2021 16

From Theta to Scale Scores • Once item parameters are known, we can use the item responses for the individuals to estimate their ability (theta). • In general, when person’s share the same response string (pattern of correct and incorrect responses) they will have the same estimate of theta. • The estimation program will then produce a table that gives us the relationship between raw scores and theta. 12/20/2021 17

12/20/2021 18

DIF • Differential item functioning (DIF) is a phenomenon which occurs in the context of testing multiple groups. • When we estimate item and person parameters we do so with a complete data set; the persons are from multiple demographic groups. – We do our best to have a representative sample. • DIF occurs when we estimate the item calibrations separately based on groups of interest (e. g. males and females, ethnicity, type of instruction, geographic regions, etc. ) and we find differences in item parameters. • Depending on the DIF methodology used, there are different levels of DIF that may or be not be problematic. 12/20/2021 19

DIF • Once DIF is detected the item(s) is/are brought to the attention of the content specialists and at times the content and sensitivity review committees. – Items are either edited, deleted or kept in the assessment depending on the findings of the reviewers. • The bottom line is that the finding of DIF from a statistical standpoint does not necessarily mean the item will be deleted. – Subjective decisions always follow the technical information. 12/20/2021 20

Item Types • MDE assessments contain a variety of items with different levels of maturity. • Even though an assessment is new for a test cycle, there are parts of it which are not new. • Generally speaking, we have core items and field test items. 12/20/2021 21

Equating • Core items are more established and have been used before. In fact, the core items are the only ones which contribute to the score. – Field test items are embedded within assessments to maintain the health of our item banks. – We can treat the item parameters from core items as ‘known’ and use those known parameters to drive the estimation of field test item parameters. • Equating also utilizes what we know about the common items to link assessments from year to year because common items are used in concurrent test years. 12/20/2021 22

Equating • When we have a test designed to measure the same construct from year to year but differs (somewhat) in specific content, we need to be able to put the scores on the same scale. • What is being sought in test equating is a conversion from the units of one form of a test to the units of another form of the same test. 12/20/2021 23

Equating • Three restrictions are important for equating: 1. The two tests must measure the same construct 2. The resulting conversion should be independent of the individuals from whom the data were drawn to develop the conversion 3. The conversions should be applicable in future situations. 12/20/2021 24

Equipercentile Equating • Two scores, one on Form X and the other on Form Y (where X and Y measure the same thing with the same degree of reliability), may be considered equivalent if their corresponding percentile ranks in any giving group are equal. • Plot the percentile rank to raw score curves for each form • Paired values for the forms are then interpolated at common points on the percentile rank distribution. 12/20/2021 25

Linear Equating • Based on the assumption that the two forms of the test, designed to be parallel (equivalent), will have essentially the same raw score distributions. • When the assumption is met, it should be possible to convert scores on one form of the measure into the same metric as the other form by employing a linear function. 12/20/2021 26

Linear Equating • Analogous to multiple regression • Generally expressed as Y* = a(X-c) + d – ‘a’ refers to the ration of the standard deviations of Form Y over the standard deviation of Form X – ‘c’ refers to the mean of form X – ‘d’ refers to the mean of form Y • To perform this type of equating, one of three basic data collection designs should be used. 12/20/2021 27

Basic Data Collection Designs • Design 1: a large group of examinees are selected who are sufficiently heterogeneous in order to adequately sample all levels of the scores on both Form X and Form Y. – Divide this randomly into two groups, each group gets a different form. – Collect the mean and standard deviations of both groups and insert them in the aforementioned linear function. 12/20/2021 28

Design 2 • Preferable when the test administrator has the luxury of more time to administer tests. – Both form X and form Y are administered to all subjects. – To control for order effects, half of the subjects receive form X first and half receive form Y first. – Calculations are slightly more involved because averages must be used and must be applied properly. 12/20/2021 29

Design 3 • Two randomly assigned groups each take a different test along with a common equating test. – Common test is known as the ‘anchor test’ (form Z) – Again, calculations are complicated by the addition of the anchor test – Benefits: it is the industry standard, intact or non-random groups may be used, the anchor test is designed to adjust for any between group differences that may be present 12/20/2021 30

IRT Equating • When using IRT, if the data fit the model reasonably well, the item and ability parameters are invariant. – For a set of calibrated items, an examinee will be expected to obtain the same ability estimate from any subset of items. – For any sub-sample of examinees, item parameters will be the same. 12/20/2021 31

IRT Equating Cont’d • Based on the principals in the previous slide and the common anchor item methodology, MDE utilizes various forms of this equating methodology. • Generally, when we embed core items into examinations from year to year, we already know the difficulty estimates of those items. • When we perform our IRT estimation during the current year or with the form under investigation, we can ‘fix’ the parameters of those items when we feed our data into an estimation program. – The new ability estimates (and field test item parameter estimates) are anchored to the previous form or previous year’s administration. 12/20/2021 32

Contact Information Steven Viger Michigan Department of Education 608 W. Allegan St. Lansing, MI 48909 Office: (517) 241 -2334 Fax: (517) 335 -1186 Viger. S@Michigan. gov 12/20/2021 33

Contact Information Joseph Martineau Michigan Department of Education 608 W. Allegan St. Lansing, MI 48909 Office: (517) 241 -4710 Fax: (517) 335 -1186 Martineau. J@Michigan. gov 12/20/2021 34