Automatic Generation of Verbal Analogy Items Alan D
Automatic Generation of Verbal Analogy Items Alan D. Mead Illinois Institute of Technology
AIG in employment testing • Rise of unproctored Internet testing (UIT) • UIT may cause many security problems – One is item theft and coaching • Solution: Generate entire test from scratch for each examinee – Item theft less of a problem – Coaching less effective – Items could be “watermarked” • Also reduces cost and speeds deployment
AIG in employment testing (cont. ) • Need a variety of test content – Verbal analogies – Vocabulary – Math – Perceptual speed and accuracy – Spatial ability – Personality – Situational Judgment – Etc.
Verbal Analogies Shovel: Dig a) Bag: Buy b) Baby: Cry c) Fork: Eat d) Car: Stop Pair responses Shovel: Dig: : Fork a) Buy b) Cry c) Eat d) Stop Word Responses • Identify a “bridge”; you DIG with a SHOVEL • Find a matching answer; you EAT with a FORK
Generating Verbal Analogies • Identified database of relationships (e. g. , “RIDER operates a BIKE”) • Identified additional bridge relationships (“BOVINE means COW-like” & “ABSENT is the opposite of PRESENT”) • Gathered data on word frequency and (part of this study) word familiarity
Generating Verbal Analogies (cont. ) 1. Randomly select a bridge 2. Randomly select TWO pairs for this bridge (one for the stem, one for the key) 3. Randomly select 2 -3 additional pairs from other bridges 4. Randomly assign key pair; fill in remaining pairs
Sample Items 1. paternal: father: : ? a. juvenile: child b. microphone: sound c. chalk: writer d. unfold: fold 3. rocket: astronaut: : ? a. lamp: light b. stick: skating rink c. jet: pilot d. demand: supply
Alternative format 1. paternal: father: : juvenile: ? a. child b. sound 3. rocket: astronaut: : jet: ? c. writer a. light d. fold b. skating rink c. pilot d. supply
Keys 1. paternal: father: : ? [Bridge: FATHER is described by PATERNAL] a. juvenile: child *** b. microphone: sound (unrelated: sound is a (typical) theme of microphone) c. chalk: writer (unrelated: writer is a (typical) agent of chalk) d. unfold: fold (unrelated: unfold and fold are opposites/opposed) 3. rocket: astronaut: : ? [Bridge: ASTRONAUT operates ROCKET] a. lamp: light (unrelated: lamp is a (typical) result of light) b. stick: skating_rink (unrelated: skating_rink is a (typical) location of stick) c. jet: pilot *** d. demand: supply (unrelated: supply and demand are opposites/opposed)
Present Study • H 1: Two forms of AIG analogies (word responses and pair responses) will have comparable reliability & validity • H 2: AIG scales will have reliability comparable to manually-written scale • H 3: AIG scales will have construct and criterion validity comparable to manually-written scale
Method • Sample of N=251 gathered online and from psychology classes • Measures: – n=20 AIG & human-written verbal analogy scales – N=40 vocabulary – Self-reported performance at work & school
Feasibility • Manually examined items for feasibility • 40/64 (63%) items were feasible • Reasons for infeasibility – Over-use of a bridge or a pair (some bridges have few pairs) – Ambiguous pairs (drum: drum? ) – Foil inadvertently a correct key
Results for H 1 Variable Mean SD n 1 2 3 4 1 Vocabulary 0. 75 0. 14 40 (0. 86) 0. 66 0. 69 2 Human-written items 0. 65 0. 14 20 0. 46 (0. 57) 0. 97 1. 04 3 AIG items with pairs responses 0. 73 0. 16 20 0. 52 0. 63 (0. 73) 0. 94 4 AIG items with word responses 0. 81 0. 14 19 0. 54 0. 67 0. 68 (0. 72) 5 Self-Rated Performance 3. 72 0. 61 6 -0. 04 -0. 01 0. 05 0. 10 6 Academic Performance 0. 02 0. 72 3 0. 14 0. 22 0. 20 0. 14 H 1: Two forms of AIG analogies (word responses and pair responses) will have comparable reliability & validity CONFIRMED
Results for H 2 Variable Mean SD n 1 2 3 4 1 Vocabulary 0. 75 0. 14 40 (0. 86) 0. 66 0. 69 2 Human-written items 0. 65 0. 14 20 0. 46 (0. 57) 0. 97 1. 04 3 AIG items with pairs responses 0. 73 0. 16 20 0. 52 0. 63 (0. 73) 0. 94 4 AIG items with word responses 0. 81 0. 14 19 0. 54 0. 67 0. 68 (0. 72) 5 Self-Rated Performance 3. 72 0. 61 6 -0. 04 -0. 01 0. 05 0. 10 6 Academic Performance 0. 02 0. 72 3 0. 14 0. 22 0. 20 0. 14 H 2: AIG scales will have reliability comparable to manually-written scale NOT CONFIRMED because the AIG scales had better reliability
Results for H 3 Variable Mean SD n 1 2 3 4 1 Vocabulary 0. 75 0. 14 40 (0. 86) 0. 66 0. 69 2 Human-written items 0. 65 0. 14 20 0. 46 (0. 57) 0. 97 1. 04 3 AIG items with pairs responses 0. 73 0. 16 20 0. 52 0. 63 (0. 73) 0. 94 4 AIG items with word responses 0. 81 0. 14 19 0. 54 0. 67 0. 68 (0. 72) 5 Self-Rated Performance 3. 72 0. 61 6 -0. 04 -0. 01 0. 05 0. 10 6 Academic Performance 0. 02 0. 72 3 0. 14 0. 22 0. 20 0. 14 H 3: AIG scales will have construct and criterion validity comparable to manually-written scale CONFIRMED
Predicting Item Difficulty Predictor Correlation Automatically generated (1) or manually written (0) 0. 28* Familiarity of least familiar word in item 0. 33* Familiarity of second least familiar word in item 0. 39** Mean familiarity of all words in item 0. 37** Lowest log(count(word)) 0. 14 Second lowest log(count(word)) -0. 06 Mean log(count(word)) 0. 17
Future Directions • Better handling of senses (DRUM is for DRUMMING) • Better difficulty calculations based on larger sample of items • Automated feasibility checking • Enhanced database of relationships • Choosing foils to have more semantic similarity to other words
Thank you! mead@iit. edu
- Slides: 18