The Ups and Downs of Preposition Error Detection

Motivation n n Increasing need for tools for instruction in English as a Second

Why are prepositions hard to master? n Prepositions perform so many complex roles q

Objective n n Long Term Goal: develop NLP tools to automatically provide feedback to

Outline n Approach q q q n n Obs 1: Classifier Prediction Obs 2:

Observation 1: Classification Problem n n Cast error detection task as a classification problem

Observation 2: Training a Model n Develop a training set of error-annotated ESL essays

Observation 3: Features n Prepositions are influenced by: q q q Words in the

Summary Extract lexical and syntactic features from well-formed (native) text Train Max. Ent model

Feature Extraction n Corpus Processing: q q q POS tagged (Maxent tagger [Ratnaparkhi ’

Features Feature No. of Values Description PV 16, 060 Prior verb PN 23, 307

Combination Features n n Max. Ent does not model the interactions between features Build

Combination Features Class p-N Components FH +Combo: word line N-p-N PN-FH place-line V-p-N PV-PN

Google-Ngram Features n Typical way that non-native speakers check if usage is correct: q

Google Features Class p-N Combo: word line Google Features N-p-N place-line P 1= in

Preposition Selection Evaluation n n Test models on well-formed native text Metric: accuracy q

Preposition Selection Evaluation Model WSJ Enc-Reu* Baseline (of)* 26. 7% 27. 2% Lexical 70.

Evaluation on Non-Native Texts n Error Annotation q q q n Performance Thresholding q

Related Work n Most previous work has focused on: q q Subset of prepositions

Related Work Method Performance [Eeg-Olofsson et al. ’ 03] Handcrafted rules for Swedish learners

Training Corpus for ESL Texts n n Well-formed text training only on positive examples

ESL Testing Corpus n n n Collection of randomly selected TOEFL essays by native

Expanded Classifier Data Pre Filter Maxent Post Filter Extran. Use Output Model n n

Pre-Processing Filter Pre Filter Data Maxent Post Filter Extran. Use Model n Spelling Errors

Post-Processing Filter Pre Filter Data Maxent Post Filter Extran. Use Output Model n Antonyms

Prohibited Context Filter Pre Filter Data Maxent Post Filter Extran. Use Output Model n

Thresholding Classifier’s Output n Thresholds allow the system to skip cases where the top-ranked

Thresholds FLAG AS ERROR “He is fond with beer”

Thresholds FLAG AS OK “My sister usually gets home around 3: 00”

Results Model Precision Recall Lexical 80% 12% +Combo: tag 82% 14% +Combo: tag +

Google Features n n n Adding Google features had minimal impact Using solely Google

Conclusions n Present a combined ML and rule-based approach: q q n In instructional

Common Preposition Confusions Writer’s Prep Rater’s Prep Frequency to null 9. 5% of null

Slides: 36

Download presentation

The Ups and Downs of Preposition Error Detection in ESL Writing Joel Tetreault Martin Chodorow [Educational Testing Service] [Hunter College of CUNY]

Motivation n n Increasing need for tools for instruction in English as a Second Language (ESL) Preposition usage is one of the most difficult aspects of English for non-native speakers q q [Dalgish ’ 85] – 18% of sentences from ESL essays contain a preposition error Our data: 8 -10% of all prepositions in TOEFL essays are used incorrectly

Why are prepositions hard to master? n Prepositions perform so many complex roles q q q Preposition choice in an adjunct is constrained by its object (“on Friday”, “at noon”) Prepositions are used to mark the arguments of a predicate (“fond of beer. ”) Phrasal Verbs (“give in to their demands. ”) n q “give in” “acquiesce, surrender” Multiple prepositions can appear in the same context n “…the force of gravity causes the sap to move _____ the underside of the stem. ” [to, onto, toward, on]

Objective n n Long Term Goal: develop NLP tools to automatically provide feedback to ESL learners on grammatical errors Preposition Error Detection q q q n Selection Error (“They arrived to the town. ”) Extraneous Use (“They came to outside. ”) Omitted (“He is fond this book. ”) Coverage: 34 most frequent prepositions

Outline n Approach q q q n n Obs 1: Classifier Prediction Obs 2: Training a Model Obs 3: What features are important? Evaluation on Native Text Evaluation on ESL Text

Observation 1: Classification Problem n n Cast error detection task as a classification problem Given a model classifier and a context: q q n System outputs a probability distribution over all prepositions Compare weight of system’s top preposition with writer’s preposition Error occurs when: q q Writer’s preposition ≠ classifier’s prediction And the difference in probabilities exceeds a threshold

Observation 2: Training a Model n Develop a training set of error-annotated ESL essays (millions of examples? ): q n Alternative: q n Too labor intensive to be practical Train on millions of examples of proper usage Determining how “close to correct” writer’s preposition is

Observation 3: Features n Prepositions are influenced by: q q q Words in the local context, and how they interact with each other (lexical) Syntactic structure of context Semantic interpretation

Summary Extract lexical and syntactic features from well-formed (native) text Train Max. Ent model on feature set to output a probability distribution over 34 preps Evaluate on error-annotated ESL corpus by: 1. 2. 3. 1. 2. Comparing system’s prep with writer’s prep If unequal, use thresholds to determine “correctness” of writer’s prep

Feature Extraction n Corpus Processing: q q q POS tagged (Maxent tagger [Ratnaparkhi ’ 98]) Heuristic Chunker Parse Trees? n n “In consion, for some reasons, museums, particuraly known travel place, get on many people. ” Feature Extraction q Context consists of: n n q +/- two word window Heads of the following NP and preceding VP and NP 25 features consisting of sequences of lemma forms and POS tags

Features Feature No. of Values Description PV 16, 060 Prior verb PN 23, 307 Prior noun FH 29, 815 Headword of the following phrase FP 57, 680 Following phrase TGLR 69, 833 Middle trigram (pos + words) TGL 83, 658 Left trigram TGR 77, 460 Right trigram BGL 30, 103 Left bigram He will take our place in the line

Combination Features n n Max. Ent does not model the interactions between features Build “combination” features of the head nouns and commanding verbs q n PV, PN, FH 3 types: word, tag, word+tag q q Each type has four possible combinations Maximum of 12 features

Combination Features Class p-N Components FH +Combo: word line N-p-N PN-FH place-line V-p-N PV-PN take-line V-N-p-N PV-PN-FH take-place-line “He will take our place in the line. ”

Google-Ngram Features n Typical way that non-native speakers check if usage is correct: q n n n “Google” the phrase and alternatives Created a fast-access Oracle database from the POS-tagged Google N-gram corpus Queries provided frequency data for the +Combo features Top three prepositions per query were used as features for ME model q Maximum of 12 Google features

Google Features Class p-N Combo: word line Google Features N-p-N place-line P 1= in P 2= on P 3= of V-p-N take-line P 1= on P 2= to P 3= into V-N-p-N take-place-line P 1= in P 2= on P 3= after P 1= on P 2= in P 3= of “He will take our place in the line”

Preposition Selection Evaluation n n Test models on well-formed native text Metric: accuracy q q n Compare system’s output to writer’s Has the potential to underestimate performance by as much as 7% [HJCL ’ 08] Two Evaluation Corpora: WSJ q q test=106 k events train=4. 4 M NANTC events Encarta-Reuters q q q test=1. 4 M events train=3. 2 M events Used in [Gamon+ ’ 08]

Preposition Selection Evaluation Model WSJ Enc-Reu* Baseline (of)* 26. 7% 27. 2% Lexical 70. 8% 76. 5% +Combo 71. 8% 77. 4% +Google 71. 6% 76. 9% +Both 72. 4% 77. 7% +Combo +Extra Data 74. 1% 79. 0% * [Gamon et al. , ’ 08] perform at 64% accuracy on 12 prep’s

Evaluation on Non-Native Texts n Error Annotation q q q n Performance Thresholding q q n Most previous work used only one rater Is one rater reliable? [HJCL ’ 08] Sampling Approach for efficient annotation How to balance precision and recall? May not want to optimize a system using F-score ESL Corpora q q Factors such as L 1 and grade level greatly influence performance Makes cross-system evaluation difficult

Related Work n Most previous work has focused on: q q Subset of prepositions Limited evaluation on a small test corpus

Related Work Method Performance [Eeg-Olofsson et al. ’ 03] Handcrafted rules for Swedish learners 11/40 prepositions correct [Izumi et al. ’ 03, ’ 04] ME model to classify 13 error types 25% precision 7% recall [Lee & Seneff ‘ 06] Stochastic model on restricted domain 80% precision 77% recall [De Felice & Pullman ’ 08] Maxent model (9 prep’s) ~57% precision ~11% recall [Gamon et al. ’ 08] 80% precision LM + decision trees (12 prep’s)

Training Corpus for ESL Texts n n Well-formed text training only on positive examples 6. 8 million training contexts total q n 3. 7 million sentences Two sub-corpora: Meta. Metrics Lexile q q 11 th and 12 th grade texts 1. 9 M sentences San Jose Mercury News q q Newspaper Text 1. 8 M sentences

ESL Testing Corpus n n n Collection of randomly selected TOEFL essays by native speakers of Chinese, Japanese and Russian 8192 prepositions total (5585 sentences) Error annotation reliability between two human raters: q q Agreement = 0. 926 Kappa = 0. 599

Expanded Classifier Data Pre Filter Maxent Post Filter Extran. Use Output Model n n Pre-Processing Filter Maxent Classifier (uses model from training) Post-Processing Filter Extraneous Use Classifier (PC)

Pre-Processing Filter Pre Filter Data Maxent Post Filter Extran. Use Model n Spelling Errors q n Punctuation Errors q n Blocked classifier from considering preposition contexts with spelling errors in it TOEFL essays have many omitted punctuation marks, which affects feature extraction Tradeoff recall for precision Output

Post-Processing Filter Pre Filter Data Maxent Post Filter Extran. Use Output Model n Antonyms q q n Classifier confused prepositions with opposite meanings (with/without, from/to) Resolution dependent on intention of writer Benefactives q q Adjunct vs. argument confusion Use Word. Net to block classifier from marking benefactives as errors

Prohibited Context Filter Pre Filter Data Maxent Post Filter Extran. Use Output Model n n Account for 142 of 600 errors in test set Two filters: q q n Plural Quantifier Constructions (“some of people”) Repeated Prep’s (“can find friends with”) Filters cover 25% of 142 errors

Thresholding Classifier’s Output n Thresholds allow the system to skip cases where the top-ranked preposition and what the student wrote differ by less than a prespecified amount

Thresholds FLAG AS ERROR “He is fond with beer”

Thresholds FLAG AS OK “My sister usually gets home around 3: 00”

Results Model Precision Recall Lexical 80% 12% +Combo: tag 82% 14% +Combo: tag + Extraneous 84% 19%

Google Features n n n Adding Google features had minimal impact Using solely Google features (or counts) as a classifier: ~45% accuracy on native text Disclaimer: very naïve implementation

Conclusions n Present a combined ML and rule-based approach: q q n In instructional applications it is important to minimize false positives q n n State-of-the-art preposition selection performance: 79% Accurately detects preposition errors in ESL essays with P=0. 84, R=0. 19 Precision favored over recall This work is included in ETS’s Criterion. SM Online Writing Service and E-Rater Also see: “Native Judgments of Non-Native Usage” [HJCL ’ 08] (tomorrow afternoon)

Common Preposition Confusions Writer’s Prep Rater’s Prep Frequency to null 9. 5% of null 7. 3% in at 7. 1% to for 4. 6% in null 3. 2% of for 3. 1% in on 3. 1%