Automated Pronunciation Scoring Combining Confidence Scoring and LandmarkBased

Automated Pronunciation Scoring, Combining Confidence Scoring and Landmark-Based SVM Su-Youn Yoon, Mark Hasegawa-Johnson, and Richard Sproat Presenter: Mark Hasegawa-Johnson

Goal: Automated Pronunciation Error Detection for L 2 Learners Error Localization = step one in any pronunciation scoring system First-pass system → Use the same algorithm for all phonemes (no label-dependent features) Which phonemes? L 2 phonology studies identify phones with high P(error | intended phone, talker L 1) Targeted error detection: phones w/high error rates

Previous Studies: Confidence Scoring Franco et al. (1997), Neumeyer et al. (2000), Witt (1999), Witt and Young (1997) Normalized acoustic score (e. g. , P(phone|signal)) based on acoustic models of automatic speech recognizer Target: Does L 2 utterance match L 1 acoustic models? Problems: Difficult to specialize for a specific phoneme

Previous Studies: Classifier Methods Pronunciation Rating Ito et al. , 2005; Stouten and Martens, 2006 Error Localization Troung et al. , 2005; Strik et al. , 2007 Strik et al. : Distinguish velar fricative /x/ from stop /k/ AP-classifier: articulatory-acoustic features MFCC-classifier: MFCC features from near release Both classifiers showed higher accuracy than confidence scoring method

Objective: Combine Confidence Scoring with Classifier Confidence scores: Normalized acoustic scores Same algorithm for all phonemes Classifiers: Landmark-based support vector machines (SVMs) Specialized for phones where L 2 learners make frequent errors

Phone Classifier SVM For each target confusion x=(L 1 phone) → often misarticulated as y=(L 2 phone) Distinguish each L 2 phoneme from its possible substitution patterns <example> If the target English phone is x=[f], and its potential substitution pattern is y=[p], an SVM classifier was trained in order to distinguish [f] from [p].

Phone Classifier Target Pairs

Simulate Pronunciation Errors: Change the Dictionary (1) Choose 8 frequent x→y phone confusions Collected Korean learners' frequent substitution error patterns from (M. Swan and B. Smith, 2002) (2) Find L 1 English speech examples of phone y (2) Change the dictionary so it pretends the canonical phone is x

Simulated Data: Example

Phone Classifier SVMs using Landmark-Based Features Choose PLP features from a landmark or segment that discriminates the two phones Consonant: boundary of phoneme Vowel: center of phoneme Used the same features for all phonemes 39 (PLP+E+v+a) + 6 (formants+v+a)

Training Confidence scoring method (Hub 4) ASR acoustic model trained using HUB 4 broadcast news data SVM classifiers (TIMIT) SVMs trained using TIMIT read speech Canonical phones (x) and substitution phones (y) extracted based on TIMIT transcriptions Extracted features from 7 X 10 ms segment near landmark

Validation and Test Validation data (Buckeye corpus) Scoring combination SVM: Trained a new additional SVM to combine confidence score and SVM score (step 1) Phone specific threshold: Found a threshold for each phoneme which maximized F-score of development test data (step 2) Test data (Buckeye corpus) Target L 2 phones were automatically extracted based on the time -alignment segmentation For each phone, confidence score/SVM score were calculated and combined using scoring combination SVM (from step 1) If combination SVM score was lower than phone-specific threshold (from step 2), it was classified as an error.

Data Size of Training/Validation/Test data

Accuracy (F-Score) Combined method (confidence + SVM) improved F-score for all phones except [ih] 3% improvement (relative 17%) by combining two scores SVM score showed higher accuracy in [ae, ao] than confidence score Confidence score showed higher accuracy in [ih, dh, f, v] than SVM score

Discussion In general, combined method (confidence + SVM) improved the accuracy of method Systematic application of landmark based SVM method Developed classifiers for diverse vowels and consonants Extracted acoustic features from the appropriate time interval (near landmark) according to phonemes Landmark-based SVMs provide information complementary to that of the ASR acoustic models SVM can provide the acoustic characteristics of the incorrect phone: based on this information, feedback about how to correct error can be provided