YowBang Wang Linshan Lee Supervised Detection and Unsupervised
Yow-Bang Wang Lin-shan Lee Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer Assisted language Learning IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 3, MARCH 2015/05/19 Ming-Han Yang 1
電腦輔助語言學習的時代 • Speaking 2 or more languages is necessary ü 利用Speech processing的技術來學習第二語言 ü Computer-assisted language learning (CALL) – Virtual language Tutor, HAFSS, Speech. Rater, Rainbow Rummy game, CALLJ, CHELSEA • CAPT, analyzes the produced utterance to offer feedback to the language learner in form of quantitative or qualitative evaluations of pronunciation proficiency. ü Pronunciation的評估方式 ASR posterior (GOP) ü 如何根據發錯的音回饋給learner? (based on Error Patterns) • Error Patterns, are patterns of erroneous pronunciations frequently produced by language learners. ü 發錯的音通常是因learner的母語沒有目標語言這種發音器官的使用方式 ü 母語/目標的pair不同, EP也跟著不同; 語言很多, 組合也跟著複雜起來 • CAPT中, 可分成兩個面向 ü EP derivation/discover= L 1 -L 2 pair / L 2 -non-specific-L 1 建立EP dictionary ü EP detection= 根據learner的 voice segment是否正確, 或屬於EP詞典中的特定EP 2
• EP derivation的研究大致可分為兩類 – 根據L 1 -L 2 pairs EP dictionary(查表) – 比較 free-phone ASR output 與 corpus的phone label ü 缺點= cost高 (標label耗時間, 需語言專業知識, ASR的結果不一定可靠) • Unsupervised acoustic pattern discover ü Spoken term detection, OOV word modeling ü No longer need ‘human-annotated data’ for acoustic model training – Goal= automatically discover the acoustic patterns in a data set on the signal characteristics – 為了解決傳統HMM訓練時, labeled training data cost很高 – Motivation= label EP需要更多專家, 且難取得, cost更高 ✡ Major difficulty of EP detection: – EPs are intrinsically similar to their corresponding canonical pronunciation, and that EPs corresponding to the same canonical pronunciation are also intrinsically similar to each other. – 從EP對應的標準發音辨別出EPs很困難 – 有些研究使用log-likelihood ratio或posterior probability來解決這類問題 • In this paper – Supervised EP detection & Unsupervised EP discover – Tested on a corpus from Mandarin Chinese learners 3
Data Collection • Chinese language teachers from the International Chinese Language Program(ICLP) of NTU • Corpus: – – 278 ICLP leaners from 36 different countries Balanced gender 30 sentences / leaner 6~24 characters / sentence – The recording text prompts were chosen so as to cover as many Chinese syllables and tone patterns as possible 5
Universal phoneme posteriorgram (UPP) • Used as fundamental frame-level features for the two tasks – Posterior probability has been widely used in CAPT and unsupervised acoustic pattern discover – Much work on pattern discovery has adopted posteriorgrams as the features for further processing. Some work derives the posteriorgram with GMM trained on the target corpus and some with MLP trained on a separate large corpora. • We train an MLP with a large multi-speaker corpora of mixed languages. – Posteriorgram feature extractor • Output layer= Softmax • Output= the set of all phoneme units for the mixed languages • Input= the MFCC feature vectors for all signal frames of the mixed language corpora. • This is the UPP feature vector. 6
Universal phoneme posteriorgram (UPP) • We reduce speaker variation while preserving the traits of pronunciation variation, which is the key for both supervised EP detection and unsupervised EP discovery. 7
Supervised Detection of Pronunciation Error Patterns • We consider the EPs as the pronunciation variations for each corresponding canonical pronunciation. – Phoneme-level forced alignment can be performed with learners’ recordings – We expand the phoneme sequence of the orthographic transcription of the utterance into a network of canonical pronunciations and Eps – Those surface pronunciations with maximum likelihood are then automatically chosen during forced alignment. • This framework requires an AM for each EP – Corpus的mispronunciation比例太少, learner發音不會只有中文的phoneme – 根據專家標的EP描述, 從其他的corpus找到對應的發音(L 1)來train初始的EP model • Cascaded adaptation – 3 stages= global MLLR, class-based MLLR, Maximum A Posterior 8
Model initialization & adaptation procedures • AM adaptation has been widely adopted for alleviating the speaker or environment mismatch between the learner’s voice and the training or reference set for fairer comparison or evaluation • Main purpose of model adaptation : – to create the AMs which better capture the characteristics of EPs. • After the EP models are successfully built, we use them to construct the pronunciation network as in Fig. 4 for maximal-likelihood alignment, and the surface pronunciation of learners’ recordings can thus be determined. ü This is the baseline system in the experiments 9
EP Detection Framework Based on the Hybrid Approach • We propose the framework for EP detection – 1 st pass of Viterbi decoding: on the learners’ utterances using the EP AM set and the pronunciation network to obtain more precise time boundaries. – 2 nd pass of Viterbi decoding: performed given the estimated segment boundaries, taking into account the scores from both the EP AMs and MLPbased EP classifiers • using 39 -dimensional MFCCs, c 0 to c 12 plus derivatives and accelerations, as the frame-level feature vector • using MFCCs, UPPs proposed here, or different variants of UPPs as the frame-level feature vector (假設UPPs能 與MFCCs起到互補作用) • 分數計算: 10
EP Classifiers & Confidence estimation • 11
Evaluation metrics • 12
Unsupervised Discovery of Pronunciation Error Patterns • The task here is to unsupervisedly discover the EPs for each phoneme given a corpus of learner voice data • We can thus focus on one phoneme at a time: each time we are given a set of acoustic segments corresponding to a specific phoneme, and the goal is to divide this set into several clusters, each of which corresponds to an EP 13
14
Evaluation metrics • 16
Experiment setup • We chose the monophone as the phoneme model unit for both Chinese and English – Chinese phoneme model : ASTMIC Mandarin corpus • 95 males, 95 females, 200 utterances, 24. 6 hours – English phoneme model : TIMIT corpus • 462 speakers, 3. 9 hour • Input feature : 1) MFCC ü 2) UPP ü 3) 4) 39 parameters, c 0 to c 12 plus first and second derivatives 73 posteriors for 73 Mandarin/English mono-phones Logarithm of UPP (log-UPP) Principal component analysis (PCA) transformed log-UPP(PCA-log-UPP) ü For PCA we retained 95% of the total variance • One problem arose when training the binary-MLPs: • there were far more correctly-pronounced than mispronounced instances • Training時把正確發音的instance down-sampling到跟錯誤發音一樣 17
18
Experiment setup • K-means 單位: Rand index 19
Experiment GMM-MDL • Table shows the results of ARI using GMM-MDL with an automatically estimated number of EPs for each phoneme. – 效果比K-means差, 原因可能是缺乏對EP數量的專業理解 • In other words, with UPP or its variants the machine is able to perform slightly finer clustering – while there are some patterns with only subtle differences that human experts may consider the same. • In contrast, MFCC resulted in a lower number of clusters • this further shows the superior discriminating power of UPP in discovering EPs. 單位: Rand index 20
Automatically Discovered Eps & analysis • we try to analyze a typical set of examples of automatically discovered EPs in the log-UPP space • Next, we calculate the displacement in each dimension of log-UPP, which is for each Chinese or English phoneme p • Note here the displacements are evaluated in each dimension of the log. UPP space – while each dimension of log-UPP represents the log-posterior probability of the input frame with respect to a certain Chinese or English phoneme • Log-UPP兩個centroid之間, 每個維度的位移實際上是正確發音和錯誤發 音之間的posterior的比值 21
22
Conclusion • In this paper we consider both supervised detection and unsupervised discovery of pronunciation EPs in computer-assisted language learning. • We propose new frameworks for both supervised detection and unsupervised discovery of pronunciation EPs with empirical analysis over different approaches. • Supervised EP detection – We integrate the scores from both HMM-based EP models and MLP-based EP classifiers with a two-pass Viterbi decoding architecture. – We use EP classifiers in Viterbi decoding to encompass different aspects of EP detection, while maintaining flexibility for fine tuning. • Unsupervised EP discovery – We use the hierarchical agglomerative clustering (HAC) algorithm to divide speech seg-ments corresponding to a phoneme into sub-segments • In both tasks : we use the universal phoneme posteriorgram (UPP) – derived from a multi-layer perceptron (MLP) trained on corpora of mixed languages, as a set of very useful features to reduce speaker variation while maintaining pronunciation variation across speech frames. 23
- Slides: 23