Automatic Detectionbased Phone Recognition on TIMIT Based on

Automatic Detection-based Phone Recognition on TIMIT Based on Chen and Wang in ISCSLP’ 08 and Interspeech’ 09 Hung-Shin Lee (李鴻欣) Institute of Information Science, Academia Sinica 12 July, 2011 @ IIS, Academia Sinica

Detection-Based ASR Human SR Knowledge Detection Integration Knowledge (Higher Level) Detectors Integrator Results DB ASR • HMM • CRF • … • • Phonological attr. Prosodic attr. Acoustic attr. … • • • Phone Syllable Word Sentence Semantic info … Page-2

Phonological Systems SPE (Sound Pattern of English) MV (Multi-valued Feature) GP (Government Phonology) Literatures (N. Chomsky & M. Halle, 1968) (S. King, 2000)? (J. Harris, 1994) Feature Types Production-based, Binary Production-based, 2 -10 values Sound structure primes, Binary Feature Number 13 6 11 anterior, nasal, round centrality, front back, manner, phonation, place, roundness Examples Page-3

Phonological Feature Detection (1) MLP (Detectors) 9 frames input layer hidden layer posterior probability quantization 13 MFCCs i-4 i i+4 time-delay recurrent 0 1. . . 0 1 1. . 0 1 SPE_14 GP_11 Page-4

Phonological Feature Detection (2) 6 MV Features MLP (Centrality) 0 1 0 0 MLP (Front-Back) 1 0 0 9 frames 13 MFCCs i-4 i i+4 time-delay MLP (Roundness) 0 1 0 0. . 0 1 0 MV_29 Page-5

Conditional Random Field (CRF) Integrator • General Chain CRF λj, μk : feature function weight parameters state feature function Output (phone) Input (phonological features) Y X transition feature function yi-1 yi . . xi-1 xi xi+1 Page-6

CRF Integrator – Training Issues • Required Label for CRF Training – Phone: y – Phonological features: x Oracle-data trained CRF Phone labels Mapping Phonological features phones → phonological features Training Data OT CRF Phone labels Speech Detectors Phonological features (with errors) MLP DT CRF Detected-data trained CRF Page-7

Experiments • Corpus: TIMIT – No SA 1, SA 2 – Training set (3296 utts), Dev set (400 utts) – Test set (1344 utts) • Phone set: TIMIT 61 – Evaluation: CMU/MIT 39 • Baseline – CI-HMM • Toolkits – Nico Toolkit (for MLP), CRF++ (for CRF) Page-8

Results (1) Model: OT CRF Test: OD Features Phone Corr. % Phone Acc. % SPE 14 93. 28 93. 20 GP 11 98. 39 98. 36 MV 29 88. 75 88. 56 Phone Corr. % Phone Acc. % 69. 02 63. 45 SPE 14 66. 19 29. 68 GP 11 69. 03 31. 38 MV 29 59. 24 30. 33 SPE 14 56. 56 55. 27 GP 11 55. 74 54. 53 MV 29 51. 84 50. 68 HMM-baseline Model: OT/DT CRF Test: DD Features OT CRF DT CRF Page-9

Results (2) System Fusion Methods # System Phone Corr. (%) Phone Acc. (%) HMM baseline 1 69. 02 63. 45 OT: SPE+GP+MV 3 61. 97 60. 65 DT: SPE+GP+MV 3 52. 90 52. 06 OT+DT: SPE+GP+MV 6 60. 81 59. 20 OT: SPE+GP+MV +HMM 4 65. 53 64. 31 DT: SPE+GP+MV +HMM 4 59. 57 58. 64 OT+DT: SPE+GP+MV +HMM 7 64. 22 62. 59 Page-10

System Fusion with CRF Combined Results (Phone) yi-1 Y SPE Sys. . Phone Sequence X . MV Sys. . yi. . . xi xi+1 GP Sys. HMM Sys. xi-1 Page-11

Two Types of AFDT Imperfection Phone h# n eh ow kcl k w eh ae eh s tcl t ix n AF(A) AF(A’) AF asynchrony AFDT errors Page-12

CRF Training (1) Phone y t Detected Errors t Phone AFDT AFs Mapping Table AFs x Oracle Data Training Detected Data Training Page-13

CRF Training (2) AF Sequence Phone y t AFDT AFs x Aligned Data Training Page-14

Results (3) Upper Bound Real Case System Phone Corr. (%) Phone Acc. (%) OT CRF 98. 31 98. 28 AT CRF 71. 49 70. 31 OT CRF 70. 55 34. 38 DT CRF 57. 30 56. 14 AT CRF 64. 87 62. 32 27. 97 % acc. drops on the introduction of AF asynchrony Detection Error causes further 7. 99 % acc. drop Page-15

AF Asynchrony Compensation • AF asynchrony is caused by context variation • We can reduce AF asynchrony by letting our systems learn context variation directly – Long-Term information Windows + DCTs MLP Right Context 23 dim Mel 310 ms Left Context Windows MLP + DCTs 72 Dim 144 Dim MLP 72 Dim Page-16

Results (4) Test Data Type System Corr Acc - CI-HMM 69. 02 63. 45 - CD-HMM 75. 76 65. 78 OT CRF (± 3) 75. 24 47. 97 Long Term AFDT + DT CRF (± 3) 64. 58 63. 12 Long Term AFDT + AT CRF 74. 96 73. 64 MFCC AFDT + AT CRF (± 3) 72. 87 71. 62 Long Term AFDT + AT CRF (± 3) 76. 83 74. 97 Long Term AFDT + AT CRF 69. 83 66. 97 MFCC AFDT + AT CRF (± 3) 66. 21 63. 16 Long Term AFDT + AT CRF (± 3) 71. 01 67. 67 Detected (real case) Ideal (upper bound) Detected (real case) Page-17

Conclusions • A well-designed phonological feature system is important – AF asynchrony minimization training and AF-phone synchronization could also be investigated • Oracle Trained CRF is able to retrieve more phonological information from speech – High phone correction rate (but sensitive to detection error) – Helpful for combination • Detection-Based ASR is promising – A front-end detector is a major issue Page-18

AF and Phone Alignment Using AFDT t t t phone sequence t t AF sequence Page-19