Automatic speech recognition on the articulation index corpus

  • Slides: 18
Download presentation
Automatic speech recognition on the articulation index corpus Guy J. Brown and Amy Beeston

Automatic speech recognition on the articulation index corpus Guy J. Brown and Amy Beeston Department of Computer Science University of Sheffield g. [email protected] shef. ac. uk EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 1

Aims • Eventual aim is to develop a ‘perceptual constancy’ frontend for automatic speech

Aims • Eventual aim is to develop a ‘perceptual constancy’ frontend for automatic speech recognition (ASR). • Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task. – – – wider vocabulary range of reverberation conditions variety of speech contexts naturalistic speech, rather than interpolated stimuli consider phonetic confusions in reverberation in general • Initial ASR studies using articulation index corpus • Aim to compare human performance (Amy experiment) and machine performance on same task EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 2

Articulation index (AI) corpus • Recorded by Jonathan Wright (University of Pennsylvania) • Intended

Articulation index (AI) corpus • Recorded by Jonathan Wright (University of Pennsylvania) • Intended for speech recognition in noise experiments similar to those of Fletcher. • Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins: – American English – Target syllables are mostly nonsense, but some correspond to real words (including “sir” and “stir”) – Target syllables are embedded in a context sentence drawn from a limited vocabulary EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 3

Grammar for Amy’s subset of AI corpus $cw 1 = YOU | I |

Grammar for Amy’s subset of AI corpus $cw 1 = YOU | I | THEY | NO-ONE | WE | ANYONE | EVERYONE | SOMEONE | PEOPLE; $cw 2 = SPEAK | SAY | USE | THINK | SENSE | ELICIT | WITNESS | DESCRIBE | SPELL | READ | STUDY | REPEAT | RECALL | REPORT | PROPOSE | EVOKE | UTTER | HEAR | PONDER | WATCH | SAW | REMEMBER | DETECT | SAID | REVIEW | PRONOUNCE | RECORD | WRITE | ATTEMPT | ECHO | CHECK | NOTICE | PROMPT | DETERMINE | UNDERSTAND | EXAMINE | DISTINGUISH | PERCEIVE | TRY | VIEW | SEE | UTILIZE | IMAGINE | NOTE | SUGGEST | RECOGNIZE | OBSERVE | SHOW | MONITOR | PRODUCE; $cw 3 = ONLY | STEADILY | EVENLY | ALWAYS | NINTH | FLUENTLY | PROPERLY | EASILY | ANYWAY | NIGHTLY | NOW | SOMETIME | DAILY | CLEARLY | WISELY | SURELY | FIFTH | PRECISELY | USUALLY | TODAY | MONTHLY | WEEKLY | MORE | TYPICALLY | NEATLY | TENTH | EIGHTH | FIRST | AGAIN | SIXTH | THIRD | SEVENTH | OFTEN | SECOND | HAPPILY | TWICE | WELL | GLADLY | YEARLY | NICELY | FOURTH | ENTIRELY | HOURLY; $test = SIR | STIR | SPUR | SKUR; ( !ENTER $cw 1 $cw 2 $test $cw 3 !EXIT ) Audio demos EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 4

ASR system • HMM-based phone recogniser – – implemented in HTK monophone models 20

ASR system • HMM-based phone recogniser – – implemented in HTK monophone models 20 Gaussian mixtures per state adapted from scripts by Tony Robinson/Dan Ellis • Bootstrapped by training on TIMIT then further 10 -12 iterations of embedded training on AI corpus • Word-level transcripts in AI corpus expanded to phones using the CMU pronunciation dictionary • All of AI corpus used for training, except the 80 utterances in Amy’s experimental stimuli EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 5

MFCC features • Baseline system trained using mel-frequency cepstral coefficients (MFCCs) – 12 MFCCs

MFCC features • Baseline system trained using mel-frequency cepstral coefficients (MFCCs) – 12 MFCCs + energy + delta+acceleration (total 39 features per frame) – cepstral mean normalization • Baseline system performance on Amy’s clean subset of AI corpus (80 utterances, no reverberation): – 98. 75% context words correct – 96. 25% test words correct EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 6

Amy experiment • Amy’s first experiment used 80 utterances – 20 instances each of

Amy experiment • Amy’s first experiment used 80 utterances – 20 instances each of “sir”, “skur”, “spur” and “stir” test words • Overall confusion rate was controlled by lowpass filtering at 1, 1. 5, 2, 3 and 4 k. Hz • Same reverberation conditions as in Watkins et al. experiments Test 0. 32 m Test 10 m Context 0. 32 m near-near-far Context 10 m far-near far-far • Stimuli presented to the ASR system as in Amy’s human studies EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 7

Baseline ASR: context words • Performance falls as the cutoff frequency decreases • Performance

Baseline ASR: context words • Performance falls as the cutoff frequency decreases • Performance falls as level of reverberation increases • Near context substantially better than far context at most cutoffs EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 8

Baseline ASR: test words No particular pattern of confusions in 2 k. Hz near-near

Baseline ASR: test words No particular pattern of confusions in 2 k. Hz near-near case but more frequent skur/spur/stir errors EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 9

Baseline ASR: human comparison Baseline ASR system far test word EPSRC Perceptual Constancy Steering

Baseline ASR: human comparison Baseline ASR system far test word EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Human data (20 subjects) percentage error • Data for 4 k. Hz cutoff • Even mild reverberation (near) causes substantial errors in the baseline ASR system • Human listeners exhibit compensation in the AIC task, the baseline ASR system doesn’t (as expected) near test word Slide 10

Training on auditory features • 80 channels between 100 Hz and 8 k. Hz

Training on auditory features • 80 channels between 100 Hz and 8 k. Hz • 15 DCT coefficients + delta + acceleration (45 features per frame) Auditory periphery OME DRNL Stimulus ATT • Efferent attenuation set to zero for initial tests Hair cell Frame & DCT Recogniser Efferent system • Performance of auditory features on Amy’s clean subset of AI corpus (80 utterances, no reverberation): – 95% context words correct – 97. 5% test words correct EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 11

Auditory features: context words • Take a big hit in performance using auditory features

Auditory features: context words • Take a big hit in performance using auditory features – saturation in AN is likely to be an issue – mean normalization • Performance falls sharply with decreasing cutoff • As expected, best performance in the least reverberated conditions EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 12

Auditory features: test words EPSRC Perceptual Constancy Steering Group Meeting | 19 th May

Auditory features: test words EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 13

Effect of efferent suppression • Not yet used full closed-loop model in ASR experiments

Effect of efferent suppression • Not yet used full closed-loop model in ASR experiments • Indication of likely performance obtained by increasing efferent attenuation in ‘far’ context conditions Auditory periphery OME DRNL Stimulus ATT EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Hair cell Frame & DCT Recogniser Efferent system Slide 14

Auditory features: human comparison • 4 k. Hz cutoff • Efferent suppression effective for

Auditory features: human comparison • 4 k. Hz cutoff • Efferent suppression effective for mild reverberation • Detrimental to far test word • Currently unable to model human data, but: No efferent suppression 10 d. B efferent suppression Human data (20 subjects) – not closed loop – same efferent attenuation in all bands far test word EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 near test word Slide 15

Confusion analysis: far-near condition • Without efferent attenuation “skur”, “spur” and “stir” are frequently

Confusion analysis: far-near condition • Without efferent attenuation “skur”, “spur” and “stir” are frequently confused as “sir” • These confusions are reduced by more than half when 10 d. B of efferent attenuation is applied EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 far-near 0 d. B attenuation SKU SPU SIR R R STIR SIR 12 3 3 2 SKU R 11 5 2 2 SPU R 10 2 4 4 STIR 7 10 d. B 1 attenuation 7 far-near SPU SIR SKUR R 5 STIR SIR 9 2 2 7 SKU R 3 10 4 3 SPU R 5 4 4 7 STIR 3 2 3 12 Slide 16

Confusion analysis: far-far condition • Again “skur”, “spur” and “stir” are commonly reported as

Confusion analysis: far-far condition • Again “skur”, “spur” and “stir” are commonly reported as “sir” • These confusions are somewhat reduced by 10 d. B efferent attenuation, but: – gain is outweighed by more frequent “skur”, “spur”, “stir” confusions • Efferent attenuation recovers the dip in the temporal envelope but not cues to /k/, /p/ and /t/ EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 far-far 0 d. B attenuation SKU SPU SIR R R STIR SIR 18 0 1 1 SKU R 14 1 3 2 SPU R 12 1 5 2 STIRfar-far 1210 d. B attenuation 0 3 SKU SPU SIR R R 5 STIR SIR 13 2 1 4 SKU R 11 3 1 5 SPU R 6 5 2 7 STIR 10 2 1 7 Slide 17

Summary • ASR framework in place for the AI corpus experiments • We can

Summary • ASR framework in place for the AI corpus experiments • We can compare human and machine performance on the AIC task • Reasonable performance from baseline MFCC system • Need to address shortfall in performance when using auditory features • Haven’t yet tried the full within-channel model as a front end EPSRC Perceptual Constancy Steering Group Meeting | 19 th May 2010 Slide 18