Voice DSP Processing I Yaakov J Stein Chief

  • Slides: 53
Download presentation
Voice DSP Processing I Yaakov J. Stein Chief Scientist RAD Data Communications Stein Voice.

Voice DSP Processing I Yaakov J. Stein Chief Scientist RAD Data Communications Stein Voice. DSP 1. 1

Voice DSP Part 1 Speech biology and what we can learn from it Part

Voice DSP Part 1 Speech biology and what we can learn from it Part 2 Speech DSP (AGC, VAD, features, echo cancellation) Part 3 Speech compression techiques Part 4 Speech Recognition Stein Voice. DSP 1. 2

Voice DSP - Part 1 a Speech production mechanisms n Biology of the vocal

Voice DSP - Part 1 a Speech production mechanisms n Biology of the vocal tract n Pitch and formants n Sonograms n The basic LPC model n The cepstrum n LPC cepstrum n Line spectral pairs Stein Voice. DSP 1. 3

Voice DSP - Part 1 b Speech perception mechanisms n Biology of the ear

Voice DSP - Part 1 b Speech perception mechanisms n Biology of the ear n Psychophysical phenomena – Weber’s law – Fechner’s law – Changes – Masking Stein Voice. DSP 1. 4

Voice DSP - Part 1 c Speech quality measurement n Subjective measurement – MOS

Voice DSP - Part 1 c Speech quality measurement n Subjective measurement – MOS and its variants n Objective measurement – PSQM, PESQ Stein Voice. DSP 1. 5

Voice DSP - Part 2 a Basic speech processing n Simplest processing – AGC

Voice DSP - Part 2 a Basic speech processing n Simplest processing – AGC – Simplistic VAD n More complex processing – pitch tracking – formant tracking – U/V decision – computing LPC and other features Stein Voice. DSP 1. 6

Voice DSP - Part 2 b Echo Cancellation n Sources of echo (acoustic vs.

Voice DSP - Part 2 b Echo Cancellation n Sources of echo (acoustic vs. line echo) n Echo suppression and cancellation n Adaptive noise cancellation n The LMS algorithm n Other adaptive algorithms n The standard LEC Stein Voice. DSP 1. 7

Voice DSP - Part 3 Speech compression techniques n PCM n ADPCM n SBC

Voice DSP - Part 3 Speech compression techniques n PCM n ADPCM n SBC n VQ n ABS-CELP n MBE n MELP n STC n Waveform Interpolation Stein Voice. DSP 1. 8

Voice DSP - Part 4 Speech Recognition tasks ASR Engine Phonetic labeling DTW HMM

Voice DSP - Part 4 Speech Recognition tasks ASR Engine Phonetic labeling DTW HMM State-of-the-Art Stein Voice. DSP 1. 9

Voice DSP - Part 1 a Speech production mechanisms Stein Voice. DSP 1. 10

Voice DSP - Part 1 a Speech production mechanisms Stein Voice. DSP 1. 10

Speech Production Organs Brain Hard Palate Nasal cavity Velum Teeth Lips Mouth cavity Uvula

Speech Production Organs Brain Hard Palate Nasal cavity Velum Teeth Lips Mouth cavity Uvula Pharynx Tongue Esophagus Larynx Trachea Lungs Stein Voice. DSP 1. 11

Speech Production Organs - cont. n Air from lungs is exhaled into trachea (windpipe(

Speech Production Organs - cont. n Air from lungs is exhaled into trachea (windpipe( n Vocal chords (folds) in larynx can produce periodic pulses of air by opening and closing (glottis( n Throat (pharynx), mouth, tongue and nasal cavity modify air flow n Teeth and lips can introduce turbulence n Epiglottis separates esophagus (food pipe) from trachea Stein Voice. DSP 1. 12

Voiced vs. Unvoiced Speech n n n When vocal cords are held open air

Voiced vs. Unvoiced Speech n n n When vocal cords are held open air flows unimpeded When laryngeal muscles stretch them glottal flow is in bursts When glottal flow is periodic called voiced speech Basic interval/frequency called the pitch Pitch period usually between 2. 5 and 20 milliseconds Pitch frequency between 50 and 400 Hz You can feel the vibration of the larynx n n Vowels are always voiced (unless whispered( Consonants come in voiced/unvoiced pairs for example : B/P K/G D/T V/F J/CH TH/th W/WH Z/S ZH/SH Stein Voice. DSP 1. 13

Excitation spectra n Voiced speech Pulse train is not sinusoidal - harmonic rich f

Excitation spectra n Voiced speech Pulse train is not sinusoidal - harmonic rich f n Unvoiced speech Common assumption : white noise f Stein Voice. DSP 1. 14

Effect of vocal tract n Mouth and nasal cavities have resonances n Resonant frequencies

Effect of vocal tract n Mouth and nasal cavities have resonances n Resonant frequencies depend on geometry Stein Voice. DSP 1. 15

Effect of vocal tract - cont. n Sound energy at these resonant frequencies is

Effect of vocal tract - cont. n Sound energy at these resonant frequencies is amplified Frequencies of peak amplification are called formants F 1 frequency response n voiced speech F 2 F 3 F 4 frequency unvoiced speech F 0 Stein Voice. DSP 1. 16

Formant frequencies n Peterson - Barney data (note the “vowel triangle(” Stein Voice. DSP

Formant frequencies n Peterson - Barney data (note the “vowel triangle(” Stein Voice. DSP 1. 17

Sonograms Stein Voice. DSP 1. 18

Sonograms Stein Voice. DSP 1. 18

Cylinder model(s( Rough model of throat and mouth cavity Voice open Excitation With nasal

Cylinder model(s( Rough model of throat and mouth cavity Voice open Excitation With nasal cavity Voice Excitation open/closed Stein Voice. DSP 1. 19

Phonemes n n n The smallest acoustic unit that can change meaning Different languages

Phonemes n n n The smallest acoustic unit that can change meaning Different languages have different phoneme sets Types: (notations: phonetic, CVC, ARPABET( – Vowels • front (heed, hid, head, hat( • mid (hot, heard, hut, thought( • back (boot, book, boat( • dipthongs (buy, boy, down, date( – Semivowels • liquids (w, l( • glides (r, y( Stein Voice. DSP 1. 20

Phonemes - cont. – Consonants • nasals (murmurs) (n, m, ng( • stops (plosives(

Phonemes - cont. – Consonants • nasals (murmurs) (n, m, ng( • stops (plosives( – voiced (b, d, g( – unvoiced (p, t, k( • fricatives – voiced (v, that, z, zh( – unvoiced (f, think, s, sh ( • affricatives (j, ch( • whispers (h, what( • gutturals( ע , )ח • clicks, etc. Stein Voice. DSP 1. 21

Basic LPC Model Pulse Generator U/V Switch LPC synthesis filter White Noise Generator Stein

Basic LPC Model Pulse Generator U/V Switch LPC synthesis filter White Noise Generator Stein Voice. DSP 1. 22

Basic LPC Model - cont. n Pulse generator produces a harmonic rich periodic impulse

Basic LPC Model - cont. n Pulse generator produces a harmonic rich periodic impulse train (with pitch period and gain( n ) White noise generator produces a random signal with gain( n U/V switch chooses between voiced and unvoiced speech n ) LPC filter amplifies formant frequencies all-pole or AR IIR filter( n The output will resemble true speech to within residual error Stein Voice. DSP 1. 23

Cepstrum Another way of thinking about the LPC model Speech spectrum is the obtained

Cepstrum Another way of thinking about the LPC model Speech spectrum is the obtained from multiplication Spectrum of (pitch) pulse train times Vocal tract (formant) frequency response So log of this spectrum is obtained from addition Log spectrum of pitch train plus Log of vocal tract frequency response Consider this log spectrum to be the spectrum of some new signal called the cepstrum The cepstrum is the sum of two components : excitation plus vocal tract Stein Voice. DSP 1. 24

Cepstrum - cont. Cepstral processing has its own language n Cepstrum (note that this

Cepstrum - cont. Cepstral processing has its own language n Cepstrum (note that this is really a signal in the time domain( n Quefrency (its units are seconds( n Liftering (filtering( n Alanysis n Saphe Several variants: n complex cepstrum n power cesptrum n LPC cepstrum Stein Voice. DSP 1. 25

Do we know enough? Standard speech model (LPC( ) used by most speech processing/compression/recognition

Do we know enough? Standard speech model (LPC( ) used by most speech processing/compression/recognition systems( is a model of speech production Unfortunately, speech production and speech perception systems are not matched So next we’ll look at the biology of the hearing (auditory) system and some psychophysics (perception( Stein Voice. DSP 1. 26

Voice DSP - Part 1 b Speech Hearing &perception mechanisms Stein Voice. DSP 1.

Voice DSP - Part 1 b Speech Hearing &perception mechanisms Stein Voice. DSP 1. 27

Hearing Organs Stein Voice. DSP 1. 28

Hearing Organs Stein Voice. DSP 1. 28

Hearing Organs - cont. n n n Sound waves impinge on outer ear enter

Hearing Organs - cont. n n n Sound waves impinge on outer ear enter auditory canal Amplified waves cause eardrum to vibrate Eardrum separates outer ear from middle ear The Eustachian tube equalizes air pressure of middle ear Ossicles (hammer, anvil, stirrup) amplify vibrations Oval window separates middle ear from inner ear Stirrup excites oval window which excites liquid in the cochlea The cochlea is curled up like a snail The basilar membrane runs along middle of cochlea The organ of Corti transduces vibrations to electric pulses Pulses are carried by the auditory nerve to the brain Stein Voice. DSP 1. 29

Function of Cochlea n n n n Cochlea has 2 1/2 to 3 turns

Function of Cochlea n n n n Cochlea has 2 1/2 to 3 turns were it straightened out it would be 3 cm in length The basilar membrane runs down the center of the cochlea as does the organ of Corti 15, 000 cilia (hairs) contact the vibrating basilar membrane and release neurotransmitter stimulating 30, 000 auditory neurons Cochlea is wide (1/2 cm) near oval window and tapers towards apex is stiff near oval window and flexible near apex Hence high frequencies cause section near oval window to vibrate low frequencies cause section near apex to vibrate Overlapping bank of filter frequency decomposition Stein Voice. DSP 1. 30

Psychophysics - Weber’s law Ernst Weber Professor of physiology at Leipzig in the early

Psychophysics - Weber’s law Ernst Weber Professor of physiology at Leipzig in the early 1800 s Just Noticeable Difference: minimal stimulus change that can be detected by senses Discovery: DI=KI Example Tactile sense: place coins in each hand subject could discriminate between with 10 coins and 11 , but not 20/21, but could 20/22! Similarly vision lengths of lines, taste saltiness, sound frequency Stein Voice. DSP 1. 31

Weber’s law - cont. This makes a lot of sense Bill Gates Stein Voice.

Weber’s law - cont. This makes a lot of sense Bill Gates Stein Voice. DSP 1. 32

Psychophysics - Fechner’s law Weber’s law is not a true psychophysical law it relates

Psychophysics - Fechner’s law Weber’s law is not a true psychophysical law it relates stimulus threshold to stimulus (both physical entities( not internal representation (feelings) to physical entity Gustav Theodor Fechner student of Weber medicine, physics philosophy Simplest assumption: JND is single internal unit Using Weber’s law we find: Y = A log I + B Fechner Day (October 22 1850( Stein Voice. DSP 1. 33

Fechner’s law - cont. Log is very compressive Fechner’s law explains the fantastic ranges

Fechner’s law - cont. Log is very compressive Fechner’s law explains the fantastic ranges of our senses Sight: single photon - direct sunlight 1015 Hearing: eardrum move 1 H atom - jet plane 1012 Bel defined to be log 10 of power ratio decibel (d. B) one tenth of a Bel d(d. B) = 10 log 10 P 1 / P 2 Stein Voice. DSP 1. 34

Fechner’s law - sound amplitudes Companding adaptation of logarithm to positive/negative signals m-law and

Fechner’s law - sound amplitudes Companding adaptation of logarithm to positive/negative signals m-law and A-law are piecewise linear approximations Equivalent to linear sampling at 12 -14 bits 8)bit linear sampling is significantly more noisy( Stein Voice. DSP 1. 35

Fechner’s law - sound frequencies octaves, well tempered scale 2 12 Critical bands Frequency

Fechner’s law - sound frequencies octaves, well tempered scale 2 12 Critical bands Frequency warping Melody 1 KHz = 1000, JND afterwards f M ~ 1000 log 2 ( 1 + f. KHz( Barkhausen can be simultaneously heard B ~ 25 + 75 ( 1 + 1. 4 f 2 KHz )0. 69 excite different basilar membrane regions Stein Voice. DSP 1. 36

Psychophysics - changes Our senses respond to changes Inverse E Filter Stein Voice. DSP

Psychophysics - changes Our senses respond to changes Inverse E Filter Stein Voice. DSP 1. 37

Psychophysics - masking Masking: strong tones block weaker ones at nearby frequencies narrowband noise

Psychophysics - masking Masking: strong tones block weaker ones at nearby frequencies narrowband noise blocks tones (up to critical band( f Stein Voice. DSP 1. 38

Voice DSP - Part 1 c Speech Quality Measurement Stein Voice. DSP 1. 39

Voice DSP - Part 1 c Speech Quality Measurement Stein Voice. DSP 1. 39

Why does it sound the way it sounds? PSTN n n BW=0. 2 -3.

Why does it sound the way it sounds? PSTN n n BW=0. 2 -3. 8 KHz, SNR>30 d. B PCM, ADPCM (BER 10 -3( five nines reliability line echo cancellation Voice over packet network n n speech compression delay, delay variation, jitter packet loss/corruption/priority echo cancellation Stein Voice. DSP 1. 40

Subjective Voice Quality Old Measures n n n 5/9 DRT DAM meet neat seat

Subjective Voice Quality Old Measures n n n 5/9 DRT DAM meet neat seat feet Pete beat heat The modern scale n n MOS DMOS Stein Voice. DSP 1. 41

MOS according to ITU P. 800 Subjective Determination of Transmission Quality Annex B: Absolute

MOS according to ITU P. 800 Subjective Determination of Transmission Quality Annex B: Absolute Category Rating (ACR( Listening Quality 5 excellent 4 good 3 fair 2 poor 1 bad Listening Effort relaxed attention needed moderate effort considerable effort no meaning with feasible effort Stein Voice. DSP 1. 42

MOS according to ITU (cont( Annex D Degradation Category Rating (DCR( Annex E Comparison

MOS according to ITU (cont( Annex D Degradation Category Rating (DCR( Annex E Comparison Category Rating (CCR( n ACR not good at high quality speech DCR 5 4 3 2 1 inaudible not annoying slightly annoying very annoying CCR much better slightly better 0 the same 1 -slightly worse 2 -worse 3 -much worse Stein Voice. DSP 1. 43

Some MOS numbers Effect of Speech Compression : )from ITU-T Study Group 15( n

Some MOS numbers Effect of Speech Compression : )from ITU-T Study Group 15( n n n n Quiet room 48 KHz 16 bit linear sampling PCM (A-law/mlaw) 64 Kb/s G. 723. 1 @ 6. 3 Kb/s G. 729 @ 8 Kb/s 5. 0 4. 1 3. 9 ADPCM G. 726 32 Kb/s GSM @ 13 Kb/s VSELP IS 54 @ 8 Kb/s 3. 8 3. 6 3. 4 toll quality Stein Voice. DSP 1. 44

The Problem(s) with MOS Accurate MOS tests are the only reliable benchmark BUT n

The Problem(s) with MOS Accurate MOS tests are the only reliable benchmark BUT n MOS tests are off-line n MOS tests are slow MOS tests are expensive Different labs give consistently different results Most MOS tests only check one aspect of system n n n Stein Voice. DSP 1. 45

The Problem(s) with SNR Naive question: Isn’t CCR the same as SNR? SNR does

The Problem(s) with SNR Naive question: Isn’t CCR the same as SNR? SNR does not correlate well with subjective criteria Squared difference is not an accurate comparator n n Gain Delay Phase Nonlinear processing Stein Voice. DSP 1. 46

Speech distance measures Many objective measures have been proposed: n n n Segmental SNR

Speech distance measures Many objective measures have been proposed: n n n Segmental SNR Itakura Saito distance Euclidean distance in Cepstrum space Bark spectral distortion Coherence Function None correlate well with MOS ITU target - find a quality-measure that does correlate well Stein Voice. DSP 1. 47

Some objective methods Perceptual Speech Quality Measurement (PSQM( ITU-T P. 861 Perceptual Analysis Measurement

Some objective methods Perceptual Speech Quality Measurement (PSQM( ITU-T P. 861 Perceptual Analysis Measurement System (PAMS( BT proprietary technique Perceptual Evaluation of Speech Quality (PESQ( ITU-T P. 862 Objective Measurement of Perceived Audio Quality (PAQM( ITU-R BS. 1387 Stein Voice. DSP 1. 48

Objective Quality Strategy channel speech QM QM to MOS estimate Stein Voice. DSP 1.

Objective Quality Strategy channel speech QM QM to MOS estimate Stein Voice. DSP 1. 49

PSQM philosophy (from P. 861( Internal Perceptual model Representation Audible Cognitive Difference Model Perceptual

PSQM philosophy (from P. 861( Internal Perceptual model Representation Audible Cognitive Difference Model Perceptual model Internal Representation Stein Voice. DSP 1. 50

PSQM philosophy (cont( Perceptual Modelling (Internal representation( n n n Short time Fourier transform

PSQM philosophy (cont( Perceptual Modelling (Internal representation( n n n Short time Fourier transform Frequency warping (telephone-band filtering, Hoth noise( Intensity warping Cognitive Modelling n n Loudness scaling Internal cognitive noise Asymmetry Silent interval processing PSQM Values n ) 0 no degradation) to 6. 5 (maximum degradation( Conversion to MOS n n PSQM to MOS calibration using known references Equivalent Q values Stein Voice. DSP 1. 51

Problems with PSQM Designed for telephony grade speech codecs Doesn’t take network effects into

Problems with PSQM Designed for telephony grade speech codecs Doesn’t take network effects into account: n n n filtering variable time delay localized distortions Draft standard P. 862 adds: n n n transfer function equalization time alignment, delay skipping distortion averaging Stein Voice. DSP 1. 52

PESQ philosophy (from P. 862( Perceptual Internal model Representation Time Audible Cognitive Alignment Difference

PESQ philosophy (from P. 862( Perceptual Internal model Representation Time Audible Cognitive Alignment Difference Model Perceptual Internal model Representation Stein Voice. DSP 1. 53