A brainseyeview of speech perception David Poeppel Cognitive
A brain’s-eye-view of speech perception David Poeppel Cognitive Neuroscience of Language Lab Department of Linguistics and Department of Biology Neuroscience and Cognitive Science Program University of Maryland College Park Students: • Anthony Boemio • Maria Chait • Huan Luo • Virginie van Wassenhove Colleagues: • Allen Braun, NIH • Greg Hickok, UC Irvine • Jonathan Simon, Univ. Maryland
encoding ? Is this a hard problem? Yes! If it could be solved straightforwardly (e. g. by machine), Mark Liberman would be in Tahiti having cold beers. “chair” “uncomfortable” “lunch” “soon” representation ?
Outline (1) Fractionating the problem in space: Towards a functional anatomy of speech perception (2) Fractionating the problem in time: (3) Towards a functional physiology of speech perception - A hypothesis about the quantization of time - Psychophysical evidence for temporal integration - Imaging evidence
interface with lexical items, word recognition
interface with lexical items, word recognition hypothesis about storage: distinctive features [-voice] [+voice] [+labial] [+high] [+labial] [-round] [+round] [-round] […. ]
production, articulation of speech interface with lexical items, word recognition hypothesis about storage: distinctive features [-voice] [+voice] [+labial] [+high] [+labial] [-round] [+round] [-round] […. ]
production, articulation of speech hypothesis about production: distinctive features [-voice] [+labial] [+high] […. ] interface with lexical items, word recognition hypothesis about storage: distinctive features [-voice] [+voice] [+labial] [+high] [+labial] [-round] [+round] [-round] […. ]
production, articulation of speech FEATURES analysis of auditory signal spectrotemporal rep. FEATURES interface with lexical items, wordrecognition FEATURES
Unifying concept: distinctive feature auditory-motor interface coordinate transform from acoustic to articulatory space production, articulation of speech analysis of auditory signal spectrotemporal rep. FEATURES auditory-lexical interface with lexical items, word recognition
coordinate transform from acoustic to articulatory space production, articulation of speech analysis of auditory signal spectrotemporal rep. FEATURES interface with lexical items, word recognition
p. IFG/d. PM (left) articulatory-based speech codes STG (bilateral) acoustic-phonetic speech codes Hickok & Poeppel (2000), Trends in Cognitive Sciences Hickok & Poeppel (in press), Cognition Area Spt (left) auditory-motor interface p. MTG (left) sound-meaning interface
Indefrey & Levelt, in press, Cognition Meta-analysis of neuroimaging data, perception/production overlap Shared neural correlates of word production and perception processes Bilat mid/post STG L anterior STG L mid/post MTG L post IFG MTG and IFG overlap when controlling for the overt/covert distinction across tasks Hypothesized functions: - lexical selection (MTG) - lexical phon. code retr. (MTG) - post-lexical syllabification (IFG)
Scott & Johnsrude 2003
Possible Subregions of Inferior Frontal Gyrus Burton (2001) Auditory Studies Burton et al. (2000), Demonet et al. (1992, 1994), Fiez et al, (1995), Zatorre et al. , (1992, 1996) Visual Studies Sergent et al. (1992, 1993), Poldrack et al. , (1999), Paulesu et al. (1993, 1996), Sergent et al. , 1993, Shaywitz et al. (1995)
Auditory lexical decision versus FM/sweeps (a), CP/syllables (b), and rest (c) (a) (b) (c) D. Poeppel et al. (in press)
f. MRI (yellow blobs) and MEG (red dots) recordings of speech perception show pronounced bilateral activation of left and right temporal cortices T. Roberts & D. Poeppel (in preparation)
Binder et al. 2000
p. IFG/d. PM (left) articulatory-based speech codes STG (bilateral) acoustic-phonetic speech codes Hickok & Poeppel (2000), Trends in Cognitive Sciences Hickok & Poeppel (in press), Cognition Area Spt (left) auditory-motor interface p. MTG (left) sound-meaning interface
Outline (1) Fractionating the problem in space: Towards a functional anatomy of speech perception (2) Fractionating the problem in time: (3) Towards a functional physiology of speech perception - A hypothesis about the quantization of time - Psychophysical evidence for temporal integration - Imaging evidence
The local/global distinction in vision is intuitively clear Chuck Close
What information does the brain extract from speech signals?
Acoustic and articulatory phonetic phenomena occur on different time scales Phenomena at the scale of formant transitions, subsegmental cues “short stuff” -- order of magnitude 20 -50 ms fine structure Phenomena at the scale of syllables (tonality and prosody) “long stuff” -- order of magnitude 150 -250 ms envelope
Does different granularity in time matter? Segmental and subsegmental information serial order in speech fool/flu carp/crap bat/tab Supra-segmental information prosody Sleep during lecture! Sleep during lecture?
The local/global distinction can be conceptualized as a multi-resolution analysis in time Further processing Binding process Supra-segmental information Segmental information (time ~200 ms) syllabicity metrics (time ~20 -50 ms) tone features, segments
Outline (1) Fractionating the problem in space: Towards a functional anatomy of speech perception (2) Fractionating the problem in time: (3) Towards a functional physiology of speech perception - A hypothesis about the quantization of time - Psychophysical evidence for temporal integration - Imaging evidence
Temporal integration windows Psychophysical and electrophysiologic evidence suggests that perceptual information is integrated analysed in temporal integration windows (v. Bekesy 1933; Stevens and Hall 1966; Näätänen 1992; Theunissen and Miller 1995; etc). The importance of the concept of a temporal integration window is that it suggests the discontinuous processing of information in the time domain. The CNS, on this view, treats time not as a continuous variable but as a series of temporal windows, and extracts data from a given window. arrow of time, physics arrow of time, Central Nervous System
Asymmetric sampling/quantization of the speech waveform 25 ms short temporal integration windows This p a p er long temporal integration windows 200 ms is h ar d tp u b li sh
Two spectrograms of the same word illustrate how different analysis windows highlight different aspects of the sounds. (a) high time resolution - each glottal pulse visible as vertical striation (b) high frequency resolution - each harmonic visible as horizontal stripe (a) High time, low frequ. resolution (b) Low time, high frequ. resolution
Hypothesis: Asymmetric Sampling in Time (AST) Left temporal cortical areas preferentially extract information over 25 ms temporal integration windows. Right hemisphere areas preferentially integrate over long, 150 -250 ms integration windows. By assumption, the auditory input signal has a neural representation that is bilaterally symmetric (e. g. at the level of core); beyond the initial representation, the signal is elaborated asymmetrically in the time domain. Another way to cocneptualize the AST proposal is to say that the sampling rate of non-primary auditory areas is different, with LH sampling at high frequencies (~40 Hz) and RH sampling at low frequencies (4 -10 Hz).
a. Physiological lateralization Symmetric representation of spectro-temporal receptive fields in primary auditory cortex Proportion of neuronal ensembles Temporally asymmetric elaboration of perceptual representations in non-primary cortex LH RH 25 [40 Hz 250 4 Hz] Size of temporal integration windows (ms) [Associated oscillatory frequency (Hz)] b. Functional lateralization Analyses requiring high temporal resolution e. g. formant transitions LH e. g. intonation contours RH Analyses requiring high spectral resolution
Asymmetric sampling in time (AST) characteristics • AST is an example of functional segregation, a standard concept. • AST is an example of multi-resolution analysis, a signal processing strategy common in other cortical domains (cf. visual areas MT and V 4 which, among other differences, have phasic versus tonic firing properties, respectively). • AST speaks to the “granularity” of perceptual representations: the model suggests that there exist basic perceptual representations that correspond to the different temporal windows (e. g. featural info is equally basic to the envelope of syllables, on this view). • The AST model connects in plausible ways to the local versus global distinction: there are multiple representations of a given signal on different scales (cf. wavelets) Global Local ==> ‘large-chunk’ analysis, e. g. , syllabic level ==> ‘small-chunk’ analysis, e. g. , subsegmental level
a. Physiological lateralization Symmetric representation of spectro-temporal receptive fields in primary auditory cortex Proportion of neuronal ensembles Temporally asymmetric elaboration of perceptual representations in non-primary cortex LH RH 25 [40 Hz 250 4 Hz] Size of temporal integration windows (ms) [Associated oscillatory frequency (Hz)] b. Functional lateralization Analyses requiring high temporal resolution e. g. formant transitions LH e. g. intonation contours RH Analyses requiring high spectral resolution
Outline (1) Fractionating the problem in space: Towards a functional anatomy of speech perception (2) Fractionating the problem in time: (3) Towards a functional physiology of speech perception - A hypothesis about the quantization of time AST model - Psychophysical evidence for temporal integration - Imaging evidence
Perception of FM sweeps Huan Luo, Mike Gordon, Anthony Boemio, David Poeppel
waveform FM Sweep Example 80 msec, from 3 -2 k. Hz, linear FM sweep spectrogram
The rationale • Important cues for speech perception: Formant transition in speech sounds (For example, F 2 direction can distinguish /ba/ from /da/) • Importance in tone languages • Vertebrate auditory system is well equipped to analyze FM signals.
Tone languages • For example, Chinese, Thai… • The direction of FM (of the fundamental frequency) is important in the language to make lexical distinctions. • (Four tones in Chinese) /Ma 1/, /Ma 2/ , /Ma 3/, /Ma 4/
Questions • How good are we at discriminating these signals? determine threshold of the duration of stimuli (corresponding to rate) for the detection of FM direction Any performance difference between UP and DOWN detection? • Will language experience affect the performance of such a basic perceptual ability?
Stimuli • • Linearly frequency modulated Frequency range studied: 2 -3 k. Hz (0. 5 oct) Two directions (Up / Down ) Changing FM rate (frequency range/time) by changing duration. For each frequency range, frequency span is kept constant (slow / Fast ) • Stimuli duration: from 5 msec(100 oct/sec) to 640 msec (0. 8 oct/sec) Tasks • Detection and discrimination of UP versus DOWN • 2 AFC, 2 IFC, 3 IFC
English speakers • 3 frequency ranges relevant to speech (approximately F 1, F 2, F 3 ranges) • single-interval 2 -AFC 2 -3 k. Hz Two main findings: • threshold for UP at 20 ms • UP better than DOWN 1 -1. 5 k. Hz 600 -900 Hz Gordon & Poeppel (2001), JASA-ARLO
2 IFC • To eliminate the possibility of bias strategy subjects can use • To see whether the asymmetric performance of English subjects is due to their “Up preference bias” Interval 1 UP Interval 2 Down Same duration of the two sounds, so the only difference is direction Which interval (1 or 2) contains certain direction sound?
Results for Chinese Subjects no significant difference Threshold for both UP and DOWN is about 20 msec
Results for English Subjects No difference now between UP and DOWN Threshold for both at 20 msec No difference between Chinese and English subjects now.
3 IFC Standard UP Interval 1 UP Interval 2 Down Choose which interval contains DIFFERENT among the three sounds (different quality rather than only direction)
3 IFC versus 2 IFC No difference between Chinese and English subjects Threshold confirmed at 20 ms
Conclusion • Importance of 20 msec as the threshold for discrimination of FM sweeps - corresponds to temporal order threshold determined by Hirsh 1959 - consistent with Schouten 1985, 1989 testing FM sweeps - this basic threshold arguably reflects the shortest integration window that generates robust auditory percepts.
Click trains Anthony Boemio & David Poeppel
Click Stimuli
Psychophysics
Auditory visual integration: the Mc. Gurk effect Virginie van Wassenhove, Ken Grant, David Poeppel
Mc. Gurk Effect • Audiovisual (AV) token • Visual (V) token • Auditory (A) token
Identification Task (3 AFC) Ap. Vk TWI True bimodal responses Response rate as a function of SOA (ms) in the Ap. Vk Mc. Gurk pair. Mean responses (N=21) and standard errors. Fusion rate (open red squares) and corrected fusion rate (filled red squares, dotted line) are /ta/ responses, visually driven responses (open green triangles) are /ka/, and auditorily driven responses (filled blue circles) are /pa/. A negative value in corrected fusion rate is interpreted as a visually dominated error response /ta/.
Simultaneity Judgment Task (2 AFC) Ap. Vk vs. At. Vt and Ab. Vg vs. Ad. Vd Simultaneity judgment task. Simultaneity judgment as a function of SOA (ms) in both incongruent and congruent conditions (Ap. Vk and At. Vt N=21; Ab. Vg and Ad. Vd N=18). The congruent conditions (open symbols) are associated with broader and higher simultaneity judgment profile than the incongruent conditions (filled symbols).
Temporal Window of Integration (TWI) across Tasks and Bimodal Speech Stimuli Stimulus Ap. V k At V t Ab. V g Ad. V d Task A Lead, Left Boundary (ms) A Lag, Right Boundary (ms) Plateau Center (ms) Window Size (ms) ID -25 +136 +56 161 S -44 +117 +37 161 S -80 +125 +23 205 ID -34 +174 +70 208 S -37 +122 +43 159 S -74 +131 +29 205
Outline (1) Fractionating the problem in space: Towards a functional anatomy of speech perception (2) Fractionating the problem in time: (3) Towards a functional physiology of speech perception - A hypothesis about the quantization of time • AST model - Psychophysical evidence for temporal integration • FM sweeps and click trains: 20 -30 ms integration • AV processing in Mc. Gurk: 200 ms integration - Imaging evidence
Binding of Temporal Quanta in Speech Processing Maria Chait, Steven Greenberg, Takayuki Arai, David Poeppel
Multi Resolution Analysis Hypothesis “SYLLABLE” Binding process Suprasegmental information (Sub)segmental information (t. s ~300 ms) (t. s ~30 ms) syllabicity stress tone feature
E 14×FS 14 E 2, FS 2 Low Pass E 2 (0 -3 Hz) E 2×FS 2 0 -265 Hz E 1, FS 1 Low Pass E 1 (0 -3 Hz) E 1×FS 1 50456000 Hz Computing the Envelope and fine Structure E 14, FS 14 Low Pass Filter High Pass E 14 (22 - Hz) S_low Low Pass E 14 (0 -3 Hz) 265315 Hz Filtering Original E 14, FS 14 Multiply E by FS E 14×FS 14 265315 Hz E 2, FS 2 High Pass E 2 (22 - Hz) E 2×FS 2 0 -265 Hz E 1, FS 1 High Pass E 1 (22 - Hz) E 1×FS 1 S_high Signal Processing: 50456000 Hz
• 0 -6 khz • 14 channels • spaced in 1/3 octave steps along the cochlear frequency map. • Every two neighboring channels are separated by 50 hz
Envelope Extraction Amplitude Time
Original Envelope Low Passed Envelope High Passed Envelope
Original High Passed Low Passed
Evidence: • Comodulation masking release • Ahissar et al. (2001) - Phase locking in the auditory cortex to the envelope of sentence stimuli. • Shannon (1995) • Drullman (1994): Effect of low pass filtering the envelope on speech reception: *severe reduction at 0 -2 Hz cutoff frequencies *marginal contribution of frequencies above 16 Hz Effect of High Pass filtering the envelope: *reduction in speech intelligibility for cutoff frequencies above 64 Hz *no reduction in sentence intelligibility when only frequencies below 4 Hz are reduced
Experiment 1 Stimuli: - 53 Sentences from the IEEE corpus. - Nonsense Syllables (CUNY) 8 Blocks – 2(voiced/voiceless)*2 vowels(/a/, /i/) *2(CV/VC) - 3 manipulations 0 -3 Hz Low Pass 22 -40 Hz Band Pass 0 -3 and 22 -40 Hz Presented Dichotically Each subject hears all 53 sentences but only one manipulation per sentence. A practice block of 26 sentences precedes the experiment. Task: - Sentences: subjects asked to write down what they heard as precisely as they can - Syllables: 7 -alternative forced choice
Results high-pass
Results low-pass high-pass
Results low-pass high-pass plus low-pass?
Results low-pass high-pass Result reflects the interaction between information carried on the short and long time scales. high-pass plus low-pass?
Outline (1) Fractionating the problem in space: Towards a functional anatomy of speech perception (2) Fractionating the problem in time: (3) Towards a functional physiology of speech perception - A hypothesis about the quantization of time • AST model - Psychophysical evidence for temporal integration • FM sweeps and click trains: 20 -30 ms integration • AV processing in Mc. Gurk: 200 ms integration • Interaction of temporal windows - Imaging evidence
f. MRI study of temporal structure in concatenated FMs Anthony Boemio, Allen Braun, Steven Fromm, David Poeppel
Stimulus Properties
Stimulus Properties Spectrograms PSDs Ampl. vs. Time FM Stimulus TONE Stimulus CNST Stimulus 1 1 E-10 100 Frequency (Hz) 1 E 4 0 Time (sec) All 13 stimuli have nearly identical long-term spectra and RMS power over the entire 9 second stimulus duration. Stimuli differ only in segment duration which was determined by drawing from a Gaussian distribution (previous panel), with means of 12, 25, 45, 85, 160, and 300 ms. 1
f. MRI • Single-trial sparse acquisition paradigm (clustered volume acqu. ) • 1. 5 T GE Signa, echoplanar sequence • 11. 4 s TR (9 s signal, 2. 4 s volume), TE 40 ms • 24 reps/condition • SPM 99 random-effects Model, p<0. 05 corrected
SPM 99 Cohort Analysis FMs-CNST Categorical Contrasts (p < 0. 05 corr. )
Segment Transition Hemodynamic response/stimulus model Not all segment transitions are equal. Only 1 second of stimuli are shown for clarity FM/TONE CNST acquisition threshold set by categorical contrast to CNST stimulus-– anything below this level will be zero in the SPM Including the segment transitions and segments themselves, but assuming that transitions between long segments contribute more to the response than shorter ones produces the observed activation vs. segment-duration relation (left).
MEG study of spectral responses to complex sounds David Poeppel, Huan Luo, Dana Ritter, Anthony Boemio, Didier Depireux, Jonathan Simon
Asymmetric sampling in time (AST) hypothesis predicts electrophysiological asymmetries in specific frequency bands, gamma (25 -55 Hz) and theta (3 -8 Hz) …. … because the hypothesized temporal quantization is reflected as oscillatory activity. RH Sensitivity of neuronal ensembles LH 25 [40 Hz 250 4 Hz] Size of temporal integration windows (ms) [Associated oscillatory frequency (Hz)] 25 [40 Hz 250 4 Hz]
Flow chart LH Gamma Band. Pass Filter RH Theta Band. Pass Filter Gamma for LH RMS Gamma for RH Theta for LH Theta for RH
Multi-taper spectral analysis
Result
Power ratio in specific frequency bands (P(L)/(P(L)+P(R))) Kaiser Remetz Elliptic Gamma 0. 4769 0. 4751 0. 4733 Theta 0. 3958 0. 3965 0. 4210 • • The difference is much greater in Theta band (low frequency band) and RH activation in Theta band is greater than LH
Distribution of spectral responses
Outline (1) Fractionating the problem in space: Towards a functional anatomy of speech perception (2) Fractionating the problem in time: (3) Towards a functional physiology of speech perception - A hypothesis about the quantization of time • AST model - Psychophysical evidence for temporal integration • FM sweeps and click trains: 20 -30 ms integration • AV processing in Mc. Gurk: 200 ms integration • Interaction of temporal windows - Imaging evidence • f. MRI: temporal sensitivity and lateralization • MEG spectral lateralization
p. IFG/d. PM (left) articulatory-based speech codes STG (bilateral) acoustic-phonetic speech codes Hickok & Poeppel (2000), Trends in Cognitive Sciences Hickok & Poeppel (in press), Cognition Area Spt (left) auditory-motor interface p. MTG (left) sound-meaning interface
Asymmetric sampling in time (AST) builds on anatomical symmetry but permits functional asymmetry a. Physiological lateralization Symmetric representation of spectro-temporal receptive fields in primary auditory cortex Proportion of neuronal ensembles Temporally asymmetric elaboration of perceptual representations in non-primary cortex LH RH 25 [40 Hz 250 4 Hz] Size of temporal integration windows (ms) [Associated oscillatory frequency (Hz)] b. Functional lateralization Analyses requiring high temporal resolution e. g. formant transitions LH e. g. intonation contours RH Analyses requiring high spectral resolution
Conclusion The input signal (e. g. speech) must interface with higher-order symbolic representations of different types (e. g. segmental representations relevant to lexical access and supra-segmental representations relevant to interpretation). These higher-order representation categories appear to be lateralized (e. g. segmental phonology/LH, phrasal prosody/RH). The timing-based asymmetry provides a possible cortical ‘logistical’ or ‘administrative’ device that helps create representations of the appropriate granularity. If this is on the right track, syllable is - at least for perception as elementary a unit as feature/segment. Both are basic.
Analysis-by-synthesis I Long-term memory: Abstract lexical repr. Hypothesize- and test models contextual information Synthesis Where do the candidates for synthesis come from? Peripheral auditory processing Segmentation and labeling Recoding acoustic-phonetic manifestations of words MATCHING PROCESS Lexical access code spectral representation Analysis BEST LEXICAL CANDIDATE
Analysis-by-synthesis II Analysis-by-synthesis model of lexical hypothesis generation and verification (adapted and extended from Klatt, 1979) analysis-by-synthesis verification; “internal forward model” peripheral and central ‘neurogram’ spectral analysis speech waveform partial feature matrix segmental analysis best- scoring lexical candidates lexical hypotheses lexical search predicted subsequent items synt. /seman. analysis acceptable word string
Analysis-by-synthesis III frontal areas (articulatory codes) - l IFG, premotor temporo-parietal areas? analysis-by-synthesis verification; “internal forward model” auditory cortex peripheral and central ‘neurogram’ spectral analysis speech waveform partial feature matrix segmental analysis best- scoring lexical candidates lexical hypotheses lexical search predicted subsequent items p. STG? MTG? ITG? synt. /seman. analysis acceptable word string
- Slides: 97