Version WS 2007 8 Speech Science XIII Speech

  • Slides: 18
Download presentation
Version WS 2007 -8 Speech Science XIII Speech perception is special (deutsche Begleitnotizen)

Version WS 2007 -8 Speech Science XIII Speech perception is special (deutsche Begleitnotizen)

Topics • Speech perception as simple pattern matching? • Evidence for and against a

Topics • Speech perception as simple pattern matching? • Evidence for and against a “speech mode” of speech perception. • A bird’s-eye view of the perception landscape • Reading: BHR(3 rd ed. ), chapter 6 (part 2), pp. 203 -229 (5 th ed. ), chap. 11, pp. 237 -272 P. -M. , 3. 2. 2. , part 2 + 3. 2. 3. pp. 162 -173

Speech perception as pattern matching • The “acoustic cue” concept suggests acoustic patterns which

Speech perception as pattern matching • The “acoustic cue” concept suggests acoustic patterns which can be stored in memory (learned) • But huge variability in the acoustic structure of any linguistic unit (sound, syllable or word) argues against a simple pattern-matching mechanism. • The issue of how much of the variability is stored and used when perceiving speech divides scientists. The brain is very powerful, but how is the power used! • Most agree that we don’t just (passively) receive input, but that we actively work with it to create our percepts. But how? – We look first at vowels

How do we deal with vowels? • Vowel formants vary greatly with the size

How do we deal with vowels? • Vowel formants vary greatly with the size of the vocal tract. • But formants change in relation to one another, and they change together with other properties, (e. g. F 0: children – adults; women – men) • The relative values of formants have therefore been examined. • We do change our interpretation of formant values a) as a function of (very) different F 0 values b) as a function of preceding formant values. • And – our two-formant model of vowels is not reality

Two or more formants? Two-formant synthetic vowels which best match natural vowels (nach Carlsson

Two or more formants? Two-formant synthetic vowels which best match natural vowels (nach Carlsson et al. 1975, Fig. 1)

F 0 as a factor in perceived vowel quality? For 140 Hz fundamental, the

F 0 as a factor in perceived vowel quality? For 140 Hz fundamental, the same vowels are generally perceived with 80 Hz lower F 1 values than for a 280 Hz F 0 (after Miller 1953)

Speaker Vowels relative to preceding context. Ladefoged and Broadbent (1957) demonstrated that the size

Speaker Vowels relative to preceding context. Ladefoged and Broadbent (1957) demonstrated that the size of the speaker producing a carrier phrase (and therefore the values of the speaker‘s vowel formants) affected the intrepetation of the test words at the end of the carrier phrase. (the test words were not produced by different speakers) Relation of carrier-phrase formants relative to testword formant values. (e. g. F 1 up = higher carrier phrase formants, therefore testword heard as less open lower F 1) /b. It/ /bet/ /b t/ Formants of carrier relative to testword

Immediate vs. wider context • The carrier phrase influence shows effects of wider context.

Immediate vs. wider context • The carrier phrase influence shows effects of wider context. The F 0 effect is vowel-intrinsic, but average F 0 over a phrase also provides a wider F 0 context. • So one important question is, whether we simply change the frame within which we process vowel formants according to the information about the speaker that we collect during the utterance? • This would mean that vowels would be more difficult to identify at the beginning of utterances (from unknown speakers!) – i. e. , vowels offered with no prior information. …. Is this the case?

Isolated vowels vs. vowels in syllabic context Formants rarely stay constant for long in

Isolated vowels vs. vowels in syllabic context Formants rarely stay constant for long in C_C syllabic context. This could lead to the assumption that isolated vowels with welldefined, steady-state formants should be identified with more certainty. But Stevens (1968) showed that steady-state isolated vowels are, in fact, less well identified than syllable-context vowels.

Syllabic context 2 Percent errors Strange et al. (1976) showed that the effect of

Syllabic context 2 Percent errors Strange et al. (1976) showed that the effect of syllabic context was more important (21. 7 - 25 -6% difference) than the effect of listening to one speaker at a time (7. 5 – 11. 4% difference) 42. 6% 31. 2% 17. 0% 9. 5%

The importance of vowel-target info. vs. vowel-dynamics Verbrugge & Rakerd (1986) investigated the contribution

The importance of vowel-target info. vs. vowel-dynamics Verbrugge & Rakerd (1986) investigated the contribution of the dynamic, movement information vs. the “voweldefining” target information. The whole syllable was clearly easiest to recognise (91. 7%). But even if the central target section was missing, almost 80% were correctly identified.

The Motor Theory of Speech Perception • The assumption of an articulatory basis to

The Motor Theory of Speech Perception • The assumption of an articulatory basis to our speech perception mechanisms has been explicit for over 40 years (internationally since a landmark Speech Communication Seminar in Stockholm in 1962) • The Haskins Laboratories (USA) presented evidence (from earlier experimental work) that: We identify acoustically different stimuli as one and the same articulatorily defined speech sound We can only discriminate acoustic differences between stimuli that cross category boundaries, although the differences within categories are just as great.

/b/ x /d/ x /g/ x x 1 x x 2 3 x 4

/b/ x /d/ x /g/ x x 1 x x 2 3 x 4 x 5 6 x x x Discriminability of stimulus pairs No. of judgements for a category Categorical Perception 7 8 9 10 11 12 13 14 15 Series of acoustically equidistant stimuli E. g. , 1 is a typical /b/ F 2 -transition, 8 is a typical /d/ transition and 15 is a typical /g/ transition. Stimuli 2 -7 and 9 -14 are steps between these typical stimuli.

Categorical Perception 2 • Further experiments with many other acoustic properties which come from

Categorical Perception 2 • Further experiments with many other acoustic properties which come from articulations which are not categorically separable (VOT, /l – r/, vowel categories, etc. ) brought about a theoretical modification …. • Categorical perzeption is “acquired” and the increased distinctiveness between categories is also acquired. The low-sensitivity baseline between the category boundaries can be seen as psychoacoustically normal sensitivity. • Normal perception in persons with disturbed articulation induced a theoretical fall-back to a position where the link between perception and production was more abstract…. The position was referred to as “the speech mode” of perception. This still made speech perception special.

The Speech Mode of perception • Many experiments showed that the functional goal of

The Speech Mode of perception • Many experiments showed that the functional goal of speech perception made it special: • Dichotic signals (different parts played into the left and right ear) were heard as one speech sound, but the separate elements were still audible • Separate words played into the left and right ear were heard as one word, if the sounds of the two words could combine: E. g. “pay” + “lay” “play”. This was heard even if the /l/ started before the release of the /p/! • Even more dramatic is the perceptual “switch” which can occur with “sine analogue speech”. Some people hear it as strange music until they are asked whether they can understand what is being said. They then hear it as speech (and cannot switch back to the music mode)

Other influences on phonetic perception: Visual Information • The prime input in speech perception

Other influences on phonetic perception: Visual Information • The prime input in speech perception is the acoustic signal, but we can also often see the person who is speaking and have therefore a sub-conscious knowledge of the visual information accompanying the acoustics. • A laboratory mistake led to the discovery, that a video clip of a spoken /ga/ together with the acoustic Signal of /ba/ is often perceived as /da/. Acoustic /ga/ with a video of /ba/, on the other hand, is heard as /ba/. • This “Mc. Gurk” effect (after the person who discovered it) has since been systematically investigated. It confirms that we cannot ignore visual information, but the synchronisation must be accurate for fusion to take place.

Semantische Einflüsse Es gibt einen Effekt von fast 25% in der Erkennung eines echten

Semantische Einflüsse Es gibt einen Effekt von fast 25% in der Erkennung eines echten Wortes im Vergleich zu einem Nichtwort entlang einer Stimulusreihe mit einem Wort bzw. einem Nichtwort als Endstimulus: Ganongeffekt

Anti-Speech-Mode • There are still many scientists who consider the speechmode approach too much

Anti-Speech-Mode • There are still many scientists who consider the speechmode approach too much like “hocus pocus”. They concentrate on a more direct relationship between the acoustic signal and the percept. • Stevens’ “quantal theory” of (plosive) perception rests on the fact that /t, d/ tend to have high-frequency energy, /g, k/ have middle-frequency energy, and /b, p/. Therefore, the same relative acoustic information serves the distinction indepen dent of context. • “Feature detectors” have been another attempt to link the acoustic signal directly with the linguistic units in a more passive model of speech perception. Animals have highlevel neuronal detectors linked to vital functions, so why not humans?