Cues to Emotion Language Suzanne Yuen Monday Oct

Overview • Two-Stream Emotion Recognition for Call Center Monitoring • Voice Quality and f

Two Stream Emotion Recognition for Call Center Monitoring • Background: To aid supervisors in

Two-Stream Recognition Acoustic Stream Extracted features based on pitch and energy • Trained on

Implementation • Method: – Two streams analyzed separately: • speech utterance/acoustic features • spoken

Two Stream - Conclusion • Table 2 suggested that two-stream analysis is more accurate

Discussion • Gupta analyzed three emotions (happy, neutral, hot-anger): Why break it down into

Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis • Previous

Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis • 3

What is Voice Quality? Phonation Gestures • • Derived from a variety of laryngeal

Tense Voice • Very strong tension of vocal folds, very high tension in vocal

Whispery Voice • Very low adductive tension • Medial compression moderately high • Longitudinal

Creaky Voice • Vocal fold vibration at low frequency, irregular • Low tension (only

Breathy Voice • Tension low – Minimal adductive tension – Weak medial compression •

Modal Voice • “Neutral” mode • Muscular adjustments moderate • Vibration of vocal folds

Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis • Six

Voice Quality and f 0 Test: Conclusion • Categorized results into 4 groups. No

Voice Quality and f 0 Test: Discussion • If the scale is on a

Slides: 19

Download presentation

Cues to Emotion: Language Suzanne Yuen Monday Oct 5, 2009 COMS 6998

Overview • Two-Stream Emotion Recognition for Call Center Monitoring • Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis

Two Stream Emotion Recognition for Call Center Monitoring • Background: To aid supervisors in the evaluation of agents at call centers* • Objective: To present a two stream processing technique to detect strong emotion • Previous Work: – Fernandez categorized affect into four main components: intonation, loudness, rhythm, and voice quality – Yang studied feature selection methods in text categorization and suggested that information gain should be used – Petrushin and Yacoub examined agitation and calm states in people-machine interaction *Typical medium-sized call-center receives about 100, 000 calls per day

Two-Stream Recognition Acoustic Stream Extracted features based on pitch and energy • Trained on 900 calls, ~60 hrs of speech • Vocabulary system of more than 10 000 words • TF-IDF scheme = Term Frequency – Inverse Document Frequency Semantic Stream • Performed speech-to-text conversion • Text classification algorithms identified phrases such as “pleasure, ” “thanks, ” “useless, ” & “disgusting. ”

Implementation • Method: – Two streams analyzed separately: • speech utterance/acoustic features • spoken text/semantics/speech recognition of conversation – Confidence levels of two streams combined – Examined 3 emotions • Neutral • Hot-anger • Happy • Tested two data sets: – LDC data – 20 real-world call-center calls

Two Stream - Conclusion • Table 2 suggested that two-stream analysis is more accurate than acoustic or semantic alone • LDC data recognition significantly higher than real-world data • Neutral emotions had less accuracy • Combination of two-stream processing showed improvement (~20%) in identification of “happy” and “anger” emotions • Low acoustic stream accuracy may be attributed to length of sentences in real-world data. Normal people do not exhibit different emotions significantly in long sentences

Discussion • Gupta analyzed three emotions (happy, neutral, hot-anger): Why break it down into these categories? Implications? Can this technique be applied to a wider range of emotions? For other applications? • Speech to text may not translate the complete conversation. Would further examination greatly improve results? What are the pros and cons? • Pitch range was from 50 -400 Hz. Research may not be applicable outside this range. Do you think it necessary to examine other frequencies? • In this paper, TF-IDF (Term Frequency – Inverse Document Frequency) technique is used to classify utterances. Accuracy for acoustics only is about 55%. Previous research suggest that alternative techniques may be better. Would implementation better results? What are the pros and cons of using the TF-IDF technique?

Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis • Previous work: – 1995; Mozziconacci suggested that VQ combined with f 0 combined could create affect – 2002; Gobl suggested synthesized stimuli with VQ can add affective coloring. Study suggested that “VQ + f 0” stimuli is more affective than “f 0 only” – 2003; Gobl tested VQ with large f 0 range. Did not examine contribution of affect-related f 0 contours • Objective: To examine affects of VQ and f 0 on affect expression

Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis • 3 series of stimuli of Sweden utterance – “ja adjo”: – – – • Tested parameters exemplifying 5 voice quality (VQ): – – – • Stimuli exemplifying VQ Stimuli with modal voice quality with different affect-related f 0 contours Stimuli combining both Modal voice Breathy voice Whispery voice Lax-creaky voice Tense voice 15 synthesized stimuli test samples (see Table 1)

What is Voice Quality? Phonation Gestures • • Derived from a variety of laryngeal and supralaryngeal features Adductive tension: interarytenoid muscles adduct the arytenoid muscles Medial compression: adductive force on vocal processes- adjustment of ligamental glottis Longitudinal pressure: tension of vocal folds

Tense Voice • Very strong tension of vocal folds, very high tension in vocal tract

Whispery Voice • Very low adductive tension • Medial compression moderately high • Longitudinal tension moderately high • Little or no vocal fold vibration • Turbulence generated by friction of air in and above larynx

Creaky Voice • Vocal fold vibration at low frequency, irregular • Low tension (only ligamental part of glottis vibrates) • The vocal folds strongly adducted • Longitudinal tension weak • Moderately high medial compression

Breathy Voice • Tension low – Minimal adductive tension – Weak medial compression • Medium longitudinal vocal fold tension • Vocal folds do not come together completely, leading to frication

Modal Voice • “Neutral” mode • Muscular adjustments moderate • Vibration of vocal folds periodic, full closing of glottis, no audible friction • Frequency of vibration and loudness in low to mid range for conversational speech

Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis • Six sub-tests with 20 native speakers of Hiberno-English. • Rated on 12 different affective attributes: – – – Sad – happy Intimate – formal Relaxed – stressed Bored – interested Apologetic – indignant Fearless – scared • Participants asked to mark their response on scale Intimate Formal No affective load

Voice Quality and f 0 Test: Conclusion • Categorized results into 4 groups. No simple one-to-one mapping between quality and affect • “Happy” was most difficult to synthesis • Suggested that, in addition to f 0 , VQ should be used to synthesis affectively colored speech. VQ appears to be crucial for expressive synthesis

Voice Quality and f 0 Test: Discussion • If the scale is on a 1 -7, then 3. 5 should be “neutral”; however, most ratings are less than 2. Do the conclusions (see Fig 2) seem strong? • In terms of VQ and f 0, the groupings in Fig 2 seem to suggest that certain affects are closely related. What are the implications of this? For example, are happy and indignant affects closer than relaxed or formal? Do you agree? • Do you consider an intimate voice more “breathy” or “whispery? ” Does your intuition agree with the paper? • Yanushevskaya found that the VQ accounts for the highest affect ratings overall. How to compare range of voice quality with frequency? Do you think they are comparable? Is there a different way to describe these qualities?

Questions?