Building Applying Emotion Recognition Cristian Canton Microsoft Cristian
Building & Applying Emotion Recognition Cristian Canton Microsoft @Cristian. Canton Anna S. Roth Microsoft @Anna. SRoth
Microsoft Cognitive Services Vision Computer Vision | Emotion | Face | Video Speech Custom Recognition | Speech Language Bing Spell Check | Language Understanding | Linguistic Analysis | Text Analytics | Web Language Model Knowledge Academic Knowledge | Entity Linking | Knowledge Exploration | Recommendations We’re hiring! Search Bing Autosuggest | Bing Image Search | Bing News Search | Bing Video Search | Bing Web Search
Goals • Emotion as a subjective problem • Building an image classifier end-to-end
The Recipe Data collection Tagging Aggregation Data preprocessing Architecture selection Cost function Training
“Emotion”
Basic Emotions
FACS Emotion Action Units Happiness 6+12 Sadness 1+4+15 Surprise 1+2+5 B+26 Fear 1+2+4+5+7+20+26 Anger 4+5+7+23 Disgust 9+15+16 Contempt R 12 A+R 14 A
CIRCUMPLEX Negative Arousal High Low Valence Positive
Lots of models of emotion • Lovheim Cube Image is CC-BY-SA-4. 0 from Wikimedia user “Fred The Oyster” - https: //en. wikipedia. org/wiki/File: L%C 3%B 6 vheim_cube_of_emotion. svg Plutchik wheel image public domain from: https: //en. wikipedia. org/wiki/File: Plutchik-wheel. svg
Basic Emotions
Subjective Dog image is CC-BY-SA-4. 0 from Wikimedia user “Edmontcz”
Subjective
Very Subjective
Other Subjective Problems • Attractiveness • Personality traits • Style
The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training
FER Data – used for early academic work http: //www-etud. iro. umontreal. ca/~goodfeli/fer 2013. html
In-house Data Collection 4. 5 million webcrawled images Emotional keywords Names
The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training
Tagging FACS • More accurate and less subjective. • Easy expand to more emotions. • Con: Expensive and require a certified tagger. Appearance based • Cheap and doesn’t require a certified tagger. • Con: Crowdsourcing is very noisy.
Crowd Sourced Tagging Each tagger can choose between 1 of the 8 emotions or unknown or not a face. We started with at least 2 taggers agree and up to 5 taggers. Quality was very bad specially with subtle emotions. We retagged all our data with 10 taggers. Quality improved drastically (detailed next).
How many taggers to we need? Number of taggers needed 100 90 80 Agreement percentage 70 60 50 40 30 20 10 0 0 1 2 3 4 5 Number of taggers 6 7 8 9 10
FER++ https: //github. com/Microsoft/FERPlus
Unreliable?
The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training
Input data
Input data
The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training Input Data Preprocessing DNN Architecture Cost Function Training
Data pre-processing • Present the data in a more or less homogeneous way to the system
Data pre-processing • Present the data in a more or less homogeneous way to the system • Reduce variability of the input data exploiting any known characteristics
Data pre-processing • Present the data in a more or less homogeneous way to the system • Reduce variability of the input data exploiting any known characteristics • In our case: Deep. Face: Ranzato et al • Grayscale conversion • Image cropping and scaling to the input size • No frontalization Taigman et al. , 2014
Data pre-processing: Augmentation Rotation
Data pre-processing: Augmentation Translation
Data pre-processing: Augmentation Scaling
Data pre-processing: Augmentation Flip
The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training
DNN Architecture • It is very difficult to predict the performance of a given DNN architecture for a particular problem • Explored several deep architectures: VGG 16, VGG 19, Resnet-50, Resnet-101 • Commodity architectures
The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training
Cost Function • Link between distilled info from tags into cost function. Soft max and entropy
Emotion Probability Distribution Happiness Surprise Fear 5 4 1 Majority Voting (MV) Each face is associated with one emotion, the one that has the majority vote. Probabilistic Drawing (PLD) During training draw the target emotion according to its probability. Multi-Label Learning (ML) All emotions above certain threshold are treated as valid emotion. Cross-entropy loss (CEL) Learn the actual probability distribution.
Emotion Probability Distribution Training result (on FER+) Schemes Accuracy MV 83. 85± 0. 63% ML 83. 97± 0. 36% PLD 84. 99± 0. 37% CEL 84. 72± 0. 24%
The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training
The Recipe Data collection Tagging Aggregation Data preprocessing Architecture selection Cost function Training. . . Future work? ?
Video
Emotion in Video • Difficulties: • Temporal component of expressions • Necessity to track the face along time • Data tagging • Potential approaches: • Fame-by-frame analysis + temporal aggregation • Fully train a RNN or LSTM (data hungry!)
Multimodal Future
Multimodal Emotion • Combine audio+video in sequences to improve the recognition ratio of emotions • Combine audio+text to improve the recognition ratio
Microsoft Cognitive Services Vision Computer Vision | Emotion | Face | Video Speech Custom Recognition | Speech Language Bing Spell Check | Language Understanding | Linguistic Analysis | Text Analytics | Web Language Model Knowledge Academic Knowledge | Entity Linking | Knowledge Exploration | Recommendations We’re hiring! Search Bing Autosuggest | Bing Image Search | Bing News Search | Bing Video Search | Bing Web Search
- Slides: 49