Building Applying Emotion Recognition Cristian Canton Microsoft Cristian

Building & Applying Emotion Recognition Cristian Canton Microsoft @Cristian. Canton Anna S. Roth Microsoft @Anna. SRoth

Goals • Emotion as a subjective problem • Building an image classifier end-to-end

The Recipe Data collection Tagging Aggregation Data preprocessing Architecture selection Cost function Training

“Emotion”

Basic Emotions

FACS Emotion Action Units Happiness 6+12 Sadness 1+4+15 Surprise 1+2+5 B+26 Fear 1+2+4+5+7+20+26 Anger 4+5+7+23 Disgust 9+15+16 Contempt R 12 A+R 14 A

CIRCUMPLEX Negative Arousal High Low Valence Positive

Lots of models of emotion • Lovheim Cube Image is CC-BY-SA-4. 0 from Wikimedia user “Fred The Oyster” - https: //en. wikipedia. org/wiki/File: L%C 3%B 6 vheim_cube_of_emotion. svg Plutchik wheel image public domain from: https: //en. wikipedia. org/wiki/File: Plutchik-wheel. svg

Basic Emotions

Subjective Dog image is CC-BY-SA-4. 0 from Wikimedia user “Edmontcz”

Subjective

Very Subjective

Other Subjective Problems • Attractiveness • Personality traits • Style

The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training

FER Data – used for early academic work http: //www-etud. iro. umontreal. ca/~goodfeli/fer 2013. html

In-house Data Collection 4. 5 million webcrawled images Emotional keywords Names

The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training

Tagging FACS • More accurate and less subjective. • Easy expand to more emotions. • Con: Expensive and require a certified tagger. Appearance based • Cheap and doesn’t require a certified tagger. • Con: Crowdsourcing is very noisy.

Crowd Sourced Tagging Each tagger can choose between 1 of the 8 emotions or unknown or not a face. We started with at least 2 taggers agree and up to 5 taggers. Quality was very bad specially with subtle emotions. We retagged all our data with 10 taggers. Quality improved drastically (detailed next).

How many taggers to we need? Number of taggers needed 100 90 80 Agreement percentage 70 60 50 40 30 20 10 0 0 1 2 3 4 5 Number of taggers 6 7 8 9 10

FER++ https: //github. com/Microsoft/FERPlus

Unreliable?

The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training

Input data

The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training Input Data Preprocessing DNN Architecture Cost Function Training

Data pre-processing • Present the data in a more or less homogeneous way to the system

Data pre-processing • Present the data in a more or less homogeneous way to the system • Reduce variability of the input data exploiting any known characteristics

Data pre-processing • Present the data in a more or less homogeneous way to the system • Reduce variability of the input data exploiting any known characteristics • In our case: Deep. Face: Ranzato et al • Grayscale conversion • Image cropping and scaling to the input size • No frontalization Taigman et al. , 2014

Data pre-processing: Augmentation Rotation

Data pre-processing: Augmentation Translation

Data pre-processing: Augmentation Scaling

Data pre-processing: Augmentation Flip

The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training

DNN Architecture • It is very difficult to predict the performance of a given DNN architecture for a particular problem • Explored several deep architectures: VGG 16, VGG 19, Resnet-50, Resnet-101 • Commodity architectures

The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training

Cost Function • Link between distilled info from tags into cost function. Soft max and entropy

Emotion Probability Distribution Happiness Surprise Fear 5 4 1 Majority Voting (MV) Each face is associated with one emotion, the one that has the majority vote. Probabilistic Drawing (PLD) During training draw the target emotion according to its probability. Multi-Label Learning (ML) All emotions above certain threshold are treated as valid emotion. Cross-entropy loss (CEL) Learn the actual probability distribution.

Emotion Probability Distribution Training result (on FER+) Schemes Accuracy MV 83. 85± 0. 63% ML 83. 97± 0. 36% PLD 84. 99± 0. 37% CEL 84. 72± 0. 24%

The Recipe Data collection Tagging Data preprocessing Architecture selection Aggregation Cost function Training

The Recipe Data collection Tagging Aggregation Data preprocessing Architecture selection Cost function Training. . . Future work? ?

Video

Emotion in Video • Difficulties: • Temporal component of expressions • Necessity to track the face along time • Data tagging • Potential approaches: • Fame-by-frame analysis + temporal aggregation • Fully train a RNN or LSTM (data hungry!)

Multimodal Future

Multimodal Emotion • Combine audio+video in sequences to improve the recognition ratio of emotions • Combine audio+text to improve the recognition ratio