Finegrained Language Identification with Multilingual Caps Net Model

  • Slides: 36
Download presentation
Fine-grained Language Identification with Multilingual Caps. Net Model Mudit Verma 1 & Arun Balaji

Fine-grained Language Identification with Multilingual Caps. Net Model Mudit Verma 1 & Arun Balaji Buduru 2 1. Arizona State University, USA 2. IIIT-Delhi, India

Need Emergency Call Routing Services Intelligent Voice Assistants Conversational AI Speech is the easiest

Need Emergency Call Routing Services Intelligent Voice Assistants Conversational AI Speech is the easiest form of communication for humans.

Trends • • Language Embeddings (i-vector) Use Hand Crafted Features ( & Phoneme Detection)

Trends • • Language Embeddings (i-vector) Use Hand Crafted Features ( & Phoneme Detection) Mel Frequency Cepstral Coefficients Spectrograms P. Verma and P. K. Das, “i-vectors in speech processing applications: a survey, ” International Journal of Speech Technology, vol. 18, no. 4, pp. 529– 546, Dec 2015. [Online]. Available: https: //doi. org/10. 1007/s 10772 -015 -9295 -3

Trends View it as a classification Problem • SVM / GMM / HMM •

Trends View it as a classification Problem • SVM / GMM / HMM • Logistic Regression • Fully connected Neural Networks • BLSTM • CNN (based on VGG)

Issues • Manual Feature Extraction is hard • Data Requirements • Robustness to Noise

Issues • Manual Feature Extraction is hard • Data Requirements • Robustness to Noise

Fine-Grained LID Problem Characteristics : 1. Short Spoken Audio Snippets (5 s-10 s) 2.

Fine-Grained LID Problem Characteristics : 1. Short Spoken Audio Snippets (5 s-10 s) 2. Multiple Languages 3. Noise 4. Exiguous Train Data 5. Trivial Data collection 6. Non-Class Identification 7. Multilingual

Dataset - Languages Arabic - AR Bengali - BE Chinese(Mand. ) - CH English

Dataset - Languages Arabic - AR Bengali - BE Chinese(Mand. ) - CH English - EN Hindi - HI Turkish Spanish Japanese Punjabi Portuguese *used for Language Identification **used for Non-Class Tests

Dataset - Languages Arabic - AR Bengali - BE Chinese(Mand. ) - CH English

Dataset - Languages Arabic - AR Bengali - BE Chinese(Mand. ) - CH English - EN Hindi - HI Turkish Spanish Japanese Punjabi Portuguese *used for Language Identification **used for Non-Class Tests • Deal with Indian Languages with more popular languages used for LID Task • Diverse set of languages • Exiguous Data requirements help with regional LID

Dataset - Collection Type : Audio recordings of local and global news, interviews, speeches

Dataset - Collection Type : Audio recordings of local and global news, interviews, speeches etc. Source : You. Tube Data Size : 100 hrs / language for LID Task ( -> 500 hrs total) 30 hrs / language for Non-Class Task (-> 150 hrs total) Train / Test Size : 70 -30 for LID 20 -10 for Non-Class

Dataset - Characteristics Trivial and Easy Data Collection Various Types of Noise : 1.

Dataset - Characteristics Trivial and Easy Data Collection Various Types of Noise : 1. Heard/Not-Understandable (spoken language is not understood) Background noise of cheers, slogans 2. Heard/Understandable (multiple spoken languages) Interviews/News reporting in multiple languages 3. Unheard (noise but not spoken language) Chimes/Mic Noise

Dataset - Processing • • . wav format Spectrogram Representation Discretize using Hann window

Dataset - Processing • • . wav format Spectrogram Representation Discretize using Hann window & 129 frequency bins 8 -bit grayscale

Work • Handle Problem in Image Domain • Use Capsule Networks (Caps. Net) for

Work • Handle Problem in Image Domain • Use Capsule Networks (Caps. Net) for classification • Compare with variants of CNN + Bi-GRU CNN + Bi-LSTM CNN + Bi-GRU + Attention • Test deeper variant of Caps. Net • Verify Non-Class Detection (Out of Distribution Samples)

Image Domain • Use Spectrogram • Mel-Frequency Coeff. Cepstrum does not help much.

Image Domain • Use Spectrogram • Mel-Frequency Coeff. Cepstrum does not help much.

Caps. Nets - Theory • CNNs are great but they have a problem. They

Caps. Nets - Theory • CNNs are great but they have a problem. They have : • Positional Invariance (Thanks to Pooling layers) Tolerant to View. Point Invariance S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules, ” in Advances in Neural Information Pro- cessing Systems, 2017, pp. 3856– 3866.

Caps. Nets - Theory • Solves the “Picasso Problem” • Caps. Nets replace scalar-output

Caps. Nets - Theory • Solves the “Picasso Problem” • Caps. Nets replace scalar-output feature detectors with vector-output capsules and max-pooling with routing-by-agreement. • Because capsules are independent, when multiple capsules agree, the probability of correct detection is much higher S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules, ” in Advances in Neural Information Pro- cessing Systems, 2017, pp. 3856– 3866.

Caps. Nets - Theory

Caps. Nets - Theory

Caps. Nets - Theory • Solves the “Picasso Problem” • Caps. Nets replace scalar-output

Caps. Nets - Theory • Solves the “Picasso Problem” • Caps. Nets replace scalar-output feature detectors with vector-output capsules and max-pooling with routing-by-agreement. • Because capsules are independent, when multiple capsules agree, the probability of correct detection is much higher

Caps. Nets - Theory

Caps. Nets - Theory

Caps. Nets Architecture

Caps. Nets Architecture

Caps. Nets Architecture Bottleneck

Caps. Nets Architecture Bottleneck

Baseline – CNN-(RNN)Attention C. Bartz, T. Herold, H. Yang, and C. Meinel, “Language identification

Baseline – CNN-(RNN)Attention C. Bartz, T. Herold, H. Yang, and C. Meinel, “Language identification using deep convolutional recurrent neural net- works, ” in International Conference on Neural Information Processing Springer, 2017, pp. 880– 889.

Non-Class Detection • Verification Step. • Is Caps. Net more robust* than baseline? •

Non-Class Detection • Verification Step. • Is Caps. Net more robust* than baseline? • Thresholding Mechanism *robustness here is over several languages.

Results

Results

Results

Results

Results

Results

Results Caps. Net is bad. Or is it?

Results Caps. Net is bad. Or is it?

Results ROC Curve & AUC Score for Languages (1 -5) by Caps. Net (with

Results ROC Curve & AUC Score for Languages (1 -5) by Caps. Net (with Midcaps Layers) – 5 second audio input. Left : Test Data. Right : Train Data.

Results ROC Curve & AUC Score for Languages (1 -5) by Bi-GRU 10 second

Results ROC Curve & AUC Score for Languages (1 -5) by Bi-GRU 10 second audio input. Left : Test Data. Right : Train Data.

Results See a pattern?

Results See a pattern?

Results See a pattern? HI -> AR - > BE ~ CH > EN

Results See a pattern? HI -> AR - > BE ~ CH > EN

Results See a pattern? HI -> AR - > BE ~ CH > EN

Results See a pattern? HI -> AR - > BE ~ CH > EN Should we be expecting this order?

Results Non Class Detection for languages (6 -10)

Results Non Class Detection for languages (6 -10)

Results Why is Japanese low? Why is Punjabi high? Portuguese & Spanish Turkish ~

Results Why is Japanese low? Why is Punjabi high? Portuguese & Spanish Turkish ~ Arabic Non Class Detection for languages (6 -10)

Results – Caps. Net is Multilingual S. Toshniwal, T. N. Sainath, R. J. Weiss,

Results – Caps. Net is Multilingual S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao, “Multilingual speech recognition with a single end-to-end model, ” in 2018 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4904– 4908.

Future Work • Recurrent Layers with Caps. Net • NAS over Caps. Net architectures

Future Work • Recurrent Layers with Caps. Net • NAS over Caps. Net architectures • Non Class Detection

Thank You

Thank You