Finegrained Language Identification with Multilingual Caps Net Model

Need Emergency Call Routing Services Intelligent Voice Assistants Conversational AI Speech is the easiest

Trends • • Language Embeddings (i-vector) Use Hand Crafted Features ( & Phoneme Detection)

Trends View it as a classification Problem • SVM / GMM / HMM •

Issues • Manual Feature Extraction is hard • Data Requirements • Robustness to Noise

Fine-Grained LID Problem Characteristics : 1. Short Spoken Audio Snippets (5 s-10 s) 2.

Dataset - Languages Arabic - AR Bengali - BE Chinese(Mand. ) - CH English

Dataset - Collection Type : Audio recordings of local and global news, interviews, speeches

Dataset - Characteristics Trivial and Easy Data Collection Various Types of Noise : 1.

Dataset - Processing • • . wav format Spectrogram Representation Discretize using Hann window

Work • Handle Problem in Image Domain • Use Capsule Networks (Caps. Net) for

Image Domain • Use Spectrogram • Mel-Frequency Coeff. Cepstrum does not help much.

Caps. Nets - Theory • CNNs are great but they have a problem. They

Caps. Nets - Theory • Solves the “Picasso Problem” • Caps. Nets replace scalar-output

Baseline – CNN-(RNN)Attention C. Bartz, T. Herold, H. Yang, and C. Meinel, “Language identification

Non-Class Detection • Verification Step. • Is Caps. Net more robust* than baseline? •

Results ROC Curve & AUC Score for Languages (1 -5) by Caps. Net (with

Results ROC Curve & AUC Score for Languages (1 -5) by Bi-GRU 10 second

Results See a pattern? HI -> AR - > BE ~ CH > EN

Results Non Class Detection for languages (6 -10)

Results Why is Japanese low? Why is Punjabi high? Portuguese & Spanish Turkish ~

Results – Caps. Net is Multilingual S. Toshniwal, T. N. Sainath, R. J. Weiss,

Future Work • Recurrent Layers with Caps. Net • NAS over Caps. Net architectures

Slides: 36

Download presentation

Fine-grained Language Identification with Multilingual Caps. Net Model Mudit Verma 1 & Arun Balaji Buduru 2 1. Arizona State University, USA 2. IIIT-Delhi, India

Need Emergency Call Routing Services Intelligent Voice Assistants Conversational AI Speech is the easiest form of communication for humans.

Trends • • Language Embeddings (i-vector) Use Hand Crafted Features ( & Phoneme Detection) Mel Frequency Cepstral Coefficients Spectrograms P. Verma and P. K. Das, “i-vectors in speech processing applications: a survey, ” International Journal of Speech Technology, vol. 18, no. 4, pp. 529– 546, Dec 2015. [Online]. Available: https: //doi. org/10. 1007/s 10772 -015 -9295 -3

Trends View it as a classification Problem • SVM / GMM / HMM • Logistic Regression • Fully connected Neural Networks • BLSTM • CNN (based on VGG)

Issues • Manual Feature Extraction is hard • Data Requirements • Robustness to Noise

Fine-Grained LID Problem Characteristics : 1. Short Spoken Audio Snippets (5 s-10 s) 2. Multiple Languages 3. Noise 4. Exiguous Train Data 5. Trivial Data collection 6. Non-Class Identification 7. Multilingual

Dataset - Languages Arabic - AR Bengali - BE Chinese(Mand. ) - CH English - EN Hindi - HI Turkish Spanish Japanese Punjabi Portuguese *used for Language Identification **used for Non-Class Tests

Dataset - Languages Arabic - AR Bengali - BE Chinese(Mand. ) - CH English - EN Hindi - HI Turkish Spanish Japanese Punjabi Portuguese *used for Language Identification **used for Non-Class Tests • Deal with Indian Languages with more popular languages used for LID Task • Diverse set of languages • Exiguous Data requirements help with regional LID

Dataset - Collection Type : Audio recordings of local and global news, interviews, speeches etc. Source : You. Tube Data Size : 100 hrs / language for LID Task ( -> 500 hrs total) 30 hrs / language for Non-Class Task (-> 150 hrs total) Train / Test Size : 70 -30 for LID 20 -10 for Non-Class

Dataset - Characteristics Trivial and Easy Data Collection Various Types of Noise : 1. Heard/Not-Understandable (spoken language is not understood) Background noise of cheers, slogans 2. Heard/Understandable (multiple spoken languages) Interviews/News reporting in multiple languages 3. Unheard (noise but not spoken language) Chimes/Mic Noise

Dataset - Processing • • . wav format Spectrogram Representation Discretize using Hann window & 129 frequency bins 8 -bit grayscale

Work • Handle Problem in Image Domain • Use Capsule Networks (Caps. Net) for classification • Compare with variants of CNN + Bi-GRU CNN + Bi-LSTM CNN + Bi-GRU + Attention • Test deeper variant of Caps. Net • Verify Non-Class Detection (Out of Distribution Samples)

Image Domain • Use Spectrogram • Mel-Frequency Coeff. Cepstrum does not help much.

Caps. Nets - Theory • CNNs are great but they have a problem. They have : • Positional Invariance (Thanks to Pooling layers) Tolerant to View. Point Invariance S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules, ” in Advances in Neural Information Pro- cessing Systems, 2017, pp. 3856– 3866.

Caps. Nets - Theory • Solves the “Picasso Problem” • Caps. Nets replace scalar-output feature detectors with vector-output capsules and max-pooling with routing-by-agreement. • Because capsules are independent, when multiple capsules agree, the probability of correct detection is much higher S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules, ” in Advances in Neural Information Pro- cessing Systems, 2017, pp. 3856– 3866.

Caps. Nets - Theory

Caps. Nets Architecture

Caps. Nets Architecture Bottleneck

Baseline – CNN-(RNN)Attention C. Bartz, T. Herold, H. Yang, and C. Meinel, “Language identification using deep convolutional recurrent neural networks, ” in International Conference on Neural Information Processing Springer, 2017, pp. 880– 889.

Non-Class Detection • Verification Step. • Is Caps. Net more robust* than baseline? • Thresholding Mechanism *robustness here is over several languages.

Results

Results Caps. Net is bad. Or is it?

Results ROC Curve & AUC Score for Languages (1 -5) by Caps. Net (with Midcaps Layers) – 5 second audio input. Left : Test Data. Right : Train Data.

Results ROC Curve & AUC Score for Languages (1 -5) by Bi-GRU 10 second audio input. Left : Test Data. Right : Train Data.

Results See a pattern?

Results See a pattern? HI -> AR - > BE ~ CH > EN

Results See a pattern? HI -> AR - > BE ~ CH > EN Should we be expecting this order?

Results Non Class Detection for languages (6 -10)

Results Why is Japanese low? Why is Punjabi high? Portuguese & Spanish Turkish ~ Arabic Non Class Detection for languages (6 -10)

Results – Caps. Net is Multilingual S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao, “Multilingual speech recognition with a single end-to-end model, ” in 2018 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4904– 4908.

Future Work • Recurrent Layers with Caps. Net • NAS over Caps. Net architectures • Non Class Detection

Thank You