Spoken Arabic Digits Recognition using Convolutional Neural Network

Spoken Arabic Digits Recognition using Convolutional Neural Network Presented by: Mona Abdelazim Teaching assistant, Ain shams university Supervised By: Prof. Nagwa Badr Dr. Wedad Hussein

Introduction Digit recognition has a vital use in multiple human – machine interaction applications. It can be used in telephone-based services, such as dialing systems, airline reservation systems, different bank transactions and price extraction. The purpose of this research is to develop a new Convolution Neural Network (CNN) based spoken digits recognition system for the Arabic digits. The developed system used a classification approach to perform the recognition task. Convolutional Neural Networks(CNNs) are a powerful artificial neural network architecture as it preserve the spatial structure of the inputs that’s why it achieved stateof-the-art results on various problems.

Related Work Reference Network Architecture Language System Accuracy Dataset Size "Pashto isolated digits recognition using deep convolutional neural network. " Elsevier, 2020 Deep CNN Pashto 84. 17% 500 utterances "Spoken Arabic Digits Recognition Using Deep Learning. " IEEE, 2019 LSTM Arabic 69% 1040 utterances "Bi-directional recurrent end-to-end neural network classifier for spoken Arab digit recognition”IEEE, 2018 Bi-directional RNN Arabic 98. 77% 8800 utterances "Spoken Arabic digits recognizer using recurrent neural networks. " IEEE, 2004. RNN Arabic 94. 5% 1700 utterances Proposed Solution CNN Arabic 99% 6600 utterances

Proposed Solution The Spoken Arabic Recognition task was conducted using a CNN-based system to estimate the digit class for each of the test data's utterances. The proposed solution takes as input the sequences of Mel Frequency Cepstral Coefficients (MFCC) features as fixed-size vectors. To do so, the training samples were encoded as a matrix of (6600, 93, 13), where 6600 is the size of samples in the training data, 93 is the size of the most extended sample of MFCC coefﬁcients, and 13 is the number of MFCC coefﬁcients used in this experiment. When the sequence's size is smaller than 93, the sequence is padded by a zeroed vector of size 13 until it reaches 93 frames.

How does CNN work? Contents of a classic Convolutional Neural Network: 1. Convolutional Layer. 2. Pooling 3. Fully Connected Layer.

How does CNN What is CNN? work? Convolution layer: In a CNN, the convolution is performed on the input data with the use of a filter or kernel to then produce a feature map. Convolution is executed by sliding the filter over the input. At every location, a matrix multiplication is performed and sums the result onto the feature map.

How does CNN work? Pooling layer: The function of pooling is to continuously reduce the dimensionality to reduce the number of parameters and computation in the network. This shortens the training time and controls overfitting. The most frequent type of pooling is max pooling, which takes the maximum value in each window. This decreases the feature map size while at the same time keeping the significant information.

How does CNN work? Fully connected layer: Fully connected layers connect every neuron in one layer to every neuron in another layer. It is in principle the same as the traditional multi-layer perceptron neural network (MLP). The flattened matrix goes through a fully connected layer to classify the testing samples.

Model Architecture

Dataset: The dataset contains time series MFCCs corresponding to spoken Arabic digits. Includes data from 44 males and 44 females native Arabic speakers. Data. Description: Training Data: There are 660 blocks for each spoken digit. The first 330 blocks represent male speakers and the second 330 blocks represent the female speakers. Testing Data: Digits 0 to 9 have 220 blocks for each one. The first 110 blocks represent male speakers and the second 110 blocks represent the female speakers.

Classification Results: Class Precision Recall F 1 score Support 0 0. 96 0. 98 0. 97 220 1 1. 00 220 2 1. 00 0. 98 0. 99 220 3 1. 00 0. 99 220 4 1. 00 220 5 0. 99 1. 00 220 6 1. 00 0. 99 220 7 0. 97 220 8 1. 00 220 9 0. 97 0. 98 220

Classification Performance RECOGNITION PERFORMANCE 1 0, 99 Ratio 0, 98 0, 97 0, 96 0, 95 0 1 2 3 4 5 6 Digit Class Precision Recall F 1 score 7 8 9

Validation accuracy Vs. Validation loss Training Performance 100 90 Percentage(%) 80 70 60 50 40 30 20 10 0 1 10 20 30 validation loss 40 Epoch Number 70 validation accurracy 80 90 100

References q Zada, Bakht, and Rahim Ullah. "Pashto isolated digits recognition using deep convolutional neural network. " Heliyon 6. 2, Elsevier, 2020 q Zerari, Naima, et al. "Bi-directional recurrent end-to-end neural network classifier for spoken Arab digit recognition. " 2018 2 nd International Conference on Natural Language and Speech Processing (ICNLSP). IEEE, 2018. q WAZIR, Abdulaziz Saleh Mahfoudh BA, and Joon Huang CHUAH. "Spoken Arabic Digits Recognition Using Deep Learning. " 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I 2 CACIS). IEEE, 2019. q Alotaibi, Yousef Ajami. "Spoken Arabic digits recognizer using recurrent neural networks. " Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, 2004. . IEEE, 2004.