CMSC 5707 Topics in A I CMSC Assignment

Task 1 • (5%) Recording of the templates: Use your own sound recording device

Task 2 • (5%) Recording for the testing data: Repeat the above recording procedures

Task 3 • (5%) Plotting: – Pick one wav file out of your sound

• (35%) Signal analysis: Task 4 – From “x. wav”, write a program

Task 5 • (50%) Build a speech recognition system: You may use any Matlab/Octvae

MFCC parameter's extraction From http: //en. wikipedia. org/wiki/Mel-frequency_cepstrum • Very popular in music and

MFCC (inside MFCC. m) • • • • • : Pre-emphasis the whole signal

Step (a) of task 5 • Convert sound files in set A and set

Step(b) of task 5 • Assume we have two short time segments (e. g.

Step (c) of task 5 • Build a speech recognition system: You should show

Task (d) of Step 5 • Pick any one sound file from set A

Step 6 • Pick any one sound file from set A (e. g. the

What to submit : – All your programs with a readme file showing how

Appendix • A tutorial of using the htk_tool: – http: //www. cse. cuhk. edu.

Slides: 15

Download presentation

CMSC 5707 Topics in A. I. CMSC Assignment 1 Audio signal processing Assignment 1 of CMSC 5707 V. 0 a 1

Task 1 • (5%) Recording of the templates: Use your own sound recording device (e. g. mobile phone, windows-soundrecorder or http: //www. goldwave. com/) to record the numbers 1, 2, 3, . . . N and name these files as s 1 A. wav, s 2 A. wav, s 3 A. wav and s. NA. wav etc. , respectively. Each word should last about 0. 6 0. 8 seconds and use http: //format-factory. en. softonic. com/ to convert your file to. wav if necessary. (You may choose English or Cantonese or Mandarin to pronounce these words). These N files are called set A to be used as templates of our speech recognition system. You may use any sampling rate (Fs) and bits per second (bps) value. However, typical values are Fs=22050 Hz (or lower) and bps=16 bits per second. Assignment 1 of CMSC 5707 V. 0 a 2

Task 2 • (5%) Recording for the testing data: Repeat the above recording procedures of the same N (e. g. N=10) numbers: 1, 2, 3 and 4 etc. , and save the files as : s 1 B. wav, s 2 B. wav, s 3 B. wav and s 4 B. wav etc. , respectively. They are to be used as testing data in our speech recognition system. Assignment 1 of CMSC 5707 V. 0 a 3

Task 3 • (5%) Plotting: – Pick one wav file out of your sound files (e. g. x. wav), read the file and plot the time domain signal. (Hint: you may use “wavread”, “plot” in MATLAB or OCTAVE. Type “>help wavread” , “>help plot” in MATLAB to learn how to use them. ) – Plot x. wav and save it in a picture file “x. jpg”. Assignment 1 of CMSC 5707 V. 0 a 4

• (35%) Signal analysis: Task 4 – From “x. wav”, write a program to find the start (T 1) and stop (T 2) locations in time (ms) of your N recorded sounds automatically. – Extract one segment called Seg 1 (20 ms of your choice of location) of the voiced vowel part of x. wav between T 1 and T 2. Seg 1 can be saved as an array in C++ or a vector in MATLAB / OCTAVE. You may choose the segment by manual inspection and hardcode the locations in your program. – Find and plot the Fourier transform (energy against frequency) of Seg 1. The energy is equal to |Square_root ([real]^2+[imaginary]^2)|. The horizontal axis is frequency and the vertical axis is energy. Label the axes of the plot. Save the plot as “fourier_x. jpg”. – Find the pre-emphasis signal (pem_Seg 1) of Seg 1 if the pre-emphasis constant α is 0. 945. Plot Seg 1 and Pem_Seg 1. Submit your program. – Find the 10 LPC parameters if the order of LPC for Pem_seg 1 is 10. You should write your autocorrelation code, but you may use the inverse function (inv) in MATLAB/OCTAVE to solve the linear matrix equation. Assignment 1 of CMSC 5707 V. 0 a 5

Task 5 • (50%) Build a speech recognition system: You may use any Matlab/Octvae functions you like in this part. Use the tool at http: //www. mathworks. com/matlabcentral/fileexchange/3 2849 -htk-mfcc-matlab to extract the MFCC parameters (Mel-frequency cepstrum http: //en. wikipedia. org/wiki/Mel -frequency_cepstrum) from your sound files. Each sound file (. wav) will give one set of MFCC parameters. See “A tutorial of using the htk-mfcc tool” in the appendix of how to extract MFCC parameters. Build a dynamic programming DP based N-numeral speech recognition system. Use set A as templates and set B as testing inputs. You may follow the following steps to complete your assignment. Assignment 1 of CMSC 5707 V. 0 a 6

MFCC parameter's extraction From http: //en. wikipedia. org/wiki/Mel-frequency_cepstrum • Very popular in music and speech analysis • Fourier transform of (a windowed excerpt of) a signal. • Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. • logs of the powers at each of the mel frequencies. • discrete cosine transform of the list of mel log powers, as if it were a signal. • The MFCCs are the amplitudes of the resulting spectrum. Assignment 1 of CMSC 5707 V. 0 a 7

MFCC (inside MFCC. m) • • • • • : Pre-emphasis the whole signal % Framing and windowing (frames as columns) frames = vec 2 frames( speech, Nw, Ns, 'cols', window, false ); % Magnitude spectrum computation (as column vectors) MAG = abs( fft(frames, nfft, 1) ); % Triangular filterbank with uniformly spaced filters on mel scale H = trifbank( M, K, R, fs, hz 2 mel, mel 2 hz ); % size of H is M x K % Filterbank application to unique part of the magnitude spectrum FBE = H * MAG(1: K, : ); % FBE( FBE<1. 0 ) = 1. 0; % apply mel floor % DCT matrix computation DCT = dctm( N, M ); % Conversion of log. FBEs to cepstral coefficients through DCT CC = DCT * log( FBE ); % Cepstral lifter computation lifter = ceplifter( N, L ); % Cepstral liftering gives liftered cepstral coefficients CC = diag( lifter ) * CC; % ~ HTK's MFCCs : Assignment 1 of CMSC 5707 V. 0 a 8

Step (a) of task 5 • Convert sound files in set A and set B into MFCCs parameters, so each sound file will give an MFCC matrix of size 13 x 70 (no_of_MFCCs_parameters x=13 and no_of_frame_segments=70). Because if the time shift is 10 ms, a 0. 7 seconds sound will have 70 frame segments, and there are 13 MFCC parameters for one frame. Here we use M (j, t), to represent the MFCC parameters, where ‘j’ is the index for MFCC parameters ranging from 1 to 13, ‘t’ is the index for time segment ranging from 1 to 70. Therefore a (13 -parameter) sound segment at time index t is M(1: 13, t). Assignment 1 of CMSC 5707 V. 0 a 9

Step(b) of task 5 • Assume we have two short time segments (e. g. 25 ms each), one from the tth (t=28) segment of sound X (represented by 13 MFCCS parameters Mx(1: 13, t=28), and another from the t’th (t’=32) time segment of sound Y (represented by MFCCS parameters My(1: 13, t’=32). The distortion (dist) between these two segments is • Note: The first row of the MFCCs (M(j=1, time)) matrix is the energy term and is not recommended to be used in the comparison procedures because it does not contain the relevant spectral information. So summation starts from j=2. • Use dynamic programing to find the minimum accumulated distance (minimum accumulated score) between sound x and sound y. Assignment 1 of CMSC 5707 V. 0 a 10

Step (c) of task 5 • Build a speech recognition system: You should show a Nx. N confusion matrix table (or comparison-matrixtable) as the result (e. g. N=10). An entry to this matrix-table is the minimum accumulated distance between a sound in set A and a sound in set B. You may use the above steps to find the minimum accumulated distance for each sound pair (there should be Nx. N pairs, because there are N sound files in set A and N sound files in set B, (e. g. N=10)) and enter the confusion matrix table manually or by a program. Assignment 1 of CMSC 5707 V. 0 a 11

Task (d) of Step 5 • Pick any one sound file from set A (e. g. the sound of ‘one’) and the corresponding sound file from set B (e. g. the sound of ‘one’), compare these two files using dynamic programing , plot the optimal path on the accumulated matrix diagram. Assignment 1 of CMSC 5707 V. 0 a 12

Step 6 • Pick any one sound file from set A (e. g. the sound of ‘one’) and the corresponding sound file from set B (e. g. the sound of ‘one’), compare these two files using dynamic programing , plot the optimal path on the accumulated matrix diagram. Assignment 1 of CMSC 5707 V. 0 a 13

What to submit : – All your programs with a readme file showing how to run them – All sound files of your recordings – The picture files – The Nx. N confusion matrix table (or called comparison-matrix-table) of the speech recognition system Assignment 1 of CMSC 5707 V. 0 a 14

Appendix • A tutorial of using the htk_tool: – http: //www. cse. cuhk. edu. hk/~khwong/www 2/cm sc 5707/A_tutorial_of_using_the_htk_tool. docx Assignment 1 of CMSC 5707 V. 0 a 15