TIMBRE AND MODULATION FEATURES FOR MUSIC GENREMOOD CLASSIFICATION

TIMBRE AND MODULATION FEATURES FOR MUSIC GENRE/MOOD CLASSIFICATION J. -S. Roger Jang & Jia-Min Ren Multimedia Information Retrieval Lab Dept. of CSIE, National Taiwan University

OUTLINE Audio features and modulation spectral analysis MIREX 2011 method and its improvement Experimental setup and results Conclusions and future work 2/40

INTRODUCTION – MUSIC GENRES/MOODS Descriptions of music contents *pictures from www. playonradio. com, brainpickings. org & mpac. ee. ntu. edu. tw 3/40

MOTIVATION Rapid growth of digital music Apple i. Tunes: 28 million songs; 7 digital: 20 million tracks Organization of large collections of audio music Important but challenging Manual labeling by tags: labor intensive/time consuming Thus, machine learning for classification is called for! Music clips for training Feature Extraction Classifier Training KNN, GMM, SVM Classifiers Short-term: MFCC, OSC Long-term: beat, tempo, pitch Music clip for test Feature Extraction Evaluation Result 4/40

SYSTEM OVERVIEW 5/40

PERFORMANCE EVALUATION Dataset-dependent criteria for evaluation GTZAN 10 -fold cross-validation ISMIR 2004 Genre Holdout test, same as the one used in ISMIR 2004 Genre Classification Contest, with 729 clips for training and 729 clips for test 6/40

AUDIO FEATURES – SHORT-TERM TIMBRE FEATURES Statistical spectrum descriptors (SSD) Spectral centroid (SC) Spectral flux (SF) Spectral rolloff (SR), Spectral skewness (SS) Spectral kurtosis (SK). MFCC To model the subjective frequency contents of audio signals 21 -dim (including energy) 7/40

AUDIO FEATURES – SHORT-TERM TIMBRE FEATURES Spectral contrast & valley (SCV) Measure spectral contrast/valley in octave-based subbands Peak: harmonic audio FFT frame Valley: non-harmonic/noise For each subband, compute peak/valley by averaging values in the larger/smaller percentage of spectra ( ) 8 frequency subbands: 1: [0, 100) 2: [100, 200) 3: [200, 400) 4: [400, 800) 5: [800, 1600) 6: [1600, 3200) 7: [3200, 6400) 8: [6400, 11025] contrast=peak-valley: relative distribution 8/40

AUDIO FEATURES – SHORT-TERM TIMBRE FEATURES Spectral flatness measure (SFM) Measures the noisiness of spectra within a subband the i-th magnitude spectrum in the a-th subband # of spectra in the a-th subband ≈1: similar amount of power is distributed in all spectral bands ≈0: spectral power is concatenated in a relative small # bands Spectral crest measure (SCM) 10/40

AUDIO FEATURES – SHORT-TERM TIMBRE FEATURES For each feature dimension, we compute its mean and standard deviation. Total dimensions for short-term timbre features 2*(5+21+16+16)=116 Mean & std SSD MFCC SCV SFM/SCM Octave-based subbands Frame-based features 11/40

MODULATION SPECTRAL ANALYSIS MFCC, SC, SFM/SCM Capture only short-time spectral properties of audio signals Modulation spectral analysis Captures long-term spectral dynamics within audio signals Computes spectrogram, then creates modulation spectrogram (by applying FFT again along time axis of spectrogram) Low/high modulation frequency slow/fast spectral change FFT 12/40

MODULATION SPECTRAL ANALYSIS OF TIMBRE FEATURES Flowchart 7 modulation freq. subbands: [0, 0. 33), [0. 33, 0. 66), [0. 66, 1. 32), [1. 32, 2. 64), [2. 64, 5. 28), [5. 28, 10. 56), [10. 56, 21. 03) (MSC: modulation Spectral contrast) The same process is applied to MFCC, SFM/SCM. MSP/MSV: the strength of rhythm in music MSV MSC 13/40

MODULATION SPECTRAL ANALYSIS OF TIMBRE FEATURES Reference C. -H. Lee, J. -L. Shih, K. -M. Yu, and H. -S. Lin, “Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features, ” IEEE Trans. Multimedia, vol. 11, no. 4, pp. 670 -682, June 2009. 14/40

PROPOSED JOINT ACOUSTIC FREQUENCY AND MODULATION FREQUENCY FEATURES Motivation Averaging and mean/std computation smooth out MD info. Computation of joint frequency features (proposed) Compute modulation spectrogram from an entire music clip Compute SCV (spectral contrast/valley), SFM/SCM (spectral flatness/crest measure) within each joint acoustic-modulation (AM) frequency subband AMSCV, AMSFM/AMSCM FFT Compute AMSCV AMSFM AMSCM 15/40

AUDIO FEATURES USED IN OUR STUDY All possible audio features Extract SSD, MFCC, SCV, and SFM/SCM from audio frames mean/std computation Mu. Std dim=2*(5+21+16+16)=116 Perform modulation spectral analysis on MFCC, OSC, SFM/SCM MMFCC dim=2*(21*2+7*2)=112 MSCV dim=2*(16*2+7*2)=92 MSFM/MSCM dim=2*(16*2+7*2)=92 Compute SCV, SFM/SCM within acoustic-modulation (AM) frequency subbands AMSCV, AMSFM/AMSCM AMSCV 8*7*2=112 AMSFM/AMSCM dim = 8*7*2=112 16/40

AUDIO FEATURE SETS AND CLASSIFIER Audio feature sets MIREX 2011 method Mu. Std+MMFCC+MSCV+MSFM/MSCM dim=116+112+92+92=412 Improved method Mu. Std+MMFCC+AMSCV+AMSFM/AMSCM dim=116+112+112=452 Classifier construction with RBF kernel SVMs Three-fold inside cross-validation to tune hyper-parameters 17/40

EXPERIMENTAL SETUP AND RESULTS OFMIREX 2011 GENRE/MOOD CLASSIFICATION TASKS Datasets Genre classification: 10 genres, 700 30 -sec clips in each one Mood classification: 5 categories, 120 30 -sec clips in each one Evaluation metric Three-fold cross-validation; classification accuracy Results (JR 1 is ours) 18/40

EXPERIMENTAL RESULTS OF MIREX 2008 -2012 GENRE/MOOD CLASSIFICATION TASKS Participations Classification Task(Year) Accuracy (%) Rank (# of Submissions) Wu and Jang Genre (2013) 76. 23 1 (13) Wu and Jang Genre (2012) 76. 13 1 (16) Wu and Ren Genre (2011) 75. 57 1 (15) Our submission Genre (2011) 74. 23 4 (15) Seyerlehner et al. Genre (2010) 73. 64 1 (24) Cao and Li Genre (2009) 73. 33 1 (31) Tzametalis Genre (2008) 67. 83 1 (13) Wu and Jang Mood (2013) 68. 33 1 (23) Panda and Paiva Mood (2012) 67. 83 1 (20) Our submission Mood (2011) 69. 50 1 (17) Wang et al. Mood (2010) 64. 17 1 (36) Cao and Li Mood (2009) 65. 67 1 (33) Peeters Mood (2008) 63. 67 1 (13) 19/40

EXTENDED EXPERIMENTS Four datasets Min/Max # of Total # Duration Dataset Category Class # clips in classes of clips of each clip GTZAN Genre 10 100/100 1, 000 30 s Unique Genre 14 26/766 3, 115 ~30 s Soundtracks Mood 6 30/30 18 s to 30 s MIR-Mood 4 464/619 2, 223 ~30 s or ~60 s Performance evaluation Randomly stratified 10 -fold cross-validation (repeating 10 times) Repeat the above process 10 times to obtain the average result 20/40

EXTENDED EXPERIMENTS Averaged classification accuracy (%) of combining different feature sets on four datasets 21/40

EXTENDED EXPERIMENTS Comparison of our methods with other recent work 22/40

CONCLUSIONS Timbre & modulation features Won 1 st place (MIREX 2011 mood classification) Timbre & improved modulation Improves 2. 47%/2. 08% on GTZAN/Unique Achieves 2. 50%/0. 14% higher than MIREX 2011 method on Soundtracks/MIR-Mood 23/40

Thank you for listening. Questions & comment welcome! 24/40

25/40