Collective Annotation of Music from Multiple Semantic Categories

Slides: 1

Collective Annotation of Music from Multiple Semantic Categories Zhiyao 1, 2 Duan , Lie 1. Microsoft Research Asia (MSRA), Beijing, China. Summary Ø Two collective semantic annotation methods of music, modeling not only individual labels , but also label correlations. ü 50 musically relevant labels are manually selected for music annotation, covering 10 aspects of music perception. ü Normalized mutual information is employed to measure the correlation between two semantic labels. Ø Label pairs with strong correlation are selected and modeled. 1 Lu , and Changshui 2 Zhang 2. Department of Automation, Tsinghua University, Beijing, China. Proposed methods: consider the relations between labels. 1) Collective single GMM-based method: approximates the label pair label posterior Properties: ü 0 <= Norm. MI(X; Y) <= 1; ü Norm. MI(X; Y) = 0 when X and Y is statistically independent; ü Norm. MI(X; X) = 1. Table 2 Selected pairs ü Generative: Gaussian Mixture Model (GMM)-based method ü Discriminative: Conditional Random Field (CRF)-based Motivation method Ø Experimental results show slight but consistent Ø improvements Semantic annotation of with music is an annotation important compared individual research direction. methods. ü Semantic labels (text, words) is a more compact and efficient representation than raw audio or low-level features. ü Potentially facilitates applications, e. g. music retrieval and recommendation. Ø Disadvantages of previous methods: ü Vocabulary without structured labels -> annotation without sufficient musical aspects. ü Model audio-label relations only, without label-label relations. • E. g. “hard rock” & “electronic guitar”, “happy” & “minor key” Semantic Vocabulary Ø Therefore, we divide the semantic vocabulary into categories, and attempt to model label correlations. 1. Consists of 50 labels, manually selected from webparsed musically relevant words 2. 10 semantic categories (aspects) 3. A label number limitation in each category for annotation Table 1 Vocabula ry 4. Normalized Mutual Information (Norm. MI) is used to measure the correlation of each label pair. (4) 5. Only the label pairs whose Norm. MI values are larger than a threshold are selected to be modeled. Audio Feature Extraction A bag of beat-level feature vectors are used to represent a song: 1. Each song is divided into beat segments. 2. Each segment contains a number of frames of 20 ms length and 10 ms overlap. 3. Timbre features (94 -d) and rhythm features (8 -d) are extracted to compose a 102 -d feature vector in each segment. 4. PCA to reduce the dimensionality to 65, reserving 95% energy. Ø Timbre features: means and standard deviations of 8 order MFCCs, spectral shape features and spectral contrast features Ø Rhythm features: average tempo, average onset Semantic Annotation frequency, rhythm regularity, rhythm contrast, rhythm strength, average drum frequency, amplitude and Problem: find some semantic words to describe a song. confidence [1] It can be viewed as a multi-label binary classification problem. Input: a vocabulary consisting of labels (or words) ; a bag of feature vectors of a song. Output: an annotation vector , where is a binary variable of , 1: presence, -1: absence. Solution: Maximum A Posterior (MAP) Previous methods: labels are treated independent. single label Ø Individual GMM-based method: posterior (2) (1) : entropy of X, between X and Y. TEMPLATE DESIGN © 2008 www. Poster. Presentations. com : mutual information Results: Ø Per category performance: the performance for each category where (3) The likelihood can be estimated using GMM from training data. The prior probability can be set to a uniform distribution. where is the set of selected label pairs; and are labels of a pair; is a trade-off between label posterior and label pair posterior. The Likelihood and are estimated using a 8 -kernel GMM from training data. 2) Collective CRF-based method: overall potential of Conditional Random Field (CRF): edges an undirected graphical model, nodes: label variables; edges: relations between labels. Multi-label classification CRF model overall potential of [2]: nodes (5) where : a sample (a song), represented by an input feature vector; : an output label vector; Experiments : the normalizing factor. & : features of CRF, predefined real. Data set: functions. & value : parameters to be estimated Ø~5, 000 Westerndata. popular songs; using training ØManually annotated with semantic labels from the vocabulary in Table according to the label number Note: Different from 1, the GMM-based method, “bag of limitations; features” cannot be used here; instead, each song Ø 25%isfor training, 75% testing; represented by afor 115 -d feature vector. Ø 49 label pairs are selected to model, whose Norm. MI > 115 -d = 65 -d (mean of beat-level features) + 50 -d (word 0. 1. likelihoods) Compared Methods: 1. Collective GMM-based method 2. Individual GMM-based method 3. Collective CRF-based method 4. Individual CRF-based method : use the CRF framework in Eq. (5) without considering the “overall potential of edges”. Table 3 GMM-based 1. CRF-based methods outperform methods; 2. Collective annotation methods slightly but consistently improve the performance of their individual counterpart, both for GMM-based and CRF-based. Ø Per song performance: the average performance for a song Table 4 1. While the recalls are similar, the precision is improved significantly from the generative models to discriminative models; 2. The collective methods slightly outperform their individual counterparts. Open question: Ø The performance improvements from individual modeling to collective modeling is not so much. Possible reason: In individual modeling methods, the labels which are “correlated” share many songs in their training set (since Future Work each song has multiple labels). This makes the trained models of “correlated” labels are also “correlated”, or in 1. Further better ismethods to model label other words, exploit the correlation implicitly modeled. correlations. 2. Exploit better features, especially the song-level feature vector for CRF-based methods. 3. Try to apply the obtained annotations in various applications, such as music similarity measure, music search and recommendation. References [1] Lu, L. , Liu, D. and Zhang, H. J. ”Automatic mood detection and tracking of music audio signals”, IEEE Trans. on Audio, Speech and Lang. Process. , vol. 14, no. 1, pp. 5 -18, 2006. [2] Ghamrawi, N. and Mc. Callum, A. “Collective multilabel classification, ” in Proc. the 14 th ACM International Conference on Information and Knowledge Management (CIKM), 2005, pp. 195 -200.