Towards Musical QuerybySemanticDescription using the CAL 500 Dataset

Towards Musical Query-by-Semantic-Description using the CAL 500 Dataset Douglas Turnbull Computer Audition Lab UC San Diego Work with Luke Barrington, David Torres, and Gert Lanckriet SIGIR June 25, 2007 �

How do we find music? • Query-by-Metadata - artist, song, album, year – We must know what we want • Query-by-(Humming, Tapping, Beatboxing) – Requires talent • Query-by-Song-Similarity – We must possess ‘acoustically’ similar songs • Query-by-Semantic-Description – Google seems to work pretty well for text – Semantic Image Labeling is a hot topic in Computer Vision – Can it work for music? 1

Semantic Music Annotation and Retrieval Our goal is build a system that can 1. Annotate a song with meaningful words 2. Retrieve songs given a text-based query Frank Sinatra ‘Fly Me to the Moon’ Annotation Retrieval ‘Jazz’ ‘Male Vocals’ ‘Sad’ ‘Mellow’ ‘Slow Tempo’ Plan: Learn a probabilistic model that captures a relationship between the audio content of a song and words that describe the song. We consider this as a supervised multi-class, multi-label problem. 2

System Overview Data Features Training Data Vocabulary T T Annotation Modeling Evaluation Parametric Model: Set of GMMs Annotation Vectors (y) Parameter Estimation: Audio-Feature Extraction (X) EM Algorithm Novel Song Evaluation (annotation) Music Review Inference Text Query (retrieval) 3

System Overview Data Training Data T T Annotation 4

The CAL 500 data set The Computer Audition Lab 500 -song (CAL 500) data set • 500 ‘Western Popular’ songs • 174 -word vocabulary – genre, emotion, usage, instrumentation, rhythm, pitch, vocal characteristics • 3 or more annotations per song • 55 paid undergrads annotate music for 120 hours Other Techniques 1. Text-mining of web documents 2. ‘Human Computation’ Games - (e. g. , Listen Game ) 5

System Overview Data Features Training Data Vocabulary T T Annotation Document Vectors (y) Audio-Feature Extraction (X) 6

Semantic Representation We choose a vocabulary of ‘musically relevant’ words Each annotation is converted to a real-valued vector. – Each element represents the ‘semantic association’ between a word and the song. Example: Frank Sinatra’s ‘Fly Me to the Moon” Vocab = {funk, jazz, guitar, female vocals, sad, passionate } Annotation Vector = [0/4 , 3/4, 4/4 , 0/4 , 2/4, 1/4] 7

Acoustic Representation Each songs is represented as a bag-of-feature-vectors – Pass a short time window over the audio signal – Extract a feature vector for each short-time audio segment Specifically, we calculate Delta MFCC feature vectors – Mel-frequency Cepstral Coefficients (MFCC) represent the shape of a shortterm (23 msec) spectrum – Popular for both representing speech, music, and sound effects – Instantaneous derivatives (deltas) encode short-time temporal info – 10, 000 39 -dimensional vector per minute 8

System Overview Data Features Training Data Vocabulary T T Annotation Modeling Parametric Model: Set of GMMs Document Vectors (y) Parameter Estimation: Audio-Feature Extraction (X) EM Algorithm 9

Statistical Model We adapt the Supervised Multi-class Labeling (SML) model – Set of probability distributions over the audio feature space – One Gaussian Mixture Model (GMM) per word - p(x|w) – Estimate parameters for GMM using the set of training songs that are positively associated with the word Notes: – Developed for image annotation by Carneiro and Vasconcelos – Modified for real-value semantic weights rather than binary class labels – Extended formulation to handle multi-word queries 10

Gaussian Mixture Model (GMM) A GMM is often used to model arbitrary probability distributions over high dimensional spaces: A GMM is a weighted combo of R Gaussian distributions • r is the r-th mixing weight • r is the r-th mean • r is the r-th a covariance matrix These parameters are usually estimated using the Expectation Maximization (EM) algorithm. 11

Step 1 - Song GMMs To model each song: 1. Segment audio signal 2. Extract short-time feature vectors 3. Estimate the GMM distribution using a ‘standard’ EM Bag of MFCC vectors GMM + ++ + ++ ++ + ++++++++++ ++ + + ++++ +++ ++++++ + + + + + ++ +++++ + + +++++ ++ + + + ++ ++++ + + ++ +++++++ + ++ + + ++++ + + + + + ++ ++++ + ++ + ++ ++ +++++ + + +++ + + + ++ ++++ ++ + + ++ ++ ++++++++++ ++ +++ + ++++ + + +++++ ++++++ ++ + + + 12

Word GMMs - p(x|w) For each word w, we learn a word model p(x|w) 1. Identify all songs associated with w – i. e. , all ‘romantic’ songs 2. Estimate song-level GMMs 3. Use Weigthed Mixture Hierarchies EM to estimate p(x|w) – Soft clustering of Gaussian components from song GMMs romantic Efficient Hierarchical Estimation Semantic Class Model p(x|w) 13

System Overview Data Features Training Data Vocabulary T T Annotation Modeling Parametric Model: Set of GMMs Annotation Vectors (y) Parameter Estimation: Audio-Feature Extraction (X) EM Algorithm Novel Song (annotation) Music Review Inference 14

Annotation Given a set of word-GMMs and a novel song X = {x 1, …, x. T}, we calculate the likelihood of the song under each word GMM: Assuming 1. 2. Uniform word prior P(w) Feature vectors are conditionally independent given a word (Naïve Bayes) These likelihoods can be interpreted as a semantic multinomial distribution over the vocabulary of words. Annotation involves picking the peaks of the semantic multinomial. 15

Annotation Semantic Multinomial for “Give it Away” by the Red Hot Chili Peppers 16

Annotation: Automatic Music Reviews Dr. Dre (feat. Snoop Dogg) - Nuthin' but a 'G' thang This is a dance poppy, hip-hop song that is arousing and exciting. It features drum machine, backing vocals, male vocal, a nice acoustic guitar solo, and rapping, strong vocals. It is a song that is very danceable and with a heavy beat that you might like listen to while at a party. Frank Sinatra - Fly me to the moon This is a jazzy, singer / songwriter song that is calming and sad. It features acoustic guitar, piano, saxophone, a nice male vocal solo, and emotional, high-pitched vocals. It is a song with a light beat and a slow tempo that you might like listen to while hanging with friends. 17

System Overview Data Features Training Data Vocabulary T T Annotation Modeling Parametric Model: Set of GMMs Annotation Vectors (y) Parameter Estimation: Audio-Feature Extraction (X) EM Algorithm Novel Song (annotation) Music Review Inference Text Query (retrieval) 18

Retrieval 1. Annotate each songs in corpus with a semantic multinomial p • p = {P(w 1|X), …, P(w. V|X)}. 2. Given a text-based query, construct as query multinomial q – qw = 1/|w| , if word w appears in the query string 1. qw = 0, otherwise • We rank order all songs by the Kullback-Leibler (KL) divergence between the query multinomial and all semantic multinomals The compact semantic multinomial representation of a song allow us to quickly rank order songs. 19

Retrieval The top 3 results for - “pop, female vocals, tender” 0. 33 0. 02 1. Shakira - The One 2. Alicia Keys - Fallin’ 3. Evanescence - My Immortal 20

Retrieval: Query-by-Semantic-Description Query Retrieved Songs ‘Tender’ Crosby, Stills and Nash - Guinnevere Jewel - Enter from the East Art Tatum - Willow Weep for Me John Lennon - Imagine Tom Waits - Time�� ‘Female Vocals’ Alicia Keys - Fallin’ Shakira - The One Christina Aguilera - Genie in a Bottle Junior Murvin - Police and Thieves Britney Spears - I'm a Slave 4 U ‘Tender’ AND ‘Female Vocals’ Jewel - Enter from the East Evanescence - My Immortal Cowboy Junkies - Postcard Blues Everly Brothers - Take a Message to Mary Sheryl Crow - I Shall Believe 21

System Overview Data Features Training Data Vocabulary T T Annotation Modeling Evaluation Parametric Model: Set of GMMs Annotation Vectors (y) Parameter Estimation: Audio-Feature Extraction (X) EM Algorithm Novel Song Evaluation (annotation) Music Review Inference Text Query (retrieval) 22

Quantifying Annotation Our system annotates the CAL 500 songs with 10 words from our vocabulary of 174 words. Model�� Precision Recall Random 0. 14 0. 06 Our System 0. 27 0. 16 Human 0. 30 0. 15 23

Quantifying Retrieval We rank order song according to songs once for each query. Model�� AROC Random 0. 50 Our System - 1 Word 0. 71 Our System - 2 Words 0. 72 Our System - 3 Words 0. 73 24

Demos • CAL Music Search Engine - a content-based semantic music search engine. • Listen Game - a ‘game with a purpose’ to collect semantic annotations of music. 25

What’s on tap… Large-scale system – Web-based, large-scale collection of reliable human annotations => Multiplayer, online game - Listen Game - ISMIR 07 – Prune and extend vocabulary (automatically) - ISMIR 07 – Novel Applications - Music Search Engine / Radio Player Personalized search – Model homogeneous groups / individuals rather than population • Personalized Audio Search – Adjust to affective state of the user Novel query paradigms - Query-by-semantic-example - ICASSP 07 - Heterogeneous queries 26

“Talking about music is like dancing about architecture” - origins unknown cosmal. ucsd. edu/cal 27

System Overview Data Features Training Data Vocabulary T T Annotation Modeling Evaluation Parametric Model: Set of GMMs Annotation Vectors (y) Parameter Estimation: Audio-Feature Extraction (X) EM Algorithm Novel Song Evaluation (annotation) Music Review Inference Text Query (retrieval) 28

Acoustic Representation Calculating Delta MFCC feature vectors – – Calculate a time-series for 13 MFCCs Append 1 st and 2 nd instantaneous derivatives 5, 200 39 -dimensional feature vectors per minute of audio content Denoted by X = {x 1, …, x. T} where T depends on the length of the song Short-Time Fourier Transform Time Series of MFCCs Reconstructed based on MFCCs (log frequency) 29

Three Approaches to Parameter Estimation For each word w, we want to estimate the parameters of a GMM p(x|w). 1. Direct Estimation 1. Take the union of sets of feature vectors for each song that are semantically associated with w. 2. Use the ‘standard’ Expectation Maximization(EM) for estimating the parameters for a GMM Direct Estimation is computationally difficult and empirically converges to poor local optima. 30

Three Approaches to Parameter Estimation For each word w, we want to estimate the parameters of a GMM. 2. Model Averaging 1. For each song associated with w, estimate a ‘song GMM’ using the standard EM algorithm - p(x|s) 2. Concatenate mixture components and renormalize mixture weights Model averaging produces a distribution with a variable number of mixture component. As the training set size grows, evaluating this distribution become prohibitively expensive. 31

Mixture Hierarchices Density Estimation For each word w, we want to estimate the parameters of a GMM. Mixture Hierarchies EM 1. For each song associated with w, estimate a ‘song GMM’ using the standard EM algorithm. 2. Learn a ‘mixture of mixture components’ using the Mixture Hierarchies EM algorithm [Vasconcelos 01] Notes: • Computationally efficient for both parameter estimation and inference. • Each song is re-represented as a ‘smoothed’ estimate of bag-of-feature vectors. • Combining song models abstracts the semantic of a common word. � 32

Quantifying Annotation Our system annotates the Cal-500 songs with 10 words from our vocabulary of 173 words. – ‘Population Annotation’ Ground Truth Metric: ‘Word’ Precision & Recall Consider word w, Precision = Recall = # songs correctly annotated with w # songs that should have been annotated w Mean Word Recall and Word Precision are the averages over all words in our vocabulary. 33

Quantifying Annotation Our system annotates the Cal-500 songs with 10 words from our vocabulary of 174 words. Model�� Precision Recall Random 0. 14 0. 06 Upper Bound 0. 71 0. 38 Our System 0. 27 0. 16 Human 0. 30 0. 15 Compared with a human, our model is • worse on objective categories - instrumentation, genre • about the same on subjective categories - emotion, usage 34

Quantifying Retrieval For each 1 -, 2 -, & 3 -word query for which there is at least 5 songs in the ground truth, we rank order test set songs according KL divergence between the query multinomial and the semantic multinomial Metric: Area under the ROC Curve (AROC) – An ROC curve is a plot of the true positive rate as a function of the false positive rate as we move down this ranked list of songs. – Integrating the curve gives us a scalar between 0 and 1 where 0. 5 is the expected value when randomly guessing. Mean AROC is the average AROC over a large number of queries. 35

A biased view of Music Classification 2000 -03: Music classification (by genre, emotion, instrumentation) becomes a popular MIR task – Undergrad Thesis on Genre Classification with G. Tzanetakis 2003 -04: MIR community starts to criticize music classification problems – ill-posed problem due to subjectivity – not an end in itself – performance ‘glass ceiling’ 2004 -06: Focus turns to Music Similarity research – Recommendation – Playlist generation 2006 -07: We view Music Annotation as a supervised multi-class labeling problem – Like classification but with large, less-restrictive vocabulary 36

Modeling Semantic Classes Given a set of word models p(x|w) over a vocabulary of words, Annotation: Given a novel song, we pick words by comparing the likelihood of the audio features under each word model. Retrieval: Given a text query, we pick songs that are likely under the word models associated with the words in the query. 37