Audiobased Music Similarity Analysis v MUMT 611 v

Audio-based Music Similarity Analysis v MUMT 611 v Beinan Li v Music Tech @ Mc. Gill v 2005 -3 -17 1

Content v Overview (based on Foote and Logan’s work) v Application background v Summary on common approach v Foote 1997: Content-based Retrieval of Music and Audio v Foote 2002: Audio Retrieval by Rhythmic Similarity v Logan 2001: A Music Similarity Function Based on Signal Analysis 2

Overview v Application background: v Query by similarity v Automatic play-list generation v Automatic D. J. v Automatic summarizing and categorization 3

Overview v Common approach of audio similarity analysis v Find a hidden metaphor to relate acoustic similarity with statistical audio features, based on defined application domains. v Distance between features from different audio samples is taken as the measure of similarity. v Supervised vs. Unsupervised approach. 4

v Common v Audio steps: Overview Parameterization v Windowing time-domain signal and extract low-level features, usually frequency-domain or cepstral features. v Quantization (lower the dimensionality) v Transform low-level features into some higher-level statistical features according to the metaphor. v Supervised: discriminative quantization, training involved. v Unsupervised: statistical clustering v Distance calculation v Calculate distance between high-level features from different audio samples, based on certain type of measures. 5

Foote: Content-based Retrieval of Music and Audio v Application domain: v Audio search engine for audio documents by acoustic similarity v Metaphor: v Sample-specific template: Histogram of feature groups v Maximized Mutual Information (informatics) v Supervised approach v Multiple distance measures are tested. 6

Foote: Content-based Retrieval of Music and Audio v Picture taken from Foote 1997 7

Foote: v Audio Content-based Retrieval of Music and Audio parameterization: v Hamming window, overlapping steps. v 13 -D feature: v 12 MFCC coefficients and an energy term v Emphasizes on mid-frequency bands. 8

Foote: Content-based Retrieval of Music and Audio v Quantization v Tree-based v Leaves v Off-line vector quantization. -> Histogram bins (features fall in…) training of binary-tree construction: v Supervised training: data labeled by known classes v Binary branching threshold: v determined via MMI between features and class-labels v 1 -D feature space partition (find the dimension by MMI) v Stopping v Practical rule: thresholds for probability-weighted MMI tree construction v Sample-specific template: “signature” (Logan 2001) 9

Foote: v Distance Content-based Retrieval of Music and Audio calculation: v Euclidean distance v Straightforward v Sensitive to magnitude v Successful in Speaker ID domain. v Cosine distance v Derived from scalar product v Insensitive to magnitude v No evidence on the relation of measures to perception. 10

Foote: Content-based Retrieval of Music and Audio v Experiments v Performance measurement: v No subjective test v File-naming hints (oboe*) -> TREC-like Average Precision. v Percentage of top-ranked items that are actually relevant v On simple sounds (laughter, musical notes, animal, etc. ) v On music clips 11

Foote: Content-based Retrieval of Music and Audio v Experiments on simple sounds (no predominance) v Q-tree vs. Muscle Fish (unsupervised) v Pitch vs. timbre similarity -> subjective importance? (Diagram from Foote 1997) 12

Foote: Content-based Retrieval of Music and Audio v Experiments on music clips v Each artist as a class v Cosine distance performs best (Diagram from Foote 1997) 13

Foote: Content-based Retrieval of Music and Audio v Conclusion v Histogram and future: bins can be further weighted to maximize the entropy. v May be used to measure subjective perceptual qualities. v Audio content change (via templates within a stream) v Tree can show importance of feature dimensions (1 -D) v Compressed audio may help skip the step of parameterization. v Demo link: http: //www. fxpal. com/people/foote/musicr/doc 0. html 14

Foote: Audio Retrieval by Rhythmic Similarity v Previous tempo/rhythm tracking approach: v Restricted to narrow application domains v Dannenberg 1987: MIDI v Schierer 1998: strong percussive elements v Goto 1994: 4/4, bass drum on downbeat v Muscle Fish: drum-only tracks v Cliff 2000: dance music v Not robust. 15

Foote: Audio Retrieval by Rhythmic Similarity v Application v Automatic domain: D. J. , play-list via rhythmic similarity v Metaphor: v Beat Spectrum: autocorrelation v Within a certain time range (lag), autocorrelation of spectral-related audio features hints rhythm. v Unsupervised approach v Multiple distance measures are tested. 16

Foote: Audio Retrieval by Rhythmic Similarity v Audio Parameterization v Overlapped windowing, FFT v Logarithmic magnitude response (power spectrum) v Others (similar sounds -> similar parameters) v Linear prediction v MFCC v Psychoacoustic 17

Foote: Audio Retrieval by Rhythmic Similarity v Quantization v Distance between ST-powers within the stream v Similarity Matrix (visualization of audio structure) v End-to-end repeated time-line v Main diagonal (self-correlate) v Diagonal stripes (D(i, j) = S(k, k+l)) v Visible periodicity (repeated stripes) (Picture from Foote 2002) 18

Foote: Audio Retrieval by Rhythmic Similarity v Similarity Matrix v Brightness v Beat -> distance Spectrum (in a range) v Autocorrelation v Peak of BS -> repetition (Picture from Foote 2002) 19

Foote: Audio Retrieval by Rhythmic Similarity v Distance calculation: v Euclidean distance v Cosine distance v Fourier Beat Spectral Coefficients v Further lower the dimensionality 20

Foote: Audio Retrieval by Rhythmic Similarity v Experiments: v Euclidean distance v Different-tempo versions of identical music v “Find itself” (Diagram from Foote 2002) 21

Foote: Audio Retrieval by Rhythmic Similarity v Experiments: v Three 10 -sec sections from each of 4 songs v Relevant with sections of same song only v Lag size carefully chosen v Within a certain time range (lag), autocorrelation of spectralrelated audio features hints rhythm. v Rule out overly small/large lag candidates v Cosine and FBSC win with precision of 96. 7% 22

Foote: Audio Retrieval by Rhythmic Similarity v Conclusion and future v Beat Spectrum is actually a vector space, so common classification / machine learning can be used. v Auto-play-list: build Similarity Matrix in terms of ending of Candidate Song N and beginning of Candidate Song N+1. v Knowledge constraints. 23

Logan: A Music Similarity Function Based on Signal Analysis v Application v Automatic domain: play-list via similarity v Metaphor: v Song-signature: instrument type, singing-presence? v Spectral features v Transformation cost between signatures v Unsupervised approach v Multiple distance measures are tested. 24

Logan: A Music Similarity Function Based on Signal Analysis v Audio Parameterization v Windowing, MFCC v Many other candidates so long as a distance measure can be found. 25

Logan: A Music Similarity Function Based on Signal Analysis v Quantization v signature based on Foote 1997 (“template”) v Supervision may overly rely on training data and thus emphasize on several specific histogram bins. v K-means clustering v Assume the number of clusters be fixed for a song v Signature: vector of common statistic parameters v (means, covariance, weight) 26

Logan: A Music Similarity Function Based on Signal Analysis v Distance v Earth measure Mover’s Distance v Weighted cost of moving probability mass from one cluster to another. EMD is the normalized cost. v Distance is based on Kullback-Leibler distance. 27

Logan: A Music Similarity Function Based on Signal Analysis v Experiments: v Test over 8, 000 style-variant songs in a database. v Multiple number of MFCC coefficients are tested. v Main metrics: v Average distance between all songs v Average distance between songs on the same album v Average distance between in the same genre v Average distance between by the same artist v Objective and subjective relevance tested. v Robustness to corruption tested. v Remove a section of a song on purpose 28