www kdd uncc edu CCI UNCCharlotte Music Information

  • Slides: 49
Download presentation
www. kdd. uncc. edu CCI, UNC-Charlotte Music Information Retrieval based on multi-label cascade classification

www. kdd. uncc. edu CCI, UNC-Charlotte Music Information Retrieval based on multi-label cascade classification system http//: www. mir. uncc. edu presented by Zbigniew W. Ras Research sponsored by NSF IIS-0414815, IIS-0968647

MIRAI - Musical Database (mostly MUMS) [music pieces played by 57 different music instruments]

MIRAI - Musical Database (mostly MUMS) [music pieces played by 57 different music instruments] Goal: Design and Implement a System for Automatic Indexing of Music by Instruments (objective task) and Emotions (subjective task) Outcome: Musical Database represented as FS-tree guarantying efficient storage and retrieval [music pieces indexed by instruments and emotions].

MIRAI - Musical Database [music pieces played by 57+ different music instruments (see below)

MIRAI - Musical Database [music pieces played by 57+ different music instruments (see below) and described by over 910 attributes] Alto Flute, Bach-trumpet, bass-clarinet, bassoon, bass-trombone, Bb trumpet, b-flat clarinet, cello-bowed, cello-martele, cello-muted, cello-pizzicato, contrabassclarinet, contrabassoon, crotales, c-trumpet, ctrumpet-harmon. Stem. O doublebass-bowed, doublebass-martele, doublebass-muted, doublebass-pizzic eflatclarinet, electric-bass, electric-guitar, englishhorn, flute, frenchhorn, french. Horn-muted, glockenspiel, marimba-crescendo, marimba-singlestroke, oboe, piano-9 ft, piano-hamburg, piccolo-flutter, saxophone-soprano, saxophone-tenor, steeldrums, symphonic, tenor-trombone-mut tuba, tubular-bells, vibraphone-bowed, vibraphone-hardmallet, viola-bowed, viola-martele, viola-muted, viola-natural, viola-pizzicato, violin-artificial, violin-bowed, violin-ensemble, violin-muted, violin-natural-harmonics, xylophone

Automatic Indexing of Music What is needed? Database of monophonic and polyphonic music signals

Automatic Indexing of Music What is needed? Database of monophonic and polyphonic music signals and their descriptions in terms of new features (including tempora in addition to the standard MPEG 7 features. These signals are labeled by instruments and emotions forming additional features called decision features. Why is needed? To build classifiers for automatic indexing of musical sound by instruments and emotions.

MIRAI - Cooperative Music Information Retrieval System based on Automatic Indexing Query … …

MIRAI - Cooperative Music Information Retrieval System based on Automatic Indexing Query … … … Instruments Durations Indexed Audio Database Query Adapter Music Objects User … Empty Answer?

Challenges to applying KDD in MIR The nature and types of raw data Data

Challenges to applying KDD in MIR The nature and types of raw data Data source organization volume Type Quality Traditional data Structured Modest Discrete, Categorical Clean Audio data Unstructured Very large Continuous, Numeric Noise

Feature extractions Amplitude values at each sample point lower level raw data form Feature

Feature extractions Amplitude values at each sample point lower level raw data form Feature Extraction For instance using Matlab Higher level representations manageable Feature Database traditional pattern recognition classification clustering regression

MPEG 7 features Hamming Window NFFT points STFT Signal envelope Signal Power Spectrum Spectral

MPEG 7 features Hamming Window NFFT points STFT Signal envelope Signal Power Spectrum Spectral Centroid Log Attack Time Temporal Centroid STFT Hamming Window STFT - Short-time Fourier transform NFFT- Non-Uniform Fast Fourier Transform. Harmonic Peaks Detection Fundamental Frequency Instantaneous Harmonic Spectral Spread Instantaneous Harmonic Spectral Centroid Instantaneous Harmonic Spectral Deviation Instantaneous Harmonic Spectral Variation

Derived Database MPEG 7 features Non-MPEG 7 features & new temporal features Spectrum Centroid

Derived Database MPEG 7 features Non-MPEG 7 features & new temporal features Spectrum Centroid (C) Roll-Off Spectrum Spread (S) Flux Spectrum Flatness (F) Mel frequency cepstral coefficients (MFCC) Spectrum Basic Functions Spectrum Projection Functions Log Attack Time Harmonic Peaks ……………. . Tristimulus and similar parameters (contents of odd and even partials. Od, Ev) Mean frequency deviation for low partials Changing ratios of spectral spread Changing ratios of spectral centroid

New Temporal Features – S’(i), C’(i), S’’(i), C’’(i) S’(i) = [S(i+1) – S(i)]/S(i) ;

New Temporal Features – S’(i), C’(i), S’’(i), C’’(i) S’(i) = [S(i+1) – S(i)]/S(i) ; C’(i) = [C(i+1) – C(i)]/C(i) where S(i+1), S(i) and C(i+1), C(i) are the spectral spread and spectral centroid of two consecutive frames: frame i+1 and frame i. The changing ratios of spectral spread and spectral centroid for two consecutive frames are considered as the first derivatives of the spread and spectral centroid. Following the same method we calculate the second derivatives: S’’(i) = [S’(i+1) – S’(i)]/S’(i) ; C’’(i) = [C’(i+1) – C’(i)]/C’(i) Remark: Sequence [S(i), S(i+1), S(i+2), …. . , S(i+k)] can be approximated by polynomial p(x)=a 0+a 1*x+a 2*x 2 + a 3*x 3 + ……… ; new features: a 0, a 1, a 2, a 3, ……

Experiment with WEKA: 19 instruments [flute, piano, violin, saxophone, vibraphone, trumpet, marimba, french-horn, viola,

Experiment with WEKA: 19 instruments [flute, piano, violin, saxophone, vibraphone, trumpet, marimba, french-horn, viola, basson, clarinet, cello, trombone, accordian, guitar, tuba, english-horn, oboe, double-bass], J 48 with 0. 25 confidence factor for pruning tree, minimum number of instances per leaf – 10; KNN – number of neighbors – 3 (object being assigned to the class most common among its 3 nearest neighbors) Classification confidence with temporal features Overfitting Experiment Features Classifier Confidence 1 A S, C Decision Tree 80. 47% 2 B S, C, S’ , C’ Decision Tree 83. 68% 3 C S, C, S’ , C’ , S’’ , C’’ Decision Tree 84. 76% 4 S , C KNN 80. 31% 5 S, C, S’ , C’ KNN 84. 07% 6 S, C, S’ , C’ , S’’ , C’’ KNN 85. 51%

Confusion matrices: left is from Experiment 1, right is from Experiment 3. The correctly

Confusion matrices: left is from Experiment 1, right is from Experiment 3. The correctly classified instances are highlighted in green and the incorrectly classified instances are highlighted in yellow

Precision of the decision tree for each instrument Recall of the decision tree for

Precision of the decision tree for each instrument Recall of the decision tree for each instrument F-score of the decision tree for each instrument

Polyphonic sounds – how to handle? 1. Single-label classification Based on Sound Separation 2.

Polyphonic sounds – how to handle? 1. Single-label classification Based on Sound Separation 2. Multi-labeled classifiers Problems ? Polyphonic Sound Get frame Classifier . segmentation Feature extraction Sound separation Information loss during the signal subtraction Get Instrument Sound Separation Flowchart

Timbre estimation in polyphonic sounds and designing multi-labeled classifiers timbre (tone color or tone

Timbre estimation in polyphonic sounds and designing multi-labeled classifiers timbre (tone color or tone quality) relevant descriptors ( n n Spectrum Centroid, Spread Spectrum Flatness Band Coefficients n Harmonic Peaks n Mel frequency cepstral coefficients (MFCC) n Tristimulus n

Sub-pattern of single instrument in mixture Feature extraction Mel-Frequency Cepstral Coefficients

Sub-pattern of single instrument in mixture Feature extraction Mel-Frequency Cepstral Coefficients

Timbre estimation based on multi-label classifier 40 ms window segmentation Get frame Features Extraction

Timbre estimation based on multi-label classifier 40 ms window segmentation Get frame Features Extraction Classifier timbre descriptors instrument confidence Candidate 1 70% Candidate 2 50% . . . Candidate N 10%

Timbre Estimation Results based on different methods [Instruments - 45, Training Data (TD) -

Timbre Estimation Results based on different methods [Instruments - 45, Training Data (TD) - 2917 single instr. sounds from MUMS, Testing on 308 mixed sounds randomly chosen from TD, window size – 1 s, frame size – 120 ms, hop size – 40 ms (~25 frames), Mel-frequency cepstral coefficients (MFCC) extracted from each frame experiment # pitch based Sound Separation N(Labels) max Recall Precision F-score 1 Yes/No 1 54. 55% 39. 2% 45. 60% 2 Yes 2 61. 20% 38. 1% 46. 96% 3 Yes No 2 64. 28% 44. 8% 52. 81% 4 Yes No 4 67. 69% 37. 9% 48. 60% 5 Yes No 8 68. 3% 36. 9% 47. 91% Threshold 0. 4 controls the total number of estimations for each index window.

Polyphonic Sound (window) Polyphonic Sounds Classifiers Get frame Feature extraction Multiple labels Compressed representations

Polyphonic Sound (window) Polyphonic Sounds Classifiers Get frame Feature extraction Multiple labels Compressed representations of the signal: Harmonic Peaks, Mel Frequency Ceptral Coefficients (MFCC), Spectral Flatness, …. Irrelevant information (inharmonic frequencies or partials) is removed. Violin and viola have similar MFCC patterns. The same is with double-bass and guitar. It is difficult to distinguish them in polyphonic sounds. More information from the raw signal is needed.

Short Term Power Spectrum – low level representation of signal (calculated by STFT) Spectrum

Short Term Power Spectrum – low level representation of signal (calculated by STFT) Spectrum slice – 0. 12 seconds lon Power Spectrum patterns of flute & trombone can be seen in the mixture

Experiment: Middle C instrument sounds (pitch equal to C 4 in MIDI notation, frequency

Experiment: Middle C instrument sounds (pitch equal to C 4 in MIDI notation, frequency 261. 6 Hz Training set: Power Spectrum from 3323 frames - extracted by STFT from 26 single instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet, E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone, Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass, Alto flute, piano, Bach trumpet, tuba, and bass clarinet. Testing Set: Fifty two audio files are mixed (using Sound Forge ) by two of these 26 single instrument sounds. Classifier – (1) KNN with Euclidean distance (spectrum match based classification);

Timbre Pattern Match Based on Power Spectrum experiment # description Recall Precision F-score 1

Timbre Pattern Match Based on Power Spectrum experiment # description Recall Precision F-score 1 Feature-based + Decision Tree (n=2) 64. 28% 44. 8% 52. 81% 2 Spectrum Match + KNN (k=1; n=2) 79. 41% 50. 8% 61. 96% 3 Spectrum Match + KNN (k=5; n=2) 82. 43% 45. 8% 58. 88% 4 Spectrum Match + KNN (k=5; n=2) without percussion instrument 87. 1% n – number of labels assigned to each frame; k – parameter for KNN

Schema I - Hornbostel Sachs Idiophone Membranophone Lip Vibration C Trumpet French Horn Aerophone

Schema I - Hornbostel Sachs Idiophone Membranophone Lip Vibration C Trumpet French Horn Aerophone Single Reed Tuba Bassoon Oboe Chordophone Free Whip Side Flute Alto Flute

Schema II - Play Methods Blow Bowed Alto Flute ……Flute Muted …… Picked Piccolo

Schema II - Play Methods Blow Bowed Alto Flute ……Flute Muted …… Picked Piccolo Bassoon Pizzicato Shaken

Decision Table Obj Decision Attributes Classification Attributes CA 1 … … CAn Hornbostel Sachs

Decision Table Obj Decision Attributes Classification Attributes CA 1 … … CAn Hornbostel Sachs Play Method 1 0. 22 … … 0. 28 [Aerophone, Side, Alto Flute] [Blown, Alto Flute] 2 0. 31 … … 0. 77 [Idiophone, Concussion, Bell] [Concussive, Bell] 3 0. 05 … … 0. 21 [Chordophone, Composite, [Bowed, Cello] 4 0. 12 … … 0. 11 [Chordophone, Composite, [Martele, Violin] Cello] Violin] Xin Cynthia Zhang 25 25

Example Level I C[1] 1 C[2] 2 1 Level II C[2, 1] d[1] 1

Example Level I C[1] 1 C[2] 2 1 Level II C[2, 1] d[1] 1 d[2] 2 d[3] 2 1 C[2, 2] d[3, 1] X a b c d x 1 a[1] b[2] c[1] d[3] x 2 a[1] b[1] c[1] d[3, 1] x 3 a[1] b[2] c[2, 2] d[1] x 4 a[2] b[2] c[2] d[1] Classification Attributes Decision Attributes 3 2 d[3, 2]

Instrument granularity classifiers which are trained at each level of the hierarchical tree Hornbostel/Sachs

Instrument granularity classifiers which are trained at each level of the hierarchical tree Hornbostel/Sachs We do not include membranophones because instruments in this family usually do not produce harmonic sound so that they need special techniques to be identifie

Modules of cascade classifier for single instrument estimation --- Hornboch /Sachs Pitch 3 B

Modules of cascade classifier for single instrument estimation --- Hornboch /Sachs Pitch 3 B 96. 02% 91. 80% 98. 94% = 95. 00%> *

New Experiment: Middle C instrument sounds (pitch equal to C 4 in MIDI notation,

New Experiment: Middle C instrument sounds (pitch equal to C 4 in MIDI notation, frequency - 261. 6 Hz Training set: 2762 frames extracted from the following instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet, E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone, Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass, Alto flute, piano, Bach trumpet, tuba, and bass clarinet. Classifiers – WEKA: (1) KNN with Euclidean distance (spectrum match based classification); (2) Decision Tree (classification based on previously extracted features) Confidence – ratio of the correct classified instances over the total number of instances

Classification on different Feature Groups Group Feature description A 33 Spectrum Flatness Band Coefficients

Classification on different Feature Groups Group Feature description A 33 Spectrum Flatness Band Coefficients B C 13 MFCC coefficients Classifier KNN Decision Tree KNN 28 Harmonic Peaks D 38 Spectrum projection coefficients E Log spectral centroid, spread, flux, rolloff, zerocrossing Decision Tree KNN Decision Tree Confidence 99. 23% 94. 69% 98. 19% 93. 57% 86. 60% 91. 29% 47. 45% 31. 81% 99. 34% 99. 77%

Feature and classifier selection at each level of cascade system KNN + Band Coefficients

Feature and classifier selection at each level of cascade system KNN + Band Coefficients Node feature Classifier chordophone Band Coefficients KNN aerophone MFCC coefficients KNN idiophone Band Coefficients KNN Node feature Classifier chrd_composite Band Coefficients KNN aero_double-reed MFCC coefficients KNN aero_lip-vibrated MFCC coefficients KNN aero_side MFCC coefficients KNN aero_single-reed Band Coefficients Decision Tree idio_struck Band Coefficients KNN

Classification on the combination of different feature groups Classification based on KNN Classification based

Classification on the combination of different feature groups Classification based on KNN Classification based on Decision Tree

From those two experiments, we see that: 1) KNN classifier works better with feature

From those two experiments, we see that: 1) KNN classifier works better with feature vectors such as spectral flatness coefficients, projection coefficients and MFCC. 2) Decision tree works better with harmonic peaks and statistical features. Simply adding more features together does not improve the classifiers and sometime even worsens classification results (such as adding harmonic to other feature groups).

HIERARCHICAL STRUCTURE BUILT BY CLUSTERING ANALYSIS Seven common method to calculate the distance or

HIERARCHICAL STRUCTURE BUILT BY CLUSTERING ANALYSIS Seven common method to calculate the distance or similarity between clusters: single linkage (nearest neighbor), complete linkage (furthest neighbor), unweighted pair-group method using arithmetic averages (UPGMA), weighted pair-group method using arithmetic averages (WPGMA), unweighted pair-group method using the centroid average (UPGMC), weighted pair-group method using the centroid average (WPGMC), Ward's method. Six most common distance functions: Euclidean, Manhattan, Canberra (examines the sum of series of a fraction differences between coordinates of a pair of objects), Pearson correlation coefficient (PCC) – measures the degree of association between objects, Spearman's rank correlation coefficient, Kendal (counts the number of pairwise disagreements between two lists) Clustering algorithm – HCLUST (Agglomerative hierarchical clustering) – R Package

Testing Datasets (MFCC, flatness coefficients, harmonic peaks) : The middle C pitch group which

Testing Datasets (MFCC, flatness coefficients, harmonic peaks) : The middle C pitch group which contains 46 different musical sound ob Each sound object is segmented into multiple 0. 12 s frames and each frame is stored as an instance in the testing dataset. There are totally 2884 frames This dataset is represented by 3 different sets of features (MFCC, flatness coefficients, and harmonic peaks) Total number of experiments = 3 7 6 = 126 Clustering: When the algorithm finishes the clustering process, a particular cluster ID is assigned to each single frame.

Contingency Table derived from clustering result Cluster 1 Instrument 1 … … X 11

Contingency Table derived from clustering result Cluster 1 Instrument 1 … … X 11 … … Instrument i … j … … … Instrument n X 1 n … Xij … Cluster n … … X n 1 … … X 1 Xi 1 … Cluster j Xin … … … X nj X nn

Evaluation result of Hclust algorithm (14 results which yield the highest score among 126

Evaluation result of Hclust algorithm (14 results which yield the highest score among 126 experiments Feature Flatness Coefficients mfcc mfcc Flatness Coefficients mfcc method ward ward ward mcquitty average metric pearson euclidean manhattan kendall pearson kendall euclidean manhattan spearman maximum euclidean manhattan α 87. 3% 85. 8% 85. 6% 81. 0% 83. 0% 82. 9% 80. 5% 80. 1% 81. 3% 83. 7% 86. 1% 79. 8% 88. 9% 87. 3% w 37 37 36 36 35 35 34 33 32 34 30 30 score 32. 30 31. 74 30. 83 29. 18 29. 05 29. 03 28. 17 28. 04 27. 63 27. 62 27. 56 27. 12 26. 67 26. 20 w – number of clusters, α - average clustering accuracy of all the instruments, score= α*w

Clustering result from Hclust algorithm with Ward linkage method and Pearson distance measure; Flatness

Clustering result from Hclust algorithm with Ward linkage method and Pearson distance measure; Flatness coefficients are used as the selected feature “ctrumpet” and “batchtrumpet” are clustered in the same group. “ctrumpet_harmon. Stem. Out” is clustered in one single group instead of merging with “ctrumpet”. Bassoon is considered as the sibling of the regular French horn. “French horn muted” is clustered in another different group together with “English Horn” and “Oboe”.

Looking for optimal [classification method data representation] in monophonic music [Middle C pitch group

Looking for optimal [classification method data representation] in monophonic music [Middle C pitch group - 46 different musical sound objects] Experiment Classification method 1 non-cascade 2 Description Feature-based Recall Precision F-Score 64. 3% 44. 8% 52. 81% non-cascade Spectrum-Match 79. 4% 50. 8% 61. 96% 3 Cascade Hornbostel/Sachs 75. 0% 43. 5% 55. 06% 4 Cascade 77. 8% 53. 6% 63. 47% 5 Cascade machine learned 87. 5% 62. 3% 72. 78% play method

Looking for optimal [classification method data representation] in polyphonic music [Middle C pitch group

Looking for optimal [classification method data representation] in polyphonic music [Middle C pitch group - 46 different musical sound objects] Testing Data: 49 polyphonic sounds are created by selecting three different single instrument sounds from the training database and mixing them together. This set of sounds is used to test again our five different arrangement for [classification method data representation] KNN (k=3) is used as the classifier for each experiment.

Looking for optimal [classification method data representation] in polyphonic music Testing Data: 49 polyphonic

Looking for optimal [classification method data representation] in polyphonic music Testing Data: 49 polyphonic sounds are created by selecting three different single instrument sounds from the training database and mixing them together. This set of sounds is used to test again our five different arrangement for [classification method data representation] KNN (k=3) is used as the classifier for each experiment. Precis Exp# Classifier Method Recallion F-Score 1 Non-Cascade Single-label based on sound separation 31. 48% 43. 06% 36. 37% 2 Non_Cascade Feature-based multi-label classification Spectrum-Match 69. 44% 58. 64% 63. 59% 3 Non_Cascade multi-label classification 85. 51% 55. 04% 66. 97% 4 Cascade(hornbostel) multi-label classification 64. 49% 63. 10% 63. 79% 5 Cascade(playmethod) multi-label classification 66. 67% 55. 25% 60. 43% 6 Cascade(machine Learned) multi-label classification 63. 77% 69. 67% 66. 59%

WWW. MIR. UNCC. EDU MIRAI

WWW. MIR. UNCC. EDU MIRAI

Questions?

Questions?

User entering query He is looking for a particular piece of music Mozart, 40

User entering query He is looking for a particular piece of music Mozart, 40 th Symphony User is not satisfied and he is entering a new query Yes, but I’m sad today, play the same song but make it sadder. Modified Mozart, 40 th Symphony - Action Rules System

Action Rule Action rule is defined as a term [(ω) ∧ (α → β)]

Action Rule Action rule is defined as a term [(ω) ∧ (α → β)] →(ϕ→ψ) conjunction of fixed condition features shared by both groups Information System proposed changes in values of flexible features desired effect of the action

Action Rules Discovery Meta-actions based decision system S(d)=(X, A {d}, V ), with A=

Action Rules Discovery Meta-actions based decision system S(d)=(X, A {d}, V ), with A= {A 1, A 2, …, Am} A 1 A 2 A 3 A 4 …. . Am M 1 E 12 E 13 E 14 E 1 m M 2 E 21 E 22 E 23 E 24 E 2 m M 3 E 31 E 32 E 33 E 34 E 3 m M 4 E 41 E 42 E 43 E 44 E 4 m Em 1 Em 2 Em 3 Em 4 Emn …. . Mn Influence Matrix if E 32 = [a 2 a 2’], then E 31 = [a 1 a 1’], E 34 = [a 4 a 4’] Candidate action rule r = [(A 1 , a 1’) (A 2 , a 2’) (A 4 , a 4’)]) (d , d 1’) Rule r is supported & covered by M 3

"Action Rules Discovery without pre-existing classification rules", Z. W. Ras, A. Dardzinska, Proceedings of

"Action Rules Discovery without pre-existing classification rules", Z. W. Ras, A. Dardzinska, Proceedings of RSCTC 2008 Conference, in Akron, Ohio, LNAI 5306, Springer, 2008, 181 -190 http: //www. cs. uncc. edu/~ras/Papers/Ras-Aga-AKRON. pdf

Since the window diminishes the signal on both edges, it leads to information loss

Since the window diminishes the signal on both edges, it leads to information loss due to the narrowing of frequency spectrum. In order to preserve this information, those consecutive analysis frames have overlap in time. The empirical experiments show the best overlap is two third of window size A A B A A A Time

Windowing Hamming window spectral leakage

Windowing Hamming window spectral leakage