Notelevel Music Transcription by Maximum Likelihood Sampling Zhiyao

Note-level Music Transcription by Maximum Likelihood Sampling Zhiyao Duan ¹ & David Temperley ² 1. Department of Electrical and Computer Engineering 2. Eastman School of Music University of Rochester Presentation at ISMIR 2014 Taipei, Taiwan October 28, 2014

Different Levels of Music Transcription • Frame-level (multi-pitch estimation) – Estimate pitches and polyphony in each frame – Many methods • Note-level (note tracking) – Estimate pitch, onset, offset of notes – Fewer methods • Song-level (multi-pitch streaming) – Stream pitches by sources – Very few methods 2

Existing Note Tracking Methods • Connect proximate frame-level pitch estimates – Misses in pitch estimates will cause fragmented notes – False alarms will generate spurious notes that are unreasonably short Ryynanen’ 05, Bello’ 06, Kameoka’ 07, Poliner’ 07, Lagrange’ 07, Chang’ 08, Raczynski’ 09, Dessein’ 10, Grindlay’ 11, Benetos’ 11, Grosche’ 12, etc. Frame-level pitch estimates • Fill gaps and prune short notes – Deals with notes individually, and does not consider interactions between different notes 3

Problems • Contains many spurious notes caused by consistent MPE errors (usually octave/harmonic errors) • Often violates instantaneous polyphony constraints Results from the existing “connect -fill-prune” approach Ground-truth 4

Our Idea • Consider interactions between notes • A generation-evaluation strategy – Generate a number of transcription candidates – Evaluate each candidate on how well its notes explain the audio as a whole 5

Proposed System [Duan, Pardo, & Zhang, 2010] Generate subsets as transcription candidates Evaluate candidates and select the best 6

Note Sampling Strategies • How to sample efficiently and effectively? • What we want – Sampling space not too big • – Only sample “good” notes – Diversity in transcription candidates – Candidates obey polyphony constraints 7

Note Sampling Algorithm • 8

Note Likelihood • Indicates how “good” the note is by itself – Also called “salience”, “activation”, “strength” • Note likelihood = geometric mean of single-pitch -likelihood of pitches in the note – Multi-pitch estimation algorithms almost always estimate a likelihood (salience) for each pitch estimate 9

Candidate Evaluation • 10

Single-pitch vs. Multi-pitch Likelihood • Single-pitch likelihood (salience) Note likelihood – E. g. , total spectral energy at its harmonic positions – Describes how well a pitch fits in the audio individually • A correct pitch usually has a high likelihood • Octave/harmonic errors may also have high likelihood • Multi-pitch likelihood Transcription likelihood – Defined as the match between spectral peaks and harmonics of all pitches – Describes how well a set of pitches explain the audio as a whole • Octave/harmonic relations would not improve likelihood much 11

An Example Trombone: C 3 Violin: E 4 Higher value is better Pitch candidate C 3 C 4 E 4 Log single-pitchlikelihood -338. 8 -466. 9 -475 Pitch set candidate {C 3} {C 3, C 4} {C 3, E 4} Log multi-pitchlikelihood -338. 8 -346. 2 -318. 9 12

Experiments • Bach 10 dataset: 110 polyphonic combinations derived from 10 pieces of 4 -part J. S. Bach chorales, played by violin, clarinet, saxophone, and bassoon – 60 duets, 40 trios, 10 quartets • Comparison methods – Benetos 13: shift-invariant PLCA (frame-level) + median filtering of pitch activity matrix (note-level) – Klapuri 06: iterative spectral subtraction (frame-level) + our preliminary note tracking (note-level) 13

Performance Measures • 14

Comparison with state of the art 15

Works with state of the art 16

Example 17

Conclusions • A new method for note-level transcription, considering note interactions – Generate transcription candidates by sampling notes according to note length and note likelihood, derived from single-pitch likelihood – Evaluate candidates according to transcription likelihood, derived from multi-pitch likelihood • Good performance against state of the art • Can work with any MPE or note tracking algorithm, as long as single-pitch likelihood (salience) is calculated 18

Limitations and Future Work • Only removes spurious notes, but can’t add back missed notes • Different runs of sampling are independent • A better sampling technique – E. g. , Using Markov Chain Monte Carlo to add back missed notes and to consider dependencies between different runs of sampling • A better evaluation technique – E. g. , considering musical knowledge to evaluate the “musical plausibility” of transcription candidates 19