Extracting Information from Music Audio

Extracting Information from Music Audio Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA http://labrosa.ee.columbia.edu/ 1. Motivation: Learning Music 2. Notes Extraction 3. Drum Pattern Modeling 4. Music Similarity Music Information Extraction - Ellis 2006-05-22 p. 1 /35

LabROSA Overview Information Extraction Music Machine Learning Recognition Separation Retrieval Speech Environment Signal Processing Music Information Extraction - Ellis 2006-05-22 p. 2 /35

1. Learning from Music A lot of music data available e.g. 60G of MP3 1000 hr of audio, 15k tracks What can we do with it? implicit definition of music Quality vs. quantity Speech recognition lesson: 10x data, 1/10th annotation, twice as useful Motivating Applications music similarity (recommendation, playlists) computer (assisted) music generation insight into music Music Information Extraction - Ellis 2006-05-22 p. 3 /35

Ground Truth Data File: /Users/dpwe/projects/aclass/aimee.wav Hz 7000 6500 6000 A lot of unlabeled 5500 5000 4500 4000 3500 music data available 3000 2500 2000 1500 manual annotation is 1000 500 t expensive and rare mus Unsupervised structure discovery possible.. but labels help to indicate what you want Weak annotation sources artist-level descriptions symbol sequences without timing (MIDI) errorful transcripts Evaluation requires ground truth limiting factor in Music IR evaluations? Music Information Extraction - Ellis 2006-05-22 p. 4 /35 f 9 Printed: Tue Mar 11 13:04:28 0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 vox mu

Talk Roadmap Anchor models Similarity/ recommend'n 4 Music audio Semantic bases 1 2 Melody extraction Drums extraction 3 Fragment clustering Eigenrhythms Synthesis/ generation Event extraction? Music Information Extraction - Ellis 2006-05-22 p. 5 /35

2. Notes Extraction Audio Score very desirable for data compression, searching, learning Full solution is elusive signal separation of overlapping voices music constructed to frustrate! Maybe simplify problem: Dominant Melody at each time frame with Graham Poliner 4000 Frequency 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time Music Information Extraction - Ellis 2006-05-22 p. 6 /35

Conventional Transcription Pitched notes have harmonic spectra transcribe by searching for harmonics e.g. sinusoid modeling + grouping 3000 freq / Hz 2500 2000 1500 1000 500 0 0 1 2 3 4 time / s Explicit expert-derived knowledge Music Information Extraction - Ellis 2006-05-22 p. 7 /35

Transcription as Classification Signal models typically used for transcription harmonic spectrum, superposition But... trade domain knowledge for data transcription as pure classification problem: Audio Trained classifier p("c0" Audio) p("c#0" Audio) p("d0" Audio) p("d#0" Audio) p("e0" Audio) p("f0" Audio) single N-way discrimination for melody per-note classifiers for polyphonic transcription Music Information Extraction - Ellis 2006-05-22 p. 8 /35

Melody Transcription Features Short-time Fourier Transform Magnitude (Spectrogram) Standardize over 50 pt frequency window Music Information Extraction - Ellis 2006-05-22 p. 9 /35

Training Data Need {data, label} pairs for classifier training Sources: pre-mixing multitrack recordings + hand-labeling? freq / khz synthetic music (MIDI) + forced-alignment? 2 1.5 1 0.5 0 2 1.5 30 20 10 0 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 time / sec Music Information Extraction - Ellis 2006-05-22 p. 10/35

Melody Transcription Results Trained on 17 examples.. plus transpositions out to +/- 6 semitones All-pairs SVMs (Weka) Tested on ISMIR MIREX 2005 set includes foreground/background detection Rank Participant Overall Accuracy Voicing d Raw Pitch Raw Chroma Runtime / s 1 Dressler 71.4% 1.85 68.1% 71.4% 32 2 Ryynänen 64.3% 1.56 68.6% 74.1% 10970 3 Poliner 61.1% 1.56 67.3% 73.4% 5471 3 Paiva 2 61.1% 1.22 58.5% 62.0% 45618 5 Marolt 59.5% 1.06 60.1% 67.1% 12461 6 Paiva 1 57.8% 0.83 62.7% 66.7% 44312 7 Goto 49.9%* 0.59* 65.8% 71.8% 211 8 Vincent 1 47.9%* 0.23* 59.8% 67.6%? 9 Vincent 2 46.4%* 0.86* 59.6% 71.1% 251 10 Brossier 3.2%* 0.14 * 3.9% 8.1% 41 Example... Music Information Extraction - Ellis 2006-05-22 p. 11/35

Polyphonic Transcription Train SVM detectors for every piano note same features & classifier but different labels 88 separate detectors, independent smoothing Use MIDI syntheses, player piano recordings Bach 847 Disklavier freq / pitch A6 A5 A4 A3 A2 20 10 0-10 A1-20 0 1 2 3 4 5 6 7 8 9 time / sec about 30 min training data level / db Music Information Extraction - Ellis 2006-05-22 p. 12/35

Piano Transcription Results Significant improvement from classifier: frame-level accuracy results: Algorithm Errs False Pos False Neg d SVM 43.3% 27.9% 15.4% 3.44 Klapuri&Ryynänen 66.6% 28.1% 38.5% 2.71 Marolt 84.6% 36.5% 48.1% 2.35 Breakdown by frame type: Classification error % 120 100 80 60 40 20 False Negatives False Positives 0 1 2 3 4 5 6 7 8 # notes present http://labrosa.ee.columbia.edu/projects/melody/ Music Information Extraction - Ellis 2006-05-22 p. 13/35

Melody Clustering Goal: Find fragments that recur in melodies.. across large music database.. trade data for model sophistication Training data Melody extraction 5 second fragments VQ clustering Data sources pitch tracker, or MIDI training data Melody fragment representation DCT(1:20) - removes average, smoothes detail Top clusters Music Information Extraction - Ellis 2006-05-22 p. 14/35

Melody clustering results Clusters match underlying contour: Some interesting matches: e.g. Pink + Nsync Music Information Extraction - Ellis 2006-05-22 p. 15/35

3. Eigenrhythms: Drum Pattern Space Pop songs built on repeating drum loop variations on a few bass, snare, hi-hat patterns with John Arroyo Eigen-analysis (or...) to capture variations? by analyzing lots of (MIDI) data, or from audio Applications music categorization beat box synthesis insight Music Information Extraction - Ellis 2006-05-22 p. 16/35

Aligning the Data Need to align patterns prior to modeling... tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against mean template Music Information Extraction - Ellis 2006-05-22 p. 17/35

Eigenrhythms (PCA) Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims) Eigenrhythms both add and subtract Music Information Extraction - Ellis 2006-05-22 p. 18/35

Posirhythms (NMF) Posirhythm 1 Posirhythm 2 HH HH SN SN BD BD Posirhythm 3 Posirhythm 4 HH HH SN SN BD BD Posirhythm 5 Posirhythm 6 HH HH SN SN BD BD 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 samples (@ 2 1 2 3 4 1 2 3 4 beats (@ 120 0.1 0-0.1 Nonnegative: only adds beat-weight Capturing some structure Music Information Extraction - Ellis 2006-05-22 p. 19/35

Eigenrhythms for Classification 10 5 0-5 Projections in Eigenspace / LDA space PCA(1,2) projection (16% corr) 6 blues country 4 disco hiphop2 house newwave rock 0 pop punk -2 rnb LDA(1,2) projection (33% corr) -10-20 -10 0 10-4 -8-6 -4-2 0 2 10-way Genre classification (nearest nbr): PCA3: 20% correct LDA4: 36% correct Music Information Extraction - Ellis 2006-05-22 p. 20/35

Eigenrhythm BeatBox Resynthesize rhythms from eigen-space Music Information Extraction - Ellis 2006-05-22 p. 21/35

4. Music Similarity Can we predict which songs sound alike to a listener?.. based on the audio waveforms? many aspects to subjective similarity Applications query-by-example automatic playlist generation discovering new music Problems the right representation modeling individual similarity with Mike Mandel and Adam Berenzweig Music Information Extraction - Ellis 2006-05-22 p. 22/35

Music Similarity Features Need timbral features: Mel-Frequency Cepstral Coeffs (MFCCs) auditory-like frequency warping log-domain discrete cosine transform orthogonalization!"e$tr'(r)m +el-freq0en$2!"e$tr'(r)m Music Information Extraction - Ellis 2006-05-22 p. 23/35 +el-3req0en$2 4e"str)l 4'effi$ients

Timbral Music Similarity Measure similarity of feature distribution i.e. collapse across time to get density p(x i ) compare by e.g. KL divergence e.g. Artist Identification learn artist model p(x i artist X) (e.g. as GMM) classify unknown song to closest model Training MFCCs GMMs Artist 1 Artist 2 KL KL Min Artist Test Song Music Information Extraction - Ellis 2006-05-22 p. 24/35

Anchor Space Acoustic features describe each song.. but from a signal, not a perceptual, perspective.. and not the differences between songs Use genre classifiers to define new space prototype genres are anchors Audio Input (Class i) Audio Input (Class j) Anchor Anchor Anchor Anchor Anchor Anchor n-dimensional vector in "Anchor Space" p(a 1 x) p(a n-dimensional vector 2 x) in "Anchor Space" p(a 1 x) p(a n x) p(a 2 x) Conversion to Anchorspace p(a n x) GMM Modeling GMM Modeling Similarity Computation KL-d, EMD, etc. Conversion to Anchorspace Music Information Extraction - Ellis 2006-05-22 p. 25/35

Anchor Space Frame-by-frame high-level categorizations compare to raw features? fifth cepstral coef 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 Cepstral Features madonna bowie 1 0.5 0 0.5 third cepstral coef properties in distributions? dynamics? Electronica 0 5 10 15 Anchor Space Features madonna bowie 15 10 5 Country Music Information Extraction - Ellis 2006-05-22 p. 26/35

Playola Similarity Browser Music Information Extraction - Ellis 2006-05-22 p. 27/35

Ground-truth data Hard to evaluate Playola s accuracy user tests... ground truth? Musicseer online survey: ran for 9 months in 2002 > 1,000 users, > 20k judgments http://labrosa.ee.columbia.edu/ projects/musicsim/ Music Information Extraction - Ellis 2006-05-22 p. 28/35

Evaluation Compare Classifier measures against Musicseer subjective results triplet agreement percentage Top-N ranking agreement score: s i = Average Dynamic Recall?(Typke et al.) First-place agreement percentage - simple significance test N α r rα k r c r=1 α r = ( )1 1 3 2 α c = α 2 r % 80 70 60 50 SrvKnw 4789x3.58 40 SrvAll 6178x8.93 GamKnw 7410x3.96 30 GamAll 7421x8.92 20 10 0 cei cmb erd e3d opn kn2 rnd ANK Music Information Extraction - Ellis 2006-05-22 p. 29/35

Using SVMs for Artist ID Support Vector Machines (SVMs) find hyperplanes in a high-dimensional space relies only on matrix of distances between points much smarter than nearest-neighbor/overlap want diversity of reference vectors... (w x) + b = 1 yi = 1 x 2 w (w x) + b = + 1 x 1 y i = +1 (w x) + b = 0 Music Information Extraction - Ellis 2006-05-22 p. 30/35

Song-Level SVM Artist ID Instead of one model per artist/genre, use every training song as an anchor then SVM finds best support for each artist Training Artist 2 Artist 1 MFCCs Song Features D D D D D D DAG SVM Artist Test Song Music Information Extraction - Ellis 2006-05-22 p. 31/35

Artist ID Results ISMIR/MIREX 2005 also evaluated Artist ID 148 artists, 1800 files (split train/test) from uspop2002 Song-level SVM clearly dominates using only MFCCs! MIREX 05 Audio Artist (USPOP2002) Rank Participant Raw Accuracy Normalized Runtime / s 1 Mandel 68.3% 68.0% 10240 2 Bergstra 59.9% 60.9% 86400 3 Pampalk 56.2% 56.0% 4321 4 West 41.0% 41.0% 26871 5 Tzanetakis 28.6% 28.5% 2443 6 Logan 14.8% 14.8%? 7 Lidy Did not complete Music Information Extraction - Ellis 2006-05-22 p. 32/35

Playlist Generation SVMs are well suited to active learning solicit labels on items closest to current boundary Automatic player with skip = Ground truth data collection active-svm automatic playlist generation Music Information Extraction - Ellis 2006-05-22 p. 33/35

5. Artistic Application Compositional applications of automatic music analysis with Douglas Repetto, Ron Weiss, and the rest of the MEAP team o music reformulation automatic mashup generator Music Information Extraction - Ellis 2006-05-22 p. 34/35

Conclusions Anchor models Similarity/ recommend'n Semantic bases Music audio Melody extraction Drums extraction Fragment clustering Eigenrhythms Synthesis/ generation Event extraction Lots of data + noisy transcription + weak clustering musical insights? Music Information Extraction - Ellis 2006-05-22 p. 35/35?