ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 12: Alignment and Matching 1. Music Alignment 2. Cover Song Detection 3. Echo Nest Analyze Dan Ellis Dept. Electrical Engineering, Columbia University dpwe@ee.columbia.edu http://www.ee.columbia.edu/~dpwe/e4896/ E4896 Music Signal Processing (Dan Ellis) 2013-04-15-1 /22
1. Music Alignment Often have versions of the same music with unmatched time axes different performances performance vs. score Various applications for aligning them synchronizing different tracks (with TSM) synchronized score display ground truth transcriptions Kurth et al., 2007 E4896 Music Signal Processing (Dan Ellis) 2013-04-15-2 /22
The Similarity Matrix Point-to-point comparison of sequences Foote 1999 e.g. Euclidean distance d euc (i, j) = or normalized inner product (cosine distance) d cos (i, j) =1 k x i (k) y j (k) 2 k x i(k)y j (k) k x i(k) 2 k y j(k) 2 Let It Be - The Beatles 32 48 time / beats 16 j Euclidean Distance dij 6 5 4 3 2 1 G AB D E C 16 32 48 time / beats E4896 Music Signal Processing (Dan Ellis) 2013-04-15-3 /22 G AB D E C Let It Be - Nick Cave i
Dynamic Programming Find best path combining local + transitions works for any kind of similarity matrix Allowable transitions T(1,0) = 0.1 {i k, j k } j 8 Local costs dij ; C* ; paths Bellman 1957 2.7 2.8 2.5 2.3 2.0 1.7 1.6 1.5 dij 1 0.9 T(1,1) = 0.0 T(0,1) = 0.1 7 2.1 2.2 2.0 1.8 1.5 1.3 1.2 1.6 0.8 {i k-1, j k-1 } 6 1.4 1.5 1.3 1.2 1.0 0.9 1.3 1.5 0.7 Finds path {ik, jk} to minimize cost... C imax,j max = k... recursively d(i k,j k ) +T (i k i k 1,j k j k 1 ) 5 4 3 2 1 1.2 1.2 1.0 1.0 0.7 0.9 1.2 1.5 0.8 0.8 0.7 0.6 0.7 1.2 1.8 2.4 0.5 0.5 0.4 0.5 0.9 1.5 2.2 2.7 0.3 0.3 0.4 0.6 1.0 1.5 2.1 2.6 0.1 0.4 0.6 0.8 1.1 1.6 2.3 2.9 C i,j = 0 1 2 3 4 5 6 7 8 i min d(i, j)+t (x, y)+c i x,j y x,y={(1,1),(1,0),(0,1)} E4896 Music Signal Processing (Dan Ellis) 2013-04-15-4 /22 0.6 0.5 0.4 0.3 0.2 0.1
Audio-to-Audio Alignment Dynamic programming to get time mapping + phase vocoder time scaling 500 450 400 350 300 250 200 150 100 50 50 100 150 200 250 300 350 400 E4896 Music Signal Processing (Dan Ellis) 2013-04-15-5 /22
Audio-Score Alignment Aligning a score representation (e.g. MIDI) is a proxy for polyphonic transcription Let It Be + aligned MIDI labels 1000 800 freq / Hz 600 400 200 0 0 2 4 6 8 10 12 14 16 18 20 22 time / sec E4896 Music Signal Processing (Dan Ellis) 2013-04-15-6 /22
Peak Structure Distance How do we match spectra to score notes? synthesize audio from MIDI & compare audio? Peak Structure distance : is energy where we expect? MIDI Piano roll Synthesized audio Predicted spectrum = mask M[k] Peak Structure = energy blw mask freq / bins freq / khz note freq / bins C6 C5 C4 C3 C2 0 50 100 150 200 250 300 350 400 time / frames 1 0.5 0 0 5 10 15 20 80 60 40 20 d psd =1 Orio & Schwartz 2001 50 100 150 200 250 300 350 400 450 80 60 40 20 0 0 50 100 150 200 250 300 350 400 k M[k] X[k] k X[k] time / sec time / frames E4896 Music Signal Processing (Dan Ellis) 2013-04-15-7 /22
2. Cover Song Detection Musicians are fond of cover versions usually alter melody, harmony, instrumentation, rhythm, style can be hard to spot even for a human! Can try to match via alignment.. with some threshold on best alignment cost? E4896 Music Signal Processing (Dan Ellis) 2013-04-15-8 /22
Smith-Waterman Local Alignment Cover version may have different form different number, ordering of verse/ chorus/brige want to find any large aligned regions Local alignment measure S i,j = max x,y time / beats 200 180 160 140 120 100 80 60 40 20 Beatles vs. Carol Woods cosine dist 50 100 150 max{0,s(i, j) P (x, y)+s i x,j y } Smith Waterman cd/2,.96.1.2 want largest score S* similarity s(i, j) must exceed penalty P(x,y) on avg. (e.g. 0.96 for diagonal, 1.2 for off-diagonal) 50 100 time / beats E4896 Music Signal Processing (Dan Ellis) 2013-04-15-9 /22
Local Alignment Cover Detection Serrà & Gòmez, 2008 Smith-Waterman needs predictable values use binary similarity based on best transposition Euclidean Binary Non-cover E4896 Music Signal Processing (Dan Ellis) 2013-04-15-10/22
Cross-correlation Covers System DP is good for time-warping, but expensive beat-timing is tempo independent (if it works) simply cross-correlate beat-chroma patches? chroma bins G E D C A extract Query 100 200 300 400 500 beats cross-correlate Candidate how big are the pieces? how do we combine individual scores? also expensive chroma bins G E D C A 100 200 300 400 500 beats E4896 Music Signal Processing (Dan Ellis) 2013-04-15-11/22
Global Cross-Correlation Cross-correlate entire beat-chroma matrices... at all possible transpositions (circular) chroma bins chroma bins skew / semitones Ellis & Poliner, 2007 implicit combination of match quality and duration G E D C A G E D C A +5 0 Elliott Smith - Between the Bars 100 200 300 400 500 beats @281 BPM Glen Phillips - Between the Bars Cross-correlation -5-500 -400-300 -200-100 0 100 200 300 400 skew / beats One good matching fragment is sufficient...? E4896 Music Signal Processing (Dan Ellis) 2013-04-15-12/22
Filtered Cross-Correlation Raw correlation not as important as precise local match looking for large contrast at ±1 beat skew i.e. high-pass filter skew / semitones Cross-correlation +5 0-5 -500-400 -300-200 -100 0 100 200 300 400 skew / beats Cross-correlation @ skew = +2 semitones 0.6 raw 0.4 0.2 0 filtered -500-400 -300-200 -100 0 100 200 300 400 skew / beats E4896 Music Signal Processing (Dan Ellis) 2013-04-15-13/22
Cover Song Results 23 Covers found in 8700 song uspop2002 Take_Me_To_The_River/annie_lennox Let_It_Be/nick_cave I_Love_You/faith_hill I_Can_t_Get_No_Satisfaction/rolling_stones Hush/milli_vanilli Grand_Illusion/styx Gold_Dust_Woman/sheryl_crow God_Only_Knows/brian_wilson Faith/limp_bizkit Cover Songs - dpwe23-12/23 correct Query Enjoy_The_Silence/tori_amos Day_Tripper/cheap_trick Come_Together/beatles Cocaine/nazareth Claudette/roy_orbison Cecilia/simon_and_garfunkel Caroline_No/brian_wilson Blue_Collar_Man/styx Between_The_Bars/glen_phillips Before_You_Accuse_Me/eric_clapton America/simon_and_garfunkel All_Along_The_Watchtower/dave_matthews_band Addicted_To_Love/tina_turner Abracadabra/sugar_ray Ab Ad Al Am Be Be Bl Ca Ce Cl Co Co Da En Fa Go Go Gr Hu I_ I_ Le Ta popular decoys normalization issues E4896 Music Signal Processing (Dan Ellis) 2013-04-15-14/22 Test
Analyzing Cover Song Correlation Look inside global cross-correlation to find matching fragments... xcorr = t f (C1(t, f) C2(t, f)) - view along time Let It Be / Beatles (beats 11-441) chroma G F D C A 50 100 150 200 250 300 350 400 Let It Be / Nick Cave (beats 13-443) time / beats chroma G F D C A 50 100 150 200 250 300 350 400 time / beats 0.4 0.2 0-0.2 0 50 100 150 200 250 300 350 400 time / beats E4896 Music Signal Processing (Dan Ellis) 2013-04-15-15/22
Cover Song False Alarm Correlation can be weak Cocaine (Clapton) vs. Satisfaction (Stones) Eric Clapton - Cocaine - beats 17:1027 chroma G F D C A 100 200 300 400 500 600 700 800 900 1000 Rolling Stones - Satisfaction - beats 1:1011 chroma G F D C A 100 200 300 400 500 600 700 800 900 1000 2 1 0-1 -2 0 100 200 300 400 500 600 700 800 900 1000 E4896 Music Signal Processing (Dan Ellis) 2013-04-15-16/22
3. Echo Nest Analyze Web service to provide beat, chroma,... analysis (and much more) TRKUYPW128F92E1FC0 - Tori Amos - Smells Like Teen Spirit register for free API key http:// developer.echonest.c om/account/register/ upload MP3, get back XML with analysis data freq / Hz freq / Hz 2416 761 240 B A G E D C 2416 761 240 Original EN Features Resynth 0 2 4 6 8 10 12 time / sec 14 E4896 Music Signal Processing (Dan Ellis) 2013-04-15-17/22
EN Analyze Usage Matlab wrapper function E4896 Music Signal Processing (Dan Ellis) 2013-04-15-18/22
Million Song Dataset (MSD) Commercial-scale dataset available to MIR researchers 1M pop songs 250 GB of features (6 years of listening) Thierry Bertin-Mahieux EN Analyze features +... Lyrics, Tags, Covers, Listeners... http://labrosa.ee.columbia.edu/millionsong E4896 Music Signal Processing (Dan Ellis) 2013-04-15-19/22
MSD Metadata EN Metadata artist: 'Tori Amos' release: 'LIVE AT MONTREUX' title: 'Smells Like Teen Spirit' id: 'TRKUYPW128F92E1FC0' key: 5 mode: 0 loudness: -16.6780 tempo: 87.2330 time_signature: 4 duration: 216.4502 sample_rate: 22050 audio_md5: '8' 7digitalid: 5764727 familiarity: 0.8500 year: 1992 SHS Covers %5489,4468, Smells Like Teen Spirit TRTUOVJ128E078EE10 Nirvana TRFZJOZ128F4263BE3 Weird Al Yankovic TRJHCKN12903CDD274 Pleasure Beach TRELTOJ128F42748B7 The Flying Pickets TRJKBXL128F92F994D Rhythms Del Mundo feat. Shanade TRIHLAW128F429BBF8 The Bad Plus TRKUYPW128F92E1FC0 Tori Amos Last.fm Tags 100.0 cover 57.0 covers 43.0 female vocalists 42.0 piano 34.0 alternative 14.0 singer-songwriter 11.0 acoustic 8.0 tori amos 7.0 beautiful 6.0 rock 6.0 pop 6.0 Nirvana 6.0 female vocalist 6.0 90s 5.0 out of genre covers 12 hello 11 i 10 a 9 and 7 it 6 are 6 we 6 now 5.0 cover songs 4.0 soft rock 4.0 nirvana cover 4.0 Mellow 4.0 alternative rock 3.0 chick rock 3.0 Ballad 3.0 Awesome Covers 2.0 melancholic 2.0 k00l ch1x 2.0 indie 2.0 female vocalistist 2.0 female 2.0 cover song 2.0 american MxM Lyric Bag-of-Words 6 here 6 us 6 entertain 4 the 4 feel 4 yeah 3 to 3 my 3 is 3 with 3 oh 3 out 3 an 3 light 3 less 3 danger E4896 Music Signal Processing (Dan Ellis) 2013-04-15-20/22
Summary Music Alignment Dynamic Programming finds correspondence Cover Songs DP, or cross-correlation for efficiency EN Analyze Web service to analyze audio E4896 Music Signal Processing (Dan Ellis) 2013-04-15-21/22
References R. Bellman, Dynamic Programming, Princeton University Press. 1957. D. Ellis and G. Poliner, Identifying Cover Songs With Chroma Features and Dynamic Programming Beat Tracking, Proc. ICASSP-07, Hawai'i, pp. IV-1429-1432, 2007. J. Foote, Visualizing Music and Audio using Self-Similarity, In Proc. ACM Multimedia, Orlando, pp. 77-80, 1999. Frank Kurth, Meinard Müller, Christian Fremerey, Yoon ha Chang, and Michael Clausen, Automated synchronization of scanned sheet music with audio recordings, Proc. 8th International Conference on Music Information Retrieval (ISMIR), Vienna, pp. 261-266, 2007. N. Orio & D. Schwarz, Alignment of monophonic and polyphonic music to a score, Proc. Int. Comp. Music Conf., Havana, pp.155-158, 2001. J. Serrà, E. Gómez, P. Herrera, X. Serra, Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification, IEEE Trans. on Audio, Speech and Lang. Proc., 16(6), pp. 1138-1151, 2008. E4896 Music Signal Processing (Dan Ellis) 2013-04-15-22/22