Extracting Information from Music Audio

Size: px

Start display at page:

Download "Extracting Information from Music Audio"

Jacob Shaw
5 years ago
Views:

1 Extracting Information from Music Audio Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA 1. Motivation: Learning Music 2. Notes Extraction 3. Drum Pattern Modeling 4. Music Similarity Music Information Extraction - Ellis p. 1 /35

2 LabROSA Overview Information Extraction Music Machine Learning Recognition Separation Retrieval Speech Environment Signal Processing Music Information Extraction - Ellis p. 2 /35

1. Learning from Music A lot of music data available e.g. 60G of MP3 1000 hr of audio, 15k tracks What can we do with it? implicit definition of music Quality vs.

3 1. Learning from Music A lot of music data available e.g. 60G of MP hr of audio, 15k tracks What can we do with it? implicit definition of music Quality vs. quantity Speech recognition lesson: 10x data, 1/10th annotation, twice as useful Motivating Applications music similarity (recommendation, playlists) computer (assisted) music generation insight into music Music Information Extraction - Ellis p. 3 /35

4 Ground Truth Data File: /Users/dpwe/projects/aclass/aimee.wav Hz A lot of unlabeled music data available manual annotation is t expensive and rare mus Unsupervised structure discovery possible.. but labels help to indicate what you want Weak annotation sources artist-level descriptions symbol sequences without timing (MIDI) errorful transcripts Evaluation requires ground truth limiting factor in Music IR evaluations? Music Information Extraction - Ellis p. 4 /35 f 9 Printed: Tue Mar 11 13:04:28 0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 vox mu

5 Talk Roadmap Anchor models Similarity/ recommend'n 4 Music audio Semantic bases 1 2 Melody extraction Drums extraction 3 Fragment clustering Eigenrhythms Synthesis/ generation Event extraction? Music Information Extraction - Ellis p. 5 /35

2. Notes Extraction Audio Score very desirable for data compression, searching, learning Full solution is elusive signal separation of overlapping voices music constructed to frustrate!

6 2. Notes Extraction Audio Score very desirable for data compression, searching, learning Full solution is elusive signal separation of overlapping voices music constructed to frustrate! Maybe simplify problem: Dominant Melody at each time frame with Graham Poliner 4000 Frequency Time Music Information Extraction - Ellis p. 6 /35

7 Conventional Transcription Pitched notes have harmonic spectra transcribe by searching for harmonics e.g. sinusoid modeling + grouping 3000 freq / Hz time / s Explicit expert-derived knowledge Music Information Extraction - Ellis p. 7 /35

8 Transcription as Classification Signal models typically used for transcription harmonic spectrum, superposition But... trade domain knowledge for data transcription as pure classification problem: Audio Trained classifier p("c0" Audio) p("c#0" Audio) p("d0" Audio) p("d#0" Audio) p("e0" Audio) p("f0" Audio) single N-way discrimination for melody per-note classifiers for polyphonic transcription Music Information Extraction - Ellis p. 8 /35

9 Melody Transcription Features Short-time Fourier Transform Magnitude (Spectrogram) Standardize over 50 pt frequency window Music Information Extraction - Ellis p. 9 /35

freq / khz synthetic music (MIDI) + forced-alignment? 2 1.5 1 0.5 0 2 1.

10 Training Data Need {data, label} pairs for classifier training Sources: pre-mixing multitrack recordings + hand-labeling? freq / khz synthetic music (MIDI) + forced-alignment? time / sec Music Information Extraction - Ellis p. 10/35

11 Melody Transcription Results Trained on 17 examples.. plus transpositions out to +/- 6 semitones All-pairs SVMs (Weka) Tested on ISMIR MIREX 2005 set includes foreground/background detection Rank Participant Overall Accuracy Voicing d Raw Pitch Raw Chroma Runtime / s 1 Dressler 71.4% % 71.4% 32 2 Ryynänen 64.3% % 74.1% Poliner 61.1% % 73.4% Paiva % % 62.0% Marolt 59.5% % 67.1% Paiva % % 66.7% Goto 49.9%* 0.59* 65.8% 71.8% Vincent %* 0.23* 59.8% 67.6%? 9 Vincent %* 0.86* 59.6% 71.1% Brossier 3.2%* 0.14 * 3.9% 8.1% 41 Example... Music Information Extraction - Ellis p. 11/35

12 Polyphonic Transcription Train SVM detectors for every piano note same features & classifier but different labels 88 separate detectors, independent smoothing Use MIDI syntheses, player piano recordings Bach 847 Disklavier freq / pitch A6 A5 A4 A3 A A time / sec about 30 min training data level / db Music Information Extraction - Ellis p. 12/35

13 Piano Transcription Results Significant improvement from classifier: frame-level accuracy results: Algorithm Errs False Pos False Neg d SVM 43.3% 27.9% 15.4% 3.44 Klapuri&Ryynänen 66.6% 28.1% 38.5% 2.71 Marolt 84.6% 36.5% 48.1% 2.35 Breakdown by frame type: Classification error % False Negatives False Positives # notes present Music Information Extraction - Ellis p. 13/35

14 Melody Clustering Goal: Find fragments that recur in melodies.. across large music database.. trade data for model sophistication Training data Melody extraction 5 second fragments VQ clustering Data sources pitch tracker, or MIDI training data Melody fragment representation DCT(1:20) - removes average, smoothes detail Top clusters Music Information Extraction - Ellis p. 14/35

15 Melody clustering results Clusters match underlying contour: Some interesting matches: e.g. Pink + Nsync Music Information Extraction - Ellis p. 15/35

16 3. Eigenrhythms: Drum Pattern Space Pop songs built on repeating drum loop variations on a few bass, snare, hi-hat patterns with John Arroyo Eigen-analysis (or...) to capture variations? by analyzing lots of (MIDI) data, or from audio Applications music categorization beat box synthesis insight Music Information Extraction - Ellis p. 16/35

17 Aligning the Data Need to align patterns prior to modeling... tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against mean template Music Information Extraction - Ellis p. 17/35

18 Eigenrhythms (PCA) Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims) Eigenrhythms both add and subtract Music Information Extraction - Ellis p. 18/35

100 150 200 250 300 350 samples (@ 2 1 2 3 4 1 2 3 4 beats (@ 120 0.1 0-0.

19 Posirhythms (NMF) Posirhythm 1 Posirhythm 2 HH HH SN SN BD BD Posirhythm 3 Posirhythm 4 HH HH SN SN BD BD Posirhythm 5 Posirhythm 6 HH HH SN SN BD BD samples (@ beats (@ Nonnegative: only adds beat-weight Capturing some structure Music Information Extraction - Ellis p. 19/35

20 Eigenrhythms for Classification Projections in Eigenspace / LDA space PCA(1,2) projection (16% corr) 6 blues country 4 disco hiphop2 house newwave rock 0 pop punk -2 rnb LDA(1,2) projection (33% corr) way Genre classification (nearest nbr): PCA3: 20% correct LDA4: 36% correct Music Information Extraction - Ellis p. 20/35

21 Eigenrhythm BeatBox Resynthesize rhythms from eigen-space Music Information Extraction - Ellis p. 21/35

22 4. Music Similarity Can we predict which songs sound alike to a listener?.. based on the audio waveforms? many aspects to subjective similarity Applications query-by-example automatic playlist generation discovering new music Problems the right representation modeling individual similarity with Mike Mandel and Adam Berenzweig Music Information Extraction - Ellis p. 22/35

Music Similarity Features Need timbral features: Mel-Frequency Cepstral Coeffs (MFCCs) auditory-like frequency warping log-domain discrete cosine

23 Music Similarity Features Need timbral features: Mel-Frequency Cepstral Coeffs (MFCCs) auditory-like frequency warping log-domain discrete cosine transform orthogonalization!"e$tr'(r)m +el-freq0en$2!"e$tr'(r)m Music Information Extraction - Ellis p. 23/35 +el-3req0en$2 4e"str)l 4'effi$ients

24 Timbral Music Similarity Measure similarity of feature distribution i.e. collapse across time to get density p(x i ) compare by e.g. KL divergence e.g. Artist Identification learn artist model p(x i artist X) (e.g. as GMM) classify unknown song to closest model Training MFCCs GMMs Artist 1 Artist 2 KL KL Min Artist Test Song Music Information Extraction - Ellis p. 24/35

25 Anchor Space Acoustic features describe each song.. but from a signal, not a perceptual, perspective.. and not the differences between songs Use genre classifiers to define new space prototype genres are anchors Audio Input (Class i) Audio Input (Class j) Anchor Anchor Anchor Anchor Anchor Anchor n-dimensional vector in "Anchor Space" p(a 1 x) p(a n-dimensional vector 2 x) in "Anchor Space" p(a 1 x) p(a n x) p(a 2 x) Conversion to Anchorspace p(a n x) GMM Modeling GMM Modeling Similarity Computation KL-d, EMD, etc. Conversion to Anchorspace Music Information Extraction - Ellis p. 25/35

5 0 0.5 third cepstral coef properties in distributions? dynamics?

26 Anchor Space Frame-by-frame high-level categorizations compare to raw features? fifth cepstral coef Cepstral Features madonna bowie third cepstral coef properties in distributions? dynamics? Electronica Anchor Space Features madonna bowie Country Music Information Extraction - Ellis p. 26/35

27 Playola Similarity Browser Music Information Extraction - Ellis p. 27/35

28 Ground-truth data Hard to evaluate Playola s accuracy user tests... ground truth? Musicseer online survey: ran for 9 months in 2002 > 1,000 users, > 20k judgments projects/musicsim/ Music Information Extraction - Ellis p. 28/35

29 Evaluation Compare Classifier measures against Musicseer subjective results triplet agreement percentage Top-N ranking agreement score: s i = Average Dynamic Recall?(Typke et al.) First-place agreement percentage - simple significance test N α r rα k r c r=1 α r = ( ) α c = α 2 r % SrvKnw 4789x SrvAll 6178x8.93 GamKnw 7410x GamAll 7421x cei cmb erd e3d opn kn2 rnd ANK Music Information Extraction - Ellis p. 29/35

30 Using SVMs for Artist ID Support Vector Machines (SVMs) find hyperplanes in a high-dimensional space relies only on matrix of distances between points much smarter than nearest-neighbor/overlap want diversity of reference vectors... (w x) + b = 1 yi = 1 x 2 w (w x) + b = + 1 x 1 y i = +1 (w x) + b = 0 Music Information Extraction - Ellis p. 30/35

31 Song-Level SVM Artist ID Instead of one model per artist/genre, use every training song as an anchor then SVM finds best support for each artist Training Artist 2 Artist 1 MFCCs Song Features D D D D D D DAG SVM Artist Test Song Music Information Extraction - Ellis p. 31/35

32 Artist ID Results ISMIR/MIREX 2005 also evaluated Artist ID 148 artists, 1800 files (split train/test) from uspop2002 Song-level SVM clearly dominates using only MFCCs! MIREX 05 Audio Artist (USPOP2002) Rank Participant Raw Accuracy Normalized Runtime / s 1 Mandel 68.3% 68.0% Bergstra 59.9% 60.9% Pampalk 56.2% 56.0% West 41.0% 41.0% Tzanetakis 28.6% 28.5% Logan 14.8% 14.8%? 7 Lidy Did not complete Music Information Extraction - Ellis p. 32/35

33 Playlist Generation SVMs are well suited to active learning solicit labels on items closest to current boundary Automatic player with skip = Ground truth data collection active-svm automatic playlist generation Music Information Extraction - Ellis p. 33/35

34 5. Artistic Application Compositional applications of automatic music analysis with Douglas Repetto, Ron Weiss, and the rest of the MEAP team o music reformulation automatic mashup generator Music Information Extraction - Ellis p. 34/35

35 Conclusions Anchor models Similarity/ recommend'n Semantic bases Music audio Melody extraction Drums extraction Fragment clustering Eigenrhythms Synthesis/ generation Event extraction Lots of data + noisy transcription + weak clustering musical insights? Music Information Extraction - Ellis p. 35/35?

Data Driven Music Understanding

Data Driven Music Understanding Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA http://labrosa.ee.columbia.edu/ 1. Motivation: