Vocal Melody Extraction from Polyphonic Audio with Pitched Accompaniment

Similar documents
OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

Proc. of NCC 2010, Chennai, India A Melody Detection User Interface for Polyphonic Music

Vocal Melody Extraction from Polyphonic Audio with Pitched Accompaniment

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

Transcription of the Singing Melody in Polyphonic Music

Efficient Vocal Melody Extraction from Polyphonic Music Signals

AN ACOUSTIC-PHONETIC APPROACH TO VOCAL MELODY EXTRACTION

Topic 10. Multi-pitch Analysis

MELODY EXTRACTION BASED ON HARMONIC CODED STRUCTURE

ON THE USE OF PERCEPTUAL PROPERTIES FOR MELODY ESTIMATION

Music Radar: A Web-based Query by Humming System

MELODY EXTRACTION FROM POLYPHONIC AUDIO OF WESTERN OPERA: A METHOD BASED ON DETECTION OF THE SINGER S FORMANT

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

Singing Pitch Extraction and Singing Voice Separation

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

CURRENT CHALLENGES IN THE EVALUATION OF PREDOMINANT MELODY EXTRACTION ALGORITHMS

A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION

THE importance of music content analysis for musical

IMPROVED MELODIC SEQUENCE MATCHING FOR QUERY BASED SEARCHING IN INDIAN CLASSICAL MUSIC

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Automatic Construction of Synthetic Musical Instruments and Performers

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

Introductions to Music Information Retrieval

A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS

A repetition-based framework for lyric alignment in popular songs

ON FINDING MELODIC LINES IN AUDIO RECORDINGS. Matija Marolt

MUSI-6201 Computational Music Analysis

Effects of acoustic degradations on cover song recognition

AN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS

Topic 4. Single Pitch Detection

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Subjective Similarity of Music: Data Collection for Individuality Analysis

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

ACCURATE ANALYSIS AND VISUAL FEEDBACK OF VIBRATO IN SINGING. University of Porto - Faculty of Engineering -DEEC Porto, Portugal

Tempo and Beat Analysis

Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity

Rhythm related MIR tasks

Piano Transcription MUMT611 Presentation III 1 March, Hankinson, 1/15

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

A prototype system for rule-based expressive modifications of audio recordings

Singer Traits Identification using Deep Neural Network

Automatic music transcription

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

Raga Identification by using Swara Intonation

International Journal of Computer Architecture and Mobility (ISSN ) Volume 1-Issue 7, May 2013

Audio-Based Video Editing with Two-Channel Microphone

CONTENT-BASED MELODIC TRANSFORMATIONS OF AUDIO MATERIAL FOR A MUSIC PROCESSING APPLICATION

POLYPHONIC INSTRUMENT RECOGNITION USING SPECTRAL CLUSTERING

Analysis, Synthesis, and Perception of Musical Sounds

Krzysztof Rychlicki-Kicior, Bartlomiej Stasiak and Mykhaylo Yatsymirskyy Lodz University of Technology

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Robert Alexandru Dobre, Cristian Negrescu

Video-based Vibrato Detection and Analysis for Polyphonic String Music

Automatic Music Clustering using Audio Attributes

Automatic Piano Music Transcription

Lecture 15: Research at LabROSA

Query By Humming: Finding Songs in a Polyphonic Database

Voice & Music Pattern Extraction: A Review

Topics in Computer Music Instrument Identification. Ioanna Karydi

/$ IEEE

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

GCT535- Sound Technology for Multimedia Timbre Analysis. Graduate School of Culture Technology KAIST Juhan Nam

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

Automatic Rhythmic Notation from Single Voice Audio Sources

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

However, in studies of expressive timing, the aim is to investigate production rather than perception of timing, that is, independently of the listene

Reducing False Positives in Video Shot Detection

2. AN INTROSPECTION OF THE MORPHING PROCESS

SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL NETWORKS

DISTINGUISHING MUSICAL INSTRUMENT PLAYING STYLES WITH ACOUSTIC SIGNAL ANALYSES

AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON SUNG MELODIC CONTOUR CHARACTERIZED IN PHASE PLANE

Computational Modelling of Harmony

Music Database Retrieval Based on Spectral Similarity

Automatic Laughter Detection

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

Week 14 Music Understanding and Classification

Audio Feature Extraction for Corpus Analysis

CSC475 Music Information Retrieval

AUTOMATICALLY IDENTIFYING VOCAL EXPRESSIONS FOR MUSIC TRANSCRIPTION

TOWARDS IMPROVING ONSET DETECTION ACCURACY IN NON- PERCUSSIVE SOUNDS USING MULTIMODAL FUSION

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

Supervised Musical Source Separation from Mono and Stereo Mixtures based on Sinusoidal Modeling

Outline. Why do we classify? Audio Classification

A probabilistic framework for audio-based tonal key and chord recognition

Improving Frame Based Automatic Laughter Detection

Addressing user satisfaction in melody extraction

Music Similarity and Cover Song Identification: The Case of Jazz

Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Statistical Modeling and Retrieval of Polyphonic Music

Application Of Missing Feature Theory To The Recognition Of Musical Instruments In Polyphonic Audio

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

IMPROVING RHYTHMIC SIMILARITY COMPUTATION BY BEAT HISTOGRAM TRANSFORMATIONS

Music Information Retrieval with Temporal Features and Timbre

Simple Harmonic Motion: What is a Sound Spectrum?

Lecture 9 Source Separation

Transcription:

Vocal Melody Extraction from Polyphonic Audio with Pitched Accompaniment Vishweshwara Rao (05407001) Ph.D. Defense Guide: Prof. Preeti Rao (June 2011) Department of Electrical Engineering Indian Institute of Technology Bombay

OUTLINE Introduction Objective, background, motivation, approaches & issues Indian music Proposed melody extraction system Design Evaluation Problems Competing pitched accompanying instrument Enhancements for increasing robustness to pitched accompaniment Dual-F0 tracking Identification of vocal segments by combination of static and dynamic features Signal-sparsity driven window length adaptation Graphical User Interface for melody extraction Conclusions and Future work Department of Electrical Engineering, IIT Bombay 2 of of 2540

INTRODUCTION Objective Vocal melody extraction from polyphonic audio Polyphony : Multiple musical sound sources present Vocal : Lead melodic instrument is the singing voice Melody Sequence of notes Symbolic representation of music Note frequency time Pitch contour of the singing voice Department of Electrical Engineering, IIT Bombay 3 of of 2540

INTRODUCTION Background Pitch Perceptual attribute of sound Closely related to periodicity or fundamental frequency (F0) F0 = 1/ T0 = 100 Hz F0 = 1/ T0 = 300 Hz Vocal pitch contour Department of Electrical Engineering, IIT Bombay 4 of of 2540

INTRODUCTION Motivation, Complexity and Approaches Motivation Music Information Retrieval applications Query-by-singing/humming (QBSH), Artist ID, Cover Song ID Music Edutainment Singing learning, karaoke creation Musicology Problem complexity Singing large F0 range, pitch dynamics, Diversity Inter-singer, across cultures Polyphony Crowded signal Percussive & tonal instruments Approaches Understanding without separation Source-separation [Lag08] Polyphonic audio signal Signal representation Multi-F0 analysis Classification [Pol05] Voice F0 contour Voicing detection Predominant-F0 trajectory extraction Department of Electrical Engineering, IIT Bombay 5 of of 2540

INTRODUCTION Indian classical music: Signal characteristics Singer Tanpura (drone) Harmonium (secondary melody) Tabla (percussion) 2000 Frequency (Hz) Tun Na Ghe 1500 1000 500 0 0 5 10 15 20 Time (se c) Department of Electrical Engineering, IIT Bombay 40 6 ofof25

INTRODUCTION Melody extraction in Indian classical music Issues Signal complexity Singing Polyphony Variable tonic Non-availability of ground-truth data Almost completely improvised (no universally accepted notation) Example Thit Ke Tun Department of Electrical Engineering, IIT Bombay 7 of of 2540

SYSTEM DESIGN Our Approach Polyphonic audio signal Signal representation Multi-F0 analysis Voice F0 contour Singing voice detection Predominant-F0 trajectory extraction Design considerations Singing Robustness to pitched accompaniment Flexible Department of Electrical Engineering, IIT Bombay 8 of of 2540

Frequency domain representation SYSTEM DESIGN Signal Representation Pitched sounds have harmonic spectra Short-time analysis and DFT Window-length Chosen to resolve harmonics of minimum expected F0 Sinusoidal representation More compact & relevant Different methods of sinusoid ID Magnitude-based Phase-based Main-lobe matching (Sinusoidality) [Grif88] method found to be most reliable Frequency transform of window has a known shape Local peaks whose shape closely matches window main-lobe are declared as sinusoids X( n, ω) 0 w(n-m) Frequency transform of a 40 ms Hamming window DFT -20-500 0 500 Department of Electrical Engineering, IIT Bombay Frequency 9 of of 2540 Magnitude (db) 60 50 40 30 20 10 0-10 M 1 m= 0 ( ) ( ) Xn (, ω) = xmwn m e ω x(m) x(m)w(n-m) 2 π i ω m M π

SYSTEM DESIGN Multi-F0 Analysis Objective To reliably detect the voice-f0 in polyphony with a high salience F0-candidate identification Sub-multiples of well-formed sinusoids (Sinusoidality>0.8) F0-salience function Typical salience functions Maximize Auto-correlation function (ACF) Maximize comb-filter output Harmonic sieve-type [Pol07] Sensitive to strong harmonic sounds Two-way mismatch [Mah94] Error function sensitive to the deviation of measured partials/sinusoids from ideal harmonic locations F0-candidate pruning Sort in ascending order of TWM errors Prune weaker F0-candidates in close vicinity (25 cents) of stronger F0 candidates Department of Electrical Engineering, IIT Bombay 10 of of 2540

SYSTEM DESIGN Predominant-F0 Trajectory Extraction Objective To find that path through the F0-candidate v/s time space that best represents the predominant-f0 trajectory Dynamic-programming [Ney83] based path finding Measurement cost = TWM error Smoothness cost must be based on musicological considerations 2 ( ) W(p,p ) = OJC. log p'/ p W(p,p') = 1 e ( log ( ) ( )) 2 2 p ' log2 p 2σ p and p are F0s in current and previous frames resp. Normalized distribution of adjacent frame pitch transitions for male & female singers (Hop =10 ms) 1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2 OJC = 1.0 Cost functions Std. Dev = 0.1 0-0.4-0.2 0 0.2 0.4 Log change in pitch 0-1 -0.5 0 0.5 1 Log change in pitch Department of Electrical Engineering, IIT Bombay 11 of of 2540

EVALUATION Predominant-F0 extraction: Indian Music Data Classical: 4 min. of multi-track data, Film: 2 min. of multi-track data Ground truth: Output of YIN PDA [Chev02] on clean voice tracks with manual correction Evaluation metrics Pitch Accuracy (PA) = % of vocal frames whose pitch has been correctly tracked (within 50 cents) Chroma Accuracy (CA) = PA except that octave errors are forgiven Parameter Frame length Hop Lower limit on F0 Upper limit on F0 Upper limit on spectral content Value 40 ms 10 ms 100 Hz 1280 Hz 5000 Hz Genre Audio content PA (%) CA (%) Indian classical music Indian pop music Voice + percussion 97.4 99.5 Voice + percussion + drone Voice + percussion + drone + harmonium 92.9 98.2 67.3 73.04 Voice + guitar 96.7 96.7 Department of Electrical Engineering, IIT Bombay 12 of of 2540

SYSTEM DESIGN Voicing Detection Features Polyphonic signal FS1 13 MFCCs FS2 7 static timbral features Feature Extraction FS3 Normalized harmonic energy (NHE) Classifier Classifier GMM 4 mixtures per class Boundary detector Boundary detector Grouping Audio novelty detector [Foote] with NHE Data 23 min. of Hindustani training data Decision labels 7 min. of Hindustani testing data Results on testing data Recall: % of actual frames that were correctly labeled Feature set Vocal recall (%) Frame-level Instrumental recall (%) Vocal recall (%) After grouping Instrumental recall (%) FS1 92.17 66.43 97.86 61.61 FS2 92.38 66.29 96.28 68.98 FS3 89.05 92.10 93.15 96.58 Department of Electrical Engineering, IIT Bombay 13 of of 2540

EVALUATION Submission to MIREX 2008 & 2009 Music Signal Music Information Retrieval Evaluation exchange Started in 2004 DFT Main Lobe Matching Parabolic interpolation Signal representation International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) Sinusoids frequencies and magnitudes Common platform for evaluation on common datasets Sub-multiples of sinusoids Є F0 search range F0 candidates TWM error computation Sort (Ascending) Vicinity Pruning Multi-F0 analysis Tasks Audio genre, artist, mood classification Audio melody extraction Audio beat tracking Audio Key detection F0 candidates and measurement costs Dynamic programming-based optimal path finding Predominant F0 contour Predominant- F0 trajectory extraction Query by singing/ hummin Audio chord estimation Thresholding normalized harmonic energy Grouping over homogenous segments Voicing detection Vocal segment pitch tracks Department of Electrical Engineering, IIT Bombay 14 of of 2540

EVALUATION MIREX 2008 & 2009 Datasets & Evaluation Data ADC 2004: Publicly available data 20 excerpts (about 20 sec each) from pop, opera, jazz & midi MIREX 2005: Secret data 25 excerpts (10 40 sec) from rock, R&B, pop, jazz, solo piano MIREX 2008: ICM data 4 excerpts of 1 minute each from a male and female Hindustani vocal performance. 2 min. each with and without a loud harmonium MIREX 2009: MIR 1K data 374 Karaoke recordings of Chinese songs. Each recording is mixed at 3 different Signal-to-accompaniment ratios (SARs) {-5,0,5 db} Evaluation metrics: Pitch evaluation Pitch accuracy (PA) and Chroma accuracy (CA) Voicing evaluation Vocal recall (Vx recall) and Vocal false alarm rate (Vx false alm) Overall accuracy % of correctly detected vocal frames with correctly detected pitch Run-time Department of Electrical Engineering, IIT Bombay 15 of of 2540

Participant EVALUATION MIREX 2009 & 2010 MIREX 05 dataset (vocal) Vx Recall Vx False Alm 2009 Pitch accuracy Chroma accuracy Overall Accuracy Runtime (dd:hh:mm) cl1 91.2267 67.7071 70.807 73.924 59.7095 - cl2 80.9512 44.8062 70.807 73.924 64.4609 - dr1 93.7543 53.5255 76.1145 77.7138 66.9613 - dr2 88.0635 38.0207 70.9258 75.9161 66.5175 - hjc1 65.8441 19.8206 62.6594 73.4868 54.8538 - hjc2 65.8441 19.8206 54.1294 69.4221 52.2905 - jjy 88.8546 41.9813 76.2696 79.3236 66.3123 - kd 82.6017 15.2554 77.4622 80.8218 76.962 - mw 99.9345 99.7947 75.7398 80.3791 53.7323 - pc 75.6436 21.0879 71.7068 72.5089 70.465 - rr 92.908 56.3639 75.9506 79.1084 65.7745 - toos 91.2267 67.7071 70.807 73.924 59.7095-2010 HJ1 70.80 44.90 71.70 74.93 53.89 00:59:31 TOOS1 84.65 41.93 68.87 74.62 60.84 00:50:31 JJY2 96.86 69.62 70.17 78.32 60.81 01:09:30 JJY1 97.33 70.16 71.58 78.61 61.54 03:48:02 SG1 76.39 22.78 61.80 73.70 62.13 00:08:15 Department of Electrical Engineering, IIT Bombay 16 of of 2540

Participant EVALUATION MIREX 2009 & 2010 MIREX 09 dataset (0 db mix) Vx Recall Vx False Alm 2009 Pitch accuracy Chroma accuracy Overall Accuracy Runtime (dd:hh:mm) cl1 92.4858 83.5749 59.138 62.9508 43.9659 00:00:28 cl2 77.2085 59.7352 59.138 62.9508 49.2294 00:00:33 dr1 91.8716 55.3555 69.8804 72.5138 60.1294 16:00:00 dr2 87.3985 47.3422 66.549 70.7923 59.5076 00:08:44 hjc1 34.1722 1.7909 72.6577 75.2906 53.1752 00:05:44 hjc2 34.1722 1.7909 51.6871 70.002 51.7469 00:09:38 jjy 38.906 19.4063 75.9354 80.2461 49.686 02:14:06 kd 91.1846 47.7842 80.4565 81.8811 68.2237 00:00:24 mw 99.992 99.4688 67.2905 71.0018 43.6365 00:02:12 pc 73.1175 43.4773 50.8895 53.3672 51.5001 03:05:57 rr 88.8091 50.7595 68.6242 71.3714 60.7733 00:00:26 toos 99.9829 99.4185 82.2943 85.7474 53.5623 01:00:28 2010 HJ1 82.06 14.27 83.15 84.23 76.17 14:39:16 TOOS1 94.17 38.58 82.59 86.18 72.23 12:07:21 JJY2 98.33 70.62 81.29 83.83 62.55 14:06:20 JJY1 98.34 70.65 82.20 84.57 62.90 65:21:11 SG1 89.65 30.22 80.05 85.50 73.59 01:56:27 Department of Electrical Engineering, IIT Bombay 17 of of 2540

EVALUATION Problems in Melody Extraction No bold increase in melody extraction over the last 3 years (2007-2009) [Dres2010] Errors due to loud pitched accompaniment Accompaniment pitch tracked instead of voice Error in Predominant-F0 trajectory extraction Accompaniment pitch tracked along with voice Error in voicing detection Errors due to signal dynamics Octave errors due to fixed window length Error in Signal representation Department of Electrical Engineering, IIT Bombay 18 of of 2540

ENHANCEMENTS: PREDOMINANT-F0 TRACKING Problems Incorrect tracking of loud pitched accompaniment ICM Data Largest reduction in accuracy for audio in which Voice displays large rapid modulations Instrument pitch is flat Predominant F0 trajectory DP-based path finding Based on suitably defined Measurement cost Smoothness cost Accompaniment errors Bias in measurement cost: Salient (spectrally rich) instrument Bias in smoothness cost: Stable-pitched instrument Department of Electrical Engineering, IIT Bombay 19 of of 2540

ENHANCEMENTS: PREDOMINANT-F0 TRACKING Design & Implementation Extension of DP to track F0 candidate ordered pairs (nodes) called Dual-F0 tracking Node formation All possible pairs Computationally expensive ( 10 P 2 = 90) F0 and (sub) multiple may be tracked Prohibit pairing of harmonically related F0 candidates Low harmonic threshold of 5 cents Allows pairing of voice F0 and octave-separated instrument F0 because of voice detuning Node measurement cost computation Joint TWM error [Mah94] Node smoothness cost computation Sum of corresponding F0 candidate smoothness costs Final selection of predominant-f0 contour Based on voice-harmonic instability Department of Electrical Engineering, IIT Bombay 20 of of 2540

ENHANCEMENTS: PREDOMINANT-F0 TRACKING Selection of Predominant-F0 contour Harmonic Sinusoidal Model (HSM) Partial tracking algorithm used in SMS [Serra98] Tracks are indexed and linked by harmonic number Std. dev. pruning Prune tracks in 200 ms segments whose std. dev. <2 Hz Mark that 200 ms segment with greater residual energy as predominant-f0 Spectrogram HSM (before pruning) HSM (after pruning) Department of Electrical Engineering, IIT Bombay 21 of of 2540

ENHANCEMENTS: PREDOMINANT-F0 TRACKING Final Implementation Block Diagram Music Signal DFT Main-lobe matching Parabolic interpolation Signal representation Sinusoids frequencies and magnitudes Sub-multiples of sinusoids Є F0 search range F0 candidates TWM error computation Sorting (Ascending) Vicinity pruning Multi-F0 analysis F0 candidates and saliences Ordered pairing of F0 candidates with harmonic constraint Joint TWM error computation Optimal path finding Optimal path finding Nodes (F0 pairs) Vocal pitch identification Predominant-F0 trajectory extraction Melodic contour Melodic contour Department of Electrical Engineering, IIT Bombay 22 of of 2540

ENHANCEMENTS: PREDOMINANT-F0 TRACKING Experimental evaluation: Setup Participating systems TWMDP (single- and dual-f0) LIWANG [LiWang07] Uses HMM to track predominant-f0 Includes the possibility of a 2-pitch hypothesis but finally outputs a single F0 Shown to be superior to other contemporary systems Same F0 search range [80 500 Hz] Evaluation metrics Multi-F0 stage % presence of true voice F0 in candidate list Predominant-F0 extraction (PA & CA) Single F0 Dual-F0 Final contour accuracy Either-Or accuracy : Correct pitch is present in at least one of the two outputs Department of Electrical Engineering, IIT Bombay 23 of of 2540

ENHANCEMENTS: PREDOMINANT-F0 TRACKING Experimental Evaluation: Data & Results DATASET DESCRIPTION VOCAL (SEC) TOTAL (SEC) 1 Li & Wang data 55.4 97.5 2 Data Examples from MIR-1k dataset with loud pitched accompaniment 61.8 98.1 DATASET Multi-F0 Evaluation PERCENTAGE PRESENCE OF VOICE-F0 (%) TOP 5 CANDIDATES TOP 10 CANDIDATES 1 92.9 95.4 3 Examples from MIREX 08 data (Indian classical music) 91.2 99.2 TOTAL 208.4 294.8 2 88.5 95.1 3 90.0 94.1 Pitch accuracies (%) Comparison of TWMDP single-f0 tracker and LIWANG for dataset 1 100 80 60 40 20 (a) LIWANG TWMDP Chroma accuracies (%) 100 80 60 40 20 (b) LIWANG TWMDP 0 10 5 0-5 SARs (db) 0 10 5 0-5 SARs (db) Department of Electrical Engineering, IIT Bombay 24 of of 2540

ENHANCEMENTS: PREDOMINANT-F0 TRACKING Experimental evaluation: Results (Dual-F0) TWMDP Single-F0 significantly better than LIWANG system for all datasets TWMDP Dual-F0 significantly better than TWMDP single-f0 for datasets 2 & 3 Scope for further improvement in final predominant-f0 identification Indicated by difference between TWMDP Dual-F0 Either-Or and Final accuracies 80 TWMDP (% IMPROVEMENT OVER LIWANG (A1)) 70 DATA- SET 1 2 3 DUAL-F0 SINGLE-F0 (A2) EITHER-OR FINAL (A3) PA (%) 88.5 (8.3) 89.3 (0.9) 84.1 (2.9) CA (%) 90.2 (6.4) 92.0 (1.1) 88.8 (3.9) PA (%) 57.0 (24.5) 74.2 (-6.8) 69.1 (50.9) CA (%) 61.1 (14.2) 81.2 (-5.3) 74.1 (38.5) PA (%) 66.0 (11.3) 85.7 (30.2) 73.9 (24.6) CA (%) 66.5 (9.7) 87.1 (18.0) 76.3 (25.9) 60 50 40 30 20 10 0 D2 (PA) D2 (CA) D3(PA) D3 (CA) A3 A2 A1 Department of Electrical Engineering, IIT Bombay 25 of of 2540

ENHANCEMENTS: PREDOMINANT-F0 EXTRACTION Example of F0 collisions F0 (Octaves ref. 110 Hz) 2 1.5 1 2 1.5 1 2 (a) Single-F0 tracking Ground truth voice pitch Ground truth harmonium pitch Single-F0 output (b) Dual-F0 tracking (intermediate) (c) Dual-F0 Tracking (final) Ground truth voice pitch Dual F0 contour 1 Dual F0 contour 2 1.5 Ground truth voice pitch Dual F0 final output 1 0 1 2 3 4 Time (sec) Contour switching occurs at F0 collisions Department of Electrical Engineering, IIT Bombay 26 of of 2540

ENHANCEMENTS: VOICING DETECTION Problems Department of Electrical Engineering, IIT Bombay 40 27 ofof25

ENHANCEMENTS: VOICING DETECTION Features Proposed feature set combination of static & dynamic features Features extracted using a harmonic sinusoidal model representation Feature selection in each feature set using information entropy [Weka] C1 Static timbral C2 Dynamic timbral C3 Dynamic F0-Harmonic F0 Δ 10 Harmonic powers Mean & median of ΔF0 10 Harmonic powers Δ SC & Δ SE Spectral centroid (SE) Sub-band energy (SE) Std. Dev. of SC for 0.5, 1 & 2 sec MER of SC for 0.5, 1 & 2 sec Std. Dev. of SE for 0.5, 1 & 2 sec MER of SE for 0.5, 1 & 2 sec Mean, median & Std.Dev. of ΔHarmonic ε [0 2 khz] Mean, median & Std.Dev. of ΔHarmonic ε [2 5 khz] Mean, median & Std.Dev. of ΔHarmonics 1 to 5 Mean, median & Std.Dev. of ΔHarmonics 6 to10 Mean, median & Std.Dev. of ΔHarmonics 1 to10 Ratio of mean, median & Std.dev. of ΔHarmonics 1 to 5 : ΔHarmonics 6 to 10 MER Modulation energy ratio Department of Electrical Engineering, IIT Bombay 28 of of 2540

ENHANCEMENTS: VOICING DETECTION Data Genre Number of songs Vocal duration Instrumental duration Overall duration I. Western 11 7m 19s 7m 02s 14m 21s II. Greek 10 6m 30s 6m 29s 12m 59s III. Bollywood 13 6m 10s 6m 26s 12m 36s IV. Hindustani 8 7m 10s 5m 24s 12m 54s V. Carnatic 12 6m 15s 5m 58s 12m 13s Total 45 33m 44s 31m 19s 65m 03s Genre Singing Dominant Instrument I Western II Greek Syllabic. No large pitch modulations. Voice often softer than instrument. Syllabic. Replete with fast, pitch modulations. Mainly flat-note (piano, guitar). Pitch range overlapping with voice. Equal occurrence of flat-note pluckedstring /accordion and of pitch-modulated violin. III Bollywood Syllabic. More pitch modulations than western but lesser than other Indian genres. Mainly pitch-modulated wood-wind & bowed instruments. Pitches often much higher than voice. IV Hindustani V Carnatic Syllabic and melismatic. Varies from long, pitch-flat, vowel-only notes to large & rapid modulations. Syllabic and melismatic. Replete with fast pitch modulations. Mainly flat-note harmonium (woodwind). Pitch range overlapping with voice. Mainly pitch-modulated violin. F0 range generally higher than voice but has some overlap in pitch range. Department of Electrical Engineering, IIT Bombay 29 of of 2540

ENHANCEMENTS: VOICING DETECTION Evaluation Two cross-validation experiments Intra-genre Leave 1 song out Inter-genre Leave 1 genre out Feature combination Concatenation Classifier combination Baseline features 13 MFCCs [Roc07] Evaluation Vocal Recall (%) and precision (%) Vocal Precision v/s Recall curves for different feature sets across genres in 'Leave 1 song out' experiment 1 Overall Results C1 better than baseline C1+C2+C3 better than C1 Classifier combination better than feature concatenation Vocal Precision 0.9 0.8 0.7 MFCC C1 C1+C2+C3 0.6 0.5 0.6 0.7 0.8 0.9 1 Vocal Recall Department of Electrical Engineering, IIT Bombay 30 of of 2540

ENHANCEMENTS: VOICING DETECTION Evaluation (contd.) Leave 1 Song out (Recall %) Semi-automatic F0-driven HSM Fully-automatic F0-driven HSM I II III IV V Total Baseline 77.2 66.0 65.6 82.6 83.2 74.9 F0-MFCCs 78.9 77.8 78.0 85.9 85.9 81.2 C1 79.6 77.4 79.3 82.3 87.1 81.0 C1+C2 82.3 83.6 85.4 83.3 86.8 84.2 C1+C3 80.2 83.4 81.7 89.7 88.2 84.5 C1+C2+C3 81.1 86.9 86.4 88.5 87.3 85.9 I II III IV V Total Baseline 77.2 66.0 65.6 82.6 83.2 74.9 F0-MFCCs 76.9 72.3 70.0 78.9 83.0 76.2 C1 81.1 67.8 74.8 78.9 84.5 77.4 C1+C2 82.9 78.5 82.8 79.5 85.0 81.7 C1+C3 81.5 72.9 77.6 83.9 84.9 80.2 C1+C2+C3 82.1 81.1 83.5 83.0 84.7 82.8 Genre-specific feature set adaptation C1+C2 Western C1+C3 - Hindustani Department of Electrical Engineering, IIT Bombay 31 of of 2540

ENHANCEMENTS: SIGNAL REPRESENATION Sparsity-driven window length adaptation Relation between window length and signal characteristics Dense spectrum (multiple harmonic sources) -> long window Non-stationarity (rapid pitch modulations) -> short window Adaptive time segmentation for signal modeling and synthesis [Good97] Based on minimizing reconstruction error between synthesized and original signals High computational cost Easily computable measures for adapting window length Signal sparsity sparse spectrum has concentrated components Window length selection (23.2, 46.4 92.9 ms) based on maximizing signal sparsity L2 Norm Normalized kurtosis Gini Index Hoyer measure Spectral flatness 2 2 n k ( ) L = X k 1 N KU = 1 N k k X ( k) X n X ( k) X n 2 4 2 ( k) X n N k+ 0.5 GI = 1 2 k X N 1 Xn ( k) k HO = N N 2 Xn ( k) k ( 1) 1 SF = N 1 N k k X X 2 n 2 n ( k) ( k) Department of Electrical Engineering, IIT Bombay 32 of of 2540

ENHANCEMENTS: SIGNAL REPRESENATION Sparsity-driven window length adaptation [contd.] Experimental comparison between fixed and adaptive schemes Fixed and adaptive window lengths (different sparsity measures) Sinusoid detection by main-lobe matching Data Simulations: Two sound mixtures (Polyphony) and vibrato signal Real: Western pop (Whitney, Mariah) and Hindustani taans Evaluation metrics Recall (%) and frequency deviation (Hz) Expected harmonic locations computed from ground-truth pitch Results 1. Adaptive higher recall and lower frequency deviation 2. Kurtosis driven adaptation is superior than other sparsity measures Department of Electrical Engineering, IIT Bombay 33 of of 2540

GRAPHICAL USER INTERFACE Motivation Generalized music transcriptions system still unavailable Solution [Wang08] Semi-automatic approach Application-specific design E.g. music tutoring Two, possibly independent, aspects of melody extraction Voice pitch extraction Manually difficult Vocal segment detection Manually easier Semi-automatic tool Goal: To facilitate the extraction & validation of the voice pitch in polyphonic recordings with minimal human intervention Design considerations Accurate pitch detection Completely parametric control User-friendly control for vocal segment detection Department of Electrical Engineering, IIT Bombay 34 of of 2540

GRAPHICAL USER INTERFACE Design Salient features Melody extraction back-end Validation Visual: Spectrogram Aural: Re-synthesis Segmental parameter variation Easy non-vocal labeling Saving final result & parameters Selective use of dual-f0 tracker Switching between contours A Waveform viewer B Spectrogram & pitch view C Menu bar D Controls for viewing, scrolling, playback & volume control E Parameter window F Log viewer Department of Electrical Engineering, IIT Bombay 35 of of 2540

CONCLUSIONS AND FUTURE WORK Final system block diagram Music Signal Signal representation DFT Main-lobe matching Parabolic interpolation Sinusoids frequencies and magnitudes Multi-F0 analysis Sub-multiples of sinusoids Є F0 search range F0 candidates TWM error computation Sorting (Ascending) Vicinity pruning Voice Pitch Contour F0 candidates and saliences Grouping Predominant F0 trajectory extraction Optimal path finding Ordered pairing of F0 candidates with harmonic constraint Optimal path finding Joint TWM error computation Nodes (F0 pairs and saliences) Vocal pitch identification Classifier Feature Extraction Harmonic Sinusoidal Model Boundary Deletion Voicing Detector Predominant F0 contour Predominant F0 contour Department of Electrical Engineering, IIT Bombay 36 of of 2540

CONCLUSIONS AND FUTURE WORK Conclusions State-of-the-art melody extraction system designed by making careful choices for system modules Enhancements to above system increase robustness to loud, pitched accompaniment Dual-F0 tracking for predominant-f0 extraction Combination of static & dynamic, timbral & F0-harmonic features for voicing detection Fully-automatic, high accuracy melody extraction still not feasible Large variability in underlying signal conditions due to diversity of music A priori knowledge of music and signal conditions Male/Female singer Rate of pitch variation High accuracy melodic contours can be extracted using a semiautomatic approach Department of Electrical Engineering, IIT Bombay 37 of of 2540

CONCLUSIONS AND FUTURE WORK Summary of contributions Design & validation of a novel, practically useful melody extraction system with increased robustness to pitched accompaniment Signal representation Choice of main-lobe matching criterion for sinusoid identification Improved sinusoid detection by signal sparsity driven window length adaptation Multi-F0 analysis Choice of TWM error as salience function Improved voice-f0 detection by separation of F0 candidate identification & salience computation Predominant-F0 trajectory extraction Gaussian log smoothness cost Dual-F0 tracking Final predominant-f0 contour identification by voice-harmonic instability Voicing detection Use of predominant-f0-derived signal representation Combination of static and dynamic, timbral and F0-harmonic features Design of a novel graphical user interface for semi-automatic use of the melody extraction system Department of Electrical Engineering, IIT Bombay 38 of of 2540

CONCLUSIONS AND FUTURE WORK Future work Melody Extraction Identification of single predominant-f0 contour from dual-f0 output Use of dynamic features F0 collisions Detection based on minima in difference of constituent F0s of nodes Correction allowing pairing of F0 with itself around these locations Use of prediction-based partial tracking [Lag07] Validation across larger, more diverse datasets Incorporate predictive path-finding in DP algorithm Extend algorithm to instrumental pitch tracking in polyphony Homophonic music Lead instrument (e.g. flute) with accompaniment Polyphonic instruments (sitar) Applications of Melody Extraction Singing evaluation & feedback QBSH systems Musicological studies Department of Electrical Engineering, IIT Bombay 39 of of 2540

CONCLUSIONS AND FUTURE WORK List of related publications International Journals V. Rao, P. Gaddipati and P. Rao, Signal-driven window adaptation for sinusoid identification in polyphonic music, IEEE Transactions on Audio, Speech, and Language Processing, 2011. (accepted) V. Rao and P. Rao, Vocal melody extraction in the presence of pitched accompaniment in polyphonic music, IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 8, pp. 2145 2154, Nov. 2010. International Conferences V. Rao, C. Gupta and P. Rao, Context-aware features for singing voice detection in polyphonic music, 9 th International Workshop on Adaptive Multimedia Retrieval, 2011. (submitted for review). V. Rao, S. Ramakrishnan and P. Rao, Singing voice detection in polyphonic music using predominant pitch, in Proceedings of InterSpeech, Brighton, U.K., 2009. V. Rao and P. Rao, Improving polyphonic melody extraction by dynamic programming-based dual-f0 tracking, in Proceedings of the 12 th International Conference on Digital Audio Effects (DAFx), Como, Italy, 2009. V. Rao and P. Rao, Vocal melody detection in the presence of pitched accompaniment using harmonic matching methods, in Proceedings of the 11 th International Conference on Digital Audio Effects (DAFx), Espoo, Finland, 2008. A. Bapat, V. Rao and P. Rao, Melodic contour extraction for Indian classical vocal music, in Proceedings of Music-AI (International Workshop on Artificial Intelligence and Music) in IJCAI, Hyderabad, India, 2007. V. Rao and P. Rao, Melody extraction using harmonic matching, in Proceedings of the Music Information Retrieval Exchange MIREX 2008 & 2009, URL: http://www.music-ir.org/mirex/abstracts/2009/rr.pdf Department of Electrical Engineering, IIT Bombay 40 of of 2540

CONCLUSIONS AND FUTURE WORK List of related publications [contd.] National Conferences S. Pant, V. Rao and P. Rao, A melody detection user interface for polyphonic music, in Proceedings of National Conference on Communication (NCC), Chennai, India, 2010. N. Santosh, S. Ramakrishnan, V. Rao and P. Rao, Improving singing voice detection in the presence of pitched accompaniment, in Proceedings of National Conference on Communication (NCC), Guwahati, India, 2009. V.Rao, S. Pant, M. Bhaskar and P. Rao, Applications of a semi-automatic melody extraction interface for Indian music, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM), Gwalior, India, Dec. 2009. V. Rao, S. Ramakrishnan and P. Rao, Singing voice detection in north Indian classical music, in Proceedings of National Conference on Communication (NCC), Mumbai, India, 2008. V. Rao and P. Rao, Objective evaluation of a melody extractor for north Indian classical vocal performances, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM), Kolkota, India, 2008. V. Rao and P. Rao, Vocal trill and glissando thresholds for Indian listeners, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM), Mysore, India, 2007. Patent P. Rao, V. Rao and S. Pant, A device and method for scoring a singing voice, Indian Patent Application, No. 1338/MUM/2009, Filed June 2, 2009. Department of Electrical Engineering, IIT Bombay 41 of of 2540

REFERENCES [Pol07] G. Poliner, D. Ellis, A. Ehmann, E. Gomez, S. Streich and B. Ong, Melody transcription from music audio: Approaches and evaluation, IEEE Trans. Audio, Speech, Lang., Process., vol. 15, no. 4, pp. 1247 1256, May 2007. [Grif88] D. Griffin and J. Lim, Multiband Excitation Vocoder, IEEE Trans. on Acoust., Speech and Sig. Process., vol. 36, no. 8, pp. 1223 1235, 1988. [Wang08] Y. Wang and B. Zhang, Application-specific music transcription for tutoring, IEEE Multimedia, vol. 15, no. 3, pp. 70 74, 2008. [Chev02] A. de Cheveigné and H. Kawahara, YIN, a Fundamental Frequency Estimator for Speech and Music. J. Acoust. Soc. America, vol. 111, no. 4, pp. 1917 1930, 2002. [LiWang07] Y. Li and D. Wang, Separation of singing voice from music accompaniment for monoaural recordings, IEEE Trans. Audio, Speech and Lang. Process., vol. 15, no. 4, pp. 1475-1487, 2007. [Mah94] R. Maher and J. Beauchamp, Fundamental Frequency Estimation of Musical Signals using a Two-Way Mismatch Procedure, J. Acoust. Soc. Amer., vol. 95, no. 4, pp. 2254 2263, Apr. 1994. [Dres2010] K. Dressler, Audio melody extraction for MIREX 2009, Ilmenau: Fraunhofer IDMT, 2010. [Ney83] H. Ney, Dynamic Programming Algorithm for Optimal Estimation of Speech Parameter Contours, IEEE Trans. Systems, Man and Cybernetics, vol. SMC-13, no. 3, pp. 208 214, Apr. 1983. [Lag07] M. Lagrange, S. Marchand and J. B. Rault, Enhancing the tracking of partials for the sinusoidal modeling of polyphonic sounds, IEEE Trans. Audio, Speech and Lang. Process., vol. 15, no. 5, pp. 1625-1634, 2007. [Chao09] C. Hsu and R. Jang, On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset, IEEE Trans. Audio, Speech, and Lang. Process., 2009, accepted. [Good97] M. Goodwin, Adaptive signal models: Theory, algorithms and audio applications, Ph. D. dissertation, MIT, 1997. [Lag08] M. Lagrange, L. Martins, J. Murdoch and G. Tzanetakis, Normalised cuts for predominant melodic source separation, IEEE Trans. Audio, Speech, Lang., Process. (Sp. Issue on MIR), vol. 16, no.2, pp. 278 290, Feb. 2008. [Pol05] G. Poliner and D. Ellis, A classification approach to melody transcription, in Proc. Intl. Conf. Music Information Retrieval, London, 2005. Department of Electrical Engineering, IIT Bombay 42 of of 2540