Singing Pitch Extraction and Singing Voice Separation Advisor: Jyh-Shing Roger Jang Presenter: Chao-Ling Hsu Multimedia Information Retrieval Lab (MIR) Department of Computer Science National Tsing Hua University
Outline Introduction Binary Mask based Separation System Overview Proposed Method Voiced Singing Separation Unvoiced Singing Separation Evaluation Conclusions 2011/1/24 MIR, CS, NTHU 2
Problems Come with Music Accompaniment Many applications encounter difficulties when music accompaniment is present Music accompaniment acts like noise and interferes with the analysis of the singing voice. Solution: singing voice separation 3
Goal of the Dissertation Goal: Separate singing voice from back ground music 2011/1/24 MIR, CS, NTHU 4
Comparison of Speech and Singing Voice Separation Singing voice separation is similar to speech separation with similar applications: speech recognition : lyrics recognition speaker identification : singer identification subtitle alignment : lyric alignment Differences: Correlation of background noise and target Noise type Speech Uncorrelated Periodic/aperiodic, Narrow band/broad band Singing Strong correlated Most periodic, Most broad band Target pitch range 80~500Hz Up to 1400 Hz 2011/1/24 MIR, CS, NTHU 5
Binary Mask based Separation Cochleagram Waveform Clean Speech Noisy Speech Apply Separated Speech 2011/1/24 MIR, CS, NTHU 6
System Overview Polyphonic Song Segmentation Stage Stage 1 Time-Frequency Decomposition A/U/V Detection Voiced Frames Unvoiced Frames Grouping Stage Stage 2 Voiced-dominant T-F unit Identification within Voiced Frames Unvoiced-dominant T-F unit Identification within Unvoiced Frames Resynthesis Stage 3 Resynthesis Separated Singing Voice 2011/1/24 MIR, CS, NTHU 7
Time-Frequency Decomposition Polyphonic Song Stage 1 Time-Frequency Decomposition A/U/V Detection Stage 2 Voiced Frames Voiced-dominant T-F unit Identification within Voiced Frames Unvoiced Frames Unvoiced-dominant T-F unit Identification within Unvoiced Frames Stage 3 Resynthesis Separated Singing Voice 2011/1/24 MIR, CS, NTHU 8
Time-Frequency Decomposition Input Song Gammatone filterbank Framing Frequency channels T F unit Time frames 2011/1/24 MIR, CS, NTHU 9
Time-Frequency Decomposition Example 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 2 4 6 8 10 12 14 4 x 10 0.8 Input song signal 0.6 0.4 0.2 0 0.2 0.4 0.8 0.6 0.6 0.8 0.4 1 2 0.2 4 6 8 10 12 14 4 x 10 0 0.2 0.4 0.6 1 2 4 6 8 10 12 14 4 x 10 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 2 4 6 8 10 12 14 4 x 10 0.1 0.05 0 0.05 0.1 0.15 2 4 6 8 10 12 14 4 x 10 1. Frequency decomposition 2011/1/24 2. Time decomposition MIR, CS, NTHU 10
A/U/V Detection Polyphonic Song Stage 1 Time-Frequency Decomposition A/U/V Detection Stage 2 Voiced Frames Voiced-dominant T-F unit Identification within Voiced Frames Unvoiced Frames Unvoiced-dominant T-F unit Identification within Unvoiced Frames Stage 3 Resynthesis Separated Singing Voice 2011/1/24 MIR, CS, NTHU 11
A/U/V Detection This block performs an Accompaniment/ Unvoiced sound/ Voiced sound detection. An hidden Markov model (HMM) is employed: S ˆ = arg max p( x0 s0 ) t t t t 1) S t { p( x s ) p( s s } s A s U s V p x s ) x s ) ( A p p x s ) ( U ( V Output Likelihood Output likelihood Output likelihood 12
Examples of A/U/V Detection A U V U V U V 2011/1/24 MIR, CS, NTHU 13
Voiced Singing Separation Polyphonic Song Stage 1 Time-Frequency Decomposition A/U/V Detection Stage 2 Voiced Frames Voiced-dominant T-F unit Identification within Voiced Frames Unvoiced Frames Unvoiced-dominant T-F unit Identification within Unvoiced Frames Stage 3 Resynthesis Separated Singing Voice 2011/1/24 MIR, CS, NTHU 14
Voiced Singing Separation Three improvements for singing pitch extraction a) Vocal Component Enhancement b) Trend Estimation c) A tandem algorithm for singing pitch extraction and voice separation 2011/1/24 MIR, CS, NTHU 15
The tandem algorithm (1) The tandem algorithm [Hu and Wang 2010] has been shown very effective and robust for separating speech from varied noises. It performs pitch estimation and voice separation jointly and iteratively. However, it does not help when the intrusion type is music. 2011/1/24 MIR, CS, NTHU 16
The tandem algorithm (2) N1 white noise, N2 noise bursts, N3 cocktail party noise, N4 rock music, N5 siren, N6 trill telephone, N7 female speech, N8 male speech, and N9 female speech. 2011/1/24 MIR, CS, NTHU 17
Schematic Diagram of Voiced Singing Separation Mixture Trend Estimation Singing Voice Detection Separated Singing Singing Pitch Extraction Singing Voice Separation Iterative procedure 2011/1/24 MIR, CS, NTHU 18
Trend estimation Objective: Estimates a rough pitch range of the target singing voice. Mixture Estimated Trend Vocal Component Enhancement Pitch Range Estimation 2011/1/24 MIR, CS, NTHU 19
Trend estimation-vocal component enhancement (1) We employ HPSS (Harmonic/Percussive Sound Separation) [Tachibana2010] to enhance the singing 8000 7000 voice. 6000 5000 This method uses information of 4000 3000 anisotropic smoothness of the sounds. 2000 Harmonic sound: smooth in temporal direction because they are sustained and periodic for a while Percussive sound: 0 0 1 2 3 4 5 6 Time smooth in frequency direction because they are instantaneous and aperiodic HPSS exploits anisotropic smoothness of harmonic 2011/1/24 sound and percussive MIR, CS, sound NTHU to separate them. 20 Frequency 1000
Trend estimation-vocal component enhancement (2) HPSS is designed as an optimization problem to minimize an objective function as: 2 2 γ γ ( J H, P) = H ( t, ω) dtdω + P( t, ω) dtdω t ω And H ( t, ω) + P( t, ω) = W ( t, ω) H(t, ω) and P(t, ω) are complex spectrograms of harmonic sound and percussive sound to be estimated. W(t, ω) is the spectrogram of the original signal γ is an exponential constant approximately 0.6 (to imitate the auditory systems) 2011/1/24 MIR, CS, NTHU 21
Trend estimation-vocal component enhancement (3) How to enhance singing voice from background music? Frequency of singing voice fluctuates more than that of instruments such as guitar and piano. Using long STFT window size: the spectrogram has low temporal resolution and high frequency resolution. 2011/1/24 MIR, CS, NTHU 22
Trend estimation-vocal component enhancement (4) window size: 64 ms window size: 256 ms 900 900 800 800 music Frequency 700 600 500 Frequency 700 600 500 400 400 300 300 200 200 100 1.8 2 2.2 2.4 2.6 2.8 Time 1600 100 1.8 2 2.2 2.4 2.6 2.8 Time 1600 1400 1400 1200 1200 vocal Frequency 1000 800 Frequency 1000 800 600 600 400 400 200 200 2011/1/24 MIR, CS, NTHU 23 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 Time Time
Trend estimation-vocal component enhancement (5) H Mixture P 2011/1/24 MIR, CS, NTHU 24
Pitch Range Estimation (1) 1200 1000 Frequency (Hz) 800 600 400 200 1 2 3 4 5 6 7 8 9 10 11 Time (Secs) 2011/1/24 The spectrogram MIR, CS, NTHUafter HPSS 25
Pitch Range Estimation (2) 180 Frequency bin (Step by 0.25 Semitone) 160 140 120 100 80 60 40 20 1 2 3 4 5 6 7 8 9 10 11 Time (Secs) MR-FFT 2011/1/24 MIR, CS, NTHU 26
Pitch Range Estimation (3) 180 Frequency bin (Step by 0.25 Semitone) 160 140 120 100 80 60 40 20 1 2 3 4 5 6 7 8 9 10 11 Time (Secs) Overtone deletion 2011/1/24 MIR, CS, NTHU 27
Pitch Range Estimation (4) 11 10 9 8 7 6 5 4 3 2 1 2 4 6 8 10 12 14 2011/1/24 MIR, CS, NTHU 28
Pitch Range Estimation (5) 1200 1000 Frequency (Hz) 800 600 400 200 1 2 3 4 5 6 7 8 9 10 11 Time (Secs) 2011/1/24 MIR, CS, NTHU 29
Pitch Extraction Result of the Tandem Algorithm (1) No HPSS, No Trend Estimation 2011/1/24 MIR, CS, NTHU 30
Pitch Extraction Result of the Tandem Algorithm (2) With HPSS, No Trend Estimation 2011/1/24 MIR, CS, NTHU 31
Pitch Extraction Result of the Tandem Algorithm (3) With HPSS and Trend Estimation 2011/1/24 MIR, CS, NTHU 32
Pitch Extraction Result of the Tandem Algorithm (4) Post Processing 2011/1/24 MIR, CS, NTHU 33
Mask Comparison IBM Proposed 120 120 100 100 80 80 60 60 40 40 20 20 120 50 100 150 200 250 300 350 400 450 500 Li_Wang 2007 50 100 150 200 250 300 350 400 450 500 Mixture (0dB) 100 80 60 40 2011/1/24 MIR, CS, NTHU 34 50 100 150 200 250 300 350 400 450 500
Evaluation for Voiced Singing Separation Datasets: MIR-1K 1000 clips of Chinese pop music 4 to 13 seconds for each clip, total length is 133 minutes Each clip contains 2 tracks, one is singing voice and the other one is music accompaniment. Mix the two tracks at -5 db, 0 db, and 5 db SNR for evaluation. 2011/1/24 MIR, CS, NTHU 35
Evaluation of Singing voice detection -5 db 0 db 5 db Accuracy (%) Accuracy (%) Accuracy (%) 100 90 80 70 100 90 80 70 100 90 80 70 Performance of Singing Voice Detection (a) Precision Recall Overall Accuracy (b) Precision Recall Overall Accuracy (c) Precision Recall Overall Accuracy without HPSS with HPSS 2011/1/24 MIR, CS, NTHU 36
Evaluation of Trend Estimation and Singing Pitch Extraction (1) 100 (a) Voiced Only (b) Oveall Results with Vocal Detection 100 (c) Oveall Results without Vocal Detection 100 90 90 90 Percent of correct detection 80 70 60 50 40 Percent of correct detection 80 70 60 50 40 Percent of correct detection 80 70 60 50 40 30 30 30 20-5 db 0 db 5 db Mixture SNR 20-5 db 0 db 5 db Mixture SNR -5 db 0 db 5 db Mixture SNR Trend estimation Before post-processing* Proposed Hu-Wang 2010 (with HPSS)* Hu-Wang 2010 (without HPSS)* Li-Wang 2007 2011/1/24 MIR, CS, NTHU 37 20
Evaluation of Trend Estimation and Singing pitch extraction (2) Winner of MIREX 2010 Audio Melody Extraction task 100 Combined results of MIREX 2009 and 2010 (Based on clips with vocal only) 90 80 70 60 50 40 30 20 10 0 HJ1 (2010, proposed) TOOS1 (2010) JJY2 (2010) JJY1 (2010) SG1 (2010) cl1 (2009) cl2 (2009) dr1 (2009) dr2 (2009) hjc1 (2009) hjc2 (2009) jjy (2009) kd (2009) mw (2009) pc (2009) rr (2009) toos (2009) ADC2004 (12 MIREX2005 clips) (15 clips)indian08 MIREX09 0dB MIREX09-5dBMIREX09 +5dB Average 2011/1/24 MIR, CS, NTHU 38
Evaluation of Singing voice separation (SNR gain) 12 (a) Voiced Only 12 (b) Overall Results with Vocal Detection SNR gain of separated target (db) 10 8 6 4 2 0-2 SNR gain of separated target (db) 10 8 6 4 2 0-2 -4-4 -5 db 0 db 5 db -5 db 0 db 5 db Mixture SNR Mixture SNR IBM Ideal pitch Proposed Li-Wang 2007 (ideal pitch) 2011/1/24 MIR, CS, NTHU Li-Wang 2007 (estimated pitch) Ozerov 2007 39
Unvoiced Singing Separation Polyphonic Song Stage 1 Time-Frequency Decomposition A/U/V Detection Stage 2 Voiced Frames Voiced-dominant T-F unit Identification within Voiced Frames Unvoiced Frames Unvoiced-dominant T-F unit Identification within Unvoiced Frames Stage 3 Resynthesis Separated Singing Voice 2011/1/24 MIR, CS, NTHU 40
What is an Unvoiced Sound? Unvoiced sounds 2011/1/24 MIR, CS, NTHU 41
Unvoiced Singing Separation To identify the T-F units of the unvoiced frames that are dominated by the unvoiced sounds Mel-scale Frequency Cepstral Coefficient (MFCC) are used. 2 GMMs are trained for each frequency channel. Establish binary masks for the unvoiced part of the singing voice 20 20 20 40 40 40 60 60 60 80 80 80 100 100 100 120 120 120 20 40 60 80 100 120 140 160 Before unvoiced T F units identification 20 40 60 80 100 120 140 160 After unvoiced T F units identification 20 40 60 80 100 120 140 160 The ideal binary mask 2011/1/24 MIR, CS, NTHU 42
Evaluation of Unvoiced Singing voice separation (SNR gain) Overall GNSDR (db) 1 0.5 0 0.5 1 1.5 2 2.5 3 GNSDR of separated singing voice (a) GNSDR in unvoiced frames (db) GNSDR of separated unvoiced singing voice (b) 10 8 6 4 2 0 2 4 6 8 3.5 5 0 5 10 5 0 5 Mixture SNR (db) Mixture SNR (db) Ozerov s method Li and Wang s method Proposed method 2011/1/24 MIR, CS, NTHU 43
Sound Demos (1) Lyrics : 人說情歌總是老的好 2011/1/24 MIR, CS, NTHU 44
Sound Demos (2) Lyrics : 時間在逃亡 2011/1/24 MIR, CS, NTHU 45
Sound Demos (3) Lyrics : 然而談的情, 說的愛不夠喔, 說來就來說走就走喔 2011/1/24 MIR, CS, NTHU 46
Conclusions We proposed an extended tandem algorithm for singing pitch extraction and singing voice separation from music accompaniment. A trend estimation algorithm is proposed to estimate the pitch range of singing voice. The HPSS is employed to improve the performance of singing voice detection. A post processing is proposed to deal with the sequential grouping problem. The first unvoiced singing voice separation method for pitch-based inference method is introduced. 2011/1/24 MIR, CS, NTHU 47
Publications (1) Journal Papers: 1. Chao-Ling Hsu and Jyh-Shing Roger Jang, On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset, IEEE Trans. Audio, Speech, and Language Processing, volume 18, issue 2, p.p 310-319, 2010. 2. Chao-Ling Hsu, DeLiang Wang, and Jyh-Shing Roger Jang, A Tandem Algorithm for Singing Pitch Extraction and Voice Separation from Music Accompaniment, submitted to IEEE Trans. Audio, Speech, and Language Processing. Conference Papers: 1. Chao-Ling Hsu, DeLiang Wang, and Jyh-Shing Roger Jang, A Trend Estimation Algorithm for Singing Pitch Detection in Musical Recordings, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May. 2011. 2. Chao-Ling Hsu and Jyh-Shing Roger Jang, Singing Pitch Extraction by Voice Vibrato/Tremolo Estimation and Instrument Partial Deletion, International Society for Music Information Retrieval (ISMIR), Utrecht, Netherlands, Aug. 2010. 3. Chao-Ling Hsu and Jyh-Shing Roger Jang, Singing Pitch Extraction at MIREX 2010, extended abstract in International Society for Music Information Retrieval (ISMIR), Utrecht, Netherlands, Aug. 2010. (Rank 1st for vocal songs in audio melody extraction competition) 4. Chao-Ling Hsu, Liang-Yu Chen, Jyh-Shing Roger Jang, and Hsing-Ji Li, Singing Pitch Extraction From Monaural Polyphonic Songs By Contextual Audio Modeling and Singing Harmonic Enhancement, International Society for Music Information Retrieval (ISMIR), Kobe, Japan, Oct. 2009. 5. Chao-Ling Hsu, Liang-Yu Chen, and Jyh-Shing Roger Jang, Singing Pitch Extraction at MIREX 2009, extended abstract in International Society for Music Information Retrieval (ISMIR), Kobe, Japan, Oct. 2009. 2011/1/24 MIR, CS, NTHU 48
Publications (2) 6. Chao-Ling Hsu, Jyh-Shing Roger Jang, and Te-Lu Tsai, "Separation of Singing Voice from Music Accompaniment with Unvoiced Sounds Reconstruction for Monaural Recordings", Proceedings of 125th American Engineering Society Convention (AES), San Francisco, USA, Oct. 2008. 7. Jyh-Shing Roger Jang Nien-Jung Lee, and Chao-Ling Hsu, "Simple But Effective Methods for QBSH at MIREX 2006", extended abstract in International Symposium on Music Information Retrieval (ISMIR), Victoria, Canada, Oct. 2006. 8. Ruo-Han Chen, Chao-Ling Hsu, Jyh-Shing Roger Jang, Fong-Jhu Luo, "Content-based Music Emotion Recognition ", Workshop on Computer Music and Audio Technology, Taipei, Taiwan, March 2006. 9. Jyh-Shing Roger Jang, Chao-Ling Hsu, Hong-Ru Lee, "Continuous HMM and Its Enhancement for Singing/Humming Query Retrieval", International Symposium on Music Information Retrieval (ISMIR), London, UK, Sept 2005. 10. Chao-Ling Hsu, Hong-Ru Lee, Jyh-Shing Roger Jang, "On the Improvement and Error Analysis of Singing/Humming Query Retrieval", Workshop on Computer Music and Audio Technology, Taipei, Taiwan, March 2005. 11. Hong-Ru Lee, Chao-Ling Hsu, Yi-Cin Wang, Jyh-Shing Roger Jang, " Multi-modal Music Retrieval System", Conference on Digital Archive Technology, Taipei, Taiwan, 2004. 2011/1/24 MIR, CS, NTHU 49