Singing Pitch Extraction and Singing Voice Separation

Size: px

Start display at page:

Download "Singing Pitch Extraction and Singing Voice Separation"

Rudolf Daniel
5 years ago
Views:

1 Singing Pitch Extraction and Singing Voice Separation Advisor: Jyh-Shing Roger Jang Presenter: Chao-Ling Hsu Multimedia Information Retrieval Lab (MIR) Department of Computer Science National Tsing Hua University

2 Outline Introduction Binary Mask based Separation System Overview Proposed Method Voiced Singing Separation Unvoiced Singing Separation Evaluation Conclusions 2011/1/24 MIR, CS, NTHU 2

3 Problems Come with Music Accompaniment Many applications encounter difficulties when music accompaniment is present Music accompaniment acts like noise and interferes with the analysis of the singing voice. Solution: singing voice separation 3

4 Goal of the Dissertation Goal: Separate singing voice from back ground music 2011/1/24 MIR, CS, NTHU 4

5 Comparison of Speech and Singing Voice Separation Singing voice separation is similar to speech separation with similar applications: speech recognition : lyrics recognition speaker identification : singer identification subtitle alignment : lyric alignment Differences: Correlation of background noise and target Noise type Speech Uncorrelated Periodic/aperiodic, Narrow band/broad band Singing Strong correlated Most periodic, Most broad band Target pitch range 80~500Hz Up to 1400 Hz 2011/1/24 MIR, CS, NTHU 5

6 Binary Mask based Separation Cochleagram Waveform Clean Speech Noisy Speech Apply Separated Speech 2011/1/24 MIR, CS, NTHU 6

7 System Overview Polyphonic Song Segmentation Stage Stage 1 Time-Frequency Decomposition A/U/V Detection Voiced Frames Unvoiced Frames Grouping Stage Stage 2 Voiced-dominant T-F unit Identification within Voiced Frames Unvoiced-dominant T-F unit Identification within Unvoiced Frames Resynthesis Stage 3 Resynthesis Separated Singing Voice 2011/1/24 MIR, CS, NTHU 7

8 Time-Frequency Decomposition Polyphonic Song Stage 1 Time-Frequency Decomposition A/U/V Detection Stage 2 Voiced Frames Voiced-dominant T-F unit Identification within Voiced Frames Unvoiced Frames Unvoiced-dominant T-F unit Identification within Unvoiced Frames Stage 3 Resynthesis Separated Singing Voice 2011/1/24 MIR, CS, NTHU 8

9 Time-Frequency Decomposition Input Song Gammatone filterbank Framing Frequency channels T F unit Time frames 2011/1/24 MIR, CS, NTHU 9

Time-Frequency Decomposition Example 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 2 4 6 8 10 12 14 4 x 10 0.8 Input song signal 0.6 0.4 0.2 0 0.2 0.4 0.8 0.6 0.6 0.8 0.4 1 2 0.2 4 6 8 10 12 14 4 x 10 0 0.2 0.4 0.6 1 2 4 6 8 10 12 14 4 x 10 0.

10 Time-Frequency Decomposition Example x Input song signal x x x x Frequency decomposition 2011/1/24 2. Time decomposition MIR, CS, NTHU 10

11 A/U/V Detection Polyphonic Song Stage 1 Time-Frequency Decomposition A/U/V Detection Stage 2 Voiced Frames Voiced-dominant T-F unit Identification within Voiced Frames Unvoiced Frames Unvoiced-dominant T-F unit Identification within Unvoiced Frames Stage 3 Resynthesis Separated Singing Voice 2011/1/24 MIR, CS, NTHU 11

12 A/U/V Detection This block performs an Accompaniment/ Unvoiced sound/ Voiced sound detection. An hidden Markov model (HMM) is employed: S ˆ = arg max p( x0 s0 ) t t t t 1) S t { p( x s ) p( s s } s A s U s V p x s ) x s ) ( A p p x s ) ( U ( V Output Likelihood Output likelihood Output likelihood 12

13 Examples of A/U/V Detection A U V U V U V 2011/1/24 MIR, CS, NTHU 13

14 Voiced Singing Separation Polyphonic Song Stage 1 Time-Frequency Decomposition A/U/V Detection Stage 2 Voiced Frames Voiced-dominant T-F unit Identification within Voiced Frames Unvoiced Frames Unvoiced-dominant T-F unit Identification within Unvoiced Frames Stage 3 Resynthesis Separated Singing Voice 2011/1/24 MIR, CS, NTHU 14

15 Voiced Singing Separation Three improvements for singing pitch extraction a) Vocal Component Enhancement b) Trend Estimation c) A tandem algorithm for singing pitch extraction and voice separation 2011/1/24 MIR, CS, NTHU 15

16 The tandem algorithm (1) The tandem algorithm [Hu and Wang 2010] has been shown very effective and robust for separating speech from varied noises. It performs pitch estimation and voice separation jointly and iteratively. However, it does not help when the intrusion type is music. 2011/1/24 MIR, CS, NTHU 16

17 The tandem algorithm (2) N1 white noise, N2 noise bursts, N3 cocktail party noise, N4 rock music, N5 siren, N6 trill telephone, N7 female speech, N8 male speech, and N9 female speech. 2011/1/24 MIR, CS, NTHU 17

18 Schematic Diagram of Voiced Singing Separation Mixture Trend Estimation Singing Voice Detection Separated Singing Singing Pitch Extraction Singing Voice Separation Iterative procedure 2011/1/24 MIR, CS, NTHU 18

19 Trend estimation Objective: Estimates a rough pitch range of the target singing voice. Mixture Estimated Trend Vocal Component Enhancement Pitch Range Estimation 2011/1/24 MIR, CS, NTHU 19

Trend estimation-vocal component enhancement (1) We employ HPSS (Harmonic/Percussive Sound Separation) [Tachibana2010] to enhance the singing 8000 7000 voice.

20 Trend estimation-vocal component enhancement (1) We employ HPSS (Harmonic/Percussive Sound Separation) [Tachibana2010] to enhance the singing voice This method uses information of anisotropic smoothness of the sounds Harmonic sound: smooth in temporal direction because they are sustained and periodic for a while Percussive sound: Time smooth in frequency direction because they are instantaneous and aperiodic HPSS exploits anisotropic smoothness of harmonic 2011/1/24 sound and percussive MIR, CS, sound NTHU to separate them. 20 Frequency 1000

21 Trend estimation-vocal component enhancement (2) HPSS is designed as an optimization problem to minimize an objective function as: 2 2 γ γ ( J H, P) = H ( t, ω) dtdω + P( t, ω) dtdω t ω And H ( t, ω) + P( t, ω) = W ( t, ω) H(t, ω) and P(t, ω) are complex spectrograms of harmonic sound and percussive sound to be estimated. W(t, ω) is the spectrogram of the original signal γ is an exponential constant approximately 0.6 (to imitate the auditory systems) 2011/1/24 MIR, CS, NTHU 21

22 Trend estimation-vocal component enhancement (3) How to enhance singing voice from background music? Frequency of singing voice fluctuates more than that of instruments such as guitar and piano. Using long STFT window size: the spectrogram has low temporal resolution and high frequency resolution. 2011/1/24 MIR, CS, NTHU 22

Trend estimation-vocal component enhancement (4)

music Frequency 700 600 500 Frequency 700 600 500 400

800 Frequency 1000 800 600 600 400 400 200 200

23 Trend estimation-vocal component enhancement (4) window size: 64 ms window size: 256 ms music Frequency Frequency Time Time vocal Frequency Frequency /1/24 MIR, CS, NTHU Time Time

24 Trend estimation-vocal component enhancement (5) H Mixture P 2011/1/24 MIR, CS, NTHU 24

Pitch Range Estimation (1) 1200 1000 Frequency (Hz) 800 600 400 200 1 2 3 4 5

25 Pitch Range Estimation (1) Frequency (Hz) Time (Secs) 2011/1/24 The spectrogram MIR, CS, NTHUafter HPSS 25

26 Pitch Range Estimation (2) 180 Frequency bin (Step by 0.25 Semitone) Time (Secs) MR-FFT 2011/1/24 MIR, CS, NTHU 26

27 Pitch Range Estimation (3) 180 Frequency bin (Step by 0.25 Semitone) Time (Secs) Overtone deletion 2011/1/24 MIR, CS, NTHU 27

Pitch Range Estimation (4) 11 10 9 8 7 6 5 4 3

28 Pitch Range Estimation (4) /1/24 MIR, CS, NTHU 28

29 Pitch Range Estimation (5) Frequency (Hz) Time (Secs) 2011/1/24 MIR, CS, NTHU 29

30 Pitch Extraction Result of the Tandem Algorithm (1) No HPSS, No Trend Estimation 2011/1/24 MIR, CS, NTHU 30

31 Pitch Extraction Result of the Tandem Algorithm (2) With HPSS, No Trend Estimation 2011/1/24 MIR, CS, NTHU 31

32 Pitch Extraction Result of the Tandem Algorithm (3) With HPSS and Trend Estimation 2011/1/24 MIR, CS, NTHU 32

33 Pitch Extraction Result of the Tandem Algorithm (4) Post Processing 2011/1/24 MIR, CS, NTHU 33

Mask Comparison IBM Proposed 120 120 100

200 250 300 350 400 450 500 Li_Wang 2007

Mixture (0dB) 100 80 60 40 2011/1/24 MIR,

34 Mask Comparison IBM Proposed Li_Wang Mixture (0dB) /1/24 MIR, CS, NTHU

Evaluation for Voiced Singing Separation Datasets: MIR-1K 1000 clips of Chinese pop music 4 to 13 seconds for each clip, total length is 133 minutes Each clip

35 Evaluation for Voiced Singing Separation Datasets: MIR-1K 1000 clips of Chinese pop music 4 to 13 seconds for each clip, total length is 133 minutes Each clip contains 2 tracks, one is singing voice and the other one is music accompaniment. Mix the two tracks at -5 db, 0 db, and 5 db SNR for evaluation. 2011/1/24 MIR, CS, NTHU 35

Evaluation of Singing voice detection -5 db 0 db 5 db Accuracy (%) Accuracy (%) Accuracy (%) 100 90 80 70 100 90 80 70 100 90 80 70 Performance of Singing Voice

36 Evaluation of Singing voice detection -5 db 0 db 5 db Accuracy (%) Accuracy (%) Accuracy (%) Performance of Singing Voice Detection (a) Precision Recall Overall Accuracy (b) Precision Recall Overall Accuracy (c) Precision Recall Overall Accuracy without HPSS with HPSS 2011/1/24 MIR, CS, NTHU 36

37 Evaluation of Trend Estimation and Singing Pitch Extraction (1) 100 (a) Voiced Only (b) Oveall Results with Vocal Detection 100 (c) Oveall Results without Vocal Detection Percent of correct detection Percent of correct detection Percent of correct detection db 0 db 5 db Mixture SNR 20-5 db 0 db 5 db Mixture SNR -5 db 0 db 5 db Mixture SNR Trend estimation Before post-processing* Proposed Hu-Wang 2010 (with HPSS)* Hu-Wang 2010 (without HPSS)* Li-Wang /1/24 MIR, CS, NTHU 37 20

38 Evaluation of Trend Estimation and Singing pitch extraction (2) Winner of MIREX 2010 Audio Melody Extraction task 100 Combined results of MIREX 2009 and 2010 (Based on clips with vocal only) HJ1 (2010, proposed) TOOS1 (2010) JJY2 (2010) JJY1 (2010) SG1 (2010) cl1 (2009) cl2 (2009) dr1 (2009) dr2 (2009) hjc1 (2009) hjc2 (2009) jjy (2009) kd (2009) mw (2009) pc (2009) rr (2009) toos (2009) ADC2004 (12 MIREX2005 clips) (15 clips)indian08 MIREX09 0dB MIREX09-5dBMIREX09 +5dB Average 2011/1/24 MIR, CS, NTHU 38

Evaluation of Singing voice separation (SNR gain) 12 (a) Voiced Only 12 (b) Overall Results with Vocal Detection SNR gain of separated target (db) 10 8 6 4 2 0-2 SNR gain of separated target (db)

39 Evaluation of Singing voice separation (SNR gain) 12 (a) Voiced Only 12 (b) Overall Results with Vocal Detection SNR gain of separated target (db) SNR gain of separated target (db) db 0 db 5 db -5 db 0 db 5 db Mixture SNR Mixture SNR IBM Ideal pitch Proposed Li-Wang 2007 (ideal pitch) 2011/1/24 MIR, CS, NTHU Li-Wang 2007 (estimated pitch) Ozerov

40 Unvoiced Singing Separation Polyphonic Song Stage 1 Time-Frequency Decomposition A/U/V Detection Stage 2 Voiced Frames Voiced-dominant T-F unit Identification within Voiced Frames Unvoiced Frames Unvoiced-dominant T-F unit Identification within Unvoiced Frames Stage 3 Resynthesis Separated Singing Voice 2011/1/24 MIR, CS, NTHU 40

41 What is an Unvoiced Sound? Unvoiced sounds 2011/1/24 MIR, CS, NTHU 41

Establish binary masks for the unvoiced part of the singing voice 20 20 20 40 40 40 60 60 60 80 80 80 100 100 100 120 120 120 20 40 60 80

42 Unvoiced Singing Separation To identify the T-F units of the unvoiced frames that are dominated by the unvoiced sounds Mel-scale Frequency Cepstral Coefficient (MFCC) are used. 2 GMMs are trained for each frequency channel. Establish binary masks for the unvoiced part of the singing voice Before unvoiced T F units identification After unvoiced T F units identification The ideal binary mask 2011/1/24 MIR, CS, NTHU 42

Evaluation of Unvoiced Singing voice separation (SNR gain) Overall GNSDR (db) 1 0.5 0 0.5 1 1.5 2 2.

43 Evaluation of Unvoiced Singing voice separation (SNR gain) Overall GNSDR (db) GNSDR of separated singing voice (a) GNSDR in unvoiced frames (db) GNSDR of separated unvoiced singing voice (b) Mixture SNR (db) Mixture SNR (db) Ozerov s method Li and Wang s method Proposed method 2011/1/24 MIR, CS, NTHU 43

44 Sound Demos (1) Lyrics : 人說情歌總是老的好 2011/1/24 MIR, CS, NTHU 44

45 Sound Demos (2) Lyrics : 時間在逃亡 2011/1/24 MIR, CS, NTHU 45

46 Sound Demos (3) Lyrics : 然而談的情, 說的愛不夠喔, 說來就來說走就走喔 2011/1/24 MIR, CS, NTHU 46

47 Conclusions We proposed an extended tandem algorithm for singing pitch extraction and singing voice separation from music accompaniment. A trend estimation algorithm is proposed to estimate the pitch range of singing voice. The HPSS is employed to improve the performance of singing voice detection. A post processing is proposed to deal with the sequential grouping problem. The first unvoiced singing voice separation method for pitch-based inference method is introduced. 2011/1/24 MIR, CS, NTHU 47

48 Publications (1) Journal Papers: 1. Chao-Ling Hsu and Jyh-Shing Roger Jang, On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset, IEEE Trans. Audio, Speech, and Language Processing, volume 18, issue 2, p.p , Chao-Ling Hsu, DeLiang Wang, and Jyh-Shing Roger Jang, A Tandem Algorithm for Singing Pitch Extraction and Voice Separation from Music Accompaniment, submitted to IEEE Trans. Audio, Speech, and Language Processing. Conference Papers: 1. Chao-Ling Hsu, DeLiang Wang, and Jyh-Shing Roger Jang, A Trend Estimation Algorithm for Singing Pitch Detection in Musical Recordings, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May Chao-Ling Hsu and Jyh-Shing Roger Jang, Singing Pitch Extraction by Voice Vibrato/Tremolo Estimation and Instrument Partial Deletion, International Society for Music Information Retrieval (ISMIR), Utrecht, Netherlands, Aug Chao-Ling Hsu and Jyh-Shing Roger Jang, Singing Pitch Extraction at MIREX 2010, extended abstract in International Society for Music Information Retrieval (ISMIR), Utrecht, Netherlands, Aug (Rank 1st for vocal songs in audio melody extraction competition) 4. Chao-Ling Hsu, Liang-Yu Chen, Jyh-Shing Roger Jang, and Hsing-Ji Li, Singing Pitch Extraction From Monaural Polyphonic Songs By Contextual Audio Modeling and Singing Harmonic Enhancement, International Society for Music Information Retrieval (ISMIR), Kobe, Japan, Oct Chao-Ling Hsu, Liang-Yu Chen, and Jyh-Shing Roger Jang, Singing Pitch Extraction at MIREX 2009, extended abstract in International Society for Music Information Retrieval (ISMIR), Kobe, Japan, Oct /1/24 MIR, CS, NTHU 48

49 Publications (2) 6. Chao-Ling Hsu, Jyh-Shing Roger Jang, and Te-Lu Tsai, "Separation of Singing Voice from Music Accompaniment with Unvoiced Sounds Reconstruction for Monaural Recordings", Proceedings of 125th American Engineering Society Convention (AES), San Francisco, USA, Oct Jyh-Shing Roger Jang Nien-Jung Lee, and Chao-Ling Hsu, "Simple But Effective Methods for QBSH at MIREX 2006", extended abstract in International Symposium on Music Information Retrieval (ISMIR), Victoria, Canada, Oct Ruo-Han Chen, Chao-Ling Hsu, Jyh-Shing Roger Jang, Fong-Jhu Luo, "Content-based Music Emotion Recognition ", Workshop on Computer Music and Audio Technology, Taipei, Taiwan, March Jyh-Shing Roger Jang, Chao-Ling Hsu, Hong-Ru Lee, "Continuous HMM and Its Enhancement for Singing/Humming Query Retrieval", International Symposium on Music Information Retrieval (ISMIR), London, UK, Sept Chao-Ling Hsu, Hong-Ru Lee, Jyh-Shing Roger Jang, "On the Improvement and Error Analysis of Singing/Humming Query Retrieval", Workshop on Computer Music and Audio Technology, Taipei, Taiwan, March Hong-Ru Lee, Chao-Ling Hsu, Yi-Cin Wang, Jyh-Shing Roger Jang, " Multi-modal Music Retrieval System", Conference on Digital Archive Technology, Taipei, Taiwan, /1/24 MIR, CS, NTHU 49

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

th International Society for Music Information Retrieval Conference (ISMIR ) SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION Chao-Ling Hsu Jyh-Shing Roger Jang