SIMULATED FORMANT MODELING OF ACCOMPANIED SINGING SIGNALS FOR VOCAL MELODY EXTRACTION

Size: px

Start display at page:

Download "SIMULATED FORMANT MODELING OF ACCOMPANIED SINGING SIGNALS FOR VOCAL MELODY EXTRACTION"

Jodie Cameron
5 years ago
Views:

1 SIMULATED FORMANT MODELING OF ACCOMPANIED SINGING SIGNALS FOR VOCAL MELODY EXTRACTION Yu-Ren Chien, 1,2 Hsin-Min Wang, 2 Shyh-Kang Jeng 1,3 1 Graduate Institute o Communication Engineering, National Taiwan University, Taiwan 2 Institute o Inormation Science, Academia Sinica, Taiwan 3 Department o Electrical Engineering, National Taiwan University, Taiwan yrchien@ntu.edu.tw,whm@iis.sinica.edu.tw,skjeng@ew.ee.ntu.edu.tw ABSTRACT This paper deals with the task o extracting vocal melodies rom accompanied singing recordings. The challenging aspect o this task consists in the tendency or instrumental sounds to interere with the extraction o the desired vocal melodies, especially when the singing voice is not necessarily predominant among other sound sources. Existing methods in the literature are either rule-based or statistical. It is diicult or rule-based methods to adequately take advantage o human voice characteristics, whereas statistical approaches typically require large-scale data collection and labeling eorts. In this work, the extraction is based on a model o the input signals that integrates acousticphonetic knowledge and real-world data under a probabilistic ramework. The resulting vocal pitch estimator is simple, determined by a small set o parameters and a small set o data. Tested on a publicly available dataset, the proposed method achieves a transcription accuracy o 76%. 1. INTRODUCTION Music lovers have always been aced with a large collection o music recordings or concert perormances or them to choose rom. Whereas successul choices are possible with a small set o metadata, disappointment recurs because the metadata only provides limited inormation about the musical contents. This has motivated researchers to work on systems that extract musically relevant eatures rom audio recordings. One potential beneit o such processing would be the possibility that machines will be able to make personalized music purchase decisions on behal o humans. In this paper, we ocus on the extraction o vocal melodies rom polyphonic audio signals. A melody is deined as a succession o pitches and durations; as one might expect, melodies represent one o the most signiicant eatures that can be identiied by listeners rom musical pieces. In various musical cultures including popular music in particular, predominant melodies are commonly carried by singing voices. In view o this, this work aims at analyzing a Copyright: c 2012 Yu-Ren Chien et al. This is an open-access article distributed under the terms o the Creative Commons Attribution 3.0 Unported License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. singing voice accompanied by musical instruments. Instrumental accompaniment is common in vocal music, where the main melodies are exclusively carried by a solo singing voice, with the musical instruments providing harmony. In brie, the goal o the analysis considered in this work is inding the undamental requency o the singing voice as a unction o time. The speciic task outlined above is challenging because melody extraction is prone to intererence rom the accompaniment unless a mechanism is in place or distinguishing human voice rom instrumental sound. [1], [2], and [3] determined the predominant pitch as it accounts or the most o the signal power among all the simultaneous pitches. The concept o pitch predominance is also presented in [5] and [6], which deined the predominance in terms o harmonicity. For these methods, the problem proves diicult whenever the signal is dominated by a harmonic musical instrument rather than by the singing voice. [7] and [8] realized the timbre recognition mechanism by classiication techniques; on the other hand, pitch classiication entails quantization o pitch, which in turn causes loss o such musical inormation as vibrato, portamento, and non-standard tuning. The singing voice is probably the oldest mechanism in human history or music perormance. It shares considerable acoustic characteristics with speech, which have been ormulated analytically in acoustic phonetics [9]. However, a typical acoustic-phonetic model involves some ree parameters, i.e., the ormant requencies, which are highly variable across vowels or singers. In view o this, we take a probabilistic approach to vocal melody extraction, by which acoustic knowledge and real-world data can be integrated in a uniied manner. With an accompanied singing signal observed, estimation o the vocal pitch is based on the pitch likelihood (likelihood unction o the pitch), which is in turn based on the voice likelihood (likelihood unction o the singing voice). By simulating the singing voice signal, the pitch likelihood can be approximated by an average o values o the voice likelihood evaluated at the simulated set o voice examples. The simulation is realized by synthesizing voice signals o various timbres in advance according to ormant requencies extracted rom a wide variety o (possibly accompanied) singing recordings. Since ormant requencies represent spectrum envelopes o the human voice, their extraction does not require the sampled singing recordings to

2 densely cover various pitch values, nor is it impaired by accompaniment o modest loudness in the recordings. The proposed method oers several potential advantages over previous approaches to vocal melody extraction. First o all, imposing acoustic-phonetic constraints on the extraction enables the proposed method to better distinguish human voice rom instrumental sound than the predominant pitch estimators in [1 3, 5, 6]. Secondly, the acousticphonetic constraints save the proposed method rom largescale data collection and labeling eorts that are common or purely data-driven systems [7]. Third, some systems [4, 10] depend on pitch instability in identiying the vocal pitch; in contrast, without discriminating between stable and unstable pitches, the proposed method allows or such cases as an unstable instrumental pitch (e.g., violin) or a stably sung vocal pitch. Fourth, although our earlier approach in [15] was also based on acoustic-phonetic knowledge, it did not statistically model the joint distribution o ormant requencies, nor did it model the accompaniment signal whatsoever. The signal model proposed here or accompanied singing promises to better represent vocal characteristics and handle intererence rom the accompaniment. Lastly, we highlight the advantage o the proposed method over the method in [8]. These two methods are interestingly related to each other, both adopting spectrum envelope modeling and the Viterbi algorithm. In spectrum envelope modeling, [8] extracts linear-predictive and cepstral eatures rom sinusoidally resynthesized vocal or instrumental sounds, while our approach models vocal spectrum envelopes by ormant-synthesizing voice examples. The proposed signal model turns out 1) to be applicable to both vocal pitch estimation and voicing detection, and 2) not to depend on any sound samples o musical instruments. 2. OVERVIEW OF VOCAL PITCH ESTIMATION To acilitate the estimation, we quantize the vocal pitch into a discrete variable with 88 possible values. The pitch at k quarter tones (k = 1,2,...,88) is associated with a undamental requency o (k 60)/24 hertz. Thereore, the 88 pitch values are quarter-tone-spaced samples o the undamental requency in the vocal range rom 80 hertz to 1,000 hertz. The act that we are now estimating a discrete-valued signal (i.e., the vocal pitch sequence) rom an observed signal (i.e., the accompanied singing) makes it possible or us to characterize the pair o signals with a hidden Markov model (HMM) and ind the best pitch sequence by the Viterbi algorithm [11]. Here, the accompanied singing signal is represented by a vector-valued observation sequence, which consists o 100 N-vector observations per second. Each N-vector observation is made up o N consecutive time samples o the signal. Obviously, there are 88 states in the HMM. As one might expect, the HMM is deined by two probabilistic models: the observation model and the prior model. The observation model describes the probability distribution o an observation given a particular state, while the prior model comprises the state transition probability and initial state distributions. 3. OBSERVATION MODEL Let the accompanied singing signal, the (unobserved) singing voice, and the vocal pitch at a particular time point be denoted by the random N-vector z, the random N-vector x, and the random variable w, respectively. Then, the likelihood unction o w, i.e., the pitch likelihood, can be expanded as a marginalizing integral: p z w (z w) = p z w,x (z w,x)p x w (x w)dx. (1) With the term p z w,x (z w,x) taken as a unction o x, this integral can be thought o as the expectation op z w,x (z w,x) and approximated by the corresponding sample mean: p z w (z w) 1 p z w,x (z w,x (i,w) ), (2) i=1 where is the number o voice examples available or each o the 88 pitch values, andx (i,k) denotes theith voice example or pitchk. Here, the voice examples{x (i,w) } Ne i=1 simulate the random experiment underlying the probability distribution described by the density p x w ( w). Given the singing voicex, the vocal pitchw can be regarded as a constant, which is independent o any other random quantity; as a result, w can be dropped rom the right-hand-side condition in (2): p z w (z w) 1 p z x (z x (i,w) ), (3) i=1 which is an average o the values o the voice likelihood p z x (z ) as evaluated at the voice examples {x (i,w) } Ne i=1. The preparation o the voice examples will be presented in Section 3.1, which is an oline procedure perormed well in advance o melody extraction. Ater that, we will describe the evaluation o the likelihood o each voice example, i.e.,p z x (z x (i,k) ), in Section Synthesizing Voice Examples Comprehensive collection o real-world singing voice data is diicult, as results rom several acts about the singing voice. In the irst place, most vocal perormances are accompanied, which renders unaccompanied singing voice recordings extremely scarce. Although non-proessional unaccompanied singing data can be collected with less diiculty, untrained singing voice is typically o less practical relevance as compared with proessional singing. Secondly, the pitches most oten used in a song are conined 1) on the scale o its key and 2) within the registers o the singer; consequently, it would take a huge number o songs collected to have the entire vocal pitch range covered densely. Finally, to provide timbral variety, the collection must include various singers and various voiced sounds (vowels, nasal consonants, etc.). To circumvent the diiculty in collecting singing voice data, we 1) collect accompanied singing data, 2) extract vocal spectrum envelopes rom the data, and 3) synthesize voice examples o various pitches rom the extracted envelopes. (These will be described in Sections 6.1, 3.1.1,

3 and 3.1.2, respectively.) Since vocal spectrum envelopes ollow a well-deined ormant structure, they can be extracted reliably in the presence o instrumental sounds, as long as the singing voice is suiciently loud in comparison with the instruments. Moreover, by giving a pitchindependent description o timbre, the vocal spectrum envelopes eliminate the need or covering various pitches in data collection. In this way, suicient data can be collected, or the sole purpose o representing the timbral diversity in singing voice Extracting Vocal Spectrum Envelopes A vocal spectrum envelope is an amplitude unction o requency that models the spectrum envelope o a particular voiced sound (a vowel, a nasal consonant, etc.). By giving partial amplitudes as its samples at partial requencies, it provides a pitch-independent description o the speciic timbre o the voiced sound. In our implementation, it is determined by seven parameters: the irst ive oral ormant requencies 1, 2,..., 5 (hertz), a nasal ormant requency p (hertz), and a nasal anti-ormant requency z (hertz) [12]. To be more speciic, it is deined by (see [9]) A( h ) = 20log 10 U R( h )K R ( h ) H n (2π h ) n I, (4) wherea( ) is the amplitude unction in db, h denotes the requency in hertz, U R ( ) represents the (radiated) spectrum envelope o the glottal excitation [9]: U R ( h ) = h /100 1+( h /100) 2, (5) K R ( ) represents all ormants o order six and above [9]: ( ) 20log 10 K R ( h ) 0.43 h ( h 4, 500) h 5000, (6) I = {1,2,3,4,5,p,z}, and H n ( ) represents requency response o ormant n [9]: H n (ω) = H z (ω) = ( 1 ( 1 jω 1 )( ), σ n+jω n 1 jω σ n jω n n = 1,2,3,4,5,p, )( jω 1 σ z +jω z jω σ z jω z (7) ). (8) In (7),ω n is the requency o ormantnin rad/s, i.e.,ω n = 2π n, and σ n is hal the bandwidth o ormant n in rad/s, which can be approximated as a unction o ω n by a polynomial regression model [13]. As an example, a vocal spectrum envelope is plotted in Figure 1, which was extracted by the ollowing procedure rom a recording o Dietrich Fischer-Dieskau s perormance. In a short-time spectrum (computed by the constant-q transorm [14]) o accompanied singing, amplitudes at the partial requencies o a (manually identiied) vocal pitch constitute a noisy observation or estimating the underlying vocal spectrum envelope. As a consequence, the vocal Figure 1. A vocal spectrum envelope with ormant requencies (in hertz) o 1 = 270, 2 = 1274, 3 = 2630, 4 = 2920, 5 = 3270, p = 920, and z = spectrum envelope can be estimated by itting its spectral samples to the observed amplitudes: ˆv = argmin v V 40 l=1 ( a q l a A(lh 0) ) 2, (9) where a is an amplitude (in db) variable that modiies the overall magnitude o the spectrum envelope, v = (a, 1, 2, 3, 4, 5, p, z ) T, (10) V describes constraints imposed on the ormant requencies: V = v R , (11) 200 p z p z a q l denotes the amplitude (in db) observed at the lth partial, and 0 h denotes the vocal pitch in hertz. The constrained optimization problem in (9) is solved by the multistart coordinate-descent distance minimization procedure described in [15] Synthesis From a Spectrum Envelope LetA (i) ( ) denote theith vocal spectrum envelope extracted rom accompanied singing data (i = 1,..., ). To synthesize the ith voice example or pitch k (i.e., x (i,k) ), we compute its partial amplitudes according to the envelope A (i) ( ): a (i) l = A (i) (l (k 60)/24 ),l = 1,...,L, 5000 L =, (k 60)/24 (12)

4 where a (i) l denotes the amplitude (in db) o the lth partial. Then, the voice example can be synthesized as x (i,k) t = L l=1 10 a (i) ( ) l 20 cos 2πl k 60 t 24, t = 1,...,N. 3.2 Likelihood o a Voice Example (13) To evaluate the likelihood o the voice example x (i,k), we take advantage o the act that the accompanied singing signal is the sum o the singing voice signal and the accompaniment signal: z = x+y, (14) where x, y, and z are random N-vectors representing the singing voice, the accompaniment, and the accompanied singing, respectively. By taking (14) as a transormation o y intoz, the likelihood can be evaluated as Figure 2. Histogram o sample values in accompaniment signals. p z x (z x (i,k) ) = p y x (z x (i,k) x (i,k) ). (15) Approximate independence can be assumed between the vectors x and y, in that they represent separate sound sources with independent phases; hence, we have p z x (z x (i,k) ) p y (z x (i,k) ). (16) The dependence and trend among the time samples in y represent the speciic timbre or polyphony o the accompaniment, o which, however, we do not have any knowledge at the time o melody extraction. In consequence, approximate i.i.d. is assumed among the time samples: p z x (z x (i,k) ) N t=1 p y (z t x (i,k) t ). (17) To determine the probability distribution o each time sample in y, we collected 256,000 time sample values by randomly sampling the accompaniment data in the MIR-1K dataset [16]. The histogram o these values, as plotted in Figure 2, suggests that the probability distribution can be approximated by a zero-mean Gaussian distribution. The Q-Q plot o these values against the standard normal distribution, as shown in Figure 3, conirms the approximation by presenting a curve that resembles a straight line. With this approximation, we have p z x (z x (i,k) ) exp { z x(i,k) 2 }, (18) 2σ 2 y where σ y denotes the standard deviation o the Gaussian distribution. The voice examples {x (i,k) } Ne i=1 are initially intended or simulating the random experiment underlying the probability distribution described by the density p x w ( k); even so, we cannot aord to synthesize a huge number o voice examples that collectively represent the diversity in such trivial signal speciications as various loudness levels and Figure 3. Q-Q plot o sample values in accompaniment signals against the standard normal distribution. various sinusoidal phase angles. Thereore, as described in Section 3.1, the examples serve only to represent the timbral variety in singing voice; meanwhile, each voice example needs to be matched against the accompanied singing in a phase- and loudness-insensitive ashion. To achieve the phase-insensitivity, we substitute a scaled requency-domain total power or the N-sample signal energy in (18): 192 p z x (z x (i,k) ) exp{ c A z e jφz (i,k) A e jφ(i,k) 2 }, =1 (19) where c is a manually speciied scaling constant (c = ), is a requency index to a constant-q spectrum [14] with 192 quarter-tone-spaced bins, A z and φz denote the constant-q magnitude and phase spectra o the accompanied singing signal, and A (i,k) and φ (i,k) denote those o the voice example. Now, or the phase-insensitivity, we

5 relax the phase o the voice example and maximize the likelihood with respect to the relaxed phase, thereby creating a modiied voice example x (i,k) with phase spectrum {φ z }192 =1 : 192 p z x (z x (i,k) ) exp{ c (A z A (i,k) ) 2 }, (20) =1 CQT{ x (i,k) } = {A (i,k) e jφz } 192 =1, (21) where CQT{ } denotes the constant-q transorm. Next, to achieve the insensitivity to loudness, we relax the loudness o the voice example and maximize the likelihood with respect to the relaxed loudness, thereby creating an ampliied or attenuated voice example x (i,k) : p z x (z x (i,k) ) exp c 192 (A z ) 2 ( 192 =1 Az A(i,k) 192 =1 =1 (A(i,k) ) 2 x (i,k) = 192 =1 Az A(i,k) ) 2, (22) 192 =1 (A(i,k) ) 2 x(i,k), (23) which orthogonally projects the accompanied singing onto the subspace o ampliied or attenuated versions o the voice example. In the end, to evaluate the likelihood o pitch w, we substitute the modiied voice examples { x (i,w) } Ne i=1 or the voice examples {x (i,w) } Ne i=1 in (3): p z w (z w) 1 p z x (z x (i,w) ). (24) i=1 4. PRIOR MODEL We use a Markov chain{w m } M m=1 to model the vocal pitch sequence, which consists o 100 pitch values per second: P(w 1,...,w M ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1,w 2 ) P(w m w 1,...,w m 1 ) P(w M w 1,...,w M 1 ) M = P(w 1 ) P(w m w m 1 ), m=2 (25) where random variable w m {1,...,88} represents the mth element in the vocal pitch sequence. The second equality in (25) results rom the Markovianity that given the previoius pitch w m 1, the current pitch w m is independent o all the earlier pitches w m 2,w m 3,...,w 1. The initial state distribution is assumed to be uniorm over all possible pitch values: tones o the previous pitch: = P(w m = k 2 w m 1 = k 1 ) 1 3 i k 1 {1,88}, k 1 k 2 2; 1 4 i k 1 {2,87}, k 1 k 2 2; 1 5 i 3 k 1 86, k 1 k 2 2; 0 i k 1 k 2 > 2. (27) In almost all cases, there are ive pitch values around the previous pitch that are assigned a nonzero probability ( 1 5 ) or the current pitch. Other cases are associated with only 3 or 4 pitch values. For example, when the previous pitch is 88, the only possible values or the current pitch are 86, 87, and VOICING DETECTION In addition to estimating the vocal pitch sequence rom accompanied singing, vocal melody extraction inds particular time points at which no singing voice is actually sounding. Such time points may be ound during vocal rests, at plosives, etc. For each o these time points, the pitch estimate should be overriden by a state indicating the absence o singing voice. In other words, we need a mechanism or detecting the singing voice or each time point. To this end, we estimate the short-time spectra o the singing voice on the basis o the accompanied singing. Ideally, the estimation will give a zero spectrum or each time point that is not voiced. To estimate the spectrum at a particular time point, we use its minimum mean square error (MMSE) estimator: E[CQT{x} z] = CQT{x}p x z (x z)dx = CQT{x} p z x(z x) p z(z) p x (x)dx. (28) With the term CQT{x} p z x(z x) p z(z) taken as a unction o x, this integral can be thought o as the expectation o CQT{x} p z x(z x) p z(z) and approximated by the corresponding sample mean: E[CQT{x} z] k=1 i=1 CQT{ x (i,k) } p z x(z x (i,k) ). p z (z) (29) The density p z (z) can again be approximated in this ashion: p z (z) = p z x (z x)p x (x)dx (30) 1 88 Ne 88 k=1 i=1 p z x(z x (i,k) ). Since all the modiied voice examples share the same phase spectrum, the magnitude o the spectrum estimate is evaluated as P(w 1 = k) = 1, k {1,...,88}. (26) 88 The state transition probability distribution is also assumed to be uniorm, but only over pitch values within 2 quarter E[CQT{x} z] k=1 i=1 = 1,...,192, Ã (i,k) p z x (z x (i,k) ), p z (z) (31)

6 where Ã(i,k) denotes the constant-q magnitude spectrum o the modiied voice example x (i,k). Eventually, the loudness o the singing voice can be estimated by correcting the magnitude spectrum according to the trends in the 40-phon equal-loudness contour (ELC) [17], which quantiies the dependency o human loudness perception on requency: 192 Λ(z) = ( E[CQT{x} z] 10 (40 κ)/20 ) 2, (32) =1 where κ denotes the 40-phon ELC, plotted in Figure 4. I, and only i, Λ(z) exceeds the empirical threshold o , the time point is deemed voiced. Figure phon equal-loudness contour. 6. EXPERIMENTS In this section, to provide comparison o our method with some existing methods, we conduct vocal melody extraction experiments on a publicly available dataset. Since the synthesis o voice examples is based on a collection o accompanied singing data, we start by describing the collection. pop nasal male voice (Wakin Chau), pop nasal emale voice (Chiou-Feng Tsai), non-proessional high male voice (Bobon), non-proessional high emale voice (Annar), non-proessional low male voice (Davidson), and non-proessional low emale voice (Ani). The nasal singers are well-known in Taiwan or nasalizing their vowels signiicantly. The 6 types o voiced sound are /i/, /E/, /A/, /O/, /u/, and a miscellaneous type deined by /@/, /z " /, /ü " /, /m " /, /n " /, or /N " /. Each sound in the miscellaneous type does not occur in all recordings: /@/ is absent in all 4 Taiwanese-language recordings, perhaps because it seldom occurs in the northern speech o the Taiwanese language; the syllabic nuclei /z " / and /ü " / are speciic to languages such as Mandarin Chinese; and the nasal hummings, due to their low loudness, are rarely used in operatic singing. To extract vocal spectrum envelopes, the irst author subjectively selected 6 short-time spectra rom each recording that exempliy the 6 sound types, respectively. 6.2 Dataset Description The dataset adopted or perormance evaluation is a subset o the one built or the Melody Extraction Contest in the ISMIR 2004 Audio Description Contest (ADC 2004). The whole ADC 2004 dataset consists o 20 audio recordings, each around 20 seconds in duration, among which eight recordings have instrumental melodies, and the other twelve have vocal melodies. Since this work considers vocal melodies only, experiments are carried out exclusively on 9 o the 12 vocal recordings, including two pop song excerpts, three song excerpts with synthesized vocal, and our opera excerpts. The other three vocal excerpts are not included here because one contains alsetto singing and the other two contain an ensemble o vocals. The dataset has been in use in several Music Inormation Retrieval Evaluation Exchange (MIREX) contests since 2006; thereore, it aords extensive comparison among methods. Beore melody extraction, each audio ile in the dataset is resampled at 11,025 hertz and constant-q transormed [14] (Q = 34) into a sequence o short-time spectra. Each resulting spectrum is a quarter-tone-spaced sampling o a continuous spectrum that is capable o resolving the intererence between two hal-tone-spaced sinusoids rom hertz all the way to 5,428.6 hertz. 6.1 Data Collection or Voice Example Synthesis 6.3 Perormance Measures To synthesize voice examples, we extracted = 84 vocal spectrum envelopes rom 14 recordings o about 1 minute In the experiments documented here, the tested system gives vocal melodies in the ormat o a voicing/pitch value or each. The 14 recordings represent 14 distinct types o each rame (at the rate o 100 rames per second). I a singing voice, including 10 recordings o proessional (accompanied) singing captured rom YouTube, and 4 recordings o non-proessional (unaccompanied) singing adapted rom some clips in the MIR-1K dataset [16]. From each recording, 6 spectrum envelopes were extracted that represent 6 distinct types o voiced sound. The 14 types o singing voice are tenor (José Carreras), soprano (Kiri Te Kanawa), baritone (Dietrich Fischer-Dieskau), mezzo-soprano (Cecilia Bartoli), pop high male voice (Terry Lin), pop high emale voice (Stella Chang), pop low male voice (Shieng Luo), pop low emale voice (Inn-Jae Chen), rame is estimated to be voiced, the output speciies the pitch estimate or the rame; otherwise, the output speciies that the rame is estimated to be not voiced. MIREX adopts several measures or evaluating the perormance o a melody extraction system [18]. In the irst place, to determine how well the system perorms voicing detection, we use the voicing detection rate, the voicing alse alarm rate, and the discriminability. The voicing detection rate is computed as the raction o rames that are both labeled and estimated to be voiced, among all the rames that are labeled voiced. The voicing alse alarm rate

is computed as the raction o rames that are estimated to be voiced but are actually not voiced, among all the rames that are not voiced according to the reerence transcription.

D ), (33) whereq 1 ( ) denotes the inverse o the Gaussian tail unction, P F denotes the alse alarm rate, and P D denotes the detection rate.

7 is computed as the raction o rames that are estimated to be voiced but are actually not voiced, among all the rames that are not voiced according to the reerence transcription. The discriminability combines the above two measures in such a way that it can be deemed independent o the value o any threshold involved in the decision o voicing detection: d = Q 1 (P F )+Q 1 (1 P D ), (33) whereq 1 ( ) denotes the inverse o the Gaussian tail unction, P F denotes the alse alarm rate, and P D denotes the detection rate. Second, to determine how well the system perorms pitch estimation, we use the raw pitch accuracy and the raw chroma accuracy. The raw pitch accuracy is computed as the raction o rames that are labeled voiced and have pitch estimated within one quarter tone o the true pitch, among all the rames that are labeled voiced. To ocus on pitch class estimation while ignoring octave errors, we compute the raw chroma accuracy, which is computed in the same way as the raw pitch accuracy, except that the pitch is here measured in terms o chroma, or pitch class, a quantity derived rom the pitch by wrapping the pitch into one octave. Finally, the perormance o voicing detection and pitch estimation can be measured jointly by the overall transcription accuracy, deined as the raction o rames that receive correct voicing classiication and, i voiced, a pitch estimate within one quarter tone o the true pitch, among all the rames. components in the calculation o the voice likelihood. At the other end o the accuracies, we see that the maximum occurs at the excerpt daisy4, which might have been particularly easy or our approach because its melodic source is a synthesized vocal. The raw pitch accuracies in the column titled Voiced are highly correlated with the overall transcription accuracies, which suggests that urther improvement to this system should be made in pitch estimation, not in voicing detection. The column titled Chroma contains raw chroma accuracies similar to the raw pitch accuracies, which suggests that octave errors were successully avoided by the system. Figure 5. Spectrogram o a segment o the test excerpt pop3, overlayed with the true melody in green and the melody estimate in red. Table 1. Experimental results. ( pop1 and pop2, which contain an ensemble o vocals, are not included here. daisy3 is excluded because it contains alsetto singing.) 6.4 Results The results are listed in Table 1. The overall transcription accuracies listed in the column titled All range rom 61% to 99%, with an average at %. The minimum is ound at the excerpt pop3. A signiicant error made in the analysis o this excerpt is depicted in Figure 5, which reveals that the system mistakenly selected the pitch (87 hertz) o a high-energy (instrumental) bass rom 1.34 s to 1.77 s because o a tendency o the proposed signal model to assume a low-energy accompaniment. Still, the energy o requency components away rom the hypothesized partials is irrelevant to the timbre o the hypothesized voice. This suggests that urther improvement to the accuracy may be made by leaving out the o-partial requency Shown in Table 2 is a comparison o the proposed method with the MIREX 2011 submissions in terms o the overall transcription accuracy (OTA). Notably, i the proposed method had entered the evaluation in 2011, it would have ranked 5th out o a total o 11 submissions. Moreover, the accuracy o the proposed system is within 10% o the highest accuracy in the 2011 evaluation. Compared with the method we proposed in [15], which corresponds to Method 6 in Table 2, our current method turns out to give a slighly lower accuracy. This conirms the easibility o adopting this new approach as the oundation or our uture work on vocal melody extraction. Table 2. Comparison with the MIREX 2011 Audio Melody Extraction results. 7. CONCLUSIONS A novel approach to vocal melody extraction has been presented that integrates acoustic-phonetic knowledge and realworld data in estimating the vocal pitch sequence. The perormance o the proposed method has been evaluated on a

8 publicly available dataset to be comparable to the stateo-the-art perormance. In the uture, we expect a minor modiication to the proposed signal model that will urther improve the perormance in vocal pitch estimation. Acknowledgments This work was supported in part by the Taiwan e-learning and Digital Archives Program (TELDAP) sponsored by the National Science Council o Taiwan under Grant: NSC H Comments rom the anonymous reviewers were valuable or the enhanced quality o this paper. 8. REFERENCES [1] M. Goto and S. Hayamizu, A real-time music scene description system: Detecting melody and bass lines in audio signals, in IJCAI-CASA, [2] R. P. Paiva, T. Mendes, and A. Cardoso, On the detection o melody notes in polyphonic audio, in Proc. ISMIR, [3] S. Jo and C. D. Yoo, Melody extraction rom polyphonic audio based on particle ilter, in Proc. ISMIR, [4] K. Dressler, An auditory streaming approach or melody extraction rom polyphonic music, in Proc. ISMIR, [12] D. H. Klatt, Sotware or a cascade/parallel ormant synthesizer, J. Acoust. Soc. Am., vol. 67, no. 3, pp , [13] J. W. Hawks and J. D. Miller, A ormant bandwidth estimation procedure or vowel synthesis, J. Acoust. Soc. Am., vol. 97, no. 2, pp , [14] J. C. Brown and M. S. Puckette, An eicient algorithm or the calculation o a constant Q transorm, J. Acoust. Soc. Am., vol. 92, no. 5, pp , [15] Y.-R. Chien, H.-M. Wang, and S.-K. Jeng, An acoustic-phonetic approach to vocal melody extraction, in Proc. ISMIR, [16] C.-L. Hsu and J.-S. R. Jang, On the improvement o singing voice separation or monaural recordings using the MIR-1K dataset, IEEE Trans. Audio, Speech, Lang. Process., [17] ISO 226, Acoustics normal equal-loudness contours, [18] G. E. Poliner, D. P. W. Ellis, A. F. Ehmann, E. Gómez, S. Streich, and B. Ong, Melody transcription rom music audio: Approaches and evaluation, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 4, pp , [5] M. Lagrange, L. Martins, J. Murdoch, and G. Tzanetakis, Normalized cuts or predominant melodic source separation, IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 2, pp , [6] J.-L. Durrieu, G. Richard, and B. David, Singer melody extraction in polyphonic signals using source separation methods, in Proc. ICASSP, [7] D. P. W. Ellis and G. E. Poliner, Classiication-based melody transcription, Mach. Learn., vol. 65, no. 2-3, pp , [8] H. Fujihara, T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, F0 estimation method or singing voice in polyphonic audio signal based on statistical vocal model and Viterbi search, in Proc. ICASSP, [9] G. Fant, Acoustic theory o speech production with calculations based on X-ray studies o Russian articulations. The Hague: Mouton, [10] V. Rao and P. Rao, Vocal melody extraction in the presence o pitched accompaniment in polyphonic music, IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 8, pp , [11] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings o the IEEE, 1989.

AN ACOUSTIC-PHONETIC APPROACH TO VOCAL MELODY EXTRACTION

12th International Society for Music Information Retrieval Conference (ISMIR 2011) AN ACOUSTIC-PHONETIC APPROACH TO VOCAL MELODY EXTRACTION Yu-Ren Chien, 1,2 Hsin-Min Wang, 2 Shyh-Kang Jeng 1,3 1 Graduate