SIMULATED FORMANT MODELING OF ACCOMPANIED SINGING SIGNALS FOR VOCAL MELODY EXTRACTION

Size: px
Start display at page:

Download "SIMULATED FORMANT MODELING OF ACCOMPANIED SINGING SIGNALS FOR VOCAL MELODY EXTRACTION"

Transcription

1 SIMULATED FORMANT MODELING OF ACCOMPANIED SINGING SIGNALS FOR VOCAL MELODY EXTRACTION Yu-Ren Chien, 1,2 Hsin-Min Wang, 2 Shyh-Kang Jeng 1,3 1 Graduate Institute o Communication Engineering, National Taiwan University, Taiwan 2 Institute o Inormation Science, Academia Sinica, Taiwan 3 Department o Electrical Engineering, National Taiwan University, Taiwan yrchien@ntu.edu.tw,whm@iis.sinica.edu.tw,skjeng@ew.ee.ntu.edu.tw ABSTRACT This paper deals with the task o extracting vocal melodies rom accompanied singing recordings. The challenging aspect o this task consists in the tendency or instrumental sounds to interere with the extraction o the desired vocal melodies, especially when the singing voice is not necessarily predominant among other sound sources. Existing methods in the literature are either rule-based or statistical. It is diicult or rule-based methods to adequately take advantage o human voice characteristics, whereas statistical approaches typically require large-scale data collection and labeling eorts. In this work, the extraction is based on a model o the input signals that integrates acousticphonetic knowledge and real-world data under a probabilistic ramework. The resulting vocal pitch estimator is simple, determined by a small set o parameters and a small set o data. Tested on a publicly available dataset, the proposed method achieves a transcription accuracy o 76%. 1. INTRODUCTION Music lovers have always been aced with a large collection o music recordings or concert perormances or them to choose rom. Whereas successul choices are possible with a small set o metadata, disappointment recurs because the metadata only provides limited inormation about the musical contents. This has motivated researchers to work on systems that extract musically relevant eatures rom audio recordings. One potential beneit o such processing would be the possibility that machines will be able to make personalized music purchase decisions on behal o humans. In this paper, we ocus on the extraction o vocal melodies rom polyphonic audio signals. A melody is deined as a succession o pitches and durations; as one might expect, melodies represent one o the most signiicant eatures that can be identiied by listeners rom musical pieces. In various musical cultures including popular music in particular, predominant melodies are commonly carried by singing voices. In view o this, this work aims at analyzing a Copyright: c 2012 Yu-Ren Chien et al. This is an open-access article distributed under the terms o the Creative Commons Attribution 3.0 Unported License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. singing voice accompanied by musical instruments. Instrumental accompaniment is common in vocal music, where the main melodies are exclusively carried by a solo singing voice, with the musical instruments providing harmony. In brie, the goal o the analysis considered in this work is inding the undamental requency o the singing voice as a unction o time. The speciic task outlined above is challenging because melody extraction is prone to intererence rom the accompaniment unless a mechanism is in place or distinguishing human voice rom instrumental sound. [1], [2], and [3] determined the predominant pitch as it accounts or the most o the signal power among all the simultaneous pitches. The concept o pitch predominance is also presented in [5] and [6], which deined the predominance in terms o harmonicity. For these methods, the problem proves diicult whenever the signal is dominated by a harmonic musical instrument rather than by the singing voice. [7] and [8] realized the timbre recognition mechanism by classiication techniques; on the other hand, pitch classiication entails quantization o pitch, which in turn causes loss o such musical inormation as vibrato, portamento, and non-standard tuning. The singing voice is probably the oldest mechanism in human history or music perormance. It shares considerable acoustic characteristics with speech, which have been ormulated analytically in acoustic phonetics [9]. However, a typical acoustic-phonetic model involves some ree parameters, i.e., the ormant requencies, which are highly variable across vowels or singers. In view o this, we take a probabilistic approach to vocal melody extraction, by which acoustic knowledge and real-world data can be integrated in a uniied manner. With an accompanied singing signal observed, estimation o the vocal pitch is based on the pitch likelihood (likelihood unction o the pitch), which is in turn based on the voice likelihood (likelihood unction o the singing voice). By simulating the singing voice signal, the pitch likelihood can be approximated by an average o values o the voice likelihood evaluated at the simulated set o voice examples. The simulation is realized by synthesizing voice signals o various timbres in advance according to ormant requencies extracted rom a wide variety o (possibly accompanied) singing recordings. Since ormant requencies represent spectrum envelopes o the human voice, their extraction does not require the sampled singing recordings to

2 densely cover various pitch values, nor is it impaired by accompaniment o modest loudness in the recordings. The proposed method oers several potential advantages over previous approaches to vocal melody extraction. First o all, imposing acoustic-phonetic constraints on the extraction enables the proposed method to better distinguish human voice rom instrumental sound than the predominant pitch estimators in [1 3, 5, 6]. Secondly, the acousticphonetic constraints save the proposed method rom largescale data collection and labeling eorts that are common or purely data-driven systems [7]. Third, some systems [4, 10] depend on pitch instability in identiying the vocal pitch; in contrast, without discriminating between stable and unstable pitches, the proposed method allows or such cases as an unstable instrumental pitch (e.g., violin) or a stably sung vocal pitch. Fourth, although our earlier approach in [15] was also based on acoustic-phonetic knowledge, it did not statistically model the joint distribution o ormant requencies, nor did it model the accompaniment signal whatsoever. The signal model proposed here or accompanied singing promises to better represent vocal characteristics and handle intererence rom the accompaniment. Lastly, we highlight the advantage o the proposed method over the method in [8]. These two methods are interestingly related to each other, both adopting spectrum envelope modeling and the Viterbi algorithm. In spectrum envelope modeling, [8] extracts linear-predictive and cepstral eatures rom sinusoidally resynthesized vocal or instrumental sounds, while our approach models vocal spectrum envelopes by ormant-synthesizing voice examples. The proposed signal model turns out 1) to be applicable to both vocal pitch estimation and voicing detection, and 2) not to depend on any sound samples o musical instruments. 2. OVERVIEW OF VOCAL PITCH ESTIMATION To acilitate the estimation, we quantize the vocal pitch into a discrete variable with 88 possible values. The pitch at k quarter tones (k = 1,2,...,88) is associated with a undamental requency o (k 60)/24 hertz. Thereore, the 88 pitch values are quarter-tone-spaced samples o the undamental requency in the vocal range rom 80 hertz to 1,000 hertz. The act that we are now estimating a discrete-valued signal (i.e., the vocal pitch sequence) rom an observed signal (i.e., the accompanied singing) makes it possible or us to characterize the pair o signals with a hidden Markov model (HMM) and ind the best pitch sequence by the Viterbi algorithm [11]. Here, the accompanied singing signal is represented by a vector-valued observation sequence, which consists o 100 N-vector observations per second. Each N-vector observation is made up o N consecutive time samples o the signal. Obviously, there are 88 states in the HMM. As one might expect, the HMM is deined by two probabilistic models: the observation model and the prior model. The observation model describes the probability distribution o an observation given a particular state, while the prior model comprises the state transition probability and initial state distributions. 3. OBSERVATION MODEL Let the accompanied singing signal, the (unobserved) singing voice, and the vocal pitch at a particular time point be denoted by the random N-vector z, the random N-vector x, and the random variable w, respectively. Then, the likelihood unction o w, i.e., the pitch likelihood, can be expanded as a marginalizing integral: p z w (z w) = p z w,x (z w,x)p x w (x w)dx. (1) With the term p z w,x (z w,x) taken as a unction o x, this integral can be thought o as the expectation op z w,x (z w,x) and approximated by the corresponding sample mean: p z w (z w) 1 p z w,x (z w,x (i,w) ), (2) i=1 where is the number o voice examples available or each o the 88 pitch values, andx (i,k) denotes theith voice example or pitchk. Here, the voice examples{x (i,w) } Ne i=1 simulate the random experiment underlying the probability distribution described by the density p x w ( w). Given the singing voicex, the vocal pitchw can be regarded as a constant, which is independent o any other random quantity; as a result, w can be dropped rom the right-hand-side condition in (2): p z w (z w) 1 p z x (z x (i,w) ), (3) i=1 which is an average o the values o the voice likelihood p z x (z ) as evaluated at the voice examples {x (i,w) } Ne i=1. The preparation o the voice examples will be presented in Section 3.1, which is an oline procedure perormed well in advance o melody extraction. Ater that, we will describe the evaluation o the likelihood o each voice example, i.e.,p z x (z x (i,k) ), in Section Synthesizing Voice Examples Comprehensive collection o real-world singing voice data is diicult, as results rom several acts about the singing voice. In the irst place, most vocal perormances are accompanied, which renders unaccompanied singing voice recordings extremely scarce. Although non-proessional unaccompanied singing data can be collected with less diiculty, untrained singing voice is typically o less practical relevance as compared with proessional singing. Secondly, the pitches most oten used in a song are conined 1) on the scale o its key and 2) within the registers o the singer; consequently, it would take a huge number o songs collected to have the entire vocal pitch range covered densely. Finally, to provide timbral variety, the collection must include various singers and various voiced sounds (vowels, nasal consonants, etc.). To circumvent the diiculty in collecting singing voice data, we 1) collect accompanied singing data, 2) extract vocal spectrum envelopes rom the data, and 3) synthesize voice examples o various pitches rom the extracted envelopes. (These will be described in Sections 6.1, 3.1.1,

3 and 3.1.2, respectively.) Since vocal spectrum envelopes ollow a well-deined ormant structure, they can be extracted reliably in the presence o instrumental sounds, as long as the singing voice is suiciently loud in comparison with the instruments. Moreover, by giving a pitchindependent description o timbre, the vocal spectrum envelopes eliminate the need or covering various pitches in data collection. In this way, suicient data can be collected, or the sole purpose o representing the timbral diversity in singing voice Extracting Vocal Spectrum Envelopes A vocal spectrum envelope is an amplitude unction o requency that models the spectrum envelope o a particular voiced sound (a vowel, a nasal consonant, etc.). By giving partial amplitudes as its samples at partial requencies, it provides a pitch-independent description o the speciic timbre o the voiced sound. In our implementation, it is determined by seven parameters: the irst ive oral ormant requencies 1, 2,..., 5 (hertz), a nasal ormant requency p (hertz), and a nasal anti-ormant requency z (hertz) [12]. To be more speciic, it is deined by (see [9]) A( h ) = 20log 10 U R( h )K R ( h ) H n (2π h ) n I, (4) wherea( ) is the amplitude unction in db, h denotes the requency in hertz, U R ( ) represents the (radiated) spectrum envelope o the glottal excitation [9]: U R ( h ) = h /100 1+( h /100) 2, (5) K R ( ) represents all ormants o order six and above [9]: ( ) 20log 10 K R ( h ) 0.43 h ( h 4, 500) h 5000, (6) I = {1,2,3,4,5,p,z}, and H n ( ) represents requency response o ormant n [9]: H n (ω) = H z (ω) = ( 1 ( 1 jω 1 )( ), σ n+jω n 1 jω σ n jω n n = 1,2,3,4,5,p, )( jω 1 σ z +jω z jω σ z jω z (7) ). (8) In (7),ω n is the requency o ormantnin rad/s, i.e.,ω n = 2π n, and σ n is hal the bandwidth o ormant n in rad/s, which can be approximated as a unction o ω n by a polynomial regression model [13]. As an example, a vocal spectrum envelope is plotted in Figure 1, which was extracted by the ollowing procedure rom a recording o Dietrich Fischer-Dieskau s perormance. In a short-time spectrum (computed by the constant-q transorm [14]) o accompanied singing, amplitudes at the partial requencies o a (manually identiied) vocal pitch constitute a noisy observation or estimating the underlying vocal spectrum envelope. As a consequence, the vocal Figure 1. A vocal spectrum envelope with ormant requencies (in hertz) o 1 = 270, 2 = 1274, 3 = 2630, 4 = 2920, 5 = 3270, p = 920, and z = spectrum envelope can be estimated by itting its spectral samples to the observed amplitudes: ˆv = argmin v V 40 l=1 ( a q l a A(lh 0) ) 2, (9) where a is an amplitude (in db) variable that modiies the overall magnitude o the spectrum envelope, v = (a, 1, 2, 3, 4, 5, p, z ) T, (10) V describes constraints imposed on the ormant requencies: V = v R , (11) 200 p z p z a q l denotes the amplitude (in db) observed at the lth partial, and 0 h denotes the vocal pitch in hertz. The constrained optimization problem in (9) is solved by the multistart coordinate-descent distance minimization procedure described in [15] Synthesis From a Spectrum Envelope LetA (i) ( ) denote theith vocal spectrum envelope extracted rom accompanied singing data (i = 1,..., ). To synthesize the ith voice example or pitch k (i.e., x (i,k) ), we compute its partial amplitudes according to the envelope A (i) ( ): a (i) l = A (i) (l (k 60)/24 ),l = 1,...,L, 5000 L =, (k 60)/24 (12)

4 where a (i) l denotes the amplitude (in db) o the lth partial. Then, the voice example can be synthesized as x (i,k) t = L l=1 10 a (i) ( ) l 20 cos 2πl k 60 t 24, t = 1,...,N. 3.2 Likelihood o a Voice Example (13) To evaluate the likelihood o the voice example x (i,k), we take advantage o the act that the accompanied singing signal is the sum o the singing voice signal and the accompaniment signal: z = x+y, (14) where x, y, and z are random N-vectors representing the singing voice, the accompaniment, and the accompanied singing, respectively. By taking (14) as a transormation o y intoz, the likelihood can be evaluated as Figure 2. Histogram o sample values in accompaniment signals. p z x (z x (i,k) ) = p y x (z x (i,k) x (i,k) ). (15) Approximate independence can be assumed between the vectors x and y, in that they represent separate sound sources with independent phases; hence, we have p z x (z x (i,k) ) p y (z x (i,k) ). (16) The dependence and trend among the time samples in y represent the speciic timbre or polyphony o the accompaniment, o which, however, we do not have any knowledge at the time o melody extraction. In consequence, approximate i.i.d. is assumed among the time samples: p z x (z x (i,k) ) N t=1 p y (z t x (i,k) t ). (17) To determine the probability distribution o each time sample in y, we collected 256,000 time sample values by randomly sampling the accompaniment data in the MIR-1K dataset [16]. The histogram o these values, as plotted in Figure 2, suggests that the probability distribution can be approximated by a zero-mean Gaussian distribution. The Q-Q plot o these values against the standard normal distribution, as shown in Figure 3, conirms the approximation by presenting a curve that resembles a straight line. With this approximation, we have p z x (z x (i,k) ) exp { z x(i,k) 2 }, (18) 2σ 2 y where σ y denotes the standard deviation o the Gaussian distribution. The voice examples {x (i,k) } Ne i=1 are initially intended or simulating the random experiment underlying the probability distribution described by the density p x w ( k); even so, we cannot aord to synthesize a huge number o voice examples that collectively represent the diversity in such trivial signal speciications as various loudness levels and Figure 3. Q-Q plot o sample values in accompaniment signals against the standard normal distribution. various sinusoidal phase angles. Thereore, as described in Section 3.1, the examples serve only to represent the timbral variety in singing voice; meanwhile, each voice example needs to be matched against the accompanied singing in a phase- and loudness-insensitive ashion. To achieve the phase-insensitivity, we substitute a scaled requency-domain total power or the N-sample signal energy in (18): 192 p z x (z x (i,k) ) exp{ c A z e jφz (i,k) A e jφ(i,k) 2 }, =1 (19) where c is a manually speciied scaling constant (c = ), is a requency index to a constant-q spectrum [14] with 192 quarter-tone-spaced bins, A z and φz denote the constant-q magnitude and phase spectra o the accompanied singing signal, and A (i,k) and φ (i,k) denote those o the voice example. Now, or the phase-insensitivity, we

5 relax the phase o the voice example and maximize the likelihood with respect to the relaxed phase, thereby creating a modiied voice example x (i,k) with phase spectrum {φ z }192 =1 : 192 p z x (z x (i,k) ) exp{ c (A z A (i,k) ) 2 }, (20) =1 CQT{ x (i,k) } = {A (i,k) e jφz } 192 =1, (21) where CQT{ } denotes the constant-q transorm. Next, to achieve the insensitivity to loudness, we relax the loudness o the voice example and maximize the likelihood with respect to the relaxed loudness, thereby creating an ampliied or attenuated voice example x (i,k) : p z x (z x (i,k) ) exp c 192 (A z ) 2 ( 192 =1 Az A(i,k) 192 =1 =1 (A(i,k) ) 2 x (i,k) = 192 =1 Az A(i,k) ) 2, (22) 192 =1 (A(i,k) ) 2 x(i,k), (23) which orthogonally projects the accompanied singing onto the subspace o ampliied or attenuated versions o the voice example. In the end, to evaluate the likelihood o pitch w, we substitute the modiied voice examples { x (i,w) } Ne i=1 or the voice examples {x (i,w) } Ne i=1 in (3): p z w (z w) 1 p z x (z x (i,w) ). (24) i=1 4. PRIOR MODEL We use a Markov chain{w m } M m=1 to model the vocal pitch sequence, which consists o 100 pitch values per second: P(w 1,...,w M ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1,w 2 ) P(w m w 1,...,w m 1 ) P(w M w 1,...,w M 1 ) M = P(w 1 ) P(w m w m 1 ), m=2 (25) where random variable w m {1,...,88} represents the mth element in the vocal pitch sequence. The second equality in (25) results rom the Markovianity that given the previoius pitch w m 1, the current pitch w m is independent o all the earlier pitches w m 2,w m 3,...,w 1. The initial state distribution is assumed to be uniorm over all possible pitch values: tones o the previous pitch: = P(w m = k 2 w m 1 = k 1 ) 1 3 i k 1 {1,88}, k 1 k 2 2; 1 4 i k 1 {2,87}, k 1 k 2 2; 1 5 i 3 k 1 86, k 1 k 2 2; 0 i k 1 k 2 > 2. (27) In almost all cases, there are ive pitch values around the previous pitch that are assigned a nonzero probability ( 1 5 ) or the current pitch. Other cases are associated with only 3 or 4 pitch values. For example, when the previous pitch is 88, the only possible values or the current pitch are 86, 87, and VOICING DETECTION In addition to estimating the vocal pitch sequence rom accompanied singing, vocal melody extraction inds particular time points at which no singing voice is actually sounding. Such time points may be ound during vocal rests, at plosives, etc. For each o these time points, the pitch estimate should be overriden by a state indicating the absence o singing voice. In other words, we need a mechanism or detecting the singing voice or each time point. To this end, we estimate the short-time spectra o the singing voice on the basis o the accompanied singing. Ideally, the estimation will give a zero spectrum or each time point that is not voiced. To estimate the spectrum at a particular time point, we use its minimum mean square error (MMSE) estimator: E[CQT{x} z] = CQT{x}p x z (x z)dx = CQT{x} p z x(z x) p z(z) p x (x)dx. (28) With the term CQT{x} p z x(z x) p z(z) taken as a unction o x, this integral can be thought o as the expectation o CQT{x} p z x(z x) p z(z) and approximated by the corresponding sample mean: E[CQT{x} z] k=1 i=1 CQT{ x (i,k) } p z x(z x (i,k) ). p z (z) (29) The density p z (z) can again be approximated in this ashion: p z (z) = p z x (z x)p x (x)dx (30) 1 88 Ne 88 k=1 i=1 p z x(z x (i,k) ). Since all the modiied voice examples share the same phase spectrum, the magnitude o the spectrum estimate is evaluated as P(w 1 = k) = 1, k {1,...,88}. (26) 88 The state transition probability distribution is also assumed to be uniorm, but only over pitch values within 2 quarter E[CQT{x} z] k=1 i=1 = 1,...,192, Ã (i,k) p z x (z x (i,k) ), p z (z) (31)

6 where Ã(i,k) denotes the constant-q magnitude spectrum o the modiied voice example x (i,k). Eventually, the loudness o the singing voice can be estimated by correcting the magnitude spectrum according to the trends in the 40-phon equal-loudness contour (ELC) [17], which quantiies the dependency o human loudness perception on requency: 192 Λ(z) = ( E[CQT{x} z] 10 (40 κ)/20 ) 2, (32) =1 where κ denotes the 40-phon ELC, plotted in Figure 4. I, and only i, Λ(z) exceeds the empirical threshold o , the time point is deemed voiced. Figure phon equal-loudness contour. 6. EXPERIMENTS In this section, to provide comparison o our method with some existing methods, we conduct vocal melody extraction experiments on a publicly available dataset. Since the synthesis o voice examples is based on a collection o accompanied singing data, we start by describing the collection. pop nasal male voice (Wakin Chau), pop nasal emale voice (Chiou-Feng Tsai), non-proessional high male voice (Bobon), non-proessional high emale voice (Annar), non-proessional low male voice (Davidson), and non-proessional low emale voice (Ani). The nasal singers are well-known in Taiwan or nasalizing their vowels signiicantly. The 6 types o voiced sound are /i/, /E/, /A/, /O/, /u/, and a miscellaneous type deined by /@/, /z " /, /ü " /, /m " /, /n " /, or /N " /. Each sound in the miscellaneous type does not occur in all recordings: /@/ is absent in all 4 Taiwanese-language recordings, perhaps because it seldom occurs in the northern speech o the Taiwanese language; the syllabic nuclei /z " / and /ü " / are speciic to languages such as Mandarin Chinese; and the nasal hummings, due to their low loudness, are rarely used in operatic singing. To extract vocal spectrum envelopes, the irst author subjectively selected 6 short-time spectra rom each recording that exempliy the 6 sound types, respectively. 6.2 Dataset Description The dataset adopted or perormance evaluation is a subset o the one built or the Melody Extraction Contest in the ISMIR 2004 Audio Description Contest (ADC 2004). The whole ADC 2004 dataset consists o 20 audio recordings, each around 20 seconds in duration, among which eight recordings have instrumental melodies, and the other twelve have vocal melodies. Since this work considers vocal melodies only, experiments are carried out exclusively on 9 o the 12 vocal recordings, including two pop song excerpts, three song excerpts with synthesized vocal, and our opera excerpts. The other three vocal excerpts are not included here because one contains alsetto singing and the other two contain an ensemble o vocals. The dataset has been in use in several Music Inormation Retrieval Evaluation Exchange (MIREX) contests since 2006; thereore, it aords extensive comparison among methods. Beore melody extraction, each audio ile in the dataset is resampled at 11,025 hertz and constant-q transormed [14] (Q = 34) into a sequence o short-time spectra. Each resulting spectrum is a quarter-tone-spaced sampling o a continuous spectrum that is capable o resolving the intererence between two hal-tone-spaced sinusoids rom hertz all the way to 5,428.6 hertz. 6.1 Data Collection or Voice Example Synthesis 6.3 Perormance Measures To synthesize voice examples, we extracted = 84 vocal spectrum envelopes rom 14 recordings o about 1 minute In the experiments documented here, the tested system gives vocal melodies in the ormat o a voicing/pitch value or each. The 14 recordings represent 14 distinct types o each rame (at the rate o 100 rames per second). I a singing voice, including 10 recordings o proessional (accompanied) singing captured rom YouTube, and 4 recordings o non-proessional (unaccompanied) singing adapted rom some clips in the MIR-1K dataset [16]. From each recording, 6 spectrum envelopes were extracted that represent 6 distinct types o voiced sound. The 14 types o singing voice are tenor (José Carreras), soprano (Kiri Te Kanawa), baritone (Dietrich Fischer-Dieskau), mezzo-soprano (Cecilia Bartoli), pop high male voice (Terry Lin), pop high emale voice (Stella Chang), pop low male voice (Shieng Luo), pop low emale voice (Inn-Jae Chen), rame is estimated to be voiced, the output speciies the pitch estimate or the rame; otherwise, the output speciies that the rame is estimated to be not voiced. MIREX adopts several measures or evaluating the perormance o a melody extraction system [18]. In the irst place, to determine how well the system perorms voicing detection, we use the voicing detection rate, the voicing alse alarm rate, and the discriminability. The voicing detection rate is computed as the raction o rames that are both labeled and estimated to be voiced, among all the rames that are labeled voiced. The voicing alse alarm rate

7 is computed as the raction o rames that are estimated to be voiced but are actually not voiced, among all the rames that are not voiced according to the reerence transcription. The discriminability combines the above two measures in such a way that it can be deemed independent o the value o any threshold involved in the decision o voicing detection: d = Q 1 (P F )+Q 1 (1 P D ), (33) whereq 1 ( ) denotes the inverse o the Gaussian tail unction, P F denotes the alse alarm rate, and P D denotes the detection rate. Second, to determine how well the system perorms pitch estimation, we use the raw pitch accuracy and the raw chroma accuracy. The raw pitch accuracy is computed as the raction o rames that are labeled voiced and have pitch estimated within one quarter tone o the true pitch, among all the rames that are labeled voiced. To ocus on pitch class estimation while ignoring octave errors, we compute the raw chroma accuracy, which is computed in the same way as the raw pitch accuracy, except that the pitch is here measured in terms o chroma, or pitch class, a quantity derived rom the pitch by wrapping the pitch into one octave. Finally, the perormance o voicing detection and pitch estimation can be measured jointly by the overall transcription accuracy, deined as the raction o rames that receive correct voicing classiication and, i voiced, a pitch estimate within one quarter tone o the true pitch, among all the rames. components in the calculation o the voice likelihood. At the other end o the accuracies, we see that the maximum occurs at the excerpt daisy4, which might have been particularly easy or our approach because its melodic source is a synthesized vocal. The raw pitch accuracies in the column titled Voiced are highly correlated with the overall transcription accuracies, which suggests that urther improvement to this system should be made in pitch estimation, not in voicing detection. The column titled Chroma contains raw chroma accuracies similar to the raw pitch accuracies, which suggests that octave errors were successully avoided by the system. Figure 5. Spectrogram o a segment o the test excerpt pop3, overlayed with the true melody in green and the melody estimate in red. Table 1. Experimental results. ( pop1 and pop2, which contain an ensemble o vocals, are not included here. daisy3 is excluded because it contains alsetto singing.) 6.4 Results The results are listed in Table 1. The overall transcription accuracies listed in the column titled All range rom 61% to 99%, with an average at %. The minimum is ound at the excerpt pop3. A signiicant error made in the analysis o this excerpt is depicted in Figure 5, which reveals that the system mistakenly selected the pitch (87 hertz) o a high-energy (instrumental) bass rom 1.34 s to 1.77 s because o a tendency o the proposed signal model to assume a low-energy accompaniment. Still, the energy o requency components away rom the hypothesized partials is irrelevant to the timbre o the hypothesized voice. This suggests that urther improvement to the accuracy may be made by leaving out the o-partial requency Shown in Table 2 is a comparison o the proposed method with the MIREX 2011 submissions in terms o the overall transcription accuracy (OTA). Notably, i the proposed method had entered the evaluation in 2011, it would have ranked 5th out o a total o 11 submissions. Moreover, the accuracy o the proposed system is within 10% o the highest accuracy in the 2011 evaluation. Compared with the method we proposed in [15], which corresponds to Method 6 in Table 2, our current method turns out to give a slighly lower accuracy. This conirms the easibility o adopting this new approach as the oundation or our uture work on vocal melody extraction. Table 2. Comparison with the MIREX 2011 Audio Melody Extraction results. 7. CONCLUSIONS A novel approach to vocal melody extraction has been presented that integrates acoustic-phonetic knowledge and realworld data in estimating the vocal pitch sequence. The perormance o the proposed method has been evaluated on a

8 publicly available dataset to be comparable to the stateo-the-art perormance. In the uture, we expect a minor modiication to the proposed signal model that will urther improve the perormance in vocal pitch estimation. Acknowledgments This work was supported in part by the Taiwan e-learning and Digital Archives Program (TELDAP) sponsored by the National Science Council o Taiwan under Grant: NSC H Comments rom the anonymous reviewers were valuable or the enhanced quality o this paper. 8. REFERENCES [1] M. Goto and S. Hayamizu, A real-time music scene description system: Detecting melody and bass lines in audio signals, in IJCAI-CASA, [2] R. P. Paiva, T. Mendes, and A. Cardoso, On the detection o melody notes in polyphonic audio, in Proc. ISMIR, [3] S. Jo and C. D. Yoo, Melody extraction rom polyphonic audio based on particle ilter, in Proc. ISMIR, [4] K. Dressler, An auditory streaming approach or melody extraction rom polyphonic music, in Proc. ISMIR, [12] D. H. Klatt, Sotware or a cascade/parallel ormant synthesizer, J. Acoust. Soc. Am., vol. 67, no. 3, pp , [13] J. W. Hawks and J. D. Miller, A ormant bandwidth estimation procedure or vowel synthesis, J. Acoust. Soc. Am., vol. 97, no. 2, pp , [14] J. C. Brown and M. S. Puckette, An eicient algorithm or the calculation o a constant Q transorm, J. Acoust. Soc. Am., vol. 92, no. 5, pp , [15] Y.-R. Chien, H.-M. Wang, and S.-K. Jeng, An acoustic-phonetic approach to vocal melody extraction, in Proc. ISMIR, [16] C.-L. Hsu and J.-S. R. Jang, On the improvement o singing voice separation or monaural recordings using the MIR-1K dataset, IEEE Trans. Audio, Speech, Lang. Process., [17] ISO 226, Acoustics normal equal-loudness contours, [18] G. E. Poliner, D. P. W. Ellis, A. F. Ehmann, E. Gómez, S. Streich, and B. Ong, Melody transcription rom music audio: Approaches and evaluation, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 4, pp , [5] M. Lagrange, L. Martins, J. Murdoch, and G. Tzanetakis, Normalized cuts or predominant melodic source separation, IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 2, pp , [6] J.-L. Durrieu, G. Richard, and B. David, Singer melody extraction in polyphonic signals using source separation methods, in Proc. ICASSP, [7] D. P. W. Ellis and G. E. Poliner, Classiication-based melody transcription, Mach. Learn., vol. 65, no. 2-3, pp , [8] H. Fujihara, T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, F0 estimation method or singing voice in polyphonic audio signal based on statistical vocal model and Viterbi search, in Proc. ICASSP, [9] G. Fant, Acoustic theory o speech production with calculations based on X-ray studies o Russian articulations. The Hague: Mouton, [10] V. Rao and P. Rao, Vocal melody extraction in the presence o pitched accompaniment in polyphonic music, IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 8, pp , [11] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings o the IEEE, 1989.

AN ACOUSTIC-PHONETIC APPROACH TO VOCAL MELODY EXTRACTION

AN ACOUSTIC-PHONETIC APPROACH TO VOCAL MELODY EXTRACTION 12th International Society for Music Information Retrieval Conference (ISMIR 2011) AN ACOUSTIC-PHONETIC APPROACH TO VOCAL MELODY EXTRACTION Yu-Ren Chien, 1,2 Hsin-Min Wang, 2 Shyh-Kang Jeng 1,3 1 Graduate

More information

MELODY EXTRACTION BASED ON HARMONIC CODED STRUCTURE

MELODY EXTRACTION BASED ON HARMONIC CODED STRUCTURE 12th International Society for Music Information Retrieval Conference (ISMIR 2011) MELODY EXTRACTION BASED ON HARMONIC CODED STRUCTURE Sihyun Joo Sanghun Park Seokhwan Jo Chang D. Yoo Department of Electrical

More information

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION th International Society for Music Information Retrieval Conference (ISMIR ) SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION Chao-Ling Hsu Jyh-Shing Roger Jang

More information

Topic 10. Multi-pitch Analysis

Topic 10. Multi-pitch Analysis Topic 10 Multi-pitch Analysis What is pitch? Common elements of music are pitch, rhythm, dynamics, and the sonic qualities of timbre and texture. An auditory perceptual attribute in terms of which sounds

More information

Efficient Vocal Melody Extraction from Polyphonic Music Signals

Efficient Vocal Melody Extraction from Polyphonic Music Signals http://dx.doi.org/1.5755/j1.eee.19.6.4575 ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 19, NO. 6, 213 Efficient Vocal Melody Extraction from Polyphonic Music Signals G. Yao 1,2, Y. Zheng 1,2, L.

More information

MELODY EXTRACTION FROM POLYPHONIC AUDIO OF WESTERN OPERA: A METHOD BASED ON DETECTION OF THE SINGER S FORMANT

MELODY EXTRACTION FROM POLYPHONIC AUDIO OF WESTERN OPERA: A METHOD BASED ON DETECTION OF THE SINGER S FORMANT MELODY EXTRACTION FROM POLYPHONIC AUDIO OF WESTERN OPERA: A METHOD BASED ON DETECTION OF THE SINGER S FORMANT Zheng Tang University of Washington, Department of Electrical Engineering zhtang@uw.edu Dawn

More information

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu

More information

CURRENT CHALLENGES IN THE EVALUATION OF PREDOMINANT MELODY EXTRACTION ALGORITHMS

CURRENT CHALLENGES IN THE EVALUATION OF PREDOMINANT MELODY EXTRACTION ALGORITHMS CURRENT CHALLENGES IN THE EVALUATION OF PREDOMINANT MELODY EXTRACTION ALGORITHMS Justin Salamon Music Technology Group Universitat Pompeu Fabra, Barcelona, Spain justin.salamon@upf.edu Julián Urbano Department

More information

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC Vishweshwara Rao, Sachin Pant, Madhumita Bhaskar and Preeti Rao Department of Electrical Engineering, IIT Bombay {vishu, sachinp,

More information

A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION

A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION Graham E. Poliner and Daniel P.W. Ellis LabROSA, Dept. of Electrical Engineering Columbia University, New York NY 127 USA {graham,dpwe}@ee.columbia.edu

More information

ON THE USE OF PERCEPTUAL PROPERTIES FOR MELODY ESTIMATION

ON THE USE OF PERCEPTUAL PROPERTIES FOR MELODY ESTIMATION Proc. of the 4 th Int. Conference on Digital Audio Effects (DAFx-), Paris, France, September 9-23, 2 Proc. of the 4th International Conference on Digital Audio Effects (DAFx-), Paris, France, September

More information

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES Jun Wu, Yu Kitano, Stanislaw Andrzej Raczynski, Shigeki Miyabe, Takuya Nishimoto, Nobutaka Ono and Shigeki Sagayama The Graduate

More information

Transcription of the Singing Melody in Polyphonic Music

Transcription of the Singing Melody in Polyphonic Music Transcription of the Singing Melody in Polyphonic Music Matti Ryynänen and Anssi Klapuri Institute of Signal Processing, Tampere University Of Technology P.O.Box 553, FI-33101 Tampere, Finland {matti.ryynanen,

More information

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University Week 14 Query-by-Humming and Music Fingerprinting Roger B. Dannenberg Professor of Computer Science, Art and Music Overview n Melody-Based Retrieval n Audio-Score Alignment n Music Fingerprinting 2 Metadata-based

More information

Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity

Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity Holger Kirchhoff 1, Simon Dixon 1, and Anssi Klapuri 2 1 Centre for Digital Music, Queen Mary University

More information

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES Vishweshwara Rao and Preeti Rao Digital Audio Processing Lab, Electrical Engineering Department, IIT-Bombay, Powai,

More information

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? NICHOLAS BORG AND GEORGE HOKKANEN Abstract. The possibility of a hit song prediction algorithm is both academically interesting and industry motivated.

More information

Query By Humming: Finding Songs in a Polyphonic Database

Query By Humming: Finding Songs in a Polyphonic Database Query By Humming: Finding Songs in a Polyphonic Database John Duchi Computer Science Department Stanford University jduchi@stanford.edu Benjamin Phipps Computer Science Department Stanford University bphipps@stanford.edu

More information

/$ IEEE

/$ IEEE 564 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010 Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals Jean-Louis Durrieu,

More information

Automatic Rhythmic Notation from Single Voice Audio Sources

Automatic Rhythmic Notation from Single Voice Audio Sources Automatic Rhythmic Notation from Single Voice Audio Sources Jack O Reilly, Shashwat Udit Introduction In this project we used machine learning technique to make estimations of rhythmic notation of a sung

More information

CS229 Project Report Polyphonic Piano Transcription

CS229 Project Report Polyphonic Piano Transcription CS229 Project Report Polyphonic Piano Transcription Mohammad Sadegh Ebrahimi Stanford University Jean-Baptiste Boin Stanford University sadegh@stanford.edu jbboin@stanford.edu 1. Introduction In this project

More information

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS Andrew N. Robertson, Mark D. Plumbley Centre for Digital Music

More information

Automatic Piano Music Transcription

Automatic Piano Music Transcription Automatic Piano Music Transcription Jianyu Fan Qiuhan Wang Xin Li Jianyu.Fan.Gr@dartmouth.edu Qiuhan.Wang.Gr@dartmouth.edu Xi.Li.Gr@dartmouth.edu 1. Introduction Writing down the score while listening

More information

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Kazuyoshi Yoshii, Masataka Goto and Hiroshi G. Okuno Department of Intelligence Science and Technology National

More information

The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng

The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng S. Zhu, P. Ji, W. Kuang and J. Yang Institute of Acoustics, CAS, O.21, Bei-Si-huan-Xi Road, 100190 Beijing,

More information

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Introduction In this project we were interested in extracting the melody from generic audio files. Due to the

More information

Topic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller)

Topic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller) Topic 11 Score-Informed Source Separation (chroma slides adapted from Meinard Mueller) Why Score-informed Source Separation? Audio source separation is useful Music transcription, remixing, search Non-satisfying

More information

Designing Filters with the AD6620 Greensboro, NC

Designing Filters with the AD6620 Greensboro, NC Designing Filters with the AD66 Greensboro, NC Abstract: This paper introduces the basics o designing digital ilters or the AD66. This article assumes a basic knowledge o ilters and their construction

More information

MUSI-6201 Computational Music Analysis

MUSI-6201 Computational Music Analysis MUSI-6201 Computational Music Analysis Part 9.1: Genre Classification alexander lerch November 4, 2015 temporal analysis overview text book Chapter 8: Musical Genre, Similarity, and Mood (pp. 151 155)

More information

POLYPHONIC INSTRUMENT RECOGNITION USING SPECTRAL CLUSTERING

POLYPHONIC INSTRUMENT RECOGNITION USING SPECTRAL CLUSTERING POLYPHONIC INSTRUMENT RECOGNITION USING SPECTRAL CLUSTERING Luis Gustavo Martins Telecommunications and Multimedia Unit INESC Porto Porto, Portugal lmartins@inescporto.pt Juan José Burred Communication

More information

Hidden Markov Model based dance recognition

Hidden Markov Model based dance recognition Hidden Markov Model based dance recognition Dragutin Hrenek, Nenad Mikša, Robert Perica, Pavle Prentašić and Boris Trubić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3,

More information

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC G.TZANETAKIS, N.HU, AND R.B. DANNENBERG Computer Science Department, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213, USA E-mail: gtzan@cs.cmu.edu

More information

Music Segmentation Using Markov Chain Methods

Music Segmentation Using Markov Chain Methods Music Segmentation Using Markov Chain Methods Paul Finkelstein March 8, 2011 Abstract This paper will present just how far the use of Markov Chains has spread in the 21 st century. We will explain some

More information

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM A QUER B EAMPLE MUSIC RETRIEVAL ALGORITHM H. HARB AND L. CHEN Maths-Info department, Ecole Centrale de Lyon. 36, av. Guy de Collongue, 69134, Ecully, France, EUROPE E-mail: {hadi.harb, liming.chen}@ec-lyon.fr

More information

Audio Feature Extraction for Corpus Analysis

Audio Feature Extraction for Corpus Analysis Audio Feature Extraction for Corpus Analysis Anja Volk Sound and Music Technology 5 Dec 2017 1 Corpus analysis What is corpus analysis study a large corpus of music for gaining insights on general trends

More information

Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016

Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016 Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016 Jordi Bonada, Martí Umbert, Merlijn Blaauw Music Technology Group, Universitat Pompeu Fabra, Spain jordi.bonada@upf.edu,

More information

2. AN INTROSPECTION OF THE MORPHING PROCESS

2. AN INTROSPECTION OF THE MORPHING PROCESS 1. INTRODUCTION Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals,

More information

THE APPLICATION OF SIGMA DELTA D/A CONVERTER IN THE SIMPLE TESTING DUAL CHANNEL DDS GENERATOR

THE APPLICATION OF SIGMA DELTA D/A CONVERTER IN THE SIMPLE TESTING DUAL CHANNEL DDS GENERATOR THE APPLICATION OF SIGMA DELTA D/A CONVERTER IN THE SIMPLE TESTING DUAL CHANNEL DDS GENERATOR J. Fischer Faculty o Electrical Engineering Czech Technical University, Prague, Czech Republic Abstract: This

More information

ON FINDING MELODIC LINES IN AUDIO RECORDINGS. Matija Marolt

ON FINDING MELODIC LINES IN AUDIO RECORDINGS. Matija Marolt ON FINDING MELODIC LINES IN AUDIO RECORDINGS Matija Marolt Faculty of Computer and Information Science University of Ljubljana, Slovenia matija.marolt@fri.uni-lj.si ABSTRACT The paper presents our approach

More information

Chord Classification of an Audio Signal using Artificial Neural Network

Chord Classification of an Audio Signal using Artificial Neural Network Chord Classification of an Audio Signal using Artificial Neural Network Ronesh Shrestha Student, Department of Electrical and Electronic Engineering, Kathmandu University, Dhulikhel, Nepal ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Topics in Computer Music Instrument Identification. Ioanna Karydi

Topics in Computer Music Instrument Identification. Ioanna Karydi Topics in Computer Music Instrument Identification Ioanna Karydi Presentation overview What is instrument identification? Sound attributes & Timbre Human performance The ideal algorithm Selected approaches

More information

A COMPARISON OF MELODY EXTRACTION METHODS BASED ON SOURCE-FILTER MODELLING

A COMPARISON OF MELODY EXTRACTION METHODS BASED ON SOURCE-FILTER MODELLING A COMPARISON OF MELODY EXTRACTION METHODS BASED ON SOURCE-FILTER MODELLING Juan J. Bosch 1 Rachel M. Bittner 2 Justin Salamon 2 Emilia Gómez 1 1 Music Technology Group, Universitat Pompeu Fabra, Spain

More information

THE importance of music content analysis for musical

THE importance of music content analysis for musical IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007 333 Drum Sound Recognition for Polyphonic Audio Signals by Adaptation and Matching of Spectrogram Templates With

More information

A Study of Synchronization of Audio Data with Symbolic Data. Music254 Project Report Spring 2007 SongHui Chon

A Study of Synchronization of Audio Data with Symbolic Data. Music254 Project Report Spring 2007 SongHui Chon A Study of Synchronization of Audio Data with Symbolic Data Music254 Project Report Spring 2007 SongHui Chon Abstract This paper provides an overview of the problem of audio and symbolic synchronization.

More information

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION Halfdan Rump, Shigeki Miyabe, Emiru Tsunoo, Nobukata Ono, Shigeki Sagama The University of Tokyo, Graduate

More information

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Aric Bartle (abartle@stanford.edu) December 14, 2012 1 Background The field of composer recognition has

More information

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine Project: Real-Time Speech Enhancement Introduction Telephones are increasingly being used in noisy

More information

Singer Identification

Singer Identification Singer Identification Bertrand SCHERRER McGill University March 15, 2007 Bertrand SCHERRER (McGill University) Singer Identification March 15, 2007 1 / 27 Outline 1 Introduction Applications Challenges

More information

A Variable Resolution transform for music analysis

A Variable Resolution transform for music analysis 1 A Variable Resolution transorm or music analysis Aliaksandr Paradzinets and Liming Chen A Research Report, Lab. LIRIS, Ecole Centrale de Lyon Ecully, June 2009 2 A Variable Resolution transorm or music

More information

638 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010

638 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010 638 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010 A Modeling of Singing Voice Robust to Accompaniment Sounds and Its Application to Singer Identification and Vocal-Timbre-Similarity-Based

More information

Proc. of NCC 2010, Chennai, India A Melody Detection User Interface for Polyphonic Music

Proc. of NCC 2010, Chennai, India A Melody Detection User Interface for Polyphonic Music A Melody Detection User Interface for Polyphonic Music Sachin Pant, Vishweshwara Rao, and Preeti Rao Department of Electrical Engineering Indian Institute of Technology Bombay, Mumbai 400076, India Email:

More information

AN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS

AN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS AN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS Rui Pedro Paiva CISUC Centre for Informatics and Systems of the University of Coimbra Department

More information

Singer Traits Identification using Deep Neural Network

Singer Traits Identification using Deep Neural Network Singer Traits Identification using Deep Neural Network Zhengshan Shi Center for Computer Research in Music and Acoustics Stanford University kittyshi@stanford.edu Abstract The author investigates automatic

More information

Automatic Laughter Detection

Automatic Laughter Detection Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional

More information

Automatic music transcription

Automatic music transcription Music transcription 1 Music transcription 2 Automatic music transcription Sources: * Klapuri, Introduction to music transcription, 2006. www.cs.tut.fi/sgn/arg/klap/amt-intro.pdf * Klapuri, Eronen, Astola:

More information

Music Source Separation

Music Source Separation Music Source Separation Hao-Wei Tseng Electrical and Engineering System University of Michigan Ann Arbor, Michigan Email: blakesen@umich.edu Abstract In popular music, a cover version or cover song, or

More information

Music Radar: A Web-based Query by Humming System

Music Radar: A Web-based Query by Humming System Music Radar: A Web-based Query by Humming System Lianjie Cao, Peng Hao, Chunmeng Zhou Computer Science Department, Purdue University, 305 N. University Street West Lafayette, IN 47907-2107 {cao62, pengh,

More information

Singer Recognition and Modeling Singer Error

Singer Recognition and Modeling Singer Error Singer Recognition and Modeling Singer Error Johan Ismael Stanford University jismael@stanford.edu Nicholas McGee Stanford University ndmcgee@stanford.edu 1. Abstract We propose a system for recognizing

More information

MODELING OF PHONEME DURATIONS FOR ALIGNMENT BETWEEN POLYPHONIC AUDIO AND LYRICS

MODELING OF PHONEME DURATIONS FOR ALIGNMENT BETWEEN POLYPHONIC AUDIO AND LYRICS MODELING OF PHONEME DURATIONS FOR ALIGNMENT BETWEEN POLYPHONIC AUDIO AND LYRICS Georgi Dzhambazov, Xavier Serra Music Technology Group Universitat Pompeu Fabra, Barcelona, Spain {georgi.dzhambazov,xavier.serra}@upf.edu

More information

Phone-based Plosive Detection

Phone-based Plosive Detection Phone-based Plosive Detection 1 Andreas Madsack, Grzegorz Dogil, Stefan Uhlich, Yugu Zeng and Bin Yang Abstract We compare two segmentation approaches to plosive detection: One aproach is using a uniform

More information

Outline. Why do we classify? Audio Classification

Outline. Why do we classify? Audio Classification Outline Introduction Music Information Retrieval Classification Process Steps Pitch Histograms Multiple Pitch Detection Algorithm Musical Genre Classification Implementation Future Work Why do we classify

More information

Detecting Musical Key with Supervised Learning

Detecting Musical Key with Supervised Learning Detecting Musical Key with Supervised Learning Robert Mahieu Department of Electrical Engineering Stanford University rmahieu@stanford.edu Abstract This paper proposes and tests performance of two different

More information

Automatic Singing Performance Evaluation Using Accompanied Vocals as Reference Bases *

Automatic Singing Performance Evaluation Using Accompanied Vocals as Reference Bases * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 31, 821-838 (2015) Automatic Singing Performance Evaluation Using Accompanied Vocals as Reference Bases * Department of Electronic Engineering National Taipei

More information

Supervised Learning in Genre Classification

Supervised Learning in Genre Classification Supervised Learning in Genre Classification Introduction & Motivation Mohit Rajani and Luke Ekkizogloy {i.mohit,luke.ekkizogloy}@gmail.com Stanford University, CS229: Machine Learning, 2009 Now that music

More information

HUMANS have a remarkable ability to recognize objects

HUMANS have a remarkable ability to recognize objects IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER 2013 1805 Musical Instrument Recognition in Polyphonic Audio Using Missing Feature Approach Dimitrios Giannoulis,

More information

A SCORE-INFORMED PIANO TUTORING SYSTEM WITH MISTAKE DETECTION AND SCORE SIMPLIFICATION

A SCORE-INFORMED PIANO TUTORING SYSTEM WITH MISTAKE DETECTION AND SCORE SIMPLIFICATION A SCORE-INFORMED PIANO TUTORING SYSTEM WITH MISTAKE DETECTION AND SCORE SIMPLIFICATION Tsubasa Fukuda Yukara Ikemiya Katsutoshi Itoyama Kazuyoshi Yoshii Graduate School of Informatics, Kyoto University

More information

ACCURATE ANALYSIS AND VISUAL FEEDBACK OF VIBRATO IN SINGING. University of Porto - Faculty of Engineering -DEEC Porto, Portugal

ACCURATE ANALYSIS AND VISUAL FEEDBACK OF VIBRATO IN SINGING. University of Porto - Faculty of Engineering -DEEC Porto, Portugal ACCURATE ANALYSIS AND VISUAL FEEDBACK OF VIBRATO IN SINGING José Ventura, Ricardo Sousa and Aníbal Ferreira University of Porto - Faculty of Engineering -DEEC Porto, Portugal ABSTRACT Vibrato is a frequency

More information

Singing Pitch Extraction and Singing Voice Separation

Singing Pitch Extraction and Singing Voice Separation Singing Pitch Extraction and Singing Voice Separation Advisor: Jyh-Shing Roger Jang Presenter: Chao-Ling Hsu Multimedia Information Retrieval Lab (MIR) Department of Computer Science National Tsing Hua

More information

DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes

DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring 2009 Week 6 Class Notes Pitch Perception Introduction Pitch may be described as that attribute of auditory sensation in terms

More information

International Journal of Computer Architecture and Mobility (ISSN ) Volume 1-Issue 7, May 2013

International Journal of Computer Architecture and Mobility (ISSN ) Volume 1-Issue 7, May 2013 Carnatic Swara Synthesizer (CSS) Design for different Ragas Shruti Iyengar, Alice N Cheeran Abstract Carnatic music is one of the oldest forms of music and is one of two main sub-genres of Indian Classical

More information

Improving Frame Based Automatic Laughter Detection

Improving Frame Based Automatic Laughter Detection Improving Frame Based Automatic Laughter Detection Mary Knox EE225D Class Project knoxm@eecs.berkeley.edu December 13, 2007 Abstract Laughter recognition is an underexplored area of research. My goal for

More information

Available online at ScienceDirect. Procedia Computer Science 46 (2015 )

Available online at  ScienceDirect. Procedia Computer Science 46 (2015 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 381 387 International Conference on Information and Communication Technologies (ICICT 2014) Music Information

More information

An Accurate Timbre Model for Musical Instruments and its Application to Classification

An Accurate Timbre Model for Musical Instruments and its Application to Classification An Accurate Timbre Model for Musical Instruments and its Application to Classification Juan José Burred 1,AxelRöbel 2, and Xavier Rodet 2 1 Communication Systems Group, Technical University of Berlin,

More information

CULTIVATING VOCAL ACTIVITY DETECTION FOR MUSIC AUDIO SIGNALS IN A CIRCULATION-TYPE CROWDSOURCING ECOSYSTEM

CULTIVATING VOCAL ACTIVITY DETECTION FOR MUSIC AUDIO SIGNALS IN A CIRCULATION-TYPE CROWDSOURCING ECOSYSTEM 014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) CULTIVATING VOCAL ACTIVITY DETECTION FOR MUSIC AUDIO SIGNALS IN A CIRCULATION-TYPE CROWDSOURCING ECOSYSTEM Kazuyoshi

More information

Robert Alexandru Dobre, Cristian Negrescu

Robert Alexandru Dobre, Cristian Negrescu ECAI 2016 - International Conference 8th Edition Electronics, Computers and Artificial Intelligence 30 June -02 July, 2016, Ploiesti, ROMÂNIA Automatic Music Transcription Software Based on Constant Q

More information

NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING

NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING Zhiyao Duan University of Rochester Dept. Electrical and Computer Engineering zhiyao.duan@rochester.edu David Temperley University of Rochester

More information

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University A Pseudo-Statistical Approach to Commercial Boundary Detection........ Prasanna V Rangarajan Dept of Electrical Engineering Columbia University pvr2001@columbia.edu 1. Introduction Searching and browsing

More information

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function EE391 Special Report (Spring 25) Automatic Chord Recognition Using A Summary Autocorrelation Function Advisor: Professor Julius Smith Kyogu Lee Center for Computer Research in Music and Acoustics (CCRMA)

More information

Polyphonic Audio Matching for Score Following and Intelligent Audio Editors

Polyphonic Audio Matching for Score Following and Intelligent Audio Editors Polyphonic Audio Matching for Score Following and Intelligent Audio Editors Roger B. Dannenberg and Ning Hu School of Computer Science, Carnegie Mellon University email: dannenberg@cs.cmu.edu, ninghu@cs.cmu.edu,

More information

A probabilistic framework for audio-based tonal key and chord recognition

A probabilistic framework for audio-based tonal key and chord recognition A probabilistic framework for audio-based tonal key and chord recognition Benoit Catteau 1, Jean-Pierre Martens 1, and Marc Leman 2 1 ELIS - Electronics & Information Systems, Ghent University, Gent (Belgium)

More information

SINCE the lyrics of a song represent its theme and story, they

SINCE the lyrics of a song represent its theme and story, they 1252 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 6, OCTOBER 2011 LyricSynchronizer: Automatic Synchronization System Between Musical Audio Signals and Lyrics Hiromasa Fujihara, Masataka

More information

Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE, and Bryan Pardo, Member, IEEE

Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE, and Bryan Pardo, Member, IEEE IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 6, OCTOBER 2011 1205 Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE,

More information

Effects of acoustic degradations on cover song recognition

Effects of acoustic degradations on cover song recognition Signal Processing in Acoustics: Paper 68 Effects of acoustic degradations on cover song recognition Julien Osmalskyj (a), Jean-Jacques Embrechts (b) (a) University of Liège, Belgium, josmalsky@ulg.ac.be

More information

Classification-Based Melody Transcription

Classification-Based Melody Transcription Classification-Based Melody Transcription Daniel P.W. Ellis and Graham E. Poliner LabROSA, Dept. of Electrical Engineering Columbia University, New York NY 10027 USA {dpwe,graham}@ee.columbia.edu February

More information

Transient behaviour in the motion of the brass player s lips

Transient behaviour in the motion of the brass player s lips Transient behaviour in the motion o the brass player s lips John Chick, Seona Bromage, Murray Campbell The University o Edinburgh, The King s Buildings, Mayield Road, Edinburgh EH9 3JZ, UK, john.chick@ed.ac.uk

More information

SINGING VOICE ANALYSIS AND EDITING BASED ON MUTUALLY DEPENDENT F0 ESTIMATION AND SOURCE SEPARATION

SINGING VOICE ANALYSIS AND EDITING BASED ON MUTUALLY DEPENDENT F0 ESTIMATION AND SOURCE SEPARATION SINGING VOICE ANALYSIS AND EDITING BASED ON MUTUALLY DEPENDENT F0 ESTIMATION AND SOURCE SEPARATION Yukara Ikemiya Kazuyoshi Yoshii Katsutoshi Itoyama Graduate School of Informatics, Kyoto University, Japan

More information

1. Introduction NCMMSC2009

1. Introduction NCMMSC2009 NCMMSC9 Speech-to-Singing Synthesis System: Vocal Conversion from Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices * Takeshi SAITOU 1, Masataka GOTO 1, Masashi

More information

Sparse Representation Classification-Based Automatic Chord Recognition For Noisy Music

Sparse Representation Classification-Based Automatic Chord Recognition For Noisy Music Journal of Information Hiding and Multimedia Signal Processing c 2018 ISSN 2073-4212 Ubiquitous International Volume 9, Number 2, March 2018 Sparse Representation Classification-Based Automatic Chord Recognition

More information

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION ULAŞ BAĞCI AND ENGIN ERZIN arxiv:0907.3220v1 [cs.sd] 18 Jul 2009 ABSTRACT. Music genre classification is an essential tool for

More information

CSC475 Music Information Retrieval

CSC475 Music Information Retrieval CSC475 Music Information Retrieval Monophonic pitch extraction George Tzanetakis University of Victoria 2014 G. Tzanetakis 1 / 32 Table of Contents I 1 Motivation and Terminology 2 Psychacoustics 3 F0

More information

Interactive Classification of Sound Objects for Polyphonic Electro-Acoustic Music Annotation

Interactive Classification of Sound Objects for Polyphonic Electro-Acoustic Music Annotation for Polyphonic Electro-Acoustic Music Annotation Sebastien Gulluni 2, Slim Essid 2, Olivier Buisson, and Gaël Richard 2 Institut National de l Audiovisuel, 4 avenue de l Europe 94366 Bry-sur-marne Cedex,

More information

A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS

A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS Justin Salamon Music Technology Group Universitat Pompeu Fabra, Barcelona, Spain justin.salamon@upf.edu Emilia

More information

A Bayesian Network for Real-Time Musical Accompaniment

A Bayesian Network for Real-Time Musical Accompaniment A Bayesian Network for Real-Time Musical Accompaniment Christopher Raphael Department of Mathematics and Statistics, University of Massachusetts at Amherst, Amherst, MA 01003-4515, raphael~math.umass.edu

More information

Analysis, Synthesis, and Perception of Musical Sounds

Analysis, Synthesis, and Perception of Musical Sounds Analysis, Synthesis, and Perception of Musical Sounds The Sound of Music James W. Beauchamp Editor University of Illinois at Urbana, USA 4y Springer Contents Preface Acknowledgments vii xv 1. Analysis

More information

Music Representations

Music Representations Lecture Music Processing Music Representations Meinard Müller International Audio Laboratories Erlangen meinard.mueller@audiolabs-erlangen.de Book: Fundamentals of Music Processing Meinard Müller Fundamentals

More information

Classification-based melody transcription

Classification-based melody transcription DOI 10.1007/s10994-006-8373-9 Classification-based melody transcription Daniel P.W. Ellis Graham E. Poliner Received: 24 September 2005 / Revised: 16 February 2006 / Accepted: 20 March 2006 / Published

More information

Subjective evaluation of common singing skills using the rank ordering method

Subjective evaluation of common singing skills using the rank ordering method lma Mater Studiorum University of ologna, ugust 22-26 2006 Subjective evaluation of common singing skills using the rank ordering method Tomoyasu Nakano Graduate School of Library, Information and Media

More information

MUSICAL NOTE AND INSTRUMENT CLASSIFICATION WITH LIKELIHOOD-FREQUENCY-TIME ANALYSIS AND SUPPORT VECTOR MACHINES

MUSICAL NOTE AND INSTRUMENT CLASSIFICATION WITH LIKELIHOOD-FREQUENCY-TIME ANALYSIS AND SUPPORT VECTOR MACHINES MUSICAL NOTE AND INSTRUMENT CLASSIFICATION WITH LIKELIHOOD-FREQUENCY-TIME ANALYSIS AND SUPPORT VECTOR MACHINES Mehmet Erdal Özbek 1, Claude Delpha 2, and Pierre Duhamel 2 1 Dept. of Electrical and Electronics

More information

SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL NETWORKS

SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL NETWORKS SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL NETWORKS François Rigaud and Mathieu Radenen Audionamix R&D 7 quai de Valmy, 7 Paris, France .@audionamix.com ABSTRACT This paper

More information

AN ON-THE-FLY MANDARIN SINGING VOICE SYNTHESIS SYSTEM

AN ON-THE-FLY MANDARIN SINGING VOICE SYNTHESIS SYSTEM AN ON-THE-FLY MANDARIN SINGING VOICE SYNTHESIS SYSTEM Cheng-Yuan Lin*, J.-S. Roger Jang*, and Shaw-Hwa Hwang** *Dept. of Computer Science, National Tsing Hua University, Taiwan **Dept. of Electrical Engineering,

More information