MELODY EXTRACTION FROM POLYPHONIC AUDIO BASED ON PARTICLE FILTER

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) MELODY EXTRACTION FROM POLYPHONIC AUDIO BASED ON PARTICLE FILTER Seokhwan Jo Chang D. Yoo Deparmen of Elecrical Engineering, Korea Advanced Insiue of Science Technology, 373-1 Guseong-dong, Yuseong-gu, Daejeon, 305-701, Korea aniland00@kais.ac.kr cdyoo@ee.kais.ac.kr ABSTRACT This paper considers a paricle filer based algorihm o exrac melody from a polyphonic audio in he shor-ime Fourier ransforms (STFT) domain. The exracion is focused on overcoming he difficulies due o harmonic / percussive sound inerferences, possibiliy of ocave mismach, and dynamic variaion in melody. The main idea of he algorihm is o consider probabilisic relaions beween melody and polyphonic audio. Melody is assumed o follow a Markov process, and he framed segmens of polyphonic audio are assumed o be condiionally independen given he parameers ha represen he melody. The melody parameers are esimaed using sequenial imporance sampling (SIS) which is a convenional paricle filer mehod. In his paper, he likelihood and sae ransiion are defined o overcome he aforemenioned difficulies. The SIS algorihm relies on sequenial imporance densiy, and his densiy is designed using muliple piches which are esimaed by a simple muli-pich exracion algorihm. Experimenal resuls show ha he considered algorihm ouperforms oher famous melody exracion algorihms in erms of he raw pich accuracy (RPA) and he raw chroma accuracy (RCA). 1. INTRODUCTION Many people believe ha people recognize music as a sequence of monophonic noes called melody, and for his reason, melody exracion is playing an imporan role in music conen processing which has recenly become an imporan research area. Alhough he debae over he definiion of melody is on going [1 3], many expers concur ha melody should be he dominan pich sequence of a polyphonic audio. In his paper, melody is defined o be he singing voice pich sequence in he vocal par and he pich sequence of he solo insrumen in non-vocal par or non-vocal music. When a music conains singing voice, mos people recognize music by he vocal melody line in he vocal par. However, in non-vocal par such as iner- Permission o make digial or hard copies of all or par of his work for personal or classroom use is graned wihou fee provided ha copies are no made or disribued for profi or commercial advanage and ha copies bear his noice and he full ciaion on he firs page. c 010 Inernaional Sociey for Music Informaion Rerieval. mezzo and non-vocal music such as jazz and orchesra, mos people recognize music by he melody line of he solo insrumen. Many melody exracion algorihms have been proposed over he las one decade [1 6], albei wih limied success. Melody exracion from he polyphonic audio is sill difficul for he following reasons: 1. Harmonic inerference: Harmonics of oher insrumen signal inerfere in he esimaion of he melody pich harmonics.. Percussive sound inerference: Percussive sound inerfere o esimae he melody pich because he energy of i forms a verical ridge wih srong and wideband specral envelopes. 3. Ocave mismach: The esimaed pich can be one ocave higher or lower han he ground-ruh. 4. Dynamic variaion in melody: Accurae pich esimaion in he beginning, end and sudden ransien regions of a melody is difficul. In his paper, melody pich frequency and harmonic ampliudes ha represen he melody are esimaed in he shorime Fourier ransforms (STFT) domain. The main idea of he algorihm is o consider a probabilisic relaions beween melody and polyphonic audio. Melody pich frequency and harmonic ampliudes are assumed o follow Markov processes, and he framed segmens of polyphonic audio are assumed o be condiionally independen given melody pich frequency and harmonic ampliudes. Thus, melody pich frequency and harmonic ampliudes can be esimaed from he polyphonic audio based on he Bayesian sequenial model once he likelihood and sae ransiion are defined. The likelihood is defined o be robus o harmonic and percussive sound inerferences. The sae ransiion of melody pich frequency is adjused by conrol parameers ha discourages ocave mismach and dynamic variaion in he melody. The sequenial imporance sampling (SIS) algorihm, a convenional paricle filer algorihm, is used o esimae he melody parameers. The SIS algorihm relies on a so-called sequenial imporance densiy, and his densiy is designed using muliple piches which are esimaed by a simple muli-pich exracion algorihm. 357

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) This paper is organized as follows. Secion presens he melody exracion from polyphonic audio based on paricle filer. Secion 3 provides experimenal resuls. Finally, Secion 4 concludes his paper.. MELODY EXTRACTION FROM POLYPHONIC AUDIO BASED ON PARTICLE FILTER.1 Melody exracion from polyphonic audio The melody pich harmonics x [n] in he h frame is defined as follows: x [n] = w[n] H A m, cos(mω 0, n + ϕ m, ), (1) m=1 where A m,, ω 0,, ϕ m,, H and w[n] are he ampliude of he mh harmonic in he h frame, he melody pich frequency in he h frame, he phase of he mh harmonic in he h frame, number of melody pich harmonics, and he analysis window funcion, respecively. The polyphonic audio can be expressed as z [n] = x [n] + y [n], () where z [n] and y [n] are he polyphonic audio signal and signal of oher insrumens in he h frame, respecively. In he frequency domain, he following relaionship holds: z = x + y, (3) where z, x, and y are he N-poin discree Fourier ransforms (DFT) of z [n], x [n], and y [n], respecively. The parameers of he melody pich harmonics he melody pich frequency and he harmonic ampliudes mus be esimaed for he melody exracion. This paper assumes ha he phase of he melody pich harmonics is he same as he phase of he polyphonic audio, i.e., he phase of he melody pich is no esimaed since human ear is assumed o be unsensiive o phase variaions. Thus, he h frame parameer se is defined as Θ = (ω 0,, A ), (4) where A = [A 1,, A,,..., A H, ]. The objecive of melody exracion is o esimae Θ from given z. I is usually observed ha successive parameers ω 0, and A are highly correlaed. In his paper, i is assumed ha Θ is considered a Markov process and y a each frame is condiionally independen given Θ. Here, Θ is considered laen while y is observed. From his perspecive, he Bayesian sequenial model for melody exracion can be consruced as shown in Figure 1. In Figure 1, p(z Θ ), p(θ Θ 1 ), and ρ are likelihood, sae ransiion, and conrol parameer o decide he sae ransiion of he melody pich frequency, respecively. From his Bayesian sequenial model, he poserior probabiliy p(θ 0: z 1: ) 1 is esimaed, and i is used o esimae Θ for melody exracion. To esimae p(θ 0: z 1: ), likelihood and sae evoluion equaions wih sae ransiion needs o be defined. 1 The noaion a 0: means ha a 0: = [a 0, a 1,..., a ] T Figure 1. Bayesian sequenial model for melody exracion. z, Θ, and ρ are polyphonic audio, melody parameer (ω 0, and A ), and conrol parameer, respecively. To obain he likelihood, i is assumed ha he DFT coefficiens of y follow a zero mean complex mulivariae Gaussian disribuion, which is given by y N (0, Σ ), Σ = diag(σ,1, σ,,..., σ,n), (5) where Σ and σ,k are he covariance marix in h frame and he variance of he kh bin in he h frame, respecively. Eqn. (5) yields he likelihood as follows: p(z Θ ) = N (z ; x, Σ ) exp { (z x ) H Σ 1 (z x ) }, (6) where ( ) H is he Hermiian operaor. To define p(z Θ ), σ,k mus be esimaed. In his paper, σ,k is esimaed using he decision-direced mehod [7] as follows: σ,k = α σ 1,k + (1 α) Y,k, (7) where α and Y,k are a smoohing facor and he kh bin DFT coefficien of y, respecively. However, Eqn. (7) can no be used direcly since Y,k is unknown. I is assumed ha Y,k is highly correlaed wih Y 1,k. Therefore, he esimaion is modified as follows: σ,k = α σ,k + (1 α) Ŷ 1,k. (8) Accurae esimaion of Σ will lead o robusness o harmonic and percussive sound inerferences. Figure shows an example of z and an esimae of Σ, and i is easily shown ha he likelihood in Eqn. (6) is maximized a he rue Θ. The sae evoluion equaions, which describe relaionships of he parameers a frame, are se as follows: A m, = A m, 1 + v A, 1, (9) ω 0, = ω 0, 1 + v ω0, 1, (10) where v A, 1 and v ω0, 1 are he random perurbaions corresponding o harmonic ampliudes and melody pich frequency of he ( 1)h frame, respecively. This ype of sae evoluion equaions is called random walk: he curren sae is a random perurbaion of he previous sae. I is imporan o define p(v A, 1 ) and p(v ω0, 1) accuraely, and in his paper, p(v A, 1 ) is assumed o be a runcaed Gaussian as shown in Figure 3 since A m, > 0, and 358

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) 10 Polyphonic audio Sqr of variances 0.18 0.16 0.14 0.1 0.1 0.18 0.16 0.14 0.1 0.1 10 1 0.08 0.06 0.04 0.08 0.06 0.04 0.0 0.0 0 6 4 0 4 6 0 0 4 6 8 Magniude 10 0 (a) p(v A, 1 ) (b) p(a m, A m, 1 ) 10 1 Figure 3. Sae ransiion in harmonic ampliudes. 10 100 00 300 400 500 600 700 800 900 frequency bin Figure. Example of polyphonic audio (z ) and he esimaed variances (Σ ) of oher insrumen signal. p(v ω0, 1) is assumed o be a Gaussian whose variance conrolled by ρ. Melody line is characerized by prolonged periods of smoohness, wih infrequen sharp changes in noe ransiion or during vibrao regions. Furhermore, here are wo general rules concerning he melody line: 1) he vibrao exhibis an exen of 60 00 cens for singing voice and only 0 30 cens for oher [8], and ) he ransiions are ypically limied o one ocave [1]. Therefore, assumpion ha v ω0, 1 follows a Gaussian disribuion wih fixed variance is no appropriae. In his paper, he sae ransiion from he from he ( 1)h sae o he h sae of he melody pich frequency is conrolled by ρ which indicaes he degree of he melody line being wheher in ransiion or no. Here, ransiion includes vibrao. And, ρ is defined as ρ = ω 0, 1 ω 0,, (11) and p(v ω0, 1) is given by If he imporance densiy is chosen o facorize as follows N (0, 0 cen) ρ < 50 cen p(v ω0, 1) = N (0, 50 cen) 50 cen ρ < 100 cen. q(θ 0: z 1: ) = q(θ Θ 0: 1, z 1: )q(θ 0: 1 z 1: 1 ), N (0, 100 cen) 100 cen ρ (15) (1) hen one can obain paricles Θ 0: q(θ 0: z 1:) by augmening each of he exising paricles Θ 0: 1 q(θ 0: 1 When ρ is small, he curren melody pich frequency represens a cerain noe frequency and has a value similar o z 1: 1 ) wih he new sae Θ q(θ Θ 0: 1, z 1: ). The he previous melody pich frequency. When ρ is large, he weigh updae equaion can be derived as follows using curren melody pich frequency is wih high probabiliy in Eqn. (14) and Eqn. (15) a noe ransiion or vibrao regions and has a value dissimilar o he previous melody pich frequency. The sae ransiion of melody pich frequency defined by Eqn. (1) w w p(z Θ )p(θ Θ 1 ) 1 q(θ can lead o robusness o ocave mismach and dynamic Θ 1, z. (16) ) variaion in melody. The cen is a uni of logarihmic frequency range, and i is defined as f cen = 6900 + 100 log f Hz 440.. Melody exracion based on paricle filer In his paper, p(θ 0: z 1: ) is approximaed using Mone Carlo inegraion and Θ is esimaed using he paricle filer. The SIS algorihm which is a common paricle filer mehod [9, 10] is adoped o esimae he parameers of he melody. If he likelihood and he sae ransiion follow a Gaussian disribuion, he problem can be solved by Kalman filer. However, he sae ransiion is no assumed o be a Gaussian. The SIS algorihm is used o obain p(θ 0: z 1: ) based on he Bayesian sequenial model shown as Figure 1. The poserior densiy p(θ 0: z 1: ) can be approximaed as follows: N p p(θ 0: z 1: ) i=1 w δ(θ 0: Θ 0: ), (13) where Θ 0:, w, and N p are he ih paricle of Θ 0:, associaed weigh, and he number of paricles, respecively. The weighs are normalized such ha N p i=1 w = 1. The weighs are chosen using he mehod of imporance sampling. If he paricle Θ 0: were drawn from an imporance densiy q(θ 0: z 1:), he weighs in Eqn. (13) are defined as follows: w p(θ 0: z 1:) q(θ 0: z 1:). (14) A common problem wih he paricle filer is he degeneracy phenomenon, where afer a few ieraions, mos paricles have negligible weigh [9,10]. A suiable measure of degeneracy is he effecive paricle size, N eff, which is given by N eff = 1 Np i=1 (w ). (17) 359

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) x 10 3 7 6 5 4 3 imporance densiy N bes previous paricle he muliple-pich esimaion and he melody pich paricles drawn in he previous frame. Afer drawing melody pich paricles, melody pich harmonic ampliudes paricles are drawn as given by = N A q(a ω 0,, A 1, z ) A 1 + 0, Azω, A 1 0, Azω () 1 0 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 pich candidae [cen] Figure 4. Design of q(ω 0, ω 0, 1, z ). In his paper, o avoid he degeneracy problem, resampling algorihm is used when N eff N p. Finally, esimaion of parameers is achieved by poserior mean afer obaining p(θ 0: z 1: ). N p ω 0,0: = w ω 0,0:, (18) i=1 where A zω 0, pich candidae near ω 0, is he harmonic ampliudes corresponding wih consrain A > 0. In defining q(a ω 0,, A 1, z ), he curren harmonic ampliude paricles which are similar o he previous harmonic ampliude paricles and harmonic ampliudes of he N- bes pich candidaes are generaed. If A are similar, hen A A 1 +Azω 0,. If A A 1 Azω 0, 1 Azω 0, 0, 1 and Azω 0, herefore, A 0, 1 and Azω are no similar, hen >> 0, herefore, A is generaed somewha randomly. The ouline of he considered algorihm is given below. N p Â 0: = w A 0:. (19) i=1..1 Design of sequenial imporance densiy The performance of he SIS algorihm depends on he choice of q(θ Θ 1, z ). Seing q(θ Θ 1, z ) = p(θ Θ 1 ) leads o no only unnecessary large number of paricles bu also difficulies in esimaing p(θ 0: z 1: ). In his paper, a muliple pich esimaion algorihm is used o define Θ 1, z ) since he melody pich frequency is assumed o be one of he pich esimae given by he muliple q(θ pich esimaes. A main idea in defining q(θ Θ 1, z ) is o generae paricles of he melody parameers similar o he esimaed muliple pich parameers. To obain muliple pich parameers, he muliple pich esimaion algorihm proposed in [11] is used. Before drawing paricles from he imporance densiy, q(θ Θ 1, z ) is facorized as follows: q(ω 0,, A ω 0, 1, A 1, z ) = q(a ω 0,, A 1, z )q(ω 0, ω 0, 1, z ). (0) Here, ω 0, and A are considered condiionally independen given ω 0, 1, A 1, and z. Firs, melody pich paricles are drawn as given by ω 0, q(ω 0, ω 0, 1, z ), (1) where q(ω 0, ω 0, 1, z ) is shown as Figure 4. In defining q(ω 0, ω 0, 1, z ), he curren melody pich paricles are drawn near he N-bes pich candidaes obained from Ouline of he considered algorihm Melody exracion based on he SIS For i = 1,..., N p 1. Generae he paricles Melody pich paricles ω 0, q(ω 0, ω 0, 1, z ) Harmonic ampliudes paricles A q(a ω 0,, A 1, z ). Updae he weighs: Eqn. (16) Normalize he weighs ( N p i=1 w = 1). Resampling: Resampling algorihm is used when N eff N p. Esimaion: Melody pich frequency in h frame is esimaed by Eqn. (18). Harmonic ampliudes of melody pich harmonics in h frame are esimaed by Eqn. (19). 3. EVALUATION The considered algorihm was evaluaed and compared o oher melody exracion algorihms using he ISMIR 004 Audio Descripion Cones (ADC04) daabase. The daabase conains 0 polyphonic musical audio pieces. All es daa are single channel PCM daa wih 44.1 khz sample rae and 16-bi quanizaion. Table 1 shows he daa composiion of he ADC04 se. Search range of melody pich frequency was beween 80Hz and 180Hz in frequency do- 360

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) Melody Insrumen Synhesized voice (4) Saxophone (4) MIDI insrumens (4) Human voice ( male, female) Male Voice (4) Syle POP Jazz Folk(), Pop() Classical opera POP Table 1. Summary of ADC04 daa se. The number in parenheses is he number of corresponding pieces. RPA RCA Goo [] 65.8% (005) 71.8% (005) Paiva el al. [3] 6.7% (005) 66.7% (005) Marlo [4] 60.1% (005) 67.1% (005) Ryynanen el al. [5] 68.6% (005) 74.1% (005) Ellis el al. [6] 73.% (006) 76.4% (006) Considered algorihm 77.3% 83.8% Table. Resul comparison. The number in parenheses is he year when heir algorihms were submied o he MIREX. main (3950 cen and 8750 cen in cen domain). The Hanning window was used wih 48ms frame lengh and 10ms frame hop size. α = 0.98 in Eqn. (8) was used. N p = 500 in Eqn. (13) was used. The esimaed melody is correc when he absolue value of he difference beween he ground-ruh frequency and esimaed frequency is less han 50 cen ( 1 4 one). The performance of he considered algorihm was evaluaed in erms of raw pich accuracy (RPA) and raw chroma accuracy (RCA). The RPA is defined as he proporion of frames in which he esimaed melody pich is wihin ± 1 4 one of he reference pich. And he RCA is defined in he same manner as he raw pich accuracy; however, boh he esimaed and reference frequencies are mapped ino a single ocave in order o forgive ocave ransposiions. The considered algorihm was compared o he oher famous melody exracion algorihms such as algorihms proposed by Goo [], Paiva e al. [3], Marlo [4], Ryynanen el al. [5], and Ellis e al. [6]. Their performances are based on resuls of he Music Informaion Rerieval Evaluaion exchange (MIREX) [1]. Table shows he evaluaion resuls for all algorihms considered. The considered algorihm ouperformed he ohers in erms of he RPA and he RCA. The difference beween he RPA and RCA is proporional ocave mismach error. Alhough he algorihm in his paper is considered o be robus agains ocave mismach, he difference beween he RPA and he RCA is 6.5 %. The muliple pich esimaion algorihm proposed in [11] was quie simple and vulnerable o ocave error, i.e., inaccuracy in sequenial imporance densiy led o inaccurae melody pich candidaes. 4. CONCLUSION The melody exracion algorihm from he polyphonic audio based on paricle filer is considered in his paper. Mos people recognize music as no all of noe sequences bu a special monophonic noe sequence called melody. However, melody exracion from polyphonic audio is difficul due o he following impedimens: harmonic inerference, percussive sound inerference, ocave mismach, and dynamic variaion in melody. The main idea of he algorihm is o consider probabilisic relaions beween melody and polyphonic audio. Melody is assumed o follow a Markov process, and he framed segmens of polyphonic audio are assumed o be condiionally independen given he parameers ha represen he melody. The parameers are esimaed using he SIS algorihm. This paper shows ha likelihood and sae ransiion ha are required in he SIS algorihm are defined o be robus agains he aforemenioned impedimens. The performance of he SIS algorihm depends on a sequenial imporance densiy, and his densiy is designed by muliple pich. Experimenal resuls show ha he considered algorihm ouperformed he oher famous melody exracion algorihms. 5. ACKNOWLEDGEMENTS This work was suppored by he Minisry of Culure, Spors and Tourism (MCST) and Korea Culure Conen Agency (KOCCA) in he Culure Technology (CT) Research and Developmen Program 009. 6. REFERENCES [1] G. E. Poliner, D. P. W. Ellis, and A. F. Ehmann: Melody ranscripion from music audio: approach and evaluaion, IEEE Transacions on Audio, Speech, and Language Processing, Vol. 15, No. 4, pp. 147 156, 007. [] M. Goo: A real-ime music-scene-descripion sysem: predominan-f0 esimaion for deecing melody and bass lines in real-world audio signals, Speech Communicaion, Vol. 43, No. 4, pp. 311 39, 004. [3] R. P. Paiva, T. Mendes, and A. Cardoso: Melody deecion in polyphonic musical signals: exploiing percepual rules, noe salience, and melodic smoohness, Compuer Music Journal, Vol. 30, No. 4, pp. 80 98, 006. [4] M. Marol: On finding melodic lines in audio recordings, Proceeding of 7h Inernaional Conference on Digial Audio Effecs DAFx 04, pp. 17 1, 004. [5] M. P. Ryynanen and A. P. Klapuri: Noe even modeling for audio melody exracion, MIREX 005 Audio Melody Exracion Cones, 005. [6] D. P. W. Ellis and G. E. Poliner: Classificaion-based melody ranscripion, Machine Learning, Vol. 65, pp. 439 456, 006. 361

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) [7] Yariv Ephraim: Speech enhancemen using a minimum mean-square error shor-ime specral ampliude esimaor, IEEE Transacions on Acousics, Speech, and Signal Processing, Vol. 3, No. 6, pp. 1109 111, 1984. [8] R. Timmers and P. W. M Desain: Vibrao: he quesions and answers from musicians and science, Proceedings of Inernaional Conference on Music Percepion and Cogniion, 000. [9] A. Douce, N. de Freias, and N. J. Gordon: Sequenial Mone Carlo mehods in pracice, Springer-Verlag, New York, 001. [10] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp,: A uorial on paricle filers for online nonlinear/non-gaussian bayesian racking, IEEE Transacions on Signal Processing, Vol. 50, No., pp. 174 188, 00. [11] S. Joo, S. Jo, and C. D. Yoo: Melody exracion from polyphonic audio signal MIREX 009, MIREX 009 Audio Melody Exracion Cones, 009. [1] J. S. Downie, K. Wes, A. Ehmann, and Vincen E: The 005 music informaion rerieval evaluaion exchange (mirex 005): preliminary overview, Proceedings of he Sixh Inernaional Conference on Music Informaion Rerieval, pp. 30 33, 005. 36