MELODY EXTRACTION FROM POLYPHONIC AUDIO BASED ON PARTICLE FILTER

Similar documents
Evaluation of a Singing Voice Conversion Method Based on Many-to-Many Eigenvoice Conversion

AN ESTIMATION METHOD OF VOICE TIMBRE EVALUATION VALUES USING FEATURE EXTRACTION WITH GAUSSIAN MIXTURE MODEL BASED ON REFERENCE SINGER

TRANSFORM DOMAIN SLICE BASED DISTRIBUTED VIDEO CODING

Adaptive Down-Sampling Video Coding

Measurement of Capacitances Based on a Flip-Flop Sensor

DO NOT COPY DO NOT COPY DO NOT COPY DO NOT COPY

THE INCREASING demand to display video contents

MELODY EXTRACTION BASED ON HARMONIC CODED STRUCTURE

Determinants of investment in fixed assets and in intangible assets for hightech

AUTOCOMPENSATIVE SYSTEM FOR MEASUREMENT OF THE CAPACITANCES

MULTI-VIEW VIDEO COMPRESSION USING DYNAMIC BACKGROUND FRAME AND 3D MOTION ESTIMATION

A ROBUST DIGITAL IMAGE COPYRIGHT PROTECTION USING 4-LEVEL DWT ALGORITHM

Removal of Order Domain Content in Rotating Equipment Signals by Double Resampling

A Turbo Tutorial. by Jakob Dahl Andersen COM Center Technical University of Denmark

Overview ECE 553: TESTING AND TESTABLE DESIGN OF. Ad-Hoc DFT Methods Good design practices learned through experience are used as guidelines:

Computer Vision II Lecture 8

Computer Vision II Lecture 8

Hierarchical Sequential Memory for Music: A Cognitive Model

Singing voice detection with deep recurrent neural networks

Video Summarization from Spatio-Temporal Features

Performance Rendering for Piano Music with a Combination of Probabilistic Models for Melody and Chords

CE 603 Photogrammetry II. Condition number = 2.7E+06

10. Water tank. Example I. Draw the graph of the amount z of water in the tank against time t.. Explain the shape of the graph.

Automatic Selection and Concatenation System for Jazz Piano Trio Using Case Data

4.1 Water tank. height z (mm) time t (s)

-To become familiar with the input/output characteristics of several types of standard flip-flop devices and the conversion among them.

G E T T I N G I N S T R U M E N T S, I N C.

Real-time Facial Expression Recognition in Image Sequences Using an AdaBoost-based Multi-classifier

Coded Strobing Photography: Compressive Sensing of High-speed Periodic Events

application software

Nonuniform sampling AN1

Communication Systems, 5e

Source and Channel Coding Issues for ATM Networks y. ECSE Department, Rensselaer Polytechnic Institute, Troy, NY 12180, U.S.A

R&D White Paper WHP 120. Digital on-channel repeater for DAB. Research & Development BRITISH BROADCASTING CORPORATION.

Marjorie Thomas' schemas of Possible 2-voice canonic relationships

application software

Telemetrie-Messtechnik Schnorrenberg

Solution Guide II-A. Image Acquisition. Building Vision for Business. MVTec Software GmbH

SC434L_DVCC-Tutorial 1 Intro. and DV Formats

Solution Guide II-A. Image Acquisition. HALCON Progress

BLOCK-BASED MOTION ESTIMATION USING THE PIXELWISE CLASSIFICATION OF THE MOTION COMPENSATION ERROR

Lab 2 Position and Velocity

A Methodology for Evaluating Storage Systems in Distributed and Hierarchical Video Servers

THERMOELASTIC SIGNAL PROCESSING USING AN FFT LOCK-IN BASED ALGORITHM ON EXTENDED SAMPLED DATA

Truncated Gray-Coded Bit-Plane Matching Based Motion Estimation and its Hardware Architecture

And the Oscar Goes to...peeeeedrooooo! 1

TEA2037A HORIZONTAL & VERTICAL DEFLECTION CIRCUIT

Region-based Temporally Consistent Video Post-processing

2015 Communication Guide

UPDATE FOR DESIGN OF STRUCTURAL STEEL HOLLOW SECTION CONNECTIONS VOLUME 1 DESIGN MODELS, First edition 1996 A.A. SYAM AND B.G.

The Art of Image Acquisition

The Art of Image Acquisition

VECM and Variance Decomposition: An Application to the Consumption-Wealth Ratio

Mean-Field Analysis for the Evaluation of Gossip Protocols

Monitoring Technology

EX 5 DIGITAL ELECTRONICS (GROUP 1BT4) G

Automatic location and removal of video logos

Computer Graphics Applications to Crew Displays

Advanced Handheld Tachometer FT Measure engine rotation speed via cigarette lighter socket sensor! Cigarette lighter socket sensor FT-0801

Novel Power Supply Independent Ring Oscillator

SMD LED Product Data Sheet LTSA-G6SPVEKT Spec No.: DS Effective Date: 10/12/2016 LITE-ON DCC RELEASE

The Impact of e-book Technology on Book Retailing

TLE Overview. High Speed CAN FD Transceiver. Qualified for Automotive Applications according to AEC-Q100

Drivers Evaluation of Performance of LED Traffic Signal Modules

Physics 218: Exam 1. Sections: , , , 544, , 557,569, 572 September 28 th, 2016

TLE6251D. Data Sheet. Automotive Power. High Speed CAN-Transceiver with Bus Wake-up. Rev. 1.0,

Supercompression for Full-HD and 4k-3D (8k) Digital TV Systems

Diffusion in Concert halls analyzed as a function of time during the decay process

Efficient Vocal Melody Extraction from Polyphonic Music Signals

TLE9251V. 1 Overview. High Speed CAN Transceiver. Qualified for Automotive Applications according to AEC-Q100. Features

CHEATER CIRCUITS FOR THE TESTING OF THYRATRONS

TLE7251V. 1 Overview. Features. Potential applications. Product validation. High Speed CAN-Transceiver with Bus Wake-up

Enabling Switch Devices

A CAP for graphic scores Graphic notation and performance

Tarinaoopperabaletti

On Mopping: A Mathematical Model for Mopping a Dirty Floor

AN-605 APPLICATION NOTE

Workflow Overview. BD FACSDiva Software Quick Reference Guide for BD FACSAria Cell Sorters. Starting Up the System. Checking Cytometer Performance

LABORATORY COURSE OF ELECTRONIC INSTRUMENTATION BASED ON THE TELEMETRY OF SEVERAL PARAMETERS OF A REMOTE CONTROLLED CAR

Philips Reseàrch Reports

A Delay-efficient Radiation-hard Digital Design Approach Using CWSP Elements

Student worksheet: Spoken Grammar

A Delay-efficient Radiation-hard Digital Design Approach Using CWSP Elements

Digital Panel Controller

TLE7251V. Data Sheet. Automotive Power. High Speed CAN-Transceiver with Bus Wake-up TLE7251VLE TLE7251VSJ. Rev. 1.0,

LOW LEVEL DESCRIPTORS BASED DBLSTM BOTTLENECK FEATURE FOR SPEECH DRIVEN TALKING AVATAR

Besides our own analog sensors, it can serve as a controller performing variegated control functions for any type of analog device by any maker.

Circuit Breaker Ratings A Primer for Protection Engineers

Overlapped Vehicle Tracking via Enhancement of Particle Filter with Adaptive Resampling Algorithm

I (parent/guardian name) certify that, to the best of my knowledge, the

DIGITAL MOMENT LIMITTER. Instruction Manual EN B

Guitar. Egregore. for kantele, guitar, accordion and piano

LCD Module Specification

(12) (10) Patent N0.: US 7,260,789 B2 Hunleth et a]. (45) Date of Patent: Aug. 21, 2007

Personal Computer Embedded Type Servo System Controller. Simple Motion Board User's Manual (Advanced Synchronous Control) -MR-EM340GF

Theatrical Feature Film Trade in the United States, Europe, and Japan since the 1950s: An Empirical Study of the Home Market Effect

Kantele. Egregore. for kantele, guitar, accordion and piano

TUBICOPTERS & MORE OBJECTIVE

TLE8251V. 1 Overview. High Speed CAN Transceiver with Bus Wake-up

A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION

Transcription:

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) MELODY EXTRACTION FROM POLYPHONIC AUDIO BASED ON PARTICLE FILTER Seokhwan Jo Chang D. Yoo Deparmen of Elecrical Engineering, Korea Advanced Insiue of Science Technology, 373-1 Guseong-dong, Yuseong-gu, Daejeon, 305-701, Korea aniland00@kais.ac.kr cdyoo@ee.kais.ac.kr ABSTRACT This paper considers a paricle filer based algorihm o exrac melody from a polyphonic audio in he shor-ime Fourier ransforms (STFT) domain. The exracion is focused on overcoming he difficulies due o harmonic / percussive sound inerferences, possibiliy of ocave mismach, and dynamic variaion in melody. The main idea of he algorihm is o consider probabilisic relaions beween melody and polyphonic audio. Melody is assumed o follow a Markov process, and he framed segmens of polyphonic audio are assumed o be condiionally independen given he parameers ha represen he melody. The melody parameers are esimaed using sequenial imporance sampling (SIS) which is a convenional paricle filer mehod. In his paper, he likelihood and sae ransiion are defined o overcome he aforemenioned difficulies. The SIS algorihm relies on sequenial imporance densiy, and his densiy is designed using muliple piches which are esimaed by a simple muli-pich exracion algorihm. Experimenal resuls show ha he considered algorihm ouperforms oher famous melody exracion algorihms in erms of he raw pich accuracy (RPA) and he raw chroma accuracy (RCA). 1. INTRODUCTION Many people believe ha people recognize music as a sequence of monophonic noes called melody, and for his reason, melody exracion is playing an imporan role in music conen processing which has recenly become an imporan research area. Alhough he debae over he definiion of melody is on going [1 3], many expers concur ha melody should be he dominan pich sequence of a polyphonic audio. In his paper, melody is defined o be he singing voice pich sequence in he vocal par and he pich sequence of he solo insrumen in non-vocal par or non-vocal music. When a music conains singing voice, mos people recognize music by he vocal melody line in he vocal par. However, in non-vocal par such as iner- Permission o make digial or hard copies of all or par of his work for personal or classroom use is graned wihou fee provided ha copies are no made or disribued for profi or commercial advanage and ha copies bear his noice and he full ciaion on he firs page. c 010 Inernaional Sociey for Music Informaion Rerieval. mezzo and non-vocal music such as jazz and orchesra, mos people recognize music by he melody line of he solo insrumen. Many melody exracion algorihms have been proposed over he las one decade [1 6], albei wih limied success. Melody exracion from he polyphonic audio is sill difficul for he following reasons: 1. Harmonic inerference: Harmonics of oher insrumen signal inerfere in he esimaion of he melody pich harmonics.. Percussive sound inerference: Percussive sound inerfere o esimae he melody pich because he energy of i forms a verical ridge wih srong and wideband specral envelopes. 3. Ocave mismach: The esimaed pich can be one ocave higher or lower han he ground-ruh. 4. Dynamic variaion in melody: Accurae pich esimaion in he beginning, end and sudden ransien regions of a melody is difficul. In his paper, melody pich frequency and harmonic ampliudes ha represen he melody are esimaed in he shorime Fourier ransforms (STFT) domain. The main idea of he algorihm is o consider a probabilisic relaions beween melody and polyphonic audio. Melody pich frequency and harmonic ampliudes are assumed o follow Markov processes, and he framed segmens of polyphonic audio are assumed o be condiionally independen given melody pich frequency and harmonic ampliudes. Thus, melody pich frequency and harmonic ampliudes can be esimaed from he polyphonic audio based on he Bayesian sequenial model once he likelihood and sae ransiion are defined. The likelihood is defined o be robus o harmonic and percussive sound inerferences. The sae ransiion of melody pich frequency is adjused by conrol parameers ha discourages ocave mismach and dynamic variaion in he melody. The sequenial imporance sampling (SIS) algorihm, a convenional paricle filer algorihm, is used o esimae he melody parameers. The SIS algorihm relies on a so-called sequenial imporance densiy, and his densiy is designed using muliple piches which are esimaed by a simple muli-pich exracion algorihm. 357

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) This paper is organized as follows. Secion presens he melody exracion from polyphonic audio based on paricle filer. Secion 3 provides experimenal resuls. Finally, Secion 4 concludes his paper.. MELODY EXTRACTION FROM POLYPHONIC AUDIO BASED ON PARTICLE FILTER.1 Melody exracion from polyphonic audio The melody pich harmonics x [n] in he h frame is defined as follows: x [n] = w[n] H A m, cos(mω 0, n + ϕ m, ), (1) m=1 where A m,, ω 0,, ϕ m,, H and w[n] are he ampliude of he mh harmonic in he h frame, he melody pich frequency in he h frame, he phase of he mh harmonic in he h frame, number of melody pich harmonics, and he analysis window funcion, respecively. The polyphonic audio can be expressed as z [n] = x [n] + y [n], () where z [n] and y [n] are he polyphonic audio signal and signal of oher insrumens in he h frame, respecively. In he frequency domain, he following relaionship holds: z = x + y, (3) where z, x, and y are he N-poin discree Fourier ransforms (DFT) of z [n], x [n], and y [n], respecively. The parameers of he melody pich harmonics he melody pich frequency and he harmonic ampliudes mus be esimaed for he melody exracion. This paper assumes ha he phase of he melody pich harmonics is he same as he phase of he polyphonic audio, i.e., he phase of he melody pich is no esimaed since human ear is assumed o be unsensiive o phase variaions. Thus, he h frame parameer se is defined as Θ = (ω 0,, A ), (4) where A = [A 1,, A,,..., A H, ]. The objecive of melody exracion is o esimae Θ from given z. I is usually observed ha successive parameers ω 0, and A are highly correlaed. In his paper, i is assumed ha Θ is considered a Markov process and y a each frame is condiionally independen given Θ. Here, Θ is considered laen while y is observed. From his perspecive, he Bayesian sequenial model for melody exracion can be consruced as shown in Figure 1. In Figure 1, p(z Θ ), p(θ Θ 1 ), and ρ are likelihood, sae ransiion, and conrol parameer o decide he sae ransiion of he melody pich frequency, respecively. From his Bayesian sequenial model, he poserior probabiliy p(θ 0: z 1: ) 1 is esimaed, and i is used o esimae Θ for melody exracion. To esimae p(θ 0: z 1: ), likelihood and sae evoluion equaions wih sae ransiion needs o be defined. 1 The noaion a 0: means ha a 0: = [a 0, a 1,..., a ] T Figure 1. Bayesian sequenial model for melody exracion. z, Θ, and ρ are polyphonic audio, melody parameer (ω 0, and A ), and conrol parameer, respecively. To obain he likelihood, i is assumed ha he DFT coefficiens of y follow a zero mean complex mulivariae Gaussian disribuion, which is given by y N (0, Σ ), Σ = diag(σ,1, σ,,..., σ,n), (5) where Σ and σ,k are he covariance marix in h frame and he variance of he kh bin in he h frame, respecively. Eqn. (5) yields he likelihood as follows: p(z Θ ) = N (z ; x, Σ ) exp { (z x ) H Σ 1 (z x ) }, (6) where ( ) H is he Hermiian operaor. To define p(z Θ ), σ,k mus be esimaed. In his paper, σ,k is esimaed using he decision-direced mehod [7] as follows: σ,k = α σ 1,k + (1 α) Y,k, (7) where α and Y,k are a smoohing facor and he kh bin DFT coefficien of y, respecively. However, Eqn. (7) can no be used direcly since Y,k is unknown. I is assumed ha Y,k is highly correlaed wih Y 1,k. Therefore, he esimaion is modified as follows: σ,k = α σ,k + (1 α) Ŷ 1,k. (8) Accurae esimaion of Σ will lead o robusness o harmonic and percussive sound inerferences. Figure shows an example of z and an esimae of Σ, and i is easily shown ha he likelihood in Eqn. (6) is maximized a he rue Θ. The sae evoluion equaions, which describe relaionships of he parameers a frame, are se as follows: A m, = A m, 1 + v A, 1, (9) ω 0, = ω 0, 1 + v ω0, 1, (10) where v A, 1 and v ω0, 1 are he random perurbaions corresponding o harmonic ampliudes and melody pich frequency of he ( 1)h frame, respecively. This ype of sae evoluion equaions is called random walk: he curren sae is a random perurbaion of he previous sae. I is imporan o define p(v A, 1 ) and p(v ω0, 1) accuraely, and in his paper, p(v A, 1 ) is assumed o be a runcaed Gaussian as shown in Figure 3 since A m, > 0, and 358

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) 10 Polyphonic audio Sqr of variances 0.18 0.16 0.14 0.1 0.1 0.18 0.16 0.14 0.1 0.1 10 1 0.08 0.06 0.04 0.08 0.06 0.04 0.0 0.0 0 6 4 0 4 6 0 0 4 6 8 Magniude 10 0 (a) p(v A, 1 ) (b) p(a m, A m, 1 ) 10 1 Figure 3. Sae ransiion in harmonic ampliudes. 10 100 00 300 400 500 600 700 800 900 frequency bin Figure. Example of polyphonic audio (z ) and he esimaed variances (Σ ) of oher insrumen signal. p(v ω0, 1) is assumed o be a Gaussian whose variance conrolled by ρ. Melody line is characerized by prolonged periods of smoohness, wih infrequen sharp changes in noe ransiion or during vibrao regions. Furhermore, here are wo general rules concerning he melody line: 1) he vibrao exhibis an exen of 60 00 cens for singing voice and only 0 30 cens for oher [8], and ) he ransiions are ypically limied o one ocave [1]. Therefore, assumpion ha v ω0, 1 follows a Gaussian disribuion wih fixed variance is no appropriae. In his paper, he sae ransiion from he from he ( 1)h sae o he h sae of he melody pich frequency is conrolled by ρ which indicaes he degree of he melody line being wheher in ransiion or no. Here, ransiion includes vibrao. And, ρ is defined as ρ = ω 0, 1 ω 0,, (11) and p(v ω0, 1) is given by If he imporance densiy is chosen o facorize as follows N (0, 0 cen) ρ < 50 cen p(v ω0, 1) = N (0, 50 cen) 50 cen ρ < 100 cen. q(θ 0: z 1: ) = q(θ Θ 0: 1, z 1: )q(θ 0: 1 z 1: 1 ), N (0, 100 cen) 100 cen ρ (15) (1) hen one can obain paricles Θ 0: q(θ 0: z 1:) by augmening each of he exising paricles Θ 0: 1 q(θ 0: 1 When ρ is small, he curren melody pich frequency represens a cerain noe frequency and has a value similar o z 1: 1 ) wih he new sae Θ q(θ Θ 0: 1, z 1: ). The he previous melody pich frequency. When ρ is large, he weigh updae equaion can be derived as follows using curren melody pich frequency is wih high probabiliy in Eqn. (14) and Eqn. (15) a noe ransiion or vibrao regions and has a value dissimilar o he previous melody pich frequency. The sae ransiion of melody pich frequency defined by Eqn. (1) w w p(z Θ )p(θ Θ 1 ) 1 q(θ can lead o robusness o ocave mismach and dynamic Θ 1, z. (16) ) variaion in melody. The cen is a uni of logarihmic frequency range, and i is defined as f cen = 6900 + 100 log f Hz 440.. Melody exracion based on paricle filer In his paper, p(θ 0: z 1: ) is approximaed using Mone Carlo inegraion and Θ is esimaed using he paricle filer. The SIS algorihm which is a common paricle filer mehod [9, 10] is adoped o esimae he parameers of he melody. If he likelihood and he sae ransiion follow a Gaussian disribuion, he problem can be solved by Kalman filer. However, he sae ransiion is no assumed o be a Gaussian. The SIS algorihm is used o obain p(θ 0: z 1: ) based on he Bayesian sequenial model shown as Figure 1. The poserior densiy p(θ 0: z 1: ) can be approximaed as follows: N p p(θ 0: z 1: ) i=1 w δ(θ 0: Θ 0: ), (13) where Θ 0:, w, and N p are he ih paricle of Θ 0:, associaed weigh, and he number of paricles, respecively. The weighs are normalized such ha N p i=1 w = 1. The weighs are chosen using he mehod of imporance sampling. If he paricle Θ 0: were drawn from an imporance densiy q(θ 0: z 1:), he weighs in Eqn. (13) are defined as follows: w p(θ 0: z 1:) q(θ 0: z 1:). (14) A common problem wih he paricle filer is he degeneracy phenomenon, where afer a few ieraions, mos paricles have negligible weigh [9,10]. A suiable measure of degeneracy is he effecive paricle size, N eff, which is given by N eff = 1 Np i=1 (w ). (17) 359

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) x 10 3 7 6 5 4 3 imporance densiy N bes previous paricle he muliple-pich esimaion and he melody pich paricles drawn in he previous frame. Afer drawing melody pich paricles, melody pich harmonic ampliudes paricles are drawn as given by = N A q(a ω 0,, A 1, z ) A 1 + 0, Azω, A 1 0, Azω () 1 0 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 pich candidae [cen] Figure 4. Design of q(ω 0, ω 0, 1, z ). In his paper, o avoid he degeneracy problem, resampling algorihm is used when N eff N p. Finally, esimaion of parameers is achieved by poserior mean afer obaining p(θ 0: z 1: ). N p ω 0,0: = w ω 0,0:, (18) i=1 where A zω 0, pich candidae near ω 0, is he harmonic ampliudes corresponding wih consrain A > 0. In defining q(a ω 0,, A 1, z ), he curren harmonic ampliude paricles which are similar o he previous harmonic ampliude paricles and harmonic ampliudes of he N- bes pich candidaes are generaed. If A are similar, hen A A 1 +Azω 0,. If A A 1 Azω 0, 1 Azω 0, 0, 1 and Azω 0, herefore, A 0, 1 and Azω are no similar, hen >> 0, herefore, A is generaed somewha randomly. The ouline of he considered algorihm is given below. N p  0: = w A 0:. (19) i=1..1 Design of sequenial imporance densiy The performance of he SIS algorihm depends on he choice of q(θ Θ 1, z ). Seing q(θ Θ 1, z ) = p(θ Θ 1 ) leads o no only unnecessary large number of paricles bu also difficulies in esimaing p(θ 0: z 1: ). In his paper, a muliple pich esimaion algorihm is used o define Θ 1, z ) since he melody pich frequency is assumed o be one of he pich esimae given by he muliple q(θ pich esimaes. A main idea in defining q(θ Θ 1, z ) is o generae paricles of he melody parameers similar o he esimaed muliple pich parameers. To obain muliple pich parameers, he muliple pich esimaion algorihm proposed in [11] is used. Before drawing paricles from he imporance densiy, q(θ Θ 1, z ) is facorized as follows: q(ω 0,, A ω 0, 1, A 1, z ) = q(a ω 0,, A 1, z )q(ω 0, ω 0, 1, z ). (0) Here, ω 0, and A are considered condiionally independen given ω 0, 1, A 1, and z. Firs, melody pich paricles are drawn as given by ω 0, q(ω 0, ω 0, 1, z ), (1) where q(ω 0, ω 0, 1, z ) is shown as Figure 4. In defining q(ω 0, ω 0, 1, z ), he curren melody pich paricles are drawn near he N-bes pich candidaes obained from Ouline of he considered algorihm Melody exracion based on he SIS For i = 1,..., N p 1. Generae he paricles Melody pich paricles ω 0, q(ω 0, ω 0, 1, z ) Harmonic ampliudes paricles A q(a ω 0,, A 1, z ). Updae he weighs: Eqn. (16) Normalize he weighs ( N p i=1 w = 1). Resampling: Resampling algorihm is used when N eff N p. Esimaion: Melody pich frequency in h frame is esimaed by Eqn. (18). Harmonic ampliudes of melody pich harmonics in h frame are esimaed by Eqn. (19). 3. EVALUATION The considered algorihm was evaluaed and compared o oher melody exracion algorihms using he ISMIR 004 Audio Descripion Cones (ADC04) daabase. The daabase conains 0 polyphonic musical audio pieces. All es daa are single channel PCM daa wih 44.1 khz sample rae and 16-bi quanizaion. Table 1 shows he daa composiion of he ADC04 se. Search range of melody pich frequency was beween 80Hz and 180Hz in frequency do- 360

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) Melody Insrumen Synhesized voice (4) Saxophone (4) MIDI insrumens (4) Human voice ( male, female) Male Voice (4) Syle POP Jazz Folk(), Pop() Classical opera POP Table 1. Summary of ADC04 daa se. The number in parenheses is he number of corresponding pieces. RPA RCA Goo [] 65.8% (005) 71.8% (005) Paiva el al. [3] 6.7% (005) 66.7% (005) Marlo [4] 60.1% (005) 67.1% (005) Ryynanen el al. [5] 68.6% (005) 74.1% (005) Ellis el al. [6] 73.% (006) 76.4% (006) Considered algorihm 77.3% 83.8% Table. Resul comparison. The number in parenheses is he year when heir algorihms were submied o he MIREX. main (3950 cen and 8750 cen in cen domain). The Hanning window was used wih 48ms frame lengh and 10ms frame hop size. α = 0.98 in Eqn. (8) was used. N p = 500 in Eqn. (13) was used. The esimaed melody is correc when he absolue value of he difference beween he ground-ruh frequency and esimaed frequency is less han 50 cen ( 1 4 one). The performance of he considered algorihm was evaluaed in erms of raw pich accuracy (RPA) and raw chroma accuracy (RCA). The RPA is defined as he proporion of frames in which he esimaed melody pich is wihin ± 1 4 one of he reference pich. And he RCA is defined in he same manner as he raw pich accuracy; however, boh he esimaed and reference frequencies are mapped ino a single ocave in order o forgive ocave ransposiions. The considered algorihm was compared o he oher famous melody exracion algorihms such as algorihms proposed by Goo [], Paiva e al. [3], Marlo [4], Ryynanen el al. [5], and Ellis e al. [6]. Their performances are based on resuls of he Music Informaion Rerieval Evaluaion exchange (MIREX) [1]. Table shows he evaluaion resuls for all algorihms considered. The considered algorihm ouperformed he ohers in erms of he RPA and he RCA. The difference beween he RPA and RCA is proporional ocave mismach error. Alhough he algorihm in his paper is considered o be robus agains ocave mismach, he difference beween he RPA and he RCA is 6.5 %. The muliple pich esimaion algorihm proposed in [11] was quie simple and vulnerable o ocave error, i.e., inaccuracy in sequenial imporance densiy led o inaccurae melody pich candidaes. 4. CONCLUSION The melody exracion algorihm from he polyphonic audio based on paricle filer is considered in his paper. Mos people recognize music as no all of noe sequences bu a special monophonic noe sequence called melody. However, melody exracion from polyphonic audio is difficul due o he following impedimens: harmonic inerference, percussive sound inerference, ocave mismach, and dynamic variaion in melody. The main idea of he algorihm is o consider probabilisic relaions beween melody and polyphonic audio. Melody is assumed o follow a Markov process, and he framed segmens of polyphonic audio are assumed o be condiionally independen given he parameers ha represen he melody. The parameers are esimaed using he SIS algorihm. This paper shows ha likelihood and sae ransiion ha are required in he SIS algorihm are defined o be robus agains he aforemenioned impedimens. The performance of he SIS algorihm depends on a sequenial imporance densiy, and his densiy is designed by muliple pich. Experimenal resuls show ha he considered algorihm ouperformed he oher famous melody exracion algorihms. 5. ACKNOWLEDGEMENTS This work was suppored by he Minisry of Culure, Spors and Tourism (MCST) and Korea Culure Conen Agency (KOCCA) in he Culure Technology (CT) Research and Developmen Program 009. 6. REFERENCES [1] G. E. Poliner, D. P. W. Ellis, and A. F. Ehmann: Melody ranscripion from music audio: approach and evaluaion, IEEE Transacions on Audio, Speech, and Language Processing, Vol. 15, No. 4, pp. 147 156, 007. [] M. Goo: A real-ime music-scene-descripion sysem: predominan-f0 esimaion for deecing melody and bass lines in real-world audio signals, Speech Communicaion, Vol. 43, No. 4, pp. 311 39, 004. [3] R. P. Paiva, T. Mendes, and A. Cardoso: Melody deecion in polyphonic musical signals: exploiing percepual rules, noe salience, and melodic smoohness, Compuer Music Journal, Vol. 30, No. 4, pp. 80 98, 006. [4] M. Marol: On finding melodic lines in audio recordings, Proceeding of 7h Inernaional Conference on Digial Audio Effecs DAFx 04, pp. 17 1, 004. [5] M. P. Ryynanen and A. P. Klapuri: Noe even modeling for audio melody exracion, MIREX 005 Audio Melody Exracion Cones, 005. [6] D. P. W. Ellis and G. E. Poliner: Classificaion-based melody ranscripion, Machine Learning, Vol. 65, pp. 439 456, 006. 361

11h Inernaional Sociey for Music Informaion Rerieval Conference (ISMIR 010) [7] Yariv Ephraim: Speech enhancemen using a minimum mean-square error shor-ime specral ampliude esimaor, IEEE Transacions on Acousics, Speech, and Signal Processing, Vol. 3, No. 6, pp. 1109 111, 1984. [8] R. Timmers and P. W. M Desain: Vibrao: he quesions and answers from musicians and science, Proceedings of Inernaional Conference on Music Percepion and Cogniion, 000. [9] A. Douce, N. de Freias, and N. J. Gordon: Sequenial Mone Carlo mehods in pracice, Springer-Verlag, New York, 001. [10] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp,: A uorial on paricle filers for online nonlinear/non-gaussian bayesian racking, IEEE Transacions on Signal Processing, Vol. 50, No., pp. 174 188, 00. [11] S. Joo, S. Jo, and C. D. Yoo: Melody exracion from polyphonic audio signal MIREX 009, MIREX 009 Audio Melody Exracion Cones, 009. [1] J. S. Downie, K. Wes, A. Ehmann, and Vincen E: The 005 music informaion rerieval evaluaion exchange (mirex 005): preliminary overview, Proceedings of he Sixh Inernaional Conference on Music Informaion Rerieval, pp. 30 33, 005. 36