F0 ESTIMATION FOR NOISY SPEECH BY EXPLORING TEMPORAL HARMONIC STRUCTURES IN LOCAL TIME FREQUENCY SPECTRUM SEGMENT

Similar documents
AN IMPROVED VARIABLE STEP-SIZE AFFINE PROJECTION SIGN ALGORITHM FOR ECHO CANCELLATION * Jianming Liu and Steven L Grant 1

Optimal Filter Estimation for Lucas-Kanade Optical Flow

-To become familiar with the input/output characteristics of several types of standard flip-flop devices and the conversion among them.

GESTURE RECOGNITION RESEARCH FOR HUMAN-MACHINE SYMBIOTIC ENVIRONMENT. T.Kirishima 1, K.Sato 2, and K.Chihara 3

Adversarial Learning for Chinese NER from Crowd Annotations

Interaction Between Real And Financial Sectors In Nigeria: A Causality Test

FDI, Human Capital and Economic Growth: Evidence from Nigeria

Combinatorial structures and processing in Neural Blackboard Architectures

Overview ECE 553: TESTING AND TESTABLE DESIGN OF. Ad-Hoc DFT Methods Good design practices learned through experience are used as guidelines:

CE 603 Photogrammetry II. Condition number = 2.7E+06

A Turbo Tutorial. by Jakob Dahl Andersen COM Center Technical University of Denmark

Diffusion in Concert halls analyzed as a function of time during the decay process

UPDATE FOR DESIGN OF STRUCTURAL STEEL HOLLOW SECTION CONNECTIONS VOLUME 1 DESIGN MODELS, First edition 1996 A.A. SYAM AND B.G.

Physics 218: Exam 1. Sections: , , , 544, , 557,569, 572 September 28 th, 2016

Adaptive Down-Sampling Video Coding

Lab 2 Position and Velocity

Speech Recognition for Controlling Movement of the Wheelchair

Regression Model Used in Analyzing the Effect of Foreign Direct Investment on Economic Growth

10. Water tank. Example I. Draw the graph of the amount z of water in the tank against time t.. Explain the shape of the graph.

TRANSFORM DOMAIN SLICE BASED DISTRIBUTED VIDEO CODING

Logistics We are here. If you cannot login to MarkUs, me your UTORID and name.

Prior Subspace Analysis for Drum Transcription

Evaluation of a Singing Voice Conversion Method Based on Many-to-Many Eigenvoice Conversion

AN ESTIMATION METHOD OF VOICE TIMBRE EVALUATION VALUES USING FEATURE EXTRACTION WITH GAUSSIAN MIXTURE MODEL BASED ON REFERENCE SINGER

TABLE OF CONTENTS *******************************

Student worksheet: Spoken Grammar

TUBICOPTERS & MORE OBJECTIVE

MELODY EXTRACTION FROM POLYPHONIC AUDIO BASED ON PARTICLE FILTER

SC434L_DVCC-Tutorial 1 Intro. and DV Formats

application software

application software

Guidance Supplement for ACR Computed Tomography Accreditation on the MX16-slice CT

I (parent/guardian name) certify that, to the best of my knowledge, the

T-25e, T-39 & T-66. G657 fibres and how to splice them. TA036DO th June 2011

Adaptive Depth Imaging with Single-Photon Detectors

MULTI-VIEW VIDEO COMPRESSION USING DYNAMIC BACKGROUND FRAME AND 3D MOTION ESTIMATION

The Determinants and Impacts of Foreign Direct Investment in Nigeria

Nonuniform sampling AN1

EX 5 DIGITAL ELECTRONICS (GROUP 1BT4) G

Drum Transcription in the presence of pitched instruments using Prior Subspace Analysis

Available online at ScienceDirect. Procedia Computer Science 73 (2015 ) 48 55

Topic 4. Single Pitch Detection

Real-time Facial Expression Recognition in Image Sequences Using an AdaBoost-based Multi-classifier

First Result of the SMA Holography Experirnent

Vowels and consonants? - articulatory characteristics. Phonetics and Phonology. Vowels and consonants? - acoustic characteristics

Line numbering and synchronization in digital HDTV systems

Communication Systems, 5e

Automatic Selection and Concatenation System for Jazz Piano Trio Using Case Data

AN-605 APPLICATION NOTE

ITU BS.1771 Loudness Meter BLITS Channel Identification for 5.1 Surround Sound

LATCHES Implementation With Complex Gates

Mullard INDUCTOR POT CORE EQUIVALENTS LIST. Mullard Limited, Mullard House, Torrington Place, London Wel 7HD. Telephone:

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

G E T T I N G I N S T R U M E N T S, I N C.

DO NOT COPY DO NOT COPY DO NOT COPY DO NOT COPY

THE INCREASING demand to display video contents

On Mopping: A Mathematical Model for Mopping a Dirty Floor

MODELLING PERCEPTION OF SPEED IN MUSIC AUDIO

Time dilation and Langevin paradox

Removal of Order Domain Content in Rotating Equipment Signals by Double Resampling

4.1 Water tank. height z (mm) time t (s)

Motivation. Analysis-and-manipulation approach to pitch and duration of musical instrument sounds without distorting timbral characteristics

A Speech Recognition System Based Improved Algorithm of Dual-template HMM

Workflow Overview. BD FACSDiva Software Quick Reference Guide for BD FACSAria Cell Sorters. Starting Up the System. Checking Cytometer Performance

Polychrome Devices Reference Manual

Solution Guide II-A. Image Acquisition. Building Vision for Business. MVTec Software GmbH

Background Manuscript Music Data Results... sort of Acknowledgments. Suite, Suite Phylogenetics. Michael Charleston and Zoltán Szabó

Determinants of investment in fixed assets and in intangible assets for hightech

Image Intensifier Reference Manual

Measurement of Capacitances Based on a Flip-Flop Sensor

The Art of Image Acquisition

EDT/Collect for DigitalMicrograph

NewBlot PVDF 5X Stripping Buffer

The Art of Image Acquisition

A ROBUST DIGITAL IMAGE COPYRIGHT PROTECTION USING 4-LEVEL DWT ALGORITHM

Remarks on The Logistic Lattice in Random Number Generation. Neal R. Wagner

Daniel R. Dehaan Three Études For Solo Voice Summer 2010, Chicago

MELSEC iq-f FX5 Simple Motion Module User's Manual (Advanced Synchronous Control) -FX5-40SSC-S -FX5-80SSC-S

Quality improvement in measurement channel including of ADC under operation conditions

Solution Guide II-A. Image Acquisition. HALCON Progress

viewing perspective projection in drawing

Australian Journal of Basic and Applied Sciences

NEURAL NETWORKS FOR SUPERVISED PITCH TRACKING IN NOISE. Kun Han and DeLiang Wang

Monitoring Technology

PROJECTOR SFX SUFA-X. Properties. Specifications. Application. Tel

LOW LEVEL DESCRIPTORS BASED DBLSTM BOTTLENECK FEATURE FOR SPEECH DRIVEN TALKING AVATAR

Instruction manual for 5024 Weighing Terminal

NCH Software VideoPad Video Editor

AZ DISPLAYS, INC. Complete LCD Solutions. AGM6448V Series LCD Module AGM6448V. Without. Without. 495 g(approx.) CXA-L0612-VMR (TDK) MIN -0.

TERRESTRIAL broadcasting of digital television (DTV)

References and quotations

BLOCK-BASED MOTION ESTIMATION USING THE PIXELWISE CLASSIFICATION OF THE MOTION COMPENSATION ERROR

Modal Bass Line Modules

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

A Model of Metric Coherence

Robust 3-D Video System Based on Modified Prediction Coding and Adaptive Selection Mode Error Concealment Algorithm

PIANO SYLLABUS SPECIFICATION. Also suitable for Keyboards Edition

R&D White Paper WHP 120. Digital on-channel repeater for DAB. Research & Development BRITISH BROADCASTING CORPORATION.

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

A Study of Encoding and Decoding Techniques for Syndrome-Based Video Coding

Transcription:

F0 ESTIMATION FOR NOISY SPEEC BY EXPLORING TEMPORAL ARMONIC STRUCTURES IN LOCAL TIME FREQUENCY SPECTRUM SEGMENT Dogmei Wag, Joh. L. ase Dep. Eecrica Egieerig, Uiversiy of Texas a Daas 800 Wes Campbe Road, Richardso, Tx. 75080 {dogmei.wag, oh.hase}@udaas.edu ABSTRACT I his paper, we propose a oise robus F0 esimaio approach by exporig he empora harmoic srucures i oca ime-frequecy (TF) specrum segme. Sice he speech eergy is sparsey disribued o he TF pae, he speech harmoic srucures occupied i he higher speech eergy TF segme are edig o domiae over oise. Thus, we aemp o derive F0 from such high (siga o oise raio) SNR TF segmes raher ha fu bad siga. Our agorihm comprises of wo sages: i) F0 cadidae esimaio for a series of TF segmes; ii) F0 racig based o he acousic feaures of each TF segme as we as he F0 empora coiuiy cosrais. Experimea resus show ha our approach ouperforms he compared mehods i erms of F0 esimaio accuracy. Idex Terms F0 esimaio, oca TF segme, SNR esimaio, empora coiuiy cosrais 1. INTRODUCTION Fudamea frequecy (F0) is oe of he mos impora characerisics of huma speech which represes he vibraio rae of he voca cords durig speech producio. A promisig F0 esimaio sysem wi faciiae may speech siga processig areas, such as speech source separaio, emoio recogiio, speaer/aguage ideificaio, ec. Recey, F0 esimaio has aso bee appied o assis he mea disease diagosis [1] []. The sraighforward way o aayze F0 is eiher exporig harmoic srucures i frequecy domai [3] [4] or examie he periodic cues i ime domai [5-7]. Correspodigy, auocorreaio fucio (ACF) ad average magiude differece fucio (AMDF) are he wo basic ime domai F0 esimaio approaches. Besides, subharmoic summaio [3] ad comb fier [9] are usuay adoped as frequecy domai mehods. owever i adverse codiios, he above radiioa F0 esimaio mehods become ieffecive due o boh of he empora periodic cues ad harmoic srucure are disored o some degree. I order o dea wih he oisy siuaio, may effors have bee made by he researchers. For isace, ACF ad AMDF are combied ogeher o obai beer periodic pea deecio [8] [9]. I addiio, various ypes of adapive speech represeaio mehods are iroduced o ehace he speech compoe so as o provide a more reiabe source for F0 esimaio [10-1]. Siga pre-processig is aso proposed o aeuae some oise for F0 esimaio [13] [14]. Audiory fier ba based F0 esimaio is proposed o ae advaage of high SNR sub-chaes [15-17]. Moreover, he F0 empora coiuiy cosrais are modeed o esure more accurae F0 racig [1] [15] [18]. Recey, saisica ad machie earig mehods are aso widey used for s ST -specrum LS- LT -specrum s : oisy speech LS-: og ad shor erm ST: shor erm TF, LT: og erm TF LogFcc: ogarihmic frequecy scae correaio coefficie Form ST-TF segme Form LT-TF segme Fˆ 0 LogFcc LogF cacuaio LogF yes LogF 0.4 Sip F0 cadidae esimaio F0 cadidaes Overa F0 rac Fig. 1 Agorihm overview SNR esi SNR esimaio segme feaure oise robus pich esimaio [19-]. Amog he previous sudies, empora harmoic srucures have bee ivesigaed for oise robus F0 esimaio because of he harmoic simiariy bewee adace speech frames [1]. Speech sparsiy characerisic [3] is aso cosidered ha F0 ca be esimaed from ess oise affeced chaes [15-17] i each frame. owever, i sedom cases, empora harmoic coiuiy ad sparsiy are cosidered simuaeousy for F0 esimaio. Neverheess, if he paricuar specrum area (TF segme) domiaed by coiuous frames of harmoic srucures are abe o be deeced for F0 esimaio, he performace coud be improved. I his wor we focus o F0 esimaio by exporig empora harmoic srucures i he oca TF segme. Firs, he oisy speech specrum is decomposed io a series of overapped TF segmes. A F0 cadidae coour is esimaed for each oca TF segme. Subsequey, overa F0 racig is performed based o idde Marov Mode (MM). Two feaures are proposed o idicae he F0 accuracy i each TF segme, icudig ogarihmic frequecy scae correaio coefficies (LogFcc) ad a esimaed SNR. I addiio, wo dyamic facors are deveoped o mode he F0 empora coiuiy cosrais, which are ier-frame as we as ier-segme F0 rasiio probabiiy. A simiar F0 esimaio agorihm was proposed i our previous paper [30], everheess he overa F0 racig is improved i his wor. This paper is orgaized as foows. Secio describes a overview of he sysem. The F0 cadidae esimaio is preseed i Secio 3. Secio 4 iusraes he arge F0 racig. Experimes ad resus are described i Secio 5. Fiay, he cocusios are draw i Secio 6.. ALGORITM OVERVIEW I his secio a overa agorihm overview is preseed. The geera boc diagram is show i Fig. 1. Geeray, our agorihm cosiss of wo mai sages: i) F0 cadidae coour esimaio for 978-1-4799-9988-0/16/$31.00 016 IEEE 6510 ICASSP 016

every sige TF segme; ii) F0 racig across he overa TF pae. A firs, we aayze he oisy speech siga based o a ogshor erm associaed harmoic mode [4]. O oe had, shor erm specrum aaysis esures o preserve he shor-ime saioary propery of he speech siga. O he oher had, he og erm specrum aaysis is abe o obai a higher frequecy resouio, maig he speech harmoics more discrimiaed from oise ierferece. Each TF segme is formed as 5 frames og i ime ad 800z wide i frequecy. The reaso we choose 800z as he TF segme badwidh is ha a eas wo harmoic parias are icuded i such frequecy rage. A F0 cadidae coour wih duraio of five frames is esimaed for each TF segme. Afer ha, he overa F0 racig is performed based o MM mode. The he observed ieihood of a F0 cadidae o be rue or fase are idicaed by wo acousic feaures: LogFcc ad a esimaed SNR. Moreover a he five F0 cadidaes ocaed i oe TF segme wi be assiged he same average ieihood. Meawhie, he F0 empora coiuiy cosrais are ae io accou by usig boh he ier-frame ad ier-segme based F0 rasiio probabiiies. Fiay, Vierbi agorihm is used for F0 decodig. 3. F0 CANDIDATE ESTIMATION 3.1. Iiia deecio of speech domiaed TF segme The speech harmoic srucures usuay chage more sowy ha oise specrums. The higher he correaio coefficie bewee wo adace frames, he more probabe he TF segme is domiaed by speech. Thus we propose o cacuae he LogFcc for each shor erm TF segme o idicae is ieihood of beig domiaed by speech or o. The compuaio of LogFcc is show as Eq. (1) - (3) 1 X X Y F Y F og og LogFcc (1) X Y N 1 X Y ( ) og ( X( )) () a ( ) og ( Y( )) (3) a where X ad Y are he wo eighborig specrum ampiude vecors i a paricuar TF segme, N is he sampe umber of X ad Y, is he idex of frequecy bi, f N / ff fs, f [1 800] z, N is he poi, fs is he sampig rae, ad are he mea ad variace respecivey. We se a 1. 5, ad N fs 500 / empiricay. For each TF segme, a average LogFcc is obaied across five frames. I addiio, rasformig he iear frequecy scae io ogarihm is o resrai he oabe frequecy differeces bewee high order harmoic srucures i wo successive frames. Accordigy, he TF segme wih he average LogFcc vaue smaer ha a hreshod is cosidered as oise ad is discarded before he furher processig. Oherwise, he average LogFcc Fig. Overview of F0 cadidae esimaio vaues are saved ad used for he overa F0 racig i ex sep. ere he hreshod is empiricay se as 0.4. 3.. F0 cadidae coour esimaio I his subsecio, we wi perform F0 cadidae esimaio i he iiia deeced speech domiaed TF segmes. ere og erm TF segmes are used isead of shor erm oe o icrease he frequecy resouio for F0 esimaio. Fig. shows he geera fowchar of he F0 cadidae esimaio. We ae a og TF segme as a exampe. ACF is obaied for each frame ad is ormaized by dividig he maximum ampiude i each frame. The frequecies of he ACF peas i each frame are cosidered as he F0 cadidaes. Moreover, he ampiudes of he correspodig ormaized ACF peas are cosidered as observaio ieihoods of he cadidaes beogig o rue F0. Meawhie, he F0 rasiio probabiiy bewee wo cosecuive frames ( p ( F 0 / F 0 1) ) is ear from Keee [5] ad CSTR [6] daabases, boh of which provide groud ruh F0 vaues. We assume p( F 0 / F 0 1) is equivae as he probabiiy of he F0 chage i ogarihmic scae bewee wo eighborig frames, which is show as Eq. (4) F 0 p ( F 0 / F 0 1) p og1.5 (4) F 0 1 Gaussia mixure mode is adoped o mode he ogarihmic F0 chage which wi be cosidered as he F0 rasiio probabiiy. Wih he observed ieihood ad F0 rasiio probabiiy, Vierbi agorihm is appied for F0 decodig. Furhermore, we use he sub-harmoic summaio echique [3] [7] [8] o correc some F0 esimaio errors caused by ACF based approach. The core echique of sub-harmoic summaio is o compress he specrum vecor i each frame aog he frequecy axis by a series of ieger facors ad sum he compressed specrum ogeher. I cosequece, muipe harmoics wi be coicide ehaced ad cause a maximum specrum pea a fudamea frequecy. I our case, he ieger facors are equa o he harmoic orders cacuaed by dividig he TF segme frequecy boud by he iiia deeced F0s from ACF mehod. The frequecy of he maximum pea from above compressed ad summed specrum wi be cosidered as he updaed F0 cadidae i each frame. The idea behid his is ha ACF based mehod ad sub-harmoic summaio based mehod shoud produce he same F0 resus. If cofics happe, here is a high probabiiy ha he esimaed F0 migh be wrog. IN our case, oe ypica cause of F0 error by ACF esimaio is ha some TF segmes are occupied by equa disace ocaed oise specrum peas. Uforuaey, hose frequecy disaces are easiy deeced by ACF as F0. owever, hese specrum peas are o harmoicay correaed wih each oher, ad heir frequecies do o have a commo facor. Therefore he sub-harmoic summaio echique provides a pos processig for error correcio. 6511

observaio ieihood of a specific TF segme coaiig he rue F0 cadidae is obaied as: p (8) SNR esi LogF Fig. 3 TF segme saus represeaio 4. OVERALL F0 TRACKING 4.1. Feaure exracio for each TF segme Wih he esimaed F0 cadidae coours i each TF segme, we begi o seec he opima pich via searchig hose speech domiaed TF segme o he overa TF pae. Two parameers are proposed for measurig he ieihood of a specific TF segme is speech domiaed or oise domiaed. Oe is ogarihmic LogFcc, which we described earier i Secio 3, ad he oher is a SNR vaue which wi be expaied here. The SNR is esimaed for each TF segme based o harmoics regeeraio wih esimaed pich cadidae coour []. Firs, he harmoic ampiude is obaied by choosig he specrum pea which is coses o he idea harmoic frequecy (F0) wihi he predefied deviaio rage, show as Eq. (5). A = AP F0 N / fs (5) where A is he seeced h order of harmoic ampiude, ad A P is he specrum ampiude pea vecor, ad a represes seecig a exised umber ha is coses o a. Nex he geeraed harmoic specrum is obaied by covovig he harmoic peas wih he specrum of hammig widow (wih equa size as he shor erm speech aaysis widow), see Eq. (6). where S () = A A ham ham K () A K1 P ( - F N /fs) e (6) is he specrum ampiude of he hammig widow, A, P are he ampiude, frequecy ad phase of he h order F ad of harmoics, K ad 1 K P is exraced from he oisy speech direcy, ad are he ower ad upper harmoic order boud of a paricuar TF segme, here K f / Fˆ 0 ad K / ˆ fu F 0, f ad f u are he ower ad upper frequecy boud of ha TF segme. I addiio, * deoes covouio. Accordigy, he SNR i each frame is cacuaed as Eq. (7) SNR S L ( ) 1 = max, 10 L 1 S N ( ) S ( ) (7) esi where is he idex of frame, L is he oa frame umber i a TF segme, ad is he oisy speech specrum. Fiay, he () S N 4.. F0 racig The F0 racig sep is o seec he bes F0 cadidae from he cadidaes is for each frame. ere we mode a of he F0 cadidaes as saes i a hidde Marov mode (MM). The TF segme feaures obaied i secio 4.1 is used as he observaio ieihood of he F0 cadidae saes. I addiio, we proposed a F0 rasiio probabiiy for he mode ha coais wo differe dyamic facors. Oe is F0 chagig over cosecuive ime frames, ad he oher is F0 chagig over he adace TF segmes. The former oe is obaied as he same procedure i secio 3., whie he aer oe is defied i Eq. (9) 0.7, i i, 1 p( S / Si ) 0., 1, 5 (9) i,, 0.1, ohers where S i, ad Si, are he TF segme saus of he previous ad curre F0 cadidae respecivey, i deoes he frequecy chae idex which sars from oe o he oa umber of chaes, represes he frame idex i each TF segme ad i sars from oe o he overa frame umber for a TF segme. Fig. 3 shows a exampe of he TF segme saus. Each purpe horizoa bar represes a TF segme. I fac he TF segmes are overapped i boh ime ad frequecy. owever, we dispay he overapped TF segme separaey i differe frequecy chae i Fig. 3. The TF segme saes are show o he bar frame by frame. Sice he F0 cadidaes are esimaed from he TF segmes which are overapped boh i ime ad frequecy, he opima F0 racig migh swich bewee differe TF segmes. Neverheess, i is esseia o guaraee ha he F0 racig go hrough he paricuar eire TF segme i mos of he cases, avoidig he freque hafway hoppig bewee adace TF segmes. Therefore, we assig a higher probabiiy for he F0 rasiio of ier TF segme, ad ower probabiiy for oher cases. Wih he observaio ieihoods ad F0 rasiio probabiiies, a Vierbi agorihm is performed o decode he overa F0 coour by maximizig he ieihood, show as Eq. (10). QT arg max [ p( F 0 ) p( F 0 / F 0 1) p( S / S, )] i, i (10) 1 i N 1, N c F where p F 0 ) is he observed probabiiy of curre F0 cadidaes, ( which is equas o p ha is obaied i Eq. (8), ad p( F 0 / F 0 1) is he frame based F0 sae rasiio probabiiy. 5. EXPERIMENTS AND RESULTS We use he Keee [5] ad CSTR [6] daabase for he performace evauaio which provides groud ruh pich abes ad ca be used as a referece for performace assessme. Keee daabase coais 10 og seeces spoe by five femae ad five mae aive Briish Egish speaers wih oa duraio of 9 mis. The CSTR daabase coais 50 Egish ueraces, spoe by boh oe femae ad oe mae Egish aive speaer, wih he duraio of 7 mis. Six ypes of daiy ife oise are used o simuae he oisy eviromes, icudig airpor, babbe, exhibiio, resaura, sree, ad rai oise. Seve SNR eves are se from -10dB o 0dB. Three oher sae-of-he-ar o 651

(a) airpor (b) babbe (c) exhibiio (d) resaura (e) sree (f) rai Fig. 4 GPE resus for Keee daabase (a) airpor (b) babbe (c) exhibiio (d) resaura (e) sree (f) rai Fig. 5 GPE resus for CSTR daabase -parameric F0 esimaio agorihms are used for performace compariso: RAPT [5], YIN [8] ad PEFAC [13]. Our agorihm is deoed as TF. Boh he proposed ad he referece agorihms do o require ay prior voiced/uvoiced decisio. Goba pich error (GPE) is used as he evauaio meric, which defies ha he esimaed pich ou of 5% of he groud ruh vaue is cosidered as icorrec [13]. Fig. 4 ad Fig. 5 show he GPE resus for he Keee ad CSTR daabase respecivey. From Fig. 4 ad Fig. 5 we ca see ha our proposed agorihm ouperforms he referece agorihms i mos of he oise codiios. owever, here are si severa oise scees (e.g., exhibiio, rai) a ow SNR eves (- 10dB) where PEFAC performs beer ha he proposed agorihm. The reaso is probaby ha i ow SNR eves, fewer speech domiaed TF segmes sad ou over oise, which brigs dow he pich cadidae coour esimaio accuracy. I his case, a fu bad specrum wih eough redudacy is preferabe for pich esimaio. Whe he SNRs are above 0dB, our agorihm is comparabe wih a of he referece mehods. 6. CONCLUSIONS We preseed a sudy o oise-robus F0 esimaio by exporig he empora harmoic srucures i oca TF segmes. Firs, a series of F0 cadidae coours are esimaed from differe TF segmes. Secod, F0 racig is performed across he overa TF pae o seec he bes F0. The speech domiaed TF segmes have a beer SNR eve ha fu bad siga. Ad hece he harmoic srucures i hese high SNR TF segmes provide a more reiabe source for F0 esimaio i oise. Experimes ad resus have show ha our agorihm subsaiay ouperforms he compared sae-of-he-ar mehods i erms of pich esimaio accuracy. 6513

8. REFERENCES [1] M. Asgari, A. Bayesehash, I. Shafra, Robus ad accurae feaures for deecig ad diagosig auism specrum disorders. I: Proc. INTERSPEEC. Lyo, Frace, pp. 191 194, 013. [] Y. Yag, C. Fairbir, J. F. Coh, Deecig depressio severiy from voca prosody. IEEE Tras. Audio Speech Lag. Process., vo. 4, o., 14 150, 013. [3] D. J. ermes, Measureme of pich by subharmoic summaio, J. Acous. Soc. Am., vo. 83, o. 1, pp. 57-64, 1988. [4]. Duifhuis, L. F. Wiems, R. J. Suyer, Measureme of pich i speech: A impemeaio of godseis heory of pich percepio,. J. Acous. Soc. Am., vo. 71, o. 6, pp. 1568-1580, 198. [5] D. Tai, Robus agorihm for pich racig, Speech Codig ad Syhesis, pp. 497-518, 1995. [6] Y. Gog, J. ao, Time domai harmoic machig pich esimaio usig ime-depede speech modeig, IEEE Tras. Acous., Speech, Siga, Process., vo. ASSP-35, o. 10, pp. 1386-1400, Oc. 1987. [7] W. ess, Pich Deermiaio of Speech Sigas. Sprig - Verag, Beri, Germay, 1983. [8] A. Cheveige,. Kawahara, YIN, a fudamea frequecy esimaor for speech ad music, J. Acous. Soc. Am., vo. 111, o. 4, pp. 1917-1930, 00. [9] T. Shimamura,. Kobayashi, Weighed auocorreaio for pich exracio of oisy speech, IEEE Tras. Speech, Audio Processig, vo. 9, o. 7, pp. 77-730, Oc. 001. [10] F. uag, T. Lee, Pich esimaio i oisy speech usig accumuaed pea specrum ad sparse esimaio echique, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo. 1, o. 1, pp. 99-109, 013. [11] D. Liu, C. Li, Fudamea frequecy esimaio based o he oi ime-frequecy aaysis of harmoic specra srucure, IEEE Tras. Acous., Speech, Siga, Process., vo. 9, o. 6, pp. 609-61, Sep. 001. [1] J. L. Roux,. Kameoa, N. Oo, A. Cheveige, S. Sagayamma, Sige ad muipe F0 coour esimaio hrough paramerix specrogram modeig of speech i oisy eviromes, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo. 15, o. 4, pp. 1135 1145, 007. [13] S. Gozaez, M. Brooes, PEFAC - A pich esimaio agorihm robus o high eves of oise, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo., o., pp. 518 530, 014. [14]. Bori, P. Poa, Direc ime domai fudamea frequecy esimaio of speech i oisy codiios, i Proc. Eurospeech, 004, vo., pp. 1003 1006. [15] M. Wu, D. Wag, A muipich racig agorihm for oisy speech, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo. 11, o. 3, pp. 9-41, 003. [16] B. S. Lee, D. P. W. Eis, Noise robus pich racig by subbad auocorreaio cassificaio, i Proc. Ierspeech 01, Sep. 01, Porad. [17] L. N. Ta, A. Awa, Mui-bad summary correogram-based pich deecio for oisy speech, Speech Commuicaio vo. 55, pp. 841 856, 013. [18] M. Mauch, S. Dixo, PYIN: A fudamea frequecy esimaor usig probabiisic hreshod disribuios, ICASSP 014, May, 014, Forece. [19] W. Chu, A. Awa, SAFE: A saisica approach o F0 esimaio uder cea ad oisy codiios, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo. 0, o. 3, pp. 933 944, 01. [0] K. a, D. Wag, Neura ewor based pich racig i very oisy speech, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo., o. 1, pp. 158 168, 014. [1] E. Terhard, Cacuaig virua pich, earig Research, vo. 1, pp. 155-18, 1979. [] D. Wag, P. C. Loizou, J.. L. ase, F0 esimaio i oisy speech based o og-erm harmoic feaure aaysis combied wih eura ewor cassificaio, i Proc Ierspeech 014, Sep. 014, Sigapore. [3] M. Cooe, A gimpsig mode of speech percepio i oise, J. Acous. Soc. Am., vo. 119, o. 3, pp. 156-1573, 005. [4] Q. uag, D. Wag, Sige chae speech separaio based o og-shor frame associaed harmoic mode, Digia Siga Processig, vo. 1, pp. 497-507, Mar., 011. [5] F. Pae, G. Meyer, ad W. A. Aisworh, A pich exracio referece daabase, i Proc. Eurospeech, 1995, pp. 837 840. [6] P. C. Bagshaw, S. M. ier, ad M. A. Jac, Ehaced pich racig ad he processig of F0 coours for compuer aided ioaio eachig, i Proc. Eurospeech, 1993, vo., pp. 1003 1006. [7] E. Terhard, Pich, cosoace, ad harmoy, J. Acous. Soc. Am., vo. 55, pp. 1061-1069, 1974. [8] E. Terhard, G. So, M. Seewa, Agorihm for exracio of pich ad pich saiece from compex oa sigas, J. Acous. Soc. Am., vo. 71, pp. 679-688, 198. [9] M. Gaiza, B. Lawor, E. Coye, Mui pich esimaio by usig modified IIR comb fiers, i Proc. Ieraioa Symposium focused o Muimedia Sysems ad Appicaios (ELMAR), Zadar, 005. [30] D. Wag, J.. L. ase, E. Tobey, F0 esimaio for oisy speech based o exporig oca ime frequecy segme, i Proc. WASPAA-015, Oc. 015, New Paz. 6514