F0 ESTIMATION FOR NOISY SPEEC BY EXPLORING TEMPORAL ARMONIC STRUCTURES IN LOCAL TIME FREQUENCY SPECTRUM SEGMENT Dogmei Wag, Joh. L. ase Dep. Eecrica Egieerig, Uiversiy of Texas a Daas 800 Wes Campbe Road, Richardso, Tx. 75080 {dogmei.wag, oh.hase}@udaas.edu ABSTRACT I his paper, we propose a oise robus F0 esimaio approach by exporig he empora harmoic srucures i oca ime-frequecy (TF) specrum segme. Sice he speech eergy is sparsey disribued o he TF pae, he speech harmoic srucures occupied i he higher speech eergy TF segme are edig o domiae over oise. Thus, we aemp o derive F0 from such high (siga o oise raio) SNR TF segmes raher ha fu bad siga. Our agorihm comprises of wo sages: i) F0 cadidae esimaio for a series of TF segmes; ii) F0 racig based o he acousic feaures of each TF segme as we as he F0 empora coiuiy cosrais. Experimea resus show ha our approach ouperforms he compared mehods i erms of F0 esimaio accuracy. Idex Terms F0 esimaio, oca TF segme, SNR esimaio, empora coiuiy cosrais 1. INTRODUCTION Fudamea frequecy (F0) is oe of he mos impora characerisics of huma speech which represes he vibraio rae of he voca cords durig speech producio. A promisig F0 esimaio sysem wi faciiae may speech siga processig areas, such as speech source separaio, emoio recogiio, speaer/aguage ideificaio, ec. Recey, F0 esimaio has aso bee appied o assis he mea disease diagosis [1] []. The sraighforward way o aayze F0 is eiher exporig harmoic srucures i frequecy domai [3] [4] or examie he periodic cues i ime domai [5-7]. Correspodigy, auocorreaio fucio (ACF) ad average magiude differece fucio (AMDF) are he wo basic ime domai F0 esimaio approaches. Besides, subharmoic summaio [3] ad comb fier [9] are usuay adoped as frequecy domai mehods. owever i adverse codiios, he above radiioa F0 esimaio mehods become ieffecive due o boh of he empora periodic cues ad harmoic srucure are disored o some degree. I order o dea wih he oisy siuaio, may effors have bee made by he researchers. For isace, ACF ad AMDF are combied ogeher o obai beer periodic pea deecio [8] [9]. I addiio, various ypes of adapive speech represeaio mehods are iroduced o ehace he speech compoe so as o provide a more reiabe source for F0 esimaio [10-1]. Siga pre-processig is aso proposed o aeuae some oise for F0 esimaio [13] [14]. Audiory fier ba based F0 esimaio is proposed o ae advaage of high SNR sub-chaes [15-17]. Moreover, he F0 empora coiuiy cosrais are modeed o esure more accurae F0 racig [1] [15] [18]. Recey, saisica ad machie earig mehods are aso widey used for s ST -specrum LS- LT -specrum s : oisy speech LS-: og ad shor erm ST: shor erm TF, LT: og erm TF LogFcc: ogarihmic frequecy scae correaio coefficie Form ST-TF segme Form LT-TF segme Fˆ 0 LogFcc LogF cacuaio LogF yes LogF 0.4 Sip F0 cadidae esimaio F0 cadidaes Overa F0 rac Fig. 1 Agorihm overview SNR esi SNR esimaio segme feaure oise robus pich esimaio [19-]. Amog he previous sudies, empora harmoic srucures have bee ivesigaed for oise robus F0 esimaio because of he harmoic simiariy bewee adace speech frames [1]. Speech sparsiy characerisic [3] is aso cosidered ha F0 ca be esimaed from ess oise affeced chaes [15-17] i each frame. owever, i sedom cases, empora harmoic coiuiy ad sparsiy are cosidered simuaeousy for F0 esimaio. Neverheess, if he paricuar specrum area (TF segme) domiaed by coiuous frames of harmoic srucures are abe o be deeced for F0 esimaio, he performace coud be improved. I his wor we focus o F0 esimaio by exporig empora harmoic srucures i he oca TF segme. Firs, he oisy speech specrum is decomposed io a series of overapped TF segmes. A F0 cadidae coour is esimaed for each oca TF segme. Subsequey, overa F0 racig is performed based o idde Marov Mode (MM). Two feaures are proposed o idicae he F0 accuracy i each TF segme, icudig ogarihmic frequecy scae correaio coefficies (LogFcc) ad a esimaed SNR. I addiio, wo dyamic facors are deveoped o mode he F0 empora coiuiy cosrais, which are ier-frame as we as ier-segme F0 rasiio probabiiy. A simiar F0 esimaio agorihm was proposed i our previous paper [30], everheess he overa F0 racig is improved i his wor. This paper is orgaized as foows. Secio describes a overview of he sysem. The F0 cadidae esimaio is preseed i Secio 3. Secio 4 iusraes he arge F0 racig. Experimes ad resus are described i Secio 5. Fiay, he cocusios are draw i Secio 6.. ALGORITM OVERVIEW I his secio a overa agorihm overview is preseed. The geera boc diagram is show i Fig. 1. Geeray, our agorihm cosiss of wo mai sages: i) F0 cadidae coour esimaio for 978-1-4799-9988-0/16/$31.00 016 IEEE 6510 ICASSP 016
every sige TF segme; ii) F0 racig across he overa TF pae. A firs, we aayze he oisy speech siga based o a ogshor erm associaed harmoic mode [4]. O oe had, shor erm specrum aaysis esures o preserve he shor-ime saioary propery of he speech siga. O he oher had, he og erm specrum aaysis is abe o obai a higher frequecy resouio, maig he speech harmoics more discrimiaed from oise ierferece. Each TF segme is formed as 5 frames og i ime ad 800z wide i frequecy. The reaso we choose 800z as he TF segme badwidh is ha a eas wo harmoic parias are icuded i such frequecy rage. A F0 cadidae coour wih duraio of five frames is esimaed for each TF segme. Afer ha, he overa F0 racig is performed based o MM mode. The he observed ieihood of a F0 cadidae o be rue or fase are idicaed by wo acousic feaures: LogFcc ad a esimaed SNR. Moreover a he five F0 cadidaes ocaed i oe TF segme wi be assiged he same average ieihood. Meawhie, he F0 empora coiuiy cosrais are ae io accou by usig boh he ier-frame ad ier-segme based F0 rasiio probabiiies. Fiay, Vierbi agorihm is used for F0 decodig. 3. F0 CANDIDATE ESTIMATION 3.1. Iiia deecio of speech domiaed TF segme The speech harmoic srucures usuay chage more sowy ha oise specrums. The higher he correaio coefficie bewee wo adace frames, he more probabe he TF segme is domiaed by speech. Thus we propose o cacuae he LogFcc for each shor erm TF segme o idicae is ieihood of beig domiaed by speech or o. The compuaio of LogFcc is show as Eq. (1) - (3) 1 X X Y F Y F og og LogFcc (1) X Y N 1 X Y ( ) og ( X( )) () a ( ) og ( Y( )) (3) a where X ad Y are he wo eighborig specrum ampiude vecors i a paricuar TF segme, N is he sampe umber of X ad Y, is he idex of frequecy bi, f N / ff fs, f [1 800] z, N is he poi, fs is he sampig rae, ad are he mea ad variace respecivey. We se a 1. 5, ad N fs 500 / empiricay. For each TF segme, a average LogFcc is obaied across five frames. I addiio, rasformig he iear frequecy scae io ogarihm is o resrai he oabe frequecy differeces bewee high order harmoic srucures i wo successive frames. Accordigy, he TF segme wih he average LogFcc vaue smaer ha a hreshod is cosidered as oise ad is discarded before he furher processig. Oherwise, he average LogFcc Fig. Overview of F0 cadidae esimaio vaues are saved ad used for he overa F0 racig i ex sep. ere he hreshod is empiricay se as 0.4. 3.. F0 cadidae coour esimaio I his subsecio, we wi perform F0 cadidae esimaio i he iiia deeced speech domiaed TF segmes. ere og erm TF segmes are used isead of shor erm oe o icrease he frequecy resouio for F0 esimaio. Fig. shows he geera fowchar of he F0 cadidae esimaio. We ae a og TF segme as a exampe. ACF is obaied for each frame ad is ormaized by dividig he maximum ampiude i each frame. The frequecies of he ACF peas i each frame are cosidered as he F0 cadidaes. Moreover, he ampiudes of he correspodig ormaized ACF peas are cosidered as observaio ieihoods of he cadidaes beogig o rue F0. Meawhie, he F0 rasiio probabiiy bewee wo cosecuive frames ( p ( F 0 / F 0 1) ) is ear from Keee [5] ad CSTR [6] daabases, boh of which provide groud ruh F0 vaues. We assume p( F 0 / F 0 1) is equivae as he probabiiy of he F0 chage i ogarihmic scae bewee wo eighborig frames, which is show as Eq. (4) F 0 p ( F 0 / F 0 1) p og1.5 (4) F 0 1 Gaussia mixure mode is adoped o mode he ogarihmic F0 chage which wi be cosidered as he F0 rasiio probabiiy. Wih he observed ieihood ad F0 rasiio probabiiy, Vierbi agorihm is appied for F0 decodig. Furhermore, we use he sub-harmoic summaio echique [3] [7] [8] o correc some F0 esimaio errors caused by ACF based approach. The core echique of sub-harmoic summaio is o compress he specrum vecor i each frame aog he frequecy axis by a series of ieger facors ad sum he compressed specrum ogeher. I cosequece, muipe harmoics wi be coicide ehaced ad cause a maximum specrum pea a fudamea frequecy. I our case, he ieger facors are equa o he harmoic orders cacuaed by dividig he TF segme frequecy boud by he iiia deeced F0s from ACF mehod. The frequecy of he maximum pea from above compressed ad summed specrum wi be cosidered as he updaed F0 cadidae i each frame. The idea behid his is ha ACF based mehod ad sub-harmoic summaio based mehod shoud produce he same F0 resus. If cofics happe, here is a high probabiiy ha he esimaed F0 migh be wrog. IN our case, oe ypica cause of F0 error by ACF esimaio is ha some TF segmes are occupied by equa disace ocaed oise specrum peas. Uforuaey, hose frequecy disaces are easiy deeced by ACF as F0. owever, hese specrum peas are o harmoicay correaed wih each oher, ad heir frequecies do o have a commo facor. Therefore he sub-harmoic summaio echique provides a pos processig for error correcio. 6511
observaio ieihood of a specific TF segme coaiig he rue F0 cadidae is obaied as: p (8) SNR esi LogF Fig. 3 TF segme saus represeaio 4. OVERALL F0 TRACKING 4.1. Feaure exracio for each TF segme Wih he esimaed F0 cadidae coours i each TF segme, we begi o seec he opima pich via searchig hose speech domiaed TF segme o he overa TF pae. Two parameers are proposed for measurig he ieihood of a specific TF segme is speech domiaed or oise domiaed. Oe is ogarihmic LogFcc, which we described earier i Secio 3, ad he oher is a SNR vaue which wi be expaied here. The SNR is esimaed for each TF segme based o harmoics regeeraio wih esimaed pich cadidae coour []. Firs, he harmoic ampiude is obaied by choosig he specrum pea which is coses o he idea harmoic frequecy (F0) wihi he predefied deviaio rage, show as Eq. (5). A = AP F0 N / fs (5) where A is he seeced h order of harmoic ampiude, ad A P is he specrum ampiude pea vecor, ad a represes seecig a exised umber ha is coses o a. Nex he geeraed harmoic specrum is obaied by covovig he harmoic peas wih he specrum of hammig widow (wih equa size as he shor erm speech aaysis widow), see Eq. (6). where S () = A A ham ham K () A K1 P ( - F N /fs) e (6) is he specrum ampiude of he hammig widow, A, P are he ampiude, frequecy ad phase of he h order F ad of harmoics, K ad 1 K P is exraced from he oisy speech direcy, ad are he ower ad upper harmoic order boud of a paricuar TF segme, here K f / Fˆ 0 ad K / ˆ fu F 0, f ad f u are he ower ad upper frequecy boud of ha TF segme. I addiio, * deoes covouio. Accordigy, he SNR i each frame is cacuaed as Eq. (7) SNR S L ( ) 1 = max, 10 L 1 S N ( ) S ( ) (7) esi where is he idex of frame, L is he oa frame umber i a TF segme, ad is he oisy speech specrum. Fiay, he () S N 4.. F0 racig The F0 racig sep is o seec he bes F0 cadidae from he cadidaes is for each frame. ere we mode a of he F0 cadidaes as saes i a hidde Marov mode (MM). The TF segme feaures obaied i secio 4.1 is used as he observaio ieihood of he F0 cadidae saes. I addiio, we proposed a F0 rasiio probabiiy for he mode ha coais wo differe dyamic facors. Oe is F0 chagig over cosecuive ime frames, ad he oher is F0 chagig over he adace TF segmes. The former oe is obaied as he same procedure i secio 3., whie he aer oe is defied i Eq. (9) 0.7, i i, 1 p( S / Si ) 0., 1, 5 (9) i,, 0.1, ohers where S i, ad Si, are he TF segme saus of he previous ad curre F0 cadidae respecivey, i deoes he frequecy chae idex which sars from oe o he oa umber of chaes, represes he frame idex i each TF segme ad i sars from oe o he overa frame umber for a TF segme. Fig. 3 shows a exampe of he TF segme saus. Each purpe horizoa bar represes a TF segme. I fac he TF segmes are overapped i boh ime ad frequecy. owever, we dispay he overapped TF segme separaey i differe frequecy chae i Fig. 3. The TF segme saes are show o he bar frame by frame. Sice he F0 cadidaes are esimaed from he TF segmes which are overapped boh i ime ad frequecy, he opima F0 racig migh swich bewee differe TF segmes. Neverheess, i is esseia o guaraee ha he F0 racig go hrough he paricuar eire TF segme i mos of he cases, avoidig he freque hafway hoppig bewee adace TF segmes. Therefore, we assig a higher probabiiy for he F0 rasiio of ier TF segme, ad ower probabiiy for oher cases. Wih he observaio ieihoods ad F0 rasiio probabiiies, a Vierbi agorihm is performed o decode he overa F0 coour by maximizig he ieihood, show as Eq. (10). QT arg max [ p( F 0 ) p( F 0 / F 0 1) p( S / S, )] i, i (10) 1 i N 1, N c F where p F 0 ) is he observed probabiiy of curre F0 cadidaes, ( which is equas o p ha is obaied i Eq. (8), ad p( F 0 / F 0 1) is he frame based F0 sae rasiio probabiiy. 5. EXPERIMENTS AND RESULTS We use he Keee [5] ad CSTR [6] daabase for he performace evauaio which provides groud ruh pich abes ad ca be used as a referece for performace assessme. Keee daabase coais 10 og seeces spoe by five femae ad five mae aive Briish Egish speaers wih oa duraio of 9 mis. The CSTR daabase coais 50 Egish ueraces, spoe by boh oe femae ad oe mae Egish aive speaer, wih he duraio of 7 mis. Six ypes of daiy ife oise are used o simuae he oisy eviromes, icudig airpor, babbe, exhibiio, resaura, sree, ad rai oise. Seve SNR eves are se from -10dB o 0dB. Three oher sae-of-he-ar o 651
(a) airpor (b) babbe (c) exhibiio (d) resaura (e) sree (f) rai Fig. 4 GPE resus for Keee daabase (a) airpor (b) babbe (c) exhibiio (d) resaura (e) sree (f) rai Fig. 5 GPE resus for CSTR daabase -parameric F0 esimaio agorihms are used for performace compariso: RAPT [5], YIN [8] ad PEFAC [13]. Our agorihm is deoed as TF. Boh he proposed ad he referece agorihms do o require ay prior voiced/uvoiced decisio. Goba pich error (GPE) is used as he evauaio meric, which defies ha he esimaed pich ou of 5% of he groud ruh vaue is cosidered as icorrec [13]. Fig. 4 ad Fig. 5 show he GPE resus for he Keee ad CSTR daabase respecivey. From Fig. 4 ad Fig. 5 we ca see ha our proposed agorihm ouperforms he referece agorihms i mos of he oise codiios. owever, here are si severa oise scees (e.g., exhibiio, rai) a ow SNR eves (- 10dB) where PEFAC performs beer ha he proposed agorihm. The reaso is probaby ha i ow SNR eves, fewer speech domiaed TF segmes sad ou over oise, which brigs dow he pich cadidae coour esimaio accuracy. I his case, a fu bad specrum wih eough redudacy is preferabe for pich esimaio. Whe he SNRs are above 0dB, our agorihm is comparabe wih a of he referece mehods. 6. CONCLUSIONS We preseed a sudy o oise-robus F0 esimaio by exporig he empora harmoic srucures i oca TF segmes. Firs, a series of F0 cadidae coours are esimaed from differe TF segmes. Secod, F0 racig is performed across he overa TF pae o seec he bes F0. The speech domiaed TF segmes have a beer SNR eve ha fu bad siga. Ad hece he harmoic srucures i hese high SNR TF segmes provide a more reiabe source for F0 esimaio i oise. Experimes ad resus have show ha our agorihm subsaiay ouperforms he compared sae-of-he-ar mehods i erms of pich esimaio accuracy. 6513
8. REFERENCES [1] M. Asgari, A. Bayesehash, I. Shafra, Robus ad accurae feaures for deecig ad diagosig auism specrum disorders. I: Proc. INTERSPEEC. Lyo, Frace, pp. 191 194, 013. [] Y. Yag, C. Fairbir, J. F. Coh, Deecig depressio severiy from voca prosody. IEEE Tras. Audio Speech Lag. Process., vo. 4, o., 14 150, 013. [3] D. J. ermes, Measureme of pich by subharmoic summaio, J. Acous. Soc. Am., vo. 83, o. 1, pp. 57-64, 1988. [4]. Duifhuis, L. F. Wiems, R. J. Suyer, Measureme of pich i speech: A impemeaio of godseis heory of pich percepio,. J. Acous. Soc. Am., vo. 71, o. 6, pp. 1568-1580, 198. [5] D. Tai, Robus agorihm for pich racig, Speech Codig ad Syhesis, pp. 497-518, 1995. [6] Y. Gog, J. ao, Time domai harmoic machig pich esimaio usig ime-depede speech modeig, IEEE Tras. Acous., Speech, Siga, Process., vo. ASSP-35, o. 10, pp. 1386-1400, Oc. 1987. [7] W. ess, Pich Deermiaio of Speech Sigas. Sprig - Verag, Beri, Germay, 1983. [8] A. Cheveige,. Kawahara, YIN, a fudamea frequecy esimaor for speech ad music, J. Acous. Soc. Am., vo. 111, o. 4, pp. 1917-1930, 00. [9] T. Shimamura,. Kobayashi, Weighed auocorreaio for pich exracio of oisy speech, IEEE Tras. Speech, Audio Processig, vo. 9, o. 7, pp. 77-730, Oc. 001. [10] F. uag, T. Lee, Pich esimaio i oisy speech usig accumuaed pea specrum ad sparse esimaio echique, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo. 1, o. 1, pp. 99-109, 013. [11] D. Liu, C. Li, Fudamea frequecy esimaio based o he oi ime-frequecy aaysis of harmoic specra srucure, IEEE Tras. Acous., Speech, Siga, Process., vo. 9, o. 6, pp. 609-61, Sep. 001. [1] J. L. Roux,. Kameoa, N. Oo, A. Cheveige, S. Sagayamma, Sige ad muipe F0 coour esimaio hrough paramerix specrogram modeig of speech i oisy eviromes, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo. 15, o. 4, pp. 1135 1145, 007. [13] S. Gozaez, M. Brooes, PEFAC - A pich esimaio agorihm robus o high eves of oise, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo., o., pp. 518 530, 014. [14]. Bori, P. Poa, Direc ime domai fudamea frequecy esimaio of speech i oisy codiios, i Proc. Eurospeech, 004, vo., pp. 1003 1006. [15] M. Wu, D. Wag, A muipich racig agorihm for oisy speech, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo. 11, o. 3, pp. 9-41, 003. [16] B. S. Lee, D. P. W. Eis, Noise robus pich racig by subbad auocorreaio cassificaio, i Proc. Ierspeech 01, Sep. 01, Porad. [17] L. N. Ta, A. Awa, Mui-bad summary correogram-based pich deecio for oisy speech, Speech Commuicaio vo. 55, pp. 841 856, 013. [18] M. Mauch, S. Dixo, PYIN: A fudamea frequecy esimaor usig probabiisic hreshod disribuios, ICASSP 014, May, 014, Forece. [19] W. Chu, A. Awa, SAFE: A saisica approach o F0 esimaio uder cea ad oisy codiios, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo. 0, o. 3, pp. 933 944, 01. [0] K. a, D. Wag, Neura ewor based pich racig i very oisy speech, IEEE Trasacios o Acousics, Speech ad Siga Processig, vo., o. 1, pp. 158 168, 014. [1] E. Terhard, Cacuaig virua pich, earig Research, vo. 1, pp. 155-18, 1979. [] D. Wag, P. C. Loizou, J.. L. ase, F0 esimaio i oisy speech based o og-erm harmoic feaure aaysis combied wih eura ewor cassificaio, i Proc Ierspeech 014, Sep. 014, Sigapore. [3] M. Cooe, A gimpsig mode of speech percepio i oise, J. Acous. Soc. Am., vo. 119, o. 3, pp. 156-1573, 005. [4] Q. uag, D. Wag, Sige chae speech separaio based o og-shor frame associaed harmoic mode, Digia Siga Processig, vo. 1, pp. 497-507, Mar., 011. [5] F. Pae, G. Meyer, ad W. A. Aisworh, A pich exracio referece daabase, i Proc. Eurospeech, 1995, pp. 837 840. [6] P. C. Bagshaw, S. M. ier, ad M. A. Jac, Ehaced pich racig ad he processig of F0 coours for compuer aided ioaio eachig, i Proc. Eurospeech, 1993, vo., pp. 1003 1006. [7] E. Terhard, Pich, cosoace, ad harmoy, J. Acous. Soc. Am., vo. 55, pp. 1061-1069, 1974. [8] E. Terhard, G. So, M. Seewa, Agorihm for exracio of pich ad pich saiece from compex oa sigas, J. Acous. Soc. Am., vo. 71, pp. 679-688, 198. [9] M. Gaiza, B. Lawor, E. Coye, Mui pich esimaio by usig modified IIR comb fiers, i Proc. Ieraioa Symposium focused o Muimedia Sysems ad Appicaios (ELMAR), Zadar, 005. [30] D. Wag, J.. L. ase, E. Tobey, F0 esimaio for oisy speech based o exporig oca ime frequecy segme, i Proc. WASPAA-015, Oc. 015, New Paz. 6514