AN ESTIMATION METHOD OF VOICE TIMBRE EVALUATION VALUES USING FEATURE EXTRACTION WITH GAUSSIAN MIXTURE MODEL BASED ON REFERENCE SINGER

Similar documents
Evaluation of a Singing Voice Conversion Method Based on Many-to-Many Eigenvoice Conversion

MELODY EXTRACTION FROM POLYPHONIC AUDIO BASED ON PARTICLE FILTER

Automatic Selection and Concatenation System for Jazz Piano Trio Using Case Data

-To become familiar with the input/output characteristics of several types of standard flip-flop devices and the conversion among them.

Adaptive Down-Sampling Video Coding

10. Water tank. Example I. Draw the graph of the amount z of water in the tank against time t.. Explain the shape of the graph.

Overview ECE 553: TESTING AND TESTABLE DESIGN OF. Ad-Hoc DFT Methods Good design practices learned through experience are used as guidelines:

MULTI-VIEW VIDEO COMPRESSION USING DYNAMIC BACKGROUND FRAME AND 3D MOTION ESTIMATION

CE 603 Photogrammetry II. Condition number = 2.7E+06

4.1 Water tank. height z (mm) time t (s)

A ROBUST DIGITAL IMAGE COPYRIGHT PROTECTION USING 4-LEVEL DWT ALGORITHM

Nonuniform sampling AN1

TRANSFORM DOMAIN SLICE BASED DISTRIBUTED VIDEO CODING

THE INCREASING demand to display video contents

Measurement of Capacitances Based on a Flip-Flop Sensor

Singing voice detection with deep recurrent neural networks

A Turbo Tutorial. by Jakob Dahl Andersen COM Center Technical University of Denmark

application software

application software

THERMOELASTIC SIGNAL PROCESSING USING AN FFT LOCK-IN BASED ALGORITHM ON EXTENDED SAMPLED DATA

Real-time Facial Expression Recognition in Image Sequences Using an AdaBoost-based Multi-classifier

Workflow Overview. BD FACSDiva Software Quick Reference Guide for BD FACSAria Cell Sorters. Starting Up the System. Checking Cytometer Performance

Performance Rendering for Piano Music with a Combination of Probabilistic Models for Melody and Chords

Hierarchical Sequential Memory for Music: A Cognitive Model

Video Summarization from Spatio-Temporal Features

On Mopping: A Mathematical Model for Mopping a Dirty Floor

Removal of Order Domain Content in Rotating Equipment Signals by Double Resampling

Region-based Temporally Consistent Video Post-processing

Lab 2 Position and Velocity

Drivers Evaluation of Performance of LED Traffic Signal Modules

Coded Strobing Photography: Compressive Sensing of High-speed Periodic Events

Determinants of investment in fixed assets and in intangible assets for hightech

Diffusion in Concert halls analyzed as a function of time during the decay process

DO NOT COPY DO NOT COPY DO NOT COPY DO NOT COPY

The Impact of e-book Technology on Book Retailing

Truncated Gray-Coded Bit-Plane Matching Based Motion Estimation and its Hardware Architecture

LOW LEVEL DESCRIPTORS BASED DBLSTM BOTTLENECK FEATURE FOR SPEECH DRIVEN TALKING AVATAR

Source and Channel Coding Issues for ATM Networks y. ECSE Department, Rensselaer Polytechnic Institute, Troy, NY 12180, U.S.A

The Art of Image Acquisition

VECM and Variance Decomposition: An Application to the Consumption-Wealth Ratio

SOME FUNCTIONAL PATTERNS ON THE NON-VERBAL LEVEL

Singing voice synthesis based on deep neural networks

Computer Graphics Applications to Crew Displays

EX 5 DIGITAL ELECTRONICS (GROUP 1BT4) G

Computer Vision II Lecture 8

Computer Vision II Lecture 8

First Result of the SMA Holography Experirnent

UPDATE FOR DESIGN OF STRUCTURAL STEEL HOLLOW SECTION CONNECTIONS VOLUME 1 DESIGN MODELS, First edition 1996 A.A. SYAM AND B.G.

BLOCK-BASED MOTION ESTIMATION USING THE PIXELWISE CLASSIFICATION OF THE MOTION COMPENSATION ERROR

Solution Guide II-A. Image Acquisition. Building Vision for Business. MVTec Software GmbH

TUBICOPTERS & MORE OBJECTIVE

Mean-Field Analysis for the Evaluation of Gossip Protocols

Solution Guide II-A. Image Acquisition. HALCON Progress

A Methodology for Evaluating Storage Systems in Distributed and Hierarchical Video Servers

Telemetrie-Messtechnik Schnorrenberg

SC434L_DVCC-Tutorial 1 Intro. and DV Formats

And the Oscar Goes to...peeeeedrooooo! 1

Student worksheet: Spoken Grammar

Predicting the perceived Quality of impulsive Vehicle sounds

AUTOCOMPENSATIVE SYSTEM FOR MEASUREMENT OF THE CAPACITANCES

Marjorie Thomas' schemas of Possible 2-voice canonic relationships

The Art of Image Acquisition

LCD Module Specification

Automatic location and removal of video logos

Video inpainting of complex scenes based on local statistical model

Novel Power Supply Independent Ring Oscillator

Trinitron Color TV KV-TG21 KV-PG21 KV-PG14. Operating Instructions M70 M61 M40 P70 P (1)

LABORATORY COURSE OF ELECTRONIC INSTRUMENTATION BASED ON THE TELEMETRY OF SEVERAL PARAMETERS OF A REMOTE CONTROLLED CAR

An Investigation of Acoustic Features for Singing Voice Conversion based on Perceptual Age

Theatrical Feature Film Trade in the United States, Europe, and Japan since the 1950s: An Empirical Study of the Home Market Effect

TEA2037A HORIZONTAL & VERTICAL DEFLECTION CIRCUIT

A CAP for graphic scores Graphic notation and performance

Personal Computer Embedded Type Servo System Controller. Simple Motion Board User's Manual (Advanced Synchronous Control) -MR-EM340GF

SMD LED Product Data Sheet LTSA-G6SPVEKT Spec No.: DS Effective Date: 10/12/2016 LITE-ON DCC RELEASE

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices

Tarinaoopperabaletti

SINCE the lyrics of a song represent its theme and story, they

VocaRefiner: An Interactive Singing Recording System with Integration of Multiple Singing Recordings

On human capability and acoustic cues for discriminating singing and speaking voices

Communication Systems, 5e

Physics 218: Exam 1. Sections: , , , 544, , 557,569, 572 September 28 th, 2016

USB TRANSCEIVER MACROCELL INTERFACE WITH USB 3.0 APPLICATIONS USING FPGA IMPLEMENTATION

IN THE FOCUS: Brain Products acticap boosts road safety research

LCD Module Specification

G E T T I N G I N S T R U M E N T S, I N C.

Besides our own analog sensors, it can serve as a controller performing variegated control functions for any type of analog device by any maker.

AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON SUNG MELODIC CONTOUR CHARACTERIZED IN PHASE PLANE

2015 Communication Guide

SAFETY WARNING! DO NOT REMOVE THE MAINS EARTH CONNECTION!

(12) (10) Patent N0.: US 7,260,789 B2 Hunleth et a]. (45) Date of Patent: Aug. 21, 2007

Philips Reseàrch Reports

R&D White Paper WHP 120. Digital on-channel repeater for DAB. Research & Development BRITISH BROADCASTING CORPORATION.

Sustainable Value Creation: The role of IT innovation persistence

A Delay-efficient Radiation-hard Digital Design Approach Using CWSP Elements

A Delay-efficient Radiation-hard Digital Design Approach Using CWSP Elements

Advanced Handheld Tachometer FT Measure engine rotation speed via cigarette lighter socket sensor! Cigarette lighter socket sensor FT-0801

CHEATER CIRCUITS FOR THE TESTING OF THYRATRONS

Emergence of invariant representation of vocalizations in the auditory cortex

CULTIVATING VOCAL ACTIVITY DETECTION FOR MUSIC AUDIO SIGNALS IN A CIRCULATION-TYPE CROWDSOURCING ECOSYSTEM

Digital Panel Controller

Transcription:

AN ESTIMATION METHOD OF VOICE TIMBRE EVALUATION VALUES USING FEATURE EXTRACTION WITH GAUSSIAN MIXTURE MODEL BASED ON REFERENCE SINGER Soichi Yamane, Kazuhiro Kobayashi, Tomoki Toda 2, Tomoyasu Nakano 3, Masaaka Goo 3, Saoshi Nakamura Graduae School of Informaion Science, Nara Insiue of Science and Technology NAIST, Japan 2 Informaion Technology Cener, Nagoya Universiy, Japan 3 Naional Insiue of Advanced Indusrial Science and Technology AIST, Japan ABSTRACT This paper presens an esimaion mehod of voice imbre evaluaion values for arbirary singer s singing voices generaed wih a singing voice synhesis sysem owards he developmen of a singing voice rerieval sysem. The voice imbre evaluaion values are numerical values corresponding o voice imbre expression words, such as Age and Gender, and hey usually need o be manually assigned o individual singers singing voices hrough lisening. To make i possible o auomaically esimae hem from given singer s singing voices, an acousic feaure o well capure only each singer s voice imbre is exraced wih a Gaussian mixure model rained using parallel daa beween singing voices sung by many pre-sored arge singers and same voices sung by a reference singer. Then, he voice imbre evaluaion values are esimaed from he exraced feaure using regression models. The experimenal resuls showed ha he proposed mehod is capable of accuraely esimaing hose values for some expression words, such as Age and Gender, and nonlinear regression is effecive for he expression words, owerfulness and Uniqueness. Index Terms singing voice synhesis, voice imbre, esimaion of evaluaion values, Gaussian mixure model, reference singer. INTRODUCTION In creaing vocal music, a singing voice synhesis sysem, such as VOCALOID [], UTAU [2] or Sinsy [3], is ofen used by many end users. The singing voice synhesis sysem allows he users o easily synhesize a singing voice as hey wan by manually inpuing score informaion such as pich, onse ime, and duraion o conrol melody and linguisic informaion o represen desired lyrics. Moreover, i enables he users o easily change voice imbre of synhesized singing voices o no only manipulae conrol parameers of voice imbre bu also selec differen singers singing voice daa. In paricular, voice imbre of singing voices has a large impac on he vocal music, and herefore, i is imporan for he users o carefully selec i so ha voice imbre of synhesized singing voices well suis o he music creaed by he user. However, a large number of available singing voice daa exis currenly and i ends o increase more and more. For example, UTAU voice libraries [2] include over 5 kinds of singing voice daa [4]. Consequenly, i is difficul o search for he mos suiable singing voice daa among hem. I will We hank o Suzuki Serif who provided he voice imbre evaluaion values inended for he UTAU voice libraries used in his experimens. This work was suppored in par by JSS KAKENHI Gran Number 26286 and by he JST OngaCREST projec. be helpful o develop a sysem o rerieve he suiable singing voice daa. Music informaion rerieval has been widely sudied and various mehods focusing on singing voices have also been proposed. For example, several music informaion rerieval sysems based on voice imbre similariy have been proposed o search for music daa including a singing voice of which voice imbre is similar o ha in inpu music daa [5 ]. As anoher rerieval mehod, a singing syle rerieval mehod uses an inpu singer s singing voice o search for singing voices sung in a similar singing syle o ha of he inpu one based on a similariy measure calculaed wih a probabiliy disribuion on he phase plane i.e., f f plane o express dynamic variaions of f paerns []. In hese mehods, he user basically needs o inpu reference singing voice or music daa as a query of which a singing syle or voice imbre is similar o he arge one ha he user wans o search for. Therefore, i is sill difficul o search for he desired singing voices used in he singing voice synhesis sysem if he user canno find such a suiable reference daa. To develop a singing voice rerieval mehod wih no need of reference daa, i is essenial o define a measure o describe arge characerisics of singing voices, such as a singing syle or voice imbre. In his paper, we focus on voice imbre because a singing syle can be well conrolled by he users in he singing synhesis sysem. As relaed work, here have been several aemps a describing voice qualiy of speaking voices using an evaluaion value on voice qualiy expression words [2] and singing impression words [3]. According o he aricle [2], several word pairs expressing voice qualiy, such as husky/clear clearness and elder/younger age, have been seleced by applying facor analysis o a resul of a large-scaled percepual evaluaion using many speakers naural voices. Each word pair can be used o manually assign a 5-scaled evaluaion value -2: disagree, -: disagree a lile, : neiher, : agree a lile, 2: agree o describe voice qualiy of individual naural voices. I has been repored ha his descripion mehod is helpful for developing a speech synhesis sysem or a voice conversion sysem making i possible for he users o inuiively conrol voice qualiy of synhesized/convered speech as hey wan [4, 5]. Inspired by his convenional mehod, in his paper we also use evaluaion values for several word pairs o describe voice imbre of individual singers singing voices. These voice imbre evaluaion values will be effecively used o evaluae similariy of voice imbre beween differen singing voice daa, and also used o inuiively design inpu query for searching for singing voice daa wih he desired voice imbre. On he oher hand, in order o develop such a singing voice daa rerieval sysem, i is ineviable o assign he voice imbre evaluaion values o all exising singing voice daa. However, hese 978--4799-9988-/6/$3. 26 IEEE 5265 ICASS 26

values basically need o be manually assigned o each singing voice daa hrough lisening. In order o reduce a huge amoun of effor o do i, i is worhwhile o develop a echnique for auomaically assigning hese values o he exising singing voice daa. In his paper, we propose an auomaic esimaion mehod of he voice imbre evaluaion values owards he developmen of a singing voice daa rerieval sysem helpful for he users o find heir desired singing voice daa o be used in he singing voice synhesis sysem. To exrac an acousic feaure o well capure only voice imbre of each singing voice daa, we use he join probabiliy densiy modeling mehod based on a reference singer [6], which was originally proposed as a voice conversion echnique [7, 8] and hen was successfully applied o singing voice conversion as well [9]. This mehod makes i possible o separaely model voice imbre of singing voices and acousic variaions caused by changes of phones using a Gaussian mixure model GMM rained wih parallel daa ses of synhesized singing voices beween he reference singer and many pre-sored arge singers. Afer exracing he acousic feaures of individual pre-sored arge singers, regression analysis is performed o develop an esimaion model of he voice imbre evaluaion values from he exraced acousic feaure. We conduc several experimenal evaluaions o invesigae he effeciveness of he proposed esimaion mehod, 2 he effeciveness of using nonlinear regression raher han linear regression, and 3 an effecive way o develop parallel daa ses used for he GMM raining. 2. VOICE TIMBRE FEATURE EXTRACTION BASED ON JOINT ROBABILITY DENSITY MODELING WITH REFERENCE SINGER As acousic feaures o capure voice imbre, we used segmenal feaures, such as a specral parameer and an aperiodic parameer. However, hese feaures are also affeced by phonemes and prosody i.e., F and duraion. Therefore, i is essenial o remove heir effecs on he acousic feaures so as o represen only voice imbre. To do so, we apply he join probabiliy densiy modeling echnique wih a reference singer o he voice imbre feaure exracion of singing voice daa used in a singing voice synhesis sysem. Figure shows an overview of he proposed feaure exracion process. Firs, parallel daa ses of singing voices sharing he same score informaion and lyrics beween a single reference singer and many pre-sored arge singers are synhesized by using a singing voice synhesis sysem. The segmenal feaures are exraced frame by frame from each singing voice and hey are ime aligned beween each pre-sored arge singer and he reference singer. Then, a join probabiliy densiy funcion of hese ime-aligned segmenal feaures beween he reference singer and he s-h pre-sored arge singer is modeled wih a GMM as follows: = X, Y s M α mn m= µ s, λ [ X Y s Y [µ ] [ ; µ X m µ Y m s Y ] [ Σ XX m Σ XY m Y X m Σ Y m Y, Σ Y ], ] µ s = s,, µ m s,, µ M s, 2 where X = [ ] x, x is a join saic and dynamic feaure ] vecor of he reference singer, Y s = [y s, y s is ha of he s-h pre-sored arge singer, and denoes ransposiion. The normal disribuion wih a mean vecor µ and a covariance marix Σ is denoed as N ; µ, Σ. The oal number of mixure componens is M and he mixure componen index is m. The m-h mixure componen weigh is α m. The mean vecor of he m-h mixure componen consiss of he reference singer s mean vecor µ X m and he s-h pre-sored arge singer s mean vecor µ Y m s. The covariance marix of he m-h mixure componen consiss of he reference singer s covariance marix Σ XX m, he pre-sored arge singer s covariance marix Σ m Y Y, and heir cross covariance marices Σ XY m Y X and Σ m. Noe ha hese parameers excep for he pre-sored arge singer s mean vecors are shared among all pre-sored arge singers, and only he arge mean vecors depend on individual presored arge singers. Therefore, hey are concaenaed o develop a super vecor µ s as he voice imbre feaure vecor of he s-h presored arge singer. The oher parameers are included in a shared parameer se λ. To opimize hese parameers, firs a arge-singer-independen GMM is rained using all parallel daa ses beween he reference singer and he individual pre-sored arge singers as follows: { µ, λ } = arg max {µ,λ} S T s s= = X, Y s µ, λ, 3 where T s is he number of ime-aligned frames of he parallel daa se for he s-h pre-sored arge singer and S is he oal number of he pre-sored arge singers. Then, a singer-dependen GMM for he s-h pre-sored arge singer, which is given by Eqs. and 2, is rained by updaing only he super vecor µ using only he s-h parallel daa se as follows: µ s = arg max µ T s = X, Y s µ, λ. 4 I is noed ha each mixure componen consisenly models he same phone space over all singers, i.e., he reference singer and all pre-sored arge singers, hanks o he use of he parameers shared over differen singers and he use of parallel daa in he parameer opimizaion [6, 7]. In he join densiy modeling, he reference singer s daa plays a role as an anchor o align each mixure componen o he same phone space over all pre-sored arge singers. Consequenly, acousic variaions caused by differen phones are modeled wih differen mixure componens and only singer-dependen acousic feaures are modeled wih he super vecor in Eq. 2. The use of parallel daa ses also sharing prosodic feaures as well as linguisic informaion effecively makes differences of he resuling super vecor beween differen pre-sored arge singers depend on only differences of heir voice imbre. 3. AUTOMATIC ESTIMATION OF VOICE TIMBRE EVALUATION VALUE USING REGRESSION ANALYSIS A regression analysis is applied o he esimaion of voice imbre evaluaion values. Firs, we manually assign he voice imbre evaluaion values for pre-deermined word pairs o express voice imbre ino all of or a par of he pre-sored arge singers. Then, we develop a regression model o esimae he voice imbre evaluaion values from he voice imbre feaure using he pre-sored arge singers daa. 3.. Esimaion Mehod based on Muliple Regression The voice imbre evaluaion values of he s-h pre-sored arge singer are sored as a voice imbre evaluaion vecor w s = [ ], w s,, ws s J where w j is he voice imbre evaluaion 5266

Feaure exracion Maximum likelihood esimaion Reference singer Mixure weigh Mean vecor Targe singers Join feaures Covariance marix SI-GMM SD-GMMs Updae mean vecors of arge singers Fig.. Exracion of voice imbre feaures using parallel daa Mean vecors of arge singers Super vecors value for he j-h pre-deermined word pairs, and he number of voice imbre evaluaion values i.e., he number of pre-deermined word pairs is J. In muliple regression analysis, he voice imbre evaluaion vecor w s is esimaed from he corresponding super vecor µ s as follows: ŵ s = Aµ s + b, 5 where A and b are regression parameers, which are deermined wih minimum mean square error esimaion beween he given voice imbre evaluaion vecors w,, w S and he esimaed ones ŵ,, ŵ S. 3.2. Esimaion Mehod based on Kernel Regression In kernel regression analysis, he voice imbre evaluaion vecor of s-h arge singer is esimaed from he corresponding super vecor as follows: ŵ s = V ϕ µ s, 6 where ϕ is a funcion o map he super vecor o a higher dimensional feaure space, and V is a regression parameer in he higher dimensional feaure space, which is given by V = S s= ϕ µ s Zs, 7 where Z s is a weighing parameer for he s-h mapped super vecor. From Eqs. 6 and 7, he esimae of w s is wrien as ŵ s = Zk µ s, 8 k µ s [ = k µ, µ s,, k µ S, µ s], 9 where Z = [Z,, Z S], and k, is a kernel funcion. In his paper, we use Gaussian kernel as he kernel funcion, which is given by k x, x = exp x x 2, where σ is a parameer of any posiive value. The weighing parameer Z is deermined wih minimum mean square error esimaion also using regularizaion as follows: [ where K = k µ,, k and r is a regularizaion parameer. σ 2 Z = W K + ri, ], ] µ S W = [w,, w S, 3.3. Esimaion of Voice Timbre Evaluaion Value for Arbirary Targe Singer In order o auomaically esimae he voice imbre evaluaion values for a given arge singer, firs we generae a parallel daa se beween he reference singer and he given arge singer using a singing voice synhesis sysem, and hen, he arge singer s super vecor is exraced in he same manner as shown in Eq. 4. Finally, he voice imbre evaluaion values are esimaed from he exraced super vecor using he rained regression model. The proposed mehod can easily be applied o naural singing voices as well. Thanks o he singing voice synhesis sysem, a parallel daa se consising of he reference singer s singing voices and he naural singing voices is easily developed by generaing he reference singer s singing voices corresponding o he naural ones. Therefore, he voice imbre evaluaion values for he given naural singing voice are esimaed in he same manner as menioned above. I is also possible o exrac he super vecor of he given naural singing voices wihou generaing he parallel daa se. The probabiliy densiy funcion of only he arge acousic feaures Y s µ, λ is easily derived from he argesinger-independen GMM X, Y s µ, λ in Eq. 4 by marginalizing ou he reference singer s acousic feaures X. Then, he super vecor is opimized by using only acousic feaures of he given naural singing voices and he marginalized probabiliy densiy funcion. Well-known model adapaion echniques, such as maximum a poseriori esimaion [2], maximum likelihood linear regression [2], and eigenvoice [22], are available in his opimizaion. 4. EXERIMENTAL EVALUATION 4.. Experimenal Condiions We used 4 kinds of UTAU voice libraries as he singing voice daa. Three differen ypes of synhesized voices, Monosyllabic voice, Talking voice, and Singing voice, were generaed by using UTAU singing voice synhesis sysem and hey were used for raining and evaluaion. Monosyllabic voice consised of kinds of Japanese syllables generaed wih 7 differen F values, i.e., 7 samples in oal. Talking voice consised of 5 synhesized voices generaed by manually mimicking prosody of Japanese normal speech in he ATR phoneme balanced senence se [23]. Singing voice consised of 6 phrases exraced from songs using MIDI daa of he RWC Japanese popular music daabase [24]. The lengh of each sample of Monosyllabic voice, Talking voice, and Singing voice was abou 2, 5, and 2 seconds, respecively. 5267

Mel-cepsrum Band aperiodiciy Join feaure Mel-cepsrum Band aperiodiciy Join feaure Monosyllabic voice Talking voice Singing voice Correlaion coefficien Correlaion coefficien Correlaion coefficien - AGE CLR GEN LSN OW UNQ AGE CLR GEN LSN OW UNQ AGE CLR GEN LSN OW UNQ Voice imbre expression word Voice imbre expression word Voice imbre expression word Fig. 2. Correlaion coefficiens beween he esimaed and arge voice imbre evaluaion values when using lef muliple regression, middle kernel regression, and righ differen ypes of parallel daa. Table. Word pairs o express voice imbre Specificaion Word pairs Adulness AGE Young Adul Clearness CLR Noisy Clear Gender GEN Feminine Masculine Lisenabiliy LSN Lisping Lucidly owerfulness OW Tender owerful Uniqueness UNQ Universaliy eculiariy STRAIGHT analysis [25] was used o exrac specral envelope, which was furher parameerized ino he s hrough 24h melcepsral coefficiens as a specral feaure. As an exciaion feaure, STRAIGHT analysis [26] was also used o exrac aperiodic componens, which were averaged in five frequency bands, i.e., -, -2, 2-4, 4-6 and 6-8 khz. The frame shif was se o 5 ms. The sampling frequency was se o 6 khz. The number of mixure componens for he specral feaure was 28, and ha for he aperiodic feaure was 6. Table shows 6 kinds of word pairs o express voice imbre used in his evaluaion. A 7-scaled evaluaion value was used o describe voice imbre corresponding o each word pair e.g. Adulness AGE is annoaed from -Young o 7-Adul. 9 annoaors manually assigned hese values o individual 4 singers singing voice daa, and hen values averaged over all annoaors were used as he voice imbre evaluaion values. For each acousic feaure i.e., he specral feaure or he aperiodic feaure, he arge-singer-independen GMM was rained using all singers singing voice daa. Then, he super vecor for each singer was exraced. In regression analysis, leave-one-ou cross-validaion was performed o rain he regression model and evaluae is esimaion accuracy of he voice imbre evaluaion values. The parameers of he kernel regression were opimized manually. 4.2. Experimenal Resuls Figure 2 shows he resuls of esimaion of voice imbre evaluaion values by he muliple regression lef graph and he kernel regression middle graph when using Monosyllabic voice as he parallel daa. Each figure shows resuls using he specral feaure Melcepsrum, he aperiodic feaure Band aperiodiciy, and heir join feaure as he acousic feaure modeled by he GMM. We can see ha he specral feaure yields higher esimaion accuracy han he aperiodic feaure. Even if using he join feaure, is esimaion accuracy is almos he same as ha of he specral feaure. Therefore, he aperiodic feaure is no effecive for he esimaion of he voice imbre evaluaion values. We can see ha he correlaion coefficiens for Adulness AGE and Gender GEN are over and.9, respecively. On he oher hand, very weak correlaion is observed for Clearness CLR. These resuls are consisen in boh he muliple regression and he kernel regression. On he oher hand, he correlaion coefficiens for owerfulness OW and Uniqueness UNQ end o be increased from o using he kernel regression raher han he muliple regression. These resuls imply ha here exiss nonlineariy in he mapping from he super vecor ino hese voice imbre evaluaion values alhough he number of dimensions of he super vecor is already quie high. The resul when using differen ypes of he parallel daa is also shown in Fig. 2 righ graph. In his experimen, we used he join feaure and he kernel regression. We can see ha he difference of esimaion accuracy caused by using differen ypes of he parallel daa is small excep for esimaion of he voice imbre evaluaion score on Uniqueness UNQ. Because of hese small differences, we may flexibly selec a ype of parallel daa according o ha of available singing voice daa. 5. CONCLUSION This paper has presened an esimaion mehod of voice imbre evaluaion values from given singing voice daa used in a singing voice synhesis sysem. To exrac an acousic feaure o well represen only voice imbre while minimizing he effecs of acousic variaions caused by differen phones or prosody, we have successfully applied he join probabiliy densiy modeling using reference singer s singing voices o he proposed esimaion process. The regression analysis have also been used for esimaing he voice imbre evaluaion values from he exraced voice imbre feaure. Experimenal evaluaions for he esimaion of 6 voice imbre evaluaion values for Adulness, Clearness, Gender, Lisenabiliy, owerfulness and Uniqueness have been conduced, demonsraing ha very high esimaion accuracy is achieved for Adulness r > and Gender r >.9 using he mel-cepsral coefficiens as he acousic feaure in he proposed mehod; 2 improvemens in esimaion accuracy from r = o r = is observed for owerfulness and Uniqueness ; and 3 various ypes of parallel daa, such as Monosyllabic voice, Talking voice, and Singing voice, can also be used in he proposed mehod. We plan o develop a singing voice rerieval sysem for a singing voice synhesis sysem using he proposed mehod. 5268

6. REFERENCES [] H. Kenmochi and H. Ohshia, VOCALOID - commercial singing synhesizer based on sample concaenaion, roc. IN- TERSEECH, pp. 4 42, Aug 27. [2] Ameya, UTAU - singing voice synhesis ool, hp://uau28.web.fc2.com. [3] K. Oura, A. Mase, S. Muo, Y. Nankaku, and T. Tokuda, Recen developmen of he HMM-based singing voice synhesis sysem - Sinsy, SSW7, pp. 2 26, Sep 2. [4] Ruo, UTAU voice libraries daabase, hp://ruo.yu.o/. [5] A. Mesaros, T. Viranen, and A. Klapuri, Singer idenificaion in polyphonic music using vocal separaion and paern recogniion mehods, roc. ISMIR, Sep 27. [6] T. L. Nwe and H. Li, Exploring vibrao-moivaed acousic feaures for singer idenificaion, IEEE Trans. Audio Speech and Lang. rocess., vol. 5, pp. 59 53, Feb 27. [7] H. Fujihara, M. Goo, T. Kiahara, and H. G. Okuno, A modeling of singing voice robus o accompanimen sounds and is applicaion o singer idenificaion and vocal-imbre-similariybased music informaion rerieval, IEEE Trans. Audio Speech and Lang. rocess., vol. 8, pp. 638 648, Mar 2. [8] W. H. Tsai and H.. Lin, Background music removal based on cepsrum ransformaion for popular singer idenificaion, IEEE Trans Audio Speech and Lang. rocess., vol. 9, pp. 96 25, July 2. [9] M. Lagrange, A. Ozerov, and E. Vincen, Robus singer idenificaion in polyphonic music using melody enhancemen and uncerainy-based learning, roc. ISMIR, Oc 22. [] T. Nakano, K. Yoshii, and M. Goo, Vocal imbre analysis laen dirichle allocaion and cross-gender vocal imbre similariy, roc. ICASS, pp. 5239 5243, May 24. [] T. Kako, Y. Ohishi, H. Kameoka, K. Kashino, and K. Takeda, Auomaic idenificaion for singing syle based on sung melodic conour characerized in phase plane, roc. ISMIR, pp. 393 398, Oc 29. [2] H. Kido and H. Kasuya, Everyday expressions associaed wih voice qualiy normal uerance exracion by percepual evaluaion, The Acousical Sociey of Japan, pp. 337 344, May 2, in Japanese. [3] A. Kanao, T. Nakano, M. Goo, and H. Kikuchi, An auomaic singing impression esimaion mehod using facor analysis and muliple regression, roc. ICMC SMC, pp. 244 25, Sep 24. [4] M. Tachibana, T. Nose, J. Yamagishi, and T. Kobayashi, A echnique for conrolling voice qualiy of synheic speech using muliple regression HSMM, roc. INTERSEECH, pp. 2438 244, Sep 26. [5] K. Oha, T. Toda, Y. Ohani, H. Saruwaari, and K. Shikano, Adapive voice-qualiy conrol based on one-o-many eigenvoice conversion, roc.interseech, pp. 258 26, Sep 2. [6] T. Toda, Y. Ohani, and K. Shikano, One-o-many and manyo-one voice conversion based on eigenvoices, roc. ICASS, pp. 249 252, Apr 27. [7] T. Toda, Y. Ohani, and K. Shikano, Eigenvoice conversion based on Gaussian mixure model, roc. INTERSEECH, pp. 2446 2449, Sep 26. [8] T. Toda, A. W. Black, and K. Tokuda, Voice conversion based on maximum likelihood esimaion of specral parameer rajecory, IEEE Trans. ASL, vol. 5, no. 8, pp. 2222 2235, Nov 27. [9] H. Doi, T. Toda, T. Nakano, M. Goo, and S. Nakamura, Singing voice conversion mehod based on many-o-many eigenvoice conversion and raining daa generaion using a singing-osinging synhesis sysem, roc. ASIA ASC, Nov 22. [2] J. Gauvain and C. H. Lee, Maximum a poseriori esimaion for mulivariae Gaussian mixure observaions of markov chains, IEEE Trans. SA, vol. 2, pp. 29 298, Apr 994. [2] C. J. Leggeer and. C. Woodland, Maximum likelihood linear regression for speaker adapaion of coninuous densiy hidden Markov models, roc. CSL, vol. 9, no. 2, pp. 7 85, Feb 995. [22] R. Kuhn, J. C. Junqua,. Nugyen, and N. Niedzielski, Rapid speaker adapaion in eigenvoice space, IEEE Trans. SA, vol. 8, no. 6, pp. 695 77, Nov 2. [23] K. Iso, T. Waanabe, and H. Kuwabara, Design of a Japanese senence lis for a speech daabase, reprins, Spring Meeing of Acous. Soc. Jpn., vol. 2-2-9, pp. 88 9, Mar 988, in Japanese. [24] M. Goo, T. Nishimura, H. Hashiguchi, and R. Oka, RWC music daabase : Music genne daabase and musical insrumen sounddaabase, roc. ISMIR, pp. 229 23, Oc 23. [25] K. Kawahara, I. Masuda-Kasuse, and A. Cheveigne, Resrucuring speech represenaions using a pich-adapive imefrequency smoohing and an insananeous-frequency-based f exracion, Speech Communicaion, vol. 27, no. 3-4, pp. 87 27, Apr 999. [26] H. Kawahara and H. Kaayose, Sca generaion research program based on sraigh a high-qualiy speech analysis modificaion and synhesis sysem, J. of ISJ, vol. 43, no. 2, pp. 28 28, Feb 22, in Japanese. 5269