AN ON-THE-FLY MANDARIN SINGING VOICE SYNTHESIS SYSTEM

Similar documents
Music Radar: A Web-based Query by Humming System

Singing voice synthesis in Spanish by concatenation of syllables based on the TD-PSOLA algorithm

Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016

Study of White Gaussian Noise with Varying Signal to Noise Ratio in Speech Signal using Wavelet

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Laboratory Assignment 3. Digital Music Synthesis: Beethoven s Fifth Symphony Using MATLAB

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

A prototype system for rule-based expressive modifications of audio recordings

Robert Alexandru Dobre, Cristian Negrescu

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

Query By Humming: Finding Songs in a Polyphonic Database

A Study of Synchronization of Audio Data with Symbolic Data. Music254 Project Report Spring 2007 SongHui Chon

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

CSC475 Music Information Retrieval

1. Introduction NCMMSC2009

Pitch correction on the human voice

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION

2. AN INTROSPECTION OF THE MORPHING PROCESS

Communication Lab. Assignment On. Bi-Phase Code and Integrate-and-Dump (DC 7) MSc Telecommunications and Computer Networks Engineering

Advanced Signal Processing 2

Available online at ScienceDirect. Procedia Computer Science 46 (2015 )

Automatic Construction of Synthetic Musical Instruments and Performers

MANDARIN SINGING VOICE SYNTHESIS BASED ON HARMONIC PLUS NOISE MODEL AND SINGING EXPRESSION ANALYSIS

The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng

ELEC 484 Project Pitch Synchronous Overlap-Add

Computer Coordination With Popular Music: A New Research Agenda 1

International Journal of Computer Architecture and Mobility (ISSN ) Volume 1-Issue 7, May 2013

Music Source Separation

Speech and Speaker Recognition for the Command of an Industrial Robot

Normalized Cumulative Spectral Distribution in Music

Quarterly Progress and Status Report. Formant frequency tuning in singing

A Music Retrieval System Using Melody and Lyric

Singing-voice Synthesis Using ANN Vibrato-parameter Models *

Music Representations

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

1 Introduction to PSQM

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Proposal for Application of Speech Techniques to Music Analysis

Machine Learning Term Project Write-up Creating Models of Performers of Chopin Mazurkas

Adaptive Resampling - Transforming From the Time to the Angle Domain

System Identification

Region Adaptive Unsharp Masking based DCT Interpolation for Efficient Video Intra Frame Up-sampling

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

Melody Retrieval On The Web

Quarterly Progress and Status Report. Musicians and nonmusicians sensitivity to differences in music performance

SYNTHESIS FROM MUSICAL INSTRUMENT CHARACTER MAPS

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

Analysis, Synthesis, and Perception of Musical Sounds

Automatic Singing Performance Evaluation Using Accompanied Vocals as Reference Bases *

Singing voice synthesis based on deep neural networks

Melody transcription for interactive applications

Pitch-Synchronous Spectrogram: Principles and Applications

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

Tempo and Beat Analysis

DISTRIBUTION STATEMENT A 7001Ö

Acoustic Measurements Using Common Computer Accessories: Do Try This at Home. Dale H. Litwhiler, Terrance D. Lovell

ACCURATE ANALYSIS AND VISUAL FEEDBACK OF VIBRATO IN SINGING. University of Porto - Faculty of Engineering -DEEC Porto, Portugal

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC

Doubletalk Detection

A METHOD OF MORPHING SPECTRAL ENVELOPES OF THE SINGING VOICE FOR USE WITH BACKING VOCALS

Tempo Estimation and Manipulation

Automatic characterization of ornamentation from bassoon recordings for expressive synthesis

Music Alignment and Applications. Introduction

DSP First Lab 04: Synthesis of Sinusoidal Signals - Music Synthesis

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

NUMEROUS elaborate attempts have been made in the

Voice & Music Pattern Extraction: A Review

Welcome to Vibrationdata

VOCALISTENER: A SINGING-TO-SINGING SYNTHESIS SYSTEM BASED ON ITERATIVE PARAMETER ESTIMATION

Frame Synchronization in Digital Communication Systems

Comparison of Dictionary-Based Approaches to Automatic Repeating Melody Extraction

SINGING EXPRESSION TRANSFER FROM ONE VOICE TO ANOTHER FOR A GIVEN SONG. Sangeon Yong, Juhan Nam

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

Lab experience 1: Introduction to LabView

A repetition-based framework for lyric alignment in popular songs

Audio-Based Video Editing with Two-Channel Microphone

Singer Traits Identification using Deep Neural Network

How to Obtain a Good Stereo Sound Stage in Cars

Music Information Retrieval Using Audio Input

Topic 4. Single Pitch Detection

Music 209 Advanced Topics in Computer Music Lecture 4 Time Warping

A System for Acoustic Chord Transcription and Key Extraction from Audio Using Hidden Markov models Trained on Synthesized Audio

Pitch Perception and Grouping. HST.723 Neural Coding and Perception of Sound

Investigation of Digital Signal Processing of High-speed DACs Signals for Settling Time Testing

Rec. ITU-R BT RECOMMENDATION ITU-R BT * WIDE-SCREEN SIGNALLING FOR BROADCASTING

Pitch. The perceptual correlate of frequency: the perceptual dimension along which sounds can be ordered from low to high.

Singing Pitch Extraction and Singing Voice Separation

Keywords Separation of sound, percussive instruments, non-percussive instruments, flexible audio source separation toolbox

Musical Signal Processing with LabVIEW Introduction to Audio and Musical Signals. By: Ed Doering

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Implementation of Memory Based Multiplication Using Micro wind Software

BASE-LINE WANDER & LINE CODING

Music Segmentation Using Markov Chain Methods

Color Image Compression Using Colorization Based On Coding Technique

PCM ENCODING PREPARATION... 2 PCM the PCM ENCODER module... 4

Controlling Musical Tempo from Dance Movement in Real-Time: A Possible Approach

TANSEN: A QUERY-BY-HUMMING BASED MUSIC RETRIEVAL SYSTEM. M. Anand Raju, Bharat Sundaram* and Preeti Rao

Transcription:

AN ON-THE-FLY MANDARIN SINGING VOICE SYNTHESIS SYSTEM Cheng-Yuan Lin*, J.-S. Roger Jang*, and Shaw-Hwa Hwang** *Dept. of Computer Science, National Tsing Hua University, Taiwan **Dept. of Electrical Engineering, National Taipei University, Taiwan Email: gavins@cs.nthu.edu.tw ABSTRACT An on-the-fly Mandarin singing voice synthesis system, called SINVOIS (singing voice synthesis), is proposed in this paper. The SINVOIS system can receive the continuous speech of the lyrics of a song, and generate the singing voice immediately based on the music score information (embedded in a MIDI file) of the song. Two sub-systems are designed and embedded into the system. One is the synthesis unit generator and the other is the pitch-shifting module. In the first one, the Viterbi decoding algorithm is employed on a continuous speech to generate the synthesis unit for singing voice. And the PSOLA method is employed to implement the pitch-shifting function in the second one. Moreover, the energy, duration, and spectrum modifications on the synthesis unit are also implemented in the second part. The synthesized singing voice sounds reasonably good. From the subjective listening test, the MOS (mean opinion score) of 4.5 and 3. are obtained for the original and synthesized singing voices, respectively. synthesized singing voices for virtual singers, and so on. In a conventional concatenation-based Chinese TTS system, the synthesis unit is taken from a set of pre-recorded 4 syllabic clips, representing the distinct base syllables in Mandarin Chinese. A concatenation-based singing voice synthesis system works in a similar way, except that we need to synthesize the singing voice based on a given music score and lyrics of a song. More specifically, a singing voice synthesis system takes as inputs the lyrics and melody information of a song. The lyrics are converted into syllables and the corresponding syllable clips are selected for concatenation. Then the system performs pitch/time modification and adds other desirable effects such as vibrato and echoes to make the synthesized singing voice more naturally sounding. The following figure demonstrates the flow chart of a conventional singing voice synthesis system:. INTRODUCTION Text-to-speech (TTS) systems have been developed in the past few decades and the most recent TTS systems can produce human-like natural sounding speech. The success of TTS systems can be attributed to their wide applications as well as the advances in modern computers. On the other hand, the research and developments of singing voice synthesis are not as mature as speech synthesis, partly due to its limited application domains. However, as computer-based games and entertainments are becoming popular, interesting applications of singing voice synthesis are emerging, including software for vocal training and However, such conventional singing voice synthesis systems cannot be used to produce personalized singing unless one has to record the 4 base Mandarin syllables in advance, which is a time-consuming process. Moreover, we must re-create the co-articulate effects since they are not available in the original 4 base syllable recordings. Considering these disadvantages in the conventional systems, we propose the use of speech recognition technology as a front end of our SINVOIS system. In other words, to create a personalized singing voice, the use needs to read the lyrics, sentence by sentence, to our system. Our system then employs

forced alignment via Viterbi decoding to detect the boundary of each character, as well as its consonant and vowel parts. Once these parts are identified, we can use them as synthesis units to synthesize a singing voice of the song, retaining all the timber and co-articulate effects of the user. Other add-on features, such as vibrato and echoes, can be imposed in the post-processing. The following figure demonstrates the flow chart of our SINVOS system: 2. RELATED WORK Due to limited computing power, most previous approaches to singing voice synthesis employ acoustic models to implement the human voice production. These include:. The SPASM system by Perry Cook [4] 2. The CHANT system by Bennett et al. [] 3. Rule-based synthesis by Sundberg [4] 4. Frequency modulation method by Chowning [3] However, performance of the above methods is not acceptable since the acoustic models cannot produce natural sounding human voices. Recently, the success of concatenation based text-to-speech systems motivates the use of concatenation for singing voice synthesis. For example, the LYRICOS system by Macon et al. [9][] is a typical example of concatenation-based singing voice synthesis system. The SMALLTALK system by OKI company [7] in Japan is another example that adopts PSOLA [6] method (which will be introduced in forth section) to synthesize singing voices. Even though these systems can produce satisfactory performance, they cannot produce personalized singing voices on the fly for a specific user. 3. GENERATION OF SYNTHESIS UNIT The conventional method of synthesis unit generation for speech synthesis derives from a database of 4 syllables that was recorded previously by a specific person who possesses clear tone. Once the recordings of 4 base syllables are available, we need to process the speech data according to the following steps:. End-point detection [5] based on energies and zero crossing rates are employed to identify the exact position of the speech recordings. 2. We need to find the pitch marks of each syllable, which are positions at the time axis indicating the beginning of a pitch period. 3. The consonant part and the vowel part of each syllable are also labeled manually. For best performance, the above three steps are usually carried out manually, which is a rather time-consuming process. In our SINVOIS system, we need to synthesize the singing voice on the fly; hence all three steps are performed automatically. Moreover, we also need to identify each syllable boundary via Viterbi decoding. 3. Syllable Detection For a given recording of a lyric sentence, each syllable is detected by force alignment via Viterbi decoding [2][3]. The process can be divided into the following two steps:. Each character in the lyric sentence must be labeled with a base syllable. This task is not as trivial as it seems since we need to take care of some of the character-to-syllable mappings that are one-to-many. A maximum matching method is used in conjunction with a dictionary of about 9, terms to determine the best character-to-syllable mapping. 2. The syllable sequence from a lyric sentence is then converted into bi-phone models for constructing a single-sentence of a linear lexicon. Viterbi decoding [2][3] is then employed to align the frames of the speech recording to the bi-phone models in the one-sentence linear lexicon, such that a best state sequence of the maximal probability is found. The obtained optimal state sequence indicates the best alignment of each frame to a state in the lexicon. Therefore we can correctly identify the position of each syllable, including its consonant and vowel parts. Of course, before the use of Viterbi decoding, we need to have an acoustic model in advance. The acoustic model used here contains 52 bi-phone models, which are obtained from a speech corpus of 7 subjects to achieve speaker independency. The

complete acoustic model ensures the precision in syllable detection. The following plot demonstrates a typical result of syllable detection. Amplitude.6.4.2.2 Wave form.3.2.4..6.8.5..5.2.25.3.35.4.45.5..2.3.4 2 3 4 5 6 x 4 For simplicity, once a syllable is detected, we can also use zero crossing rates directly to distinguish consonant part from vowel part. 3.2 Identification of Pitch Mark Pitch marks are the positions where complete pitch periods start. We need to identify pitch marks for effective time/pitch modification. In our system, pitch identification is only performed on the vowel part of each syllable since the consonant part does not have a clearly defined pitch. Namely, the consonant part of a syllable is kept unchanged during pitch shifting. The steps involved in pitch mark identification are listed next:. Use ACF (autocorrelation function) or AMDF (average magnitude difference function) to compute the average pitch periodt of a given syllable recording. 2. Find the global maximum of the syllable waveform and label its time coordinate ast m ; this is the position of the first pitch mark. 3. Search other pitch marks to the right oft m by finding the maximum in the region [ tm +.9* T, tm +.* T ]. Repeat the same procedure until all pitch marks to the right of the global maximum are found. 4. Search the pitch marks to the left of t m and the region should be t.* T, t.9* ] instead. [ m m T Repeat the same procedure until all pitch marks to the left of the global maximum are found. The following plot shows the waveform after pitch marks (denoted as circles) are found. Once pitch marks are found, we can perform necessary pitch/time modification according to the music score of the song, and add other desirable effects for singing voices. These procedures are introduced in next section. 4. PITCH SHIFTING MODULE In this section we will introduce the essential operations of STS that include pitch/time scale modification and energy normalization. Afterward we further to do fine tuning such as echo effect, pitch vibrato, coarticulation effect to make the singing voice more natural. 4. Pitch Shifting Pitch shifting of speech/audio signals is an essential part in speech and music synthesis. There are several well-known approaches to pitch shifting:. PSOLA (Pitch Synchronous Overlap and Add) [6] 2. Cross-Fading [2] 3. Residual Signal with PSOLA [5] 4. Sinusoidal Modeling [] In our system, we adopt the PSOLA method to achieve a balance between quality and efficiency. The basic concept behind PSOLA is to multiply a hamming window centered at each pitch mark of the speech signal. The windowed signals at each pitch mark are then relocated as necessary. More specifically, if we want to shift up pitch, the distance between neighboring pitch marks will be decreased. On the contrary, if we want to shift down pitch, the distance between neighboring pitch marks should be increased. When performing a pitch-up operation, the half size of the hamming window is equal to the desired distance between neighboring pitch marks of the pitch-up signals. On the other hand, when performing a pitch-down operation, the half size of the hamming window is equal to the distance between neighboring pitch marks of the original signals. As a result, we might want to insert some zeros between two windowed signals if a

pitch-down operation with less than 5% of the original pitch frequency is desired. The following plot demonstrates the situations for both pitch-up and pitch-down operations: O rig ina l Pitch up. Compute the energy of each syllable in the recorded lyric sentence, E, E2,... EN, where N is the number of syllables. N 2. Compute the average energy Eave = E k. N k= 3. Multiply the waveform of the k -th syllable by E / E E / E = 3.6. ave k ave k a constant ( ) Pitch down Insert zero Hamming Window On the other hand, if the recorded lyric sentence already bears the desirable energy profile, then we do not need to apply such energy normalization procedure. 4.2 Time Modification Time modification is used to increase or decrease the duration of a synthesis unit. We use a simple linear mapping method for time modification in our system. The method can duplicate or delete fundamental periods as necessary, as shown in the following diagram: O rigin al Later O riginal Later contraction of waveform extension of waveform Before we want to do time modification for a syllable, we must separate the consonant and vowel parts. Usually the consonant part is not change at all; only the vowel part is shortened or lengthened. The speech-based recording has already included natural co-articulation; therefore we do not need to do special arrangement on this issue. (In our previous approach to singing voice synthesis based on 4 syllable recordings, we need to smooth the transition between syllables by a method of cross-fading on a small overlapped region.) 4.3 Energy Modification The concatenated singing voice occasionally results in unnatural sound since each synthesis unit has diverse level of energy (intensity or volume). Therefore, we can simply adjust the amplitude in each syllable such that the energy is equal to the average energy of the whole sentence. The energy normalization procedure is described as follows: 4.4 Other Desirable Effects Results of the above synthesis procedure constantly contain some undesirable artificial-sounding buzzy effects. As a result, we adopt the following formula to implement the echo effect: y[ n] = x[ n] + ay[ n k] Or in its z-transform: Y ( z) H ( z) = = k X ( z) az The value of k controls the amount of delay and it can be adjust accordingly. Usually the amount of delay is set to.7 second. The echo effect can essentially mask the undesirable buzzy components. Furthermore, the echo effect can make the whole synthesize singing voice more genuine and softer, which is also exemplified by the fact that almost every karaoke machine has a knob for echo effect control. Besides the echo effect, the inclusion of vibrato effect [9] is an important factor to make the synthesized singing voice more natural. Vibrato effect can be implemented according to the following guidelines:. We use the sinusoidal function to model the effect of vibrato. For instance, if we want to alter the pitch curve of a syllable to a sinusoidal function in the range [a b] (for instance, [.8,.2]), we can simply do so by rescaling and shifting the basic sinusoid sin( ω t) : sin( ω t ) * ( b a) a + b +, 2 where ω is the vibration angular frequency and t is the frame index. 2. Reassign the position of pitch marks based on the above sinusoidal function of the pitch curve with vibrato. 2

3. Only syllables with duration greater than.8 seconds are permitted to have vibrato effect. Moreover, vibrato only occurs in the vowel part. The following figure demonstrates the synthesized singing voice without vibrato effect. The first plot is time-domain waveform; the second plot is the corresponding pitch curve without vibrato: Amplitude.5..5.5..5 Semitone 8 7 6 5 Wave form 2 4 6 8 2 pitch: After median/merging filter demonstrates the average score of MOS test for each song: Song 2 3 4 5 6 7 8 9 2 3 4 5 MOS 3.2 3.2 3.5 2.4 3.8 2.6 2.9 3. 3.7 2.9 3.4 2.8 3.3 2.6 3.5 From the above table, it is obvious that the synthesized singing voices are acceptable, but definitely not satisfactory enough to be described as natural sounding. The major reason is that the synthesis units are obtained from recordings of speech instead of singing. Therefore the synthesized voices are reminiscent of speech instead of natural singing. However, the effect of on-the-fly synthesis is entertaining and most people are eager to try the system for fun. 4 2 4 6 8 2 Time (in seconds) 6 CONCLUSIONS AND FUTURE WORK After adding the vibrato effect, the waveform as well as the pitch curve is shown in the following plot: Amplitude.5..5.5..5 Semitone 8 7 6 5 4 Wave form 2 4 6 8 2 pitch: After median/merging filter 2 4 6 8 2 Time (in seconds) 5 RESULTS AND ANALYSIS The performance of our SINVOS system depends on three factors: the outcome of force alignment via Viterbi decoding, the result of pitch/time modification, and the special effects of singing voice. We have 5 persons try 5 different Mandarin Chinese pop songs and obtain a 95% recognition rate on syllable detection. The resulted synthesized singing voices all seemed to be acceptable. We adopt a test of MOS (mean opinion score) [8] to obtain subjective assessments of our system. In the test, we have ten persons to listen to the fifteen synthesized singing voice and each person has to give a score for each song. The score ranges from t o 5, with 5 representing the highest grade for naturalness. The following table In the paper, we have described the development of a singing voce synthesis system called SINVOS (singing voice synthesis). The system can accept a user s speech input of the lyric sentences, and generate a synthesized singing voice based on the input recording and the song s music score. The operation of the system is divided into two parts: one is the synthesis unit generator via Viterbi decoding, and the other is time/pitch modification and special effects. To assess the performance of SINVOS, we designed an experiment with MOS for subjective evaluation. The experiment results are acceptable, but not totally satisfactory due to the way our synthesis units are obtained. However, the fun part of the system also comes from the personal recording, which can be used for on-the-fly synthesis that can retain personal features via audio timber. This is only a preliminary study and there are many directions for future work. Some of the immediately future work includes:. Find the transformation in the frequency domain that can capture and transform the speech recordings into their singing duplicates. 2. Find the most likely pitch contour via methods in system identification, such that the synthesized singing voice will have a natural pitch contour. 3. Try some other frequency domain techniques for pitch shifting, such as sinusoidal modeling [].

7 REFERENCES [] Bennett, Gerald, and Rodet, Xavier, Synthesis of the singing voice, in Current Directions in Computer Music Research (M. V. Mathews and J. R. Pierce, eds.), pp. 9-44, MIT Press, 989. [2] Chen, S.G. and Lin, G.J., High Quality and Low Complexity Pitch Modification of Acoustic Signals, Proceedings of the 995 IEEE International Conference on Acoustic, Speech, and Signal Processing, May, Detroit, USA, 995, p2987-299. [3] Chowning, John M., Frequency Modulation Synthesis of the Singing Voice, in Current Directions in Computer Music Research (Max. V. Mathews and John. R. Pierce, eds.), pp. 57-63, MIT Press, 989. [4] Cook, P.R., SPASM, a real time vocal track physical model controller and singer, the companion software synthesis system, Computer Music Journal, vol. 7, pp.3-43, spring 993. [5] Edgington, M. and Lowry, A., Residual-based speech modification algorithms for text-to-speech synthesis, Spoken Language, 996. ICSLP 96. Proceedings, Fourth International Conference on Volume: 3, 996, Page(s): 425-428 vol.3 [6] F. Charpentier and Moulines, Pitch-synchronous Waveform Processing Technique for Text-to-Speech Synthesis Using Diphones, European Conf. On Speech Communication and Technology, pp.3-9, Paris, 989. [7] http://www.oki.com/jp/cng/softnew/english/sm.htm [8] ITU-T, Methods for Subjective Determination of Transmission Quality, 996, Int. Telecommunication Unit. [9] Macon, Michael W. and Jensen-Link, Leslie and Oliverio, James and Clements, Mark A. and George, E. Bryan, A Singing voice synthesis system based on sinusoidal modeling, Proc. of International Conference on Acoustics, Speech, and Signal Processing, Vol., pp. 435-438, 997. [] Macon, Michael W., and Jensen-Link, Leslie and Oliverio, James and Clements, Mark A. and George, E. Bryan, "Concatenation-based MIDI-to-Singing Voice Synthesis," 3rd Meeting of the Audio Engineering Society, New York, 997. [] Macon, Michael W., M. W. Macon, Speech Synthesis Based on Sinusoidal Modeling, PhD thesis, Georgia Institute of Technology, October 996. [2] Ney, F., and Aubert, X., Dynamic programming search: from digit strings to large vocabulary word graphs, in C. H. Lee, F Soong, and K. Paliwal, eds., Automatic Speech and Speaker Recognition, Kluwer, Norwell, Mass., 996. [3] Rabiner, L., and Juang, B-H., Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, N.J., pp. 339-34, 993. [4] Sundberg, Johan, Synthesis of Singing by Rule, in Current Directions in Computer Music Research (Max. V. Mathews and John. R. Pierce, eds.), pp. 57-63, MIT Press, 989. [5] Yiying Zhang, Xiaoyan Zhu, Yu Hao, Yupin Luo, A robust and fast endpoint detection algorithm for isolated word recognition, Intelligent Processing Systems, 997. ICIPS '97. 997 IEEE International Conference on Volume: 2, 997, Page(s): 89-822 vol.2