AN ON-THE-FLY MANDARIN SINGING VOICE SYNTHESIS SYSTEM

AN ON-THE-FLY MANDARIN SINGING VOICE SYNTHESIS SYSTEM Cheng-Yuan Lin*, J.-S. Roger Jang*, and Shaw-Hwa Hwang** *Dept. of Computer Science, National Tsing Hua University, Taiwan **Dept. of Electrical Engineering, National Taipei University, Taiwan Email: gavins@cs.nthu.edu.tw ABSTRACT An on-the-fly Mandarin singing voice synthesis system, called SINVOIS (singing voice synthesis), is proposed in this paper. The SINVOIS system can receive the continuous speech of the lyrics of a song, and generate the singing voice immediately based on the music score information (embedded in a MIDI file) of the song. Two sub-systems are designed and embedded into the system. One is the synthesis unit generator and the other is the pitch-shifting module. In the first one, the Viterbi decoding algorithm is employed on a continuous speech to generate the synthesis unit for singing voice. And the PSOLA method is employed to implement the pitch-shifting function in the second one. Moreover, the energy, duration, and spectrum modifications on the synthesis unit are also implemented in the second part. The synthesized singing voice sounds reasonably good. From the subjective listening test, the MOS (mean opinion score) of 4.5 and 3. are obtained for the original and synthesized singing voices, respectively. synthesized singing voices for virtual singers, and so on. In a conventional concatenation-based Chinese TTS system, the synthesis unit is taken from a set of pre-recorded 4 syllabic clips, representing the distinct base syllables in Mandarin Chinese. A concatenation-based singing voice synthesis system works in a similar way, except that we need to synthesize the singing voice based on a given music score and lyrics of a song. More specifically, a singing voice synthesis system takes as inputs the lyrics and melody information of a song. The lyrics are converted into syllables and the corresponding syllable clips are selected for concatenation. Then the system performs pitch/time modification and adds other desirable effects such as vibrato and echoes to make the synthesized singing voice more naturally sounding. The following figure demonstrates the flow chart of a conventional singing voice synthesis system:. INTRODUCTION Text-to-speech (TTS) systems have been developed in the past few decades and the most recent TTS systems can produce human-like natural sounding speech. The success of TTS systems can be attributed to their wide applications as well as the advances in modern computers. On the other hand, the research and developments of singing voice synthesis are not as mature as speech synthesis, partly due to its limited application domains. However, as computer-based games and entertainments are becoming popular, interesting applications of singing voice synthesis are emerging, including software for vocal training and However, such conventional singing voice synthesis systems cannot be used to produce personalized singing unless one has to record the 4 base Mandarin syllables in advance, which is a time-consuming process. Moreover, we must re-create the co-articulate effects since they are not available in the original 4 base syllable recordings. Considering these disadvantages in the conventional systems, we propose the use of speech recognition technology as a front end of our SINVOIS system. In other words, to create a personalized singing voice, the use needs to read the lyrics, sentence by sentence, to our system. Our system then employs

forced alignment via Viterbi decoding to detect the boundary of each character, as well as its consonant and vowel parts. Once these parts are identified, we can use them as synthesis units to synthesize a singing voice of the song, retaining all the timber and co-articulate effects of the user. Other add-on features, such as vibrato and echoes, can be imposed in the post-processing. The following figure demonstrates the flow chart of our SINVOS system: 2. RELATED WORK Due to limited computing power, most previous approaches to singing voice synthesis employ acoustic models to implement the human voice production. These include:. The SPASM system by Perry Cook [4] 2. The CHANT system by Bennett et al. [] 3. Rule-based synthesis by Sundberg [4] 4. Frequency modulation method by Chowning [3] However, performance of the above methods is not acceptable since the acoustic models cannot produce natural sounding human voices. Recently, the success of concatenation based text-to-speech systems motivates the use of concatenation for singing voice synthesis. For example, the LYRICOS system by Macon et al. [9][] is a typical example of concatenation-based singing voice synthesis system. The SMALLTALK system by OKI company [7] in Japan is another example that adopts PSOLA [6] method (which will be introduced in forth section) to synthesize singing voices. Even though these systems can produce satisfactory performance, they cannot produce personalized singing voices on the fly for a specific user. 3. GENERATION OF SYNTHESIS UNIT The conventional method of synthesis unit generation for speech synthesis derives from a database of 4 syllables that was recorded previously by a specific person who possesses clear tone. Once the recordings of 4 base syllables are available, we need to process the speech data according to the following steps:. End-point detection [5] based on energies and zero crossing rates are employed to identify the exact position of the speech recordings. 2. We need to find the pitch marks of each syllable, which are positions at the time axis indicating the beginning of a pitch period. 3. The consonant part and the vowel part of each syllable are also labeled manually. For best performance, the above three steps are usually carried out manually, which is a rather time-consuming process. In our SINVOIS system, we need to synthesize the singing voice on the fly; hence all three steps are performed automatically. Moreover, we also need to identify each syllable boundary via Viterbi decoding. 3. Syllable Detection For a given recording of a lyric sentence, each syllable is detected by force alignment via Viterbi decoding [2][3]. The process can be divided into the following two steps:. Each character in the lyric sentence must be labeled with a base syllable. This task is not as trivial as it seems since we need to take care of some of the character-to-syllable mappings that are one-to-many. A maximum matching method is used in conjunction with a dictionary of about 9, terms to determine the best character-to-syllable mapping. 2. The syllable sequence from a lyric sentence is then converted into bi-phone models for constructing a single-sentence of a linear lexicon. Viterbi decoding [2][3] is then employed to align the frames of the speech recording to the bi-phone models in the one-sentence linear lexicon, such that a best state sequence of the maximal probability is found. The obtained optimal state sequence indicates the best alignment of each frame to a state in the lexicon. Therefore we can correctly identify the position of each syllable, including its consonant and vowel parts. Of course, before the use of Viterbi decoding, we need to have an acoustic model in advance. The acoustic model used here contains 52 bi-phone models, which are obtained from a speech corpus of 7 subjects to achieve speaker independency. The

complete acoustic model ensures the precision in syllable detection. The following plot demonstrates a typical result of syllable detection. Amplitude.6.4.2.2 Wave form.3.2.4..6.8.5..5.2.25.3.35.4.45.5..2.3.4 2 3 4 5 6 x 4 For simplicity, once a syllable is detected, we can also use zero crossing rates directly to distinguish consonant part from vowel part. 3.2 Identification of Pitch Mark Pitch marks are the positions where complete pitch periods start. We need to identify pitch marks for effective time/pitch modification. In our system, pitch identification is only performed on the vowel part of each syllable since the consonant part does not have a clearly defined pitch. Namely, the consonant part of a syllable is kept unchanged during pitch shifting. The steps involved in pitch mark identification are listed next:. Use ACF (autocorrelation function) or AMDF (average magnitude difference function) to compute the average pitch periodt of a given syllable recording. 2. Find the global maximum of the syllable waveform and label its time coordinate ast m ; this is the position of the first pitch mark. 3. Search other pitch marks to the right oft m by finding the maximum in the region [ tm +.9* T, tm +.* T ]. Repeat the same procedure until all pitch marks to the right of the global maximum are found. 4. Search the pitch marks to the left of t m and the region should be t.* T, t.9* ] instead. [ m m T Repeat the same procedure until all pitch marks to the left of the global maximum are found. The following plot shows the waveform after pitch marks (denoted as circles) are found. Once pitch marks are found, we can perform necessary pitch/time modification according to the music score of the song, and add other desirable effects for singing voices. These procedures are introduced in next section. 4. PITCH SHIFTING MODULE In this section we will introduce the essential operations of STS that include pitch/time scale modification and energy normalization. Afterward we further to do fine tuning such as echo effect, pitch vibrato, coarticulation effect to make the singing voice more natural. 4. Pitch Shifting Pitch shifting of speech/audio signals is an essential part in speech and music synthesis. There are several well-known approaches to pitch shifting:. PSOLA (Pitch Synchronous Overlap and Add) [6] 2. Cross-Fading [2] 3. Residual Signal with PSOLA [5] 4. Sinusoidal Modeling [] In our system, we adopt the PSOLA method to achieve a balance between quality and efficiency. The basic concept behind PSOLA is to multiply a hamming window centered at each pitch mark of the speech signal. The windowed signals at each pitch mark are then relocated as necessary. More specifically, if we want to shift up pitch, the distance between neighboring pitch marks will be decreased. On the contrary, if we want to shift down pitch, the distance between neighboring pitch marks should be increased. When performing a pitch-up operation, the half size of the hamming window is equal to the desired distance between neighboring pitch marks of the pitch-up signals. On the other hand, when performing a pitch-down operation, the half size of the hamming window is equal to the distance between neighboring pitch marks of the original signals. As a result, we might want to insert some zeros between two windowed signals if a

pitch-down operation with less than 5% of the original pitch frequency is desired. The following plot demonstrates the situations for both pitch-up and pitch-down operations: O rig ina l Pitch up. Compute the energy of each syllable in the recorded lyric sentence, E, E2,... EN, where N is the number of syllables. N 2. Compute the average energy Eave = E k. N k= 3. Multiply the waveform of the k -th syllable by E / E E / E = 3.6. ave k ave k a constant ( ) Pitch down Insert zero Hamming Window On the other hand, if the recorded lyric sentence already bears the desirable energy profile, then we do not need to apply such energy normalization procedure. 4.2 Time Modification Time modification is used to increase or decrease the duration of a synthesis unit. We use a simple linear mapping method for time modification in our system. The method can duplicate or delete fundamental periods as necessary, as shown in the following diagram: O rigin al Later O riginal Later contraction of waveform extension of waveform Before we want to do time modification for a syllable, we must separate the consonant and vowel parts. Usually the consonant part is not change at all; only the vowel part is shortened or lengthened. The speech-based recording has already included natural co-articulation; therefore we do not need to do special arrangement on this issue. (In our previous approach to singing voice synthesis based on 4 syllable recordings, we need to smooth the transition between syllables by a method of cross-fading on a small overlapped region.) 4.3 Energy Modification The concatenated singing voice occasionally results in unnatural sound since each synthesis unit has diverse level of energy (intensity or volume). Therefore, we can simply adjust the amplitude in each syllable such that the energy is equal to the average energy of the whole sentence. The energy normalization procedure is described as follows: 4.4 Other Desirable Effects Results of the above synthesis procedure constantly contain some undesirable artificial-sounding buzzy effects. As a result, we adopt the following formula to implement the echo effect: y[ n] = x[ n] + ay[ n k] Or in its z-transform: Y ( z) H ( z) = = k X ( z) az The value of k controls the amount of delay and it can be adjust accordingly. Usually the amount of delay is set to.7 second. The echo effect can essentially mask the undesirable buzzy components. Furthermore, the echo effect can make the whole synthesize singing voice more genuine and softer, which is also exemplified by the fact that almost every karaoke machine has a knob for echo effect control. Besides the echo effect, the inclusion of vibrato effect [9] is an important factor to make the synthesized singing voice more natural. Vibrato effect can be implemented according to the following guidelines:. We use the sinusoidal function to model the effect of vibrato. For instance, if we want to alter the pitch curve of a syllable to a sinusoidal function in the range [a b] (for instance, [.8,.2]), we can simply do so by rescaling and shifting the basic sinusoid sin( ω t) : sin( ω t ) * ( b a) a + b +, 2 where ω is the vibration angular frequency and t is the frame index. 2. Reassign the position of pitch marks based on the above sinusoidal function of the pitch curve with vibrato. 2

3. Only syllables with duration greater than.8 seconds are permitted to have vibrato effect. Moreover, vibrato only occurs in the vowel part. The following figure demonstrates the synthesized singing voice without vibrato effect. The first plot is time-domain waveform; the second plot is the corresponding pitch curve without vibrato: Amplitude.5..5.5..5 Semitone 8 7 6 5 Wave form 2 4 6 8 2 pitch: After median/merging filter demonstrates the average score of MOS test for each song: Song 2 3 4 5 6 7 8 9 2 3 4 5 MOS 3.2 3.2 3.5 2.4 3.8 2.6 2.9 3. 3.7 2.9 3.4 2.8 3.3 2.6 3.5 From the above table, it is obvious that the synthesized singing voices are acceptable, but definitely not satisfactory enough to be described as natural sounding. The major reason is that the synthesis units are obtained from recordings of speech instead of singing. Therefore the synthesized voices are reminiscent of speech instead of natural singing. However, the effect of on-the-fly synthesis is entertaining and most people are eager to try the system for fun. 4 2 4 6 8 2 Time (in seconds) 6 CONCLUSIONS AND FUTURE WORK After adding the vibrato effect, the waveform as well as the pitch curve is shown in the following plot: Amplitude.5..5.5..5 Semitone 8 7 6 5 4 Wave form 2 4 6 8 2 pitch: After median/merging filter 2 4 6 8 2 Time (in seconds) 5 RESULTS AND ANALYSIS The performance of our SINVOS system depends on three factors: the outcome of force alignment via Viterbi decoding, the result of pitch/time modification, and the special effects of singing voice. We have 5 persons try 5 different Mandarin Chinese pop songs and obtain a 95% recognition rate on syllable detection. The resulted synthesized singing voices all seemed to be acceptable. We adopt a test of MOS (mean opinion score) [8] to obtain subjective assessments of our system. In the test, we have ten persons to listen to the fifteen synthesized singing voice and each person has to give a score for each song. The score ranges from t o 5, with 5 representing the highest grade for naturalness. The following table In the paper, we have described the development of a singing voce synthesis system called SINVOS (singing voice synthesis). The system can accept a user s speech input of the lyric sentences, and generate a synthesized singing voice based on the input recording and the song s music score. The operation of the system is divided into two parts: one is the synthesis unit generator via Viterbi decoding, and the other is time/pitch modification and special effects. To assess the performance of SINVOS, we designed an experiment with MOS for subjective evaluation. The experiment results are acceptable, but not totally satisfactory due to the way our synthesis units are obtained. However, the fun part of the system also comes from the personal recording, which can be used for on-the-fly synthesis that can retain personal features via audio timber. This is only a preliminary study and there are many directions for future work. Some of the immediately future work includes:. Find the transformation in the frequency domain that can capture and transform the speech recordings into their singing duplicates. 2. Find the most likely pitch contour via methods in system identification, such that the synthesized singing voice will have a natural pitch contour. 3. Try some other frequency domain techniques for pitch shifting, such as sinusoidal modeling [].

7 REFERENCES [] Bennett, Gerald, and Rodet, Xavier, Synthesis of the singing voice, in Current Directions in Computer Music Research (M. V. Mathews and J. R. Pierce, eds.), pp. 9-44, MIT Press, 989. [2] Chen, S.G. and Lin, G.J., High Quality and Low Complexity Pitch Modification of Acoustic Signals, Proceedings of the 995 IEEE International Conference on Acoustic, Speech, and Signal Processing, May, Detroit, USA, 995, p2987-299. [3] Chowning, John M., Frequency Modulation Synthesis of the Singing Voice, in Current Directions in Computer Music Research (Max. V. Mathews and John. R. Pierce, eds.), pp. 57-63, MIT Press, 989. [4] Cook, P.R., SPASM, a real time vocal track physical model controller and singer, the companion software synthesis system, Computer Music Journal, vol. 7, pp.3-43, spring 993. [5] Edgington, M. and Lowry, A., Residual-based speech modification algorithms for text-to-speech synthesis, Spoken Language, 996. ICSLP 96. Proceedings, Fourth International Conference on Volume: 3, 996, Page(s): 425-428 vol.3 [6] F. Charpentier and Moulines, Pitch-synchronous Waveform Processing Technique for Text-to-Speech Synthesis Using Diphones, European Conf. On Speech Communication and Technology, pp.3-9, Paris, 989. [7] http://www.oki.com/jp/cng/softnew/english/sm.htm [8] ITU-T, Methods for Subjective Determination of Transmission Quality, 996, Int. Telecommunication Unit. [9] Macon, Michael W. and Jensen-Link, Leslie and Oliverio, James and Clements, Mark A. and George, E. Bryan, A Singing voice synthesis system based on sinusoidal modeling, Proc. of International Conference on Acoustics, Speech, and Signal Processing, Vol., pp. 435-438, 997. [] Macon, Michael W., and Jensen-Link, Leslie and Oliverio, James and Clements, Mark A. and George, E. Bryan, "Concatenation-based MIDI-to-Singing Voice Synthesis," 3rd Meeting of the Audio Engineering Society, New York, 997. [] Macon, Michael W., M. W. Macon, Speech Synthesis Based on Sinusoidal Modeling, PhD thesis, Georgia Institute of Technology, October 996. [2] Ney, F., and Aubert, X., Dynamic programming search: from digit strings to large vocabulary word graphs, in C. H. Lee, F Soong, and K. Paliwal, eds., Automatic Speech and Speaker Recognition, Kluwer, Norwell, Mass., 996. [3] Rabiner, L., and Juang, B-H., Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, N.J., pp. 339-34, 993. [4] Sundberg, Johan, Synthesis of Singing by Rule, in Current Directions in Computer Music Research (Max. V. Mathews and John. R. Pierce, eds.), pp. 57-63, MIT Press, 989. [5] Yiying Zhang, Xiaoyan Zhu, Yu Hao, Yupin Luo, A robust and fast endpoint detection algorithm for isolated word recognition, Intelligent Processing Systems, 997. ICIPS '97. 997 IEEE International Conference on Volume: 2, 997, Page(s): 89-822 vol.2