Singing voice synthesis in Spanish by concatenation of syllables based on the TD-PSOLA algorithm

Singing voice synthesis in Spanish by concatenation of syllables based on the TD-PSOLA algorithm ALEJANDRO RAMOS-AMÉZQUITA Computer Science Department Tecnológico de Monterrey (Campus Ciudad de México) Calle del Puente 222, Colonia Ejidos de HuipulcoTlalpan 14380, México D.F. MEXICO alejandro.ramos.amezquita@itesm.mxhttp://www.ccm.itesm.mx/micampus/ Abstract: -The present work shows the development of a Spanish singing voice synthesizer where a TD- PSOLA algorithm is applied. The main goal of the development was to test the hypothesis that while diphones are linguistically the units with the best intelligibility-flexibility compromise for the purposes of spoken voice synthesis, it is the syllables the best suited units for concatenation singing voice synthesis. Such hypothesis is particularly strong for Spanish, since its rules for syllable construction are comprehensive, relatively simple, and only a handful.to test the hypothesis a relatively small amount of vocals and syllables in Spanish were recorded by a soprano singer at both F4 and C5 tones, with duration of 1 second each(±0.2sec.). The modification of the syllables was carried only in regards to tone and duration.matlab was used as the programming platform mainly because of the author s relative expertise on it. To evaluate the performance of the system several melodic tasks were asked of it including the singing of a popular Mexican song (Las Mañanitas). Results show that a highly intelligible synthesized Spanish singing voice based on syllable concatenation can be achieved with minimum control mechanisms. While the time duration variation introduces very few noticeable digital errors, a transposition of up to a just fourth was possible without generating very obvious digital errors. A variation of 5% (0.05) in the frequency scale corresponds to a semitone variation in the equally tempered modern scale. Key-Words: -Singing voice, synthesis, concatenation, syllables, Spanish, Time Domain, PSOLA. 1 Introduction The singing voice synthesis springs from two wellresearched fields: spoken voice coding and synthesis, and musical instrument synthesis. The field of voice synthesis was studied since the XVIII century when Wolfang von Kempelen built his mechanical synthesizer. On the other side, Leon Theremin created the first well-known electrical synthesizer and in 1939 the Vocoder (Voice Coder),built by Homer Dudley, constituted the first Singing Voice synthesizer.nowadays, the human voice synthesis models can be classified in spectral and physical models. The first are based fundamentally in mechanisms of the hearing perception, while the later are based on the modeling of the mechanisms of production of the sound sources. The benefit of these last ones is the fact that the parameters used in such are closely related to the control mechanisms that a singer uses on his or her own vocal system, and therefore some of the actual control mechanisms are incorporated in the design. However, the amount of parameters and their mapping onto the intuitive controls of the production mechanisms to the output of the model isn t a trivial task [1]. A wide variety of pseudo physical models are also available. In these, the model is decomposed in a particular source and a vocal tract. A typical example is the lineal prediction method for which the resonances of the vocal tract are modeled as poles in a filter, and the residual error is considered a signal. The problem with such approach is that the filter modification does not produce the expected results since there is a lot more to it than just the glottal excitation of the sound source signal. This is due to the non-linear effects of the vocal tract that such model cannot reproduce. On the other hand, the Vocoders divide the speech spectrum in channels for which the gain and source parameters are approached, while the formant synthesizers resemble the linear prediction method in the sense that they include the option for different sound sources, voiced and unvoiced, and the vocal tract is modeled by a set of formant filters [4]. The formant wave functions and the frequencymodulated approach are slightly different spectral ISBN: 978-1-61804-096-1 210

methods. The first ones model the formant s impulse response in the time domain and each one of these functions can be exited to the fundamental frequency required to produce the singing voice. The Frequency modulated approach tries to generate a spectrum that resembles the one of the singing voice involving two oscillators, a carrier and another one that drives its frequency, having excellent results in elegantly modeling the singer s effort. However, the most successful examples for Voice Synthesis are those based on the sampling approach, in which the output signal is the result of the sequential concatenation of samples of a particular database. As such, this does not constitute a synthesis technique, but according to author s Bonada and Serra [1] it should be regarded as a synthesis model. The success of the method lies in its simplicity, and of the fact that it captures de natural sound from its real counterpart. The most fundamental problem of the approach is the lack of flexibility and expression that a professional musician expects of an instrument. For such instruments, utilizing large databases and sampling a wide portion of the instrument s sound space can achieve and acceptable quality level. Although as technology progresses the storing capacity becomes less and less of an impediment, it remains an issue that one must considerate. A reasonable database can be used when a way to morph a particular sound in the database into a different one is incorporated in a manner in which the outcome sounds natural as well. To achieve such a transformation, one can parameterize the stored samples, or one can store exact waveforms in the database, as it happens with the PSOLA methods. The PSOLA method (Pitch Synchronous Overlap and Add) is one that decomposes the signal in elementary waveforms, each one corresponding in length to the period of a tone. When such waveforms are added with a certain amount of overlapping, the original signal can be rebuilt. A temporal modification can be achieved by the repetition or elimination of tone periods. By changing the time between elementary waveforms the modification of frequency can be achieved. Such methods work only for the voiced sections of the signal, since there has to be a fundamental period for the model to work. The PSOLA method works well for small transformations, and it resembles the wavetable reading method. 2 Problem Formulation Spoken voice synthesis and singing voice synthesis have been around for a long time, and yet there is still the issue of compromising the natural sound for intelligibility and vice versa. The differences between the singing and the spoken voice have not yet been understood, and there are reasons to believe that they might differ in basic aspects such as the fundamental phonetic unit. Also, the different human languages are distinct with regards to phones; graphs, pronunciation and orthographic rules, and therefore certain knowledge of the acoustic and phonetic characteristics must be developed for each of the languages individually. The present work is aimed to tackle such a notion particularly for the Spanish language, one for which little or no research has been developed for sung voice. The system developed is based in the synthesis model of concatenation of pre-recorded sung syllables, in an effort to prove or disprove the hypothesis that it is in fact the syllables the basic structural unit of the singing Spanish voice (at least in Mexican popular songs). The hypothesis is sustained by the fact in all Spanish lyrical scores a melodic line is present where there is a one-to-one correspondence between notes and syllables (except on melisma). Being the first approach to the singing voice synthesis in Spanish, the primary objectives were modest but extraordinarily important in the developmental steps of a complete and comprehensive Spanish singing voice synthesis system. The main objective was to simply test the results of applying the PSOLA algorithm to the modification of the tone and duration of a subset of syllables in Spanish, and their trial in a particular Mexican-Spanish popular song (Las Mañanitas). The results would be measured in terms of the balance between how natural the resulting voice would sound, and the intelligibility of the performance. 2.1 The PSOLA decision The problem with the concatenation of segments is that it isn t possible to generalize them in contexts that are not included in the training process due to the prosodic variability. For such purposes there are certain technics that allow the prosody modification in a unit to match a desired prosody. Although such technics degrade the quality of the synthesized voice, they also bring benefits that exceed theinconvenience of the distortion introduced by its use. The objective of the prosody modification is the change in amplitude, duration and tone of a voice segment. The modification of the amplitude can be ISBN: 978-1-61804-096-1 211

easily achieved by direct multiplication, however, the duration and tone present more difficulties [5].The PSOLA algorithm allows for the smooth concatenation of prerecorded samples of spoken voice, and provides good tone and duration controls making it the perfect algorithmic choice for the present work. Although all the different PSOLA versions work in a similar manner, it is the time domain algorithm the most widely used due to its computational efficiency. The basic algorithm consists of three steps: 1) the analysis, in which the signal is first divided into separate short time signals that are often superimposed; 2) the modification of each analysis signal to generate the synthesized signal; and 3) the synthesis step, in which such segments are recombined by overlapping and adding [2]. The short time signals are obtained of the digital waveform, multiplying the signal by a window sequence of synchronized tone analysis. The window usually used is the Hanning window centered on successive instants called pitch marks. Such marks are placed at a synchronized rate with respect to the toneover the voiced sections, and at a constant rate over the unvoiced sections. The length of the window is proportional to the local tonal period and the window factor goes from 2 to 4. The pitch marks are determined either by inspection of the signal, or automatically by some estimation method. The segment recombination in the synthesis step is performed after defining the new sequence of pitch marks [2]. The frequency manipulation is achieved by the change of pitch marks intervals. The duration, on the other hand, is modified through the repetition or elimination of voice segments. 2.1.1 Spanish syllables and the PSOLA method Presumably, a syllable-based system would dramatically diminish the noise generated by the discontinuities of the concatenation procedure. This is due to their intrinsic articulatory characteristics since they include the language s co-articulation, and constitute a clear phonetic unit, which minimizes the border effect naturally present in systems based on other units like diphones. Also, the manner of syllable construction in Spanish facilitates the implementation of the PSOLA algorithm, since the tone of the note to which the syllable is associated always relates the fundamental frequency to the sung vowel. Being able to extract information of the tone, the PSOLA algorithms can control the most important voiced characteristic of the syllables, and more importantly, it can modify it. 3 Problem Solution The implementation of the PSOLA algorithm for this first stage of the development was carried on a Matlab platform, mainly because of the familiarity of the author with it, but also because of its straight forwardness and accessibility. The program basically consisted of one single function (tdpsola) that included a sub-function necessary to find the pitch marks (find_pmarks). The pre-recorded syllables were fed into the program in.wav format and the voice of a Mexican soprano singer was chosen for the construction of a small database of Spanish sung syllables that would allow the testing of the algorithm. 3.1 The database construction For the construction of the database a half hour warming session took place before the actual recording of the segments that included three note legato, chromatic scales with maximum range of an octave and a half, and arpeggios of one and a half octave. The singer s register is of two octaves, from C-4 to C-6. The recordings were carried in a home studio (5 x 5 x 2.5 meters) using a Pro Tools 7.4 system and a digidesignmbox digital audio card at a 48 khz sampling frequency and depth of 24 bits (highest available). An AKG 414 microphone model was used during the recordings in cardioid mode. Considering the vocal interval of most popular singers, it was required of the singer that the syllables were sung without vibrato at two different tones; F-4 and C-5 at a tempo of 60 bpm. The length of the syllables were of 1 second (±0.2sec.)and the selection of syllables to record included all the vowels and the syllables that are included in the popular songs that the system would be required to synthesize. 3.2 The tdpsola function The novelty of the application of the TD-PSOLA algorithm in the present work is the possibility of choosing the percentage of frequency and duration modification at the beginning of the signal and at the end of it, and the fact that is being applied to singing voice. The modification is distributed linearly throughout the signal. At the end of the modifying procedure, the edges of the modified signal are eliminated in terms of the first and last cycles since such areas are the most likely to present undesired features like distortion. The general structure of the main function tdpsolapresents six input parameters: 1) the syllable ISBN: 978-1-61804-096-1 212

to modify, 2) the sampling frequency of the signal, 3) the frequency scaling percentage at the beginning of the syllable, 4) the frequency scaling percentage of the ending of the syllable, 5) the duration scaling percentage at the beginning of the syllable and 6) the duration scaling percentage at the end of the syllable. 3.2.1 Thepitch_marks function For the prosodic modification the most important step is the calculation of the new pitch marks, procedure by which the length between pitch marks can be obtained through the multiplication of the modification percentage times the actual length. After obtaining the number of modified pitch marks, the number of pitch marks to add or subtract has to be calculated for the first period. To increase the duration of a signal, the periods must be doubled until the number of samples required to achieve the wanted duration are less than the period duration. To decrease the duration of the signal the number of periods has to be eliminated until the number of samples to eliminate is less than the duration of the period. When duplicating windows, the even copies (2 nd, 4 th, 6 th, etc) are inverted to avoid the generation of periodic effects in aperiodic signals. The new windows, multiplied by the Hanning window, must be added to the output signal and its samples can in fact be overlapped with previous samples but the normalization factor has to be calculated to diminish the distortion that the Hanning window can introduce. The pitch_marksfunction has two input parameters, the audio signal to analyze and the sampling frequency, and it outputs the positions of the new pitch marks. This function, as can be inferred from the past paragraph, is the most fundamental section of the synthesiss systemand its functioning can be divided and explained in four steps: 1. The generation of the energy contour of the signal. 2. The approximation of the pitch marks from the local maxima in the energy contour. 3. The addition of pitch marks to the aperiodic sections of the signal. 4. The optimization of the pitch marks to the signal maxima. The calculation of the pitch marks was performed automatically by means of an algorithm developed by Vladimir Goncharoff and Patrick Gries of the University of Illinois in Chicago [3]. For the evaluation of the performance of the singing voice synthesizer, a reduced number of task related to melodic notions were given to it which involved the testing of the change of tone and duration, the execution of intervals in the mayor scale and the performance of vowel glissandos. Although it is difficult to report written results about a singing performance, it can be said that the most important test performed was the actual singing of the Mexican song, Las mañanitas. Fig. 1 Las Mañanitas, popular Mexican song. The lyrics on the 6 th and 7 th bar were replaced for the most common version: ser, día de tu santo te las can Fig. 2 Unmodifiedsyllable Es, 1st bar 3.3 Results and Results Analysis Fig. 3 ModifiedSyllable Es, 1st bar ISBN: 978-1-61804-096-1 213

Fig. 4 Unmodifiedsyllable San, 7th bar introduces less noticeable errors since its applied mainly on the vowels of the segment, and the adding of new cycles is less problematic for voiced segments when no transposition is involved due to the fact that it has no spectral impact and the overlapping is unnoticeable. For lack of better adjectives, the performance of the synthesizer can vary from recited to expressive, and it was found that a better outcome could be achieved if the modification of the first pitch marks and the last coincides, making the execution more melodic and reducing the legato effect. This was to be expected since the period of a Spanish singing performer usually does not change within a syllable (supporting the idea that it is the fundamental singing unit), and when a singer forces a glissando into a syllable he or she probably does not do it linearly over voiced and unvoiced segments alike. Finally, it can be stated that although a formal intelligibility test has not yet been performed, the listener can clearly recognize the tune that is being sung and that a promising path towards a comprehensive Spanish singing voice synthesis system lies ahead. Fig. 5 ModifiedSyllable San, 7th bar Results show that a highly intelligible Spanish singing voice can be achieved with minimum control mechanisms. This in itself suggests the choosing of the syllables as the basic phonetic unit for singing as a right decision. However, the natural sound of the voice is, to some degree, compromised, and its severity depends of the interval, whether in frequency or duration, as well as the recorded execution of the database. As a manner of example, Figures 2 through 5 show spectral differences for the unvoiced S phonemes for different placements (beginning and end of the syllable), and for different frequency and temporal modification. Figures 3 and 5 show the result that the transposition of a recorded syllable with vibrato has, exaggerating such vibrato making it sound unnatural. Nevertheless, results show that depending on the syllable and when no vibrato is present, a transposition of up to a just fourth is possible without generating very obvious digital errors, and that a variation of 5% (0.05) in the frequency scale corresponds to a semitone variation in the equally tempered modern scale. Although the percentage of variation that the voice segments allow without the introduction of noticeable errors depended very much on the syllable itself, it was clear that the duration variation 4 Conclusion It is a common objective of any effort towards singing voice synthesis to achieve a synthesis engine capable of sounding as natural and expressive as a real singer, by having only the score and lyrics as input. The general architecture proposed to achieve such a goal includes a section for the generalization of the traditional score that can include any symbolic information required for the synthesizer s control; a section destined to translate the input controls into low level interpretative actions; another section that creates the parametric trajectories that express appropriately the different paths within the sound space of the instrument, and a module that contains the synthesis engine that produces the output signal by concatenation of a transformed sample sequence that resembles the performed trajectory[1].it is this author s opinion that just as an effective speaking voice synthesis was only possible after a fundamental understanding of the physical acoustics of the voice production was gathered, a well rounded singing voice synthesis (that includes both natural and intelligible outputs) is only possible after a profound knowledge of the singing voice sound space is obtained. Such sound space is undeniably related not only to the human physical ISBN: 978-1-61804-096-1 214

singing capabilities and techniques, but also to the language sung, its phonetic and grammatical inner laws as well as the type of music style it forms a part of. Most of this knowledge is yet to be attained for Spanish language. Although of the general architecture of a singing voice synthesis system the present work only lacks the section for the generalization of the traditional score, its real contribution is the fact that it represents the means to obtain more information about the soundspace of the instrument itself. For the concatenation approach, the importance of the database is paramount since not only does it include the interpretative recordings, but it also carries the models and measurements related to the interpretative space and provide relevant information for the conversion of a high level representation (score) to an output signal. In that sense, substantial evidence has been gathered supporting the idea that in Spanish singing it is the syllable the basic phonetic unit and that understanding the singer s manipulation of such unit will not only provide the information for natural and expressive singing voice synthesis, but it will also shed light on Spanish singing techniques and Spanish language itself. Finally, acknowledging the fact that there are only around 2,000 Spanish syllables, the results obtained from this work lead to the promising conclusion that a complete and comprehensive Spanish singing synthesis system can be achieved with a syllabic database with low concatenation and prosodic distortion. In a second developmental stage, higher-level controls will be added to drive the transformation of the parametric trajectories to obtain a finer control from the engine, and the construction of a user interface (musical in nature) forthe generalization of the score would render the work complete. References: [1] J. Bonada and X. Sierra, Synthesis of the Singing Voice by Performance Sampling and Spectral Models, IEEE Signal Processing Magazine, Vol. 24, 2007, pp. 67-79. [2] S. Lemmety, Review of Speech Synthesis Technology, http://www.acoustics.hut.fi/publications/files/th eses/lemmetty_mst/thesis.pdf, date of last access: April 9 th 2012. [3] V. Goncharoff, P. Gries. An Algorithm for Accurately Marking Pitch Pulses in Speech Signals, Proceedings of the IASTED International Conference in Signal and Image Processing, Las Vegas Nevada, USA, 1998, pp. 281-234 [4] V. Siivola. A Survey for Methods for the Synthesis of the Singing Voice, http://www.cis.hut.fi/vsiivola/papers/svs.ps, date of last access: April 9 th 2012. [5] X, Huang, A. Acero, and H. W. Hon, Spoken Language Processing: A guide to Theory, Algorithm and System Development, Prentice Hall, 2001. ISBN: 978-1-61804-096-1 215