ELEC 484 Project Pitch Synchronous Overlap-Add Joshua Patton University of Victoria, BC, Canada This report will discuss steps towards implementing a real-time audio system based on the Pitch Synchronous Overlap and Add (PSOLA) algorithm. This time based algorithm along with Formant Preservation (PSOLAF) will be explored to produce the desired pitch manipulation effects. Some background information will be provided as well as motivation for using PSOLAF pitch shifting methods over less complex methods such as Time Stretch and Resample, and Delay Line Modulation will be discussed. An ideal solution to implementing the system will be discussed, along with an events timeline to completion and some possible audio test clips for evaluation will be determined. 1.0 Introduction The importance of pitch manipulation in the digital audio processing and effects world cannot be understated. Applications for pitch shifting may be found in vocoders, such as in cell phones, creation of realistic choir effects from a single singer, high audio playback equipment, audio editing and recording software, and voice disguising applications [SPL00]. The major motivation behind this project was to demonstrate a reliable way to modify pitch of an audio signal source for any of the above aforementioned applications. PSOLA methods offer some of the best sound reproduction with the fewest drawbacks and will be contrasted briefly with several other ways to modify audio signal pitch. 2.0 Pitch Related Methods There are several key ways to modify a source signal s pitch. The below methods are related to pitch shifting and cause a change in pitch but are not well suited for modern applications for several reasons to be explained. 2.1 Variable Speed Replay This method of pitch shifting is very straightforward and works by playing back the original sound at an increased or decreased rate, thus creating a shift in pitch. For example x(n),replay = x(n),in * c Where c < 1 is time expansion and c > 1 is time compression. Figure 1: VSR Leading to Time and Spectral Envelope Distortion [DAFX] Figure 1 shows the detrimental effects on the signal, mainly that the time of the clip is expanded and compressed deping on the pitch shift. This type of shifting also changes the spectral envelope, which makes the signal qualitatively sound like a chipmunk when compressed, and more like a baritone when expanded (c < 1). These effects are undesirable for practical use. 2.2 Delay-Line Modulation This method was described in several publications and can be implemented in several ways [BB89,DAFX]. The first principle of the proposed methods was to implement a pitch shift using two saw tooth waves to control the time varying delay line which were set half a period apart. The resulting output waveforms were multiplied by a cross fade filter and divided in to blocks. When the blocks were read faster or slower the pitch would go up or down 1
accordingly. The downside is a fair amount of distortion in the signal and the output signal becomes more noise prone. Figure 3: SOLA Time Manipulation Step 2: Shift the overlapping segments by the scaling factor (alpha). Figure 2: Pitch Shifting by Delay Line Modulation Alternatively an overlap and add scheme that does not require estimation of the fundamental frequency can be employed using three in phase time varying delay lines. Each line is used on a block that overlaps 2/3 of the next full block length. The result gives the same desired effect [DZ99]. 2.3 SOLA Time Stretch and Resample Basically this method takes the original signal uses the below SOLA algorithm and does a linear resample to get an output signal of the same time duration but with a shifted pitch. Resampling is done at the rate of alpha*ƒ s, where alpha is the time stretch or constant. 2.4 Synchronous Overlap Add (SOLA) This algorithm is important to all further study and thus is required to understand the more complex algorithms that are to be implemented in the real-time system. The synchronous overlap and add is done in several steps [MEJ86,RW85]: Step 1: Separate the input signal in to segments of fixed length and overlap as shown in Figure 3 below. Step 3: Search the overlapping samples for discrete time lag of max similarity. At a maximum point weight the samples by a fade in out function to avoid transients. Then add together to create final signal of changed time length. 3.0 Background The goal of pitch shifting is to modify up or down the pitch of an audio signal without losing its information, which is preserved in the frequency information and the harmonic ratios. If done correctly the new audio signal will be of the same length, sound like the original signal, but at a desired pitch. 3.1 Pitch Detection/Marking Detection and marking of pitches for the input sound are crucial to the next two algorithms. For input signals of constant pitch the desired pitch marks can be found at the time index location where the signal reaches it s maximum amplitude. However for more complicated signals involving multiple instruments and vocals this becomes a much more involved task. The main problem to solve requires then ls it s self to finding a way to separate the different pitch periods of the in order to accurately determine the pitch marks for each segment. 3.2 Pitch Synchronous Overlap Add (PSOLA) This method implements the SOLA algorithm and the time domain resample in a similar manner as mentioned previously in section 2.3. The major difference between the two comes in 2
the re-sampling where an interpolation is used between pitch marks to create the desired pitch effect as described by Moulines et al. [HMC89, MC90]. Voice and speech processing fall in to the category of applications that this particular algorithm excels at. Based on the assumption that the input can be characterized by a series of pitches, PSOLA remains a two-step process. First the input sound is segmented in to its harmonic, non-harmonic and transient parts then characterized by pitches, known as analysis. The second part is known as synthesis whereby various transformations can be then applied to the signal by a parameter set [SPL00]. These two phases are done as follows, with illustrations below for clarification: if the time signal is to be expanded or compressed. Scaling factors less than 1 will result in discarding of segments resulting in time compression. While a scaling factor greater than 1 will cause segments to be repeated resulting in time expansion. 3. Finally the new time index is found in order to centre the next synthesis segment and preserve the pitch. I. Analysis: 1. Determine the pitch period. Divide the signal in to small blocks where the pitch is considered constant. Finally do pitch detection on each block in succession. 2. Use a Hanning window centered on the pitch mark to extract each block length of two individual pitch periods. Thus providing for a smooth transition between blocks using a fade-in/fadeout effect between blocks [BJ95]. Figure 4: PSOLA Pitch Analysis [DAFX] II. Synthesis: Figure 5: PSOLA Synthesis (time stretching) [DAFX] The effect of this process is a shift in pitch. This is accomplished using a linear interpolation on the time stretched signal to recreate samples between the samples and then re-sampling to get the desired pitch. This approach is used rather than a simple re-sampling as seen in the SOLA algorithm and should offer much improved sound quality over the previously discussed methods. 3.3 PSOLA with Formant Preservation (PSOLAF) Using formant preservation is similar to resampling the time domain with the difference being that frequency re-sampling occurs for the short time spectral envelope rather than on the entire signal. The spectral envelope is defined as the line that goes through all the harmonic amplitudes as seen below in Figure 6. 1. Choose the analysis segment identified by its corresponding time marking. 2. Use the Overlap and Add algorithm where the scaling factor (alpha) decides 3
Figure 6: PSOLA Pitch Shifting: Frequency Re-sampling of Spectral Envelope [DAFX] All harmonics are scaled by the scaling factor, but the amplitudes are determined by sampling of the spectral envelope. Pitch markers must be placed pitch synchronously in accordance to the local maxima of each windowed function for good results during analysis [SPL00]. Figure 8: PSOLA Synthesis (pitch shifting) [DAFX] It is apparent during synthesis that rather than purely adding or removing segments from the signal in blocks and therefore stretching the time, the process results in an addition or removal of segments by overlapping of Hanning windows thus preserving the time duration of the signal while modifying its pitch. 4.0 Discussion and Results The project s final realization was achieved with some difficulties encountered along the way, which are to be examined below. Figure 7: PSOLA Analysis (pitch shifting) [DAFX] Preserving the formants of the signal effectively preserves the voice or instrument identity after synthesis has been completed [ML95]. Figure 7 above shows that PSOLA analysis when applied to pitch shifting is identical to the analysis for time stretching. Figure 8 below shows the difference during synthesis between time-stretch and resample method and pure pitch shifting. 4.1 PSOLA Final Implementation As it happened the bulk of the frustration came in trying to implement this algorithm of pitch scaling using the psola.m file from the DAFX text and a timescale and resampling method shown above. The m-file TimescaleResamplePSOLA.m simply calls the psola function with different alpha values to set the timescaling that is to occur. However there was a problem in matching matrix dimensions, the index dimensions at the Hanning window and during resampling of some signals which caused an outright faileur to process the signal for reasons that were unclear. The output of the psola algorithm gives a sound that is indeed shifted in pitch, but does not preserve the sound of the original signal. This can be observed on the x1.wav clip where the higher pitched voice sounds chipmunk like, and the lower pitched one sounds very baritone. These effects were successfully overcome using formant preservation as seen in the next section. 4
4.2 PSOLA with Formant Preservation Final Implementation This method was overall successful in producing the desired effect of pitch scaling. The produced sounds are almost identical to the original with no modification, with a slight addition of noise or clipping as it may be due to the Hanning windows. During scaling either up or down the integrity of the source is preserved well so that the resulting signal sounds like the source but at a higher or lower pitch deping on the alpha parameter. An alpha value higher than 1 results in a pitch that is higher while a fractional alpha less than 1 resulted in a lower pitch. Changing the gamma of the signal modified offered another range of options that was explored only briefly. Test files and outputs are available for very simple and short tones to longer ones including vocals. The parameters used in the test code to generate the resulting sounds can be found in the Matlab script file PSOLA_Formant.m available in the appix and on my website. Also available are several original.wav files from the DAFX text and the modified ones in.zip format to save space. The original files that were used in testing are: 1) la.wav 2) flute2.wav 3) moore_guitar.wav 4) x1.wav Sound files and m-files can be accessed at: http://www.ece.uvic.ca/~jpatton/yeshua1984 /Elec484/Elec484.html Several sound files were tested that did not work with the algorithm, these included some proposed in the initial report submission and also included extra samples of music from my own library. The error message as before with the PSOLA algorithm seemed to be related to pitch marks. This conclusion is based on an educated guess that the pitch marker program that was developed is not sophisticated enough to properly place the marks for complex signals with many harmonics. It could also be said that many of these signals which included multiple instruments and the like may not have had any primary harmonics to work on and this could have lead to the errors incurred. Another explanation may be that too many pitch marks were found (erroneously) such that the shifted Hanning window could not properly operate on the signal as this is where the psolaf1.m program failed with the more complex signals and the psola.m program failed for those signals as well as others. Since all the signals that did run for the psolaf1.m algorithm had fairly distinct pitches, it is safe to assume that the algorithm should work for all signals provided that the equivalent pitch marks are determined with very good accuracy. 5.0 Conclusions Considering the limited time of this project it is evident that much further work could be done in this area. However this being said, it was evident from the produced sound files that the project was successful in realizing a system that can modify pitch and maintain the integrity of the original sound signal and source. 6.0 Future Considerations Although this project was inted to be implemented as a real-time system it was impossible to do so with the amount of time and problems encountered. With further resources available and more understanding of transferring programs from a Matlab environment to a realtime system this PSOLA with Formant preservation program would be implemented in Marsays. Due to limitations in time and ability this did not occur. More importantly the detection of pitch with great accuracy should be considered a high priority as the better methods that preserve the message quality need input pitch marks to centre some sort of windowing method on. Without these marks placed properly this project is not very useful for any real world application. 5
7.0 References [BB89] K. Bogdanowicz and R. Blecher. Using Multiple Processors for real-time audio effects. In AES 7th International Conference, pp. 336-342, 1989. [BJ95] R. Bristow-Johnson. A detailed analysis of a time-domain format-corrected pitch shifting algorithm. J. Audio Eng. Soc., 43(5):340-353, 1995. [DAFX] U. Zolzer. Digital Audio Effects. John Wiley and Sons, pp. 202-225, 2005. http://www.dafx.de/ [DZ99] S. Disch and U. Zolzer. Modulation and delay line based digital audio effects. In Proc. DAFX-99 Digital Audio Effects Workshop, pp.4-8, Trondheim, December 1999. [HMC89] C. Hamon, E. Moulines and F. Charpentier. A diphone synthesis system based on time-domain prosodic modifications of speech. In Proc. ICASSP, pp.238-244, 1989. [MC90] E. Moulines and F. Charpentier. Pitch synchronous waveform processing technique for text-to speech synthesis using diphones. Speech Communication, 16:175-205, 1995. [MEJ86] J. Makhoul and A. El-Jaroudi. Timescale modification in medium to low rate speech coding. In Proc. ICASSP, pp.1705-1708, 1986. [ML95] E. Moulines and J. Laroche. Nonparameter technique for pitch-scale and timescale modification of speech. Speech Communication, 9(5/6):453-467, 1990. [RW85] S. Roucos and A.M. Wilgus. High quality time-scale modification for speech. In Proc. ICASSP, pp. 493-496, 1985. [SPL00] N. Schnell, G. Peeters, S. Lemouton, P. Manoury, X. Rodet, Synthesizing a choir in realtime using Pitch Synchronous Overlap Add (PSOLA). Ircam Centre Georges-Pompidou, pp. 1-4, 2000 6
Appix: PSOLA_Formant.m % Pitch Shifting by PSOLA with Formant Preservation % Josh Patton % PSOLA_Formant.m % Files required: % psolaf1.m % pitchmarker.m clear all close all clc %% la.wav [x,fs,nbits]=wavread('la.wav'); gamma=2; wavwrite(y, Fs, 'la_gamma2.wav'); beta=(3/2); wavwrite(y, Fs, 'la_high.wav'); beta=(3/4); wavwrite(y, Fs, 'la_low.wav'); %% flute2.wav [x,fs,nbits]=wavread('flute2.wav'); gamma=2; wavwrite(y, Fs, 'flute2_gamma2.wav'); beta=(3/2); wavwrite(y, Fs, 'flute2_high.wav'); beta=(3/4); wavwrite(y, Fs, 'flute2_low.wav'); %% moore_guitar.wav [x,fs,nbits]=wavread('moore_guitar.w av'); wavwrite(y, Fs, 'moore_guitar_gamma1.wav'); gamma=2; wavwrite(y, Fs, 'moore_guitar_gamma2.wav'); beta=(3/2); wavwrite(y, Fs, 'moore_guitar_high.wav'); beta=(3/4); wavwrite(y, Fs, 'moore_guitar_low.wav'); %% x1.wav [x,fs,nbits]=wavread('x1.wav'); wavwrite(y, Fs, 'x1_gamma1.wav'); gamma=2; wavwrite(y, Fs, 'x1_gamma2.wav'); beta=(3/2); wavwrite(y, Fs, 'x1_high.wav'); beta=(3/4); wavwrite(y, Fs, 'x1_low.wav'); 7
Appix: psolaf1.m % This function file preforms pitch shifting synchrounous overlap add with % formant preservation using pitch marks from an external source, and % psolaf1.m % based off of psolaf.m from DAFX function out=psolaf1(in,m,alpha,beta,gamma) %... % gamma newformantfreq/oldformantfreq %... P = diff(m); %compute pitch periods if m(1)<=p(1), %remove first pitch mark m=m(2:length(m)); P=P(2:length(P)); if m(length(m))+p(length(p))>length(in) %remove last pitch mark m=m(1:length(m)-1); else P=[P P(length(P))]; Lout=ceil(length(in)*alpha); out=zeros(1,lout); %output signal tk = P(1)+1; %output pitch mark while round(tk)<lout [minimum i]=min(abs(alpha*m-tk) ); % find analysis segment pit=p(i);pitstr=floor(pit/gamma); gr=in(m(i)-pit:m(i)+pit).*hanning(2*pit+1); gr=interp1(-pit:1:pit,gr,-pitstr*gamma:gamma:pit);% stretch segm. inigr=round(tk)-pitstr;gr=round(tk)+pitstr; if Gr>Lout, break; out(inigr:gr)=out(inigr:gr)+gr; % overlap new segment tk=tk+pit/beta; 8
Appix: TimescaleResamplePSOLA.m % Pitch Shifting by PSOLA Time Stretching and Resampling % Josh Patton % TimescaleResamplePSOLA.m % Files required: % psola.m % pitchmarker.m %% test one flute2 [x,fs,nbits]=wavread('x1.wav'); alpha=(3/2); y=psola(x,m,alpha,beta); y=resample(y,length(x),length(y)); wavwrite(y, Fs, 'psola_high_x1.wav'); alpha=(3/4); y=psola(x,m,alpha,beta); y=resample(y,length(x),length(y)); wavwrite(y, Fs, 'psola_low_x1.wav'); %% test moore_guitar [x,fs,nbits]=wavread('moore_guitar.wav'); alpha=1.5; y=psola(x,m,alpha,beta); y=resample(y,length(x),length(y)); wavwrite(y, Fs, 'psola_high_moore_guitar.wav'); alpha=0.75; y=psola(x,m,alpha,beta); y=resample(y,length(x),length(y)); wavwrite(y, Fs, 'psola_low_moore_guitar.wav'); 9
Appix: psola.m %psola.m %from DAFX %Josh Patton function out=psola(in,m,alpha,beta) % in input signal % m pitch marks (from PitchMarker.m function) % alpha time stretching factor % beta pitch shifting factor P = diff(m); %compute pitch periods if m(1)<=p(1), %remove first pitch mark m=m(2:length(m)); P=P(2:length(P)); if m(length(m))+p(length(p))>length(in) %remove last pitch mark m=m(1:length(m)-1); else P=[P P(length(P))]; Lout=ceil(length(in)*alpha); out=zeros(1,lout); %output signal tk = P(1)+1; %output pitch mark while round(tk)<lout [minimum i] = min( abs(alpha*m - tk) ); %find analysis segment pit=p(i); st=m(i)-pit; en=m(i)+pit; gr = in(st:en).* hanning(2*pit+1); inigr=round(tk)-pit; Gr=round(tk)+pit; if Gr>Lout, break; out(inigr:gr) = out(inigr:gr)+gr'; %overlap new segment tk=tk+pit/beta; 10
Appix: pitchmarker.m % pitchmarker.m % Josh Patton % Finds all the pitch marks in the input file and returns the % markings in a matrix function [ pitch ] = pitchmarker(blk_section) %% test from within (comment out the above function line) %[x,fs,bit]=wavread('moore_guitar.wav'); %blk_section=x; %% Detection % initial setup blk_size=400; mark=[1:length(blk_section)]*0; last_pos=1; place=1; blk_size=300; i=1; while last_pos+floor(blk_size*1.7) < length(blk_section) % grabs the next block to examine temp=blk_section(last_pos+50:last_pos+floor(bl k_size*1.7)); % finds the high point in the block [mag,place]=max(temp); % check for a signal in the current block if mag < 0.01 place=length(temp); mode = 0; mark(place+last_pos+50)=1; pitch(i)=place+last_pos+50; else mode = 1; % check for pitch mark before current pitch mark while mode == 1 % find the largest point in block from start to current pitch mark [mag2,place2]=max(temp(1:place-50)); % check if high mark has great enough magnitude to be a pitch mark if mag2 > 0.90*mag mag=mag2; place=place2; else mode = 0; mark(place+last_pos+50)=1; pitch(i)=place+last_pos+50; % next block to look at is 50 samples after current block blk_size=place+50; % makes sure next blk_size is of large enough size if blk_size < 150 blk_size=150; last_pos=place+last_pos+50; i=i+1; %% Plotting if needed % figure(1) % hold on % plot(mark) % plot(blk_section,'r') 11