A Real-time Framework for Video Time and Pitch Scale Modification

Dblin Institte of Technology ARROW@DIT Conference papers Adio Research Grop 2008-06-01 A Real-time Framework for Video Time and Pitch Scale Modification Ivan Damnjanovic Qeen Mary University London Dan Barry Dblin Institte of Technology, dan.barry@dit.ie David Dorran Dblin Institte of Technology Josh Reiss Qeen Mary University London Follow this and additional works at: http://arrow.dit.ie/argcon Part of the Signal Processing Commons Recommended Citation Damnjanovic, I. et al. (2008) A Real-Time Framework for Video Time and Pitch Scale Modification. Proc. of the 11th International. Conference on Digital Adio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008. This Conference Paper is broght to yo for free and open access by the Adio Research Grop at ARROW@DIT. It has been accepted for inclsion in Conference papers by an athorized administrator of ARROW@DIT. For more information, please contact yvonne.desmond@dit.ie, arrow.admin@dit.ie, brian.widdis@dit.ie. This work is licensed nder a Creative Commons Attribtion- Noncommercial-Share Alike 3.0 License

MM-002816.R2 1 A Real-time Framework for Video Time and Pitch Scale Modification Ivan Damnjanovic, Dan Barry, David Dorran, and Josha D. Reiss Abstract- A framework is presented which addresses the isses related to the real-time implementation of synchronised video and adio time-scale and pitch-scale modification algorithms. It allows for seamless real-time transition between continally varying, independent time-scale and pitch-scale parameters arising as a reslt of manal or atomatic intervention. We illminate the problems which arise in a real-time context as well as provide novel soltions to prevent artefacts, minimise latency, and improve synchronisation. The time and pitch scaling approach is based on a modified phase vocoder with optional phase locking and an integrated transient detector which enables high qality transient preservation in real-time. A novel method for adio/visal synchronisation was implemented in order to ensre no perceptible latency between adio and video while real-time time scaling and pitch shifting is applied. Evalation reslts are reported which demonstrate both high adio qality and minimal synchronisation error. Index Terms Time scale modification, Adio/visal synchronisation, adaptive video refresh rate I. INTRODUCTION Synchronised adio and video time stretching is often sed in video editing and prodction whenever video content needs to be sped p or slowed down either as a creative effect or to fit certain time slots within a programme schedle, as is the case in television advertisements. Time-scale modification (TSM) is typically sed to change the tempo of msical content or the playback rate of speech withot affecting pitch content. Conversely, pitch-scale modification (PSM) algorithms enable pitch shifting withot affecting the playback rate of the adio content. A significant amont of research has been dedicated to both TSM and PSM yielding a variety of time and freqency domain algorithms. Despite this abndance of literatre and readily available Copyright (c) 2008 IEEE. Personal se of this material is permitted. However, permission to se this material for any other prposes mst be obtained from the IEEE by sending a reqest to pbs-permissions@ieee.org Manscript received Jne 11, 2009. This work was spported in part by the Eropean Commnity nder the Information Society Technologies (IST) programme of the 6th FP for RTD - project EASAIER contract IST-033902. I. Damnjanovic is with Qeen Mary U. of London, London, E14NS, UK (telephone: +44-2078827880, e-mail: ivan.damnjanovic@elec.qml.ac.k). Dan Barry. Athor is with the Adio Research Grop in the Dblin Institte of Technology, Kevin Street, Dblin 8, Ireland (telephone: +353 1 4022862, e-mail: dan.barry@dit.ie). David Dorran. is with the Adio Research Grop in the Dblin Institte of Technology, Kevin Street, Dblin 8, Ireland (telephone: +353 1 4024873, e- mail: david.dorran@dit.ie). Josh Reiss is with Qeen Mary University of London, London, E14NS, UK (telephone: +44-2078827982, e-mail: josh.reiss@elec.qml.ac.k). commercial applications, there is still a lack of information, nderstanding and consideration for real-time implementations of TSM and PSM algorithms. Here we illminate some of the problems which arise in a real-time context as well as provide novel soltions to these isses. A real-time software based framework is presented, which allows time stretching of adio content within digital video streams whilst maintaining synchronisation with the video content. Time-scale changes can be made in real-time with almost nperceivable latency and no transitional artefacts. In addition, the approach also spports real-time pitch shifting of the adio content independent of time-scale changes. The approach is based on a modified phase vocoder with optional phase locking and an integrated transient detector which enables high qality transient preservation in real-time. Within this article, emphasis is given to adio/visal synchronisation isses which arise in sch a framework. Despite the growth in algorithms for independent adio time or pitch modification, there are relatively few applications which address combined time stretching of video and adio. In [1], a method for adjsting video playback rate to compensate for network delay is presented. Similarly, [2] presents an adaptive method for video playback, intended to address isses concerning packet loss and random delays in streaming applications. Their method ses adio time scaling when the streamed video playback speed is modified, as sggested for packet loss in voice commnication [3]. Synchronised adio and video time scaling is typically sed in video editing and prodction whenever video content needs to be sped p or slowed down either as a creative effect or to fit certain time slots within a programme schedle. For example, TSM can be sed to alter the dration of an advertisement whilst preserving the pitch and timbre of speech and other adio content. Experiments have shown that increasing the information rate in commercials is more engaging and more favorable to viewers. In [4], it was sggested that an increase in the rate of information of p to 130 percent of the typical speech rate can significantly increase the impact of advertisements. The driving force for the work presented here on real-time synchronised adio/video time-stretching comes from ser reqirements and ser feedback in msic edcation research [5, 6], which indicated that time-scaled video wold be desirable in applications related to aral learning, msic transcription and msical techniqe analysis. The effects of adio/video time-compression and expansion on the learning

MM-002816.R2 2 process have been thoroghly stdied [4-8]. Besides time efficiency benefits, it was shown that learning from accelerated material can be at least eqally as effective as the normal speed of presentation. There were frther findings that stdents watching accelerated material stay more focsed. At normal speech rates they become bored and their attention begins to wander [7], and learning processes benefit from acceleration of presentation as long as intelligibility can be maintained [8]. For entertainment applications, internet video streaming, digital video players and set-top devices can benefit greatly from an adio/video time stretching tool. Stdies of digital video browsing [9] noted that one of the highest rated enhanced featres was watching time compressed video. II. AUDIO TIME-SCALE MODIFICATION Time-scale modification can be achieved in a nmber of ways in both the time and freqency domain. However, time domain approaches are typically not considered ideally sited to mixed adio content, which may inclde speech, polyphonic msic and ambient noise. As sch, the real-time time-scale modification techniqe proposed here is based on a set of modifications to the phase vocoder [10], a poplar freqency domain approach to time-scaling. A comprehensive ttorial otlining the theory of the traditional phase vocoder is presented in [11] and a brief description is provided here. The Forier transform interpretation of the phase vocoder is mathematically eqivalent to a short time Forier transform (STFT) [12] which segments the analysed signal into overlapping frames which are separated by a certain hop size. Within phase vocoder implementations, TSM is achieved by varying the analysis hop size R a with respect to the resynthesis hopsize R s sch that the time scaling factor is calclated as α=r s /R a. It follows then that R a >R s will reslt in timescale compression (speed p), and R a <R s will reslt in timescale expansion (slow down). Within the phase vocoder, analysis frames are remapped along the time axis reslting in newly constrcted synthesis frames, each with a modified phase spectrm, to ensre that the synthesis frames maintain phase coherence throgh time. Since the phase spectrm of each frame mst be modified, the windowing fnction will also be affected. For this reason, a resynthesis window is necessary and a 75% overlap is recommended to avoid modlation at the otpt. This will reslt in the otpt having a constant gain factor of approximately 1.5 which can easily be compensated by mltiplying all samples by the reciprocal of the gain factor. An overlap of 75% corresponds to a fixed synthesis hop size, R s, of N/4 samples. In order for the synthesis frames to overlap synchronosly, the frame phases mst be pdated sch that phase continity is maintained between adjacent otpt frames. The standard method sed to calclate sitable synthesis phases involves calclation of the instantaneos freqency of each bin in radians per sample. Having obtained the instantaneos freqency, it is possible to predict the expected phase of any component for a given synthesis hop size. Given that the freqency content of both msic and speech is stationary only over short periods, phase estimates will decrease in accracy as the hop sizes increase. The most accrate way to estimate phase for each component is by first calclating the principal argment of the heterodyned phase increment between adjacent analysis frames as defined in [10, 11]. The instantaneos freqency is then calclated in radians per sample. In order to calclate the phase spectrm for the new synthesis frame at the time scaled otpt, the instantaneos freqency is mltiplied by the synthesis hop size R s, and added to the resltant synthesis phases from the previos frame. This is known as phase propagation or phase pdating. The newly modified phases along with the original magnitde spectrm are then sed to reconstrct the adio frame. Althogh, the time scaled otpt is horizontally phase coherent at this point, the timbral qality is often described as sonding phasey or distant and is generally not regarded as natral sonding. Particlarly noticeable is how transients are affected by the phase vocoder. These artifacts can be attribted to the fact that the standard phase vocoder only attempts to achieve an optimal phase relationship between adjacent frames, known as horizontal phase coherence. However, the prsit of horizontal phase coherence has a profondly negative effect on vertical phase coherence, which describes the relationship between the phases of freqency components within a single frame. Maintaining vertical phase coherence is an important consideration in order to achieve natral sonding TSM. The improved phase vocoder [13] explicitly attempts to identify sinsoidal freqency bins in FFT frames by a peak picking process within the magnitde spectrm. The phases of these trly sinsoidal peak freqency bins are then pdated in the traditional manner, i.e., by maintaining horizontal phase coherence between corresponding peak freqency bins of sccessive frames. The non-sinsoidal freqency bins are then pdated by maintaining the phase difference that existed between each bin and its closest peak/sinsoidal freqency bin. The process is known as peak locking. III. REAL-TIME CONSIDERATIONS FOR DYNAMIC TIME- SCALING When a fixed time-scale factor is applied to an entire adio signal both R a and R s remain fixed. In which case, the position in time of any analysis or synthesis frames can be defined as t a =R a and t s =R s, respectively, where is an incrementing integer representing a seqence of frames as in [10]. For realtime implementations, where the time-scale factor, α, may be varying dynamically de to ser intervention, this definition will introdce distortions into the time-scaled otpt since the analysis hop is no longer fixed. The soltion is to redefine 1 ta = Ra as ta = ta + α Rs. This ensres that the crrent analysis frame position t a, is always pdated correctly. The position in time of the crrent analysis frame is always related to both the previos analysis frame and the crrent time scaling factor, α. Althogh it is favorable to vary the analysis hop R a and fix the synthesis hop R s to achieve TSM, it can reslt in inaccrate

MM-002816.R2 3 freqency estimation for time-scaling factors α < 1. When the signal is being sped p, the distance between analysis frames exceeds N/4. It becomes impossible to accrately predict the amont of phase nwrapping to be applied dring the freqency estimation stage of the horizontal phase pdate procedre described in [10, 11], reslting in inaccrate synthesis phase estimates. In addition to this, when α is varied over time, the accracy of the instantaneos freqency estimates also varies. This leads to momentary artefacts whenever the time scale factor, α, is changed. Effectively, the transitions between frames with different TSM factors are not perceptally smooth despite the windowed overlapping scheme. The soltion to both of these problems is to ensre that the instantaneos freqency estimates are always derived sing the phase differences between the crrent analysis frame and a frame one synthesis hop back from the position of the crrent analysis frame, X ( ta Rs, Ω k). Althogh, an extra FFT and an extra bffer is reqired to obtain the phases of this frame, it garantees that phase nwrapping errors will not be present and that the instantaneos freqency estimates will be consistent regardless of variation in α. The phase pdate eqation [10, 11] is now redefined in (1). 1 Yt ( s, Ω k) = Yt ( s, Ω k) + Xt ( a, Ωk) Xt ( a Rs, Ω k) (1) When vertical phase coherence is to be maintained, peak locking can be sed, and only the sinsoidal or peak freqency bins are pdated sing (1), with all other bins pdated as in [10, 11]. This method of phase pdating removes the need to estimate the instantaneos freqency. However, for the case where pitch scale modification is reqired, calclation of instantaneos freqency is still necessary. Nonetheless, the hop-back method described above is sed to avoid phase nwrapping errors and to maintain smooth pitch and time scale transitions. This will be discssed in the next section. A similar phase pdate procedre was proposed in [14] in which time-scale modification is achieved throgh the insertion and deletion of entire frames. Since the approach we propose here ses a variable analysis hop size, it has the advantage of maintaining better estimates of the magnitde spectrm, thereby greatly redcing the possibility of removing or repeating perceptally salient characteristics within the time-scaled signal. IV. REAL-TIME PITCH SHIFTING The simplest method to shift the apparent pitch of a signal is by interpolating or decimating the time domain signal. The reslting signal, althogh pitch shifted, is also shortened or lengthened by the reciprocal of the interpolation/decimation factor β. A common techniqe sed to shift the pitch and maintain dration is to pitch scale the signal sing interpolation/decimation, and apply complimentary time scale modification to restore the original length of the signal. This is easily achieved in the offline context bt becomes difficlt to implement in a real-time context. If both pitch shifting and time scaling are reqired simltaneosly, the problem becomes more difficlt since time scaling is reqired for 2 alternate operations (pitch and time scaling) within the same frame. When the signal is both time scaled and interpolated for any time scaling factor α and pitch shifting factor β, the reqired compensatory time scale factor sch that the resltant signal is both the reqired pitch and length [15], is simply αβ. In a real-time context the pitch and time scaling mst be carried ot within a single frame interval (in this implementation 23ms). Two isses arise. First, the comptational reqirements are directly related to the prodct of α and β, since each frame mst now be time-scaled internally to compensate for pitch shifting. This makes realtime operation nfeasible for large prodcts αβ. Second, the length of the resltant frame is no longer fixed. An additional bffer mst be sed in order to handle the overflow if the resltant frame exceeds N (analysis frame size) samples. If αβ<1, the resltant frame will be smaller than the reqired N samples. In this case, more inpt frames need to be processed ntil there are sfficient samples to generate an otpt frame. These isses can make the otpt npredictable, added to which the soltions are comptationally intensive. Here we present a novel method for real-time pitch shifting which resolves the problematic isses raised above. The comptational reqirements are not dependent on α and β and the method garantees that a fixed frame length can be generated independent of the time and pitch scale factors sed. No inter-frame time scaling and no additional bffers are reqired. The pitch shifting is performed sing linear resampling in the time domain, and phase vocoder theory is then applied sing a modified phase pdate eqation which incorporates the pitch scaling factor β. In order to generate a pitch shifted frame of known length, we interpolate or decimate the inpt time domain signal over the range t to a ta + Nβ, where N is a fixed analysis frame size chosen to ensre adeqate freqency resoltion. This reslts in a time domain frame of length N which has been generated by interpolating or decimating Nβ samples by the pitch scaling factor β. Figre 1 illstrates this procedre. This frame now constittes an analysis frame which can have arbitrary time scaling applied sing the phase pdate eqations presented below. Figre 1. The real-time re-sampling method sed for obtaining fixed length pitch shifted frames. A illstrates no pitch change, B pitching down and C pitching p.

MM-002816.R2 4 The goal is to estimate the phase propagation reqired to allow sccessive interpolated frames to be pdated sch that the pitch shifted and time scaled otpt is horizontally phase coherent. Recall (1), which was introdced as a preferred method to ensre reliably wrapped phase difference estimates. This was achieved by sing an extra FFT to estimate the phases of the frame exactly one synthesis hop back from the crrent analysis frame, thereby allowing the phase differences to be estimated over a known fixed interval eqal to R s. The apparent analysis hop is now eqal to the synthesis hop, bt the actal vale of R a is still variable. In order to estimate sitable synthesis phases for pitch shifted frames, the instantaneos freqency mst be calclated as follows. A new method to calclate the heterodyned phase increment for pitch shifted frames is given by (2), where the interpolation factor, β, is now inclded in the eqation. ΔΦ p = Xt (, Ωk) Xt ( Rs, Ωk) RsΩ k/ β (2) k a a k where X( t a, Ω ) and X( ta Rs, Ωk) represent the phases of the crrent analysis frame and an analysis frame exactly one synthesis hop back from the crrent vale of t a. The reslting term, ΔΦ p k, is then the principle argment of the heterodyned phase increment of the pitch shifted frame sch that it is in the range -π to π. Since the frames have been interpolated or decimated (reslting in freqency shifts) they will no longer exhibit the expected phase derivatives over a given hop, R s. To calclate the correct phase increment, the hop mst also be mltiplied by the reciprocal of the pitch scaling factor, β. The instantaneos freqency in radians per sample of the pitch shifted frame is given by (3). ˆ ( ωk ta) =Ω k+ βδφ p k / Rs (3) As opposed to the standard method [10, 11], we divide the phase deviation by R s instead of R a, becase the method sed to calclate phase difference in (3) ses two frames separated by a fixed distance, R s. The standard phase pdate eqation [10, 11] can now be sed, and peak locking can be applied as discssed previosly. The advantages of sing (1) for phase pdating have already been incorporated in (2) above. We now have modified phase vocoder eqations which allow realtime pitch shifting and time stretching simltaneosly. A key advantage of sing this method for pitch shifting is that compensatory time scaling is not reqired. Instead, the pitch scaling factor is incorporated in the phase pdate eqations. This garantees that the comptational load remains fixed and predictable for any combination of time and pitch scaling factors. V. REAL-TIME TRANSIENT PRESERVATION Althogh peak locking contribtes to maintaining the timbral qality of transients dring TSM, transients shold not be time-scaled if a natrally sonding otpt is reqired. An off-line soltion was proposed in [16]. The approach taken here is to identify transients atomatically in real-time. Upon detection of a transient, the time scale factor α is retrned to 1 (no scaling), and the analysis phases are mapped directly to the synthesis phases (phase locking) for the dration of the transient. When the transient has passed, the time scale factor is atomatically reset to the α vale prior to the transient. Transients represent an ideal place to lock the phases since any discontinities introdced to the time scaled signal will be masked by the transient itself. In order to identify an analysis frame as a transient [17], the log difference of each freqency component between consective frames is calclated as in (4). This measre effectively tells s how rapidly the spectrogram is flctating. X( ta, k) X ( t, k) = 20log, 1 k N /2 (4) ( R, k) f a 10 X ta s where X f ( ta, k ) is the log energy difference between frames separated by R s, and t a is the crrent analysis frame instant. In order to detect the presence of a transient we define a measre given in (5). N /2 Pt ( a, k) = 1 if X f ( ta, k) > T 1 Pe( ta ) = k = 1 Pt ( a, k) = 0 otherwise (5) where, T 1 is a threshold which signifies the rise in energy, measred in db, which mst be detected within a freqency channel before it is deemed to belong to an onset. In order for the frame to be declared a transient, Pe( t a ) mst exceed a second threshold T 2. In practice we have fond that T 1 =6dB and T 2 =3N/8 give satisfactory reslts for most poplar msic. Ths, a transient is detected at frame t a, if at least 75% of the bins in the log difference spectrogram, eqation (4), exceed a vale of 6dB. Note that sing this measre, the energy present in the signal is not the defining factor of the transient. Instead, we assign the transient probability, Pe( t a ), sing a measre of how broadband or percssive the onset is [17]. This is based on the nmber of bins exhibiting a positive first derivative as described by eqation (5). Figre 2 shows the effectiveness of this approach. Despite the fact that the signal itself has little dynamic range, the featre detector is rarely prone to false detections which makes it ideal for transient detection in time scaling. Frthermore, it can easily be implemented within the crrent framework since the only reqirement is that the crrent and previos frame magnitdes are available. Figre 2. A highly dynamically compressed signal containing rock msic is depicted in the top plot. The bottom plot shows the otpt of the percssive onset detector. Upon detecting a transient, the time scale factor, α, is atomatically retrned to 1, inhibiting TSM momentarily.

MM-002816.R2 5 We term this method transient hopping. In addition the frame phases are locked and the frame is mapped directly to the otpt. This mechanism preserves the transient and ensres that it is reprodced naffected at the otpt. Since we se 75% overlap, R s = 1024 for analysis frame length 4096, a short transient will exist in 4 consective frames. In order to preserve the transient correctly, the TSM factor, α, mst remain at a vale of 1 ntil all overlapping frames have passed the transient. Since the local time scale factor is redced, a time scale compensation factor is applied after each transient. Eq. (6) describes this action: T 1 if < 4 T αm α T α = if 4 < NF + 4 (6) m α α otherwise T where α is the global time scale factor, α is the TSM factor to be applied dring the frames preceding the transient, and where m is the maximm desired TSM factor and m mst be strictly greater than 1. The nmber of frames, N F, in which the time scale compensation factor mst be applied after the transient, is dependent on the maximm timescale factor, sch that NF = 4m -4. Using a larger nmber of frames to compensate for the transient has the advantage that smaller TSM factors may be distribted over a longer time period, ths redcing signal distortion de to excessive timescale factors. Figre 3 illstrates how the time scaling factor is varied before and after the transient in order to both preserve the transient and to maintain a constant global time scale factor. concern. For reasons discssed in previos sections, a 75% overlap is recommended. This effectively means that at any one time instant, 4 analysis frames are actively contribting to the crrent otpt frame. Figre 4. The relationship between inpt and otpt frames for α=1. In Figre 4, the adio to be processed is divided into overlapping frames of length N. In order to otpt a processed frame, 4 fll frames wold need to be processed and overlapped. This leads to considerable latency from the time a parameter change is affected to the time when its effects are adible at the otpt. However, given that the synthesis hop size is fixed at R s =N/4, we can load and process a single frame of length N, otpt ¼ of the frame, and retain the rest in a bffer to overlap with adio in sccessive otpt frames. To do this, a bffer of length N is reqired in which the crrent processed frame (with synthesis window applied) is placed. Three additional bffers of length 3N/4, N/2 and N/4 will also be reqired to store remaining segments from the 3 previosly processed frames. Each otpt frame of length N/4 is then generated by smming samples from each of these 4 bffers. Figre 5 shows how the bffer scheme works. On each iteration, a fll frame, F, of length N is processed and placed in bffer 1. The remaining samples from the 3 previos frames occpy bffers 2, 3 and 4. The reqired otpt frame of length N/4, S, is generated as defined in (7). Figre 3. Time scale factor as a fnction of transient detection VI. BUFFER SCHEMES One of the key isses in a real-time implementation of TSM is the choice of bffer scheme and for completeness sake we sggest a sitable scheme here. In offline processing, the entire signal is overlapped and concatenated before playback. However, in a real-time environment, a constant stream of processed adio mst be otptted and consective otpt frames mst be continos. In order for seamless concatenation, the bondaries of each otpt frame mst be at the constant gain associated with the overlap factor in order to avoid modlation. The method presented below addresses this Figre 5. Real-time otpt bffer scheme sing a 75% overlap. The gray arrows indicate how each segment of each bffer is shifted after the otpt frame has been generated. 1 2 3 S ( n) = F ( n) + F ( n+ N /4) + F ( n+ N /2) + F ( n+ 3 N /4) (7) n 1 n N /4 From (7), it can be seen that the otpt frame, S (n), is generated by smming the first N/4 samples form each bffer. Once the otpt frame has been generated and otptted, the first N/4 samples in each bffer can be discarded. The data in all bffers mst now be shifted in order to prepare for the next iteration. The gray arrows in Figre 5 illstrate how each segment of each bffer is shifted in order to accommodate a newly processed frame in the next iteration. The order in which the bffers are shifted is vital. Bffer 4 is filled with the remaining N/4 samples from bffer 3, bffer 3 is then filled

MM-002816.R2 6 with the remaining N/2 samples from bffer 2, and finally bffer 2 is filled with the remaining 3N/4 samples from bffer 1. Bffer 1 is now empty and ready to receive the next processed frame of length N. The reslt of this scheme, is that ¼ of a processed frame will be otptted at time intervals of R s, which is eqal to N/4 samples. Using the sggested frame size of 4096 samples, the otpt will be pdated every 1024 samples which is approximately eqal to 23.2 milliseconds. The adio will be processed with newly pdated parameters every 23.2 milliseconds, bt the latency will be larger than this and depends on the time reqired to access and write to hardware bffers in the adio interface. In general however, it is possible to achieve latencies < 40ms. VII. SYNCHRONISATION WITH THE HOST APPLICATION The reqirement to synchronise independent time and pitch scaling with video and screen pdates adds additional complexity. To maintain mltimedia synchronisation, the time scaling process shold control the master clock within an application. In this section, we present a real-time media synchronisation framework which has made this possible Previos sections have described in detail the adio processing blocks reqired to achieve real-time time and pitch scaling simltaneosly. Figre 6 shows how the overall system is configred. Figre 6. Overview of clocking between time/pitch scale modification and host application. Firstly, it is important to note that, in order to allow time scale modification to be carried ot in real-time whilst maintaining synchronisation with other media sch as video or screen pdates, e.g., the adio locators, it is necessary to pass fll control of the host clock to the time scaling algorithm. This is becase time scaling by its very natre involves maniplation of the time base of the adio. As described previosly, the time increment between frames is prely dependent on the choice of time scale factor. Frthermore, if we wish to continosly vary the time-scale factor, the time line becomes non linear at transition points. Essentially, the time scale algorithm mst be able to reqest any adio frame, starting at any sample point within the adio stream. With this in mind, the first stage involves loading an adio frame defined by the time scale algorithm itself. Immediately following this, the first stage of pitch shifting is achieved by interpolating or decimating the inpt waveform by the pitch scaling factor. Regardless of time or pitch scale factor, one fll frame is always poplated on every iteration. For example, sing a pitch scale factor of 2, 2N samples will be interpolated to prodce an N sample frame where N is the frame size. If the frame is identified as a transient, no frther processing is applied, and time scaling is sspended for 4 frames (de to 75% overlap). The frames arond a transient are reprodced at the otpt identical to that of the inpt and the adio clock is pdated as normal. If no transient has been detected, the phases are pdated according to the modified phase pdate eqations. Pitch shifting is only completed at this stage since the phase pdate procedre needs to inclde the interpolation factor. Following this, the processed adio frame is reprodced and re-windowed. The adio clock is then pdated, and the frame incremented by a varying factor depending on the ser inpt (i.e., TSM factor). In order to prodce a continos stream of adio, the bffer scheme described above is sed. Regardless of what processing is carried ot by the timescaling algorithm, it is solely responsible for pdating the host clock. The host then ses this information to pdate screen components which depend on adio playback position. Ths, all screen components, processes and visalisations are synchronised with the adio clock which is controlled by the time-scale modification algorithm. VIII. VIDEO SYNCHRONISATION Combined adio/visal artefacts that can be introdced de to loss of synchronisation are often the most perceptally ndesirable. Failre to keep adio and video streams synchronised, known as lip sync errors, reslt in adio events occrring before or after the associated video frames. When adio advances video by 20ms or when adio lags video by 40ms, it becomes detectible. Errors of +40ms and - 160ms are sbjectively annoying as reported by the International Telecommnications Union (ITU) in 1993[18]. Frther research reported in ITU-R BT1359-1 [19], showed reliable detection of 45ms adio leading and 125ms adio lagging, while the acceptability region is even wider. The ITU recommends that the difference between adio and video shold be no less than -90ms and no more then +185ms. In reality, this range is probably too wide for acceptable performance. For example, in video footage of msical instrments being played, key strokes or string plcks are more precise than lip movement dring speech, so the synchronisation thresholds need to be redced. In addition, when a video has been stretched it can be easier to analyse and therefore synchronisation errors become more perceivable. In this section, three approaches to the preservation of adio/video synchronisation in time scaling applications will be presented. Insertion and deletion of frames is necessary when the frame rate is dictated by the playback device. Television standards sch as PAL/SECAM and NTSC se

MM-002816.R2 7 standardised refresh rates and hence the otpt of a time stretching modle mst maintain a corresponding frame rate. However, many software implementations of video players, inclding mplayer, VLC player and others, allow for change of the video rate once the compressed video is npacked. Screen refresh rate of modern eqipment is in the range of 100-200Hz, so variations in the frame rate can be introdced by choosing when a particlar video frame will be shown on the display device. Hence, less noticeable artefacts and smoother pictre transition can be obtained when variable frame rate, the second method, is applied. The third method, Adaptive Video Refresh Rate (AVRR), relies on the precision of the adio clock. Synchronisation is maintained by ensring that the video time code remains locked to the adio time code within an allowable threshold. Video time stretching for conventional broadcast ses insertion and deletion of frames to maintain synchronisation. When speeding p the video, some frames need to be dropped, whilst when slowing down some need to be dplicated. When frames are dplicated or dropped, maximal synchronisation error is half of a video frame length, since we rond to the closest frame. Hence, if the frame rate is 25 fps, maximal error will be 20 ms. This error range (-20ms to +20ms) meets ITU recommendations for lip sync error to be ndetectable. However it may not be good enogh for more demanding applications sch as time stretching of video, when precise movements are slowed and become easier to analyse. In addition, frame dplication can case jerkiness to be perceived in the video of slow steady movements. Changing the video frame rate by the scaling factor will generally give a smoother image since frames are eqally spaced in time. The additional advantage is that no frames are dropped when speeding p. Ideally, timing for a new frame is easy to calclate by advancing the previos frame time by the new frame rate interval. However, de to the fact that timing precision is inflenced by factors sch as temperatre and hmidity, simply setting-p the next frame to display a given period after the previos frame withot comparing it to a master clock can case long term synchronisation errors. The AVRR method refreshes the display with a new frame when the video time code is eqal to (or within a threshold of) the original time code of the adio frame being otptted. The refresh rate is adaptive since the period between two frames adapts to the adio clock. Ideally, it shold be eqal to the reciprocal of the scaled frame rate, bt will oscillate arond that vale. We define here two time-lines; one is the media player s actal time-line and the other is the original media time-line. It is crcial for this method to calclate precisely the time on the media time-line of the adio sample crrently being played. This time vale is then compared with the original time code associated with non-time scaled video frames and the display is refreshed with this frame when the video frame time code is smaller than or eqal to the time of the adio sample that is crrently being otptted. To minimise loss of synchronisation de to comptationally intensive processing, the decoding algorithm needs to be efficient and implemented in a separate high priority thread. Figre 7. Video time scaling implementation.

MM-002816.R2 8 The video-synchronised time stretching algorithm described above was implemented as presented in Figre 7, and intended for a demanding application reqiring fast access to adio frames while other intensive processing tasks are performed. Here, the adio stream is first ncompressed and stored locally in an adio inpt bffer. Unlike adio however, ncompressed video wold reqire an nacceptably large local bffer, so video packets are accessed directly from the compressed stream. Since video decoding is done on-line, particlar consideration was given to its implementation. Higher time compression rates will demand that video frames be decoded and scaled mch faster than sal. Hence, the video decoding is carried ot together with video zooming in a separate high priority thread. The video decoding thread receives two control inpts from the ser interface. Video zoom factors, changeable from the interface, are sent directly to the video scaler, which scales a frame according to a zoom factor and sends it to the video display bffer. Change of playback position is sent to the decoder and it instrcts the decoder to seek the stream and also to erase any previosly decoded frames from video display bffer. The time-stretching factor is sent to the adio processing engine in order to change the analysis hop size, and the adio otpt frame timestamp is calclated accordingly. However, this timestamp is not sfficient for proper A/V synchronisation, since it represents the time when the adio frame is sent to the adio hardware bffer. For example, if an adio frame is 1024 samples and the sample rate is 44100 Hz, the time resoltion will be 23.2 ms. For the normal playback speed, this may be sfficient, bt in the case of dobling the playback speed the time span between two adio sample points on the media timeline becomes 46.4 ms. Hence, some measre of fllness of the adio hardware bffer needs to be introdced for precise timing of otptted adio samples. The fllness of the hardware adio bffer is hardware dependent and measring it is often a complex task, so we propose to find approximate timing of the adio sample by measring the time difference (Δt) between the moment the adio frame is sent to the hardware bffer and the crrent time. This vale is then added to the timestamp of the adio frame that was sent to the adio bffer (T adio ), and is then compared with the video frame timestamp (T video ). The display is refreshed with this frame when the video frame time code is smaller than or eqal to the calclated adio time: Tvideo Tadio +Δ t (8) Another isse is timer precision for measring Δt. In Windows OS, the maximal precision that can be achieved with the standard timer is 15ms, which is hardly enogh for a synchronisation application. Hence, Δt is determined by measring CPU conts from the moment the frame is sent to the hardware bffer and then dividing by the CPU cont freqency. Since Δt gives a vale related to the real playback time-line, it is transposed to the media time line by dividing it by the time-stretching factor α: 1 CNTcp Δ t = (9) α fcp However, both variable frame rate and adaptive video refresh rate have the potential disadvantage that at higher time scale factors, since more frames are displayed per second, frames need to be decoded mch faster. Synchronisation can be lost if a frame is not decoded within a frame interval, so a preferred soltion is to combine AVRR with frame dropping when loss of synchronisation occrs. In or implementation, whenever the video lag exceeds 20 ms, the application instrcts the decoder not to decode the following frame, and retrns to fll decoding when the lag retrns to nder 10 ms. IX. AUDIO QUALITY EVALUATION Since the focs of this research is concerned with the realtime implementation of a synchronised video/adio and mltimedia time and pitch scale modification algorithm, the evalation of the adio time-scale algorithm presented here is not intended to be comprehensive. Instead, to ensre that this real-time implementation has not reslted in a compromise to the adio qality of the algorithm, a series of sbjective listening tests were carried ot in order to ensre that the TSM algorithm is as least as good as that described in [13]. The transient detection has not been sed in these comparison tests since [13] does not employ transient detection. In total, 10 sbjects ndertook a series of 20 tests 1 each, totalling 200 individal tests. The tests sed inclded slowing and speeding of adio as well as pitch shifting in both directions by a range of factors. Both time and pitch scale factors ranged from 0.75 to 1.5. A range of signals inclding solo and ensemble msic from a range of genres and male and female speech segments sampled at 16 bit, 44.1 khz comprise the test site. Each listener was presented with an nprocessed reference signal and two alternative processed signals. The same processing parameters and frame sizes are sed in each algorithm. The order in which the algorithms are presented was randomised. Figre 8. Sbjective listening test reslts for 10 sbjects. Along the horizontal axis, 1 indicates a predominant preference for real-time TSM whereas 5 indicates predominant preference for the improved phase vocoder [13]. 1 http://www.adioresearchgrop.com/downloads/tsmtests.zip

MM-002816.R2 9 The reslts are presented in Figre 8, where reslts for each sbject are given from 1 to 5, where 1 indicates predominant preference for real-time TSM, 3 indicates no preference, and 5 indicates predominant preference for the improved phase vocoder. The sbjective listening tests indicate that the overall trend is sch that the algorithms are perceived to perform eqally well. The average vale over all 200 tests was 2.985, very close to no preference, with a relatively low standard deviation of 0.94. Sbjects who were predisposed to distinctly choosing 1 algorithm over the other tended to choose each algorithm a similar nmber of times indicating eqivalence of the algorithms. Many sbjects reported that the algorithms sonded very similar bt felt compelled to make explicit decisions regardless. The data is skewed slightly in favor of the real-time TSM algorithm, bt it is likely that a greater nmber of test sbjects wold introdce greater balance in the data. Some differences between the algorithms which may accont for this inclde the fact that the real-time TSM algorithm does not perform peak locking above 10 KHz de to the fact that peak locking is intended to maintain the phase relationship between the peak and lobes of sinsoidal components. Significant acostic energy above 10 KHz is often stochastic and attribted to transients, noise and ambience. Peak locking above 10 KHz forces non-sinsoidal components into a state of nnatral phase coherence which can sond objectionable to sbjects with acte hearing in the pper freqency range. Theoretically, the pitch shifting qality in [13] shold otperform that of the real-time algorithm bt sbjective tests have shown that the differences are largely imperceptible for moderate time scaling factors (in the region of 0.75 to 1.5) althogh the real-time algorithm can become noticeably more objectionable when opposing time and pitch scale factors are sed simltaneosly (i.e. slow down and pitch p simltaneosly). This is de to the efficient pitch shifting techniqe sed to achieve frame synchronos pitch shifting. X. A/V SYNCHRONISATION EVALUATION To measre the qality of the A/V synchronisation algorithm, we compared it with integration of or time-stretching into the FFmpeg (v0.4, ffmpeg.org/ffplay-doc.html) platform and with the MPlayer implementation (v1.0rc2, www.mplayerhq.h/) in LinxOS. FFplay is a well known efficient open sorce application for video encoding, and MPlayer is a robst, open sorce video player based on ffmpeg libraries. One of the many featres of MPlayer is the possibility to change playback speed, bt withot independent pitch-shifting. Nevertheless, this featre, robst implementation and the possibility to extract A/V synchronisation information make MPlayer sefl for evalation and comparison with or algorithm. For A/V synchronisation, FFplay ses dplicating and dropping video frames whereas MPlayer ses a variable frame rate. We compared video players on the Casino Royale trailer seqence coded in MPEG1 format with video frame dimension 640x352 at 23.97 frames per second and an adio sample rate of 44100 Hz. The video frame lag with respect to adio is presented for 100 video frames from the middle of the seqence in the case of playing the video at half of the original speed (Figre 9) and with doble the original speed (Figre 10). It can be seen that or adaptive video refresh rate algorithm clearly otperforms the other two, becase of the precise matching of the video timestamp to the adio clock. The video lag of the AVRR time-stretching algorithm is also well below the ITU lip sync error recommendation with maximal video lag being 14 ms and maximal video advance being 13 ms in the case of dobled playback speed. Moreover, the standard deviation of video lag is 3.328 ms, showing stability of this soltion. Video Lag (ms) 30 20 10 0-10 -20 AVRR Mplayer FFplay modified -30 20 40 60 80 100 Frame Nmber Figre 9. Comparison of video lag for three video player implementations when playback speed is half of original. Video Lag (ms) 60 40 20 0-20 AVRR Mplayer FFplay modified -40 20 40 60 80 Frame Nmber Figre 10. Video lag when playback speed is dobled. Average Video Lag (ms) 100 90 80 70 60 50 40 30 20 10 0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 Time Scaling Factor AVRR FFplay Mplayer 3.2 3.6 100 Figre 11. Average video lag as a fnction of time scaling factor. Figre 11 depicts the average video lag as a fnction of the 4.0

MM-002816.R2 10 time scaling factor for the three video synchronisation techniqes. FFplay was modified to ensre that it wold not decode dropped frames, otherwise its performance wold be significantly worse. However, it still shows notable degradation in performance as the time scaling factor increases beyond 2 and video frame decoding becomes significantly slower than the time to process a time scaled adio frame. MPlayer maintains sitable performance as time scale increases, thogh it does not adapt the variable refresh rate to the precise adio time codes. The AVRR method maintains strong synchronisation over the entire range of time scaling factors. Only at time scaling factors beyond 3.5 does the AVRR occasionally lose synchronisation, and opts not to decode a frame. XI. CONCLUSIONS A framework for real-time independent video time scaling and pitch shifting was presented. Carefl consideration was given to the problems which arise in a real-time context and novel soltions to these isses have been provided. It was shown how time-scale changes can be achieved in real-time with almost imperceptible latency and no transitional artefacts. The approach is based on a modified phase vocoder with optional phase locking and an integrated transient detector which enables high qality transient preservation in real-time. The framework presented is the basis for the developments of applications which allow for a seamless real-time transition between continally varying, independent video time-scale and pitch-scale parameters. A novel soltion for adio/visal synchronisation called adaptive video refresh rate has also been developed. De to the fact that synchronisation errors in the foreseen applications will be easier to detect, special focs was given to minimizing video lags and advances, reslting in an algorithm that significantly otperforms existing algorithms. [9] F. C. Li, A. Gpta, E. Sanocki, L. He, and Y. Ri, "Browsing digital video," presented at ACM CHI, Hage, Netherlands, 2000. [10] J. L. Flanagan, D. I. S. Meinhart, R. M. Golden, and M. M. Sondhi, "Phase vocoder," The Jornal of the Acostical Society of America, vol. 38, pp. 939, 1965. [11] M. Dolson, "The phase vocoder: A ttorial " Compter Msic Jornal, vol. 10, pp. 14-27, 1986. [12] M. Portnoff, "Implementation of the digital phase vocoder sing the fast Forier transform," IEEE Transactions on Acostics, Speech, and Signal Processing, vol. 24, pp. 243-248, 1976. [13] J. Laroche and M. Dolson, "Improved phase vocoder timescale modification of adio," IEEE Trans. Speech and Adio Processing, vol. 7, pp. 323-332, 1999. [14] J. Bonada, "Atomatic techniqe in freqency domain for near-lossless time-scale modification of adio," presented at International Compter Msic Conference, pp. 396-399, Berlin, Germany, 2000. [15] J. Laroche, "Atocorrelation method for high qality time/pitch scaling," presented at IEEE WASPAA, pp. 131-134, Mohonk, NY, 1993. [16] C. Dxbry, M. Davies, and M. Sandler, "Improved time-scaling of msical adio sing phase locking at transients," presented at 112th AES Convention, pp. 1-5, Mnich, Germany, May 10-13, 2002. [17] D. Barry, D. FitzGerald, and E. Coyle, "Drm Sorce Separation sing Percssive Featre Detection and Spectral Modlation," presented at IEE Irish Signals and Systems Conference, pp. 13-17, Dblin, Ireland, 2005. [18] International Telecommnication Union Docment 11A/47-E, 13 October 1993. [19] "Relative Timing of Sond and Vision for Broadcasting. Recommendation," International Telecommnication Union ITU-R BT. 1359-1, 1998. REFERENCES [1] M. C. Yang, S. T. Liang, and Y. G. Chen, "Dynamic video playot smoothing method for mltimedia applications," Mltimedia Tools and Applications, vol. 6, pp. 47-59, 1998. [2] M. Kalman, E. Steinbach, and B. Girod, "Adaptive Media Playot for Low Delay Video Streaming over Error-Prone Channels," IEEE Transactions on Circits and Systems for Video Technology, vol. 14, pp. 841-851, 2004. [3] Y. J. Liang, N. Färber, and B. Girod, "Adaptive playot schedling sing time-scale modification in packet voice commnication," presented at International Conference on Acostics, Speech, and Signal Processing (ICASSP), pp. 1445-1448, Salt Lake City, 2001. [4] P. LaBarbera and J. MacLachlan, "Time-Compressed Speech in Radio Advertising," Jornal of Marketing, vol. 43, pp. 30-36, 1979. [5] C. Landone, J. Harrop, and J. D. Reiss, "Enabling Access to Sond Archives throgh Integration, Enrichment and Retrieval: The EASAIER Project," presented at 8th International Conference on Msic Information Retrieval (ISMIR), Vienna, 2007. [6] C. Dffy, "A case stdy of networked sond resorces for edcation in traditional msic: the HOTBED project," presented at Integration of Msic in Mltimedia Applications, Barcelona, Spain, 2004. [7] J. S. Olson, "A Stdy of the relative effectiveness of verbal and visal agmentation of rate-modified speech in the presentation of technical material," in Annal Convention of the Association for Edcational Commnications and Technology (AECT). Anaheim, CA, 1985. [8] K. Harrigan, "The SPECIAL system: Searching time-compressed digital video lectres," Jornal of Research on Compting in Edcation, vol. 33, pp. 77-86, 2000.