GMM-based Synchronization rules for HMM-based Audio-Visual laughter synthesis

Similar documents
An Investigation of Acoustic Features for Singing Voice Conversion based on Perceptual Age

The AV-LASYN Database : A synchronous corpus of audio and 3D facial marker data for audio-visual laughter synthesis

Estimating PSNR in High Definition H.264/AVC Video Sequences Using Artificial Neural Networks

FPGA Implementation of High Performance LDPC Decoder using Modified 2-bit Min-Sum Algorithm

LONGITUDINAL AND TRANSVERSE PHASE SPACE CHARACTERIZATION

Focus. Video Encoder Optimisation to Enhance Motion Representation in the Compressed-Domain

Singing voice synthesis based on deep neural networks

An Industrial Case Study for X-Canceling MISR

Automatic Laughter Detection

Design for Verication at the Register Transfer Level. Krishna Sekar. Department of ECE. La Jolla, CA RTL Testbench

A Phonetic Analysis of Natural Laughter, for Use in Automatic Laughter Processing Systems

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Studio encoding parameters of digital television for standard 4:3 and wide-screen 16:9 aspect ratios

Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016

A COMPARATIVE EVALUATION OF VOCODING TECHNIQUES FOR HMM-BASED LAUGHTER SYNTHESIS

single-phase industrial vacuums for dust Turbine motorized industrial vacuums for dust single-phase industrial vacuums for WeT & dry

Phone-based Plosive Detection

Automatic Laughter Detection

LAN CABLETESTER INSTRUCTION MANUAL I. INTRODUCTION

MUSI-6201 Computational Music Analysis

Motion-Induced and Parametric Excitations of Stay Cables: A Case Study

Improving Frame Based Automatic Laughter Detection

Evaluation of School Bus Signalling Systems

Laugh when you re winning

Audio-Based Video Editing with Two-Channel Microphone

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

EXPLORING THE USE OF ENF FOR MULTIMEDIA SYNCHRONIZATION

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

Transcription of the Singing Melody in Polyphonic Music

Fig. 1. Fig. 3. Ordering data. Fig. Mounting

Advanced Signal Processing 2

Benefits of a Small Diameter Category 6A Copper Cabling System

Benefits of a Small Diameter Category 6A Copper Cabling System

The Evaluation of rock bands using an Integrated MCDM Model - An Empirical Study based on Finland (2000s)

Automatic Rhythmic Notation from Single Voice Audio Sources

A Comparative Study of Spectral Transformation Techniques for Singing Voice Synthesis

Application Of Missing Feature Theory To The Recognition Of Musical Instruments In Polyphonic Audio

A Bayesian Network for Real-Time Musical Accompaniment

Speech and Speaker Recognition for the Command of an Industrial Robot

A Survey on: Sound Source Separation Methods

MOVIES constitute a large sector of the entertainment

Automatic Labelling of tabla signals

Towards a Mathematical Model of Tonality

SINCE the lyrics of a song represent its theme and story, they

Retrieval of textual song lyrics from sung inputs

Music Radar: A Web-based Query by Humming System

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

A HMM-based Mandarin Chinese Singing Voice Synthesis System

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

Laughter Animation Synthesis

1. Introduction NCMMSC2009

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES

MODELS of music begin with a representation of the

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

A Study of Synchronization of Audio Data with Symbolic Data. Music254 Project Report Spring 2007 SongHui Chon

A repetition-based framework for lyric alignment in popular songs

Chord Classification of an Audio Signal using Artificial Neural Network

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

CS229 Project Report Polyphonic Piano Transcription

Tempo and Beat Analysis

A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models

Audio Professional LPR 35

Obbi Silver Luxo. external automations for swing gates

Subjective Similarity of Music: Data Collection for Individuality Analysis

"Glued to the Sofa": Exploring Guilt and Television Binge-Watching Behaviors

Robert Alexandru Dobre, Cristian Negrescu

A prototype system for rule-based expressive modifications of audio recordings

AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON SUNG MELODIC CONTOUR CHARACTERIZED IN PHASE PLANE

Automatic Piano Music Transcription

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

A Categorical Approach for Recognizing Emotional Effects of Music

The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

Seminar CHIST-ERA Istanbul : 4 March 2014 Kick-off meeting : 27 January 2014 (call IUI 2012)

Proposal for Application of Speech Techniques to Music Analysis

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

Comparison Parameters and Speaker Similarity Coincidence Criteria:

Bertsokantari: a TTS based singing synthesis system

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005.

Type-2 Fuzzy Logic Sensor Fusion for Fire Detection Robots

VOCALISTENER: A SINGING-TO-SINGING SYNTHESIS SYSTEM BASED ON ITERATIVE PARAMETER ESTIMATION

MUSIC TONALITY FEATURES FOR SPEECH/MUSIC DISCRIMINATION. Gregory Sell and Pascal Clark

HT300 PLUS. SIM2 Multimedia is certified

Hidden Markov Model based dance recognition

Multimodal Analysis of laughter for an Interactive System

Musical Entrainment Subsumes Bodily Gestures Its Definition Needs a Spatiotemporal Dimension

Machine Learning Term Project Write-up Creating Models of Performers of Chopin Mazurkas

WAKE-UP-WORD SPOTTING FOR MOBILE SYSTEMS. A. Zehetner, M. Hagmüller, and F. Pernkopf

Classification of Timbre Similarity

Detecting Musical Key with Supervised Learning

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Transcription:

2015 International Conference on Affective Coputing and Intelligent Interaction (ACII) GMM-based Synchronization rules for HMM-based Audio-Visual laughter synthesis Hüseyin Çakak, UMONS, Place du Parc 20, 7000 Mons, huseyin.cakak@uons.ac.be Kévin El Haddad, UMONS, Place du Parc 20, 7000 Mons, kevin.elhaddad@uons.ac.be Thierry Dutoit, UMONS, Place du Parc 20, 7000 Mons, thierry.dutoit@uons.ac.be Abstract In this paper we propose synchronization rules between acoustic and visual laughter synthesis systes. Previous works have addressed separately the acoustic and visual laughter synthesis following an HMM-based approach. The need of synchronization rules coes fro the constraint that in laughter, HMM-based synthesis cannot be perfored using a unified syste where coon transcriptions ay be used as it has been shown to be the case for audio-visual speech synthesis. Therefore acoustic and visual odels are trained independently without any synchronization constraints. In this work, we propose rules derived fro the analysis of audio and visual laughter transcriptions in order to be able to generate a visual laughter transcriptions corresponding to an audio laughter data. Keywords - audio-visual; synchronization; laughter; synthesis I. INTRODUCTION Aong features of huan interactions, laughter is one of the ost significant. It is a way to express our eotions and ay even be a response in soe interactions. The last decades witnessed a considerable progress in speech processing and affective coputing. Also, huan-achine interactions are becoing ore and ore present in our daily lives. So, considering the iportance of laughter in our daily counications, this non-verbal counicative signal can and should be successfully detected, analyzed and produced by achines. This work focuses on laughter production and ore specifically on the synchronization between audio and visual laughter in the fraework of HMM-based audio-visual laughter synthesis. Acoustic synthesis of laughter using Hidden Markov Models (HMMs) has already been addressed in a previous work [1]. To characterize the acoustic laughter, phoneic transcriptions were used and the results outperfored the state of the art. Extensions of the latter work were done to perfor autoatic phoneic transcriptions [2] and to integrate the arousal in the syste [3]. The goal of audio-visual laughter synthesis is to generate an audio wavefor of laughter as well as its corresponding facial aniation. In statistical data-driven audiovisual speech synthesis, it is coon that acoustic and visual odels are trained separately [4], [5], [6], [7]. The training also soeties include an additional explicit tie difference odel for synchronization purpose [8], [9]. In 2014, a visual laughter synthesis syste has also been proposed and is the basis of the visual laughter synthesis in this work [10]. Indeed, the authors showed in that work that a separate segentation of the laughter is needed to correctly odel the visual trajectories eaning that phoneic transcriptions are not suited to describe the visual cues for laughter as it has been shown to be feasible for speech [8], [11], [12], [13]. Further developents have shown that the head otion should be odeled separately as well [14]. Modeling independently audio, facial data and head data eans having specific transcriptions for each and thus, the need for synchronization arises. In [10], the synchronization between odalities was guaranteed by iposing synthesized durations to be the sae as in the database, in which the transcriptions are synchronous in the first place. To bring this to the next level and to be able to synthesize audio-visual laughter with any wanted duration, a ethod was proposed in [15] to odel the relationships between transcriptions. An iproved ethod for synchronization between audio and visual transcriptions is proposed in this paper. The basic principle lying under the proposed ethod is a Gaussian Mixture Models (GMM)-based apping used to generate the tie delays between the beginning (ending) of the audio and the beginning (ending) of the visual laughter. First a silence reoval ethod is used to estiate at what ties the laugh begins and ends in the given audio file. Then, a GMM [16] is trained on features extracted fro the audio. It is then used to generate the tie delays to add to the audio laughter liits to get the visual laughter liits. Once these liits are set, visual transcriptions ay be built to feed an HMM-based visual laughter synthesizer which will produce visual trajectories synchronous with the given laughter audio file. Two iproveents are introduced in coparison to the previous work : i) the phoneic transcriptions are not needed anyore since the ethod relies only on the input audio file ii) the accuracy of the predicted delays has been iproved as shown by the RMSE coparison. The paper is organized as follows : Section II gives a brief overview on the database used in this work, Section III explains the audio and visual laughter synthesis ethods in the frae of which this work is taking place, Section IV explains the previous ethod proposed for synchronization between acoustic and visual synthesis, Section V describes the 978-1-4799-9953-8/15/$31.00 2015 IEEE 428

Audio-visual database BVH Motion Files Audio-visual database Synthesized face + head WAV Audio Files Transcription files Corrections Filtering AV Synchronisation Segentation Transcriptions Feature extraction Head data separation Synthesized audio Figure 1. Data recording pipeline Figure 2. synthesis Overview of the pipeline for HMM-based audio-visual laughter new ethod proposed in this paper, Section VI describes the evaluation and Section VII concludes and gives an overview of future work. II. THE AVLASYN DATABASE The AVLASYN Database [17] used in this work is a synchronous audio-visual laughter database designed for laughter synthesis. The corpus contains data fro one ale subject recorded using professional audio equipent and a arkerbased otion capture syste. Figure 1 gives an overview of the recording pipeline. The database contains laughter-segented audio files in WAV forat and corresponding otion data in the Biovision Hierarchy (BVH) forat. A first segentation was done to get files containing only laughter, then these files were phoneically annotated. Please refer to [18] for ore inforation on transcriptions. The laughs were triggered by watching videos found on the web. The subject was free to watch whatever he wanted. A total aount of 125 inutes were watched by the subject to build this corpus. This led to roughly 48 inutes of visual laughter and 13 inutes of audible laughter. This work uses a sub-set of the AVLASYN Database. In this work, only the ost coon laugh pattern (i.e. a silence phase followed by a laugh and possible inhalation and finishing with a silence) is considered[19]. III. HMM-BASED LAUGHTER SYNTHESIS Details on the odels can be found in [1] for acoustic odels and in [10] for visual odels. A brief overview is given below. The HMM-based trajectories were synthesized using the HMM-based Speech Synthesis Syste (HTS) [20]. Figure 2 gives the general pipeline followed to build the odels. The ain steps that ust be introduced to understand the reainder of this paper are : 1) Features are extracted fro the audio, the face oveents and the head oveents. [10], [14]). 2) These features are odeled independently with their respective transcriptions. 3) Once the odels are trained for each odality, trajectories are synthesized. For audio synthesis, the duration of each phone ay either be estiated by the syste or ay be iposed. For visual synthesis, durations are iposed fro rules based on acoustic features to be synchronized with the (cf Section IV and V). The reference visual transcriptions are built fro an autoated Gaussian Mixture Models (GMM)-based segentation syste detailed in [10], [14]. The ai of the present work is to be able to generate new such visual transcriptions that would be in synchronization with a corresponding audio laughter file. This would allow to synthesize visual laughter aniation fro existing HMM odels that will be consistent with a given audio laughter. Since the synchronization ethod presented in this work does not rely on the synthesis ethod itself, we focus on visual transcriptions generation starting fro a given audio file in the reainder of this paper. IV. PREVIOUS ATTEMPT In the HMM-based synthesis fraework, the first stage is the training of the odels. The training stage strongly relies on the provided annotations. It is thus iportant that the annotations correctly represent the data that the HMMs will odel. In the case of acoustic data, the annotations are phoneic transcriptions. An ideal case would have been that these phoneic transcriptions could be used as annotations for the visual data as well. This was successfully applied to audio-visual speech synthesis in [11], [12], [13], [8]. However, due to the fact that laughter, contrary to speech, is an inarticulate utterance [21], the correlation between the produced sounds and the facial expressions, in particular the outh shape, is uch lower and akes it ipossible to use the sae annotations for both odalities. This is why separate annotations were necessary for visual data training [10]. Instead of using phoneic classes, three specific classes related to the deforations on the face were used. A following study has shown that a third odality, the head otion, should be considered independently to better odel the head otion by considering the shaking otion 978-1-4799-9953-8/15/$31.00 2015 IEEE 429

Audio Visual ha ha ha inhalation tie Head Oscillations Figure 3. Scheatic representation of the different transcriptions (audio, face, head) during laughter. This approach perfored better in perception tests [14]. Finally, three different odalities (audio, facial and head oveents) all related to the sae phenoenon (laughter) are used. Each odality has its own transcriptions and therefore synchronization rules between these odalities are necessary since they are trained independently and nothing ensures the synchronization at the synthesis step. The transcriptions for the audio odality consist in several successive phones such as fricatives, different vowels, nasal sounds, nareal fricatives, silence and inhalation (cf [17] for ore details). The ost coon phoneic sequence for audio laughter is siilar to : silence-h-a-h-a-h-a-inhalation-silence. In the case of facial data, three classes are used in the visual transcriptions :, and Sei-. The latter is a facial expression between no expression at all and a slight sile (cf [10] for ore details). The ajority of the laughs in the database are a succession of the first two classes in the following order : --. Finally, the head otion transcriptions are the result of a sub-segentation of the facial class defined above. Each occurrence of one class during a laughter sequence represents one period of the head oscillation that occurs in laughter (cf [14] for ore details). Figure 3 gives a scheatic overview of the different transcriptions. As we can see, the beginning of the audio laughter (end of the silence) is not exactly aligned with the beginning of visual laughter (end of the neutral face). Siilarly, the visual laughter class ends soe tie after the last audible contribution. This shows that visual laughter is teporally wider than acoustic voiced laughter. Figure 3 also shows head oscillations with red circles. As it ay be seen, head otion transcriptions are defined such that the class reains the sae as in facial transcriptions and the laughter class is sub-segented into oscillation periods. The previous attept to build a synchronization ethod was to study the relation between the audio and facial data transcriptions. The ai was to derive rules fro the study of transcriptions to later use these rules to produce facial transcriptions corresponding to phoneic transcriptions. To odel the relationship between audio and visual transcriptions, Table I THE SIX PDFS USED TO MODEL DELAYS BETWEEN AUDIO AND VISUAL TRANSCRIPTIONS. DIFFERENT PDFS WERE BUILT BASED ON CHARACTERISTICS ON ACOUSTIC TRANSCRIPTIONS. nasal = VOICED NASAL SOUND LIKE N. nf = NAREAL FRICATIVE CORRESPONDING TO AN UNVOICED AIR EXPELLATION FROM THE NOSE. fricativeh i = THE H SOUND OCCURRING WHEN INHALATION WITH THE MOUTH. nf i = THE SOUND WHEN WHEN INHALATION WITH THE NOSE. AV,start silence-nasal silence-nf anything else AV,end ending inhalation is fricativeh i ending inhalation is nf i anything else the tie delay between the end of the initial silence in the phoneic transcriptions and the end of the neutral expression in the visual transcriptions was calculated. The tie shift at the end of the laughter between audio and visual odalities was calculated as well. These tie delays were odeled using kernel density estiation which is a non-paraetric ethod to estiate the probability density function of a rando variable [22], [23]. This fitting process was done for three different cases for the beginning delay and three different cases for the ending delay. Table I gives an overview of the probability density functions used in [15]. In that work, as in the present one, the facial transcriptions are assued to be a sequence of type -- which is the ost coon sequence (±80% of the database). Once facial transcriptions are generated, head otion transcriptions are generated fro these facial transcriptions. Finally, the generated visual transcriptions are used as input to their respective HMM odels for synthesis and trajectories are produced. The synthesized visual data for face and head are then erged and transfored appropriately [14] before application on a 3D face. Finally, video aniations are produced with the corresponding audio data. V. PROPOSED GMM-BASED MAPPING METHOD The present work also ais at estiating the teporal boundaries of visual laughter to feed an HMM-based visual laughter synthesizer as explained above. Copared to the previous ethod detailed in Section IV, two iproveents are targeted : 1) To throw off the need of the phoneic transcriptions as an input to the synchronization syste. The ai is to be able to work directly with the audio file using an autoated process rather than using anually annotated phoneic transcriptions which ay not always be available. 2) To iprove the accuracy of the estiated tie delays by using a GMM-apping approach based on acoustic features directly extracted fro the input audio laughter file. 978-1-4799-9953-8/15/$31.00 2015 IEEE 430

These iproveents would bring us one step further towards the synchronization of a laughter aniation with any given audio laughter as input. Figure 4 gives an overview of the proposed ethod. A. Defining references fro audio In order to obtain the transcriptions autoatically, a siple silence reoval ethod ipleented in Matlab [24] was used to discriinate the silent parts fro the laughter parts in a given audio file. Originally dedicated for audio signals containing speech, this ethod proved to be efficient for laughter detection as well. It is to note that this is not a laughter recognition process since our audio files contain only laughter or silence as entioned earlier. Audio files are divided into non-overlapping fraes of 5 s each. Two features are then extracted fro each frae: the signal energy and the spectral centroid. Threshold values are then deterined for each of these features and used to discriinate the laughter segents fro the silence ones as described in [24]. As the ethod ay give several laughter detections in the sae laughter segent, a post-processing step is applied to erge the possible overlapping laughter segents detected so that one audio file contains one audio laughter segent as the input files are assued to be. To evaluate this ethod, the beginning and ending tie of laughter were copared to their corresponding values in the anual transcriptions. The coparison was done via a siple Root Mean Square Error estiation (RMSE). For the beginning and ending tie, we obtained RMSE values of 0.1169 sec and 0.1836 sec respectively. These results suggest that the used ethod is accurate enough for the purpose of this work. Indeed, rather than finding the exact sae boundaries as in the anual transcriptions, what is needed here is a consistent way of deterining the positions of the beginning and ending ties of the audio laughter onset. Fro these boundaries, tie delays between audio and visual laughter will be calculated and odels will be built as explained below. B. GMM-based apping to deterine the tie delays between audio and visual laughter Once the beginning tie t A,start and ending ties t A,end of the audio laughter are deterined using the ethod explained in the previous section, we can calculate the tie delays AV,start and AV,end between these estiated audio liits and the visual laughter liits fro the reference visual transcriptions. A teporal scheatic representation of the tie delays is given in Figure 5. 1) Features used for GMM odeling: A set of features are extracted fro the audio files. The first four features are scalar values while the rest are curves extracted by using a frae length of 25 s and a frae shift of 10 s. The Audio Visual t V,start t A,start Δ AV,start Δ AV,end t A,end t V,end tie Figure 5. Scheatic representation of the different ties and delays Table II LIST OF THE FEATURES CONSIDERED IN GMM MODELING Scalar features RMS value of a curve derived fro the spectrogra Utterance length Variance duration = t A,end - t A,start considered features are given in Table II. Continuous features Zero Crossing Rate Energy Energy Entropy Spectral Centroid Spectral Entropy Spectral flux Spectral Rolloff 13 MFCCs Fundaental Frequency F0 Chroa Vector Spectral Zone To reduce the nuber of diensions of each of these features, their histogra is calculated by iposing the nuber and centers of bins for each feature. This allows to produce 3-diensional feature vectors for each of the features listed above. RMS values are also included. The Pearson s correlation coefficients are then calculated between all the feature and the value of delays AV,start and AV,end for each file. The ost correlated features are kept for GMM odeling, they are suarized below: duration = t A,end - t A,start RMS Energy 1 st histogra bin of Spectral Centroid 3 d histogra bin of MFCC 1 3 d histogra bin of MFCC 7 1 st histogra bin of MFCC 11 3 d histogra bin of MFCC 13 2) GMM-based apping: We investigate the use of GMM apping fraework proposed in 1996 by Stylianou [25] for voice conversion. The ipleentation used here is the one of Kain [26] also used in recent work such as [27]. The ipleentation is based on the joint probability density of source and target vectors p(z) = p(x, Y ) with : 978-1-4799-9953-8/15/$31.00 2015 IEEE 431

Input WAV Files Detection t A,start t A,end Defining references fro audio Reference Visual Transcriptions Δ AV,start Δ AV,end Getting tie delays fro database for training Feature Extraction Features Matrix Training GMM to ap extracted features to audiovisual tie delays GMM training GMM odels Applying GMM-based apping to get the tie delays corresponding to a given feature vector New Source Feature Vector GMM-based apping algorith Δ AV,start Δ AV,end Output apped tie delays Figure 4. Overview of the proposed ethod for tie delays prediction follows : Z = [ X Y ] (1) x 11... x 1dx y 11... y 1dy =.......... (2) x N1... x Ndx y N1... y Ndy where X and Y are the sequence of source and target (N sequences each) and d x, d y are the diensions of the source and target vectors. In the present work, d x is equal to 7 (see kept features above) and d y is equal to 2 corresponding to AV,start and AV,end which are the values that we want to predict fro the set of 7 source features. The apping function that estiates the target vector ŷ t starting fro a source vector x t at tie t is forulated as ŷ t = F (x t ) = M (W c t + b ) P (c x t ) (3) =1 where W is the transforation atrix and b the bias vector related to the th coponent of the odel and are defined as : Y X W = ( XX ) 1 (4) b = µ Y W µ X (5) and where 978-1-4799-9953-8/15/$31.00 2015 IEEE 432

= µ = XX Y X µx µ Y XY Y Y (6) (7) P (c x t ) is the probability that the source vector is related to the th coponent. This probability is defined as : P (c x t ) = ( α N M ( α p N p=1 x t, µ X, XX ) x t, µ X p, XX p ) (8) where N (x, µ, ) is a Gaussian distribution with ean µ and covariance atrix. α is the weight of the considered coponent. The Gaussian distributions are trained using the iterative Expectation Maxiization (EM) algorith. Since the observations in this work are the laughter utterance files (fro which a finite nuber of features (7) are extracted), the data for training the GMMs is quite liited and therefore in order to liit the nuber of paraeters that need to be estiated in the training stage, we use the siplest configuration. A GMM with 1 coponent, 9 diensions (source + target) and full covariance atrices is trained. VI. RMSE-BASED EVALUATION The first synchronization ethod (Method 1) presented in Section IV has been evaluated in a previous study through perception tests. The results showed that the obtained synchronization was perceived as slightly less accurate than the case where the original visual transcriptions were used. To copare the proposed ethod in this work (Method 2) with Method 1, we have calculated the Root Mean Square Error (RMSE) between the generated visual transcriptions and the original reference visual transcriptions for both ethods. Since rando processes are part of the ethods, the accuracy and therefore the RMSE ay change between two different applications of the ethods. To alleviate this, we run both ethods on all the available files 100 ties and calculate the ean RMSE for each ethod and each delay to estiate ( AV,start and AV,end ). In the case of Method 2 which include a data-driven training, GMMs were trained for each file following a leave-one-out protocol eaning that every tie a target vector had to be estiated, the source vector was not included in the training while all the other observations in the available data were included. Table III gives the results. Table III MEAN RMSE ERROR VALUES WITH THEIR STANDARD ERRORS FOR EACH METHOD AND EACH VALUE PREDICTED Method 1 Method 2 Mean RMSE AV,start (sec) 0.1588 (std err. = 0.0012) 0.1340 (0.0014) Mean RMSE AV,end (sec) 1.0477 (0.0156) 0.5770* (0.0059) between RMSE eans between both ethod in the case of AV,start (p-value = 0.85) while there is a strong difference between the eans between ethods in the case of AV,end (p-value < 0.01). These results tend to show that the proposed ethod perfors as good as the previous one in the case of the estiation of the tie delay between audio and visual laughter at the beginning ( AV,start ) and perfors better in the case of the estiation of the delay at the end ( AV,end ). VII. CONCLUSIONS AND FUTURE WORKS In this paper, we have proposed a synchronization ethod based on the generation of visual transcription intended to be used as input to an HMM-based visual laughter synthesizer. Copared to the previous ethod, two iproveents were introduced. Firstly, the phoneic transcriptions which ay not always be available are not needed as input as it was the case in the previous ethod. The proposed ethod only uses an audio laughter file as input. Secondly, ean RMSE calculations between generated and original visual transcriptions have been conducted for both the previous and proposed ethods. The results showed that the proposed ethod perfors better for the prediction of the delay at the end of the laughter than the previous ethod and as good as it for the prediction of the delay at the beginning. Future work include deeper evaluation of how well the accuracy iproveent ipacts the perception in rendered aniations. The study of the extrapolation and use of the proposed ethod to synthesize visual aniation fro audio laughter files belonging to several different persons is also an interesting line to follow. Further accuracy iproveents ay possibly be achieved by adding ore supra-segental features such as the nuber of vowels and fricatives in the file. This would require the developent of a robust autoatic laughter phonee recognition syste. ACKNOWLEDGMENT Hüseyin Çakak H. Çakak receives a Ph.D. grant fro the Fonds de la Recherche pour l Industrie et l Agriculture (F.R.I.A.), Belgiu. A Tukey Honest Significant Difference (HSD) test with a significance level of 95% was perfored on the eans obtained for each ethod. There is no significant difference 978-1-4799-9953-8/15/$31.00 2015 IEEE 433

REFERENCES [1] J. Urbain, H. Çakak, and T. Dutoit, Evaluation of HMM-based laughter synthesis, in Acoustics Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013. [2] J. Urbain, H. Çakak, and T. Dutoit, Autoatic phonetic transcription of laughter and its application to laughter synthesis, in Proceedings of the 5 th biannual Huaine Association Conference on Affective Coputing and Intellignet Interaction (ACII), Geneva, Switzerland, 2-5 Septeber 2013, pp. 153 158. [3] J. Urbain, H. Çakak, Aurélie Charlier, Maxie Denti, T. Dutoit, and Stéphane Dupont, Arousal-driven synthesis of laughter, IEEE Journal of Selected Topics in Signal Processing, vol. 8, pp. 273 284, 2014. [4] G. Hofer, J. Yaagishi, and H. Shiodaira, Speech-driven lip otion generation with a trajectory h, 2008. [5] S. Sako, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitaura, H-based text-to-audio-visual speech synthesis, in ICSLP, 2000. [6] L. Wang, Y. Wu, X. Zhuang, and F. Soong, Synthesizing visual speech trajectory with iniu generation error, in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 4580 4583. [7] G. Hofer and K. Richond, Coparison of h and td ethods for lip synchronisation, 2010. [8] O. Govokhina, G. Bailly, G. Breton, et al., Learning optial audiovisual phasing for a h-based control odel for facial aniation, in 6th ISCA Workshop on Speech Synthesis (SSW6), 2007. [9] G. Bailly, O. Govokhina, F. Elisei, and G. Breton, Lip-synching using speakerspecific articulation, shape and appearance odels, EURASIP Journal on Audio, Speech, and Music Processing, vol. 2009, pp. 5, 2009. [10] H. Çakak, J. Urbain, J. Tilanne, and T. Dutoit, Evaluation of h-based visual laughter synthesis, in Acoustics Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014. [11] D. Schabus, M. Pucher, and G. Hofer, Joint audiovisual hidden sei-arkov odel-based speech synthesis, Selected Topics in Signal Processing, IEEE Journal of, 2013. [12] T. Masuko, T. Kobayashi, M. Taura, J. Masubuchi, and K. Tokuda, Text-tovisual speech synthesis based on paraeter generation fro h, in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, 1998, vol. 6. [13] M. Taura, T. Masuko, T. Kobayashi, and K. Tokuda, Visual speech synthesis based on paraeter generation fro h: Speech-driven and text-and-speechdriven approaches, in AVSP 98 Int. Conf. on Auditory-Visual Speech Processing, 1998. [14] H. Çakak, J. Urbain, and T. Dutoit, H-based synthesis of laughter facial expression, Journal on Multiodal User Interfaces (JMUI), 2015, [Subitted]. [15] Hüseyin Çakak, Jérôe Urbain, and Thierry Dutoit, Synchronization rules for HMM-based audio-visual laughter synthesis, in IEEE International Conference on Acoustics Speech and Signal Processing ICASSP, 2015. [16] G. McLachlan and D. Peel, Finite Mixture Models, 2000. [17] H. Çakak, J. Urbain, and T. Dutoit, The av-lasyn database : A synchronous corpus of audio and 3d facial arker data for audio-visual laughter synthesis, in Proc. of the 9th Int. Conf. on Language Resources and Evaluation (LREC 14), 2014. [18] J. Urbain and T. Dutoit, A phonetic analysis of natural laughter, for use in autoatic laughter processing systes, in ACII 2011, 2011, pp. 397 406. [19] Willibald Ruch and Paul Ekan, The expressive pattern of laughter, Eotion, qualia, and consciousness, pp. 426 443, 2001. [20] H. Zen, T. Nose, J. Yaagishi, S. Sako, T. Masuko, A. Black, and K. Tokuda, The h-based speech synthesis syste (hts) version 2.0, in Proc. of Sixth ISCA Workshop on Speech Synthesis, 2007. [21] W. Ruch and P. Ekan, The expressive pattern of laughter, in Eotion, qualia and consciousness, A. Kaszniak, Ed., pp. 426 443. World Scientific Publishers, 2001. [22] A.W. Bowan and A. Azzalini, Applied Soothing Techniques for Data Analysis : The Kernel Approach with S-Plus Illustrations: The Kernel Approach with S-Plus Illustrations, OUP Oxford, 1997. [23] N.L. Johnson, S. Kotz, and N. Balakrishnan, Continuous univariate distributions, Nuber vol. 2 in Wiley series in probability and atheatical statistics: Applied probability and statistics. Wiley & Sons, 1995. [24] Theodoros Giannakopoulos, A ethod for silence reoval and segentation of speech signals, ipleented in atlab, University of Athens, Athens, 2009. [25] Ioannis Stylianou, Haronic plus noise odels for speech, cobined with statistical ethods, for speech and speaker odification, Ph.D. thesis, Ecole Nationale Supérieure des Télécounications, 1996. [26] Alexander Blouke Kain, High resolution voice transforation, Ph.D. thesis, Oregon Health & Science University, 2001. [27] Thoas Hueber, Elie-Laurent Benaroya, Bruce Denby, and Gérard Chollet, Statistical apping between articulatory and acoustic data for an ultrasound-based silent speech interface., in INTERSPEECH, 2011, pp. 593 596. 978-1-4799-9953-8/15/$31.00 2015 IEEE 434