Fusion for Audio-Visual Laughter Detection

Size: px

Start display at page:

Download "Fusion for Audio-Visual Laughter Detection"

Leo Rice
5 years ago
Views:

1 Fusion for Audio-Visual Laughter Detection Boris Reuderink September 13, 7

2 2

3 Abstract Laughter is a highly variable signal, and can express a spectrum of emotions. This makes the automatic detection of laughter a challenging but interesting task. We perform automatic laughter detection using audio-visual data from the AMI Meeting Corpus. Audio-visual laughter detection is performed by combining (fusing) the results of a separate audio and video classifier on the decision level. The video-classifier uses features based on the principal components of 20 tracked facial points, for audio we use the commonly used PLP and RASTA- PLP features. Our results indicate that RASTA-PLP features outperform PLP features for laughter detection in audio. We compared hidden Markov models (HMMs), Gaussian mixture models (GMMs) and support vector machines (SVM) based classifiers, and found that RASTA-PLP combined with a GMM resulted in the best performance for the audio modality. The video features classified using a SVM resulted in the best single-modality performance. Fusion on the decision-level resulted in laughter detection with a significantly better performance than single-modality classification. 3

4 4

5 Acknowledgements I would like to thank my supervisors, Maja Pantic, Mannes Poel, Khiet Truong and Ronald Poppe for supervising me, supporting me and pointing me in the right directions when needed. In addition to my supervisors, I would like to thank the HMI department of the University of Twente for supporting my trip to Londen. I would also like to thank Michel Valstar, who helped me tremendously during my stay in London, and Stavros Petridis, for helping with the creation of the video-features. I would like to thank Luuk Peters and Roald Dijkstra for proofreading my drafts. And last but not least I would like to thank Sanne Beukers for supporting me while I was working on my thesis. 5

6 6

7 Contents 1 Introduction 9 2 Literature Laughter Laughter detection in audio Facial expressions Audio-visual fusion Methodology Dataset AMI Meeting Corpus Segmentation Corpus Features Audio features Video features Test setup Classifiers Fusion Cross validation scheme Model-parameter estimation Performance measure Results Single-modality classifiers High level fusion Conclusions 33 A Principal components for the video features 39 B Normalized features 41 7

8 8 CONTENTS

9 Chapter 1 Introduction Laughter is important. Someones mental state and emotions are conveyed in paralinguistic cues, such as laughter, a trembling voice and coughs. Because laughter occurs frequently in spontaneous speech, it is an interesting research subject. Laughter is not limited to positive emotions; negative feelings and attitudes such as sadness and contempt can be expressed with laughter [35]. This spectrum of expressed emotions combined with the high variability of laughter makes the automatic detection of laughter a challenging but interesting task. Automatic laughter detection can be used for example in meetings where laughter can provide cues to semantically meaningful events. Another application of laughter detection is the detection of non-speech for automatic speech recognition. Laughter can possibly be used as a feedback mechanism in Human Computer Interaction interfaces. Earlier work on laughter detection has mainly focused on laughter detection in audio only. Currently the focus starts to shift for laughter detection in audio to audio-visual detection of laughter because additional visual information can possibly improve the detection of laughter. Research investigating audio-visual laughter detection was suggested by Truong et al. [38]. In this thesis we investigate fusion for audio-visual laughter detection. We will investigate if fusion of the audio and video modality can improve the performance of automatic laughter detection. Fusion will be performed on the decision level, which means that audio and video are classified separately, and the results are fused to make a final classification. We will evaluate different feature sets and different classification algorithms in order to find strong audio and video classifiers, and fuse those results to create a audio-visual classifier which hopefully outperform both the audio and the video classifiers. The rest of this thesis is organized as follows. In the next chapter we describe earlier work on laughter detection in audio, detection of facial expressions in video, and work on audio-visual emotion recognition. Then we describe the methodology we use to evaluate the performance of decision-level fusion. This includes a description of the data set, the machinelearning techniques we use, and the performance measure we use. The results are presented in the next chapter, followed by the conclusions in the last chapter. 9

10 10 CHAPTER 1. INTRODUCTION

11 Chapter 2 Literature 2.1 Laughter Although laughter occurs frequently in conversations, we do not seem to know a lot about laughter. Research on the acoustic properties of laughter often contradicts other research, and the used terminology differs from work to work. To increase the confusion even more, smiles and laughter are often not discussed together while they seem to be related. Therefore we will describe different terminologies to describe laughter, the relation between speech, laughter and smiles before we describe the variability of the laughter signal. Laughter is usually analyzed on three levels. Bachorowski [5] defines the following three levels: bouts, calls and segments. Bouts are entire laugh episodes that occur during one exhalation. Calls are discrete acoustic events that together form a bout. Bouts start with long calls, followed by calls that were about half as long. A call is voiced, unvoiced, mixed, or is made up by glottal pulses, fry registers and glottal whistles. The mouth can be open or closed during the production of a call. Calls can be subdivided in segments. Segments are temporally delimited spectrogram components. A similar division in three level was suggested by Trouvain [36]: phrases, syllables and segments. Phrases are comparable to bouts. Syllables are defined as interpulse intervals, and form phrases when combined. Segments can be vowels or consonants. The consonantal segment in a laugh is often seen as an interval or pause. Smiling and laughter often occurs together, and seem to be different forms of the same event. Laughter often shows a facial expression similar to smiling combined with an involuntarily exhalation, sometimes followed by uncontrolled inhalations and exhalations. This involuntarily breathing is not present during smiling. Laughter and smiles could be extremes of a smile-laugh continuum, but there are some indications that there is a more complex relation between laughter, smiling and even speech than we would expect. Aubergé and Cathiard demonstrated that a genuine smile includes a specific manipulation of the prosody of the speech [4], which cannot be attributed to the facial deformation of a smile; not only laughter, but also smiling is audible. Like smiling, laughter does occur during speech, and does so very often according to Trovain [35]. In the KielCorpus of Spontaneous Speech, 60% of all labeled laughs are instances that overlap speech. Simultaneous production of speech and laughter is not simply laughter imposed on articulation, and there is no prototypical pattern for speechlaughs. In a later work, Trouvain reports [36] that laughter is a mix of laughter interspersed with speech-laughs and smiled speech. Smiling and laughter seem to be different categories rather than extremes of a continuum. 11

12 12 CHAPTER 2. LITERATURE Laughter is a highly variable signal [5, 40]. Voiced laughter shows much more sourcerelated variability than is associated with speech, and the individual identity and sex are conveyed in laugh acoustics. The variability between individuals greatly exceeded the variability within an individual. Laughter seems to be better conceptualized as a repertoire of sounds, which makes it difficult to detect it automatically. Kipper and Todt [27] report that the successive syllables of laughter, which appear similar, show dynamic changes of acoustic parameters, and are in fact different. For example, the fundamental frequency, the amplitude and the duration of a syllable varies during a laughter bout. Laughter seems to be a very variable signal, both on phrase and syllable level. The automatic recognition of of laughter seems to be a very challenging problem. The laughter signal is highly variable on multiple levels, and can be described best as a group of sounds. Laughter and smile seem to be different categories, and should not be regarded as different manifestations of the same event. 2.2 Laughter detection in audio Automatic laughter detection has been studied several times, in the context of meetings, for audio indexing and to detect affective states. We will describe a few studies on automatic laughter detection, and summarize some characteristics of these studies. An overview of automatic laughter detection can be found in Table 2.1. Campbell et al. [8] developed a system to classify a laugh in different categories. They constructed a corpus containing four affective classes of laughter: A hearty laugh, an amused laugh, a satirical laugh and a social laugh. A training set of 0 hand labeled laughs was used to train hidden Markov models (HMMs). The HMMs recognized the affective class correctly in 75% of the test cases. Automatic laughter detection can be used in audio indexing applications. For example, Lockerd and Mueller [28] performed laughter detection using their affective indexing camcorder. Laughter was detected using HMMs. One HMM was trained on 40 laughter examples, the other HMM was trained on speech. The classifier correctly identified in 88% of the test segments. Misleading segments were sounds such as coughs, and sounds produced by cars and trains. Arias et al. [2] performed audio indexing using a Gaussian mixture models (GMMs) and support vector machines (SVMs) on spectral features. Each frame is classified and then smoothed using a smoothing function to merge small parts. The accuracy of their laughter detection is very high (97.26% with GMMs and 97.12% with a SVM). However, their data set contains 1 minute of laughter for every 180 minutes of audio. Only prediction non-laughter would result in a baseline accuracy of 99.4%, which makes it unclear how well their laughterdetection really performs. Automatic laughter detection is frequently studied in the context of meetings. Kennedy and Ellis [25] detected multiple laughing participants in the ICSI Meeting database. Using a SVM on one second windows of Mel-Frequency Cepstrum Coefficients (MFCCs) features, a equal error rate (EER) of 13% was obtained. The same data set was used by Truong and Van Leeuwen [37]. Using Gaussian mixture models (GMM) on Perceptual Linear Predictive Analysis (PLP) features, they also obtained an EER of 13%. The data set contained examples in which both speech and laughter were present, and some inaudible laughs. After removing these difficult instances, the performance

13 2.3. FACIAL EXPRESSIONS 13 Study Dataset Performance Remarks Truong (7) [38] ICSI-Bmr, clean set EER gmm: 6.3, EER svm: 2.6, EER fused : 2.9 EER fused was tested on different corpus than EER svm Arias (5) [2] Broadcast A: 97% MFCCs with GMMs and SVM, 1 minute laughter, 180 minutes non-laughter Campbell (5) [8] ESP A: 75% HMMs to classify a laugh into 4 categories Ito [23] (5) Audio visual laughter Audio: (95% R, 60% P) Truong (5) [37] ICSI-Bmr EER: 13.4%, EER clean : 7.1% Kennedy (4) [25] ICSI-Bmr EER: 13% MFCCs + SVM Lockerd (2) [28] Single person, 40 laughs A: 88% HMMs Table 2.1: Automatic laughter recognition in audio. PLP, GMM, EER clean on set with unclear samples removed increased, resulting in a EER of 7.1%. Different audio features were tested and resulted in PLP outperforming pitch and energy, pitch and voicing and modulation spectrum features. In a more recent work, Truong and Van Leeuwen [38] used the cleaned ICSI meeting data set to train GMM and SVM classifiers. For the SVM classifier the frame level features were transformed to a fixed length using a Generalized Linear Discriminant Sequence (GLDS) kernel. The SVM classifier performed better than the GMM classifier in most cases. The best feature set appeared to be the PLP feature set. The scores of different classifiers based on different features were fused using a linear combination of the scores or fused using a SVM or a MLP trained on the scores. Fusion based on GMM- and SVM-classifiers increases the discriminative power, as does fusion between classifiers based on spectral features and classifiers based on prosodic information. When we compare the results of these studies, GMMs and SVMs seem to be used most for automatic laughter recognition. Spectral features seem to outperform prosodic features. An EER of 12 13% seems to be usual. Removing unclear examples improves the classification performance enormously. This suggests that the performance largely depends on the difficulty of the chosen data set. 2.3 Facial expressions The detection of facial expressions in video is a popular area of research. Therefore, we will only describe a few studies that are related to fusion and the Patras-Pantic particle filtering tracking scheme [33] which we will use to extract video features. Valstar et al. [39] conducted a study to automatically differentiate between posed and spontaneous brow actions. Timing is a critical factor for the interpretation of facial behavior. The facial expressions are labeled according to the Facial Action Coding System (FACS) [16] action units (AUs). The SVM based AU detectors detect temporal segment (neutral, onset, apex, offset) of the atomic AUs based on a sequence of 20 tracked facial points. These points were tracked using the Patras-Pantic particle filtering tracking scheme. Using the detected three brow AUs (AU1, AU2, AU4), mid-level features based on intensity, duration, trajectory, symmetry and co-occurrence with other muscle actions were created to determine the spontaneousness of an instance. This resulted in a classification with an accuracy of

14 14 CHAPTER 2. LITERATURE 90.7%. Gunes and Piccardi [18] compare fusion of facial expressions and affective body gestures at the feature and decision level. For both modalities, single expressive frames are manually selected, which are classified into six emotions. The body-modality classifier was able to classify frames with a 100% accuracy. Using feature-level fusion, which combines the feature vectors of both modality into a single multi-modal feature vector, again an accuracy of 100% was obtained. Decision level fusion was performed using different rules to combine the scores of the classifiers for both modalities, which resulted in an accuracy of only 91%. Clearly the used fusion rules are not well chosen for their problem. Pantic et al. [32] used the head, face and shoulder modalities to differentiate between spontaneous and posed smiles. The tracking of the facial expressions was performed using the Patras-Pantic particle filtering tracking scheme [33]. For mid- and high-level fusion, frames are classified, and filtered to create neutral-onset-apex-offset-neutral sequences. Midlevel fusion is performed by transforming features into symbols such as temporal aspects of AUs, and the head and shoulder actions. For these symbols, mid-level features such as morphology, speed, symmetry, the duration of apex-overlap of modalities are calculated, and the order of the different actions are computed. Low level fusion (recall: 93%, precision: 89%) yields better results than mid-level (recall: 79%, precision: 79%) and high-level fusion (recall: 93%, precision: 63%). The head modality is the most important modality for the recognition for this data set, although the difference is not significant. The fusion of these modalities improves the performance significantly. 2.4 Audio-visual fusion Most work on audio-visual fusion has focused on the detection of emotion in audio-visual data [49, 47, 44, 18, 45, 48, 46, 21, 41, 17, 7]. Some other audio-visual studies are conducted on cry detection [31], movie classification [42], tracking [6, 3], speech recognition [13] and laughter detection [23]. These studies all try to exploit the complementary nature of audio-visual data. Decision level fusion is usually performed using the product, or a (weighted) sum of the predictions of single-modality classifiers, or using hand-crafted rules for classification. Other commonly used fusion techniques include mid-level fusion using multi-stream hidden Markov model (MHMM), and feature level fusion. We will describe some studies in more detail and make some general observations. A overview of these studies can be found in Table 2.2. Zeng et al. [48] used a sparse network of Winnow (SNoW) classifier to detect 11 affective states in audio-visual data. Fusion was performed using voting on frame-level to obtain a class for each instance. For a second, person-independent test, fusion was performed by using a weighted summation of component HMMs. In a following study [46], Zeng et al. performed automatic emotion recognition of positive and negative emotions in a realistic conversation setting. The facial expressions were encoded using FACS. Video features were based on the facial texture; prosodic features were used for audio classification. Fusion was regarded as a multi-class classification problem, with the outputs of the different component HMMs as features. An AdaBoost learning scheme performed best of the tested classifiers. Another study that compared feature-level fusion and decision level fusion for automatic emotion recognition was conducted by Busso et al. [7]. Video texture and prosodic audio features were classified using SVMs. The confusion matrices of the audio and video modalities show that pairs of emotions that are confused in one modality can be easily classified using

15 2.4. AUDIO-VISUAL FUSION 15 Study Dataset Performance Remarks Zajdel [44] (7) Posed, 2 emotions A: 45%, V: 67%, MF: 78% Dynamic Baysian Network Zeng [46] (7) AAI, 2 emotions A: 70%, V: 86% DF: 90% Adaboost on component HMMs Zeng [48] (7) Posed, 11 emotional A: 66%, V: 39%, DF: 72% SNoW, MHMM states Pal [31] (6) Unknown, 5 cry types A: 74%, V: 64%, DF: 75% Rule based fusion using confusion matrices Zeng [45] (6) AAI, 2 emotions A: 70, V: 86% DF: 90% Adaboost on component HMMs Asoh [3] (5) Speech, 2 states and location MF: 85% Particle filter Hoch [21] (5) Posed, 3 emotions A: 82%, V: 67%, DF: 87% SVM, weighted-sum fusion Ito [23] (5) Spontaneous, laugh- A: (95% R, 60% P), V: (71% R, Rule based fusion ter 52% P)%, DF: (71% R, 74% P) Wang [41] (5) Posed, 6 emotions A: 66%, V: 49%, FF1: 70%, FF2: 82% FF1: FLDA classfier, FF2: Rule-based voting Voting, rule based fusion DL: product fusion Xu [42] (5) Movies, horror vs DF: (R=97%, P=91%) comedy Busso [7] (4) Posed, single person, A: 71%, V: 85%, FF: 89%, 4 emotions DF: 89% Go [17] (3) 6 emotions A: 93, V: 93%, DF: 97% Rule based Dupont [13] (0) M2VTS, 10 words, A: 52% V: 60%, FF: 70%, Fusion using MHMMs noisy MF: 80%, DF: 82% Table 2.2: Audio-visual fusion the other modality. Different decision-level fusion rules were tested, the best results (89%) were obtained using the product of the prediction of both modalities. Feature-level fusion resulted in an accuracy of 89%. In this experiment, feature-level fusion and decision-level fusion had a similar performance. Dupont and Luettin [13] used both acoustic and visual speech data for automatic speech recognition. A MHMM is used to combine the audio and video modalities on the feature level, decision level and mid-level. The fused system performs better than systems based on single modality in the condition of noise. Both mid-level and decision-level fusion perform better than feature-level fusion. Without the addition of noise to the features, audio-classification alone is sufficient for almost perfect classification. A quite different approach for fusion was taken by Asoh et al. [3]. Particle filtering was used to track the location of human speech events. The audio modality consisted of a microphone array, the video-modality consisted of a monocular camera. Audio-visual tracking was performed by modeling the position and the type of the signal as a hidden state. The noisy observations are used to estimate the hidden state using particle filtering. This approach provides a simple method to compute the probability of a location and occurrence of a speech event. Ito et al. [23] focused on the detection of a smiling face and the utterance of laughter sound in natural dialogues. A database was created with Japanese, English and Chinese subjects. Video features consist of the lip lengths, the lip angles and the mean intensities of the cheek areas. Frame level classification of the video features is performed using a

16 16 CHAPTER 2. LITERATURE perceptron, resulting in a recall of 71%, and a precision of 52%. Laughter sound detection is performed on MFCC and delta-mfcc features, using two GMMs, one for laughter, and one for other sounds. Using a moving-average filter the frame-by-frame sequences are smoothed. A recall of 96% and a precision of 60% was obtained with 16 Gaussian mixtures. The audio and video channels are combined using hand-crafted rules. The combined system obtained a recall of 71% and a precision of 74%. Ito et al. do not report if fusion significantly increases the performance of their detector. Xu et al. [42] performed affective content analysis of comedy and horror movies using audio emotional events, such as laughing and horror sounds. The audio is classified using a left-to-right HMM with four states. After classification, the predictions are filtered using sliding window majority-voting. Short horror sounds were too short to detect using a HMM, they were detected by finding large amplitude changes. The audio features consist of MFCCs, with delta and acceleration features to accentuate the temporal characteristics of the signal. The recall and precision are over 90% for horror sounds and canned laughter. The performance of decision-level fusion seems to be similar to the performance of featurelevel fusion. The fusion of audio and video seems to boost the classification performance in these studies with about 4%. However, most work does not report the significance of this gain in performance. Fusion seems to work best when the individual modalities both have a low performance, for example due to noise in the audio-visual speech recognition of Dupont [13]. When single classifiers have a high performance, the performance gain obtained by fusion of the modalities is low, and sometimes fusion even degrades the performance, as observed in the work of Gunes [18].

17 Chapter 3 Methodology Fusion of audio and video can be performed on different levels. We perform fusion on the decision-level where the audio and video modality are classified separately. When the classifiers for both modalities have classified the instance, their results are fused into a final multi-modal prediction. See Figure 3.1 for a schematic overview. An alternative approach is fusion on feature-level, where the audio and video features are merged into a single, fused feature set. A classifier classifies the fused features of a single instance. We have chosen to evaluate decision-level fusion instead of feature-level fusion for two reasons. The first first reason is that decision-level fusion allows the use of different classifiers for the different modalities. The different results for the different classifiers helps us understand the nature of the audio-visual signal, and it possibly results in a better performance. The second reason is that we use a very small data set. The feature-level fusion approach has a higher dimensionality, which requires a lager data set to learn a classifier [1]. We therefore use decision-level fusion. In the next subsections, we will describe the preprocessing we applied to our data set, the features we used and the design we used to evaluate our fusion techniques. Audio V ideo feat A feat V Classif ier Classif ier pred A pred V F usion pred F P rediction Figure 3.1: Decision-level fusion. 3.1 Dataset In order to measure the classification performance of different fusion techniques, we need a corpus containing both laughter and non-laughter examples to use for training and testing. We created a corpus based on the AMI Meeting Corpus [29]. In the following sections, 17

18 18 CHAPTER 3. METHODOLOGY we will describe the AMI Meeting Corpus, the segmentation process used to select examples (instances), and the details regarding the construction of the corpus based on the segmentation data AMI Meeting Corpus The AMI Meeting Corpus consists of 100 hours of meeting recordings, stored in different signals that are synchronized to a common time line. The meetings are recorded in English, mostly spoken by non-native speakers. For each meeting, there are multiple audio and video recordings. We used seven nonscenario meetings recorded in the IDIAP-room (IB1, IB2, IB3, IB4, IB5, IB4010, IB4011). These meetings contain a fair amount of spontaneous laughter. In the first five meetings, the four participants plan an office move. In the last two meeting four people discuss the selection of films to show for a fictitious movie club. We removed two participants, one displayed extremely asymmetrical facial expressions (IB5.2), the other displayed a strong nervous tick in muscles around the mouth (IB3.3, IB3.4). Both participants were removed because their unusual expressions would have a huge impact on our results due to the small size of our dataset. The remaining 10 participants are displayed in Figure 3.3. We used the close-up video recording (DivX AVI codec 5.2.1, 2 Kbps, pixels, 25 frames per second) and the headset audio recording (16 KHz WAV file) of each participant for our corpus. In total we have used 17 hours of raw audio-visual meeting data to construct our corpus Segmentation The seven meetings we selected from the AMI Meeting Corpus were segmented into laughter and smile segments. The presence of laughter was determined using the definition for audible laughter of Vettin and Todt [40]: Vocalizations compromising several vocal elements must consist mainly of expiratory elements; inspiratory elements might occur at the end of vocalisations; expiratory elements must be shorter than 600ms and successive elements have to be similar in their acoustic structure; single-element vocalizations must be expiratory with a vowel-like acoustic structure, or, when noisy, the element must begin with a distinct onset. For smiles we used a definition based on visual information. We define a visible smile as the visible contraction of the Zygomatic Major (FACS Action Unit 12). The activation of AU12 pulls the lip corners towards the cheekbones [14]. We define the start of the smile as the moment the corners of the mouth start to move, the end is defined as the moment the corners of the mouth return to a neutral position. Using these definitions, the 17 hours of audio-visual meeting recordings were segmented into 2049 smiles and 960 laughs. Due to the spontaneous nature of these meetings, speech, chewing and occlusions sometimes co-occur with the smile and laugh segments.

19 3.2. FEATURES 19 perceived intensity smile laugh laughs and smiles smile laugh seconds segmentation negative positive seconds corpus seconds Figure 3.2: Segmentation of the data and extraction of instances for the corpus. On top the typical observed intensity of the smiles and laughs is shown. Based on the observations, a segmentation is made, as shown in the middle diagram. The laughs are padded with 3 seconds on both sides to form the positive instances. The negative instances are created from the remaining non-smile space Corpus The final corpus is built using this segmentation data. The laughter instances are created by padding each laughter segment with 3 seconds on each side to capture the onset and offset of a visual laughter event (see Figure 3.2). A preliminary experiment showed that these onset and offset segments increased the performance of the classifier. Laughter segments that overlapped after padding are merged into a single laughter instance. This effectively merges separate laughter calls to a instance containing a single laughter bout. The nonlaughter instances are created from the audio-visual data that remains after removing all the laughter and smile segments; the smile segments are not used during this research. The length of the non-laughter instance is taken from a random Gaussian distribution with a mean and standard deviation equal to the mean and standard deviation of the laughter segments. Due to time constraint we have based our corpus on selected 60 randomly selected laughter and 120 randomly selected non-laughter instances, in which the 20 facial points needed for tracking are visible. Of these 180 instances, 59% contains speech of the visible participant. Almost all instances contain background speech. Together these instances consist 25 minutes of audio-visual data. 3.2 Features This section outlines the features we have used for the audio and video modalities. For audio we use features that are commonly used for the detection of laughter in audio. For video we used features based on the location of 20 facial points Audio features In order to detect laughter in audio, the audio signal has to be transformed to useful features for classification algorithms. Spectral or cepstral audio features, such as Mel-Frequency Cepstrum Coefficients (MFCC) [25] and Perceptual Linear Predictive (PLP) Analysis [19], have been used successfully for automatic speech recognition and laughter detection. We decided

20 20 CHAPTER 3. METHODOLOGY Figure 3.3: Laughter examples for each individual in the corpus.

21 3.2. FEATURES 21 to use PLP features, with the same settings as used by Truong and van Leeuwen [38] for automatic laughter detection, and RASTA-PLP features with similar settings. PLP and RASTA-PLP can be understood best as a sequence of transformations. The first transformation is Linear Predictive Coding (LPC). LPC encodes speech based on the assumption that speech is comparable with a buzzer at the end of tube; the formants of the speech are removed, and encoded with the intensity and frequency of the remaining buzz. PLP adds a transformation of the short term spectrum to LPC encoded audio, in order to mimic human hearing. We used these PLP features for audio classification. In addition to the PLP audio features, we derived RASTA-PLP [20] features. RASTA-PLP adds filtering capabilities for channel distortions to PLP, and yield significantly better results for speech recognition tasks than PLP in noisy environments [13]. A visualisation of PLP and RASTA-PLP features can be found in Appendix B. For PLP-features we used the same settings as were used by Truong and Van Leeuwen [38] for laughter detection (see Table 3.1). The 13 cepstral coefficients are calculated (12 model order, 1 gain) over a window of 32 ms with a step-size of 16 ms. Combined with the temporal derivative (calculated by convolving with a simple linear-slope filter over 5 audio frames) this results in a 26 dimensional feature vector per audio frame. The RASTA-PLP features are created using the same settings. We normalize these 26-dimensional feature vectors to a mean µ = 0 and a standard deviation σ = 1 using z-normalisation. PLP RASTA-PLP Sampling frequency: 16 khz 16 khz Window size: 32 ms 32 ms Window step-size: 16 ms 16 ms Model order: Delta window: 5 frames 5 frames Log-RASTA filtering: false true Table 3.1: Settings used for the PLP and RASTA-PLP features Video features The video channel was transformed into sequences of 20 two-dimensional facial points located on key features of the human face. These point sequences are subsequently transformed into orthogonal features using a Principal Component Analysis (PCA). The points were tracked as follows. The points were manually assigned at the first frame of an instance movie and tracked using a tracking scheme based on particle filtering with factorized likelihoods [33]. We track the brows (2 points each), the eyes (4 points each), the nose (3 points), the mouth (4 points) and chin (1 point). This tracking configuration has been used successfully [39] for the detection of the atomic action units of the FACS. This results in a compact representation of the facial movement in a movie using 20 (x, y) tuples per frame (see Figure 3.4). After tracking, we performed a PCA on the 20 points per video-frame. A PCA linearly transforms a set of correlated variables in a set of uncorrelated variables [24]. The principal components are ordered so that the first few retain most of the variance of the original variables. Therefore a PCA can be used as a dimension-reduction technique for features [1], however we chose to keep all the dimensions because we do not know in advance which

22 CHAPTER 3. METHODOLOGY Tracked facial points 150 250 y axis 350 450 x axis Figure 3.4: The tracked facial points principal components are useful for laughter detection.

22 22 CHAPTER 3. METHODOLOGY Tracked facial points y axis x axis Figure 3.4: The tracked facial points principal components are useful for laughter detection. We have chosen to use PCA over manually defined features because PCA can detect factors such as differences in head shape that are otherwise difficult to detect and remove from the features. For each frame in the videos we defined a 40-dimensional shape vector by concatenating all the Cartesian (x, y) coordinates. Using a PCA we extracted 40 principal components (eigenvectors) for all the frames in the data set. The original shape vectors can be reconstructed by adding a linear combination of these eigenvectors, to the mean of the shape vectors: x = x + bp T (3.1) Here x is the original shape vector, x is the mean of the shape vectors, b is a vector of weights and P matrix of the eigenvectors. An analysis of the eigenvectors revealed that the first five principal components encode the head pose, including translation, rotation and scale. The other components encode interpersonal differences, facial expressions, corrections for the linear approximations of movements and less obvious factors of the facial configuration. See Figure 3.5 for a visualisation of the first 12 principal components, for more information please refer to Appendix A. The matrix of eigenvectors serves as a parametric model for the tracked facial points. The Active Shape Model developed by Cootes et al. [10] used a similar technique to create a model for shapes. The main difference is that Cootes at al. removed global linear transformations from the model by aligning the shapes before the PCA is applied. We did not align the shapes because the head modality seems to contain valuable cues for laughter detection we want to include in the model. We use the input for this model (the weight vector b) as feature vector for the video-data. For unseen data, this feature vector can be calculated using Equation 3.2. b = (x x)p (3.2) In order to capture temporal aspects of this model, the first order derivative for each weight is added to each frame. The derivative is calculated with t = 4 frames on a moving average of the weights with a window length of 2 frames. Facial activity (onset-apex-offset) can last from a 0.25 seconds (for example a blink) to several minutes [16]. With a t = 4 frames even the fastest facial activity is captured in the derivative of the features. We normalize this 80-dimensional feature vector to a mean µ = 0 and a standard deviation σ = 1 using

23 3.2. FEATURES 23 PC 1: Translation PC 2: Translation PC 3: Head roll PC 4: Scale PC 5: Head yaw PC 6: Aspect ratio PC 7: Face length PC 8: Mouth corners / head pitch PC 9 PC 10: Mouth width PC 11: Mouth opening / eyes PC 12: Mouth corners Figure 3.5: A visualisation of the influence of the first 12 principal components. The arrows point from 3σ to 3σ, where σ is the standard deviation.

24 24 CHAPTER 3. METHODOLOGY z-normalisation. This results in a normalized 80-dimensional feature vector per frame which we use for classification (Appendix B). 3.3 Test setup Classifiers We selected Gaussian mixture models (GMMs), hidden Markov models (HMMs) and support vector machines(svms) as machine learning techniques to be used for classification. GMMs and HMMs are frequently used in speech recognition and speaker identification, and have been used before for laughter recognition [2, 38, 23, 8, 28, 26, 42]. SVMs have been used for laughter detection in [25, 2, 34, 38]. HMMs and GMMs are generative models. Therefore, a different model has to be trained for each class. After training using the EM algorithm [11, 43], the log-likelihood for both class-models is computed and compared for each instance. Using these log-likelihoods the final output is computed as the logarithm of the ration between the probability of the positive and the negative model (Eq. 3.3). score(i) = log( P pos(i) P neg (I) ) = logp pos(i) logp neg (I) (3.3) We use HMMs that model the generated output using a mixture of Gaussian distributions. For the HMMs classifiers we used two different topologies (Figure 3.6). The first is commonly used in speech recognition, and contains only forward connections. The advantage of this leftright HMM model is that less parameters have to be learned, and the left-right architecture seems to fit sequential nature of speech. An ergodic HMM allows state transitions from every state to every state. This topology is more flexible, but more variables have to be learned. Kevin Murphy s HMM Toolbox [30] was used to implement the GMM and the HMM classification Figure 3.6: left-right HMM (left) and an ergodic HMM (right) SVMs expect a fixed-length feature vector, but our data consists of sequences with a variable length. Therefore we use a sliding window to create features for the SVM. During training the class of windowed sections of the instances are learned. During classification a probability estimate for the different windows of an instance is calculated. The final score of an instance is the mean of its window-scores, a median could be used as well. We use Radial Basis Function (RBF) kernel SVMs, which are trained using LIBSVM [9].

25 3.3. TEST SETUP Fusion Fusion is performed on the decision-level, which means that the output of an audio and a video classifier are used as input for the final fused prediction. For each instance we classify, we generate two numbers, representing the probability of laughter in the audio and the video modality. Fusion SVMs are trained on these numbers using same train, validation and test sets as used for the single modality classifiers (see Section 3.3.3). The output of these SVMs is a multi-modal prediction based on high-level fusion. As an alternative to this learned fusion, we test fusion using a weighted-sum (Equation 3.4) of the predictions to fuse the scores of the single-modality classifiers. s fused = α s video + (1 α) s audio (3.4) Cross validation scheme In order to compare different fusion techniques, we need to be able to measure the generalisation performance of a classifier. We decided to use a preprocessed data set, so the preprocessing is done once for the whole data set. We have chosen to exclude the preprocessing from the cross-validation loop in order to measure the generalisation error of the fusion without the additional generalisation error of the preprocessing. The preprocessing consists of feature-extraction, and z-normalisation which transforms the data to a mean µ = 0 and σ = 1. Using this setup we measure the generalisation error of the classification, and not the combined generalisation error of preprocessing and classification. Because we have a small data set we use a cross-validation scheme to create multiple train, validation and test sets (see Figure 3.7). Algorithm 1: The used cross-validation scheme. for K in [1..10] do S train = S S K ; for L in [1..3] do S validation = S KL ; S test = S K S KL ; C = trainer.learn(s train, S validation ); S test.performance = trainer.test(c, S test ); end end The preprocessed data set is divided into K=10 subsets. During each of the K folds, 1 subset is set aside. The other 9 subsets are used for training. The remaining subset is used to create a validation and a test set for three folds. One third is used as validation set and the remaining two thirds as test set (see Algorithm 1). Different model-parameters are used to train classifiers on the train set. The classifier with the best performance on the validationset is selected, and tested on the test set. This results in performance measurements for 10 3 = 30 different folds of the data set.

26 26 CHAPTER 3. METHODOLOGY Corpus preprocessing Dataset prep fold fold fold T rainset V alidationset T estset train select Classifier predict P redictions measure P erf ormance Figure 3.7: A train, validation and test set are used to measure the generalisation performance of a classifier Model-parameter estimation Most machine learning techniques have model-parameters (for example, the number of states and the number of Gaussian mixtures for a HMM, the C and γ parameters for a SVM with a RBF kernel) that influence their performance. We find good parameters by performing a multi-resolution grid-search [22] in the model-parameter space in which we search for the parameters that result in the best performance on the validation set after training. For a SVM with a RBF-kernel, we test different values for the log(c) and log(γ). The parameters that result in the highest AUC-ROC (see section 3.3.5) form the center of a smaller grid, whose values are again tested on the validation set. The best scoring classifier is the final classifier. For generative models, such as HMMs and GMMs, we perform the same grid-based parameter search. Because we need a model for both the positive and the negative instances, the grid-search is performed for both classes individually. The performance measure during this search is the log-likelihood of the model on the validation set. For GMMs, we estimate the best number of Gaussian mixtures for our data set. For HMMs, we search the best values for the number of states, the number of Gaussian mixtures and a Boolean that determines if the HMM is fully connected or not Performance measure In order to calculate the generalisation performance of a classifier, we need to select a suitable measure for the performance. We have chosen to use and the Area Under Curve of the Receiver Operating Characteristic (AUC-ROC) [15] as primary and the Equal Error Rate (EER) as secondary performance measure. accuracy = T P + T N P + N recall = T P P (3.5) (3.6)

27 3.3. TEST SETUP 27 precision = T P T P + F P (3.7) The most commonly used measure in previous work is the accuracy (Equation 3.5), or the recall and precision pair (Equation 3.6 and Equation 3.7). The accuracy measure is not suitable to measure the performance for a two-class problem, because a very high accuracy can be obtained by predicting the most frequent class for problems with a high class skew. The combination of recall and precision is more descriptive. Recall expresses the fraction of detected positive instances, precision describes the fraction of the detected instances that is a real positive. Those measures can be calculated using the values found in the confusion matrix (Fig. 3.8). prediction \ class positive negative positive TP FP negative FN TN all P N Figure 3.8: A confusion matrix, where the columns represent the real class, and the rows represent the prediction of a classifier. The cells contain the true positives (TP), false positives (FP), the false negatives (FN) and true negatives (TN). Most classifiers can be modified to output a probability of a class instead of a binary decision. A trade-off for the cost of different errors FP versus FN can be made by thresholding this probabilistic output. This trade-off can be visualized in a receiver operating characteristic (ROC), in which the true-positive rate is plotted against the false-positive rate for different thresholds (see Figure 3.9). A recall-precision pair corresponds to a single point on the ROC. One of the advantages of the ROC over other thresholded plots is its invariancy to class-skew [15]. Because we do not know in advance which costs are associated with the different errors, we cannot define a single point of interest on the ROC. Therefore we measure the performance using the area under the ROC curve (AUC-ROC). The AUC-ROC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. In addition to the AUC- ROC performance, we will report the EER for a classifier. The EER is single a point on the ROC, defined as the point for which the false-positive rate equals the false-negative rate. We will use a paired two-tailed t-test to compare the AUC-ROCs of the cross-validation folds. This K-fold cross-validated paired t-test suffers from the problem that the train sets overlap, which results in an elevated probability of detecting a difference between classifiers when no such difference exists (type I error) [12]. As a solution for this problem the 5 2 crossvalidated paired t-test has been developed, which has an acceptable type I error. Because this method uses only half of the data for training during a fold it is unsuitable for our data set. Therefore we use the K-fold cross-validated paired t-test to compare the AUC-ROC values for different classifiers, and note the possibility of a type I error.

28 28 CHAPTER 3. METHODOLOGY True positive rate EER = AUC ROC = ROC EER line False positive rate Figure 3.9: The ROC, the ROC-AUC and the EER for a classifier. The probabilistic output of the classifier is thresholded to generate the ROC-curve. Points on the curve define the relation between the true positive rate and the false positive rate. The area under the ROC (the AUC- ROC) is our primary performance measure. The EER for a classifier is the error-rate in the intersection of the ROC with the EER-line from (0, 1) to (1, 0).

29 Chapter 4 Results In this chapter we will describe the results of our experiments. We will start with the results for the single-modality classifiers. The best single-modality classifiers are used to construct a fused classifier, which we will compare to the best performing single modality classifier. 4.1 Single-modality classifiers We will start with the audio classifiers. We have trained different classifiers on the the two sets of audio features. Figure 4.1 shows a ROC-plot for the audio features. The figure shows that all the trained classifiers have similar performance for PLP features. The only real differences are in the area with a very low threshold (high recall, low precision) and the are with a high threshold (low recall, high precision). In those areas the generative models (GMMs and HMMs) seem to perform better. When we look at Table 4.1 we see that the number of Gaussian mixtures for the positive and negative model seems to be proportional to the amount of train data. We expect that more train data would increase the number of mixtures and possibly the performance of our GMM and HMM classifiers. This is supported by the work of Truong et al. [38], where models with 1024 Gaussian mixtures were trained using more than 0 instances. Classifier Features Positive model Negative model #states #mix. #states #mix. AUC-ROC GMM PLP (3.2) (3.1) (0.169) GMM* RASTA (2.8) (5.9) (0.143) GMM Video (0.7) (0.6) (0.129) HMM PLP 11.0 Erg. (0) 2.1 (0.5) 18.5 (1.1) Erg. 2.5 (0.9) (0.160) HMM RASTA 11.6 Erg. (1.9) 2.1 (0.4) 21.3 (1.9) Erg. 2.0 (0) (0.135) HMM Video 2.5 LR (0.5) 4.0 (0) 1.2 (0.4) Erg. 3.0 (0) (0.129) Classifier Features Window Step log 2(C) log 2(γ) AUC-ROC EER SVM PLP 1.12 s 0.64 s -8.9 (3.7) -22 (2.6) (0.173) SVM RASTA 1.12 s 0.64 s -9.8 (4.1) (3.2) (0.157) 0. SVM* Video 1.20 s 0.60 s 1.3 (5) -18 (0) (0.114) Table 4.1: Results of the different classifiers trained on different features. For the modelparameters and the performance measure, the mean value is diplayed with the standard deviation displayed between parenthesis. The classifiers marked with an asterisk are the best performing classifiers for the audio and video modality. EER 29

30 30 CHAPTER 4. RESULTS The results for the RASTA-PLP features are remarkebly different. The ROC is not as smooth as for PLP features, and the SVM-performance is degraded dramatically. However, RASTA-PLP features result in a slightly better performance than the PLP features for the generative models. The filtering that RASTA-PLP adds to PLP seems to smoothen the signal (Appendix B). This results in features that can be modeled using fewer mixtures (see Table 4.1), which allows for the training of more states, or training with a higher accuracy. RASTA-PLP was developed with speech recognition in mind, with explains why the generative models that are commonly used in speech recognition perform better with RASTA-PLP features than with PLP features. While the distribution of the values of the features is simplified, the performance for SVMs degrades. SVM-classifiers trained on RASTA-PLP features generally have a lower C-parameter, which indicates a smoother hyper-plane. Therefore we assume that the smoother RASTA-PLP signal allows for more overfitting, which can explain the degraded performance for SVMs on RASTA-PLP features. 1 PLP classifiers 1 RASTA PLP classifiers 1 Video classifiers True positive rate HMM GMM SVM True positive rate HMM GMM SVM True positive rate HMM GMM SVM False positive rate False positive rate False positive rate Figure 4.1: The ROC for the PLP features (left), the RASTA-PLP features (mid) and the video features (right). When we compare the results of the different classifiers trained using PLP and RASTA- PLP features, we observe that the SVM-based classifiers have the worst performance. The difference in performance for the generative models is not as clear. Using a paired samples t-test, we find that the RASTA-PLP features have a significantly higher AUC-ROC (t(59) = 2.15, p < 0.05) than the PLP features. We conclude that the combination of a GMM or HMM classifier with RASTA-PLP features results in best performance for laughter detection in audio using our data set. For the video features we evaluated the same classifiers using different model-parameters. These ROC-plots can be found in Figure 4.1. The ROC-plot shows that classifiers trained on the video modality have a better performance than classifiers trained on the audio modality. When we look at the average model for the HMM-classifier trained on the video features, we notice that the model for the positive instances is a left-right (LR) HMM, while the model for the positive instance for audio is an ergodic HMM (see Table 4.1). The visual laugh seems to display a sequential order, that is not modeled in the audio HMMs. Another difference is that the video modality is modeled using fewer Gaussian mixtures. This can be the result of the higher dimensionality of the video features. The best result for the video modality was obtained using a SVM-classifier. This can be the result of the more sequential pattern of visual laughter, that can be detected more reliable inside of a sliding window than the variable audio signal. The video-svm classifier has the best single-modality performance.

Automatic Laughter Detection

Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional