Improving Frame Based Automatic Laughter Detection

Improving Frame Based Automatic Laughter Detection Mary Knox EE225D Class Project knoxm@eecs.berkeley.edu December 13, 2007 Abstract Laughter recognition is an underexplored area of research. My goal for this project was to improve upon my previous work to automatically detect laughter on a frame-byframe basis. My previous system (baseline system) detected laughter based on shortterm features including MFCCs, pitch, and energy. In this project, I have explored the utility of additional features (phone and prosodic) both by themselves and in combination with the baseline system. I improved the baseline system by 0.1% absolute and achieved an equal error rate (EER) of 7.9% for laughter detection on the ICSI Meetings database. 1 Introduction Audio communication contains a wealth of information in addition to spoken words. Specifically, laughter provides cues regarding the emotional state of the speaker [1], topic changes in the conversation [2], and the speaker s identity. Accurate laughter detection could be useful in a variety of applications. A laughter detector incorporated with a digital camera could be used to identify an opportune time to take a picture [3]. Laughter could be useful in a video search of humorous clips [4]. In speech recognition, identifying laughter could decrease word error rate by identifying nonspeech sounds [2]. The overall goal of this study is to use laughter for speaker recognition, as my intuition is that many individuals have their own distinct laugh. To be able to explore the utility of laughter segments for speaker recognition, however, we first need to build a robust system to detect laughter, which is the focus of this paper. Previous work has studied the acoustics of laughter [5, 6, 7]. Many agree that laughter has a breathy consonant-vowel structure [5, 8]. Some have made generalizations about laughter, such as Provine, who concluded that laughter is usually a series of short syllables repeated approximately every 210 ms [7]. Yet, others have found laughter to be highly variable [8] and thus difficult to stereotype [6]. These conclusions lead me to believe that automatic laughter detection is not a simple task. The most relevant previous work on automatic laughter detection has been that of Kennedy and Ellis [2], Truong and van Leeuwen [1], and Knox and Mirghafori [9] (my

previous work). However, the experimental setups and objectives of these works differ from this study, with the exception of my previous work. Kennedy and Ellis [2] studied the detection of overlapped (multiple speaker) laughter in the Meetings domain. They split the data into non-overlapping one second segments, which were then classified based on whether or not multiple speakers laughed. They used support vector machines (SVMs) trained on four features: MFCCs, delta MFCCs, modulation spectrum, and spatial cues. They achieved a true positive rate of 87%. Truong and van Leeuwen [1] classified presegmented ICSI Meetings data as laughter or speech. The segments were determined prior to training and testing their system and had variable time durations. The average duration of laughter and speech segments were 2.21 and 2.02 seconds, respectively. They used Gaussian mixture models trained with perceptual linear prediction (PLP) features, pitch and energy, pitch and voicing, and modulation spectrum. They built models for each of the feature sets. The model trained with PLP features performed the best at 13.4% 1 EER for a data set similar to the one used in this study. In my previous work [9], I did laughter detection at the frame level. A neural network was trained on each of the following features: MFCCs, energy (RMS), fundamental frequency (F 0 ), and the largest cross correlation value used to compute the fundamental frequency (AC PEAK). The score level combination of MFCC and AC PEAK features had the lowest EER at 8.0% 2 for the same dataset used in this study. Current state-of-the-art speech recognition systems also detect laughter. However, the threshold used is such that most occurrences of laughter are not identified. For example, SRI s speech recognizer was run on the same data used in this study and achieved a false acceptance rate of 0.1% and a false rejection rate of 78.2% for laughter. Since laughter is a somewhat rare event, occurring only 6.3% of the time in the dataset used in this study, it is important to identify as many laughter segments as possible in order to have data to use in a speaker recognition system. That being the case, the SRI speech recognizer would not be very useful for this task. Furthermore, the systems which used 1 second (or greater) segments or presegmented data would not be able to precisely detect laughter segments. Thus, for this project I expanded upon my previous system which performed laughter detection for each frame by including phone and prosodic features. The outline for this report is as follows: in Section 2 I describe the data used in this study, in Section 3 I further describe my previous work, in Section 4 I explain the current system and results, in Section 5 I discuss the results, and in Section 6 I provide my conclusions and ideas for future work. 2 Data I trained and tested the detector on the ICSI Meeting Recorder Corpus [10], a hand transcribed corpus of multi-party meeting recordings, in which each of the speakers was recorded on a close-talking microphone (which is the data used in this study) as well as distant microphones. The full text was transcribed in addition to non-lexical events (including coughs, 1 Each segment, which had varying duration, was equally weighted in the EER computation 2 Each frame, or time unit, was equally weighted in the EER computation 2

Figure 1: Histogram of laugh duration for the Bmr subset of the ICSI Meeting Recorder Corpus. lip smacks, mic noise, and most importantly, laughter). There were a total of 75 meetings in this corpus. In order to compare my results to the work done by Kennedy and Ellis [2] and Truong and van Leeuwen [1], I used the same training and testing sets, which were from the Bmr subset of the corpus. This subset contains 29 meetings. The first 26 were used for training and the last 3 were used to test the detector. I trained and tested only on data which was hand transcribed to be either laughter or non-laughter. Laughter-colored speech, that is, cases in which the hand transcribed documentation had both speech and laughter listed under a single start and end time were disregarded since I would not specifically know which time interval(s) contained laughter. Also, if the transcription did not include information for a period of time for a channel, that audio was excluded. This exclusion reduced training and testing on cross-talk and allowed me to train and test on channels only when they were in use. Ideally, an automatic silence detector would be employed in this step instead of relying on the transcripts. As a note, unlike Truong and van Leeuwen I included audio that contained non-lexical vocalized sounds other than laughter. Figure 1 shows the histogram of the laughter durations. The average laugh duration was 1.615 seconds with a standard deviation of 1.241 seconds. 3 Previous Work 3.1 Previous System My previous system trained a neural network with MFCC, pitch, and energy features to detect whether a given frame contained laughter. 3.1.1 Features Mel Frequency Cepstral Coefficients (MFCCs) MFCCs were used to capture the spectral features of laughter and non-laughter. The first order regression coefficients of the MFCCs (delta MFCCs) and the second order regression coefficients (delta-delta MFCCs) 3

were also computed and used as features for the neural network. I used the first 12 MFCCs as well as the 0 th coefficient, which were computed over a 25 ms window with a 10 ms forward shift, as features for the neural network. MFCC features were extracted using the Hidden Markov Model Toolkit (HTK) [11]. For each frame I computed 13 MFCCs, 13 delta MFCCs, and 13 delta-delta MFCCs. Pitch and energy Studies in the acoustics of laughter [5, 6] and in automatic laughter detection [1] investigated the pitch and energy of laughter as potentially important features. I used the ESPS pitch tracker get f0 [12] to extract the fundamental frequency (F 0 ), local root mean squared energy (RMS), and the highest normalized cross correlation value found to determine F 0 (AC PEAK) for each frame. The delta and delta-delta coefficients were computed for each of these features as well. 3.1.2 Learning Method I did frame-wise laughter detection. Since the frames were short in duration (10 ms) and each laughter segment was on average 1.615 seconds in this data set, I decided it would be best to use a large context window of features as inputs in the neural network. A neural network with one hidden layer was trained using QuickNet [13]. The input to the neural network was a window of feature frames, where the center frame was the target frame. I used the softmax activation function to compute the probability that the frame was laughter. To prevent over-fitting, the data used to train the neural network was split into two groups: training (the first 21 Bmr meetings) and cross validation (the last 5 meetings from the original training set). The neural network weights were updated based on the training data via the back-propagation algorithm and then the cross validation data was scored after every training epoch resulting in the cross validation frame accuracy (CVFA). Training was concluded once the CVFA increased by less than 0.5% for a second time. 3.2 Previous experiments and results 3.2.1 Parameter settings I first needed to determine the input window size and the number of hidden units in the neural network. Empirically, I found that a context window of 75 consecutive frames (0.75 seconds) worked well. To make the classification of laughter based on the middle frame, I set the offset to 37 frames. In other words, the inputs to the neural network were the features from the frame to be classified and the 37 frames before and after this frame. Figure 2 shows the windowing technique. I also had to determine the number of hidden units. MFCCs were the most valuable features for Kennedy and Ellis [2] and I suspected my system would have similar results. Thus, I used the MFCCs as the input features and modified the number of hidden units while keeping all other parameters the same. Based on the accuracy on the cross validation set, I saw that 200 hidden units performed best. Similarly, I varied the number of hidden units using F 0 features. The CVFA was approximately the same for a range of hidden units but the system with 200 was marginally better than the rest. 4

1 t w i n d o w w i t h 7 5 f r a m e s ( 7 5 0 m s ) 3 7 3 6 3 5 1 0 3 5 3 6 3 7...... t a r g e t f r a m e ( 1 0 m s ) Figure 2: For each frame being evaluated, features from a window of 75 frames is input to the neural network. 3.2.2 Systems The neural networks were first separately trained on the four classes of features: MFCCs, F 0, RMS, and AC PEAK. The EERs for each of the classes is shown in Table 1. Each column lists the EER for a neural network trained with the feature itself, the deltas, the delta-deltas, and the feature level combination of the feature, delta, and delta-delta (the All System). I combined the All systems on the score level to improve our results using another neural network, this time using a smaller window size and fewer hidden units. Since each of the inputs was the probability of laughter for each frame, I shortened the input window size of the combiner neural network from 75 frames to 9 and reduced the number of hidden units to 2. Since the system using MFCC features had the lowest EER, I combined MFCCs with each of the other classes of features. Table 2 shows that after combining, the MFCC+AC PEAK system (baseline system) performed the best. I then combined MFCC, AC PEAK, and RMS features. Finally, I combined all of the systems and computed the EER. Table 1: Equal Error Rates (%). MFCCs F 0 RMS AC PEAK Feature 11.35 23.26 32.22 16.75 Delta 9.62 24.42 26.52 22.37 Delta-Delta 11.23 27.83 26.62 27.61 All 10.66 22.80 26.01 16.72 Table 2: Equal Error Rates for Combined Systems (%). EER MFCC+F 0 9.80 MFCC+RMS 8.92 MFCC+AC PEAK 7.97 MFCC+AC PEAK+RMS 8.26 MFCC+AC PEAK+RMS+F 0 8.58 5

4 Current Work Figure 3: Example hand-transcription. While the results from the previous work were good. By including more features I hoped to improve upon the baseline system. 4.1 Current System 4.1.1 Additional Features Phones Laughter has a repeated consonant vowel structure [5, 8]. Thus, phone sequences seemed to be a good feature to use to identify laughter. I used the SRI phone recognizer, Decipher [14], in order to extract the phones; however, Decipher annotates nonstandard phones including laughter. Although this was not the original information I intended to extract, it seemed plausible for Decipher s laughter detector to improve the baseline results. Prosodic My previous system only included short-term features. However, laughter is different from most speech sounds because it repeats approximately every 210 ms [7]. Since prosodic features are extracted over longer intervals of time, they may distinguish laughter from non-laughter. I used 21 prosodic features, which were statistics (including min, max, mean, standard deviation, etc.) on pitch, energy, long-term average spectrum, and noiseto-harmonic ratio. 4.1.2 Learning Methods For the phone features, I trained a neural network with a 46 dimensional feature vector (1 dimension for each possible phone) for each frame. Each feature vector had only 1 nonzero value, which corresponded to the phone for that frame. I again used features over a context window as the input to the neural network and trained the neural network using the same training and cross validation sets as before in order to prevent over-fitting. Prosodic features were extracted for each segment as documented in the hand-transcriptions. Figure 3 is an example hand-transcription which shows each segment marked with a start and end time. Segmenting based on the transcript guaranteed that non-laughter segments were separated from laughter segments. In the future, it would be better to have an automatic segmenter. However, since this is the my first experimentation with prosodic features, I wanted to easily decipher which features perform well for the task of laughter detection. Since the prosodic features were computed for the entire segment, I used an SVM to build a model to detect laughter. 6

4.2 Current experiments and results 4.2.1 Parameter settings Many of the parameters used in the neural network trained on phone features were chosen to be compatible with the previous system. For example, I again used a window size of 75 frames. This was done in order to compare score level combinations of the phone and baseline systems with feature level combinations, which will be computed in the future. Based on the CVFA score on the cross validation set, I chose the number of hidden units to be 9. For the prosodic features, I set the cost-factor (amount that training errors on positive examples outweighs training errors on negative examples) to 10 when building the SVM models. This was done because there were approximately 10 non-laughter segments for every laughter segment. 4.2.2 System Results I first trained systems on phone features and prosodic features alone. The neural network trained on phone features achieved an EER of 18.45%, as shown in Table 4. Using the training set, SVM models were trained using all of the statistics for each class of prosodic features: pitch, energy, long-term average spectrum (LTAS), and the noise-to-harmonic ratio (N2H). The EER was computed using the cross validation set (each segment, which had varying duration, was equally weighted in the EER computation). As shown in Table 3, the energy features performed the best. I then combined the energy features with each of the other classes of features. The system trained with energy, noise-to-harmonic ratio, and pitch features was the best prosodic system. I then computed the EER where each frame was weighted equally to be 50%. This result is show in Table 4. Table 3: Equal Error Rates (each segment was equally weighted) for Prosodic Features(%). EER PITCH 36.20 ENERGY 27.87 LTAS 32.75 N2H 44.31 ENERGY+PITCH 27.29 ENERGY+LTAS 32.40 ENERGY+N2H 21.88 ENERGY+N2H+PITCH 21.75 ENERGY+N2H+LTAS 32.34 ENERGY+N2H+PITCH+LTAS 31.69 I then performed score level combinations of the phone, prosodic, and baseline systems using the same neural network parameters as the previous system s score level combinations. While the score level combination of the baseline system and the phone system improved to 7.9%, the combination of the baseline and prosodic systems degraded to 9.8% EER. When 7

Table 4: Equal Error Rates(%). EER PHONES 18.45 PROSODIC 50 PHONES+PROSODIC 11.30 BASELINE+PHONES 7.87 BASELINE+PROSODIC 9.80 BASELINE+PHONES+PROSODIC 8.75 all three systems were combined the EER was 8.75%. As shown in Table 4, the combination of the baseline and phone systems had the best result. 5 Discussion From Table 4, it is clear that the prosodic system does not score as well when each frame is weighted equally. It may be be beneficial to either change the training method for the prosodic system from the segment level to the frame level, perhaps using a neural network, or modify the method of combining the prosodic system with the other systems to make use of the prosodic system s results when each segment is weighted equally. Since the prosodic system did not score well on its own, it was not surprising that when combined with the baseline system and the baseline+phone system the resulting systems were worse. However, when the prosodic system was combined with the phone system the EER improved. Despite this improvement the score was still worse than the baseline system. Taking advantage of Decipher s phone recognizer proved beneficial. The EER for the combined baseline and phone system improved to 7.87%, which is currently the best EER for laughter detection at the frame level. 6 Conclusion and future work In conclusion, we have improved upon our baseline results by including additional features. The phone system improved the baseline system by 0.1% absolute. Although prosodic features did not improve the baseline system. I think that there are many other prosodic features which will improve laughter detection. In the future, I plan on scoring feature level combinations of the system using a comparable number of hidden units. Also, many other prosodic features have already been extracted and may be beneficial for this study. After determining which prosodic features work well, I will use a moving window technique to extract the features so that this laughter detector can be run on audio that may not have hand-transcriptions. I also plan on experimenting with using a neural network to train the prosodic system instead of an SVM. 8

7 Acknowledgments I would like to thank Christian Mueller for all of his help in extracting the prosodic features and Nikki Mirghafori for her guidance throughout this study. 9

References [1] Truong, K.P. and Van Leeuwen, D.A., Automatic detection of laughter, In Proceedings of Interspeech, Lisbon, Portugal, 2005. [2] Kennedy, L. and Ellis, D., Laughter detection in meetings, NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, 2004. [3] A. Carter, Automatic acoustic laughter detection, Masters Thesis, Keele University, 2000. [4] Cai, R., Lu, L., Zhange, H.-J., Cai, L.-H., Highlight sound effects detection in audio stream, in Proc. Intern. Confer. on Multimedia and Expo, Baltimore, 2003. [5] C. Bickley and S. Hunnicutt, Acoustic analysis of laughter, in Proc. ICSLP, pp. 927 930, Banff, Canada, 1992. [6] Bachorowski, J., Smoski, M., Owren, M., The acoustic features of human laughter, Acoustical Society of America, pp. 1581 1597, 2001. [7] R. Provine, Laughter, American Scientist, January-February 1996. [8] J. Trouvain, Segmenting phonetic units in laughter, in Proc ICPhS, 2003. [9] M. Knox and N. Mirghafori, Automatic laughter detection using neural networks, Unpublished. [10] Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C., The ICSI meeting corpus, ICAASP, Hong Kong, April 2003. [11] Hidden Markov Model Toolkit (HTK), http://htk.eng.cam.ac.uk/. [12] Entropic Research Laboratory, Washington, D.C., Esps version 5.0 programs manual, August 1993. [13] QuickNet, http://www.icsi.berkeley.edu/speech/qn.html. [14] Cohen, M., Murveit, H., Bernstein, J., Price, P., Weintraub, M., The DECI- PHER speech recognition system, In Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 77 80, Albuquerque, 1990. 10