Automatic Laughter Detection

Similar documents
Automatic Laughter Detection

Improving Frame Based Automatic Laughter Detection

Automatic Laughter Segmentation. Mary Tai Knox

Automatic discrimination between laughter and speech

A Phonetic Analysis of Natural Laughter, for Use in Automatic Laughter Processing Systems

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Chord Classification of an Audio Signal using Artificial Neural Network

Automatic Rhythmic Notation from Single Voice Audio Sources

Music Genre Classification and Variance Comparison on Number of Genres

AUTOMATIC RECOGNITION OF LAUGHTER

Detecting Attempts at Humor in Multiparty Meetings

Singing voice synthesis based on deep neural networks

Singer Identification

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Acoustic Scene Classification

Fusion for Audio-Visual Laughter Detection

Classification of Musical Instruments sounds by Using MFCC and Timbral Audio Descriptors

Music Composition with RNN

Audio-Based Video Editing with Two-Channel Microphone

Detecting Musical Key with Supervised Learning

Voice Controlled Car System

Features for Audio and Music Classification

Phone-based Plosive Detection

Singer Traits Identification using Deep Neural Network

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

Speech and Speaker Recognition for the Command of an Industrial Robot

Neural Network for Music Instrument Identi cation

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES

MUSIC TONALITY FEATURES FOR SPEECH/MUSIC DISCRIMINATION. Gregory Sell and Pascal Clark

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Retrieval of textual song lyrics from sung inputs

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

Subjective Similarity of Music: Data Collection for Individuality Analysis

Supervised Learning in Genre Classification

MUSI-6201 Computational Music Analysis

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

Singing Voice Detection for Karaoke Application

CS229 Project Report Polyphonic Piano Transcription

Hidden Markov Model based dance recognition

A Music Retrieval System Using Melody and Lyric

ISSN ICIRET-2014

Speech To Song Classification

Distortion Analysis Of Tamil Language Characters Recognition

GRADIENT-BASED MUSICAL FEATURE EXTRACTION BASED ON SCALE-INVARIANT FEATURE TRANSFORM

WAKE-UP-WORD SPOTTING FOR MOBILE SYSTEMS. A. Zehetner, M. Hagmüller, and F. Pernkopf

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Composer Style Attribution

A Computational Model for Discriminating Music Performers

Classification of Timbre Similarity

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Automatic Construction of Synthetic Musical Instruments and Performers

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

A COMPARATIVE EVALUATION OF VOCODING TECHNIQUES FOR HMM-BASED LAUGHTER SYNTHESIS

Predicting Time-Varying Musical Emotion Distributions from Multi-Track Audio

Week 14 Music Understanding and Classification

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

Application Of Missing Feature Theory To The Recognition Of Musical Instruments In Polyphonic Audio

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

Analytic Comparison of Audio Feature Sets using Self-Organising Maps

Music Radar: A Web-based Query by Humming System

AUD 6306 Speech Science

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices

Robert Alexandru Dobre, Cristian Negrescu

Reducing False Positives in Video Shot Detection

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Proposal for Application of Speech Techniques to Music Analysis

GCT535- Sound Technology for Multimedia Timbre Analysis. Graduate School of Culture Technology KAIST Juhan Nam

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004

A Survey of Audio-Based Music Classification and Annotation

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Automatic Classification of Instrumental Music & Human Voice Using Formant Analysis

Statistical Modeling and Retrieval of Polyphonic Music

Music Recommendation from Song Sets

NewsComm: A Hand-Held Device for Interactive Access to Structured Audio

Neural Network Predicating Movie Box Office Performance

A Short Introduction to Laughter

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

Processing Linguistic and Musical Pitch by English-Speaking Musicians and Non-Musicians

Machine Learning Term Project Write-up Creating Models of Performers of Chopin Mazurkas

Automatic Music Genre Classification

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text

Pattern Recognition in Music

Topics in Computer Music Instrument Identification. Ioanna Karydi

Comparison Parameters and Speaker Similarity Coincidence Criteria:

PSYCHOLOGICAL AND CROSS-CULTURAL EFFECTS ON LAUGHTER SOUND PRODUCTION Marianna De Benedictis Università di Bari

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Figure 1: Feature Vector Sequence Generator block diagram.

A Categorical Approach for Recognizing Emotional Effects of Music

Semi-supervised Musical Instrument Recognition

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Interacting with a Virtual Conductor

Transcription:

Automatic Laughter Detection Mary Knox 1803707 knoxm@eecs.berkeley.edu December 1, 006 Abstract We built a system to automatically detect laughter from acoustic features of audio. To implement the system, we trained a neural network on four classes of features: mel frequency cepstral coeffiencts (MFCCs), fundamental frequency, rms frequency, and ac peak. We used the ICSI 1 Meeting Recorder corpus to train and test our system. The combined system of MFCCs and ac peak features performed best, with an equal error rate (EER) of 8.1%. 1 Introduction There are many facets to spoken communication other than the words that are used. One of the many examples of this is laughter. Laughter is a powerful cue in human interaction because it provides additional information to the listener. Laughter communicates the emotional state of the speaker. This information can be beneficial in human-machine interaction. For example, it could be used to automatically detect an opportune time for a digital camera to take a picture (Carter, 000). Laughter can also be utilized in speech processing by identifying jokes or topic changes in meetings and improve speech-to-text accuracy by recognizing non-speech sounds (Kennedy, L. and Ellis, D., 004). Furthermore, many people have unique laughs so hearing a person s laugh helps in the identification of others. The goal of this project was to build an automatic laughter detector in order to examine the many uses of laughter in the future. Previous work has been done in both studying the characteristics of laughter (Bickley and Hunnicutt, 199) (Bachorowski, J., Smoski, M., and Owren, M., 001) (Provine, 1996) and building automatic laughter detectors (Carter, 000) (Kennedy, L. and Ellis, D., 004) (Truong, K.P. and Van Leeuwen, D.A., 00) (Cai, R., Lu, L., Zhange, H., and Cai,L., 003). Many researchers have found a wide range of results with respect to characterizing laughter. Many agree that laughter has a breathy consonant-vowel structure (Bickley and Hunnicutt, 199) (Trouvain, 003). Some researchers go on to make generalizations about laughter. For instance, Provine concluded that laughter is usually a series of short syllables repeated approximately every ms (Provine, 1996). However, many have found that 1 International Computer Science Institute; Berkeley, CA EER is the rate at which the percentage of misses is equal to the percentage of false alarms

laughter is highly variable (Trouvain, 003) and thus difficult to stereotype (Bachorowski, J., Smoski, M., and Owren, M., 001). Kennedy and Ellis (Kennedy, L. and Ellis, D., 004) ran experiments to detect when multiple people laughed using a support vector machine (SVM) classifier trained on four types of features: mel frequency cepstral coefficients (MFCCs), delta MFCCs, modulation spectrum, and spatial cues. The data was split into one second windows, which were classified as multiple speaker laughter or non-laughter events. They achieved a true positive rate (percentage of laughter that was correctly classified) of 87%. Truong and van Leeuwen (Truong, K.P. and Van Leeuwen, D.A., 00) used gaussian mixture models (GMM) trained with perceptual linear prediction (PLP) features, pitch and energy, pitch and voicing, and modulation spectrum. The experiments were run on presegmented laughter and speech segments. Determining the start and end time of laughter was not part of the experiment. They built models for each of the four features. The model trained with PLP features performed the best, at 13.4% equal error rate (EER) for a data set similar to the one used in our experiments. The goal of this experiment is to automatically detect the onset and offset of laughter. In order to do so, a neural network with one hidden layer was trained with MFCC and pitch features, which will be described in Section 3.. These features were chosen based on the results of previous experiments. Although Kennedy and Ellis and Truong and van Leeuwen used modulation spectrum, in both cases this feature did not perform well in comparison to the other features. This task is slightly different from both Kennedy and Ellis, who used 1 second windows, and Truong and van Leeuwen, who tested on presegmented data. In Section, we discuss the data that was used in this experiment. Section 3 describes our neural network. Results from the experiment are given in Section 4 and in Section we conclude and discuss our results. Data We used the ICSI Meeting Recorder Corpus (Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C, 003) to train and test the detector. It is a hand transcribed corpus of multi-party meeting recordings, in which each of the speakers wore close-talking microphones. Distant microphones were also recorded; however, they were not used in this experiment. The full text was transcribed as well as non-lexical events, including coughs, laughs, lip smacks, etc. There were a total of 7 meetings in this corpus. In order to compare our results to the work done by Kennedy and Ellis (Kennedy, L. and Ellis, D., 004) and Truong and van Leeuwen (Truong, K.P. and Van Leeuwen, D.A., 00), we used the same training and testing sets, which were from the Bmr subset of the corpus. This subset contains 9 meetings. The first 6 were using for training and the last 3 were used to test the detector. In order to clean the data, we only trained and tested on data that was transcribed as pure laughter or pure non-laughter. Cases in which the hand transcribed documentation had both speech and laughter listed under a single start and stop time were disregarded. Furthermore, if a speaker was silent over a period of time then their channel at that time was not used. This reduced cross-talk (other speakers appearing on the designated speaker s channel) and allowed us to train and test on channels only when they were in use. Table

1 has the statistics of the clean data. The average laugh duration was 1.61 seconds with a standard deviation of 1.41 seconds. Figure 1 shows the histogram of the laughter durations. Table 1: Bmr statistics Training Data Testing Data All Data Pure Laughter (seconds): 86.069 739.707 6604.776 Pure Non-Laughter (seconds): 9094.894 7766.17 9871.411 Percentage Pure Laughter (%): 6.08 8.696 6.71 Figure 1: Histogram of Laugh Duration 3 Method 3.1 Neural Network A neural network 3 is a statistical learning method which consists of an input layer, hidden layer(s), and output layer. The intuition behind it is that the output is a nonlinear function of a weighted sum of the inputs. A neural network with one hidden layer was used to classify feature vectors as either laughter or non-laughter. A schematic of a neural network is shown in Figure. In this case the input units are the features. The input units are linked to each of the hidden units. The hidden units then take a weighted sum of the input units to get b i = j W i,jf j, where W i,j is the weight associated with feature F j and hidden unit A i. These weights are determined via training data. An activation function, g, is applied to the sum, b i, to determine the value of A i. Similarly, to compute the output of the neural net a weighted sum is taken of the hidden units and a softmax activation function is applied to 3 For more information on neural networks, see (Russell, S. and Norvig, P., 003) 3

the sum in order to determine the output values, the posterior probability of laughter and non-laughter. Using the softmax activation function ensures that the two outputs sum to one. To prevent over-fitting (or tailoring the system to the training data such that it performs poorly on testing data), the data used to train the neural network was split into two groups: training (the first 1 Bmr meetings) and cross validation (the rest of the original training set). The weights were adjusted based on the first 1 Bmr meetings via the back-propagation algorithm, which modifies the weight based on the partial derivative of the error with respect to each weight. After each epoch (pass through the training data) the cross validation frame accuracy (CVFA) is evaluated. The CVFA is the ratio of true negatives and true positives to all cross validation data. Using the default values used for training neural networks at ICSI (Ellis, 000), the learning rate was initially set to 0.008. Once the CVFA increases by less than 0.% from the previous epoch, the learning rate is halved at the beginning of subsequent epochs. The next time the CVFA increases by less than 0.%, training is stopped. Figure : A neural network with n input units, 00, hidden units, and output units 3. Features Feature selection is an essential part of machine learning. In this case, we used MFCC and pitch features to distinguish laughter from non-laugher. 3..1 Mel Frequency Cepstral Coefficients Mel frequency cepstral coefficients (MFCCs) are coefficients obtained by taking the fourier transform of a signal, converting it to the mel scale, and finally taking the discrete cosine transform of the mel scaled fourier transform (wik, 006). The mel scale is a perceptual scale used to more accurately portray what humans hear. In this experiment, MFCCs were used to capture the spectral features of laughter and non-laughter. The first order regression coefficients of the MFCCs (delta MFCCs) and the second order regression coefficients (deltadelta MFCCs) were also computed and used as features for the neural network. We used 4

the first 1 MFCCs (including the energy component), which were computed over a ms window with a ms forward shift, as features for the neural network. MFCC features were extracted using the Hidden Markov Model Toolkit (HTK)(htk, 006). 3.. Pitch From laughter characterization papers, it was concluded that pitch features were different for laughter than for speech. Previous studies showed that the fundamental frequency (F 0 ) pattern of laughter did not decline as it typically does in speech (Bickley and Hunnicutt, 199) and that the F 0 has a large range during laughter(bachorowski, J., Smoski, M., and Owren, M., 001). Using the ESPS pitch tracker get f0 (Entropic Research Laboratory, 1993), we extracted the F 0, rms value, and ac peak (the highest normalized cross correlation value found to determine F 0 ) for each frame. The delta and delta-delta coefficients were computed for each of these features as well. 4 Experiments and Results Before running our system, we needed to tweak a couple of parameters: the input window size and the number of hidden units. Since the frames were so short in duration and in this data set people generally laughed for a consecutive 1.61 seconds at a time, we realized we should feed multiple frames into the neural net to determine whether or not a person is laughing. After trying 0., 0.0, 0.7, and 1 second windows (which would be equivalent to, 0, 7, and 0 frames respectively) we found that a window of 0.7 seconds worked best, based on the CVFA. We wanted the classification of laughter to be based on the middle frame so we set the offset to be 37 frames. We also had to determine the number of hidden units. Since MFCCs were the most valuable features for Kennedy and Ellis (Kennedy, L. and Ellis, D., 004), we used MFCCs as the input units and modified the number of hidden units to be 0, 0, 00, and 300 while keeping all other parameters the same. Based on the accuracy on the cross validation set, we saw that 00 hidden units performed best. The neural networks were first trained on the four classes of features: MFCCs, F 0, rms frequency, and ac peak. The detection error trade-off (DET) curves for the each of the classes is shown in Figures 3, 4,, and 6. Each plot shows the DET curve for a neural network trained on the feature itself (blue line), the deltas (green line), the delta-deltas (cyan line), and the combination of the feature, delta, and delta-delta (red line). The EERs for each system are shown in Table. Since the All systems, which combined the feature, delta, and delta-delta, usually performed better than the individual systems we combined them to improve our results. Since the system using MFCC features had the lowest EER, we combined MFCCs with each of the other classes of features. Table 3 shows that after combining, the MFCC+AC Peak system performed the best. We then combined MFCC, ac peak, and rms frequency features. Finally, we combined all of the systems and computed the EER.

60 MFCCs 40 Miss probability (in %) 0 1 0. 0. 0.1 0.1 0. 0. 1 0 40 60 False Alarm probability (in %) Figure 3: DET curve for MFCC features 60 40 AC Peak AC All AC AC D AC A Miss probability (in %) 0 1 0. 0. 0.1 0.1 0. 0. 1 0 40 60 False Alarm probability (in %) Figure 4: DET curve for ac peak features 6

60 40 Fundamental Frequency F0 All F0 F0 D F0 A Miss probability (in %) 0 1 0. 0. 0.1 0.1 0. 0. 1 0 40 60 False Alarm probability (in %) Figure : DET curve for F0 features 60 RMS Frequency 40 Miss probability (in %) 0 1 0. 0. RMS All RMS RMS D RMS A 0.1 0.1 0. 0. 1 0 40 60 False Alarm probability (in %) Figure 6: DET curve for rms features 7

60 40 Combined Scores MFCC AC MFCC RMS MFCC F0 MFCC AC RMS MFCC AC RMS F0 Miss probability (in %) 0 1 0. 0. 0.1 0.1 0. 0. 1 0 40 60 False Alarm probability (in %) Figure 7: DET curve for combined systems Table : Equal Error Rates (%) MFCCs F 0 RMS AC Peak Feature 11.3 3.6 3. 16.7 Delta 9.6 4.4 6..37 Delta-Delta 11.3 7.83 6.6 7.61 All.66.80 6.01 16.7 Table 3: Equal Error Rates for Combined Systems (%) EER MFCC+F 0 9.38 MFCC+RMS 8.9 MFCC+AC Peak 8.1 MFCC+AC Peak+RMS 8.19 MFCC+AC Peak+RMS+F 0 8.80 Discussion and Conclusions From Table, it is clear that MFCC features outperformed pitch related features. This is consistent with Kennedy and Ellis (Kennedy, L. and Ellis, D., 004) results. For Truong and van Leeuwen (Truong, K.P. and Van Leeuwen, D.A., 00), PLP features outperformed the other features. PLP, like MFCCs, are perceptually scaled spectrums so it is not surprising that they performed well for the task of laughter detection too. AC peak values also performed well, which suggests that the cross correlation of an audio signal helps in detecting laughter. This seems reasonable since laughter is repetitive (Provine, 1996). However Provine found that the repetitions were ms apart, which 8

exceeds the time used to compute the cross correlation. Comparing the combined systems, we see that the combination of MFCC and ac peak features performed the best. This shows that by combining different systems we can gain more information about the data and achieve better EERs. However, when we added rms frequency and fundamental frequency features, the EER increased. A possible reason for this increase is that since the ac peak, rms frequency, and fundamental frequency are so related to one another the addition of the features did not contribute new information about the data and instead only cluttered the system with more inputs. In the future we plan to use additional features to detect laughter. Trouvain noted the repetition of a consonant-vowel syllable structure (Trouvain, 003). Based on this model, we could run a phoneme recognizer on the audio and detect when a phoneme is repeated. Another approach would be to compute the modulation spectrum. This should also portray the repetitive nature of laughter. Although previous experiments have shown it to perform worse than other features such as MFCCs and pitch, it may improve the EER if combined with other features. In conclusion, we have shown that neural networks can be used to automatically detect the onset and offset of laughter with an EER of 8.1%. These results are slightly better than previous experiments which classified presegmented data. By adding more features, we hope to further improve these results. 6 Acknowledgements We would like to thank Nikki, Chuck, Howard, Lara, Khiet, Mathew, and Galen for their help with this experiment. 9

References (006). Hidden Markov Model Toolkit (HTK). http://htk.eng.cam.ac.uk/. (006). Mel frequency cepstral coefficient. http://en.wikipedia.org/wiki/mel Frequency Cepstral Coefficients. Bachorowski, J., Smoski, M., and Owren, M. (001). The acoustic features of human laughter. Acoustical Society of America, pages 181 197. Bickley, C. and Hunnicutt, S. (199). Acoustic analysis of laughter. in Proc. ICSLP, pages 97 930. Cai, R., Lu, L., Zhange, H., and Cai,L. (003). Highlight sound effects detection in audio stream. in Proc. Intern. Confer. on Multimedia and Expo. Carter, A. (Masters Thesis, Keele University, 000). Automatic acoustic laughter detection. Ellis, D. (000). ICSI Speech FAQ: How are neural nets trained. http://www.icsi.berkeley.edu/speech/faq/nn-train.html. Entropic Research Laboratory (August 1993). Esps version.0 programs manual. Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C (April 003). The ICSI meeting corpus. ICAASP. Kennedy, L. and Ellis, D. (004). Laughter detection in meetings. NIST ICASSP 004 Meeting Recognition Workshop. Provine, R. (January-February 1996). Laughter. American Scientist. Russell, S. and Norvig, P. (003). Artificial Intelligence: A Modern Approach. Prentice Hall, New Jersey. Trouvain, J. (003). Segmenting phonetic units in laughter. in Proc ICPhS. Truong, K.P. and Van Leeuwen, D.A. (00). Automatic detection of laughter. in Proceedings of Interspeech.