Automatic Laughter Detection

Similar documents
Automatic Laughter Detection

Improving Frame Based Automatic Laughter Detection

Automatic Laughter Segmentation. Mary Tai Knox

Automatic discrimination between laughter and speech

Detecting Attempts at Humor in Multiparty Meetings

Automatic Rhythmic Notation from Single Voice Audio Sources

A Phonetic Analysis of Natural Laughter, for Use in Automatic Laughter Processing Systems

Features for Audio and Music Classification

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Singer Traits Identification using Deep Neural Network

AUTOMATIC RECOGNITION OF LAUGHTER

Singer Identification

Acoustic Scene Classification

Fusion for Audio-Visual Laughter Detection

Music Genre Classification and Variance Comparison on Number of Genres

Chord Classification of an Audio Signal using Artificial Neural Network

Audio-Based Video Editing with Two-Channel Microphone

MUSIC TONALITY FEATURES FOR SPEECH/MUSIC DISCRIMINATION. Gregory Sell and Pascal Clark

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Classification of Musical Instruments sounds by Using MFCC and Timbral Audio Descriptors

Phone-based Plosive Detection

Singing voice synthesis based on deep neural networks

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Speech and Speaker Recognition for the Command of an Industrial Robot

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

Voice Controlled Car System

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

Neural Network for Music Instrument Identi cation

Supervised Learning in Genre Classification

Speech To Song Classification

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

Retrieval of textual song lyrics from sung inputs

MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES

Music Composition with RNN

Singing Voice Detection for Karaoke Application

MUSI-6201 Computational Music Analysis

Subjective Similarity of Music: Data Collection for Individuality Analysis

Music Recommendation from Song Sets

Application Of Missing Feature Theory To The Recognition Of Musical Instruments In Polyphonic Audio

Predicting Time-Varying Musical Emotion Distributions from Multi-Track Audio

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

Detecting Musical Key with Supervised Learning

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices

Proposal for Application of Speech Techniques to Music Analysis

A Computational Model for Discriminating Music Performers

ISSN ICIRET-2014

A COMPARATIVE EVALUATION OF VOCODING TECHNIQUES FOR HMM-BASED LAUGHTER SYNTHESIS

GRADIENT-BASED MUSICAL FEATURE EXTRACTION BASED ON SCALE-INVARIANT FEATURE TRANSFORM

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

GCT535- Sound Technology for Multimedia Timbre Analysis. Graduate School of Culture Technology KAIST Juhan Nam

Week 14 Music Understanding and Classification

A Survey of Audio-Based Music Classification and Annotation

Processing Linguistic and Musical Pitch by English-Speaking Musicians and Non-Musicians

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

Pattern Recognition in Music

WAKE-UP-WORD SPOTTING FOR MOBILE SYSTEMS. A. Zehetner, M. Hagmüller, and F. Pernkopf

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

AUD 6306 Speech Science

First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text

Topics in Computer Music Instrument Identification. Ioanna Karydi

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Statistical Modeling and Retrieval of Polyphonic Music

TECHNIQUES FOR AUTOMATIC MUSIC TRANSCRIPTION. Juan Pablo Bello, Giuliano Monti and Mark Sandler

Semi-supervised Musical Instrument Recognition

Normalized Cumulative Spectral Distribution in Music

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons

Distortion Analysis Of Tamil Language Characters Recognition

Analytic Comparison of Audio Feature Sets using Self-Organising Maps

Classification of Timbre Similarity

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

A Categorical Approach for Recognizing Emotional Effects of Music

A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS

Automatic Construction of Synthetic Musical Instruments and Performers

Robert Alexandru Dobre, Cristian Negrescu

Pitch Perception and Grouping. HST.723 Neural Coding and Perception of Sound

A Music Retrieval System Using Melody and Lyric

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

TOWARDS IMPROVING ONSET DETECTION ACCURACY IN NON- PERCUSSIVE SOUNDS USING MULTIMODAL FUSION

SINGING EXPRESSION TRANSFER FROM ONE VOICE TO ANOTHER FOR A GIVEN SONG. Sangeon Yong, Juhan Nam

Music Radar: A Web-based Query by Humming System

PSYCHOLOGICAL AND CROSS-CULTURAL EFFECTS ON LAUGHTER SOUND PRODUCTION Marianna De Benedictis Università di Bari

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Acoustic scene and events recognition: how similar is it to speech recognition and music genre/instrument recognition?

POLYPHONIC INSTRUMENT RECOGNITION USING SPECTRAL CLUSTERING

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Recognising Cello Performers Using Timbre Models

Automatic Piano Music Transcription

WE ADDRESS the development of a novel computational

Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016

A Short Introduction to Laughter

Figure 1: Feature Vector Sequence Generator block diagram.

Music Mood Classification - an SVM based approach. Sebastian Napiorkowski

Recognising Cello Performers using Timbre Models

LAUGHTER serves as an expressive social signal in human

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Transcription:

Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional state of the speaker. Emotion detection is beneficial in human-machine interaction. For example, it could be used to automatically detect an opportune time for a digital camera to take a picture [1]. Laughter can also be utilized in speech processing by identifying jokes or topic changes in meetings and improve speech-to-text accuracy by recognizing non-speech sounds[]. Furthermore, many people have distinct laughs so hearing a person s laugh helps in the auditory recognition of others. In order to examine the many uses of laughter in the future, we first built an automatic laughter detector, which is the goal of this project. Previous work has been done in both characterizing laughter [3] [4] [] and building automatic laughter detectors [1] [] [6] [7]. Bachorowski [4] found that laughter is highly variable and difficult to stereotype. Though compared to speech, laughter had more source related variability. Provine concluded that laughter is usually a series of short syllables repeated approximately every 10 ms []. In order to detect when multiple people laughed, Kennedy and Ellis [] used a support vector machine (SVM) classifier trained on mel frequency cepstral coefficients (MFCCs), delta MFCCs, modulation spectrum, and spatial cues. The data was split into one second windows, which were classified as multiple speaker laughter or non-laughter events. They achieved a true positive rate of 87% and a false positive rate of 13%. Truong and van Leeuwen [6] used gaussian mixture models (GMM) trained with perceptual linear prediction (PLP) features, pitch and energy, pitch and voicing, and modulation spectrum. The experiments were run on presegmented laughter and speech segments. Determining the start and end time of laughter was not part of the experiment. They built models for each of the four features. The model trained with PLP features performed the best, at 13.4% EER for a data set similar to the one used in our experiments. The goal of this experiment is to automatically detect the onset and offset of laughter. In order to do so, a neural network with one hidden layer was trained with MFCC and pitch features, which will be described in Section 3.. This task is slightly different from both Kennedy and Ellis, who used 1 second windows, and Truong and van Leeuwen, who tested on presegmented data. In Section, we discuss the data that was used in this experiment. Section 3 describes our neural network. Results from the experiment are given in Section 4 and in Section

we conclude and discuss our results. Data We used the ICSI 1 Meeting Recorder Corpus [8] to train and test the detector. It is a hand transcribed corpus of multi-party meeting recordings, in which each of the speakers wore close-talking microphones. Distant microphones were also recorded; however, they were not used in this experiment. The full text was transcribed as well as non-lexical events, including coughs, laughs, lip smacks, etc. There were a total of 7 meetings in this corpus. Similar to the work done by Kennedy and Ellis[] and Truong and van Leeuwen [6], we trained and tested on the Bmr subset of the corpus, which included 9 meetings. The first 6 were using for training and the last 3 were used to test the detector. In order to clean the data, we only trained and tested on data that was transcribed as pure laughter or pure non-laughter. Cases in which the hand transcribed documentation had both speech and laughter listed under a single start and stop time were disregarded. Furthermore, if a speaker was silent over a period of time then their channel at that time was not used in training. This reduced cross-talk (other speakers appearing on the designated speaker s channel) and allowed us to train on channels only when they were in use. All of the data was tested; but only time that was transcribed as pure laughter or pure non-laughter was included in the computation of the equal error rate (EER). Table 1 has the statistics of the clean data. The average laugh duration was 1.61 seconds with a standard deviation of 1.41 seconds. Figure 1 shows the histogram of the laughter durations. Table 1: Bmr statistics Training Data Testing Data All Data Pure Laughter (seconds): 86.069 739.707 6604.776 Pure Non-Laughter (seconds): 9094.894 7766.17 9871.411 Percentage Pure Laughter (%): 6.08 8.696 6.71 3 Method 3.1 Neural Network A neural network with one hidden layer was used to classify feature vectors as either laughter or non-laughter. A schematic of a neural network is shown in Figure. It consists of input units, hidden units, and output units [9]. In this case the input units are the features and the two output units are the probability it was laughter and the probability it was not laughter. The input units are linked to each of the hidden units. The hidden units then take a weighted sum of the input units to get b i = j W i,jf j, where W i,j is the weight associated with feature F j and hidden unit A i. These weights are determined via training data. An activation function, g, is applied to the sum, b i, to determine the value of A i. Similarly, to compute the output of the neural net a weighted sum is taken of the hidden 1 International Computer Science Institute; Berkeley, CA

Figure 1: Histogram of Laugh Duration units and a softmax activation function is applied to the sum in order to determine the output values, the posterior probability of laughter and non-laughter. In order to prevent over-fitting, the data used to train the neural network was split into two groups, training (the first 1 Bmr meetings) and cross validation (the rest of the original training set). The weights are adjusted based on the first 1 Bmr meetings via the back-propagation algorithm, which modifies the weight based on the partial derivative of the error with respect to each weight. After each epoch the cross validation frame accuracy (CVFA) is evaluated. The CVFA is the ratio of true negatives and true positives to all cross validation data. Using the system used at ICSI, the learning rate was initially set to 0.008. Once the CVFA does not increase by at least 0.% from the previous epoch, the learning rate is halved at the beginning of subsequent epochs. The next time the CVFA does not improve by 0.% training is stopped. Figure : A neural network with n input units, 00, hidden units, and output units 3

3. Features 3..1 Mel Frequency Cepstral Coefficients Mel frequency cepstral coefficients (MFCCs) are coefficients obtained by taking the fourier transform of a signal, converting it to the mel scale, and finally taking the discrete cosine transform of the mel scaled fourier transform [10]. The mel scale is a perceptual scale used to more accurately portray what humans hear. In this experiment, MFCCs were used to capture the spectral features of laughter and non-laughter. The first order regression coefficients of the MFCCs (delta MFCCs) and the second order regression coefficients (deltadelta MFCCs) were also computed and used as features for the neural network. We used the first 1 MFCCs (including the energy component), which were computed over a ms window with a 10 ms forward shift, as features for the neural network. MFCC features were extracted using HTK. 3.. Pitch From laughter characterization papers, it was concluded that pitch features were different for laughter than for speech. In [3], they found that the fundamental frequency (F 0 ) pattern of laughter did not decline as it typically does in speech. In [4], they found F 0 to have large ranges during laughter. Using the ESPS pitch tracker get f0 [11], we extracted the F 0, rms value, and ac peak (the highest normalized cross correlation value found to determine F 0 ) for each frame. The delta and delta-delta coefficients were computed for each of these features as well. 4 Experiments Since the frames are so short in duration and in this data set people generally laughed for a consecutive 1.61 seconds at a time, we realized we should feed multiple frames into the neural net to determine whether or not a person is laughing. After trying 0., 0.0, 0.7, and 1 second windows we found that a window of 0.7 seconds worked best, based on the CVFA. We wanted the classification of laughter to be based on the middle frame so we set the offset to be 37 frames. We also had to determine the number of hidden units. Since MFCCs were the most valuable features for Kennedy and Ellis[], we used MFCCs as the input units and modified the number of hidden units to be 0, 100, 00, and 300 while keeping all other parameters the same. Based on the accuracy on the cross validation set, we saw that 00 hidden units performed best. The other parameters used for the neural network, including the learning rate, were set to values which work well for speech recognition at ICSI [1]. As stated earlier, the neural network was trained only on data that was included in the transcript as either pure laughter or pure non-laughter. After the weights were set based on this training data, all of the test data was run through the neural network and given two output scores. The first output score was the probability the frame was non-laughter and the second output score was the probability the frame was laughter. Although the entire test set was given two output scores, only the pure laughter and pure non-laughter frames 4

were used to compute the detection error trade-off (DET) curves and the equal error rate (EER). The neural networks were first trained with each of the features individually (i.e. MFCCs, delta MFCCs, delta-delta MFCCs, F 0, delta F 0, etc). Then we combined each class of features (MFCCs, F 0, rms value, ac peak) with their respective deltas and delta-deltas and retrained the neural networks. The EERs are shown in Table and the DET curves are shown in Figures 3, 4,, and 6. Each figure contains four DET curves: the feature itself, delta-feature (* D), delta-delta feature (* A), and the combination of the feature, delta, and delta-delta (* ALL). 60 40 MFCCs MFCC All MFCC MFCC D MFCC A Miss probability (in %) 0 10 1 0. 0. 0.1 0.1 0. 0. 1 10 0 40 60 False Alarm probability (in %) Figure 3: DET curve for MFCC features 60 40 AC Peak Pitch AC All AC AC D AC A Miss probability (in %) 0 10 1 0. 0. 0.1 0.1 0. 0. 1 10 0 40 60 False Alarm probability (in %) Figure 4: DET curve for ac peak features

60 40 Fundamental Frequency F0 All F0 F0 D F0 A Miss probability (in %) 0 10 1 0. 0. 0.1 0.1 0. 0. 1 10 0 40 60 False Alarm probability (in %) Figure : DET curve for F0 features 60 RMS Pitch 40 Miss probability (in %) 0 10 1 0. 0. RMS All RMS RMS D RMS A 0.1 0.1 0. 0. 1 10 0 40 60 False Alarm probability (in %) Figure 6: DET curve for rms features Table : Equal Error Rates (%) MFCCs F 0 RMS AC Peak Feature 11.3 3.6 3. 16.7 Delta 9.6 4.4 6..37 Delta-Delta 11.3 7.83 6.6 7.61 All 10.66.80 6.01 16.7 6

Discussion and Conclusions From Table, it is clear that MFCC features outperformed pitch related features. This is consistent with Kennedy and Ellis [] results. For Truong and van Leeuwen [6], PLP features outperformed the other features. PLP, like MFCCs, are perceptually scaled spectrums so it is not surprising that they performed well for the task of laughter detection too. AC peak values also performed well, which suggests that the cross correlation of an audio signal helps in detecting laughter. This seems reasonable since laughter is repetitive []. However Provine [] found that the repetitions were 10 ms apart, which exceeds the time used to compute the correlation. In the future, we plan to combine the features both on the score level and the feature level. The added features would likely improve our results. Additional features could be utilized to detect laughter. Trouvain noted the repetition of a consonant-vowel syllable structure [13]. Based on this model, we could run a phoneme recognizer on the audio and detect when a phoneme is repeated. Another approach would be to compute the modulation spectrum. This should also portray the repetitive nature of laughter. In conclusion, we have shown that neural networks can be used to automatically detect the onset and offset of laughter with an EER of 9.6%. These results are slightly better than previous experiments which classified presegmented data. By adding more features and combining features we hope to further improve these results. 6 Acknowledgements We would like to thank Nikki, Chuck, Howard, Lara, Khiet, Mathew, and Galen for their help with this experiment. 7

References [1] A. Carter, Automatic acoustic laughter detection, Masters Thesis, Keele University, 000. [] L.S. Kennedy, D.P.W. Ellis, Laughter detection in meetings, NIST ICASSP 004 Meeting Recognition Workshop, Montreal, 004. [3] C. Bickley and S. Hunnicutt, Acoustic analysis of laughter, in Proc. ICSLP, pp. 97 930, Banff, Canada, 199. [4] J. Bachorowski, M. Smoski, M. Owren, The acoustic features of human laughter, Acoustical Society of America, pp. 181 197, 001. [] R. Provine, Laughter, American Scientist, January-February 1996. [6] K.P. Truong, D.A. Van Leeuwen, Automatic detection of laughter, in Proceedings of Interspeech, Lisbon, Portugal, 00. [7] R. Cai, L. Lu, H.-J. Zhange, and L.-H. Cai, Highlight sound effects detection in audio stream, in Proc. Intern. Confer. on Multimedia and Expo, Baltimore, 003. [8] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, C. Wooters, The ICSI meeting corpus, ICAASP, Hong Kong, April 003. [9] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, New Jersey, 003. [10] Mel frequency cepstral coefficient, http://en.wikipedia.org/wiki/mel Frequency Cepstral Coefficients, 006. [11] Entropic Research Laboratory, Washington, D.C., Esps version.0 programs manual, August 1993. [1] D. Ellis, ICSI Speech FAQ: How are neural nets trained, http://www.icsi.berkeley.edu/speech/faq/nn-train.html, 000. [13] J. Trouvain, Segmenting phonetic units in laughter, in Proc ICPhS, Barcelona, 003. 8