AUTOMATIC RECOGNITION OF LAUGHTER

Size: px

Start display at page:

Download "AUTOMATIC RECOGNITION OF LAUGHTER"

Roberta Harvey
5 years ago
Views:

1 AUTOMATIC RECOGNITION OF LAUGHTER USING VERBAL AND NON-VERBAL ACOUSTIC FEATURES Tomasz Jacykiewicz 1 Dr. Fabien Ringeval 2 JANUARY, 2014 DEPARTMENT OF INFORMATICS - MASTER PROJECT REPORT Département d Informatique - Departement für Informatik Université de Fribourg - Universität Freiburg Boulevard de Pérolles Fribourg Switzerland 1 tomasz.jacykiewicz@unifr.ch, BENEFRI Master Student, University of Fribourg 2 fabien.ringeval@unifr.ch, Supervisor, University of Fribourg

5 Abstract Laughter is a fundamental social event, yet our knowledge about it is incomplete. Several researches have been conducted so far for the automatic recognition of laughter. The majority of them were especially focused on non-verbal spectral-related features. In this Master Thesis we investigate three classification problems in two approaches. We consider the discrimination of the speech from (1) laughter, (2) speech-laugh and (3) the two previous types of laughter plus an acted one. All experiments were conducted on the MAHNOB Laughter Database. We applied the leave-one-subject-out cross-validation as evaluation framework to achieve both language and speaker independence during the automatic laughter recognition. We evaluated the scores using weighted accuracy (WA). The first approach is based on non-verbal features. We tested four feature sets prepared for the INTERSPEECH (IS) Challenges between 2010 and 2013 and we proposed another four sets of features based on formant values (F). We obtained a very high performance ((1) W A IS = 98%, W A F = 92%; (2) W A IS = 86%, W A F = 75%; (3) W A IS = 92%, W A F = 86%) for both group of feature sets, with a domination of IS feature sets. The feature-level fusion was investigated for the best sets of each group and it improved the scores in one classification problem (1), even though the results were already very high. The second approach is based on verbal features. We tested Bag-of-Words and n-grams modeling based on acoustic events detected automatically, i.e. voiced/unvoiced segments, pseudo-vowels/pseudo-consonants and acoustic landmarks. The feature sets based on acoustic landmarks (AL) achieved the best scores in all experiments ((1) W A AL = 78%; (2) W A AL = 73%; (3) W A AL = 74%), though the results were not as good as in the first approach. We observed that n-grams perform generally better than Bag-of-Words, which shows that the sequencing of units based on acoustic landmarks is more pertinent for automatic laughter recognition than their distribution. The results obtained in this thesis are very promising, since the state-of-the-art performance in automatic recognition of laughter from speech signal was significantly improved, i.e. from 77% to 98%.

7 Acknowledgment I am grateful to Pr. Rolf Ingold, the head of the DIVA research group, for giving me the opportunity of writing the Master Thesis in this group. I am also very thankful to my supervisor, Dr. Fabien Ringeval, for his assistance in my approach to the machine learning domain and professional explanations, for his pertinent advices and constructive feedbacks and for our fruitful discussions. Finally, I want to express my gratitude to those who supported me mentally during the time of writing this Master Thesis.

9 Contents 1 Introduction Motivations Thesis structure Laughter in human interactions Eliciting laughter Categories of laughter Psychoacoustic models of laughter Automatic speech processing Thesis contributions State-of-the-art in laughter recognition Available databases Relevant acoustic features Spectral-related features Prosodic features Voicing-related features Modulation spectrum-related features Acoustic-based verbal features Popular classification methods Support Vector Machines (SVMs) Gaussian Mixture Model (GMMs) Artificial Neural Network (ANN) Actual performance Audio-only

10 CONTENTS Video-only Audio-visual Automatic laughter recognition based on acoustic features MAHNOB database Structure Annotations Non-verbal acoustic features INTERSPEECH Challenge feature sets Formants and vocalization triangle Classification Results Speech vs Laughter Speech vs Speech-Laugh Speech vs All types of laughter Conclusions Automatic laughter recognition based on verbal features Acoustic events Voiced/unvoiced/silence segments Pseudo-vowel/pseudo-consonant/silence segments Phonetic based landmarks Perceptual centers Results: acoustic events Speech vs Laughter Speech vs Speech-Laugh Speech vs All types of laughter Direct fusion of acoustic events and p-centers Results: acoustic events & p-center Speech vs Laughter Speech vs Speech-laugh Speech vs All types of laughter Conclusions

11 CONTENTS 5 Conclusions 49

12 CONTENTS

13 List of Figures 1.1 A waveform (top) and the corresponding frequency spectrum (bottom) of a typical laugh, composed of six vowel-like notes, showing the regularities. Adapted from [84] Four generations of speech and speaker recognition research; taken from [93] and adapted from [37] The vocalization triangle (in red). Adapted from [93] Spectrograms of speech (above) and laughter (bottom) segments with values of formants marked in color: red - F1, green - F2, blue - F3 and yellow - F A waveform (above) and the corresponding spectrogram (bottom) of a segment annotated as speech-laughter; between 0.0 and 1.0 second the segment contain a normal speech and between approximately 1.05 and 1.4 second the speech is interfered by a laughter Redistibution of features after applying the CFS Redistribution of features after applying the CFS for the Set LE A speech signal (blue) with its energy (black), voiced (darkgreen) and unvoiced (red) segments. The black dashed line is the silence threshold. Adapted from [107] A waveform (top) of a short two-syllable ( ha-ha ) segment of laughter produced by a subject with landmarks indicated and the corresponding spectrum (bottom) Energy waveforms of 5 frequency and one voicing (bottom) bands. Adapted from [14] A rhythmic envelope extracted on a speech signal, adapted from [93] A rhythmic envelope extracted on a speech signal (red) and perception levels of p-centers with threshold 1/3, 1/4, 1/6 of the amplitude (gray scale); adapted from [93]

15 Chapter 1 Introduction This chapter unveils the subject and the aim of this Master thesis. First, the motivations that conducted this study are given and the thesis structure is uncovered. Then, the notion of laughter in human interactions is explained and the psycho-acoustical model of laughter is presented. A brief description of the concept of automatic speech processing and automatic laughter recognition is also given. At the end, the contributions brought with this work are listed. 1.1 Motivations The laughter, among with other non-lexical utterances, like moan or cry, was reported to appear before the development of speech and to be used as an expressive-communicative social signal [96]. However, the knowledge about laughter is incomplete and it lacks of empirical studies [52]. Knowing how to automatically recognize the laughter, can be useful in analyzing the context and circumstances that provoked it. That could have practical application in more efficient exploration of neuro-behavioral topics [85] or better perception of human affects [97]. That would also improve human-computer interfaces (i.e. human-centered computing) by sensing human behavioral signals [71] and social attitudes, like agreement or disagreement [13]. Moreover, recognizing laughter more correctly as non-speech segments of signal would augment the performance of automatic speech recognition systems [75]. Automatic recognition of laughter could also be helpful in automatic analysis of nonverbal communication in groups [39]. At last, automatic tagging of multimedia data and, in consequence, their retrieval [122] could be improved while using a user s laughter as feedback. 1

16 2 CHAPTER 1. INTRODUCTION 1.2 Thesis structure This thesis is split into five chapters. The chapter 1 describes briefly what laughter is, what is its role in human interactions, what are its characteristics and what could be the benefits of recognizing the laughter automatically. It also gives an introduction to automatic signal processing. In the chapter 2, an introduction of the state-of-the-art in automatic laughter recognition is given, with an overview of available databases containing laughter, a summary of types of features utilized for that task, classification methods and results of accomplished experiments. The chapter 3 presents our first approach in laughter recognition, which is based on acoustic features of a speech signal. We describe in detail the MAHNOB database [75] that we used for the experiments, the feature sets that we tested and our methodology. In the chapter 4 we present our second approach in laughter recognition, which is based on verbal features extracted from acoustic events of a speech signal. We describe the types of acoustic events we used and the corresponding results. In the chapter 5 we express our conclusions and we discuss the perspectives on future works. 1.3 Laughter in human interactions Human communication is a base of human existence. It is fundamentally a social phenomenon and is important for many social processes. Humans communicate by exchanging information with others. The word communicate 1 comes from Latin communicationem, noun of the action of communicare - "to share, divide out; communicate, impart, inform; join, unite, participate in". A piece of information can be transmitted with the use of verbal codes (i.e. words) or non-verbal codes. In both cases, the information is valid as long as it is comprehensible for both, the sender and the receiver of the message. However, unlike verbal, the non-verbal communication can be unintentional, i.e. speaker / listener may not be aware of it. Such type of communication includes postures, proximity, gestures, eye contact or sounds like groan, sigh or laughter. The last one is of huge interest, since it can bring some context-related information, which can help to better understand the meaning of an utterance. Reflections about the laughter has been reported in the past in different ways by many important personages in the domain of science and philosophy, namely Aristotle, Kant, Darwin, Bergson and Freud [84]. Laughter is estimated to be about 7 million years old [63], which means that when we laugh 1

17 1.3. LAUGHTER IN HUMAN INTERACTIONS 3 spontaneously, we use a capacity rooted in our most primitive biology. We are born with this innate skill, since the laughter is observable even among deaf-blind children [96]. Laughter was already identified in 5-week-old infants [95], far before they say their first word. It is a direct appeal for mutuality [32] and, in opposite to speech, indicates positive feedback when done synchronously with others. Its evolution assisted the progress of developing positive social relations in groups, as suggested in [68]. Since laughter is associated with release of tension and some stress physiology issues [10], it was clinically used for patients relaxation [112]. Further works in that field could help medicine to take advantage of these positive effects of laughter in treatment of psychiatric disorders like depression [70] Eliciting laughter Laughter can be elicited in different ways, generally depending on age. Among infants, the laughter occurs during tickling or surprising sounds, sights, or movements as well as due to motor accomplishments like standing up for the first time [95]. Among kids, the laughter was reported in response to energetic social games like the chasing and running activities of rough-and-tumble play [108]. Among adults, the laughter occurs mostly during friendly social interactions, like greetings or remarks and not in response to explicit verbal jokes [70]. Only less than 20% of the conversational laughter is elicited by humorous comments or stories [84]. Laughter is a social signal, so the stimulus that elicits it the most is another person rather than something funny [86]. That explains why people laugh about 30 times less often when they are alone and without any stimulus, like television or book, than they do when they are in a social situation - we are more likely to smile and talk to ourselves than laugh while being alone [84]. Moreover, teasing or criticizing ironically our relatives and those we admire, may also provoke laughter - the more important person is to us, the more mirth it provokes [70]. The superiority theory suggests that we laugh at someone else s mistakes or misfortune, because we feel superior to this person [16]. On the other hand, persistent one-sided laughter can signal seeking of dominance [70]. The laughter can be elicited also in response to a relief of tension - this kind of manipulation is being used in movies: moments of suspense are often followed by a side, comic comment [16]. People are so keen to laughing that a humorist became a real profession. Since 1998, in United States, the greatest humorists have been annually awarded for their contribution to the American humor and their impact on American society by the John F. Kennedy Center for the Performing Arts - The Mark Twain Prize for American Humor, named after the 19th century novelist, essayist and satirist Mark Twain [5]. So far, such great comedians as Richard Pryor (1998), Whoopi Goldberg (2001), George Carlin (2008) or Bill Cosby (2009) were awarded. However, we have to be aware and keep in mind that in some cultures laughing or smiling to others is disapproved [99].

18 4 CHAPTER 1. INTRODUCTION Categories of laughter Laughter is not a stereotyped signal and it can be categorized in different dimensions. A standard laughter does not exist - there is a plenty of varieties of laughter [9]. Laughter can be voiced (about 30% of all analyzed laughs), unvoiced (about 50%) or mixed (the remaining 20%). Laughter can be categorized according to the number of laugh syllables: comment laugh - 1 syllable, a quiet laugh (chuckle) - 2 syllables, rhythmical (3 and more syllables) and very high pitched laugh (squeal) [64]. Laughter can be categorized according to its emotional content [46, 55]. Humans are capable of intuitively interpreting a wide range of meanings contained in the laughter, such us: sincerity, nervousness, hysteria, embarrassment, raillery or strength of character [32]. It is also interesting to note that a variety of synonyms of the term laughter are often onomatopoeic, e.g. giggle, cackle or titter [32]. Speech-synchronous forms of laughter, i.e. speech-laugh - laughing simultaneously with articulation, seem to constitute another category of non-verbal vocalization, since the segments of laughter are not just superimposed on articulation, but are nested, while articulation configuration is preserved [117]. Smiled-speech can as well be considered as a laughter, since it can be distinguished from non-smiled speech only by listening [113]. Since Darwin [24], there is a discussion whether smiling and laughing are extremes of the same continuum. 1.4 Psychoacoustic models of laughter To represent the laughter, RR. Provine [84], took an approach of a visiting extraterrestrial who meets a group of laughing human beings: What would the visitor make of the large bipedal animals emitting paroxysms of sound from a toothy vent in their faces?. He proposed to describe physical characteristics of that noisy behavior, the mechanism that produces it and rules that control it. Characteristics of the creature producing the sound (such as gender or age) and information about other species that emit similar sounds might as well be useful. R.R. Provine tried to describe sonic structure of human laughter, but he found it was difficult, since humans do not have as much conscious control over laughter as we have over speech - he asked people in public places to laugh, but about a half of subjects reported that they could not laugh on command [84]. Moreover, although a laughter of a given person can be identifiable, it is not invariant - typically every human uses several patterns of laughter. The good news, regarding computational models of laughter, is that, there is no significant difference in the sounds used in laughter between various cultures [32]. By the means of a sound spectrograph 2, Provine [84] analyzed the sonic properties of the laughter. A sound of laughter can be represented as a series of notes (syllables) that resemble a vowel. Typically, a 2 A device that captures a sound and represent it visually as a variations of the frequencies and intensities of the sound over time.

1.4. PSYCHOACOUSTIC MODELS OF LAUGHTER 5 note is about 75 milliseconds long and is repeated regularly in 210-millisecond intervals. The figure 1.

19 1.4. PSYCHOACOUSTIC MODELS OF LAUGHTER 5 note is about 75 milliseconds long and is repeated regularly in 210-millisecond intervals. The figure 1.1 presents a waveform and the corresponding spectrogram of a typical laughter. Figure 1.1: A waveform (top) and the corresponding frequency spectrum (bottom) of a typical laugh, composed of six vowel-like notes, showing the regularities. Adapted from [84] A sequence of laughter syllables in one exhalation phase is called a bout and a laughter episode is defined as a sequence of bouts separated by one or more inhalation phases [117]. The production of the sound of laughter depends on openness of the mouth (closed/half open/fully open) and glottalization [32]. The aspiration /h/ is a central sound feature of laughter and it can be repeated or combined with any vowel or vocalic nasal, e.g. /m/ or /n/ [32]. There is no specific vowel sounds that define laughter, however, a particular laugh is typically composed of similar vowel sounds, e.g. "ha-ha-ha" or "hi-hi-hi" are possible structures of laugh, but "ha-hi-ha-hi" is not. This is due to some intrinsic constraints of our vocal apparatus that prevent us from producing such sounds [84]. If a variation happens, it is generally in the first or the last note in a sequence, i.e. ha-ha-hi is a possible structure of laughs. Another constraint prevents us from producing abnormally long laughter notes [84]. That s why we rarely hear laughters like haaa-haaa-haaa ; and when we actually do hear it, we might suspect that it is a faked one, such as the laughter of Nelson from The Simpsons series. Abnormally short notes, i.e. who last much less than 75 milliseconds, are as well contrary to our nature. Similarly, too long or too short inter-note intervals are rare. Laughter has a harmonic structure [84]. Each harmonic is a multiple of a low, fundamental frequency.

20 6 CHAPTER 1. INTRODUCTION The fundamental frequency of the laughter of females is higher than the laughter of males. However, all human laughter is a variation of this basic form. This allows us to recognize the laughter, no matter how much we differ from each other. Laugh notes are temporarily symmetrical [83], which means that a short bout played backward will sound similarly to its original version. This can be seen as well on the sound spectrum - its form is very similar while reading in both directions. However, not every aspect of laughter is reversible. The loudness of a segment of laughter declines gradually over time (probably because of the lack of air) and thus can be described by a decrescendo 3. At last, the placement of laughter segments in the flow of speech is not random, which makes it a very important feature. In more than 99% of cases (1192 out of 1200 samples) the laughter appeared during the pauses at the end of phrases [84]. This is called the punctuation effect [83] and it may suggest that a neurologically based process gives more priority for accessing the single vocalization channel to the speech than to the laughter. 1.5 Automatic speech processing The research in automatic speech processing has been evolving for over 50 years now [36]. Various types of systems have been proposed so far: from a simple isolated digit recognition system for a single speaker [25] up to complex systems recognizing speakers and their utterances in multi-party meetings [34]. The figure 1.2 presents the evolution of the research. The general schema of automatic speech recognition (ASR) is composed of three steps: (1) retrieving raw data from sensors, (2) pulling characteristic details (i.e. features extraction) and (3) detecting patterns based on predefined models, i.e. classification. While the first step is more linked to the environment, the second and third steps depend more on the specificity of a task. There are two major categories of modeling methods for speech data: verbal, e.g. Bag of Words [124], n-grams [28] or maximum likelihood estimates [27], and non-verbal, e.g. perceptual linear predictive (PLP) technique [44] or mel-frequency cepstral features [43], as well as several classification methods, e.g. Hidden Markov Models (HMM) [87], Artificial Neural Networks (ANN) [57], Gaussian Mixture Models (GMM) [91] or Support Vector Machines (SVM) [38]. However, the human communication consists not only of speaking but also of other wordless cues. One of them, the para-language, use the same communication channel as speech does, i.e. the voice. Knowing how to recognize these types of signals, not only can improve the performance of speech recognizers, but also can help to retrieve other important informations, perhaps relevant for understanding the 3 A gradual decrease in volume of a musical passage; taken from Merriam-Webster Dictionary.

21 1.6. THESIS CONTRIBUTIONS 7 Figure 1.2: Four generations of speech and speaker recognition research; taken from [93] and adapted from [37]. context. One of the branches that explores the paralinguistic signals is a research of laughter recognition systems. The main goal in this research area is aimed to detect any kind of laughter in spontaneous speech data. The main difficulty of laughter recognition is the definition of laughter itself and the variability of its forms. The general process of automatic laughter recognition is the same as for the automatic speech recognition. There has already been several studies in this research field. Earlier attempts aimed solely to detect whether a particular segment of signal, generally of a duration of 1 second, contained laughter, e.g. [49, 118]. More recent works, e.g. [53, 54], are more precise in their predictions and are able to detect the start and the end of a laughter segment, though there still remain some challenges to reach the ultimate goal, i.e. the human-level performance. More on the state-of-the-art experiments in laughter recognition is presented in the chapter Thesis contributions The purpose of this Master thesis is to develop a system of laughter recognition using two different approaches: non-verbal and verbal methods. Non-verbal approach is especially used in automatic speech and emotion recognition, but it showed also a good performance in laughter recognition, e.g. in [77].

22 8 CHAPTER 1. INTRODUCTION Our motivation to investigate this approach is conducted by a desire to improve the performance by using other types of features, such as new models of voice quality. The use of a verbal model was motivated by the lack of experiments of this method. The only attempt, to our best knowledge, in using a verbal approach in laughter recognition was done by Pammi et al. [69] and presented promising results. Thus, we use the Bag-of-words and n-grams as features extraction methods, that we applied on automatically detected acoustic event of different nature. Since a standard laughter does not exist [9], we introduce three classification problems: discrimination of speech (1) from laughter, (2) from speech-laugh and (3) from all types of laughter (including two previous types of laughter plus an acted one). All experiments are done on multilingual speech corpus containing spontaneous and forced laughters as well as speech-laughs, i.e. MAHNOB Laughter Database [75], to investigate both speaker and language independent recognition of laughter. Our results for non-verbal approach showed that the classification of acoustic-based features can obtain very high scores and that formant-related features can also be pertinent, especially when using a logarithmic scale of formant values normalized by their energies. The results obtained using verbal approach, although not as good as those obtained with the nonverbal approach, also showed a good performance and great potential. Since this method has been little exploited in laughter recognition, a lot of perspectives wait to be unfold.

23 Chapter 2 State-of-the-art in laughter recognition This chapter presents the state-of-the-art in laughter recognition. It includes brief descriptions of the databases which contain laughter episodes, popular types of features, commonly used classifiers and obtained performance for its automatic recognition. 2.1 Available databases There are several audiovisual databases, available for download, that contain laughter events. An overview is presented in Table 2.1. ILHAIRE Laughter Database [59] is an ensemble of laughter episodes extracted from five existing databases. (1) The Belfast Naturalistic Database is composed of video materials, drawn from television programmes, talk shows, religious and factual programmes, that contain positive and negative emotions. 53 out of 127 clips contain laughter and were included to the ILHAIRE database. (2) The HUMAINE Database [30] is composed of 50 audiovisual clips with diverse examples of emotional content, drawn from various sources like TV interviews or reality shows. Although their quality is variable, they present a variety of situations in which laughter occurs. 46 laughter episodes were extracted to include in the ILHAIRE database. (3) The Green Persuasive Database [30] contains 8 interactions between a University Professor and his students, who tries to persuade them to use more environmentally friendly lifestyle. 280 instances of conversational or social laughter were extracted to include in the ILHAIRE database. (4) The Belfast Induced Natural Emotion Database [109] is composed of 3 sets of audiovisual clips containing emotionally coloured naturalistic responses to a series of laboratory based tasks or to emotional video clips. 289 laughter episodes were extracted from a total of 565 clips of the Set 1. Ongoing works aim 9

24 10 CHAPTER 2. STATE-OF-THE-ART IN LAUGHTER RECOGNITION to add laughter episodes from the Sets 2 and 3. (5) The SEMAINE Database [60] is composed of highquality audiovisual clips recorded during an emotionally coloured interaction with an avatar, known as a Sensitive Artificial Listener (SAL). 443 instances of conversational and social laughter were extracted from 345 video clips to include in the ILHAIRE database. AMI Meeting Corpus [19] (stands for Augmented Multi-party Interaction) is a multi-modal data set consisting of 100 hours of meeting recordings. About two-third of meetings are scenarios played by four people - design team members who take off and finish a design project. The rest consists of naturally occurring meetings, such as a discussion between four colleagues about selecting a movie for a fictitious movie club or a debate of three linguistics students who plan a postgraduate workshop [6]. The corpus provides orthographic transcription and annotations for many different phenomena like dialogs, hand / leg / head movements etc. Laughter annotations are also provided but only approximately, i.e. no start neither end time is given, but only a time stamp about occurring laughter. Although the language spoken in the meeting is English, most of the participants are non-native speakers with therefore a variety of accents. This database was used, inter alia, in [77, 76, 73] (only close-up video recordings of the subject s face and the related individual headset audio recordings). ICSI Meeting Corpus [62] is a collection of 75 meetings (approximately 72 hours of speech) collected at the International Computer Science Institute in Berkeley. In comparison to AMI Meeting Corpus, ICSI corpus contains real-life meetings - regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project, with an average of six participants per meeting and a total of 53 unique speakers (13 females and 40 males) varying in fluency in English. Each participant wore close-talking microphone. In addition, six tabletop microphones simultaneously recorder the audio. Annotations include events like coughs, lip smacks, microphone noise and laughter. This database was used in [118, 53]. AudioVisual Laughter Cycle database [119] is a collection of audiovisual recordings of 24 subjects (9 females and 15 males) from different countries that were registered by one web-cam, seven infrared cameras and a headset microphone while watching 10-minute-compilation of funny videos. Annotations were added by one annotator using a hierarchical annotation protocol: a main class (laughter, breath, verbal, clap, silence or trash) and its subclasses. The laughter subclasses include temporal structure (number of bouts and syllables, as proposed by Trouvain [117]) and type of sound (e.g. voiced, breathy, nasal, etc.). The number of laughter episodes is around 44 per participant (in average 23.5% of all recordings) and 871 in total. This database was used in [69]. AVIC (Audiovisual Interest Corpus) [100] is a collection of audiovisual recordings where subjects are interacting with an experimenter who plays a role of a product presenter and leads the subject through

25 2.2. RELEVANT ACOUSTIC FEATURES 11 a commercial presentation. The subject is asked to interact actively but naturally depending on his/her interest in the proposed product. The presentations are held in English, but most of the 21 subjects (10 females and 11 males) are non-native speakers. Data is recorded by a camera and two microphones, one headset and one far-field microphone. The total duration is 10 hours and 22 minutes with 324 laughter episodes. Annotations are done by four independent annotators, mainly to describe a level of interest (disinterest, indifference, neutrality, interest, curiosity), but some additional annotations for nonlinguistic vocalizations (like laughter, consent or hesitation) are also available. MAHNOB Laughter Database [75] is the most recent audiovisual corpus available on-line (after an end user license agreement is signed) at It consists of 22 subjects (12 males and 10 females) from 12 different countries. During sessions they were watching funny video clips in order to elicit laughter, but also some posed laughter and smiles were recorded. In addition, they were asked to give a short speech in English as well as in their native language to create a multilingual speech corpus, since all previous works employ only utterances in English, which can bring some bias to the discrimination models. All recordings were done by a camera with a microphone and an additional lapel microphone. Annotations were performed by one human annotator using 9 labels, i.e. laughter, speech, speech-laugh, posed smile, acted laughter, laughter + inhalation, speech-laugh + inhalation, posed laughter + inhalation, and other. Such amount of labels resolves problems like whether an audible inhalation that follows a laughter belongs to it or not [117]. In addition to that, a second level of annotations were added which separates all the laughter episodes into voices and unvoiced. This step was performed by a combination of two approaches, i.e. manual labeling by two human annotators and automatic detection of unvoiced frames based on the pitch contour by PRAAT software [12]. This database was chosen for this master project because of its multilingual character and the fact that it contains speech from the same subjects that produce laughter, which makes it appropriate for training a system to distinguish laughter and speech characteristics. More about its structure and annotations can be found in the chapter Relevant acoustic features This section describes briefly the most common features and extraction methods used in the state-ofthe-art systems for automatic laughter recognition from speech signal. The purpose of feature extraction step in speech processing is to reduce the quantity of informations passed to the classifier by adapting its parameters so that their discrimination capabilities are adjusted to the classes that are modeled. The most common extraction techniques use the models of human auditory system [40] and are based on a

26 12 CHAPTER 2. STATE-OF-THE-ART IN LAUGHTER RECOGNITION Table 2.1: Overview of the existing databases containing laughter. Three type of interaction exists: dyadic - subjects interact with agents which play a role; elicit - laughter is engendered by watching funny videos; spontaneous - recordings from real-life meetings. H: headset, L: lapel, F: far-field,?: no information available. Name Interaction # subjects # episodes Camera res. Mic. type # raters SEMAINE [60] Dyadic x580 H, F 28 AVIC [100] Dyadic x576 L, F 4 MAHNOB [75] Elicit x576 C, L 2 AVLC [119] Elicit x480 H 1 Belfast Nat. [29] Elicit ? 3 Belfast Ind. [109] Elicit x1080 F 1 AMI [19] Spontaneous x576 H, L, F? ICSI [62] Spontaneous H, L, F? short-term spectral analysis [88]. They are successfully used in tasks like speech recognition, speaker recognition and emotion recognition [101], but also to distinguish the speaker s paralinguistic and idiosyncratic traits such as gender or age [61]. Since a speech signal is not stationary in long-term and it is quasi-stationary in short-term, before the feature extraction phase, the signal is divided into overlapping segments using a sliding window. Although there is no standard length of a segment, in speech recognition the duration of approximately 25 ms is used (which corresponds to the length of a phoneme), with a shift of 10 ms to ensure stationarity between adjacent frames and to cover phonetic transitions, which contain important information about the phones nearby [80]. Characteristics obtained in that way are also called the Low-Level Descriptors (LLDs). Depending on the type of information extracted from the speech signal, acoustic features can be further divided into spectral, prosodic, voicing-related and modulation features; we detail them in the next section. They can be exploited in two approaches: dynamic or static [102, 94]. In the dynamic approach, a classifier is optimized directly using the LLDs values. In the static approach, the LLDs first undergo a set of statistical measure over a duration and then are passed to a classifier. To reduce growing dimensionality of the features, a set of techniques can be used, e.g. Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) techniques; calculating a mean and a standard deviation for a temporal window [78] or for whole laughter segment [49]; polynomial fitting that approximately describes a curve of feature values using a p th order polynomial (the best results seem

27 2.2. RELEVANT ACOUSTIC FEATURES 13 to be produced by a quadratic polynomial [77]) or the Correlation Feature Selection (CFS) Spectral-related features The Perceptual Linear Predictive (PLP) analysis uses some concepts from the psychoacoustic and is more consistent with human hearing than conventional linear predictive (LP) analysis. The representation of an audio signal is low-dimensional. PLP analysis is computationally efficient and speaker-independent [44]. In [118, 76], 13 PLP coefficients for a frame were used, while, in [77] and [49], only 7 PLP coefficients were calculated, which was found to lead to a better performance in laughter recognition: F-measure of 64% was obtained for 13 coefficients on AMI corpus using neural networks [76], while F-measure of 68% for 7 coefficients on the same corpus also using neural networks [77]. In all above cases, also their delta values were calculated. The RelAtive SpecTrAl-Perceptual Linear Predictive (RASTA-PLP) technique suppresses the spectral components that are outside the range of a typical rate of change of speech (the vocal tract shape) [45]. Thus, it adds some filtering capabilities for channel distortions to PLP features. In [31], it was shown that this features result in better performance in speech recognition tasks in noisy environments than PLP. RASTA-PLP features were tested in [89] in laughter detection task, comparing GMM and HMM. GMM classifiers performed slightly better (mean AUC-ROC of 0.825) than HMM classifiers (mean AUC-ROC of 0.822). The Mel-frequency cepstrum is a representation of the short-term power spectrum of a sound. The speech amplitude spectrum is represented in compact form. Mel Frequency Cepstral Coefficients (MFCCs) have been dominant features, widely used in speech, speaker and emotion recognition tasks. Typically 13 MFCCs are used (e.g. in [53]). However it has been reported [49] that a use of only the first 6 MFCCs in laughter detection results in the same performance as when using 13 MFCCs. It can suggest that characteristics of laughter are more discriminative in lower frequencies. 6 MFCCs were later used in [79, 78] Prosodic features Prosody reflects several features of the sound like its intonation or stress, thus prosodic features are used in speech, speaker and emotion recognition. The prosodic of a sound signal are variations in pitch (a perceived fundamental frequency of a sound), loudness (a perceived physical strength of a sound), voice quality (a perceived shape of acoustic formants) and duration patterns (rhythm). Those were therefore used in laughter recognition for their ability to describe dynamic patterns, e.g. in [118, 78].

28 14 CHAPTER 2. STATE-OF-THE-ART IN LAUGHTER RECOGNITION Voicing-related features Bickley and Hunnicutt [11] found out that ratios of unvoiced to voiced duration in laughter signals are decidedly greater than a typical ratio for spoken English. This characteristic was used in [118], by calculating the fraction of locally unvoiced frames and the degree of voice breaks. Voice breaks are determined by dividing the total duration of breaks between voiced part of the signal by the total duration of the analyzed segment of signal Modulation spectrum-related features The rhythm and the repetitive syllable sounds, e.g. vowel sounds which are characteristic for most laughter, can be extracted by calculating amplitude envelope, via a Hilbert transform which employs the discrete Fourier transform (DFT), then applying a low-pass filter and down-sampling. The expectation is to capture the repeated high-energy pulses which occur roughly every ms in laughter [11]. In [118], the first 16 coefficients of the DFT are used as features, whereas in [49] the system uses the first 20 coefficients of the DFT Acoustic-based verbal features Although all the preceding features are non-verbal, there exists also an approach to extract verbal features, i.e. by quantifying the presence and the sequencing of linguistic or pseudo-linguistic units, via the Bag-of-Words and n-grams [17, 20], respectively. These types of units can be detected automatically, in unsupervised way, i.e. data-driven approach, e.g. voiced/unvoiced segments or pseudo-vowels/pseudoconsonants. It is also possible to recognize a word in supervised manner, however, it requires the use of a classifier like the Hidden Markov Models [21]. The idea of combining the automatic detection of units in a speech signal was proposed originally in [107]; it was successfully applied on the emotion recognition task on the SEMAINE corpus. The obtained scores were higher than for the method using manually transcribed words, i.e. unweighted accuracy for the arousal and the valence dimensions were, respectively, 6% and 5.1% higher than in the experiment using words. In this thesis, we reuse this method by applying it on the laughter recognition, as in [69], however, with other types of acoustic and rhythmic units.

29 2.3. POPULAR CLASSIFICATION METHODS Popular classification methods This section briefly describes machine learning algorithms that are commonly used in automatic laughter recognition Support Vector Machines (SVMs) The SVM is a machine learning method for solving binary classification problems. The general idea is to map input vectors that are non-linearly separable into a higher dimension feature space that will allow a linear separation of the two classes [23]. The optimization of the SVMs was for a long time a bottleneck of this method, since the training required solving very large quadratic programming (QP) problems [92]. However, in 1998 Platt [81] proposed the Sequential Minimal Optimization (SMO) algorithm, that breaks a QP problem into several, smallest possible, QP problems. The SVMs show a good generalization performance in many classification problems, e.g. handwritten digit recognition [56] or face detection [67]. For the laughter recognition they were used, inter alia, in [89], and are very popular in many speech related decision problems for their ability to deal efficiently with very large features vectors, e.g. 5-6k features Gaussian Mixture Model (GMMs) The Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities and is widely used to model continuous measurements or biometric features [90]. It found applications in systems for speaker identification or hand geometry detection [98]. It was successfully used for automatic laughter recognition in different studies [47, 89, 118] Artificial Neural Network (ANN) Artificial Neural Network is a computational methodology of analysis inspired by the biological neuronal networks [126]. The model, based on our knowledge of the structure and functions of neurons (i.e. neurocomputing), is composed of layers of computing nodes interconnected by weighted lines [26]. The ANN has been widely used in different areas, e.g. speech recognition, medicine [50] or forecasting [115]. It was also used for automatic laughter recognition [53, 77, 76, 78, 79]. However, such systems are not easy to use on both large datasets and large features vectors due to the time needed for the training phase, that can take up to several weeks on very powerful computers.

30 16 CHAPTER 2. STATE-OF-THE-ART IN LAUGHTER RECOGNITION 2.4 Actual performance A number of attempts has been done so far in building laughter recognition systems. This section presents a summary of experiments done with audio-only, video-only and audiovisual features Audio-only The majority of experiments that use only audio signal to recognize laughter are focused on spectral features, especially PLP and MFCC. The best performance (F-measure of 77%) among audio-only experiments was achieved using 7 PLP coefficients and their delta values (14 features in total) with a context window of 320 ms [77]. The system used neural networks and was trained on AMI corpus. Adaptive Boosting meta-algorithm selected the best features for classification. Slightly worse result (F-measure of 75%) was achieved in another experiment in [77], with the same set of 14 features and neural networks, however using mean and standard deviation values over the context window. Without any context information, this system achieved F-measure of 68%. A system [76] obtained F-measure of 62% while using 13 PLP coefficients and their delta values (26 features in total), using as well neural networks trained on AMI corpus. Another research [118] that exploit PLP features was built using GMM classifier trained on ICSI corpus. It achieved an EER of 13 % when using 13 PLP and their delta values as features and 19 % when using global pitch and voice-related features. High results were also achieved using MFCCs. In [53], neural networks with one hidden layer composed of 200 neurons were trained on ICSI corpus using 13 MFCCs and the highest normalized cross correlation value found to determine F 0 (AC PEAK) along with a context window of 750 ms. The system achieved an EER of 8.15 %, while without the AC PEAK the EER augmented to %. In [74], 13 MFCCs with their delta values were computed every 10 ms over a window of 40 ms. Time delay neural networks with one hidden layer were trained on the SEMAINE database and achieved F-measure of 95.9 % and 17.1 % for detection of speech and laughter, respectively. An experiment using other than spectral features was performed in [69]. Their system used the n-gram approach based on automatically acquired acoustic segments using Automatic Language Independent Speech Processing (ALISP) models [22, 51], trained on SEMAINE and AVLC databases and evaluated with MAHNOB database. 3-gram and 5-gram yielded in similar F-measure of about 75 %, however a small difference in precision and recall were noted between them, i.e. 3-grams model showed better recall, but smaller precision than 5-grams.

31 2.4. ACTUAL PERFORMANCE Video-only The best performance of systems using only video features was achieved in [77] with a F-measure of 83%. Their system used neural networks trained on AMI corpus with 20 facial points projected to 4 PCs (7 to 10) as feature vector. Worse, F-measure of 60%, was achieved in [47], using GMM classifiers with 10 facial points, trained on their own dataset (composed of seven 4-8 minutes dialogs). In another research [89] trained SVM on AMI corpus using 20 facial points and achieved an EER of 13%. In [74], a 3D tracker capturing facial expression over 113 facial points [66] was tested. Time delay neural networks with one hidden layer were trained on the SEMAINE database and achieved F-measure of 89.4 % and 9.8 % for detection of speech and laughter, respectively Audio-visual Among systems using both audio signal and video sequence for laughter recognition, the best performance, i.e. F-measure of 88%, was achieved in [78]. The system, composed of neural networks trained on AMI corpus, uses, as audio features, 6 MFCCs with their delta values, mean and standard deviation values of pitch and energy calculated over a context window of 320 ms, and 20 facial points projected to 5 PCs (6 to 10) as video features. The fusion is done on feature-level. The same performance was achieved in [79], but using detection algorithms instead of classification. The system, as well as previously mentioned one, is composed of neural networks but trained on SAL corpus and tested on AMI corpus. It uses also 6 MFCCs but only 4 PCs and the classification is done by a prediction technique. The prediction is made in three dimensions: (1) a value of the current frame is predicted on the past values, (2) the current audio frame is predicted on the current video frame and (3) the current video frame is predicted on the current audio frame. A slightly worse performance was achieved in [77]. The system, composed of neural networks trained on AMI corpus, uses 7 PLP coefficients and their delta values as audio features and 4 PCs as video features. When the modalities were fused on decision-level, the system achieved F-measure of 86% in comparison to 83% when fusion was done on feature-level. In another research [76], a similar system was built, but instead of using 7 PLP coefficients, 13 and their delta values were used. In this study the decision-level fusion also performed better than feature-level, i.e. F-measure of 82% and 81%, respectively, however the results were worse than when using 7 PLP + their delta values. In [74], a decision-level fusion of 13 MFCCs with their delta values and 3D tracker capturing facial expression over 113 facial points [66] was tested. Time delay neural networks with one hidden layer were trained on the SEMAINE database and achieved F-measure of 97.5 % and 25.5 % for detection of speech and laughter, respectively.

32 18 CHAPTER 2. STATE-OF-THE-ART IN LAUGHTER RECOGNITION

33 Chapter 3 Automatic laughter recognition based on acoustic features In this chapter, we present the first part of our experiment on the recognition of laughter in spontaneous speech signals. We investigate the usability of acoustic features extracted from the speech data. All experiments are performed on the MAHNOB corpus [75], which is the most recent database containing speech and laughter episodes produced by the same subjects, which makes it relevant for a development of a system that distinguishes laughter and speech characteristics. This corpus was already used in researches on the laughter recognition, i.e. [69, 75]. A brief description of the MAHNOB database was presented in the chapter 2. In the first section of this chapter, we explain the structure of the chosen corpus and the types of annotations it uses. Next, we describe the types and the extraction methods of features that we tested, i.e. the feature sets proposed by the INTERSPEECH Challenges and new feature sets related to the voice quality based on formants. In the subsequent section, we illustrate the classification and evaluation procedures. In the last section of this chapter, we present and discuss the results obtained for the three tasks of discrimination that we performed: (1) speech vs laughter, (2) speech vs speech-laughter and (3) speech vs all the types of laughter. Moreover, each task is divided into three parts: (I) the first part presents the results achieved for the feature sets from the INTERSPEECH Challenges, (II) the second part presents the results obtained with our feature sets based on the formants and (III) the last one presents the results achieved with the fusion of the best feature sets from the two previous parts. 19

34 20 CHAPTER 3. AUTOMATIC LAUGHTER RECOGNITION BASED ON ACOUSTIC FEATURES 3.1 MAHNOB database Structure The MAHNOB database contains 180 sessions recorded by 22 subjects from 12 different countries using a camera with a built-in microphone (2 channels, 48 khz, 16 bits) and a lapel microphone (1 channel, 44.1 khz, 16 bits). Each session is named after the combination of subject ID and the number of session and it represents the time of watching from 1 to 5 funny video clips, depending on their length. Subjects were not aware (with the exception of the three authors participating in the study) neither of the content of the clips, nor of the purpose of the research. Additionally, every subject performed two extra speech sessions, where they spoke for about 90 seconds in English and in their native language for the same amount of time. This makes the corpus multilingual and allows a research about the influence of language in discrimination of laughter from speech. Moreover, each subject was asked to produce laughter without any humorous stimuli (i.e. posed laughter), however, more than a half of them found it difficult to do, as confirmed in [96]. In all, the corpus consists of 90 laughter sessions, 38 speech sessions and 23 posed laughter sessions. These sessions contain 344 speech episodes, 149 laughter episodes, 52 speech-laugh episodes and 5 posed-laughter episodes Annotations Start and end points of a speech signal are quite easy to determine. However, it is much harder for laughter episodes, since it is not clearly defined how laughter episode should be divided [117]. The MAHNOB corpus follows the principle, proposed in [9], that laughter is any sound expression, that would be characterized as laughter by an ordinary person in ordinary circumstances. Annotations were added by one human annotator using audio channels, with a support of video channel, where the segmentation was not obvious. They consist of 9 classes, however, we focused only on 4 of them (speech, laughter, speechlaugh and posed laughter), since the rest of them is related to smile (e.g. posed smile) while we are interested uniquely in laughter, or they are extensions of the 4 selected ones (e.g. laughter + inhalation). Annotations are stored in three different formats per session: (1) the Comma Separated Value (CSV) file with the start and end times in seconds (for audio processing) and the start and end frames numbers (for video processing); (2) ELANAnnotation (i.e. the software that was used to perform annotation [18]) file and (3) FaceTrackingAnnotation file. In addition, each session is provided with an XML file, that specifies the general information like session ID and recording time, but also some clues specific for the subject like his/her ethnicity and whether or not he/she has glasses, beard or mustache.

35 3.2. NON-VERBAL ACOUSTIC FEATURES Non-verbal acoustic features Audio recordings were split into segments according to time stamps provided in the annotation files. Non-verbal acoustic features were extracted using opensmile and configuration files from the popular Interspeech ComParE international challenges on paralinguistic. These features are detailed in the next sections. The aim of this task is to obtain characteristics for each segment and prepare them for the classification task. The next, optional, step is the correlation-based features selection (CFS). Its goal is to reduce the set of features to a subset that is highly correlated with laughter, while having low intracorrelation, which therefore reduce redundancy in features (i.e. features are not correlated with each other) [42]. The CFS accelerates the classification, since it is done only once for every application, and the size of the feature subset can be vary small, e.g. from 6k to less than 100 features. For this purpose, we use WEKA data mining software [41] with the Best-First search to find the best subset of features. We also use WEKA for the classification task, because this software allows comparisons of performance with other results from the literature [103, 105, 104, 106, 48, 65, 123, 8]. WEKA uses specific file format, Attribute-Relation File Format (ARFF), that is divided into two sections: the Header and the Data. The Header section contains the name of the relation (which is a meta-data that describes the type of information present in the file) and a list of attributes (features) with their names and types. The last attribute always specifies the classes. Lines that begin with a % are comments. The listing 1 presents an example, taken from [1], of the header of an ARFF file that describes iris plants. The second section, the Data, contains the values of the specified attributes. Each line represents one instance. Values are comma-separated and the last one must match one of the specified classes. Attributes whose values are unknown must be written with a question mark (which corresponds to a NaN value in programming). The listing 2 presents an example, taken from [1], of the data section of an ARFF file that describes iris plants. WEKA provides a functionality to concatenate the data from two files, if they specify the same attributes. The concatenation of attributes is however not supported by WEKA, but can be achieved with a custom script INTERSPEECH Challenge feature sets INTERSPEECH (IS) is an Annual Conference of the International Speech Communication Association (ISCA) [2]. Since 2009, each year the conference is dedicated to a different issue concerning speech communication science and technology, both from theoretical and empirical point of view. In addition to that, every year brings a new challenge for spoken language processing specialists. The topics like para-

36 22 CHAPTER 3. AUTOMATIC LAUGHTER RECOGNITION BASED ON ACOUSTIC FEATURES Listing 1 A header of an ARFF file describing iris plants [1]. % 1. Title: Iris Plants Database % % 2. Sources: % (a) Creator: R.A. Fisher % (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) % (c) Date: July, class NUMERIC NUMERIC NUMERIC NUMERIC {Iris-setosa,Iris-versicolor,Iris-virginica} Listing 2 A data section of an ARFF file describing iris plants 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa

37 3.2. NON-VERBAL ACOUSTIC FEATURES 23 linguistics, speaker s state and speaker s trait were the main subjects of the challenges. As result, after every challenge, a set of features that obtained the best performance for a particular task is published. The distribution of the types of features, for the last four years, is presented in the table 3.1. More details on the features can be found in corresponding publications. Year / reference Energy related Spectral related Voicing related Total 2010 / [103] / [105] / [104] / [106] Table 3.1: Distribution of LLDs according to feature s type from INTERSPEECH ComParE challenges feature sets between 2010 and 2013; Total corresponds to: LLDs functionals. In order to better understand the purpose of the selection of features for particular challenges, we present below short summaries of tasks for which they were designed: INTERSPEECH 2010 addresses three sub-challenges: (1) Age Sub-Challenge aims to classify a speaker into one of four age groups (children, youth, adults or seniors), the baseline result (unweighted average recall, i.e. weighted accuracy (WA)) is 48.91%; (2) Gender Sub-Challenge aims to classify a speaker as male, female or child; the baseline result is 81.21%; (3) Affect Sub-Challenge is a regression task of detecting speaker s state of interest in ordinal representation; the baseline result, expressed as Person correlation coefficient, is INTERSPEECH 2011 addresses two sub-challenges: (1) Intoxication Sub-Challenge aims to classify a speaker s alcoholisation level as alcoholised (blood alcohol concentration (BAC) higher than 0.5 per mill) or non-alcoholised (BAC equal or below 0.5 per mill); the baseline result (WA) is 65.9%; (2) Sleepiness Sub-Challenge aims to classify a level of speaker s sleepiness as sleepiness (for values above 7.5/10 of the Karolinska Sleepiness Scale (KSS)) or non-sleepiness (for values equal or below 7.5 of the KSS); the baseline result (WA) is 70.3%. INTERSPEECH 2012 addresses three sub-challenges: (1) Personality Sub-Challenge aims to classify a speaker into one of five OCEAN personality dimensions [125], each mapped onto two classes; the baseline result (mean of WA of all five classification tasks) by SVM is 68.0% and 68.3% by random forests; (2) Likability Sub-Challenge aims to classify the likability of speaker s voice into one of two classes, although the annotation provides likability in multiple levels; the baseline result (WA)

38 24 CHAPTER 3. AUTOMATIC LAUGHTER RECOGNITION BASED ON ACOUSTIC FEATURES by SVM is 55.9% and 59.0% by random forests; (3) Pathology Sub-Challenge aims to determine the intelligibility of a speaker in a pathological condition; the baseline result (WA) by SVM is 68.0% and 68.9% by random forests; INTERSPEECH 2013 addresses four sub-challenges: Social Signal Sub-Challenge aims to detect and localize non-linguistic events of a speaker, such as laughter or sigh; the baseline result (WA) 83.3%; Conflict Sub-Challenge aims to detect conflicts in group discussions; the baseline result (WA) 80.8%; Emotion Sub-Challenge aims to classify user s emotion into one of 12 emotional categories; the baseline result (WA) 40.9%; Autism Sub-Challenge aims to determine the type of pathology of a speaker in two evaluation tasks: typically vs atypically developing children and a diagnosis task classifying into one of four disorder categories. the baseline results (WA) are 90.7% and 67.1%, respectively. Since paralinguistic is related to non-verbal elements of speech, we found that these feature sets may be relevant to laughter recognition. We use the opensmile toolkit, which is the official feature extractor for the INTERSPEECH ComParE Challenges since 2009 [33], to extract characteristics from our data. It outputs the results in different types of format, including CSV, HTK (for ASR with the HTK toolbox [4]) and ARFF (for WEKA data mining toolkit) Formants and vocalization triangle In 1948, Potter and Peterson [82], suggested that using quantitative analysis of the main trajectories of the acoustic resonance may be used to distinguish vowels. Nowadays, we know that the first two formants F 1 and F 2 carry clues about the place of articulation (e.g. front, central, back) and the degree of aperture (e.g. close, mid, open) determined by the position of tongue [93]. These, among with the roundedness, are characteristic for vowels. In 1952, Peterson and Barney [72] published a representation of formants values of English vowels on a plan. The area filled by the vowels is called the vocalization triangle (c.f. figure 3.1). Formants are widely used in speech recognition and emotion recognition, because they carry information about the speaker s effort in articulation [116, 120]. This information and the fact that a laugh is a series of vowel-like notes, may induce the relevance of formants in laughter recognition. Thus, we use the values of the two formants F 1 and F 2, values of their energy and the value of the vocalization area. The purpose of calculating the vocalization area is to describe the degree of articulation of an utterance, i.e. a small area of F1-F2 corresponds to an utterance hypo-articulated, while a large area corresponds to an utterance hyper-articulated. The script calculating the vocalization area was elaborated in Matlab [3] and uses values of the first and the second formant. These values are extracted over the

39 3.2. NON-VERBAL ACOUSTIC FEATURES 25 Figure 3.1: The vocalization triangle (in red). Adapted from [93]. voiced segments using The Snack Sound Toolkit 1. For each voiced segment, the script looks for the two extreme (i.e. minimum and maximum) values of the F2 and remembers their coordinates. In the next step, the script goes through all the values of F1 and looks for the one that maximizes the area of the triangle composed of the two extreme F2 values and the selected F1 value. The area of triangle is computed using the Heron s formula: T = s(s a)(s b)(s c) where s is the semi-perimeter of the triangle: s = a+b+c 2. We use 29 statistical measures (c.f. Table 3.2) applied on those values. In consequence, the set is composed of 145 voice quality features. In addition to raw values (1), we investigated combinations of them as well as perceptual scales modeling to try take into account some psychoacoustic phenomenon. Given the logarithmic scale of perception, the second set of features (2) is composed of logarithmic values of F1, F2 and their energies. In the third set of features (3), the values of F1 and F2 are normalized by their respective energies. The last set (4) combines the two previous: values of F1, F2 and their energies are logarithmized, then, in addition, the values of F1 and F2 are normalized by their respective energies. The values of the formantic area were always computed with the raw values of formants and logarithmized afterwards for the set 2 (L) and the set 4 (LE). In all, we created 4 sets of features: Set 1 (R): F 1, F 2, E F1, E F2, Area Set 2 (L): log(f 1 ), log(f 2 ),log(e F1 ), log(e F2 ), log(area) 1

40 26 CHAPTER 3. AUTOMATIC LAUGHTER RECOGNITION BASED ON ACOUSTIC FEATURES Symbol Description 1 : Maximum Value of the maximum 2 : Rpos max Relative position of the maximum 3 : Minimum Value of the minimum 4 : Rpos min Relative position of the minimum 5 : Rpos diff Difference between the relative positions of maximum and minimum 6: Range Maximum - minimum 7 : Range norm (Maximum - minimum) / Difference between their relative positions 8 : Mean Mean value 9 : σ norm Standard deviation, normalized by N-1 10 : Skewness Third statistical moment 11 : Kurtosis Fourth statistical moment 12 : Q 1 Value of the first quartile (25/100) 13 : Q 2 Median value (50/100) 14 : Q 3 Value of the third quartile (75/100) 15 : IQR Interquartile range 16 : σ IQR Standard deviation of the interquartile range 17 : Slope Slope from the regression line 18 : Onset Value of the onset (first value) 19 : T arget Value of the target (middle value) 20 : Offset Value of the offset (last value) 21 : T arget Onset Difference between the target and the onset values 22 : Offset Onset Difference between the offset and the onset values 23 : Offset T arget Difference between the offset and the target values 24 : values / segs Average number of increasing values per segment 25 : values / segs Average number of decreasing values per segment 26 : µ Mean of increasing values 27 : σ Standard deviation of increasing values 28 : µ Mean of decreasing values 29 : σ Standard deviation of decreasing values Table 3.2: Set of statistical measures used for modeling the laughter.

3.3. CLASSIFICATION 27 Figure 3.2: Spectrograms of speech (above) and laughter (bottom) segments with values of formants marked in color: red - F1, green - F2, blue - F3 and yellow - F4.

will refer to the Set 1 as the Set R (Raw values), to the Set 2 as the Set L (Logarithmic values), to the Set 3 as the Set E (values normalized by Energy) and to the Set 4 as the Set LE (Logarithmic

41 3.3. CLASSIFICATION 27 Figure 3.2: Spectrograms of speech (above) and laughter (bottom) segments with values of formants marked in color: red - F1, green - F2, blue - F3 and yellow - F4. Set 3 (E): F 1 E F1, F 2 E F2, E F1, E F2, Area Set 4 (LE): log(f 1 ) log(e F1 ), log(f 2 ) log(e F2 ), log(e F1 ), log(e F2 ), log(area) Note that, to avoid confusions, in the rest of this work, we will refer to the Set 1 as the Set R (Raw values), to the Set 2 as the Set L (Logarithmic values), to the Set 3 as the Set E (values normalized by Energy) and to the Set 4 as the Set LE (Logarithmic values normalized by Energy). 3.3 Classification As classifier we used the Support Vector Machines (SVM) with the Sequential Minimal Optimization (SMO), given its small generalization error for large vectors of non-linearly separable features [93]. Our models (complexity and kernel) were optimized on a development partition using the leave-one-subjectout (LOSO) cross-validation as evaluation framework, to ensure that results are user independent and,

42 28 CHAPTER 3. AUTOMATIC LAUGHTER RECOGNITION BASED ON ACOUSTIC FEATURES hence, language independent, since we used the MAHNOB database. This methodology is based on n- folds, where n is the number of subjects in the corpus, i.e. n=15 in our study. For each fold one subject is considered as testing set and no information nor optimization were thus used or performed on it, whereas the remaining are divided equally into two partitions: training set and development set. In addition, we applied permutations while selecting subjects for training and development sets, so that the data is randomly distributed. The first part of each experiment consists of training the system using the training set and optimizing performance on the development set to select the best setting of the SVM classifier, i.e. the value of complexity (10 5, , 10 4, , 10 3, , 10 2, or 1), the type of kernel (polynomial or radial basis function - RBF - kernel) and the value of exponent for polynomial kernel (1, 2 or 3) or gamma for RBF kernel (10 6, , 10 5, , 10 4, or 10 3 ). Because the distribution of classes (c.f ) is unbalanced in the data, we use weighted accuracy (WA, i.e. unweighted average recall) as primary evaluation measure; even if we also give unweighted accuracy (UA, i.e. weighted average recall) for informative purpose. These values are calculated with a help of a confusion matrix. The table 3.3 shows an example of a confusion matrix for two classes: Speech and Laughter. Actual class Predicted class Speech Laughter Speech True Speech (TS) False Laughter (FL) Laughter False Speech (FS) True Laughter (TL) Table 3.3: An example of a confusion matrix, used to calculate unweighted accuracy and weighted accuracy. Once the matrix is filled with the actual values, we calculate out evaluation measures using the following equations: Weighted accuracy (WA): A u = Unweighted accuracy (UA): A w = T S T S+F L + T L T L+F S 2 T S+T L T S+F S+T L+F L The setting that obtains the best score (i.e. the highest WA) is then tested once on the testing set of each of the 15 folds and the mean value is considered as the final score. For each set of features we perform three tasks: first and second tasks (speech vs laughter and speech vs speech-laugh, respectively) are 2-class discrimination problems; the third one, speech vs all the types of laughter (including laughter, speech-laugh and posed-laughter) is a one-vs-all discrimination problem. We expect worse performance for the experiments speech vs speech-laughter and speech vs all than for speech vs laughter, since both

43 3.4. RESULTS 29 classes in the mentioned two experiments contain some segments of speech. Moreover, the class speechlaugh may contain only one or two words expressed while laughing, whereas all others are uttered in a non-laughing style, which complicates the discrimination of speech-laughter vs speech. An example of waveform of such a case is presented in the figure 3.3: between 0.0 and 1.0 second the subject speaks in a non-laughing way and approximately between 1.05 and 1.4 second he laughs while speaking. Figure 3.3: A waveform (above) and the corresponding spectrogram (bottom) of a segment annotated as speechlaughter; between 0.0 and 1.0 second the segment contain a normal speech and between approximately 1.05 and 1.4 second the speech is interfered by a laughter. 3.4 Results In this section, we present the most important results achieved for each of the three performed experiments. In addition to results for the sets of features presented in the section 3.2, we present also the results obtained for a feature set composed of a combination of the two feature sets that achieved the best scores in each of the two groups, i.e. IS feature sets and those based on formants. We expect that this fusion will complement the set of features and, in consequence, improve the performance Speech vs Laughter The results of the experiment of distinguishing speech from laughter are in line with expectations. We can observe that the performance of all IS sets is very high - far above the score chance, which is 69.8%. Two

44 30 CHAPTER 3. AUTOMATIC LAUGHTER RECOGNITION BASED ON ACOUSTIC FEATURES Speech vs Laughter % UA % WA Feature set name Full set CFS set Full set CFS set INTERSPEECH feature sets IS IS IS IS Feature sets based on formants Set R Set E Set L Set LE Fusion of both IS11 + Set LE IS12 + Set LE Table 3.4: Classification Speech vs Laughter: unweighted accuracy (UA) and weighted accuracy (WA) for all tested sets of features and the fusion of the best of each group. sets obtained the same best score: IS11 and IS12 - feature sets tailored for detecting speaker s state (such as the level of intoxication or sleepiness) and speaker s trait (such as personality or likability of speaker s voice), respectively. Those two were selected for a fusion with the feature set based on formants. The application of the CFS did not enhance the performance in any case, but the scores were already very high. In contrast, the scores for all feature sets based on formants are slightly better when the CFS is applied. The reason is the complexity of redundancy removal, i.e. it is easier to delete the redundancy in a small set of features that is already very efficient, than in a large ensemble where the redundancy is much more manifest. Moreover, the performance of the set based on formants is remarkable due a fact that, using as few as 145 features, it models only a small component of the prosody with few parameters, in comparison to the IS feature sets which contain over 4k features representing 3 large categories of prosodic information. However, the fusion of both IS11 and IS12 with the LE features set slightly improved performance, showing the complementarity of our new features set with the IS-based ones. These sets also obtained scores much better than the score chance, although, worse than any IS

45 3.4. RESULTS 31 feature set. We can observe that, among CFS sets, the score obtained by the Set E is higher than the score of raw values (Set R). The same relationship can be noted between the Sets L and LE, which can suggest that the normalization of formant values by the energy is pertinent for laughter recognition. Moreover, both those sets (i.e. L and LE) obtained better results than sets without the logarithm applied (i.e. R and E), which demonstrate the importance of perception in logarithmic scale for formants. The repartition of the features (c.f. the figure 3.4) of the best set, i.e. Set LE, after applying the CFS on the whole corpus, shows that more than a half of the new set is composed of features related to the 1st formant, what can suggest its strong discrimination properties. However, it could also mean that 2nd formant is not extracted as precisely as the 1st one. Moreover, it shows that only 8 formant-based features suffice to obtain a good performance in laughter recognition, comparable to sets composed of more than 1k features. At last, the fusion of the IS11 and the Set LE improved minimally the score, what confirms our expectations. Similarly as among IS feature sets, the CFS did not change significantly the results for the fusion. The results, unweighted accuracy (UA) and weighted accuracy (WA), for all IS feature sets, feature sets based on formants and the fusion of the best ones, are presented in the table 4.1. log(f1) x log(e(f1)) log(f2) x log(e(f2)) log(e(f1)) Set LE 5 (63%) 2 (25%) 1 (13%) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Figure 3.4: Redistibution of features after applying the CFS Speech vs Speech-Laugh The results of the experiment of distinguishing speech from speech-laugh are not as high as for the previous experiment. This is due to a more difficult classification task, since both classes contain speech. In addition, because of a huge imbalance in number of instances per class (344 vs 52), the score chance is very high: 86,9%. None of the tested sets of features has achieved better score than this chance score, thus we do not go into details of the results obtained after applying the CFS. We can clearly observe, like we also did on the previous task, i.e. speech vs laughter, that the modifications of the feature set based on formants can improve the score: the set based on values normalized by their energies (Set E) had about 15% better score than the raw set (Set R); the score for the set with logarithmic values (Set L) was higher than for the Set R and Set E and the set with logarithmic values normalized by their energies (Set LE) achieved the best score. This shows the importance of each successive step that we have proposed, i.e.

46 32 CHAPTER 3. AUTOMATIC LAUGHTER RECOGNITION BASED ON ACOUSTIC FEATURES Speech vs Speech-Laugh % UA % WA Feature set name Full set CFS set Full set CFS set INTERSPEECH feature sets IS IS IS IS Feature sets based on formants Set R Set E Set L Set LE Fusion of both IS11 + Set LE Table 3.5: Classification Speech vs Speech-Laugh: unweighted accuracy (UA) and weighted accuracy (WA) for all tested sets of features and the fusion of the best of each group. using logarithmic scale to represent formants and their energy, as well as weighting values of formants by their respective values of energy to take into account phenomena of perception by measuring the voice quality. That set was used in the fusion among with the IS10, however, it did not improve the score. The results, unweighted accuracy (UA) and weighted accuracy (WA), for all IS feature sets, feature sets based on formants and the fusion of the best ones, are presented in the table Speech vs All types of laughter The last experiment in this section, the discrimination of speech from all the types of laughter, brought us good results. The number of instances in this experiment was the most balanced among the three experiments, and so the score chance is 62.5%. All scores for IS feature sets are approximately 90%. The best one, 91%, was obtained by the IS12 - a feature set tailored for speaker s trait classification, like evaluating a speaker s intelligibility level while reading a text in different pathological conditions. However, the results obtained by all the IS feature sets are very close to each other. Feature sets based on formants followed our expectations (except the Set E, which was the worst in this experiment), i.e.

47 3.4. RESULTS 33 Speech vs All Laughter % UA % WA Feature set name Full set With CFS Full set With CFS INTERSPEECH Challenge feature sets IS IS IS IS Feature sets based on formants Set R Set E Set L Set LE Fusion IS12 + Set LE Table 3.6: Classification Speech vs All: unweighted accuracy (UA) and weighted accuracy (WA) for all tested sets of features and the fusion of the best of each group. using both logarithmic values and weighting. The best result among these feature sets was achieved, as in previous two experiments, by the Set LE. We can see that the CFS improved all the scores of feature sets based on formants. The repartition of the features (c.f. the figure 3.5) of the best set, i.e. Set LE, after applying the CFS on the whole corpus, shows that the new set of selected features consists of only 1/8 (23 features) of the original size (145 features) and that about a half of them are related to the 1st formant, what could be also observed in a previous experiment, i.e. speech vs laughter. This suggests that the position of tongue is very important for the laughter recognition. In the same time, we can imagine that the variability of the F1 must be weak for the laughter compared to the speech, since the position of tongue is almost stable during a laughing phase. This may be less pertinent for the F2, which corresponds to the shape of mouth and can vary upon the intensity of laughter. However, the bigger amount of selected features in comparison to the mentioned experiment above, and the fact that other types of features were kept, i.e. features based on the formantic area, can suggest that, for more complicated tasks, such us the discrimination of speech and different types of laughter, which can contain some speech episodes as well, more voice quality information is needed. The fusion of the best feature sets, i.e. IS12 and the Set LE, did

48 34 CHAPTER 3. AUTOMATIC LAUGHTER RECOGNITION BASED ON ACOUSTIC FEATURES not improve the performance - obtained result was similar to the result for IS12. The results, unweighted accuracy (UA) and weighted accuracy (WA), for all IS feature sets, feature sets based on formants and the fusion of the best ones, are presented in the table 4.2. log(f1) x log(e(f1)) log(f2) x log(e(f2)) log(e(f1)) log(e(f2)) log(area) Set LE 11 (48%) 5 (22%) 3 (13%) 2 (9%) 2 (9%) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Figure 3.5: Redistribution of features after applying the CFS for the Set LE. 3.5 Conclusions In this chapter we presented two approaches for automatic recognition of laughter based on acoustic features. All tests were done on the MAHNOB laughter database, containing 15 different subjects, using leave-one-subject-out cross-validation to investigate both language and speaker independent automatic laughter recognition. We chose the SVM as classifier and the weighted accuracy as evaluation measure. The first approach uses feature sets used in the INTERSPEECH Challenges, that were tailored for paralinguistic classification tasks. Among four tested feature sets, three of them obtained the best scores in different experiments. Moreover, all the results obtained by the IS feature sets are very close to each other. The best score in distinguishing the speech from all types of laughter, which, in our opinion, is the most complex task, achieved the score of 91% (whereas the score chance is 62,5%). The CFS did not improve, already high enough, scores. The second approach employed feature sets based on formants. We tested four sets of features, which, in fact, are the combinations of the base feature set, i.e. composed of raw values of formants, formants energy values and the corresponding formantic area. A logarithmic scale of perception and normalization of values are considered by applying logarithms and weighting formant values by their respective values of energy, respectively. We observed that each successive step improved the performance, i.e. weighting augmented scores compared to the feature set of raw formant values; and applying logarithms ameliorated scores in relation to the feature sets based on non-logarithmic values. The set composed of logarithmic values of formants and normalized by the energy (i.e. Set LE) obtained the highest scores among all formant-based feature sets for all three experiments, i.e. 93%, 75% and 86% for the speech vs laughter, speech vs speech-laugh and speech vs all types of laughter, respectively.

49 3.5. CONCLUSIONS 35 The analysis of the features selected by the classifier (CFS) showed that as few as 8 features are enough to obtain a good performance in laughter recognition. About a half of selected features were based on the 1st formant, which can prove its strong discrimination properties. However, for more complicated classification, i.e. speech vs all types of laughter, more formant-based features were needed, but the difference in the total number of features is still huge in relation to the IS feature sets. On the end, we combined the best feature sets from both groups on the feature-level. Since the feature sets from the INTERSPEECH Challenges set a bar very high, it was difficult to improve the scores. However, the fusion enhanced the results for the speech-vs-laughter experiment, what shows that voice quality plays an important role in laughter recognition. In the next chapter we are going to present another approach in laughter recognition, based on verbal features.

50 36 CHAPTER 3. AUTOMATIC LAUGHTER RECOGNITION BASED ON ACOUSTIC FEATURES

51 Chapter 4 Automatic laughter recognition based on verbal features In the previous chapter, we presented an approach in laughter recognition that was based on acoustic characteristics of a speech signal. We have achieved very high score for set of features that were tailored for the INTERSPEECH Challenges, good score for feature sets based on formants and a small amelioration in score using their fusion. However, since the laughter was found to be a series of notes of similar length, repeated in intervals of similar length [84], in this chapter we will investigate the relevance of distribution and sequencing of units based on acoustic events. Acoustic events are detected with a data-driven approach, which is more robust and efficient compared to machine learning algorithms used for automatic speech recognition. The feature extraction is done automatically in a supra-segmental way, which means that a signal is divided into units of varying length, depending on detected events. In the first section of this chapter, we introduce the notion of acoustic events, that are related to changes in production and perception of speech [110], and present our motivations of using them in laughter recognition tasks. We particularly focus on the events related to speech production, e.g. acoustic landmarks, which correspond to abrupt changes in the articulation system, or pseudo-vowels/pseudo-consonants segments and speech perception, e.g. voiced/unvoiced/silent segments and center of perception (p-center) of the speech. We use p-centers for a direct fusion with all the three types of acoustic events mentioned before. The best results are shown in the subsequent sections. The classification process is done in the same way as in the previous chapter, i.e. LOSO on MANHOB database with SVM. We use the Bag of Words (BoW) to quantify the distribution of units and the n-grams for their sequencing. 37

52 38 CHAPTER 4. AUTOMATIC LAUGHTER RECOGNITION BASED ON VERBAL FEATURES 4.1 Acoustic events An acoustic event or a speech unit can be a time-stamp, that marks a moment of a change in the articulation system, or a whole segment, that represents a piece of signal with specific acoustic properties. In this section we introduce four types of such units: (1) voiced/unvoiced/silence segments, (2) pseudovowels/pseudo-consonants/silence segments, (3) phonetic based landmarks and (4) p-centers. Since annotations of p-centers are composed only of two alternating labels (p-center/silence), we found irrelevant to classify them separately and we decided to use them in a direct fusion (c.f. 4.3) with the rest of units Voiced/unvoiced/silence segments Laughter can be described as a sequence of alternating voiced and unvoiced segments [11], which are related to the speech perception. Also in [117], it was described as an alternating voiced-unvoiced pattern. A voiced segment is a piece of speech signal where vocal cords vibrate. Voiced and unvoiced segments can be identified using pitch and loudness values of the signal. We use the opensmile tool to extract those two features and a Matlab script to detect and label the segments. The figure 4.1 presents a Matlab plot with a speech signal and detected voiced/unvoiced segments. Figure 4.1: A speech signal (blue) with its energy (black), voiced (darkgreen) and unvoiced (red) segments. The black dashed line is the silence threshold. Adapted from [107].

53 4.1. ACOUSTIC EVENTS Pseudo-vowel/pseudo-consonant/silence segments Pseudo-vowel and pseudo-consonant segments are related to the speech production. Their detection is based on pseudo-phonemes - speech units introduced by the computational sciences that are established on the stationarity of the speech signal. In consequence, the identification of the pseudo-phonemes is relevant for vowels, since their acoustic wave forms were observed to be stationary for more than 30 ms. Consonants happen to be shorter than this and their waveforms are often non-linear. More details on the method of detection pseudo-vowels and pseudo-consonants segments can be found in [93]. We are interested in this type of units, since it was reported that a laughter can be described as a series of short vowel-like notes[84]. Another research [117] characterize a typical laugh as a syllable of a consonantvowel structure Phonetic based landmarks Phonetic based landmarks are events that correspond to a change in the articulation system [111]. These abrupt changes in the amplitude can be observed in an acoustic signal simultaneously across wide frequency ranges. Thus, in order to automatically detect acoustic landmarks, the speech signal is divided into several frequency bands and a voicing contour is computed. For each of the bands, an energy waveform is constructed and the time derivative is computed, which then is used to detect its peaks. The peaks represent the times of abrupt spectral changes in the bands. A landmark is detected if the peaks appear simultaneously in several bands in a specified pattern [111] and the amplitude value of the signal reach a threshold, derived empirically, for abruptness [14]. This method was implemented in The SpeechMark tool [15], which was used in our experiments. Following types of landmarks can be detected [14]: +g (glottis): the onset of a voicing; -g: the offset of a voicing; +s (syllabicity) the onset/release (+s) of a voiced sonorant consonant; -s the offset/closure of a voiced sonorant consonant; +b (burst) the onset of a burst of air following stop, affricate consonant release or onset of frication noise for fricative consonants; -b the point where aspiration or frication noise ends; V (vowel) the point of the harmonic power maximum. In addition, energy changes correlated with frication patterns are also detected (+/-f and +/-v). The SpeechMark tool generates annotation files containing the the time of occurrence and the type of landmark, c.f. listing 3. The tool can also visualize the analysis of a signal. The figure 4.2 presents a waveform and the corresponding spectrum of a two-syllable ( ha-ha ) segment of laugher. The figure 4.3 shows three scenarios of changes in frequency and voicing bands: (a) energy increase is detected in frequency bands just before the onset of voicing, however, the change is not large enough in some bands - no landmark is identified; (b) a large energy increase is detected in all frequency bands just

corresponding spectrum (bottom). Figure 4.3: Energy waveforms of 5 frequency and one voicing (bottom) bands. Adapted from [14].

54 40 CHAPTER 4. AUTOMATIC LAUGHTER RECOGNITION BASED ON VERBAL FEATURES Figure 4.2: A waveform (top) of a short two-syllable ( ha-ha ) segment of laughter produced by a subject with landmarks indicated and the corresponding spectrum (bottom). Figure 4.3: Energy waveforms of 5 frequency and one voicing (bottom) bands. Adapted from [14]. before the onset of voicing - a +b (burst) landmark is identified; (c) a large energy increase is detected in all frequency bands during the voicing - a +s (syllabic) landmark is identified.

55 4.1. ACOUSTIC EVENTS 41 Listing 3 A landmarks annotation file for a short segment of laughter produced by the subject ,+b ,+g ,+s ,-s ,-g Since different sounds produce different patterns of abrupt changes, the analysis of those patterns can help to detect types of sounds. In addition, a stemming technique can be applied on those values by treating onsets and offsets of the same group as the same words Perceptual centers A perceptual center (P-center) is related to the perception of temporal patterns in speech, music, and other temporally sensitive activities [121]. That includes isochrony, e.g. if a subject can identify an isochronous pattern, i.e. rhythm, it means that successive p-centers appear in constant intervals. In other words, a sequence of words sounds isochronously to a listener, if a locus of one word is in equal temporal distance of loci in surrounding words [35]. Several methods of measuring the location of p-centers exist. The most commonly used, rhythm adjustment method (first described in detail in [58]), involves a subject listening to a repetition of a pattern composed of two short alternating sounds, say sound A and sound B, which is not perceptually isochronous. His task is to adjust the onset of the sound B until he feels (subjectively) the rhythm, while the distance between consecutive sounds A stays untouched. The result of a final adjustment is an estimation of the interval of p-centers of the sound B with respect to the sound A [121]. Another method, worth mentioning, is the method of tapping fingers proposed in [7]. It was originally designed to discover the locus of a stress beat. In that method, a subject is asked to tap his fingers while he perceives a particular syllable in a sentence. The same sentence is repeated 50 times. The results showed that subjects tended to tap before the onset of the vowel in a stressed syllable, what was defined as a moment, where occurrence of the syllable is perceived [58]. An automatic method of extracting a rhythmic envelope of a speech signal was proposed in [114], c.f. figure 4.4. The method uses a set of numeric filters supposed to represent the process of perceiving the rhythm of the speech. A rhythmic envelope permits to locate the p-center by defining a threshold on its amplitude, which corresponds to a level of perception of the rhythmic prominence, c.f. figure 4.5. We use four values of the threshold, i.e. 1/3, 1/4, 1/6 and 1/8.

56 42 CHAPTER 4. AUTOMATIC LAUGHTER RECOGNITION BASED ON VERBAL FEATURES Figure 4.4: A rhythmic envelope extracted on a speech signal, adapted from [93]. Figure 4.5: A rhythmic envelope extracted on a speech signal (red) and perception levels of p-centers with threshold 1/3, 1/4, 1/6 of the amplitude (gray scale); adapted from [93]. 4.2 Results: acoustic events In this section, we present the results achieved for each of three performed experiments using the BoW, 1-, 2- and 3-grams on speech units as feature extraction methods. We have tested four types of fea-

Automatic Laughter Detection

Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional