Fusion for Audio-Visual Laughter Detection

Size: px
Start display at page:

Download "Fusion for Audio-Visual Laughter Detection"

Transcription

1 Fusion for Audio-Visual Laughter Detection Boris Reuderink September 13, 7

2 2

3 Abstract Laughter is a highly variable signal, and can express a spectrum of emotions. This makes the automatic detection of laughter a challenging but interesting task. We perform automatic laughter detection using audio-visual data from the AMI Meeting Corpus. Audio-visual laughter detection is performed by combining (fusing) the results of a separate audio and video classifier on the decision level. The video-classifier uses features based on the principal components of 20 tracked facial points, for audio we use the commonly used PLP and RASTA- PLP features. Our results indicate that RASTA-PLP features outperform PLP features for laughter detection in audio. We compared hidden Markov models (HMMs), Gaussian mixture models (GMMs) and support vector machines (SVM) based classifiers, and found that RASTA-PLP combined with a GMM resulted in the best performance for the audio modality. The video features classified using a SVM resulted in the best single-modality performance. Fusion on the decision-level resulted in laughter detection with a significantly better performance than single-modality classification. 3

4 4

5 Acknowledgements I would like to thank my supervisors, Maja Pantic, Mannes Poel, Khiet Truong and Ronald Poppe for supervising me, supporting me and pointing me in the right directions when needed. In addition to my supervisors, I would like to thank the HMI department of the University of Twente for supporting my trip to Londen. I would also like to thank Michel Valstar, who helped me tremendously during my stay in London, and Stavros Petridis, for helping with the creation of the video-features. I would like to thank Luuk Peters and Roald Dijkstra for proofreading my drafts. And last but not least I would like to thank Sanne Beukers for supporting me while I was working on my thesis. 5

6 6

7 Contents 1 Introduction 9 2 Literature Laughter Laughter detection in audio Facial expressions Audio-visual fusion Methodology Dataset AMI Meeting Corpus Segmentation Corpus Features Audio features Video features Test setup Classifiers Fusion Cross validation scheme Model-parameter estimation Performance measure Results Single-modality classifiers High level fusion Conclusions 33 A Principal components for the video features 39 B Normalized features 41 7

8 8 CONTENTS

9 Chapter 1 Introduction Laughter is important. Someones mental state and emotions are conveyed in paralinguistic cues, such as laughter, a trembling voice and coughs. Because laughter occurs frequently in spontaneous speech, it is an interesting research subject. Laughter is not limited to positive emotions; negative feelings and attitudes such as sadness and contempt can be expressed with laughter [35]. This spectrum of expressed emotions combined with the high variability of laughter makes the automatic detection of laughter a challenging but interesting task. Automatic laughter detection can be used for example in meetings where laughter can provide cues to semantically meaningful events. Another application of laughter detection is the detection of non-speech for automatic speech recognition. Laughter can possibly be used as a feedback mechanism in Human Computer Interaction interfaces. Earlier work on laughter detection has mainly focused on laughter detection in audio only. Currently the focus starts to shift for laughter detection in audio to audio-visual detection of laughter because additional visual information can possibly improve the detection of laughter. Research investigating audio-visual laughter detection was suggested by Truong et al. [38]. In this thesis we investigate fusion for audio-visual laughter detection. We will investigate if fusion of the audio and video modality can improve the performance of automatic laughter detection. Fusion will be performed on the decision level, which means that audio and video are classified separately, and the results are fused to make a final classification. We will evaluate different feature sets and different classification algorithms in order to find strong audio and video classifiers, and fuse those results to create a audio-visual classifier which hopefully outperform both the audio and the video classifiers. The rest of this thesis is organized as follows. In the next chapter we describe earlier work on laughter detection in audio, detection of facial expressions in video, and work on audio-visual emotion recognition. Then we describe the methodology we use to evaluate the performance of decision-level fusion. This includes a description of the data set, the machinelearning techniques we use, and the performance measure we use. The results are presented in the next chapter, followed by the conclusions in the last chapter. 9

10 10 CHAPTER 1. INTRODUCTION

11 Chapter 2 Literature 2.1 Laughter Although laughter occurs frequently in conversations, we do not seem to know a lot about laughter. Research on the acoustic properties of laughter often contradicts other research, and the used terminology differs from work to work. To increase the confusion even more, smiles and laughter are often not discussed together while they seem to be related. Therefore we will describe different terminologies to describe laughter, the relation between speech, laughter and smiles before we describe the variability of the laughter signal. Laughter is usually analyzed on three levels. Bachorowski [5] defines the following three levels: bouts, calls and segments. Bouts are entire laugh episodes that occur during one exhalation. Calls are discrete acoustic events that together form a bout. Bouts start with long calls, followed by calls that were about half as long. A call is voiced, unvoiced, mixed, or is made up by glottal pulses, fry registers and glottal whistles. The mouth can be open or closed during the production of a call. Calls can be subdivided in segments. Segments are temporally delimited spectrogram components. A similar division in three level was suggested by Trouvain [36]: phrases, syllables and segments. Phrases are comparable to bouts. Syllables are defined as interpulse intervals, and form phrases when combined. Segments can be vowels or consonants. The consonantal segment in a laugh is often seen as an interval or pause. Smiling and laughter often occurs together, and seem to be different forms of the same event. Laughter often shows a facial expression similar to smiling combined with an involuntarily exhalation, sometimes followed by uncontrolled inhalations and exhalations. This involuntarily breathing is not present during smiling. Laughter and smiles could be extremes of a smile-laugh continuum, but there are some indications that there is a more complex relation between laughter, smiling and even speech than we would expect. Aubergé and Cathiard demonstrated that a genuine smile includes a specific manipulation of the prosody of the speech [4], which cannot be attributed to the facial deformation of a smile; not only laughter, but also smiling is audible. Like smiling, laughter does occur during speech, and does so very often according to Trovain [35]. In the KielCorpus of Spontaneous Speech, 60% of all labeled laughs are instances that overlap speech. Simultaneous production of speech and laughter is not simply laughter imposed on articulation, and there is no prototypical pattern for speechlaughs. In a later work, Trouvain reports [36] that laughter is a mix of laughter interspersed with speech-laughs and smiled speech. Smiling and laughter seem to be different categories rather than extremes of a continuum. 11

12 12 CHAPTER 2. LITERATURE Laughter is a highly variable signal [5, 40]. Voiced laughter shows much more sourcerelated variability than is associated with speech, and the individual identity and sex are conveyed in laugh acoustics. The variability between individuals greatly exceeded the variability within an individual. Laughter seems to be better conceptualized as a repertoire of sounds, which makes it difficult to detect it automatically. Kipper and Todt [27] report that the successive syllables of laughter, which appear similar, show dynamic changes of acoustic parameters, and are in fact different. For example, the fundamental frequency, the amplitude and the duration of a syllable varies during a laughter bout. Laughter seems to be a very variable signal, both on phrase and syllable level. The automatic recognition of of laughter seems to be a very challenging problem. The laughter signal is highly variable on multiple levels, and can be described best as a group of sounds. Laughter and smile seem to be different categories, and should not be regarded as different manifestations of the same event. 2.2 Laughter detection in audio Automatic laughter detection has been studied several times, in the context of meetings, for audio indexing and to detect affective states. We will describe a few studies on automatic laughter detection, and summarize some characteristics of these studies. An overview of automatic laughter detection can be found in Table 2.1. Campbell et al. [8] developed a system to classify a laugh in different categories. They constructed a corpus containing four affective classes of laughter: A hearty laugh, an amused laugh, a satirical laugh and a social laugh. A training set of 0 hand labeled laughs was used to train hidden Markov models (HMMs). The HMMs recognized the affective class correctly in 75% of the test cases. Automatic laughter detection can be used in audio indexing applications. For example, Lockerd and Mueller [28] performed laughter detection using their affective indexing camcorder. Laughter was detected using HMMs. One HMM was trained on 40 laughter examples, the other HMM was trained on speech. The classifier correctly identified in 88% of the test segments. Misleading segments were sounds such as coughs, and sounds produced by cars and trains. Arias et al. [2] performed audio indexing using a Gaussian mixture models (GMMs) and support vector machines (SVMs) on spectral features. Each frame is classified and then smoothed using a smoothing function to merge small parts. The accuracy of their laughter detection is very high (97.26% with GMMs and 97.12% with a SVM). However, their data set contains 1 minute of laughter for every 180 minutes of audio. Only prediction non-laughter would result in a baseline accuracy of 99.4%, which makes it unclear how well their laughterdetection really performs. Automatic laughter detection is frequently studied in the context of meetings. Kennedy and Ellis [25] detected multiple laughing participants in the ICSI Meeting database. Using a SVM on one second windows of Mel-Frequency Cepstrum Coefficients (MFCCs) features, a equal error rate (EER) of 13% was obtained. The same data set was used by Truong and Van Leeuwen [37]. Using Gaussian mixture models (GMM) on Perceptual Linear Predictive Analysis (PLP) features, they also obtained an EER of 13%. The data set contained examples in which both speech and laughter were present, and some inaudible laughs. After removing these difficult instances, the performance

13 2.3. FACIAL EXPRESSIONS 13 Study Dataset Performance Remarks Truong (7) [38] ICSI-Bmr, clean set EER gmm: 6.3, EER svm: 2.6, EER fused : 2.9 EER fused was tested on different corpus than EER svm Arias (5) [2] Broadcast A: 97% MFCCs with GMMs and SVM, 1 minute laughter, 180 minutes non-laughter Campbell (5) [8] ESP A: 75% HMMs to classify a laugh into 4 categories Ito [23] (5) Audio visual laughter Audio: (95% R, 60% P) Truong (5) [37] ICSI-Bmr EER: 13.4%, EER clean : 7.1% Kennedy (4) [25] ICSI-Bmr EER: 13% MFCCs + SVM Lockerd (2) [28] Single person, 40 laughs A: 88% HMMs Table 2.1: Automatic laughter recognition in audio. PLP, GMM, EER clean on set with unclear samples removed increased, resulting in a EER of 7.1%. Different audio features were tested and resulted in PLP outperforming pitch and energy, pitch and voicing and modulation spectrum features. In a more recent work, Truong and Van Leeuwen [38] used the cleaned ICSI meeting data set to train GMM and SVM classifiers. For the SVM classifier the frame level features were transformed to a fixed length using a Generalized Linear Discriminant Sequence (GLDS) kernel. The SVM classifier performed better than the GMM classifier in most cases. The best feature set appeared to be the PLP feature set. The scores of different classifiers based on different features were fused using a linear combination of the scores or fused using a SVM or a MLP trained on the scores. Fusion based on GMM- and SVM-classifiers increases the discriminative power, as does fusion between classifiers based on spectral features and classifiers based on prosodic information. When we compare the results of these studies, GMMs and SVMs seem to be used most for automatic laughter recognition. Spectral features seem to outperform prosodic features. An EER of 12 13% seems to be usual. Removing unclear examples improves the classification performance enormously. This suggests that the performance largely depends on the difficulty of the chosen data set. 2.3 Facial expressions The detection of facial expressions in video is a popular area of research. Therefore, we will only describe a few studies that are related to fusion and the Patras-Pantic particle filtering tracking scheme [33] which we will use to extract video features. Valstar et al. [39] conducted a study to automatically differentiate between posed and spontaneous brow actions. Timing is a critical factor for the interpretation of facial behavior. The facial expressions are labeled according to the Facial Action Coding System (FACS) [16] action units (AUs). The SVM based AU detectors detect temporal segment (neutral, onset, apex, offset) of the atomic AUs based on a sequence of 20 tracked facial points. These points were tracked using the Patras-Pantic particle filtering tracking scheme. Using the detected three brow AUs (AU1, AU2, AU4), mid-level features based on intensity, duration, trajectory, symmetry and co-occurrence with other muscle actions were created to determine the spontaneousness of an instance. This resulted in a classification with an accuracy of

14 14 CHAPTER 2. LITERATURE 90.7%. Gunes and Piccardi [18] compare fusion of facial expressions and affective body gestures at the feature and decision level. For both modalities, single expressive frames are manually selected, which are classified into six emotions. The body-modality classifier was able to classify frames with a 100% accuracy. Using feature-level fusion, which combines the feature vectors of both modality into a single multi-modal feature vector, again an accuracy of 100% was obtained. Decision level fusion was performed using different rules to combine the scores of the classifiers for both modalities, which resulted in an accuracy of only 91%. Clearly the used fusion rules are not well chosen for their problem. Pantic et al. [32] used the head, face and shoulder modalities to differentiate between spontaneous and posed smiles. The tracking of the facial expressions was performed using the Patras-Pantic particle filtering tracking scheme [33]. For mid- and high-level fusion, frames are classified, and filtered to create neutral-onset-apex-offset-neutral sequences. Midlevel fusion is performed by transforming features into symbols such as temporal aspects of AUs, and the head and shoulder actions. For these symbols, mid-level features such as morphology, speed, symmetry, the duration of apex-overlap of modalities are calculated, and the order of the different actions are computed. Low level fusion (recall: 93%, precision: 89%) yields better results than mid-level (recall: 79%, precision: 79%) and high-level fusion (recall: 93%, precision: 63%). The head modality is the most important modality for the recognition for this data set, although the difference is not significant. The fusion of these modalities improves the performance significantly. 2.4 Audio-visual fusion Most work on audio-visual fusion has focused on the detection of emotion in audio-visual data [49, 47, 44, 18, 45, 48, 46, 21, 41, 17, 7]. Some other audio-visual studies are conducted on cry detection [31], movie classification [42], tracking [6, 3], speech recognition [13] and laughter detection [23]. These studies all try to exploit the complementary nature of audio-visual data. Decision level fusion is usually performed using the product, or a (weighted) sum of the predictions of single-modality classifiers, or using hand-crafted rules for classification. Other commonly used fusion techniques include mid-level fusion using multi-stream hidden Markov model (MHMM), and feature level fusion. We will describe some studies in more detail and make some general observations. A overview of these studies can be found in Table 2.2. Zeng et al. [48] used a sparse network of Winnow (SNoW) classifier to detect 11 affective states in audio-visual data. Fusion was performed using voting on frame-level to obtain a class for each instance. For a second, person-independent test, fusion was performed by using a weighted summation of component HMMs. In a following study [46], Zeng et al. performed automatic emotion recognition of positive and negative emotions in a realistic conversation setting. The facial expressions were encoded using FACS. Video features were based on the facial texture; prosodic features were used for audio classification. Fusion was regarded as a multi-class classification problem, with the outputs of the different component HMMs as features. An AdaBoost learning scheme performed best of the tested classifiers. Another study that compared feature-level fusion and decision level fusion for automatic emotion recognition was conducted by Busso et al. [7]. Video texture and prosodic audio features were classified using SVMs. The confusion matrices of the audio and video modalities show that pairs of emotions that are confused in one modality can be easily classified using

15 2.4. AUDIO-VISUAL FUSION 15 Study Dataset Performance Remarks Zajdel [44] (7) Posed, 2 emotions A: 45%, V: 67%, MF: 78% Dynamic Baysian Network Zeng [46] (7) AAI, 2 emotions A: 70%, V: 86% DF: 90% Adaboost on component HMMs Zeng [48] (7) Posed, 11 emotional A: 66%, V: 39%, DF: 72% SNoW, MHMM states Pal [31] (6) Unknown, 5 cry types A: 74%, V: 64%, DF: 75% Rule based fusion using confusion matrices Zeng [45] (6) AAI, 2 emotions A: 70, V: 86% DF: 90% Adaboost on component HMMs Asoh [3] (5) Speech, 2 states and location MF: 85% Particle filter Hoch [21] (5) Posed, 3 emotions A: 82%, V: 67%, DF: 87% SVM, weighted-sum fusion Ito [23] (5) Spontaneous, laugh- A: (95% R, 60% P), V: (71% R, Rule based fusion ter 52% P)%, DF: (71% R, 74% P) Wang [41] (5) Posed, 6 emotions A: 66%, V: 49%, FF1: 70%, FF2: 82% FF1: FLDA classfier, FF2: Rule-based voting Voting, rule based fusion DL: product fusion Xu [42] (5) Movies, horror vs DF: (R=97%, P=91%) comedy Busso [7] (4) Posed, single person, A: 71%, V: 85%, FF: 89%, 4 emotions DF: 89% Go [17] (3) 6 emotions A: 93, V: 93%, DF: 97% Rule based Dupont [13] (0) M2VTS, 10 words, A: 52% V: 60%, FF: 70%, Fusion using MHMMs noisy MF: 80%, DF: 82% Table 2.2: Audio-visual fusion the other modality. Different decision-level fusion rules were tested, the best results (89%) were obtained using the product of the prediction of both modalities. Feature-level fusion resulted in an accuracy of 89%. In this experiment, feature-level fusion and decision-level fusion had a similar performance. Dupont and Luettin [13] used both acoustic and visual speech data for automatic speech recognition. A MHMM is used to combine the audio and video modalities on the feature level, decision level and mid-level. The fused system performs better than systems based on single modality in the condition of noise. Both mid-level and decision-level fusion perform better than feature-level fusion. Without the addition of noise to the features, audio-classification alone is sufficient for almost perfect classification. A quite different approach for fusion was taken by Asoh et al. [3]. Particle filtering was used to track the location of human speech events. The audio modality consisted of a microphone array, the video-modality consisted of a monocular camera. Audio-visual tracking was performed by modeling the position and the type of the signal as a hidden state. The noisy observations are used to estimate the hidden state using particle filtering. This approach provides a simple method to compute the probability of a location and occurrence of a speech event. Ito et al. [23] focused on the detection of a smiling face and the utterance of laughter sound in natural dialogues. A database was created with Japanese, English and Chinese subjects. Video features consist of the lip lengths, the lip angles and the mean intensities of the cheek areas. Frame level classification of the video features is performed using a

16 16 CHAPTER 2. LITERATURE perceptron, resulting in a recall of 71%, and a precision of 52%. Laughter sound detection is performed on MFCC and delta-mfcc features, using two GMMs, one for laughter, and one for other sounds. Using a moving-average filter the frame-by-frame sequences are smoothed. A recall of 96% and a precision of 60% was obtained with 16 Gaussian mixtures. The audio and video channels are combined using hand-crafted rules. The combined system obtained a recall of 71% and a precision of 74%. Ito et al. do not report if fusion significantly increases the performance of their detector. Xu et al. [42] performed affective content analysis of comedy and horror movies using audio emotional events, such as laughing and horror sounds. The audio is classified using a left-to-right HMM with four states. After classification, the predictions are filtered using sliding window majority-voting. Short horror sounds were too short to detect using a HMM, they were detected by finding large amplitude changes. The audio features consist of MFCCs, with delta and acceleration features to accentuate the temporal characteristics of the signal. The recall and precision are over 90% for horror sounds and canned laughter. The performance of decision-level fusion seems to be similar to the performance of featurelevel fusion. The fusion of audio and video seems to boost the classification performance in these studies with about 4%. However, most work does not report the significance of this gain in performance. Fusion seems to work best when the individual modalities both have a low performance, for example due to noise in the audio-visual speech recognition of Dupont [13]. When single classifiers have a high performance, the performance gain obtained by fusion of the modalities is low, and sometimes fusion even degrades the performance, as observed in the work of Gunes [18].

17 Chapter 3 Methodology Fusion of audio and video can be performed on different levels. We perform fusion on the decision-level where the audio and video modality are classified separately. When the classifiers for both modalities have classified the instance, their results are fused into a final multi-modal prediction. See Figure 3.1 for a schematic overview. An alternative approach is fusion on feature-level, where the audio and video features are merged into a single, fused feature set. A classifier classifies the fused features of a single instance. We have chosen to evaluate decision-level fusion instead of feature-level fusion for two reasons. The first first reason is that decision-level fusion allows the use of different classifiers for the different modalities. The different results for the different classifiers helps us understand the nature of the audio-visual signal, and it possibly results in a better performance. The second reason is that we use a very small data set. The feature-level fusion approach has a higher dimensionality, which requires a lager data set to learn a classifier [1]. We therefore use decision-level fusion. In the next subsections, we will describe the preprocessing we applied to our data set, the features we used and the design we used to evaluate our fusion techniques. Audio V ideo feat A feat V Classif ier Classif ier pred A pred V F usion pred F P rediction Figure 3.1: Decision-level fusion. 3.1 Dataset In order to measure the classification performance of different fusion techniques, we need a corpus containing both laughter and non-laughter examples to use for training and testing. We created a corpus based on the AMI Meeting Corpus [29]. In the following sections, 17

18 18 CHAPTER 3. METHODOLOGY we will describe the AMI Meeting Corpus, the segmentation process used to select examples (instances), and the details regarding the construction of the corpus based on the segmentation data AMI Meeting Corpus The AMI Meeting Corpus consists of 100 hours of meeting recordings, stored in different signals that are synchronized to a common time line. The meetings are recorded in English, mostly spoken by non-native speakers. For each meeting, there are multiple audio and video recordings. We used seven nonscenario meetings recorded in the IDIAP-room (IB1, IB2, IB3, IB4, IB5, IB4010, IB4011). These meetings contain a fair amount of spontaneous laughter. In the first five meetings, the four participants plan an office move. In the last two meeting four people discuss the selection of films to show for a fictitious movie club. We removed two participants, one displayed extremely asymmetrical facial expressions (IB5.2), the other displayed a strong nervous tick in muscles around the mouth (IB3.3, IB3.4). Both participants were removed because their unusual expressions would have a huge impact on our results due to the small size of our dataset. The remaining 10 participants are displayed in Figure 3.3. We used the close-up video recording (DivX AVI codec 5.2.1, 2 Kbps, pixels, 25 frames per second) and the headset audio recording (16 KHz WAV file) of each participant for our corpus. In total we have used 17 hours of raw audio-visual meeting data to construct our corpus Segmentation The seven meetings we selected from the AMI Meeting Corpus were segmented into laughter and smile segments. The presence of laughter was determined using the definition for audible laughter of Vettin and Todt [40]: Vocalizations compromising several vocal elements must consist mainly of expiratory elements; inspiratory elements might occur at the end of vocalisations; expiratory elements must be shorter than 600ms and successive elements have to be similar in their acoustic structure; single-element vocalizations must be expiratory with a vowel-like acoustic structure, or, when noisy, the element must begin with a distinct onset. For smiles we used a definition based on visual information. We define a visible smile as the visible contraction of the Zygomatic Major (FACS Action Unit 12). The activation of AU12 pulls the lip corners towards the cheekbones [14]. We define the start of the smile as the moment the corners of the mouth start to move, the end is defined as the moment the corners of the mouth return to a neutral position. Using these definitions, the 17 hours of audio-visual meeting recordings were segmented into 2049 smiles and 960 laughs. Due to the spontaneous nature of these meetings, speech, chewing and occlusions sometimes co-occur with the smile and laugh segments.

19 3.2. FEATURES 19 perceived intensity smile laugh laughs and smiles smile laugh seconds segmentation negative positive seconds corpus seconds Figure 3.2: Segmentation of the data and extraction of instances for the corpus. On top the typical observed intensity of the smiles and laughs is shown. Based on the observations, a segmentation is made, as shown in the middle diagram. The laughs are padded with 3 seconds on both sides to form the positive instances. The negative instances are created from the remaining non-smile space Corpus The final corpus is built using this segmentation data. The laughter instances are created by padding each laughter segment with 3 seconds on each side to capture the onset and offset of a visual laughter event (see Figure 3.2). A preliminary experiment showed that these onset and offset segments increased the performance of the classifier. Laughter segments that overlapped after padding are merged into a single laughter instance. This effectively merges separate laughter calls to a instance containing a single laughter bout. The nonlaughter instances are created from the audio-visual data that remains after removing all the laughter and smile segments; the smile segments are not used during this research. The length of the non-laughter instance is taken from a random Gaussian distribution with a mean and standard deviation equal to the mean and standard deviation of the laughter segments. Due to time constraint we have based our corpus on selected 60 randomly selected laughter and 120 randomly selected non-laughter instances, in which the 20 facial points needed for tracking are visible. Of these 180 instances, 59% contains speech of the visible participant. Almost all instances contain background speech. Together these instances consist 25 minutes of audio-visual data. 3.2 Features This section outlines the features we have used for the audio and video modalities. For audio we use features that are commonly used for the detection of laughter in audio. For video we used features based on the location of 20 facial points Audio features In order to detect laughter in audio, the audio signal has to be transformed to useful features for classification algorithms. Spectral or cepstral audio features, such as Mel-Frequency Cepstrum Coefficients (MFCC) [25] and Perceptual Linear Predictive (PLP) Analysis [19], have been used successfully for automatic speech recognition and laughter detection. We decided

20 20 CHAPTER 3. METHODOLOGY Figure 3.3: Laughter examples for each individual in the corpus.

21 3.2. FEATURES 21 to use PLP features, with the same settings as used by Truong and van Leeuwen [38] for automatic laughter detection, and RASTA-PLP features with similar settings. PLP and RASTA-PLP can be understood best as a sequence of transformations. The first transformation is Linear Predictive Coding (LPC). LPC encodes speech based on the assumption that speech is comparable with a buzzer at the end of tube; the formants of the speech are removed, and encoded with the intensity and frequency of the remaining buzz. PLP adds a transformation of the short term spectrum to LPC encoded audio, in order to mimic human hearing. We used these PLP features for audio classification. In addition to the PLP audio features, we derived RASTA-PLP [20] features. RASTA-PLP adds filtering capabilities for channel distortions to PLP, and yield significantly better results for speech recognition tasks than PLP in noisy environments [13]. A visualisation of PLP and RASTA-PLP features can be found in Appendix B. For PLP-features we used the same settings as were used by Truong and Van Leeuwen [38] for laughter detection (see Table 3.1). The 13 cepstral coefficients are calculated (12 model order, 1 gain) over a window of 32 ms with a step-size of 16 ms. Combined with the temporal derivative (calculated by convolving with a simple linear-slope filter over 5 audio frames) this results in a 26 dimensional feature vector per audio frame. The RASTA-PLP features are created using the same settings. We normalize these 26-dimensional feature vectors to a mean µ = 0 and a standard deviation σ = 1 using z-normalisation. PLP RASTA-PLP Sampling frequency: 16 khz 16 khz Window size: 32 ms 32 ms Window step-size: 16 ms 16 ms Model order: Delta window: 5 frames 5 frames Log-RASTA filtering: false true Table 3.1: Settings used for the PLP and RASTA-PLP features Video features The video channel was transformed into sequences of 20 two-dimensional facial points located on key features of the human face. These point sequences are subsequently transformed into orthogonal features using a Principal Component Analysis (PCA). The points were tracked as follows. The points were manually assigned at the first frame of an instance movie and tracked using a tracking scheme based on particle filtering with factorized likelihoods [33]. We track the brows (2 points each), the eyes (4 points each), the nose (3 points), the mouth (4 points) and chin (1 point). This tracking configuration has been used successfully [39] for the detection of the atomic action units of the FACS. This results in a compact representation of the facial movement in a movie using 20 (x, y) tuples per frame (see Figure 3.4). After tracking, we performed a PCA on the 20 points per video-frame. A PCA linearly transforms a set of correlated variables in a set of uncorrelated variables [24]. The principal components are ordered so that the first few retain most of the variance of the original variables. Therefore a PCA can be used as a dimension-reduction technique for features [1], however we chose to keep all the dimensions because we do not know in advance which

22 22 CHAPTER 3. METHODOLOGY Tracked facial points y axis x axis Figure 3.4: The tracked facial points principal components are useful for laughter detection. We have chosen to use PCA over manually defined features because PCA can detect factors such as differences in head shape that are otherwise difficult to detect and remove from the features. For each frame in the videos we defined a 40-dimensional shape vector by concatenating all the Cartesian (x, y) coordinates. Using a PCA we extracted 40 principal components (eigenvectors) for all the frames in the data set. The original shape vectors can be reconstructed by adding a linear combination of these eigenvectors, to the mean of the shape vectors: x = x + bp T (3.1) Here x is the original shape vector, x is the mean of the shape vectors, b is a vector of weights and P matrix of the eigenvectors. An analysis of the eigenvectors revealed that the first five principal components encode the head pose, including translation, rotation and scale. The other components encode interpersonal differences, facial expressions, corrections for the linear approximations of movements and less obvious factors of the facial configuration. See Figure 3.5 for a visualisation of the first 12 principal components, for more information please refer to Appendix A. The matrix of eigenvectors serves as a parametric model for the tracked facial points. The Active Shape Model developed by Cootes et al. [10] used a similar technique to create a model for shapes. The main difference is that Cootes at al. removed global linear transformations from the model by aligning the shapes before the PCA is applied. We did not align the shapes because the head modality seems to contain valuable cues for laughter detection we want to include in the model. We use the input for this model (the weight vector b) as feature vector for the video-data. For unseen data, this feature vector can be calculated using Equation 3.2. b = (x x)p (3.2) In order to capture temporal aspects of this model, the first order derivative for each weight is added to each frame. The derivative is calculated with t = 4 frames on a moving average of the weights with a window length of 2 frames. Facial activity (onset-apex-offset) can last from a 0.25 seconds (for example a blink) to several minutes [16]. With a t = 4 frames even the fastest facial activity is captured in the derivative of the features. We normalize this 80-dimensional feature vector to a mean µ = 0 and a standard deviation σ = 1 using

23 3.2. FEATURES 23 PC 1: Translation PC 2: Translation PC 3: Head roll PC 4: Scale PC 5: Head yaw PC 6: Aspect ratio PC 7: Face length PC 8: Mouth corners / head pitch PC 9 PC 10: Mouth width PC 11: Mouth opening / eyes PC 12: Mouth corners Figure 3.5: A visualisation of the influence of the first 12 principal components. The arrows point from 3σ to 3σ, where σ is the standard deviation.

24 24 CHAPTER 3. METHODOLOGY z-normalisation. This results in a normalized 80-dimensional feature vector per frame which we use for classification (Appendix B). 3.3 Test setup Classifiers We selected Gaussian mixture models (GMMs), hidden Markov models (HMMs) and support vector machines(svms) as machine learning techniques to be used for classification. GMMs and HMMs are frequently used in speech recognition and speaker identification, and have been used before for laughter recognition [2, 38, 23, 8, 28, 26, 42]. SVMs have been used for laughter detection in [25, 2, 34, 38]. HMMs and GMMs are generative models. Therefore, a different model has to be trained for each class. After training using the EM algorithm [11, 43], the log-likelihood for both class-models is computed and compared for each instance. Using these log-likelihoods the final output is computed as the logarithm of the ration between the probability of the positive and the negative model (Eq. 3.3). score(i) = log( P pos(i) P neg (I) ) = logp pos(i) logp neg (I) (3.3) We use HMMs that model the generated output using a mixture of Gaussian distributions. For the HMMs classifiers we used two different topologies (Figure 3.6). The first is commonly used in speech recognition, and contains only forward connections. The advantage of this leftright HMM model is that less parameters have to be learned, and the left-right architecture seems to fit sequential nature of speech. An ergodic HMM allows state transitions from every state to every state. This topology is more flexible, but more variables have to be learned. Kevin Murphy s HMM Toolbox [30] was used to implement the GMM and the HMM classification Figure 3.6: left-right HMM (left) and an ergodic HMM (right) SVMs expect a fixed-length feature vector, but our data consists of sequences with a variable length. Therefore we use a sliding window to create features for the SVM. During training the class of windowed sections of the instances are learned. During classification a probability estimate for the different windows of an instance is calculated. The final score of an instance is the mean of its window-scores, a median could be used as well. We use Radial Basis Function (RBF) kernel SVMs, which are trained using LIBSVM [9].

25 3.3. TEST SETUP Fusion Fusion is performed on the decision-level, which means that the output of an audio and a video classifier are used as input for the final fused prediction. For each instance we classify, we generate two numbers, representing the probability of laughter in the audio and the video modality. Fusion SVMs are trained on these numbers using same train, validation and test sets as used for the single modality classifiers (see Section 3.3.3). The output of these SVMs is a multi-modal prediction based on high-level fusion. As an alternative to this learned fusion, we test fusion using a weighted-sum (Equation 3.4) of the predictions to fuse the scores of the single-modality classifiers. s fused = α s video + (1 α) s audio (3.4) Cross validation scheme In order to compare different fusion techniques, we need to be able to measure the generalisation performance of a classifier. We decided to use a preprocessed data set, so the preprocessing is done once for the whole data set. We have chosen to exclude the preprocessing from the cross-validation loop in order to measure the generalisation error of the fusion without the additional generalisation error of the preprocessing. The preprocessing consists of feature-extraction, and z-normalisation which transforms the data to a mean µ = 0 and σ = 1. Using this setup we measure the generalisation error of the classification, and not the combined generalisation error of preprocessing and classification. Because we have a small data set we use a cross-validation scheme to create multiple train, validation and test sets (see Figure 3.7). Algorithm 1: The used cross-validation scheme. for K in [1..10] do S train = S S K ; for L in [1..3] do S validation = S KL ; S test = S K S KL ; C = trainer.learn(s train, S validation ); S test.performance = trainer.test(c, S test ); end end The preprocessed data set is divided into K=10 subsets. During each of the K folds, 1 subset is set aside. The other 9 subsets are used for training. The remaining subset is used to create a validation and a test set for three folds. One third is used as validation set and the remaining two thirds as test set (see Algorithm 1). Different model-parameters are used to train classifiers on the train set. The classifier with the best performance on the validationset is selected, and tested on the test set. This results in performance measurements for 10 3 = 30 different folds of the data set.

26 26 CHAPTER 3. METHODOLOGY Corpus preprocessing Dataset prep fold fold fold T rainset V alidationset T estset train select Classifier predict P redictions measure P erf ormance Figure 3.7: A train, validation and test set are used to measure the generalisation performance of a classifier Model-parameter estimation Most machine learning techniques have model-parameters (for example, the number of states and the number of Gaussian mixtures for a HMM, the C and γ parameters for a SVM with a RBF kernel) that influence their performance. We find good parameters by performing a multi-resolution grid-search [22] in the model-parameter space in which we search for the parameters that result in the best performance on the validation set after training. For a SVM with a RBF-kernel, we test different values for the log(c) and log(γ). The parameters that result in the highest AUC-ROC (see section 3.3.5) form the center of a smaller grid, whose values are again tested on the validation set. The best scoring classifier is the final classifier. For generative models, such as HMMs and GMMs, we perform the same grid-based parameter search. Because we need a model for both the positive and the negative instances, the grid-search is performed for both classes individually. The performance measure during this search is the log-likelihood of the model on the validation set. For GMMs, we estimate the best number of Gaussian mixtures for our data set. For HMMs, we search the best values for the number of states, the number of Gaussian mixtures and a Boolean that determines if the HMM is fully connected or not Performance measure In order to calculate the generalisation performance of a classifier, we need to select a suitable measure for the performance. We have chosen to use and the Area Under Curve of the Receiver Operating Characteristic (AUC-ROC) [15] as primary and the Equal Error Rate (EER) as secondary performance measure. accuracy = T P + T N P + N recall = T P P (3.5) (3.6)

27 3.3. TEST SETUP 27 precision = T P T P + F P (3.7) The most commonly used measure in previous work is the accuracy (Equation 3.5), or the recall and precision pair (Equation 3.6 and Equation 3.7). The accuracy measure is not suitable to measure the performance for a two-class problem, because a very high accuracy can be obtained by predicting the most frequent class for problems with a high class skew. The combination of recall and precision is more descriptive. Recall expresses the fraction of detected positive instances, precision describes the fraction of the detected instances that is a real positive. Those measures can be calculated using the values found in the confusion matrix (Fig. 3.8). prediction \ class positive negative positive TP FP negative FN TN all P N Figure 3.8: A confusion matrix, where the columns represent the real class, and the rows represent the prediction of a classifier. The cells contain the true positives (TP), false positives (FP), the false negatives (FN) and true negatives (TN). Most classifiers can be modified to output a probability of a class instead of a binary decision. A trade-off for the cost of different errors FP versus FN can be made by thresholding this probabilistic output. This trade-off can be visualized in a receiver operating characteristic (ROC), in which the true-positive rate is plotted against the false-positive rate for different thresholds (see Figure 3.9). A recall-precision pair corresponds to a single point on the ROC. One of the advantages of the ROC over other thresholded plots is its invariancy to class-skew [15]. Because we do not know in advance which costs are associated with the different errors, we cannot define a single point of interest on the ROC. Therefore we measure the performance using the area under the ROC curve (AUC-ROC). The AUC-ROC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. In addition to the AUC- ROC performance, we will report the EER for a classifier. The EER is single a point on the ROC, defined as the point for which the false-positive rate equals the false-negative rate. We will use a paired two-tailed t-test to compare the AUC-ROCs of the cross-validation folds. This K-fold cross-validated paired t-test suffers from the problem that the train sets overlap, which results in an elevated probability of detecting a difference between classifiers when no such difference exists (type I error) [12]. As a solution for this problem the 5 2 crossvalidated paired t-test has been developed, which has an acceptable type I error. Because this method uses only half of the data for training during a fold it is unsuitable for our data set. Therefore we use the K-fold cross-validated paired t-test to compare the AUC-ROC values for different classifiers, and note the possibility of a type I error.

28 28 CHAPTER 3. METHODOLOGY True positive rate EER = AUC ROC = ROC EER line False positive rate Figure 3.9: The ROC, the ROC-AUC and the EER for a classifier. The probabilistic output of the classifier is thresholded to generate the ROC-curve. Points on the curve define the relation between the true positive rate and the false positive rate. The area under the ROC (the AUC- ROC) is our primary performance measure. The EER for a classifier is the error-rate in the intersection of the ROC with the EER-line from (0, 1) to (1, 0).

29 Chapter 4 Results In this chapter we will describe the results of our experiments. We will start with the results for the single-modality classifiers. The best single-modality classifiers are used to construct a fused classifier, which we will compare to the best performing single modality classifier. 4.1 Single-modality classifiers We will start with the audio classifiers. We have trained different classifiers on the the two sets of audio features. Figure 4.1 shows a ROC-plot for the audio features. The figure shows that all the trained classifiers have similar performance for PLP features. The only real differences are in the area with a very low threshold (high recall, low precision) and the are with a high threshold (low recall, high precision). In those areas the generative models (GMMs and HMMs) seem to perform better. When we look at Table 4.1 we see that the number of Gaussian mixtures for the positive and negative model seems to be proportional to the amount of train data. We expect that more train data would increase the number of mixtures and possibly the performance of our GMM and HMM classifiers. This is supported by the work of Truong et al. [38], where models with 1024 Gaussian mixtures were trained using more than 0 instances. Classifier Features Positive model Negative model #states #mix. #states #mix. AUC-ROC GMM PLP (3.2) (3.1) (0.169) GMM* RASTA (2.8) (5.9) (0.143) GMM Video (0.7) (0.6) (0.129) HMM PLP 11.0 Erg. (0) 2.1 (0.5) 18.5 (1.1) Erg. 2.5 (0.9) (0.160) HMM RASTA 11.6 Erg. (1.9) 2.1 (0.4) 21.3 (1.9) Erg. 2.0 (0) (0.135) HMM Video 2.5 LR (0.5) 4.0 (0) 1.2 (0.4) Erg. 3.0 (0) (0.129) Classifier Features Window Step log 2(C) log 2(γ) AUC-ROC EER SVM PLP 1.12 s 0.64 s -8.9 (3.7) -22 (2.6) (0.173) SVM RASTA 1.12 s 0.64 s -9.8 (4.1) (3.2) (0.157) 0. SVM* Video 1.20 s 0.60 s 1.3 (5) -18 (0) (0.114) Table 4.1: Results of the different classifiers trained on different features. For the modelparameters and the performance measure, the mean value is diplayed with the standard deviation displayed between parenthesis. The classifiers marked with an asterisk are the best performing classifiers for the audio and video modality. EER 29

30 30 CHAPTER 4. RESULTS The results for the RASTA-PLP features are remarkebly different. The ROC is not as smooth as for PLP features, and the SVM-performance is degraded dramatically. However, RASTA-PLP features result in a slightly better performance than the PLP features for the generative models. The filtering that RASTA-PLP adds to PLP seems to smoothen the signal (Appendix B). This results in features that can be modeled using fewer mixtures (see Table 4.1), which allows for the training of more states, or training with a higher accuracy. RASTA-PLP was developed with speech recognition in mind, with explains why the generative models that are commonly used in speech recognition perform better with RASTA-PLP features than with PLP features. While the distribution of the values of the features is simplified, the performance for SVMs degrades. SVM-classifiers trained on RASTA-PLP features generally have a lower C-parameter, which indicates a smoother hyper-plane. Therefore we assume that the smoother RASTA-PLP signal allows for more overfitting, which can explain the degraded performance for SVMs on RASTA-PLP features. 1 PLP classifiers 1 RASTA PLP classifiers 1 Video classifiers True positive rate HMM GMM SVM True positive rate HMM GMM SVM True positive rate HMM GMM SVM False positive rate False positive rate False positive rate Figure 4.1: The ROC for the PLP features (left), the RASTA-PLP features (mid) and the video features (right). When we compare the results of the different classifiers trained using PLP and RASTA- PLP features, we observe that the SVM-based classifiers have the worst performance. The difference in performance for the generative models is not as clear. Using a paired samples t-test, we find that the RASTA-PLP features have a significantly higher AUC-ROC (t(59) = 2.15, p < 0.05) than the PLP features. We conclude that the combination of a GMM or HMM classifier with RASTA-PLP features results in best performance for laughter detection in audio using our data set. For the video features we evaluated the same classifiers using different model-parameters. These ROC-plots can be found in Figure 4.1. The ROC-plot shows that classifiers trained on the video modality have a better performance than classifiers trained on the audio modality. When we look at the average model for the HMM-classifier trained on the video features, we notice that the model for the positive instances is a left-right (LR) HMM, while the model for the positive instance for audio is an ergodic HMM (see Table 4.1). The visual laugh seems to display a sequential order, that is not modeled in the audio HMMs. Another difference is that the video modality is modeled using fewer Gaussian mixtures. This can be the result of the higher dimensionality of the video features. The best result for the video modality was obtained using a SVM-classifier. This can be the result of the more sequential pattern of visual laughter, that can be detected more reliable inside of a sliding window than the variable audio signal. The video-svm classifier has the best single-modality performance.

Automatic Laughter Detection

Automatic Laughter Detection Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional

More information

Automatic Laughter Detection

Automatic Laughter Detection Automatic Laughter Detection Mary Knox 1803707 knoxm@eecs.berkeley.edu December 1, 006 Abstract We built a system to automatically detect laughter from acoustic features of audio. To implement the system,

More information

Improving Frame Based Automatic Laughter Detection

Improving Frame Based Automatic Laughter Detection Improving Frame Based Automatic Laughter Detection Mary Knox EE225D Class Project knoxm@eecs.berkeley.edu December 13, 2007 Abstract Laughter recognition is an underexplored area of research. My goal for

More information

Automatic discrimination between laughter and speech

Automatic discrimination between laughter and speech Speech Communication 49 (2007) 144 158 www.elsevier.com/locate/specom Automatic discrimination between laughter and speech Khiet P. Truong *, David A. van Leeuwen TNO Human Factors, Department of Human

More information

LAUGHTER serves as an expressive social signal in human

LAUGHTER serves as an expressive social signal in human Audio-Facial Laughter Detection in Naturalistic Dyadic Conversations Bekir Berker Turker, Yucel Yemez, Metin Sezgin, Engin Erzin 1 Abstract We address the problem of continuous laughter detection over

More information

Automatic Rhythmic Notation from Single Voice Audio Sources

Automatic Rhythmic Notation from Single Voice Audio Sources Automatic Rhythmic Notation from Single Voice Audio Sources Jack O Reilly, Shashwat Udit Introduction In this project we used machine learning technique to make estimations of rhythmic notation of a sung

More information

Supervised Learning in Genre Classification

Supervised Learning in Genre Classification Supervised Learning in Genre Classification Introduction & Motivation Mohit Rajani and Luke Ekkizogloy {i.mohit,luke.ekkizogloy}@gmail.com Stanford University, CS229: Machine Learning, 2009 Now that music

More information

CS229 Project Report Polyphonic Piano Transcription

CS229 Project Report Polyphonic Piano Transcription CS229 Project Report Polyphonic Piano Transcription Mohammad Sadegh Ebrahimi Stanford University Jean-Baptiste Boin Stanford University sadegh@stanford.edu jbboin@stanford.edu 1. Introduction In this project

More information

2. AN INTROSPECTION OF THE MORPHING PROCESS

2. AN INTROSPECTION OF THE MORPHING PROCESS 1. INTRODUCTION Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals,

More information

MUSI-6201 Computational Music Analysis

MUSI-6201 Computational Music Analysis MUSI-6201 Computational Music Analysis Part 9.1: Genre Classification alexander lerch November 4, 2015 temporal analysis overview text book Chapter 8: Musical Genre, Similarity, and Mood (pp. 151 155)

More information

Automatic Laughter Segmentation. Mary Tai Knox

Automatic Laughter Segmentation. Mary Tai Knox Automatic Laughter Segmentation Mary Tai Knox May 22, 2008 Abstract Our goal in this work was to develop an accurate method to identify laughter segments, ultimately for the purpose of speaker recognition.

More information

Hidden Markov Model based dance recognition

Hidden Markov Model based dance recognition Hidden Markov Model based dance recognition Dragutin Hrenek, Nenad Mikša, Robert Perica, Pavle Prentašić and Boris Trubić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3,

More information

Classification of Timbre Similarity

Classification of Timbre Similarity Classification of Timbre Similarity Corey Kereliuk McGill University March 15, 2007 1 / 16 1 Definition of Timbre What Timbre is Not What Timbre is A 2-dimensional Timbre Space 2 3 Considerations Common

More information

Detecting Musical Key with Supervised Learning

Detecting Musical Key with Supervised Learning Detecting Musical Key with Supervised Learning Robert Mahieu Department of Electrical Engineering Stanford University rmahieu@stanford.edu Abstract This paper proposes and tests performance of two different

More information

Chord Classification of an Audio Signal using Artificial Neural Network

Chord Classification of an Audio Signal using Artificial Neural Network Chord Classification of an Audio Signal using Artificial Neural Network Ronesh Shrestha Student, Department of Electrical and Electronic Engineering, Kathmandu University, Dhulikhel, Nepal ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Subjective Similarity of Music: Data Collection for Individuality Analysis

Subjective Similarity of Music: Data Collection for Individuality Analysis Subjective Similarity of Music: Data Collection for Individuality Analysis Shota Kawabuchi and Chiyomi Miyajima and Norihide Kitaoka and Kazuya Takeda Nagoya University, Nagoya, Japan E-mail: shota.kawabuchi@g.sp.m.is.nagoya-u.ac.jp

More information

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 AN HMM BASED INVESTIGATION OF DIFFERENCES BETWEEN MUSICAL INSTRUMENTS OF THE SAME TYPE PACS: 43.75.-z Eichner, Matthias; Wolff, Matthias;

More information

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION Halfdan Rump, Shigeki Miyabe, Emiru Tsunoo, Nobukata Ono, Shigeki Sagama The University of Tokyo, Graduate

More information

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn Introduction Active neurons communicate by action potential firing (spikes), accompanied

More information

MUSIC TONALITY FEATURES FOR SPEECH/MUSIC DISCRIMINATION. Gregory Sell and Pascal Clark

MUSIC TONALITY FEATURES FOR SPEECH/MUSIC DISCRIMINATION. Gregory Sell and Pascal Clark 214 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) MUSIC TONALITY FEATURES FOR SPEECH/MUSIC DISCRIMINATION Gregory Sell and Pascal Clark Human Language Technology Center

More information

Acoustic Scene Classification

Acoustic Scene Classification Acoustic Scene Classification Marc-Christoph Gerasch Seminar Topics in Computer Music - Acoustic Scene Classification 6/24/2015 1 Outline Acoustic Scene Classification - definition History and state of

More information

Research & Development. White Paper WHP 232. A Large Scale Experiment for Mood-based Classification of TV Programmes BRITISH BROADCASTING CORPORATION

Research & Development. White Paper WHP 232. A Large Scale Experiment for Mood-based Classification of TV Programmes BRITISH BROADCASTING CORPORATION Research & Development White Paper WHP 232 September 2012 A Large Scale Experiment for Mood-based Classification of TV Programmes Jana Eggink, Denise Bland BRITISH BROADCASTING CORPORATION White Paper

More information

Music Genre Classification and Variance Comparison on Number of Genres

Music Genre Classification and Variance Comparison on Number of Genres Music Genre Classification and Variance Comparison on Number of Genres Miguel Francisco, miguelf@stanford.edu Dong Myung Kim, dmk8265@stanford.edu 1 Abstract In this project we apply machine learning techniques

More information

Detecting Attempts at Humor in Multiparty Meetings

Detecting Attempts at Humor in Multiparty Meetings Detecting Attempts at Humor in Multiparty Meetings Kornel Laskowski Carnegie Mellon University Pittsburgh PA, USA 14 September, 2008 K. Laskowski ICSC 2009, Berkeley CA, USA 1/26 Why bother with humor?

More information

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu

More information

Multi-modal Kernel Method for Activity Detection of Sound Sources

Multi-modal Kernel Method for Activity Detection of Sound Sources 1 Multi-modal Kernel Method for Activity Detection of Sound Sources David Dov, Ronen Talmon, Member, IEEE and Israel Cohen, Fellow, IEEE Abstract We consider the problem of acoustic scene analysis of multiple

More information

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices Yasunori Ohishi 1 Masataka Goto 3 Katunobu Itou 2 Kazuya Takeda 1 1 Graduate School of Information Science, Nagoya University,

More information

Comparison Parameters and Speaker Similarity Coincidence Criteria:

Comparison Parameters and Speaker Similarity Coincidence Criteria: Comparison Parameters and Speaker Similarity Coincidence Criteria: The Easy Voice system uses two interrelating parameters of comparison (first and second error types). False Rejection, FR is a probability

More information

AUTOMATIC RECOGNITION OF LAUGHTER

AUTOMATIC RECOGNITION OF LAUGHTER AUTOMATIC RECOGNITION OF LAUGHTER USING VERBAL AND NON-VERBAL ACOUSTIC FEATURES Tomasz Jacykiewicz 1 Dr. Fabien Ringeval 2 JANUARY, 2014 DEPARTMENT OF INFORMATICS - MASTER PROJECT REPORT Département d

More information

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution. CS 229 FINAL PROJECT A SOUNDHOUND FOR THE SOUNDS OF HOUNDS WEAKLY SUPERVISED MODELING OF ANIMAL SOUNDS ROBERT COLCORD, ETHAN GELLER, MATTHEW HORTON Abstract: We propose a hybrid approach to generating

More information

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval DAY 1 Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval Jay LeBoeuf Imagine Research jay{at}imagine-research.com Rebecca

More information

A NOVEL CEPSTRAL REPRESENTATION FOR TIMBRE MODELING OF SOUND SOURCES IN POLYPHONIC MIXTURES

A NOVEL CEPSTRAL REPRESENTATION FOR TIMBRE MODELING OF SOUND SOURCES IN POLYPHONIC MIXTURES A NOVEL CEPSTRAL REPRESENTATION FOR TIMBRE MODELING OF SOUND SOURCES IN POLYPHONIC MIXTURES Zhiyao Duan 1, Bryan Pardo 2, Laurent Daudet 3 1 Department of Electrical and Computer Engineering, University

More information

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj 1 Story so far MLPs are universal function approximators Boolean functions, classifiers, and regressions MLPs can be

More information

Phone-based Plosive Detection

Phone-based Plosive Detection Phone-based Plosive Detection 1 Andreas Madsack, Grzegorz Dogil, Stefan Uhlich, Yugu Zeng and Bin Yang Abstract We compare two segmentation approaches to plosive detection: One aproach is using a uniform

More information

Classification of Musical Instruments sounds by Using MFCC and Timbral Audio Descriptors

Classification of Musical Instruments sounds by Using MFCC and Timbral Audio Descriptors Classification of Musical Instruments sounds by Using MFCC and Timbral Audio Descriptors Priyanka S. Jadhav M.E. (Computer Engineering) G. H. Raisoni College of Engg. & Mgmt. Wagholi, Pune, India E-mail:

More information

WE ADDRESS the development of a novel computational

WE ADDRESS the development of a novel computational IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010 663 Dynamic Spectral Envelope Modeling for Timbre Analysis of Musical Instrument Sounds Juan José Burred, Member,

More information

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 04, April -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 MUSICAL

More information

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) (1) Stanford University (2) National Research and Simulation Center, Rafael Ltd. 0 MICROPHONE

More information

A Large Scale Experiment for Mood-Based Classification of TV Programmes

A Large Scale Experiment for Mood-Based Classification of TV Programmes 2012 IEEE International Conference on Multimedia and Expo A Large Scale Experiment for Mood-Based Classification of TV Programmes Jana Eggink BBC R&D 56 Wood Lane London, W12 7SB, UK jana.eggink@bbc.co.uk

More information

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION ULAŞ BAĞCI AND ENGIN ERZIN arxiv:0907.3220v1 [cs.sd] 18 Jul 2009 ABSTRACT. Music genre classification is an essential tool for

More information

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University A Pseudo-Statistical Approach to Commercial Boundary Detection........ Prasanna V Rangarajan Dept of Electrical Engineering Columbia University pvr2001@columbia.edu 1. Introduction Searching and browsing

More information

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004 Story Tracking in Video News Broadcasts Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004 Acknowledgements Motivation Modern world is awash in information Coming from multiple sources Around the clock

More information

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY Eugene Mikyung Kim Department of Music Technology, Korea National University of Arts eugene@u.northwestern.edu ABSTRACT

More information

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed, VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS O. Javed, S. Khan, Z. Rasheed, M.Shah {ojaved, khan, zrasheed, shah}@cs.ucf.edu Computer Vision Lab School of Electrical Engineering and Computer

More information

Singer Recognition and Modeling Singer Error

Singer Recognition and Modeling Singer Error Singer Recognition and Modeling Singer Error Johan Ismael Stanford University jismael@stanford.edu Nicholas McGee Stanford University ndmcgee@stanford.edu 1. Abstract We propose a system for recognizing

More information

Semi-supervised Musical Instrument Recognition

Semi-supervised Musical Instrument Recognition Semi-supervised Musical Instrument Recognition Master s Thesis Presentation Aleksandr Diment 1 1 Tampere niversity of Technology, Finland Supervisors: Adj.Prof. Tuomas Virtanen, MSc Toni Heittola 17 May

More information

Recognising Cello Performers Using Timbre Models

Recognising Cello Performers Using Timbre Models Recognising Cello Performers Using Timbre Models Magdalena Chudy and Simon Dixon Abstract In this paper, we compare timbre features of various cello performers playing the same instrument in solo cello

More information

Topic 10. Multi-pitch Analysis

Topic 10. Multi-pitch Analysis Topic 10 Multi-pitch Analysis What is pitch? Common elements of music are pitch, rhythm, dynamics, and the sonic qualities of timbre and texture. An auditory perceptual attribute in terms of which sounds

More information

Composer Style Attribution

Composer Style Attribution Composer Style Attribution Jacqueline Speiser, Vishesh Gupta Introduction Josquin des Prez (1450 1521) is one of the most famous composers of the Renaissance. Despite his fame, there exists a significant

More information

A Framework for Segmentation of Interview Videos

A Framework for Segmentation of Interview Videos A Framework for Segmentation of Interview Videos Omar Javed, Sohaib Khan, Zeeshan Rasheed, Mubarak Shah Computer Vision Lab School of Electrical Engineering and Computer Science University of Central Florida

More information

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Aric Bartle (abartle@stanford.edu) December 14, 2012 1 Background The field of composer recognition has

More information

Application Of Missing Feature Theory To The Recognition Of Musical Instruments In Polyphonic Audio

Application Of Missing Feature Theory To The Recognition Of Musical Instruments In Polyphonic Audio Application Of Missing Feature Theory To The Recognition Of Musical Instruments In Polyphonic Audio Jana Eggink and Guy J. Brown Department of Computer Science, University of Sheffield Regent Court, 11

More information

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS Mutian Fu 1 Guangyu Xia 2 Roger Dannenberg 2 Larry Wasserman 2 1 School of Music, Carnegie Mellon University, USA 2 School of Computer

More information

Audio-Based Video Editing with Two-Channel Microphone

Audio-Based Video Editing with Two-Channel Microphone Audio-Based Video Editing with Two-Channel Microphone Tetsuya Takiguchi Organization of Advanced Science and Technology Kobe University, Japan takigu@kobe-u.ac.jp Yasuo Ariki Organization of Advanced Science

More information

Topics in Computer Music Instrument Identification. Ioanna Karydi

Topics in Computer Music Instrument Identification. Ioanna Karydi Topics in Computer Music Instrument Identification Ioanna Karydi Presentation overview What is instrument identification? Sound attributes & Timbre Human performance The ideal algorithm Selected approaches

More information

Transcription of the Singing Melody in Polyphonic Music

Transcription of the Singing Melody in Polyphonic Music Transcription of the Singing Melody in Polyphonic Music Matti Ryynänen and Anssi Klapuri Institute of Signal Processing, Tampere University Of Technology P.O.Box 553, FI-33101 Tampere, Finland {matti.ryynanen,

More information

MOVIES constitute a large sector of the entertainment

MOVIES constitute a large sector of the entertainment 1618 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 11, NOVEMBER 2008 Audio-Assisted Movie Dialogue Detection Margarita Kotti, Dimitrios Ververidis, Georgios Evangelopoulos,

More information

Analysis, Synthesis, and Perception of Musical Sounds

Analysis, Synthesis, and Perception of Musical Sounds Analysis, Synthesis, and Perception of Musical Sounds The Sound of Music James W. Beauchamp Editor University of Illinois at Urbana, USA 4y Springer Contents Preface Acknowledgments vii xv 1. Analysis

More information

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS Item Type text; Proceedings Authors Habibi, A. Publisher International Foundation for Telemetering Journal International Telemetering Conference Proceedings

More information

A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS

A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS 12th International Society for Music Information Retrieval Conference (ISMIR 2011) A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS Juhan Nam Stanford

More information

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM A QUER B EAMPLE MUSIC RETRIEVAL ALGORITHM H. HARB AND L. CHEN Maths-Info department, Ecole Centrale de Lyon. 36, av. Guy de Collongue, 69134, Ecully, France, EUROPE E-mail: {hadi.harb, liming.chen}@ec-lyon.fr

More information

Time Series Models for Semantic Music Annotation Emanuele Coviello, Antoni B. Chan, and Gert Lanckriet

Time Series Models for Semantic Music Annotation Emanuele Coviello, Antoni B. Chan, and Gert Lanckriet IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 5, JULY 2011 1343 Time Series Models for Semantic Music Annotation Emanuele Coviello, Antoni B. Chan, and Gert Lanckriet Abstract

More information

UC San Diego UC San Diego Previously Published Works

UC San Diego UC San Diego Previously Published Works UC San Diego UC San Diego Previously Published Works Title Classification of MPEG-2 Transport Stream Packet Loss Visibility Permalink https://escholarship.org/uc/item/9wk791h Authors Shin, J Cosman, P

More information

Speech and Speaker Recognition for the Command of an Industrial Robot

Speech and Speaker Recognition for the Command of an Industrial Robot Speech and Speaker Recognition for the Command of an Industrial Robot CLAUDIA MOISA*, HELGA SILAGHI*, ANDREI SILAGHI** *Dept. of Electric Drives and Automation University of Oradea University Street, nr.

More information

Recognising Cello Performers using Timbre Models

Recognising Cello Performers using Timbre Models Recognising Cello Performers using Timbre Models Chudy, Magdalena; Dixon, Simon For additional information about this publication click this link. http://qmro.qmul.ac.uk/jspui/handle/123456789/5013 Information

More information

Digital Video Telemetry System

Digital Video Telemetry System Digital Video Telemetry System Item Type text; Proceedings Authors Thom, Gary A.; Snyder, Edwin Publisher International Foundation for Telemetering Journal International Telemetering Conference Proceedings

More information

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST) Computational Models of Music Similarity 1 Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST) Abstract The perceived similarity of two pieces of music is multi-dimensional,

More information

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis Fengyan Wu fengyanyy@163.com Shutao Sun stsun@cuc.edu.cn Weiyao Xue Wyxue_std@163.com Abstract Automatic extraction of

More information

arxiv: v1 [cs.sd] 8 Jun 2016

arxiv: v1 [cs.sd] 8 Jun 2016 Symbolic Music Data Version 1. arxiv:1.5v1 [cs.sd] 8 Jun 1 Christian Walder CSIRO Data1 7 London Circuit, Canberra,, Australia. christian.walder@data1.csiro.au June 9, 1 Abstract In this document, we introduce

More information

BitWise (V2.1 and later) includes features for determining AP240 settings and measuring the Single Ion Area.

BitWise (V2.1 and later) includes features for determining AP240 settings and measuring the Single Ion Area. BitWise. Instructions for New Features in ToF-AMS DAQ V2.1 Prepared by Joel Kimmel University of Colorado at Boulder & Aerodyne Research Inc. Last Revised 15-Jun-07 BitWise (V2.1 and later) includes features

More information

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection Kadir A. Peker, Ajay Divakaran, Tom Lanning Mitsubishi Electric Research Laboratories, Cambridge, MA, USA {peker,ajayd,}@merl.com

More information

THE importance of music content analysis for musical

THE importance of music content analysis for musical IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007 333 Drum Sound Recognition for Polyphonic Audio Signals by Adaptation and Matching of Spectrogram Templates With

More information

Recommending Music for Language Learning: The Problem of Singing Voice Intelligibility

Recommending Music for Language Learning: The Problem of Singing Voice Intelligibility Recommending Music for Language Learning: The Problem of Singing Voice Intelligibility Karim M. Ibrahim (M.Sc.,Nile University, Cairo, 2016) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT

More information

The MAHNOB Laughter Database. Stavros Petridis, Brais Martinez, Maja Pantic

The MAHNOB Laughter Database. Stavros Petridis, Brais Martinez, Maja Pantic Accepted Manuscript The MAHNOB Laughter Database Stavros Petridis, Brais Martinez, Maja Pantic PII: S0262-8856(12)00146-1 DOI: doi: 10.1016/j.imavis.2012.08.014 Reference: IMAVIS 3193 To appear in: Image

More information

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Kazuyoshi Yoshii, Masataka Goto and Hiroshi G. Okuno Department of Intelligence Science and Technology National

More information

Pitch. The perceptual correlate of frequency: the perceptual dimension along which sounds can be ordered from low to high.

Pitch. The perceptual correlate of frequency: the perceptual dimension along which sounds can be ordered from low to high. Pitch The perceptual correlate of frequency: the perceptual dimension along which sounds can be ordered from low to high. 1 The bottom line Pitch perception involves the integration of spectral (place)

More information

Automatic Music Similarity Assessment and Recommendation. A Thesis. Submitted to the Faculty. Drexel University. Donald Shaul Williamson

Automatic Music Similarity Assessment and Recommendation. A Thesis. Submitted to the Faculty. Drexel University. Donald Shaul Williamson Automatic Music Similarity Assessment and Recommendation A Thesis Submitted to the Faculty of Drexel University by Donald Shaul Williamson in partial fulfillment of the requirements for the degree of Master

More information

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? NICHOLAS BORG AND GEORGE HOKKANEN Abstract. The possibility of a hit song prediction algorithm is both academically interesting and industry motivated.

More information

Figure 1: Feature Vector Sequence Generator block diagram.

Figure 1: Feature Vector Sequence Generator block diagram. 1 Introduction Figure 1: Feature Vector Sequence Generator block diagram. We propose designing a simple isolated word speech recognition system in Verilog. Our design is naturally divided into two modules.

More information

Automatic Labelling of tabla signals

Automatic Labelling of tabla signals ISMIR 2003 Oct. 27th 30th 2003 Baltimore (USA) Automatic Labelling of tabla signals Olivier K. GILLET, Gaël RICHARD Introduction Exponential growth of available digital information need for Indexing and

More information

Music Genre Classification

Music Genre Classification Music Genre Classification chunya25 Fall 2017 1 Introduction A genre is defined as a category of artistic composition, characterized by similarities in form, style, or subject matter. [1] Some researchers

More information

Singer Traits Identification using Deep Neural Network

Singer Traits Identification using Deep Neural Network Singer Traits Identification using Deep Neural Network Zhengshan Shi Center for Computer Research in Music and Acoustics Stanford University kittyshi@stanford.edu Abstract The author investigates automatic

More information

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication Proceedings of the 3 rd International Conference on Control, Dynamic Systems, and Robotics (CDSR 16) Ottawa, Canada May 9 10, 2016 Paper No. 110 DOI: 10.11159/cdsr16.110 A Parametric Autoregressive Model

More information

Singing voice synthesis based on deep neural networks

Singing voice synthesis based on deep neural networks INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Singing voice synthesis based on deep neural networks Masanari Nishimura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda

More information

Experiments on musical instrument separation using multiplecause

Experiments on musical instrument separation using multiplecause Experiments on musical instrument separation using multiplecause models J Klingseisen and M D Plumbley* Department of Electronic Engineering King's College London * - Corresponding Author - mark.plumbley@kcl.ac.uk

More information

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Music Emotion Recognition. Jaesung Lee. Chung-Ang University Music Emotion Recognition Jaesung Lee Chung-Ang University Introduction Searching Music in Music Information Retrieval Some information about target music is available Query by Text: Title, Artist, or

More information

Neural Network for Music Instrument Identi cation

Neural Network for Music Instrument Identi cation Neural Network for Music Instrument Identi cation Zhiwen Zhang(MSE), Hanze Tu(CCRMA), Yuan Li(CCRMA) SUN ID: zhiwen, hanze, yuanli92 Abstract - In the context of music, instrument identi cation would contribute

More information

Lyrics Classification using Naive Bayes

Lyrics Classification using Naive Bayes Lyrics Classification using Naive Bayes Dalibor Bužić *, Jasminka Dobša ** * College for Information Technologies, Klaićeva 7, Zagreb, Croatia ** Faculty of Organization and Informatics, Pavlinska 2, Varaždin,

More information

ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC

ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC Vaiva Imbrasaitė, Peter Robinson Computer Laboratory, University of Cambridge, UK Vaiva.Imbrasaite@cl.cam.ac.uk

More information

1 Introduction to PSQM

1 Introduction to PSQM A Technical White Paper on Sage s PSQM Test Renshou Dai August 7, 2000 1 Introduction to PSQM 1.1 What is PSQM test? PSQM stands for Perceptual Speech Quality Measure. It is an ITU-T P.861 [1] recommended

More information

An Accurate Timbre Model for Musical Instruments and its Application to Classification

An Accurate Timbre Model for Musical Instruments and its Application to Classification An Accurate Timbre Model for Musical Instruments and its Application to Classification Juan José Burred 1,AxelRöbel 2, and Xavier Rodet 2 1 Communication Systems Group, Technical University of Berlin,

More information

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance Methodologies for Expressiveness Modeling of and for Music Performance by Giovanni De Poli Center of Computational Sonology, Department of Information Engineering, University of Padova, Padova, Italy About

More information

Toward Multi-Modal Music Emotion Classification

Toward Multi-Modal Music Emotion Classification Toward Multi-Modal Music Emotion Classification Yi-Hsuan Yang 1, Yu-Ching Lin 1, Heng-Tze Cheng 1, I-Bin Liao 2, Yeh-Chin Ho 2, and Homer H. Chen 1 1 National Taiwan University 2 Telecommunication Laboratories,

More information

Effects of acoustic degradations on cover song recognition

Effects of acoustic degradations on cover song recognition Signal Processing in Acoustics: Paper 68 Effects of acoustic degradations on cover song recognition Julien Osmalskyj (a), Jean-Jacques Embrechts (b) (a) University of Liège, Belgium, josmalsky@ulg.ac.be

More information

Analysis of the effects of signal distance on spectrograms

Analysis of the effects of signal distance on spectrograms 2014 Analysis of the effects of signal distance on spectrograms SGHA 8/19/2014 Contents Introduction... 3 Scope... 3 Data Comparisons... 5 Results... 10 Recommendations... 10 References... 11 Introduction

More information

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University Week 14 Query-by-Humming and Music Fingerprinting Roger B. Dannenberg Professor of Computer Science, Art and Music Overview n Melody-Based Retrieval n Audio-Score Alignment n Music Fingerprinting 2 Metadata-based

More information

HUMANS have a remarkable ability to recognize objects

HUMANS have a remarkable ability to recognize objects IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER 2013 1805 Musical Instrument Recognition in Polyphonic Audio Using Missing Feature Approach Dimitrios Giannoulis,

More information

Lecture 9 Source Separation

Lecture 9 Source Separation 10420CS 573100 音樂資訊檢索 Music Information Retrieval Lecture 9 Source Separation Yi-Hsuan Yang Ph.D. http://www.citi.sinica.edu.tw/pages/yang/ yang@citi.sinica.edu.tw Music & Audio Computing Lab, Research

More information

Music Radar: A Web-based Query by Humming System

Music Radar: A Web-based Query by Humming System Music Radar: A Web-based Query by Humming System Lianjie Cao, Peng Hao, Chunmeng Zhou Computer Science Department, Purdue University, 305 N. University Street West Lafayette, IN 47907-2107 {cao62, pengh,

More information

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Introduction In this project we were interested in extracting the melody from generic audio files. Due to the

More information