Improving Frame Based Automatic Laughter Detection
|
|
- Charles Black
- 5 years ago
- Views:
Transcription
1 Improving Frame Based Automatic Laughter Detection Mary Knox EE225D Class Project December 13, 2007 Abstract Laughter recognition is an underexplored area of research. My goal for this project was to improve upon my previous work to automatically detect laughter on a frame-byframe basis. My previous system (baseline system) detected laughter based on shortterm features including MFCCs, pitch, and energy. In this project, I have explored the utility of additional features (phone and prosodic) both by themselves and in combination with the baseline system. I improved the baseline system by 0.1% absolute and achieved an equal error rate (EER) of 7.9% for laughter detection on the ICSI Meetings database. 1 Introduction Audio communication contains a wealth of information in addition to spoken words. Specifically, laughter provides cues regarding the emotional state of the speaker [1], topic changes in the conversation [2], and the speaker s identity. Accurate laughter detection could be useful in a variety of applications. A laughter detector incorporated with a digital camera could be used to identify an opportune time to take a picture [3]. Laughter could be useful in a video search of humorous clips [4]. In speech recognition, identifying laughter could decrease word error rate by identifying nonspeech sounds [2]. The overall goal of this study is to use laughter for speaker recognition, as my intuition is that many individuals have their own distinct laugh. To be able to explore the utility of laughter segments for speaker recognition, however, we first need to build a robust system to detect laughter, which is the focus of this paper. Previous work has studied the acoustics of laughter [5, 6, 7]. Many agree that laughter has a breathy consonant-vowel structure [5, 8]. Some have made generalizations about laughter, such as Provine, who concluded that laughter is usually a series of short syllables repeated approximately every 210 ms [7]. Yet, others have found laughter to be highly variable [8] and thus difficult to stereotype [6]. These conclusions lead me to believe that automatic laughter detection is not a simple task. The most relevant previous work on automatic laughter detection has been that of Kennedy and Ellis [2], Truong and van Leeuwen [1], and Knox and Mirghafori [9] (my
2 previous work). However, the experimental setups and objectives of these works differ from this study, with the exception of my previous work. Kennedy and Ellis [2] studied the detection of overlapped (multiple speaker) laughter in the Meetings domain. They split the data into non-overlapping one second segments, which were then classified based on whether or not multiple speakers laughed. They used support vector machines (SVMs) trained on four features: MFCCs, delta MFCCs, modulation spectrum, and spatial cues. They achieved a true positive rate of 87%. Truong and van Leeuwen [1] classified presegmented ICSI Meetings data as laughter or speech. The segments were determined prior to training and testing their system and had variable time durations. The average duration of laughter and speech segments were 2.21 and 2.02 seconds, respectively. They used Gaussian mixture models trained with perceptual linear prediction (PLP) features, pitch and energy, pitch and voicing, and modulation spectrum. They built models for each of the feature sets. The model trained with PLP features performed the best at 13.4% 1 EER for a data set similar to the one used in this study. In my previous work [9], I did laughter detection at the frame level. A neural network was trained on each of the following features: MFCCs, energy (RMS), fundamental frequency (F 0 ), and the largest cross correlation value used to compute the fundamental frequency (AC PEAK). The score level combination of MFCC and AC PEAK features had the lowest EER at 8.0% 2 for the same dataset used in this study. Current state-of-the-art speech recognition systems also detect laughter. However, the threshold used is such that most occurrences of laughter are not identified. For example, SRI s speech recognizer was run on the same data used in this study and achieved a false acceptance rate of 0.1% and a false rejection rate of 78.2% for laughter. Since laughter is a somewhat rare event, occurring only 6.3% of the time in the dataset used in this study, it is important to identify as many laughter segments as possible in order to have data to use in a speaker recognition system. That being the case, the SRI speech recognizer would not be very useful for this task. Furthermore, the systems which used 1 second (or greater) segments or presegmented data would not be able to precisely detect laughter segments. Thus, for this project I expanded upon my previous system which performed laughter detection for each frame by including phone and prosodic features. The outline for this report is as follows: in Section 2 I describe the data used in this study, in Section 3 I further describe my previous work, in Section 4 I explain the current system and results, in Section 5 I discuss the results, and in Section 6 I provide my conclusions and ideas for future work. 2 Data I trained and tested the detector on the ICSI Meeting Recorder Corpus [10], a hand transcribed corpus of multi-party meeting recordings, in which each of the speakers was recorded on a close-talking microphone (which is the data used in this study) as well as distant microphones. The full text was transcribed in addition to non-lexical events (including coughs, 1 Each segment, which had varying duration, was equally weighted in the EER computation 2 Each frame, or time unit, was equally weighted in the EER computation 2
3 Figure 1: Histogram of laugh duration for the Bmr subset of the ICSI Meeting Recorder Corpus. lip smacks, mic noise, and most importantly, laughter). There were a total of 75 meetings in this corpus. In order to compare my results to the work done by Kennedy and Ellis [2] and Truong and van Leeuwen [1], I used the same training and testing sets, which were from the Bmr subset of the corpus. This subset contains 29 meetings. The first 26 were used for training and the last 3 were used to test the detector. I trained and tested only on data which was hand transcribed to be either laughter or non-laughter. Laughter-colored speech, that is, cases in which the hand transcribed documentation had both speech and laughter listed under a single start and end time were disregarded since I would not specifically know which time interval(s) contained laughter. Also, if the transcription did not include information for a period of time for a channel, that audio was excluded. This exclusion reduced training and testing on cross-talk and allowed me to train and test on channels only when they were in use. Ideally, an automatic silence detector would be employed in this step instead of relying on the transcripts. As a note, unlike Truong and van Leeuwen I included audio that contained non-lexical vocalized sounds other than laughter. Figure 1 shows the histogram of the laughter durations. The average laugh duration was seconds with a standard deviation of seconds. 3 Previous Work 3.1 Previous System My previous system trained a neural network with MFCC, pitch, and energy features to detect whether a given frame contained laughter Features Mel Frequency Cepstral Coefficients (MFCCs) MFCCs were used to capture the spectral features of laughter and non-laughter. The first order regression coefficients of the MFCCs (delta MFCCs) and the second order regression coefficients (delta-delta MFCCs) 3
4 were also computed and used as features for the neural network. I used the first 12 MFCCs as well as the 0 th coefficient, which were computed over a 25 ms window with a 10 ms forward shift, as features for the neural network. MFCC features were extracted using the Hidden Markov Model Toolkit (HTK) [11]. For each frame I computed 13 MFCCs, 13 delta MFCCs, and 13 delta-delta MFCCs. Pitch and energy Studies in the acoustics of laughter [5, 6] and in automatic laughter detection [1] investigated the pitch and energy of laughter as potentially important features. I used the ESPS pitch tracker get f0 [12] to extract the fundamental frequency (F 0 ), local root mean squared energy (RMS), and the highest normalized cross correlation value found to determine F 0 (AC PEAK) for each frame. The delta and delta-delta coefficients were computed for each of these features as well Learning Method I did frame-wise laughter detection. Since the frames were short in duration (10 ms) and each laughter segment was on average seconds in this data set, I decided it would be best to use a large context window of features as inputs in the neural network. A neural network with one hidden layer was trained using QuickNet [13]. The input to the neural network was a window of feature frames, where the center frame was the target frame. I used the softmax activation function to compute the probability that the frame was laughter. To prevent over-fitting, the data used to train the neural network was split into two groups: training (the first 21 Bmr meetings) and cross validation (the last 5 meetings from the original training set). The neural network weights were updated based on the training data via the back-propagation algorithm and then the cross validation data was scored after every training epoch resulting in the cross validation frame accuracy (CVFA). Training was concluded once the CVFA increased by less than 0.5% for a second time. 3.2 Previous experiments and results Parameter settings I first needed to determine the input window size and the number of hidden units in the neural network. Empirically, I found that a context window of 75 consecutive frames (0.75 seconds) worked well. To make the classification of laughter based on the middle frame, I set the offset to 37 frames. In other words, the inputs to the neural network were the features from the frame to be classified and the 37 frames before and after this frame. Figure 2 shows the windowing technique. I also had to determine the number of hidden units. MFCCs were the most valuable features for Kennedy and Ellis [2] and I suspected my system would have similar results. Thus, I used the MFCCs as the input features and modified the number of hidden units while keeping all other parameters the same. Based on the accuracy on the cross validation set, I saw that 200 hidden units performed best. Similarly, I varied the number of hidden units using F 0 features. The CVFA was approximately the same for a range of hidden units but the system with 200 was marginally better than the rest. 4
5 1 t w i n d o w w i t h 7 5 f r a m e s ( m s ) t a r g e t f r a m e ( 1 0 m s ) Figure 2: For each frame being evaluated, features from a window of 75 frames is input to the neural network Systems The neural networks were first separately trained on the four classes of features: MFCCs, F 0, RMS, and AC PEAK. The EERs for each of the classes is shown in Table 1. Each column lists the EER for a neural network trained with the feature itself, the deltas, the delta-deltas, and the feature level combination of the feature, delta, and delta-delta (the All System). I combined the All systems on the score level to improve our results using another neural network, this time using a smaller window size and fewer hidden units. Since each of the inputs was the probability of laughter for each frame, I shortened the input window size of the combiner neural network from 75 frames to 9 and reduced the number of hidden units to 2. Since the system using MFCC features had the lowest EER, I combined MFCCs with each of the other classes of features. Table 2 shows that after combining, the MFCC+AC PEAK system (baseline system) performed the best. I then combined MFCC, AC PEAK, and RMS features. Finally, I combined all of the systems and computed the EER. Table 1: Equal Error Rates (%). MFCCs F 0 RMS AC PEAK Feature Delta Delta-Delta All Table 2: Equal Error Rates for Combined Systems (%). EER MFCC+F MFCC+RMS 8.92 MFCC+AC PEAK 7.97 MFCC+AC PEAK+RMS 8.26 MFCC+AC PEAK+RMS+F
6 4 Current Work Figure 3: Example hand-transcription. While the results from the previous work were good. By including more features I hoped to improve upon the baseline system. 4.1 Current System Additional Features Phones Laughter has a repeated consonant vowel structure [5, 8]. Thus, phone sequences seemed to be a good feature to use to identify laughter. I used the SRI phone recognizer, Decipher [14], in order to extract the phones; however, Decipher annotates nonstandard phones including laughter. Although this was not the original information I intended to extract, it seemed plausible for Decipher s laughter detector to improve the baseline results. Prosodic My previous system only included short-term features. However, laughter is different from most speech sounds because it repeats approximately every 210 ms [7]. Since prosodic features are extracted over longer intervals of time, they may distinguish laughter from non-laughter. I used 21 prosodic features, which were statistics (including min, max, mean, standard deviation, etc.) on pitch, energy, long-term average spectrum, and noiseto-harmonic ratio Learning Methods For the phone features, I trained a neural network with a 46 dimensional feature vector (1 dimension for each possible phone) for each frame. Each feature vector had only 1 nonzero value, which corresponded to the phone for that frame. I again used features over a context window as the input to the neural network and trained the neural network using the same training and cross validation sets as before in order to prevent over-fitting. Prosodic features were extracted for each segment as documented in the hand-transcriptions. Figure 3 is an example hand-transcription which shows each segment marked with a start and end time. Segmenting based on the transcript guaranteed that non-laughter segments were separated from laughter segments. In the future, it would be better to have an automatic segmenter. However, since this is the my first experimentation with prosodic features, I wanted to easily decipher which features perform well for the task of laughter detection. Since the prosodic features were computed for the entire segment, I used an SVM to build a model to detect laughter. 6
7 4.2 Current experiments and results Parameter settings Many of the parameters used in the neural network trained on phone features were chosen to be compatible with the previous system. For example, I again used a window size of 75 frames. This was done in order to compare score level combinations of the phone and baseline systems with feature level combinations, which will be computed in the future. Based on the CVFA score on the cross validation set, I chose the number of hidden units to be 9. For the prosodic features, I set the cost-factor (amount that training errors on positive examples outweighs training errors on negative examples) to 10 when building the SVM models. This was done because there were approximately 10 non-laughter segments for every laughter segment System Results I first trained systems on phone features and prosodic features alone. The neural network trained on phone features achieved an EER of 18.45%, as shown in Table 4. Using the training set, SVM models were trained using all of the statistics for each class of prosodic features: pitch, energy, long-term average spectrum (LTAS), and the noise-to-harmonic ratio (N2H). The EER was computed using the cross validation set (each segment, which had varying duration, was equally weighted in the EER computation). As shown in Table 3, the energy features performed the best. I then combined the energy features with each of the other classes of features. The system trained with energy, noise-to-harmonic ratio, and pitch features was the best prosodic system. I then computed the EER where each frame was weighted equally to be 50%. This result is show in Table 4. Table 3: Equal Error Rates (each segment was equally weighted) for Prosodic Features(%). EER PITCH ENERGY LTAS N2H ENERGY+PITCH ENERGY+LTAS ENERGY+N2H ENERGY+N2H+PITCH ENERGY+N2H+LTAS ENERGY+N2H+PITCH+LTAS I then performed score level combinations of the phone, prosodic, and baseline systems using the same neural network parameters as the previous system s score level combinations. While the score level combination of the baseline system and the phone system improved to 7.9%, the combination of the baseline and prosodic systems degraded to 9.8% EER. When 7
8 Table 4: Equal Error Rates(%). EER PHONES PROSODIC 50 PHONES+PROSODIC BASELINE+PHONES 7.87 BASELINE+PROSODIC 9.80 BASELINE+PHONES+PROSODIC 8.75 all three systems were combined the EER was 8.75%. As shown in Table 4, the combination of the baseline and phone systems had the best result. 5 Discussion From Table 4, it is clear that the prosodic system does not score as well when each frame is weighted equally. It may be be beneficial to either change the training method for the prosodic system from the segment level to the frame level, perhaps using a neural network, or modify the method of combining the prosodic system with the other systems to make use of the prosodic system s results when each segment is weighted equally. Since the prosodic system did not score well on its own, it was not surprising that when combined with the baseline system and the baseline+phone system the resulting systems were worse. However, when the prosodic system was combined with the phone system the EER improved. Despite this improvement the score was still worse than the baseline system. Taking advantage of Decipher s phone recognizer proved beneficial. The EER for the combined baseline and phone system improved to 7.87%, which is currently the best EER for laughter detection at the frame level. 6 Conclusion and future work In conclusion, we have improved upon our baseline results by including additional features. The phone system improved the baseline system by 0.1% absolute. Although prosodic features did not improve the baseline system. I think that there are many other prosodic features which will improve laughter detection. In the future, I plan on scoring feature level combinations of the system using a comparable number of hidden units. Also, many other prosodic features have already been extracted and may be beneficial for this study. After determining which prosodic features work well, I will use a moving window technique to extract the features so that this laughter detector can be run on audio that may not have hand-transcriptions. I also plan on experimenting with using a neural network to train the prosodic system instead of an SVM. 8
9 7 Acknowledgments I would like to thank Christian Mueller for all of his help in extracting the prosodic features and Nikki Mirghafori for her guidance throughout this study. 9
10 References [1] Truong, K.P. and Van Leeuwen, D.A., Automatic detection of laughter, In Proceedings of Interspeech, Lisbon, Portugal, [2] Kennedy, L. and Ellis, D., Laughter detection in meetings, NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, [3] A. Carter, Automatic acoustic laughter detection, Masters Thesis, Keele University, [4] Cai, R., Lu, L., Zhange, H.-J., Cai, L.-H., Highlight sound effects detection in audio stream, in Proc. Intern. Confer. on Multimedia and Expo, Baltimore, [5] C. Bickley and S. Hunnicutt, Acoustic analysis of laughter, in Proc. ICSLP, pp , Banff, Canada, [6] Bachorowski, J., Smoski, M., Owren, M., The acoustic features of human laughter, Acoustical Society of America, pp , [7] R. Provine, Laughter, American Scientist, January-February [8] J. Trouvain, Segmenting phonetic units in laughter, in Proc ICPhS, [9] M. Knox and N. Mirghafori, Automatic laughter detection using neural networks, Unpublished. [10] Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C., The ICSI meeting corpus, ICAASP, Hong Kong, April [11] Hidden Markov Model Toolkit (HTK), [12] Entropic Research Laboratory, Washington, D.C., Esps version 5.0 programs manual, August [13] QuickNet, [14] Cohen, M., Murveit, H., Bernstein, J., Price, P., Weintraub, M., The DECI- PHER speech recognition system, In Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, pp , Albuquerque,
Automatic Laughter Detection
Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional
More informationAutomatic Laughter Detection
Automatic Laughter Detection Mary Knox 1803707 knoxm@eecs.berkeley.edu December 1, 006 Abstract We built a system to automatically detect laughter from acoustic features of audio. To implement the system,
More informationAutomatic Laughter Segmentation. Mary Tai Knox
Automatic Laughter Segmentation Mary Tai Knox May 22, 2008 Abstract Our goal in this work was to develop an accurate method to identify laughter segments, ultimately for the purpose of speaker recognition.
More informationAutomatic discrimination between laughter and speech
Speech Communication 49 (2007) 144 158 www.elsevier.com/locate/specom Automatic discrimination between laughter and speech Khiet P. Truong *, David A. van Leeuwen TNO Human Factors, Department of Human
More informationDetecting Attempts at Humor in Multiparty Meetings
Detecting Attempts at Humor in Multiparty Meetings Kornel Laskowski Carnegie Mellon University Pittsburgh PA, USA 14 September, 2008 K. Laskowski ICSC 2009, Berkeley CA, USA 1/26 Why bother with humor?
More informationA Phonetic Analysis of Natural Laughter, for Use in Automatic Laughter Processing Systems
A Phonetic Analysis of Natural Laughter, for Use in Automatic Laughter Processing Systems Jérôme Urbain and Thierry Dutoit Université de Mons - UMONS, Faculté Polytechnique de Mons, TCTS Lab 20 Place du
More informationSinger Traits Identification using Deep Neural Network
Singer Traits Identification using Deep Neural Network Zhengshan Shi Center for Computer Research in Music and Acoustics Stanford University kittyshi@stanford.edu Abstract The author investigates automatic
More informationAcoustic Scene Classification
Acoustic Scene Classification Marc-Christoph Gerasch Seminar Topics in Computer Music - Acoustic Scene Classification 6/24/2015 1 Outline Acoustic Scene Classification - definition History and state of
More informationAudio-Based Video Editing with Two-Channel Microphone
Audio-Based Video Editing with Two-Channel Microphone Tetsuya Takiguchi Organization of Advanced Science and Technology Kobe University, Japan takigu@kobe-u.ac.jp Yasuo Ariki Organization of Advanced Science
More informationFirst Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text
First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart March 16, 2017 Slot Filling sequential
More informationA QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM
A QUER B EAMPLE MUSIC RETRIEVAL ALGORITHM H. HARB AND L. CHEN Maths-Info department, Ecole Centrale de Lyon. 36, av. Guy de Collongue, 69134, Ecully, France, EUROPE E-mail: {hadi.harb, liming.chen}@ec-lyon.fr
More informationInstrument Recognition in Polyphonic Mixtures Using Spectral Envelopes
Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu
More informationhit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.
CS 229 FINAL PROJECT A SOUNDHOUND FOR THE SOUNDS OF HOUNDS WEAKLY SUPERVISED MODELING OF ANIMAL SOUNDS ROBERT COLCORD, ETHAN GELLER, MATTHEW HORTON Abstract: We propose a hybrid approach to generating
More information19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007
19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 AN HMM BASED INVESTIGATION OF DIFFERENCES BETWEEN MUSICAL INSTRUMENTS OF THE SAME TYPE PACS: 43.75.-z Eichner, Matthias; Wolff, Matthias;
More informationAUTOMATIC RECOGNITION OF LAUGHTER
AUTOMATIC RECOGNITION OF LAUGHTER USING VERBAL AND NON-VERBAL ACOUSTIC FEATURES Tomasz Jacykiewicz 1 Dr. Fabien Ringeval 2 JANUARY, 2014 DEPARTMENT OF INFORMATICS - MASTER PROJECT REPORT Département d
More informationINTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION
INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION ULAŞ BAĞCI AND ENGIN ERZIN arxiv:0907.3220v1 [cs.sd] 18 Jul 2009 ABSTRACT. Music genre classification is an essential tool for
More informationSemi-supervised Musical Instrument Recognition
Semi-supervised Musical Instrument Recognition Master s Thesis Presentation Aleksandr Diment 1 1 Tampere niversity of Technology, Finland Supervisors: Adj.Prof. Tuomas Virtanen, MSc Toni Heittola 17 May
More informationComposer Identification of Digital Audio Modeling Content Specific Features Through Markov Models
Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Aric Bartle (abartle@stanford.edu) December 14, 2012 1 Background The field of composer recognition has
More informationMUSI-6201 Computational Music Analysis
MUSI-6201 Computational Music Analysis Part 9.1: Genre Classification alexander lerch November 4, 2015 temporal analysis overview text book Chapter 8: Musical Genre, Similarity, and Mood (pp. 151 155)
More informationAutomatic Rhythmic Notation from Single Voice Audio Sources
Automatic Rhythmic Notation from Single Voice Audio Sources Jack O Reilly, Shashwat Udit Introduction In this project we used machine learning technique to make estimations of rhythmic notation of a sung
More informationMUSIC TONALITY FEATURES FOR SPEECH/MUSIC DISCRIMINATION. Gregory Sell and Pascal Clark
214 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) MUSIC TONALITY FEATURES FOR SPEECH/MUSIC DISCRIMINATION Gregory Sell and Pascal Clark Human Language Technology Center
More informationMusic Genre Classification and Variance Comparison on Number of Genres
Music Genre Classification and Variance Comparison on Number of Genres Miguel Francisco, miguelf@stanford.edu Dong Myung Kim, dmk8265@stanford.edu 1 Abstract In this project we apply machine learning techniques
More informationFeatures for Audio and Music Classification
Features for Audio and Music Classification Martin F. McKinney and Jeroen Breebaart Auditory and Multisensory Perception, Digital Signal Processing Group Philips Research Laboratories Eindhoven, The Netherlands
More informationRetrieval of textual song lyrics from sung inputs
INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Retrieval of textual song lyrics from sung inputs Anna M. Kruspe Fraunhofer IDMT, Ilmenau, Germany kpe@idmt.fraunhofer.de Abstract Retrieving the
More informationSpeech and Speaker Recognition for the Command of an Industrial Robot
Speech and Speaker Recognition for the Command of an Industrial Robot CLAUDIA MOISA*, HELGA SILAGHI*, ANDREI SILAGHI** *Dept. of Electric Drives and Automation University of Oradea University Street, nr.
More informationSpeech To Song Classification
Speech To Song Classification Emily Graber Center for Computer Research in Music and Acoustics, Department of Music, Stanford University Abstract The speech to song illusion is a perceptual phenomenon
More informationDetecting Musical Key with Supervised Learning
Detecting Musical Key with Supervised Learning Robert Mahieu Department of Electrical Engineering Stanford University rmahieu@stanford.edu Abstract This paper proposes and tests performance of two different
More informationMUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES
MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES PACS: 43.60.Lq Hacihabiboglu, Huseyin 1,2 ; Canagarajah C. Nishan 2 1 Sonic Arts Research Centre (SARC) School of Computer Science Queen s University
More informationChord Classification of an Audio Signal using Artificial Neural Network
Chord Classification of an Audio Signal using Artificial Neural Network Ronesh Shrestha Student, Department of Electrical and Electronic Engineering, Kathmandu University, Dhulikhel, Nepal ---------------------------------------------------------------------***---------------------------------------------------------------------
More informationAutomatic Extraction of Popular Music Ringtones Based on Music Structure Analysis
Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis Fengyan Wu fengyanyy@163.com Shutao Sun stsun@cuc.edu.cn Weiyao Xue Wyxue_std@163.com Abstract Automatic extraction of
More informationPredicting Time-Varying Musical Emotion Distributions from Multi-Track Audio
Predicting Time-Varying Musical Emotion Distributions from Multi-Track Audio Jeffrey Scott, Erik M. Schmidt, Matthew Prockup, Brandon Morton, and Youngmoo E. Kim Music and Entertainment Technology Laboratory
More informationSinging voice synthesis based on deep neural networks
INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Singing voice synthesis based on deep neural networks Masanari Nishimura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda
More informationCS229 Project Report Polyphonic Piano Transcription
CS229 Project Report Polyphonic Piano Transcription Mohammad Sadegh Ebrahimi Stanford University Jean-Baptiste Boin Stanford University sadegh@stanford.edu jbboin@stanford.edu 1. Introduction In this project
More informationPhone-based Plosive Detection
Phone-based Plosive Detection 1 Andreas Madsack, Grzegorz Dogil, Stefan Uhlich, Yugu Zeng and Bin Yang Abstract We compare two segmentation approaches to plosive detection: One aproach is using a uniform
More informationModeling memory for melodies
Modeling memory for melodies Daniel Müllensiefen 1 and Christian Hennig 2 1 Musikwissenschaftliches Institut, Universität Hamburg, 20354 Hamburg, Germany 2 Department of Statistical Science, University
More informationComparison Parameters and Speaker Similarity Coincidence Criteria:
Comparison Parameters and Speaker Similarity Coincidence Criteria: The Easy Voice system uses two interrelating parameters of comparison (first and second error types). False Rejection, FR is a probability
More informationMusic Genre Classification
Music Genre Classification chunya25 Fall 2017 1 Introduction A genre is defined as a category of artistic composition, characterized by similarities in form, style, or subject matter. [1] Some researchers
More informationAutomatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting
Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting Dalwon Jang 1, Seungjae Lee 2, Jun Seok Lee 2, Minho Jin 1, Jin S. Seo 2, Sunil Lee 1 and Chang D. Yoo 1 1 Korea Advanced
More informationAUD 6306 Speech Science
AUD 3 Speech Science Dr. Peter Assmann Spring semester 2 Role of Pitch Information Pitch contour is the primary cue for tone recognition Tonal languages rely on pitch level and differences to convey lexical
More informationMusic Composition with RNN
Music Composition with RNN Jason Wang Department of Statistics Stanford University zwang01@stanford.edu Abstract Music composition is an interesting problem that tests the creativity capacities of artificial
More informationRelease Year Prediction for Songs
Release Year Prediction for Songs [CSE 258 Assignment 2] Ruyu Tan University of California San Diego PID: A53099216 rut003@ucsd.edu Jiaying Liu University of California San Diego PID: A53107720 jil672@ucsd.edu
More informationReducing False Positives in Video Shot Detection
Reducing False Positives in Video Shot Detection Nithya Manickam Computer Science & Engineering Department Indian Institute of Technology, Bombay Powai, India - 400076 mnitya@cse.iitb.ac.in Sharat Chandran
More informationAnalysis of the Occurrence of Laughter in Meetings
Analysis of the Occurrence of Laughter in Meetings Kornel Laskowski 1,2 & Susanne Burger 2 1 interact, Universität Karlsruhe 2 interact, Carnegie Mellon University August 29, 2007 Introduction primary
More informationFusion for Audio-Visual Laughter Detection
Fusion for Audio-Visual Laughter Detection Boris Reuderink September 13, 7 2 Abstract Laughter is a highly variable signal, and can express a spectrum of emotions. This makes the automatic detection of
More informationMusic Radar: A Web-based Query by Humming System
Music Radar: A Web-based Query by Humming System Lianjie Cao, Peng Hao, Chunmeng Zhou Computer Science Department, Purdue University, 305 N. University Street West Lafayette, IN 47907-2107 {cao62, pengh,
More informationTopics in Computer Music Instrument Identification. Ioanna Karydi
Topics in Computer Music Instrument Identification Ioanna Karydi Presentation overview What is instrument identification? Sound attributes & Timbre Human performance The ideal algorithm Selected approaches
More informationA PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES
12th International Society for Music Information Retrieval Conference (ISMIR 2011) A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES Erdem Unal 1 Elaine Chew 2 Panayiotis Georgiou
More informationStatistical Modeling and Retrieval of Polyphonic Music
Statistical Modeling and Retrieval of Polyphonic Music Erdem Unal Panayiotis G. Georgiou and Shrikanth S. Narayanan Speech Analysis and Interpretation Laboratory University of Southern California Los Angeles,
More informationMusic Recommendation from Song Sets
Music Recommendation from Song Sets Beth Logan Cambridge Research Laboratory HP Laboratories Cambridge HPL-2004-148 August 30, 2004* E-mail: Beth.Logan@hp.com music analysis, information retrieval, multimedia
More informationSinger Identification
Singer Identification Bertrand SCHERRER McGill University March 15, 2007 Bertrand SCHERRER (McGill University) Singer Identification March 15, 2007 1 / 27 Outline 1 Introduction Applications Challenges
More informationSkip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video
Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video Mohamed Hassan, Taha Landolsi, Husameldin Mukhtar, and Tamer Shanableh College of Engineering American
More informationPOST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS
POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS Andrew N. Robertson, Mark D. Plumbley Centre for Digital Music
More informationAutomatic Piano Music Transcription
Automatic Piano Music Transcription Jianyu Fan Qiuhan Wang Xin Li Jianyu.Fan.Gr@dartmouth.edu Qiuhan.Wang.Gr@dartmouth.edu Xi.Li.Gr@dartmouth.edu 1. Introduction Writing down the score while listening
More informationInternational Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC
Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 04, April -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 MUSICAL
More informationEffects of acoustic degradations on cover song recognition
Signal Processing in Acoustics: Paper 68 Effects of acoustic degradations on cover song recognition Julien Osmalskyj (a), Jean-Jacques Embrechts (b) (a) University of Liège, Belgium, josmalsky@ulg.ac.be
More informationExperiments on musical instrument separation using multiplecause
Experiments on musical instrument separation using multiplecause models J Klingseisen and M D Plumbley* Department of Electronic Engineering King's College London * - Corresponding Author - mark.plumbley@kcl.ac.uk
More informationLaughbot: Detecting Humor in Spoken Language with Language and Audio Cues
Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues Kate Park katepark@stanford.edu Annie Hu anniehu@stanford.edu Natalie Muenster ncm000@stanford.edu Abstract We propose detecting
More informationEfficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas
Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications Matthias Mauch Chris Cannam György Fazekas! 1 Matthias Mauch, Chris Cannam, George Fazekas Problem Intonation in Unaccompanied
More informationNeural Network for Music Instrument Identi cation
Neural Network for Music Instrument Identi cation Zhiwen Zhang(MSE), Hanze Tu(CCRMA), Yuan Li(CCRMA) SUN ID: zhiwen, hanze, yuanli92 Abstract - In the context of music, instrument identi cation would contribute
More informationExpressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016
Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016 Jordi Bonada, Martí Umbert, Merlijn Blaauw Music Technology Group, Universitat Pompeu Fabra, Spain jordi.bonada@upf.edu,
More informationAUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION
AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION Halfdan Rump, Shigeki Miyabe, Emiru Tsunoo, Nobukata Ono, Shigeki Sagama The University of Tokyo, Graduate
More informationSubjective Similarity of Music: Data Collection for Individuality Analysis
Subjective Similarity of Music: Data Collection for Individuality Analysis Shota Kawabuchi and Chiyomi Miyajima and Norihide Kitaoka and Kazuya Takeda Nagoya University, Nagoya, Japan E-mail: shota.kawabuchi@g.sp.m.is.nagoya-u.ac.jp
More informationABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC
ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC Vaiva Imbrasaitė, Peter Robinson Computer Laboratory, University of Cambridge, UK Vaiva.Imbrasaite@cl.cam.ac.uk
More informationLAUGHTER serves as an expressive social signal in human
Audio-Facial Laughter Detection in Naturalistic Dyadic Conversations Bekir Berker Turker, Yucel Yemez, Metin Sezgin, Engin Erzin 1 Abstract We address the problem of continuous laughter detection over
More informationWHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?
WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? NICHOLAS BORG AND GEORGE HOKKANEN Abstract. The possibility of a hit song prediction algorithm is both academically interesting and industry motivated.
More informationTranscription of the Singing Melody in Polyphonic Music
Transcription of the Singing Melody in Polyphonic Music Matti Ryynänen and Anssi Klapuri Institute of Signal Processing, Tampere University Of Technology P.O.Box 553, FI-33101 Tampere, Finland {matti.ryynanen,
More informationData-Driven Solo Voice Enhancement for Jazz Music Retrieval
Data-Driven Solo Voice Enhancement for Jazz Music Retrieval Stefan Balke1, Christian Dittmar1, Jakob Abeßer2, Meinard Müller1 1International Audio Laboratories Erlangen 2Fraunhofer Institute for Digital
More informationMusic Mood Classification - an SVM based approach. Sebastian Napiorkowski
Music Mood Classification - an SVM based approach Sebastian Napiorkowski Topics on Computer Music (Seminar Report) HPAC - RWTH - SS2015 Contents 1. Motivation 2. Quantification and Definition of Mood 3.
More informationSinger Recognition and Modeling Singer Error
Singer Recognition and Modeling Singer Error Johan Ismael Stanford University jismael@stanford.edu Nicholas McGee Stanford University ndmcgee@stanford.edu 1. Abstract We propose a system for recognizing
More informationDrum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods
Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Kazuyoshi Yoshii, Masataka Goto and Hiroshi G. Okuno Department of Intelligence Science and Technology National
More informationLaughbot: Detecting Humor in Spoken Language with Language and Audio Cues
Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues Kate Park, Annie Hu, Natalie Muenster Email: katepark@stanford.edu, anniehu@stanford.edu, ncm000@stanford.edu Abstract We propose
More informationAutomatic Music Genre Classification
Automatic Music Genre Classification Nathan YongHoon Kwon, SUNY Binghamton Ingrid Tchakoua, Jackson State University Matthew Pietrosanu, University of Alberta Freya Fu, Colorado State University Yue Wang,
More informationVISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,
VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS O. Javed, S. Khan, Z. Rasheed, M.Shah {ojaved, khan, zrasheed, shah}@cs.ucf.edu Computer Vision Lab School of Electrical Engineering and Computer
More informationPSYCHOLOGICAL AND CROSS-CULTURAL EFFECTS ON LAUGHTER SOUND PRODUCTION Marianna De Benedictis Università di Bari
PSYCHOLOGICAL AND CROSS-CULTURAL EFFECTS ON LAUGHTER SOUND PRODUCTION Marianna De Benedictis marianna_de_benedictis@hotmail.com Università di Bari 1. ABSTRACT The research within this paper is intended
More informationWeek 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University
Week 14 Query-by-Humming and Music Fingerprinting Roger B. Dannenberg Professor of Computer Science, Art and Music Overview n Melody-Based Retrieval n Audio-Score Alignment n Music Fingerprinting 2 Metadata-based
More informationMODELING OF PHONEME DURATIONS FOR ALIGNMENT BETWEEN POLYPHONIC AUDIO AND LYRICS
MODELING OF PHONEME DURATIONS FOR ALIGNMENT BETWEEN POLYPHONIC AUDIO AND LYRICS Georgi Dzhambazov, Xavier Serra Music Technology Group Universitat Pompeu Fabra, Barcelona, Spain {georgi.dzhambazov,xavier.serra}@upf.edu
More informationGYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)
GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) (1) Stanford University (2) National Research and Simulation Center, Rafael Ltd. 0 MICROPHONE
More informationA Music Retrieval System Using Melody and Lyric
202 IEEE International Conference on Multimedia and Expo Workshops A Music Retrieval System Using Melody and Lyric Zhiyuan Guo, Qiang Wang, Gang Liu, Jun Guo, Yueming Lu 2 Pattern Recognition and Intelligent
More informationWAKE-UP-WORD SPOTTING FOR MOBILE SYSTEMS. A. Zehetner, M. Hagmüller, and F. Pernkopf
WAKE-UP-WORD SPOTTING FOR MOBILE SYSTEMS A. Zehetner, M. Hagmüller, and F. Pernkopf Graz University of Technology Signal Processing and Speech Communication Laboratory, Austria ABSTRACT Wake-up-word (WUW)
More informationMUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES
MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES Jun Wu, Yu Kitano, Stanislaw Andrzej Raczynski, Shigeki Miyabe, Takuya Nishimoto, Nobutaka Ono and Shigeki Sagayama The Graduate
More informationA Framework for Segmentation of Interview Videos
A Framework for Segmentation of Interview Videos Omar Javed, Sohaib Khan, Zeeshan Rasheed, Mubarak Shah Computer Vision Lab School of Electrical Engineering and Computer Science University of Central Florida
More informationDETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION
DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION H. Pan P. van Beek M. I. Sezan Electrical & Computer Engineering University of Illinois Urbana, IL 6182 Sharp Laboratories
More informationAcoustic and musical foundations of the speech/song illusion
Acoustic and musical foundations of the speech/song illusion Adam Tierney, *1 Aniruddh Patel #2, Mara Breen^3 * Department of Psychological Sciences, Birkbeck, University of London, United Kingdom # Department
More informationA COMPARATIVE EVALUATION OF VOCODING TECHNIQUES FOR HMM-BASED LAUGHTER SYNTHESIS
A COMPARATIVE EVALUATION OF VOCODING TECHNIQUES FOR HMM-BASED LAUGHTER SYNTHESIS Bajibabu Bollepalli 1, Jérôme Urbain 2, Tuomo Raitio 3, Joakim Gustafson 1, Hüseyin Çakmak 2 1 Department of Speech, Music
More informationAPPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC
APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC Vishweshwara Rao, Sachin Pant, Madhumita Bhaskar and Preeti Rao Department of Electrical Engineering, IIT Bombay {vishu, sachinp,
More informationMusical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons
Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons Róisín Loughran roisin.loughran@ul.ie Jacqueline Walker jacqueline.walker@ul.ie Michael O Neill University
More informationNewsComm: A Hand-Held Device for Interactive Access to Structured Audio
NewsComm: A Hand-Held Device for Interactive Access to Structured Audio Deb Kumar Roy B.A.Sc. Computer Engineering, University of Waterloo, 1992 Submitted to the Program in Media Arts and Sciences, School
More informationAcoustic Prosodic Features In Sarcastic Utterances
Acoustic Prosodic Features In Sarcastic Utterances Introduction: The main goal of this study is to determine if sarcasm can be detected through the analysis of prosodic cues or acoustic features automatically.
More informationResearch & Development. White Paper WHP 232. A Large Scale Experiment for Mood-based Classification of TV Programmes BRITISH BROADCASTING CORPORATION
Research & Development White Paper WHP 232 September 2012 A Large Scale Experiment for Mood-based Classification of TV Programmes Jana Eggink, Denise Bland BRITISH BROADCASTING CORPORATION White Paper
More information... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University
A Pseudo-Statistical Approach to Commercial Boundary Detection........ Prasanna V Rangarajan Dept of Electrical Engineering Columbia University pvr2001@columbia.edu 1. Introduction Searching and browsing
More informationMelody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng
Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Introduction In this project we were interested in extracting the melody from generic audio files. Due to the
More informationAdvanced Signal Processing 2
Advanced Signal Processing 2 Synthesis of Singing 1 Outline Features and requirements of signing synthesizers HMM based synthesis of singing Articulatory synthesis of singing Examples 2 Requirements of
More informationA repetition-based framework for lyric alignment in popular songs
A repetition-based framework for lyric alignment in popular songs ABSTRACT LUONG Minh Thang and KAN Min Yen Department of Computer Science, School of Computing, National University of Singapore We examine
More informationPiano Transcription MUMT611 Presentation III 1 March, Hankinson, 1/15
Piano Transcription MUMT611 Presentation III 1 March, 2007 Hankinson, 1/15 Outline Introduction Techniques Comb Filtering & Autocorrelation HMMs Blackboard Systems & Fuzzy Logic Neural Networks Examples
More informationMusic Source Separation
Music Source Separation Hao-Wei Tseng Electrical and Engineering System University of Michigan Ann Arbor, Michigan Email: blakesen@umich.edu Abstract In popular music, a cover version or cover song, or
More informationPitch-Gesture Modeling Using Subband Autocorrelation Change Detection
Published at Interspeech 13, Lyon France, August 13 Pitch-Gesture Modeling Using Subband Autocorrelation Change Detection Malcolm Slaney 1, Elizabeth Shriberg 1, and Jui-Ting Huang 1 Microsoft Research,
More informationAutomatic Labelling of tabla signals
ISMIR 2003 Oct. 27th 30th 2003 Baltimore (USA) Automatic Labelling of tabla signals Olivier K. GILLET, Gaël RICHARD Introduction Exponential growth of available digital information need for Indexing and
More informationAbout Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance
Methodologies for Expressiveness Modeling of and for Music Performance by Giovanni De Poli Center of Computational Sonology, Department of Information Engineering, University of Padova, Padova, Italy About
More informationAnalytic Comparison of Audio Feature Sets using Self-Organising Maps
Analytic Comparison of Audio Feature Sets using Self-Organising Maps Rudolf Mayer, Jakob Frank, Andreas Rauber Institute of Software Technology and Interactive Systems Vienna University of Technology,
More informationDepartment of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement
Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine Project: Real-Time Speech Enhancement Introduction Telephones are increasingly being used in noisy
More information