Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues
|
|
- Jasmin Willis
- 6 years ago
- Views:
Transcription
1 Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues Kate Park Annie Hu Natalie Muenster Abstract We propose detecting and responding to humor in spoken dialogue by extracting language and audio cues and subsequently feeding these features into a combined Recurrent Neural Network (RNN) and logistic regression model. In this paper, we parse Switchboard phone conversations to build a corpus of punchlines and unfunny lines where punchlines precede tokens in Switchboard transcripts. We create a combined RNN and logistic regression model that uses both acoustic and language cues to predict whether a conversational agent should respond to an utterance with. Our model achieves an F1- score of 63.2 and accuracy of This model outperforms our logistic language model (F1-score 56.6) and RNN acoustic model (59.4) as well as the final RNN model of Bertero. et al s (2016) paper (52.9). Using our final model, we create a laughbot that audibly responds to a user with when their utterance is classified as a punchline. A conversational agent outfitted with a humor-recognition system such as the one we present in this paper would be valuable as these agents gain utility in everyday life. 1 Introduction Our project aims to detect humor and thus predict when someone will laugh based on their textual and acoustic cues. Humor is dependent on both the strings of words, how the speakers voice changes, as well as situational context. Predicting when someone will laugh is a significant part of contemporary media. For example, sitcoms are written and performed primarily to cause, so the ability to detect precisely why something will draw is of extreme importance to screenwriters and actors. Being able to detect humor could improve embodied conversational agents who are attributed more human attributes when they are able to display the appreciation of humor (Nijholt, 2003). Siri currently will only say you are very funny, but could become more human if it had the ability to recognize social cues like humor, and respond to them with. Thus, the objective is to train and test a model that can detect whether or not a line is funny and should be appropriately responded with. 2 Background / Related Work Existing work in predicting humor uses a combination of text and audio features and various machine learning methods to tackle the classification problem. The goal of Piot et als (2014) research in Predicting when to Laugh with Structured Classification was to predict and create an avatar that imitated a human experts sense of when to laugh in a conversation. Piot et al reduced the problem to a multi-class classification problem, where they found that using the large margin algorithm had the most natural behavior (laughing in same proportion to the expert). In a later paper, Piot et al. (2015) in Imitation Learning Applied to Embodied Conversational Agents examined the problem of using a enabled Interaction Manager to make decisions about whether to laugh or not. They defined the Regularized Classification for Apprenticeship Learning (RCAL) algorithm using audio features. Final results showed that RCAL performed much better than Multi-class classification on
2 and slightly worse on. The paper noted the problem of having a large class imbalance between and non- in the dataset, and suggested potential future work in Inverse Reinforcement Learning (IRL) algorithms. From this, we propose weighting the datasets to reduce imbalance so that the non- does not overpower what the model will learn from punchlines. Bertero and Fung (2016) compared three supervised machine learning methods to predict and detect humor in The Big Bang Theory sitcom dialogues. From a corpus where punchlines were annotated by, Bertero et al. extracted audio and language features. Audio features and language features were fed into 3 models: CNN, RNN, and CRF were compared against a logistic regression baseline (F1-score 29.2). The CNN using word vectors and overlapping time frames of 25ms performed the best (F1-score of 68.6). The RNN model (F1-score 52.9) should have performed the best but actually performed worse than CNN likely due to overfitting. The paper proposes future work building a better dialog system that understands humor. In this paper, our objective is to improve the RNN model. Rather than a sitcoms dataset, this paper will use conversational conversations whose humor we hypothesize is more relevant for a conversational agent like Siri. Further, our objective is to build a simple dialog system, laughbot, that responds to humor. 3 Dataset The focus of this project is detecting humor in everyday, regular speech in contrast to past work we have examined that analyzes humor in sitcoms. We use the Switchboard Corpora available on AFS, which consists of around 3000 groups of audio files, transcripts, and word-level time intervals. We classified each line of a transcript as a punchline if it preceded any indication of at the beginning of the next person s response: B: That, that s the major reason I m walking up the stairs. A: [Laughter] To go skiing? Punchline Table 1: Punchline inducing A: Uh-huh. Well, you must have a relatively clean conscience then []. B: [Laughter] Punchline Table 2: Laughter that induces is classified as a punchline A: just for fun. Not Punchline B: Shaking the scorpions out of their shoes []. Table 3: A line preceding someone laughing at themselves does not count as a punchline We split our data set into 80% train / 10% validation / 10% test sets. Considering the imbalanced datasets that previous work ran into, we sample 5% of the non-punchlines since our original dataset is heavily imbalanced towards nonpunchlines. This achieves a more balanced train set among positive (punchlines) and negative (unfunny lines) classes. Our final datasets each have about 35-40% punchlines. 3.1 Features We extract a combination of language and audio features from the data. Language features include: unigrams, bigrams, trigrams: we prune the vocabulary and keep the n-grams that appear more than a certain frequency in the train set, tuning the threshold on our val set. In this model, we keep n-grams that appear at least twice. Parts of speech: we implemented NLTK s POS-tagger to pull the number of nouns, verbs, adjective, adverbs and pronouns appearing in the example. (Steven Bird, 2009) Sentiment: we utilized NLTK s vader toolkit to extract sentiment from the punchline, in a scale of more negative to more positive (Steven Bird, 2009) Length, average word length: from reading past work, we learned Sitcom punchlines are often short, so we use length features. (Bertero, 2016) We also extract acoustic features from each audio file (converted to.wav format) with the opens- MILE toolkit and match the timestamped features 2
3 to the timed transcripts in Switchboard to extract corresponding acoustic and language features for a given example. (Eyben et al., 2013) Acoustic features include: MFCC: We expect MFCC vectors to store the most information about an audio sample, so we sample 12 vectors every 10ms with a maximum of 50 time intervals per example, since certain lines may be too long to fully store. Energy level: We also expect the speaker s energy to be a strong indicator of humor, so we include this as an additional feature. 4 Approach / Models 4.1 Baseline Our baseline was an all positive classifier, predicting every example as a punchline. The precision of this classifier is the proportion of true punchlines in the dataset (around 35%) and 100% recall. We also used all negative classifier, predicting every line as unfunny which has a precision of the proportion of unfunny lines (around 65%) and 0% recall. Classifier Precision Recall F1-score All Positive All Negative Table 4: Baseline metrics 4.2 Logistic Regression Language Model We train a logistic regression model using only language features (ngrams, sentiment, line length) as a secondary baseline. Logistic regression is an intuitive starting model for binary classification, and also allows us to observe and tune the performance of just our language features on predicting humor. 4.3 RNN Acoustic Model We next train a Recurrent Neural Network (RNN) using only acoustic features, to observe the performance of our acoustic features on classifying lines of audio as punchlines or not. We choose an RNN to better capture the sequential nature and thus conversational context of dialogue, and we use Gated Recurrent Unit (GRU) cells so our model can better remember earlier timesteps in a line, instead of overemphasizing the latest timesteps. During training, we use standard softmax cross entropy to calculate cost. We initially used an Adam optimizer, as it handles less-frequently seen training features better and converges smother than Stochastic Gradient Descent. Our final RNN uses an Adamax optimizer to further stabilize the model between epochs and to make the model more robust in handling less-frequently seen features and gradient noise. 4.4 Final Combined Model After designing separate language and acoustic models, we combined the two as such: 1. Run our RNN on all acoustic features in the training set, and extract the final hidden state vector in the RNN on each training example. 2. Concatenate this vector with all language features for its corresponding training example. 3. Use the combined feature vectors to train a logistic regression model. Figure 1: Diagram of final RNN + LogReg model Figure 1 shows our combined model architecture. For testing, we follow a similar process of running the acoustic features into our pre-trained RNN, concatenating the final hidden state vector with language features for an example, and running the combined feature vector through our pre-trained logistic regression model to see the prediction. 5 Laughbot Architecture The laughbot is a simple user-interface application that implements the model we built during our research and testing, predicting humor using a chatbot-style audio prompter. It is intended for demonstration purposes for both letting users experiment with custom input as well as showing the results of our project in an accessible and tangible form. The user speaks into the microphone, after which the laughbot will classify whether what was said was funny, and audibly laugh if so. 3
4 The laughbot is designed to take user input audio, transcribe it, and feed the audio file and transcription into our pre-trained RNN and logistic regression model. Multithreading allows the user to use the microphone to speak for as much of the maximum 60 second time segment as he or she would like, before pressing Enter to indicate end of speech. The audio is then saved as a.wav file and transcribed by hitting the Google Cloud Speech API. Both the transcription and the original audio file are sent through the pre-trained model in which acoustic features are extracted and run through the pre-trained RNN. The last hidden states are combined with textual features extracted from the transcription as features for the entire logistic regression model. Once a classification is obtained, not funny or funny, the laughbot will either keep a straight face by staying silent and just prompt for more audio, or it will randomly play one of several laughtracks that we recorded during the late night hours of project development. The classification is almost immediate. The brunt of the runtime of our implementation depends on the speed of transcription from Google Cloud Speech API, thus the strength of the wifi connection. During development of the laughbot, we originally tried to design it to work in real-time so the user could continually speak and the laughbot could laugh at every point in the one-sided conversation when it recognized humor. We were able to transcribe audio in real-time with the Google Speech API, and intended to multithread it to capture an audio file simultaneously, but we faced problems structuring the rest of the interface to allow it to continually run the input through the model until humor was detected or the speaker paused for more than a certain threshold. Real-time recognition and laughing is a next-step implementation that will involve sending partial transcripts and audio files through the model continuously, concatenating audio and transcript data to preceding chunks to allow for context and longer audio cues to contribute to the classification. To see a sample of our laughbot in action, see this video: 6 Results 6.1 Evaluation We evaluate using accuracy, precision, recall, and F1 scores, with greatest emphasis on F1 scores. Accuracy calculates the proportion of correct predictions: Accuracy = T P + T N T P + T N + F P + F N Precision calculates the proportion of predicted punchlines that were actual punchlines: Precision = T P T P + F P Recall calculates the proportion of true punchlines that were captured as punchlines: Recall = T P T P + F N F1 is the harmonic mean of precision and recall, which can be calculated as below. F 1 = 2 Precision Recall Precision + Recall Table 5 and Figure 2 show the final performance of our models on these metrics, evaluated on the test dataset. Notably, our final model not only beat our baseline, it also beat the final RNN model of (Bertero, 2016), the most similar approach to our own model. Note Bertero et al s paper used the Big Bang Theory sitcom dialogues which we hypothesize is an easier dataset to classify than general phone conversations. Their dataset also had a higher proportion of punchlines leading to a higher F1 score in their positive baseline. Thus their final CNN model improved less compared to their baseline (14.36% improvement) than ours performed compared to our baseline (16.35% improvement). Further, while Bertero s final CNN model had a higher F1-score than our final RNN model, our model had higher accuracy. See Model Analysis Our final combined RNN acoustic and logistic regression language model performed best of all our models. This fit our expectations, as humor should depend on both what was said and how it was said. Both the language only and audio only model had fairly similar accuracies to the final model, but had much lower recall scores 4
5 Classifier Accuracy Precision Recall F1-score Logistic Regression (train) RNN (train) Combined (train) Logistic Regression (validation) RNN (validation) Combined (validation) Logistic Regression (test) RNN (test) Combined (test) Table 5: Comparison of all models on all datasets Classifier Accuracy Precision Recall F1-score Bertero s Positive Baseline Our Positive Baseline Logistic regression (language only) RNN (audio only) Bertero s Final RNN Our Final Model (RNN + LogReg) Bertero s Final CNN Table 6: Comparison of our model against Bertero et al Figure 2: Comparison of models on test datasets (especially the language model), suggesting that the combination model was better at correctly predicting punchlines while the individual models perhaps tended to be too conservative in their predictions, in that they predicted non-punchline for too many true punchlines. We tuned our RNN and regression models separately on the validation set. We found that the language model performed best when the frequent n-grams threshold was set at 2 (so we only included n-grams that occurred at least twice in the train set), and performance dropped as this threshold was increased. This makes sense since for bigrams and especially trigrams, the number that appear at least x times in a dataset drops drastically as x increases, so with a too-high threshold, we were excluding too many potentially useful features. We also found that sentence length was a particularly important feature, which confirmed our expectation that most punchlines would be relatively short. With the RNN, we found that increasing the number of hidden states greatly improved model performance up to a certain point, then began causing overfitting past that point. The same was true of the number of epochs we ran the model through during training. As we were using an Adamax optimizer, which already performs certain optimizations to adapt the model learning rate, we did not perform much tuning on our initial learning rate. 6.3 Error Analysis Table 5 and Figure 3 show the performance of our language-only, audio-only, and combined models on our train, validation, and test datasets. All models performed significantly better on the train set than on the validation or test set, especially the final combined model, suggesting that our model is strongly overfitting to the training data. This may 5
6 dataset and achieved higher and higher F1 scores on the test set. Table 7 shows our model performance on different portions of the dataset, with noticeably less overfitting and better test set performance as the dataset portion increased. This suggests that having an even larger dataset could achieve better results. Figure 3: Comparison of F1-scores on train, val, and test be helped with hyperparameter tuning i.e. decreasing the number of epochs or number of hidden units in our RNN, or changing the regularization of our logistic regression model or by simplifying the model i.e. decreasing the maximum number of MFCC vectors to extract or decreasing the number of language features. As we explore in the next section, this may also be helped by training on larger datasets. 6.4 Dataset Analysis 6.5 Laughbot Analysis In testing our laughbot by speaking to it and waiting for or, our laughbot responded appropriately to several types of inputs. We noticed that was often caused by shorter inputs (though not in all cases, as seen in Figure 7), as well as by in the punchline. On longer inputs or inputs with negative sentiment or both, laughbot generally considered the line as unfunny. Laughbot responded positively to jokes, at some cases waiting for the actual punchline. Laughbot considered questions and statements as unfunny as shown in 8. Figure 4: RNN model train accuracy and cost on 20% of dataset Transcript you re cute ha ha ha you re so fun to talk to haha my grandma died why did the chicken cross the road to get to the other side do you like cheese i finished my cs224s project Response We initially ran our models on only 20% of the full Switchboard dataset to speed up training. Figure 4 shows the training accuracy and cost curves of our RNN model, which suggest strong overfitting to the train set. Once we finalized our combined RNN and logistic regression model we continued to run on larger portions of the Switchboard Table 8: Laughbot successes There were cases that fooled the laughbot and laughbot inappropriately responded with. For example, the laughbot laughed at I love you, likely because the statement has positive sentiment and is short in length. Sometimes, unfunny lines 6
7 Classifier Dataset Size Accuracy Precision Recall F1-score Combined (train) 20% (4659 examples) Combined (test) 20% (618 examples) Combined (train) 50% (12011 examples) Combined (test) 50% (1527 examples) Combined (train) 100% (23658 examples) Combined (test) 100% (2893 examples) Table 7: Model performance on varying dataset sizes said in a funny manner (raising the pitch of the last word) can induce. For example saying my grandma died but with high pitch at the end will cause laughbot to respond with. Whether this should be considered a success on part of the laughbot is up to the discretion of the user. 7 Conclusion and Future Work Our combined RNN and logistic regression model performed best with an F1-score of 63.2 on the test set and an accuracy of Future work will focus on reducing overfitting, as our final model run on the entire dataset still performs significantly higher on the train set. Since logistic regression is a much more naive model than an RNN, we will work on improving this ensemble model to fully utilize the predictive power of both. We also wish to explore Convolutional Neural Networks (CNNs), both stand-alone and in ensemble models, as our research showed CNNs to have higher F1-score than RNN models for this task (Bertero, 2016). We would also train on a larger dataset for a more generalizable model. To test laughbot, we would like to make it realtime as well as run it on sitcoms without their laughtracks to see how closely laughbot laughs compared to the original laughtracks of the TV show. Additional implementations could include more complex classification to identify the level or type of humor; laughbot would be able to respond with giggles or guffaws based on user input. Sarcasm in particular has always been difficult to detect in natural and spoken language processing, but our model for detecting humor is a step towards being able to recognize the common textual cues along with the specific intonations that normally accompany sarcastic speech. Since we trained and tested on real conversations, this model s humor detection is applicable to real, everyday speech more so than scripted jokes. In this paper we also identified areas in our implementation with room for improvement. With future work, our model could be a viable addition to conversational agents to make them embody more human-like attributes. Acknowledgments We would like to thank our professor Andrew Maas and the CS244S Spoken Natural Language Processing teaching team. Special thanks to Raghav and Jiwei for their direction on our combined RNN and regression model. References Dario Bertero Deep learning of audio and language features for humor prediction. Olivier Pietquin Matthieu Geist Bilal Piot Predicting when to laugh with structured classification. Interspeech. Matthieu Geist Olivier Pietquin Bilal Piot Imitation learning applied to embodied conversational agents. MLIS. Florian Eyben, Felix Weninger, Florian Gross, and Bjrn Schuller Recent developments in opensmile, the munich open-source multimedia feature extractor. ACM Multimedia (MM), pages Anton Nijholt Humor and embodied conversational agents. Edward Loper Ewan Klein Steven Bird Natural Language Processing with Python. OReilly Media Inc. 7
Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues
Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues Kate Park, Annie Hu, Natalie Muenster Email: katepark@stanford.edu, anniehu@stanford.edu, ncm000@stanford.edu Abstract We propose
More informationPREDICTING HUMOR RESPONSE IN DIALOGUES FROM TV SITCOMS. Dario Bertero, Pascale Fung
PREDICTING HUMOR RESPONSE IN DIALOGUES FROM TV SITCOMS Dario Bertero, Pascale Fung Human Language Technology Center The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong dbertero@connect.ust.hk,
More informationDeep Learning of Audio and Language Features for Humor Prediction
Deep Learning of Audio and Language Features for Humor Prediction Dario Bertero, Pascale Fung Human Language Technology Center Department of Electronic and Computer Engineering The Hong Kong University
More informationMusic Composition with RNN
Music Composition with RNN Jason Wang Department of Statistics Stanford University zwang01@stanford.edu Abstract Music composition is an interesting problem that tests the creativity capacities of artificial
More informationImproving Frame Based Automatic Laughter Detection
Improving Frame Based Automatic Laughter Detection Mary Knox EE225D Class Project knoxm@eecs.berkeley.edu December 13, 2007 Abstract Laughter recognition is an underexplored area of research. My goal for
More informationFinding Sarcasm in Reddit Postings: A Deep Learning Approach
Finding Sarcasm in Reddit Postings: A Deep Learning Approach Nick Guo, Ruchir Shah {nickguo, ruchirfs}@stanford.edu Abstract We use the recently published Self-Annotated Reddit Corpus (SARC) with a recurrent
More informationAutomatic Laughter Detection
Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional
More informationAutomatic Laughter Detection
Automatic Laughter Detection Mary Knox 1803707 knoxm@eecs.berkeley.edu December 1, 006 Abstract We built a system to automatically detect laughter from acoustic features of audio. To implement the system,
More informationSinger Traits Identification using Deep Neural Network
Singer Traits Identification using Deep Neural Network Zhengshan Shi Center for Computer Research in Music and Acoustics Stanford University kittyshi@stanford.edu Abstract The author investigates automatic
More informationSarcasm Detection in Text: Design Document
CSC 59866 Senior Design Project Specification Professor Jie Wei Wednesday, November 23, 2016 Sarcasm Detection in Text: Design Document Jesse Feinman, James Kasakyan, Jeff Stolzenberg 1 Table of contents
More informationLSTM Neural Style Transfer in Music Using Computational Musicology
LSTM Neural Style Transfer in Music Using Computational Musicology Jett Oristaglio Dartmouth College, June 4 2017 1. Introduction In the 2016 paper A Neural Algorithm of Artistic Style, Gatys et al. discovered
More informationAn AI Approach to Automatic Natural Music Transcription
An AI Approach to Automatic Natural Music Transcription Michael Bereket Stanford University Stanford, CA mbereket@stanford.edu Karey Shi Stanford Univeristy Stanford, CA kareyshi@stanford.edu Abstract
More informationFirst Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text
First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart March 16, 2017 Slot Filling sequential
More informationIntroduction to Natural Language Processing This week & next week: Classification Sentiment Lexicons
Introduction to Natural Language Processing This week & next week: Classification Sentiment Lexicons Center for Games and Playable Media http://games.soe.ucsc.edu Kendall review of HW 2 Next two weeks
More informationImage-to-Markup Generation with Coarse-to-Fine Attention
Image-to-Markup Generation with Coarse-to-Fine Attention Presenter: Ceyer Wakilpoor Yuntian Deng 1 Anssi Kanervisto 2 Alexander M. Rush 1 Harvard University 3 University of Eastern Finland ICML, 2017 Yuntian
More informationA Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification
INTERSPEECH 17 August, 17, Stockholm, Sweden A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification Yun Wang and Florian Metze Language
More informationDetecting Musical Key with Supervised Learning
Detecting Musical Key with Supervised Learning Robert Mahieu Department of Electrical Engineering Stanford University rmahieu@stanford.edu Abstract This paper proposes and tests performance of two different
More informationarxiv: v1 [cs.lg] 15 Jun 2016
Deep Learning for Music arxiv:1606.04930v1 [cs.lg] 15 Jun 2016 Allen Huang Department of Management Science and Engineering Stanford University allenh@cs.stanford.edu Abstract Raymond Wu Department of
More informationNoise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017
Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017 Background Abstract I attempted a solution at using machine learning to compose music given a large corpus
More informationMelody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng
Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Introduction In this project we were interested in extracting the melody from generic audio files. Due to the
More informationFeature-Based Analysis of Haydn String Quartets
Feature-Based Analysis of Haydn String Quartets Lawson Wong 5/5/2 Introduction When listening to multi-movement works, amateur listeners have almost certainly asked the following situation : Am I still
More informationDAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval
DAY 1 Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval Jay LeBoeuf Imagine Research jay{at}imagine-research.com Rebecca
More informationThe Lowest Form of Wit: Identifying Sarcasm in Social Media
1 The Lowest Form of Wit: Identifying Sarcasm in Social Media Saachi Jain, Vivian Hsu Abstract Sarcasm detection is an important problem in text classification and has many applications in areas such as
More informationCS229 Project Report Polyphonic Piano Transcription
CS229 Project Report Polyphonic Piano Transcription Mohammad Sadegh Ebrahimi Stanford University Jean-Baptiste Boin Stanford University sadegh@stanford.edu jbboin@stanford.edu 1. Introduction In this project
More informationAutomatic Rhythmic Notation from Single Voice Audio Sources
Automatic Rhythmic Notation from Single Voice Audio Sources Jack O Reilly, Shashwat Udit Introduction In this project we used machine learning technique to make estimations of rhythmic notation of a sung
More informationStructured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello
Structured training for large-vocabulary chord recognition Brian McFee* & Juan Pablo Bello Small chord vocabularies Typically a supervised learning problem N C:maj C:min C#:maj C#:min D:maj D:min......
More informationMelody classification using patterns
Melody classification using patterns Darrell Conklin Department of Computing City University London United Kingdom conklin@city.ac.uk Abstract. A new method for symbolic music classification is proposed,
More informationNarrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts
Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts Gerald Friedland, Luke Gottlieb, Adam Janin International Computer Science Institute (ICSI) Presented by: Katya Gonina What? Novel
More informationJoint Image and Text Representation for Aesthetics Analysis
Joint Image and Text Representation for Aesthetics Analysis Ye Zhou 1, Xin Lu 2, Junping Zhang 1, James Z. Wang 3 1 Fudan University, China 2 Adobe Systems Inc., USA 3 The Pennsylvania State University,
More informationAcoustic Scene Classification
Acoustic Scene Classification Marc-Christoph Gerasch Seminar Topics in Computer Music - Acoustic Scene Classification 6/24/2015 1 Outline Acoustic Scene Classification - definition History and state of
More informationUWaterloo at SemEval-2017 Task 7: Locating the Pun Using Syntactic Characteristics and Corpus-based Metrics
UWaterloo at SemEval-2017 Task 7: Locating the Pun Using Syntactic Characteristics and Corpus-based Metrics Olga Vechtomova University of Waterloo Waterloo, ON, Canada ovechtom@uwaterloo.ca Abstract The
More informationGenerating Music with Recurrent Neural Networks
Generating Music with Recurrent Neural Networks 27 October 2017 Ushini Attanayake Supervised by Christian Walder Co-supervised by Henry Gardner COMP3740 Project Work in Computing The Australian National
More informationA repetition-based framework for lyric alignment in popular songs
A repetition-based framework for lyric alignment in popular songs ABSTRACT LUONG Minh Thang and KAN Min Yen Department of Computer Science, School of Computing, National University of Singapore We examine
More informationRewind: A Music Transcription Method
University of Nevada, Reno Rewind: A Music Transcription Method A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering by
More informationhit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.
CS 229 FINAL PROJECT A SOUNDHOUND FOR THE SOUNDS OF HOUNDS WEAKLY SUPERVISED MODELING OF ANIMAL SOUNDS ROBERT COLCORD, ETHAN GELLER, MATTHEW HORTON Abstract: We propose a hybrid approach to generating
More informationSpeech Recognition and Voice Separation for the Internet of Things
Speech Recognition and Voice Separation for the Internet of Things Mohammad Hasanzadeh Mofrad and Daniel Mosse Department of Computer Science School of Computing and Information University of Pittsburgh
More informationBilbo-Val: Automatic Identification of Bibliographical Zone in Papers
Bilbo-Val: Automatic Identification of Bibliographical Zone in Papers Amal Htait, Sebastien Fournier and Patrice Bellot Aix Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,13397,
More informationFigure 1: Feature Vector Sequence Generator block diagram.
1 Introduction Figure 1: Feature Vector Sequence Generator block diagram. We propose designing a simple isolated word speech recognition system in Verilog. Our design is naturally divided into two modules.
More informationNeural Network for Music Instrument Identi cation
Neural Network for Music Instrument Identi cation Zhiwen Zhang(MSE), Hanze Tu(CCRMA), Yuan Li(CCRMA) SUN ID: zhiwen, hanze, yuanli92 Abstract - In the context of music, instrument identi cation would contribute
More informationDeep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj
Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj 1 Story so far MLPs are universal function approximators Boolean functions, classifiers, and regressions MLPs can be
More informationLyrics Classification using Naive Bayes
Lyrics Classification using Naive Bayes Dalibor Bužić *, Jasminka Dobša ** * College for Information Technologies, Klaićeva 7, Zagreb, Croatia ** Faculty of Organization and Informatics, Pavlinska 2, Varaždin,
More informationAn Impact Analysis of Features in a Classification Approach to Irony Detection in Product Reviews
Universität Bielefeld June 27, 2014 An Impact Analysis of Features in a Classification Approach to Irony Detection in Product Reviews Konstantin Buschmeier, Philipp Cimiano, Roman Klinger Semantic Computing
More informationarxiv: v1 [cs.ir] 16 Jan 2019
It s Only Words And Words Are All I Have Manash Pratim Barman 1, Kavish Dahekar 2, Abhinav Anshuman 3, and Amit Awekar 4 1 Indian Institute of Information Technology, Guwahati 2 SAP Labs, Bengaluru 3 Dell
More informationComputational modeling of conversational humor in psychotherapy
Interspeech 2018 2-6 September 2018, Hyderabad Computational ing of conversational humor in psychotherapy Anil Ramakrishna 1, Timothy Greer 1, David Atkins 2, Shrikanth Narayanan 1 1 Signal Analysis and
More informationChord Classification of an Audio Signal using Artificial Neural Network
Chord Classification of an Audio Signal using Artificial Neural Network Ronesh Shrestha Student, Department of Electrical and Electronic Engineering, Kathmandu University, Dhulikhel, Nepal ---------------------------------------------------------------------***---------------------------------------------------------------------
More informationChord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations
Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations Hendrik Vincent Koops 1, W. Bas de Haas 2, Jeroen Bransen 2, and Anja Volk 1 arxiv:1706.09552v1 [cs.sd]
More informationAutomatic Piano Music Transcription
Automatic Piano Music Transcription Jianyu Fan Qiuhan Wang Xin Li Jianyu.Fan.Gr@dartmouth.edu Qiuhan.Wang.Gr@dartmouth.edu Xi.Li.Gr@dartmouth.edu 1. Introduction Writing down the score while listening
More informationLaughter Valence Prediction in Motivational Interviewing based on Lexical and Acoustic Cues
Laughter Valence Prediction in Motivational Interviewing based on Lexical and Acoustic Cues Rahul Gupta o, Nishant Nath, Taruna Agrawal o, Panayiotis Georgiou, David Atkins +, Shrikanth Narayanan o o Signal
More informationMachine Learning Term Project Write-up Creating Models of Performers of Chopin Mazurkas
Machine Learning Term Project Write-up Creating Models of Performers of Chopin Mazurkas Marcello Herreshoff In collaboration with Craig Sapp (craig@ccrma.stanford.edu) 1 Motivation We want to generative
More informationAudio-Based Video Editing with Two-Channel Microphone
Audio-Based Video Editing with Two-Channel Microphone Tetsuya Takiguchi Organization of Advanced Science and Technology Kobe University, Japan takigu@kobe-u.ac.jp Yasuo Ariki Organization of Advanced Science
More informationMusic Genre Classification and Variance Comparison on Number of Genres
Music Genre Classification and Variance Comparison on Number of Genres Miguel Francisco, miguelf@stanford.edu Dong Myung Kim, dmk8265@stanford.edu 1 Abstract In this project we apply machine learning techniques
More informationMusic Genre Classification
Music Genre Classification chunya25 Fall 2017 1 Introduction A genre is defined as a category of artistic composition, characterized by similarities in form, style, or subject matter. [1] Some researchers
More informationDeep Jammer: A Music Generation Model
Deep Jammer: A Music Generation Model Justin Svegliato and Sam Witty College of Information and Computer Sciences University of Massachusetts Amherst, MA 01003, USA {jsvegliato,switty}@cs.umass.edu Abstract
More informationSinger Recognition and Modeling Singer Error
Singer Recognition and Modeling Singer Error Johan Ismael Stanford University jismael@stanford.edu Nicholas McGee Stanford University ndmcgee@stanford.edu 1. Abstract We propose a system for recognizing
More informationABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC
ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC Vaiva Imbrasaitė, Peter Robinson Computer Laboratory, University of Cambridge, UK Vaiva.Imbrasaite@cl.cam.ac.uk
More informationInstrument Recognition in Polyphonic Mixtures Using Spectral Envelopes
Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu
More informationLEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception
LEARNING AUDIO SHEET MUSIC CORRESPONDENCES Matthias Dorfer Department of Computational Perception Short Introduction... I am a PhD Candidate in the Department of Computational Perception at Johannes Kepler
More informationLAUGHTER serves as an expressive social signal in human
Audio-Facial Laughter Detection in Naturalistic Dyadic Conversations Bekir Berker Turker, Yucel Yemez, Metin Sezgin, Engin Erzin 1 Abstract We address the problem of continuous laughter detection over
More informationarxiv: v1 [cs.cl] 3 May 2018
Binarizer at SemEval-2018 Task 3: Parsing dependency and deep learning for irony detection Nishant Nikhil IIT Kharagpur Kharagpur, India nishantnikhil@iitkgp.ac.in Muktabh Mayank Srivastava ParallelDots,
More informationWHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?
WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? NICHOLAS BORG AND GEORGE HOKKANEN Abstract. The possibility of a hit song prediction algorithm is both academically interesting and industry motivated.
More informationSentiMozart: Music Generation based on Emotions
SentiMozart: Music Generation based on Emotions Rishi Madhok 1,, Shivali Goel 2, and Shweta Garg 1, 1 Department of Computer Science and Engineering, Delhi Technological University, New Delhi, India 2
More informationSome Experiments in Humour Recognition Using the Italian Wikiquote Collection
Some Experiments in Humour Recognition Using the Italian Wikiquote Collection Davide Buscaldi and Paolo Rosso Dpto. de Sistemas Informáticos y Computación (DSIC), Universidad Politécnica de Valencia, Spain
More informationAN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY
AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY Eugene Mikyung Kim Department of Music Technology, Korea National University of Arts eugene@u.northwestern.edu ABSTRACT
More informationCHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS
CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS Hyungui Lim 1,2, Seungyeon Rhyu 1 and Kyogu Lee 1,2 3 Music and Audio Research Group, Graduate School of Convergence Science and Technology 4
More informationMusic Generation from MIDI datasets
Music Generation from MIDI datasets Moritz Hilscher, Novin Shahroudi 2 Institute of Computer Science, University of Tartu moritz.hilscher@student.hpi.de, 2 novin@ut.ee Abstract. Many approaches are being
More informationMusical Hit Detection
Musical Hit Detection CS 229 Project Milestone Report Eleanor Crane Sarah Houts Kiran Murthy December 12, 2008 1 Problem Statement Musical visualizers are programs that process audio input in order to
More informationChapter Two: Long-Term Memory for Timbre
25 Chapter Two: Long-Term Memory for Timbre Task In a test of long-term memory, listeners are asked to label timbres and indicate whether or not each timbre was heard in a previous phase of the experiment
More informationCombination of Audio & Lyrics Features for Genre Classication in Digital Audio Collections
1/23 Combination of Audio & Lyrics Features for Genre Classication in Digital Audio Collections Rudolf Mayer, Andreas Rauber Vienna University of Technology {mayer,rauber}@ifs.tuwien.ac.at Robert Neumayer
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 22: Conversational Agents Instructor: Preethi Jyothi Oct 26, 2017 (All images were reproduced from JM, chapters 29,30) Chatbots Rule-based chatbots Historical
More information6.UAP Project. FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System. Daryl Neubieser. May 12, 2016
6.UAP Project FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System Daryl Neubieser May 12, 2016 Abstract: This paper describes my implementation of a variable-speed accompaniment system that
More informationSupervised Learning in Genre Classification
Supervised Learning in Genre Classification Introduction & Motivation Mohit Rajani and Luke Ekkizogloy {i.mohit,luke.ekkizogloy}@gmail.com Stanford University, CS229: Machine Learning, 2009 Now that music
More informationComposer Identification of Digital Audio Modeling Content Specific Features Through Markov Models
Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Aric Bartle (abartle@stanford.edu) December 14, 2012 1 Background The field of composer recognition has
More informationNoise Flooding for Detecting Audio Adversarial Examples Against Automatic Speech Recognition
Noise Flooding for Detecting Audio Adversarial Examples Against Automatic Speech Recognition Krishan Rajaratnam The College University of Chicago Chicago, USA krajaratnam@uchicago.edu Jugal Kalita Department
More informationINGEOTEC at IberEval 2018 Task HaHa: µtc and EvoMSA to Detect and Score Humor in Texts
INGEOTEC at IberEval 2018 Task HaHa: µtc and EvoMSA to Detect and Score Humor in Texts José Ortiz-Bejar 1,3, Vladimir Salgado 3, Mario Graff 2,3, Daniela Moctezuma 3,4, Sabino Miranda-Jiménez 2,3, and
More informationBi-Modal Music Emotion Recognition: Novel Lyrical Features and Dataset
Bi-Modal Music Emotion Recognition: Novel Lyrical Features and Dataset Ricardo Malheiro, Renato Panda, Paulo Gomes, Rui Paiva CISUC Centre for Informatics and Systems of the University of Coimbra {rsmal,
More informationA Fast Alignment Scheme for Automatic OCR Evaluation of Books
A Fast Alignment Scheme for Automatic OCR Evaluation of Books Ismet Zeki Yalniz, R. Manmatha Multimedia Indexing and Retrieval Group Dept. of Computer Science, University of Massachusetts Amherst, MA,
More informationHidden Markov Model based dance recognition
Hidden Markov Model based dance recognition Dragutin Hrenek, Nenad Mikša, Robert Perica, Pavle Prentašić and Boris Trubić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3,
More informationAutomatic Joke Generation: Learning Humor from Examples
Automatic Joke Generation: Learning Humor from Examples Thomas Winters, Vincent Nys, and Daniel De Schreye KU Leuven, Belgium, info@thomaswinters.be, vincent.nys@cs.kuleuven.be, danny.deschreye@cs.kuleuven.be
More informationSemi-supervised Musical Instrument Recognition
Semi-supervised Musical Instrument Recognition Master s Thesis Presentation Aleksandr Diment 1 1 Tampere niversity of Technology, Finland Supervisors: Adj.Prof. Tuomas Virtanen, MSc Toni Heittola 17 May
More informationarxiv: v2 [cs.sd] 31 Mar 2017
On the Futility of Learning Complex Frame-Level Language Models for Chord Recognition arxiv:1702.00178v2 [cs.sd] 31 Mar 2017 Abstract Filip Korzeniowski and Gerhard Widmer Department of Computational Perception
More informationYour Sentiment Precedes You: Using an author s historical tweets to predict sarcasm
Your Sentiment Precedes You: Using an author s historical tweets to predict sarcasm Anupam Khattri 1 Aditya Joshi 2,3,4 Pushpak Bhattacharyya 2 Mark James Carman 3 1 IIT Kharagpur, India, 2 IIT Bombay,
More informationA Note Based Query By Humming System using Convolutional Neural Network
INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden A Note Based Query By Humming System using Convolutional Neural Network Naziba Mostafa, Pascale Fung The Hong Kong University of Science and Technology
More informationGenerating Original Jokes
SANTA CLARA UNIVERSITY COEN 296 NATURAL LANGUAGE PROCESSING TERM PROJECT Generating Original Jokes Author Ting-yu YEH Nicholas FONG Nathan KERR Brian COX Supervisor Dr. Ming-Hwa WANG March 20, 2018 1 CONTENTS
More informationSpeech To Song Classification
Speech To Song Classification Emily Graber Center for Computer Research in Music and Acoustics, Department of Music, Stanford University Abstract The speech to song illusion is a perceptual phenomenon
More informationThe Sparsity of Simple Recurrent Networks in Musical Structure Learning
The Sparsity of Simple Recurrent Networks in Musical Structure Learning Kat R. Agres (kra9@cornell.edu) Department of Psychology, Cornell University, 211 Uris Hall Ithaca, NY 14853 USA Jordan E. DeLong
More informationEnabling editors through machine learning
Meta Follow Meta is an AI company that provides academics & innovation-driven companies with powerful views of t Dec 9, 2016 9 min read Enabling editors through machine learning Examining the data science
More informationTopics in Computer Music Instrument Identification. Ioanna Karydi
Topics in Computer Music Instrument Identification Ioanna Karydi Presentation overview What is instrument identification? Sound attributes & Timbre Human performance The ideal algorithm Selected approaches
More informationTake a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University
Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You Chris Lewis Stanford University cmslewis@stanford.edu Abstract In this project, I explore the effectiveness of the Naive Bayes Classifier
More informationComputational Modelling of Harmony
Computational Modelling of Harmony Simon Dixon Centre for Digital Music, Queen Mary University of London, Mile End Rd, London E1 4NS, UK simon.dixon@elec.qmul.ac.uk http://www.elec.qmul.ac.uk/people/simond
More informationDetecting Sarcasm in English Text. Andrew James Pielage. Artificial Intelligence MSc 2012/2013
Detecting Sarcasm in English Text Andrew James Pielage Artificial Intelligence MSc 0/0 The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference
More informationUsing Deep Learning to Annotate Karaoke Songs
Distributed Computing Using Deep Learning to Annotate Karaoke Songs Semester Thesis Juliette Faille faillej@student.ethz.ch Distributed Computing Group Computer Engineering and Networks Laboratory ETH
More informationA Framework for Segmentation of Interview Videos
A Framework for Segmentation of Interview Videos Omar Javed, Sohaib Khan, Zeeshan Rasheed, Mubarak Shah Computer Vision Lab School of Electrical Engineering and Computer Science University of Central Florida
More informationDISCOURSE ANALYSIS OF LYRIC AND LYRIC-BASED CLASSIFICATION OF MUSIC
DISCOURSE ANALYSIS OF LYRIC AND LYRIC-BASED CLASSIFICATION OF MUSIC Jiakun Fang 1 David Grunberg 1 Diane Litman 2 Ye Wang 1 1 School of Computing, National University of Singapore, Singapore 2 Department
More informationA combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007
A combination of approaches to solve Tas How Many Ratings? of the KDD CUP 2007 Jorge Sueiras C/ Arequipa +34 9 382 45 54 orge.sueiras@neo-metrics.com Daniel Vélez C/ Arequipa +34 9 382 45 54 José Luis
More informationEfficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas
Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications Matthias Mauch Chris Cannam György Fazekas! 1 Matthias Mauch, Chris Cannam, George Fazekas Problem Intonation in Unaccompanied
More informationA STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING
A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING Adrien Ycart and Emmanouil Benetos Centre for Digital Music, Queen Mary University of London, UK {a.ycart, emmanouil.benetos}@qmul.ac.uk
More informationAutomatic Music Clustering using Audio Attributes
Automatic Music Clustering using Audio Attributes Abhishek Sen BTech (Electronics) Veermata Jijabai Technological Institute (VJTI), Mumbai, India abhishekpsen@gmail.com Abstract Music brings people together,
More informationMUSI-6201 Computational Music Analysis
MUSI-6201 Computational Music Analysis Part 9.1: Genre Classification alexander lerch November 4, 2015 temporal analysis overview text book Chapter 8: Musical Genre, Similarity, and Mood (pp. 151 155)
More informationWHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs
WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs Abstract Large numbers of TV channels are available to TV consumers
More informationA real time study of plosives in Glaswegian using an automatic measurement algorithm
A real time study of plosives in Glaswegian using an automatic measurement algorithm Jane Stuart Smith, Tamara Rathcke, Morgan Sonderegger University of Glasgow; University of Kent, McGill University NWAV42,
More information