Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues Kate Park, Annie Hu, Natalie Muenster Email: katepark@stanford.edu, anniehu@stanford.edu, ncm000@stanford.edu Abstract We propose detecting and responding to humor in spoken dialogue by extracting language and audio cues and subsequently feeding these features into a combined recurrent neural network (RNN) and logistic regression model. In this paper, we parse Switchboard phone conversations to build a corpus of punchlines and unfunny lines where punchlines precede tokens in Switchboard transcripts. We create a combined RNN and logistic regression model that uses both acoustic and language cues to predict whether a conversational agent should respond to an utterance with. Our model achieves an F1-score of 63.2 and accuracy of 73.9. This model outperforms our logistic language model (F1-score 56.6) and RNN acoustic model (59.4) as well as the final RNN model of D. Bertero, 2016 (52.9). Using our final model, we create a laughbot that audibly responds to a user with when their utterance is classified as a punchline. A conversational agent outfitted with a humorrecognition system such as the one we present in this paper would be valuable as these agents gain utility in everyday life. Keywords Chatbots; spoken natural language processing; deep learning; machine learning I. INTRODUCTION Our project takes the unique approach of building a laughbot to detect humor and thus predict when someone will laugh based on their textual and acoustic cues. Humor depends on both the strings of words, how the speakers voice changes, as well as situational context. Predicting when someone will laugh is a significant part of contemporary media. For example, sitcoms are written and performed primarily to cause, so the ability to detect precisely why something will draw is of extreme importance to screenwriters and actors. Being able to detect humor could improve embodied conversational agents who are perceived as having more human attributes when they are able to display the appreciation of humor [1]. Siri currently will only say you are very funny, but could become more human if she had the ability to recognize social cues like humor, and respond to them with. Thus, our objective is to train and test a model that can detect whether or not a line is funny and should be appropriately responded to with. II. LIMITATIONS OF PREVIOUS WORK Existing work in predicting humor uses a combination of text and audio features and various machine learning methods to tackle the classification problem. The goal of research done in [2] was to predict and create an avatar that imitated a human expert s sense of when to laugh in a conversation. They reduced the problem to a multi-class classification problem, where they found that using the large margin algorithm resulted in the most natural behavior (laughing with the same proportion to the expert). A later paper [3] examined the problem of using a enabled interaction manager to make decisions about whether to laugh or not. They defined the regularized classification for apprenticeship learning (RCAL) algorithm using audio features. Final results showed that RCAL performed much better than multi-class classification on and slightly worse on. The paper noted the problem of having a large class imbalance between and non- in the dataset, and suggested potential future work in inverse reinforcement learning (IRL) algorithms. From this, we propose weighting the datasets to reduce the imbalance so that the non- does not overpower what the model will learn from punchlines. In [4], author compared three supervised machine learning methods to predict and detect humor in The Big Bang Theory sitcom dialogues. From a corpus where punchlines were annotated by, they extracted audio and language features. Audio features and language features were fed into three models: convolutional neural network (CNN), RNN, and conditional random field (CRF), and compared against a logistic regression baseline (F1-score 29.2). The CNN using word vectors and overlapping time frames of 25ms performed the best (F1-score of 68.6). The RNN model (F1-score 52.9) should have performed the best but actually performed worse than the CNN, likely due to overfitting. Their paper proposes future work building a better dialog system that understands humor. III. PRESENT WORK Rather than a sitcom s dataset, our paper uses conversational conversations whose humor we hypothesize is more relevant for a conversational agent like Siri. We create an ensemble model combining a recurrent neural network with a logistic regression classifier that, when run on audio and language features from lines of conversation, can identify whether a line is humorous. We additionally implement a laughbot, a simple dialog system based on our ensemble model that converses with a user and responds to humorous input with. In Section IV, we introduce the Switchboard dataset we used for training, the preprocessing steps we performed, and the features we extracted to implement our models. Section V then gives implementation details for these models, including a baseline, a language-only, an audio, and a combined model, and Section VI gives details for our laughbot interface. In Section VII we define our evaluation metrics, present results, 978-1-5386-2056-4/18/$31.00 c 2018 IEEE 1 P a g e

and analyze errors. Finally, in Section VIII we discuss future plans to improve our humor-detection model and our laughbot system. IV. DATASET Since the focus of this project is detecting humor in everyday, regular speech, our dataset must reflect everyday conversations. We use the Switchboard Corpora available on AFS, which consists of around 3000 groups of audio files, transcripts, and word-level time intervals from phone conversations between two speakers. We classified each line of a transcript as a punchline if it preceded any indication of at the beginning of the next person s response. Tables I, II, and III show sample lines and their corresponding classifications. TABLE II. TABLE I. TABLE III. LINE INDUCING LAUGHTER IS CLASSIFIED AS A PUNCHLINE B: That, that s the major reason I m walking up Punchline the stairs. A: [Laughter] To go skiing? LAUGHTER THAT INDUCES LAUGHTER IS CLASSIFIED AS A PUNCHLINE A: Uh-huh. Well, you must have a relatively clean Punchline conscience then []. B: [Laughter] LINE PRECEDING SOMEONE LAUGHING AT THEMSELVES DOES NOT COUNT AS A PUNCHLINE A: just for fun. B: Shaking the scorpions out of their shoes []. We split our data set into 80% train / 10% validation / 10% test sets. Considering the imbalanced datasets that previous work worked with, we sampled 5% of the non-punchlines because our original dataset is heavily imbalanced towards non-punchlines. This achieved a more balanced training set among positive (punchlines) and negative (unfunny lines) classes. Our final datasets each have about 35-40% punchlines. A. Features We extracted a combination of language and audio features from the data. Language features include: Unigrams, bigrams, trigrams: we pruned the vocabulary and kept the n-grams that appear more than a certain frequency in the training set, tuning the threshold on our val set. In this model, we kept n- grams that appear at least twice. Parts of speech: we implemented NLTK s POStagger to pull the number of nouns, verbs, adjective, adverbs and pronouns appearing in the example [5]. Sentiment: we utilized NLTK s vader toolkit to extract sentiment from the punchline, in a scale of more negative to more positive [5]. Length, average word length: from reading past work, we learned sitcom punchlines are often short, so we used length features [4]. We also extracted acoustic features from each audio file (converted to.wav format) with the opensmile toolkit and matched the timestamped features to the timed transcripts in Switchboard to extract corresponding acoustic and language features for a given example [6]. Acoustic features include: MFCC: We expected MFCC vectors to store the most information about an audio sample, so we sampled 12 vectors every 10ms with a maximum of 50 time intervals per example, since certain lines may be too long to fully store. Energy level: We also expected the speaker s energy to be a strong indicator of humor, so we included this as an additional feature. A. Baseline V. IMPLEMENTED MODELS Our baseline was an all positive classifier, predicting every example as a punchline. The precision of this classifier is the proportion of true punchlines in the dataset (around 35%) and 100% recall. We also use all negative classifier, predicting every line as unfunny which has a precision of the proportion of unfunny lines (around 65%) and 0% recall. Table IV displays these baseline results. TABLE IV. BASELINE METRICS Classifier Precision Recall F1-score All Positive 37.2 100.0 54.2 All Negative 52.8 0.0 0.0 B. Logistic Regression Language Model We trained a logistic regression model using only language features (ngrams, sentiment, line length) as a secondary baseline. Logistic regression was an intuitive starting model for binary classification, and also allowed us to observe and tune the performance of just our language features on predicting humor. C. RNN Acoustic Model We next trained a RNN using only acoustic features, to observe the performance of our acoustic features on classifying lines of audio as punchlines or not. We chose an RNN to better capture the sequential nature and thus conversational context of dialogue, and we use Gated Recurrent Unit (GRU) cells so our model can better remember earlier timesteps in a line, instead of overemphasizing the latest timesteps. During training, we used standard softmax cross entropy to calculate cost. We initially used an Adam optimizer because it handles less frequently seen training features better and converges smoother than stochastic gradient descent. Our final RNN uses an Adamax optimizer to further stabilize the model between epochs and to make the model more robust in handling lessfrequently seen features and gradient noise. 978-1-5386-2056-4/18/$31.00 c 2018 IEEE 2 P a g e

strength of the wifi connection. To see a sample of our laughbot in action, see this video: https://youtu.be/t6je0kznyxg. Fig. 1. Diagram of final RNN + Logistic Regression model. D. Final Combined Model After designing separate language and acoustic models, we combined the two by: 1) Running our RNN on all acoustic features in the training set, and extract the final hidden state vector in the RNN on each training example. 2) Concatenating this vector with all language features for its corresponding training example. 3) Using the combined feature vectors to train a logistic regression model. Fig. 1 shows our combined model architecture. For testing, we followed a similar process of running the acoustic features into our pre-trained RNN, concatenating the final hidden state vector with language features, and running the combined feature vector through our pre-trained logistic regression model to see the prediction. VI. A. Overview LAUGHBOT APPLICATION ARCHITECTURE The laughbot is a simple user-interface application that implements the model we built during our research and testing, predicting humor using a chatbot-style audio prompter. It is intended for demonstration purposes for both letting users experiment with custom input as well as showing the results of our project in an accessible and tangible form. The user speaks into the microphone, after which the laughbot will classify whether what was said was funny, and audibly laugh if so. B. Transcription Architecture The laughbot is designed to take user input audio, transcribe it, and feed the audio file and transcription into our pre-trained RNN and logistic regression model. Multithreading allows the user to use the microphone to speak for as much of the maximum 60 second time segment as he or she would like, before pressing Enter to indicate end of speech. The audio is then saved as a.wav file and transcribed by hitting the Google Cloud Speech API. Both the transcription and the original audio file are sent through the pre-trained model in which acoustic features are extracted and run through the pre-trained RNN. The last hidden states are combined with textual features extracted from the transcription as features for the entire logistic regression model. Once a funny or not funny classification is obtained the laughbot will either keep a straight face by staying silent and simply prompt for more audio, or it will randomly play one of several laughtracks that we recorded during the late night hours of project development. The classification is almost immediate. The brunt of the runtime of our implementation depends on the speed of transcription from Google Cloud Speech API, thus the A. Evaluation VII. RESULTS We evaluated using accuracy, precision, recall, and F1 scores, with greatest emphasis on F1 scores. Accuracy calculates the proportion of correct predictions: Accuracy = T P + T N T P + T N + F P + F N Precision calculates the proportion of predicted punchlines that were actual punchlines: Precision = T P T P + F P Recall calculates the proportion of true punchlines that were captured as punchlines: Recall = T P T P + F N F1 is the harmonic mean of precision and recall, which can be calculated as below: F 1 = 2 Precision Recall Precision + Recall Table VII-A and Fig. 2 show the final performance of our models on these metrics, evaluated on the test dataset. Notably, our final model not only beat our baseline, it also beat the final RNN model of [4], the most similar approach to our own model. Bertero et al. used The Big Bang Theory sitcom dialogues which we hypothesize is an easier dataset to classify than general phone conversations. Their dataset also had a higher proportion of punchlines leading to a higher F1 score in their positive baseline. Their final Convolutional Neural Network (CNN) model, a deep feed-forward neural network, performed the best, but improved less compared to their baseline (14.36% improvement) than ours performed compared to our baseline (16.35% improvement). Further, while the final CNN model proposed by Bertero et al. had a higher F1-score than our final RNN model, our model had higher accuracy (see Table VII-A). B. Model Analysis Our final combined RNN acoustic and logistic regression language model performed the best of all our models. This fit our expectations, as humor should depend on both what was said and how it was said. Both the language only and audio only model had fairly similar accuracies to the final model, but had much lower recall scores (especially the language model), suggesting that the combination model was better at correctly predicting punchlines while the individual models perhaps tended to be too conservative in their predictions, in that they predicted non-punchlines for too many true punchlines. 978-1-5386-2056-4/18/$31.00 c 2018 IEEE 3 P a g e

TABLE V. COMPARISON OF ALL MODELS ON ALL DATASETS Classifier Accuracy Precision Recall F1-score Logistic Regression (train) 87.7 90.1 76.7 82.8 RNN (train) 75.0 73.8 55.2 63.2 Combined (train) 91.1 90.5 86.3 88.3 Logistic Regression (validation) 68.5 60.2 48.5 53.7 RNN (validation) 70.0 61.5 54.7 57.9 Combined (validation) 73.4 66.0 60.5 63.1 Logistic Regression (test) 70.6 62.7 51.4 56.5 RNN (test) 71.7 63.5 55.9 59.4 Combined (test) 73.9 66.5 60.3 63.2 TABLE VI. COMPARISON OF OUR MODEL AGAINST MODELS IN BERTERO ET AL. Classifier Accuracy Precision Recall F1-score Bertero s Positive Baseline 42.8 42.8 100.0 59.9 Our Positive Baseline 37.2 37.2 100.0 54.2 Our Logistic regression (language only) 70.6 62.7 51.4 56.5 Our RNN (audio only) 71.7 63.5 55.9 59.4 Bertero s Final RNN 65.8 64.4 44.9 52.9 Our Final Model (RNN + LogReg) 73.9 66.5 60.3 63.2 Bertero s Final CNN 73.8 70.3 66.7 68.5 Fig. 3. Comparison of F1-scores on train, val, and test. Fig. 2. Comparison of models on test datasets. We tuned our RNN and regression models separately on the validation set. We found that the language model performed best when the frequent n-grams threshold was set at 2 (so we only included n-grams that occurred at least twice in the training set), and performance dropped as this threshold was increased. This makes sense since for bigrams and especially trigrams, the number that appear at least x times in a dataset drops drastically as x increases, so with a too-high threshold, we were excluding too many potentially useful features. We also found that sentence length was a particularly important feature, which confirmed our expectation that most punchlines would be relatively short. With the RNN, we found that increasing the number of hidden states greatly improved model performance up to a certain point, then began causing overfitting past that point. The same was true of the number of epochs we ran the model through during training. As we were using an Adamax optimizer, which already performs certain optimizations to adapt the model learning rate, we did not perform much tuning on our initial learning rate. C. Error Analysis Table VII-A and Fig. 3 show the performance of our language-only, audio-only, and combined models on our training, validation, and test datasets. All models performed significantly better on the training set than on the validation or test set, especially the final combined model, suggesting that our model is strongly overfitting to the training data. This may be helped by hyperparameter tuning such as decreasing the number of epochs or number of hidden units in our RNN, or changing the regularization of our logistic regression model. Simplifying the model could also help by decreasing the maximum number of MFCC vectors to extract or decreasing the number of language features. As we explore in the next section, overfitting may be reduced by training on larger datasets. D. Dataset Analysis We initially ran our models on only 20% of the full Switchboard dataset to speed up development. Fig. 4 shows the training accuracy and cost curves of our RNN model, which suggest strong overfitting to the training set. Once we finalized our combined RNN and logistic regression model we continued to run on larger portions until we used the full 100% of the Switchboard dataset and achieved the highest F1 score 978-1-5386-2056-4/18/$31.00 c 2018 IEEE 4 P a g e

TABLE VII. MODEL PERFORMANCE ON VARYING DATASET SIZES Classifier Dataset Size Accuracy Precision Recall F1-score Combined (train) 20% (4659 examples) 95.1 96.3 90.7 93.4 Combined (test) 20% (618 examples) 69.7 44.84 61.0 51.7 Combined (train) 50% (12011 examples) 93.8 92.4 91.1 91.8 Combined (test) 50% (1527 examples) 71.3 63.5 58.0 60.6 Combined (train) 100% (23658 examples) 91.1 90.5 86.3 88.3 Combined (test) 100% (2893 examples) 73.9 66.5 60.3 63.2 TABLE VIII. LAUGHBOT SUCCESSES Transcript you re cute ha ha ha you re so fun to talk to haha my grandma died why did the chicken cross the road to get to the other side do you like cheese I finished my cs224s project Response Fig. 4. RNN model train accuracy and cost on 20% of dataset. on the test set. Table VII-D shows our model performance on different portions of the dataset, with noticeably less overfitting and better test set performance as the dataset portion increased. This suggests that having an even larger dataset could achieve better results. E. Laughbot Analysis In testing our laughbot by speaking to it and waiting for or as shown in Fig. 5 and 6, our laughbot responded appropriately to several types of inputs. Fig. 5. Fig. 6. Example laughbot success responding with. Example laughbot success responding with no. We noticed that was often caused by shorter inputs (though not in all cases, as seen in Table VII-E), as well as by in the punchline. On longer inputs or inputs with negative sentiment or both, laughbot generally correctly considered the line as not a punchline. Laughbot responded positively to jokes, at some cases waiting for the actual punchline. Laughbot considered questions and statements as unfunny as shown in Table VII-E. There were cases that fooled the laughbot and laughbot inappropriately responded with. For example, the laughbot laughed at I love you, likely because the statement has positive sentiment and is short in length. Sometimes, unfunny lines said in a funny manner (raising the pitch of the last word) can induce. For example saying my grandma died but with high pitch at the end will cause laughbot to respond with. Whether this should be considered a success for laughbot is up to the discretion of the user. VIII. A. Improving the Model LIMITATIONS AND FUTURE WORK Our combined RNN and logistic regression model performed best with an F1-score of 63.2 on the test set and an accuracy of 73.9. Future work will focus on reducing overfitting, as our final model run on the entire dataset still performs significantly higher on the train set. Since logistic regression is a much more naive model than an RNN, we will work on improving this ensemble model to fully utilize the predictive power of both. We also wish to explore CNNs, both stand-alone and in ensemble models, as our research showed CNNs to have higher F1-score than RNN models for this task [4]. We would also train on a larger dataset for a more generalizable model. Additional implementations could include more complex classification to identify the level or type of humor; laughbot would be able to respond with giggles or guffaws based on user input. Sarcasm in particular has always been difficult to detect in natural and spoken language processing, but our model for detecting humor is a step towards being able to recognize the common textual cues along with the specific intonations that normally accompany sarcastic speech. B. Making Laughbot Realtime During development of the laughbot, we originally tried to design it to work in real-time so the user could continually speak and the laughbot could laugh at every point in the one-sided conversation when it recognized humor. We were able to transcribe audio in real-time with the Google Speech 978-1-5386-2056-4/18/$31.00 c 2018 IEEE 5 P a g e

API, and intended to multithread it to capture an audio file simultaneously, but we faced problems structuring the rest of the interface to allow it to continually run the input through the model until humor was detected or the speaker paused for more than a certain threshold. Real-time recognition and laughing is a next-step implementation that will involve sending partial transcripts and audio files through the model continuously, concatenating audio and transcript data to preceding chunks to allow for context and longer audio cues to contribute to the classification. Once real-time recognition and response is implemented, we could run laughbot on sitcoms stripped of their laughtracks and allow the laughbot to respond. Then, we could compare how closely laughbot s laughs compare to the original laughtracks of the TV show. IX. CONCLUSION Our project takes the unique approach of training on phone conversations and combining a RNN and logistic regression model to classify spoken speech as funny or not funny. Our final model achieves an F1-score of 63.2 and accuracy of 73.9 on the test set, outperforming RNN models in previous work. Since we trained and tested on real conversations, this model s humor detection is most applicable to real, everyday speech that might be faced by a conversational agent. We also outline the architecture of Laughbot, a conversational agent that listens to users and responds with to funny utterances. With future work, our model is a promising development for conversational agents that will detect and respond to humor in realtime. ACKNOWLEDGMENTS We would like to thank our professor Andrew Maas, Dan Jurafsky and the Stanford University CS244S Spoken Natural Language Processing teaching team. Special thanks to Raghav and Jiwei for their direction on our combined RNN and regression model. REFERENCES [1] Anton Nijholt. Humor and embodied conversational agents. 2003. [2] Matthieu Geist Bilal Piot, Olivier Pietquin. Predicting when to laugh with structured classification. Interspeech, 2014. [3] Olivier Pietquin Bilal Piot, Matthieu Geist. Imitation learning applied to embodied conversational agents. MLIS, 2015. [4] Dario Bertero. Deep learning of audio and language features for humor prediction. 2016. [5] Ewan Klein Steven Bird, Edward Loper. Natural Language Processing with Python. OReilly Media Inc., 2009. [6] Florian Eyben, Felix Weninger, Florian Gross, and Bjrn Schuller. Recent developments in opensmile, the munich open-source multimedia feature extractor. pages 835 838. ACM Multimedia (MM), 2013. 978-1-5386-2056-4/18/$31.00 c 2018 IEEE 6 P a g e