Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues Kate Park katepark@stanford.edu Annie Hu anniehu@stanford.edu Natalie Muenster ncm000@stanford.edu Abstract We propose detecting and responding to humor in spoken dialogue by extracting language and audio cues and subsequently feeding these features into a combined Recurrent Neural Network (RNN) and logistic regression model. In this paper, we parse Switchboard phone conversations to build a corpus of punchlines and unfunny lines where punchlines precede tokens in Switchboard transcripts. We create a combined RNN and logistic regression model that uses both acoustic and language cues to predict whether a conversational agent should respond to an utterance with. Our model achieves an F1- score of 63.2 and accuracy of 73.9. This model outperforms our logistic language model (F1-score 56.6) and RNN acoustic model (59.4) as well as the final RNN model of Bertero. et al s (2016) paper (52.9). Using our final model, we create a laughbot that audibly responds to a user with when their utterance is classified as a punchline. A conversational agent outfitted with a humor-recognition system such as the one we present in this paper would be valuable as these agents gain utility in everyday life. 1 Introduction Our project aims to detect humor and thus predict when someone will laugh based on their textual and acoustic cues. Humor is dependent on both the strings of words, how the speakers voice changes, as well as situational context. Predicting when someone will laugh is a significant part of contemporary media. For example, sitcoms are written and performed primarily to cause, so the ability to detect precisely why something will draw is of extreme importance to screenwriters and actors. Being able to detect humor could improve embodied conversational agents who are attributed more human attributes when they are able to display the appreciation of humor (Nijholt, 2003). Siri currently will only say you are very funny, but could become more human if it had the ability to recognize social cues like humor, and respond to them with. Thus, the objective is to train and test a model that can detect whether or not a line is funny and should be appropriately responded with. 2 Background / Related Work Existing work in predicting humor uses a combination of text and audio features and various machine learning methods to tackle the classification problem. The goal of Piot et als (2014) research in Predicting when to Laugh with Structured Classification was to predict and create an avatar that imitated a human experts sense of when to laugh in a conversation. Piot et al reduced the problem to a multi-class classification problem, where they found that using the large margin algorithm had the most natural behavior (laughing in same proportion to the expert). In a later paper, Piot et al. (2015) in Imitation Learning Applied to Embodied Conversational Agents examined the problem of using a enabled Interaction Manager to make decisions about whether to laugh or not. They defined the Regularized Classification for Apprenticeship Learning (RCAL) algorithm using audio features. Final results showed that RCAL performed much better than Multi-class classification on

and slightly worse on. The paper noted the problem of having a large class imbalance between and non- in the dataset, and suggested potential future work in Inverse Reinforcement Learning (IRL) algorithms. From this, we propose weighting the datasets to reduce imbalance so that the non- does not overpower what the model will learn from punchlines. Bertero and Fung (2016) compared three supervised machine learning methods to predict and detect humor in The Big Bang Theory sitcom dialogues. From a corpus where punchlines were annotated by, Bertero et al. extracted audio and language features. Audio features and language features were fed into 3 models: CNN, RNN, and CRF were compared against a logistic regression baseline (F1-score 29.2). The CNN using word vectors and overlapping time frames of 25ms performed the best (F1-score of 68.6). The RNN model (F1-score 52.9) should have performed the best but actually performed worse than CNN likely due to overfitting. The paper proposes future work building a better dialog system that understands humor. In this paper, our objective is to improve the RNN model. Rather than a sitcoms dataset, this paper will use conversational conversations whose humor we hypothesize is more relevant for a conversational agent like Siri. Further, our objective is to build a simple dialog system, laughbot, that responds to humor. 3 Dataset The focus of this project is detecting humor in everyday, regular speech in contrast to past work we have examined that analyzes humor in sitcoms. We use the Switchboard Corpora available on AFS, which consists of around 3000 groups of audio files, transcripts, and word-level time intervals. We classified each line of a transcript as a punchline if it preceded any indication of at the beginning of the next person s response: B: That, that s the major reason I m walking up the stairs. A: [Laughter] To go skiing? Punchline Table 1: Punchline inducing A: Uh-huh. Well, you must have a relatively clean conscience then []. B: [Laughter] Punchline Table 2: Laughter that induces is classified as a punchline A: just for fun. Not Punchline B: Shaking the scorpions out of their shoes []. Table 3: A line preceding someone laughing at themselves does not count as a punchline We split our data set into 80% train / 10% validation / 10% test sets. Considering the imbalanced datasets that previous work ran into, we sample 5% of the non-punchlines since our original dataset is heavily imbalanced towards nonpunchlines. This achieves a more balanced train set among positive (punchlines) and negative (unfunny lines) classes. Our final datasets each have about 35-40% punchlines. 3.1 Features We extract a combination of language and audio features from the data. Language features include: unigrams, bigrams, trigrams: we prune the vocabulary and keep the n-grams that appear more than a certain frequency in the train set, tuning the threshold on our val set. In this model, we keep n-grams that appear at least twice. Parts of speech: we implemented NLTK s POS-tagger to pull the number of nouns, verbs, adjective, adverbs and pronouns appearing in the example. (Steven Bird, 2009) Sentiment: we utilized NLTK s vader toolkit to extract sentiment from the punchline, in a scale of more negative to more positive (Steven Bird, 2009) Length, average word length: from reading past work, we learned Sitcom punchlines are often short, so we use length features. (Bertero, 2016) We also extract acoustic features from each audio file (converted to.wav format) with the opens- MILE toolkit and match the timestamped features 2

to the timed transcripts in Switchboard to extract corresponding acoustic and language features for a given example. (Eyben et al., 2013) Acoustic features include: MFCC: We expect MFCC vectors to store the most information about an audio sample, so we sample 12 vectors every 10ms with a maximum of 50 time intervals per example, since certain lines may be too long to fully store. Energy level: We also expect the speaker s energy to be a strong indicator of humor, so we include this as an additional feature. 4 Approach / Models 4.1 Baseline Our baseline was an all positive classifier, predicting every example as a punchline. The precision of this classifier is the proportion of true punchlines in the dataset (around 35%) and 100% recall. We also used all negative classifier, predicting every line as unfunny which has a precision of the proportion of unfunny lines (around 65%) and 0% recall. Classifier Precision Recall F1-score All Positive 37.2 100.0 54.2 All Negative 52.8 0.0 0.0 Table 4: Baseline metrics 4.2 Logistic Regression Language Model We train a logistic regression model using only language features (ngrams, sentiment, line length) as a secondary baseline. Logistic regression is an intuitive starting model for binary classification, and also allows us to observe and tune the performance of just our language features on predicting humor. 4.3 RNN Acoustic Model We next train a Recurrent Neural Network (RNN) using only acoustic features, to observe the performance of our acoustic features on classifying lines of audio as punchlines or not. We choose an RNN to better capture the sequential nature and thus conversational context of dialogue, and we use Gated Recurrent Unit (GRU) cells so our model can better remember earlier timesteps in a line, instead of overemphasizing the latest timesteps. During training, we use standard softmax cross entropy to calculate cost. We initially used an Adam optimizer, as it handles less-frequently seen training features better and converges smother than Stochastic Gradient Descent. Our final RNN uses an Adamax optimizer to further stabilize the model between epochs and to make the model more robust in handling less-frequently seen features and gradient noise. 4.4 Final Combined Model After designing separate language and acoustic models, we combined the two as such: 1. Run our RNN on all acoustic features in the training set, and extract the final hidden state vector in the RNN on each training example. 2. Concatenate this vector with all language features for its corresponding training example. 3. Use the combined feature vectors to train a logistic regression model. Figure 1: Diagram of final RNN + LogReg model Figure 1 shows our combined model architecture. For testing, we follow a similar process of running the acoustic features into our pre-trained RNN, concatenating the final hidden state vector with language features for an example, and running the combined feature vector through our pre-trained logistic regression model to see the prediction. 5 Laughbot Architecture The laughbot is a simple user-interface application that implements the model we built during our research and testing, predicting humor using a chatbot-style audio prompter. It is intended for demonstration purposes for both letting users experiment with custom input as well as showing the results of our project in an accessible and tangible form. The user speaks into the microphone, after which the laughbot will classify whether what was said was funny, and audibly laugh if so. 3

The laughbot is designed to take user input audio, transcribe it, and feed the audio file and transcription into our pre-trained RNN and logistic regression model. Multithreading allows the user to use the microphone to speak for as much of the maximum 60 second time segment as he or she would like, before pressing Enter to indicate end of speech. The audio is then saved as a.wav file and transcribed by hitting the Google Cloud Speech API. Both the transcription and the original audio file are sent through the pre-trained model in which acoustic features are extracted and run through the pre-trained RNN. The last hidden states are combined with textual features extracted from the transcription as features for the entire logistic regression model. Once a classification is obtained, not funny or funny, the laughbot will either keep a straight face by staying silent and just prompt for more audio, or it will randomly play one of several laughtracks that we recorded during the late night hours of project development. The classification is almost immediate. The brunt of the runtime of our implementation depends on the speed of transcription from Google Cloud Speech API, thus the strength of the wifi connection. During development of the laughbot, we originally tried to design it to work in real-time so the user could continually speak and the laughbot could laugh at every point in the one-sided conversation when it recognized humor. We were able to transcribe audio in real-time with the Google Speech API, and intended to multithread it to capture an audio file simultaneously, but we faced problems structuring the rest of the interface to allow it to continually run the input through the model until humor was detected or the speaker paused for more than a certain threshold. Real-time recognition and laughing is a next-step implementation that will involve sending partial transcripts and audio files through the model continuously, concatenating audio and transcript data to preceding chunks to allow for context and longer audio cues to contribute to the classification. To see a sample of our laughbot in action, see this video: https://youtu.be/t6je0kznyxg. 6 Results 6.1 Evaluation We evaluate using accuracy, precision, recall, and F1 scores, with greatest emphasis on F1 scores. Accuracy calculates the proportion of correct predictions: Accuracy = T P + T N T P + T N + F P + F N Precision calculates the proportion of predicted punchlines that were actual punchlines: Precision = T P T P + F P Recall calculates the proportion of true punchlines that were captured as punchlines: Recall = T P T P + F N F1 is the harmonic mean of precision and recall, which can be calculated as below. F 1 = 2 Precision Recall Precision + Recall Table 5 and Figure 2 show the final performance of our models on these metrics, evaluated on the test dataset. Notably, our final model not only beat our baseline, it also beat the final RNN model of (Bertero, 2016), the most similar approach to our own model. Note Bertero et al s paper used the Big Bang Theory sitcom dialogues which we hypothesize is an easier dataset to classify than general phone conversations. Their dataset also had a higher proportion of punchlines leading to a higher F1 score in their positive baseline. Thus their final CNN model improved less compared to their baseline (14.36% improvement) than ours performed compared to our baseline (16.35% improvement). Further, while Bertero s final CNN model had a higher F1-score than our final RNN model, our model had higher accuracy. See 6. 6.2 Model Analysis Our final combined RNN acoustic and logistic regression language model performed best of all our models. This fit our expectations, as humor should depend on both what was said and how it was said. Both the language only and audio only model had fairly similar accuracies to the final model, but had much lower recall scores 4

Classifier Accuracy Precision Recall F1-score Logistic Regression (train) 87.7 90.1 76.7 82.8 RNN (train) 75.0 73.8 55.2 63.2 Combined (train) 91.1 90.5 86.3 88.3 Logistic Regression (validation) 68.5 60.2 48.5 53.7 RNN (validation) 70.0 61.5 54.7 57.9 Combined (validation) 73.4 66.0 60.5 63.1 Logistic Regression (test) 70.6 62.7 51.4 56.5 RNN (test) 71.7 63.5 55.9 59.4 Combined (test) 73.9 66.5 60.3 63.2 Table 5: Comparison of all models on all datasets Classifier Accuracy Precision Recall F1-score Bertero s Positive Baseline 42.8 42.8 100.0 59.9 Our Positive Baseline 37.2 37.2 100.0 54.2 Logistic regression (language only) 70.6 62.7 51.4 56.5 RNN (audio only) 71.7 63.5 55.9 59.4 Bertero s Final RNN 65.8 64.4 44.9 52.9 Our Final Model (RNN + LogReg) 73.9 66.5 60.3 63.2 Bertero s Final CNN 73.8 70.3 66.7 68.5 Table 6: Comparison of our model against Bertero et al Figure 2: Comparison of models on test datasets (especially the language model), suggesting that the combination model was better at correctly predicting punchlines while the individual models perhaps tended to be too conservative in their predictions, in that they predicted non-punchline for too many true punchlines. We tuned our RNN and regression models separately on the validation set. We found that the language model performed best when the frequent n-grams threshold was set at 2 (so we only included n-grams that occurred at least twice in the train set), and performance dropped as this threshold was increased. This makes sense since for bigrams and especially trigrams, the number that appear at least x times in a dataset drops drastically as x increases, so with a too-high threshold, we were excluding too many potentially useful features. We also found that sentence length was a particularly important feature, which confirmed our expectation that most punchlines would be relatively short. With the RNN, we found that increasing the number of hidden states greatly improved model performance up to a certain point, then began causing overfitting past that point. The same was true of the number of epochs we ran the model through during training. As we were using an Adamax optimizer, which already performs certain optimizations to adapt the model learning rate, we did not perform much tuning on our initial learning rate. 6.3 Error Analysis Table 5 and Figure 3 show the performance of our language-only, audio-only, and combined models on our train, validation, and test datasets. All models performed significantly better on the train set than on the validation or test set, especially the final combined model, suggesting that our model is strongly overfitting to the training data. This may 5

dataset and achieved higher and higher F1 scores on the test set. Table 7 shows our model performance on different portions of the dataset, with noticeably less overfitting and better test set performance as the dataset portion increased. This suggests that having an even larger dataset could achieve better results. Figure 3: Comparison of F1-scores on train, val, and test be helped with hyperparameter tuning i.e. decreasing the number of epochs or number of hidden units in our RNN, or changing the regularization of our logistic regression model or by simplifying the model i.e. decreasing the maximum number of MFCC vectors to extract or decreasing the number of language features. As we explore in the next section, this may also be helped by training on larger datasets. 6.4 Dataset Analysis 6.5 Laughbot Analysis In testing our laughbot by speaking to it and waiting for or, our laughbot responded appropriately to several types of inputs. We noticed that was often caused by shorter inputs (though not in all cases, as seen in Figure 7), as well as by in the punchline. On longer inputs or inputs with negative sentiment or both, laughbot generally considered the line as unfunny. Laughbot responded positively to jokes, at some cases waiting for the actual punchline. Laughbot considered questions and statements as unfunny as shown in 8. Figure 4: RNN model train accuracy and cost on 20% of dataset Transcript you re cute ha ha ha you re so fun to talk to haha my grandma died why did the chicken cross the road to get to the other side do you like cheese i finished my cs224s project Response We initially ran our models on only 20% of the full Switchboard dataset to speed up training. Figure 4 shows the training accuracy and cost curves of our RNN model, which suggest strong overfitting to the train set. Once we finalized our combined RNN and logistic regression model we continued to run on larger portions of the Switchboard Table 8: Laughbot successes There were cases that fooled the laughbot and laughbot inappropriately responded with. For example, the laughbot laughed at I love you, likely because the statement has positive sentiment and is short in length. Sometimes, unfunny lines 6

Classifier Dataset Size Accuracy Precision Recall F1-score Combined (train) 20% (4659 examples) 95.1 96.3 90.7 93.4 Combined (test) 20% (618 examples) 69.7 44.84 61.0 51.7 Combined (train) 50% (12011 examples) 93.8 92.4 91.1 91.8 Combined (test) 50% (1527 examples) 71.3 63.5 58.0 60.6 Combined (train) 100% (23658 examples) 91.1 90.5 86.3 88.3 Combined (test) 100% (2893 examples) 73.9 66.5 60.3 63.2 Table 7: Model performance on varying dataset sizes said in a funny manner (raising the pitch of the last word) can induce. For example saying my grandma died but with high pitch at the end will cause laughbot to respond with. Whether this should be considered a success on part of the laughbot is up to the discretion of the user. 7 Conclusion and Future Work Our combined RNN and logistic regression model performed best with an F1-score of 63.2 on the test set and an accuracy of 73.9. Future work will focus on reducing overfitting, as our final model run on the entire dataset still performs significantly higher on the train set. Since logistic regression is a much more naive model than an RNN, we will work on improving this ensemble model to fully utilize the predictive power of both. We also wish to explore Convolutional Neural Networks (CNNs), both stand-alone and in ensemble models, as our research showed CNNs to have higher F1-score than RNN models for this task (Bertero, 2016). We would also train on a larger dataset for a more generalizable model. To test laughbot, we would like to make it realtime as well as run it on sitcoms without their laughtracks to see how closely laughbot laughs compared to the original laughtracks of the TV show. Additional implementations could include more complex classification to identify the level or type of humor; laughbot would be able to respond with giggles or guffaws based on user input. Sarcasm in particular has always been difficult to detect in natural and spoken language processing, but our model for detecting humor is a step towards being able to recognize the common textual cues along with the specific intonations that normally accompany sarcastic speech. Since we trained and tested on real conversations, this model s humor detection is applicable to real, everyday speech more so than scripted jokes. In this paper we also identified areas in our implementation with room for improvement. With future work, our model could be a viable addition to conversational agents to make them embody more human-like attributes. Acknowledgments We would like to thank our professor Andrew Maas and the CS244S Spoken Natural Language Processing teaching team. Special thanks to Raghav and Jiwei for their direction on our combined RNN and regression model. References Dario Bertero. 2016. Deep learning of audio and language features for humor prediction. Olivier Pietquin Matthieu Geist Bilal Piot. 2014. Predicting when to laugh with structured classification. Interspeech. Matthieu Geist Olivier Pietquin Bilal Piot. 2015. Imitation learning applied to embodied conversational agents. MLIS. Florian Eyben, Felix Weninger, Florian Gross, and Bjrn Schuller. 2013. Recent developments in opensmile, the munich open-source multimedia feature extractor. ACM Multimedia (MM), pages 835 838. https://doi.org/10.1145/2502081.2502224. Anton Nijholt. 2003. Humor and embodied conversational agents. Edward Loper Ewan Klein Steven Bird. 2009. Natural Language Processing with Python. OReilly Media Inc. 7