1 Sentiment and Sarcasm Classification with Multitask Learning Navonil Majumder, Soujanya Poria, Haiyun Peng, Niyati Chhaya, Erik Cambria, and Alexander Gelbukh arxiv:1901.08014v1 [cs.cl] 23 Jan 2019 Abstract Sentiment classification and sarcasm detection are both important NLP tasks. We show that these two tasks are correlated, and present a multi-task learningbased framework using deep neural network that models this correlation to improve the performance of both tasks in a multi-task learning setting. Our method outperforms the state-of-the-art by 3 4%. I. INTRODUCTION The surge of Internet has enabled large-scale text-based opinion sharing on a wide range of topics. This has led to the opportunity of mining user sentiment on various topics from the data publicly available over the Internet. The most important task in the analysis of the users opinions is sentiment classification: determining whether a given text, such as user review, comment, or tweet, expresses positive or negative sentiment. When expressing their opinions or sentiment, users often use sarcasm for emphasis. In a sarcastic text, the sentiment intended by the author is the opposite of its literal meaning. E.g., the sentence Thank you alarm for never going off is literally positive ( Thank you ), however, the intended sentiment is negative alarm never going off. Unless this sentiment shift is detected with semantics, the sarcasm classifier may fail to spot sarcasm. Currently, most researchers focus on either sentiment classification or sarcasm detection [19], [7], without considering the possibility of mutual influence between the two tasks. However, one can observe that the two tasks are correlated: people usually (though not always) use sarcasm as a device for the expression of emphatic negative sentiment. This observation can lead to a simple way in which one of the two tasks can help improve the other i.e. if an expression can be detected as sarcastic, its sentiment can be assumed negative; if the expression N. Majumder and A. Gelbukh are with the CIC, Instituto Politécnico Nacional, Mexico City, Mexico, e-mail: see http://www.gelbukh.com. S. Poria is with the SCSE, Nanyang Technological Institute, Singapore, e-mail: see http://http://www.ntu.edu.sg/home/sporia. H. Peng and E. Cambria are with the SCSE, Nanyang Technological Institute, Singapore, e-mail: see http://sentic.net/erikcambria. N. Chhaya is with the Adobe Research, India, e-mail: nchhaya@ adobe.com. can be classified as positive, then it can be assumed not sarcastic. We show here that while this logic does lead to a slight improvement, there is a better way of combining the two tasks. Namely, in this paper, we train a classifier for both sarcasm and sentiment in a single neural network using multi-task learning which is a novel learning scheme and has gained recent popularity [1], [11]. We empirically show that this method outperforms the results obtained with two separate classifiers and, in particular, outperforms the current state of the art [14]. Rest of the paper is structured as follows. Section II outlines the related work; Section III presents our approach; Section IV list the baselines; Section V discusses the results; and Section VI concludes this paper. II. RELATED WORK Machine learning methods e.g., [20], [26] and deep neural networks, such as CNN [8], [17], [13], recursive neural networks [5], [24], recurrent neural networks [25], or memory networks [10], have shown good performance for sentiment detection. Knowledge-based methods explore syntactic characteristics/patterns/rules [18] and employ sentiment resources [6]. However, sarcasm detection currently focuses on extracting features, such as syntactic [2], surface pattern-based [4], or personality-based features [19], as well as contextual incongruity [7]. Mishra et al. [14] extracted multimodal cognitive features for both sentiment classification and sarcasm detection, without modelling the two tasks in a single system. However, recently multi-task learning has been successfully applied in many NLP tasks, such as implicit discourse relationship identification [11], key-phrase boundary classification [1]. In this paper, we apply it to sentiment classification and sarcasm detection. III. METHOD Following [22], many sarcastic sentences carry negative sentiment. We leverage this to improve both sentiment classification and sarcasm detection. We use multi-task learning, where a single neural network is used to perform
2 more than one classification task, in our case, sentiment classification and sarcasm detection. This network facilitates synergy between the two tasks, resulting in improved performance on both tasks in comparison with their standalone counterparts. accommodate two different tasks, sarcasm detection and sentiment classification: H sar = ReLU(HW sar + b sar ), H sen = ReLU(HW sen + b sen ), Sarcasm detection s sar Neural tensor network s + tensor T Sentiment detection s sen where W [sar,sen] R D dru D t and b [sar,sen] R D t. e) Attention network: Word representations in H are encoded with task-specific sentence-level context. To aggregate these context-rich representations into the sentence representation s, we use attention mechanism, due to its ability to prioritize words relevant for the classification: P = tanh(h W AT T ), (1) H sar H sen α = softmax(p T W α ), (2) s = αh T, (3) GRU GRU GRU x 1 sentence X Fig. 1. Our multi-task architecture. a) Task Definition: We solve two tasks with a single network. Given a sentence [w 1, w 2,..., w l ], where w i are words, we assign it both a sentiment tag (positive / negative) and a sarcasm tag (yes / no). b) Input Representation: We use D g -dimensional (D g = 300) Glove word-embeddings [16] x i R D g to represent the words w i, padding the variable-length input sentences to a fixed length with null vectors. Thus, the input is represented as a matrix X = [x 1, x 2,..., x L ], where L is the length of the longest sentence. c) Sentence Representation: In the next layers, we obtain sentence representation from X using Gated Recurrent Unit (GRU) [3] with attention mechanism [12] as follows. d) Sentence-level word representation: The sentence X is fed to a GRU of size D gru = 500 with parameters W [z,r,h] R D g D gru and U [z,r,h] R D gru D gru to get context-rich sentence-level word representations H = [h 1, h 2,, h L ], h t R D gru at the hidden output of the GRU. We use H for both sarcasm and sentiment. Thus, H is transformed to H sar and H sen using two different fully-connected layers of size D t = 300 in order to x L where W AT T R Dt 1, W α R L L, P R L 1, and s R D t. In Eq. (2), α [0, 1]L gives the relevance of words for the task, multiplied in Eq. (3) by the contextaware word representations in H. f) Inter-Task Communication: We use Neural Tensor Network (NTN) [23] of size D ntn = 100 to fuse sarcasmand sentiment-specific sentence representations, s sar and s sen, to obtain the fused representation s +, where s + = tanh(s sar T [1 D ntn] s T sen + (s sar s sen )W + b), where T R D ntn D t D t, W R 2D t D ntn, b, s + R D ntn, and stands for concatenation. The vector s + contains information relevant to both sentiment and sarcasm. Instead of NTN, we also tried attention and concatenation for fusion, which resulted in inferior performance (Section V). g) Classification: For the two tasks, we use two different softmax layers for classifications. h) Sentiment classification: We use only s sen as sentence representation for sentiment classification, since we observe best performance without s +. We apply softmax layer of size C (C = 2 for binary task) on s sen for classification as follows: P sen = softmax(s sen Wsen softmax ŷ sen = argmax(p sen [j]), j + b softmax sen ), where Wsen softmax R Dt C, b softmax sen R C, P sen R C, j is the class value (0 for negative and 1 for positive), and ŷ sen is the estimated class value.
3 i) Sarcasm classification: We use s sar s + as sentence representation for sarcasm classification using softmax layer with size C (C = 2) as follows: P sar = softmax((s sar s + ) Wsar softmax ŷ sar = argmax(p sar [j]), j + b softmax sar ), where Wsar softmax R (D t+d ) C ntn, b softmax sar R C, P sar R C, j is the class value (0 for no and 1 for yes), and ŷ sar is the estimated class value. j) Training: We use categorical cross-entropy as the loss function (J ; * is sar or sen) for training: J = 1 N N i=1 C 1 yij j=0 log P i [j], where N is the number of samples, i is the index of a sample, j is the class value, and yij 1, if expected class value of sample i is j, = 0, otherwise. As a training algorithm, we use Stochastic Gradient Descent (SGD)-based ADAM algorithm [9], which optimizes each parameter individually with different and adaptive learning rates. Also, we minimize both loss functions, namely J sen and J sar, with equal priority, by optimizing the parameter set θ ={U [z,r,h], W [z,r,h], W, b, W AT T, W α, T, W, b, W softmax, b softmax }. IV. EXPERIMENTS a) Dataset: The dataset [15] consists of 994 samples, each sample containing a text snippet labeled with sarcasm tag, sentiment tag, and eye-movement data of 7 readers. We ignored the eye-movement data in our experiments. Of those samples, 383 are positive and 350 are sarcastic. b) Baselines and Model Variants: We evaluated the following baselines and variations of our model. c) Standalone classifiers: Here, we used h =F CLayer(GRU(X)), P =SoftmaxLayer(h ), where * represents sar or sen, X is the input sentence as a list of word embeddings. We feed X to GRU and pass the final output through a fully-connected layer (F CLayer) to obtain sentence representation h. We apply final softmax classification (Sof tmaxlayer) to h. d) Sentiment coerced by sarcasm: In this classifier, the sentences classified as sarcastic are forced to be considered negative by the sentiment classifier. e) Simple multi-task classifier: The following equations summarize this variant: h =F CLayer (GRU(X)), (4) P =SoftmaxLayer (h ), (5) where * represents sar or sen. This setting shares the GRU between two tasks. Final output of GRU is taken as the sentence representation. Sentence representation is fed to two different task-specific fully-connected layers (F CLayer ), giving h. Subsequently, h are fed to two different softmax layers SoftmaxLayer for classification. f) Simple multi-task classifier with fusion: In this variant, we changed Eq. (5) to: P sar =SoftmaxLayer sar (h sar F ), (6) P sen =SoftmaxLayer sen (h sen ), (7) where F = NT N(h sar, h sen ). Here, h sar and h sem are fed to a Neural Tensor Network (NT N) whose output is concatenated with h sar for classification. Sentiment classification is done with h sen only. We also tried variants with other methods of fusion (such as fully connected layer or Hadamard product) instead of NTN, as well as variants with h sen F instead of, or in addition to, h sar F, but they did not imprive the results. g) Task-specific GRU with fusion: Here, we used two separate GRUs for the two tasks in Eq. (4): h =F CLayer (GRU (X)). (8) We used Eq. (6) and Eq. (7) for P. Again, we tried concatenating F with h sen, both, or none as in Eq. (5), but this did not improve the results. h) Best model: shared attention: Here, we added the attention mechanism to the matrix H in Eq. (4), and used Eq. (6) and Eq. (7) for P. This model, described in detail in Section III, is the main model we present in this paper since it gave the best results. We also tried separate GRUs as in Eq. (8), but this did not improve the results. V. RESULTS AND DISCUSSION The results using 10-fold cross validation are shown in Table I. As baselines, we used the standalone sentiment and sarcasm classifiers, as well as the CNN-based stateof-the-art method [14] (SoA). Our standalone GRU-based sentiment and sarcasm classifiers performed slightly better than the SoA, even though the SoA also uses the gaze data
4 TABLE I RESULTS FOR VARIOUS EXPERIMENTS. Variant Sentiment Sarcasm Average Precision Recall F-Score Precision Recall F-Score F-Score State of the art [14] 79.89 74.86 77.30 87.42 87.03 86.97 82.13 Standalone classifiers 79.02 78.03 78.13 89.96 89.25 89.37 83.75 Standalone coerced 81.57 80.06 80.38 Multi-Task simple 80.41 79.88 79.7 89.42 89.19 89.04 84.37 Multi-Task with fusion 82.32 81.71 81.53 90.94 90.74 90.67 86.10 Multi-Task with fusion and separate GRUs 80.54 80.02 79.86 91.01 90.66 90.62 85.24 Multi-Task with fusion and shared attention (Section III) 83.67 83.10 83.03 90.50 90.34 90.29 86.66 present in the dataset but is never available in any reallife setting. In contrast, our method, besides improving results, is applied to plain-text documents such as tweets, without any gaze data. As expected, the sentiment classifier coerced by sarcasm classifier performed better than the standalone sentiment classifier. This means that an efficient sarcasm detector can boost the performance of a sentiment classifier. All our multi-task classifiers outperformed both standalone classifiers. However, the margin of improvement for multi-task classifier over the standalone classifier is greater for sentiment than for sarcasm. Probably this is because sarcasm detection is a subtask of sentiment analysis. Analyzing examples and attention visualization of the multi-task network, we observed that the multi-task network mainly helps improving sarcasm classification when there is a strong sentiment shift, which indicates the possibility of sarcasm in the sentence. The example given in the introduction was classified incorrectly by the standalone sarcasm classifier but correctly by the standalone sentiment classifier; coercing one of the classifiers by the other would not change the result. In the multi-task network, both sentiment and sarcasm are detected correctly, apparently because the network detected the sentiment shift in the sentence, which improved sarcasm classification. Similarly, the sentence Absolutely love when water is spilt on my phone, just love it is classified as positive by the standalone sentiment classifier: Absolutely love highlighted by the attention scores (not presented in this short paper). However, the standalone sarcasm classifier identified it as sarcastic due to water spilt on my phone (seen from the attention scores) and in the multi-task network this clue corrected the sentiment classifier s output. Even our standalone GRU-based classifiers outperformed the CNN-based state-of-the-art method. The multitask classifiers outperformed the standalone classifiers because of shared representation which serves as additional regularization for each task from the other task. Adding NTN fusion to the multi-task classifier further improved the results, giving the best performance for sarcasm detection. Adding attention network shared between the tasks further improves the performance for sentiment classification. As the last column of Table I shows, on average the best results across the two tasks were obtained with the architecture described in Section III. VI. CONCLUSIONS We presented a classifier architecture that can be trained on sentiment or sarcasm data and outperforms the state of the art in both cases on the dataset used by [14]. Our architecture uses a GRU-based neural network, while the state-of-the-art method by [14] used a CNN. Furthermore, we showed that multi-task learningbased methods significantly outperform these standalone sentiment and sarcasm classifiers. This indicates that sentiment classification and sarcasm detection are related tasks. Finally, we presented a multi-task learning architecture that gave the best results, out of a number of variants of the architecture that we tried. In the future, we intend to incorporate multimodal information [21] in our network for improved performance. REFERENCES [1] I. Augenstein and A. Søgaard. Multi-Task Learning of Keyphrase Boundary Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 341 346, Vancouver, Canada, July 2017. Association for Computational Linguistics. [2] F. Barbieri, H. Saggion, and F. Ronzano. Modelling sarcasm in twitter, a novel approach. In WASSA@ ACL, pages 50 58, 2014. [3] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR, abs/1412.3555, 2014. [4] D. Davidov, O. Tsur, and A. Rappoport. Semi-supervised recognition of sarcastic sentences in twitter and amazon. In Proceedings of the fourteenth conference on computational natural language learning, pages 107 116. Association for Computational Linguistics, 2010.
5 [5] L. Dong, F. Wei, C. Tan, D. Tang, M. Zhou, and K. Xu. Adaptive Recursive Neural Network for Target-dependent Twitter Sentiment Classification. In ACL (2), pages 49 54, 2014. [6] A. Esuli and F. Sebastiani. Sentiwordnet: A high-coverage lexical resource for opinion mining. Evaluation, pages 1 26, 2007. [7] A. Joshi, V. Sharma, and P. Bhattacharyya. Harnessing context incongruity for sarcasm detection. In ACL (2), pages 757 762, 2015. [8] Y. Kim. Convolutional neural networks for sentence classification. arxiv preprint arxiv:1408.5882, 2014. [9] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. [10] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378 1387, 2016. [11] M. Lan, J. Wang, Y. Wu, Z.-Y. Niu, and H. Wang. Multitask Attention-based Neural Networks for Implicit Discourse Relationship Representation and Identification. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1299 1308, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. [12] M.-T. Luong, H. Pham, and C. D. Manning. Effective Approaches to Attention-based Neural Machine Translation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1412 1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics. [13] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria. Dialoguernn: An attentive rnn for emotion detection in conversations. arxiv preprint arxiv:1811.00405, 2018. [14] A. Mishra, K. Dey, and P. Bhattacharyya. Learning cognitive features from gaze data for sentiment and sarcasm classification using convolutional neural network. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 377 387, Vancouver, Canada, July 2017. Association for Computational Linguistics. [15] A. Mishra, D. Kanojia, and P. Bhattacharyya. Predicting readers sarcasm understandability by modeling gaze behavior, 2016. [16] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532 1543, 2014. [17] S. Poria, E. Cambria, and A. Gelbukh. Aspect extraction for opinion mining with a deep convolutional neural network. Knowledge-Based Systems, 108:42 49, 2016. [18] S. Poria, E. Cambria, A. Gelbukh, F. Bisio, and A. Hussain. Sentiment data flow analysis by means of dynamic linguistic patterns. IEEE Computational Intelligence Magazine, 10(4):26 36, 2015. [19] S. Poria, E. Cambria, D. Hazarika, and P. Vij. A deeper look into sarcastic tweets using deep convolutional neural networks. In COLING, pages 1601 1612, 2016. [20] S. Poria, A. Gelbukh, A. Hussain, S. Bandyopadhyay, and N. Howard. Music genre classification: A semi-supervised approach. In Mexican Conference on Pattern Recognition, pages 254 263. Springer, 2013. [21] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arxiv preprint arxiv:1810.02508, 2018. [22] E. Riloff, A. Qadir, P. Surve, L. D. Silva, N. Gilbert, and R. Huang. Sarcasm as Contrast between a Positive Sentiment and Negative Situation. In EMNLP, 2013. [23] R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning With Neural Tensor Networks for Knowledge Base Completion. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 926 934. Curran Associates, Inc., 2013. [24] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 129 136, 2011. [25] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104 3112, 2014. [26] A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, and L.-P. Morency. Multi-attention recurrent network for human communication comprehension. arxiv preprint arxiv:1802.00923, 2018.