Temporal patterns of happiness and sarcasm detection in social media (Twitter) Pradeep Kumar NPSO Innovation Day November 22, 2017
Our Data Science Team Patricia Prüfer Pradeep Kumar Marcia den Uijl Next member? 3 Peter Fontein Hendri Adriaens
Content: 1. Average Happiness Measurement 1.1 Introduction 1.2 Data Collection from Twitter 1.3 Data Cleaning 1.4 Method 1.5 Result 1.6 Interpretation 2. Sarcasm Detection in Tweets 2.1 Introduction 2.2 Training data collection 2.3 Method 2.4 Training Result 4
Why Twitter data? Popular microblogging site 1.1 Introduction 500 million tweets a day, 200 billion a year 240+ million active users Twitter audience varies from commoner to celebrities User often discuss current affairs and personal views on various subjects Challenges Tweets are highly unstructured and also non grammatical Non standard vocabulary and abbreviations Lexical variations Cultural context of phrases, terms and symbols Hidden sarcasm 5
Population of Twitter Users in the Netherlands 2.6 million Dutch users, of which 0.9 million daily Usage by age category 10% 19% 8% 23% 25% 15-19 yrs 20-39 yrs 40-64 yrs 65-79 yrs 80+ yrs 6 By 2016, stable, decrease in youth use, increase of elderly people
Social media analytics and process Capture Gather data from various sources Preprocess the data Extract pertinent information from the data Understand Remove noisy data Perform advanced analytics: opinion mining, topic modelling, trend analysis, sentiment analysis Temporal (time series) Happiness/Sentiment analysis Sarcasm analysis Present Summarize and evaluate the findings from understand stage 7
8 Twitter structure
Data hidden in plain sight Time Social network Author Tweet Description Location Popularity 9 Sentiment Topic
Approach: An Overview Tweet download using Twitter API Preprocessing and Cleaning Sanitization and emoticons replacement Tokenizer 10 Happiness calculation, term frequency and topic modelling
1.2 Data Collection from Twitter Tweet Streaming criteria: 1 % of data streaming is possible The words in the top-10s are either articles (de, het, een), prepositions (in, van), conjunctions (en, dat), a personal, pronoun (ik), a negation (niet) or a conjugation of to be (is) [1][2] The words in the top-10 are the same for men and women 11 [1]. Collecting and Analysing Chats and Tweets in SoNaR Eric Sanders, CLST, Radboud University Nijmegen [2] https://dev.twitter.com/streaming/overview
1.3 Tweet Cleaning Step 1 : Removing the HTTP links (urls) Step 2 : Removing the # tags Step 3 : Replacement of Emoticons (faces, objects, nature, flags) Emoji cheat sheet: Smileys and People Animals and nature Objects Activity, Travel and Places, Objects, Symbols, Flags 12 [3]. https://www.webpagefx.com/tools/emoji-cheat-sheet/
Step 4: Sanitization (treating the abbreviations and repeated sounds) Some Dutch abbreviations: Hgh: Hoe gaat het Gmj: goed met jou Idk: i don t know Gwn: gewoon Vgm: volgens mij K: ik Vaka: vakantie Das: dat is Sws: sowieso Wnr: wanneer T: het 13 Step 5: Monogram model
1.4 Method (Word collection and ranking) Sourcing of words: Twitter Google books Ranking methodology: New York times Music Library 1. Top 5000 words (most frequent) from each corpus merged resulting in 10,222 words [4] 2. 50 evaluations per word 3. Words ranked on the scale of 1-9 14 Top words are: Words Average happiness 1. laughter 8.50 2. happiness 8.44 3. love 8.42 4. joy 8.16 Bottom words are: Words Average happiness 1. killer 1.56 2. cancer 1.54 3. death 1.44 4. terrorist 1.30 [4] Data-Set: Data collected from LabMT [3]. The over 10,222 unique words were labeled with Amazon's Mechanical Turk.
1.4 Method (Temporal Happiness Calculation) Mathematical formula: Average Happiness = n i=1 h avg (W i ) f i n 1 f i f i = frequency of ith word h avg (W i ) = estimate of average happiness of ith word 15
Average Happiness Index 1.5 Result Interactive dynamic graph is available at https://www.centerdata.nl/nl/projecten-van-centerdata/tijdelijke-blijheidsscore-van-nederlandse-tweets 16
Two instances: 1.6 Interpretation Most used terms Term Top hashtags Score : 3.92 (5 AM on 19 th August 2017) { Hard, idiot, good, 'get, 'knows, 'like, 'police, 'Mexican, struggle fuck, 'loves, 'blonde, fantastic, drug, "government, dismissed 'care } [('#nieuws', 248), ('#nieuwstwitter', 207), ('#vacature', 204), ('#NL', 191), ('#actueel', 120), ('#NieuwsTwitter', 120), ('#Krant', 102), ('#feywil ', 73), ('#lab', 64), ('#kkl', 61), ('#brugopen', 60), ('#Nieuws', 55), ('#Nederland', 54), ('#voetbal', 53), ('#Politie', 46), ('#Amsterdam', 43), ('#LaraconEU ', 37), ('#HLN', 36), ('#E313', 35), ('#tdd', 33)] 17 Score : 4.65 (4 PM on 20 th August 2017, Sunday) {'good', 'request',bright' 'care', 'okay, victory, lucky, sunday, 'passion, well, 'cookies', happy''dismissed': 'theaters 'like,'mature''weekend,'har d, 'thought''strange, 'main': 'car 'personal,'social,'stole.'lov e,helps,walking,negative,s pa,laugh,ride,start,sea, sonic': 'needed' 'sitting } [('#ajagro', 787), ('#Ajax', 328), ('#nieuwstwitter', 275), ('#nieuws', 254), ('#NieuwsTwitter', 169), ('#actueel', 168), ('#ANDSTV', 163), ('#brugopen', 143), ('#AJAgro', 116), ('#utrwil', 116), ('#excfey', 106), ('#CNBLUE', 105), ('#andstv', 89), ('#PushAwardsKathNiels', 83), ('#FCGroningen', 82), ('#voetbal', 79), ('#PSV', 78), ('#NACpsv', 77), ('#NACpraat', 72)]
Topic Modelling on Twitter Data Score : 4.65 (4 PM on 20 th August 2017, Sunday) Topic 1 [goed,nook,ooik,we,juist,steed,zee,my,kom,wel,meisje,vrouwen, nederland] Topic 2 [Iik,wel,ooik,hebt,waar,echt,heel,d enik,weer,erg,gaat,zit,mensen,zin morgen] Topic 3 [iik,leuik,vind,video,frans,gelezen, waal,waarom,geld,nl,stuik,wedstrij d,blijft,vragen] Topic 4 [Minder,bedankt,middle,smokkelma ffia,ngo s,knechten,afname,verdrin kingen,verdraait,club,bal,tijd,leuke, blonde] Topic 5 [Weer,nee,niet,ooik,gewoon,volgen s,smile, we, gaat,uur, keer,gaan,zomer,wel,ij] Score : 3.92 (5 AM on 19 th August 2017) iik,zegtweer,waar,he,ooik,gaan,allemaa l,wel,mooi,nieuwe,bal,barcelona,we,va kantie [man.weg,grote,tijdens,krijgt,rood,gei k,twee,meisje,no,jullie,mensen,omg,sp eelt] [iik,video,vind,leuik,via,wonder,live,wis dom,toegevoegd,ooik,we,afspeellijst,g oal,onze,amp] Nou,minder,juist,gaat,nederland,zee,m iddel,worden,tijd,ooik,smokkelmaffia,k nechten,afname,ngo s [my,beste,gemaakt,school,gelijik,you,t oe,geniet,pa,gewoon,extra] 18
2. Sarcasm Detection in Tweets 2.1 Introduction 2.2 Training data collection 2.3 Methods 2.4 Training Result 19
Underlying Hypothesis: 2.1 Introduction Contrast in Sarcastic Tweets: Sarcasm detection relies on the assumption that a negative situation often appears after the positive situations in a sarcastic document.[5] [positive verb phrase] + [negative verb phrase] 1. "Honesty is the best policy - when there is money in it." - Mark Twain 2. (een ouder tegen een kind met een slecht rapport) Je bent weer eens de beste leerling van de klas! The training dataset consists of 20000 clean sarcastic tweets 100000 clean non-sarcastic tweets 20 [5] Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert and Ruihong Huang, Sarcasm as Contrast between a Positive Sentiment and Negative Situation
2.2 Feature Engineering Sentiment analysis Topic modeling Part of speech tagging n-grams model e.g. 1. unigrams is one word (example: really, great, super, awesome, etc.) and 2. bigrams words (example: really great, super awesome, etc) 21
2.3 Method Step 1: Split each tweet in one, two and three parts! Step 2: Sentiment analysis on all parts Splitting Blob sentiment 0.213793322025 Blob subjectivity 0.0501915056376 Blob sentiment 1/2-0.0218950362976 Blob sentiment 2/2-0.0323621717231 Blob subjectivity 1/2-0.100033759951 Blob subjectivity 2/2-0.0904785266556 Blob sentiment 1/3 0.0209378078692 Blob sentiment 2/3-0.0577754412137 Blob sentiment 3/3 0.0344419200665 Blob subjectivity 1/3 0.110001944706 Blob subjectivity 2/3 0.0676131604139 Blob subjectivity 3/3 0.0520556715094 No Split Two Splits Three Splits Step 3: Topic modelling- decompose each tweet as sum of topics, to be used as feature 22
2.4 Training Results Training results, Support vector machine (linear kernel) Precision Recall f1-score Sarcasm 0.91 0.93 0.92 Non sarcasm 0.63 0.58 0.61 Avg 0.86 0.87 0.87 23
Conclusions and future work Enrich library for acronym expansion and emoticons replacement Applying deep learning methods for sarcasm analysis Collection of labelled Dutch tweets for training the model for sarcasm detection Additional features to be explored to tweak the algorithm for sarcasm detection in Dutch tweets Consideration retweet as a factor 24
Thanks for your attention!!