Linguistic Ethnography: Identifying Dominant Word Classes in Text Rada Mihalcea University of Michigan Stephen Pulman Oxford University
Linguistic Ethnography? Finding and understanding patterns in given types of text Find the characteristics of a text Reflective of behavior or style Examples Female vs. male authored texts (gender) Texts describing happy vs. sad moods (mood) Humorous vs. non-humorous text (comic) Introvert vs. extrovert authors (psychology)
Linguistic Ethnography vs. Text Classification Text classification: Automatic separation of classes of text Supervised or semi-supervised algorithms (Naïve Bayes, SVM, perceptron, etc.) Feature weighting and selection Linguistic ethnography Identification of classes of words over salient features Understand the characteristics of the texts Insights into the properties and behaviors modeled by those texts
An Example: Finding Happiness Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts Home alone for too many hours, all week long... screaming child, headache, tears that just won t let themselves loose... and now I ve lost my wedding band. I hate this. mood: mood:
Corpus-derived Happiness Factors yay 86.67 shopping 79.56 awesome 79.71 birthday 78.37 lovely 77.39 concert 74.85 cool 73.72 cute 73.20 lunch 73.02 books 73.02 goodbye 18.81 hurt 17.39 tears 14.35 cried 11.39 upset 11.12 sad 11.11 cry 10.56 died 10.07 lonely 9.50 crying 5.50
Identifying Word Classes in Text Foreground corpus: corpus of texts of interest Background corpus: neutral texts Collection of texts that do not have the property shared by the foreground corpus Balanced corpus Mix of texts Goal: identify word classes that are dominant in the foreground corpus
Word Class Dominance C = {W 1, W 2,, W n } Coverage Coverage F B = = W C i W C i Frequency ( W Size( F) Frequency ( W Size( B) i i ) ) Domi. nance F = Coverage Coverage ( C) ( C) Score significantly higher than 1: word classes that are dominant in the foreground corpus F B
Lexical Resources for Word Classes Roget Thesaurus of English language 100,000 grouped based on synonymy and other semantic relations Linguistic Inquiry and Word Count (LIWC) Lexicon developed for psycholinguistic analysis (Pennebaker & all) 2,200 words grouped into 70 classes WordNet Affect Resource built on top of WordNet Annotations with the emotions in the classification of Ortony Focus on: anger, disgust, fear, joy, sadness, surprise
Word Class Examples Roget: PERFECTION: perfection, purity, integrity, impeccability, MEDIOCRITY: mediocrity, dullness, indifference, inferiority, LIWC: OPTIMISM: accept, best, confidence, glorious, hope, SOCIAL: adult, advice, affair, boy, buddies, comrade, WordNet-Affect: ANGER: offense, temper, irritation, fury, rage, JOY: worship, adoration, sympathy, tenderness, respect, love,
A Case Study: Verbal Humour Gain insights into the language of humour Find classes of words that are dominant in humorous text Foreground corpus: humorous text Two types of verbal humour: One-liners Humorous news articles Background corpus: non-humorous text A mix of data from non-humorous sources: Reuters newspapers, British National Corpus, proverbs, Open Mind Common Sense
Humorous Data: One-liners He who smiles in a crisis has found someone to blame Short sentence, simple syntax Deliberate use of rhetoric devices (alliteration, rhyme) Frequent use of creative language Comic effect Web-based bootstrapping Start with a few manually selected seeds Identify a list of Web pages including at least one seed Parse Web pages and find new one-liners Repeat 16,000 one-liners
Humorous Data: News stories The Onion the best source of humour out there (Jeff Grienfield, CNN) Canadian Prime Minister Jean Chrétien and Indian President Abdul Kalam held a subdued press conference in the Canadian Capitol building Monday to announce that the two nations have peacefully and sheepishly resolved a dispute over their common border. "We are - well, I guess proud isn't the word - relieved, I suppose, to restore friendly relations with India after the regrettable dispute over the exact coordinates of our shared border," said Chrétien, who refused to meet reporters' eyes as he nervously crumpled his prepared statement. "The border that, er... Well, I guess it turns out that we don't share a border after all." Chrétien then officially withdrew his country's demand that India hand over a 20-mile-wide stretch of land that was to have served as a demilitarized buffer zone between the two nations. 1,125 news articles from August 2005 March 2006 1,000-10,000 characters
Dominant Roget Word Classes in Humorous Text anonymity 3.48 : you, person, cover, anonymous, unknown, unidentified, unspecified odor 3.36 : nose, smell, strong, breath, inhale, stink, pong, perfume, flavor secrecy 2.96 : close, wall, secret, meeting, apart, ourselves, security, censorship wrong 2.83 : wrong, illegal, evil, terrible, shame, beam, incorrect, pity, horror unorthodoxy 2.52 : error, non, err, wander, pagan, fallacy, atheism, erroneous, fallacious overestimation 2.45 : think, exaggerate, overestimated, overestimate, exaggerated disarrangement 2.18 : trouble, throw, ball, bug, insanity, confused, upset, mess, confuse
Dominant LIWC Word Classes in Humorous Text you 3.17 : you, thou, thy, thee, thin I 2.84 : myself, mine swear 2.81 : hell, ass, butt, suck, dick, arse, bastard, sucked, sucks, boobs self 2.23 : our, myself, mine, lets, ourselves, ours sexual 2.07 : love, loves, loved, naked, butt, gay, dick, boobs, cock, horny, fairy groom 2.06 : soap, shower, perfume, makeup cause 1.99 : why, how, because, found, since, product, depends, thus, cos humans 1.79 : man, men, person, children, human, child, kids, baby, girl, boy
Dominant WordNet-Affect Word Classes in Humorous Text surprise 3.31 : stupid, wonder, wonderful, beat, surprised, surprise, amazing, terrific
Evaluation How good are these classes? Derive word classes from different data sets and measure correlation Split the one-liners in two: 8,000 one-liners vs. 8,000 oneliners Split the news stories in two: 550 stories vs. 550 stories 16,000 one-liners vs. 1,100 news stories Roget LIWC one-liners vs. one-liners 0.95 0.96 news stories vs. news stories 0.84 0.88 one-liners vs. news stories 0.63 0.42
Characteristics of Verbal Humour Observed by analyzing the word classes Human-centerdness YOU, I, SELF, HUMANS you occurs in more than 25% of the one-liners You can always find what you are not looking for. professional communities It was so cold last winter, that I saw a lawyer with his hands in his own pockets.
Characteristics of Verbal Humour Negative polarity WRONG, UNORTHODOXY, DISARRANGEMENT Only adults have trouble with child-proof bottles. When everything comes your way, you are in the wrong lane.
Dominant Classes in Humour Human-centeredness: human-related semantic classes found dominant in humorous text as compared to nonhumorous text Negative polarity: semantic classes with negative orientation Humour as natural therapy where tensions related to negative scenarios concerning us humans are relieved through laughter Correlation with empirical observations from previous work Human-centerdness, negative polarity, sexual vocabulary, swear words, surprise
Conclusions Find the dominant word classes in types of text Reflective of behavior or style Systematic and portable Case study on humour: Good correlation among classes derived from different corpora Correlation with empirical observations from previous work
? A conclusion is simply the place where you got tired of thinking.