Basic Natural Language Processing

Similar documents
Sarcasm Detection in Text: Design Document

Text Analysis. Language is complex. The goal of text analysis is to strip away some of that complexity to extract meaning.

Finding Sarcasm in Reddit Postings: A Deep Learning Approach

The Lowest Form of Wit: Identifying Sarcasm in Social Media

Generating Original Jokes

LING/C SC 581: Advanced Computational Linguistics. Lecture Notes Feb 6th

Characterizing Literature Using Machine Learning Methods

World Journal of Engineering Research and Technology WJERT

Creating Mindmaps of Documents

Detecting Sarcasm in English Text. Andrew James Pielage. Artificial Intelligence MSc 2012/2013

COMMON GRAMMAR ERRORS. By: Dr. Elham Alzoubi

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

A COMPREHENSIVE STUDY ON SARCASM DETECTION TECHNIQUES IN SENTIMENT ANALYSIS

LT3: Sentiment Analysis of Figurative Tweets: piece of cake #NotReally

Detecting Musical Key with Supervised Learning

An extensive Survey On Sarcasm Detection Using Various Classifiers

Introduction to Natural Language Processing This week & next week: Classification Sentiment Lexicons

Lyric-based Sentiment Polarity Classification of Thai Songs

Shopping 1. Listening and speaking. Reading and writing. What shops can you see here? Where do you go shopping?

Implementation of Emotional Features on Satire Detection

Temporal patterns of happiness and sarcasm detection in social media (Twitter)

A Dominant Gene Genetic Algorithm for a Substitution Cipher in Cryptography

What are these in English?

KLUEnicorn at SemEval-2018 Task 3: A Naïve Approach to Irony Detection

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society

Bi-Modal Music Emotion Recognition: Novel Lyrical Features and Dataset

Lyrics Classification using Naive Bayes

CS 562: STATISTICAL NATURAL LANGUAGE PROCESSING

Fixed Verse Generation using Neural Word Embeddings. Arjun Magge

Document downloaded from: This paper must be cited as:

Arts, Computers and Artificial Intelligence

Families Have Rules. homework rule. family dishes. Write the words and then match them to the correct pictures.

LSTM Neural Style Transfer in Music Using Computational Musicology

The ACL Anthology Network Corpus. University of Michigan

Introduction Paragraph

An Impact Analysis of Features in a Classification Approach to Irony Detection in Product Reviews

Writing Posters. The Bubbly Blonde

Statistical NLP Spring Machine Translation: Examples

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Power Words come. she. here. * these words account for up to 50% of all words in school texts

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Sentiment Analysis. Andrea Esuli

Introduction to Sentiment Analysis. Text Analytics - Andrea Esuli

Machine Translation Part 2, and the EM Algorithm

Lyric-Based Music Mood Recognition

Harnessing Context Incongruity for Sarcasm Detection

Toward Multi-Modal Music Emotion Classification

Lyrical Features of Popular Music of the 20th and 21st Centuries: Distinguishing by Decade

A Discriminative Approach to Topic-based Citation Recommendation

Digital Text, Meaning and the World

LanguageWire Style Guide. Rules and preferences for translating into UK English

Automatic Speech Recognition (CS753)

AP Literature and Composition: Summer Assignment

Purdue University Press Style Guide

Figurative Language Processing: Mining Underlying Knowledge from Social Media

Talks in Maths: Visualizing Repetition in Text and the Fractal Nature of Lyrical Verse.

Please follow Adler s recommended method of annotating. ************************************************************************************

Speech Recognition and Voice Separation for the Internet of Things

Some Experiments in Humour Recognition Using the Italian Wikiquote Collection

In order to complete this task effectively, make sure you

MLA Quoting, Paraphrasing, and Citing Sources

A Survey of Sarcasm Detection in Social Media

Generating Chinese Classical Poems Based on Images

Goals and Objectives Bank

semicolon colon apostrophe parentheses dash italics quotation marks

Apa 6th Edition Citation In Text

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

LING 202 Lecture outline W Sept 5. Today s topics: Types of sound change Expressing sound changes Change as misperception

High accuracy citation extraction and named entity recognition for a heterogeneous corpus of academic papers

UWaterloo at SemEval-2017 Task 7: Locating the Pun Using Syntactic Characteristics and Corpus-based Metrics

2nd Grade Reading, Writing, & Integrated Social Studies Pacing Guide for

Can Song Lyrics Predict Genre? Danny Diekroeger Stanford University

#SarcasmDetection Is Soooo General! Towards a Domain-Independent Approach for Detecting Sarcasm

Placement Test for Adventures in Language II (2014 Edition)

Apa 6th Edition Citation In Text

Emotionally-Relevant Features for Classification and Regression of Music Lyrics

arxiv: v1 [cs.ir] 16 Jan 2019

CS 5014: Research Methods in Computer Science

WHEN LYRICS OUTPERFORM AUDIO FOR MUSIC MOOD CLASSIFICATION: A FEATURE ANALYSIS

Useful Definitions. a e i o u. Vowels. Verbs (doing words) run jump

MODELING HARMONY WITH SKIP-GRAMS

N-GRAM-BASED APPROACH TO COMPOSER RECOGNITION

What s New in the 17th Edition

Appendix B. Elements of Style for Proofs

Student Affairs Branding: A Style Guide

Semantic Research Methodology

AVOIDING FRAGMENTS AND RUN-ONS

Correlation Results By Level

Machine Translation: Examples. Statistical NLP Spring MT: Evaluation. Phrasal / Syntactic MT: Examples. Lecture 7: Phrase-Based MT

Doctor of Nursing Practice Formatting Guidelines

Mr. Burke, Yoda and others.

CAPITAL LETTERS. 2. All headings use capital letters (you don t need capitals for the small joining words). EXAMPLE: Exploring the Atlantic Ocean

With thanks to Seana Coulson and Katherine De Long!

Let's Go~ Let's start learning Grammar~ Yeah! NAME :

Understanding the Changing Roles of Scientific Publications via Citation Embeddings

Comparison of N-Gram 1 Rank Frequency Data from the Written Texts of the British National Corpus World Edition (BNC) and the author s Web Corpus

POLITECNICO DI TORINO Repository ISTITUZIONALE

How to Write a Term Paper or Thesis

The Visual Denotations of Sentences. Julia Hockenmaier with Peter Young and Micah Hodosh University of Illinois

Transcription:

Basic Natural Language Processing

Why NLP? Understanding Intent Search Engines Question Answering Azure QnA, Bots, Watson Digital Assistants Cortana, Siri, Alexa Translation Systems Azure Language Translation, Google Translate News Digest Flipboard, Facebook, Twitter Other uses Pollect, Crime mapping, Earthquake prediction

Understanding human language is hard NLP requires inputs from : Linguistics Computer Science Mathematics Statistics Machine Learning Psychology Databases Human (U)nderstanding Computer (G)eneration Human

THE KEY: Changing uncertainty to certainty I am changing this sentence to numbers 1 2 3 4 5 6 7 Vectorizing You are changing too many sentences! 8? 3? 9? Remember: There is no ambiguity with numbers!

Challenges in NLP: Syntax vs. Semantics Syntax: Lamb a Mary had little Semantics: Merry hat hey lid tell lam Colorless orange liquid Address, number, resent

Challenges in NLP: Ambiguity pt 1 CC Attachment I like swimming in warm lakes and rivers Ellipsis and Parallelism I gave the Steven a shovel and Joseph a ruler Metonymy Sydney is essential to this class Phonetic My toes are getting number Pp Attachment You ate spaghetti with meatballs / pleasure / a fork / Jillian /

Challenges in NLP: Ambiguity pt 2 Referential Sharon complimented Lisl. She had been kind all day. Reflexive Brandon brought himself an apple Sense Julia took the math quiz Subjectivity Karen believes that the Economy will stay strong Syntactic Call a dentist for Wayne

Challenges in NLP: Others Parsing N-grams: United States of America Hot dog Typos John Hopkins vs Johns Hopkins Non-standard language (208)929-6136 vs 208-929-6136 Cause = because SARCASM I love rotting apples

Edit Distance: How we Spellcheck Can reference box above, left, or diagonal up-left If letter matches, +0 If letter doesn t match, +1 Score is the box at the bottom-right S T R E N G T H 0 1 2 3 4 5 6 7 8 T 1 1 1 2 3 4 5 5 6 R 2 2 2 1 2 3 4 5 6 E 3 3 3 2 1 2 3 4 5 N 4 4 4 3 2 1 2 3 4 D 5 5 5 4 3 2 2 3 4

Semantic Relationships Measuring how words are related to each other. Birdcage will be more similar to Dog Kennel than it will be to Bird Many different systems to draw out semantic relationships, but Wordnet is one of the most commonly used Similarity metric: Sim(V,W) = - ln(pathlength(v,w)) Sim(Run, Miracle) would be = -ln(7)

Preprocessing: Stopwords and punctuation Why we want to get rid of them? And, If, But,.,, Will almost ALWAYS be your most significant words Tells you nothing about what s going on Don t get rid of them if you are focused on Natural Language Generation!

Preprocessing: Porter s Algorithm Measure: A measure of a word is an indication of how many syllables are in it. Consonants = C, Vowels = V Every sequence of VC is counted as +1 Intellectual = (VC)C(VC)C(VC)CV(VC) = 4 Stemming: Strip a word down to its barest form Ex: Alleviation ation + ate = Alleviate Transformational Rule

Stemming: Sample Rules If m>0: Lies -> li Abilities = Abiliti Ational -> ate National = National Recreational = recreate Sses -> ss Sunglasses = sunglass Biliti -> ble Abiliti = able

Stemming: Example Original Word: Computational Computational ational + ate = Computate Computate ate = Comput Final Word: Comput Original Word: Computer Computer er = Comput Final Word: Comput

Sentence Boundary Recognition Problems with things like Dr., A.M., U.S.A. Use a decision tree to estimate the boundary Features: Punctuation Formatting Fonts Spaces Capitalization Known Abbreviations

N-Gram Modeling Words that have a separate meaning when combined with other words The best way to highlight the importance of context Examples: Unigram: Apple Bigram: Hot Dog Trigram: George Bush Sr. I ll meet you in Times {?????}

Preprocessing Checklist Remove Extraneous Text Convert sentences to lower case Tokenize Sentences Tokenize Words Remove Stopwords & Punctuation Stemming / Lemmatizing Identify N- Grams

Words to Numbers Corpus creation Create a library of all words in original dataset Vectorizing Changing words to numbers Often a raw count TFIDF Term Frequency / Inverse Document Frequency Example: This mentioned 3 times in a given review, but the review has 27 words in it Tfidf = 3 / 27 = 1/9

Bayes Theorem P(A B) = P(A) P(B A) P(B)

Predicting the next { } Example from Charles Dickens: P( Darnay looked at Dr. Manette ) Use maximum likelihood estimates for the n-gram probabilities Unigram: P(w) = c(w)/v Bigram: P(w1 w2) = c(w1,w2)/c(w2) Values - P( Darnay ) = 533 / 598633 =.00089 - P( looked Darnay ) = 3 / 676 =.0044 - P( at looked ) = 77 / 312 =.247 - P( Dr. Manette at ) = 2 / 4512 =.000443 Bigram probability - P( Darnay looked at Dr. Manette ) = 4.28 * e^-10 P( at Dr. Manette Darnay looked ) = 0

The Bag of Words Approach P(Positive Review Words Contained) Look at the unordered words of a document to determine underlying characteristics Coffee reviews with the word bean tend to be far more positive Common in sentiment and feature analysis