LING/C SC 581: Advanced Computational Linguistics. Lecture Notes Feb 6th

Similar documents
Introduction to Natural Language Processing Phase 2: Question Answering

Practice Midterm Exam for Natural Language Processing

The ACL Anthology Network Corpus. University of Michigan

ABSTRACT CITATION HANDLING: PROCESSING CITATION TEXTS IN SCIENTIFIC DOCUMENTS. Michael Alan Whidby Master of Science, 2012

Lab 14: Text & Corpus Processing with NLTK. Ling 1330/2330: Computational Linguistics Na-Rae Han

UWaterloo at SemEval-2017 Task 7: Locating the Pun Using Syntactic Characteristics and Corpus-based Metrics

LING/C SC 581: Advanced Computational Linguistics. Lecture 2 Jan 15 th

The structure of this ppt. Structural and categorial (and some functional) issues: English Hungarian

Combination of Audio & Lyrics Features for Genre Classication in Digital Audio Collections

Basic Natural Language Processing

Language and Inference

CS 562: STATISTICAL NATURAL LANGUAGE PROCESSING

Probabilistic Grammars for Music

Sarcasm Detection in Text: Design Document

Spectacular successes and failures of recurrent neural networks applied to language

Comparison of N-Gram 1 Rank Frequency Data from the Written Texts of the British National Corpus World Edition (BNC) and the author s Web Corpus

Sentence Processing. BCS 152 October

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

Characterizing Literature Using Machine Learning Methods

DISCOURSE ANALYSIS OF LYRIC AND LYRIC-BASED CLASSIFICATION OF MUSIC

How English Phrases Are Formed: Syntax I

Paraphrasing Nega-on Structures for Sen-ment Analysis

MLA ANNOTATED BIBLIOGRAPHIES. For use in your Revolutionary Song projects

Introduction to NLP. Ruihong Huang Texas A&M University. Some slides adapted from slides by Dan Jurafsky, Luke Zettlemoyer, Ellen Riloff

Introduction to NLP. Ruihong Huang Texas A&M University. Some slides adapted from slides by Dan Jurafsky, Luke Zettlemoyer, Ellen Riloff

Randomness for Ergodic Measures

Natural Language Processing

Language and Mind Prof. Rajesh Kumar Department of Humanities and Social Sciences Indian Institute of Technology, Madras

World Journal of Engineering Research and Technology WJERT

Hidden Markov Model based dance recognition

Information retrieval in folktales using natural language processing

WEB FORM F USING THE HELPING SKILLS SYSTEM FOR RESEARCH

Determining sentiment in citation text and analyzing its impact on the proposed ranking index

Natural Language Processing (CSE 517): Predicate-Argument Semantics

Class 5: Language processing over a noisy channel. Ted Gibson 9.59J/24.905J

Neural evidence for a single lexicogrammatical processing system. Jennifer Hughes

By Mrs. Paula McMullen Library Teacher Norwood Public Schools

Using Natural Language Processing Techniques for Musical Parsing

arxiv: v1 [cs.cl] 24 Oct 2017

HCC class lecture 8. John Canny 2/23/09

Markers of Literary Language A Computational-Linguistic Odyssey

The structure of this ppt

A Dominant Gene Genetic Algorithm for a Substitution Cipher in Cryptography

Towards the Generation of Melodic Structure

Humor in Collective Discourse: Unsupervised Funniness Detection in the New Yorker Cartoon Caption Contest

The Visual Denotations of Sentences. Julia Hockenmaier with Peter Young and Micah Hodosh University of Illinois

MODIFIED UNIT TEST. Miss Shay English 10 honors Spring 2012 Modified Assessment (Hearing Impairment) on Books One and Two of Les Miserables

Using the Annotated Bibliography as a Resource for Indicative Summarization

STA4000 Report Decrypting Classical Cipher Text Using Markov Chain Monte Carlo

INDEX. classical works 60 sources without pagination 60 sources without date 60 quotation citations 60-61

2 Year College vs. 4 Year College Research

The structure of this ppt. Sentence types An overview Yes/no questions WH-questions

저작권법에따른이용자의권리는위의내용에의하여영향을받지않습니다.

Automatic Speech Recognition (CS753)

Encoders and Decoders: Details and Design Issues

-This is the first grade of the marking period. Be sure to do your very best work and answer all parts of the assignment completely and thoroughly.

Set-Top-Box Pilot and Market Assessment

6.034 Notes: Section 4.1

UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society

Fixed Verse Generation using Neural Word Embeddings. Arjun Magge


First Stage of an Automated Content-Based Citation Analysis Study: Detection of Citation Sentences 1

Supplementary Note. Supplementary Table 1. Coverage in patent families with a granted. all patent. Nature Biotechnology: doi: /nbt.

Morphology, heads, gaps, etc.

Precision testing methods of Event Timer A032-ET

Introduction to Natural Language Processing This week & next week: Classification Sentiment Lexicons

Detecting Sarcasm in English Text. Andrew James Pielage. Artificial Intelligence MSc 2012/2013

Your Sentiment Precedes You: Using an author s historical tweets to predict sarcasm

Figurative Language Processing: Mining Underlying Knowledge from Social Media

CHAPTER 2 REVIEW OF RELATED LITERATURE. advantages the related studies is to provide insight into the statistical methods

ILAR Grade 7. September. Reading

Leopold-Franzens-University Innsbruck. Institute of Computer Science Databases and Information Systems. Stefan Wurzinger, BSc

LT3: Sentiment Analysis of Figurative Tweets: piece of cake #NotReally

Rubrics & Checklists

Creating Mindmaps of Documents

12/4/2013 Wed E Period

Music Genre Classification and Variance Comparison on Number of Genres

Adverb Phrases & Reasons. Week 7, Wed 10/14/15 Todd Windisch, Fall 2015

Musical Creativity. Jukka Toivanen Introduction to Computational Creativity Dept. of Computer Science University of Helsinki

MODELING HARMONY WITH SKIP-GRAMS

Centre for Economic Policy Research

Chapter 14. From Randomness to Probability. Probability. Probability (cont.) The Law of Large Numbers. Dealing with Random Phenomena

Music Recommendation from Song Sets

STYLISTIC ANALYSIS OF MAYA ANGELOU S EQUALITY

Automatic Polyphonic Music Composition Using the EMILE and ABL Grammar Inductors *

On the Ontological Basis for Logical Metonymy:

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

ECE 301 Digital Electronics

ECE 331 Digital System Design

Emotionally-Relevant Features for Classification and Regression of Music Lyrics

An Icelandic Gigaword Corpus

EE373B Project Report Can we predict general public s response by studying published sales data? A Statistical and adaptive approach

Adjectives - Semantic Characteristics

CSE 166: Image Processing. Overview. Representing an image. What is an image? History. What is image processing? Today. Image Processing CSE 166

CHAPTER I INTRODUCTION

Topic: Part of Speech Exam & Sentence Types KEY

Learning to translate with source and target syntax. David Chiang, USC Information Sciences Institute

A computer assisted analysis of literary text: from feature analysis to judgements of literary merit Tess M. E. A. Crosbie

Chapter 1 Midterm Review

How to Write a Term Paper or Thesis

Transcription:

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 6th

Adminstrivia The Homework Pipeline: Homework 2 graded Homework 4 not back yet soon Homework 5 due Weds by midnight No classes next week: I'm out of town on business No new homework assigned this week

Today's Topics Homework 4 review

Homework 4 Review: Ques1on 1 Construct a WSJ text corpus that excludes both words tagged as NONE- and punctuation words (defined previously) Show your Python console. How many words in the corpus? How many distinct words? Plot the cumulative frequency distribution graph How many top words do you need to account for 50% of the corpus?

Homework 4 Review: Question 1 excluded = set(['-none-', '-LRB-', '-RRB-', 'SYM', ':', '.', ',', '``', "''"]) tokens = [x[0] for x in ptb.tagged_words(categories=['news']) if x[1] not in excluded] words = set(tokens) print('tokens: {}; #Words: {}'.format(len(text),len(words))) Tokens: 1037490; #Words: 49184 len(words) 49184 print('lexical diversity: {:.3f}%'.format(len(words)/len(text))) Lexical diversity: 0.047% text = nltk.text(tokens) dist = nltk.freqdist(text) print(dist) <FreqDist with 49184 samples and 1037490 outcomes>

Homework 4 Review: Ques1on 1 list = sorted(dist.items(),key=lambda t:t[1],reverse=true) half = len(text) / 2.0 total = 0 index = 0 while total < half: total += list[index][1] index += 1 print('no of words: {}; total: {}'.format(index,total)) No of words: 217; total: 518763 1037490 /2 = 518745

Homework 4 Review: Question 1 print('{:12s} {:5s}'.format('Word','Freq')) for word, freq in list[:index]: print('{:12s} {:5d}'.format(word,freq))

Homework 4 Review: Question 1

Homework 4 Review: Question 2 With case folding: tokens = [x[0].lower() for x in ptb.tagged_words(categories=['news']) if x[1] not in excluded] Tokens: 1037490; #Words: 43746 Lexical diversity: 0.042% No of words: 176; total: 518944 (1037490/2= 518745)

Homework 4 Review: Question 2

Colorless green ideas examples (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Chomsky (1957):... It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not. idea (1) is syntactically valid, (2) is word salad One piece of suppor>ng evidence: (1) pronounced with normal intona>on (2) pronounced like a list of words

Background: Language Models and N-grams given a word sequence w 1 w 2 w 3... w n chain rule how to compute the probability of a sequence of words p(w 1 w 2 ) = p(w 1 ) p(w 2 w 1 ) p(w 1 w 2 w 3 ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1 w 2 )... p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1 w 2 )... p(w n w 1...w n-2 w n-1 ) note It s not easy to collect (meaningful) sta8s8cs on p(w n w n-1 w n-2...w 1 ) for all possible word sequences

Background: Language Models and N-grams Given a word sequence w 1 w 2 w 3... w n Bigram approximation just look at the previous word only (not all the proceedings words) Markov Assumption: finite length history 1st order Markov Model p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1 w 2 )...p(w n w 1...w n-3 w n-2 w n-1 ) p(w 1 w 2 w 3...w n )» p(w 1 ) p(w 2 w 1 ) p(w 3 w 2 )...p(w n w n-1 ) note p(w n w n-1 ) is a lot easier to collect data for (and thus estimate well) than p(w n w 1...w n-2 w n-1 )

Colorless green ideas Sentences: (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Sta7s7cal Experiment (Pereira 2002) bigram language model w i-1 w i

Part-of-Speech (POS) Tag Sequence Chomsky's example: colorless green ideas sleep furiously JJ JJ NNS VBP RB (POS Tags) Similar but grammatical example: revolutionary new ideas appear infrequently JJ JJ NNS VBP RB LSLT pg. 146

Stanford Parser Stanford Parser: a probabilis2c PS parser trained on the Penn Treebank

Stanford Parser Stanford Parser: a probabilis2c PS parser trained on the Penn Treebank

Penn Treebank (PTB) Corpus: word frequencies: Word POS Frequency colorless 0 green NNP 33 JJ 19 NN 5 ideas NNS 32 sleep VB 5 NN 4 VBP 2 NNP 1 furiously RB 2 Word POS Frequency revolutionary JJ 6 NNP 2 NN 2 new JJ 1795 NNP 1459 NNPS 2 NN 1 ideas NNS 32 appear VB 55 VBP 41 infrequently 0

Stanford Parser Structure of NPs: colorless green ideas revolutionary new ideas Phrase Frequency [ NP JJ JJ NNS] 1073 [ NP NNP JJ NNS] 61

An experiment examples (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Question: Is (1) even the most likely permutation of these particular five words?

Parsing Data All 5! (=120) permutations of colorless green ideas sleep furiously.

Parsing Data The winning sentence was: 1 furiously ideas sleep colorless green. after training on sections 02-21 (approx. 40,000 sentences) sleep selects for ADJP object with 2 heads adverb (RB) furiously modifies noun

Parsing Data The next two highest scoring permutations were: 2 Furiously green ideas sleep colorless. 3 Green ideas sleep furiously colorless. sleep takes NP object sleep takes ADJP object

Parsing Data (Pereira 2002) compared Chomsky s original minimal pair: 23. colorless green ideas sleep furiously 36. furiously sleep ideas green colorless Ranked #23 and #36 respectively out of 120

Parsing Data But graph (next slide) shows how arbitrary these rankings are when trained on randomly chosen sections covering 14K- 31K sentences Example: #36 furiously sleep ideas green colorless outranks #23 colorless green ideas sleep furiously (and the top 3) over much of the training space Example: Chomsky's original sentence #23 colorless green ideas sleep furiously outranks both the top 3 and #36 just briefly at one data point

Sentence Rank vs. Amount of Training Data 120 Best three sentences 100 80 Rank 60 40 #1 #2 #3 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Amount of training data

Sentence Rank vs. Amount of Training Data 120 100 #23 colorless green ideas sleep furiously #36 furiously sleep ideas green colorless 80 Rank 60 40 #23 #36 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Amount of training data

Sentence Rank vs. Amount of Training Data 120 100 #23 colorless green ideas sleep furiously #36 furiously sleep ideas green colorless 80 Rank 60 40 #1 #2 #3 #23 #36 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Amount of training data