LING/C SC 581: Advanced Computational Linguistics. Lecture Notes Feb 6th

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 6th

Adminstrivia The Homework Pipeline: Homework 2 graded Homework 4 not back yet soon Homework 5 due Weds by midnight No classes next week: I'm out of town on business No new homework assigned this week

Today's Topics Homework 4 review

Homework 4 Review: Ques1on 1 Construct a WSJ text corpus that excludes both words tagged as NONE- and punctuation words (defined previously) Show your Python console. How many words in the corpus? How many distinct words? Plot the cumulative frequency distribution graph How many top words do you need to account for 50% of the corpus?

Homework 4 Review: Question 1 excluded = set(['-none-', '-LRB-', '-RRB-', 'SYM', ':', '.', ',', '``', "''"]) tokens = [x[0] for x in ptb.tagged_words(categories=['news']) if x[1] not in excluded] words = set(tokens) print('tokens: {}; #Words: {}'.format(len(text),len(words))) Tokens: 1037490; #Words: 49184 len(words) 49184 print('lexical diversity: {:.3f}%'.format(len(words)/len(text))) Lexical diversity: 0.047% text = nltk.text(tokens) dist = nltk.freqdist(text) print(dist) <FreqDist with 49184 samples and 1037490 outcomes>

Homework 4 Review: Ques1on 1 list = sorted(dist.items(),key=lambda t:t[1],reverse=true) half = len(text) / 2.0 total = 0 index = 0 while total < half: total += list[index][1] index += 1 print('no of words: {}; total: {}'.format(index,total)) No of words: 217; total: 518763 1037490 /2 = 518745

Homework 4 Review: Question 1 print('{:12s} {:5s}'.format('Word','Freq')) for word, freq in list[:index]: print('{:12s} {:5d}'.format(word,freq))

Homework 4 Review: Question 1

Homework 4 Review: Question 2 With case folding: tokens = [x[0].lower() for x in ptb.tagged_words(categories=['news']) if x[1] not in excluded] Tokens: 1037490; #Words: 43746 Lexical diversity: 0.042% No of words: 176; total: 518944 (1037490/2= 518745)

Homework 4 Review: Question 2

Colorless green ideas examples (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Chomsky (1957):... It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not. idea (1) is syntactically valid, (2) is word salad One piece of suppor>ng evidence: (1) pronounced with normal intona>on (2) pronounced like a list of words

Background: Language Models and N-grams given a word sequence w 1 w 2 w 3... w n chain rule how to compute the probability of a sequence of words p(w 1 w 2 ) = p(w 1 ) p(w 2 w 1 ) p(w 1 w 2 w 3 ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1 w 2 )... p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1 w 2 )... p(w n w 1...w n-2 w n-1 ) note It s not easy to collect (meaningful) sta8s8cs on p(w n w n-1 w n-2...w 1 ) for all possible word sequences

Background: Language Models and N-grams Given a word sequence w 1 w 2 w 3... w n Bigram approximation just look at the previous word only (not all the proceedings words) Markov Assumption: finite length history 1st order Markov Model p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1 w 2 )...p(w n w 1...w n-3 w n-2 w n-1 ) p(w 1 w 2 w 3...w n )» p(w 1 ) p(w 2 w 1 ) p(w 3 w 2 )...p(w n w n-1 ) note p(w n w n-1 ) is a lot easier to collect data for (and thus estimate well) than p(w n w 1...w n-2 w n-1 )

Colorless green ideas Sentences: (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Sta7s7cal Experiment (Pereira 2002) bigram language model w i-1 w i

Part-of-Speech (POS) Tag Sequence Chomsky's example: colorless green ideas sleep furiously JJ JJ NNS VBP RB (POS Tags) Similar but grammatical example: revolutionary new ideas appear infrequently JJ JJ NNS VBP RB LSLT pg. 146

Stanford Parser Stanford Parser: a probabilis2c PS parser trained on the Penn Treebank

Penn Treebank (PTB) Corpus: word frequencies: Word POS Frequency colorless 0 green NNP 33 JJ 19 NN 5 ideas NNS 32 sleep VB 5 NN 4 VBP 2 NNP 1 furiously RB 2 Word POS Frequency revolutionary JJ 6 NNP 2 NN 2 new JJ 1795 NNP 1459 NNPS 2 NN 1 ideas NNS 32 appear VB 55 VBP 41 infrequently 0

Stanford Parser Structure of NPs: colorless green ideas revolutionary new ideas Phrase Frequency [ NP JJ JJ NNS] 1073 [ NP NNP JJ NNS] 61

An experiment examples (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Question: Is (1) even the most likely permutation of these particular five words?

Parsing Data All 5! (=120) permutations of colorless green ideas sleep furiously.

Parsing Data The winning sentence was: 1 furiously ideas sleep colorless green. after training on sections 02-21 (approx. 40,000 sentences) sleep selects for ADJP object with 2 heads adverb (RB) furiously modifies noun

Parsing Data The next two highest scoring permutations were: 2 Furiously green ideas sleep colorless. 3 Green ideas sleep furiously colorless. sleep takes NP object sleep takes ADJP object

Parsing Data (Pereira 2002) compared Chomsky s original minimal pair: 23. colorless green ideas sleep furiously 36. furiously sleep ideas green colorless Ranked #23 and #36 respectively out of 120

Parsing Data But graph (next slide) shows how arbitrary these rankings are when trained on randomly chosen sections covering 14K- 31K sentences Example: #36 furiously sleep ideas green colorless outranks #23 colorless green ideas sleep furiously (and the top 3) over much of the training space Example: Chomsky's original sentence #23 colorless green ideas sleep furiously outranks both the top 3 and #36 just briefly at one data point

Sentence Rank vs. Amount of Training Data 120 Best three sentences 100 80 Rank 60 40 #1 #2 #3 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Amount of training data

Sentence Rank vs. Amount of Training Data 120 100 #23 colorless green ideas sleep furiously #36 furiously sleep ideas green colorless 80 Rank 60 40 #23 #36 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Amount of training data

Sentence Rank vs. Amount of Training Data 120 100 #23 colorless green ideas sleep furiously #36 furiously sleep ideas green colorless 80 Rank 60 40 #1 #2 #3 #23 #36 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Amount of training data