LING/C SC 581: Advanced Computational Linguistics. Lecture 2 Jan 15 th

Similar documents
BOOKS AND LIFE TASK. Look back at your answers to the task above. Which of the three women s experience does yours come closest to?

Memoria est Imperfectus

LING/C SC 581: Advanced Computational Linguistics. Lecture Notes Feb 6th

Mrs Dalloway (Penguin Popular Classics) By Virginia Woolf

Lab 14: Text & Corpus Processing with NLTK. Ling 1330/2330: Computational Linguistics Na-Rae Han

Symbolism in Virginia Woolf s Mrs. Dalloway

THE 13th INTERNATIONAL CONFERENCE OF. ISSEI International Society for the Study of European Ideas. in cooperation with the University of Cyprus

By Mark and Helen Warner

Thought Representation and One: An Analysis of Virginia Woolf s Narrative Style

Marrying The Second Time Around (The Brides Of Hilton Head Island) (Volume 5) By Sabrina Sims McAfee

List of Contents. Introduction 600 IDIOMS A-Z A - B - C - D - E - F - G - H - I J - K - L - M - N - O - P - Q - R S - T - U - V - W - X - Y - Z

Creating Mindmaps of Documents

A Dictionary Of Modern English Usage (Oxford Language Classics Series) By Henry Fowler, Simon Winchester READ ONLINE

Spelling Made Simple By Sheila Henderson READ ONLINE

5 th Grade 1 st TERM: REVIEW Units 1-2-3

Speaking Of Silents: First Ladies Of The Screen By William Drew

Conversation 1. Conversation 2. Conversation 3. Conversation 4. Conversation 5

Test 1 Answers. Listening TRANSCRIPT. Part 1 (5 marks) Part 2 (5 marks) Part 3 (5 marks) Part 4 (5 marks) Part 5 (5 marks) Part 1

Annotate or take handwritten notes on each chapter of Foster. This will help you later. Consider annotating for the following:

KJV, Life In The Spirit Study Bible, Hardcover, Red Letter Edition: Formerly Full Life Study By Zondervan READ ONLINE

The Lives Of Carl Atman: A Love Story By Morris Wayne Walker, Paul Michael Garcia

On the weekend UNIT. In this unit. 1 Listen and read.

Romeo And Juliet Study Guide Packet Questions

Punk Whiz 2.: An Article From: Word Ways [HTML] [Digital] By Anil

Travels Of An Ordinary Man Australia By Paul Elliott READ ONLINE

In The Beginning: Great First Lines From Your Favorite Books By Hans Bauer

Write the words and then match them to the correct pictures.

THE TRAGEDY OF ROMEO AND JULIET. READ ONLINE

BANCO DE QUESTÕES - INGLÊS - 5 ANO - ENSINO FUNDAMENTAL

KJV, Life In The Spirit Study Bible, Hardcover, Red Letter Edition: Formerly Full Life Study By Zondervan READ ONLINE

Film Studies Coursework Guidance

Little Red Book of Idioms and Phrases

Music, Language, And The Brain By Aniruddh D. Patel

More Laugh-Out-Loud Jokes For Kids By Dylan August, Danielle Hitchcock READ ONLINE

Descendants: Mal's Spell Book

Les Miserables: Complete And Unabridged By Victor Hugo, Charles E. WIlbour READ ONLINE

William Shakespeare - As You Like It By William Shakespeare READ ONLINE

David Elginbrod By George MacDonald

Maritime History Of Massachusetts By Samuel Eliot Morison, Tipped-in Frontice

ISSN Galaxy: International Multidisciplinary Research Journal

Marrow Bones: English Folk Songs From The Hammond And Gardiner Manuscripts READ ONLINE

War Over The Steppes: The Air Campaigns On The Eastern Front (General Aviation) By E. R. Hooton READ ONLINE

Audio Feature Extraction for Corpus Analysis

Classic Cats 2013 Calendar (Multilingual Edition)

The Brothers Karamazov By Fyodor Dostoevsky (Translated By Constance Garnett): Adapted By Joseph Cowley By Joseph Cowley

Stand Up And Bless The Lord - SATB - Sheet Music By Gilbert M Martin

The Eighteen Editions Of The Dewey Decimal Classification By John P. Comaromi READ ONLINE

The Sinner (The Return Of The Highlanders) By Margaret Mallory

MLA Documention Guide Prepared by St. Peter Chanel s English Department

A-Z Dream Symbology Dictionary

The Bible & Science Made Easy: An Easy To Understand Pocket Ref Guide [With Chart] By Mark Water READ ONLINE

The Lady's Not For Burning. By Christopher Fry READ ONLINE

How Does it Feel? Point of View in Translation: The Case of Virginia Woolf into French

Sixty Glorious Years: Our Queen Elizabeth II - Diamond Jubilee By Victoria Murphy

Novels & Adaptations: A Postmodern Intertextual Approach to The Hours. Bóka Zsombor. Anglisztika (BA), végzett hallgató

Not A Second Time (Kreme Klassic Book 21) By Kris P. Kreme READ ONLINE

A Lost Lady By W. Cather

The Other Child: A Novel By Charlotte Link READ ONLINE

Enriching a Document Collection by Integrating Information Extraction and PDF Annotation

American Music: A Panorama, Concise Edition By Daniel Kingman, Lorenzo Candelaria

Dynamic vs. Stative Verbs. Stative verbs deal with. Emotions, feelings, e.g.: adore

Lady Audley's Secret (Penguin Classics) By Jenny Bourne Taylor, Mary Elizabeth Braddon

The Elements Of Style: The Classic Writing Style Guide By William Strunk, E. B. White

Writing Taiwan: A New Literary History (Asia-Pacific: Culture, Politics, And Society)

I'm The Big Brother (Yaoi Manga)

The Everything Kids' More Hidden Pictures Book: Discover Hours Of Fun With Over 100 Brand-new Puzzles! By Beth L Blair

Live And Learn By Niobia Bryant READ ONLINE

Chopin Etude Op. 25 No. 7: Instantly Download And Print Sheet Music [Download: PDF] [Digital] By Chopin

Inventory of the German Friendly Society Records,

Materials Science And Engineering: An Introduction, 8th Edition By David G. Rethwisch, William D. Callister Jr.

Ruins Of War: A Mason Collins Novel By John A. Connell

Childrens Book : Fun Facts About Egypt: (Ancient Egypt For Kids) (Ages 4-12) [Kindle Edition] By Merrily Home

Who's Afraid Of Virginia Woolf? By Edward Albee

Jumpstarters for Math

DOWNLOAD OR READ : THE WORKS OF SIR JAMES Y SIMPSON BART VOLUME I III PDF EBOOK EPUB MOBI

449 Stupid Things Republicans Have Said By Ted Rueter

My Thirty Years Backstairs At The White House By Lillian Rogers Parks, Frances Spatz Leighton

3 rd CSE Unit 1. mustn t and have to. should and must. 1 Write sentences about the signs. 1. You mustn t smoke

Tabloid Love: Looking For Mr. Right In All The Wrong Places, A Memoir By Bridget Harrison

Basic Outline of Publishing Unit

A Traveller's History Of Australia By John H. Chambers

HOWARD COUNTY JUNIOR/INTERMEDIATE 4-H RECORD BOOK GUIDE. Project Year

QCM 3 - ENTRAINEMENT. 11. American students often... a little money by working part-time in the evenings. A. earn B. gains C. win D.

Guide To Calligraphy

Scientific American Supplement, No. 401, September 8, 1883 [Kindle Edition] By Various

Down The Hidden Path (The Roads To River Rock) By Heather Burch

I've Got To Talk To Somebody, God: A Woman's Conversations With God By Marjorie Holmes

What are MLA, APA, and Chicago/Turabian Styles?

Romeo And Juliet (Shakespeare For Young People) By William Shakespeare, Diane Davidson READ ONLINE

Selected Short Stories (Penguin Classics) By William Radice, Rabindranath Tagore

If I Only Had A Brain - Scarecrow Dance - From That's Entertainment - Sheet Music - (Harold Arlen, Piano/Vocal/Chords)

CHAPTER I INTRODUCTION. language such as in a play or a film. Meanwhile the written dialogue is a dialogue

April... Spring song characters Gus Octavia... Dec Tick Tock Father Time Summer song characters...

eats leaves. Where? It

Table of Contents. #3974 Daily Warm-Ups: Nonfiction & Fiction Writing 2 Teacher Created Resources

Eleni By Nicholas Gage

Debussy : Very Best For Piano (The Classical Composer Series)

THE MOP IS NOT THE CHERRY TREE.!

THE FRENCH REVOLUTION-A HISTORY By THOMAS CARLYLE By THOMAS CARLYLE

Name. Read each sentence and circle the pronoun. Write S on the line if it is a subject pronoun. Write O if it is an object pronoun.

Transcription:

LING/C SC 581: Advanced Computational Linguistics Lecture 2 Jan 15 th

From last time Did everyone install Python 3 and nltk/nltk_data? We'll do a Homework 2 on this today

Importing your own corpus Learning to import your own texts plain text OR Beautiful Soup (html) Read nltk book chapter 3 Assume import nltk, re, pprint from nltk import word_tokenize Reading local files

Project Gutenberg http://www.gutenberg.org/catalog/

Step 1: download Download: raw file = 1 (long) string Text number 2554 is an English translation of Crime and Punishment

Step 1: word_tokenize() Tokenize: list of words

Step 1: Beautiful Soup. html: get_text() from BeautifulSoup

nltk Text object: methods.collocations() and.concordance(word)

Step 1: getting rid of extraneous start/end text Adjusting start and end:

Mrs. Dalloway Project Gutenberg Australia (not indexed by www.gutenberg.org) http://gutenberg.net.au/ebooks02/0200991.txt Mrs. Dalloway by Virginia Woolf (1925) Code to read plaintext: from urllib import request url = "http://gutenberg.net.au/ebooks02/0200991.txt" response = request.urlopen(url) raw = response.read().decode('latin-1') # utf-8 common

Mrs. Dalloway Dealing with html >>> html = request.urlopen(url).read().decode('latin-1') >>> html[:60] '\r\n\r\nï» <table width="45%" border ="0">\r\n<tr>\r\n<td bgcolor="#' >>> from bs4 import BeautifulSoup >>> raw = BeautifulSoup(html).get_text() >>> tokens = word_tokenize(raw) >>> len(tokens) 77977 >>> tokens[:100] ['ï', '»', ' ', 'Project', 'Gutenberg', 'Australia', 'a', 'treasure-trove', 'of', 'literature', 'treasure', 'found', 'hidden', 'with', 'no', 'evidence', 'of', 'ownership', 'Title', ':', 'Mrs.', 'Dalloway', '(', '1925', ')', 'Author', ':', 'Virginia', 'Woolf', '*', 'A', 'Project', 'Gutenberg', 'of', 'Australia', 'ebook', '*', 'ebook', 'No', '.', ':', '0200991.txt', 'Edition', ':', '1', 'Language', ':', 'English', 'Character', 'set', 'encoding', ':', 'Latin-1', '(', 'ISO-8859-1', ')', '--', '8', 'bit', 'Date', 'first', 'posted', ':', 'November', '2002', 'Date', 'most', 'recently', 'updated', ':', 'November', '2002', 'This', 'ebook', 'was', 'produced', 'by', ':', 'Don', 'Lainson', 'dlainson', '@', 'sympatico.ca', 'Project', 'Gutenberg', 'of', 'Australia', 'ebooks', 'are', 'created', 'from', 'printed', 'editions', 'which', 'are', 'in', 'the', 'public', 'domain', 'in'] >>> tokens[-100:] ['shall', 'go', 'and', 'talk', 'to', 'him', '.', 'I', 'shall', 'say', 'goodnight', '.', 'What', 'does', 'the', 'brain', 'matter', ',', "''", 'said', 'Lady', 'Rosseter', ',', 'getting', 'up', ',', '``', 'compared', 'with', 'the', 'heart', '?', "''", '``', 'I', 'will', 'come', ',', "''", 'said', 'Peter', ',', 'but', 'he', 'sat', 'on', 'for', 'a', 'moment', '.', 'What', 'is', 'this', 'terror', '?', 'what', 'is', 'this', 'ecstasy', '?', 'he', 'thought', 'to', 'himself', '.', 'What', 'is', 'it', 'that', 'fills', 'me', 'with', 'extraordinary', 'excitement', '?', 'It', 'is', 'Clarissa', ',', 'he', 'said', '.', 'For', 'there', 'she', 'was', '.', 'THE', 'END', 'This', 'site', 'is', 'full', 'of', 'FREE', 'ebooks', '-', 'Project', 'Gutenberg', 'Australia']

Mrs. Dalloway >>> raw[:150] '\r\n\r\nï» <table width="45%" border ="0">\r\n<tr>\r\n<td bgcolor="#ffe4e1"><font color="#800000" size="5"><p style="text-align:center"><b><a href="http://gut'

Mrs. Dalloway >>> response = request.urlopen(url) >>> raw = response.read().decode('latin-1') >>> m = re.search('title',raw) >>> m <_sre.sre_match object; span=(426, 431), match='title'> >>> raw = raw[431:] >>> m = re.search('title',raw) >>> m <_sre.sre_match object; span=(1217, 1222), match='title'> >>> raw = raw[1217:] >>> raw[:200] 'Title: Mrs. Dalloway\r\nAuthor: Virginia Woolf\r\n\r\n\r\n\r\n\r\nMrs. Dalloway said she would buy the flowers herself.\r\n\r\nfor Lucy had her work cut out for her. The doors would be taken\r\noff their hing'

Mrs. Dalloway >>> raw[-400:] 'What is\r\nit that fills me with extraordinary excitement?\r\n\r\nit is Clarissa, he said.\r\n\r\nfor there she was.\r\n\r\n\r\n\r\nthe END\r\n\r\n\r\n\r\n\r\n\r\n</pre>\r\n<p style="marginleft:10%"><img src="/pga-australia.jpg" width="80" height="75" alt=""> </p>\r\n\r\n<p><b>this site is full of FREE ebooks - <a href="http://gutenberg.net.au" target="_blank">project Gutenberg Australia</a></b></p>\r\n<!-- ad goes here -- >\r\n\r\n\r\n\r\n\r\n' >>> m = re.search('the END',raw) >>> m <_sre.sre_match object; span=(368969, 368976), match='the END'> >>> raw = raw[:368976] >>> raw[-400:] 'Sally. "I shall go\r\nand talk to him. I shall say goodnight. What does the brain\r\nmatter," said Lady Rosseter, getting up, "compared with the heart?"\r\n\r\n"i will come," said Peter, but he sat on for a moment. What is\r\nthis terror? what is this ecstasy? he thought to himself. What is\r\nit that fills me with extraordinary excitement?\r\n\r\nit is Clarissa, he said.\r\n\r\nfor there she was.\r\n\r\n\r\n\r\nthe END'

Mrs. Dalloway >>> tokens = word_tokenize(raw) >>> type(tokens) <class 'list'> >>> len(tokens) 77718 >>> tokens[:100] ['Title', ':', 'Mrs.', 'Dalloway', 'Author', ':', 'Virginia', 'Woolf', 'Mrs.', 'Dalloway', 'said', 'she', 'would', 'buy', 'the', 'flowers', 'herself', '.', 'For', 'Lucy', 'had', 'her', 'work', 'cut', 'out', 'for', 'her', '.', 'The', 'doors', 'would', 'be', 'taken', 'off', 'their', 'hinges', ';', 'Rumpelmayer', "'s", 'men', 'were', 'coming', '.', 'And', 'then', ',', 'thought', 'Clarissa', 'Dalloway', ',', 'what', 'a', 'morning', '--', 'fresh', 'as', 'if', 'issued', 'to', 'children', 'on', 'a', 'beach', '.', 'What', 'a', 'lark', '!', 'What', 'a', 'plunge', '!', 'For', 'so', 'it', 'had', 'always', 'seemed', 'to', 'her', ',', 'when', ',', 'with', 'a', 'little', 'squeak', 'of', 'the', 'hinges', ',', 'which', 'she', 'could', 'hear', 'now', ',', 'she', 'had', 'burst']

Mrs. Dalloway >>> text = nltk.text(tokens) >>> len(text) 77718 >>> len(set(text)) 7623 >>> len(set(text)) / len(text) 0.09808538562495175 >>> text.count('dalloway') 104 >>> text.count('mrs.') 118 >>> fd = nltk.freqdist(text) >>> fd FreqDist({',': 6098, '.': 3017, 'the': 3015, 'and': 1625, 'of': 1525, ';': 1473, 'to': 1447, 'a': 1328, 'was': 1254, 'her': 1227,...}) >>> print(fd) <FreqDist with 7623 samples and 77718 outcomes> >>> fd['dalloway'] 104 >>> text.collocations() Peter Walsh; Sir William; Lady Bruton; Miss Kilman; Dr. Holmes; Prime Minister; Ellie Henderson; Mrs. Filmer; Mrs. Dalloway; Hugh Whitbread; Warren Smith; Sally Seton; Aunt Helena; Big Ben; Richard Dalloway; motor car; Miss Parry; motor cars; years ago; Bond Street

Mrs. Dalloway >>> fd2 = nltk.freqdist(len(word) for word in text) >>> fd2.most_common() [(3, 17056), (1, 13433), (4, 11541), (2, 11452), (5, 7915), (6, 5236), (7, 4496), (8, 2980), (9, 1684), (10, 966), (11, 446), (12, 253), (13, 158), (14, 51), (15, 37), (16, 4), (18, 4), (17, 3), (19, 1), (20, 1), (21, 1)] >>> fd2.plot()

Mrs. Dalloway

Mrs. Dalloway Searching Tokenized Text in nltk angle brackets < > mark token boundaries >>> text[:20] >>> text.findall(r"<mrs\.> (<\w+>)") Dalloway; Dalloway; Foxcroft; Dalloway; Asquith; Dalloway; Richard; Dalloway; Dalloway; Dalloway; Coates; Coates; Bletchley; Bletchley; Dempster; Dempster; Dempster; Dempster; Dempster; Dempster; Dempster; ['Title', ':', 'Mrs.', 'Dalloway', 'Author', ':', 'Virginia', 'Woolf', 'Mrs.', 'Dalloway', 'said', 'she', 'would', 'buy', 'the', 'flowers', 'herself', '.', 'For', 'Lucy'] Dalloway; Walker; Dalloway; Walker; Dalloway; Dalloway; Dalloway; Dalloway; Turner; Filmer; Hugh; Septimus; Filmer; Filmer; Warren; Smith; Filmer; Smith; Warren; Dalloway; Whitbread; Marsham; Marsham; Marsham; Marsham; Hilbery; Dalloway; Dalloway; Dalloway; Dalloway; Dalloway; Dalloway; Marsham; Marsham; Dalloway; Dalloway; Gorham; Dalloway; Filmer; Peters; Peters; Filmer; Peters; Peters; Filmer; Peters; Peters; Peters; Peters; Filmer; Peters; Peters; Peters; Filmer; Filmer; Filmer; Williams; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Filmer; Burgess; Burgess; Burgess; Morris; Morris; Walker; Walker; Dalloway; Walker; Walker; Walker; Parkinson; Barnet; Barnet; Barnet; Barnet; Barnet; Garrod; Hilbery; Mount; Dakers; Durrant; Hilbery; Hilbery; Dalloway; Dalloway; Dalloway; Dalloway; Hilbery; Hilbery

nltk:.sent_tokenize() 3.8 Segmentation Sentence segmentation Brown corpus (pre-segmented): >>> len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()) 20.250994070456922 (average sentence length in terms of number of words) >>> raw = "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL. Soup does very\nwell without--maybe it's always pepper that makes people hot-tempered,'..." >>> nltk.sent_tokenize(raw) ["'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL.", "Soup does very\nwell without--maybe it's always pepper that makes people hot-tempered,'..."] >>> nltk.sent_tokenize(raw)[0] "'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\nthough), 'I won't have any pepper in my kitchen AT ALL." >>> nltk.sent_tokenize(raw)[1] "Soup does very\nwell without--maybe it's always pepper that makes people hot-tempered,'..."

Homework 2 Virginia Woolf was famous for her stream-of-consciousness style of writing: How fresh, how calm, stiller than this of course, the air was in the early morning; like the flap of a wave; the kiss of a wave; chill and sharp and yet (for a girl of eighteen as she then was) solemn, feeling as she did, standing there at the open window, that something awful was about to happen; looking at the flowers, at the trees with the smoke winding off them and the rooks rising, falling; standing and looking until Peter Walsh said, "Musing among the vegetables?"--was that it?--"i prefer men to cauliflowers"--was that it? Dumbledore's death in the style https://www.theguardian.com/books/2005/jul/13/harrypotter.jkjoannekathleenrowl ing4 Download Mrs. Dalloway http://gutenberg.net.au/ebooks02/0200991.txt

Homework 2 Compute the average sentence length of Mrs. Dalloway Compare with the average sentence length of the Brown Corpus Is it true that stream-of-conscious writing leads to (significantly) longer sentences? Submit homework by Friday evening One PDF file: show your workings (Python interpreter)