A picture is worth 13.6 words (on average)

Similar documents
A picture is worth 13.6 words (on average)

Young Learners. Starters. Sample papers. Young Learners English Tests (YLE) Volume One. UCLES 2014 CE/2063a/4Y01

Letterland Lists by Unit. cat nap mad hat sat Dad lap had at map

Fry Instant Phrases. First 100 Words/Phrases

Word Fry Phrase. one by one. I had this. how is he for you

Vocabulary Sentences & Conversation Color Shape Math. blue green. Vocabulary Sentences & Conversation Color Shape Math. blue brown

P3 Hold On Tight. Do you want to have some fun? Dah dah dah dah Do you want to have some fun? Then come along with me.

Basic Sight Words - Preprimer

FIRST STEP LAAS LANGUAGE ATTAINMENT ASSESSMENT SYSTEM. English English Language Language Examinations Examinations. December 2005 June 2014 NAME..

Test 1 Answers. Listening. T RANSCRIPT Hello. This is the Cambridge Starters. Part 1 (5 marks) Part 2 (5 marks) Part 3 (5 marks) Part 4 (5 marks)

Section I. Quotations

The First Hundred Instant Sight Words. Words 1-25 Words Words Words

Pgs. Level 1 Questions Level 2 Questions Level 3 Questions Level 4 Questions

ENGLISH ENGLISH AMERICAN. Level 1. Tests

1-1 I Like Stars. A. It is in a room. A. It is looking at the stars through the window. A. They are a rabbit, a frog, a bird, and a mouse.

Where are the three friends?... What is the girl wearing?... Find the true sentence...

SALTY DOG Year 2

ENGLISH ENGLISH BRITISH. Level 1. Tests

ABSS HIGH FREQUENCY WORDS LIST C List A K, Lists A & B 1 st Grade, Lists A, B, & C 2 nd Grade Fundations Correlated

This is a vocabulary test. Please select the option a, b, c, or d which has the closest meaning to the word in bold.

LEVEL PRE-A1 LAAS LANGUAGE ATTAINMENT ASSESSMENT SYSTEM. English English Language Language Examinations Examinations. December 2005 May 2012

The Visual Denotations of Sentences. Julia Hockenmaier with Peter Young and Micah Hodosh University of Illinois

Countable (Can count) uncountable (cannot count)

L.4.4a L.3.4a L.2.4a

PEAK Generalization Pre-Assessment: Assessor Script and Scoring Guide Learner: Assessment Date: Assessor:

THE YELLOW BUTTERFLY. Off flew the butterfly!

ENGLISH ENGLISH. Level 3. Tests AMERICAN. Student Workbook ENGLISH. Level 3. Rosetta Stone Classroom. RosettaStone.com AMERICAN

Grade 2 - English Ongoing Assessment T-2( ) Lesson 4 Diary of a Spider. Vocabulary

South Avenue Primary School. Name: New Document 1. Class: Date: 44 minutes. Time: 44 marks. Marks: Comments: Page 1

The Ant and the Grasshopper

THE LANGUAGE MAGICIAN classroom resources. Pupil's worksheets Activities

Contents Starter Unit 1 Unit 2 Unit 3 Review 1 Cross-curricular 1: Math Unit 4 Unit 5 Unit 6 Review 2 Cross-curricular 2: Language Arts Unit 7

In the sentence above we find the article "a". It shows us that the speaker does not need a specific chair. He can have any chair.

Cover Photo: Burke/Triolo Productions/Brand X Pictures/Getty Images

XSEED Summative Assessment Test 1. Duration: 90 Minutes Maximum Marks: 60. English, Test 1. XSEED Education English Grade 3 1

Unit 4 Week 1 Day 2. Unit 4 Week 1 Day 1

ENGLISH ENGLISH BRITISH. Level 3. Tests

3-40. Oi! Get off our Train

We read a story in class from Whootie Owl's Test Prep Storytime Series for Level 2

First 100 High Frequency Words

Monday Tuesday Wednesday Thursday Friday

General Revision on Module 1& 1 and (These are This is You are) two red apples in the basket.

able, alone, animal, become, call, catch, country, monkey, thin, word; baby, clean, eat, enjoy, family, fruit, jump, kind, man, parent

Lesson THINKING OPERATIONS. Now you re going to say the rule that starts with no chairs. (Pause.) Get ready.

Unit 4. Decodable Readers. Phonics/Comprehension Activities. Lifeinfirstgrade1.blogspot.com

Summer Fun ~ Entering 1st Grade

FIRST STEP LAAS LANGUAGE ATTAINMENT ASSESSMENT SYSTEM. English English Language Language Examinations Examinations. December 2005 SAMPLE 1 NAME..

Literal & Nonliteral Language

ATOMIC ENERGY CENTRAL SCHOOL No.4, RAWATBHATA WORKSHEET FOR ANNUAL EXAM Name: CLASS : III / Sec. SUB : English

Show Me Actions. Word List. Celebrating. are I can t tell who you are. blow Blow out the candles on your cake.

CS 1674: Intro to Computer Vision. Intro to Recognition. Prof. Adriana Kovashka University of Pittsburgh October 24, 2016

LearnEnglish Elementary Podcast Series 02 Episode 08

Recording scripts Third edition. for Movers

Pgs. Level 1 Questions Level 2 Questions Level 3 Questions Level 4 Questions Cover

Instant Words Group 1

English as a Second Language Podcast ESL Podcast 169 Describing People s Appearance

Room 6 First Grade Homework due on Tuesday, November 3rd

Power Words come. she. here. * these words account for up to 50% of all words in school texts

Idioms. Idiom quiz. 1. Improve after going through something A. As plain as day

A Day in May. Phonics Skills. Long a: ai, ay. rain Gail gray day May Ray mail brain play tray way

Downloaded from SA2QP Total number of printed pages 10

Suitable Class Level: Materna 1st - 2nd Elementary

Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik

Al Khozama International School, Dammam (B. E. S. T. Schools, Saudi Arabia) Class: 4 Worksheet- 1 Subject: English Annual Exam SECTION A- READING

Tilda and her family. Read, write and draw


FIRST STEP LAAS LANGUAGE ATTAINMENT ASSESSMENT SYSTEM. English English Language Language Examinations Examinations. December 2005 May 2017 NAME..

1. As you study the list, vary the order of the words.

Susana Amante

grocery store circus school beach dentist circus bowling alley beach farm theater beach school grocery store orchard school beach

Conversation 1. Conversation 2. Conversation 3. Conversation 4. Conversation 5

School District of Palm Beach County Elementary Curriculum

1. Complete the sentences using will or won t:

My name is: YazooA_booklet.indd 1 9/8/09 10:20:56 AM

Teach Your Child Lessons BeginningReads Level 10

A nurse works at a hospital. Left is the opposite of (A) right. A pencil is used to write. Fingers are used to (A) touch.

First Grade Spelling

Unit 12 Superstitions

to believe all evening thing to see to switch on together possibly possibility around

A verb tells what the subject does or is. A verb can include more than one word. There may be a main verb and a helping verb.

Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Short a. Adding -s. nap naps sit sits win wins fit fits hit hits. High-Frequency Words help use

Hey! Get Off Our Train By John Burningham

Answer Keys for Grammar/Composition/Spelling

TEST ONE. Singing Star Showing this week. !The Wild Wheel Ride! Indoor tennis centre. RACING CAR TRACK To drive, children must be 1 metre or more

Section 2: Known And Unknown

Longman English for Pre-school Book 4

DIAGNOSTIC EVALUATION

Absurdities REM 201C A TEACHING RESOURCE FROM... C RITICAL THINKING SKILLS

arranged _G3U1W5_ indd 1 2/19/10 5:02 PM

Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing

DAV Centenary Public School

(Answers on Pages 17 & 18)

K-2nd. March 3-4, Obsessed Journey: No worries! We can choose to trust Jesus instead of worrying! Matthew 6:25-34

Developed in Consultation with Pennsylvania Educators

Name Date Unit 3 - Wk.2 Abuelo and the Three Bears. Daily Language Arts / Math D.O.L.

Notes for teachers D2 / 31

Supplementary Material Notes

XSEED Summative Assessment Test 2. Duration: 90 Minutes Maximum Marks: 60. English, Test 2. XSEED Education English Grade 1

Unit 5: Holiday in Thailand

HOMEWORK JANUARY WEEK 5 Black Bolts

Transcription:

Yiannis Tamara Aloimonos Berg Alex Berg Jesse Dodge Amit Goyal Yejin Choi A picture is worth 13.6 words (on average) 1/101 Xufeng Alyssa Meg Han Mensch Mitchell Kota Karl Ching Lik Yezhou Yamaguchi Stratos TeoA pictureyang is worth 13.6 words

An on-paper experiment Write a caption for this image, one sentence in length. (In English.) 2/101

People write weird captions Another dream car to add to the list, this one spotted in Hanbury St. 3/101

People write weird captions Another dream car to add to the list, this one spotted in Hanbury St. 4/101 Shot out my car window while stuck in traffic because people in Cincinatti can't drive in the rain

People write weird captions 1. A distorted photo of a man cutting up a large cut of meat in a garage. 2. A man smiling at the camera while carving up meat. Another dream car to out my carhewindow 3. AShot man smiling while add to the list, this one while stuckofinmeat. traffic cuts up a piece spotted in Hanbury St. next to a table because people in 4. A smiling man is standing dressing Cincinatti can't a piece of venison. drive in the rain 5. The man is smiling into the camera as he cuts meat. 5/101

Two complementary questions... Image Text? two women sitting brunette blonde on bench reading magazine 6/101

Two complementary questions... Image Text? Text Image? looking for castles in the clouds out my car window two women sitting brunette blonde on bench reading magazine 7/101

Two complementary questions... Image Text? Understanding and Predicting Importance in Images BBDDGHMMSSY, CVPR 2012 Midge: Generating Image Descriptions from Computer two womenvision sitting Detections MDGYSHMBBD, EACL 2012 brunette blonde on bench reading magazine Corpus-Guided Sentence Generation of Natural Images Text Image? looking for castles in the clouds out my car window Detecting Visual Text DGHMMSYCDBB, NAACL 2012 YTDA, EMNLP 2011 8/101

Why do this? Caption Generation 9/101

Why do this? Caption Generation the sheep meandered along a desolate road in the highlands of Scotland through frozen grass 10/101

Why do this? Caption Generation 11/101

Why do this? Caption Generation Visual Scene Construction 12/101

Coyne & Sproat, SIGGRAPH 2001 WordsEye: An Automatic Text-to-Scene Conversion System Why do this? Caption Generation Visual Scene Construction 13/101 the small white cat is -17 inches above the hat. the tiny white illuminator is in front of the cat. it is night. the ground is red. the 200 foot tall dragon is facing the 100 foot tall car. The ground is a checkerboard. the sky is pink

Why do this? Caption Generation Visual Scene Construction 14/101

Why do this? Caption Generation Visual Scene Construction Training Object Detectors from Text 15/101

Farhadi + Sadeghi, CVPR 2011 Recognition Using Visual Phrases Why do this? Caption Generation Visual Scene Construction Training Object Detectors from Text a person riding a horse Person + Horse elephant in the beach 16/101

Why do this? Caption Generation Visual Scene Construction Training Object Detectors from Text 17/101

What is visual text Photographer/viewer distinctions Kevin s mom, so punxrawk in Kev s black flag hat 18/101

What is visual text Photographer/viewer distinctions Kevin s mom, so punxrawk in Kev s black flag hat Amount of inference Another dream car to add to the list, this one spotted in Hanbury St. 19/101

What is visual text Photographer/viewer distinctions Kevin s mom, so punxrawk in Kev s black flag hat Amount of inference Temporal events Another dream car to add to the list, this one spotted in Hanbury St. Tuckered out from playing in Nannie s yard. 20/101

What is visual text Photographer/viewer distinctions Kevin s mom, so punxrawk in Kev s black flag hat A phrase is visual if there is a Amount of inference piece of the image you Another dream car can to cut add to list, this one out, place in the another image, spotted in Hanbury St. andevents still use the same description. Temporal Tuckered out from playing in Nannie s yard. 21/101

Okay, so can we detect it? 22/101 SBU Flickr data 3 NPs per caption 800 images: 3 annotations 48k images: 1 annotation People largely agree (74% whatever that means...) 3 NPs per caption, 70% visual

Okay, so can a computer detect it? Word+stems Bigrams Spelling Hypernyms (Inside, Before and After) Another dream car to add to the list... another anoth dream dream car car another_dream dream_car Aa+ a+ a+ Vehicle artifact entity 23/101 to to add add to_add a+ a+

Okay, so can a computer detect it? Word+stems Bigrams Spelling Hypernyms 67% AUC (Inside, Before and After) Another dream car to add to the list... another anoth dream dream car car another_dream dream_car Aa+ a+ a+ Vehicle artifact entity 24/101 to to add add to_add a+ a+

Bootstrapping visual terminology Start with some seeds Apply bootstrapping or label propagation Adj Noun Color Material V Shape Size NV Surface Direction V Pattern Quality BeautyNV Age Ethnicity 25/101 purple blue maroon beige green car house tree horse animal man plastic cotton wooden metallic silver table bottle woman computer circular square round rectangular triangular idea bravery trust dedication small big tiny deceit tall huge anger humour luck inflation honesty coarse smooth furry fluffy rough brown green wooden orange sideways north upwardstriped left down rectangular furry shiny rusty striped dotted checked plaid feathered quilted shiny dirty burned glitteryadjectives publicrusty original whole righteous beautiful cute pretty gorgeous political personal intrinsic seedslovely individual young mature immature older senior french asian american greek hispanic

Bootstrapping visual terminology Start with some seeds emerald, rufous grayish, chestnut, Apply bootstrapping or label propagation Adj Noun Color Material V Shape Size NV Surface Direction V Pattern Quality BeautyNV Age Ethnicity 26/101 purple blue maroon beige green car house tree horse animal man plastic cotton wooden metallic silver table bottle woman computer circular square round rectangular triangular idea bravery trust dedication small big tiny deceit tall huge anger humour luck inflation honesty coarse smooth furry fluffy rough brown green wooden orange sideways north upwardstriped left down rectangular furry shiny rusty striped dotted checked plaid feathered quilted shiny dirty burned glitteryadjectives publicrusty original whole righteous beautiful cute pretty gorgeous political personal intrinsic seedslovely individual young mature immature older senior french asian american greek hispanic

Bootstrapping visual terminology Start with some seeds emerald, rufous grayish, chestnut, Apply bootstrapping or label propagation #A81C07 Adj Noun Color Material V Shape Size NV Surface Direction V Pattern Quality BeautyNV Age Ethnicity 27/101 purple blue maroon beige green car house tree horse animal man plastic cotton wooden metallic silver table bottle woman computer circular square round rectangular triangular idea bravery trust dedication small big tiny deceit tall huge anger humour luck inflation honesty coarse smooth furry fluffy rough brown green wooden orange sideways north upwardstriped left down rectangular furry shiny rusty striped dotted checked plaid feathered quilted shiny dirty burned glitteryadjectives publicrusty original whole righteous beautiful cute pretty gorgeous political personal intrinsic seedslovely individual young mature immature older senior french asian american greek hispanic

Bootstrapping visual terminology Start with some seeds emerald, rufous grayish, chestnut, Apply bootstrapping or label propagation #A81C07 Adj Noun Color Material V Shape Size NV Surface Direction V Pattern Quality BeautyNV Age Ethnicity 28/101 purple blue maroon beige green car house tree horse animal man plastic cotton wooden metallic silver table bottle woman computer circular square round rectangular triangular idea bravery trust dedication small big tiny deceit tall huge anger humour luck inflation honesty coarse smooth furry fluffy rough brown green wooden orange sideways north upwardstriped left down rectangular furry shiny rusty striped dotted checked plaid feathered quilted shiny dirty burned glitteryadjectives publicrusty original whole righteous beautiful cute pretty gorgeous political personal intrinsic seedslovely individual young mature immature older senior french asian american greek hispanic

Bootstrapping visual terminology Start with some seeds emerald, rufous grayish, chestnut, Apply bootstrapping or label propagation #A81C07 Adj Noun Color purple blue maroon beige green car house tree horse animal man Material cotton wooden metallic silver V plastic table bottle woman computer Shape circular square round rectangular triangular idea bravery trust dedication Size NV small big tiny deceit tall huge anger humour luck inflation honesty Surface coarse smooth furry fluffy rough brown green wooden orange Direction sideways north upwardstriped left down V hemispherical, oblong, quadrangular, rectangular furry shiny rusty Pattern striped dotted checked plaid feathered quilted convex Quality shiny dirty burned glitteryadjectives publicrusty original whole righteous BeautyNV beautiful cute pretty gorgeous political personal intrinsic seedslovely individual Age young mature immature older senior Ethnicity french asian american greek hispanic 29/101

Bootstrapping visual terminology Start with some seeds emerald, rufous grayish, chestnut, Apply bootstrapping or label propagation #A81C07 67% AUC Adj Noun Color purple blue maroon beige green car house tree horse animal man Material cotton wooden metallic silver V plastic table bottle woman computer Shape circular square round rectangular triangular idea bravery trust dedication Size NV small big tiny deceit tall huge anger humour luck inflation honesty Surface coarse smooth furry fluffy rough brown green wooden orange Direction sideways north upwardstriped left down V hemispherical, oblong, quadrangular, rectangular furry shiny rusty Pattern striped dotted checked plaid feathered quilted convex Quality shiny dirty burned glitteryadjectives publicrusty original whole righteous BeautyNV beautiful cute pretty gorgeous political personal intrinsic seedslovely individual Age young mature immature older senior Ethnicity french asian american greek hispanic 30/101

Bootstrapping visual terminology Start with some seeds emerald, rufous grayish, chestnut, Apply bootstrapping or label propagation #A81C07 67% 71% AUC Adj Noun Color purple blue maroon beige green car house tree horse animal man Material cotton wooden metallic silver V plastic table bottle woman computer Shape circular square round rectangular triangular idea bravery trust dedication Size NV small big tiny deceit tall huge anger humour luck inflation honesty Surface coarse smooth furry fluffy rough brown green wooden orange Direction sideways north upwardstriped left down V hemispherical, oblong, quadrangular, rectangular furry shiny rusty Pattern striped dotted checked plaid feathered quilted convex Quality shiny dirty burned glitteryadjectives publicrusty original whole righteous BeautyNV beautiful cute pretty gorgeous political personal intrinsic seedslovely individual Age young mature immature older senior Ethnicity french asian american greek hispanic 31/101

But this doesn't use the images!!! 95 90 85 80 Random 75 Model 70 Model+Lists Human 65 60 55 50 32/101

But this doesn't use the images!!! 95 90 85 80 Random 75 Model 70 Model+Lists Human 65 60 55 50 33/101

What I used to think vision did... 34/101

What I used to think vision did... 35/101

What I used to think vision did... 36/101

What I used to think vision did... 37/101

Now I know better... 38/101

Adding in image features Ecuador, amazon basin, near coca, rain forest, passion fruit flower 39/101 Does a detector corresponding to this head noun exist? Did it fire? How many times did it fire? How confident was the best firing? What %age of pixels in the image are in that bounding box?

Results with vision features 95 90 85 Random 80 Model 75 Model+Lists 70 +Vision 65 Human 60 55 50 40/101

Results with vision features 95 90 Features only available 85 on about 11% of examples 80 75 Random Model Model+Lists 70 +Vision 65 Human 60 55 50 41/101

Results with vision features 95 90 Features only available 85 on about 11% of examples 8% improvement on phrases with recognizers 80 75 Random Model Model+Lists 70 +Vision 65 Human 60 55 50 42/101

Detecting on a large scale... bird boat bottle bowl 43 Hal Daumé III (me@hal3.name)

What do people describe? 1) Given an image 44 Hal Daumé III (me@hal3.name)

What do people describe? two women sitting brunette blonde on bench reading magazine 1) Given an image 45 Predict what people will describe Hal Daumé III (me@hal3.name)

What do people describe? two women sitting brunette blonde on bench reading magazine 1) Given an image 46 Predict what people will describe Hal Daumé III (me@hal3.name) bench magazine grass skirt women

Predicting what will be described What s in this image? 47 Hal Daumé III (me@hal3.name)

Predicting what will be described What s in this image? man baby sling ladder fridge table watermelon chair boxes cups water bottle wall pacifier beard glasses shirt 48 Hal Daumé III (me@hal3.name)

Predicting what will be described What s in this image? What do people describe? A bearded man is holding a child in a sling. 49 Hal Daumé III (me@hal3.name) man baby sling ladder fridge table watermelon chair boxes cups water bottle wall pacifier beard glasses shirt

Predicting what will be described What s in this image? What do people describe? A bearded man is holding a child in a sling. A bearded man stands while holding a small child in a green sheet. A bearded man with a baby in a sling poses. Man standing in kitchen with little girl in green sack. Man with beard and baby 50 Hal Daumé III (me@hal3.name) man baby sling ladder fridge table watermelon chair boxes cups water bottle wall pacifier beard glasses shirt

Predicting what will be described What s in this image? What do people describe? A bearded man is holding a child in a sling. A bearded man stands while holding a small child in a green sheet. A bearded man with a baby in a sling poses. Man standing in kitchen with little girl in green sack. Man with beard and baby 51 Hal Daumé III (me@hal3.name) man baby sling ladder fridge table watermelon chair boxes cups water bottle wall pacifier beard glasses shirt

Description factors What factors influence what someone will describe about an image? Two kinds of factors 52 Compositional Semantic Hal Daumé III (me@hal3.name)

Compositional factors Size/Saliency Location A sail boat on the ocean. 53 Hal Daumé III (me@hal3.name)

Compositional factors Size/Saliency Location Two men standing on beach. 54 Hal Daumé III (me@hal3.name)

Semantic factors Object Type Nameable Scene Unusualness girl in the street 55 Hal Daumé III (me@hal3.name)

Semantic factors Object Type Nameable Scene Unusualness kitchen in house 56 Hal Daumé III (me@hal3.name)

Semantic factors Object Type Nameable Scene Unusualness elephant in the beach 57 Hal Daumé III (me@hal3.name)

Semantic factors Object Type Nameable Scene Unusualness A tree in water and a boy with a beard 58 Hal Daumé III (me@hal3.name)

Generating captions a) Detect objects and scenes from input image; b) Estimate optimal sentence structure quadruplet T; c) Generating a sentence from T; 59 Hal Daumé III (me@hal3.name)

Example 60 Hal Daumé III (me@hal3.name)

Sample Results 61 Hal Daumé III (me@hal3.name)

Evaluation Result 62 Hal Daumé III (me@hal3.name)

Using large corpora to compose natural captions (why write your own material when you can just steal it?) 63 Hal Daumé III (me@hal3.name)

Composing captions a) monkey playing in the tree canopy, Monte Verde in the rain forest b) capuchin monkey in front of my window c) monkey spotted in Apenheul Netherlands under the tree d) a white-faced or capuchin in the tree in the garden e) the monkey sitting in a tree, posing for his picture 64 Hal Daumé III (me@hal3.name)

Composing captions a) monkey playing in the tree canopy, Monte Verde in the rain forest b) capuchin monkey in front of my window c) monkey spotted in Apenheul Netherlands under the tree d) a white-faced or capuchin in the tree in the garden e) the monkey sitting in a tree, posing for his picture 65 Hal Daumé III (me@hal3.name)

Captioning with (some) evidence Caption images where: We assume some evidence for 1 object & Object detector is confident 66 Hal Daumé III (me@hal3.name)

Captioning with (some) evidence Caption images where: We assume some evidence for 1 object & Tag: mare Evidence for horse Object detector is confident 67 Hal Daumé III (me@hal3.name)

Captioning with (some) evidence Caption images where: We assume some evidence for 1 object & Tag: mare Evidence for horse Object detector is confident High detection score 68 Hal Daumé III (me@hal3.name)

Generation: Grab 'N Mash Grab phrases based on image similarity between query and captioned data base Object detection similarity - NPs, VPs Stuff detection similarity PPs Scene similarity - PPs Mash phrases Compose descriptions using simple rule based concatenation 69 Hal Daumé III (me@hal3.name)

Getting NPs Objects Detect: fruit 70 Hal Daumé III (me@hal3.name)

Getting NPs Objects Detect: fruit Find matching fruit detections by color similarity 71 Hal Daumé III (me@hal3.name)

Getting NPs Objects Tray of glace fruit in the market at Nice, France Fresh fruit in the market Detect: fruit Find matching fruit detections by color similarity 72 A box of oranges was just catching the sun, bringing out detail in the skin. mandarin oranges in glass bowl Hal Daumé III (me@hal3.name) The street market in Santanyi, Mallorca is a must for the oranges and local crafts. An orange tree in the backyard of the house. A picture is worth 13.6 words

Getting NPs Objects The muddy elephant An elephant small elephant A very large and seemingly old elephant musk male elephant African elephant the temple elephant 73 a lonesome duck Fushia flower a native new zealand duck a flower The duck a pink zinna male Mallard duck flower several other ducks This beautiful a so-called navigation duck flower this duck a roman pink a duck flower duck a tiny pink flower mandarin duck pink bursting flowers a perfectly pink gerbera Hal Daumédaisy III (me@hal3.name)

Getting VPs objects Detect: cow Find matching cow detections by shape/pose similarity 74 theses cows live in the field behind my house The cow was more interested in eating than looking at me with a camera! Hal Daumé III (me@hal3.name) A cow eating flowers in the south of the Netherlands. While cycling north on Tremaine Road near Milton, this cow gazed across the road intently.

Getting PPs stuff Detect: grass green manure in the veg field - Plaw Hatch I am happy in a field of green Maryland grass Find matching grass detections by color similarity Sheep in a field spotted during a coastal drive from Tramore to 75 Hal Daumé III (me@hal3.name) Found on hawthorn in boggy grass field

Getting PPs scenes Extract scene descriptor Find matching images by scene similarity 76 Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere I'm about to blow the building across the street over with my massive lung power. Only in Paris will you find a View from our B&B in this bottle of wine on a table photo outside a bookstore Hal Daumé III (me@hal3.name)

Composing captions 77 Hal Daumé III (me@hal3.name)

Composing captions object color object pose scene stuff 78 Hal Daumé III (me@hal3.name)

Composing captions object color object pose scene stuff 79 NP: the sheep VP: meandered along a desolate road PP: in the highlands of Scotland PP: through frozen grass Hal Daumé III (me@hal3.name)

Composing captions object color object pose scene stuff NP: the sheep VP: meandered along a desolate road PP: in the highlands of Scotland PP: through frozen grass Various composition patterns: NP VP NP PP_stuff NP PP_scene NP VP PP_scene PP_stuff 80 Hal Daumé III (me@hal3.name)

Composing captions object color object pose scene stuff Various composition patterns: NP VP NP PP_stuff NP PP_scene NP VP PP_scene PP_stuff 81 NP: the sheep VP: meandered along a desolate road PP: in the highlands of Scotland PP: through frozen grass the sheep meandered along a desolate road in the highlands of Scotland through frozen grass Hal Daumé III (me@hal3.name)

Good results A duck was having a bath in the harbor at whitehaven, cumbria, england in the water near Camley St A female Monarch butterfly was visiting the plant in my front yard in Devon 17/10/10 her flower girl dress designed by Mainbocher in the house A double-decker bus under some spreading shade trees 82 Hal Daumé III (me@hal3.name) Stained glass window depicting Christ and numerous saints in Washington National Cathedral in the Eglise cat enjoys hiding under the tree

Not so good results 83 Hal Daumé III (me@hal3.name)

Not so good results Language issues A Moo cow tied up around the city eating grass in various places under the tree at the young tree male tiger sighting in twelve months of a street 84 Hal Daumé III (me@hal3.name)

Not so good results Language issues A Moo cow tied up around the city eating grass in various places under the tree at the young tree male tiger sighting in twelve months of a street 85 Vision issues a girl walking by in a green field in the sun The silhouetted building and cross stands under water around Loon Mountain Hal Daumé III (me@hal3.name)

Not so good results Language issues A Moo cow tied up around the city eating grass in various places under the tree at the young tree male tiger sighting in twelve months of a street 86 Vision issues a girl walking by in a green field in the sun The silhouetted building and cross stands under water around Loon Mountain Hal Daumé III (me@hal3.name) Just plain silly bike was left here by an ancient civilization not as sophisticated as our own in the grass of granite dogs running pic, this time, racing through the sea at Fraisthorpe near Bridlington of Christmas tree in bed

Open question... Can we do this without using pre-defined object/scene/etc. detectors? Build a representation of each image in the database Build a representation of the test image Find 10 most similar database images Merge their NL descriptions using text-to-text generation techniques 87 Q: Where do these representations come from??? Hal Daumé III (me@hal3.name)

And why are we trying to do this...??? Captioning the world for people with visual impairments 88/101 But the captions we have are not really descriptive of the world

And why are we trying to do this...??? Captioning the world for people with visual impairments But the captions we have are not really descriptive of the world Use vision to ground out language Is it turtles all the way down? 89/101

And why are we trying to do this...??? Captioning the world for people with visual impairments But the captions we have are not really descriptive of the world Use vision to ground out language Is it turtles all the way down? 90/101 That's how babies work! Sadly we don't have baby-esque robots yet

Why work on a task at all? A solution is of benefit to society The process focuses attention on phenomena that are worthy of study René Descartes (1596-1650) What is worthy of study? (IMO) 91/101 Low-level linguistic phenomena that hide in the tail Human-like abilities to generalize from small data Very basic learning of correlations between different modalities (operant conditioning)

What about 2nd language learning? Obvious problems 92/101 Assumes knowledge 1st language Assumes knowledge of the world Still don't have a robot...

What about 2nd language learning? Obvious problems 93/101 Assumes knowledge 1st language Assumes knowledge of the world Still don't have a robot... But we do have software with exercises for SLA

What about 2nd language learning? Obvious It'sproblems hard for 94/101 people, too! Assumes knowledge 1st language Assumes knowledge of the world Still don't have a robot... But we do have software with exercises for SLA

What about 2nd language learning? Obvious It'sproblems hard for 95/101 people, too! Assumes knowledge 1st language Assumes knowledge of the world Still don't have a robot... But we do have software with exercises for SLA

Aspects of computational 2ndLL Very specific linguistic variants 96/101 Number, case, agreement, etc. Not enough to get the majority case

Aspects of computational 2ndLL Very specific linguistic variants 97/101 Number, case, agreement, etc. Not enough to get the majority case Focus on subtle visual differences

Aspects of computational 2ndLL 98/101 AI-style reasoning & one-shot learning

Aspects of computational 2ndLL 99/101 AI-style reasoning & one-shot learning It's learnable proof of concept:

What is needed to solve this? Linguistic model over character sequences (words not okay!) w/o any L-specific background Pre-trained (?) visual detectors for objects, poses and physical relationships (eg., gaze) Ability to reason and generalize from a few examples 100/101

Yiannis Tamara Aloimonos Berg Alex Berg Jesse Dodge Amit Goyal Yejin Choi Thanks! Questions? 101/101 Xufeng Alyssa Meg Han Mensch Mitchell Kota Karl Ching Lik Yezhou Yamaguchi Stratos TeoA pictureyang is worth 13.6 words