Visual Dialog. Devi Parikh

Similar documents
Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik

FOIL it! Find One mismatch between Image and Language caption

Young Learners. Starters. Sample papers. Young Learners English Tests (YLE) Volume One. UCLES 2014 CE/2063a/4Y01

Longman English for Pre-school Book 4

PEAK Generalization Pre-Assessment: Assessor Script and Scoring Guide Learner: Assessment Date: Assessor:

CS 1674: Intro to Computer Vision. Intro to Recognition. Prof. Adriana Kovashka University of Pittsburgh October 24, 2016

An Introduction to Deep Image Aesthetics

Less is More: Picking Informative Frames for Video Captioning

Test 1 Answers. Listening. T RANSCRIPT Hello. This is the Cambridge Starters. Part 1 (5 marks) Part 2 (5 marks) Part 3 (5 marks) Part 4 (5 marks)

Enabling editors through machine learning

VBM683 Machine Learning

CS 2770: Computer Vision. Introduction. Prof. Adriana Kovashka University of Pittsburgh January 5, 2017

Music Composition with RNN

CS 1699: Intro to Computer Vision. Introduction. Prof. Adriana Kovashka University of Pittsburgh September 1, 2015

Will computers ever be able to chat with us?

Suitable Class Level: Materna 1st - 2nd Elementary

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

ENGLISH ENGLISH. Level 3. Tests AMERICAN. Student Workbook ENGLISH. Level 3. Rosetta Stone Classroom. RosettaStone.com AMERICAN

CS 7643: Deep Learning

My name is: YazooA_booklet.indd 1 9/8/09 10:20:56 AM

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Unit 1 Unit 2. Topic Greetings My Family. Function. Vocabulary. Grammar. Action. Phonics. Hi, Eric. Hi, Annie. How are you? I m fine, thank you.

General Revision on Module 1& 1 and (These are This is You are) two red apples in the basket.

Joint Image and Text Representation for Aesthetics Analysis

In the sentence above we find the article "a". It shows us that the speaker does not need a specific chair. He can have any chair.

ENGLISH ENGLISH BRITISH. Level 3. Tests

CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning

Blueline, Linefree, Accuracy Ratio, & Moving Absolute Mean Ratio Charts

Vocabulary Sentences & Conversation Color Shape Math. blue green. Vocabulary Sentences & Conversation Color Shape Math. blue brown

CS 7643: Deep Learning

gresearch Focus Cognitive Sciences

Universität Bamberg Angewandte Informatik. Seminar KI: gestern, heute, morgen. We are Humor Beings. Understanding and Predicting visual Humor

THE FUTURE OF VOICE ASSISTANTS IN THE NETHERLANDS. To what extent should voice technology improve in order to conquer the Western European market?

1. There are some bananas on the table, but there aren t any apples.

Tilda and her family. Read, write and draw

Sarcasm Detection in Text: Design Document

HERE AND THERE. Vocabulary Collocations. Grammar Present continuous: all forms

Talking REAL Maths. A resource to engage children in discussion based on common errors and misconceptions in mathematics.

Automatic Speech Recognition (CS753)

Right now Listen and say the colours. 2 Read the notes. Then, write the names.

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

STYLE. Sample Test. School Tests for Young Learners of English. Form A. Level 1

LearnEnglish Elementary Podcast Series 02 Episode 08

ENGLISH FILE Beginner

INTERNATIONAL INDIAN SCHOOL BURAIDAH ENGLISH GRAMMAR WORKSHEET 06 GRADE- 3

We Are Humor Beings: Understanding and Predicting Visual Humor

Sentences for the vocabulary of The Queen and I

Generating Music with Recurrent Neural Networks

Supervised Learning of Complete Morphological Paradigms

Countable (Can count) uncountable (cannot count)

Conversation 1. Conversation 2. Conversation 3. Conversation 4. Conversation 5

ENGLIGH REVIEW. 1ºy 2ºESO Colegio "La Inmaculada" Am, is or are? Write the correct word in the gaps. Then make the sentences negative.

Impact of Deep Learning

THE FOLLOWING PREVIEW HAS BEEN APPROVED FOR ALL AUDIENCES. CVPR 2016 Spotlight

LEVEL PRE-A1 LAAS LANGUAGE ATTAINMENT ASSESSMENT SYSTEM. English Language Language Examinations. English Be sure you have written your.

Generating Chinese Classical Poems Based on Images

Primary 5 Flying Grammar Primary SB 05.indd :21

Lecture 5: Clustering and Segmentation Part 1

F31 Homework GRAMMAR REFERNCE - UNIT 6 EXERCISES

arxiv: v1 [cs.lg] 15 Jun 2016

Summarizing Long First-Person Videos

Decision-Maker Preference Modeling in Interactive Multiobjective Optimization

2nd Grade. 2-Digit Subtraction Without Regrouping. Slide 1 / 137 Slide 2 / 137. Slide 4 / 137. Slide 3 / 137. Slide 6 / 137.

Time Domain Simulations

Case Study: Can Video Quality Testing be Scripted?

EYFS Curriculum Months. Personal, Social and Emotional Development Physical Development Communication and Language

Large Scale Concepts and Classifiers for Describing Visual Sentiment in Social Multimedia

ENGLISH ENGLISH AMERICAN. Level 1. Tests

Validity. What Is It? Types We Will Discuss. The degree to which an inference from a test score is appropriate or meaningful.

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

HCC class lecture 8. John Canny 2/23/09

THE ADYOULIKE STATE OF NATIVE VIDEO REPORT EXCLUSIVE RESEARCH REPORT

An Evaluation of Video Quality Assessment Metrics for Passive Gaming Video Streaming

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/3

Write the words and then match them to the correct pictures.

ENGLISH ENGLISH BRITISH. Level 1. Tests

1 Family and friends. 1 Play the game with a partner. Throw a dice. Say. How to play

A Study of Predict Sales Based on Random Forest Classification

Xuelong Li, Thomas Huang. University of Illinois at Urbana-Champaign

A eyes B ears C nose. A did B made C took. A you going to B you re going to C are you going to. A older B oldest C most old. A than B from C as

Projektseminar: Sentimentanalyse Dozenten: Michael Wiegand und Marc Schulder

The science class. Wednesday, September 5

CSE 517 Natural Language Processing Winter 2013

Repeating and mistranslating: the associations of GANs in an art context

Recording scripts Third edition. for Movers

K-2nd. March 3-4, Obsessed Journey: No worries! We can choose to trust Jesus instead of worrying! Matthew 6:25-34

Section 2: Known And Unknown

next to Level 5 Unit 1 Language Assessment

Alphabet Posters I Can Trace My Letters. By Learn Curriculum

Maths Join up the numbers from 1 to 20

DOES MOVIE SOUNDTRACK MATTER? THE ROLE OF SOUNDTRACK IN PREDICTING MOVIE REVENUE

Predicting Aesthetic Radar Map Using a Hierarchical Multi-task Network

Test 1 Answers. Listening TRANSCRIPT. Part 1 (5 marks) Part 2 (5 marks) Part 3 (5 marks) Part 4 (5 marks) Part 5 (5 marks) Part 1

IN ENGLISH Workbook. Volume 2, Unit 5. Contents

ENGAGING IMAGE CAPTIONING VIA PERSONALITY

Indexing local features. Wed March 30 Prof. Kristen Grauman UT-Austin

Algebra I Module 2 Lessons 1 19

TEST ONE. Singing Star Showing this week. !The Wild Wheel Ride! Indoor tennis centre. RACING CAR TRACK To drive, children must be 1 metre or more

STAT 250: Introduction to Biostatistics LAB 6

arxiv: v3 [cs.sd] 14 Jul 2017

Transcription:

VQA Visual Dialog Devi Parikh

2

People coloring a street on a college campus 3

It was a great event! It brought families out, and the whole community together. 4

5

Q. What are they coloring the street with? A. Chalk 6

AI: What a nice picture! What event was this? User: Color College Avenue. It was a lot of fun! AI: I am sure it was! Do they do this every year? User: I wish they would. I don t think they ve organized it again since 2012. 7

Aid visually-impaired users 8

Aid visually-impaired users Peter just uploaded a picture from his vacation in Hawaii Great, is he at the beach? No, on a mountain 9

Aid situationally-impaired analysts Did anyone enter this room last week? Yes, 127 instances logged on camera Were any of them carrying a black bag? 10

Natural language instructions for robots Is there smoke in any room around you? Yes, in one room Go there and look for people Image Credit: Lockheed Martin; DARPA Robotics Challenge 11

Outline Visual Question Answering Visual Dialog 12

13

14

Outline Visual Question Answering Visual Dialog 15

Visual Question Answering (VQA) 16

Visual Question Answering (VQA) What is the mustache made of? 17

Visual Question Answering (VQA) What is the mustache made of? AI System 18

Visual Question Answering (VQA) AI System bananas What is the mustache made of? 19

Visual Question Answering (VQA) www.visualqa.org 20

Visual Question Answering (VQA) What color are her eyes? What is the mustache made of? How many slices of pizza are there? Is this a vegetarian pizza? Is this person expecting company? What is just under the tree? Does it appear to be rainy? Does this person have 20/20 vision? 21

VQA Dataset 22

VQA Dataset >0.25 million images 23

254,721 images (COCO) 24

50,000 scenes 25

VQA Dataset >0.25 million images >0.76 million questions 26

Questions Stump a smart robot! Ask a question that a human can answer, but a smart robot probably can t! 27

VQA Dataset >0.25 million images >0.76 million questions ~10 million answers [Antol et al., ICCV 2015] 28

Papers using VQA 29

VQA Challenge @ CVPR16 30

VQA Challenge @ CVPR16 Winning entry (MCB) Open-ended: 66% Multiple-choice: 70% ~ 30 teams 31

The Power of Language Priors Slide credit: Yash Goyal and Peng Zhang 32

The Power of Language Priors A giraffe is standing in grass next to a tree Slide credit: Yash Goyal and Peng Zhang 33

The Power of Language Priors Is there a clock? yes 98% Is the man wearing glasses? yes 94% Are the lights on? yes 85% Do you see a? yes 87% Slide credit: Yash Goyal and Peng Zhang 34

The Power of Language Priors Is the man standing? no 69% What sport is? tennis 41% How many? 2 39% What animal is? dog 35% Slide credit: Yash Goyal and Peng Zhang 35

Balancing the VQA dataset 36

Balancing the VQA dataset 37

Balancing the VQA dataset 38

Balancing the VQA dataset 39

Balancing the VQA dataset 40

Balancing the VQA dataset 41

Balancing the VQA dataset 42

VQA v2.0 More balanced than VQA v1.0 Entropy of answers increases by 56% Bigger than VQA v2.0 ~1.8 times image-question pairs 43

Benchmarking SOTA VQA models SOTA VQA models Drop in performance by 7-8% Gain 1-2% back when re-trained on balanced dataset By answer types Biggest drop in performance in yes/no (10-12%) Biggest improvement gained by re-training in yes/no (3-4%) and number (2-3%) 44

Trends 0.15% 1.51% 7.03% 3.5% 45

VQA v2.0 2 nd VQA Challenge @ CVPR17! 46

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (CVPR 2017) Yash Goyal (Virginia Tech) Tejas Khot (Virginia Tech) Doug Summers-Stay (Army Research Lab) Dhruv Batra (Georgia Tech / FAIR) Devi Parikh (Georgia Tech / FAIR) 47

(Another) problem with existing setup Q: What color is the dog? A: White Train Training Prior white red blue green yellow Test Q: What color is the dog? A: Black Prediction: White Slide credit: Aishwarya Agrawal

(Another) problem with existing setup Train Q: What color is the dog? A: White Training Prior white red blue green yellow Test Q: What color is the dog? A: Black Prediction: White Slide credit: Aishwarya Agrawal

(Another) problem with existing setup Train Q: What color is the dog? A: White Training Prior white red blue green yellow Test Q: What color is the dog? A: Black Prediction: White Slide credit: Aishwarya Agrawal

(Another) problem with existing setup Train Q: Is the person wearing shorts? A: No Training Prior no Test Q: Is the person wearing shorts? A: Yes Prediction: No female woman Slide credit: Aishwarya Agrawal

(Another) problem with existing setup Similar priors in train and test Memorization does not hurt as much Problematic for benchmarking progress Slide credit: Aishwarya Agrawal

Meet VQA-CP! Visual Question Answering under Changing Priors A new split of the VQA v1.0 dataset (Antol et al., ICCV 2015) Slide credit: Aishwarya Agrawal

VQA-CP Train Split VQA-CP Test Split Slide credit: Aishwarya Agrawal 54

Performance of VQA models on VQA-CP (Antol et al. ICCV15) (Andreas et al. CVPR16) (Yang et al. CVPR16) (Fukui et al. EMNLP16) 31% drop 25% drop 29% drop 27% drop Slide credit: Aishwarya Agrawal

Grounded-VQA (GVQA) Image (I) Question (Q) VGG Extractor Q main Visual Concept Classifier (VCC) Att Hop 1 Att Hop 2 Concepts grouped into clusters fc bus car cone red yellow green 5 2 3 + Answer Predictor (AP) fc VQA Answers (998) Answer Cluster Predictor (ACP) Visual Verifier (VV) Question Classifier LSTM POS Tags based extractor fc object color number Glove Concept clusters Concept Extractor (CE) fc + fc VQA Answers (Yes / No) Slide credit: Aishwarya Agrawal

Aishwarya Agrawal (Virginia Tech) Dhruv Batra (Georgia Tech / FAIR) Devi Parikh (Georgia Tech / FAIR) Ani Kembhavi (AI2) 58

C-VQA: Compositional VQA 59

Aishwarya Agrawal (Virginia Tech) Dhruv Batra (Georgia Tech / FAIR) Devi Parikh (Georgia Tech / FAIR) 60

Outline Visual Question Answering Visual Dialog 61

A man and a woman are holding umbrellas

A man and a woman are holding umbrellas What color is his umbrella?

man his

umbrella

A man and a woman are holding umbrellas What color is his umbrella?

A man and a woman are holding umbrellas His umbrella is black What color is his umbrella?

A man and a woman are holding umbrellas His umbrella is black What color is his umbrella? What about hers?

woman her

umbrella umbrella hers

A man and a woman are holding umbrellas His umbrella is black What color is his umbrella? What about hers?

A man and a woman are holding umbrellas His umbrella is black Hers is multi-colored What color is his umbrella? What about hers?

A man and a woman are holding umbrellas His umbrella is black Hers is multi-colored What color is his umbrella? What about hers? How many other people are in the image?

man and a woman other people

A man and a woman are holding umbrellas His umbrella is black Hers is multi-colored I think 3. They are occluded What color is his umbrella? What about hers? How many other people are in the image?

A man and a woman are holding umbrellas His umbrella is black Hers is multi-colored I think 3. They are occluded What color is his umbrella? What about hers? How many other people are in the image? How many are men?

3. other people How many are men?

Visual Dialog: Task Given Image I History of human dialog (Q 1, A 1 ), (Q 2, A 2 ),, (Q t-1, A t-1 ) Follow-up Question Q t Visual Dialogue Task Produce free-form natural language answer A t 81

Visual Dialog: Evaluation Protocol Given Image I History of human dialog (Q 1, A 1 ), (Q 2, A 2 ),, (Q t-1, A t-1 ) Follow-up Question Q t 100 Answer Options 50 answers from NN questions 30 popular answers 20 random answers Evaluation Task Rank the list of 100 options Accuracy/Error mean-rank-of-gt, mean-reciprocal-rank Visual Dialogue Question: Do people look happy? GT: Not really Yes they do I can't tell Not facing me Yes they look happy Yes I can only see 1 of their faces but she looks happy Not really but not unhappy either 82

VisDial Dataset Live Two-Person Chat on Amazon Mechanical Turk Questioner VisDial Dataset Answerer 84

VisDial Dataset Live Two-Person Chat on Amazon Mechanical Turk (C) Dhruv Batra 85

86

VisDial v0.9 Stats >120k images (from COCO) 1 dialog/image 10 question-answer rounds/dialog Total of >1.2 Million dialog QA pairs 89

visualdialog.org (C) Dhruv Batra 90

Models for Visual Dialog Encoder 1. Late Fusion 2. Hierarchical Recurrent Encoder Decoder 1. Generative o During training, maximizes LL of human response 3. Memory Network o For evaluation, ranks options by LL scores 2. Discriminative o Learn to rank 100 options 95

Visual Dialog Model #3 Memory Network Encoder Slide credit: Abhishek Das

Visual Dialog Model #3 Memory Network Encoder Slide credit: Abhishek Das

Visual Dialog Model #3 Memory Network Encoder Slide credit: Abhishek Das

Visual Dialog Model #3 Memory Network Encoder Slide credit: Abhishek Das

Visual Dialog Model #3 Memory Network Encoder Slide credit: Abhishek Das

Visual Dialog Model #3 Memory Network Encoder Slide credit: Abhishek Das

Visual Dialog Model #3 Memory Network Encoder Slide credit: Abhishek Das

Visual Dialog Model #3 Memory Network Encoder Slide credit: Abhishek Das

122

Results Memory Network (generally) performs best 0.53 MRR / ~17 mean rank (Generative) 0.60 MRR / ~5.5 mean rank (Discriminative) Slide credit: Abhishek Das 123

124

125

Results Q: Is the bottle open? A: Yes Q: Red or white wine? A: Red Q: Is there only 1 glass? A: Yes Q: Any food? A: No Q: What color is the table? A: Brown Q: Is it in a restaurant? A: No Q: Is it in the kitchen? A: Yes Slide credit: Abhishek Das

Results Q: Is the bottle open? A: Yes Q: Red or white wine? A: Red Q: Is there only 1 glass? A: Yes Q: Any food? A: No Q: What color is the table? A: Brown Q: Is it in a restaurant? A: No Q: Is it in the kitchen? A: Yes Q: What color is the building? A: Brown Q: What color is the clock? A: Black and white Q: Is it a digital or analog clock? A: Analog Q: What color are the numbers? A: Black Q: Are all of the numbers visible? A: Yes Q: Is it an outdoor view? A: Yes Q: Are there any people? A: No Q: Can you see other buildings? A: No Slide credit: Abhishek Das

Results Q: What color is the cat? A: Orange and white Q: Is the cat on the floor? A: Yes Q: What is the cat on? A: I can t tell Q: Can you see anything in the mirror? A: No, just the cat and the cat Q: Can you see a ceiling light? A: No Q: What color are the cat s eyes? A: I can t see the eyes Q: What color is the cat s nose? A: Orange Slide credit: Abhishek Das

Results Q: What color is the cat? A: Orange and white Q: Is the cat on the floor? A: Yes Q: What is the cat on? A: I can t tell Q: Can you see anything in the mirror? A: No, just the cat and the cat Q: Can you see a ceiling light? A: No Q: What color are the cat s eyes? A: I can t see the eyes Q: What color is the cat s nose? A: Orange Slide credit: Abhishek Das

Results Q: What color is the cat? A: Orange and white Q: Is the cat on the floor? A: Yes Q: What is the cat on? A: I can t tell Q: Can you see anything in the mirror? A: No, just the cat and the cat Q: Can you see a ceiling light? A: No Q: What color are the cat s eyes? A: I can t see the eyes Q: What color is the cat s nose? A: Orange Slide credit: Abhishek Das

Results Q: What color is the vase? A: White Q: Can you tell what kind of flowers? A: I can t tell Q: Is there any water in the vase? A: Yes Q: How much water is there? A: A small amount Q: What is the vase on? A: Can t tell Q: Does this look like someone s home? A: Can t tell Slide credit: Abhishek Das

Results Q: What color is the vase? A: White Q: Can you tell what kind of flowers? A: I can t tell Q: Is there any water in the vase? A: Yes Q: How much water is there? A: A small amount Q: What is the vase on? A: Can t tell Q: Does this look like someone s home? A: Can t tell Slide credit: Abhishek Das

Visual Dialog (CVPR 2017) Abhishek Das (Georgia Tech) Satwik Kottur (CMU) Khushi Gupta (CMU) Avi Singh (UC Berkeley) Deshraj Yadav (Virginia Tech) José Moura (CMU) Devi Parikh (Georgia Tech / FAIR) Dhruv Batra (Georgia Tech / FAIR) 133

Q1:what is the man wearing? A1: green shirt and pants Caption: a man and a dog on a motorcycle. HCIAE Encoder LSTM Gumbel Sampler a t gt a 1 Q2: what is he doing? A2: he 's on a motorcycle. Image, Question, History a N a t HCIAE Encoder LSTM e t f(a t gt ) f(a 1 ) f(a N ) f( a t ) f( a t ) Q3:how old is the man? A3: maybe in his 40s f(a 1 ) e t f(a 4 ) f(a t gt ) f(a N ) f(a 6 ) Deep metric learning Quantitative: Ground truth response scores higher more often Qualitative: Responses are more informative Responses are longer Responses are more diverse Generator Discriminator a t gt : Ground truth answer a N : Negative answer N e t : encoder feature f(): embedding function Slide credit: Jiasen Lu

Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model (arxiv) Jiasen Lu (Virginia Tech) Jianwei Yang (Georgia Tech) Anitha Kannan (Facebook AI Research) Dhruv Batra (Georgia Tech / FAIR) Devi Parikh (Georgia Tech / FAIR) 136

Open directions Improve dialog agents via self-talk No additional human intervention Are these agents better at human-bot interaction? Domain adaptation via self-talk No need to collect a new dataset for each domain Dialog rollouts, future prediction, theory of mind, 146

Conclusion Natural progression in Vision+Language Captioning VQA Visual Dialog VQA: Elevating the role of image understanding Balancing Changing priors Compositional Visual Dialog New AI task Challenges: Memory, history, reasoning over time VisDial dataset Live 2-person Chat on AMT 120k COCO images, 1 dialog/image, ~1.2 Million dialog QA pairs Visual Dialog Models (Neural Encoder-Decoders) Late Fusion, Hierarchical Recurrent Encoder, Memory Network 147

Thank you.

Visual Dialog: Towards AI agents that can see, talk, and act Dhruv Batra

Outline Cooperative Visual Dialog Agents Emergence of Grounded Dialog Task (color, shape) Q1: Y Q2: X Q1: 2 Q2: 2 Negotiation Dialog Agents I d like the ball and hats I need the hats, you can have the ball Ok, if I get both books? Ok, deal 2

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning [ICCV 17] Abhishek Das* (Georgia Tech) Satwik Kottur* (CMU) José Moura (CMU) Stefan Lee (Virginia Tech) Dhruv Batra (Georgia Tech)

Visual Dialog: Task Given Image I History of human dialog (Q 1, A 1 ), (Q 2, A 2 ),, (Q t-1, A t-1 ) Visual Dialogue Follow-up Question Q t Task Produce free-form natural language answer A t (C) Dhruv Batra 4

No goal Why are we talking? Problems Agent not in control Artificially injected at every round into a human conversation Can t steer conversation Doesn t get to see its errors during training Learning equivalent utterances Many ways of answering the same question that should be treated equally, but aren t Is log-likelihood of human response really a good metric? (C) Dhruv Batra 5

Image Guessing Game (C) Dhruv Batra Slide Credit: Abhishek Das 6

Image Guessing Game Q-Bot asks questions is blindfolded (C) Dhruv Batra Slide Credit: Abhishek Das 8

Image Guessing Game Q-Bot asks questions is blindfolded (C) Dhruv Batra Slide Credit: Abhishek Das 9

Image Guessing Game asks questions A-Bot answers questions sees an image (C) Dhruv Batra Slide Credit: Abhishek Das 10

Image Guessing Game asks questions answers questions A-Botsees an image (C) Dhruv Batra Slide Credit: Abhishek Das 11

Image Guessing Game asks questions (C) Dhruv Batra Slide Credit: Abhishek Das 12

Image Guessing Game asks questions (C) Dhruv Batra Slide Credit: Abhishek Das 13

Image Guessing Game asks questions (C) Dhruv Batra Slide Credit: Abhishek Das 14

RL for Cooperative Dialog Agents Agents: (Q-bot, A-bot) Environment: Image Action: Q-bot: question (symbol sequence) A-bot: answer (symbol sequence) Q-bot: image regression q t a t Any people in the shot? No, there aren t any. State Q-bot: A-bot: (C) Dhruv Batra 15

RL for Cooperative Dialog Agents Action: Q-bot: question (symbol sequence) A-bot: answer (symbol sequence) Q-bot: image regression q t a t Any people in the shot? No, there aren t any. State Q-bot: A-bot: (C) Dhruv Batra 16

RL for Cooperative Dialog Agents Action: Q-bot: question (symbol sequence) A-bot: answer (symbol sequence) Q-bot: image regression q t a t Any people in the shot? No, there aren t any. State Q-bot: A-bot: Policy Q-bot A-bot Reward (C) Dhruv Batra 17

Policy Networks Q-Bot A-Bot A-BOT (C) Dhruv Batra Slide Credit: Abhishek Das 18

Policy Networks Q-Bot A-Bot Slide Credit: Abhishek Das

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 20

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 21

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 22

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 23

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 24

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 25

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 26

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 27

Policy Networks Q-Bot VGG-16 A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 28

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 29

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 30

Policy Networks Q-Bot Fact Embedding A-Bot Two zebra are walking around their pen at the zoo. (C) Dhruv Batra Slide Credit: Abhishek Das 31

Policy Networks Q-Bot Fact Embedding A-Bot Is this zoo? Yes Two zebra are walking around their pen at the zoo. (C) Dhruv Batra Slide Credit: Abhishek Das 32

Policy Networks Q-Bot Fact Embedding A-Bot How many zebra? Two Is this zoo? Yes Two zebra are walking around their pen at the zoo. (C) Dhruv Batra Slide Credit: Abhishek Das 33

Policy Networks Q-Bot History Encoder Fact Embedding A-Bot How many zebra? Two Is this zoo? Yes Two zebra are walking around their pen at the zoo. (C) Dhruv Batra Slide Credit: Abhishek Das 34

Policy Networks Q-Bot History Encoder Fact Embedding A-Bot How many zebra? Two Is this zoo? Yes Two zebra are walking around their pen at the zoo. (C) Dhruv Batra Slide Credit: Abhishek Das 35

Policy Networks Q-Bot History Encoder Fact Embedding A-Bot How many zebra? Two Is this zoo? Yes Two zebra are walking around their pen at the zoo. (C) Dhruv Batra Slide Credit: Abhishek Das 36

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 37

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 40

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 41

Policy Networks Q-Bot A-Bot (C) Dhruv Batra Slide Credit: Abhishek Das 42

Policy Gradients REINFORCE Gradients (C) Dhruv Batra Slide Credit: Abhishek Das 44

Turing Test (C) Dhruv Batra 47

(C) Dhruv Batra 50

SL vs RL SL Agents RL Agents (C) Dhruv Batra 52

Image Guessing (C) Dhruv Batra 53

Concurrent Work (C) Dhruv Batra 55

Outline Cooperative Visual Dialog Agents Emergence of Grounded Dialog Task (color, shape) Q1: Y Q2: X Q1: 2 Q2: 2 Negotiation Dialog Agents I d like the ball and hats I need the hats, you can have the ball Ok, if I get both books? Ok, deal 56

Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog [EMNLP 17] Satwik Kottur* (CMU) José Moura (CMU) Stefan Lee (Virginia Tech) Dhruv Batra (Georgia Tech)

Toy World Sanity check shape color style Simple, synthetic world triangle blue filled Instances - (shape, color, style) square green dashed Total of 4 3 (64) instances circle red dotted star purple solid Example instances: (triangle, purple, filled) (square, blue, solid) (circle, blue, dotted)

Task & Talk Task (G) Inquire pair of attributes (color, shape), (shape, color) Instance (purple, square, filled) Talk Task (color, shape) Q1: Y A1: 2 Single token per round Q2: Z Two rounds A2: 3 Q-bot guesses a pair Reward : +1 / -1 Prediction order matters! Guess: (purple, square) Get reward! (C) Dhruv Batra 59

Emergence of Grounded Dialog T: (style, color) P: (solid, green) X 3 Z 4 color? green style? solid T: (style, shape) P: (filled, triangle) Y 1 Z 2 shape? triangle style? filled (C) Dhruv Batra 64

Emergence of Grounded Dialog Compositional grounding Predict dialog for unseen instances Task (color, shape) Q1: Y Q2: X Q1: 2 Q2: 2 (C) Dhruv Batra 65

Summary of findings Setting A. Overcomplete Vocabula ry Memory V Q V A Q-bot A-bot Generalizati on 64 64 Yes Yes 25.6 % B. Attribute 3 12 Yes Yes 38.5 % C. Minimal 3 4 Yes No 74.4 % Characteristics Non-compositional language Q-bot insignificant Inconsistent A-bot grounding Poor generalization Non-compositional language Q-bot uses one round to convey task Inconsistent A-bot grounding Poor generalization Compositional language Q-bot uses both rounds Consistent A-bot grounding Good generalization 66

Deep Multi-Agent Communication NIPS 16 [DeepMind] Learning to Communicate with Deep Multi-Agent Reinforcement Learning. Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, Shimon Whiteson. NIPS 16. [NYU / FAIR] Learning Multiagent Communication with Backpropagation. Sainbayar Sukhbaatar, Arthur Szlam, Rob Fergus. NIPS 16. Arxiv 17 [OpenAI] Emergence of Grounded Compositional Language in Multi-Agent Populations. Igor Mordatch, Pieter Abbeel. [FAIR] Multi-Agent Cooperation and the Emergence of (Natural) Language. Angeliki Lazaridou, Alexander Peysakhovich, Marco Baroni. Learning to play guess who? and inventing a grounded language as a consequence. Emilio Jorge, Mikael Ka geba ck, and Emil Gustavsson. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. Serhii Havrylov and Ivan Titov. [Berkeley] Translating neuralese. Jacob Andreas, Anca Dragan and Dan Klein. ACL 2017. (C) Dhruv Batra 67

Outline Cooperative Visual Dialog Agents Emergence of Grounded Dialog Task (color, shape) Q1: Y Q2: X Q1: 2 Q2: 2 Negotiation Dialog Agents I d like the ball and hats I need the hats, you can have the ball Ok, if I get both books? Ok, deal 68

Deal or No Deal? End-to-End Learning for Negotiation Dialogues [EMNLP 17] Mike Lewis (FAIR) Denis Yarats (FAIR) Yann Dauphin (FAIR) Devi Parikh (Georgia Tech) Dhruv Batra (Georgia Tech)

Why Negotiation? Adversarial Negotiation Cooperative Slide Credit: Mike Lewis

Why Negotiation? Negotiation useful when: Agents have different goals Not all can be achieved at once (all the time) Slide Credit: Mike Lewis

Why Negotiation? Both linguistic and reasoning problem Interpret multiple sentences, and generate new message Plan ahead, make proposals, counter-offers, bluffing, lying, compromising Slide Credit: Mike Lewis

Framework Both agents given reward function, can t observe each other s Both agents independently select agreement Agent 1 Goals Agent 1 Output Agent 1 Reward Dialog Agent 2 Goals Agent 2 Output Agent 2 Reward Dialogue until they agree on common action If agents agree, they are given reward Slide Credit: Mike Lewis

Object Division Task Agents shown same set of object but different values for each Asked to agree how to divide objects between them 2 points each 1 point each 5 points each Slide Credit: Mike Lewis

Multi-Issue Bargaining I d like the ball and hats I need the hats, you can have the ball Ok, if I get both books? Ok, deal Slide Credit: Mike Lewis

Data Collection on AMT Slide Credit: Mike Lewis

Dataset ~6k dialogs Average 6.6 turns/dialog Average 7.6 words/turn 80% agreed solutions 77% Pareto Optimal solutions Slide Credit: Mike Lewis

Baseline Model Language model predicts both agent s tokens <write> Give me both books <read> ok deal Read input at each timestep Attention over complete dialogue Input Encoder Output Decoder Separate classifier for each output Slide Credit: Mike Lewis

SL-Pretraining Train to maximize likelihood of human-human dialogues Decode by sampling likely messages Slide Credit: Mike Lewis

SL-Pretraining Model knows nothing about task, just tries to imitate human actions Agrees too easily Can t go beyond human strategies Slide Credit: Mike Lewis

Goal-based RL-Finetuning Generate dialogues using self-play reward = 9 points Backpropagate reward using REINFORCE Very sensitive to hyperparameters Interleave with supervised updates Slide Credit: Mike Lewis

Dialog Rollouts: Goal-based Decoding Dialog rollouts use model to simulate remainder of conversation Average scores to estimate future reward Slide Credit: Mike Lewis

Intrinsic Evaluation 60 50 Likelihood Reinforce 40 48 46 30 20 37 Supervised learning gives most human like dialog 10 0 4.8 5.1 Perplexity 0 Average Rank Slide Credit: Mike Lewis

End-to-End Evaluation against SL negotiators 3 2.5 2 1.5 SL RL SL+Rollouts 1.8 2.5 80 70 60 50 76 SL RL SL+Rollouts 74 72 74 65 61 1 40 30 0.5 0 0.5 0.1 0.1 Relative Score (all) 0.7 Relative Score (agreed) 20 10 0 % Agreed % Pareto Optimal Slide Credit: Mike Lewis

End-to-End Evaluation against Turkers 0-0.2-0.4-0.1-0.2 90 SL RL SL+Rollouts -0.6-0.8-1 -1.2-1.4-1.6-1.2-1 -1.4 80 70 60 50 40 67 68 59 77 73 81-1.8-2 Relative Score (all) -1.8 Relative Score (agreed) 30 20 10 0 Slide Credit: Mike Lewis % Agreed % Pareto Optimal

6 1 0 I need the book and hats Can I have the hats and book? 3 1 3 I need the book and 2 hats I can not make that deal. I need the ball and book, you can have the hats No deal then No deal doesn t work for me sorry Sorry, I want the book and one hat Ok deal How about I give you the book and I keep the rest Model generates meaningful novel language (C) Dhruv Batra Slide Credit: Mike Lewis 87

2 1 4 0 10 0 I would like the ball and two hats I need the book and 3 hats That would work for me. I can take the ball and 1 hat Model can be deceptive to achieve its goals (C) Dhruv Batra Slide Credit: Mike Lewis 88

Conclusion Negotiation is useful and challenging End-to-End approach trades cheaper data for difficult modelling Goal-based training and decoding improves over likelihood Model can generate meaningful language be be deceptive to achieve their goals Slide Credit: Mike Lewis

Outline Cooperative Visual Dialog Agents Emergence of Grounded Dialog Task (color, shape) Q1: Y Q2: X Q1: 2 Q2: 2 Negotiation Dialog Agents I d like the ball and hats I need the hats, you can have the ball Ok, if I get both books? Ok, deal 90

Sneak Peek: Inner Dialog: Pragmatic Visual Dialog Agents that Rollout a Mental Model of their Interlocutors (C) Dhruv Batra 91

Inner Dialog (C) Dhruv Batra 92

So far Vision + Language What next? Captioning VQA Visual Dialog Interacting with an intelligent agent Perceive + Communicate + Act Vision + Language + Reinforcement Learning Ok Google can you find my picture where I was wearing this red shirt? And order me a new one? (C) Dhruv Batra 97

(C) Dhruv Batra 101

Agents in Virtual Environments AI2 Thor (C) Dhruv Batra 102

So far Vision + Language What next? Captioning VQA Visual Dialog Interacting with an intelligent agent Perceive + Communicate + Act Vision + Language + Reinforcement Learning Ok Google can you find my picture where I was wearing this red shirt? And order me a new one? (C) Dhruv Batra 103

So far Vision + Language What next? Captioning VQA Visual Dialog Interacting with an intelligent agent Perceive + Communicate + Act Vision + Language + Reinforcement Learning Ok Google can you find my picture where I was wearing this red shirt? And order me a new one? Teaching with natural language No, not that shirt. This one. (C) Dhruv Batra 104

(C) Dhruv Batra 105

Machine Learning & Perception Group Qing Sun Aishwarya Agrawal PhD Yash Goyal Michael Cogswell Dhruv Batra Assistant Professor Abhishek Das Ashwin Kalyan Aroma Mahendru Akrit Mohapatra Postdoc Stefan Lee MS Deshraj Yadav Tejas Khot Viraj Prabhu Interns (C) Dhruv Batra 106

Computer Vision Lab (C) Dhruv Batra 107

Thanks! (C) Dhruv Batra 108