FOIL it! Find One mismatch between Image and Language caption

Similar documents
Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik

CS 1674: Intro to Computer Vision. Intro to Recognition. Prof. Adriana Kovashka University of Pittsburgh October 24, 2016

An Introduction to Deep Image Aesthetics

Joint Image and Text Representation for Aesthetics Analysis

Visual Dialog. Devi Parikh

Semantic Tuples for Evaluation of Image to Sentence Generation

The Visual Denotations of Sentences. Julia Hockenmaier with Peter Young and Micah Hodosh University of Illinois

CS 1699: Intro to Computer Vision. Introduction. Prof. Adriana Kovashka University of Pittsburgh September 1, 2015

First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text

Neural Aesthetic Image Reviewer

StyleNet: Generating Attractive Visual Captions with Styles

Large scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs

Predicting Aesthetic Radar Map Using a Hierarchical Multi-task Network

CS 2770: Computer Vision. Introduction. Prof. Adriana Kovashka University of Pittsburgh January 5, 2017

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Generating Chinese Classical Poems Based on Images

Photo Aesthetics Ranking Network with Attributes and Content Adaptation

Summarizing Long First-Person Videos

arxiv: v2 [cs.cv] 27 Jul 2016

Less is More: Picking Informative Frames for Video Captioning

Semantic Image Segmentation via Deep Parsing Network

Deep learning for music data processing

ENGAGING IMAGE CAPTIONING VIA PERSONALITY

CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning

EIE: Efficient Inference Engine on Compressed Deep Neural Network

arxiv: v1 [cs.cv] 21 Nov 2015

Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing

Vocabulary Sentences & Conversation Color Shape Math. blue green. Vocabulary Sentences & Conversation Color Shape Math. blue brown

Large Scale Concepts and Classifiers for Describing Visual Sentiment in Social Multimedia

Algorithmic Music Composition using Recurrent Neural Networking

Free Wheelin' (Color Sticker Storybook) (Hot Wheels) Click here if your download doesn"t start automatically

Monday, January 7, 2019

We Are Humor Beings: Understanding and Predicting Visual Humor

HumorHawk at SemEval-2017 Task 6: Mixing Meaning and Sound for Humor Recognition

Universität Bamberg Angewandte Informatik. Seminar KI: gestern, heute, morgen. We are Humor Beings. Understanding and Predicting visual Humor

Music Composition with RNN

National University of Singapore, Singapore,

Metonymy Research in Cognitive Linguistics. LUO Rui-feng

Practice Paper 2 YEAR 5 LANGUAGE CONVENTIONS

A Survey of Audio-Based Music Classification and Annotation

Sentiment and Sarcasm Classification with Multitask Learning

Unit ( 15 ) Animal Puzzles. New Vocabulary : 1 st Primary Language Section Second Term

arxiv: v2 [cs.cv] 4 Dec 2017

IMAGE AESTHETIC PREDICTORS BASED ON WEIGHTED CNNS. Oce Print Logic Technologies, Creteil, France

MECHANICS STANDARDS IN ENGINEERING WRITING

Table of Contents. Introduction Capitalization

Enabling editors through machine learning

Report on the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2017)

Perceptual Coding: Hype or Hope?

Scalable Semantic Parsing with Partial Ontologies ACL 2015

Name. Read each sentence and circle the pronoun. Write S on the line if it is a subject pronoun. Write O if it is an object pronoun.

Using Deep Learning to Annotate Karaoke Songs

SentiMozart: Music Generation based on Emotions

Bi-Modal Music Emotion Recognition: Novel Lyrical Features and Dataset

THE TWENTY MOST COMMON LANGUAGE USAGE ERRORS

Image Aesthetics Assessment using Deep Chatterjee s Machine

Sarcasm Detection in Text: Design Document

(Answers on Pages 17 & 18)

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

The Grass Roots for the ACT English Exam

Graphic Texts And Grammar Questions

On the mathematics of beauty: beautiful music

Write the words and then match them to the correct pictures.

Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment

Read the instructions at the beginning of each of the sections below on common sentence errors, then complete the practice exercises which follow.

Information processing in high- and low-risk parents: What can we learn from EEG?

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

GRADE VIII MODEL PAPER 2017 ENGLISH CRQ/ERQ PAPER MARKING SCHEME

arxiv: v1 [cs.cl] 23 Aug 2018

Lesson 10 November 10, 2009 BMC Elementary


Rewind: A Music Transcription Method

The Cognitive Nature of Metonymy and Its Implications for English Vocabulary Teaching

COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY

THE FOLLOWING PREVIEW HAS BEEN APPROVED FOR ALL AUDIENCES. CVPR 2016 Spotlight

The semiotic triangle. Componential Analysis. Lexical fields. Words. (Lexical) semantic relations. Componential Analysis. Semantics.

Power Words come. she. here. * these words account for up to 50% of all words in school texts

arxiv: v1 [cs.cv] 2 Nov 2017


A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

Grade 2 - English Ongoing Assessment T-2( ) Lesson 4 Diary of a Spider. Vocabulary

Lyrics Classification using Naive Bayes

arxiv: v2 [cs.cv] 15 Mar 2016

Lecture 9 Source Separation

The Ant and the Grasshopper

2018 SUMMER READING BINGO

An Evaluation of Video Quality Assessment Metrics for Passive Gaming Video Streaming

The Theory of Mind Test (TOM Test)

arxiv: v1 [cs.sd] 5 Apr 2017

DATA! NOW WHAT? Preparing your ERP data for analysis

A New Scheme for Citation Classification based on Convolutional Neural Networks

UWaterloo at SemEval-2017 Task 7: Locating the Pun Using Syntactic Characteristics and Corpus-based Metrics

Video-based Vibrato Detection and Analysis for Polyphonic String Music

Pulse 3 Progress Test Basic

Singer Traits Identification using Deep Neural Network

Hi, I m a vegetable boy. These are my eyes. What are they? (stop) Lettuce. Lettuce. This is my mouth. What is it? (stop) A tomato. A tomato.

DISTRIBUTION STATEMENT A 7001Ö

INTERACTIVE SIGHT WORDS Emergent Reader

ABSS HIGH FREQUENCY WORDS LIST C List A K, Lists A & B 1 st Grade, Lists A, B, & C 2 nd Grade Fundations Correlated

Experimenting with Musically Motivated Convolutional Neural Networks

Transcription:

FOIL it! Find One mismatch between Image and Language caption ACL, Vancouver, 31st July, 2017 Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aurelie Herbelot, Moin Nabi, Enver Sangineto, Raffaella Bernardi {firstname.lastname}@unitn.it https://foilunitn.github.io

Overview Research Question Do Language and Vision models genuinely integrate both modalities, plus their interaction? 2

Overview Research Question Do Language and Vision models genuinely integrate both modalities, plus their interaction? Image Captioning People riding bicycles down the road approaching a pigeon. A group of people on bicycles coming down a street Image Captioning 3

Overview Research Question Do Language and Vision models genuinely integrate both modalities, plus their interaction? Visual Question Answering Question: How many people are riding a bicycle? Answer: three Visual Question Answering 4

Overview Research Question Do Language and Vision models genuinely integrate both modalities, plus their interaction? Our contribution FOIL dataset and tasks as a (challenging) benchmark for SoA models Take-home Current models fail in deeply integrating the two modalities 5

Related Work Binary Forced-Choice Tasks (Hodosh and Hockenmaier, 2016) given two captions, original & distractor, an image captioning model has to pick one model fails to pick the original caption limitations hard to pinpoint the reason for the model failure: due to multiple word change simultaneously easier problem: due to selection between two captions Micah Hodosh and Julia Hockenmaier Focused Evaluation for Image Description with Binary Forced-Choice Tasks. VL (ACL) 16 6

Related Work CLEVR Dataset (Johnson et al., 2016) artificial dataset to evaluate visual reasoning analysed shortcoming of VQA models limitations task specific model achieves super human performance (Santoro et al., 2017) some questions are hard to answer by human s Johnson et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. CVPR, 2017 Santoro et al. A simple neural network module for relational reasoning. Arxiv, 2017 7

Motivation Need of automatically generate resource with less effort Need tasks such that automatic and human evaluation have the same metric Need of diagnostics way to evaluate limitations of SoA models 8

FOIL Dataset For a given image and original captions, generate foil captions by replacing one NOUN in the original caption A person on bike going through green light with red bus nearby in a sunny day. Original Caption Target Word : bus Foil Word : truck Target - Foil pair = bus - truck A person on bike going through green light with red truck nearby in a sunny day. Generated Foil Caption 9

FOIL Dataset For a given image and original captions, generate foil captions by replacing one NOUN in the original caption Original caption based on the MS-COCO (Lin et al., 2014) dataset for image and caption Target-Foil pair creation based on MS-COCO object super-category replace objects within same super-category with each other e.g. cat-dog, car-truck etc Tsung-Yi Lin et al. Microsoft coco: Common objects in context. ECCV, 2014 10

FOIL Dataset : Criteria Foil not present perform replacement only if the foil word is not present Salient Target replace a target word only if it is visually salient Mining hardest foil caption by using neuraltalk (Karpathy and Fei-Fei, 2015) loss Andrej Karpathy and Fei-Fei Li Deep Visual-Semantic Alignments for Generating Image Descriptions. CVPR, 2015 11

FOIL Dataset : Sample Sample Generated Example 1. 2. An orange cat hiding on the wheel of a red car. A cat sitting on a wheel of a vehicle. Original Caption 1. 2. An orange cat hiding on the wheel of a red boat. A dog sitting on a wheel of a vehicle. Generated Foil Captions 12

FOIL Dataset : Composition Composition of FOIL-COCO dataset # datapoints # images # captions # target-foil pairs Train 197,788 65,697 395,576 256 Test 99,480 32,150 198,960 216 13

FOIL Dataset : Proposed Tasks Task 1 : Binary classification : Original or Foil Task 2 : Foil word detection Task 3 : Foil word correction 14

Proposed Tasks : Task 1 Binary classification: Original or Foil Given an image and a caption decide original or foil caption People riding bicycles down the road approaching a bird. Original Caption People riding bicycles down the road approaching a dog. Foil Caption 15

Proposed Tasks : Task 1 Binary classification: Original or Foil Given a image and a caption decide original or foil caption People riding bicycles down the road approaching a bird. Original Caption People riding bicycles down the road approaching a dog. Foil Caption Human performance (AMT) Majority (2/3) : 92.89 Unanimity (3/3) : 76.32 16

Proposed Tasks : Task 2 Foil word detection Given an image and a foil caption identify the foil word People riding bicycles down the road approaching a dog.. Where is the mistake in caption? People riding bicycles down the road approaching a dog.. 17

Proposed Tasks : Task 2 Foil word detection Given an image and a foil caption identify the foil word People riding bicycles down the road approaching a dog.. Where is the mistake in caption? Human performance (AMT) Majority (2/3) : 97.00 Unanimity (3/3) : 73.60 People riding bicycles down the road approaching a dog.. 18

Proposed Tasks : Task 3 Foil word correction Given an image, a foil caption and foil word location, correct the foil caption People riding bicycles down the road approaching a dog.. Can you correct the mistake? People riding bicycles down the road approaching a bird.. 19

FOIL Dataset : is NOT Equal to Visual Question Answering In VQA, answers are highly dependent on the (linguistic) context of the question. What man is riding? A person on motorcycle going through green light with red bus nearby in a sunny day. In FOIL, we are asked a context independent fine-grained information about the image. 20

FOIL Dataset : is NOT Equal to Object Classification/Detection In computer vision tasks, generally question is, what objects are present in the image In FOIL, question is "what object is NOT in the image (foil classification/detection) and understand what object is there based on the context(correction)?" 21

Models Tested VQA Models Image Captioning Model 22

Models Tested Baseline Models Language Only (Blind) LSTM (Question) followed by MLP Question LSTM MLP 23

Models Tested Baseline Models Language Only (Blind) CNN + LSTM (Zhou et al., 2015) CNN (Image), LSTM (Question) joined by concatenation followed by MLP Question LSTM MLP Image CNN Concatenation Zhou et al. Simple Baseline for Visual Question Answering. Arxiv, 2015 24

Models Tested VQA Models LSTM + norm I (Antol et al., 2015) CNN (Image), LSTM (Question) joined by pointwise multiplication followed by MLP Question LSTM MLP Image CNN Pointwise Multiplication Antol et al. VQA: Visual Question Answering. ICCV, 2015 25

Models Tested VQA Models LSTM + norm I (Antol et al., 2015) Hierarchical Co-attention (HieCoAttn) (Lu et al., 2016) CNN (Image), LSTM (Question), both Image & Question is co-attended in alternatively Attn3 Question LSTM Attn1 MLP recursively Image CNN Attn2 Lu et al. Hierarchical Question-Image Co-Attention for Visual Question Answering. NIPS, 2016 26

Models Tested Image Captioning Model Bi-directional IC Model (IC-Wang) (Wang et al., 2016) Given Image, and past and future context model predicts current word Image w1 w2... wp-1 Wang et al. Image captioning with deep bidirectional LSTMs. MM, 2016 wp+1... wn-1 wn Image 27

Results Task 1 : Binary Classification Overall Correct Foil Blind 55.62 86.20 25.04 CNN + LSTM 61.07 89.16 32.98 LSTM + norm I 63.26 92.02 34.51 HieCoAttn 64.14 91.89 36.38 IC-Wang 42.21 38.98 45.44 Human (Majority) 92.89 91.24 94.52 Human (Unanimity) 76.32 73.73 78.90 28

Results Task 2 : Foil word detection Only Nouns All Words Chance 23.25 15.87 LSTM + norm I 26.32 24.25 HieCoAttn 38.79 33.69 IC-Wang 27.59 23.32 Human (Majority) _ 97.00 Human (Unanimity) _ 73.60 29

Results Task 3 : Foil word correction All Target Words Chance 1.38 LSTM + norm I 4.7 HieCoAttn 4.21 IC-Wang 22.16 30

Conclusion Created a challenging dataset and corresponding challenging tasks used to evaluate limitations of language and vision models can be extended to other part of speech (see Shekhar et al., 2017), scene etc by knowing source of error, will help in designing better models Need fine-grained joint understanding of language and vision Shekhar et al. Vision and Language Integration: Moving beyond Objects. IWCS, 2017 31

Thank You!!! Q&A Dataset https://foilunitn.github.io 32

Crowdflower Read and understand the caption and carefully watch the image Determine if the caption provides a correct description of what is depicted in the image If you judge the caption as "wrong", you will be asked to type the word that makes the caption incorrect 33

Crowdflower 34

Crowdflower 35

Crowdflower 36

FOIL Dataset : Criteria Foil not present Salient Target 37

FOIL Dataset : Criteria Foil not present Perform replacement only if Foil word is not present in the image Check that Foil word is not used by any other ms-coco annotator For e.g., I. II. A boy is running on the beach A boy and a little girl are playing on the beach Target - Foil = Boy - Girl 38

FOIL Dataset : Criteria Salient Target Replace Target words only if it is visually salient in the image Based on annotator agreement i.e. more than one annotator used Target word For e.g., I. II. III. IV. V. Two zebras standing in the grass near rocks. Two zebras grazing together near rocks in their enclosure. Two Zebras are standing near some rocks. two zebras in a field near one another A grassy area shows artificially arranged rocks and two zebras, as well as part of the lower half of a deer. Target - Foil = Zebra - Dog (Used) Target - Foil = Deer - Dog (Not Used) 39

FOIL Dataset : Mining Hardest Foil Caption To eliminate visual-language bias For every original caption could produce one or more foil caption Neuraltalk loss is used to mine hardest foil caption Eliminates both visual and language bias 40