Image-to-Markup Generation with Coarse-to-Fine Attention

Similar documents
LSTM Neural Style Transfer in Music Using Computational Musicology

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

Music Composition with RNN

An AI Approach to Automatic Natural Music Transcription

Joint Image and Text Representation for Aesthetics Analysis

arxiv: v1 [cs.lg] 15 Jun 2016

First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text

An Introduction to Deep Image Aesthetics

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

MUSIC scores are the main medium for transmitting music. In the past, the scores started being handwritten, later they

Neural Aesthetic Image Reviewer

OPTICAL MUSIC RECOGNITION WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Rewind: A Music Transcription Method

Music Generation from MIDI datasets

arxiv: v1 [cs.cv] 16 Jul 2017

A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

Music genre classification using a hierarchical long short term memory (LSTM) model

Less is More: Picking Informative Frames for Video Captioning

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Singer Traits Identification using Deep Neural Network

Deep Jammer: A Music Generation Model

Detecting Musical Key with Supervised Learning

CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS

HEBS: Histogram Equalization for Backlight Scaling

Generating Music with Recurrent Neural Networks

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Subtitle Safe Crop Area SCA

gresearch Focus Cognitive Sciences

CS 7643: Deep Learning

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

Lossless Compression Algorithms for Direct- Write Lithography Systems

Modified Generalized Integrated Interleaved Codes for Local Erasure Recovery

Implementation of a turbo codes test bed in the Simulink environment

arxiv: v1 [cs.lg] 16 Dec 2017

arxiv: v2 [cs.sd] 31 Mar 2017

Using Deep Learning to Annotate Karaoke Songs

Deep learning for music data processing

SentiMozart: Music Generation based on Emotions

Chord Classification of an Audio Signal using Artificial Neural Network

Sentiment and Sarcasm Classification with Multitask Learning

Sarcasm Detection in Text: Design Document

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

Video coding standards

Various Artificial Intelligence Techniques For Automated Melody Generation

Neural Network for Music Instrument Identi cation

MPEG-2. ISO/IEC (or ITU-T H.262)

CS 7643: Deep Learning

arxiv: v2 [cs.sd] 15 Jun 2017

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Deep Aesthetic Quality Assessment with Semantic Information

Generating Chinese Classical Poems Based on Images

Automatically Creating Biomedical Bibliographic Records from Printed Volumes of Old Indexes

HumorHawk at SemEval-2017 Task 6: Mixing Meaning and Sound for Humor Recognition

Homework 2 Key-finding algorithm

Experiments on musical instrument separation using multiplecause

RoboMozart: Generating music using LSTM networks trained per-tick on a MIDI collection with short music segments as input.

Supplementary material for Inverting Visual Representations with Convolutional Networks

Automatic Music Genre Classification

Luma Adjustment for High Dynamic Range Video

Composing a melody with long-short term memory (LSTM) Recurrent Neural Networks. Konstantin Lackner

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Automatic Laughter Detection

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Algorithmic Music Composition using Recurrent Neural Networking

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Supplementary Material for Video Propagation Networks

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Audio: Generation & Extraction. Charu Jaiswal

CHAPTER-9 DEVELOPMENT OF MODEL USING ANFIS

Spotting Violence from Space

arxiv: v3 [cs.sd] 14 Jul 2017

arxiv: v1 [cs.cl] 3 May 2018

Distortion Analysis Of Tamil Language Characters Recognition

Phenopix - Exposure extraction

Laboratory Assignment 3. Digital Music Synthesis: Beethoven s Fifth Symphony Using MATLAB

A Unit Selection Methodology for Music Generation Using Deep Neural Networks

Music Genre Classification

A New Scheme for Citation Classification based on Convolutional Neural Networks

16B CSS LAYOUT WITH GRID

Objectives: Topics covered: Basic terminology Important Definitions Display Processor Raster and Vector Graphics Coordinate Systems Graphics Standards

Bach2Bach: Generating Music Using A Deep Reinforcement Learning Approach Nikhil Kotecha Columbia University

Modeling Temporal Tonal Relations in Polyphonic Music Through Deep Networks with a Novel Image-Based Representation

Jazz Melody Generation from Recurrent Network Learning of Several Human Melodies

Rewind: A Transcription Method and Website

2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT , 2016, SALERNO, ITALY

Representations in Deep Neural Nets. Paul Humphreys July

MPEG has been established as an international standard

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Table of content. Table of content Introduction Concepts Hardware setup...4

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

arxiv: v1 [cs.sd] 8 Jun 2016

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

Improving Frame Based Automatic Laughter Detection

arxiv: v1 [cs.sd] 12 Dec 2016

Real-valued parametric conditioning of an RNN for interactive sound synthesis

Quick Guide Book of Sending and receiving card

Creating Mindmaps of Documents

Transcription:

Image-to-Markup Generation with Coarse-to-Fine Attention Presenter: Ceyer Wakilpoor Yuntian Deng 1 Anssi Kanervisto 2 Alexander M. Rush 1 Harvard University 3 University of Eastern Finland ICML, 2017 Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 1 / 29

Outline 1 Introduction Problem Statement 2 Model Convolutional Network Row Encoder Decoder Attention 3 Experiment Details 4 Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 2 / 29

Outline 1 Introduction Problem Statement 2 Model Convolutional Network Row Encoder Decoder Attention 3 Experiment Details 4 Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 3 / 29

Introduction Optical Character Recognition for mathematical expressions Challenge: creating markup from compiled image Have to pick up markup that translates to how characters are presented, not just what characters Goal is to make a model that does not require domain knowledge, use data-driven approach Work is based on previous attention-based encoder-decoder model used in machine translation and in image captioning Added multi-row recurrent model before attention layer, which proved to increase performance Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 4 / 29

Outline 1 Introduction Problem Statement 2 Model Convolutional Network Row Encoder Decoder Attention 3 Experiment Details 4 Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 5 / 29

Problem Statement Converting rendered source image to markup that can render the image The source, x X is a grayscale image of height, H, and width, W (R HxW ) The target, y Y consists of a sequence of tokens y 1, y 2,..., y C C is the length of the output, and each y is a token from the markup language with vocabulary Effectively trying to learn how to invert the compile function of the markup using supervised examples Goal is for compile(y) x Generate hypothesis ŷ, and ˆx is the predicted compiled image Evaluation is done between ˆx and x, as in evaluating to render an image similar to the original input Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 6 / 29

Outline 1 Introduction Problem Statement 2 Model Convolutional Network Row Encoder Decoder Attention 3 Experiment Details 4 Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 7 / 29

Model Convolutional Neural Network (CNN) extracts image features Each row is encoded using a Recurrent Neural Network (RNN) When paper mentions RNN, it means a Long-Short Term Memory Network (LSTM) These encoded features are used by an RNN decoder with a visual attention layer, which implements a conditional language model over Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 8 / 29

Outline 1 Introduction Problem Statement 2 Model Convolutional Network Row Encoder Decoder Attention 3 Experiment Details 4 Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 9 / 29

Convolutional Network Visual features of the image are extracted with a multi-layer convolutional neural network, with interleaved max-pooling layers Based on model used for OCR by Shi et al. Unlike some other OCR models, there is no fully-connected layer at the end of the convolutional layers Want to preserve spatial relationship of extracted features CNN takes in input R HxW, and produces feature grid, V of size CxH xw where c denotes the number of channels, and H and W are the reduced dimensions from pooling Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 10 / 29

Outline 1 Introduction Problem Statement 2 Model Convolutional Network Row Encoder Decoder Attention 3 Experiment Details 4 Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 11 / 29

Row Encoder Unlike with image captioning, OCR there is significant sequential information (i.e. reading left-to-right) Encode each row separately with an RNN Most markup languages default left-to-right, which an RNN will naturally pick up Encoding each row will allow the RNN to use surrounding horizontal information to improve the hidden representation Generic RNN: h t = RNN(h t 1, v t ; θ) RNN takes in V and outputs Ṽ: Run RNN over all rows h {1,..., H } and columns w {1,..., W } Ṽ = RNN(Ṽh,w 1, V h,w ) Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 12 / 29

Outline 1 Introduction Problem Statement 2 Model Convolutional Network Row Encoder Decoder Attention 3 Experiment Details 4 Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 13 / 29

Decoder Decoder is trained as conditional language model Modeling probability of output token conditional on previous ones: p(y t+1 y 1,..., y t, Ṽ = softmax(w out o t ) W out is a learned linear transformation and o t = tanh(w c [h t ; c t ]) h t = RNN(h t 1, [y t 1 ; o t 1 ]) Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 14 / 29

Outline 1 Introduction Problem Statement 2 Model Convolutional Network Row Encoder Decoder Attention 3 Experiment Details 4 Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 15 / 29

Attention General form of context vector used to assist decoder at time-step t: c t = φ({ṽ h,w }, α t ) General form of e and the weight vector, α: e t = a(h t, {Ṽ h,w }) α t = softmax(e t ) From empirical success choose a: e it = β T tanh(w h h i 1 + W v ṽ t ) and c i = i α itv t c t and h t are simply concatenated and used to predict the token, y t Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 16 / 29

Model Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 17 / 29

Attention Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 18 / 29

Encoder Decoder with Attention Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 19 / 29

Example Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 20 / 29

Model Architecture Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 21 / 29

Outline 1 Introduction Problem Statement 2 Model Convolutional Network Row Encoder Decoder Attention 3 Experiment Details 4 Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 22 / 29

Experiment Details Beam search used for testing since decoder models conditional language probability of the generated tokens Primary experiment was with IM2LATEX-100k, which is a dataset of mathematical expressions written in latex Latex vocabulary was tokenized to relatively specific tokens, modifier characters such as ˆ or symbols such as \sigma Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 23 / 29

Experiment Details Created duplicate model without encoder as control to test against image captioning models, the control was called CNNEnc Evaluation by comparing input image and rendered image of output latex Initial learning rate of.1 and halve it once validation perplexity doesn t decrease Low validation perplexity means good generalization when comparing training to validation set 12 epochs and beam search using beam size of 5 Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 24 / 29

Outline 1 Introduction Problem Statement 2 Model Convolutional Network Row Encoder Decoder Attention 3 Experiment Details 4 Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 25 / 29

Results 97.5% exact match accuracy of decoding HTML images Reimplement Image-to Caption work on Latex and achieved accuracy of over 75% for exact matches Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 26 / 29

Results Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 27 / 29

Implementation Mostly written in Torch, Python for preprocessing, and utilized lua libraries Bucketed inputs into similar size images Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 28 / 29

Citations https://theneuralperspective.com/2016/11/20/ recurrent-neural-network-rnn-part-4-attentional-interfaces Yuntian Deng, Anssi Kanervisto, Alexander M. Image-to-Markup Rush ( Harvard Generation University, University with Coarse-to-Fine of EasternAttention Finland) ICML, 2017 29 / 29