Less is More: Picking Informative Frames for Video Captioning

Similar documents
Neural Aesthetic Image Reviewer

Music Composition with RNN

Generating Chinese Classical Poems Based on Images

New Approach to Multi-Modal Multi-View Video Coding

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

Free Viewpoint Switching in Multi-view Video Streaming Using. Wyner-Ziv Video Coding

Detecting Musical Key with Supervised Learning

Deep Aesthetic Quality Assessment with Semantic Information

An Introduction to Deep Image Aesthetics

Image-to-Markup Generation with Coarse-to-Fine Attention

arxiv: v1 [cs.lg] 15 Jun 2016

A Discriminative Approach to Topic-based Citation Recommendation

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

ROBUST ADAPTIVE INTRA REFRESH FOR MULTIVIEW VIDEO

Predicting Aesthetic Radar Map Using a Hierarchical Multi-task Network

Adaptive Key Frame Selection for Efficient Video Coding

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Deep learning for music data processing

Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik

Joint Image and Text Representation for Aesthetics Analysis

Adaptive Distributed Compressed Video Sensing

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

TRAFFIC SURVEILLANCE VIDEO MANAGEMENT SYSTEM

Part 2.4 Turbo codes. p. 1. ELEC 7073 Digital Communications III, Dept. of E.E.E., HKU

Visual Communication at Limited Colour Display Capability

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

arxiv: v2 [cs.sd] 15 Jun 2017

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

A repetition-based framework for lyric alignment in popular songs

Weighted Random and Transition Density Patterns For Scan-BIST

SCALABLE video coding (SVC) is currently being developed

LSTM Neural Style Transfer in Music Using Computational Musicology

AN UNEQUAL ERROR PROTECTION SCHEME FOR MULTIPLE INPUT MULTIPLE OUTPUT SYSTEMS. M. Farooq Sabir, Robert W. Heath and Alan C. Bovik

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

Supplementary Material for Video Propagation Networks

Minimax Disappointment Video Broadcasting

StyleNet: Generating Attractive Visual Captions with Styles

3D Video Transmission System for China Mobile Multimedia Broadcasting

1C.4.1. Modeling of Motion Classified VBR Video Codecs. Ya-Qin Zhang. Ferit Yegenoglu, Bijan Jabbari III. MOTION CLASSIFIED VIDEO CODEC INFOCOM '92

Modeling and Optimization of a Systematic Lossy Error Protection System based on H.264/AVC Redundant Slices

Lossless Compression Algorithms for Direct- Write Lithography Systems

SDR Implementation of Convolutional Encoder and Viterbi Decoder

A Video Frame Dropping Mechanism based on Audio Perception

Video Transmission. Thomas Wiegand: Digital Image Communication Video Transmission 1. Transmission of Hybrid Coded Video. Channel Encoder.

Dual frame motion compensation for a rate switching network

LOCOCODE versus PCA and ICA. Jurgen Schmidhuber. IDSIA, Corso Elvezia 36. CH-6900-Lugano, Switzerland. Abstract

Exploring Architecture Parameters for Dual-Output LUT based FPGAs

II. SYSTEM MODEL In a single cell, an access point and multiple wireless terminals are located. We only consider the downlink

MVP: Capture-Power Reduction with Minimum-Violations Partitioning for Delay Testing

Relative frequency. I Frames P Frames B Frames No. of cells

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005.

Joint source-channel video coding for H.264 using FEC

NEXTONE PLAYER: A MUSIC RECOMMENDATION SYSTEM BASED ON USER BEHAVIOR

SentiMozart: Music Generation based on Emotions

Video compression principles. Color Space Conversion. Sub-sampling of Chrominance Information. Video: moving pictures and the terms frame and

Semantic Image Segmentation via Deep Parsing Network

Hidden Markov Model based dance recognition

The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

Bach2Bach: Generating Music Using A Deep Reinforcement Learning Approach Nikhil Kotecha Columbia University

SIMULATION MODELING FOR QUALITY AND PRODUCTIVITY IN STEEL CORD MANUFACTURING

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

FOIL it! Find One mismatch between Image and Language caption

IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing

Technical report on validation of error models for n.

A robust video encoding scheme to enhance error concealment of intra frames

1. INTRODUCTION. Index Terms Video Transcoding, Video Streaming, Frame skipping, Interpolation frame, Decoder, Encoder.

Supervised Learning in Genre Classification

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Neural Network for Music Instrument Identi cation

On-Supporting Energy Balanced K-Barrier Coverage In Wireless Sensor Networks

Implementation of a turbo codes test bed in the Simulink environment

Research Article Ring Counter Based ATPG for Low Transition Test Pattern Generation

Power Problems in VLSI Circuit Testing

Key-based scrambling for secure image communication

ENCYCLOPEDIA DATABASE

SIGNAL + CONTEXT = BETTER CLASSIFICATION

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Reducing False Positives in Video Shot Detection

Popularity-Aware Rate Allocation in Multi-View Video

WITH the rapid development of high-fidelity video services

MPEG has been established as an international standard

Ferenc, Szani, László Pitlik, Anikó Balogh, Apertus Nonprofit Ltd.

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ICASSP.2016.

PERCEPTUAL QUALITY COMPARISON BETWEEN SINGLE-LAYER AND SCALABLE VIDEOS AT THE SAME SPATIAL, TEMPORAL AND AMPLITUDE RESOLUTIONS. Yuanyi Xue, Yao Wang

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

arxiv: v3 [cs.sd] 14 Jul 2017

Optimized Color Based Compression

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS

Low Power Estimation on Test Compression Technique for SoC based Design

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Interframe Bus Encoding Technique for Low Power Video Compression

Modeling Musical Context Using Word2vec

arxiv: v1 [cs.sd] 5 Apr 2017

Hands-On Real Time HD and 3D IPTV Encoding and Distribution over RF and Optical Fiber

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

BBM 413 Fundamentals of Image Processing Dec. 11, Erkut Erdem Dept. of Computer Engineering Hacettepe University. Segmentation Part 1

CS 7643: Deep Learning

Agilent N9355/6 Power Limiters 0.01 to 18, 26.5 and 50 GHz

Transcription:

Less is More: Picking Informative Frames for Video Captioning ECCV 2018 Yangyu Chen 1, Shuhui Wang 2, Weigang Zhang 3 and Qingming Huang 1,2 1 University of Chinese Academy of Science, Beijing, 100049, China 2 Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS, Beijing, 100190, China 3 Harbin Inst. of Tech, Weihai, 264200, China yangyu.chen@vipl.ict.ac.cn, wangshuhui@ict.ac.cn, wgzhang@hit.edu.cn, qmhuang@ucas.ac.cn 2018-07-30

Video Captioning Seq2Seq translation: encoding: use CNN and RNN to encode video content decoding: use RNN to generate sentence conditioning on encoded feature Figure 1: Standard encoder-decoder framework for video captioning 1 1 S. Venugopalan et al. Sequence to sequence - video to text. In: Proceedings of IEEE International Conference on Computer Vision. Santiago: IEEE Computer Society Press, 2015, pp. 4534 4542.

Motivation Frame selection perspective: there are many frames with duplicated and redundant visual appearance information selected with equal interval frame sampling, and this will also involve remarkable computation expenditures. (a) Equally sampled 30 frames from a video (b) Informative frames Figure 2: Video may contains many redundant information. The whole video can be represented by a small portion of frames (b), while equally sampled frames still contain redundant information (a).

Motivation Downstream task perspective: temporal redundancy may lead to an unexpected information overload on the visual-linguistic correlation analysis model, hence using more frames may not always lead to better performance. METEOR score 36 34 32 30 28 26 24 32.0 32.2 32.7 32.8 32.7 27.5 27.6 27.6 27.5 MSVD MSR-VTT 32.3 27.0 27.0 5 10 15 20 25 30 # of frames Figure 3: The best METEOR score on the validation set of MSVD and MSR-VTT when using different number of equally sampled frames. The standard Encoder-Decoder model is used to generate captions.

Picking Informative Frames for Captioning Figure 4: Insert PickNet into the encode-decode procedure for captioning. Insert PickNet before encoder-decoder. Perform frame selection before processing downstream task. Without annotations, we can try reinforcement training to optimize picking policy.

PickNet Pick! Given an input image z t, and the last picking memory g, PickNet produce a Bernoulli distribution for selecting decision: d t = g t g (1) s t = W 2 (max(w 1 vec(d t ) + b 1, 0)) + b 2 (2) a t softmax(s t ) (3) g g t (4) where W and b are parameters of our model, g t is the flattened gray-scale image, d t is the difference between gray-scale images. Other network structures (e.g., LSTM/GRU) can also be applied.

Rewards Visual diversity reward: the average cosine distance of each frame pairs r v (V i ) = N 2 p 1 N p (N p 1) N p (1 x T k x m x k 2 x m 2 ) (5) k=1 m>k where V i is a set of picked frames, N p the number of picked frames, x k the feature of k-th picked frame. Language reward: the semantic similarity between generated sentence and ground-truth r l (V i, S i ) = CIDEr(c i, S i ) (6) S i is a set of annotated sentences, c i is the generated sentence Picking limitation { λ l r l (V i, S i ) + λ v r v (V i ) if N min N p N max r(v i ) = R otherwise, (7) N p is the number of picked frames, R is the punishment

Training Supervision stage: training the encoder-decoder. L X (y; ω) = m log(p ω (y t y t 1, y t 2,... y 1, v)) (8) t=1 ω is the parameter of encoder-decoder, y = (y 1, y 2,..., y m) is an annotated sentence, v is the encoded result Reinforcement stage: training PickNet. the relation between reward and actionv i = {x t a s t = 1 x t v i } L R (a s ; θ) = E a s p θ [r(v i )] = E a s p θ [r(a s )] (9) θ is the parameter of PickNeta s is the action sequence Adaptation stage: training both encoder-decoder and PickNet. L = L X (y; ω) + L R (a s ; θ) (10) The combinatorial explosion of direct frame selection is avoided.

REINFORCE Use REINFORCE 2 algorithm to estimate gradients. Gradient expression: Based on chain-ruler: θ L R (a s ; θ) = θ L R (a s ; θ) = E a s p θ [r(a s ) θ log p θ (a s )] (11) T t=1 L R (θ) s t s t Apply Monte-Carlo sampling: θ = T t=1 E a s p θ r(a s )(p θ (a s t) 1 a s t ) s t θ (12) θ L R (a s ; θ) T t=1 r(a s )(p θ (a s t) 1 a s t ) s t θ (13) 2 R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In: Machine learning 8.3-4 (1992), pp. 229 256.

Picking Results Ours: a woman is seasoning meat GT: someone is seasoning meat Ours: a person is solving a rubik s cube GT: person playing with toy Ours: a man is shooting a gun GT: a man is shooting Ours: there is a woman is talking with a woman GT: it is a movie Figure 5: Example results on MSVD and MSR-VTT. The green boxes indicate picked frames.

Picking Results We investigate our method on three types of artificially combined videos: a) two identical videos; b) two semantically similar videos; c) two semantically dissimilar videos. (a) Ours: a woman is (b) Ours: two polar (c) Ours: a cat is eating doing exercise bears are playing Baseline: a girl is doing a Baseline: a man is dancing Baseline: a bear is running Figure 6: Example results on joint videos. Green boxes indicate picked frames. The baseline method is Enc-Dec on equally sampled frames.

Analysis # of videos (in %) 12 10 8 6 4 2 MSVD MSR-VTT 0 1 5 10 15 20 25 30 # of picks (a) Distribution of the number of picks. # of picks (in %) 15 12 9 6 3 MSVD MSR-VTT 0 1 5 10 15 20 25 30 Frame ID (b) Distribution of the position of picks. Figure 7: Statistics on the behavior of our PickNet. In the vast majority of the videos, less than 10 frames are picked. The probability of picking a frame is reduced as time goes by.

Performance model BLEU4 ROUGE-L METEOR CIDEr time Previous Works LSTM-E 45.3-31.0-5x p-rnn 49.9-32.6 65.8 5x HRNE 43.8-33.1-33x BA 42.5-32.4 63.5 12x Baselines Full 44.8 68.5 31.6 69.4 5x Random 35.6 64.5 28.4 49.2 2.5x k-means (k=6) 45.2 68.5 32.4 70.9 1x Hecate 43.2 67.4 31.7 68.8 1x Our Models PickNet (V) 46.3 69.3 32.3 75.1 1x PickNet (L) 49.9 69.3 32.9 74.7 1x PickNet (V+L) 52.3 69.6 33.3 76.5 1x Table 1: Experiment results on MSVD. All values are reported as percentage(%). L denotes using language reward and V denotes using visual diversity reward. k is set to the average number of picks N p on MSVD. ( N p 6)

Performance model BLEU4 ROUGE-L METEOR CIDEr time Previous Works ruc-uva 38.7 58.7 26.9 45.9 4.5x Aalto 39.8 59.8 26.9 45.7 4.5x DenseVidCap 41.4 61.1 28.3 48.9 10.5x MS-RNN 39.8 59.3 26.1 40.9 10x Baselines Full 36.8 59.0 26.7 41.2 3.8x Random 31.3 55.7 25.2 32.6 1.9x k-means (k=8) 37.8 59.1 26.9 41.4 1x Hecate 37.3 59.1 26.6 40.8 1x Our Models PickNet (V) 36.9 58.9 26.8 40.4 1x PickNet (L) 37.3 58.9 27.0 41.9 1x PickNet (V+L) 39.4 59.7 27.3 42.3 1x PickNet (V+L+C) 41.3 59.8 27.7 44.1 1x Table 2: Experiment results on MSR-VTT. All values are reported as percentage(%). C denotes using the provided category information. k is set to the average number of picks N p on MSR-VTT. ( N p 8)

Time Estimation Model Appearance Motion Sampling method Frame num. Time Previous Work LSTM- VGG (0.5x) C3D (2x) uniform sampling 30 frames 30 (5x) 5x p-rnn VGG (0.5x) C3D (2x) uniform sampling 30 frames 30 (5x) 5x HRNE GoogleNet (0.5x) C3D (2x) first 200 frames 200 (33x) 33x BA ResNet (0.5x) C3D (2x) every 5 frames 72 (12x) 12x Our Models Baseline ResNet (1x) uniform sampling 30 frames 30 (5x) 5x Random ResNet (1x) randomly sampling 15 (2.5x) 2.5x k-means (k=6) ResNet (1x) k-means clustering 6 (1x) 1x Hecate ResNet (1x) video summarization 6 (1x) 1x PickNet (V) ResNet (1x) picking 6 (1x) 1x PickNet (L) ResNet (1x) picking 6 (1x) 1x PickNet (V+L) ResNet (1x) picking 6 (1x) 1x Table 3: Running time estimation on MSVD. OF means optical flow. BA uses ResNet50 while our models use ResNet152. k is set to the average number of picks N p on MSVD. ( N p 6)

Time Estimation Model Appearance Motion Sampling method Frame num. Time Previous Work ruc-uva GoogleNet (0.5x) C3D (2x) every 10 frames 36 (4.5x) 4.5x Aalto GoogleNet (0.5x) C3D+IDT (2x) one frame every second 36 (4.5x) 4.5x DenseCap ResNet (0.5x) C3D (2x) sampling 90 frames 90 (10.5x) 10.5x MS-RNN ResNet (1x) C3D (2x) uniform sampling 40 frames 40 (5x) 10x Our Models Baseline ResNet (1x) uniform sampling 30 frames 30 (3.8x) 3.8x Random ResNet (1x) randomly sampling 15 (1.9x) 1.9x k-means (k=8) ResNet (1x) k-means clustering 8 (1x) 1x Hecate ResNet (1x) video summarization 8 (1x) 1x PickNet (V) ResNet (1x) picking 8 (1x) 1x PickNet (L) ResNet (1x) picking 8 (1x) 1x PickNet (V+L) ResNet (1x) picking 8 (1x) 1x Table 4: Running time estimation on MSR-VTT. IDT means improved dense trajectory. DenseCap uses ResNet50 while our models use ResNet152. k is set to the average number of picks N p on MSR-VTT. ( N p 8)

Online Captioning When PickNet select one frame, it means that new information appears. Then the encode-decoder is triggered by PickNet and a more detailed description is generated.

Conclusion Flexibility. a plug-and-play reinforcement-learning-based PickNet to pick informative frames for video understanding tasks. Efficiency. The architecture can largely cut down the usage of convolution operations. It makes our method more applicable for real-world video processing. Effectiveness. Experiment shows that our model can achieve comparable or even better performance compared to state-of-the-art while only a small number of frames are used.

Thanks!