Summarizing Long First-Person Videos

Similar documents
CS 1674: Intro to Computer Vision. Intro to Recognition. Prof. Adriana Kovashka University of Pittsburgh October 24, 2016

CS 2770: Computer Vision. Introduction. Prof. Adriana Kovashka University of Pittsburgh January 5, 2017

Indexing local features. Wed March 30 Prof. Kristen Grauman UT-Austin

Indexing local features and instance recognition

University of Texas at Austin Department of Computer Science Phone: (512) Speedway, D9500 Austin, TX USA

An Introduction to Deep Image Aesthetics

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

CS 1699: Intro to Computer Vision. Introduction. Prof. Adriana Kovashka University of Pittsburgh September 1, 2015

Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik

Reducing False Positives in Video Shot Detection

CS 1674: Intro to Computer Vision. Face Detection. Prof. Adriana Kovashka University of Pittsburgh November 7, 2016

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION

FOIL it! Find One mismatch between Image and Language caption

Singer Traits Identification using Deep Neural Network

Image Aesthetics and Content in Selecting Memorable Keyframes from Lifelogs

A 5-Gb/s Half-rate Clock Recovery Circuit in 0.25-μm CMOS Technology

Multi-modal Kernel Method for Activity Detection of Sound Sources

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER

Research on Precise Synchronization System for Triple Modular Redundancy (TMR) Computer

Generic object recognition

Martial Arts, Dancing and Sports dataset: a Challenging Stereo and Multi-View Dataset for Human Pose Estimation Supplementary Material

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

EMBEDDED SPARSE CODING FOR SUMMARIZING MULTI-VIEW VIDEOS

Hearing Sheet Music: Towards Visual Recognition of Printed Scores

ImageNet Auto-Annotation with Segmentation Propagation

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Experiments on musical instrument separation using multiplecause

Analysis of Visual Similarity in News Videos with Robust and Memory-Efficient Image Retrieval

Audio-Based Video Editing with Two-Channel Microphone

Automatic Music Clustering using Audio Attributes

ECS 189G: Intro to Computer Vision March 31 st, Yong Jae Lee Assistant Professor CS, UC Davis

Sequential Storyboards introduces the storyboard as visual narrative that captures key ideas as a sequence of frames unfolding over time

Shot Transition Detection Scheme: Based on Correlation Tracking Check for MB-Based Video Sequences

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Audio Feature Extraction for Corpus Analysis

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Predicting Aesthetic Radar Map Using a Hierarchical Multi-task Network

A Framework for Segmentation of Interview Videos

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,

Kavita Ganesan, ChengXiang Zhai, Jiawei Han University of Urbana Champaign

Robust 3-D Video System Based on Modified Prediction Coding and Adaptive Selection Mode Error Concealment Algorithm

BBM 413 Fundamentals of Image Processing Dec. 11, Erkut Erdem Dept. of Computer Engineering Hacettepe University. Segmentation Part 1

PulseCounter Neutron & Gamma Spectrometry Software Manual

Detecting the Moment of Snap in Real-World Football Videos

Release Year Prediction for Songs

Semi-supervised Musical Instrument Recognition

Joint Image and Text Representation for Aesthetics Analysis

Improving Frame Based Automatic Laughter Detection

arxiv: v1 [cs.cv] 19 Nov 2015

MUSI-6201 Computational Music Analysis

Subjective Similarity of Music: Data Collection for Individuality Analysis

IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing

Auto classification and simulation of mask defects using SEM and CAD images

Retiming Sequential Circuits for Low Power

Quality of Music Classification Systems: How to build the Reference?

A Discriminative Approach to Topic-based Citation Recommendation

Detecting Musical Key with Supervised Learning

Phone-based Plosive Detection

Structural and Poststructural Analysis of Visual Documentation: An Approach to Studying Photographs

Speech To Song Classification

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

Automatic Piano Music Transcription

Photo Aesthetics Ranking Network with Attributes and Content Adaptation

Design Project: Designing a Viterbi Decoder (PART I)

Through-Wall Human Pose Estimation Using Radio Signals

Week 14 Music Understanding and Classification

A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS

Wipe Scene Change Detection in Video Sequences

Evaluation of Automatic Shot Boundary Detection on a Large Video Test Suite

WYNER-ZIV VIDEO CODING WITH LOW ENCODER COMPLEXITY

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC

Video-based Vibrato Detection and Analysis for Polyphonic String Music

Software Requirements Specification for the Java Quasi-Connected Components (JQCC) Software

IMAGE AESTHETIC PREDICTORS BASED ON WEIGHTED CNNS. Oce Print Logic Technologies, Creteil, France

Acoustic Scene Classification

Seminar CHIST-ERA Istanbul : 4 March 2014 Kick-off meeting : 27 January 2014 (call IUI 2012)

Controlling Peak Power During Scan Testing

Video summarization based on camera motion and a subjective evaluation method

Supervised Learning in Genre Classification

CPSC 425: Computer Vision

First Stage of an Automated Content-Based Citation Analysis Study: Detection of Citation Sentences 1

Using Genre Classification to Make Content-based Music Recommendations

Instance Recognition. Jia-Bin Huang Virginia Tech ECE 6554 Advanced Computer Vision

HIT SONG SCIENCE IS NOT YET A SCIENCE

Base, Pulse, and Trace File Reference Guide

Maths-Whizz Investigations Paper-Back Book

AUDIO FEATURE EXTRACTION AND ANALYSIS FOR SCENE SEGMENTATION AND CLASSIFICATION

Wyner-Ziv Coding of Motion Video

Optical Engine Reference Design for DLP3010 Digital Micromirror Device

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

arxiv: v2 [cs.cv] 27 Jul 2016

Team One Paper. Final Paper. Overall Strategy

Automatic Labelling of tabla signals

PROTOTYPE OF IOT ENABLED SMART FACTORY. HaeKyung Lee and Taioun Kim. Received September 2015; accepted November 2015

DARHT II Scaled Accelerator Tests on the ETA II Accelerator*

Technical report on validation of error models for n.

Processor time 9 Used memory 9. Lost video frames 11 Storage buffer 11 Received rate 11

Transcription:

CVPR 2016 Workshop: Moving Cameras Meet Video Surveillance: From Body-Borne Cameras to Drones Summarizing Long First-Person Videos Kristen Grauman Department of Computer Science University of Texas at Austin With Yong Jae Lee, Yu-Chuan Su, Bo Xiong, Lu Zheng, Ke Zhang, Wei-Lun Chao, Fei Sha

First person vs. Third person Traditional third-person view First-person view UT TEA dataset

First person vs. Third person First person egocentric vision: Linked to ongoing experience of the camera wearer World seen in context of the camera wearer s activity and goals Traditional third-person view First-person view UT Interaction and JPL First-Person Interaction datasets

Goal: Summarize egocentric video Wearable camera Input: Egocentric video of the camera wearer s day 9:00 am 10:00 am 11:00 am 12:00 pm 1:00 pm 2:00 pm Output: Storyboard (or video skim) summary

Why summarize egocentric video? Memory aid Law enforcement Mobile robot discovery RHex Hexapedal Robot, Penn's GRASP Laboratory

What makes egocentric data hard to summarize? Subtle event boundaries Subtle figure/ground Long streams of data

Prior work: Video summarization Largely third-person Static cameras, low-level cues informative Consider summarization as a sampling problem [Wolf 1996, Zhang et al. 1997, Ngo et al. 2003, Goldman et al. 2006, Caspi et al. 2006, Pritch et al. 2007, Laganiere et al. 2008, Liu et al. 2010, Nam & Tewfik 2002, Ellouze et al. 2010, ]

[Lu & Grauman, CVPR 2013] Goal: Story-driven summarization Characters and plot Key objects and influence

[Lu & Grauman, CVPR 2013] Goal: Story-driven summarization Characters and plot Key objects and influence

[Lu & Grauman, CVPR 2013] Summarization as subshot selection Good summary = chain of k selected subshots in which each influences the next via some subset of key objects influence importance diversity Subshots

[Lu & Grauman, CVPR 2013] Egocentric subshot detection In transit In transit In transit Ego-activity classifier Static Static In transit Static Head motion Head motion MRF and frame grouping Subshot 1 Subshot i Subshot n

Learning object importance We learn to rate regions by their egocentric importance distance to hand distance to frame center frequency [Lee et al. CVPR 2012, IJCV 2015]

Learning object importance We learn to rate regions by their egocentric importance distance to hand distance to frame center frequency [ ] candidate region s appearance, motion [ ] surrounding area s appearance, motion Object-like appearance, motion [Endres et al. ECCV 2010, Lee et al. ICCV 2011] Region features: size, width, height, centroid overlap w/ face detection [Lee et al. CVPR 2012, IJCV 2015]

[Lu & Grauman, CVPR 2013] Estimating visual influence Aim to select the k subshots that maximize the influence between objects (on the weakest link) Subshots

subshots Objects (or words) sink node Estimating visual influence Captures how reachable subshot j is from subshot i, via any object o [Lu & Grauman, CVPR 2013]

Datasets UT Egocentric (UT Ego) [Lee et al. 2012] Activities of Daily Living (ADL) [Pirsiavash & Ramanan 2012] 4 videos, each 3-5 hours long, uncontrolled setting. We use visual words and subshots. 20 videos, each 20-60 minutes, daily activities in house. We use object bounding boxes and keyframes.

Example keyframe summary UT Ego data http://vision.cs.utexas.edu/projects/egocentric/ Original video (3 hours) Our summary (12 frames) [Lee et al. CVPR 2012, IJCV 2015]

Example skim summary UT Ego data Ours Baseline [Lu & Grauman, CVPR 2013]

Generating storyboard maps Augment keyframe summary with geolocations [Lee et al., CVPR 2012, IJCV 2015]

Data Human subject results: Blind taste test How often do subjects prefer our summary? UT Egocentric Dataset Activities Daily Living Vs. Uniform sampling 34 human subjects, ages 18-60 12 hours of original video Each comparison done by 5 subjects Vs. Shortest-path Vs. Object-driven Lee et al. 2012 90.0% 90.9% 81.8% 75.7% 94.6% N/A Total 535 tasks, 45 hours of subject time [Lu & Grauman, CVPR 2013]

Summarizing egocentric video Key questions What objects are important, and how are they linked? When is recorder engaging with scene? Which frames look intentional? Can we teach a system to summarize?

Goal: Detect engagement Definition: A time interval where the recorder is attracted by some object(s) and he interrupts his ongoing flow of activity to purposefully gather more information about the object(s) [Su & Grauman, ECCV 2016]

Egocentric Engagement Dataset 14 hours of labeled ego video Browsing scenarios, long & natural clips 14 hours of video, 9 recorders Frame-level labels x 10 annotators [Su & Grauman, ECCV 2016]

Challenges in detecting engagement Interesting things vary in appearance! Being engaged being stationary High engagement intervals vary in length Lack cues of active camera control [Su & Grauman, ECCV 2016]

Our approach Learn motion patterns indicative of engagement [Su & Grauman, ECCV 2016]

Results: detecting engagement Blue=Ground truth Red=Predicted [Su & Grauman, ECCV 2016]

Results: failure cases Blue=Ground truth Red=Predicted [Su & Grauman, ECCV 2016]

Results: detecting engagement 14 hours of video, 9 recorders [Su & Grauman, ECCV 2016]

Summarizing egocentric video Key questions What objects are important, and how are they linked? When is recorder engaging with scene? Which frames look intentional? Can we teach a system to summarize?

Which photos were purposely taken by a human? Incidental wearable camera photos Intentional human taken photos [Xiong & Grauman, ECCV 2014]

Idea: Detect snap points Unsupervised data-driven approach to detect frames in first-person video that look intentional Domain adapted similarity Web prior Snap point score [Xiong & Grauman, ECCV 2014]

Example snap point predictions

Snap point predictions [Xiong & Grauman, ECCV 2014]

Summarizing egocentric video Key questions What objects are important, and how are they linked? When is recorder engaging with scene? Which frames look intentional? Can we teach a system to summarize?

Supervised summarization Can we teach the system how to create a good summary, based on human-edited exemplars? [Zhang et al. CVPR 2016, Chao et al. UAI 2015, Gong et al. NIPS 2014]

Determinantal Point Processes for video summarization Select subset of items that maximizes diversity and quality subset indicator N N similarity quality items diverse items Figure: Kulesza & Taskar [Zhang et al. CVPR 2016, Chao et al. UAI 2015, Gong et al. NIPS 2014]

Summary Transfer Ke Zhang (USC), Wei-Lun Chao (USC), Fei Sha (UCLA), Kristen Grauman (UT Austin) Idea: Transfer the underlying summarization structures Training kernels: idealized Test kernel: Synthesized from related training kernels Zhang et al. CVPR 2016

Summary Transfer Ke Zhang (USC), Wei-Lun Chao (USC), Fei Sha (UCLA), Kristen Grauman (UT Austin) Kodak (18) OVP (50) YouTube (31) MED (160) VSUMM [Avila 11] 69.5 70.3 59.9 28.9 seqdpp [Gong 14] 78.9 77.7 60.8 - Ours 82.3 76.5 61.8 30.7 VidMMR SumMe Submodular Ours [Li 10] [Gygli 14] [Gygli 15] SumMe (25) 26.6 39.3 39.7 40.9 VSUMM 1 (F = 54) seqdpp (F = 57) Promising results on existing annotated datasets Ours (F = 74) Zhang et al. CVPR 2016

Next steps Video summary as an index for search Streaming computation Visualization, display Multiple modalities e.g., audio, depth,

Summary Yong Jae Lee Yu-Chuan Su Bo Xiong Lu Zheng Ke Zhang Wei-Lun Chao Fei Sha First-person summarization tools needed to cope with deluge of wearable camera data New ideas Story-like summaries Detecting when engagement occurs Intentional=looking snap points from a passive camera Supervised summarization learning methods CVPR 2016 Workshop: Moving Cameras Meet Video Surveillance: From Body-Borne Cameras to Drones

Papers Summary Transfer: Exemplar-based Subset Selection for Video Summarization. K. Zhang, W-L. Chao, F. Sha, and K. Grauman. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, June 2016. Detecting Snap Points in Egocentric Video with a Web Photo Prior. B. Xiong and K. Grauman. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, Sept 2014. Detecting Engagement in Egocentric Video. Y-C. Su and K. Grauman. To appear, Proceedings of the European Conference on Computer Vision (ECCV), 2016. Predicting Important Objects for Egocentric Video Summarization. Y J. Lee and K. Grauman. International Journal on Computer Vision, Volume 114, Issue 1, pp. 38-55, August 2015. Story-Driven Summarization for Egocentric Video. Z. Lu and K. Grauman. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, June 2013. Discovering Important People and Objects for Egocentric Video Summarization. Y. J. Lee, J. Ghosh, and K. Grauman. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, June 2012.