Video Summarization from Spatio-Temporal Features

Similar documents
10. Water tank. Example I. Draw the graph of the amount z of water in the tank against time t.. Explain the shape of the graph.

Overview ECE 553: TESTING AND TESTABLE DESIGN OF. Ad-Hoc DFT Methods Good design practices learned through experience are used as guidelines:

application software

4.1 Water tank. height z (mm) time t (s)

THE INCREASING demand to display video contents

Workflow Overview. BD FACSDiva Software Quick Reference Guide for BD FACSAria Cell Sorters. Starting Up the System. Checking Cytometer Performance

Lab 2 Position and Velocity

Measurement of Capacitances Based on a Flip-Flop Sensor

application software

MULTI-VIEW VIDEO COMPRESSION USING DYNAMIC BACKGROUND FRAME AND 3D MOTION ESTIMATION

CE 603 Photogrammetry II. Condition number = 2.7E+06

TRANSFORM DOMAIN SLICE BASED DISTRIBUTED VIDEO CODING

MELODY EXTRACTION FROM POLYPHONIC AUDIO BASED ON PARTICLE FILTER

DO NOT COPY DO NOT COPY DO NOT COPY DO NOT COPY

Adaptive Down-Sampling Video Coding

-To become familiar with the input/output characteristics of several types of standard flip-flop devices and the conversion among them.

On Mopping: A Mathematical Model for Mopping a Dirty Floor

Hierarchical Sequential Memory for Music: A Cognitive Model

Coded Strobing Photography: Compressive Sensing of High-speed Periodic Events

Real-time Facial Expression Recognition in Image Sequences Using an AdaBoost-based Multi-classifier

Nonuniform sampling AN1

Computer Vision II Lecture 8

Computer Vision II Lecture 8

A Turbo Tutorial. by Jakob Dahl Andersen COM Center Technical University of Denmark

A ROBUST DIGITAL IMAGE COPYRIGHT PROTECTION USING 4-LEVEL DWT ALGORITHM

(12) (10) Patent N0.: US 7,260,789 B2 Hunleth et a]. (45) Date of Patent: Aug. 21, 2007

2015 Communication Guide

A Methodology for Evaluating Storage Systems in Distributed and Hierarchical Video Servers

Drivers Evaluation of Performance of LED Traffic Signal Modules

Automatic location and removal of video logos

Region-based Temporally Consistent Video Post-processing

The Art of Image Acquisition

Determinants of investment in fixed assets and in intangible assets for hightech

Besides our own analog sensors, it can serve as a controller performing variegated control functions for any type of analog device by any maker.

Solution Guide II-A. Image Acquisition. Building Vision for Business. MVTec Software GmbH

Evaluation of a Singing Voice Conversion Method Based on Many-to-Many Eigenvoice Conversion

Telemetrie-Messtechnik Schnorrenberg

TUBICOPTERS & MORE OBJECTIVE

LATCHES Implementation With Complex Gates

UPDATE FOR DESIGN OF STRUCTURAL STEEL HOLLOW SECTION CONNECTIONS VOLUME 1 DESIGN MODELS, First edition 1996 A.A. SYAM AND B.G.

BLOCK-BASED MOTION ESTIMATION USING THE PIXELWISE CLASSIFICATION OF THE MOTION COMPENSATION ERROR

Removal of Order Domain Content in Rotating Equipment Signals by Double Resampling

AUTOCOMPENSATIVE SYSTEM FOR MEASUREMENT OF THE CAPACITANCES

Solution Guide II-A. Image Acquisition. HALCON Progress

The Art of Image Acquisition

Video inpainting of complex scenes based on local statistical model

Singing voice detection with deep recurrent neural networks

Digital Panel Controller

Automatic Selection and Concatenation System for Jazz Piano Trio Using Case Data

Mean-Field Analysis for the Evaluation of Gossip Protocols

Physics 218: Exam 1. Sections: , , , 544, , 557,569, 572 September 28 th, 2016

AN ESTIMATION METHOD OF VOICE TIMBRE EVALUATION VALUES USING FEATURE EXTRACTION WITH GAUSSIAN MIXTURE MODEL BASED ON REFERENCE SINGER

Computer Graphics Applications to Crew Displays

A Delay-efficient Radiation-hard Digital Design Approach Using CWSP Elements

A Delay-efficient Radiation-hard Digital Design Approach Using CWSP Elements

United States Patent (19) Gardner

The Impact of e-book Technology on Book Retailing

TEA2037A HORIZONTAL & VERTICAL DEFLECTION CIRCUIT

SOME FUNCTIONAL PATTERNS ON THE NON-VERBAL LEVEL

Student worksheet: Spoken Grammar

Truncated Gray-Coded Bit-Plane Matching Based Motion Estimation and its Hardware Architecture

First Result of the SMA Holography Experirnent

G E T T I N G I N S T R U M E N T S, I N C.

Advanced Handheld Tachometer FT Measure engine rotation speed via cigarette lighter socket sensor! Cigarette lighter socket sensor FT-0801

Supercompression for Full-HD and 4k-3D (8k) Digital TV Systems

And the Oscar Goes to...peeeeedrooooo! 1

EX 5 DIGITAL ELECTRONICS (GROUP 1BT4) G

VECM and Variance Decomposition: An Application to the Consumption-Wealth Ratio

Source and Channel Coding Issues for ATM Networks y. ECSE Department, Rensselaer Polytechnic Institute, Troy, NY 12180, U.S.A

ANANKASTIC CONDITIONALS

Enabling Switch Devices

Communication Systems, 5e

SMD LED Product Data Sheet LTSA-G6SPVEKT Spec No.: DS Effective Date: 10/12/2016 LITE-ON DCC RELEASE

SC434L_DVCC-Tutorial 1 Intro. and DV Formats

Marjorie Thomas' schemas of Possible 2-voice canonic relationships

THERMOELASTIC SIGNAL PROCESSING USING AN FFT LOCK-IN BASED ALGORITHM ON EXTENDED SAMPLED DATA

Circuit Breaker Ratings A Primer for Protection Engineers

I (parent/guardian name) certify that, to the best of my knowledge, the

R&D White Paper WHP 120. Digital on-channel repeater for DAB. Research & Development BRITISH BROADCASTING CORPORATION.

Signing Naturally, Teacher s Curriculum Guide, Units 7 12 Copyright 2014 Lentz, Mikos, Smith All Rights Reserved.

Sustainable Value Creation: The role of IT innovation persistence

DIGITAL MOMENT LIMITTER. Instruction Manual EN B

Ten Music Notation Programs

Theatrical Feature Film Trade in the United States, Europe, and Japan since the 1950s: An Empirical Study of the Home Market Effect

LOW LEVEL DESCRIPTORS BASED DBLSTM BOTTLENECK FEATURE FOR SPEECH DRIVEN TALKING AVATAR

ZEP - 644SXWW 640SX - LED 150 W. Profile spot

Trinitron Color TV KV-TG21 KV-PG21 KV-PG14. Operating Instructions M70 M61 M40 P70 P (1)

MELSEC iq-f FX5 Simple Motion Module User's Manual (Advanced Synchronous Control) -FX5-40SSC-S -FX5-80SSC-S

LABORATORY COURSE OF ELECTRONIC INSTRUMENTATION BASED ON THE TELEMETRY OF SEVERAL PARAMETERS OF A REMOTE CONTROLLED CAR

Personal Computer Embedded Type Servo System Controller. Simple Motion Board User's Manual (Advanced Synchronous Control) -MR-EM340GF

Monitoring Technology

Diffusion in Concert halls analyzed as a function of time during the decay process

H3CR. Multifunctional Timer Twin Timer Star-delta Timer Power OFF-delay Timer H3CR-A H3CR-AS H3CR-AP H3CR-A8 H3CR-A8S H3CR-A8E H3CR-G.

TLE7251V. 1 Overview. Features. Potential applications. Product validation. High Speed CAN-Transceiver with Bus Wake-up

TLE Overview. High Speed CAN FD Transceiver. Qualified for Automotive Applications according to AEC-Q100

TLE7251V. Data Sheet. Automotive Power. High Speed CAN-Transceiver with Bus Wake-up TLE7251VLE TLE7251VSJ. Rev. 1.0,

TLE9251V. 1 Overview. High Speed CAN Transceiver. Qualified for Automotive Applications according to AEC-Q100. Features

Press Release. Dear Customers, Dear Friends of Brain Products,

Tarinaoopperabaletti

ERGODIC THEORY APPROACH TO CHAOS: REMARKS AND COMPUTATIONAL ASPECTS

Transcription:

Video Summarizaion from Spaio-Temporal Feaures Rober Laganière, Raphael Bacco, Arnaud Hocevar VIVA lab SITE - Universiy of Oawa K1N 6N5 CANADA laganier@sie.uoawa.ca Parick Lamber, Grégory Païs LISTIC Polyech Savoie Annecy, FRANCE parick.lamber@univsavoie.fr Bogdan E. Ionescu LAPI Universiy "Poliehnica" of Buchares 061071 Buchares, ROMANIA bionescu@alpha.imag.pub.ro ABSTRACT In his paper we presen a video summarizaion mehod based on he sudy of spaio-emporal aciviy wihin he video. The visual aciviy is esimaed by measuring he number of ineres poins, joinly obained in he spaial and emporal domains. The proposed approach is composed of five seps. Firs, image feaures are colleced using he spaio-emporal Hessian marix. Then, hese feaures are processed o rerieve he candidae video segmens for he summary (denoed clips). Furher on, wo specific seps are designed o firs deec he redundan clips, and second o eliminae he clapperboard images. The final sep consiss in he consrucion of he final summary which is performed by reaining he clips showing he highes level of aciviy. The proposed approach was esed on he BBC Rushes Summarizaion ask wihin he TRECVID 2008 campaign. Caegories and Subjec Descripors I.2.10 [Compuing Mehodologies]: Arificial Inelligence vision and scene undersanding; H.3.m [Informaion Sysems]: Informaion Sorage and Rerieval miscellaneous General Terms: Algorihms, Perfomances. Keywords: Video absrac, spaio-emporal feaures, Hessian- Laplace. 1. INTRODUCTION The volume of digial video is coninuously growing and users are requiring ools o deal wih his very large amoun of daa. Among he differen exising ools, video summarizaion is essenial because i allows o quickly grab he relevan conen of a video. In he lieraure [10] [8], here are a lo of papers proposing efficien approaches o video summarizaion. All hese approaches differ according o he form of he absrac (sill-image - collecion of salien images - or video skim - collecion of video segmens), o he informaion sources (inernal - provided by he video sream- or exernal), o he video modaliy handled (image, sound or Permission o make digial or hard copies of all or par of his work for personal or classroom use is graned wihou fee provided ha copies are no made or disribued for profi or commercial advanage and ha copies bear his noice and he full ciaion on he firs page. To copy oherwise, o republish, o pos on servers or o redisribue o liss, requires prior specific permission and/or a fee. TVS 08, Ocober 31, 2008, Vancouver, Briish Columbia, Canada. Copyrigh 2008 ACM 978-1-60558-309-9/08/10...$5.00. ex) and o he feaures exraced for each modaliy. Generally, he main problem wih all hese echniques is he gap beween he informaion rerieved from video daa and he semanic conceps required o achieve an efficien summary. In his paper, we ry o overcome his issue by addressing he video skimming ask using aciviy-based feaures. As he aim is o ge a very shor summary from video rushes, i.e. less han 2%, we have decided o measure he aciviy wihin he video using spaio-emporal feaures. Therefore, we exploi he hypohesis ha, for our ask, he relevan candidae video segmens are he ones conaining high aciviy level. Considering previous works, he originaliy of his approach is in he join deecion of spaial and emporal feaures. The layou of his aricle is as follows. In he nex secion, we give a global presenaion of he proposed approach. Techniques for geing spaio-emporal feaures are discussed in Secion 3 and he way hese feaures are used o ge he clips (he video segmens candidae for he summary) is deailed in Secion 4. Secion 5 presens he algorihm for consiuing he final summary. I includes wo posprocessing seps: redundancy reducion and clapperboard deecion (presened in Secion 6). Some resuls are proposed in secion 7. Secion 8 concludes his aricle. 2. THE PROPOSED APPROACH The proposed approach is described wih Figure 1. I consiss of five processing seps. Firs he spaio-emporal feaures are exraced. This exracion is based on he use of he spaio-emporal Hessian marix and provides a measure of he aciviy wihin each frame. Then, his informaion is processed o rerieve he keyframes and hereafer he video segmens (denoed as clips in he following) which form he candidae se used o build he summary. The basic principle relies on he selecion of he segmens where aciviy level is high. As he video rushes conain a lo of redundancy and junk frames (e.g. clapperboard frames), wo specific seps are designed o firs deec redundan clips and second o eliminae clapperboard images. The final sep consiss in fusing ogeher of all hese pieces of informaion o achieve he final summary aking ino accoun he ime consrain. 3. SPATIO-TEMPORAL FEATURES Image feaures have been largely used in compuer vision o perform maching, racking and recogniion asks. These feaures, or ineres poins, are generally deeced as local

clapperboard deecion 3210 summary consrucion Video spaio-emporal feaure deecion keyframe idenificaion clip exracion redundancy reducion aciviy level Figure 1: Diagram of he proposed approach. clips ime ime is generally used. The derivaives are hen esimaed by convoluing he image sequence wih appropriae Gaussian filers. In he case of a hree-dimensional signal (x, y, ), 3D Gaussian filers are, in principle, required which would be compuaionally expensive o applied. However, hanks o he separabiliy propery of hese filers, i is possible o apply he emporal and spaial componens separaely. The erm 2 I is hen approximaed by: x ( ) g σs (x, y) gσ () I(x, y, ) (2) x where g σs (x, y) is he 2-dimensional Gaussian wih variance σs 2 and g σ () is he 1-dimensional Gaussian wih variance σ 2 ; denoes he convoluion operaor. The oher erms of he marix are compued similarly. The variance of he Gaussian filers conrols he spaial and emporal scales a which he derivaives are evaluaed. In our experimen we used a fixed value of 1.5 forbohσ 2 and σs. 2 28205 28216 28244 28254 image srucures exhibiing significan inensiy variaions in more han one direcion. Since he image parial spaial derivaives are, mos ofen, esimaed using Gaussian filers, he idea of performing ineres poin deecion a differen scales rapidly emerge as a powerful framework for scaleinvarian feaure deecion. Lindeberg [4] has proposed a scale-invarian deecor where a feaure is declared a local maxima of he normalized Laplacian in scale-space. Lowe [5] proposed o approximae Laplacian using DoG filers. Mikolajczyk and Schmid [6] inroduced he scale-adaped varian of he classical Harris operaor. Oher operaors have also been proposed, some of hem are compared, see [7]. More recenly, spaial ineres poin operaors have been exended o he hird emporal dimension, i.e. he concep of spaio-emporal feaure deecors have been inroduced. Moivaed by he success of feaure-based objec recogniion, visual feaures in space-ime have been used in even descripion and recogniion. The ineres poins in a video sequence hen become he ones wih large variaions of pixel inensiies in boh spaial and emporal dimensions. Lapev and Lindeberg [3] used he idea of Harris poins and consruced a3 3 spaio-emporal second-momen marix. Like in he Harris operaor, his marix is composed of second order Gaussian derivaives averaged in a predefined neighborhood. The spaio-emporal ineres poins are he ones for which his marix has 3 significan eigenvalues. They applied hese feaures o he conex of video inerpreaion. These feaures have also been used for video synchronizaion [2]. In his paper, we use spaio-emporal feaures in he conex of video summarizaion. To his end, we based our operaor on he Hessian marix ha has been shown o produce sable feaures performing well in objec recogniion and feaure maching applicaions [6] [7] [1]. The Hessian marix of a spaio-emporal signal I(x, y, ) is given by: H(I) = x 2 x y x x y y 2 y x y 2 (1) where denoes he parial derivaive wih respec o x. x In objec recogniion, a scale-adaped version of his marix aciviy level 800 700 600 500 400 300 200 100 0 28205 28210 28215 28220 28225 28230 28235 28240 28245 28250 28255 frame 28222 keyframe Figure 2: Represening he aciviy level (number of deeced feaures) vs. frame number for a segmen. A spaio-emporal feaure poin is declared whenever he deerminan of he Hessian marix, de(h), exceeds a predefined hreshold. I is observed ha such feaures occur a prominen moion of visual ineres poins. Figure 2 shows he deeced feaures in few frames of a shor video segmen. The nex secion describes how hese visual feaures are used o exrac salien aciviy clips in a video sequence. 4. CLIP EXTRACTION Our objecive is o summarize a video sequence by means of exracing only he clips inside he sequence in which he level of aciviy is significanly high. These clips of salien aciviy can correspond o scenes where characers are performing quick acions, or where muliple eniies are ineracing ogeher; hey mos ofen consiue key scenes ha summarize well he acion wihin he movie. Our main observaion, on which he approach presened in his paper is based, is ha salien aciviies in a video sequence will generae imporan spaio-emporal variaions of he pixel inensiy. Consequenly, i seems herefore appropriae o use he densiy of spaio-emporal feaures o deec key clips. We define a clip, c [o,f ], as a shor video segmen (subsequence) which is composed of he sequence saring a frame I( o) and ending a frame I( f ). Our sraegy consiss hen in exracing a se of disjoin clips C from he original

video which are found o be represenaive of he film conen. These clips will sand as he basis for generaing he final video summary. To be considered significan, a clip mus encompass a scene exhibiing high aciviy level. In our approach, he candidae clips will be he ones inside which he level of aciviy reaches a local maximum. The keyframe associaed wih a given clip will be herefore he one wih he highes value of n(). I follows from his definiion ha one can idenify he candidae se of clips simply by locaing he localemporalmaximaofn(). Figure 2 illusraes how he number of feaure poin n() evolves wih he acion in a video segmen. Firs we exrac he Hessian spaio-emporal feaures for each frame I(), wih = 1,..., T and T represening he video sequence lengh. The number of deeced feaures, n(), in each frame is couned. The clip selecion process hen sars wih an empy se of candidae clips, C =. The clips will hen be ieraively added o C by considering, a each ieraion, he frame wih he highes feaure coun, ha is: max =argmax{n()/i() / C} (3) where I() Cif c [o,f ] Cwih [ o, f ]. Each of he hus exraced frames consiues a keyframe of he film. In order o aenuae he effec of local variaions, we firs smooh ou he aciviy level curve by applying a mean filer (of widh equals o 11). Once a keyframe idenified, he nex sep consiss in deermining he boundaries of he enclosing clip (i.e. finding he o and f of a clip). This is done by idenifying he inerval inside which he level of aciviy remains significan enough. Saring from he curren I( max), we seek forward in he sequence o find he firs frame for which he number of spaio-emporal feaures is less han a fracion α of n( max), ha is I( f )wih: f =min{/n() <α n( max)} (4) where [ max +1,T]and0<α<1 corresponds o he percenage of aciviy reducion (feaure coun) used o delimiae he clip (we used 0.2). The smaller is he value of α, he longer will be he exraced clip. If a frame I() Cis found before his condiion is me, hen f will be he frame preceding his frame. Similarly o he preceding sep, saring from he frame I( max), we seek backwards in he sequence for a frame such ha: o =max{/n() <α n( max)} (5) where [1, max 1]. Once he clip boundaries idenified, we can hen esimae he level of aciviy in a clip c [o,f ] as: H(c [o,f ])= log n() (6) = o The number of feaure poins is used here o measure he magniude of he aciviy; using he log funcion allows o avoid ha a single isolaed frame wih high aciviy impacs oo much on he global aciviy level of he clip. This clip aciviy level measure will be used o build he summary, as explained in he nex secion. f Noe also ha wih his approach, he exraced clips will vary in duraion; he mehod aims a encompassing he acion from he poin where is sars unil i ends. Bu, in some cases, one migh wan o resric he duraion of each clip o a cerain max, d max. This can be done using he following approach: if he exraced clip exceeds his maximum, he clip is cropped on each side of is represenaive keyframe by increasing he value of he hreshold α unil we achieve he required clip duraion. Afer deermining a clip boundaries, he clip c [o,f ] is included in he clip se C and he process sars over wih he selecion of a new keyframe. In principle, one can coninue o exrac keyframes unil no more frames are available, since he subse of clips from C ha will form he summary will be seleced a he final sep. However, in pracice, i is useless o coninue clip exracion when he number of feaures in he curren keyframe becomes very low. A his poin we have a se of candidae clips exraced fromheenirevideosequence. Wenowwanoseleca subse of hese clips which when assembled ogeher gives rise o a summary of a user-specified duraion SD. This is explained in he nex secion. However, a firs prune ou of his se can be achieved simply by eliminaing he clips ha are oo shor. Indeed, oo shor clips are visually unineresing and usually no very relevan. The duraion of a clip being simply given by: d(c) = f o (7) We remove any clip such ha d(c) <d min (e.g. < 1sec.). 5. BUILDING THE SUMMARY The summary will be composed based on he following consideraions: firs, he clips wih higher level of aciviies are generally more meaningful and should hen be considered in prioriy; second, a summary should presen a good variey of visual elemens, consequenly only one insance of a subse of clips wih very similar conen should be included in he final summary; finally, o ensure ha he summary is complee and represenaive of he original movie, he exraced clips should be well disribued over he emporal duraion of he film. These observaions sugges ha he summary can be buil hrough an incremenal procedure. Saring wih a new se of clips, S, iniially empy, clips from C are ransferred o S unil no more clips can be added wihou exceeding he preesablished maximal summary duraion, SD. The clips are added o he summary se according o he following rules: 1. Exrac he clip c max Cwih he highes aciviy level, H(c max) >H(c), c C wih c max c. 2. If [d(c max)+ c S d(c)] >SDhen cmax is deleed. 3. If c s Ssuch ha D v(c s,c max) <DV low hen c max is deleed, where D v() measures he visual disance beween clips and DV low is a hreshold under which clips are judged very similar in conen. 4. If c S such ha D (c,c max) < DT min AND D v(c,c max) < DV high hen c max is deleed. D () measures he emporal disance beween wo clips ha mus be greaer han he hreshold DT min. When he visual disance beween wo clips exceeds he hreshold DV high hen he wo clips are considered very dissimilar in conen.

5. If he condiions 2., 3. and 4. are no fulfilled hen he curren clip c max is added o S. The visual disance beween clips, D v(), is compued using a simple and ye efficien mehod which is based on he compuaion of color hisograms. To capure he visual global color signaure of a clip, a mean color hisogram is compued on a percenage p% of he clip frames (usually p [15; 30]%). The reained frames are color reduced using an accurae color reducion scheme, he Floyd-Seinberg dihering algorihm. The mean hisogram is compued as: if his value is less han a hreshold (we used 64), hen i is considered unsauraed (a gray level). If he majoriy of he pixels inside a cener window (half he size of he image) are unsauraed hen we coun he number of black and whie pixels. For a whie clapperboard, he number of brigh pixels mus be sufficienly large while he number of dark pixels mus be low (he inensiy being deermined by he sum R + G + B). This simple algorihm works surprisingly well as i can be seen in Figure 4. h c(i) = 1 N c h j N c(i) (8) c j=1 where N c is he oal number of reained frames in he clip c, h j c() is he color hisogram of he frame j from he clip c and i is he color index from he reduced color palee. The visual similariy funcion beween wo clips c 1 and c 2, D v(c 1,c 2), is given hen by he Euclidian disance beween heir hisograms, hus D v(c 1,c 2)=D E( h c1, h c2 ). Therefore, Sep 3 of he algorihm assures he exclusion of clips ha are visually very similar o he ones already seleced, while Sep 4 ensures ha he clips from he summary provide an adequae emporal coverage of he original film. However, in order o avoid o exclude significan video segmens, clips ha show a high level of dissimilariy will be acceped even if hey are in close emporal proximiy. This procedure is repeaed unil he se C becomes empy. Finally, he clips in S are reordered wih he respec o ime and hen assembled ogeher o produce he summary. 6. DETECTING THE CLAPPERBOARDS In paricular cases, such as for video rushes, he original film ofen conains numerous occurrences of meaningless visual elemens. Color bars and clapperboards are he wo mos common such visual elemens. From he poin of view of he summarizaion ask, hese elemens are no significan and mus be disregarded. In our approach, color bars are easily excluded as hey do no conain moion. On he oher hand, clapperboard scenes always involve significan moion and hey are seleced as clips of salien aciviy by our mehod. In fac, hrough our experimenal ess, i urns ou ha our approach acs as an excellen clapperboard deecor. Indeed, he exac insan in ime where he clapperboard is clapped generally corresponds o a peaks of aciviy on he spaio-emporal feaures densiy graph; his means ha he exraced keyframe in a clapperboard sho will be he frame showing he clapperboard clapping, which generally occurs when he clapper is locaed righ a he cener of he frame. In addiion, clappers have a very specific visual paern ha can be easily characerized. In consequence, i becomes relaively simple o reliably idenify clapperboards in he keyframes exraced by our algorihm jus by looking a he color hisogram of he cener region of he image. Clapperboards are eiher whie wih black edges or black wih whie edges. Consequenly, a clapperboard will produce an hisogram wih a majoriy of black and whie pixels. We herefore firs look for unsauraed pixels; he sauraion (or colorfulness) being measured as: Sa(R, G, B) =max(r, G, B) min(r, G, B) (9) Figure 4: Several keyframes exraced following he mehod from Secion 4 for one of he TRECVID08 video rush (red X s=deeced clapperboards). 7. EXPERIMENTAL RESULTS The resul of he proposed summarizaion mehod is illusraed in Figure 5 where a 34 sec. summary was produced from a 30-min video rush. The graph shows he evoluion of aciviy level (number of deeced spaio-emporal feaures) across he enire sequence. Noe ha he very high peaks mosly correspond o he sho boundaries in he video. Indeed, scene cus produce insananeous spaio-emporal changes when ransiing from one sho o anoher. However, hese peaks are discarded from he summary because of heir very shor duraion and of he very low oal aciviy level of heir corresponding clips. The exraced keyframes, shown in Figure 4, are locaed on he graph using dos. The video segmens ha form he summary buil from he process described in Secion 5 is represened by he se of hick lines. Nine clips were used o build his summary; hey correspond o he keyframes 2, 4, 6, 10, 17, 18, 30 and 31 in Figure 4. Inclusion of duplicae scenes such as he ones represened by keyframes 10 and 12 or by keyframes 30, 33, 36 have been avoided hrough he similariy measure described in Secion 5. In he framework of TRECVID 2008 [9], our produced summaries were judged o have a very pleasan rhyhm (rank 5 on 43) while in erms of junk and duplicae scenes, he mehod obained average resuls (rank 24 for junks and 12 for duplicaes). However, a low 26% was obained for he inclusion crieria (rank 39). The fac ha he video segmens exraced by our mehod are cenered a he peak

Figure 3: Graphical represenaion of n() vs. frame number for a TRECVID video. The keyframes shown in Figure 4 are locaed wih dos while he frames included in he summary are represened by gray hick lines. of aciviy level consiues an imporan facor in he producion of summaries wih pleasan empo. I ensures ha each clip is exraced wih an appropriae ime window ha includes he saring and he ending of he acion shown. Also, he fac ha we seleced a maximal clip duraion of 3 seconds paricipaes o he producion of summaries wih good fluidiy. However, individual clips of longer duraion reduces he oal number of clips ha can be included in he summary and hence reduces he probabiliy of inclusion. If, for an equivalen summary duraion, one wans o include more clips in he summary hen, wih our mehod, he maximum clip duraion should be reduced. Each clip will hen be shorer and he procedure will hus include a larger number of clips. However, his will resul in a saccadic summary. 8. DISCUSSIONS AND CONCLUSION This paper presened a mehod for video summarizaion based on he deecion of spaio-emporal feaures. Summaries were produced based on he assumpion ha he relevan candidae video segmens ha should be seleced o build a summary are he ones exhibiing high aciviy levels. This is cerainly a debaable saemen bu he subjecive esing of he resuls seems o confirm is validiy. High aciviy video segmens generally correspond o clips where characers are performing imporan acions. Some imporan scenes can however be missed such as panorama panning or shooings of inacive persons; such scenes are of lesser imporance when summarizing he sory of a film, bu hey migh be relevan when summarizing a collecion of rushes. I also happens ha some background moions (e.g. waer moion) produce scene wih high aciviy levels; hese are hen undesirably included in he summary. The deecion of duplicae scenes is based on he comparisons of color hisograms. Currenly, he hresholds used in hese comparisons are se empirically. We currenly work on an adapive echnique based on clip clusering ha will allow o auomaically deermine he value of hese parameers. The deecion of he spaio-emporal feaures also requires some parameers o be se. However, he mehod is quie oleran o hese as i relies on he deecion of local maxima; similar resuls being obained for a large range of values. The mos criical parameers are he minimum and maximum clip duraion as well as he summary duraion; hese are however enirely subjecive and depends on he user s objecives. 9. ACKNOWLEDGMENTS This work was parially suppored by he Rhône-Alpes region, Research Cluser 2 - LIMA projec and by CNCSIS - Naional Universiy Research Council of Romania, gran 6/01-10-2007/RP-2. Par of his work has been done while he firs auhor was invied professor a Polyech Savoie (Universié desavoie). 10. REFERENCES [1] A. Bhaia, R. Laganiere, G. Roh, Performance Evaluaion of Scale-Inerpolaed Hessian-Laplace and Haar Descripors for Feaure Maching, Inernaional Conference on Image Analysis and Processing, pages 61-66, Modena, Ialy, 2007. [2] D. Wedge, D. Huynh, P. Kovesi, Using Space-Time Ineres Poins for Video Sequence Synchronizaion, IAPR Conference on Machine Vision Applicaions, pages 190-194, Tokyo, Japan, 2007. [3] I. Lapev, T. Lindeberg, Space-ime ineres poins, ICCV03, pages 432-439, 2003. [4] T. Lindeberg, Feaure Deecion wih Auomaic Scale Selecion, Inernaional Journal of Compuer Vision, 30(2), pages 79-116, 1998. [5] D. Lowe, Disincive image feaures from scale-invarian keypoins, Inernaional Journal of Compuer Vision, 20, pages 91-110, 2003. [6] K. Mikolajczyk, C. Schmid, Scale and affine invarian ineres poin deecors, Inernaional Journal of Compuer Vision, 60(1), pages 63-86, 2004. [7] K. Mikolajczyk, T. Tuyelaars, C. Schmid, A. Zisserman, J. Maas, F. Schaffalizky, T. Kadir, L. Van Gool, A Comparison of Affine Region Deecors, Inernaional Journal of Compuer Vision, 65(1-2), pages 43-72, 2005. [8] A.G. Money, H. Agius, Video summarisaion: A concepual framework and survey of he sae of he ar, Journal of Visual Communicaion and Image Represenaion, 19(2), pages 121-143, 2008. [9] P. Over, A.F. Smeaon, G. Awad, Trecvid 2008 BBC Rushes Summarizaion Evaluaion, Proceedings of he ACM Inernaional Workshop on TRECVID Video Summarizaion, pages 1-20, Vancouver, Canada, 2008. [10] Ba Tu Truong, S. Venkaesh, Video absracion: A sysemaic review and classificaion, ACM TOMCCAP, 3(1), 2007.