Multi-View Video Summarization Using Bipartite Matching Constrained Optimum-Path Forest Clustering

Size: px

Start display at page:

Download "Multi-View Video Summarization Using Bipartite Matching Constrained Optimum-Path Forest Clustering"

Virgil Scott
6 years ago
Views:

1 1166 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 8, AUGUST 2015 Multi-View Video Summarization Using Bipartite Matching Constrained Optimum-Path Forest Clustering Sanjay K. Kuanar, Kunal B. Ranga, and Ananda S. Chowdhury, Member, IEEE Abstract The task of multi-view video summarization is to efficiently represent the most significant information from a set of videos captured for a certain period of time by multiple cameras. The problem is highly challenging because of the huge size of the data, presence of many unimportant frames with low activity, inter-view dependencies, and significant variations in illumination. In this paper, we propose a graph-theoretic solution to the above problems. Semantic feature in form of visual bag of words and visual features like color, texture, and shape are used to model shot representative frames after temporal segmentation. Gaussian entropy is then applied to filter out frames with low activity. Inter-view dependencies are captured via bipartite graph matching. Finally, the optimum-path forest algorithm is applied for the clustering purpose. Subjective as well as objective evaluations clearly indicate the effectiveness of the proposed approach. Index Terms Bipartite matching, Gaussian entropy, multi-view video summarization, optimum-path forest, visual bag of words. I. INTRODUCTION I NCREASING demand of security, traffic monitoring in the recent years has led to deployment of multiple video cameras with overlapping fields of views at public places like banks, ATMs and road junctions. The surveillance/monitoring systems simultaneously record a set of videos capturing various events. Multi-view video summarization techniques can be applied for obtaining significant information from these videos in short time [4], [16]. Application areas in which multi-view video summarization can be of immense help include investigative analysis of post accident scenarios, close scrutiny of traffic patterns and prompt recognition of suspicious events and activities like theft and robbery at public places. Many works for summarization of monocular videos(single view) can be found in the literature [1], [3], [5] [13]. However, multi-view video summarization poses certain distinct(from mono-view) challenges. The size of the multi-view video data Manuscript received March 21, 2015; revised May 23, 2015; accepted May 30, Date of publication June 10, 2015; date of current version July 15, The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Changsheng Xu. The authors are with the Department of Electronics and Telecommunications Engineering, Jadavpur University, Kolkata , India ( sanjay. kuanar@gmail.com; kunalranga@gmail.com; aschowdhury@etce.jdvu.ac.in). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TMM collected by a surveillance camera system for even few hours can be very large. Moreover, since these videos are captured by fixed camera systems, much of the recorded content is uninteresting which makes useful information extraction more difficult [2]. Thirdly, since all cameras capture the same scene from different viewpoints, these videos have large amount of (inter-view) statistical dependencies [4]. So, correlations among videos captured from multiple views need to be properly modeled for obtaining an informative and compact summary. Finally, the individual views can suffer from significant variations in illumination. So, mono-view video summarization approaches may not necessarily work well for the multi-view problem [4], [16], [41], [42]. In this paper, we propose a graph based solution for multi-view video summarization where correlations among different views are established via bipartite graph matching [43], [44] and clustering of high-dimensional video data is performed using optimum-path forest (OPF) algorithm [34], [35]. After temporal segmentation, semantic as well as visual features (e.g. color, texture and shape) are used for modeling shot representative frames. Gaussian entropy is used to filter out frames with low activity. II. RELATED WORK We start this section with some representative mono-view video summarization methods. For a comprehensive review of this subject area, please see [1], [5]. Jiang et al. [6] developed an automatic video summarization algorithm following a set of consumer-oriented guidelines as high level semantic rules. Hannon et al. [7] used the time-stamped opinions from the social networking sites for generating summaries of soccer matches. Semantically important information from a set of user inputted keyframes was used by Han et al. [8] for producing video summaries. A generic framework of user attention model through multiple sensory perceptions was employed by Ma et al. [9]. Since our proposed method is based on graph algorithms, we now mention some graph based summarization approaches. Dynamic Delaunay clustering and information-theoretic pre-sampling was proposed by Kuanar et al. [11]. Lu et al. [12] developed a graph optimization method that computes optimal video skim in each scene via dynamic programming. In another approach, Peng et al. [13] showed that highlighted events can be detected using an effective similarity metric for video clips. Temporal graph analysis was carried out by Ngo et al. [14] to effectively encapsulate information for video IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 KUANAR et al.: MULTI-VIEW VIDEO SUMMARIZATION USING OPTIMUM-PATH FOREST CLUSTERING 1167 Though several works can be found in the area of monoview video summarization, very little work has thus far been reported for summarizing multi-view videos. Paucity of research papers in this highly important area has primarily motivated us to investigate the problem. Fu et al. [4] presented a multi-view video summarization technique using random walks applied on spatio-temporal shot graphs. A hypergraph based representation is used for capturing correlations among different views. Clique expansion is then used to convert the hypergraph into the spatio-temporal graph. Random walks are applied on the spatio-temporal graph for final clustering. The authors in [4] have mentioned that the graph building process is intrinsically complex and it consumes most of the overall processing time. Moreover, short walks may end up in local network neighborhoods [15] producing erroneous clustering results. Li et al. [16] presented another method for abstracting multi-key frames from video datasets using Support Vector Machine(SVM) and rough sets. However, performance of SVM depends on the choice of the kernel and its design parameters [17]. Recently, Ou et al. [42] proposed a low-complexity online multi-view video summarization on wireless video sensors to save compression and transmission power while keeping critical information. We now highlight the contributions of the current work from the viewpoints of both multimedia as well as graph-theoretic pattern clustering. From the multimedia standpoint, our method produces a more accurate multi-view summary compared to [4], [16], [42] and faster summary compared to [4]. The salient features of the work are stated below. 1. We use a novel combination of features, namely, color, texture, visual bag of words and Tamura. While color and texture comprise the regular visual features; visual bag of words help in handling different lighting conditions and three Tamura features help in handling orientations. This choice of features, which improves the final clustering process, has not been reported in any similar work. 2. In sharp contrast to [4], we developed a compact spatiotemporal graph on which the clustering algorithm is to be applied. Specifically, intra-view redundancy removal which is equivalent to order reduction of the above spatiotemporal graph is achieved through Gaussian entropy. 3. Unlike the related existing approaches, we have captured the correlations among multiple views in a more accurate and efficient manner using the Maximum Cardinality Minimum Weight (MCMW) bipartite matching [43], [44]. 4. Unsupervised Optimum-Path Forest (OPF) algorithm [34] is used for the first time in the field of multimedia for rapid clustering of high volume as well as somewhat high dimensional data obviating the need of any dimensionality reduction technique. To further improve the performance of the OPF algorithm, the match sets from the MCMW matching are imposed as constraints in its input adjacency matrix. From the point of view of graph-theoretic pattern clustering, MCMW bipartite matching-constrained OPF has not been applied to the best of our knowledge. III. PROPOSED METHOD In this section, we provide a detailed description of our method. In Fig. 1, the four main components of our method, Fig. 1. Flowchart showing various components of our method. namely, video pre-processing, unimportant frame elimination, multi-view correlation and shot clustering are shown. A. Video Preprocessing The preprocessing part of a video consists of two steps, namely, (i) shot detection and representation, and, (ii) feature extraction. These steps are now described below. 1) Shot Detection and Representation: Shot boundary detection or temporal segmentation [18] [21] is carried out first. To parse the multiple views in our problem, we apply a motion based shot boundary detection method [22]. Various schemes exist to represent these detected shots using a single key frame or a set of key frames [23] [25]. However, for large datasets, like surveillance videos which contains a large number of shots, comparing every pair of frames within a shot becomes computationally prohibitive. Moreover, many shots from the multi-view videos are static in nature with no camera motion. Hence for representing a shot, we use the middle frame as it captures the general view of the overall shot content [23]. 2) Feature Extraction: We employ both visual and semantic features for content modeling. Within the visual category, color, edge and shape features are used. We decide to obtain the color histogram using the HSV color space(16 ranges of ranges of ranges of V) because it is found to be more resilient to noise and is also robust to small changes of the camera position [7], [26]. Since the global color histogram alone is incapable of preserving spatial information, we use texture features as well. These texture features are extracted using edge histogram descriptor [27]. A video frame is first sub-divided into blocks. Then the local edge histograms for each of these blocks are obtained. Edges are broadly grouped into five bins: vertical, horizontal, 45 diagonal, 135 diagonal, and isotropic. So, texture of a frame is represented by an 80-dimensional(16 blocks 5 bins) feature vector. In addition to color and edge information we extract three pixel level Tamura features [28] as they correlate very strongly to human perception. These three

1168 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 8, AUGUST 2015 features denote coarseness, contrast and directionality for the neighborhood of the pixels.

3 1168 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 8, AUGUST 2015 features denote coarseness, contrast and directionality for the neighborhood of the pixels. In a multi-view video, the representative frame of a shot from one view and that from the adjoining shot of a different view can be considered as partially rotated versions in which other visual features could remain same. In order to distinguish these frames, Tamura features are used. Multi-view videos are often found to suffer from significant variations in illumination. In such cases, a single event simultaneouslycapturedbydifferentviews will be erroneously treated as different events since they are visually dissimilar. To deal with this type of situations, we also consider semantic features between the shots. An event is modeled as a semantic object. Semantic similarity between documents is addressed using Bag of Words [29]. To capture the semantic similarity of the events in the multi-view videos, we apply the Visual bag of words (BoVW) model for our problem [30], [31]. Visual bag of words are obtained by applying K-means clustering on SIFT features [32] extracted from all the shot representative frames. Each visual word is represented by a cluster. A visual word appears in a shot if there exists some SIFT feature points in the shot representative frame within the th cluster. A shot is represented by where represents the normalized frequency of the th visual word and is the total number of visual words/clusters. Hence, we define a vocabulary as a set of centroids, where every centroid represents a word. We consider 500,000 SIFT features for this work and group them into 100 or 1000 clusters based on the duration and heterogeneity of the videos. After combining all the features, eventually a frame is represented by a 439-dimensional feature vector (256 for color for texture for Tamura for visual bag of words). B. Unimportant Shot Elimination Using Gaussian Entropy A careful scrutiny of the multi-view videos reveals a lot of redundancy in the contents of the individual views as they are being mostly captured by static cameras. In this step, we remove the unimportant(low or no activity) shots using the Gaussian entropy fusion model [4]. The importance of a shot is expressed as interaction of feature sets in the following manner: where and denote the information content of a shot and that of a set of features respectively. A well-known measure of the information content is entropy. We modify the equation (2) accordingly in the following manner: where, is the th feature set for the th shot representative frame and denotes the entropy of the feature. The entropy of a shot is calculated by taking into consideration color, texture, shape and BoVW features. Hence, (1) (2) (3) Fig. 2. Gaussian entropy scores for frames in Office1 video. Blue dots showing unimportant (low activity) frames with low scores and the green dots showing the important (high activity) frames with high scores. in our case is set to 4. The Gaussian entropy of a shot is hence expressed as The representative frames of various shots are sorted in an ascending order of the Gaussian entropy. Then we eliminate those shots whose representative frames have entropy scores below a certain threshold, chosen experimentally for a given dataset. In Fig. 2, we show a plot showing the distribution of the frames according to their Gaussian entropy scores for view1 of office1 video. Here the lowest entropy value was found to be 7.20 and the threshold was set to C. Multi-View Correlation Using Bipartite Matching Use of bipartite matching for solving correspondence problem under various contexts like shape matching, object recognition is quite well-known in the area of computer vision [43], [44]. The general underlying principle is to model the correspondence problem as an optimal assignment problem. For this work, the problem of finding the correlation among multiple views is modeled as one of Maximum Cardinality Minimum Weight (MCMW) matching problem in bipartite graphs. A key advantage of using bipartite matching is its inherent scalability, which arises from its polynomial(cubic) time-complexity [10]. We assume that the similarity between shots across multiple views can be measured according to the similarity of the key frames extracted from the corresponding shots. Let us represent any two overlapping views of a multi-view video by, with and as their number of shot representative frames(each frame is a point in 439-dimensional space) respectively. So, we can write, and where represents the th shot representative frame in the th view. To capture the similarity between these views, we construct a bipartite graph with two disjoint vertex sets and representing the two views. The vertices of the graph are the shot representative frames of the views which pass through the Gaussian entropy based filtering stage. The edge set is denoted by,where denotes the edge connecting the vertices and. The edge weight between and, is computed as the Euclidean distance between these two points in 439-dimensional space. After applying the MCMW algorithm, we obtain a match set whose elements (4)

4 KUANAR et al.: MULTI-VIEW VIDEO SUMMARIZATION USING OPTIMUM-PATH FOREST CLUSTERING 1169 indicate the actual key frame correspondences between the two views and. The time-complexity of the MCMW algorithm is,where. We can apply this MCMW bipartite matching algorithm between every pair of views and obtain similar correspondences. So, for views, we need to apply the MCMW bipartite matching algorithm times. he overall complexity of the matching process is times which is still as (e.g. typically,,and, ). D. Shot Clustering by OPF We apply OPF algorithm in an unsupervised manner [34] for clustering of the key frames. We choose this method for rapid clustering of high volume as well as somewhat high dimensional data obviating the need of any dimensionality reduction technique. The method is also not constrained by the number and the form of clusters [35], [36]. OPF is based on Image Foresting Transform(IFT) [33]. An appropriate graph is constructed on which OPF is to be applied. The node set contains the shot representative frames which are retained after Gaussian entropy based filtering. For every frame there is a feature vector. Node pairs appearing in any of the number of MCMW Bipartite match sets(for views) build the edges of the graph. So, bipartite matching is used to refine the adjacency relation for OPF in form of the following constraint: if in otherwise In the above equation, indicates presence of an edge. Let be the adjacent edge weight, i.e., the distance between two adjacent frames and in the feature space. Then is given by The graph also has node weights in form of probability density functions(pdfs) which can characterize relevant clusters. The pdfs can be estimated using a Parzen window [36] based approach with a Gaussian kernel given by where, [34] and is the maximum arc weight in. The parameter considers all nodes for density computation, since a Gaussian function covers most samples within. For the clustering process, we require the concept of a path and a connectivity function denoting the strength of association of any node in the path to its root. A path is a sequence of adjacent nodes starting from a root and ending at a node,with a trivial path and the concatenation of and arc. We need to maximize the connectivity for all where if Otherwise (5) (6) (7) (8) TABLE I DATASET INFORMATION Fig. 3. Variations in values with number of words (K) for different datasets. In (8), and is a root set with one element for each maximum of the pdf. Initially all nodes define trivial paths with values.highervalues of delta reduce the number of maxima. We set and following [37]. The IFT algorithm maximizes such that the optimum paths form an optimum-path forest. This algorithm is implemented using the codes made available by the authors of [34]. The reported time-complexity with such implementation is. IV. EXPERIMENTAL RESULTS AND DISCUSSIONS Four multi-view datasets with thirty videos in total along with the corresponding ground truths from [4], [42] are used for the experiments which are carried out on a desktop PC with Intel(R) core(tm) i processor and 8 GB of DDR2-memory. Table I shows the information on experimental data. For multi-view summaries generated by our method, please visit: result/multi-view-video-summarization. [8] is used as the objective measure while Informativeness [4] and Visual pleasantness [38] are used as the subjective measures for performance evaluation. For the BoVW model, we experimentally choose 100 words for the Office1 and Office Lobby datasets and 1000 words for the more complex Campus and the BL-7F datasets to achieve the best performance. Fig. 3shows measures for various choices of K for the four datasets. A. Validation of the Components in the Solution Pipeline To show the successive improvements from the various components in our solution pipeline, we use a simple base method(a) where K-means clustering algorithm is applied to visual features (color, edge and shape) for obtaining the multi-view summary. K-means is chosen because of its low computational overhead in clustering of high dimensional data [39]. To show the utility of OPF, we apply both the K-means [39] as well as the OPF clustering on the visual features (denoted as B). The measure values in Table II clearly

5 1170 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 8, AUGUST 2015 TABLE II COMPARATIVE PERFORMANCE ANALYSIS WITH VARIOUS APPROACHES Fig. 4. Representative frames of strongly correlated shots as detected by us across pairs of different views for the office1 dataset. indicate that B performs significantly better than the A for all four datasets. We next show the effectiveness of the semantic features by applying OPF algorithm on the combined (Visual Bag of Words and the visual) feature set (denoted as C). Table II clearly demonstrates that C has a better score as compared to that of B. This is mainly because of the poor recall score of B as the summary contains more false negatives (more missed frames) compared to that of C. These results show that the use of semantic features can lead to a significant increase in the detection of number of salient events in the summaries. We next demonstrate the effectiveness of the MCMW bipartite matching. We call the method D where MCMW bipartite matching is added to C. Thus, D represents our complete solution. From a comparison of the measures for C and D in Table II, it becomes evident that incorporation of bipartite matching significantly improves the quality of the results.infig.4,werepresentthe strongly correlated shots, with their representative middle frames across different pairs of views for the office1 video. Here we notice that most of the strongly correlated shots show same activity simultaneously recorded by the different views. We finally demonstrate that the utility of the Gaussian Entropy fusion model for elimination of unimportant frames. We use D without the Gaussian Entropy model for that purpose and call it.asshownintableii, the precision of D is much higher than that of.thisdrop in precision for is attributed to the presence of unimportant shots as false positives in the generated summary. B. Comparison With Some Mono-View Methods We first compare our method with some mono-view video summarization methods as in [4]. Table III reveals that there are lots of redundancies (simultaneous presence of most of the events) in the summaries obtained from all the mono-view strategies. Furthermore, there exist significant temporal overlaps among summarized multi-view shots in these methods. The proposed multi-view summarization method has much less redundancy and captures most of the important events that were missing in the mono-view summaries. This is because a mono-view method fails to capture the correlation between multiple views as it simply coalesces the videos from multiple views to form a single combined video. C. Comparison With the State-of-the-Art Multi-View Methods We now compare our method with four state-of-the-art multiview methods, namely, [4], [16], [41], [42] based on the available results and ground-truths. Out of these four methods, [41] uses a learning strategy. Note that the methods in [2], [40] use video data from multiple cameras mainly for the purpose of

6 KUANAR et al.: MULTI-VIEW VIDEO SUMMARIZATION USING OPTIMUM-PATH FOREST CLUSTERING 1171 TABLE III PERFORMANCE COMPARISON WITH TWO MONO-VIEW AND THREE MULTI-VIEW METHODS TABLE IV PERFORMANCE COMPARISON WITH A LEARNING-BASED MULTI-VIEW METHOD [41] TABLE V STATISTICAL DATA SHOWING SUBJECTIVE COMPARISON WITH [4] tracking and are hence left out for comparison. In [4], the authors capture the correlations among the different views using a hypergraph based representation, which was found to be intrinsically complex and slow. Office1 video is a point in the case. We found that on an average, MCMW bipartite matching algorithm for any pair of views takes less than 1 min. to complete. For the office1 video, there are 4 views. This requires altogether MCMW matching which takes less than 5 min. to complete. The feature extraction takes 3 min. and obtaining Gaussian entropy scores for the shots takes around 2 min. So, processing time for our method is 10 min. as compared to the reported value of 15 min. in [4] before a clustering algorithm is applied. The execution time of the OPF clustering in our work is about the same time for the random walks-based clustering in [4]. Hence, we say that the proposed method is almost 30% faster than that of [4]. Next, we show the comparisons using the objective measure in Table III and Table IV. Table III shows that the precision of our method as well as that of [4] are 100% for the Office 1 and Office lobby datasets and somewhat low for the Campus video dataset. Presence of the fence in the campus video has caused this reduction in precision. Still, for Campus video dataset, precision of our method is about 6% better than that of [4]. For all the three datasets, we obtain a better recall than [4]. Overall, the measure for our method easily supersedes [4] for all the three datasets. The same table also clearly demonstrates the superiority of the proposed method over [16] in terms of a much higher value. Note that the method [16] is frame based and more than one key frames are used to describe a single shot. So, in that work more than one key frame may be detected for one video shot(one event shot). In contrast, our method as well as Fig. 5. Representative frames of some events detected by us which were missed by other approaches. that of [4] are both shot-based and only a single key frame is used to describe a single shot. That is why the number of detected events in case of [16] is higher. Table III also exhibits the superiority of the proposed method over [42] for Office, Lobby and the BL-7F dataset in terms of a much higher measure. Table IV shows that though we have not used any learning strategy in our method, we are quite comparable to [41] which uses a metric learning approach [41]. For the subjective evaluation, we conducted a user study with 10 volunteers. Summaries generated from the proposed method and that from [4] for all the datasets were shown to the volunteers. We asked them to assign a score between 1 and 5(1 indicates the worst score and 5 indicates the best score) for Informativeness and Visual pleasantness for each summary. From the user study results in Table V, it is evident that most of our summaries are more informative as compared to [4]. The maximum increase in Informativeness is 18% for the challenging Campus video. Similarly, a comparison

1172 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 8, AUGUST 2015 Fig. 6. View-board of the multi-view summary. Representative frames of the summarized shots arranged in temporal order.

7 1172 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 8, AUGUST 2015 Fig. 6. View-board of the multi-view summary. Representative frames of the summarized shots arranged in temporal order. of the Visual pleasantness values indicates that for four out of five videos, we obtain a better result. The maximum improvement is about 12% for the Office lobby video. We now specifically show some of the relevant events which we could correctly detect but were reported as missed or false negatives by [4]. In Office 1 video, the event of pulling a thick book from the shelf by a member which is the 36th shot from the second view is preserved by our summary [Fig. 5(a)]. The event from 4th view of the campus video which captures a bus moving across from right to left outside the fence is preserved by our summary [Fig. 5(b)]. Similarly, another event from the 3rd view of the office lobby video, where a woman wearing a white coat walks across the lobby towards the gate without interrupting the man playing with the baby is captured by our method [Fig. 5(c)]. So, the summary can play an important role for sophisticated video surveillance tasks like event and activity detection [45]. The reason we outperform three multiview approaches and is close to one with learning is because we use i) a very strong set of features to handle complex issues like lighting conditions, and varying orientations, ii) a compact spatio-temporal graph with intra-view redundancy being removed by Gaussian entropy, iii) an efficient and accurate way of capturing correlation among multiple views by MCMW bipartite matching and iv) OPF clustering to handle high volume and somewhat high dimensional data. Finally, to statically represent our multi-view summaries we introduce a View-board as illustrated in Fig. 6. The representative middle frames of the summarized shots are assembled along the timeline across multiple views. Each shot is associated with a number(shown inside a box) that indicates the view to which the shot belongs. V. CONCLUSION AND FUTURE WORK We have presented a novel framework for the multi-view video summarization problem using bipartite matching constrained OPF. The problem of capturing the inherent correlation between the multiple views was modeled as a MCMW matching problem in bipartite graphs. OPF clustering is finally used for summarizing the multi-view videos. Performance comparisons show marked improvement over several existing mono-view and multi-view summarization algorithms. In future, we will focus on integration of more extensive set of video features. We also plan to work with large duration multi-view surveillance videos to demonstrate the scalability of our approach. ACKNOWLEDGMENT The authors would like to thank Prof. R.W. Robinson with the University of Georgia, Athens, GA, USA, and Prof. A. X. Falcão at the University of Campinas, Campinas, Brazil, for their helpful discussions on bipartite graph matching and OPF clustering. REFERENCES [1] B. T. Truong and S. Venkatesh, Video abstraction: A systematic review and classification, ACM Trans. Multimedia Comput., Commun., Appl., vol. 3, no. 1, pp. 1 37, [2] C. De Leo and B. S. Manjunath, Multicamera video summarization and anomaly detection from activity motifs, ACM Trans. Sensor Netw., vol. 10, no. 2, pp. 1 30, 2014, Article 27. [3] A. S. Chowdhury, S. Kuanar, R. Panda, and M. N. Das, Video storyboard design using Delaunay graphs, in Proc. IEEE Int. Conf. Pattern Recog., Nov. 2012, pp [4] Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, and Z. H. Zhou, Multi view video summmarization, IEEE Trans. Multimedia, vol. 12, no. 7, pp , Nov [5] A.G.MoneyandH.W.Agius, Videosummarization:Aconceptual framework and survey of the state of the art, J. Vis. Commun. Image Representation, vol. 19, no. 2, pp , [6] W. Jiang, C. Cotton, and A. C. Loui, Automatic consumer video summarization by audio and visual analysis, in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2011, pp [7] J. Hannon, K. McCarthy, J. Lynch, and B. Smyth, Personalized and automatic social summarization of events in video, in Proc. Int. Conf. Intell. User Interfaces, 2011, pp [8] B. Han, J. Hamm, and J. Sim, Personalized video summarization with human in the loop, in Proc. IEEE Workshop Appl. Comput. Vis.,2011, pp [9] Y. F. Ma, X. S. Hua, L. Lu, and H. J. Zhang, A generic framework of user attention model and its application in video summarization, IEEE Trans. Multimedia, vol. 7, no. 5, pp , Oct [10] C. H. Papadimitrou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Delhi, India: Prentice-Hall of India, [11] S. K. Kuanar, R. Panda, and A. S. Chowdhury, Video key frame extraction through dynamic delaunay clustering with a structural constraint, J. Vis. Commun. Image Representation, vol.24,no.7,pp , [12] S.Lu,I.King,andM.R.Lyu, Videosummarizationbyvideostructure analysis and graph optimization, in Proc. IEEE Int. Conf. Multimedia Expo, 2004, pp [13] Y. Peng and C. W. Ngo, Clip-based similarity measure for query-dependent clip retrieval and video summarization, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 5, pp , May [14] C. W. Ngo, Y. F. Ma, and H. J. Zhang, Video summarization and scene detection by graph modeling, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 2, pp , Feb [15] A. Z. Broder, A. R. Karlin, P. Raghavan, and E. Upfal, Trading space for time in undirected S-T connectivity, SIAM J. Comput., vol. 23, pp , 1994.

KUANAR et al.: MULTI-VIEW VIDEO SUMMARIZATION USING OPTIMUM-PATH FOREST CLUSTERING 1173 [16] P. Li, Y. Guo, and H. Sun, Multi-keyframe abstraction from videos, in Proc. IEEE Int. Conf. Image Process.

[18] C. Cotsaces, N. Nikolaidis, and I. Pitas, Video shot detection and condensed representation, A review, IEEE Signal Process. Mag., vol. 23, no. 2, pp. 28 37, Mar. 2006. [19] H. Zhang, A.

8 KUANAR et al.: MULTI-VIEW VIDEO SUMMARIZATION USING OPTIMUM-PATH FOREST CLUSTERING 1173 [16] P. Li, Y. Guo, and H. Sun, Multi-keyframe abstraction from videos, in Proc. IEEE Int. Conf. Image Process.,Brussels,Belgium,Sep.2011, pp [17] A. Ben-Hur and J. Weston, A users guide to support vector machines, Methods Molecular Biol., Data Mining Tech. Life Sci., vol. 609, pp , [18] C. Cotsaces, N. Nikolaidis, and I. Pitas, Video shot detection and condensed representation, A review, IEEE Signal Process. Mag., vol. 23, no. 2, pp , Mar [19] H. Zhang, A. Kankanhalli, and S. W. Smoliar, Automatic partitioning of full-motion video, Multimedia Syst., vol. 1, no. 1, pp , [20] D. A. Adjeroh and M. C. Lee, Robust and efficient transform domain video sequence analysis: An approach from the generalized color ratio model, J. Vis. Commun. Image Representation, vol. 8, no. 2, pp , [21] R. Zabih, J. Miller, and K. Mai, A feature-based algorithm for detecting and classifying production effects, Multimedia Syst., vol. 7, no. 2, pp , [22] A.Amel,B.Abdessalem,andM.Abdellatif, Videoshotboundary detection using motion activity descriptor, J. Telecommun., vol. 2, no. 1, pp , [23] Z. Rasheed and M. Shah, Detection and representation of scenes in videos, IEEE Trans. Multimedia, vol. 7, no. 6, pp , Dec [24] V. T. Chasanis, A. C. Likas, and N. P. Galatsanos, Scene detection in videos using shot clustering and sequence alignment, IEEE Trans. Multimedia, vol. 11, no. 1, pp , Jan [25] P. P. Mohanta, S. K. Saha, and B. Chanda, A heuristic algorithm for video scene detection using shot cluster sequence analysis, in Proc. Indian Conf. Vis. Graph. Image Process., 2010, pp [26] G. Paschos, Perceptually uniform color spaces for color texture analysis: An empirical evaluation, IEEE Trans. Image Process., vol. 10, no. 6, pp , Jun [27] B. S. Manjunath, J. R. Ohm, V. V. Vasudevan, and A. Yamada, Color and texture descriptors, IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 6, pp , Jun [28] H. Tamura, S. Mori, and T. Yamawaki, Textural features corresponding to visual perception, IEEE Trans. Syst., Man, Cybern., vol. SMC-8, no. 6, pp , Jun [29] D. Bollegala, Y. Matsuo, and M. Ishizuka, A web search engine based approach to measure semantic similarity between words, IEEE Trans. Knowl. Data Eng., vol. 23, no. 7, pp , Jul [30] J. Sivic and A. Zisserman, Video Google: A text retrieval approach to object matching in videos, in Proc. IEEE Int. Conf. Comput. Vis.,Oct. 2003, vol. 2, pp [31] Y. G. Jiang, J. Yang, C. W. Ngo, and A. G. Hauptmann, Representations of keypoint-based semantic concept detection: A comprehensive study, IEEE Trans. Multimedia, vol. 12, no. 1, pp , Jan [32] D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., vol. 60, no. 2, pp , [33] A. X. Falcao, J. Stolfi, and R. A. Lotufo, The image foresting transform: Theory, algorithms, and applications, IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 1, pp , Jan [34] L. M. Rocha, F. A. M. Cappabianco, and A. X. Falcão, Data clustering as an optimum path forest problem with applications in image analysis, Int. J. Imag. Syst. Technol., vol. 19, no. 2, pp , [35] J. P. Papa, F. A. M. Cappabianco, and A. X. Falcão, Optimizing optimum-path forest classification for huge datasets, in Proc. Int. Conf. Pattern Recog., 2010, pp [36] R.O.Duda,P.E.Hart,andD.G.Stork, Pattern Classification. New York, NY, USA: Interscience, 2001, vol. 2. [37] F. A. M. Cappabianco, A. X. Falco, C. L. Yasuda, and J. K. Udupa, Brain tissue MR-image segmentation via optimum-path forest clustering, J. Comput. Vis. Image Understanding, vol. 116, pp , [38] J. Sasongko, C. Rohr, and D. Tjondronegoro, Efficient generation of pleasant video summaries, in Proc. TRECVID BBC Rushes Summarization Workshop ACM Multimedia, New York, NY, USA, 2008, pp [39] J. B. MacQueen, Some methods for classification and analysis of multivariate observations, in Proc. Berkeley Symp. Math. Statist. Probability, 1967, pp. 1: [40] X. Zhu, J. Liu, J. Wang, and H. Lu, Key observation selection-based effective video synopsis for camera network, Machine Vis. Appl., vol. 25, pp , [41] Y. Fu, Multi-view metric learning for multi-video video summarization, CoRR, vol. abs/ , 2014 [Online]. Available: [42] S.H.Ou,C.H.Lee,V.S.Somayazulu,Y.K.Chen,andS.Y.Chien, On-line multi-view video summarization for wireless video sensor network, IEEE J. Sel. Topics Signal Process., vol. 9, no. 1, pp , Feb [43] S. Belongie, J. Malik, and J. Puzicha, Shape matching and object recognition using shape contexts, IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 24, pp , Apr [44] A. Shokoufandeh and S. Dickinson, Applications of bipartite matching to problems in object recognition, in Proc. ICCV Workshop Graph Algorithms Comput. Vis., 1999, pp [45] P. Napoletano, G. Boccignone, and F. Tisato, Attentive monitoring of multiple video streams driven by a Bayesian foraging strategy, CoRR, vol. abs/ , 2014 [Online]. Available: abs/ Sanjay K. Kuanar receivedthem.e.degreeinelectronics and telecommunication engineering from Jadavpur University, Kolkata, India, in 2007, and is currently working towards the Ph.D. degree at Jadavpur University. His current research interests include pattern recognition, multimedia analysis, and computer vision. Kunal B. Ranga received the B.E. degree in computer science and engineering from the Government Engineering College Bikaner, Bikaner, India, in 2007,and is currently working towards the M.E. degree at Jadavpur University, Kolkata, India. His current research interests include pattern recognition, multimedia analysis, and computer vision. Ananda S. Chowdhury (M 01) received the Ph.D. in computer science from the University of Georgia, Athens, GA, USA, in July He is currently an Associate Professor with the Department of Electronics and Telecommunication Engineering, Jadavpur University, Kolkata, India, where he leads the Imaging, Vision and Pattern Recognition group. He was a Post-Doctoral Fellow with the Department of Radiology and Imaging Sciences, National Institutes of Health, Bethesda, MD, USA, from 2007 to He has authored or coauthored more than forty-five papers in leading international journals and conferences, in addition to a monograph in the Springer Advances in Computer Vision and Pattern Recognition Series. His current research interests include computer vision, pattern recognition, biomedical image processing, and multimedia analysis. Dr. Chowdhury is a member of the IEEE Computer Society, the IEEE Signal Processing Society, and the IAPR TC on Graph-Based Representations. His Erdös number is 2.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 717 Multi-View Video Summarization Yanwei Fu, Yanwen Guo, Yanshu Zhu, Feng Liu, Chuanming Song, and Zhi-Hua Zhou, Senior Member, IEEE Abstract