Sensor-Based Analysis of User Generated Video for Multi-camera Video Remixing

Sensor-Based Analysis of User Generated Video for Multi-camera Video Remixing Francesco Cricri 1, Igor D.D. Curcio 2, Sujeet Mate 2, Kostadin Dabov 1, and Moncef Gabbouj 1 1 Department of Signal Processing, Tampere University of Technology, Tampere, Finland {Francesco.Cricri,Kostadin.Dabov,Moncef.Gabbouj}@tut.fi 2 Nokia Research Center, Tampere, Finland {Igor.Curcio,Sujeet.Mate}@nokia.com Abstract. In this work we propose to exploit context sensor data for analyzing user generated videos. Firstly, we perform a low-level indexing of the recorded media with the instantaneous compass orientations of the recording device. Subsequently, we exploit the low level indexing to obtain a higher level indexing for discovering camera panning movements, classifying them, and for identifying the Region of Interest (ROI) of the recorded event. Thus, we extract information about the content without performing content analysis but by leveraging sensor data analysis. Furthermore, we develop an automatic remixing system that exploits the obtained high-level indexing for producing a video remix. We show that the proposed sensor-based analysis can correctly detect and classify camera panning and identify the ROI; in addition, we provide examples of their application to automatic video remixing. Keywords: Sensor, compass, video, analysis, indexing, multi-camera. 1 Introduction In recent years there has been a rapid convergence between Internet services and mobile technologies. Internet services are increasingly becoming more socially oriented, often allowing people to publish and share media files. Due to the easy portability of camera-enabled mobile phones, video recording with mobile phones has become one of the most popular means for capturing videos of interesting and unanticipated events. One of the most popular architectural patterns in the Web 2.0 is the Participation-Collaboration pattern [1], in which each user of a web service collaborates for achieving a certain task on a topic that is of mutual interest, by contributing with a small amount of information. A typical example of such a pattern is Wikipedia. This concept could be extended to the video domain by combining various user generated videos (recorded for example at the same public happening) in order to generate a video remix, i.e., a succession of video segments extracted from the contributing videos. When the amount of source videos is large an automatic approach for generating the remix would be preferable. Previous work has mainly concentrated on video content analysis for achieving this task, as we will discuss in Section II. However, content K. Schoeffmann et al. (Eds.): MMM 2012, LNCS 7131, pp. 255 265, 2012. Springer-Verlag Berlin Heidelberg 2012

256 F. Cricri et al. analysis (that needs to be applied to each video clip) typically requires high computational costs and does not always provide the necessary semantics. Alternatively, data from sensors embedded in modern mobile phones (such as electronic compass, accelerometer, gyroscopes, GPS, etc.) can be captured simultaneously with the video recording in order to obtain important context information about the recorded video. In this paper, we present a set of novel sensor-based analysis methods for indexing multimedia content. We then use the indexing results for automatically generating a video remix of an event for which a number of users have recorded videos by using their mobile phones. The contribution of this work is the use of sensor data in order to detect camera panning and its speed. Also, we propose a sensor-based analysis for identifying the Region of Interest (ROI) of the whole event, to be applied to those cases in which a ROI exists (such as in live music shows). In this sense, we gather information about the content without performing content analysis. Instead, we use sensor data analysis, which is computationally less demanding. Section 2 introduces the related work on the subject; Section 3 describes the proposed automatic video remix system; Section 4 describes the proposed sensor-based indexing methods; Section 5 presents the experimental results; finally, Section 6 concludes the paper. 2 Related Work The field of automatically summarizing video content has been studied quite intensively in the last fifteen years. Most of this work is based on the analysis of video and/or audio content (thus being computationally demanding) for video captured by a single camera [2], and also for multi-camera setups [3-5]. In [6], a case study that compares manual and automatic video remix creation is presented. In [7] a video editing system is described, which automatically generates a summary video from multiple source videos by exploiting context data limited to time and location information. Camera motion is determined in some automatic video editing systems either for detecting fast motion which can cause blurry images [8], or for detecting scene boundaries [9]. These methods are all based on video content analysis. Shrestha et al. [10] propose an automatic video remix system in which one of the performed analysis steps consists of estimating suitable cut-points. A frame is considered suitable as a cut-point if it represents a change in the video, such as at the beginning or end of a camera motion. Another interesting work is presented in [11], in which the authors analyze the content of user generated videos to determine camera motion, by assuming that people usually move their cameras with the intent of creating some kind of story. In the work presented in this paper, we exploit the camera motion within an automatic video editing framework by using sensors embedded inside modern mobile phones, without the need of analyzing the video content as traditionally done in early research. In [12] the authors detect interesting events by analyzing motion correlation between different cameras. Finally, in [13] the authors exploit compass sensors embedded in mobile phones for detecting group rotation, in order to identify the presence of something of interest. In contrast, we propose to analyze compass data from each mobile phone in order to detect and classify individual camera panning

Sensor-Based Analysis of User Generated Video for Multi-camera Video Remixing 257 Client mobile phones Audio side Audio quality analysis Stitching audio segments Audio track Remixing server Panning detection Video side Panning classification Merging audio and video Video remix Video + Context data uploading ROI identification View switching Video track Fig. 1. Overview of the automatic video remixing system movements, and also to identify the ROI of the recorded event. In this way, we are able to consider view changes performed by each user within the ROI, and also to account for the different types of camera motion which may have different semantics (e.g., a slow panning might have been performed to obtain a panoramic view). 3 Automatic Video Remixing System The proposed automatic video remixing (or editing) system is implemented as a distributed system based on a client-server architecture (an illustration is given in Fig.1). 3.1 Context-Sensing Client The client side of the system consists of the mobile camera device. We have developed a client application that enables to simultaneously record video and context information captured by the embedded sensors such as electronic compass, GPS, and accelerometer. These sensors are used for indexing the video content while it is being recorded, by storing both sensor data and the associated timestamps. In this work we exploit the electronic compass. This sensor (usually implemented in mobile phones by embedding two or three magnetometers) measures the orientation of the device with respect to the magnetic North (which is assumed to be at orientation equal to zero). When the camera is started, the client application (which runs as a background process) starts capturing data from the electronic compass. The sensor readings are sampled at a fixed sampling rate of 10 samples/second. The recorded compass data can be regarded as a separate data stream. Each sample is a time-stamped value representing the orientation of the camera with respect to the magnetic North. By using the timestamps, any orientation sample can be uniquely associated to a certain segment of video within the recorded media, and thus it is possible to understand the direction towards which the camera was recording at a particular time interval.

258 F. Cricri et al. 3.2 Automatic Video Remixing Server The server of the automatic video remixing system exploits the availability of various user generated videos of the same event and the sensor-based low-level indexing performed on each client device by the context-sensing client. This system allows people who have recorded videos at the same event to collaborate in order to get an automatically generated video remix, which consists of segments extracted from the recorded videos. The participating users upload their videos to the server along with the associated context files. Firstly, all the source videos must be synchronized to each other so that they can be placed on a common timeline. This is achieved by means of a global clock synchronization algorithm such as the Network Time Protocol [14]. Regarding the audio side, the audio track consists of a succession of best quality audio segments extracted from the source videos. The audio analysis itself is not the focus of the work presented in this paper. Regarding the visual side, the remix will consist of a sequence of video segments present in the original source videos. We group the criteria for switching view in two different categories: sensor-related criteria and temporal criteria. The sensor-related criteria are all based on compass data, and they include the camera panning detection, the classification of camera panning based on speed, and the identification of the Region of Interest. Thus, there is a need to perform a higher-level indexing at the server side, which uses the low-level indexing data (i.e., the instantaneous compass orientations) in order to detect and classify the camera panning and identify the ROI. The considered temporal criteria are lower-bound and upper-bound temporal thresholds. The timing for switching view and the specific videos to be used at each switching instant are decided by jointly evaluating the mentioned temporal and sensor based criteria. The lower-bound temporal threshold is used to avoid that two consecutive view switches (triggered by the sensor-related criteria) happen within a too short temporal window. If no view switches happen for a time interval greater than the upperbound threshold, a switch to a different view is forced. 4 Sensor-Based Analysis In the following sub-sections we provide a detailed description of how the analysis of the sensor data is performed to assist in generating semantic information from the compass data, and thus obtaining information about the content without analyzing it directly. 4.1 Camera Panning Detection Camera panning is the horizontal rotational movement of a camera which is commonly used either for following an object of interest or for changing the focus point (or direction) of the recording. The most common techniques for detecting a camera panning or other camera motion are based on video content analysis (for example as described in [9] and [11]). In an automatic video editing system using a multi-camera setup there is a need to understand when to perform a view switch from one source

Sensor-Based Analysis of User Generated Video for Multi-camera Video Remixing 259 Compass orientation [degrees] 350 300 250 200 150 100 50 Stop panning Start panning Start panning Stop panning Raw compass data Detected camera panning operations 0 0 10 20 30 40 50 60 70 80 Time [sec] Fig. 2. Detection of camera panning movements video to another, and also which specific video to use after the view switching. One of the criteria that we consider is based on the fair assumption that when a user intentionally performs a camera panning, then the obtained view is likely to be interesting. In fact, as also stated in [11], camera motion is an indicator of the camera user s interests in the scene and can also attract the viewer s attention. One reason for considering a panning as interesting is that it is performed to include something of interest. Also, it is unlikely that the view obtained after the panning will be exactly the same (for view-angle and position of the camera) as any of the views provided by the other cameras; thus, a new view of the scene is available to the system for being included into the remix. We mark all panning movements as potential switching points, which are to be evaluated when deciding how to perform the view switches. In order to detect the camera panning, we analyze the data captured by the electronic compass during the video recording activity, instead of relying on video content analysis. One advantage of our method over content-based methods is that we analyze the real motion of the camera and not the effects of the camera motion on the recorded video content. Furthermore, motion of objects on the scene is a major obstacle for content-analysis methods. On the contrary, our method is not affected at all by such motions. For detecting a camera panning we perform the following steps: 1. Apply low-pass filtering to the raw compass data. 2. Compute the first discrete derivative over time of the filtered compass signal. 3. Select the peaks of the obtained derivative by using a threshold T P. The low-pass filtering in the first step is necessary to avoid obtaining peaks which are due to short or shaky camera movements, rather than to intentional camera motion. Fig. 2 shows the detection of the camera panning movements by analyzing compass data. Each detected camera panning is represented by two timestamps: start- and stop-panning timestamps.

260 F. Cricri et al. Compass orientation [degrees] 400 350 300 250 200 150 100 50 Slow pannings Fast pannings Raw compass data Slow camera panning Fast camera panning 0 0 10 20 30 40 50 60 70 80 90 100 Time [sec] Fig. 3. Classification of camera panning movements based on speed 4.2 Classification of Camera Panning Based on Speed We classify the detected panning movements with respect to their speed. Each panning can be a slow one, performed by the user with the intention of capturing the scene during the camera motion (e.g., for obtaining a panoramic panning or for following a moving object), or a faster one, which might have been performed for changing the focus point (or direction) of the video recording. This is a very important difference from a video editing point of view, because a too quick camera motion may result into blurry video content and should not be included into the final video remix, whereas a panoramic panning should be included for giving the viewer of the video remix a better understanding of the whole scene. We then classify the panning movements either as slow or as fast if their speed is respectively less or greater than a predefined threshold. Fig. 3 shows a sequence of panning movements and their classification. 4.3 Identifying the Region of Interest In some use cases it is important to know which area of a public happening is considered as the most interesting by the audience. The approach considered in this work for identifying the ROI exploits the fact that most public events do have a single scene of common interest, for example the performance stage in a live music show (see Fig. 4). Therefore, most of the people who are recording such an event usually point their camera towards the stage, at least for most of the recording time, as it represents the main attraction area. The automatic video remix system identifies the ROI as the specific angular range of orientations (with respect to North) towards which the users have recorded video content for most of the time. The relative location of the users with respect to each other is not taken into account in this work. Instead, the proposed ROI identification method assumes that the stage of the recorded show is a

Sensor-Based Analysis of User Generated Video for Multi-camera Video Remixing 261 Fig. 4. ROI identification. (a) An example scenario where users use their cameras to record a live music performance. (b) The (unwrapped) compass data captured by seven users while recording a music show for one of our tests. proscenium stage (i.e., the audience lies on only one side of the stage) which is the most common case at least for live music performances. The algorithm for ROI identification works as follows: 1. The preferred angular extent of the ROI is a parameter to be set manually. 2. For each recording user, compute the time spent recording in each orientation and update the histogram of recorded seconds over the orientations. 3. Analyze the obtained histogram and find the angular range (of user-specified extent) that maximizes the length of video content which has been captured towards such a range. This angular range represents the ROI. The extent of the ROI can be set to any reasonable value. 4.4 Use of Panning Movements and ROI for View Switching The automatic video remix system decides about the timing of the view switching and about which view to select by considering the detected panning movements, their

262 F. Cricri et al. classification and the pointing direction of the cameras with respect to the identified ROI. The detailed description of the view switching algorithm is as follows: FOR each detected panning event: IF panning ends inside ROI IF panning is classified as slow Switch view to the video containing the panning, when the panning starts. ELSE (panning is classified as fast) Switch view to the video containing the panning, when the panning ends. ELSE (panning ends outside ROI) No view-switching is performed. A video segment containing a fast panning is not included in order not to have blurry video content in the video remix. 5 Results An evaluation of the proposed sensor-based multimedia indexing techniques was performed using the following setup. We used a dataset consisting of 47 video recordings spanning an overall time duration of about 62 minutes and captured by nine different users during two live music shows. The users were not aware of the goal of the test. By visually inspecting the recorded video content, we identified and annotated 129 panning movements performed by the recording users. Table 1 summarizes the test results for the panning detection algorithm, and provides the accuracy in terms of precision (P fraction of the detections which are indeed true panning movements), recall (R fraction of the true panning movements which are detected correctly) and balanced F-measure (F computed as the harmonic mean of the precision and recall). As can be seen from the table, our sensor-based panning detection performs well. Regarding the panning classification, we considered the panning movements that were correctly detected by our detection algorithm. We manually classified them as either slow or fast depending on whether the video content captured during each panning was respectively pleasant or unpleasant to watch. Table 2 summarizes the performance of the proposed panning classification method. Among the 112 correctly detected panning movements, only four are misclassified. We performed a test for the ROI identification using videos that seven users have recorded at one of the two music shows considered in our dataset. Fig. 4b shows the compass data captured by the users during the show. Six of the users have pointed their cameras towards a similar orientation for most of the recording time. We have specified a preferred ROI extent of 90 degrees. The proposed method identified the ROI to be in the range [110, 200] degrees, which is satisfactory, as it corresponds to orientations pointing towards the stage of the recorded event. As can be seen in Fig. 4b, during the recording session some of the users performed panning movements either inside (for recording the main area of interest) or outside the identified ROI (for

Sensor-Based Analysis of User Generated Video for Multi-camera Video Remixing 263 Table 1. Test results for the camera panning detection algorithm. GT stands for ground truth (manually annotated panning movements), TP for true positives, FP for false positives. GT TP FP P R F 129 112 12 0.90 0.87 0.89 Table 2. Confusion matrix for the classification of panning movements based on speed. Only the correctly detected panning movements (i.e., the true positives) are considered in the table. Manually annotated as fast (ground truth) Manually annotated as slow (ground truth) Automatically Automatically detected as fast detected as slow 25 1 3 83 recording objects or people that have been found interesting). In particular, one user has been recording all the time something lying outside the ROI. Fig. 5 shows an example of using the proposed sensor-based analysis for automatically generating a video remix. For this experiment, we use the content and context data from two of the users positioned at different viewing angles. For each user, the plot of the compass data captured during the video recording is shown. Some relevant frames extracted from the recorded videos are displayed below the associated compass data. These frames have been extracted at time points indicated by the red arrows (mostly before and after each panning). The bottom-most part of the figure represents the video remix obtained by stitching together segments extracted from two source videos. The switching points are indicated by the vertical dashed lines that originate either from video 1 or video 2, depending on which video has triggered the view-switching. Panning detection, panning classification and ROI identification were performed. The identified ROI was the range [-41, 49] degrees. User 1 performed one slow and one fast panning, and has been recording always within the ROI. User 2 performed four slow pannings. The first and the third panning movements end outside the ROI, the second and the fourth end inside the ROI. At about 25 seconds user 2 turned the camera towards an orientation outside the ROI, thus the system switches view from video 2 (which is the initial state) to video 1. At 55 seconds user 2 turned the camera back towards the ROI, and a view switch to video 2 is performed. At about 62 seconds user 1 performed a slow panning, thus the system uses video 1 from the starting time of the panning. At about 86 seconds, user 2 performed a panning from outside to inside the ROI and a view switch is then triggered. Finally, as user 1 performed a fast panning, the remix will contain a segment of video 1 from the end of the panning.

264 F. Cricri et al. Fig. 5. Example of using the proposed sensor-based multimedia indexing techniques for automatically generating a video remix. Due to prior state, the excerpt of the remix starts with Video 2. 6 Conclusions In this paper we presented methods for indexing user generated videos based on context sensor data. These methods are used to automate the video remixing process in a multi-camera recording scenario. The novelty of these methods is the use of sensor data for performing the multimedia indexing thus avoiding computationally costly video-content analysis. In this work we have focused on analyzing compass data from each mobile phone and shown that the proposed methods can correctly detect and classify camera pannings, and identify the ROI of the recorded event. In this way, the system is able to generate a video remix that takes into account the panning movements performed by each user within the Region of Interest during the event. Furthermore, we are able to account for different semantics of camera motion.

Sensor-Based Analysis of User Generated Video for Multi-camera Video Remixing 265 References 1. Governor, J., Hinchcliffe, D., Nickull, D.: Web 2.0 Architectures. O Really Media / Adobe Developer Library (2009) 2. Huang, C.H., Wu, C.H., Kuo, J.H., Wu, J.L.: A Musical-driven Video Summarization System Using Content-aware Mechanisms. In: IEEE International Symposium on Circuits and Systems, Kobe, Japan, vol. 3, pp. 2711 2714. IEEE (2005) 3. Kennedy, L., Naaman, M.: Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos. In: 18th International Conference on World Wide Web, Madrid, Spain, pp. 311 320. ACM (2009) 4. El-Saban, M., Refaat, M., Kaheel, A., Abdul-Hamid, A.: Stitching Videos Streamed by Mobile Phones in Real-Time. In: 17th ACM International Conference on Multimedia, Beijing, China, pp. 1009 1010. ACM (2009) 5. Zsombori, V., Frantzis, M., Guimaraes, R.L., Ursu, M.F., Cesar, P., Kegel, I., Craigie, R., Bulterman, D.C.A.: Automatic Generation of Video Narratives from Shared UGC. In: 22nd ACM Conference on Hypertext and Hypermedia, Eindhoven, The Netherlands, pp. 325 334. ACM (2011) 6. Vihavainen, S., Mate, S., Seppälä, L., Cricri, F., Curcio, I.D.D.: We Want More: Human- Computer Collaboration in Mobile Social Video Remixing of Music Concerts. In: ACM CHI Conference on Human Factors in Computing Systems, Vancouver, Canada, pp. 287 296. ACM (2011) 7. Järvinen, S., Peltola, J., Plomp, J., Ojutkangas, O., Heino, I., Lahti, J., Heinilä, J.: Deploying Mobile Multimedia Services for Everyday Experience Sharing. In: IEEE International Conference on Multimedia and Expo, Cancun, Mexico, pp. 1760 1763. IEEE (2009) 8. Foote, J., Cooper, M., Girgensohn, A.: Creating Music Videos using Automatic Media Analysis. In: 10th ACM International Conference on Multimedia, Juan les Pins, France, pp. 553 560. ACM (2002) 9. Peyrard, N., Bouthemy, P.: Motion-based Selection of Relevant Video Segments for Video Summarisation. In: IEEE International Conference on Multimedia and Expo, Baltimore, U.S.A, vol. 2, pp. 409 412. IEEE (2003) 10. Shrestha, P., de With, P.H.N., Weda, H., Barbieri, M., Aarts, E.H.L.: Automatic Mashup Generation from Multiple-camera Concert Recordings. In: ACM International Conference on Multimedia, Firenze, Italy, pp. 541 550. ACM (2010) 11. Abdollahian, G., Taskiran, C.M., Pizlo, Z., Delp, E.J.: Camera Motion-Based Analysis of User Generated Video. IEEE Transactions on Multimedia 12(1), 28 41 (2010) 12. Cricri, F., Dabov, K., Curcio, I.D.D., Mate, S., Gabbouj, M.: Multimodal Event Detection in User Generated Videos. In: IEEE International Symposium on Multimedia, Dana Point, U.S.A. IEEE (2011) 13. Bao, X., Choudhury, R.R.: MoVi: Mobile Phone based Video Highlights via Collaborative Sensing. In: 8th International Conference on Mobile Systems, Applications and Services, San Francisco, U.S.A, pp. 357 370. ACM (2010) 14. Network Time Protocol, Version 4, IETF RFC 5905 (2010)