Audio-Based Video Editing with Two-Channel Microphone

Similar documents
INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Detection of Panoramic Takes in Soccer Videos Using Phase Correlation and Boosting

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

Improving Frame Based Automatic Laughter Detection

Single Channel Speech Enhancement Using Spectral Subtraction Based on Minimum Statistics

Automatic Rhythmic Notation from Single Voice Audio Sources

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION

ECE 4220 Real Time Embedded Systems Final Project Spectrum Analyzer

Automatic Laughter Detection

2. AN INTROSPECTION OF THE MORPHING PROCESS

Subjective Similarity of Music: Data Collection for Individuality Analysis

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

Robert Alexandru Dobre, Cristian Negrescu

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

Advanced Techniques for Spurious Measurements with R&S FSW-K50 White Paper

A prototype system for rule-based expressive modifications of audio recordings

EMBEDDED ZEROTREE WAVELET CODING WITH JOINT HUFFMAN AND ARITHMETIC CODING

TERRESTRIAL broadcasting of digital television (DTV)

CSC475 Music Information Retrieval

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

Speech and Speaker Recognition for the Command of an Industrial Robot

Music Radar: A Web-based Query by Humming System

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

DISTRIBUTION STATEMENT A 7001Ö

CONSTRUCTION OF LOW-DISTORTED MESSAGE-RICH VIDEOS FOR PERVASIVE COMMUNICATION

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

EXPLORING THE USE OF ENF FOR MULTIMEDIA SYNCHRONIZATION

Study of White Gaussian Noise with Varying Signal to Noise Ratio in Speech Signal using Wavelet

Color Image Compression Using Colorization Based On Coding Technique

Reducing False Positives in Video Shot Detection

Tempo and Beat Analysis

A Framework for Segmentation of Interview Videos

BEAMAGE 3.0 KEY FEATURES BEAM DIAGNOSTICS PRELIMINARY AVAILABLE MODEL MAIN FUNCTIONS. CMOS Beam Profiling Camera

Query By Humming: Finding Songs in a Polyphonic Database

Figure 1: Feature Vector Sequence Generator block diagram.

THE importance of music content analysis for musical

Automatic Laughter Detection

Music Segmentation Using Markov Chain Methods

Guidance For Scrambling Data Signals For EMC Compliance

A NEW LOOK AT FREQUENCY RESOLUTION IN POWER SPECTRAL DENSITY ESTIMATION. Sudeshna Pal, Soosan Beheshti

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Hands-on session on timing analysis

EVALUATION OF SIGNAL PROCESSING METHODS FOR SPEECH ENHANCEMENT MAHIKA DUBEY THESIS

Simple Harmonic Motion: What is a Sound Spectrum?

Semi-supervised Musical Instrument Recognition

Wipe Scene Change Detection in Video Sequences

Music Source Separation

Upgrading E-learning of basic measurement algorithms based on DSP and MATLAB Web Server. Milos Sedlacek 1, Ondrej Tomiska 2

Unit Detection in American Football TV Broadcasts Using Average Energy of Audio Track

Normalized Cumulative Spectral Distribution in Music

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

Extraction Methods of Watermarks from Linearly-Distorted Images to Maximize Signal-to-Noise Ratio. Brandon Migdal. Advisors: Carl Salvaggio

Analysis of Packet Loss for Compressed Video: Does Burst-Length Matter?

Reduced complexity MPEG2 video post-processing for HD display

Detection and demodulation of non-cooperative burst signal Feng Yue 1, Wu Guangzhi 1, Tao Min 1

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

UNIVERSAL SPATIAL UP-SCALER WITH NONLINEAR EDGE ENHANCEMENT

MUSI-6201 Computational Music Analysis

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

Phone-based Plosive Detection

Development of a wearable communication recorder triggered by voice for opportunistic communication

Investigation of Digital Signal Processing of High-speed DACs Signals for Settling Time Testing

Predicting Time-Varying Musical Emotion Distributions from Multi-Track Audio

CM3106 Solutions. Do not turn this page over until instructed to do so by the Senior Invigilator.

A Virtual Camera Team for Lecture Recording

Speech Enhancement Through an Optimized Subspace Division Technique

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,

MPEG has been established as an international standard

Automatic LP Digitalization Spring Group 6: Michael Sibley, Alexander Su, Daphne Tsatsoulis {msibley, ahs1,

Measurement of overtone frequencies of a toy piano and perception of its pitch

Processing. Electrical Engineering, Department. IIT Kanpur. NPTEL Online - IIT Kanpur

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

MPEG-4 Audio Synchronization

System Identification

Evaluation of Automatic Shot Boundary Detection on a Large Video Test Suite

Spectroscopy on Thick HgI 2 Detectors: A Comparison Between Planar and Pixelated Electrodes

Introduction to Signal Processing D R. T A R E K T U T U N J I P H I L A D E L P H I A U N I V E R S I T Y

Singer Traits Identification using Deep Neural Network

ISSN ICIRET-2014

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION

Voice Controlled Car System

What s New in Raven May 2006 This document briefly summarizes the new features that have been added to Raven since the release of Raven

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Singing voice synthesis based on deep neural networks

An Effective Filtering Algorithm to Mitigate Transient Decaying DC Offset

Removal of Decaying DC Component in Current Signal Using a ovel Estimation Algorithm

Voice & Music Pattern Extraction: A Review

Removing the Pattern Noise from all STIS Side-2 CCD data

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

Detecting Soccer Goal Scenes from Broadcast Video using Telop Region

A Music Retrieval System Using Melody and Lyric

Region Adaptive Unsharp Masking based DCT Interpolation for Efficient Video Intra Frame Up-sampling

GCT535- Sound Technology for Multimedia Timbre Analysis. Graduate School of Culture Technology KAIST Juhan Nam

Transcription:

Audio-Based Video Editing with Two-Channel Microphone Tetsuya Takiguchi Organization of Advanced Science and Technology Kobe University, Japan takigu@kobe-u.ac.jp Yasuo Ariki Organization of Advanced Science and Technology Kobe University, Japan ariki@kobe-u.ac.jp Jun Adachi Graduate School of Science and Technology Kobe University, Japan j-adachi@me.cs.scitec.kobe-u.ac.jp Abstract Audio has a key index in digital videos that can provide useful information for video editing, such as capturing conversations only, clipping only talking people, and so on. In this paper, we are studying about video editing based on audio with a two-channel (stereo) microphone that is standard equipment on video cameras, where the video content is automatically recorded without a cameraman. In order to capture only a talking person on video, a novel voice/non-voice detection algorithm using AdaBoost, which can achieve extremely high detection rates in noisy environments, is used. In addition, the sound source direction is estimated by the CSP (Crosspower-Spectrum Phase) method in order to zoom in on the talking person by clipping frames from videos, where a two-channel (stereo) microphone is used to obtain information about time differences between the microphones. 1. Introduction Video camera systems are becoming popular in home environments and they are often used in our daily lives to record family growth, small home parties, and so on. In home environments, the video contents, however, are greatly subjected to restrictions due to the fact that there is no production staff, such as a cameraman, editor, switcher, and so on, as with broadcasting or television stations. When we watch a broadcast or television video, the camera work helps us to not lose interest in or to understand its contents easily owing to the panning and zooming of the camera work. This means that the camera work is strongly associated with the events on video and the most appropriate camera work is chosen according to the events. Through the camera work in combination with event recognition, more interesting and intelligible video content can be produced [4]. Audio has a key index in the digital videos that can provide useful information for video retrieval. In [1], audio features are used for video scene segmentation, in [3, 2], they are used for video retrieval, and in [5], multiple microphones are used for detection and separation of audio in meeting recordings. In [9], they describe an automation system to capture and broadcast lectures to online audiencs, where a two-channel microphone is used for locating talking audience members in a lecture room. Also, there are many approaches possible for the content production system, such as generating highlights, summaries, and so on [7, 1, 12] for home video content. In this paper, we are studying about home video editing based on audio. In home environments, since it may be difficult for one person to record video continuously (especially for small home parties: just two persons), it will require the video content to be automatically recorded without a cameraman. However, it may result in a large volume of video content. Therefore, this will require digital camera work which uses virtual panning and zooming by clipping frames from hi-resolution images and controlling the frame size and position [4]. In this paper, we propose a method of video editing based on audio, such as voice/non-voice events and sound source direction, from video content that is recorded without a cameraman. This system can automatically capture only conversations using a voice/non-voice detection algorithm based on AdaBoost. In addition, this system can clip and zoom in on a talking person only by using the sound source direction estimated by CSP, where a two-channel (stereo) microphone is used. One of the advantages of the digital shooting is that the

audio Voice detection by AdaBoost Estimation of sound source direction by CSP visual In home environments (without a cameraman) capturing only conversation scenes clipping and zooming in on a talking person only Figure 1. Video editing system by audio-based digital camera work. camera work, such as panning and zooming, is adjusted to user preferences. This means that the user can watch his/her own video produced by his/her own virtual editor, cameraman, and switcher based on the user s personal preferences. The main point of this paper is that home video events can be recognized using a microphone-array technique and then used as the key indices to retrieve the events and also to summarize the whole home video. The organization of this paper is as follows. In Section 2, the overview of the video editing system based on audio is presented. Section 3 describes voice detection with AdaBoost in order to capture conversation scenes only. Section 4 describes the estimation of the talker s direction with CSP in order to zoom in on the talking person by clipping frames from the conversation scene videos. Section 5 describes the digital camera work. 2 Overview of the System 3 Voice Detection with AdaBoost In automatic production of home videos, a speech detection algorithm plays an especially important role in capture of conversation scenes only. In this section, a speech/nonspeech detection algorithm using AdaBoost, which can achieve extremely high detection rates, is described. Boosting is a technique in which a set of weak classifiers is combined to form one high-performance prediction rule, and AdaBoost [6] serves as an adaptive boosting algorithm in which the rule for combining the weak classifiers adapts to the problem and is able to yield extremely efficient classifiers. Figure 2 shows the overview of the voice detection system based on AdaBoost. The audio waveform is split into a small segment by a window function. Each segment is converted to the linear spectral domain by applying the discrete Fourier transform (DFT). Then the logarithm is applied to the linear power spectrum, and the feature vector is obtained. The AdaBoost algorithm [6] uses a set of training data, {(X(1), Y (1)),..., (X(N), Y (N))}, (1) Figure 1 shows the overview of the video editing system using audio-based digital camera work. The system is composed of two steps. The first step is voice detection with AdaBoost, where the system identifies whether the audio signal is a voice or not in order to capture conversation scenes only. When the captured video is a conversation scene, the system performs the second step. The second step is estimation of the sound source direction using the CSP (Crosspower-Spectrum Phase) method, where a twochannel microphone is used. Using the sound source direction, the system can clip and zoom in on a talking person only. where X(n) is the n-th feature vector of the observed signal and Y is a set of possible labels. For the speech detection, we consider just two possible labels, Y = {-1, 1}, where the label, 1, means voice, and the label, -1, means noise. Next, the initial weight for the n-th training data is set to w 1 (n) = 1 2m, 1 2l, Y (n) = 1 (voice) Y (n) = 1 (noise) where m is the total voice frame number and l is the total noise frame number.

Audio waveform (short-term analysis) Feature Extraction Based on the Discrete Fourier Trans. (voice or noise) X ( n), n = 1, K, N. (n: frame number) x 1 ( n) Mic. 1 DFT Mic. 2 DFT x 2 ( n) CSP coefficient X1( n; X 2( n; X1( n; X 2 ( n; X ( n; X ( n; 1 2.16.14 Training data: ( X ( n), Y( n)) Signal Detection with AdaBoost Y( n) = 1 Voice Y( n) = 1 Noise.12.1.8.6.4.2 38.9 6.3 77.7 94.1 11.7 129.6 157 Inverse DFT Initialize the weight vector: w1 ( n), n = 1, K, N. For t = 1,, T (1) Train weak learner which generates a hypothesis h t. (2) Calculate the error, e t, of h t. (3) Set αt = 1/ 2 log [(1 e t )/ et ]. (4) Update the weight: w t + 1( n), n = 1, K, N. T Output the final hypothesis: H ( X ) = sign α t ht ( X ) t = 1 Figure 2. Voice detection with AdaBoost. As shown in Figure 2, the weak learner generates a hypothesis h t : X {-1, 1} that has a small error. In this paper, single-level decision trees (also known as decision stamps) are used as the base classifiers. After training the weak learner on t-th iteration, the error of h t is calculated by e t = w t (n). (2) n:h t (X(n)) Y (n) Next, AdaBoost sets a parameter α t. Intuitively, α t measures the importance that is assigned to h t. Then the weight w t is updated. w t+1 (n) = w t(n) exp{ α t Y (n) h t (X(n))} N w t (n) exp{ α t Y (n) h t (X(n))} n=1 The equation (3) leads to the increase of the weight for the data misclassified by h t. Therefore, the weight tends to concentrate on hard data. After T -th iteration, the final hypothesis, H(X), combines the outputs of the T weak hypotheses using a weighted majority vote. In home video environments, speech signals may be severely corrupted by noise because the person speaks far from the microphone. In such situations, the speech signal captured by the microphone will have a low SNR (signalto-noise ratio) which leads to hard data. As the AdaBoost trains the weight, focusing on hard data, we can expect that it will achieve extremely high detection rates in low (3) Figure 3. Estimation of sound source direction by CSP. CSP coefficient CSP coefficient.16.14.12.1.8.6.4.2 38.9 6.3 77.7 94.1 11.7 129.6 157 Direction [degree] (Speaker direction is about 6 deg.).16.14.12.1.8.6.4.2 38.9 6.3 77.7 94.1 11.7 129.6 157 Direction [degree] (Two speakers are talking.) Figure 4. CSP coefficients. SNR situations. For example, in [11], the proposed method has been evaluated on car environments, and the experimental results show an improved voice detection rate, compared to that of conventional detectors based on the GMM (Gaussian Mixture Model) in a car moving at highway speed (an SNR of 2 db).

4 Estimation of Sound Source Direction with CSP The video editing system is requested to detect a person who is talking from among a group of persons. This section describes the estimation of the person s direction (horizontal localization) from the voice. As the home video system may require a small computation resource due to its limitations in computing capability, the CSP (Crosspower- Spectrum Phase)-based technique [8] has been implemented on the video-editing system for a real-time location system. The crosspower-spectum is computed through the shortterm Fourier transform applied to windowed segments of the signal x i [t] received by the i-th microphone at time t: CS(n; = X i (n; X j (n;, (4) where denotes the complex conjugate, n is the frame number, and ω is the spectral frequency. Then the normalized crosspower-spectrum is computed by φ(n; = X i(n; Xj (n; X i (n; X j (n; that preserves only information about phase differences between x i and x j. Finally, the inverse Fourier transform is computed to obtain the time lag (delay) corresponding to the source direction. (5) C(n; l) = F 1 φ(n; (6) Given the above representation, the source direction can be derived. If the sound source is non-moving, C(n; l) should consist of a dominant straight line at the theoretical delay. In this paper, the source direction has been estimated averaging angles corresponding to these delays. Therefore, a lag is given as follows: { N } ˆl = argmax C(n; l), (7) l n=1 where N is the total frame in a voice interval which is estimated by AdaBoost. Figure 3 shows the overview of the sound source direction by CSP. Figure 4 shows the CSP coefficients. The top is the result for a speaker direction of 6 degrees, the middle is that for 15 degrees and the bottom is that for two speakers talking. As shown in Figure 4, the peak of the CSP coefficient (in the top figure) is about 6 degrees, where the speaker is located at 6 degrees. When only one speaker is talking in a voice interval, the shape peak is obtained. However, plural speakers are talking in a voice interval, a sharp peak is not obtained as shown A voice interval Plural speakers are talking. Voice detection Sound source direction (CSP coefficients) One speaker is talking in a voice interval. Zooming out Zooming in ( 128 72) ( 64 36) Figure 5. Processing flow of digital zooming in and out. in the bottom figure. Therefore, we set a threshold, and the peak above the threshold is selected as the sound source direction. In the experiments, the threshold was set to.8. When the peak is below the threshold, a wide shot is taken. 5 Camera work mudule In the camera work module, the only one digital panning or zooming is controlled in a voice interval. The digital panning is performed on the HD image by moving the coordinates of the clipping window and the digital zooming is performed by changing the size of the clipping window. 5.1 Zooming Figure 5 shows the processing flow of the digital camera work (zooming in and out). After capturing a voice interval by AdaBoost, the sound source direction is estimated by CSP in order to zoom in on the talking person by clipping frames from videos. As described in Section 4, we can estimate that one speaker is talking or plural speakers are talking in a voice interval. In the camera work, when plural speakers are talking, a wide shot (128 72) is taken. On the other hand, when one speaker is talking in a voice interval, the digital camera work zooms in the speaker. In this paper, the size of the clipping window (zooming in) is fixed to 64 36. 5.2 Clipping position (Panning) The centroid of the clipping window is selected according to the face region estimated by using the OpenCV library. If the centroid of the clipping window is changing frequently in a voice interval, the video becomes not intelligible so that the centroid of the clipping window is fixed in a voice interval.

2 128 pixels Sound source direction in a voice interval HD image Face detection in this region 72 Face region estimated by OpenCV Calculation of the center of the face region on average Frequency 18 16 14 12 1 8 6 4 2.4.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 4.4 4.8 5.2 5.6 6 Voice interval [sec] 36 (Center coordinate) 64 Clipping window Figure 8. Interval of conversation scene that was estimated by AdaBoost. Figure 6. Clipping window for zooming in. The face regions are detected within the 2 pixels of the sound source direction in a voice interval as shown in Figure 6. Then the average centroid is calculated in order to decide that of the clipping window. Zooming out Zooming in B Zooming in A Video camera d = 1 cm x = 1 m y = 1.5 m o θ = 6 A o θ = 15 B A y θ A θ B x d Desk Microphone Figure 7. Room used for the experiments. A two-person conversation is recorded. 6 Experiments Preliminary experiments were performed to test the voice detection algorithm and the CSP method in a room. Figure 7 shows the room used for the experiments, where B Figure 9. Example of time sequence for zooming in and out. a two-person conversation is recorded. The total recording time is about 33 seconds. In the experiments, we used a Victor GR-HD1 Hi-vision camera (128 72). The focal length is 5.2 mm. The image format size is 2.735 mm (height), 4.864 mm (width) and 5.58 mm (diagonal). From these parameters, we can calculate the position of a pixel number corresponding to the sound source direction in order to clip frames from highresolution images. (In the proposed method, we can calculate the horizontal localization only.) Figure 8 shows the interval of the conversation scene that was estimated by AdaBoost. The average interval is 1.32 sec., the max is 6.7 sec., and the minimum is.46 sec. The total number of conversation scenes detected by AdaBoost is 149 (186.4 sec) and the detection accuracy is 94.6%. After capturing conversations only, the sound source di-

Table 1. Total time of zooming in and out. correct time estimated time zooming in A 63. 67.3 zooming in B 41. 55.6 zooming in another direction..5 zooming out 81.8 63. Sound source direction where the video content is automatically recorded without a cameraman. In order to capture a talking person only, a novel voice/non-voice detection algorithm using AdaBoost, which can achieve extremely high detection rates in noisy environments, is used. In addition, the sound source direction is estimated by the CSP (Crosspower-Spectrum Phase) method in order to zoom in on the talking person by clipping frames from videos, where a two-channel (stereo) microphone is used to obtain information about time differences between the microphones. Our proposed system can not only produce the video content but also retrieve the scene in the video content by utilizing the detected voice interval or information of a talking person as indices. To make the system more advanced, we will develop the sound source estimation and emotion recognition in future, and we will evaluate the proposed method on more test data. References Figure 1. Example of digital shooting (zooming in). rection is estimated by CSP in order to zoom in on the talking person by clipping frames from videos. The clipping accuracy is 65.5% in this experiment. Some conversation scenes cause a decrease in the accuracy of clipping. This is because two speakers are talking in one voice (conversation) interval estimated by AdaBoost and it is difficult to set the threshold of the CSP coefficient. Figure 9 shows an example of time sequence for zooming in and out, and Table 1 shows the results of the digital camera work (zooming in and out). Figure 1 shows an example of the digital shooting (zooming in). In this experiment, the clipping size is fixed to 64 36. In the future, we need to automatically select the size of the clipping window according to each situation. 7 Conclusions In this paper, we investigated about home video editing based on audio with a two-channel (stereo) microphone, [1] B. Adams and S. Venkatesh. Dynamic shot suggestion filtering for home video based on user performance. In ACM Int. Conf. on Multimedia, pages 363 366, 25. [2] K. Aizawa. Digitizing personal experiences: Capture and retrieval of life log. In Proc. Multimedia Modelling Conf., pages 1 15, 25. [3] T. Amin, M. Zeytinoglu, L. Guan, and Q. Zhang. Interactive video retrieval using embedded audio content. In Proc. ICASSP, pages 449 452, 24. [4] Y. Ariki, S. Kubota, and M. Kumano. Automatic production system of soccer sports video by digital camera work based on situation recognition. In Eight IEEE International Symposium on Multimedia (ISM), pages 851 858, 26. [5] F. Asano and J. Ogata. Detection and separation of speech events in meeting recordings. In Proc. Interspeech, pages 2586 2589, 26. [6] Y. Freund and R. E. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771 78, 1999. [7] X.-S. Hua, L. Lu, and H.-J. Zhang. Optimization-based automated home video editing system. IEEE Transactions on circuits and systems for video technology, 14(5):572 583, 24. [8] M. Omologo and P. Svaizer. Acoustic source location in noisy and reverberant environment using CSP analysis. In Proc. ICASSP, pages 921 924, 1996. [9] Y. Rui, A. Gupta, J. Grudin, and L. He. Automating lecture capture and broadcast: technology and videography. In ACM Multimedia Systems Journal, pages 3 15, 24. [1] H. Sundaram and S.-F. Chang. Video scene segmentation using audio and video features. In Proc. ICME, pages 1145 1148, 2. [11] T. Takiguchi, H. Matsuda, and Y. Ariki. Speech detection using real AdaBoost in car environments. In Fourth Joint Meeting ASA and ASJ, page 1pSC2, 26. [12] P. Wu. A semi-automatic approach to detect highlights for home video annotation. In Proc. ICASSP, pages 957 96, 24.