Automatic Replay Generation for Soccer Video Broadcasting

Size: px

Start display at page:

Download "Automatic Replay Generation for Soccer Video Broadcasting"

Mabel Campbell
6 years ago
Views:

1 Automatic Replay Generation for Soccer Video Broadcasting Jinjun Wang 2,1, Changsheng Xu 1, Engsiong Chng 2, Kongwah Wan 1, Qi Tian 1 1 Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore {stuwj2, xucs, kongwah, tian}@i2r.a-star.edu.sg 2 CeMNet, SCE, Nanyang Technological University, Singapore jjwang@pmail.ntu.edu.sg, aseschng@ntu.edu.sg ABSTRACT While most current approaches for sports video analysis are based on broadcast video, in this paper, we present a novel approach for highlight detection and automatic replay generation for soccer videos taken by the main camera. This research is important as current soccer highlight detection and replay generation from a live game is a labor-intensive process. A robust multi-level, multi-model event detection framework is proposed to detect the event and event boundaries from the video taken by the main camera. This framework explores the possible analysis cues, using a mid-level representation to bridge the gap between low-level features and high-level events. The event detection results and midlevel representation are used to generate replays which are automatically inserted into the video. Experimental results are promising and found to be comparable with those generated by broadcast professionals. Categories and Subject Descriptors I.5.5 [Pattern Recognition]: Implementation Interactive systems; H.3.1 [Information Storage And Retrieval]: Content Analysis and Indexing Abstracting methods, Indexing methods General Terms Algorithms, Design, Experimentation Keywords Event detection, Sports video analysis, Broadcast, Replay 1. INTRODUCTION The growing appetite for sporting excellence and patriotic passions at both the international level and the domestic club has created new culture and businesses in the sports Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM 04, October 10-16, 2004, New York, New York, USA. Copyright 2004 ACM /04/ $5.00. domain. Sports video is widely distributed over various networks and its mass appeal to large global audiences has led to increasing research attentions on sports domain in recent years [1]. We constrain the following discussions to the domain of soccer as soccer video analysis remains a challenging task due to the loose structure of soccer games. A lot of studies have been done on broadcast or unedited soccer game video, and promising results have been reported [2, 3, 4, 5]. In [4] soccer event detection using unedited game video is attempted. In [5] both raw video image information and postproduction information are utilized and events like Shooting, Yellow/Red Card and Penalty are detected. Most of these mainly focus on semantic annotation, indexing, summarization and retrieval for sports video. They do not address video editing and production such as automatic replay generation and broadcast video generation. Generating soccer highlights from a live game is a timecritical and labor-intensive process. Typically, multiple cameras are installed around the sporting arena and the broadcast director decides which video feed to go on-air. Of these cameras, a main camera that is perched high above the pitch level provides a panoramic view of the game and is often used as the main broadcast view. At sporadic moments in the game that he deems appropriate, the director launches replays of the prior game action. These replays are manually selected by reviewing the log of a particular camera view and selecting an appropriate start and end time to play at a slower than real-time rate. Replay segments are short (15-20 seconds), and must be launched quickly (within 10s of the event). In live broadcast, inserting replays can be a risky decision call by the director, since it trades off the existing live action on the field. It is not uncommon that these replay segments are prematurely cut off in order to return to the live action in the main camera view. There is clearly an opportunity for the automation of such sports highlight moments. Robust detection of highlights is currently achievable via detection of excited commentary and/or visual analysis of ball and goalmouth action. While these technologies may not replace the entire studio crew, it is foreseeable that they can substantially cut down the crew size in the broadcast studio, and to streamline the highlight generating process. In automatic replays, the primary concern is arguably their qualitative assessment, and we seek to address this by robust replay boundary determination via extensive use of broadcast rules and domain knowledge. Our approach is to detect highlights using only the unedited

2 . 2.# #. ' ( )*! "#$$%& * % 1 0*. +,,$,,,-*,. # +/. # ##. / # # * # Figure 1: Framework of the automatic replay generation system main camera video feed. The boundary end-point of replays extracted from this feed can then be used as time-stamp markers to extract the corresponding video feed from the other cameras. Collated this way, it is possible to apply further qualitative assessment as to which video feed to select as the final replay. Even if the option for automatic replay generation is not invoked, the collating and synchronizing of the various video feeds using an appropriate interface will greatly simplify the final replay selection. It is also interesting to note that, with automatic replays, the determination of replay-worthy highlights in the game is no longer the exclusive purview of the broadcast director. With advances in digital TV and set-top-boxes with hard-drive storage, automatic replay based on personalized parameters may be performed at the client-end. With automatic replays, the number of replays would most certainly increase. While not all of these would go on-air via traditional TV channels, they can be streamed via alternative media channels, e.g. wireless video. Hence, the proliferation of replay segments may potentially offer a secondary market of game viewer ship. This paper presents an automatic system to generate replay for soccer broadcasting. The research is challenging due to following reasons: Firstly, it is more difficult to detect events from the unedited main camera video. This type of video contains neither post-production information nor multiple camera views nor commentary information that are available in the broadcast video. Thus fewer cues can be used for event detection. Secondly, soccer event detection is difficult as soccer events do not possess strong temporal structure, i.e. the same semantic event can happen in different situation with different duration. Thirdly, soccer video is noisy low level visual and audio features extracted are often affected by many factors such as audience noise, weather, luminance, etc. Lastly, upon detecting the interesting segment for replay, we also need to locate a suitable time slot for replay which minimizes the view interruption from the main camera. The main contributions of the paper include: 1. Efficient mid-level representations that are suitable to analyze unedited soccer video taken by the main camera are introduced. The accuracy of Play position keyword and Audio keyword is improved as compared with related work; 2. The proposed event detection approach is able to identify not only soccer event but also event boundary, using unedited soccer video as well as broadcast video; 3. An automatic replay generation scheme is presented, and the generated replay is found to be comparable with that generated by human broadcasters. 2. FRAMEWORK Automatic highlights identification is a difficult process as there is no clear relationship between the low-level feature patterns and high-level events. To bridge the large gap between the features and events, we proposed a midlevel representation in our previous work [3]. Here we adopt this idea and propose a three level framework. Previous research results [6, 7, 8, 9] have shown that intermodal collaboration can improve the robustness of the system, e.g. visual and text streams [6], audio and motion [7], caption [8, 9], etc. Similarly, we have also applied available information from different domains for our setup, making it a multi-level, multi-model system. Fig. 1 illustrates our proposed framework. Specifically, the low-level modules extract features from the audio stream, video stream and motion vector field. Here we have assumed that the audio information is available from the video taken by the main camera. These raw feature streams are first analyzed by the mid-level system to generate keyword sequences. The high-level system then combines these midlevel keywords to detect events and their boundaries. Lastly, in our automatic replay generation application, an application level uses the event detection results and mid-level representations to generate replays and inserts them into the output video automatically. In the following sections, the details of the mid-level representation and high-level event detection are presented. As the low-level implementation is straightforward, it will not be discussed. 3. MID-LEVEL REPRESENTATION The mid-level system creates five synchronized keyword sequences from low-level visual, motion and audio features. Details of the keyword feature and associated analysis are listed in Table 1:

3 Table 1: Analysis description ID Description Analysis F 1 Active play position keyword Visual F 2 Ball trajectory Visual F 3 Goalmouth location Visual F 4 Motion activity Motion F 5 Audio keyword Audio 3.1 Visual analysis (F 1, F 2, F 3 ) The visual analysis creates three keywords: F 1, F 2, and F 3. Details of each keyword creation are discussed in the following subsections Position keyword (F 1 ) The F 1 keyword reflects the location of the play in the soccer field. In our implementation, the field is divided into 15 areas (Fig. 2a). Symmetrical regions in the field are given the same labels, resulting in six keyword labels (Fig. 2b). In comparison with [4], which has 12 coarser field regions, our field division is finer with greater precision. (a) 15 areas (b) 6 labels Figure 2: Soccer field model Video from the main camera is used to identify the play region in the field. The raw video will only show a cropped version of the field as the main camera pans and zooms. Previous work [5] implemented field-line detection to identify penalty area position. In our model, as we need to identify play regions spanning the entire field, the following three features are used: 1. Field-line locations, represented in polar coordinates (ρ i, θ i) i = 1, 2,..., N where ρ i and θ i are the i th radial and angular coordinate respectively and N is the total number of lines; 2. Goalmouth location, represented by the central point (x g, y g ) where x g and y g are the X and Y coordinate; 3. Central circle location, represented by the central point (x e, y e) where x e and y e are the X and Y coordinate. To detect the active play region, we proposed a Competition Network (CN) using the three shape features described above. The CN consists of 15 dependent classifier nodes, each node representing one area of the field as illustrated in Fig. 2a. The 15 nodes compete amongst each other, and the accumulated winning node is identified as the chosen region of play. The CN operates in the following manner: at time t, every detected field-line (ρ it, θ it ), together with the goalmouth (x gt, y gt) and central circle (x et, y et) forms the feature vector v i(t) where i = 1...N, N is the number of lines detected at each time t. Specifically, v i (t) is v i (t) = [ρ it, θ it, x gt, y gt, x et, y et ] T i = 1,..., N (1) The response of each node is where r j (t) = N w j v i (t) j = 1,..., 15 (2) i=1 w j = [w j1, w j2,..., w j6 ] j = 1,..., 15 (3) is the weight vector associated with the j th node, j = for the 15 regions. The set of wining nodes at time t is {j (t)} = arg max{r j (t)} j=15 j=1 (4) j Then the accumulated response is computed by R j(t) = R j(t 1) + r j(t) α Dist(j, j (t)) β (5) where R j(t) is the accumulated response of node j, α is a scaling positive constant, β is an attenuation constant, and Dist(j, j (t)) is the Euclidean distance between node j to the nearest instantaneous wining node within the list {j (t)}. A large Dist(j, j (t)) will result in stronger attenuation in Eq 5. To compute the final output of CN at time t, the maximal accumulated response is found at node j # (t) where j # (t) = arg max{r j (t)} j=15 j=1 (6) j If R j #(t) is bigger than a predefined threshold, the value of position keyword F 1 at time instant t is set to j # (t), otherwise it remains unchanged Ball trajectory (F 2) Detected and tracked position of the ball is a strong and direct factor to recognize some events. For example, the relative position between the ball and goalmouth can indicate events such as scoring and shooting. In this paper, the ball trajectories are obtained by a novel trajectory-based ball-detection-and-tracking algorithm presented in our previous work [2]. Unlike the object-based algorithms, this algorithm does not evaluate whether a sole object is a ball. Instead, it uses a Kalman filter to evaluate whether a candidate trajectory is a ball trajectory. We denote the ball trajectory using ID F 2 (Table 1), and F 2 is a two dimension vector stream recording the two coordinates of the ball in each frame Goalmouth location (F 3 ) Besides being used in position keyword model, goalmouth location itself is an important mid-level representation. A goalmouth can be formed by the two goalposts detected, and is expressed by its four vertexes. We denote the goalmouth location found using ID F 3 (Table 1). 3.2 Motion analysis (F 4 ) Motion information has been widely studied for video analysis such as Motion Texture [10], and MPEG-7 intensity of motion activity descriptor [7]. In soccer games, the main camera is always following the movement of the ball, and the camera motion thus provides an important cue to represent the general activity. In our framework, we calculate the camera motion using motion vector field information that is readily available in compressed video. A texture filter is applied to filter out inaccurate motion vectors extracted. Then the algorithm in [11] is used to

4 compute the pan factor p p, tilt factor p t, zoom factor p z of the camera. In addition, the average motion magnitude p m is computed. Thus a motion activity vector is formed as a measure of the motion activity [p z, p p, p t, p m] T. Since the motion information is only available or extracted from P frames, the motion activity vector for I and B frames is set to the last computed P frame value. We denote this motion activity vector stream using ID F 4 (Table 1). 3.3 Audio analysis (F 5) The purpose of the audio analysis is to label each audio frame (20ms in our experiment) with a predefined class. For our purpose, we have three classes: Whistle, Acclaim and Noise and the audio keyword found is denoted as F 5. The classifier used is the Support Vector Machine (SVM) with the Gaussian (RBF) kernel function. As the SVM is a two-class classifier, it is modified and used as one-against-all for our three-class problem. The input audio feature to the SVM is found by exhaustive search from amongst the following audio features tested: Mel Frequency Cepstral Coefficients (MFCC), Liner Prediction Coefficient (LPC), LPC Cepstral (LPCC), Short Time Energy (STE), Spectral Power (SP), and Zero Crossing Rate (ZCR). The best parameters found are a combination of LPCC subset and MFCC subset features. 3.4 Post-processing The first function of post-processing is to eliminate sudden errors in created keywords. The mid-level keywords are actually coarse semantic representations so the keyword value should not change too fast. Any sudden change in the keyword sequences can be considered as an error, and will be eliminated using majority-voting within a sliding window length of w l and step-size w s frames. For different keywords, the sliding window has different w l and w s defined experientially: position keyword F 1 : w l = 25 and w s = 10; ball trajectory keyword F 2: no post-processing is applied as it has been smoothed by Kalman filter; goalmouth position keyword F 3: w l = 12 and w s = 8; motion activity keyword F 4: no post-processing is applied as it is objective from compressed video; audio keyword F 5 : w l = 5 and w s = 1. The second function of post-processing is to synchronize keywords from different domains. Audio labels are created based on a smaller sliding window (20ms in our system) compared with visual frame rate (25fps, each video frame lasts 40ms). Since the audio sequence is twice that of video sequence, it is easy to synchronize them. After post-processing, the mid-level outputs are used by the next level for event detection. 4. EVENT DETECTION This section discusses three problems associated with event detection from the video taken by the main camera for automatic replay generation: 1. The lack of general criteria to define desired events to be selected for replay. For a human broadcaster, suitable selection comes with experience; 2. The requirement to achieve acceptable event detection accuracy from the video taken by the main camera. As mentioned in section 1, this is a difficult problem as fewer cues are available compared with event detection from the broadcast video; 3. The difficulty to detect time boundary of interesting events. For the purpose of generating replay, not only is the detection of event required, we also need to extract time boundary. However, event boundary is an even more subjective concept. Our proposed solutions to the above three problems are discussed in the following three subsections, respectively. 4.1 Selection of replay event To find general criteria on the selection of event for replay, a quantitative study of 143 replays in five FIFA WC 2002 games is conducted. It is shown that all of the events replayed belong to three types as listed in Table 2. Table 2: Events for replay Total Attack Foul Other Number Percentage 100% 49% 47% 4% The three events are: Attack events consist of scoring or just-missing shot of a goal. Foul events consist of a referee decision (referee whistle), and Other events consist of injury events and miscellaneous. If none of the above events is detected, the output of the classifier should default to noevent. For our automatic replay generation, the system will generate replays for these three events. 4.2 Event moment detection We detect events based on the created keywords sequences. Event detection with broadcast video has been widely studied [1]. In broadcast video the transition between the types of shot/view is closely related to the semantic state of the game, hence the Hidden Markov Model (HMM) based classifier which is good at discovering temporal pattern is applicable [12]. In our previous work [13] we also used the HMM for event detection from mid-level representations created from broadcast soccer video. However, when applying an HMM on the keyword sequences created in the above section, we noticed that there is less temporal pattern in the keyword sequences and this makes the HMM method unacceptable. Instead we find certain distinct feature patterns appearing only during the occurrence of an event. We name such moments with distinguishing feature pattern event moment, e.g. the moment of hearing whistle in Foul, the moment of very close distance between goalmouth and ball in Attack. By detecting these moments it is possible to detect the occurrence of an event. Fig. 3 illustrates the structure of part of a game from the perspective of events. As can be seen from Fig. 3a, the timeline of the game consists of event / no-event segments. In addition, within the event boundary, there is a smaller boundary of event moment as described above. The event in this example is an Attack event. We observed that the event moment of an Attack consists of (1) very small ballgoalmouth distance (Fig. 3b); (2) the position keyword has value 2 (Fig. 3c) which is designated for the penalty area

5 (Fig. 2b); and (3) the audio keyword is Acclaim (Fig. 3d). The choice of which mid-level representations to be used for detecting event moments is derived from heuristic and statistical methods. In the above example, the reason of choosing the ball-goalmouth distance and using position keyword is because of the intrinsic nature of soccer scoring [14], and the reason of choosing audio keywords is because of the close relationship between a possible scoring event and the response of the spectators. Figure 3: Event moment of attack (a) event segment (b) ball-goalmouth distance (pixel) (c) position (labels in Fig. 2b) (d) audio (label) As illustrated in Fig. 1, the chosen keyword streams are synchronized and integrated into a multi-dimension keyword vector stream from which the event moment is to be detected. To avoid employing heuristics, a statistical classifier to detect decision boundary is employed, e.g. how small the ball-goalmouth distance is in Attack event, how slow the motions are during a Foul event. To classify the three events, three classifiers are trained to detect event moments for the associated events. To make the classifier robust, each classifier uses a different set of mid-level keywords as input, specifically the inputs are: Attack classifier: position keyword (F 1), ball trajectory (F 2 ), goalmouth location (F 3 ) and audio keyword (F 5 ); Foul classifier: position keyword (F 1 ), motion activity keyword (F 4) and audio keyword (F 5); Other classifier: position keyword (F 1) and motion activity keyword (F 4 ). The output of each classifier is Attack /no-event, Foul /no-event and Other /no-event respectively. The classifier used is the SVM with the Gaussian kernel (radial basis function (RBF)). To train the SVM classifier, event and no-event segments are first manually identified, mid-level representations are then created. To generate the training data, the specific event moments within the events are manually tagged and used as positive examples for training the classifier. Sequences from the rest of the clips are used as negative training samples. In the detection process, the entire keyword sequences from the test video are fed to the SVM classifier and the segments with the same statistical pattern as event moment are identified. By applying post-processing similar to subsection 3.4, the small fluctuation in SVM classification results is eliminated to avoid reduplicated detection of the event moment from the same event. 4.3 Event boundary decision If an event moment is found, a search algorithm will be applied to search backward and forward from the event moment instance to identify the duration of the event. The entire video segment from this duration is used as the replay of the event. There are many factors affecting the human perceptual understanding of the duration of an event: One factor is time, i.e. events usually possess only a certain temporal duration. Another factor is the position where the event happens. Mostly events happen in a certain position, hence scenes from previous location may not be of much interest to audience. This assumption is true unless there is fast position changing in the event (e.g. a goal scoring by a long shot from the middle field). These observations motivate us to detect event boundary by making use of position keywords (F 1 ) and time duration. The backward search to identify event starting boundary begins by checking whether the location keyword F 1 has changed from t s D 1 to t s D 2, where t s is the event moment starting time, D 1 < D 2 where D 1, D 2 are the minimal and maximal offset threshold respectively. Specifically, the following pseudo code illustrates our approach: 1. Let time t = t s D If F 1 (t) F 1 (t s D 1 ) then the event starting time t es is set to t. Goto step If t < t s D 2 then t es is set to t. Goto step Let t = t 1, and loop back to step Stop. The forward search is applied to detect the event ending time t ee. The algorithm is similar to the backward search, the differences are only in the thresholds and that the search now is in forward time. We have noted that different types of events require different thresholds and they can be found by empirical evaluations. 5. REPLAY GENERATION Based on the events and event boundaries detected from the video taken by the main camera, we can automatically generate replays for these events and decide whether and where to insert the replays. Since this has been very subjective for human broadcasters, we need to set general criteria for this production. Another quantitative study is done on the same video database mentioned in subsection 4.1 and the result is given in Table 3. Table 3: Possible replay insertion place Total Instant replay Delayed replay MM FI IE MM: missed by main camera; FI: followed by another interesting segment; IE: very important event It is found that all the replays belong to two classes: instant replay and delayed replay. Most replays are instant replays that are inserted almost immediately following the

broadcaster has to delay the replay, and c) the event is important and worth being replayed many times (IE).

6 event if subsequent segments are un-interesting. The other replay class, delayed replay, occurs for several reasons: a) the event is missed by the main camera (MM), b) the event to be replayed is followed by an interesting segment (FI), hence the broadcaster has to delay the replay, and c) the event is important and worth being replayed many times (IE). The input to the replay generation system is the event detection result which has segmented the game into sequentially event / no-event structure, as illustrated in Fig. 4 row 1. If an event segment is identified, the system examines whether an instant replay can be inserted at the following no-event segment, and react accordingly. This is shown in Fig. 4 row 2 and 3 where instant replays are inserted for both event 1 and event 2. In addition, the system will examine whether the same event meets the delayed replay condition. If so, the system buffers the event and inserts the replay in a suitable subsequent time slot. This is shown in Fig. 4 row 2 and 3 where a delayed replay is inserted at a later time slot for event 1. Fig. 4 row 4 shows the generated video after replay insertion. Figure 4: Replay Structure In our current application, we have not examined the use of sub-camera capture for the replay scenes. The current work restricts the replay to that of the main camera capture, enhancement to use sub-camera capture is on-going. 5.1 Instant replay generation The replay starting time t rs and ending time t re are computed as: t rs = t ee + D 3 (7) t re = t rs + (t ee t es ) ν (8) where t es and t ee are the starting and ending time of the event as mentioned in subsection 4.3 respectively. D 3 represents the time duration between the end of an event and the start of the instant replay. We arbitrarily set D 3 to 1 second and this is adjustable. ν is a factor defining how slow the replay is displayed compared with real-time. Then the system examines whether the time slot from t rs to t re in the subsequent no-event segment meets one of the following conditions: no/low motion; high motion but position not at area 2 in Fig. 2b the penalty area. If so, an instant replay is inserted. 5.2 Delayed replay generation As we mentioned at the start of this section, delayed replays should be inserted for MM, FI or IE events. MM events are unable to be processed by our system as they cannot be detected using the video taken by the main camera. Our replay generation system will buffer the FI and IE events and find suitable time slots to insert delayed replays. In addition, to identify whether an event is an IE event, an importance measure I is given to the event based on the duration of its event moment as generally the longer the event moment, the more important the event: I = t te t ts (9) And those events with I > T 4 are deemed as important events. In our system, T 4 is set to 80 frames empirically so that only 5% events detected become important ones. This ratio is consistent to broadcast video identification of important events. The duration of the delayed replay is the same as the instant replay. The system will search in subsequent noevent segments for a time slot with t re t rs in length that meets the following condition: no motion; If such a time slot is found, a delayed replay is inserted. This search continues until a suitable time slot is found for FI event, or two delayed replays have been inserted for an IE event, or a more important IE event occurs. 6. EXPERIMENTAL RESULTS 6.1 Accuracy of mid-level representation Usually a soccer event lasts many frames, thus the detection process should examine a collection of frames for the event. In our experiments, we have noted that sporadic classification errors occur in the mid-level representations. However, these scattered errors are time-averaged and hence do not affect the overall classification performance in the high level system Position keyword Fig. 5 demonstrates the detection of two typical areas defined in Fig. 2b. (a) Position 2 (b) Position 3 Figure 5: Position keywords creation To evaluate the performance of the position keyword creation, totally 10 minutes of videos (from two FIFA WC 2002 games, Senegal vs Turkey and Germany vs Brazil) consisting of the main camera video segments only are manually

7 labeled. The result of keyword generation for this database is compared with the labels, and the accuracy of the position keyword is listed in Table 4. It is noted that the detection accuracy for field area 4 is low compared with the other labels. This can be easily explained: Field area 4 (Fig. 2b) has fewer cues than the other areas, e.g. it does not have fieldlines or goalmouth or central circle. This lack of distinct information thus results in poorer accuracy. Table 4: Accuracy of position keyword Position Accuracy Position Accuracy % % % % % % The position is the 6 labels given in Fig. 2b Ball trajectory Ball trajectory test is conducted on 15 sequences (176 seconds). The content of the video used is the final match of FIFA WC 2002 Final. These sequences are representative in the way that they include short to long sequences, ball-less sequences, and sequences with assorted frames of closed-up and full view frames. Table 5 shows the performance. More detailed results of ball trajectory tracking can be found in our previous work [2]. Table 5: Accuracy of ball trajectory Detected and tracked False positive Accuracy 4283 frames 25 frames 98.8% Audio keyword To evaluate the accuracy of the audio keyword generation module, three audio classes are defined: Acclaim, Whistle and Noise. 30 minutes of soccer audio data are segmented into 20ms frames, and each frame is classified into one of the three classes. In this experiment, 50%/50% is used as training/testing data set. The performance of the audio feature selected by exhaustive search is compared with our previous work [15] where feature selection was done by using domain knowledge. Table 6: Accuracy of audio keywords Acclaim Whistle Noise Previous method [15] 91.2% 90.8% 89.2% Current method 93.8% 94.4% 96.3% 6.2 Event detection To examine the performance of our system, we selected 50 minutes of unedited main camera video (from the Singapore- League) and also 4.5 hours of FIFA WC 2002 broadcast video in the experiment. There are two reasons why we choose broadcast soccer video: firstly, it is very difficult to collect main camera video as most TV stations do not keep such tapes; secondly, the broadcast video can also be used as the ground truth to evaluate our application level result, which will be described in the later sections. In fact, the non-main-camera shots in broadcast video are identified and filtered out by our visual analysis blocks, and only main camera segments are processed. The event detection results from these two types of videos are listed in Table 7 and Table 8, respectively. Table 7: Accuracy from main camera video Event Recall Precision BDA Attack 3 60% 100% 72.2% Foul % 70.0% 71.4% Other % 50.0% 60.0% Table 8: Accuracy from broadcast video Event Recall Precision BDA Attack % 78.3% 69.4% Foul % 72.8% 80.9% Other 12 80% 66.7% 65.0% BDA: boundary decision accuracy The boundary decision accuracy (BDA) in Table 7 and Table 8 is computed by BDA = τ db τ mb max(τ db, τ mb ) (10) where τ db and τ mb are the automatically detected event boundary and the manually labeled event boundary, respectively. It is observed that the boundary decision accuracy for event Other is lower compared with the other two events. This is because Other event is mainly made up of injure or sudden events. The cameraman usually continues moving the camera to capture the ball until the game is stopped, e.g. the ball is kicked out of the touch-line so that the injured player can be treated. Then the camera is focused on the wounded players. This results in either missing the extract event moment by the main camera or an unpredictable duration of camera movement. These reasons affect the event moment detection and hence affect the boundary decision accuracy. 6.3 Automatic replay generation As we have both the automatically generated video and the broadcast video from broadcast TV program, we can use the later as the ground truth to evaluate the performance of the replay generated. The following table compares the automatic replay generation by our system with the broadcast videos replays. Table 9: Replay Generation video Automatic generation Broadcast replay total same 13 missed 2 recall 86.7% precision 35.1% We use the term same in Table 9 which means replays are inserted in both the automatically generated video and the broadcast video. It is observed from Table 9 that our system generates significantly more replays than human broadcaster s selection. This can be understood in at least three ways:

8 1) Lack of general broadcast syntax: As noted in the previous section, the selection of replay is a subjective choice. We have observed that, for example, for a justmissing shoot event, the human broadcaster s choice was a long time close-up view of the disappointed player and subsequently back to main camera view when the game resumes with no slow motion replay. However in a similar event with the same game feature, the choice was to launch into a replay. In another example, replay was performed when an offside (foul event) was detected, for such foul events, replays using side camera logged videos are often used to explain the correctness of the assistant referee, while on other cases no replay was given. Hence it is obvious that an automated system will generate more replays if predefined conditions are met. 2) The ability of automation: Generating live soccer highlight is currently a time-critical and labor-intensive process. The strict time limit set to generate a replay means that a good replay segment selection might be missed. Hence, with the assistance of an automatic system, more replays will be generated. 3) The limit of event detection accuracy: The third possible reason for the excess in replay generation might be due to the failure of event detection. Incorrect event detection ultimately leads to missed or incorrect replays being carried out. Currently automatic systems are not able to detect the event, especially event boundaries, as accurately as humans. However, with human intervention (but at much lower expense compared with broadcasting solely by human, e.g. the director only need to justify the necessity of the generated replay, instead of monitoring throughout the match and manually finding suitable replay time slots), this problem can be minimized. 7. CONCLUSIONS AND FUTURE WORK This paper presents a novel framework to detect events from soccer videos taken by a single main camera and to automatically generate soccer replays for broadcasting. This has obvious importance to reduce manual processing, thus reducing the size of crew in the broadcast studio and streamlining the highlight generating process. The accuracy of event detection and soccer replay generated by our framework can be complemented and refined by a small quantity of human interventions. We have built up a demo system for the framework. Although currently the system is performing off-line, it can be improved into an on-line processing system. This is because the required mid-level representations can be generated onthe-fly, and these analysis are not computationally expensive, e.g. we do not need to track players to analyze their behaviors. The successive high-level and application level processing can be done in one pass, though some processing delay would be introduced, e.g., to search a suitable replay time slot. We have begun investigating the next stage of the proposed system. The future framework includes several parts: Firstly, to improve the event detection accuracy by employing more cues using either mid-level representation creation and/or high level semantic events modeling and detection; Secondly, to examine new techniques to detect the missed by main camera event to enable a full-scale replay generation system; Thirdly, to enhance the system s functionality into a fully automatic broadcast control or broadcast video generation system by incorporating a system to automatically select sub-camera capture for replay, control multi-camera switching, interactive caption overlay, etc tasks; Fourthly, to extend the system to other sports domains by investigating respective domain knowledge and introducing new domain analysis while keeping the generic structure of the framework presented in this paper unchanged. 8. REFERENCES [1] N. Adami, R. Leonardi, and P. Migliorati, An overview of multi-modal techniques for the characterization of sport programmes, SPIE-VCIP 03, pp , [2] X. Yu and et al, Trajectory-based ball detection and tracking with applications to semantic analysis of broadcast soccer video, ACM MM 03, pp , [3] L. Duan and et al, A mid-level representation framework for semantic sports video analysis, ACM MM 03, pp , [4] J. Assfalg and et al, Semantic annotation of soccer videos: automatic highlights identification, CVIU 03, vol. 92, pp , [5] A. Ekin, A. M. Tekalp, and R. Mehrotra, Automatic soccer video analysis and summarization, IEEE Trans. on Image Processing, vol. 12:7, pp , [6] N. Babaguchi and N. Nitta, Intermodal collaboration: A strategy for semantic content analysis for broadcasted sports video, ICIP 03, vol. 1, pp , [7] Z. Xiong., R. Radhakrishnan, and A. Divakaran, Generation of sports highlights using motion activity in combination with a common audio feature extraction framework, ICIP 03, vol. 1, pp. 5 8, [8] N. Nitta and N. Babaguchi, Automatic story segmentation of closed-caption text for semantic content analysis of broadcasted sports video, Inter. Workshop on MM Info. Sys. 02, pp , [9] D. Zhang and S.-F. Chang, Event detection in baseball video using superimposed caption recognition, ACM MM 02, pp , [10] Y. Ma and H. Zhang, Motion pattern based video classification using support vector machines, ISCAS 02, Theme: Circuits and Systems for Ubiquitous Computing, [11] Y. Tan and et al, Rapid estimation of camera motion from compressed video with application to video annotation, IEEE Trans. on Circuits and Systems for Video Technology, vol. 10-1, pp , [12] L. Xie and et al, Structure analysis of soccer video with domain knowledge and hidden markov models, Pattern Recognition Letters, vol. 24, [13] J. Wang, C. Xu, E. Chng, and Q. Tian, Sports highlight detection from keyword sequences using hmm, ICME 04, [14] I. F. A. Board, Law of the game, Federation International de Football Association, 11 hitzigweg, 8030 Zurich, Switzerland, July [15] M. Xu and et al, Creating audio keywords for event detection in soccer video, ICME 03, vol. 2, pp , 2003.

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION H. Pan P. van Beek M. I. Sezan Electrical & Computer Engineering University of Illinois Urbana, IL 6182 Sharp Laboratories