SHOT DETECTION METHOD FOR LOW BIT-RATE VIDEO CODING

SHOT DETECTION METHOD FOR LOW BIT-RATE VIDEO CODING J. Sastre*, G. Castelló, V. Naranjo Communications Department Polytechnic Univ. of Valencia Valencia, Spain email: Jorsasma@dcom.upv.es J.M. López, A. Ferreras Multimedia Tech. Div., Inform. Soc. Devel. Sector Telefónica I+D, Telefónica Soluciones Telefónica, Madrid, Spain ABSTRACT This paper presents a low-complexity shot detection method for real-time low bit-rate video coding. It is oriented on compression efficiency instead of indexing purposes or scene analysis. Based on the macroblock intra/inter decision information and two thresholds, one fixed and one adaptive, it provides robust scene change detection on camera motion, zoom, high motion scenes and low values of the frame rate, setting the minimum number of necessary but heavy refreshing points (intra frames) in the stream. Its excellent results lead Telefónica Spanish company to implement it in several low bit-rate video mobile transmission applications. This is the case of the application presented in this communication, called Miravideo, which transmits audio and video via GPRS and UMTS using the new shot detection method. KEY WORDS Low bit-rate video coding, real-time, shot detection. 1 Introduction There is an increasing need to extract key information automatically from video for indexing, fast retrieval and scene analysis [1, 2, 3]. To support these purposes, reliable scene change detection algorithms have been developed [4, 5, 6, 7]. In most of the algorithms the movie (uncompressed or compressed video sequences) is previously saved (or encoded). In this paper a real-time method to detect shot changes while video is being encoded is presented. As inter-coded shot changes reduce subsequent video compression, it is important to detect them in order to improve the quality of the compressed video. On the other hand, in very low bitrate video coding it is important to minimize the number of key frames, as they need much more bits than motion compensated and predicted frames. The objective of the algorithm is that the encoder code the first picture of each shot as an intra frame, improving efficiency and quality, and setting the minimum number of refreshing points (intra-coded pictures) in the stream. Figure 1. Frame numbers and shot changes of QCIF 1 and CIF 1 sequences. 2 Shot detection method When a cut happens quality and/or coding efficiency decrease because the video encoder tries to encode the first picture of the new shot based on the last picture of the previous one, and they are very low correlated. It is better this first picture to be encoded as an independent intra frame (I) and following pictures of the same sequence to be encoded based on the previous coded pictures via motion compensation (inter P, bidirectional B, etc.). So, it is desirable the encoder to detect automatically shot changes while it is encoding a sequence, specially in low bit-rate video coding, where compression efficiency is the main objective, instead of coding pictures as intra frames periodically to refresh and protect the video stream against errors. The proposed shot detection algorithm is based on the intra MB (macroblock) decision information and two thresholds, one fixed and the other adaptive. It detects shots depending on the number of intra MB s which appear in inter frames. Previous algorithms took that decision based on absolute thresholds, and where oriented to already encoded video [8]. A low complexity encoding oriented technique is presented in this paper. We have used two generated sequences for initial testing. The first generated sequence, named CIF 1 or QCIF 1 depending on the format, consists of the concatenation of fragments of the high and low motion sequences Flower

(a) Q = 4, 15 fps Figure 2. Frame numbers and shot changes of QCIF 2 and CIF 2 sequences. Garden, Akiyo, Stephan, Hall, Snow, Container Ship (see fig. 1). The second one, named CIF 2 or QCIF 2, consists of fragments of the high and moderate motion sequences Flower Garden, Fun Fair Left, Coastguard, Mobile & Calendar, Stephan, Snow, where all of these sequences have high and moderate camera motion and/or zoom, apart from high and moderate motion of the video elements (see fig. 2). After initial testing, different kind of sequences (commercials, films, news reports), with several hundreds of shots, have been used finally for fixing the algorithm parameters. All the sequences are in CIF and QCIF formats, as the method is addressed to low bit-rate video coding. Figures 3 and 4 present the number of intra MB in inter frames vs. the number of frame in a H.263+ [9] coding of the CIF 2 sequence, with different quantization parameter Q and number of frames per second (fps). For Q = 4 and 15 fps, cuts are easily detected in high bit-rate video coding using a fixed threshold of 198, 50% of the total MB in the picture (CIF format 396 MB), as the number of intra MB increases significantly when coding the first picture of the new shot. The number of intra MB increases also on high motion, zooms and high camera motion but this increase is usually progressive. In spite of that, note that in the last shot of the sequence the number of MB increases almost until the threshold, due to the high motion of the Snow sequence, which shows people skying. For Q = 31 and 15 fps the number of intra MB increases slightly, specially in high camera motion, zooms, etc., and the last fixed threshold has to be increased to avoid false detections in the last shot. For low frame rates the number of intra MB increases significantly, and in most cases it is impossible to obtain a proper fixed threshold for shot detection that provide a low number of false and miss detections. To illustrate this situation figure 5 presents an example of the number of false and missed detections using only a fixed threshold T 2 in coding QCIF 2 sequence. As can be seen, with only one fixed threshold is not possible to avoid false and missed (b) Q = 31, 15 fps Figure 3. Number of intra MB vs. number of frame in a H.263+ coding of CIF 2 sequence. shot detections, i. e. there is no value of T 2 that gives zero false and missed detections. This situation is even worse when coding at low values of the frame rate and in the case of moderate and high camera motion (pan, tilt, zoom, etc.). So, instead of using a fixed threshold, we propose to use two thresholds, T 1 and T 2, and the new detection method consists of: Update an average m k of the number of intra MB of the past coded pictures with the number of intra MB of the last coded picture: m k = m k 1 (1 α) + MB k α, where k is the number of coded pictures since the last intra-coded picture, m 0 = 0, 0 < α < 1 and MB k is the number of intra MB in the last coded picture. When coding the next picture (k + 1), after the intra/inter macroblock decision of each MB, if (MB k+1 m k + T 1 and k + 1 > n) or (MB k+1 T 2 ), then the current coding is aborted and the picture is coded as an intra frame. n is the minimum number of not intra frames between two consecutive intra frames to be detected by means of the first threshold T 1. It avoids the detection of false consecutive cuts. Note that the value m k + T 1 is an adaptive threshold that adapts to the increase of intra MB in the case of high motion, zoom, high camera motion, etc. So, the algorithm produces a detection if an abrupt increase over the average number of intra MB m k takes place, and the last n frames are not intra frames. The other threshold, T 2, is fixed and will take large values. A picture will be always intra coded

(c) Q = 4, 1.875 fps (d) Q = 31, 1.875 fps Figure 4. Number of intra MB vs. number of frame in a H.263+ coding of CIF 2 sequence. tions. We will give the values of the thresholds T 1 and T 2 in % of the total MB in the format (396 MB CIF, 99 MB QCIF). The H.263+ optional modes enabled where Advanced Intra Coding, Unrestricted Motion Vectors, Advanced Prediction, Syntax-based Arithmetic Coding and Deblocking Filter [9], and the motion estimation algorithm used was that described in [10]. The coding scheme was I P P P. The sequences CIF 1, CIF 2, QCIF 1 and QCIF 2 were coded at 15, 7.5, 5, 3.75 and 1.875 fps with fixed Q = 4, 5,, 31. Figure 6 presents the number of intra MB vs. the number of frame in a H.263+ coding of the CIF 2 sequence for Q = 4 and 15 fps, and the detected cuts, 17, for T 1 = 15%, n = 10 frames, α = 75%, eliminating the effect of the fixed threshold T 2, T 2 = 100%. For so low values of T 1 false detections happen, that are reduced by the condition imposed by the parameter n, which prevents from contiguous detections due to the condition of the adaptive threshold. We will see that the optimum values of T 1 lie between 42 and 50%. On the other hand, so high values of the parameter α (75%) can cause too much oscillations in the average m k. Low values of α make the average m k not to follow correctly the variations on the number of intra MB. Both situations produce the increase of false detections. if the number of intra MB, while inter coding, exceeds T 2, independently on the coding type (intra/inter) of the previous frames. T 2 limit will be as high as intra-coding is more efficient than inter-coding for an inter picture with T 2 intra MB. So the main number of shot detections will take place by means of the adaptive threshold m k +T 1. The other one is only used to detect very fast and clear shot changes (very near in time). 3 Results The algorithm parameters T 1, T 2, and α have been tuned coding the test sequences named above with a H.263+ coder and testing the number of missed and false detec- Figure 5. False and missed detections using only a fixed threshold T 2 in a H.263+ coding of QCIF 2 sequence. Figure 6. Shot detections (marked with *) over the number of intra MB vs. number of frame in a H.263+ coding of CIF 2 sequence, Q = 4, 15 fps, α = 75%, T 1 = 15%, n = 10, T 2 = 100%. Figures 7, 8 and 9 present the number of shot detections varying both thresholds T 1 and T 2, for QCIF 2 sequence coded taking [Q = 4, 15 fps, α = 25%],[Q = 15, 5 fps, α = 25%], and [Q = 31, 1.875 fps, α = 25%], respectively. In the bottom right corner miss detections happen (number of detections < 5, the sequence has 5 cuts) and in the left and in the top false detections happen (> 5). Results show that the algorithm is not very sensitive in high quality and high frame rate as a wide range of both thresholds gives proper results. This is because very defined peaks over the adaptive mean m k of intra MB happen when there is a shot change in that conditions. The zone of correct detection decreases if the quality reduces (quantization step Q increasing), but the decrease is much more important with low values of fps. It is important to note that in that case some theoretically false detections would

be right detections because of the low correlation of consecutive coded frames taken in pannings, tilts, zooms, etc. at so low number of frames per second. Figure 10 presents the results for [Q = 31, 1.875 fps, α = 75%]. It shows that the number of false detections increases and the zone of correct detection decreases if the value of the memory parameter α of the average m k is excessively large. Figure 9. Number of shot detections varying T 1 (horizontal) and T 2 (vertical) of QCIF 2 sequence (5 cuts), with Q = 31, 1.875 fps, α = 25%, n = 10. Figure 7. Number of shot detections varying T 1 (horizontal) and T 2 (vertical) of QCIF 2 sequence (5 cuts), with Q = 4, 15 fps, α = 25%, n = 10. Zone filled with 5 indicates correct detection of the 5 cuts. Figure 10. Shot detections varying T 1 (horizontal) and T 2 (vertical) of QCIF 2 sequence (5 cuts), Q = 31, 1.875 fps, α = 75%, n = 10. Figure 8. Number of shot detections varying T 1 (horizontal) and T 2 (vertical) of QCIF 2 sequence (5 cuts), with Q = 15, 5 fps, α = 25%, n = 10. Figure 11 presents the number of shot detections in QCIF 2 sequence, varying T 2, for n = 10, Q = 4,,31, and fps = 15, 7.5, 5, 3.75, 1.875. For that case, i.e. α = 35%, T 1 = 50% and T 2 65%, the shot detections are correct for the 140 combinations of Q and fps, including the low fps ones. It is important to note that the sequences used to build the complete QCIF 2 are very different. In the real sequences tested the shots between cuts are sometimes similar. Think for example about the shot changes in a dialog in a homogeneus background, or about dark scenes. In fact, testing the algorithm with them leads to lower values of T 1 to detect properly. The statistical performance of temporal segmentation applied to obtain the recommended parameter values has been based on the number of missed detections (MD s) and false alarms (FA s), expressed as recall and precision [5]: Recall = Detects Detects + MD s, Prec. = Detects Detects + FA s, (1) and finally, fixed n = 4, the recommended values for the parameters α, T 1 and T 2 in terms of maximization of recall and precision have been α=32%, T 1 =42% and T 2 =75%, in the case of QCIF sequences, and α=35%, T 1 =50% and T 2 =75% in the case of CIF sequences. Note that giving

Recall Prec. CM/Z/ST t I /t P t T MQ 99.13 99.13 0 123.00 0.15 MQ 2 97.92 97.92 3.03 145.22 0.59 LQ 91.79 97.81 2.56 104.20 0.59 Table 1. Shot change detection results (%): recall and precision of abrupt shot changes; % of undesired detections due to high camera motion (CM), zoom (Z), or special transitions (ST, wipes, dissolves, etc.); average time processing relation of coding the cuts as intra frames (t I, with detection) and as inter frames (t P, without detection); and increasing of the total processing time t T of coding with and without the shot detection method (scheme I P P P ). Figure 11. Shot detection results varying T 2, fixed T 1 = 50% and n = 10, for QCIF 2 sequence coded with Q = 4,,31, and fps= 15, 7.5, 5, 3.75, 1.875, (140 combinations of Q and fps). The zone filled with 140 and 100% below 5 indicates correct detection (the 5 cuts properly detected). lower values to T 1 and T 2 will make the algorithm more sensitive to the detection of new shots. It is important to note that the method does not detect special transitions as fades, dissolves, wipes, etc. Wipes are generated by translating a line across the frame in some direction, where the content on the two sides of the line belong to the two shots separated by the edit. In a fade, the luminance gradually decreases to, or increases from zero. In a dissolve, two shots are mixed, one increasing in intensity and the other decreasing. These special effects cannot detected with the use of the MB types alone, since they occur gradually over a series of pictures and the MB tend to remain bidirectionally or inter predicted more than intracoded over this span. But in low bit-rate video coding it is not desirable to detect gradual transitions as the correlation between a pair of contiguous pictures in that case is not low and the encoder can profit this correlation to improve coding efficiency. So, the encoder should only detect them if they are not so gradual due to a low value of the frame rate. Table 1 gives the recall and precision results using the recommended values of the algorithm parameters given above considering only the detection of abrupt shot changes (cuts). The tests have been made with three kind of sequences and coding conditions: MQ: medium quality coding. Sequences from films in CIF format, 25 fps, quantification parameter Q = 4. Mainly abrupt changes (115 shots). Alternation of high, moderate and low motion scenes; moderate camera motion and zooms, 571.81 kbps average. MQ 2 : medium quality coding. News reports and commercials in CIF format, 25 fps, Q = 4. All kind of changes: cuts, fades, dissolves, etc. Very fast and short shots (52 shots). High motion, and high camera motion and zooms, so higher average bitrate: 1642.76 kbps (constant quantification parameter Q). LQ: low quality coding. Sequences in QCIF format, 5 fps, Q = 15. Mainly abrupt changes (200 shots). Moderate and low motion, and moderate camera motion and zooms, 13.401 kbps average. Table 1 also gives the % of undesired detections due to high camera motion (CM), zoom (Z), or special transitions (ST, wipes, dissolves, etc.). Finally it presents the average processing time relation (%) of coding the first picture of a cut as an intra frame, including the time to detect it, (t I ) and as inter frame (t P, without detection), and the increase in % of the total processing time t T of coding with and without the shot detection method (scheme I P P P ). Table 2 presents the improvement of the average PSNR (db) of the next 10 pictures after the shot change for CIF 1 sequence coded with Q = 4, 15 fps, showing improvements between 0.17 and 0.8 db. The shot change detection procedure shows excellent results over 91% and near 100% in most cases for both the medium and low quality coding applications. The number of undesired detections, due to high camera motion (pan, tilt, zoom, etc.) and gradual transitions (wipe, dissolve, etc.) are below 3.03% in all tests. The processing load of the algorithm is very low ( 0.59%), and the time to code part of the cut picture as an inter frame, abort it, and recode it as an intra picture is almost the same as inter code in very low bit-rate video coding (104.20%, LQ sequences). The compression gain for the MQ, MQ 2 and LQ sequences was 0.74, 0.08 and 3.41%, respectively, greater for the low bit-rate coding one, LQ. If B frames are used (not the case of low delay and low complexity applications), it is possible to detect where is a cut using the information of the number of forward, backward or bidirectionally-predicted MB, besides the number of intra MB MB k, see [8, p. 8-10]. If the first picture of a new shot is coded as a B frame there will be a significant increase of the backward prediction of its MB. If this first picture is an inter frame, there will be a significant increase of the forward prediction of the MB of the

Shot Sequence PSNR lum PSNR Cr PSNR Cb Akiyo 0.1775 0.2305 0.2575 Stephan 0.5425 0.6779 0.6301 Hall 0.4116 0.8051 0.5444 Snow 0.3677 0.3908 0.4499 Container 0.2922 0.2842 0.3342 Table 2. Improvement of the average PSNR (db) of the next 10 pictures after the cut, CIF 1 seq., Q = 4, 15 fps. previous B frame. In both cases there will be a significant decrease of the number of bidirectionally-predicted MB in the B frame, and it is necessary to fix new thresholds for this new variables, which are beyond of the scope of this paper. Figure 12. Miravideo Application. Figure 12 presents the receiver of the Miravideo Application developed by Telefónica I+D, a company from the Telefónica Group. This application transmits synchronized audio and video over GPRS/UMTS channels. Receiver and transmiter of the application work on a PC or Notebook with a GPRS/UMTS modem and use the proposed detection method. The encoder can code real-time video from a camera or other sources inserting the minimum number of intra frames. 4 Conclusions An algorithm based on fixed and adaptive thresholds in order to detect shot changes automatically in low bit-rate video coding, has been presented. Results about its excellent performance in high and low bit-rate video coding, processing time and quality improvement have been given. The advantages of this algorithm are better quality in the coded sequence, low processing time, lower bit rate, and more stable performance due to coding as intra frames the pictures that are not correlated with the previous ones, better than coding pictures as intra frames periodically to refresh the stream. Moreover it can be used with all codecs that take the MB intra/inter decision such as H.26X and MPEGX codecs. Acknowledgements This work has been partially supported by the Generalitat Valenciana grant GV04B-478 and the Polytechnic University of Valencia interdisciplinary project 5607-2004. References [1] Philippe Aigrain, HongJiang Zhang, and Dragutin Petkovic, Content-based representation and retrieval of visual media: A state-of-the-art review, Multimedia Tools and Applications, 3(3), 1996, 179 202. [2] G. Ahanger and T.D.C. Little, A survey of technologies for parsing and indexing digital video, Journal of Visual Communications and Image Representations, 7(1), 1996, 28 43. [3] R. Brunelli, O. Mich, and C.M.Modena, A survey on the automatic indexing of video data, Journal of Visual Comunication and Image Representation, 10, 1999, 78 112. [4] W. A. C. Fernando, C. N. Canagarajah, and D. R. Bull, A unified approach to scene change detection in uncompressed and compressed video, IEEE Transactions on Consumer Electronics, 46, 2000. [5] J. Calic and E. Izquierdo, Towards real-time shot detection in the mpeg compressed domain, in WIAMIS 2001-Workshop on Image Analysis for Multimedia Interactive Services, Tampere, Finland, 2001. [6] B. Yeo and B. Liu, Rapid scene analysis on compressed video, IEEE Transactions on Circuits and Systems for Video Technology, 5(6), 1995, 533 544. [7] Antonio Albiol, Valery Naranjo, and Jess Angulo, Low complexity cut detection in the presence of flicker, in Proc. of International Conference of Image Processing 2000, IEEE, Ed., 2000. [8] V. Kobla, D. Doermann, and A. Rosenfeld, Compressed domain video segmentation, CfAR Technical Report CAR-TR-839 (CS-TR-3688), 1996, citeseer.nj.nec.com/vikrant96compressed.html. [9] Draft 21 Recommendation H.263+, Video coding for low bit rate communication, ITU-T, 1998. [10] J. Sastre, A. Ferreras, and J.F. Hernández-Gil, Motion vector size-compensation based method for very low bit rate video coding, IEEE Transactions on Circuits and Systems for Video Technology, 10(7), 2000, 1192 1197.