VIDEO ANALYSIS IN MPEG COMPRESSED DOMAIN

VIDEO ANALYSIS IN MPEG COMPRESSED DOMAIN THE PAPERS COLLECTED HERE FORM THE BASIS OF A SUPPLICATION FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT THE DEPARTMENT OF COMPUTER SCIENCE AND SOFTWARE ENGINEERING OF THE UNIVERSITY OF WESTERN AUSTRALIA By Lifang Gu September 2002

Abstract The amount of digital video has been increasing dramatically due to the technology advances in video capturing, storage, and compression. The usefulness of vast repositories of digital information is limited by the effectiveness of the access methods, as shown by the Web explosion. The key issues in addressing the access methods are those of content description and of information space navigation. While textual documents in digital form are somewhat self-describing (i.e., they provide explicit indices, such as words and sentences that can be directly used to categorise and access them), digital video does not provide such an explicit content description. In order to access video material in an effective way, without looking at the material in its entirety, it is therefore necessary to analyse and annotate video sequences, and provide an explicit content description targeted to the user needs. Digital video is a very rich medium, and the characteristics in which users may be interested are quite diverse, ranging from the structure of the video to the identity of the people who appear in it, their movements and dialogues and the accompanying music and audio effects. Indexing digital video, based on its content, can be carried out at several levels of abstraction, beginning with indices like the video program name and name of subject, to much lower level aspects of video like the location of edits and motion properties of video. Manual video indexing requires the sequential examination of the entire video clip. This is a time-consuming, subjective, and expensive process. As a result, there is an urgent need for tools to automate the indexing process. In response to such needs, various video analysis techniques from the research fields of image processing and computer vision have been proposed to parse, index and annotate the massive amount of digital video data. Howi

ever, most of these video analysis techniques have been developed for uncompressed video. Since most video data are stored in compressed formats for efficiency of storage and transmission, it is necessary to perform decompression on compressed video before such analysis techniques can be applied. Two consequences of having to first decompress before processing are incurring computation time for decompression and requiring extra auxiliary storage. To save on the computational cost of decompression and lower the overall size of the data which must be processed, this study attempts to make use of features available in compressed video data and proposes several video processing techniques operating directly on compressed video data. Specifically, techniques of processing MPEG-1 and MPEG-2 compressed data have been developed to help automate the video indexing process. This includes the tasks of video segmentation (shot boundary detection), camera motion characterisation, and highlights extraction (detection of skin-colour regions, text regions, moving objects and replays) in MPEG compressed video sequences. The approach of performing analysis on the compressed data has the advantages of dealing with a much reduced data size and is therefore suitable for computationally-intensive low-level operations. Experimental results show that most analysis tasks for video indexing can be carried out efficiently in the compressed domain. Once intermediate results, which are dramatically reduced in size, are obtained from the compressed domain analysis, partial decompression can be applied to enable high resolution processing to extract high level semantic information. ii

Acknowledgements First of all, I would like to thank my thesis supervisor, Professor Robyn Owens, for providing me with the opportunity to write this thesis. I thank her for her support, inspiration, and encouragement during the course of this study. I have benefited tremendously from her insight and vision. I am very grateful to David Keightley, who provided me the opportunity to enter into the exciting research area of MPEG video processing and to Dr. Graham Reynolds for allowing me the freedom and resources to pursue this area of research within the group of Digital Media Information Systems (DMIS) in the CSIRO Division of Mathematical and Information Sciences (CMIS). My thanks also go to Dr. Ken Tsui and Dr. Don Bone who have been especially helpful with technical discussions and suggestions. I would also like to thank the members of DMIS group for the rewarding and friendly research environment. Finally, I thank my family for their love and support. Especially, I would like to express my eternal gratitude to my mother. Without her support and help, this thesis might never have been written. iii

Contents Abstract Acknowledgements List of Papers Included Contribution of Candidate to Submitted Work i iii vi viii 1 Introduction 1 1.1 Digital Video and Research Challenges............. 1 1.2 Scope and Themes of the Study................. 4 1.3 Organisation of the Thesis.................... 7 2 MPEG 8 2.1 Introduction............................ 8 2.2 MPEG Stream Structure..................... 9 2.3 MPEG Compression Algorithm................. 11 2.3.1 Motion Compensation.................. 12 2.3.2 DCT Transform...................... 13 2.4 Information Available From an MPEG Stream......... 15 2.5 Preprocessing and Minimal Decoding.............. 15 2.5.1 Reconstruction of DC Images.............. 16 2.5.2 Preprocessing of Motion Vectors in B-Pictures..... 18 2.6 Conclusions............................ 22 iv

3 Overview of Included Papers 23 3.1 Shot Boundary Detection..................... 23 3.1.1 Cut Detection....................... 25 3.1.2 Dissolve Detection.................... 25 3.1.3 Discussion......................... 28 3.2 Motion Analysis.......................... 31 3.2.1 Panning and Zooming Detection............. 31 3.2.2 Global Motion Estimation and Moving Object Detection 34 3.2.3 Discussion......................... 35 3.3 Skin Colour Region Detection.................. 37 3.3.1 Skin Colour Model.................... 37 3.3.2 Skin Colour Region Detection Based on Region Growing 38 3.3.3 Discussion......................... 39 3.4 Text Region Detection and Extraction............. 40 3.4.1 Text Region Detection.................. 41 3.4.2 Text Extraction...................... 43 3.4.3 Discussion......................... 45 3.5 Replay Detection in Sports Video................ 45 3.5.1 Exact Replay Detection................. 46 3.5.2 Slow Motion Replay Detection.............. 48 3.5.3 Discussion......................... 49 3.6 Conclusions............................ 50 Included Papers 51 4 Summary and Conclusions 92 Bibliography 98 v

Papers Included in this Thesis Paper 1 (Refereed) Gu, L., Tsui, K., Keightley, D., 1997a. Dissolve detection in MPEG compressed video, In Proc. of IEEE International Conference on Intelligent Processing Systems (ICIPS), Beijing, China, October 1997, pp 1692 1696. Paper 2 (Refereed) Srinivasan, U., Gu, L., Tsui, K., and Simpson-Young, W.G., 1997b. A data model to support content-based search in digital video libraries. The Australian Computer Journal, November 1997, Vol. 29, No. 4, pp 141 147. Paper 3 (Refereed) Gu, L., 1998. Scene analysis of video sequences in the MPEG domain, In Proc. of International Conference on Signal and Image Processing (SIP), Las Vegas, USA, October 1998, pp 486 490. Paper 4 (Refereed) Gu, L. and Bone, D., 1999a. Skin colour region detection in MPEG video sequences, In Proc. of International Conference on Image Analysis and Processing (ICIAP), Venice, Italy, September 1999. vi

Paper 5 (Refereed) Gu, L., Bone, D. and Reynolds, G., 1999b. Replay detection in sports video sequences, In Proc. of the Eurographics Workshop on Multimedia, Multimedia 99, Eds. Correia N., Chambel T. and Davenport D., Springer Verlag, Milan, Italy, September 1999, pp 3 12. Paper 6 (Refereed) Gu, L., 2001. Text detection and extraction in MPEG video sequences, In Proc. of the IEEE Workshop on Content-Based Multimedia Indexing (CBMI), Brescia, Italy, September 2001, pp 233 240. vii

Contribution of Candidate to Submitted Work Paper 1 90% contribution. Developed and implemented algorithms, and wrote the paper. Ken Tsui reviewed the paper and David Keightley was the project manager. Paper 2 30% contribution. Wrote the section on video analysis. Paper 3 Sole author. Paper 4 90% contribution. Developed and implemented algorithms, and wrote the paper. Don Bone offered some technical support through discussion and reviewing the paper. He also presented the paper at the conference while the candidate was on maternity leave. Paper 5 90% contribution. Developed and implemented algorithms, and wrote the paper. Don Bone reviewed the paper. Graham Reynolds was the project leader and presented the paper at the workshop while the candidate was on maternity leave. Paper 6 Sole author. viii

Chapter 1 Introduction The six papers comprising this thesis for the degree of Doctor of Philosophy are the principal outcomes of a still-active research program commenced in 1996. This work is relevant to the fields of Digital Video Processing, Digital Video Compression, and Content-Based Search and Indexing. The nature of conference papers and journal articles causes inevitable differences between a collection of such works and a conventional doctoral dissertation. This introduction and the following two chapters attempt to describe how the papers relate to each other to form a cohesive study within the above-mentioned fields. Section 1.1 gives some background on research in the field of digital video processing and points out some of the research challenges. The scope and themes of this study are outlined in Section 1.2. The organisation of the rest of the thesis is then outlined in Section 1.3. 1.1 Digital Video and Research Challenges The availability of digital images and video, the integration of information from heterogeneous and distributed sources, the accessibility of fast communication networks, and very powerful computers have led to a flood of emerging technologies for multimedia systems, digital libraries, interactive television, telemedicine, virtual classrooms, and the like. With the advances in capturing and scanning technologies, more and more archived analog video 1

CHAPTER 1. INTRODUCTION 2 material is being converted into digital formats. In addition, new digital acquisition units capture content into bit streams (compressed or uncompressed) directly at the creation time. Digital video is therefore becoming an increasingly common data type in the new generation of multimedia databases [AB95, BMS95, DM95, Fox91, Jai95]. Many broadcasters are switching to digital formats for broadcasting, and some of them already have a significant amount of video material available in digital format for previewing. Improved compression technologies and increased Internet bandwidth have made webcasting a reality. The production of multimedia material distributed on CD-roms has been increasing dramatically during the last few years and the introduction of the new DVD technology is offering consumers unprecedented new experiences of interacting with video data. The ever growing amount of digital video poses new challenges of storage, transmission and access as vast repositories are being built at an increasing pace. Research has made significant progress in data compression technology to reduce the data size of images and videos for efficient transmission and storage. International standards have been established for lossy compression of still images through JPEG [Wal91] and lossy compression of moving pictures (video) through MPEG [LeG91]. In addition, faster communication networks have been emerging and have become available for transmitting digital video data. As shown by the Web explosion, the usefulness of vast repositories of digital information is limited by the effectiveness of the access methods. The key issues are those of content description and of information space navigation. While textual documents in digital form are somewhat self-describing (i.e., they provide explicit indices, such as words and sentences that can be directly used to categorise and access them), digital video does not provide such an explicit content description. In order to access video material in an effective way, without looking at the material in its entirety, it is therefore necessary to annotate video sequences and provide an explicit content description targeted to the user needs. Digital video is a very rich medium, and the characteristics in which users may be interested are quite diverse, ranging from the structure of the video,

CHAPTER 1. INTRODUCTION 3 i.e. its decomposition into shots and scenes, to the most representative frames of sequences, the identity of the people who appear in it, their movements and dialogues, and the accompanying music and audio effects. Indexing digital video, based on its content, can be carried out at several levels of abstraction, beginning with indices like the video program name and name of subject, to much lower level aspects of video like the location of edits and motion properties of video [ZSW + 95, HJW94a]. The cost of manually generating such indices is inversely related to the level of abstraction of the index. For example, the indexing effort for video library applications which model video by title is much less than the indexing effort for a multimedia authoring application which organises video based on the content and style of the shots used to compose the video. Manual video indexing requires the sequential examination of the entire video clip in order to annotate it. This is a time-consuming, subjective, and expensive process; an hour long footage can take up to 10 hours (depending on its content) to be completely described and archived. The automation of the indexing process thus becomes essential as the granularity of the video access becomes finer. A growing body of research studies the problem of how video indexing could benefit from the use of automated procedures [FL95, ABL95, AL96] to make the process faster and cheaper. A variety of algorithms and systems have been developed to help automate some of the indexing tasks [HJW94a, ZSW + 95, FSN + 95, SK97]. It is expected that in future, a significant portion of all digital video will be in various compressed formats. However, most of the current video analysis techniques, such as those for video indexing, have been developed for uncompressed video. It would be necessary to perform decompression on compressed video before such processing or analysis techniques could be applied. Two consequences of having to first decompress before processing are incurring computation time for decompression and requiring extra auxiliary storage. This study thus investigates video analysis techniques which operate directly on compressed video data.

CHAPTER 1. INTRODUCTION 4 1.2 Scope and Themes of the Study As more and more video data has been and will continue to be stored and distributed in compressed formats, it would be advantageous to develop processing algorithms that operate directly on compressed representations, saving on the computational cost of decompression and lowering the overall size of the data which must be processed. This eliminates the time-consuming decompression and often leads to computationally more efficient algorithms [CM95]. Consequently, this study explores the possibility of making use of features available in compressed video data and proposes several video processing techniques operating directly on compressed video data. Specifically, techniques of processing MPEG-1 and MPEG-2 compressed data are developed to help automate the video indexing process. This is because MPEG-1 and MPEG-2 are international standards for video compression and have been the widely used video formats in many applications. The study links features available from MPEG compressed data to features useful for video indexing. Figure 1.1 shows the structure of a typical computer-assisted video indexing system. There are four main modules in this system: video analysis, audio analysis, image indexing, and video access. In this diagram, it is assumed that all modules operate directly on compressed data. As a result, another decompression module at the very beginning would be required in order to apply conventional uncompressed analysis techniques. Following the approach proposed in this thesis, incoming compressed video can be directly fed into the video and audio analysis modules for processing. Analysis results are then used for video annotation and building high-level (abstract) video structures. These abstracts can be used for video browsing and retrieval. This study focuses on the video analysis module only. The components covered by this study are shaded in the diagram. In particular, the problem of temporal video segmentation (shot boundary detection) is considered first since it is often the first task in video indexing. This segmentation divides a video into some manageable units (shots), which are used as temporal limits for the annotations by professional archivists, and basic units for nonlinear access to the video stream. Once a video stream

CHAPTER 1. INTRODUCTION 5 Figure 1.1: The structure of a typical computer-assisted video indexing system in the compressed domain. The shaded blocks are covered in this study. is decomposed into shots, features within each shot can be extracted for indexing. The majority of features used in the past have been low-level image descriptions extracted from the key frames, such as colour and texture [FSN + 95]. These low-level features do not match the semantic representation used by most people to categorise scene content. In this study, attempts are made to extract semi-semantic features, such as moving objects, skin colours and captions. Although general object recognition is still beyond the reach of current techniques in the field of computer vision, object detection and even recogni-

CHAPTER 1. INTRODUCTION 6 tion within limited domains are possible. Two important examples are faces and captions, which often have the role of highlights. As a result, techniques for detecting skin colour regions and text regions directly in MPEG video sequences are developed in this study. As another example of highlights extraction, replay detection in sports video sequences is also addressed to demonstrate how this can be effectively achieved by using features in MPEG compressed video data. Camera operation information is very important for the analysis and classification of video shots, since it often explicitly reflects the communication intentions of the film (video) director. Motion analysis techniques based on motion vectors in MPEG streams are developed to detect and classify camera operations (e.g. panning), and to detect and track moving objects. Both the types of camera operations and trajectories of moving objects can be very useful indices for subsequent search and retrieval. In summary, the aim of this thesis study is to develop a number of techniques for extracting the following features directly from MPEG compressed video data: Shot boundaries Camera motion Object motion Skin colour Text captions Action replays All techniques proposed in this thesis provide essential tools for automatic and efficient video indexing and content-based search in very large video databases. The study demonstrates that these tasks can be performed effectively directly in the MPEG domain.

CHAPTER 1. INTRODUCTION 7 1.3 Organisation of the Thesis The rest of the thesis is organised as follows: Chapter 2 gives some background knowledge on MPEG standards and briefly describes the basic MPEG syntax and structures. It then describes the method for reconstructing DC images, which are images reduced 64 times in size, from an MPEG video sequence, and proposes some pre-processing of motion vectors used for motion analysis. In essence, Chapter 2 provides background on MPEG compression schemes and presents the argument for processing MPEG video data directly and what minimal processing is needed to provide the input data required by the algorithms proposed in this study. Chapter 3 gives an overview of the problems addressed by each of the included papers. Specifically, it discusses problems of shot boundary detection, camera/object motion detection, skin colour region detection, text region detection, and replay detection directly in MPEG compressed video. This chapter also summarises the approach taken in this study and adds comments and discussions, which cannot be included in the original papers due to the limitations of page numbers. Chapter 4 concludes the thesis with a summary of the contributions this research has made and some suggestions for future research work.

Chapter 2 MPEG 2.1 Introduction MPEG (Moving Picture Experts Group) is a working group of ISO/IEC (the International Standards Organizaton/the International Electro-technical Commission) in charge of the development of standards for coded representation of digital audio and video. Established in 1988, the group has produced MPEG-1, the standard on which such products as Video CD and MP3 are based, MPEG-2, the standard on which such products as Digital Television set top boxes and DVD are based, and MPEG-4, the standard for interactive multimedia for the fixed and mobile web. More recently, MPEG-7, the standard for description and search of audio and visual content, was finalised in July 2001. Work on the new standard MPEG-21 Multimedia Framework has started in June 2000 and its different parts are at different stages of development. According to the MPEG-21 time schedule, the standard will be finalised as Final Draft International Standard by July 2003. MPEG-1 and MPEG-2 have been widely used both for storage and transmission purposes up to now [LeG91, ISO93, ISO96]. MPEG-1 is targeted to applications with digital storage media, such as Video CD, at up to about 1.5 Mbit/s while MPEG-2 is designed for applications requiring higher resolution. However, they both use a suite of similar techniques to reduce both spatial and temporal redundancy in a video sequence to achieve high com- 8

CHAPTER 2. MPEG 9 pression ratios. Most existing video processing operations require full frame decompression in order to operate on the compressed sequences. The use of information directly in compressed format without decompression or with minimal decoding for processing saves both the time to perform full frame decompression and the additional storage for holding the decompressed data. It thus offers the possibility of computationally more efficient algorithms. While MPEG-1 and MPEG-2 can significantly reduce the number of bits needed to represent a video sequence without appreciable degradation of image quality, the compressed format does not lend itself to easy video processing. In this chapter, the structure of an MPEG-1 stream is first described in Section 2.2. This is followed by the description of MPEG-1 video compression algorithms in Section 2.3. These descriptions are mostly valid for MPEG-2 as well. Information directly available in an MPEG compressed stream is then listed in Section 2.4. Pre-processing and minimal decoding are discussed in Section 2.5. In particular, DC image reconstruction is given in Section 2.5.1. Pre-processing of motion vectors, which are used for motion analysis, is finally described in Section 2.5.2. 2.2 MPEG Stream Structure The difficult challenge in the design of the MPEG compression algorithm is the following: on one hand the quality requirements demand a very high compression ratio not achievable with intra-frame coding alone; on the other hand, the random access requirement is best satisfied with pure intra-frame coding. Inter-frame coding can achieve high compression while it does not promise random access. This requires a delicate balance between intra- and inter-frame coding, and between recursive and non recursive temporal redundancy reduction. This is achieved by forming a group of pictures, which usually include one intra-coded frame and several inter-coded frames. The intracoded frame serves as a random access point while the inter-coded frames facilitate a high compression ratio. Figure 2.1 shows the hierarchical structure of an MPEG video stream,

CHAPTER 2. MPEG 10 which is represented by the following six layers: Figure 2.1: The hierarchical structure of an MPEG video stream. Video Sequence: A sequence is the top level of the MPEG video coding. It is composed of several groups of pictures (GoPs) and used as a random context access unit. For video analysis, information, such as frame rate, picture width and height, aspect ratio, and video bit rate, can be obtained from the video sequence layer. Group of Picture (GoP): A GoP consists of a series of pictures and is used as a random access unit for video coding. Information, such as the number of pictures in the GoP, is available from the GoP layer. Picture: A picture is the primary coding unit and consists of several slices. Information, such as picture type (I, P and B) and motion vector resolution, is available for video analysis from this layer. Slice: A slice is used as a re-synchronisation unit and consists of several macroblocks.

CHAPTER 2. MPEG 11 Macroblock: A macroblock contains a 16 16 pixel region of luminance component and the spatially corresponding 8 8 pixel region of each chrominance component since chrominance components are sampled at half the luminance resolution. It thus has four 8 8 luminance blocks and two 8 8 chrominance blocks. For video analysis, the macroblock layer provides the type of coding (intra versus non-intra) and the motion vector. Block: A block is 8 8 pixels in size and is the unit of subsequent Discrete Cosine Transform (DCT). It provides 64 DCT coefficients (either of original pixel values or of the residues after the motion compensation). MPEG uses a component colour representation for each colour pixel, namely one luminance (Y ) and two chrominance components (C b and C r ). The conversion from Y C b C r to conventional RGB space can be carried out by a linear mapping (a 3 3 matrix). Since the human visual system (HVS) is most sensitive to the resolution of an image s luminance component, the Y values are encoded at the full resolution. The HVS is less sensitive to the chrominance information. As a result, the two chrominance components are encoded at half the resolution of their luminance counterpart. This considerably reduces the amount of information to be compressed. 2.3 MPEG Compression Algorithm The MPEG compression algorithm relies on two basic techniques: blockbased motion compensation for the reduction of the temporal redundancy and transform domain-based compression for the reduction of the spatial redundancy. MPEG uses two inter-frame coding techniques: predictive and interpolative. This results in three basic picture types in an MPEG stream: I-, P- and B-pictures. An I-picture is completely intra-coded. It provides access points for random access but only with moderate compression. A P-picture is predictively coded with reference to a past picture, which can be either an I- or a P-picture. A P-picture will in general be used as a

CHAPTER 2. MPEG 12 reference for future prediction. A B-picture is bi-directionally coded. It is similar to a P-picture, but requires both a past and a future reference picture for prediction. B-pictures provide the highest amount of compression. The relation between the three picture types is illustrated in Figure 2.2. Figure 2.2: The relationship between the three picture types. 2.3.1 Motion Compensation Motion-compensated prediction assumes that locally the current picture can be modeled as a translation of the picture at some previous time. Locally means that the magnitude and the direction of the displacement need not be the same everywhere in the picture. The local unit used in MPEG-1 is the 16 16 pixel macroblock. This is the result of a trade-off between the coding gain provided by the motion information and the cost associated with coding the motion information. Each macroblock in a P-picture is matched to the most similar group of 16 16 pixels in its past reference picture. This process is called motion estimation. Motion estimation obtains the motion vector, which is the displacement between a macroblock and its predictor candidate, by minimising a cost function measuring the mismatch between the two macroblocks.

CHAPTER 2. MPEG 13 If no match is found within a specified search range, a macroblock will be intra-coded. A macroblock in a P-picture can also be skipped, meaning it is exactly the same as the macroblock at the same location in the reference picture. As a result, a skipped macroblock will not be coded at all. For each macroblock in a B-picture, it can be forward-predicted, backwardpredicted, or bi-directionally predicted. As a result, its motion information consists of one forward motion vector, one backward motion vector, or both of the forward and backward motion vectors. Once the motion vector for each macroblock is estimated, the prediction error or residue, the difference between a macroblock and its matched candidate, is calculated. The residue will then be intra-coded by the DCT transform method described in Section 2.3.2. 2.3.2 DCT Transform Both still-image and difference image (residue) signals have a very high spatial redundancy [Wal91]. Because of the block-based nature of the motioncompensation process and a relatively straightforward implementation, the two dimensional Discrete-Cosine Transform (DCT) is chosen as the basis of compression of each I-picture and of the residue images from P- and B- pictures. As explained above, each macroblock is divided into four 8 8 luminance blocks and two 8 8 chrominance blocks (because of sub-sampling of the chrominance components). Each such 8 8 block is fed to the Forward DCT. The Forward and Inverse DCT are defined as follows: c(i, j) = 1 7 7 4 k(i)k(j) f(x, y) cos x=0 y=0 f(i, j) = 1 7 7 k(x)k(y)c(x, y) cos 4 x=0 y=0 (2x + 1)iπ 16 (2i + 1)xπ 16 cos cos (2y + 1)jπ (2.1) 16 (2j + 1)yπ (2.2) 16

CHAPTER 2. MPEG 14 where c(i, j) is the DCT coefficient, f(i, j) is the original pixel value, i, j = 0, 1,...7 and k(i) = 1 2 i = 0; In terms of matrix notation, we can write 1 otherwise. (2.3) C = T F T t (2.4) F = T t CT (2.5) where the 8 8 matrices C = [c(i, j)] and F = [f(i, j)] are the 64 DCT coefficients and the original pixel values respectively, and matrix T t denotes the transpose of matrix T, which is the DCT matrix with entries t(i, j) given by t(i, j) = 1 2 (2i + 1)jπ k(j) cos. (2.6) 16 Among the 64 DCT coefficients, c(0, 0) is the weighted value for the DCT basis function, which has no frequencies in the horizontal or vertical direction, and is therefore referred to as the DC term. The other 63 DCT coefficients, c(i, j), for i, j = 0, 1,..7 and (i, j) (0, 0), are generally referred to as the AC coefficients. The DC term c(0, 0) is related to the pixel values f(i, j) by c(0, 0) = 1 7 7 f(x, y), (2.7) 8 x=0 y=0 which is 8 times the average intensity of the block. From the Forward DCT, we have 64 coefficients for each block. These coefficients are quantised, zig-zag ordered, run-length and then Huffman coded to reduce spatial redundancy. Note that the DC terms are usually processed separately to the AC terms and as a result are more accessible in the coded (compressed) stream.

CHAPTER 2. MPEG 15 2.4 Information Available From an MPEG Stream Since there are three different picture types in an MPEG-1/2 stream, different kinds of information are available from these three picture types. For each I-picture, all macroblocks are intra-coded. As a result, the 64 DCT coefficients for its four luminance blocks and two chrominance blocks are available after entropy decoding and de-quantisation. Operations of these two steps are straightforward and require very little computation. For each P-picture, macroblocks can be either intra-coded or motion compensated (MC). For intra-coded macroblocks, the same information (64 DCT coefficients for each of the six blocks) as that of macroblocks in an I-picture is available. For each MC macroblock, a motion vector and the DCT coefficients of the difference blocks are available. For each B-picture, macroblocks can be intra-coded, MC with a forward motion vector, MC with a backward motion vector and MC with both forward and backward motion vectors. The same information is available for intracoded, MC macroblocks with either a forward or backward motion vector as those in a P-picture. For each MC macroblock with both forward and backward motion vectors, two motion vectors (forward and backward) and the DCT coefficients of the difference blocks are obviously available. 2.5 Preprocessing and Minimal Decoding After the discussion on MPEG compression algorithms in the previous section, we now describe preprocessing of MPEG compressed data and extraction of relevant information. Efficient processing can be achieved by developing analysis algorithms, which make use of these features directly available from MPEG compressed video data, and thereby avoiding full decompression. Section 2.5.1 studies the reconstruction of spatially reduced images from the different picture types of MPEG video. Such reduced images use the DC coefficient of each block only and are therefore called DC images. The size of

CHAPTER 2. MPEG 16 a DC image is 64 times smaller than its original image. Such images, though greatly reduced in size, still capture important global image features useful for many analysis purposes, such as shot boundary detection and skin colour region detection. Section 2.5.2 then discusses preprocessing of motion vector information in B-pictures to obtain more reliable motion vectors for motion analysis, such as global motion estimation and moving object detection. 2.5.1 Reconstruction of DC Images Since each image in an MPEG stream is divided into 16 16 macroblocks, which in turn consist of four 8 8 luminance blocks and two 8 8 chrominance blocks, the number of luminance blocks in each dimension is reduced 8 times compared with the number of pixels while the number of chrominance blocks in each dimension is reduced 16 times. In the following, only the reconstruction of the luminance DC images is discussed. Chrominance DC images can be similarly reconstructed. In the DCT domain, each block has one DC coefficient and 63 AC coefficients. The DC coefficient of each block is equal to 8 times the average of its original pixel values. If only the DC coefficient is used, this corresponds to using one pixel to represent every 8 8 block. As a result, such an image is reduced 8 times in each dimension and is called a DC image. For intra-coded I-pictures, reconstruction of such DC images is trivial since the DCT DC value of each block can be directly obtained from an MPEG stream. For the predictively coded P-pictures and bi-directionally coded B-pictures, reconstruction of DC images is not straightforward since macroblocks in a P- or B-picture can be either intra-coded or motion compensated (MC). The DC values of an intra-coded macroblock in a P-picture can be similarly obtained as those in an I-picture. Extraction of exact DC values for MC macroblocks in a P-picture is given by Chang and Messerschmitt [CM95] and is computationally expensive. Here we describe an approximation method proposed by Meng et al. [MJC95]. An MC macroblock in a P-picture has a motion vector and four blocks of DCT coded MC errors. The motion vector

CHAPTER 2. MPEG 17 allows us to trace back the macroblock to its matching counterpart in the previous reference picture. Each of the four luminance blocks will be matched in a location in the reference picture as shown in Figure 2.3. The matching block may overlap as many as four blocks in the reference picture. Assume that the DC values of the reference picture are available and the luminance variance within each block is small. Then the DC value of an MC block in a P-picture can be approximated by taking the overlapping area-weighted average of the four blocks in the reference picture pointed by the motion vector plus its DC value of the residues: DC(b) = 1 4 h i w i DC(b i ) + DC(b residue ), (2.8) 64 i=1 where DC(b i ) is the DC value of block i in the reference picture, and w i and h i are the overlapping width and height respectively. Their values are related to the motion vector (u, v) as follows: w 1 = w 3 = u, w 2 = w 4 = 8 u, h 1 = h 2 = v and h 3 = h 4 = 8 v. The term DC(b residue ) is the residue DC value of the current block, b. Figure 2.3: Illustration on the relation between the reference block (b ref ), the current block, and the motion vector.

CHAPTER 2. MPEG 18 The DC values of a B-picture can be similarly reconstructed for MC macroblocks with either a forward or a backward motion vector. For those MC macroblocks with both forward and backward motion vectors, their DC values can be calculated as the average of those reconstructed from the previous reference picture and the future reference picture plus the DC values of their residues. Using the above method, we can reconstruct a DC image sequence from an MPEG stream, no matter what picture types (I, P or B) it contains. Subsequent analysis algorithms presented in this thesis can then be applied to these DC image sequences. Figure 2.4 shows some examples of DC images reconstructed from the three picture types using the above method. The size of original frames in this MPEG-1 sequence is 352 288 pixels. It can be seen that they capture global features of the original images well although they are greatly reduced in size. Accuracy of the reconstructed DC values of MC macroblocks in P- and B-frames can be assessed by comparing them to the true values computed from the fully decompressed images (the approximation error). Obviously, it depends on several factors such as the picture type, the accuracy of motion vector estimation and the scene content. Examples of error DC images are given by Yeo [Yeo96], which showed that over 90% of blocks have errors smaller than 5 in their reconstructed DC values. Processing time for reconstructing a DC image is obviously dependent on the picture type. For a GoP pattern of IBBPBBPBBPBBP of MPEG-1 sequences with a size of 352 288 pixels, the average speed of DC image extraction is about 100 frames per second on an 167 MHz Sun workstation, roughly 4 times faster than real-time. 2.5.2 Preprocessing of Motion Vectors in B-Pictures The MPEG syntax specifies how to represent the motion information. It is specified by one or two motion vectors per 16 16 macroblock of the picture depending on the type of motion compensation: one for forward-

CHAPTER 2. MPEG 19 Figure 2.4: Examples of some reconstructed DC images. The images of the three columns are reconstructed from the I-, P- and B-pictures in an MPEG- 1 stream respectively. The size of the reconstructed DC images is 44 36 pixels while the size of original frames is 352 288 pixels. The images from top to bottom rows illustrate scenes of baby in the bath, woman walking, two men walking, revolving door, and close-up of a document.

CHAPTER 2. MPEG 20 predicted and backward-predicted, and two for bi-directionally interpolated. The MPEG standard does not specify how such vectors are to be computed, however. Because of the block-based motion representation, block-matching techniques [JJ81, GM90, LZ93] are usually used. In a block-matching technique, the motion vector is obtained by minimising a cost function measuring the mismatch between a block and its predicted candidate block. Let M i be a macroblock in the current picture P c, and v the displacement with respect to the reference picture P r. Then the optimal displacement ( motion vector ) is obtained by the formula: v i = min v V x M i D[P c (x) P r (x + v)], (2.9) where the search range V of the possible motion vectors and the selection of the cost function D are left entirely to the implementation. Exhaustive searches, where all the possible motion vectors are considered, are known to give good results, but at the expense of a very large complexity for large search ranges. The decision of tradeoff quality of the motion vector field versus complexity of the motion estimation process is for the implementer to make. No matter how large the search range is and therefore how accurate the motion vectors are, a motion vector in MPEG only represents the approximated amount of motion for a macroblock because of the nature of the block-based representation and the assumed translation motion. In addition, there is the aperture problem, existing for any motion estimation technique [Tek95, pages 78 79]. The aperture problem applies particularly to blocks with uniform grey-level values or one-dimensional features such as edges. Assume that the motion in the scene, whether it is the global camera motion or local object motion, is smooth in the temporal direction. Since each macroblock in a B-picture can have both forward and backward motion vectors, the two motion vectors (normalised to the displacement per frame) should not differ a lot in either direction or magnitude for macroblocks with sufficient texture if they are correctly estimated. On the other hand, forward

CHAPTER 2. MPEG 21 and backward motion vectors for macroblocks with uniform grey-level values or 1D features tend to have a random relationship and hence their difference will be large. For macroblocks with only a forward or a backward motion vector, we can use the variance of the four DC values or sum of the absolute AC coefficients of the difference blocks to measure their texture levels and hence the reliability of their motion vectors. Based on the above observation, we develop the following algorithm of extracting reliable motion vectors from B-pictures. A brief illustration of the algorithm is shown in Figure 2.5. each MB intra check type MC (one mv) End MC (two mvs) DC variance > D no no MV difference < T yes End End yes keep the MV calculate the average MV End End Figure 2.5: Process of extracting reliable motion vectors in a B-picture. The DC variance measure can be replaced with other alternatives such as the sum of the AC coefficients magnitudes of the 4 residue blocks. The type of each macroblock in a B-picture is first checked. If it is intracoded, no motion vector is available. For MC macroblocks with both motion vectors, their difference is calculated and compared with a threshold, which is empirically set to 5. If the difference is smaller than this threshold, the average of the forward and backward motion vectors is calculated and used as its motion vector. Otherwise, the motion vector is not reliable and will not be used further for any subsequent processing. For an MC macroblock with

CHAPTER 2. MPEG 22 only one motion vector, variance of its DC values or some other measures for texture levels are calculated. If the macroblock has enough texture (i.e., large variance), the motion vector is kept. Otherwise, it is declared as unreliable and excluded from further processing. Once reliable motion vectors in B-pictures are extracted, they can be used for any subsequent motion analysis. 2.6 Conclusions In this chapter, we have briefly discussed the MPEG-1 standard and some important features directly available from the compressed MPEG data. We have also shown how reduced (DC) images can be reconstructed from any picture type (I, P and B) of MPEG compressed video. These DC images contain important global features useful for video analysis. The benefits of processing on such reduced images are three folded: no full decompression is necessary; the storage needed for DC images is small (approximately 1/64) compared to the fully decompressed data and complexity is reduced because of the small data size. Based on some simple observations, we have developed a novel algorithm for extracting reliable motion vectors from B-pictures.

Chapter 3 Overview of Included Papers This chapter gives an overview of the problem that each of the included papers is trying to solve, summarises the approach taken in this study, and adds some comments and discussions. Section 3.1 discusses the problem of video segmentation, especially dissolve detection in MPEG compressed video sequences using reconstructed DC image sequences. Section 3.2 describes camera motion detection and moving object detection using motion vectors extracted from MPEG streams. Sections 3.3 and 3.4 describe algorithms for extracting highlights (skin colour regions and text regions) from MPEG video sequences. Finally, Section 3.5 proposes an efficient algorithm for detecting replays in sports video sequences. 3.1 Shot Boundary Detection An important first task in video analysis is to segment a video sequence into temporal shots, each representing an event or continuous sequence of actions. A shot is what is captured between record and stop operations. Further scene analysis and interpretation can then be performed on such shots. The segmented video sequences can also be used for browsing, in which only one or a few key frames of each shot is displayed. The boundaries between video shots are commonly called scene changes 1 and the act of segmenting a video 1 This is actually a misnomer; in film production, a scene is a collection of shots and is a division of an act presenting continuous action in one place. 23

CHAPTER 3. OVERVIEW OF INCLUDED PAPERS 24 into shots is commonly referred to as scene change detection. In this study, shot boundary detection is used instead to refer to the process of scene change detection in order to reflect the true meaning of the segmentation process. There are two types of shot transitions: abrupt and gradual. In the first case, the change from one shot to the next occurs from one frame to the next, as illustrated in Figure 3.1. This type of shot transition is also called a camera break or cut. In the second case, the change occurs over a longer period of time. This is often the result of applying special editing techniques, such as fade, dissolve and wipe, to join two shots smoothly. An example of a gradual transition (dissolve) is shown in Figure 3.2, where a shot of a document scene slowly changes to that of an outdoor scene and the transition occurs over a period of about 2 seconds (39 frames). Figure 3.1: Example of an abrupt shot transition. Figure 3.2: Example of a gradual shot transition dissolve.

CHAPTER 3. OVERVIEW OF INCLUDED PAPERS 25 3.1.1 Cut Detection The difference in grey level as well as the colour information between two consecutive frames is usually large at an abrupt shot boundary due to the content dissimilarity of the two shots. Many of the early methods for cut detection [ZKS93, HJW94b, ZMM95, AL96] were based on difference metrics, such as pixel intensity value difference and histogram difference. One of the problems with these difference based algorithms is that they are sensitive to busy scenes, in which intensities change substantially from frame to frame due to camera/object motion. Since the availability of MPEG video, several algorithms for detecting cuts directly in the MPEG domain have emerged [AHC93, YL95, SD95, ZLS95, LZ95, FLM96, KDR96, IP97, GHP98, JHEJ98, KKC99, MIP99]. These methods use the information directly available from an MPEG stream, such as DCT coefficients, motion vectors and bit-rates, to calculate the frame dissimilarity. A full review of cut detection algorithms operating directly on MPEG video data is given by the candidate [Gu00] in a commercial-inconfidence report. An efficient cut detection algorithm [GTK96] using the motion vector information directly available from an MPEG stream has been designed and implemented by the candidate. It uses one single measure for different types of frames in MPEG streams and is therefore fast, simple and reliable. This algorithm has been commercialised and become part of a series of MPEG video processing products of Mediaware Pty Ltd [Med]. Due to the intellectual property issue, this algorithm is not included as a part of this thesis study. 3.1.2 Dissolve Detection Since the difference between two consecutive frames is small at a gradual transition, difference measures introduced for cut detection are not suitable for detecting gradual transitions. Different editing techniques result in different types of gradual transitions, which, in turn, have different characteristics [Oha93]. Detection of dissolves, which are the most common gradual

CHAPTER 3. OVERVIEW OF INCLUDED PAPERS 26 transitions present in movies and TV programs, is considered here. A dissolve is a gradual transition from one shot to another, in which the first shot fades out and the second shot fades in. Mathematically, a dissolve operation from shot S 1 to shot S 2 is a sequence of frames represented by the following formula: G(x, y, t) = g 1 (x, y, t)[1 α(t)] + g 2 (x, y, t)α(t) (3.1) where G(x, y, t) represents the intensity function of the editing frames at time t, g 1 (x, y, t) and g 2 (x, y, t) are the intensity functions of shots S 1 and S 2 respectively, and α(t) = (t t s )/(t e t s ) increases from 0 to 1 during the dissolve. Here t s and t e stand for the start and end times of the dissolve. It is assumed in Equation 3.1 that the fade-in and fade-out start at the same time. It can be seen from Equation 3.1 that a fade can be regarded as a special dissolve with the intensity values g 1 (x, y, t) of the first shot being constant for a fade-in and intensity values g 2 (x, y, t) of the second shot being constant for a fade-out. Very little work on dissolve detection has been reported, especially in the compressed domain. Meng at al. [MJC95] proposed an algorithm for detecting dissolves directly in MPEG compressed video. It was observed that frame intensity variances during an ideal dissolve follow a clear parabolic shape. The algorithm uses the DC images reconstructed from MPEG video streams for calculating the intensity variances. However, frame intensity variances are often affected by factors such as motion and thus the parabolic shape is not always present during a real dissolve involving motion. In addition when the variance of one shot is much larger than that of the other shot, one half of the parabolic shape almost disappears and thus such dissolve detection algorithms will fail in such cases. Yeo [Yeo96] also proposed a method of detecting dissolves using the DC image sequence reconstructed from an original MPEG video sequence. It first calculates the difference between a frame and the following kth frame. It is then observed that a sequence of such frame differences show a plateau during a dissolve. As before, other factors also contribute to the frame difference

CHAPTER 3. OVERVIEW OF INCLUDED PAPERS 27 and therefore plateaus exhibited during ideal dissolves might be disrupted by intensity changes caused by camera/object motion. Paper 1 [Gu et al. 1997a] addresses the problem of detecting dissolves directly in MPEG compressed video. Two reliable methods are proposed, based on the characteristics that intensity values (both local and global) in the editing frames change linearly during a dissolve. DC images are used to calculate intensity values. Dissolve Detection Using Average Frame Intensity The method of dissolve detection based on global intensity information attempts to reduce the intra-shot intensity changes caused by noise and motion by using the average frame intensity values to calculate inter-frame changes. The average frame intensity values can be calculated from the DC images, which are reconstructed from an MPEG bit stream using the method described in Section 2.5.1 of Chapter 2. Assuming that the average intensity values of the two shots are g 1 and g 2, the rate of change of the average intensity during a dissolve can be written as follows (from Equation 3.1): R = g 2 g 1 t e t s. (3.2) This rate is relatively stable during the dissolve and its value depends on the dissolve duration, t e t s, and the average intensity difference of the two shots, g 2 g 1. This implies that the average inter-frame differences also change little during the dissolve. This leads to unique rectangular shapes on the average intensity difference curve. The width of the rectangular shape is the dissolve duration while its height is determined by the change rate R. Consequently, dissolves can be detected by finding rectangular shapes on the average intensity difference curve. Dissolve Detection Using Average Block Intensity When the two shots being connected have similar average intensity values, the above method will have difficulty in detecting the dissolve because the average intensity difference between two consecutive editing frames will be

CHAPTER 3. OVERVIEW OF INCLUDED PAPERS 28 too small to be detected. However, if we look at the individual blocks, their corresponding intensity values in the two shots are different since the two shots have different contents. As a result, our second dissolve detection method looks at the average intensity changes of each 8 8 luminance block, the basic unit in an MPEG stream. The average intensity value of an 8 8 luminance block corresponds to one eighth of its block DC value in an MPEG stream, which is directly available for I-frames and can be reconstructed for P- and B-frames by minimal partial decoding as described in Section 2.5.1 of Chapter 2. From Equation 3.1 it can be derived that the DC value of each block also changes linearly during a dissolve. The DC value difference of each block between any two editing frames falls within a certain range, which is determined by its corresponding DC difference of the two shots. We calculate the percentage of those blocks with large DC value differences between two consecutive frames as a measure. This measure will be large and constant during a dissolve and small otherwise. As a result, dissolves can be detected by finding periods with such block percentage values consistently larger than a given threshold and with a length in the range of a typical dissolve duration. 3.1.3 Discussion The above two algorithms for dissolve detection have been implemented on top of a general MPEG parsing/decoding library. The DC image sequences are obtained from an MPEG stream by the reconstruction of DC values for motion-compensated blocks in P- and B-frames, as detailed in Chapter 2. Both algorithms have been tested on several MPEG video sequences. Table 3.1 shows the results of dissolve detection of the proposed algorithms on these sequences. For video sequences with small camera/object motion, both algorithms can reliably detect all dissolves with a false detection rate of 10%. These false positives mostly correspond to scenes involving consistent small motion. For video sequences with large camera panning, both algorithms will detect these camera panning segments as dissolves and thus lead to a higher false detection

CHAPTER 3. OVERVIEW OF INCLUDED PAPERS 29 Number of dissolves Video sequence (type) true detected missed false test1.mpg (commercial) 15 17 1 3 test2.mpg (TV news) 28 31 0 3 test3.mpg (movies) 41 45 1 5 test4.mpg (documentary) 5 6 0 1 Table 3.1: Results of dissolve detection in several MPEG-1 video sequences. rate. However, segments with camera motion can be easily identified by camera motion detection algorithms, to be described in the next section. In addition, these camera motion segments usually last longer than the normal dissolves. As a result, such false detection can be easily removed. While the method based on the average frame intensity values is simple and fast, it will have difficulty for video sequences containing shots with similar overall intensity values. On the other hand, the method based on the average block intensity values can detect dissolves in video sequences involving substantial object motion, even during dissolves. Figure 3.3 shows 3 snapshot frames at the beginning, middle and end of such a dissolve involving large motion. It can be seen that the singer moves substantially from right to left during the dissolve. Figure 5 in Paper 1 shows the response of the proposed method for the video sequence containing this dissolve (the first peak). The result clearly shows that the proposed method is able to reliably detect this dissolve. Zabih et al. [ZMM95] compared the results of several pixel domain algorithms applied to this video sequence and concluded that only their feature-based algorithm can detect this kind of dissolves. Nevertheless, their algorithm operates on pixel data only and requires several computationally intensive steps (Gaussian smoothing, edge detection, and edge tracking) on top of the time-consuming decompression. A typical speed of 2 frames per second for detecting dissolves in MPEG sequences was quoted in the paper. On the other hand, the two algorithms proposed in this thesis use the features directly available in MPEG compressed data and are thus fast and efficient. The average speed of the two algorithms (including DC image extraction, difference calculation, and final decision making) on an

CHAPTER 3. OVERVIEW OF INCLUDED PAPERS 30 Figure 3.3: An example of dissolves involving large motion.