Motion Video Compression - PDF Free Download

7 Motion Video Compression 7.1 Motion video Motion video contains massive amounts of redundant information. This is because each image has redundant information and also because there are very few changes from one image to the next. Motion video image compression relies on two facts: Images have a great deal of redundancy (repeated images, repetitive, superfluous, duplicated, exceeding what is necessary, and so on). The human eye and brain have limitations on what they can perceive. Chapter 5 discussed how JPEG encodes still images. Motion video is basically a series of still images. The basis of motion JEPG (MPEG) is to treat the video information as a series of compressed images and to allow for compression of around 130:1. 7.2 MPEG-1 overview The Motion Picture Experts Group (MPEG) developed an international open standard for the compression of high-quality audio and video information. At the time, CD-ROM single-speed technology allowed a maximum bit rate of 1.2 Mbps and it is this rate that the standard was built around. These days, 8 and 10 CD-ROM bit rates are common. MPEG s main aim was to provide good quality video and audio using hardware processors (and in some cases, on workstations with sufficient computing power, to perform the tasks using software). Figure 7.1 shows the main processing steps of encoding: Image conversion normally involves converting images from RGB into YUV (or YC r C b ) terms with optional color sub-sampling. Conversion into slices and macroblocks a key part of MPEG-1 s compression is the detection of movement within a frame. To detect motion a

98 Advanced data communications frame is subdivided into slices then each slice is divided into a number of macroblocks. Only the luminance component is then used for the motion calculations. In the subblock, luminance (Y) values use a 16 16 pixel macroblock, whereas the two chrominance components have 8 8 pixel macroblocks. Motion estimation MPEG-1 uses a motion estimation algorithm to search for multiple blocks of pixels within a given search area and tries to track objects which move across the image. DCT conversion as with JPEG, MPEG-1 uses the DCT method. This transform is used because it exploits the physiology of the human eye. It converts a block of pixels from the spatial domain into the frequency domain. This allows the higher-frequency terms to be reduced as the human eye is less sensitive to high-frequency changes. Encoding the final stages are run-length encoding and fixed Huffman coding to produce a variable-length code. Reference frame YUV RGB to YUV converter Block matching DCT transform Quantization Run length encoding Huffman coding Images YUV Error terms DCT terms Quantized DCT terms Zeros suppressed Variable size code Figure 7.1 MPEG encoding with block matching 7.3 MPEG-1 video compression MPEG-1 typically uses the CIF format for its input, which has the following parameters: For NTSC, 352 240 pixels for luminance and 176 120 pixels for U and V color-difference components (that is, 4:1:1 subsampling). For PAL/SECAM, 352 288 pixels for luminance and 176 144 pixels for U and V color difference components (i.e. 4:1:1 subsampling). This gives a picture quality which is similar to VCR technology. MPEG-1

Motion video compression 99 differs from conventional TV in that it is non-interlaced (known as progressive scanning), but the frame rate is the same as conventional TV, i.e. 25fps (for PAL and SECAM) and 30fps (for NSTC). Note that MPEG-1 can also use larger pixel frames, such as CCIR-601 740 480, but the CIF format is the most frequency used. Taking into account the interlacing effect, the CIF format is actually derived from the CCIR-601 format. The CCIR-601 digital television standard defines a picture size of 720 243 (or 240) by 60 fields per second. Note that a frame actually comprises two fields, where the odd and even information is interlaced to create the full picture. When the interlaced luminance information occupies the full 720 480 frame, the chrominance components are reduced by 4:2:2 subsampling to give 360 243 (or 240) by 60 fields per second. MPEG-1 also reduces the chrominance components by reducing the pixel data by half in the vertical, horizontal and time directions. It also reduces the image size so that the number of pixels for it is divisible by 8 or 16. This is because the motion analysis and DCT conversion operate on 16 16 or 8 8 pixel blocks. As a result, the number of lines changes for an MPEG-1 encoded move between the NSTC standard and PAL and SECAM standards. The final figure for PAL and SECAM is 288 at 50fps; for NSTC it is 240 at 60fps. These require the same number of bits to encode the streams. The MPEG encoded bitstream comprises three components: compressed video, compressed audio and system-level information. To provide easier synchronization and lip synching the audio and video streams are time stamped using a 90kHz reference clock. 7.4 MPEG-1 compression process 7.4.1 Color space conversion The first stage of MPEG encoding is to convert a video image into the correct color space format. In most cases, the incoming data is in 24-bit RGB color format and is converted in 4:2:0 YC r C b (or YUV) form. Some information will obviously be lost but it results in some compression. 7.4.2 Slices and macroblocks MPEG-1 compression tries to detect movement within a frame. This is done by subdividing a frame into slices and then subdividing each slice into a number of macroblocks. For example, a PAL format which has: 352 288 pixel frame (101376 pixels)

100 Advanced data communications can, when divided into 16 16 blocks, give a whole number of 396 macroblocks. Dividing 288 by 16 gives a whole number of 18 slices. Dividing 352 gives 22. Thus the image is split into 22 macroblocks in the x-direction and 18 in the y-direction, as illustrated in Figure 7.2. 352 pixels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 288 pixels Figure 7.2 Segmentation of an image into subblocks Luminance (Y) values use a 16 16 pixel macroblock, whereas the two chrominance components have 8 8 pixel macroblocks. Note that only the luminance component is used for the motion calculations. 7.4.3 Motion estimation MPEG-1 uses a motion estimation algorithm to search for multiple blocks of pixels within a given search area and tries to track objects which move across the image. Each luminance (Y) 16 16 macroblock is compared with other macroblocks within either a previous or future frame to find a close match. When a close match is found, a vector is used to describe where this block should be located as well as any difference information from the compared block. As there tend to be very few changes from one frame to the next, it is far more efficient than using the original data. Figure 7.3 shows two consecutive images of 2D luminance made up into 16 5 megablocks. Each of these blocks has 16 16 pixels. It can be seen that, in this example, there are very few differences between the two blocks. If the previous image is transmitted in its entirety then the current image can be transmitted with reference to the previous image. For example, the megablocks for (0,1), (0,2) and (0,3) in the current block are the same as in the

Motion video compression 101 previous blocks. Thus they can be coded simply with a reference to the previous image. The (0,4) megablock is different to the previous image, but the (0,4) block is identical to the (0,3) block of the previous image, thus a reference to this block is made. This can continue with most of the block in the image being identical to the previous image. The only other differences in the current image are at (4,0) and (4,1); these blocks can be stored in their entirety or specified with their differences to a previous similar block. Each macroblock is compared mathematically with another block in a previous or future frame. The offset to find another block could be over a macroblock boundary or even a pixel boundary. The comparison then repeats until a match is found or the specified search area within the frame has been exhausted. If no match is available, the search process can be repeated using a different frame or the macroblock can be stored as a complete set of data. As previously stated, if a match is found, the vector information specifying where the matching macroblock is located is specified along with any difference information. (0,0) Previous image (0,16) (4,0) (4,16) Current image Figure 7.3 Two consecutive images As the technique involves very many searches over a wide search area and there are many frames to be encoded, the encoder must normally be a highpowered workstation. This has several implications: An asymmetrical compression process is adopted, where a relatively large amount of computing power is required for the encoder and much less for

102 Advanced data communications the decoder. Normally the encoding is also done in non-real time whereas the decoder reads the data in real-time. As processing power and memory capacity increase, more computers will be able to compress video information in real-time. Encoders influence the quality of the decoded image dramatically. Encoding shortcuts, such as limited search areas and macroblock matching, can generate poor picture quality, irrespective of the quality of the encoder. The decoder normally requires a large amount of electronic memory to store past and future frames, which may be needed for motion estimation. With the motion estimation completed, the raw data describing the frame can now be converted using the DCT algorithm ready for Huffman coding. 7.4.4 I, P and B frames MPEG video compression uses three main types of frames: I-frame, P-frame and B-frame. Intra frame (I-frame) An intra frame, or I-frame, is a complete image and does not require any extra information to be added to it to make it complete. Thus no motion estimation processing has been performed on the I-frame. Mainly used to provide a known starting point, it is usually the first frame to be sent. Predictive frame (P-frame) The predictive frame, P-frame, uses the preceding I-frame as its reference and has motion estimation processing. Each macroblock in this frame is supplied either as a vector and difference with reference to the I-frame, or if no match was found, as a completely encoded macroblock (called an intracoded macroblock). The decoder must thus retain all I-frame information to allow the P- frame to be decoded. Bidirectional frame (B-frame) The bidirectional frame, B-frame, is similar to the P-frame except that its reference frames are to the nearest preceding I- or P-frame and the next future I- or P-frame. When compressing the data, the motion estimation works on the future frame first, followed by the past frame. If this does not give a good match, an average of the two frames is used. If all else fails, the macroblock can be intracoded. Needless to say, decoding B-frames requires that many I- and P-frames are retained in memory. MPEG-1 frame sequence MPEG-1 allows frames to be ordered in any sequence. Unfortunately a large

Motion video compression 103 amount of reordering requires many frame buffers that must be stored until all dependencies are cleared. The MPEG-1 format allows random access to a video sequence, thus the file must contain regular I-frames. It also allows enhanced modes such as fast forward, which means that an I-frame is required every 0.4 seconds, or 12 frames between each I-frame (at ). At, a typical sequence is a starting I-frame, followed by two B- frames, a P-frame, followed by two B-frames, and so on. This is known as a group of picture (GOP). I B B P B B I B B P B B I B B P... When decoding, the decoder must store the I-frame, the next two B-frames are also stored until the B-frame arrives. The next two B-frames have to be stored locally until the P-frame arrives. The P-frame can be decoded using the stored I-frame and the two B-frames can be decoded using the I- and P-frames. One solution of this is to reorder the frames so that the I- and P-frames are sent together followed by the two intermediate B-frames. Another more radical solution is not to send B-frames at all, simply to use I- and P-frames. On computers with limited memory and limited processing power, the B- frames are diffucult because: They increase the encoding computational load and memory storage. The inclusion of the previous and future I- and P-frames as well as the arithmetic average greatly increases the processing needed. The increased frame buffers to store frames allow the encode and decode processes to proceed. this argument is again less valid with the advent of large and high-density memories. They do not provide a direct reference in the same way that an I- or P- frame does. The advantage of B-frames is that they lead to an improved signal-to-noise because of the averaging out of macroblocks between I- and P-frames. This averaging effectively reduces high-frequency random noise. It is particularly useful in lower bit rate applications, but is of less benefit with higher rates, which normally have improved signal-to-noise ratios. 7.4.5 DCT conversion As with JPEG, MPEG-1 uses the DCT. It transforms macroblocks of luminance (16 16) and chrominance (8 8) into the frequency domain. This allows the higher-frequency terms to be reduced as the human eye is less sensitive to high-frequency changes. This type of coding is the same as used in JPEG still image conversion, that was described in the previous chapter.

104 Advanced data communications Frames are broken up into slices 16 pixels high, and each slice is broken up into a vector of macroblocks having 16 16 pixels. Each macroblock contains luminance and chrominance components for each of four 8 8 pixel blocks. Color decimation can be applied to a macroblock, which yields four 8 8 blocks for luminance and two 8 8 blocks (C b and C r ) of chrominance, using one chrominance value for each of the four luminance values. This is called the 4:2:0 format; two other formats are available (4:2:2 and 4:4:4, respectively known as two luminance per chrominance and one to one), which require data rates. For each macroblock, a spacial offset difference between a macroblock in the predicted frame and the reference frame(s) is given if one exists (a motion vector), along with a luminance value and/or chrominance difference values (an error term) if needed. Macroblocks with no differences can be skipped except in intra frames. Blocks with differences are internally compressed, using a combination of a discrete cosine transform (DCT) algorithm on pixel blocks (or error blocks) and variable quantization on the resulting frequency coefficient (rounding off values to one of a limited set of values). The DCT algorithm accepts signed, 9-bit pixel values and produces signed 12-bit coefficient. The DCT is applied to one block at a time, and works much as it does for JPEG, converting each 8 8 block into an 8 8 matrix of frequency coefficients. The variable quantization process divides each coefficient by a corresponding factor in a matching 8 8 matrix and rounds to an integer. 7.4.6 Quantization As with JPEG the converted data is divided, or quantized, to remove higherfrequency components and to make more of the values zero. This results in numerous zero coefficients, particularly for high-frequency terms at the high end of the matrix. Accordingly, amplitudes are recorded in run-length form following a diagonal scan pattern from low frequency to high frequency. 7.4.7 Encoding After the DCT and quantization state, the resultant data is then compressed using Huffman coding with a set of fixed tables. The Huffman code not only specifies the number of zeros, but also the value that ended the run of zeros. This is extremely efficient in compressing the zigzag DCT encoding method. 7.5 MPEG-1 decoder The resultant encoded bitstream contains both video and audio data. These two elements are identified using system-level coding, which specifies a multiplex data format that allows multiplexing of multiple simultaneous audio and

Motion video compression 105 video streams as well as privately defined data streams. This coding includes the following: Synchronization data for decoded audio and video frames. Each frame contains a time stamp of frames so that a decoder can synchronize the decoding and playback of audio with the correct video sequence to achieve lip synchronization. The time-stamping gives the decoder a great flexibility in the playback. It even allows variable data rates, where frames can be dropped when they cannot be processed in time, and there is no loss of synchronization. The synchronization is achieved with a 90kHz reference clock. Random frame access within the stream with absolute time identification. This is important when decoding in that the time reference can be independent of the environment. Data buffer management to prevent overflow and underflow errors. Frames are not necessarily stored in the consequentive time sequence. Buffers must be set-up to hold data temporarily for future decoding. 7.6 MPEG-1 audio compression This will be covered in the Chapter 10. 7.7 MPEG-2 The orginal MPEG-1 specification proved so successful that, as soon as it was published, the MPEG committee started work on three derivatives called MPEG-2, MPEG-3 and MPEG-4. MPEG-2 has since been published. MPEG- 3 was incorporated into MPEG-2 and work continues on the MPEG-4 standard. The main drawback with the MPEG-1 standard are: It did not directly support broadcast television pictures as in the CCIR-601 specification. In particular, it did not support the interlaced mode of operation, although it could support the larger picture size of 720 480 at 30fps. It was designed for a 1.5Mbps bitstream. Interlacing dramatically affects the motion estimation process because components could move from one field to another, and vice versa. As a result, the

106 Advanced data communications MPEG-1 was poor at handling interlaced images. The main objective of the MPEG-2 standard was to make it flexible so that it supported a number of modes, called profiles, with a wide range of options. These different profiles define algorithms that may be used. Each profile has a number of associated levels which define the parameters used. 7.7.1 MPEG-2 profiles and levels MPEG-2 defines several profiles to provide a set of known configurations for different applications. It can be used from low-level video conferencing to high-definition television. If it were a unitary standard then each encoder and decoder would have to process the signals for the entire range of applications. This would, for example, burden a video conferencing system with the capability to handle very high definition images. The cost of doing this would make MPEG-2 unworkable for video conferencing. Table 7.1 outlines the valid profiles and modes. The four main profiles are: Main supports the main area of current development. Simple same as the main profile, but the B-frames are not supported (so it is mainly used in software-based applications). SNR enhanced signal-to-noise ratio. Spacial enhanced main profile. High. There are four main levels, these are: Low the low level is similar to MPEG-1 standards and supports the CIF standard of 352 240 at 30fps (or 352 288 at 25fps for PAL). This equates to 3.05Mpixels per second and a bit rate of up to 4Mbps. The lowprofile applications are aimed at the consumer market and offer quality similar to a domestic VCR. Main the main level is able to support a maximum frame size of 720 480 at 30fps (as defined in the CCIR-601 specification). This equates to 10.4Mpixels per second and a bit rate of up to 15Mbps. This level is aimed at the higher-quality consumer market. High 1440 the high 1440 supports a maximum frame size of 1440 1152 at 30fps. This is frame size is four times the CCIR-601 specification and equates to 47Mpixels per second, giving a bit rate of up to 60Mbps. This level is aimed at the high-definition TV (HDT) consumer market. High the high level is able to support a maximum frame size of 1920 1080 at 30fps. As with high 1440, the frame size is four times the CCIR-601 specification and gives a bit rate of up to 80Mbps. This level is also aimed at the HDTV consumer market.

Motion video compression 107 Table 7.1 MPEG-2 profiles and levels Simple ( SP) Main (MP) SNR Spatial High HIGH (HL) Illegal 1920 1152, 60 fps Illegal Illegal 1920 1152, 60 fps 960 576, HIGH-1440 (H-14) Illegal 1920 1152, 60 fps Illegal 1440 1152, 60 fps 720 576, 1440 1152, 60 fps 720 576, Main (ML) 720 576, 720 576, 720 576, Illegal 720 576, 352 288, Low (LL) Illegal 352 288, 352 288, Illegal Illegal 7.8 MPEG-2 system layer The MPEG data stream consists of two layers: A compression layer. A system layer. The system decoder splits the data stream into video and audio, each to be processed by separate decoders. Every 700ms (or faster) a 33-bit system clock reference (SCR) is inserted into the data. For synchronization, the video and audio clocks are periodically set to the same value, every 700ms (or faster), using 33-bit presentation time stamps (PSTs). These serve to invoke a particular picture or audio sequence. The topmost layer of MPEG-1, the video sequence layer, can be expressed as: video sequence is { next start code

108 Advanced data communications } repeat { sequence header repeat { group of pictures } while (next word in stream is group start code) } while (next word in stream is sequence header code) sequence end code The video stream comprises a header, a series of frames, an end-of-sequence code. The stream contain periodic I-frames. These provide full images to be used as periodic references, and so allow reasonably random access to the data stream. Other frames are predicted using either preceding I-pictures (which create P-pictures) or a combination of preceding and following I- pictures (which creates B-pictures). The encoder decides how I, B and P pictures are interspersed and ordered. A typical sequence would be I-B-B-P. The order of pictures in the data stream is not the order of display; for example, the previous sequence would be sent as I-P-B-B. 7.9 Other MPEG-2 enhancements MPEG-2 adds, among other features, an alternate scan order which further improves compression. All control data, vectors and DCT coefficients are further compressed using Huffman-like variable-length encoding. 7.10 MPEG-2 bit rate With MPEG-2 this transmission rate can be reduced to 4Mbps for PAL and SECAM and 3 Mbps for NTSC, thus giving a compression ratio of 40:1 to high-quality TV. MPEG-1 typically compresses TV signals to 1.2 Mbps, giving a compression ratio of 130:1. Unfortunately the quality is reduced to near VCR-type quality. Table 7.2 outlines these parameters. The base bit rate for a standard Ethernet network is 10Mbps. This allows compressed video to be transmitted over the network when there is no other traffic on the network. The 4Mbps rate will load the network by approximately 50%. Standardized and compression techniques will be discussed in the next chapter.

Motion video compression 109 Table 7.2 Motion video compression. Type Bit rate Compression Comment Uncompressed TV 162 Mbps 1:1 MPEG-1 4 Mbps 40:1 VCR quality MPEG-2 1.2 Mbps 130:1 PAL, SECAM TV quality 7.11 Exercises 7.1 Explain the main steps in MPEG coding. 7.2 State how standard TV differs in its interlacing from MPEG-1. 7.3 Explain why a frame must be divisible by 16 in both the x- and y- directions. Also give an example of a frame split into a number of macroblocks. 7.4 Explain how I, P and B frames might be used.