The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image and Video Processing

The Essence of Image and Video Compression E8: Introduction to Engineering Introduction to Image and Video Processing Dr. Anil C. Kokaram, Electronic and Electrical Engineering Dept., Trinity College, Dublin 2, Ireland, anil.kokaram@tcd.ie Overview This handout covers the basics of Image and Video compression as follows. What is compression and why is it needed? 2. The simplest possible compression scheme: Run Length Encoding 3. Representing signals by sums of sines and cosines [The Fourier Transform] 4. Transform compression and JPEG 5. Motion estimation and predicting pictures in a sequence 6. Video Compression

2 The need for compression Consider a typical television image. It consists of 720 pixels in each row, and there are 576 rows. A 4:2:2 (broadcast standard) video frame (as you would get from your Digital Set Top box, or DVD) represents colour as below. 4:2:2 4:: 4:2:0 In one frame there are 720 576+360 576 = 829440 pixels. As each pixel is represented by one byte, then that is 829440 bytes. At 25 frames/sec this means a bandwidth of 20736000 = 9.78 MB/sec is required to transmit the VIDEO ALONE! This means about 20 3600/024 = 70GB to store one hour of movie. This is the RAW DATA bandwidth. The available bandwidth for a single Digital television channel is at best 6Mbits/sec. This is about 30 times smaller than the 20MB/sec needed. DVD can store at most 4GB, how does one fit 2 hours of movie on a DVD? You digital mobile phone can handle maybe Mbit/sec absolute TOPS. That is 80 times smaller than required for video. Imagine you are a film and TV archive (like www.ina.fr or the BBC or rte). You need to keep a record of 24 hours of programming on 00 s of channels daily for up to 50 years (in the case of the BBC). Hmm.. there is not enough space in a town to stack up the CD s needed to store that! So a mechanism is needed to represent images with fewer bytes than the raw data 2

3 Towards compression I don t really need 720 576 pixels for my inch mobile screen do I? So I can throw away every 4th pixel and 4th line (subsampling) for instance, and yield a 80 44 picture instead. So now I can show the same picture for /6 the storage. Not good enough. Besides, 80 44 pictures look really crap on a TV set. Format Total Active MB/sec Resolution Resolution CCIR 60 30 frames/sec, 4:3 Aspect Ratio, 4:2:2 QCIF 24 3 76 20.27 CIF 429 262 352 240 5.07 Full 858 525 720 485 20.95 CCIR 60 25 frames/sec, 4:3 Aspect Ratio, 4:2:2 QCIF 26 56 76 44.27 CIF 432 32 352 288 5.07 Full 864 625 720 576 20.74 What if I start to think about mathematical models for pictures...? Then I can send/store the parameters of my model instead of the actual pictures, and if my model is simple, I can store less parameters than pixels and get some compression. Hmmm. But pictures look pretty complicated. In fact most interesting pictures tend to be different from other pictures. Otherwise why look? It turns out that you can make some generic statements about images and image sequences.. In small, local regions, pixel intensity and colour tends to be the same or at least slowly varying. For small.. think 8 8 blocks of pixels. 2. You can construct any picture by adding together a weighted set of pre-defined primitive pictures. These primitive pictures are in fact the 2D equivalent of sines and cosines. 3. In a video sequence consecutive pictures tend to look the same except for the moving bits. We ll use these ideas now. 3

4 Run Length Encoding Consider that you want to transmit a fax as an image. There are just 2 colours 0 = black and = white. Let s say your image is as below (the letter H in a binary image). 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Instead of sending every single pixel, since there tend to be long lengths of consecutive repeated pixels (i.e. long runs) we could send a 0 (for instance) followed by the number of times it is repeated. So instead of sending or storing 0 0 0 0 0 0 0 0 for instance, you would store 08, the first number being the colour, and the second being the number of times that colour occurred consecutively. Instead of storing 8 bytes, we have stored just 2. We have encoded some raw data of 8 zeros, as just 2 bytes. We have achieved a compression factor of 8 : 2 = 4 :! In typical RLE schemes, you do not account for all possible runs. Instead you only allow for runs of length say 0 to 32 for instance. Then a run of length 64 would need to be encoded as 2 runs of length 32. Lets say for our RLE scheme we allow a maximum run length of 8, and the data is either 0 or. The image example then can be represented by... But what about a real/grayscale image? Hmm. RLE might get inefficient if the data is not mostly flat! 0 32 22 2 0 20 20 0 0 30 20 0 8 3 20 5 4

5 Signal Transforms What if it were possible to change the image in some reversible process, so that we created a result that was easier to compress? In other words we take our data and transform it in some clever way to make RLE work better. This is related to another idea. Suppose I had a photoalbum/dictionary of all the possible images in the world ever made in the past and ever will be possible in the future. And suppose I gave you a copy of this dictionary in which each image was assigned a number. Then instead of having to send you the raw data, I would just send you the number of the image in the dictionary, and you could look it up and you d have the picture! This dictionary would be very large since pictures come in many flavours. To make a smaller dictionary, you can instead choose images which when added together make up the picture you want to send or store. So now to send a picture, the transmitting end has to work out which set of images could be added together to give the picture. Then the transmitter sends the indexes of those images to the receiver. The receiver then looks up the pictures and then adds them together to give the received image. About 200 years ago, a guy called Fourier, spotted that you could actually do this with any signal. He was working on D signals but the same applies to 2D ones. No electricity, no computers, no cinema, no television, no hot baths, no baths, no showers. Lice in your hair all the time, no soap, no nylon, no jeans, no flushing toilets, no sewage system... 5

5. Representing signals with waves The brilliant discovery of Fourier, was that any D signal can be represented by a weighted sum of sines and cosines. So to make a triangle wave for instance, all you need to do is to add a bunch of sines and cosines together of different frequencies and different amplitudes. 0 0 0.5.5 2 2.5 3 3.5 4 4.5 5 0 0 0.5.5 2 2.5 3 3.5 4 4.5 5 0.2 0 0.2 0 0.5.5 2 2.5 3 3.5 4 4.5 5 0 0 0.5.5 2 2.5 3 3.5 4 4.5 5 Time (seconds) 2/π /π 0 2 3 4 5 6 7 And he came up with a mathematical formula that says which frequencies and which amplitudes were needed to synthesise a particular signal. Since we all know what sines and cosines look like, we can summarise this signal decomposition with a graph of Amplitude versus Frequency. That graph will tell us how much of 6

each frequency should be added together. This is the Frequency Spectrum of a signal. Given this graph, Fourier also worked out how to reconstruct the original signal. He discovered a completely reversible transform: The Fourier Transform. It converts or transforms a signal from the time domain into a frequency domain. For audio signals like music, this sorta makes intuitive sense, for images and other signals its less intuitive but no less useful. 50 years later 2 (in the 960 s) people 3 worked out how to use this for Digital signals and how it could be automated with computers. Then Fourier s idea really became super-useful. You see: we can think of the sines and cosines at different frequencies as our dictionary, and the amplitudes as a weight attached to each one. So to transmit some data all you need to do is to work out frequencies and amplitudes and send that instead of the actual raw data. The signals in this special dictionary are called basis functions and the corresponding amplitudes needed are called coefficients. So its a bit like saying, instead of sending the sawtooth wave (in the example above), send instead the graph of amplitude versus frequency. That graph is a whole lot smaller, but it contains all the same information. Think of this. Suppose I have a music signal which is a pure sine wave lasting 0 secs at 50 Hertz that is represented by a digital signal sampled at 44. KHz. This means that my data record is 44 K samples long. Say we re using 6bit audio, that s 44 2 bytes. Instead of transmitting all 882 bytes : how could I send the same signal with just 3 bytes? 2 People were sorting out the showers, baths, electricity, lice in the meantime 3 A guy with the funny name of Tukey 7

5.2 Image Transforms With 2D signals things are a bit trickier. 2D sines and cosines look a bit like a wave in a wave tank, or a wave in your bath, or a wave in the sea. Except the wave is a wave in intensity or brightness. The equation for working out how much of each wave you need to make a picture is also a bit tricky. Furthermore, each wave is represented by a complex number. Urgh? 0 20 0.5 30 0 0.5 40 80 50 60 40 20 0 0 0 20 30 40 50 60 70 60 0 20 30 40 50 60 f(x, y) = a cos(ω x + ω 2 y + φ) a =.0 ω = 0.29 ω 2 = 0. Wave is directed at 20 degrees off horizontal, frequency is 0.05 cycles per pel in that direction and phase lag φ = 0. Instead electrical/signal processing engineers have come up with a simpler 4 Transform that uses only Cosine waves. This transform, known as the Discrete Cosine Transform, results in only real numbers. It is the basis of JPEG. 4 Not really 8

5.3 JPEG for First Year Undergraduates JPEG is based on Transforming 8 8 blocks of pixels using the 2D DCT. For a signal of 8 samples, the 8 possible DCT basis function (the dictionary) is as below. 0 8-point DCT: rows to 4-4 8-point DCT: rows 5 to 8-0.5-4.5 - -5 -.5-5.5-2 -6-2.5-6.5-3 -7-3.5-7.5-4 -8-4.5-8.5-5 0 2 4 6 8-9 0 2 4 6 8 The 64 2D DCT basis functions and the 2D DCT of a block in Lenna are shown below. 2 2 3 3 4 4 5 5 6 6 7 7 8 8 2 3 4 5 6 7 8 2 3 4 5 6 7 8 Now we can see that the effect of Transforming a block of pixels is to reduce its overall energy. Its flatter in the DCT space. This means that we have less information to 9

transmit. Here is what happens if we take every 8 8 block in Lenna and transform it with the 2D DCT. Now we re almost there... You can see that in the Transformed images, there are many coefficients that are almost zero. So why transmit or store them at all? If we wanted to reconstruct the image exactly, we would need all these tiny values, but because we know that the Human Visual System can tolerate defects in pictures, we know that maybe we can throw away the small coefficients and keep the big ones and still have a reasonable looking picture. In fact, in JPEG what is done is to quantise the coefficients with varying degrees of accuracy. So the top left hand corner coefficient is quantised with 32 levels say, while the bottom right hand corner is quantised to 2 levels. This is because low frequency information is more important than high frequency for visual perception. When you set the Quality setting for JPEG in Adobe Photoshop, you are changing the quantisation levels. For low quality, you throw away more information, i.e. you quantise more coarsely. For high quality you keep more information so you quantise finely. 0

After that step JPEG uses RLE to encode each block of coefficients in a zig-zag scan. 0 5 6 4 5 27 28 2 4 7 3 6 26 29 42 3 8 2 7 25 30 4 43 9 8 24 3 40 44 53 0 9 23 32 39 45 52 54 20 22 33 38 46 5 55 60 2 34 37 47 50 56 59 6 35 36 48 49 57 58 62 63 Problems : blocking artefacts and mosquito noise.

6 Video Compression All the best codecs for media are based on transforming the data in some way. JPEG2000 is based on a new kind of transform, the Wavelet Transform discovered only in the late 980 s. Compression of audio.mp3 is based on D DCT. MPEG (Motion Picture Experts Group) is used for compression of video for DVD or DTV [MPEG,2,4]. Ireland was a major player in establishing the MPEG 4 standard. Intel Indeo, Apple Quicktime, Divx are all based on MPEGGy ideas. MPEG is based again on the 8 point DCT just like JPEG except... In video most consecutive pictures look the same. So if I knew what one picture looked like, then in theory I could build all the others by slightly adjusting that one. This is called prediction. But things move around in video, so we have to estimate that motion to work out how to shift the pixels around in order to create the next image. 6.0. On Motion Compensated Prediction To understand how prediction can help with video compression, The top row of figure 2 shows a sequence of images of the Suzie sequence. It is QCIF (76 44) resolution and at a frame rate of 30 frames/sec. We have already seen that Transform coding of images yields significant levels of compression, e.g. JPEG. Therefore a first step at compressing a sequence of data is to consider each picture separately. Consider using the 2D DCT of 8 8 blocks. The DCT coefficients for each frame of Suzie are shown in the second row of figure 2. The use of the DCT on the raw image data yields a compression of the original 8 bits/pel data to about 0.8 bits/pel on each frame. Note that the DCT coefficients have NOT been quantised using the standard JPEG Quantisation matrix for demonstration purposes. We know that most images in a sequence are mostly the same as the frames nearby except with different object locations. Thus we can propose that the image sequence obeys a simple predictive model (discussed in previous lectures) as follows: I n (x) = I n (x + d n,n (x)) + e(x) () where e(x) is some small prediction error that is due to a combination of noise and model mismatch. Thus we can measure the prediction error at each pixel in a frame as e(x) = I n (x) I n (x + d n,n (x)) (2) This is the motion compensated prediction error, sometimes referred to as the Displaced Frame Difference (DFD). The only model parameter required to be estimated is the motion vector d( ). 2

Block Motion Shifted Block in Frame n Motion Vector 0 000000 00 00 Motion Vector Location of Block in Frame n 0 000000 Frame n Object n n Block in Frame n Frame n Figure : Explaining how motion compensation works. Assume for the moment that we use some process to estimate these vectors. We will look at that later. Figure illustrates how motion compensation can be applied to predict any frame from any previous frame using motion estimation. The figure shows block based motion vectors being used to match every block in frame n with the block that is most similar in frame n. The difference between the corresponding pixels in these blocks according to equation 2 is the prediction error. In MPEG, the situation shown in figure (where frame n is predicted by a motion compensated version of frame n ) is called Forward Prediction. The block that is to be constructed i.e. frame n is called the Target Block. The frame that is supplying the prediction is called the Reference Picture, and the resulting data used for the motion compensation (i.e. the displaced block in frame n ) is the Prediction Block. 6.0.2 Image prediction The Fourth row of Figure 2 shows the prediction error of each frame of the Suzie sequence starting from the first frame as a reference. A three level Block Matcher was used with 8 8 blocks and a motion threshold for motion detection of.0 at the highest resolution level. The accuracy of the search was ±0.5 pixels. Each DFD frame is the difference between frame n and a motion compensated frame n, given the original frame n. 3

Figure 2: Frames 50-53 of the Suzie sequence processed by various means. From Top to Bottom row: Original Frames; DCT of Top Row; Non-motion compensated DFD; Motion Compensated DFD with backward prediction; DCT of previous row. 4

Again, we can compress this sequence of transformed images (including the first I frame) using the DCT of blocks of 8 8. Now the amount of data needed per is about 0.4 bits/pel. Substantial compression has been achieved over attempting to compress each image separately. Of course, you will have deduced that this was going to be the case because there is much less information content in the DFD frames than in the original picture data. To confirm that it is indeed motion compensated prediction that is contributing most of the benefit, the 3rd row of figure 2 shows the non-motion compensated frame difference (FD) I n (x) I n (x) between the frames of Suzie. There is substantially more energy in these FD frames than in the DFD frames, hence the higher bit rate. 6.0.3 Problems with occlusion A closer look at the DFD frame sequence in row 2 of Figure 2 shows that in frames 52 and 53 (in particular) there are some areas that show very high DFD. This is explained by observing the behaviour of Suzie in the top row. In those frames her head moves such that she uncovers or occludes some area of the background. The phone handset also uncovers a portion of her swinging hair. In the situation of uncovering, the data in some parts of frame n simply does not exist in frame n. Thus the DFD must be high. However, the data that is uncovered in frame n, typically is also exposed in frame n +. Therefore, if we could look into the next frame as well as the previous frame we probably will be able to find a good match for any block whether it is occluded or uncovered. Using such Bi-directional prediction gives much better image fidelity. This idea is used in MPEG- 2. It uses both backward prediction for some frames (P frames) and bidirectional prediction for others (B frames). The sequencing is shown below. Typically MPEG2 encodes images in the following order IBBPBBPBBPBBPI.... I-frames (Intra-coded frames) are encoded just like JPEG i.e. without any motion compensation. This allows the codec to cope with varying image content...think what would happen if you tried to predict every image in a movie from the first frame. Its not going to work is it? So I-frames are slipped in every 2 frames or so to give a new reference frame for prediction of the next 2 frames. 6. Sledgehammer motion estimation: Block Matching The most popular and to some extent the most robust technique to date for motion estimation is Block Matching (BM). Two basic assumptions are made in this technique.. Constant translational motion over small blocks (say 8 8 or 6 6) in the image. This is the same as saying that there is a minimum object size that is larger than the chosen block 5

I B B P B B P B B I Figure 3: A typical Group of Pictures (GOP) in MPEG2 size. 2. There is a maximum (pre-determined) range for the horizontal and vertical components of the motion vector at each pixel site. This is the same as assuming a maximum velocity for the objects in the sequence. This restricts the range of vectors to be considered and thus reduces the cost of the algorithm. The image in frame n, is divided into blocks usually of the same size, N N. Each block is considered in turn and a motion vector is assigned to each. The motion vector is chosen by matching the block in frame n with a set of blocks of the same size at locations defined by some search pattern in the previous frame. Given a possible vector v = [dx, dy], we can define the DFD between a pixel in the current frame and its motion compensated pixel in the previous frame as DF D(x, v) = I n (x) I n (x + v) (3) Define the Mean Absolute Error of the DFD between the block in the current frame and that in the previous frame as MAE(x, v) = N 2 x Block DF D(x, v) (4) We can use Mean Squared Error (MSE) as well, but MAE is more robust to noise. The block matching algorithm then proceeds as follows at each image block.. Pre-determine a set of candidate vectors v to be tested as the motion vector for the current block 2. For each v calculate the MAE 3. Choose the motion vector for the block as that v which yields the minimum MAE. 6

Figure 4: Motion estimation via Block Matching. The positions indicated by a in frame n are searched for a match with the N N block in frame n. One block to be examined is located at displacement [ 2], and is shaded. The set of vectors v in effect yield a set of candidate motion compensated blocks in the previous frame n for evaluation. The separation of the candidate blocks in the search space determines the smallest vector that can be estimated. For integer accurate motion estimation the position of each block coincides with the image grid. For fractional accuracy, blocks need to be extracted between locations on the image grid. This requires some interpolation. In most cases Bilinear interpolation is sufficient. Figure 4 shows the search space used in a full motion search technique. The current block is compared to every block of the same size in an area of size (2w+N) (2w+N). The search 5 space is chosen by deciding on the maximum displacement allowed: in Figure 4 the maximum displacement estimated is ±w for both horizontal and vertical components. The technique arises from a direct solution of equation. The BM solution can be seen to minimize the Mean Absolute DFD (or Mean Square DFD) with respect to v, over the N N block. The chosen displacement, d satisfies the model equation in some average sense. 6.. Computation The Full Motion Search is computationally demanding. Given a maximum expected displacement of ±w pels, there are (2w + ) 2 searched blocks (assuming integer displacements only). Each block considered requires on the order of N 2 operations to calculate the MAE. This implies N 2 (2w + ) 2 operations per block for an integer accurate motion estimate. Several reduced search techniques have been introduced which lessen this burden. They attempt to reduce the operations required either by reducing the locations searched or by reducing the number of pixels sampled in each block. However, reduced searches may find local minima in the DFD function and yield spurious matches. 5 There are (2w + ) 2 searched locations. 7

2 2 2 0 2 2 2 2 3 3 3 3 2 3 3 3 3 0 2 2 3 2 3 4 3 5 5 4 5 5 Figure 5: Illustration of searched locations (central pixel of the searched block is shown) in Three-step BM (left) and Cross-search BM (right). The search window extent is shown in red for Cross-search. The best matches at each search level are circled in blue. 6..2 Three step search The simplest mechanism for reducing the computational burden of Full Search BM is to reduce the number of motion vectors that are evaluated. The Three-step search is a hierarchical search strategy that evaluates first 9 then 8 and finally again 8 motion vectors to refine the motion estimate in three successive steps. At each step the distance between the evaluated blocks is reduced. The next search is centred on the position of the best matching block in the previous search. It can be generalised to more steps to refine the motion estimate further. Figure 5 shows the searched blocks in frame n for this process. 6..3 Cross Search The cross search is another variant on the subsampled motion vector visiting strategy. It changes the geometry of the search pattern to a + or pattern. Figure 5 shows the searched blocks in frame n for this process. If the best match is found at the centre of the search pattern or the boundary of the search window, then the search step is reduced. 6..4 Problems The BM algorithm is noted for being a robust estimator of motion since noise effects tend to be averaged out over the block operations. However, if there is no textural information in the the two blocks compared, then noise dominates the matching process and causes spurious motion estimates. This problem can be isolated by comparing the best match found (E m ), to the no motion match (E 0 ). If these matches are sufficiently different then the motion estimate is accepted otherwise no 8

motion is assumed. A threshold acts on the ratio r b = E 0 E m. The error measure used is the MAE. If r b < t, where t is some threshold chosen according to the noise level suspected, then no motion is assumed. This algorithm verifies the validity of the motion estimate once motion is detected. The main disadvantages of Block Matching are the heavy computation involved (although these are byte wise manipulations) and the motion averaging effect of the blocks. If the blocks chosen are too large then many differently moving objects may be enclosed by one block and the chosen motion vector is unlikely to match the motion of any of the objects. The advantages are that it is very simple to implement 6 and it is robust to noise due to the averaging over the blocks. There are many more useful motion estimators than this. These others do give you motion better matched to what is actually going on in the scene. But we will not look at these here. 6.2 Video codec issues DVD and DTV both use MPEG-2, and the core is exactly as described here. MPEG-2 became a standard around 992, and just 4 years later Digital Television was a reality. This is quite amazing considering that the advances in research in video compression that made this possible were only really about 5 years old at the time. Compare that to the 200 years it took Fourier to be really appreciated! Mobile phone video communications will use MPEG-4 (established around 998). Unfortunately that is going through some teething trouble at the moment. Sadly, the creation of MPEG standards is not as simple as motion estimation, DFD, DCT, quantisation and transmission. When you actually start to think about putting together codecs the following issues arise. Compression There are at least three fundamentally different types of multimedia data sources: pictures, audio and text. Different compression techniques are needed for each data type. Each piece of data has to be identified with unique codewords for transmission. Sequencing The compressed data from each source is scanned into a sequence of bits. This sequence is then packetised for transport. The problem here is to identify each different part of the bitstream uniquely to the decoder, e.g. header information, DCT coefficient information. Multiplexing The audio and video data (for instance) has to be decoded at the same time (or approximately the same time) to create a coherent signal at the receiver. This implies that the transmitted elementary data streams should be somehow combined so that they arrive at the correct time at the decoder. The challenge is therefore to allow for identifying the different parts of the multiplexed stream and to insert information about the timing of each elementary data stream. 6 It has been implemented on Silicon for video coding applications. 9

Media The compressed and multiplexed data has to be stored on some DSM and then later (or live) broadcast to receivers across air or other links. Access to different Media channels (including DSM) is governed by different constraints and this must somehow be allowed for in the standards description. Errors Errors in the received bitstream invariably occur. The receiver must cope with errors such that the system performance is robust to errors or it degrades in some graceful way. Bandwidth The bandwidth available for the multimedia transmission is limited. The transmission system must ensure that the bandwidth of the bitstream does not exceed these limits. This problem is called Rate Control and applies both to the control of the bitrate of the elementary data streams and the multiplexed stream. Multiplatform The coded bitstream may need to be decoded on many different types of device with varying processor speeds and storage resources. It would be interesting if the transmission system could provide a bitstream which could be decoded to varying extents by different devices. Thus a low capacity device could receive a lower quality picture than a high capacity device that would receive further features and higher picture quality. This concept applied to the construction of a suitable bitstream format is called Scalability. What we have covered here is the core of the standard used for image and video compression. This just says how the data itself is compressed. If you open up an.avi or.mpg file, you will not see this data in that same form. It has to be encoded into symbols, and timing and copyright information embedded at the very least. This makes the design of codecs a tricky business. But it is certainly true that without standards, there would be no business in video communications. Finally, note that none of the compression standards actually describe how you do the things you have to do. It just describes how to represent bits and package them. So you can use cleverer DCTs or cleverer motion estimators to get better speed and performance. That is why one manufacturer s codec could be better than another s even though they both create compressed video according to the same standard. 20