Impact of scan conversion methods on the performance of scalable. video coding. E. Dubois, N. Baaziz and M. Matta. INRS-Telecommunications

Impact of scan conversion methods on the performance of scalable video coding E. Dubois, N. Baaziz and M. Matta INRS-Telecommunications 16 Place du Commerce, Verdun, Quebec, Canada H3E 1H6 ABSTRACT The ability to exibly access coded video data at dierent resolutions or bit rates is referred to as scalability. We are concerned here with the class of methods referred to as pyramidal embedded coding in which specic subsets of the binary data can be used to decode lower-resolution versions of the video sequence. Two key techniques in such a pyramidal coder are the scan-conversion operations of down-conversion and up-conversion. Down-conversion is required to produce the smaller, lower-resolution versions of the image sequence. Up-conversion is used to perform conditional coding, whereby the coded lower-resolution image is interpolated to the same resolution as the next higher image and used to assist in the encoding of that level. The coding eciency depends on the accuracy of this up-conversion process. In this paper techniques for down-conversion and up-conversion are addressed in the context of a two-level pyramidal representation. We rst present the pyramidal technique for spatial scalability and review the methods used in MPEG-2. We then discuss some enhanced methods for down- and up-conversion, and evaluate their performance in the context of the two-level scalable system. 1 INTRODUCTION Video signals are an important component of many multimedia information services. Due to the immense amount of data associated with video, compression is essential for a practical service. The compressed data stored in the video database are accessed over a broadband network by a variety of client workstations that may have dierent display capabilities. Thus some users may want to access the data at lower resolution (i.e. picture size) than others. Also, since cost of the service is sensitive to the amount of data retrieved, some users may want to access the data at a lower bit rate. The ability to exibly access the data at dierent resolutions or bit rates is referred to as scalability. This can be provided in a straightforward way by storing a separately encoded copy for each supported format or bit rate. However, this is inecient and wasteful of storage. Another option is real-time transcoding at the server, but this is excessively complex. Thus, we consider only the class of methods referred to as embedded coding in which specic subsets of the binary data can be used to decode lower-resolution versions of the video sequence. There are two main schemes for embedded coding: pyramidal coding and subband coding. Because of its greater exibility, we focus here on the class of pyramidal coding schemes. 1 In this method, a series of representations of the video sequence of decreasing resolution is produced. The lowest-resolution version is rst coded, and the coded representation forms the rst part of the stored data. This can be directly accessed by those clients desiring only the lowest version. This version is locally decoded at the encoder and used as conditional information to encode the next-higher-resolution version of the image sequence. Only information required to rene this representation

need be encoded and transmitted. This process is continued until the full-resolution version is reached. Two key techniques in such a pyramidal coder are the scan-conversion operations of down-conversion and upconversion. Down-conversion is required to produce the smaller, lower-resolution versions of the image sequence from the original data. The goal is to perform this operation such that the quality of each level is optimized and the users accessing the data at each level get the best possible picture. This involves retaining as much image detail as possible while avoiding aliasing artifacts. Up-conversion is generally used to perform the conditional coding. The lower-resolution image is interpolated to be of the same resolution as the next-higher image. Then, this up-converted decoded image is used as part of the prediction process in encoding the next-higher level. The coding eciency will depend on the accuracy of this up-conversion process. In this paper techniques for down-conversion and up-conversion are addressed in the context of a two-level pyramidal representation. Specically, the lower level is a 352 by 240, 30 frames per second progressive format (SIF), while the upper level is a 704 by 480, 60 eld per second interlaced format (4:2:0). The interlaced signal used in the high level poses a number of problems. We rst provide more detail on the pyramidal technique for spatial scalability, and review the methods used in MPEG-2. We then discuss some enhanced methods for down- and up-conversion, and evaluate their performance in the context of the two-level scalable system. We also address the problem of prediction in the higher level, given the decoded lower level. 2 BACKGROUND The two-level pyramidal coding system studied here is shown in Fig. 1 and Fig. 2. The original high-resolution sequence is in 4:2:2 ITU-R 601 format, i.e. 2:1 interlaced frames of size 720 pixels per picture width by 480 lines per picture height, and 60 elds per second for the luminance, and with 360 pixels per picture width for the two chrominance components. It is rst converted to a 4:2:0 format. This is still an interlaced format with 704 pixels per line for the luminance, obtained by cropping 8 samples from each side of the picture; the two chrominance components are subsampled vertically resulting in pictures of size 352 by 240. The lower level of the pyramid is produced by down-conversion to a SIF format. This is a non-interlaced (progressive) format with 352 pixels per line, 240 lines per frame, and 30 frames per second for the luminance. The two chrominance components have 176 pixels per line and 120 lines per frame. This down-converted signal is passed through the base-layer coder to produce the base-layer bitstream. At the encoder, the base-layer bitstream is decoded, and the image up-converted to the input signal picture dimensions. This signal is used as a priori information in the encoding of the high-resolution image. The conditionally-coded upper-layer bitstream is then multiplexed with the lower-layer bitstream for storage or transmission. A low-resolution receiver only accesses the base-layer bitstream. From this it can decode and display the low-resolution version of the image sequence. A high-resolution receiver accesses both bitstreams. It rst decodes the base layer and up-converts it to the full picture size, using the same interpolator as the transmitter. This is used to decode the conditionally-coded upper layer to produce the high-resolution decoded picture. In this study, we rst use the conditional coding method described in MPEG-2. 2 In this method, a given picture in the high-resolution sequence is spatially predicted by the up-converted lower-level picture. It may also be temporally predicted by one or more previously coded pictures in the high level, depending on the type of picture to be coded. The nal prediction is then either the spatial prediction, the temporal prediction or a spatiotemporal weighted prediction using a set of pre-dened weights. Some investigations have been conducted in order to optimize this set of weights and make the contribution of the spatial prediction more ecient in a progressive-to-interlace spatial-scalable scheme. The goal is to use the prediction that will give the best coding gain for that picture. MPEG-2 does not specify a down-conversion algorithm, since the algorithm used does not aect the decoding

High resolution CODER FILTER FILTER Upper Layer Bitstream DECODER CODER Base Layer Bitstream MUX Low resolution Figure 1: Functional diagram of video encoder with two levels of spatial scalability DECODER Upper Layer Bitstream FILTER High resolution DMUX Base Layer Bitstream DECODER Low resolution Figure 2: Functional diagram of video decoder with two levels of spatial scalability

Horizontal offset Interpolated lower layer horizontal size Vertical offset HIGH RESOLUTION Interpolated lower layer vertical size LOW RESOLUTION Lower layer vertical size Lower layer horizontal size Figure 3: Two-level pyramid system for spatial scalable coding operation. In general, most coders simply drop every second eld from the interlaced sequence. The remaining elds are ltered horizontally and down-sampled by two. This can result in signicant vertical aliasing in the down-converted picture. The up-conversion process is specied by the MPEG-2 standard, since both the encoder and decoder must use the same up-converter. For the case where the lower level picture is progressive (the case considered here) the up-conversion is accomplished by a simple two-tap linear lter with coecients (0.5, 0.5). The horizontal up-conversion is done in a similar manner. The result is considered to form the two interlaced elds of the corresponding high-resolution frame. This is suboptimal, since the two elds in the high-level picture occur at dierent times, but they are estimated from a low-resolution picture at a single time instant. 3 SCAN CONVERSION METHODS 3.1 Down-conversion A spatiotemporal pyramid is derived from the original high-resolution image sequence using windowing, lowpass ltering and down-sampling operations (see Fig. 3). A low-resolution version is not necessarily extracted from the totality of the top level area (windowing) and the spatiotemporal sampling lattice may be dierent from one pyramid level to another. In this study, down-conversion is applied to produce a two-level pyramid system, where the top level is a 4:2:0 format and the bottom level is a SIF format. In MPEG-2 the 2:1 horizontal decimation (for both luminance and chrominance) is done using a 7-tap low-pass lter in order to remove the high frequency components and hence reduce the horizontal aliasing. However the interlace to progressive conversions is simply achieved by dropping every second eld of the original sequence. Thus vertical aliasing is expected to result in a reduced quality when converting back to the original 4:2:0 format (and subsequently to the CCIR601 format). To avoid this, we propose to rst perform a spatiotemporal deinterlacing, followed by vertical low-pass ltering and decimation. This is followed by the 2:1 temporal subsampling and horizontal decimation (Fig. 4). Any deinterlacing method can be used, either adaptive or with motion compensation. 3

Deinterlacing Vertical 2:1 Horizontal 2:1 Drop every second field Towards Lower Layer Coder 4:2:0 Sequence Additional Blocks MPEG-2 Down-conversion Figure 4: Proposed Down-Conversion 3.2 Up-conversion Up-conversion has been investigated in two dierent contexts: format conversion for the display process and interpolation for the spatial-scalable process. We are more interested here in the second case. We need to perform an up-conversion from a low-layer SIF format (progressive) to a high-layer 4:2:0 format (interlaced). A rst method considers this operation as a purely spatial up-sampling operation, as recommended in the MPEG-2 standard. We have investigated the performance of a cubic interpolation 4 as an alternative to the linear approximation proposed in MPEG-2. A second method considers this operation as a spatiotemporal process combining a progressive-tointerlace conversion with a spatial up-sampling. The frame-rate conversion has been performed using adaptive temporal interpolation and motion-compensated interpolation. 3.2.1 Adaptive temporal interpolation Progressive-to-interlace conversion is a 2:1 temporal up-conversion that creates the missing elds. The latter occur at new temporal positions and have vertically shifted sample positions. Depending on the magnitude of the temporal variations over the sequence, one can choose between a simple temporal ltering method or a motion-compensated interpolation method. Temporal ltering: We start by creating new elds A 0 and C 0 from two adjacent elds A and C by performing a vertical ltering with an even-tap maximally at digital lter. 5 The resulting samples have half-pixel vertically shifted positions. Then we perform a temporal ltering, with a (0.5, 0.5) linear lter, of A 0 and C 0 to get a new eld B (Fig. 5). Motion-compensated Filtering: A motion vector v is estimated for each sample position of the missing eld B. The sample value is then estimated using motion-compensated interpolation. This is illustrated in Fig. 6. Since the two methods described above perform well in dierent circumstances, depending on the temporal variations over the image sequence, we introduce an adaptive method based on a motion-indicator test that controls switching between temporal and motion-compensated interpolation. We dene the motion indicator M(i; j; t) as an average motion-magnitude value over a centered 3 3 window: M(i; j; t) = 1 9 Xi+1 j+1 X k=i?1 l=j?1 kv(k; l; t)k

.5.5.5.5.5.5 A C A B C Sample of the "half-sample" vertical delay : Missing sample to be interpolated. Figure 5: Temporal Conversion The interpolation mode decision has the following form: bi(i; j; t) = It (i; j; t) ifm(i; j; t) < 1 I m (i; j; t) ifm(i; j; t) > 1 where b I(i; j; t) denotes the interpolated value, Im (i; j; t) the motion-compensated interpolation value and I t (i; j; t) the temporal interpolation value. Note that only temporal ltering is used to interpolate the chrominance elds. Finally, for the horizontal 2:1 up-conversion, a cubic interpolation is applied for both luminance and chrominance components. 3.3 Simulation results Interpolation performance is evaluated here in the context of spatial-scalable MPEG-2 coding. The enhancement layer is coded at a bit rate of about 4 Mbit/sec, conditionally to a coded base layer at 2 Mbit/sec (6 Mbit/sec total). Results of Fig. 7 for the standard Flowergarden sequence show and compare the PSNR values of the decoded enhancement layer when the linear interpolation (proposed in MPEG-2 standard) is replaced by a cubic interpolation or by a motion-compensated interpolation. Both new methods perform better than the standard one (compare the corresponding PSNR values). Moreover, cubic interpolation seems to be a good compromise between image quality and implementation complexity. a V b c : Missing sample to be interpolated. t - 1/2 t t + 1/2 A B C Figure 6: Motion Compensated Conversion A second test consisted of coding the down-converted sequences, followed by a simple up-conversion, in order to evaluate the impact on the lower-resolution image at the receiver when using the new down-conversion method. The coding was done at a bit rate of 1.5 Mbit/sec. The PSNR between the recovered sequence and the original one, for both MPEG-2 and new down-conversion method, is shown in Fig. 8 and this for 14 elds. The overall averages were also plotted. There is a larger PSNR in the MPEG-2 method for even elds due to the smoothing

32.5 * * * : Motion Compensated + + + : Cubic o o o : MPEG2 32 Luminance PSNR [db] 31.5 31 30.5 0 2 4 6 8 10 12 14 Fields Figure 7: PSNR results for a decoded Flowergarden sequence using dierent interpolation methods. introduced by deinterlacing and ltering, and a larger PSNR in the new scheme for odd elds caused by the removal of aliasing. The new method has an overall average higher by.28 db, with a lower PSNR uctuation. In real time viewing, a considerable reduction of aliasing was observed in the new method, when compared with the MPEG-2 one. 4 SPATIAL-SCALABLE PREDICTION In the MPEG-2 spatial-scalable algorithm, an additional spatial prediction mode is introduced. A high resolution sequence is conditionally coded using a spatial prediction from a decoded low resolution version. This process requires an up-conversion of correctly selected reference frames. The conditional coding method operates dierently, depending on the type of picture being coded. When the current high-resolution picture is Intra-coded, each block (8 by 8) can be either Intra-coded or spatially predicted. In the latter case, no data is transmitted. When the current high-resolution picture is Inter-coded (Predicted or Bidirectionally interpolated), a motion compensated temporal prediction is made from previously decoded pictures in the upper layer. In addition, a spatial prediction is formed from the base layer. After a decision process, the nal prediction, on a macroblock basis (16 by 16), is either temporal, spatial or spatiotemporal weighted prediction using a set of pre-dened weights (a; b). For each case, a dierence information (original - prediction) may be coded and transmitted except for the spatial prediction mode where only prediction type information is transmitted. Consequently, since transmission of dierence data increases the image quality, this algorithm gives a predominance to the temporal and spatiotemporal modes, i.e individual spatial prediction is rarely selected, it is mostly considered as an additional information that improves the temporal processing (spatiotemporal case). When the upper layer is an interlaced sequence and the base layer is progressive, the pair of weights (a; b) can take one of these pairs of pre-dened values (1,0), (0.5,0), (1,0.5) and (0.5,0.5). a and b give the proportion of spatial prediction corresponding to the top eld and the bottom eld respectively. The proportion of temporal prediction are then 1? a and 1? b. In the following, we describe a modied spatial scalable scheme that improves the spatial prediction contribution for both Intra- and Inter-coding cases. Spatial-scalable Intra coding Spatial-scalable Intra coding has been performed in a similar way as dierential or pyramidal coding. The

25 24.5 + + + : New down-conversion o o o : MPEG2 down-conversion 24 23.5 PSNR [db] 23 22.5 22 21.5 21 20.5 20 0 2 4 6 8 10 12 14 Fields Figure 8: PSNR results for reconstructed coded Calendar (SIF at 1.5 Mbit/sec) sequence using dierent downconversion methods whole high-resolution picture is spatially predicted from the low-resolution layer, the dierence signal is then coded and transmitted. Intra coded macroblocks are still possible but reserved only for image areas that are not covered by the interpolated low-layer level (independent coding). However, coding the dierence signal in Intra pictures requires adapted quantization matrices and adequate run-length code tables that are dierent from those of Intra coding. Spatial scalable Inter-coding Spatial-scalable modied Inter-coding still follows the syntax of spatial scalable MPEG-2 algorithm and increases the contribution of the individual spatial prediction mode. An improved set of weights has been introduced in order to give some priority to the spatial prediction selection. Indeed, with the additional pair of weights (1,1), a macroblock is now allowed to be completely spatially predicted with transmission of coded dierence information. In this way, there is no apriori predominance between temporal and spatial prediction. Moreover, when spatial prediction is selected, the coding cost, in terms of transmitted bits, concerns mainly the dierence information since no motion information is transmitted. 4.1 Simulation results Performance of the modied spatial-scalable coding has been evaluated in comparison to the standard spatialscalable method (MPEG-2). In this simulation, the base layer is MPEG-2 coded at 1.5 Mbit/sec and the enhancement layer is coded at 4.5 Mbit/sec (6 Mbit/sec total). Table 1 gives the percentage of macroblocks where both spatial predict ion and/or spatiotemporal prediction have been selected (compatibility%). It shows clearly the increasing contribution of theses modes when the modied scalable algorithm is applied. Table 2 and Fig. 9 give a PSNR comparison between dierent coding methods (simulcast coding, standard and modied spatialscalable coding and single layer coding). One can see the superiority of the modied method in terms of PSNR

Sequence Flowergarden Picture type I pictures P pictures B pictures Compatibility% for spatial scalability 77.50 % 71.31 % 8.31 % Compatibility% for modied spatial scalability 88.50 % 80.26 % 8.61 % Comparison% 11.00 % 8.95 % 0.30 % Table 1: Percentage of compatibility for spatial scalable coding of Flowergarden sequence (60 elds, high layer at 4.5 Mbit/sec, total 6Mbit/sec, coding structure IBBPBBP...) Sequence Flowergarden Picture type I pictures P pictures B pictures Overall Simultcast (4.5Mbit/sec) 30.77 31.19 31.21 31.17 Spatial scalability (4.5Mbit/sec) 30.86 31.99 31.54 31.62 Modied spatial scalability (4.5Mbit/sec) 32.30 32.58 31.72 31.99 Single layer (6Mbit/sec) 32.67 33.00 32.83 32.87 Table 2: PSNR comparison simulcast versus standard and modied spatial-scalable coding versus single layer coding (60 elds, high layer 4.5Mbit/sec, total 6Mbit/sec, coding structure IBBPBBP...) values of the decoded high-layer sequence (Flowergarden sequence). Improvements appear mainly on Intra (I) and Predicted (P) pictures, where spatial prediction contribution has increased. In Bidirectionally interpolated (B) pictures, bidirectional temporal prediction is still predominant. Future simulations will combine the improved up-conversion methods with the new spatial prediction strategies. More sophisticated conditional inter-coding algorithms are currently under investigation. The goal is to approach in the best way possible the performance of a single layer MPEG-2 coding (see the 6 Mbit/sec single layer MPEG-2 coding results in Fig. 9). 34.5 34 *** : 6 Mbits +++ : 4.5 Mbits scal, modif ooo : 4.5 Mbits scal, standard 33.5 Luminance PSNR [db] 33 32.5 32 31.5 31 30.5 30 0 5 10 15 20 25 30 35 40 Fields Figure 9: PSNR comparison single layer versus standard and modied spatial scalable coding

5 ACKNOWLEDGEMENTS This work was supported by a grant from the Canadian Institute for Telecommunication Research (CITR), under the Networks of Centers of Excellence Program of the Canadian Government. 6 REFERENCES [1] M. Vetterli and M. Uz, \Multiresolution coding techniques for digital television: a review (invited paper)," Multidimensional Systems and Signal Processing, vol. 3, pp. 161{187, 1992. [2] Draft International Standard: ISO/IEC 13818-2, \Coding of moving pictures and associated audio," March 1994. [3] A. Nguyen and E. Dubois, \Spatio-temporal adaptive interlaced-to-progressive conversion," in Signal Processing of HDTV, IV. Proc. International Workshop on HDTV'92, Kawasaki, Japan, Nov. 18-20, 1992 (E. Dubois and L. Chiariglione, eds.), pp. 749{756, Elsevier, 1993. [4] R.G. Keys, \Cubic convolution interpolation for digital image processing," IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-29, pp. 1153{1160, Dec. 1981. [5] N. Aikawa, W. Yabiku, and M. Sato, \Design method of maximally at FIR lter in consideration of even and odd order," in Proc. IEEE Int. Symp. Circuits and Systems, pp. 276{279, 1991. [6] A. Puri and A. Wong, \Spatial Domain Resolution Scalable Video Coding," in Proc. SPIE, vol. 2094, pp. 718{ 729, 1993.