Advanced Scalable Hybrid Video Coding

Size: px

Start display at page:

Download "Advanced Scalable Hybrid Video Coding"

Kelley Wells
5 years ago
Views:

1 Politechnika Poznańska Wydział Elektryczny Instytut Elektroniki i Telekomunikacji Zakład Telekomunikacji Multimedialnej i Radioelektroniki ul. Piotrowo 3A, Poznań Łukasz Błaszak Advanced Scalable Hybrid Video Coding Doctoral Dissertation Advisor: Prof. Marek Domański Poznań, 26

2 Politechnika Poznańska Wydział Elektryczny Instytut Elektroniki i Telekomunikacji Zakład Telekomunikacji Multimedialnej i Radioelektroniki ul. Piotrowo 3A, Poznań Łukasz Błaszak Zaawansowane Skalowalne Hybrydowe Kodowanie Sygnałów Wizyjnych Rozrawa Doktorska Przedłożona Radzie Wydziału Elektrycznego Politechniki Poznańskiej Promotor: Prof. dr hab. inż. Marek Domański Poznań, 26

3 Contents List of symbols and abbreviations... i Chater Introduction Scoe of the dissertation Thesis and goals of the dissertation Overview of the dissertation Chater Advanced Video Coding Introduction Coding Tools Logical layers Satial rediction Temoral rediction for variable block sizes Motion vectors rediction Motion estimation Integer transform In-loo deblocking filter Adative entroy coding Imrovements of interlaced video coding Multile resolution tool Fading comensation Switching frames technique Error resilience Summary Chater Scalable Video Coding Introduction Basic grous of scalable video codecs Wavelet video codecs Classification Develoment of wavelet video coding Imlementation of motion-comensated temoral filtering Scalability in wavelet video coding Hybrid scalable coding Temoral scalability Satial scalability Quality scalability A

4 Chater Multilayer Advanced Video Coding Introduction Satial Scalability Temoral scalability scenarios for scalable codec Interolation and Decimation Chater Siral scan Introduction Siral scan in video comression Siral Scan for Quality Scalability in AVC Codecs Introduction Intra-frame rediction Inter-frame rediction CABAC coding Model of codec with quality scalability Overview Siral Scan in AVC - Comlexity Chater Codecs imlementation arameter setting Introduction Determining Huffman codes for symbols reresenting encoding modes for scalable H.263 codec Determining encoding modes hierarchy for scalable H.264 codec Determining k arameter for edge-adative bi-cubic interolation for scalable H.264 codec Chater Exerimental assessment of the scalable video codecs Introduction Assessments using objective measure Exeriments for H.263 codec with satio-temoral scalability Testing of H.264 codec with satio-temoral scalability Comarison to simulcast and non-scalable H Comarison to encoder roosed by the MPEG: JSVM Exeriments: comarison of non-scalable H.264 with raster scan and with siral scan Intra-frame coding test results Inter-frame coding test results Testing the H.264 with raster scan and satial, temoral and quality scalability Assessment using subjective measure Introduction Testing the scalable model with quality scalability based on JM 7.3 reference software B

5 Testing the scalable model with quality scalability based on JSVM 1. software Chater Conclusions Summary Original achievements References...I Author s contributions...i References used in mention in the dissertation...iv Annex A...A-1 Intra rediction...a-1 A.1. Intra rediction for chroma blocks...a-1 A.2. Intra rediction for luma 16x16 ixel blocks...a-9 Annex B... B-1 Proosal for slice header syntax... B-1 C

6 D

7 Abstract This dissertation treats with digital video sequence coding by use of advanced scalable video codecs. In this work advanced scalable video codec based on multilayer coding structure has been roosed. The roosed codec is able to realize satial, temoral and fine granular scalability. In the dissertation, the innovatory adotion of modified macroblock coding order technique has been roosed as a tool for fine granularity scalability. Several coding tools have been adated to the siral scan of macroblocks. These modifications increase encoding efficiency when the siral scan is used. The encoder with the siral scan and modified tools has the same coding efficiency as the encoder with the raster scan of macroblocks. The model of advanced video codec, roosed and tested by the author, described in this dissertation, is based on verification model of non-scalable H.264/AVC codec. The roosed codec is fully comatible with its non-scalable redecessor (H.264/AVC). The objective and subjective estimates of encoding efficiency for indeendent coding techniques as well as for the whole scalable codec have been erformed. The coding efficiency has been comared to other well-known scalable video coding techniques.

8 Streszczenie Rozrawa orusza zagadnienia komresji sekwencji wizyjnych rzy omocy zaawansowanych kodeków skalowalnych. W racy zaroonowany został zaawansowany kodek wizyjny o strukturze wieloętlowej. Kodek ten realizuje technikę skalowalności czasowej, rzestrzennej oraz technikę skalowalności drobnoziarnistej. W racy zaroonowano nowatorskie zastosowanie techniki zmienionej kolejności kodowania makrobloków jako narzędzia do realizowania skalowalności drobnoziarnistej. Zaadatowano różne narzędzia kodowania do siralnego uszeregowania kodowanych makrobloków. Modyfikacje te zwiększają wydajności kodera, ze siralnym uszeregowaniem, tak by jego efektywność była nie mniejsza niż dla kodera ze standardowym uszeregowaniem makrobloków. Model zaawansowanego kodeka skalowalnego, zbudowanego i rzebadanego rzez autora, oisanego w racy, bazuje na modelu weryfikacyjnym nieskalowalnego kodeka H.264/AVC. Zaroonowany kodek cechuje ełna komatybilność z nieskalowalnym kodekiem H.264/AVC. Przedstawiono obiektywną oraz subiektywną ocenę wydajności zarówno oszczególnych technik kodowania jak i całego kodeka skalowalnego. Wydajność ta została orównana do wydajności innych znanych wsółczesnych metod skalowalnej komresji sekwencji wizyjnych.

9 List of symbols and abbreviations 2-D - two-dimensional, 3-D - three-dimensional, 4CIF - rogressive 4:2: ixels video sequence, AC - DCT coefficient, for which the frequency in one or both dimensions is non-zero, AMC-FGS - Adative Motion-Comensated Fine Granularity Scalability, ASO - Arbitrary Slice Order, AVC - Advanced Video Coding, bitrate - number of bits er second, B-frame - bi-directionally inter-frame encoded frame used in non-scalable coding and base layer of scalable coding, BE-frame - B-frame occurring only in enhancement layer, it is not used as a reference frame, BR-frame - B-frame occurring only in enhancement layer, it is used as a reference frame, CABAC - Context-based Adative Binary Arithmetic Coding, CAVLC - Context Adative Variable Length Coding, CIF - rogressive 4:2: ixels video sequence, CoI - Centre of Interest, DC - DCT coefficient with zero frequency in both dimensions, DCT - Discrete Cosine Transform, DP - Data Partitioning, DPCM - Difference Pulse Code Modulation, DWT - Discrete Wavelet Transform, EZBC - Embedded image coding algorithm using ZeroBlocks of subband/wavelet coefficients and Context modeling, EZW - Embedded Zerotree Wavelet, FGS - Fine Granularity Scalability, FIR - Finite Imulse Resonse, FMO - Flexible Macroblock Order, fs - frames er second, i

10 GOP HH HL IBMATF IDR I-frame IDCT Intra ITU JM JSVM JVT Kbs LBR LH LL LPS LZW MBAFF MC-EZBC MCTF MOS MPEG MV NAL PFGS PicAFF PSNR QCIF QP RD RoI RS - Grou of Pictures, - high-high-satial frequency subband, - high-low-satial frequency subband, - In-Band Motion Aligned Temoral Filtering, - Instantaneous Decoding Refresh, - Intra-frame encoded frame, - Inverse Discrete Cosine Transform, - intra-frame, - International Telecommunication Union, - Joint Model, - Join Scalable Video Model, - Joint Video Team, - kilobits er second, - Low Bit Rates, - low-high-satial frequency subband, - low-low-satial frequency subband, - Least Probable Symbol robability, - Lemel-Ziv-Welch, - MacroBlock Adative Frame Field, - Motion Comensated - Embedded image coding algorithm using ZeroBlocks of subband/wavelet coefficients and Context modeling, - Motion Comensated Temoral Filtering, - Mean Oinion Score, - Motion Pictures Exert Grou, - full-frame motion vectors, - Network Abstraction Layer, - Progressive Fine Granularity Scalability, - Picture Adative Frame Field, - Peak Signal to Noise Ratio, - rogressive 4:2: ixels video sequence, - Quantization Parameter, - Rate Distortion, - Region of Interest, - Redundant Slices, ii

11 SDMATF SNR SPIHT SSMM SVM UVLC VCEG VCL VLC - Satial-Domain Motion Aligned Temoral Filtering, - Signal to Noise Ratio, - Set Partitioning in Hierarchical Trees, - Single Stimulus MultiMedia, - Scalable Video Model, - Universal Variable Length Coding, - Video Coding Exerts Grou, - Video Coding Layer, - Variable Length Coding, iii

12 iv

13 Chater 1 Introduction 1.1. Scoe of the dissertation Since the early 9 s, significant rogress of digital video coding techniques has been made. U to day the encoding technique efficiency has been considerably imroved. Although during that time, only a few generations of video codecs have been develoed, their erformance and alications have been extended significantly. Among various roosals of video coding algorithms, the major technology, which has been aroved by commercial market, was hybrid coding with motion-comensated rediction and block-based transform coding. Each newly develoed technique of video coding was subjected to the standardization rocess. The three grous of standards are listed below: MPEG-1 [ISO93], H.261 [ISO9]; MPEG-2 [ISO94], MPEG-4 [ISO94], H.263 [ISO96]; H.264 [ISO-AVC], VC-1[SMPTE5]. Each consecutive standard covered techniques of significantly better encoding erformance. But together with increasing erformance the algorithm comlexity has also grown. The techniques, develoed since that time, were mostly related to non-scalable coding, i.e. if the available throughut is smaller than the required bitrate, transmission is not ossible. The aearance of new network technologies causes a roblem for video coding and video transmission. Because of connecting various network technologies, channel caacity between a video transmitter and a video receiver would become timevariant or would deend on the receiver location. Thus, it became difficult to estimate the bitrate for encoded video bitstreams. And here the scalable video technology would hel. The scalable video coding is coding of embedded bitstreams, each reresenting a 1-1

14 different level of quality. Thus, the decoder of scalable video bitstreams is able to decode video sequences by use of whole or only art of the bitstream received. By receiving consecutive embedded bitstreams the quality of decoded video sequence is getting better and better. The functionality of scalability has already been resent in MPEG-2 but the technique used for this standard was inefficient, and for that reason it was rarely used. But meanwhile some better techniques have been develoed [Domc, Dom1, Li1, Rad99b]. Recently develoed and also standardized technology is advanced video coding. Many new tools this technology consists of make it very efficient in comarison to earlier technologies. There are two main advanced video coding technologies develoed at the same time which have been standardized: H.264 and VC-1. The advanced video coding is a very flexible technology because of multilicity of different coding tools it consists of but it lacks the functionality of scalability. There is a need for such a tool which would rovide a codec with scalability while maintaining high encoding efficiency. There may be different tyes of scalability and alication scenarios. The quality may be reduced by droing some video frames as well as by decreasing satial resolution. There is also a so called SNR scalability where the temoral and satial resolution remains unchanged but the number of details of video sequence is reduced. The satial scalability should be alied when the receiver s resolutions differ, for examle: one is a standard TV monitor and another is a cell hone. The temoral scalability may be used wherever the icture quality and resolution should be the highest, for examle in security systems. The SNR scalability may be used for broadcast TV, internet TV where the decoded video sequence is exected to be fluent and at constant resolution Thesis and goals of the dissertation Goals: The goals of the work are the following: To roose tools which allow adding the functionality of scalability to advanced video coding techniques. 1-2

15 To roose consistent technology of scalable video coding that should be as comliant as ossible with the existing standards of advanced video coding. The roosed scalable video coding technology should rovide high comression efficiency, close to that of modern advanced video coding. The roosed scalable video coding techniques should be assessed by exerimental comarison to the existing advanced video coding techniques. Requirements: The roosed tools for scalable coding should be as simle as it is ossible in order not to increase excessively codec structure comlexity. The roosed techniques should be suitable for systems with low encoding and decoding delays. The encoder with roosed tools embedded should be backward comatible with non-scalable decoders mentioned above. Thesis: It is ossible to enhance advanced video codecs for scalable coding and achieve high comression erformance by the use of limited set of new tools. Methodology: The roosal of new tools and techniques will be reared on the basis of the studies of the bibliograhic references as well as exerience on advanced video coding and designs of scalable codecs for classic coding technology. The assessment of the roosed tools and techniques will be done by means of a set of exeriments. The exeriments will be erformed by use of the exerimental model designed and built by the author. This exerimental model will be software based on existing software verification models for advanced video coding. There are two standard advanced video codecs widely used: AVC (ISO/IEC known also as H.264 [ISO-AVC], SMPTE VC-1 (M421)/Windows Media 9 [SMPTE5]. 1-3

16 The first codec has been well documented in an international standard since its very beginning. Moreover, a reference software imlementation is ublicly available. The secification of the other one has become ublic only very recently and was not available at the time of the author s work on this doctoral dissertation. Therefore, AVC/H.264 codec has been chosen as the reference for the exeriments. Two exerimental software imlementations have been used by the author. The first uses the verification model of scalable codec, created by the author together with other scientists from Poznań University of Technology, on the basis of the AVC/H.264 reference software ver. 7.3 (JM 7.3). In this dissertation also another scalable codec verification model was used. It was recently available JSVM [ISO-JSVM] (Join Scalable Video Model) ver. 2. extended with the tools roosed by the author Overview of the dissertation This dissertation describes the results of the research that the author made when he was an active develoer of the new international scalable advanced video coding standard. The research was constructive, i.e. a roosal of codec with a set of tools is described. This codec and coding tools have been roosed during the standardization rocess. The resulted codec exhibits efficiency of coding similar to the codec which has been chosen as an international standard. This doctoral dissertation consists of eight chaters following this introduction. Chater 2 describes in general advanced video coding techniques, new tools used for video encoding and the main leading technologies used nowadays. Chater 3 describes scalable video coding techniques which have been develoed recently including techniques which are still under develoment. Chater 4 consists of the descrition of multilayer advanced video codec develoed by the author. The descrition includes the generic structure of codec and author s inventions. Next, Chater 5 describes secific order of macroblock scan with secial modifications of data rediction roosed by the author for hybrid DCT-like codecs. Chater 6 shows how the author has set the arameters for verification models. The results of several exeriments, used for setting the codec arameters, are included there. In Chater 7 all exeriments concerning verification of roosed scalable video codec s erformance are accumulated. Chater 8 summarizes this doctoral dissertation. 1-4

17 Chater 2 Advanced Video Coding 2.1. Introduction The history of hybrid video codecs begins in the early 8 s. At the beginning DPCM, scalar quantization and variable-length coding were used for the video coding comression. Those tools were used for defining the first international standard of digital video coding which was ITU-T (ex-ccitt) Rec. H.12. In the late 8 s, the motion comensation and background rediction were used for video coding and they were also added to the second version of H.12 standard. Although some of these tools have been used so far, the H.12 standard is essentially not in use today. Beginning from the early 9 s, the video frame of video sequence has been artitioned into blocks. Each block is encoded searately by use of a motioncomensated DCT-like transform, and then frequency domain coefficients are quantized and then encoded by means of Huffman entroy coder. Based on those techniques the first widesread ractical success of digital video coding was the H.261 [ISO9] version 1 in 199 and version 2 in Soon after, in the same year, the MPEG-1 [ISO93] was introduced, and later in 1994 the MPEG- 2/H.262 [ISO94]. The standard H.263 [ISO96] was released in three versions: version 1 in 1995, version 2 in 1998 and version 3 in 2. Meanwhile, the MPEG-4 (Part 2) [ISO99] was roosed in All these standards were designed for secific alications such as video conferencing, broadcast television, etc. In the 9 s there was a great rogress in video comression. A lot of new techniques were develoed and enhanced, such as motion vector rediction, satial 2-1

18 rediction, filtering of block transform artifacts, alication of multile reference ictures for motion estimation. The motion estimation accuracy was enhanced by use of reference icture interolation. At the end of the 9 s the object based coding was roosed. The icture there was treated as a scene containing various objects. Each object could be encoded searately and its osition could be freely changed at the decoder side. At the same time the body and face animation were roosed. Object based coding and body and face animation became art of MPEG-4 (Part 2) [ISO99] standard. At the beginning of this century scientists decided to create a new video comression technique which would combine existing most successful tools with new tools. Most owerful tools were taken, some were modified such as block subartitioning, motion vector rediction, satial rediction, deblock filtering, and some new tools were added such as fully reversible integer transform, new techniques of entroy coding (CABAC, CAVLC). On the basis of those tools two cometitive video coding standards were roosed. It was ITU-T H.264/AVC [ISO-AVC] and VC-1 [SMPTE5]. By the standardization rocess the alication range for these codecs became quite wide. All above techniques were, at the beginning, designed to serve different alication domains. Thus, the H.261 was designed to be used in video telehony for a kind of network as ISDN (Integrated Services Digital Network), H.263 and MPEG-4 intended to be used in network video communication for a kind of network as PSTN Public Switched Telehone Network, but also in the Internet and in mobile networks. MPEG-1 was used for consumer video on CD, while MPEG-2 was used on DVD. The last one was mainly designed to be used for broadcast of standard definition or high definition TV but also, together with MPEG-4, they were used for network video communication in ATM (Asynchronous Transfer Mode) networks. During all the years, when these techniques were widely used, the borders between initial alications disaeared. 2-2

19 2.2. Coding Tools The newest and most advanced coding technology uses various tools and techniques to achieve the best encoding erformance. Currently, there are two codecs based on architecture of motion-comensated DCT-like transform which may be classified as advanced video codecs. These are H.264/AVC and Windows Media 9. Some of the tools and techniques they use are very similar. But by the use of different algorithms they achieve almost the same encoding efficiency. Because most of the techniques used by these codecs are similar, they will be described on the basis of one of the codec Logical layers Advanced video codecs were designed to cover a wide range of alications. It means that such codecs have to be able to roduce bitstreams that may be transmitted by almost any kind of network. Because of this, the structure of the encoded bitstream needs to be very adative. A good examle of such a structure may be resented by the use of H.264/AVC codec. For examle the hierarchical structure [Wie3a, Sul4], for H.264/AVC, of video sequence is as follows: The sequence is comosed of ictures; The icture is divided onto slices which can be of different sizes; The slices are comosed of macroblocks which are block of ixels; Each macroblock can be divided into artitions which are block of 16 8, 8 16 or 8 8 ixels; The artitions of 8 8 may be divided into subartitions which are blocks of 8 4, 4 8 or 4 4 ixels. Thus, Advanced Video Codec H.264 defines two searate abstract layers: Video Coding Layer (VCL) which corresonds to the slice layer bits and Network Abstraction Layer (NAL) which corresonds to the higher layer level bits. The NAL was designed to easily adat the roduced bitstreams to variety of delivery frameworks (e.g. broadcast, wireless, and storage media) Satial rediction Satial rediction of macroblock samles is a technique used in H.264/AVC [ISO-AVC] as well as in Windows Media 9 encoder [SMPTE5]. 2-3

20 Intra coding [VCEG-N54, VCEG-O31, VCEG-O48 and Ric3] may exloit the satial correlation among ixels by use of the satial rediction. There may be several algorithms of satial rediction based on the full-macroblocks (16 16 blocks) or based on 8 8 blocks or even 4 4 blocks. The ixels may be redicted from various satial directions. As an examle see Fig. 2.1, where the directions of satial rediction for H.264 encoder are roosed. Fig Examle of directions for satial rediction of a 4 4 block in H.264/AVC standard Temoral rediction for variable block sizes The temoral rediction can be made for variable block sizes. For instance, H.264/AVC [ISO-AVC] uses the following block sizes: 16 16, 16 8, 8 16, 8 8, 8 4, 4 8, 4 4 (see Fig. 2.2). For each block searate motion estimation can be made. Macroblock artitions Sub-macroblock artitions of 8 8 block Fig Macroblock artitioning for motion estimation and comensation. 2-4

21 Motion vectors rediction Motion vectors, before being sent to the decoder, are first redicted by the use of uer and left neighboring motion vectors. There are several ways of redicting motion vector deending on the availability of the neighboring block. The rediction may be a value of motion vector from one of the neighboring blocks or a median value of motion vectors taken from three neighboring blocks. For the Direct mode the motion vector rediction is made by means of either motion vectors from satially neighboring blocks or by use of motion vectors from blocks from reference frames. Because of the fact that a motion vector may oint to several different reference frames, the rediction of the motion vector is made only by use of other vectors which are ointing to the same reference frame Motion estimation The motion estimation can be made in two directions forward or backward (or both ways) [Wed3, Fli3]. The block can be redicted from several reference ictures. Pixel values of the reference icture are first interolated to achieve quarter-ixel accuracy for luminance and for chrominance. For examle, in order to create half-ixel values in H.264/AVC, the following filter is used: [ ]/32. Filtering is done searately horizontally and vertically. To achieve quarter-ixel accuracy, the interolation for luminance is erformed by averaging two nearby values horizontally, vertically or diagonally, of half-ixel accuracy. The H.264/AVC may use also a secial mode called the Direct mode, for which the motion vectors are not exlicitly sent but they are derived by scaling the motion vectors of the co-located macroblock in another reference frame or derived by inferring motion from satially-neighboring regions. In the case of bi-directionally redicted frames the secial weighted rediction can be alied. The encoder can use weighted average between two redictions for birediction. This tool is very useful for such henomena as cross-fades between different video scenes Integer transform After the rediction of encoded samles, a transform is alied to decorrelate the data satially. This transform is made for 4 4 block sizes, and this integer transform is 2-5

22 fully invertible, it does not use floating oint arithmetic. The following transform was adated in H.264 encoder but similar technique was used for Windows Media 9 encoder. The new transform [Wie3b] was roosed and its main features are the following: It is a fully invertible integer transform; For 8-bit inut data recision the 16-bit arithmetic is sufficient for transform imlementation; The transform and quantization are low-comlexity [Mal3], the 4 4 block size transform can be imlemented using just a few additions, subtractions, and bit shifts; T = In addition, if while encoding rocess the Intra rediction mode was used with the 4 4 block size transform, for the DC coefficients of all 4 4 luminance blocks the Hadamard transform is used. H 1 1 1, = H = In-loo deblocking filter Another tool that significantly imroves subjective and objective image quality is deblocking filter [Lis3, VCEG-M2, JVT-B39]. It reduces the blockiness introduced in a icture. It is an adative filter which adjusts its strength deending on the quantization arameter, motion vector, frame/field coding decision and the values of the ixels. The higher the quantization is the stronger filtering is alied. In the case of the H.264/AVC this tool is used inside coding-decoding loo. An alternative to deblocking filtering used in Windows Media 9 [SMPTE5] is overla smoothing. It is a technique to achieve block edge smoothing using a simler 2-6

23 oeration than in the case of deblocking filtering, which is very comlex because of non-linear oeration. Overla smoothing is achieved by the use of laed transform [Tra3]. It is a transform whose inut sans, besides the data elements in current block, a few adjacent elements in neighboring blocks Adative entroy coding The tool, that is used recently to remove redundant information from encoded data is lossless coding technique entroy coding. There are many algorithms for lossless entroy coding; the most basic ones are Huffman coding [Huf52], arithmetic coding [Wit87] and dictionary technique LZW (Lemel-Ziv-Welch) [Ziv77, Ziv78]. The entroy coding is used to relace the data elements with coded reresentation, at the same time reducing the remaining correlation between data elements, thus reducing the data size. Data to which the entroy coding can be alied are for examle residuals of motion vectors rediction, transform coefficients, etc. Two entroy codecs have been used [VCEG-L13, VCEG-M59] for advance video coding for H.264/AVC [Mar3] and one for Windows Media 9 briefly described in [Sri4]. Here, the author focuses on H.264 entroy coding algorithms. The first entroy codec used for advanced video coding is based on VLC (Variable Length Coding) technique. Here, the exonential Golomb Code, also called UVLC (Universal Variable Length Coding), has been used. Codes of UVLC have generic form of following bits: [ZEROS][1][DATA], where DATA is a binary reresentation of an unsigned integer and ZEROS are number of zeros equal to length of DATA. The given inut data elements can be maed to any other data deending on the statistical robability of encoded data occurring frequency. There may be defined several code tables and the selection of currently used table may be based on some context. Thus, the technique may become the CAVLC (Context Adative Variable Length Coding), giving a tradeoff between seed of execution and erformance. The other one, entroy codec CABAC (Context-based Adative Binary Arithmetic Coding) [Mar3] increases comression efficiency by roughly 1% relative to the CAVLC, but at the cost of additional comlexity. CABAC uses as a base an arithmetic coding technique which ermits assigning a non-integer number of bits er symbol. Generally, the CABAC entroy coding scheme consists of the following stes (see Fig. 2.3): a suitable model is chosen according to a set of ast observations of relevant 2-7

24 elements (each element such as motion vector or transform coefficient has its own model), then if a given symbol is non-binary valued, it will be maed onto a sequence of binary decisions, so-called bins and each bin is encoded with an adative binary arithmetic coding engine. Context modeling Binarization Adative binary arithmetic coder Probability estimation Coding engine udate robability estimation Fig Generic block diagram of the CABAC entroy coding scheme Imrovements of interlaced video coding The advanced video coding technique suorts also coding of the interlaced video sequences. The H.264/AVC and Windows Media 9 may encode a video sequence in two modes: field icture coding mode and frame icture coding mode. The first one encodes each filed searately and the second one encodes both fields jointly. In the case of H.264/AVC there were designed, additionally, secial adative techniques for coding interlaced video sequences: Picture Adative Frame Field (PicAFF) which allows for switching coding technique between frame and field coding mode for each icture; and MacroBlock Adative Frame Field (MBAFF) which allows for switching coding technique between frame and field mode for each air of macroblocks. These techniques significantly imrove coding efficiency of interlaced video inut Multile resolution tool There have also been roosed a tool for low bit rates (LBR) scenarios [Sri4]. The tool enables to encode frames at multile resolutions. It is obtained by scaling down one or two dimensions of each coded frame. At the decoder site the frame is u-scaled by the factor received from the encoder. This technique has been alied into Windows Media 9 encoder [SMPTE5]. 2-8

25 Fading comensation Another technique roosed was fading comensation technique [Sri4] which can significantly reduce the number of bits needed for global illumination changes. The effects that can be comensated are for examle fade-to-black or fade-from-black. Standard motion-comensation technique is ineffective for frames with such effect Switching frames technique One of the innovations included into H.264/AVC are new icture coding tyes. These are switching frames [Kar3], both intra and inter coded. Their main feature is ability to reconstruct secific exact samle values, even when using different reference ictures or a different number of reference ictures in rediction rocess. Such frames ermit bitstream switching, random access, and fast forward, reverse, and stream slicing Error resilience Tools which have secific alication for network transmission are Error Resilience Tools [Wen3, Sto3]. These tools have been used in H.264/AVC codec. There are several tools which have to rotect video bitstream from network transmission roblems. These tools are: Flexible Macroblock Order (FMO), Arbitrary Slice Order (ASO), Redundant Slices (RS), or Data Partitioning (DP). By the use of FMO the macroblock may be sent randomly, and when a segment of data is lost, errors are distributed randomly over the video frame. When ASO is used, the slices may be received in any order. RS increase bitstream rotecting by reeat sending the slices reresenting the same art of a icture. By the use of DP it is ossible to searate the coded slice data into searately-decodable sections and rotect each one with the rotection level according to how imortant these data are. 2-9

26 2.3. Summary Advanced video coding is a set of several advanced tools and techniques that imrove encoding efficiency and error resiliency, but at the same time they increase encoding comlexity. These tools combined together became a otentially very owerful coding technique, which can be much more efficient then widely used existing techniques. But, even though, advanced video coding has a great otential, there are no tools for scalable coding. While this doctoral dissertation was written, several grous of exerts, including the author of this dissertation, were working uon the scalable tools for advanced video coding. Several solutions have been roosed, whose general descrition may be found in the next chater. And, the author s roosal is described in detail later in this work. 2-1

27 Chater 3 Scalable Video Coding 3.1. Introduction The scalable video coding may be defined as the ability of the encoder to roduce the bitstream which consists of layers, each reresenting different satial resolution, different temoral resolution or signal-to-noise ratio (SNR). Scalable encoded video bitstream consists of embedded bitstreams each reresenting a single layer. The more embedded bitstreams are received, the higher quality the decoded video sequence is. Thus, there are three tyes of scalability. The first one, called satial scalability, enables encoding one video sequence reresented by several satial resolutions into one bitstream. Deending on the channel caacity the decoder may receive and decode only art of the bitstream or the whole bitstream, achieving lower satial resolution video sequence or full satial resolution video sequence resectively. Another one is temoral scalability which enables to roduce a bitstream in such a way that the decoder may decode from art of the bitstream a video sequence with reduced frame-rate. It means that, for examle, every second frame was droed and remaining frames have full satial resolution and image quality. The last one is quality (SNR scalability) scalability which enables to generate an encoded bitstream from the inut video sequence which contains layers, each reresenting a different image quality. The outut video sequence quality deends on the number of layers the decoder is caable to receive. The decoder always decodes full satial and temoral resolution video sequence but the quality may be reduced. The 3-1

28 quality scalability may be obtained by Fine-Granularity-Scalability (FGS) which is functionality for recise tuning of a bitstream. For this technique a single acket of data may be very small, thus, the bitstream may be cut near the value of maximum channel caacity. There are two major grous of aroaches to achieve scalability [Ohm1]: wavelet based techniques and extensions of the hybrid transform codecs. The wavelet codecs, such as 3-D wavelet codecs, naturally enable scalable encoding [e.g.: Ohm2, Woo2, Hsi1]. The hybrid coder structures based on motion-comensated rediction and transform block coding were designed for non scalable coding of video sequences. Both techniques are still in cometition, trying to achieve better results in both scalable and non scalable coding Basic grous of scalable video codecs Motion ictures codecs may be divided into two grous: hybrid-codecs with DCT-like transform and wavelet codecs. The main difference between them is how they deal with scalability. This first grou has been mainly designed for non-scalable uroses. The most efficient techniques of this grou have been described in the revious chater of this doctoral dissertation. Scalability has been develoed due to the requirements concerning new alications. Joint work of scientists and industrial exerts has led to achieving a very owerful hybrid-scalable-codec with DCT-like transform [ISO-JSVM]. The other grou of codecs which is based on wavelet transform was designed from the beginning as a scalable coding technique. From their beginnings, wavelet coding techniques have rovided satial scalability and quality scalability. Those codecs were called 2D codecs. This kind of coding was resented in [Sha93, Tau94, and Sai96]. Next, the inut video signal wavelet analysis may be extended onto time domain and owing to this, the three-dimensional wavelet-based video coding aears. Those codecs were given the name of 3D codecs. Various 3D wavelet codecs may be found in [Kar88, Kro9, Ohm93, Ohm94, and Cho99]. Most recent roosals of the structure of scalable hybrid codecs include a technique of temoral analysis taken from wavelet codecs. Because of this fact, the name of 3D codecs was also extended to hybrid scalable codecs. 3-2

29 3.3. Wavelet video codecs Classification Scalable video codecs which use motion comensated temoral filtering may be divided into three categories: T+2D, 2D+T and multi-t+2d. Such a classification was discussed in [ISO4d, Xio5a]. The T+2D category, also called the satial-domain motion aligned temoral filtering (SDMATF) scheme, alies temoral filtering at the encoder side to inut video frames directly before satial decomosition. At the decoder site temoral filtering is done at the resolution of target outut video. Here, in this scheme, when the target resolution at the decoder side is lower than inut video frame resolution, the artifacts aear in the regions with comlex motion. This scheme was used in [Xio4, Xio5b]. The 2D+T category, also called in-band motion aligned temoral filtering (IBMATF) scheme, alies temoral filtering after satial decomosition. Filtering is done for each subband. Because of the fact that MCTF is done after satial decomosition, after decimation rocedure, critical-samled wavelet transform is only eriodically shift-invariant and has certain aliasing effects. This scheme was used in [And4, Li4, Par, and Ye3]. The last scheme is a category multi-t+2d. This category was resented in [ISO4e, ISO4f]. Here, the inut video frame is first down-samled and so various resolutions are generated. In this way the redundant yramid reresentation of the original frame is roduced. For this scheme, motion comensated temoral filtering is erformed on each of these satial resolution layers. Here no mismatch between the encoder and the decoder is resent. Additionally, the motion comensation in the image domain before satial transform generally has a higher accuracy and efficiency than when alied in the subband domain of lower resolution. But the disadvantage is high redundancy in the overlaed satial-temoral subbands across various satial resolution layers. Therefore, this redundancy has to be reduced somehow Develoment of wavelet video coding The wavelet decomosition of video sequence, alied to satial domain and temoral domain, is used by wavelet codecs to generate embedded bitstreams, each 3-3

30 reresenting different satial and/or temoral resolution. Each embedded bitstream contains encoded data of a subband. The difference between various wavelet encoders is how the data from subbands are encoded. Here in this Section, several techniques are described beginning from two dimensional coding, later enhanced to three dimensional coding and later extended by, for examle, motion comensation. One of the first successful codecs based on wavelet transform, was EZBC (Embedded image coding algorithm using ZeroBlocks of subband/wavelet coefficients and Context modeling), belonging to the 2D codecs grou. This is a technique of coding based on other successful techniques: embedded zero-tree/-block coding (EZW) [Sha93] and context modeling of the subband/wavelet coefficients [Tau94]. In zerotree/-block encoder two facts were taken into account. One is that most energy of the signal in frequency domain is accumulated near low frequencies. Another is that there is a strong correlation between data in subbands of wavelet decomosition. An EZBC was roosed in [Hsi]. The authors adoted the adative quadtree slitting method [And97] to searate significant coefficients and then encode every block of zero ixels into one symbol. In the EZBC the context models were designed for coding quadtree nodes at different tree levels and subbands. This encoder is inherently alicable to resolution scalable alications. A few years later Said and Pearlman roosed in [Sai96] a set artitioning in hierarchical trees (SPIHT) image codec which was a 2D codec. This codec consists of three basic concets: coding and transmission of imortant information first based on bit-lane reresentation of ixels, ordered bit-lane refinement, coding along referred ath/trees called satial orientation trees. Later in [Pea98] 3D SPIHT coding was roosed. This technique also consists of three arts: artial ordering by magnitude of the 3D wavelet transformed video with a 3D set artitioning algorithm, ordered bit-lane transmission of refinement bits, exloitation of self-similarity across satio-temoral orientation trees. However, the first time the three-dimensional coding was roosed by Karlsson and Vetterli in [Kar88], where a simle 2-ta Haar filter was used for temoral filtering. 3-4

31 Later on, based on the work of Kronander [Kro9], where motion comensated temoral filtering within the 3D subband coding framework was resented, Ohm introduced an idea for a erfect reconstruction filter with block-matching algorithm [Ohm93, Ohm94]. Similar work has been done by Choi and Woods in [Cho99]. The structure of temoral decomosition is shown in Fig. 3.1, and the 3D subband structure in GOP is shown in Fig GOP L H LL LH LLL LLH LLLL LLLH Fig Octave based five-band temoral decomosition. 3-5

32 Fig D subband structure in a GOP. Later Hsiang and Woods roosed an enhancement of EZBC technique known as the MC-EZBC (Motion Comensated - Embedded image coding algorithm using ZeroBlocks of subband/wavelet coefficients and Context modeling), that belongs to the grou of 3D wavelet codecs. MC-EZBC [Hsi1] is a video coding technique using 3-D subband/wavelet transform along the motion trajectory. It exloits temoral correlation and is fully embedded in quality/bitrate, satial resolution and frame rate. The basic structure of the coder is shown in Fig inut video MCTF Satial Analysis 3D-EZBC Packetizer outut bitstream Motion Estimation Motion Field Coding Fig Basic structure of MC-EZBC [Hsi1]. The coder exloits motion comensated temoral filtering (MCTF) and EZBC satial coder. In this coder the MCTF is used to reduce the aliasing effect when the video sequence resolution in time domain is decreased. Moreover, this system does not 3-6

33 suffer from the drift roblem which is resented in hybrid coders that have feedback loos. Motion comensated temoral filtering is a very imortant art of motion comensated 3-D subband/wavelet coding. MCTF is used for subband/wavelet decomosition of video sequence in time domain. One of the ways of imlementing this temoral filtering technique is using lifting scheme [Cal98] and another aroach was roosed by J.-R. Ohm [Ohm94] and extended by Choi and Woods [Cho99] Imlementation of motion-comensated temoral filtering The basic idea of motion comensated temoral filtering is to erform temoral filtering along the motion trajectory. But, there is a roblem of how to deal with connected/unconnected ixels. There were two roosals to solve this roblem. One was Ohm s method [Ohm94] where after motion comensation for the unconnected (see Fig. 3.4) ixels the original ixel values of current frame were taken as values for low temoral subband. And the scaled dislace frame difference were taken as a value for high temoral subband. In [Cho96], Choi roosed that for unconnected ixels for the temoral low-subband the original value from the reference frame should be taken because unconnected ixels are more likely to be uncovered ones. The motion comensated filtering methods, both Ohm s and Choi s, are shown at Fig (b-a)/2 Ohm s method (a+b)/2 (a+b)/2 Choi s method (b-a)/2 (b-a)/2 a b (b-a)/2 A B A revious frame B current frame unconnected ixels connected ixels Fig Motion comensated filtering for connected/unconnected ixels. Another aroach of motion comensated temoral filtering is a lifting scheme which consists of three stes: 3-7

34 olyhase decomosition, rediction ste, udate ste. Lifting imlementation of analysis and synthesis filter banks is shown in Fig z -1 2 S 2k+1 S 2k+1 h k S k P U U P 2 z S k 2 S 2k l k S 2k 2 Lifting Scheme (Analysis Filterbank) Inverse Lifting Scheme (Synthesis Filterbank) Fig Lifting reresentation of analysis and synthesis filter bank. At the analysis side the inut signal is divided into odd samles s 2k+1 and even samles s 2k. The odd samles are redicted from even samles by the use of rediction oerator P(s 2k+1 ). The outut h k signal (high ass) is the rediction residuals. The l k signal (low ass) is a sum of the signal obtained by the use of udate oerator U(s 2k ) on h k signal and even samles s 2k. The i and u i are rediction ste and udate ste coefficients. h s P s ) with P ( s 2 k + 1 ) = i s, 2 k + 2 i k l k = 2 k + 1 ( 2 k + 1 = s 2 k + U ( s 2 k ) with ( s 2 k ) = u i h k + U. i i i (3.1) (3.2) This is a erfect reconstruction filter bank because rediction and udate stes are fully invertible. At the synthesis side the rediction and udate oerators are used in inverse order with the inversed signs of summation rocesses. For uni-directional motion-comensated rediction the rediction and udate ste oerators are as follows: P uni ( s x, 2 k 1 ) s x + m P, 2 k 2 r P + =, (3.3) 1 =, (3.4) U uni ( s x ), 2 k h x + m U 2, k + ru 3-8

35 where s x,k is a video signal with the satial coordinate x=(x,y) T and the temoral coordinate k, m Pz and m Uz are motion vectors for rediction and udate ste from z frame (z =1..2), r Pz and r Uz are reference frames for rediction and udate ste. For bi-directional motion-comensated rediction the oerators are as follows: 1 ) = ( s x+ m ) P, 2 k 2 r s P x+ m P 1, 2 k+ 2 2 r, (3.5) 2 P bi + U ( s x, 2 k+ 1 + P 1 1 ) = ( h x + m ) U, k + r + h. U x + m U 1, k 1 r (3.6) 4 ( bi s x, 2 k U 1 The idea of bi-directional MCTF was introduced by Ohm [Ohm94]. The bidirectional motion-comensated rediction increases motion-vector rate in the bitstream but considerably decreases the energy of the rediction residuals. In [ISO4a] the authors have roosed a tree-band motion-comensated filter bank. The roosed tool consists of triadic motion-comensated temoral filter bank with bidirectional redict and udate oerators Scalability in wavelet video coding Wavelet coding technique decomoses inut data into subbands, where the inut may be as well still icture as video sequence. Each consecutive subband is encoded and may be sent as an enhancement layer to decoder. Decoder after decoding such a layer increases the quality of outut icture or outut video sequence, roviding that way functionality of scalability. Wavelet coding takes advantage of facts that: signal energy after wavelet decomosition is accumulated near low frequencies, there is strong correlation between subband data. The number of subbands may be increased and that way roviding more layers enhancing quality of base layer as well as in satial as in temoral domain. If inut data is three-dimensional, i.e. video sequence, various techniques of threedimensional wavelet decomosition may be used. Deending on the technique various modification have been rovided into wavelet coding such as MCTF, motion vectors coding, context-based arithmetic coding of residual data, etc. But all those modifications do not influence on basic concet of wavelet video coding. 3-9

36 3.4. Hybrid scalable coding Scalable video coding for hybrid codecs with DCT-like transform was already rovided and acceted for general use in MPEG-2 [ISO94] and later in MPEG-4 [ISO99]. But here, coding efficiency when scalability was used was not satisfactory. The structure used there consists of layers for each satial resolution video outut. Each layer may also be artitioned into sub-layers, each reresenting different outut video quality at given satial resolution. The structure is shown in Fig inut High resolution coder transform coefficients Data artitioning HIGH-RESOLUTION BITSTREAM mv_h Satial and/or temoral subsamling Satial interolation Medium resolution coder transform coefficients Data artitioning mv_m MIDDLE-RESOLUTION BITSTREAM Satial and/or temoral subsamling Satial interolation Low resolution coder transform coefficients Data artitioning LOW-RESOLUTION BITSTREAM Fig Multi-layer scalable encoder structure. This structure was also used in scalable techniques roosed in [Doma, Dom1, Mac2, Dom2e, Dom2f, and ISO4c] Temoral scalability For such a structure the number of layers of reduced satial resolution and number of layers of reduced temoral resolution is determined at the encoder side. In the case of hybrid codecs there may be distinguished two methods for obtaining temoral scalability. The first one which is used for most of hybrid codecs divides encoded frames into the following tyes: frame which can be used as reference for other frames, 3-1

37 frame which cannot be used as reference for other frames. If a frame cannot be a reference frame it can be droed in the communication channel if there is not enough bandwidth to deliver the bitstream to the decoder or at the decoder side if such a decoder has no comutational ower to decode this frame. Both tyes of frames may be encoded: as access oint frames, it is a frame which is encoded only by the use of satial rediction modes (this frame does no use other frames for rediction); as a frame using one directional temoral rediction as well as two-directional temoral rediction (those frames use other frames for rediction). This way of achieving temoral scalability was used in [ISO94, ISO99, ISO-AVC, Dom3, Bla3c, and Bla5d]. Another technique used to achieve temoral scalability is the technique which uses motion comensated temoral filtering that has already been described in Section 3.3. Later in [ISO4c], Schwarz, Mare and Wiegand roosed a scalable extension of H.264/AVC where the MCTF was alied. The connection of MCTF, taken from wavelet technology, and highly efficient hybrid codec results in a very efficient and romising scalable hybrid encoder which became a base for develoing new advanced scalable encoder. An examle of temoral decomosition made by use of MCTF is shown at Fig time (frames) layer L L L L L L L L L L L L layer 1 L 1 H 1 L 1 H 1 L 1 H 1 L 1 H 1 L 1 H 1 L 1 H 1 layer 2 H 2 L 2 H 2 L 2 H 2 L 2 layer 3 H 3 L 3 H 3 layers Fig Temoral decomosition of a grou of 12 ictures roviding 3 levels of temoral scalability with temoral resolution ratios of 1/2, 1/4, and 1/

38 Satial scalability Satial scalability is obtained by encoding each satial resolution layer searately, by the use of motion comensated encoder which can even use different coding techniques. The data from lower layers are u-samled and used in rediction loo of higher layer encoders. Multilayer scalability was already resented in [ISO94, ISO99, Hor99, Doma Domd, Mac2], also the author has analyzed several techniques of scalable encoding based on layered structure in [Dom2a Dom2f, Bla3a, Dom3, Bla3a Bla3e, Bla4a, Bla4c, Dom4a, Dom4b]. Coding efficiency of codecs with satial scalability deends on decimation and interolation rocess as it was described in [Dom3, Bla3e, Lan4] and efficient exloitation of reference and interolated ictures as described in [Dom2e, Bla3b, Dom3]. For satial scalability in scalable AVC as resented in [ISO4c] satial scalability concet which was already introduced in video coding standards H.262/MPEG-2 Visual, H.263 and MPEG-4 Visual was adated. The base layer with reduced satial resolution is encoded as a low ass signal. Then the reconstructed ictures L are satially u-samled. Those u-samled ictures which are the same satial resolution as an enhancement layer can be used as rediction for macroblock in the enhancement layer. Fig resents concets of satial scalability. inter-layer rediction high resolution layer interolation interolated frames high resolution low resolution layer Fig Imlementation of satial scalability. 3-12

39 Each satial resolution layer is encoded by the use of searate motion estimation and comensation rocess Quality scalability The third tye of scalability used in hybrid coding is quality scalability. It may be obtained by several techniques. One is a multilayer SNR scalability technique. There are two layers at the same frame rate and the same satial resolution, but using different quantization arameters. Fig shows the decoder structure with SNR scalability defined in MPEG-2 [ISO94]. The enhancement layer consists of variable-length-coded DCT coefficients of residuals. Here, the enhancement layer is used in the motionrediction loo. If the enhancement layer is not received by decoder, then drift haens and coding efficiency may be low. This technique may be extended to multile layers. Each layer has the same satial resolution, but the quantization arameters differ. The number of layers deends on the quantization arameter ste between corresonding layers. Moreover, each layer may use its own motion comensated rediction. Such quality scalability for advanced video coding hybrid technique was roosed in [Sch3, Sch4]. The structure of coder with this functionality is shown in Fig Enhancement Bitstream Enhancement Layer Decoding VLD Q -1 Base Layer Bitstream VLD Q -1 IDCT Video Outut VLD variable-length decoding Q -1 inverse quantization IDCT Inverse Discrete Cosine Transform Motion Comensation Frame Memory Fig Decoder structure with SNR scalability defined in MPEG

40 INPUT SEQUENCE encoding QP layer base layer entroy coding N number of layers QP quantization arameter QP = QP_Ste, which indicates decreasing quantization arameter value by QP_Ste. decoding encoding QP =QP_Ste decoding encoding QP =QP_Ste layer 1 entroy coding layer N-2 entroy coding decoding encoding QP =QP_Ste layer N-1 entroy coding Fig Encoder structure, with quality scalability, based on multile layers. SNR scalability may also be obtained by a technique of bit-lane coding which is an extension for standard coding of transform coefficients at the encoder side. The conventional method treats transform coefficients as two-dimensional matrix of integer values, while the bitlane coding technique treats these coefficients as several two dimensional matrixes. Each matrix is comosed of one-bit values. These one bit values are consecutive bits of binary reresentation of each coefficient. Thus, for examle, for 8 8 DCT block a bit-lane of the block is defined as an array of 64 bits, taken one from each absolute value of DCT coefficient at the same significant osition [Li1]. Bit-lane coding and matching ursuit coding of image residue [Ben98, Che98, Li98, Schu98] are techniques for obtaining Fine Grain Scalability (FGS), which is used for roducing the bitstream which may be cut, at the decoder side, at almost any osition. Those techniques have been develoed for years. There were several imrovements roosed such as hybrid temoral-snr FGS roosed in [Sch1], where the temoral scalable 3-14

41 layer and SNR scalable layer are considered to be one scalable layer; macroblock-based rogressive fine granularity scalable coding (PFGS) as described in [Wu, Wu1, Sun4], where the high quality reference frames are used in enhancement layer coding (see Fig. 3.12), which rovides higher coding efficiency, but introduces also higher ossible drifting errors at the decoder side; adative motion-comensation fine granularity scalability (AMC-FGS) where the adative switching between two-loo motion comensated FGS and single-loo motion comensated FGS is done [Sch2] in order to achieve otimal streaming erformance over the network. The system chooses the most suitable FGS structure based on the bandwidth variations or device caabilities; quality scalability based on bit-lane coding of matching ursuit atoms as described in [Lin5]. The FGS technique based on bit-lane coding were also develoed and analyzed by Maćkowiak and Domański in [Mac2, Dom1] and also in [Dom2c, Dom2d, Bla3e]. High Resolution Enhancement Layer Low Resolution Enhancement Layer Base Layer Fig PFGS coding scheme. FGS scalability can be also achieved by macroblock reordering. The reordering is done in order to encode into a bitstream first the most imortant macroblocks and later less imortant ones. The most imortant macroblocks reresent art of the image which is subjectively more imortant for human observer. So, this technique of FGS tries not to lose, after the bitstream cut, subjectively the most imortant arts of the encoded icture. This technique was roosed in [Par2, Bla4d, and Bla4e]. This technique is described later in details in this doctoral dissertation. 3-15

42 3-16

43 Chater 4 Multilayer Advanced Video Coding 4.1. Introduction This chater deals with the roblem of adoting and imlementing the multilayer scalable video coding to advanced video technology. A generic structure of a multi-loo scalable encoder shown in Fig. 4.1 was also used in scalable MPEG-2 by S. Maćkowiak in his doctoral dissertation [Mac2]. Here, in this work, the author considers the ability of modifying and adoting this structure for use in the newly develoed advanced video coding techniques. The multi-layer scalable advanced video codec has been described in revious Chater in Section 3.4 and in [Dom2b, Dom2e, Dom2f, Dom3, Bla3a, Bla3c, Bla3d, Bla3e, Bla4a, Bla4c, and Lan4]. Also SNR scalability has been added by the author, into advanced video coding technology in [Dom2c, Dom2d, Bla4d, Bla4e, Bla5a, Bla5d, and Bla5e]. As roosed, a scalable coder may, in general, consist of several sub-coders, at least two. Each of sub-coders has its own rediction loo with its own motion estimation and comensation. As it was roved in [Mać2], the bitrate needed for additional data correlated with multile motion estimation are well comensated by the decrease in the number of bits needed for rediction error coding [Dom, Łuc]. Thus, such a structure may be well adated to advanced video coding as well as earlier techniques of video coding. 4-1

44 HIGH RESOLUTION ENHANCEMENT BITSTREAM INPUT SPATIAL AND/OR TEMPORAL SUB-SAMPLING HIGH RESOLUTION ENCODER INTER-LAYER PREDICTION MEDIUM RESOLUTION ENCODER DATA PARTITIONING MEDIUM RESOLUTION ENHANCEMENT BITSTREAM DATA PARTITIONING ENHANCEMENT LAYER SPATIAL AND/OR TEMPORAL SUB-SAMPLING INTER-LAYER PREDICTION LOW RESOLUTION ENCODER LOW RESOLUTION BITSTREAM BASE LAYER Fig A generic block diagram of the multi-loo scalable encoder. HIGH RESOLUTION ENHANCEMENT BITSTREAM HIGH RESOLUTION DECODER HIGH RESOLUTION VIDEO MEDIUM RESOLUTION ENHANCEMENT BITSTREAM INTER-LAYER PREDICTION MEDIUM RESOLUTION DECODER MEDIUM RESOLUTION VIDEO LOW RESOLUTION ENHANCEMENT BITSTREAM INTER-LAYER PREDICTION LOW RESOLUTION DECODER LOW RESOLUTION VIDEO Fig A generic block diagram of the multi-loo scalable decoder. This general structure reresents the scalable video codec which may take advantage of three tyes of scalability: satial, temoral and quality scalability. Each sub-encoder may encode video signal of different satial and temoral resolution. 4-2

45 Moreover, each layer of the structure, beside the lower one, may artition the encoded data in order to obtain quality scalability. The lowest layer does not artition data because of backward comatibility with non-scalable base codec. Here, the author shows that such a structure may be successfully used for designing a model of scalable codec based on earlier video coding techniques such as H.263 [ISO96], as well as for designing model of the encoder based on the advanced video coding techniques such as H.264/AVC [ISO-AVC]. Multilayer structure of codec may rovide backward comatibility with non-scalable codecs. The lowest layer may be treated as non-scalable base layer and may have the same syntax as non-scalable. Here, in this doctoral dissertation the author resents both models of scalable encoders Satial Scalability Satial scalability is obtained by encoding sub-bitstreams for each satial resolution searately by use of additional information from decoded lower layer. For the satial scalability the enhancement layer encoder is an extended version of base layer encoder. In the enhancement layers there are the following tyes of frame: IE-frames, where satial rediction and rediction from lower layer are allowed, PE-frames, where satial, temoral (one direction) and from lower layer redictions are allowed and BRframes, where satial, temoral (two directions) and from lower layer redictions are allowed. All three tyes of frames may be used as a reference for temoral rediction. The common feature of all these tyes of frames is that they use in rediction rocess an interolated frame, at the time of dislaying, from lower layer (it may be another enhancement layer). The author s idea was to take advantage of one of advanced video coding tools which is the multi-reference rediction. The codec can use more than one reference icture in rediction rocess. The idea was to take an interolated frame from lower layer and ut it as additional frame into the list of reference frames. In this way, the syntax of bitstream remains unchanged, while it still enables enhancement layer encoder to better encoding efficiency. It has to be noticed here that the currently encoded frame and the interolated frame reresent the icture at the same given time of dislaying. So, it can be assumed that the best rediction of currently encoded 4-3

46 macroblock by the use of interolated frame is in the same satial coordinates. In consequence, there is no need to send motion information to the decoder. Thus, for this mode of rediction the motion vectors are not transmitted. IE-frames in the advanced video coding technique may use satial rediction from the same frame or from lower layer frame. PE-frames and BR-frames may additionally use a reference frame which comes into being by averaging interolated lower layer frame and reference frame from the same layer. It is so called averaging mode. In [Mać2] the roosed averaging mode took the best temoral rediction and then averaged it with interolated block from lower layer. The author of this doctoral dissertation has adoted this method to advanced video coding technique. This method has been imlemented and included into AVC. It had to be modified to act roerly with multile rediction modes and multile reference frames which exist in this new advanced technology. MV Previous reference frames ½ Predicted macroblock The macroblock interolated from the lower layer Fig Averaging mode from [Mać2] Here, the author s roosal is an extension of the averaging mode roosed by Maćkowiak. The idea is to find the block from revious reference frames which after counting the average with the interolated block from lower layer gives the best estimation of currently coded block. Thus, the roosal in [Mać2] becomes the secial case of new technique. The difference is that the new technique looks for the average which gives the best rediction and revious method looks for the best rediction from revious frame and average redicted block with interolated block. The following figure shows the idea: 4-4

47 Interolated macroblock from lower layer Macroblock redicted from reference frame average Predicted macroblock interolation from lower layer Motion vector estimation from revious frames Fig New averaging mode roosed by the author. This new averaging mode is modification of earlier algorithm. Modification is made in order to minimize the cost of roviding additional coding mode in bitstream syntax for scalable advanced video codec. This algorithm is comutationally more comlex than the revious method because for each block matching in motion search algorithm the averaging with interolated block has to be done. Thus, for block match there are additional summation and shifting for each ixel value. IE-frames roosed in this work differ from the frame tye roosed in [Mać2]. Here, this frame has a structure similar to the P-frame. Thus, all intra modes (satial redictions) are allowed and one rediction mode (block size 16 16) which is based on temoral rediction coding structure, but no motion vectors are used. The reference frame list contains only one frame, i.e. interolated frame from lower layer Temoral scalability scenarios for scalable codec Temoral scalability discussed here is widely used in hybrid DCT-based codecs. The idea was not to use some video frames for redictive coding of other frames. Such frames may not be decoded at the decoder side and the remaining frames are still decodable. Moreover, those frames could be encoded by the use of rediction from 4-5

48 neighboring frames from two directions: forward and backward. When such frames are used, the coding efficiency increases. In considered layered structure all frames may also use a satial rediction, and some of bi-directionally redicted frames may also use other bi-directionally redicted frames. Some of ossible scenarios for two layers structure, already resented in [Mac ] are shown in Fig. 4.3 and Fig IE BE BR BE PE BE BR Enhancement Layer I BR P BR Base Layer Fig Exemlary structures of low-resolution and high-resolution video sequences with temoral sub-samling by factor 2. IE BE PE BE PE Enhancement Layer I P P Base Layer IE P PE P PE Enhancement Layer I P P Base Layer Fig. 4.3 cont. Exemlary structures of low-resolution and high-resolution video sequences with temoral sub-samling by factor

49 IE BE BE PE BE BE Enhancement Layer I P Base Layer Fig Exemlary structure of low-resolution and high-resolution video sequences with temoral sub-samling by factor 3. There are two tyes of bi-directionally redicted (B) frames: BR-frames and BEframes. The first one is a frame which may also use frame from lower layer for rediction and it may be a reference frame for other B frames, but cannot be used as reference for P frames. BE frames are bi-directionally redicted frames which may use as reference frame P or BR-frames. BR-frames and BE-frames belong to two different categories described in Chater 3, i.e. frames which can or cannot be reference for other frames (see Chater 3 Section 3.4.1) Interolation and Decimation The inut video signal for the scalable video coder has to be satially and/or temorally decimated. The decimation rocess has to be erformed as many times as many enhancement layer encoders the scalable coder consists of. Before the video signal is down-samled it has to be filtered by the use of a low ass filter. The filtration has to be done in order to avoid the aliasing effect. In the case of scalable coding, the low ass filtration may also be used as a tool for distribution of inut signal energy between encoded satial layers. For the oosite rocess the u-samling and estimation of missed ixels is done. Interolated icture is used for rediction of the enhancement layer icture. For verification model of scalable multilayer advanced video encoder the decimation filter of following design conditions has been used: 4-7

50 assband attenuation below 1 db, assband cutoff frequency of about.4 of the Nyquist frequency, stoband attenuation over 5 db. A [db] A P P A R R f P f R f Fig Decimation filter design. Two filters have been designed by the use of the least-square error technique in Matlab environment, which granted the above conditions. One 12 th order FIR filter and one 24 th order FIR filter have been designed. The Figs. 4.8 and 4.9 show the magnitude resonses for both filters. The filter coefficients are the following: h(n) = [.15259, , , ,.83263,.33814,.41166,.33814,.83263, , , ,.15259]; h(n) = [.5622,.16861,.1351, , , -.163,.43615,.2659, , ,.6525,.39576, ,.39576,.6525, , ,.2659,.43615, -.163, , ,.1351,.16861,.5622]. 4-8

Fig. 4.8. Magnitude resonse of 12 th order filter.

51 Fig Magnitude resonse of 12 th order filter. Fig Magnitude resonse of 24 th order filter. 4-9

52 The reason for designing two filters of the same tye but different orders was to determine which feature is more imortant in decimate filter for scalable coding (see Fig. 4.1): Whether it is the better magnitude resonse of the filter, but longer distortions around the edges or not so good a magnitude resonse, but shorter distortions around the edges. The exeriment described in Chater 6.4 shows that there is a difference in coding efficiency for both of the decimation filters. The results related to those exeriments are in the Table 4.1 below. Table 4.1 Influence of designed decimation filters on the encoding efficiency. test sequence short decimation long decimation bitrate filter filter overhead PSNR bitrate PSNR bitrate [%] [db] [kbs] [db] [kbs] city (±64 search range) 36, ,96 36, ,1,61 city (±4 search range) 34, ,8 34, ,35 1,11 crew (±64 search range) 38, ,62 38, ,47,65 crew (±4 search range) 37,25 147,72 37,25 158,37 1,2 harbour (±64 search range) 35, ,81 35, ,98,6 harbour (±4 search range) 34, ,45 34, ,78,78 ice (±64 search range) 4, ,94 4,72 143,76 1,14 ice (±4 search range) 38,9 935,49 38,91 949,45 1,5 The conclusion of analysis of these two designed filters is that the coding efficiency is higher for designed decimation filter with lower order. The bitrate overhead for designed higher order filter is between.6% u to 1.5%. For the scalable H.264/AVC verification model roosed by the author the modified bi-cubic interolation has been chosen, because of its considerably good tradeoff between comutational cost and accuracy. The technique has been taken from [Ram99] aer. It is an edge-adative bi-cubic interolation and it is an extension to the standard non-adative bi-cubic searable interolation. The author has roosed this technique in [Bla3e, Dom3, and Lan4]. The algorithm can be described as follows. The interolation of a twodimensional image is erformed in two stes: first horizontal interolation, second vertical interolation. Let f(x) be the value to be estimated and the nearest available values are located at coordinates x k (left) and x k+1 (right). Let s = x x k, 1 s = x k+1 x, where s 1. By bi-cubic searable interolation, there is 4-1

53 f(x) = f(x k-1 )(-s s 2 s)/2 + f(x k )(3s 3-5s 2 +2)/2 + f(x k+1 )(-3s 3 + 4s 2 +s)/2 + f(x k+2 )(s 3 s 2 )/2, where x k-1, x k, x k+1 and x k+2 are the ositions of four neighboring known ixels. In the edge-adative scheme, a modified value s is used instead of s. s = s kas(s 1), where k is a ositive arameter that controls the intensity of waring and A is a function of asymmetry of the data in the neighborhood of x: A= ( f(x k+1 ) f(x k-1 ) - f(x k+2 ) f(x k ) )/(L 1), where l = 256 for 8-bit samle reresentation. In order to obtain value k several exeriments have been erformed. Figs. 4.1 and 4.11 show comarison of coding efficiency when adative and non-adative interolation filter are used. The detailed descrition of the exeriments and its results are resented in Chater 6.4. bitrate [kbs] Test sequence: CREW decimation filter length: 13 motion vector range: ,5 2 2,5 3 3,5 4 k edge-adative non adative Fig Comarison of usage non-adative and edge-adative bi-cubic interolation filters for CREW test sequence. 4-11

54 54 Test sequence: HARBOUR decimation filter length: 13 motion vector range: bitrate [kbs] edge-adative non adative ,5 2 2,5 3 3,5 4 k Fig Comarison of usage non-adative and edge-adative bi-cubic interolation filters for HARBOUR test sequence. As it may be noticed on above Figures the adatation of interolation increases encoding efficiency. Moreover, this efficiency strongly deends on the value of arameter k. The gain of coding efficiency, when adatation is used, may be even u to 1%. 4-12

55 Chater 5 Siral scan 5.1. Introduction Since the beginning of the digital image filtering and comression, eole have been used to taking the uer left corner of the icture as a starting oint for image rocessing,. The image then is rocessed row by row, from the left to the right and from the to to the bottom. It is called raster scan. This rocessing order stems from human habit and does not have any reasonable exlanation in the signal rocessing theory. If the filtering rocess of the image starts from lower right corner, in the oosite direction to the one described above, the results will be the same as before. In the case of the video sequence comression, individual frames are divided into blocks of ixels called macroblocks. The images are rocessed macroblock by macroblock. The rocessing order of those macroblocks is the same as it was for filtering rocess from the left to the right and from the to to the bottom. This coding order is widely used in many techniques of video comression and in all video coding standards including MPEG-2, MPEG-4 [ISO94, ISO98] and in the newest standard AVC/H.264 [ISO-AVC] as well. Here a question aears: Is it ossible to use other orders of macroblocks rocessing which could be useful in video comression? And if so, what are its imlications and alications? The DCT-based video coding technique artitions the image into macroblocks and then encodes them one by one. The commonly used codecs like MPEG-1,2,4 or H

56 use the raster scan of macroblocks. But there is no reason why other scans are not used. The question is how such the order influences the encoding rocess, which eventually leads to the question about coding efficiency. The idea is: to find a new macroblock coding order which will be useful for some uroses, to modify coding rocess, for this macroblock coding order, in such a way that the coding efficiency remains unchanged. The consequences of coding order modification strongly deend on the coding algorithm. In the new coding standard AVC/H.264 the deendency between neighboring macroblocks is very strong, when data are encoded. So, changing the macroblock coding order strongly influences coding efficiency. The MPEG-2, where the deendency between the neighboring macroblocks is not so strong, the macroblock coding order does not influence so strongly coding efficiency Siral scan in video comression It is imortant to define some secific features of encoded video shots. In general, it is ossible to divide the video shots into two main grous: The video shot containing one or a few objects laced in some background; The video shot containing only the background without exlicit objects. An object can be defined as art of a video frame which, from the human oint of view, has to be recognizable. The object may be a human being, an animal, a building, etc. Moreover, those objects have to be major regions of interest on the image for the human viewer. The background can be defined as art of image containing no recognizable objects or containing so many objects that it is imossible to decide which ones of them are the most imortant. An examle of a video frame containing one object and background can be the video sequence from a news channel where the news resenter is seaking. In such a case the news resenter is the object and the rest of the image is the background (see Fig. 5.1). 5-2

Fig. 5.1. The object of interest laced in the center of the image.

two eole are talking (see Fig. 5.2). Fig. 5.2. Two objects of interest inside one image.

57 Fig The object of interest laced in the center of the image. A video frame may consist of more than one object, for examle there can be a video where two eole are talking (see Fig. 5.2). Fig Two objects of interest inside one image. Video with no objects (only the background) can be, for examle, the shots of a forest, mountains, etc. (see Fig. 5.3). 5-3

Fig. 5.3. Image containing no objects of interest. The sycho-visual effect of human ercetion is focusing on a art of the image instead of the whole image.

58 Fig Image containing no objects of interest. The sycho-visual effect of human ercetion is focusing on a art of the image instead of the whole image. In most cases, a cameraman shoots video in such a way that the most interesting object is in the center of the icture. If there is more than one object they are laced around the center. A erson focuses on one of those objects at a time. The idea is to encode a video sequence in such a way that the most imortant arts of the images, here, these are the objects, are rocessed first and the less imortant art is rocessed later. As a solution of this idea, the siral scan of encoded macroblocks is roosed. The classic scan and siral scan are shown in Fig a) b) Fig a) Raster scan of macroblocks, b) Siral scan of macroblocks. Similar solution has been roosed in [Par2], called water ring scan order. The Fig. 5.5 shows the basic idea. 5-4

59 Water Ring in the Lake Origin of the Water Ring Direction of the Water Ring Water Ring Scan order Alied in the image frame Water Ring() Water Ring(1) Water Ring(2)... Water Ring(i)... Water Ring(N) Fig Water Ring Scan Order Basic idea of Water Ring scan order (Figure taken from [Par2]). However, the water ring scan order technique is less efficient as comared to the siral scan order, because of lack of context modification. The water ring technique does not modify the coding standard, so it is easier to use, but it is not so efficient as the technique roosed in this dissertation. The modification roosed here results in a ossibility to use any continuous order of coding macroblocks without losing coding efficiency. Thus, the water ring technique may become as efficient as the siral scan technique, when combined with context modification. The siral scan could be used, for examle, for SNR scalability. When there is one object one siral could be used starting in the centre of the images of the video sequence. Then the outer arts of images may be reresented with lower bitrate if the overall bitrate must be reduced. Often the resective quality deterioration, caused by bitrate reducing, is not erceived by a viewer. If the classic scan was used the visual effect is oorer. The comarison of visual effect of bitrate reducing for siral scan and for the raster scan is shown in Fig

60 a) b) Higher quality macroblocks Lower quality macroblocks Fig A decoded image using a) raster scan, b) siral scan. This effect may be used in scalable video coding, where low-bitrate bitstreams are embedded in a high-bitrate bitstream. The siral scan of macroblocks may be used and the resective bitstream may be cut after arbitrary number of macroblocks, thus giving good quality in the area of interest in images (see Fig. 5.6 b). In the decoder, the macroblocks which have not been decoded are reconstructed from the low-quality base layer. The standard macroblock order would yield a high quality area to be on the to of the image (Fig. 5.6 a). In the case of more than one object several sirals can be used. An examle of the image with two siral scan areas is shown in Fig Fig Two siral scans er one frame. For this examle the areas of good quality, after bitstream cut, are shown in Fig The reason why there are two good quality regions is that each siral is received at the decoder side as an individual unit, and so they may be cut indeendently. 5-6

61 Higher quality macroblocks Lower quality macroblocks Fig Examle of high imortance bit allocation for two siral scans. The starting oint of the siral scan deends on the osition of the object inside the image. It should be located in the centre of the object. An examle of siral scans with various starting oints is shown in Fig Starting oint Fig Examle of various starting oints for the siral scan. The areas of good image quality for such sirals are shown in Fig Higher quality macroblocks Lower quality macroblocks Fig Examle of high imortance bit allocation for siral scans with various starting oints. Additionally, for the siral scan the asect ratio may be defined, which is directly correlated with the encoded object shae. This asect ratio defines the roortion of the count of horizontal macroblocks to the count of vertical macroblock. By the use of different asects ratios it is ossible to make a better fit of the siral scan of 5-7

62 macroblocks to the shae of the encoded object. The better is the fit of the macroblocks scan to the shae of an object the better is the bit allocation. Some examles of various asect ratios for siral scan are shown in Fig Fig Siral scan for two different asect ratios. In a video sequence, the frames are artitioned into slices. The ability to artitioning the image into slices is very useful for the streaming rocess. Each slice is transmitted as a searate unit. This means that if in the communication channel there is a acket loss, it is still ossible to receive and decode a art of the image. If there is only one slice er image, then in the case of some errors during the transmission rocess whole frame is lost. Moreover, each slice may be transmitted with various riorities; this means that the different rotection methods may be used for each slice. Also for the siral scan of the macroblocks the slice artitioning may be alied. Here, two tyes of the slices may be defined. One tye of the slice initiate new siral scan (see Fig. 5.12) with its own arameters: the starting oint, the asect ratio and the slice id. The second tye of the slice continues the siral scan it belongs to by the use of the slice id arameter and the starting oint (see Fig. 5.13). 5-8

63 Fig Exemlary frame artitioning into two slices using one siral scan of macroblocks. Fig Exemlary frame artitioning into two slices using two siral scans of macroblocks Siral Scan for Quality Scalability in AVC Codecs Introduction A macroblock is the basic coding unit, but, as it was mentioned before, the neighboring macroblocks influence the encoding rocess of current macroblock. Already the MPEG-2 standard takes advantages of redictive encoding of DC coefficients and redictive encoding of motion vectors. For the raster (linear) scan, the direction of rediction is constant, while it must be adated to the current direction of 5-9

64 rocessing in the siral scan. Therefore, some modifications in the coding algorithm are needed in order to reserve high comression efficiency. In this dissertation the AVC/H.264 encoder has been adated to be able to encode image in the siral order of macroblocks. The AVC/H.264 is a new technique which is much more efficient than older ones because many more elements are encoded using sohisticated adative redictions with contexts defined in several ways. Therefore, the resective modifications, for new coding order, are much more comlex. For the rediction of various syntax elements, the data from neighboring macroblocks are taken. For macroblocks, the AVC standard defines four ossible neighboring macroblocks, which can be used in rediction rocess. This are: left, u, u left and u right neighbors (see Fig. 5.14). neighboring macroblocks current macroblock Fig Neighborhood defined in AVC/H.264 codec. In case of the siral scan the osition of the neighborhood deends directly on the direction of rocessing of the current macroblock on the siral. Only the already rocessed macroblocks may be used for the rediction. The Fig shows the neighboring lacement, which is used as a context for macroblock, deending on the osition on the siral. 5-1

65 Position tye Position tye 1 Position tye 3 Position tye 2 neighboring macroblocks current macroblock Fig Neighborhood for the siral scan. Because of the fact that the H.264 encoder uses fixed neighborhood, as was shown before in Fig. 5.14, the new neighborhood in the case of siral scan cannot be used directly by this encoder. So, the following rediction tools need adatation to siral scan: rediction of (4 4)-ixel, (8 8)-ixel and (16 16)-ixel luminance blocks for intra frame coding, rediction of chrominance blocks for intra frame coding, motion vectors rediction for all block sizes, rediction of macroblock encoding arameters, context rediction for CABAC coding: o block-based rediction, o bit-based rediction. The modifications (see Annex A) do not influence the bitstream syntax, only the semantics of the bitstream elements change. An alternative technique roosed in [Par2] has a very similar order of macroblocks rocessing, but it has also one major disadvantage as comared to the technique roosed in this doctoral dissertation: it does not exloit fully the available context for redictive coding. In the case of water ring scan and the siral scan without context usage modification there is a decrease of coding efficiency comared to raster scan. 5-11

66 5.3.2 Intra-frame rediction For the intra rediction in the AVC/H.264 encoder there is a satial rediction of the ixels of whole macroblock or a block (which is art of the macroblock) by the use of the available ixels from the boundaries of the neighboring macroblocks or blocks. Deending on the size of the block, the intra rediction modes in the AVC/H.264 encoder may be artitioned into three grous: rediction of (16 16)-ixel blocks, rediction of (8 8)-ixel blocks, rediction of (4 4)-ixel blocks. For each size of block individual rediction algorithms are defined. In the case of the (16 16)-ixel blocks there are four ossible redictions which may be grahically reresented as: vertical horizontal rediction of DC lane rediction. rediction, rediction, coefficient, 5-12

For the siral scan of macroblocks those redictions have to be modified

16 A D show four grous of redictions for each of the cases.

5.16. Four grous of rediction modes for siral scan.

adatation to the siral scan is more comlex.

67 For the siral scan of macroblocks those redictions have to be modified indeendently for each osition on the siral. The Fig A D show four grous of redictions for each of the cases. Position tye A Position tye 1 B Position tye 2 C Position tye 3 D Fig Four grous of rediction modes for siral scan. For the rediction of (8 8)-ixel blocks and (4 4)-ixel blocks, the adatation to the siral scan is more comlex. Here, the rediction is based on blocks instead of macroblocks. It means that for classic scan the blocks are rocessed in the order shown in Fig

Chapter 2 Introduction to

Chapter 2 Introduction to H.264/AVC H.264/AVC [1] is the newest video coding standard of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). The main improvements