Video Coding: Part II of Fundamentals of Source and Video Coding

Size: px

Start display at page:

Download "Video Coding: Part II of Fundamentals of Source and Video Coding"

George Lynch
5 years ago
Views:

Foundations and Trends R in Signal Processing Vol. 10, No. 1 3 (2016) 1 346 c 2016 H. Schwarz and T. Wiegand DOI: 10.

1 Foundations and Trends R in Signal Processing Vol. 10, No. 1 3 (2016) c 2016 H. Schwarz and T. Wiegand DOI: / Video Coding: Part II of Fundamentals of Source and Video Coding Heiko Schwarz Fraunhofer Institute for Telecommunications Heinrich Hertz Institute Germany heiko.schwarz@hhi.fraunhofer.de Thomas Wiegand Berlin Institute of Technology and Fraunhofer Institute for Telecommunications Heinrich Hertz Institute Germany thomas.wiegand@tu-berlin.de

2 Contents 1 Introduction The Video Communication Problem Scope and Overview of the Text Acquisition, Representation, Display, and Perception Fundamentals of Image Formation Image Formation with Lenses Diffraction and Optical Resolution Visual Perception The Human Visual System Color Perception Visual Acuity Representation of Digital Images and Video Spatio-Temporal Sampling Linear Color Spaces and Color Gamut Non-linear Encoding The Y CbCr Color Representation Format Quantization of Sample Values Image Acquisition Image Sensor Capture of Color Images ii

3 iii Image Processor Display of Images and Video Chapter Summary Video Coding Overview Properties of Digital Video Signals Intra-Picture Coding Hybrid Video Coding Structure of Hybrid Video Encoders and Decoders Picture Partitioning, Scanning, and Syntax Interoperability and Video Coding Standards Coding Efficiency Chapter Summary Video Encoder Control Encoder Control using Lagrange Multipliers Lagrangian Optimization in Hybrid Video Encoders Mode Decision Motion Estimation Quantization Selection of Lagrange Multiplier Summary and Potential Improvements Additional Aspects of Video Encoders Chapter Summary Intra-Picture Coding Transform Coding of Sample Blocks Orthogonal Block Transforms Quantization Entropy Coding Intra-Picture Prediction between Transform Blocks Prediction in Transform Domain Spatial Intra Prediction Block Sizes for Prediction and Transform Coding Block Size Selection in Video Coding Standards Experimental Analysis

4 iv 5.4 Chapter Summary Inter-Picture Coding Accuracy of Motion-Compensated Prediction Theoretical Considerations Choice of Interpolation Filters Motion Vector Accuracy Motion Models Block Sizes for Motion-Compensated Prediction Variable Prediction Block Sizes Prediction Block Sizes in Video Coding Standards Further Improvements Advanced Motion-Compensated Prediction Adaptive Reference Picture Selection Multi-Hypothesis Prediction Weighted Prediction Further Motion Compensation Techniques Coding of Motion Parameters Motion Vector Prediction Inferred Motion Parameters Coding Structures In-Loop Filters Deblocking Filter Sample Adaptive Offset Filter Adaptive Wiener Filter Chapter Summary Video Coding Standards Syntax Features and Coding Tools ITU-T Rec. H.262 MPEG-2 Video ITU-T Rec. H MPEG-4 Visual ITU-T Rec. H.264 MPEG-4 AVC ITU-T Rec. H.265 MPEG-H HEVC Comparison of Syntax Features Comparison of Coding Efficiency

5 v Intra-Only Coding Interactive Video Applications Entertainment-Quality Video Applications Chapter Summary Summary 307 Acknowledgements 312 Appendices 314 A Test Sequences 315 B Software for Coding Experiments 317 References 319

6 Abstract Digital video coding technologies have become an integral part of the way we create, communicate, and consume visual information. In the first part of this two-part text, we introduced the fundamental source coding techniques entropy coding, quantization, prediction, and transform coding. The present second part describes the application of these techniques to video coding. We introduce the basic design of hybrid video encoders and decoders, explain the basic concepts of intra-picture coding, motion-compensated prediction, and prediction error coding, and discuss encoder optimization techniques. Special emphasis is put on a fair analysis of various design aspects and coding tools in terms of coding efficiency. We highlight the application of the discussed concepts in modern video coding standards and compare important standards with respect to the achievable coding efficiency. H. Schwarz and T. Wiegand. Video Coding: Part II of Fundamentals of Source and Video Coding. Foundations and Trends R in Signal Processing, vol. 10, no. 1 3, pp , DOI: /

7 1 Introduction The application areas of digital video today range from multimedia messaging, video telephony, and video conferencing over mobile television, wireless and wired Internet video streaming, standard- and high-definition television broadcasting, subscription and pay-per-view services to personal video recorders, digital camcorders, and optical storage media such as the digital versatile disc (DVD) or the Bluray disc. Ultra-high definition (UHD) television sets, with a resolution of image points, four times as many as high-definition screens, have recently become available for end consumers. Due to its horizontal resolution, this UHD format is often also called 4K format. Internet streaming providers have started to produce and deliver content in 4K. At the time of writing this text, the first satellite broadcasters are testing a new UHD infrastructure; the first UHD demo channels are already on air [230] and the first sport events and concerts have been successfully broadcasted live in 4K [231]. One of the key techniques that enabled the variety of digital video applications is video coding, also called video compression. Even though the main task of video coding is to compress visual information so that it requires as little bit rate as possible, the availability of advanced 2

8 1.1. The Video Communication Problem 3 video coding techniques also enables a number of new applications. For example, the availability of the improved video coding standard H.264 MPEG-4 AVC [121] was a driving factor for the broad introduction of high-definition television (HDTV) and several video streaming services. Future video services and applications will benefit from the even more efficient standard H.265 MPEG-H HEVC [123]. In fact, the percentage of the Internet traffic that is caused by the transmission of compressed video data increases continuously. According to a study by Cisco [48], 66% of the bits transmitted in the consumer Internet in 2013 were video data. For 2018, an increase to 79% is predicted. Even though the various applications of digital video differ in the spatial resolution, the required compression ratio, and the acceptable video quality, the same basic video compression principles are employed. In the following, we give an overview of the main elements of a video communication chain, from the capturing of pictures to display and human perception. The main focus of the present text lies however on the description of the fundamental principles of video compression. In that context, we will also discuss the improvements in video coding technology that led to a continuous increase of coding efficiency from one generation of video coding standards to the next. 1.1 The Video Communication Problem The most important processing steps of a typical video transmission system are illustrated in the block diagram of Figure 1.1. At the beginning of the signal processing chain, a digital video signal is captured by a camera. The lens of the camera projects an image of a 3D scene onto the surface of an image sensor, which samples the optical signal. The resulting raw data samples are further processed inside the camera and transformed into a representation format. The video signal eventually delivered by the camera consists of arrays of discrete-amplitude samples. An optional preprocessing step may be applied, for example, for improving the contrast and color representation or for reducing the noise; the latter has typically a beneficial effect on the following coding step. The video encoder maps the sample arrays of the representation

4 Introduction capture / acquisition preprocessing video encoder bitstream transmission (may be replaced by storage) channel decoder demodulator channel modulator channel encoder bitstream video

The pre- and postprocessing steps are optional; the transmission may be replaced by storage. format into a so-called bitstream, which usually has a much lower bit rate than the raw data samples.

In most applications, however, the compressed video is transmitted.

The channel encoder extends the bitstream by adding structured redundancy suitable for detecting or correcting potential transmission errors at the receiver side.

9 4 Introduction capture / acquisition preprocessing video encoder bitstream transmission (may be replaced by storage) channel decoder demodulator channel modulator channel encoder bitstream video decoder postprocessing display and perception Figure 1.1: Structure of a typical video communication scenario. The pre- and postprocessing steps are optional; the transmission may be replaced by storage. format into a so-called bitstream, which usually has a much lower bit rate than the raw data samples. In the simplest case, the video bitstream generated by the encoder is stored, typically inside a container format such as the Audio Video Interleave (AVI), Quicktime, or Matroska format. In most applications, however, the compressed video is transmitted. The transmission chain usually consists of a channel encoder, a modulator, the actual physical transmission channel, a demodulator, and a channel decoder. The channel encoder extends the bitstream by adding structured redundancy suitable for detecting or correcting potential transmission errors at the receiver side. The modulator maps the resulting data stream into an analog signal, which is transmitted over the physical channel. At the receiver side, the demodulator extracts a digital data stream from the received analog signal. The channel decoder produces the received bitstream by detecting and correcting transmission errors and extracting the video data packets from the data stream. Note that if not all transmission errors can be corrected, the received bitstream is not identical to the bitstream generated by the video encoder.

10 1.1. The Video Communication Problem 5 The video decoder reconstructs the sample arrays from the received bitstream or, alternatively, the bitstream read from a storage device. Optionally, the decoded signal may be postprocessed in order to reduce the impact of transmission errors and coding artifacts on the subjective video quality. At the end of the communication chain, the video signal is typically displayed and perceived by human beings. This monograph focuses on the video encoder and video decoder, which are often summarized under the term video codec. The encoder maps the samples of the original video pictures to a set of so-called coding parameters and writes these coding parameters to a bitstream. The bitstream represents the input video in compressed form and is transmitted to the video decoder. The format in which the coding parameters are written to the bitstream is referred to as bitstream syntax. It has to be known to both the encoder and the decoder. The video decoder parses the received bitstream according to the given bitstream syntax and thereby decodes the transmitted coding parameters. Finally, the video pictures are reconstructed by following a defined decoding process, which is controlled by the transmitted coding parameters. For achieving the required transmission bit rates, video codecs apply lossy coding algorithms. Hence, even in error-free transmission scenarios, the digital video signal reconstructed by the decoder is different from the encoder s input signal. Since, in most applications, the coded video is perceived by human beings, the degradation of the perceived video quality should be as small as possible. The basic video coding problem can be stated as representing a video signal with the highest possible subjective quality without exceeding an available bit rate or, alternatively, as conveying the video signal with the lowest possible bit rate while maintaining a specified subjective quality. In practice, the subjective quality of a video signal is very hard to specify and therefore objective distortion measures calculated based on the differences between the original and decoded sample values are often used instead. The ability of a codec to choose a suitable trade-off between signal distortion and bit rate is referred to as its coding efficiency. Besides the coding efficiency, the applicability of a video codec for a certain communication scenario is also influenced by the implementation com-

11 6 Introduction plexity of the algorithms used as well as the structural and processing delay of the codec, which determine, among other factors, the crucial end-to-end delay between capturing and displaying a video picture. In the most common setup of video codecs, the video decoder merely extracts the coding parameters from the received bitstream and follows a defined decoding process for reconstructing the video pictures. Given a particular bitstream syntax and decoding process, all decoder implementations generate the same (or, sometimes, nearly the same) video pictures. The achievable coding efficiency is limited by the set of syntax features and coding tools that are supported in the bitstream syntax and the decoding process. However, the actual coding efficiency of a bitstream is highly dependent on the encoder implementation. For a given bitstream syntax and decoding process, both the bit rate and the reconstruction quality are determined by the encoding algorithm that maps the original pictures into a sequence of coding parameters. 1.2 Scope and Overview of the Text The present text provides a description of the fundamentals concepts of video coding. It is aimed at aiding students and engineers to investigate the subject. Since the topic of video coding and video communication is too broad and too deep to exhaustively describe all its aspects in the chosen presentation format, we concentrate on the signal processing in video encoders and decoders. This also means that we will leave out a number of areas, including software and hardware implementation aspects, the topics of pre- and postprocessing, and the whole field of video transmission and error-robust coding. The intention of this text is to provide an in-depth treatment of the basic principles and coding tools found in modern video codecs. Subjects that we consider particularly important will be covered in greater detail. For giving examples and analyzing certain coding tools, we will often refer to video coding standards. These standards do not only represent the dominant technology for real-world applications, but also reflect the state-of-the-art in the field of video coding. In fact, many advanced coding tools have been developed in international standard-

12 1.2. Scope and Overview of the Text 7 ization projects. Video coding approaches that are less relevant for practical applications, such as 3D subband coding or distributed video coding, will not be explained in this text. Moreover, we will neither discuss scalable nor 3D video coding. These coding schemes typically represent extensions of conventional video coding designs. Even tough they include some additional coding tools, the same fundamental concepts as in conventional video coding are employed. The monograph is divided into two parts. In the first part [301], we introduced the fundamental source coding techniques entropy coding, quantization, prediction, and transform coding and analyzed their coding efficiency based on simple models for 1D random signals. The present second part describes the application of these techniques to video coding. We describe the basic structure of video codecs, discuss the fundamental concepts of video coding, and highlight their application in modern video coding standards. The effectiveness of various coding tools will be demonstrated based on experimental results. Section 2 gives an overview of the acquisition, representation, and display of video signals. It describes the raw data formats used in video coding applications and highlights the relationship between the acquisition, representation, and display of video signals and the way we perceive visual information. Section 3 introduces the basic principles of hybrid video coding and describes the structure of typical video encoders and decoders. It further introduces the measures that we will use for comparing the coding efficiency of different codecs. Section 4 describes the concept of a Lagrangian encoder control, which we will use in all coding experiments. The usage of a unified and highly effective encoder control allows us to fairly compare different coding tools in terms of coding efficiency. Section 5 discusses the application of transform coding in video codecs and introduces techniques for intra-picture coding. Section 6 describes coding tools for inter-picture coding. It introduces the concept of motion-compensated prediction and analyzes several design aspects in terms of coding efficiency. Section 7 compares important video coding standards with respect to their coding efficiency. A summary of important results is given in Section 8.

13 2 Acquisition, Representation, Display, and Perception of Image and Video Signals In digital video communication, we typically capture a natural scene by a camera, transmit or store data representing the scene, and eventually reproduce the captured scene on a display. The camera converts the light emitted or reflected from objects in a 3D scene into arrays of discrete-amplitude samples. In the display device, the arrays of discrete-amplitude samples are converted into light that is emitted from the display and perceived by human beings. The primary task of video coding is to represent the sample arrays generated by the camera and used by the display device with a small number of bits, suitable for transmission or storage. Since the achievable compression for an exact representation of the sample arrays recorded by a camera is not sufficient for most applications, the sample arrays are approximated in a way that they can be represented with a given maximum number of bits or bits per time unit. Ideally, the degradation of the perceived image quality due to the modifications of the sample arrays should be as small as possible. Hence, even though video coding eventually deals with mapping arrays of discrete-amplitude samples into a bitstream, the quality of the displayed video is largely influenced by the way we acquire, represent, display, and perceive visual information. 8

14 2.1. Fundamentals of Image Formation 9 Certain properties of human visual perception have in fact a large impact on the construction of cameras, the design of displays, and the way visual information is represented as sample arrays. And even though today s video coding standards have been mainly designed from a signal processing perspective, they provide features that can be used for exploiting some properties of human vision. In the following section, we briefly review properties of image formation and human vision, describe raw data formats, and discuss the design of cameras and displays. For additional information on these topics, the reader is referred to the comprehensive overview in [208]. 2.1 Fundamentals of Image Formation In digital cameras, a 3D scene is projected onto an image sensor, which measures physical quantities of the incident light and converts them into arrays of samples. For obtaining an image of the real world on the sensor s surface, we require an optical device that basically projects all light rays that are emitted or reflected from an object point and fall through the opening of the camera into a point in the image plane. In the following, we review some basic properties of image formation. For a more detailed treatment of the subject, we recommend the classic references by Born and Wolf [14] and Hecht [95] Image Formation with Lenses Lenses consist of transparent materials such as glass. They change the direction of light rays falling through the lens due to refraction at the boundary between the lens material and the surrounding air. The shape of a lens determines how the wavefronts of the light are deformed. Lenses that project all light rays originating from an object point into a single image point have a hyperbolic shape at both sides [95]. This is, however, only valid for monochromatic light and a single object point; there are no lens shapes that form perfect images of objects. Since it is easier and less expensive to manufacture lenses with spherical surfaces, most lenses used in practice are spherical lenses. Aspheric lenses are, however, often used for minimizing aberration in lens systems.

15 10 Acquisition, Representation, Display, and Perception Thin Lenses. We restrict our considerations to paraxial approximations (the angles between the light rays and the optical axis are very small) for thin lenses (the thickness is small compared to the radii of curvature). Under these assumptions, a convex lens projects an object in a distance s from the lens onto an image plane located at a distance b at the other side of the lens, see Figure 2.1(a). The relationship between the object distance s and the image distance b is given by 1 s + 1 b = 1 f, (2.1) which is known as Gaussian lens formula (a derivation is, for example, given in [95]). The quantity f is called the focal length and represents the distance from the lens plane, in which light rays that are parallel to the optical axis are focused into a single point. For focusing objects at different locations, the distance b between lens and image sensor can be modified. Far objects (s ) are in focus if the distance b is approximately equal to the focal length f. As illustrated in Figure 2.1(b), for a given image sensor, the focal length f of the lens determines the field of view. With d representing the width, height, or diagonal of the image sensor, the angle of view is given by θ 2 arctan ( d 2 f ). (2.2) Aperture and Depth of Field. In addition to the focal length, a lens is characterized by its opening, which is referred to as aperture of the lens. As illustrated in Figure 2.1(c), the aperture determines the bundle of light rays that is focused in the image plane. In camera lenses, typically adjustable apertures with an approximately circular shape are used. The aperture diameter a is commonly notated as a = f/f. The number F is referred to as f-number. For a given distance b between lens and sensor, only object points that are located in a plane at a particular distance s are focused on the sensor. As shown Figure 2.1(d), object points located at distances s + s F and s s N would be focused at image distances b b F and b + b N, respectively. On the image sensor, at the distance b, the

16 2.1. Fundamentals of Image Formation 11 object plane image plane θ image sensor d (a) s f f b (b) b f a D object plane a image plane Δb N c (c) f (d) Δs F Δs N s Δb F b Figure 2.1: Image formation with lenses: (a) Object and image location for a thin convex lens; (b) Angle of view; (c) Aperture; (d) Relationship between the acceptable diameter c for the circles of confusion and the depth of field D. object points appear as blur spots, which are called circles of confusion. If the blur spots are small enough, the projected objects still appear to be sharp in a photo or video. Given a maximum acceptable diameter c for the circles of confusion, we can derive the range of object distances for which we obtain a sharp projection on the image sensor. By considering similar triangles at the image side in Figure 2.1(d) and using the Gaussian lens formula (2.1), we obtain s F = F c s (s f) f 2 F c (s f) and s N = F c s (s f) f 2 + F c (s f). (2.3) The distance D = s F + s N between the nearest and farthest objects that appear acceptably sharp in an image is called the depth of field. It is given by D = 2 F c f 2 s (s f) f 4 F 2 c 2 (s f) 2 2 F c s2 f 2. (2.4) For the simplification at the right side of (2.4), we used the often valid approximations s f and c f 2 /s.

12 Acquisition, Representation, Display, and Perception object point P = (X, Y, Z) lens plane Y y image plane Z center of lens in point (0, 0, 0) X b f x p = (x, y) Figure 2.

17 12 Acquisition, Representation, Display, and Perception object point P = (X, Y, Z) lens plane Y y image plane Z center of lens in point (0, 0, 0) X b f x p = (x, y) Figure 2.2: Perspective projection of the 3D space onto an image plane. The maximum acceptable diameter c for the circles of confusion could be defined in multiple ways. Based on considerations about the resolution capabilities of the human eye and the typical viewing angle for a photo or video, it is common practice to define c as a fraction of the sensor diagonal d (for example, c d/1500). By using (2.2), we then obtain the relationship D F ( ) s2 θ d tan2, (2.5) 2 where θ denotes the diagonal angle of view. Note that the depth of field D increases with decreasing sensor size. When we film a scene with a given camera, the depth of field can be basically only controlled by changing the aperture of the lens. Projection by Lenses. As we have discussed above, a lens actually generates a 3D image of a scene and the image sensor basically extracts a plane of this 3D image. For many applications, the projection of the 3D world onto the image plane can be reasonably well approximated by the perspective projection model. If we define the world and image coordinate systems as illustrated in Figure 2.2, a point P at world coordinates (X, Y, Z) is projected into a point p at the image coordinates (x, y). The image coordinate are given by x = b Z X f Z X and y = b Z Y f Z Y. (2.6)

$3: Diffraction in cameras: (a) Diffraction of a plane wave at an aperture; (b) Diffraction in cameras can be modeled using Fraunhofer diffraction. x (b) f 2.1.$

18 2.1. Fundamentals of Image Formation 13 wavelength λ aperture g(ζ, η) η R screen y irradiance Ψ(x, y) sensor plane (a) ζ z Figure 2.3: Diffraction in cameras: (a) Diffraction of a plane wave at an aperture; (b) Diffraction in cameras can be modeled using Fraunhofer diffraction. x (b) f Diffraction and Optical Resolution Until now, we assumed that rays of light in a homogeneous medium propagate in rectilinear paths. Experiments show, however, that light rays are bent when they encounter small obstacles or openings. This phenomenon is called diffraction and can be explained by the wave character of light. As we will discuss in the following, diffraction effects limit the resolving power of optical instruments such as cameras. As shown in Figure 2.3(a), we consider a plane wave with wavelength λ that encounters an aperture with the pupil function g(ζ, η). The pupil function is defined in a way that values of g(ζ, η) = 0 specify opaque points and values of g(ζ, η) = 1 specify transparent points in the aperture plane. The irradiance Ψ(x, y) observed on a screen in distance z depends on the spatial position (x, y). For z a 2 /λ, with a being the largest dimension of the aperture, the phase differences between the individual contributions that are superposed on the screen only depend on the viewing angles given by sin φ = x/r and sin θ = y/r, with R = x 2 + y 2 + z 2. This far-field approximation is referred to as Fraunhofer diffraction. Since a lens placed behind an aperture focuses parallel light rays into a point, as illustrated in Figure 2.3(b), diffraction observed in cameras can be modeled using Fraunhofer diffraction. The observed irradiance pattern [95] is given by ( Ψ(x, y) = ˆΨ x G λ R, y λ R ) 2, (2.7) where ˆΨ is a constant and G(u, v) represents the 2D Fourier transform

$4: Optical resolution: (a) Airy pattern; (b) Two just resolved image points; (c) Modulation transfer function of a diffraction-limited lens with a circular aperture. of the pupil function g(ζ, η).$ $For cameras with circular apertures, the diffraction pattern on the sensor [95] in distance z f is given by ( 2 J1 (β r) Ψ(r) = Ψ 0 β r ) 2 with β = π a λ R π a λ f = π λ F, (2.$

19 14 Acquisition, Representation, Display, and Perception (a) (b) (c) MTF(u) rel. spat. frequency u / u 0 Figure 2.4: Optical resolution: (a) Airy pattern; (b) Two just resolved image points; (c) Modulation transfer function of a diffraction-limited lens with a circular aperture. of the pupil function g(ζ, η). For cameras with circular apertures, the diffraction pattern on the sensor [95] in distance z f is given by ( 2 J1 (β r) Ψ(r) = Ψ 0 β r ) 2 with β = π a λ R π a λ f = π λ F, (2.8) where r = x 2 + y 2 represents the distance from the optical axis, Ψ 0 = Ψ(0) is the maximum irradiance, a, f, and F = f/a denote the aperture diameter, focal length, and f-number, respectively, and J 1 (x) represents the Bessel function of first kind and order one. The diffraction pattern (2.8), which is illustrated in Figure 2.4(a), is called an Airy pattern and its bright central region is called an Airy disk. Optical Resolution. The imaging quality of an optical system can be described by the point spread function (PSF) or line spread function. They specify the projected patterns for a focused point or line source. For large object distances, the wave fronts encountering the aperture are approximately planar. If we have a circular aperture and assume that diffraction is the only source of blurring, the PSF is given by the Airy pattern (2.8). Optical systems for which the imaging quality is only limited by diffraction are referred to as diffraction-limited or perfect optics. In real lenses, we have additional sources of blurring, which are caused by deviations from the paraxial approximation (2.1). The PSF of an optical system determines its ability to resolve details in the image. Two image points are said to be just resolvable when the center of one diffraction pattern coincides with the first minimum of

20 2.1. Fundamentals of Image Formation 15 the other diffraction pattern. This rule is known as Rayleigh criterion and is illustrated in Figure 2.4(b). For cameras with diffraction-limited lenses and circular apertures, two image points are resolvable if the distance r between the centers of the Airy patterns satisfies r r min = x 1 π λ F 1.22 λ F, (2.9) where x represents the first zero of J 1 (x)/x. As an example, we consider a camera with a 13.2 mm 8.8 mm sensor and an aperture of f/4 and assume a wavelength of λ = 550 nm (in the middle of the visible spectrum). Even with a perfect lens, we cannot discriminate more than points (or 16 Megapixels) on the image sensor. The number of discriminable points increases with decreasing f-number and increasing sensor size. By considering (2.5), we can, however, conclude that for a given picture (same field of view and depth of field), the number of distinguishable points is independent of the sensor size. Modulation Transfer Function. The resolving capabilities of lenses are often specified in the frequency domain. The optical transfer function (OTF) is defined as the 2D Fourier transform of the point spread function, OTF(u, v) = FT{PSF(x, y)}. The amplitude spectrum MTF(u, v) = OTF(u, v) is referred to as modulation transfer function (MTF). Typically, only a 1D slice MTF(u) of the modulation transfer function MTF(u, v) is considered, which corresponds to the Fourier transform of the line spread function. The contrast C of an irradiance pattern shall be defined by C = Ψ max Ψ min Ψ max + Ψ min, (2.10) where Ψ min and Ψ max represent the minimum and maximum irradiances. The modulation transfer MTF(u) specifies the reduction in contrast C for harmonic stimuli with a spatial frequency u, MTF(u) = C image / C object, (2.11) where C object and C image denote the contrasts in the object and image domain, respectively. The OTF of diffraction-limited optics can also be calculated as the normalized autocorrelation function of the pupil

21 16 Acquisition, Representation, Display, and Perception function g(ζ, η) [81]. For a camera with a diffraction-limited lens and a circular aperture with the f-number F, the MTF is given by ) ( ) 2 2 MTF(u) = π (arccos uu0 uu0 1 u : u u u0 0, (2.12) 0 : u > u 0 where u 0 = 1/(λ F ) represents the cut-off frequency. This function is illustrated in Figure 2.4(c). The MTF for real lenses generally lies below that for diffraction-limited optics. Optical Aberrations. The Gaussian lens formula represents only an approximation for thin lenses and paraxial rays. Deviations from the predictions of Gaussian optics that are not caused by diffraction are called aberrations. Aberrations that are caused by the imperfect geometry of lenses and occur even with monochromatic light are referred to as monochromatic aberrations. In contrast, chromatic aberrations occur only for light consisting of multiple wavelengths. They arise from the fact that the phase velocity of a electromagnetic wave in a medium depends on its frequency, a phenomenon called dispersion. As a result, light rays of different wavelengths are refracted at different angles and, thus, the effective focal length depends on the wavelength. Aberrations can be reduced by combining multiple lenses. Typical camera lenses consist of about 10 to 20 lens elements, including aspherical lenses and lenses of extra-low dispersion materials. 2.2 Visual Perception In most areas of image communication, whether it be photography, television, home entertainment, video streaming or video conferencing, the photos and videos are eventually viewed by human beings. The way humans perceive visual information determines whether a reproduction of a real-world scene in the form of a printed photograph or pictures displayed on a monitor or television screen looks realistic and truthful. In fact, certain aspects of human vision are not only taken into account for designing cameras, displays and printers, but are also exploited for digitally representing and coding still and moving pictures.

22 2.2. Visual Perception 17 iris crystalline lens retina fovea pupil cornea ciliary muscle optic nerve Figure 2.5: Basic structure of the human eye. In the following, we give a brief overview of the human visual system with particular emphasis on the perception of color. We will mainly concentrate on aspects that influence the way we capture, represent, code and display pictures. For more details on human vision, the reader is referred to the books by Wandell [285] and Palmer [196]. The topic of colorimetry is comprehensively treated in the classic reference by Wyszecki and Stiles [324] and the book by Koenderink [151] The Human Visual System The human eye has similar components as a camera. Its basic structure is illustrated in Figure 2.5. The cornea and the crystalline lens, which is embedded in the ciliary muscle, form a two-lens system. They act like a single convex lens and project an image of real-world objects onto a light-sensitive surface, the retina. The photoreceptor cells in the retina convert absorbed photons into neural signals that are further processed by the neural circuitry in the retina and transmitted through the optic nerve to the visual cortex of the brain. The area of the retina that provides the sharpest vision is called the fovea. We always move our eyes such that the image of the object we look at falls on the fovea. The iris is a sphincter muscle that controls the size of the hole in its middle, called the pupil, and thus the amount of light entering the retina. Human Optics. In contrast to cameras, the distance between lens and retina cannot be modified for focusing objects at varying distances. Instead, focusing is achieved by adapting the shape of the crystalline lens by the ciliary muscle. This process is referred to as accommodation.

23 18 Acquisition, Representation, Display, and Perception temporal nasal (fovea) receptor density [10 4 / mm 2 ] rods nasal retina blind spot cones temporal retina visual angle relative to center of fovea [degree] Figure 2.6: Illustration of the distribution of photoreceptor cells along the horizontal meridian of the human eye (plotted using experimental data of [50]). Investigations of the optical quality of the human eye [25, 167, 176] show that the eye is far from being perfect optics. While for very small pupil sizes, the eye is nearly diffraction-limited, for larger pupil sizes, the imperfections of the cornea and crystalline lens cause significant monochromatic aberrations, much larger than that of camera lenses. The dispersion of the substances inside the eye lead also to significant chromatic aberrations. In typical lighting conditions, the green range of the spectrum, which the eye is most sensitive to, is sharply focused on the retina, while the focal planes for the blue and red ranges are in front of and behind the retina, respectively. The strongest blurring is observed for the short wavelength range of visible light [87]. Human Photoreceptors. The retina contains two classes of photoreceptor cells, the rods and cones. At very low light levels, only the rods contribute to the visual perception; this case is called scotopic vision. Under well-lit viewing conditions, however, only the cones are effective. This case is referred to as photopic vision. It represents the type of vision that is relevant for video communication applications. There are about 100 million rods and 5 million cones in each eye, which are very differently distributed throughout the retina [195, 50], see Figure 2.6. The rods are concentrated in the periphery. The fovea does not contain any rods, but by far the highest concentration of cones. At the location of the optic nerve, also referred to as the blind spot, there are no photoreceptors. Although the retina contains much more

24 2.2. Visual Perception 19 luminous efficiency photopic vision scotopic vision normalized sensitivity S-cones rods M-cones L-cones (a) wavelength λ [nm] (b) wavelength λ [nm] Figure 2.7: Spectral sensitivity of human vision: (a) Luminous efficiency functions; (b) Spectral sensitivity of the human photoreceptors. rods than cones, the visual acuity of scotopic vision is much lower than that of photopic version. The reason is that the photocurrent responses of many rods are combined into a single neural response, whereas each cone signal is further processed by several neurons in the retina [285]. Spectral Sensitivity. The sensitivity of the human eye depends on the spectral characteristics of the observed light stimulus. Based on the data of several brightness-matching experiments, for example [72], the Commission Internationale de l Eclairage (CIE) defined the socalled CIE luminous efficiency function V (λ) for photopic vision [42] in This function characterizes the average spectral sensitivity of human brightness perception. Two light stimuli with different radiance spectra Φ(λ) are perceived as equally bright if the corresponding values 0 Φ(λ) V (λ) dλ are the same. V (λ) determines the relation between radiometric and photometric quantities. For example, the analogous photometric quantity of the radiance Φ= 0 Φ(λ) dλ is the luminance I = K 0 V (λ)φ(λ) dλ, where K is a constant (683 lumen per Watt). The SI unit of the luminance is candela per square meter (cd/m 2 ). Viewing experiments under scotopic conditions lead to the definition of a scotopic luminous efficiency function V (λ) [44]. The luminous efficiency functions V (λ) and V (λ) are depicted in Figure 2.7(a). Both functions are noticeably greater than zero in the range from about 390 to 700 nm. For that reason, electromagnetic radiation in this part of the spectrum is commonly called visible light.

25 20 Acquisition, Representation, Display, and Perception In very low light (scotopic vision), we can only discriminate between different brightness levels, but under photopic conditions, we are able to see colors. The reason is that the human retina contains only a single rod type, but three types of cones, each with a different spectral sensitivity. The existence of three types of photoreceptors was postulated in the 19th century by Young [331] and Helmholtz [97]. In the 1960s, direct measurements on single photoreceptor cells of the human retina [21] confirmed the Young-Helmholtz theory of trichromatic vision. The cones types are referred to as L-, M- and S-cones, where L, M and S stand for the long, medium and short wavelength range, respectively, and characterize the peak sensitivities. The spectral sensitivity of cone cells was determined by measuring photocurrent responses [10, 218]. For describing color perception, we are, however, more interested in spectral sensitivities with respect to light entering the cornea, which are different, since the short wavelength range is strongly absorbed by different components of the eye before reaching the retina. Such sensitivity functions, which are also called cone fundamentals, can be estimated by comparing color-matching data (see Section 2.2.2) of individuals with normal vision with that of individuals lacking one or two cone types. In Figure 2.7(b), the cone fundamentals estimated by Stockman et al. [245, 244] are depicted together with the spectral sensitivity function for the rods, which is the same as the scotopic luminous efficiency function V (λ). Luminance Sensitivity. The sensing capabilities of the human eye span a luminance range of about 11 orders of magnitude, from the visual threshold of about 10 6 cd/m 2 to about 10 5 cd/m 2 [87]. However, in each moment, only luminance levels in a range of about 2 to 3 orders of magnitude can be distinguished. In order to cover the huge range of ambient light levels, the human eye adapts its sensitivity to the lighting conditions. A fast adaptation mechanism is the pupillary light reflex, which controls the pupil size depending on the luminance on the retina. The main factors, however, which are also responsible for the transition between rod and cone vision, are photochemical reactions in the pigments of the rod and cone cells and neural processes.

26 2.2. Visual Perception 21 To a large extent, the sensitivities of the three cone types are independently controlled. As a consequence, the human eye does not only adjust to the luminance level, but also to the spectral composition of the incident light. In connection with certain properties of the neural processing, this aspect causes the phenomenon of color constancy, which describes the effect that the perceived colors of objects are relatively independent of the spectral composition of the illuminating light. Another property of human vision is that our ability to distinguish two areas with the same color but a particular difference in luminance depends on the brightness of the viewed scene. Let I and I denote the background luminance, to which the eye is adapted, and the just perceptible increase in luminance, respectively. Within a wide range of luminance values I, from about 50 to 10 4 cd/m 2 [87], the relative sensitivity I/I is nearly constant (approximately 1 2%). This behavior is known as the Weber-Fechner law. Opponent Colors. The theory of opponent colors was first formulated by Hering [98]. He found that certain hues are never perceived to occur together. While colors can be perceived as a combination of, for example, yellow and red (orange), red and blue (purple), or green and blue (cyan), there are no colors that are perceived as a combination of red and green or yellow and blue. Hering concluded that the human color perception includes a mechanism with bipolar responses to redgreen and blue-yellow. These hue pairs are referred to as opponent colors. According to the opponent color theory, any light stimulus is received as containing either one or the other of the opponent colors, or, if both contributions cancel out, none of them. For a long time, the opponent color theory seemed to be irreconcilable with the Young-Helmholtz theory of trichromatic vision. In the 1950s, Jameson and Hurvich [130, 104] performed hue-cancellation experiments by which they estimated the spectral sensitivities of the opponent-color mechanisms. Furthermore, measurements of electrical responses in the retina of goldfish [257, 258] and the lateral geniculate nucleus of the macaque monkey [53] showed the existence of neural signals that were consistent with the bipolar responses formulated by

27 22 Acquisition, Representation, Display, and Perception 0 + L linear combination V(λ) cone responses 0 + M linear combination r-g 0 + S linear combination y-b (a) opponent process responses normalized sensitivity (b) yellow-blue wavelength λ [nm] achromatic red-green Figure 2.8: Opponent color theory: (a) Simplified model for the neural processing of the cone responses; (b) Estimates [243] of the spectral sensitivities of the opponentcolor processes (for the eye adapted to equal-energy white). Hering. These and other experimental findings resulted in a wide acceptance of the modern theory of opponent colors, according to which the responses of the three cone types to light stimuli are not directly transmitted to the brain. Instead, neurons along the visual pathways transform the cone responses into three opponent signals, as illustrated in Figure 2.8(a). The transformation can be considered as approximately linear and the outputs are an achromatic signal, which corresponds to a relative luminance measure, as well as a red-green and a yellow-blue color difference signal. Since the cone sensitivities are to a large extent independently adjusted, the spectral sensitivities of the opponent processes depend on the present illumination. Estimates of the spectral sensitivity curves for the eye adapted to equal-energy white (same spectral radiance for all wavelengths) are shown in Figure 2.8(b). The depicted curves represent linear combinations, suggested in [243], of the Stockman and Sharpe cone fundamentals [244]. As an example, let Φ(λ) denote the radiance spectrum of a light stimulus and let c rg (λ) represent the spectral sensitivity curve for the red-green process. If the integral 0 Φ(λ) c rg (λ) dλ is positive, the light stimulus is perceived as containing a red component, if it is negative, the stimulus appears to include a green component. As has been shown in [22], the conversion of the cone responses into opponent signals is effectively a decorrelation. It can be interpreted as a way of improving the neural transmission of color information.

28 2.2. Visual Perception 23 Neural Processing. The neural responses of the photoreceptor cells are first processed by the neurons in the retina and then transmitted to the visual cortex of the brain, where the visual information is further processed and eventually interpreted yielding the images of the world we perceive every day. The mechanisms of the neural processing are extremely complex and not yet fully understood. Although many more aspects of visual perception than the ones mentioned in this section are already known, we will not discuss them in this monograph, since they are virtually not exploited in today s video communication applications. The main reason that most properties of human vision are neglected in image and video coding is that no simple and sufficiently accurate model has been found that allows to quantify the perceived image quality based on samples of an image or video Color Perception While the previous section gave a brief overview of the human visual system, we will now further analyze and quantitatively describe the perception and reproduction of color information. In particular, we will discuss the colorimetric standards of the CIE, which are widely used as a basis for specifying color in image and video representation formats. Metamers. It is a well-known fact that, by using a prism, a ray of sunlight can be split into components of different wavelengths, which we perceive to have different colors, ranging from violet over blue, cyan, green, yellow, orange to red. We can conclude that light with a particular spectral composition induces the perception of a particular color. But the converse is not true. Two light stimuli that appear to have the same color can have very different spectral compositions. Color is not a physical quantity, but a sensation in the viewers mind induced by the interaction of electromagnetic waves with the human cones. A light stimulus emitted or reflected from the surface of an object and falling through the pupil of the eye can be physically characterized by its radiance spectrum, specifying its composition of electromagnetic waves with different wavelengths. The light falling on the retina excites the three cone types in different ways. Let l(λ), m(λ) and s(λ) represent

29 24 Acquisition, Representation, Display, and Perception spectral radiance Φ(λ) / Φ max Φ 3 (λ) Φ 1 (λ) Φ 4 (λ) Φ 2 (λ) perceived color: orange wavelength λ [nm] Figure 2.9: Metamers: All four radiance spectra shown in the diagram induce the same cone excitation responses and are perceived as the same color (orange). the normalized spectral sensitivity curves of the L-, M- and S-cones, respectively, which have been illustrated in Figure 2.7. Then, a radiance spectrum Φ(λ) is effectively mapped to a 3D vector L l(λ) M S = 0 m(λ) s(λ) Φ(λ) dλ, (2.13) Φ 0 where Φ 0 >0 represents an arbitrarily chosen reference radiance, which is introduced for making the vector (L, M, S) dimensionless. If two light stimuli with different radiance spectra yield the same cone excitation response (L, M, S), they cannot be distinguished by the human visual system and are therefore perceived as having the same color. Light stimuli with that property are called metamers. As an example, the radiance spectra shown in Figure 2.9 are metamers. Metameric color matches play a very important role in all color reproduction techniques. They are the basis for color photography (see Section 2.4), color printing, color displays (see Section 2.5) as well as for the representation of color images and videos (see Section 2.3). For specifying the cone excitation responses in (2.13), we used normalized spectral sensitivity functions without paying attention to the different peak sensitivities. Actually, this aspect does not have any impact on the characterization of metamers. If two vectors (L 1, M 1, S 1 ) and (L 2, M 2, S 2 ) are the same, the appropriately scaled versions (αl 1, βm 1, γs 1 ) and (αl 2, βm 2, γs 2 ), with non-zero scaling factors α, β and γ, are also the same, and vice versa. An aspect that is, however,

30 2.2. Visual Perception 25 neglected in equation (2.13) is the chromatic adaptation of the human eye, i.e., the changing of the scaling factors α, β and γ in dependence of the spectral properties of the observed light (see Section 2.2.1). For the following considerations, we assume that the eye is adapted to a particular viewing condition and, thus, the mapping between radiance spectra and cone excitation responses is linear, as given by (2.13). Another point to note is that the so-called quality of a color, typically characterized by the hue and saturation, is solely determined by the ratio L:M :S. Two colors given by the cone response vectors (L 1, M 1, S 1 ) and (L 2, M 2, S 2 ) = (αl 1, αm 1, αs 1 ), with α > 1, have the same quality, i.e., the same hue and saturation, but the luminance 1 of the color (L 2, M 2, S 2 ) is by a factor of α larger than that of (L 1, M 1, S 1 ). Although (2.13) could be directly used for quantifying the perception of color, the colorimetric standards are based on empirical data obtained in color-matching experiments. One reason is that the spectral sensitivities of the human cones had not be known at the time when the standards were developed. Actually, the cone fundamentals are typically estimated based on data of color-matching experiments [244]. Mixing of Primary Colors. Since the perceived color of a light stimulus can be represented by three cone excitation levels, it seems likely that, for each radiance spectrum Φ(λ), we can also create a metameric spectrum Φ (λ) by suitably mixing three primary colors or, more correctly, primary lights. The radiance spectra of the three primary lights A, B and C shall be given by p A (λ), p B (λ) and p C (λ), respectively. With p(λ) = ( p A (λ), p B (λ), p C (λ) ) T, the radiance spectrum of a mixture of the primaries A, B and C is given by Φ (λ) = A p A (λ) + B p B (λ) + C p C (λ) = (A, B, C) p(λ), (2.14) where A, B and C denote the mixing factors, which are also referred to as tristimulus values. The radiance spectrum Φ (λ) of the light mixture is a metamer of Φ(λ) if and only if it yields the same cone excitation responses. Thus, with (L, M, S) being the vector of cone excitation 1 As mentioned in Section 2.2.1, we assume that the luminance can be represented as a linear combination of the cone excitation responses L, M, and S.

31 26 Acquisition, Representation, Display, and Perception responses for Φ(λ), we require L l(λ) M = m(λ) Φ (λ) A dλ = T B, (2.15) S s(λ) Φ 0 C 0 with the transformation matrix T being given by l(λ) T = m(λ) p(λ)t dλ. (2.16) s(λ) Φ 0 0 If the primaries are selected in a way that the matrix T is invertible, the mapping between the tristimulus values (A, B, C) and (L, M, S) is bijective. In this case, the color of each possible radiance spectrum Φ(λ) can be matched by a mixture of the three primary lights. And therefore, the color description in the (A, B, C) system is equivalent to the description in the (L, M, S) system. A sufficient condition for a suitable selection of the three primaries is that all primaries are perceived as having a different color and the color of none of the primaries can be matched by a mixture of the two other primaries. By combining the equations (2.15) and (2.13), we obtain A L l(λ) B = T 1 M = T 1 m(λ) Φ(λ) dλ = c(λ) Φ(λ) dλ, (2.17) C S 0 s(λ) Φ 0 Φ 0 0 which specifies the direct mapping of radiance spectra Φ(λ) onto the tristimulus values (A, B, C). The components ā(λ), b(λ) and c(λ) of the vector function c(λ) = ( ā(λ), b(λ), c(λ) ) T are referred to as colormatching functions for the primaries A, B and C, respectively. They represent equivalents to the cone fundamentals l(λ), m(λ) and s(λ). Thus, if we know the color-matching functions ā(λ), b(λ) and c(λ) for a set of three primaries, we can uniquely describe all perceivable colors by the corresponding tristimulus values (A, B, C). Before we discuss how color-matching functions can be determined, we highlight an important property of color mixing, which is a direct consequence of (2.17). Let Φ 1 (λ) and Φ 2 (λ) be the radiance spectra of two lights with the tristimulus values (A 1, B 1, C 1 ) and (A 2, B 2, C 2 ), respectively. Now, we mix an amount α of the first with an amount β

32 2.2. Visual Perception 27 of the second light. For the tristimulus values (A, B, C) of the resulting radiance spectrum Φ(λ) = α Φ 1 (λ) + β Φ 2 (λ), we obtain A B = c(λ) αφ 1(λ) + βφ 2 (λ) A 1 A 2 dλ = α B 1 + β B 2. (2.18) C Φ 0 C 1 C 2 0 The tristimulus values of a linear combination of multiple lights is given by the linear combination, with the same weights, of the tristimulus values of the individual lights. This property has been experimentally discovered by Grassmann [86] and is often called Grassmann s law. Color-Matching Experiments. In order to experimentally determine the color-matching matching functions c(λ) for three given primary lights, the color of sufficiently many monochromatic lights 2 can be matched with a mixture of the primaries. For each monochromatic light with wavelength λ, the radiance spectrum is Φ(λ ) = Φ λ δ(λ λ), where Φ λ is the absolute radiance and δ( ) represents the Dirac delta function. According to (2.17), the tristimulus vector is given by A B C λ = Φ λ Φ 0 0 c(λ ) δ(λ λ) dλ = Φ λ Φ 0 c(λ). (2.19) Except for a factor, the tristimulus vector for a monochromatic light with wavelength λ represents the value of c(λ) for that wavelength. Even though the value of Φ 0 can be chosen arbitrarily, the ratio of the absolute radiances Φ λ of the monochromatic lights to any constant reference radiance Φ 0 has to be known for all wavelengths. The basic idea of color-matching experiments is typically attributed to Maxwell [179]. The color-matching data that led to the creation of the widely used CIE 1931 colorimetric standard were obtained in experiments by Wright [319] and Guild [89]. The principle of their color-matching experiments [88, 318] is illustrated in Figure At a visual angle of 2, the observers looked at both a monochromatic test light and a mixture of the three primaries, for which a red, green, and blue light source were used. The amounts of the primaries could 2 In practice, lights with a reasonable small spectrum are used.

33 28 Acquisition, Representation, Display, and Perception observation screen primary lights masking screen image seen by observer 2 observer test light Figure 2.10: Principle of color-matching experiments. be adjusted by the observers. Since not all lights can be matched with positive amounts of the primary lights, it was possible to move any of the primaries to the side of the test light, in which case the amount of the corresponding primary was counted as negative value 3. For determining the color-matching functions c(λ), only the ratios of the amounts of the primary lights were utilized [234]. These data were combined with the already known luminous efficiency function V (λ) for photopic vision. For that purpose, it was assumed that V (λ) can be represented as linear combination of the three color-matching functions ā(λ), b(λ) and c(λ). This assumption is equivalent to the often used model (see Section 2.2.1) that the sensation of luminance is generated by linearly combining the cone excitation responses in the neural circuitry of the human visual system. The utilization of the luminous efficiency function V (λ) had the advantage that the effect of luminance perception could be excluded in the experiments and that it was not necessary to know the ratios of the absolute radiances Φ λ to a common reference Φ 0 for all monochromatic lights (see above). Changing Primaries. Before we discuss the results of Wright and Guild, we consider how the color-matching functions for an arbitrary set of primaries can be derived from the measurements for another set of primaries. Let us assume that we measured the color-matching functions c 1 (λ) = ( ā 1 (λ), b 1 (λ), c 1 (λ) ) T for a first set of primaries given 3 Due to the linearity of color mixing, adding a particular amount of a primary to the test light is mathematically equivalent to subtracting the same amount from the mixture of the other primaries.

34 2.2. Visual Perception 29 by the radiance spectra p 1 (λ) = ( p A1 (λ), p B1 (λ), p C1 (λ) ) T. Based on these data, we want to determined the color-matching functions c 2 (λ) for a second set of primaries, which shall be given by the radiance spectra p 2 (λ). For each radiance spectrum Φ(λ), the tristimulus vectors t 1 = (A 1, B 1, C 1 ) T and t 2 = (A 2, B 2, C 2 ) T for the primary sets one and two, respectively, are given by t 1 = 0 c 1 (λ) Φ(λ) Φ 0 dλ and t 2 = 0 c 2 (λ) Φ(λ) Φ 0 dλ. (2.20) The radiance spectra Φ(λ), Φ 1 (λ) = p 1 (λ) T t 1 and Φ 2 (λ) = p 2 (λ) T t 2 are metamers. Consequently, all three spectra correspond to the same color representation for any set of primaries. In particular, we require t 1 = 0 c 1 (λ) Φ 2(λ) Φ 0 dλ = 0 c 1 (λ) p 2(λ) T Φ 0 dλ t 2 = T 21 t 2. (2.21) The tristimulus vector in one system of primaries can be converted into any other system of primaries using a linear transformation. Since this relationship is valid for all radiance spectra Φ(λ), including those of the monochromatic lights, the color-matching functions for the second set of primaries can be calculated according to c 2 (λ) = T 1 21 c 1(λ) = T 12 c 1 (λ). (2.22) It should be noted that the columns of a matrix T ik represent the tristimulus vectors (A, B, C) of the primary lights of set i in the primary system k. These values can be directly measured, so that the colormatching functions can be transformed from one into another primary system even if the radiance spectra p 1 (λ) and p 2 (λ) are unknown. CIE Standard Colorimetric Observer. In 1931, the CIE adopted the colorimetric standard known as CIE Standard Colorimetric Observer [43] based on the experimental data of Wright and Guild. Since Wright and Guild used different primaries in their experiments, the data had to be converted into a common primary system. For that purpose, monochromatic primaries with wavelengths of 700 nm

35 30 Acquisition, Representation, Display, and Perception (red), nm (green) and nm (blue) were selected 4. The ratio of the absolute radiances of the primary lights was decided to be chosen so that white light with a constant radiance spectrum is represented by equal amounts of all three primaries. Hence, the to-be-determined color-matching functions r(λ), ḡ(λ) and b(λ) for the red, green and blue primaries, respectively, had to fulfill the condition 0 r(λ) dλ = 0 ḡ(λ) dλ = 0 b(λ) dλ. (2.23) This led to a luminance ratio I R : I G : I B equal to 1 : : , where I R, I G and I B represent the luminances of the red, green and blue primaries, respectively. The corresponding ratio Φ R : Φ G : Φ B of the absolute radiances is approximately 1 : : Finally, the normalization factor for the color-matching functions, i.e., the ratio Φ R /Φ 0, was chosen such that the condition V (λ) = r(λ) + I G I R ḡ(λ) + I B I R b(λ) (2.24) is fulfilled. The resulting CIE 1931 RGB color-matching functions r(λ), ḡ(λ) and b(λ) were tabulated for wavelengths from 380 to 780 nm at intervals of 5 nm [43, 234]. They are shown in Figure 2.11(a). It is clearly visible, that r(λ) has negative values inside the range from to nm. In fact, for most of the wavelengths inside the range of visible light, one of the color-matching functions is negative, meaning that most of the monochromatic lights cannot be represented by a physically meaningful mixture of the chosen red, green and blue primaries. The CIE decided to develop a second set of color-matching functions x(λ), ȳ(λ), and z(λ), which are now known as CIE 1931 XYZ colormatching functions, as basis for their colorimetric standard. Since all sets of color-matching functions are linearly dependent, x(λ), ȳ(λ), and z(λ) had to obey the relationship x(λ) r(λ) ȳ(λ) = T XYZ ḡ(λ), (2.25) z(λ) b(λ) 4 The primaries were chosen to be producible in a laboratory.

36 2.2. Visual Perception 31 tristimulus amplitudes (a) b - (λ) ḡ (λ) - r (λ) B G R wavelength λ [nm] tristimulus amplitudes (b) z - (λ) ȳ (λ) x - (λ) wavelength λ [nm] Figure 2.11: CIE 1931 color-matching functions: (a) RGB color-matching functions, the primaries are marked with R, G and B; (b) XYZ color-matching functions. with T XYZ being an invertible, but otherwise arbitrary, transformation matrix. For specifying the 3 3 matrix T XYZ, the following desirable properties were considered: All values of x(λ), ȳ(λ) and z(λ) were to be non-negative; The color-matching function ȳ(λ) was to be chosen equal to the luminous efficiency function V (λ) for photopic vision; The scaling was to be chosen so that the tristimulus values for an equal-energy spectrum are equal to each other; For the long wavelength range, the entries of the color-matching function z(λ) were to be equal to zero; Subject to the above criteria, the area that physical meaningful radiance spectra represent inside a plane given by a constant sum X + Y + Z was to be maximized. By considering these design principles, the transformation matrix T XYZ = (2.26) was adopted. A detailed description of how this matrix was derived can be found in [234, 64]. The resulting XYZ color-matching functions x(λ), ȳ(λ) and z(λ) are depicted in Figure 2.11(b). They have been tabulated for the range from 380 to 780 nm, in intervals of 5 nm, and

37 32 Acquisition, Representation, Display, and Perception specify the CIE 1931 standard colorimetric observer [43]. The color of a radiance spectrum Φ(λ) can be represented by the tristimulus values X x(λ) Y = ȳ(λ) Φ(λ) dλ. (2.27) Z 0 z(λ) Φ 0 The reference radiance Φ 0 is typically chosen in a way that X, Y, and Z lie in a range from 0 to 1 for the considered viewing condition. Note that, due to the choice ȳ(λ) = V (λ), the value Y represents a scaled and dimensionless version of the luminance I. It is correctly referred to as relative luminance, however, often the term luminance is used for both the absolute luminance I and the relative luminance Y. In the 1950s, Stiles and Burch [240] performed color-matching experiments for a visual angle of 10. Based on these results, the CIE defined the CIE Supplementary Colorimetric Observer [45]. The data by Stiles and Burch are considered as the most accurate set of existing color-matching functions [19]. They have been used as basis for the Stockman and Sharpe cone fundamentals [244] and the recent CIE proposal [47] of physiologically relevant color-matching functions. Baylor, Nunn and Schnapf [10] measured direct photocurrent responses in the cones of a monkey and could predict the color-matching functions of Stiles and Burch with reasonable accuracy. Nonetheless, the CIE 1931 Standard Colorimetric Observer [43] is still used in most applications. Chromaticity Diagram. The black curve in Figure 2.12(a) shows the locus of monochromatic lights with a particular radiance in the XYZ space. The tristimulus values of all possible radiance spectra represent linear combinations, with non-negative weights, of the (X, Y, Z) values for monochromatic lights. They are located inside a cone, which has its apex in the origin and lies completely in the positive octant. The cones surface is spanned by the locations of the monochromatic lights and an imaginary purple plane, which connects the tangents for the short and long wavelength ends. As mentioned above, the quality of a color is solely determined by the ratio of the tristimulus values X :Y :Z. Hence, all lights that have the same quality of color lie on a line that intersects

38 2.2. Visual Perception 33 Y (a) Y 1 Z X chromaticity y G W E 560 spectral locus R X 0.2 (b) 1 Z 1 (c) B purple line chromaticity x Figure 2.12: The CIE 1931 chromaticity diagram: (a) Locus of monochromatic lights and the imaginary purple plane in the XYZ space; (b) Space of real radiance spectra with the plane X + Y + Z = 1 and the line of all equal-energy spectra; (c) Chromaticity diagram illustrating the region of all perceivable colors in the x-y plane. The diagram additionally shows the point of equal-energy white (E) as well as the primaries (R,G,B) and white point (W) of the srgb [105] color space. the origin, as is illustrated by the gray arrow in Figure 2.12(b), which represents the color of equal-energy radiance spectra. For differentiating between the luminance and the quality of a color, it is common to introduce normalized chromaticity coordinates x = X X + Y + Z, y = Y X + Y + Z, and z = Z X + Y + Z. (2.28) The z-coordinate is actually redundant, since it is given by z = 1 x y. The tristimulus values (X, Y, Z) of a color can be represented by the chromaticity coordinates x and y, which specify the quality of the color, and the relative luminance Y. For a given quality of color, the chromaticity coordinates x and y correspond to the values of X and Y, respectively, inside the plane X + Y + Z = 1. The set of qualities of colors that is perceivable by human beings is called the human gamut.

39 34 Acquisition, Representation, Display, and Perception Its location in the x-y coordinate system is shown in Figure 2.12(c) 5. This plot is referred to as chromaticity diagram. The human gamut represents a horseshoe shape; its boundaries are given by the projection of the monochromatic lights, referred to as spectral locus, and the purple line, which is a projection of the imaginary purple plane. For the spectral locus, the figure includes wavelength labels in nanometers; it also shows the location (marked by E ) of equal-energy spectra. Linear Color Spaces. All color spaces that are linearly related to the LMS cone excitation space are called linear color spaces. When we neglect measurement errors, the CIE RGB and XYZ spaces represent linear color spaces. Hence, there exists a 3 3 transformation matrix by which the XYZ (or RGB) color-matching functions are transformed into cone fundamentals according to (2.22). While we specified the primary spectra for the CIE 1931 RGB color space, the XYZ color-matching functions were derived by defining a transformation matrix T XYZ (without explicitly stating primary spectra). Given the color-matching functions c(λ) = ( x(λ), ȳ(λ), z(λ) ) T, the associated primary spectra p(λ) = ( p X (λ), p Y (λ), p Z (λ) ) T are not uniquely defined. With I being the 3 3 identity matrix, they only have to fulfill the condition 0 c(λ) p(λ) T dλ = Φ 0 I, (2.29) which represents a special case of (2.21). There are infinitely many spectra p(λ) that fulfill (2.29), but they all have negative entries and, thus, represent imaginary primaries 6. The fact that there are no physical meaningful primary spectra with non-negative color-matching functions is often referred to as primary paradoxon. It is caused by the 5 The complete human gamut cannot be reproduced on a display or in a print and the perception of a color depends on the illumination conditions. Thus, the colors shown in Figure 2.12(c) should be interpreted as a rough illustration. 6 This can be verified as follows. For obtaining ȳ(λ)p Y (λ)dλ=φ 0, the spectrum p Y (λ) has to contain values greater than 0 inside the range for which ȳ(λ) is greater than 0, but since either x(λ) or z(λ) are also greater than 0 inside this range, the integrals x(λ)p Y (λ)dλ and z(λ)p Y (λ)dλ cannot become equal to 0, unless p Y (λ) has also negative entries.

40 2.2. Visual Perception 35 overlapping support of the cone fundamentals. There is no radiance spectrum with p(λ) 0, λ that excites the M-cones without also exciting the L- or S-cones. For all real primaries, the associated color-matching functions have negative entries. Consequently, not all colors of the human gamut can be represented by a physical meaningful mixture of primary lights. As an example, the chromaticity diagram in Figure 2.12(c) shows the chromaticity coordinates for the srgb primaries [105]. Displays that use primaries with these chromaticity coordinates can only represent the colors that are located inside the triangle spanned by the primaries. This set of colors is called the color gamut of the display device. In cameras, the situation is different. Since the transmittance spectra of the color filters (see Section 2.4), which represent the colormatching functions of the camera color space, are always non-negative, it is, in principle, possible to capture all colors of the human gamut. However, the camera color space is only a linear color space, if the transmittance spectra of the color filters represent linear combinations of the cone fundamentals (or, equivalently, the XYZ color-matching functions). In practice, this can be only realized to a certain extent. Since camera color spaces are associated with imaginary primaries, the sample arrays captured by an image sensor cannot be directly used for operating a display device. The data always have to be converted. The simplest variant of a conversion algorithm consists of a linear transformation of the tristimulus values (for changing the primaries) and a subsequent clipping of negative values. The conversion between a linear RGB color space and the XYZ color space can be written as X X r X g X b R Y = Y r Y g Y b G, (2.30) Z Z r Z g Z b B where X r represents the X-component of the red primary, etc. RGB color spaces are typically defined by the chromaticity coordinates of the red, green and blue primaries, which shall be denoted by (x r, y r ), (x g, y g ) and (x b, y b ), respectively, and the chromaticity coordinates (x w, y w ) of the so-called white point, which represents the quality of

41 36 Acquisition, Representation, Display, and Perception color for R = G = B. The chromaticity coordinates of the white point are necessary, because they determine the length ratios of the primary vectors in the XYZ coordinate system. According to (2.28), we can replace X by xy/y and Z by (1 x y)y/y in equation (2.30). If we then write this equation for the white point of the RGB color space, which is given by R = G = B, we obtain Y w R x w yw1 1 x w y w y w = x g x y g Y b g y b Y b 1 Y r Y g Y b 1. (2.31) 1 x r y r Y r 1 x r y r y r Y r 1 x g y g 1 x y g Y b y b g y b Y b It should be noted that Y w /R > 0 is only a scaling factor, which specifies the relative luminance of the stimuli with R = G = B = 1. It can be chosen arbitrarily and is often set equal to 1. Then, the linear equation system can be solved for the unknown values Y r, Y g and Y b, which finally determine the transformation matrix. Illumination. With the exception of computer monitors, television sets, and mobile phone displays, we rarely look at surfaces that emit light. In most situations, the objects we look at reflect light from one or more illumination sources, such as the sun or an incandescent light bulb. Using a simple model, the radiance spectrum Φ(λ) entering the eye from a particular surface point can be expressed as the product Φ(λ) = S(λ) R(λ) (2.32) of the incident spectral radiance S(λ) reaching the surface point from the light source and the reflectance spectrum R(λ) of the surface point. The color of an object does not only depend on the physical properties of the object surface, but also on the spectrum of the illumination source. This aspect is illustrated in Figure 2.13, where we consider two typical illumination sources, daylight and a tungsten light bulb, and the reflectance spectrum for the petals of a particular flower. Due to the different spectral properties of the two illuminants, the radiance spectra that are reflected from the flower petals are very different and, as a result, the tristimulus and chromaticity values are also different. It should be noted that two objects that are perceived as having the

42 2.2. Visual Perception 37 incid. spec. radiance S(λ) / S max spectral reflectance R(λ) refl. spec. radiance Φ(λ) / Φ max for tungsten light bulb daylight flower "veronica fruticans" 0.4 tungsten for daylight 0.2 light bulb (a) wavelength λ [nm] (b) wavelength λ [nm] (c) wavelength λ [nm] Figure 2.13: Influence of the illumination: (a) Normalized radiance spectra of a tungsten light bulb (illuminant A) and normal daylight (illuminant D65); (b) Reflectance spectrum of the flower veronica fructicans [6]; (c) Normalized radiance spectra of the reflected light for both illuminants, the chromaticity coordinates (x, y) are (0.3294, ) for the light bulb and (0.1971, ) for daylight. same color for a particular illuminant can be distinguishable from each other when the illumination is changed. The color of a material can only be described with respect to a given illumination source. For that purpose, several illuminants have been standardized. The radiance spectrum of incandescent light sources, i.e., materials for which the emission of light is caused by their temperature, can be described by Planck s law. A so-called black body at an absolute temperature T emits light with a radiance spectrum given by Φ T (λ) = 2 h c λ 5 ( e h c k B T λ 1) 1, (2.33) where k B is the Boltzmann constant, h the Planck constant and c the speed of light in the medium. The temperature T is also referred to as the color temperature of the emitted light. Figure 2.14(a) illustrates the radiance spectra for three temperatures. For low temperatures, the emitted light mainly includes long-wavelength components. When the temperature is increased, the peak of the radiance spectrum is shifted toward the short-wavelength range. Figure 2.14(c) shows the chromaticity coordinates (x, y) of light emitted by black-body radiators in the CIE 1931 chromaticity diagram. The curve representing the black-body radiators for different temperatures is called the Planckian locus. The radiance spectrum for a black-body radiator of about 2856 K has been standardized as illuminant A [108] by the CIE; it represents the typical light emitted by tungsten filament light bulbs.

43 38 Acquisition, Representation, Display, and Perception norm. spec. radiance (a) norm. spec. radiance K 5000 K 2856 K (illuminant A) tungsten light bulb wavelength λ [nm] normal daylight (D65) morning light twilight (just before dark) chromaticity y K 6000 K 8000 K K K E K K K 620 (b) wavelength λ [nm] (c) chromaticity x Figure 2.14: Illumination sources: (a) Black-body radiators; (b) Natural daylight; (c) Chromaticity coordinates of black-body radiators (Planckian locus). There are several light sources, for which the light emission is not caused by temperature. The light of such illuminants is often characterized by the so-called correlated color temperature. It represents the temperature of the black-body radiator for which the perceived color most closely matches the color of the considered light source. With the goal of approximating the radiance spectrum of average daylight, the CIE standardized the illuminant D65 [108]. It is based on various spectral measurements and has a correlated color temperature of 6504 K. Daylight for different conditions can be well approximated by linearly combining three radiance spectra. The CIE specified these three radiance spectra and recommended a procedure for determining the weights given a correlated color temperature in the range from 4000 to K. These daylight approximations are also referred to as CIE series-d illuminants. Figure 2.14(b) shows the approximations for average daylight (illuminant D65), morning light (4300 K) and twilight (12000 K). The chromaticity coordinates of the illuminant D65 specify the white point of the srgb format [105]; they are typically also used as standard setting for the white point of displays.

44 2.2. Visual Perception 39 Chromatic Adaptation. The tristimulus values of light reflected from an objects surface highly depend on the spectral composition of the light source. However, to a large extent, our visual system adapts to the spectral characteristics of the illumination source. Even though we notice the difference between, for example, the orange light of a tungsten light bulb and the blueish twilight just before dark (see Figure 2.14), a sheet of paper is recognized as being white for a large variety of illumination sources. This aspect of the human visual system is referred to as chromatic adaptation. As discussed above, linear color spaces provide a mechanism for determining if two light stimuli appear to have the same color, but only under the assumption that the viewing conditions do not change. By modeling the chromatic adaptation of the human visual system, we can, to a certain degree, predict how an object observed under one illuminant looks under a different illuminant. A simple theory of chromatic adaptation, which was first postulated by von Kries [284] in 1902, is that the sensitivities of the three cone types are independently adapted to the spectral characteristics of the illumination sources. With (L 1, M 1, S 1 ) and (L 2, M 2, S 2 ) being the cone excitation responses for two different viewing conditions, the von Kries model can be formulated as L 2 α 0 0 L 1 M 2 = 0 β 0 M 1. (2.34) S γ S 1 If we assume that the white points, i.e., the LMS tristimulus values of light stimuli that appear white, are given by (L w1, M w1, S w1 ) and (L w2, M w2, S w2 ) for the two considered viewing conditions, the scaling factors are determined by α = L w2 /L w1, β = M w2 /M w1, γ = S w2 /S w1. (2.35) Today it is known that the chromatic adaptation of our visual system cannot be solely described by an independent re-scaling of the cone sensitivity functions, but also includes non-linear components as well as cognitive effects. Nonetheless, variations of the simple von Kries method are widely used in practice and form the basis of most modern chromatic adaptation models.

45 40 Acquisition, Representation, Display, and Perception A generalized linear model for chromatic adaptation in the CIE 1931 XYZ color space can be written as X 2 α 0 0 X 1 Y 2 = M 1 CAT 0 β 0 M CAT Y 1, (2.36) Z γ Z 1 where the matrix M CAT specifies the transformation from the XYZ color space into the color space in which the von Kries-style chromatic adaptation is applied. If the chromaticity coordinates (x w1, y w1 ) and (x w2, y w2 ) of the white points for both viewing conditions are given and we assume that the relative luminance Y shall not change, the scaling factors can be determined according to α = A w2 /A w1 β = B w2 /B w1 γ = C w2 /C w1 with A wk B wk C wk = M 1 CAT x wk y wk 1 1 x wk y wk y wk. (2.37) The transformation specified by the matrix M CAT is referred to as chromatic adaptation transform. If we strictly follow von Kries idea, it specifies the transformation from the XYZ into the LMS color space. On the basis of several viewing experiments, it has been found that transformations into color spaces that are represented by so-called sharpened cone fundamentals yield better results. The chromatic adaptation transform that is suggested in the color appearance model CIECAM02 [46, 172] is given by the matrix M CAT(CIECAM02) = (2.38) For more details about chromatic adaptation transforms and modern color appearance models, the reader is referred to [208, 63]. In contrast to the human visual system, digital cameras do not automatically adjust to the properties of the present illumination, they simply measure the radiant energy of the light falling through the color filters (see Section 2.4). For obtaining natural looking images, the raw data recorded by the image sensor have to be processed in order to simulate the chromatic adaptation of the human visual system. The corresponding processing step is referred to as white balancing. It is

46 2.2. Visual Perception 41 Figure 2.15: Example for white balancing: (left) Original picture taken between sunset and dusk, implicitly assuming an equal-energy white point; (right) Picture after white balancing (the white point was defined by a selected area of the boat). often based on a standard chromatic adaptation transform and directly incorporated into the conversion from the internal color space of the camera to the color space of the representation format. With (R1, G1, B1 ) being the recorded tristimulus values and (R2, G2, B2 ) being the tristimulus values of the representation format, we have R2 α G2 = M 1 M 1 0 Rep CAT 0 B2 0 β 0 0 R1 0 M CAT M Cam G1. (2.39) γ B1 The matrices M Cam and M Rep specify the conversion from the camera and representation RGB spaces, respectively, into the XYZ space. The scaling factors α, β, and γ are determined according to (2.37), where the white point (xw2, yw2 ) is given by the used representation format. For selecting the white point (xw1, yw1 ) of the actual viewing condition, cameras typically provide various methods, ranging from selecting the white point among a predefined set ( sunny, cloudy, etc.), over calculating it based on (2.33) by specifying a color temperature, to automatically estimating it based on the recorded samples. An example for white balancing is shown in Figure As a result of the spectral composition of the natural light between sunset and dusk, the original image recorded by the camera has a noticeable purple color cast. After white balancing, which was done by using the chromaticity coordinates of an area of the boat as white point (xw1, yw1 ), the color cast is removed and the image looks more natural.

47 42 Acquisition, Representation, Display, and Perception Perceptual Color Spaces. The CIE 1931 XYZ color space provides a method for predicting whether two radiance spectra are perceived as the same color. It is, however, not suitable for quantitatively describing the difference in perception for two light stimuli. As a first aspect, the perceived brightness difference between two stimuli is not only depending on the difference in luminance, but also on the luminance level to which the eye is adapted (Weber-Fechner law, see Section 2.2.1). The CIE 1931 xy chromaticity space is not perceptually uniform either. The experiments of MacAdam [174] showed that the range of not perceptible chromaticity differences for a given reference chromaticity (x 0, y 0 ) can be described by an ellipse in the x-y plane centered around (x 0, y 0 ), but the orientation and size of the ellipses highly depend on the considered reference chromaticity (x 0, y 0 ). With the goal of defining approximately perceptual uniform color spaces with a simple relationship to the CIE 1931 XYZ color space, the CIE specified the color spaces CIE 1976 L a b [109] and CIE 1976 L u v [110], which are commonly called CIELAB and CIELUV, respectively. Typically, the CIE L a b color space is considered to be more perceptual uniform. Its relation to the XYZ space is given by L f(x/x w ) 16 a = f(y/y w ) 0 (2.40) b f(z/z w ) 0 with { 3 t : t > ( 6 f(t) = 29 )3 1 3 ( 29 6 )2 t : t ( 6. (2.41) 29 )3 The values (L, a, b ) do not only depend on the considered point in the XYZ space, but also on the tristimulus values (X w, Y w, Z w ) of the reference white point determined by the present illumination. Hence, the L a b color space includes a chromatic normalization, which corresponds to a simple von Kries-style model (2.36) with M CAT equal to the identity matrix. The function f(t) mimics the non-linear behavior of the human visual system. The coordinate L is called the lightness, a perceptually corrected version of the relative luminance Y. The components a and b represents color differences between reddish-magenta and green and yellow and blue, respectively.

48 2.2. Visual Perception 43 Due to the approximate perceptual uniformity of the CIE L a b color space, the difference between two light stimuli can be quantified by the Euclidean distance between the corresponding (L, a, b ) vectors, E = (L 1 L 0 )2 + (a 1 a 0 )2 + (b 1 b 0 )2. (2.42) There are many other color spaces that have been developed for different purposes. Most of them can be derived from the XYZ color space, which can be seen as a master color space, since it has been specified based on experimental data. For image and video coding, the Y CbCr color space is particularly important. It has some of the properties of CIELAB and will be discussed in Section 2.3, where we describe representation formats for image and video coding Visual Acuity The ability of the human visual system to resolve fine details is determined by three factors: The resolution of the human optics, the sampling of the projected image by the photoreceptor cells, and the neural processing of the photoreceptor signals. The influence of the first two factors was evaluated in several experiments. The resolution of the human optics was evaluated by measurements of the modulation transfer function [25, 167, 176]. The estimated cut-off frequency ranges from about 50 cycles per degree (cpd), for pupil sizes of 2 mm, to 200 cpd, for pupil sizes of 7.3 mm [167]. Investigations of retina tissue showed that, in the foveal region, the average distance between rows of cones is about 0.5 minutes of arc [308, 50]. This corresponds to a Nyquist frequency of 60 cpd. The impact of the neural processing can only be evaluated in connection with the human optics and the retinal sampling. An ophthalmologist checks visual acuity typically using a Snellen chart. At luminance levels of at least 120 cd/m 2, a person with normal visual acuity has to be able to read letters covering a visual angle of 5 minutes of arc, for example, letters of 8.73 mm height at a distance of 6 m. The used letters can be considered to consist of basically 3 black and 2 white lines in one direction and, hence, people with normal acuity can resolve spatial frequencies of at least 30 cpd.

49 44 Acquisition, Representation, Display, and Perception Contrast Sensitivity. The resolving capabilities of the human visual system are often characterized by contrast sensitivity functions, which specify the contrast threshold between visible and invisible. The contrast C of a stimulus is typically defined as Michelson contrast C = I max I min I max + I min, (2.43) where I min and I max are the minimum and maximum luminance of the stimulus. The contrast sensitivity s c = 1/C t is the reciprocal of the contrast C t at which a pattern is just perceivable. The smallest possible value of the contrast sensitivity is s c = 1; it means that, regardless of the contrast, the stimulus is invisible for a human observer. For analyzing the visual acuity, the contrast sensitivity is typically measured for spatio-temporal sinusoidal stimuli I(α, t) = Ī (1 + C cos(2πu α) cos(2πv t) ), (2.44) where Ī = (I min + I max )/2 is the average luminance, u is the spatial frequency in cycles per visual angle, v is the temporal frequency in Hz, α denotes the visual angle, and t represents the time. By varying the spatial and temporal frequency, a function s c (u, v) is obtained, which is called the spatio-temporal contrast sensitivity function (CSF). Spatial Contrast Sensitivity. The spatial CSF s c (u) specifies the contrast sensitivity for sinusoidal stimuli that do not change over time (i.e., for v = 0). It can be considered a psychovisual version of the modulation transfer function. Experimental investigations [279, 26, 289, 278] showed that it highly depends on various parameters, such as the average luminance Ī and the field of view. An analytic model for the spatial CSF was proposed by Barten [9]. Figure 2.16(a) illustrates the basic form of the spatial CSF for foveal vision and different average luminances Ī. The spatial CSF has a bandpass character. Except for very low luminance levels, the Weber-Fechner law, s c (u) f(ī), is valid in the low frequency range. In the high frequency range, however, the CSF highly depends on the average luminance level Ī. For photopic luminances, the CSF has its peak sensitivity between 2 and 4 cpd, the cut-off frequency is between 40 and 60 cpd.

50 2.2. Visual Perception 45 contrast sensitivity s c (u) (a) Weber-Fechner law is valid in this range 100 ca/m 2 10 ca/m ca/m ca/m spatial frequency u [cpd] 1000 ca/m 2 contrast sensitivity s c (u) (b) 300 red-green isochromatic blue-yellow spatial frequency u [cpd] Figure 2.16: Spatial contrast sensitivity: (a) Contrast sensitivity function for different luminance levels (generated using the model of Barten [9] for a field of view); (b) Comparison of the contrast sensitivity function for isochromatic and isoluminant stimuli (approximation for the experimental data of Mullen [182]). In order to analyze the resolving capabilities of the opponent processes in human vision, the spatial CSF was also measured for isoluminant stimuli with varying color [182, 229]. Such stimuli with a spatial frequency u and a contrast C are, in principle, obtained by using two sinusoidal gratings with the same spatial frequency u, average luminance Ī, and contrast C, but different colors, and superimposing them with a phase shift of π. Figure 2.16(b) shows a comparison of the spatial CSFs for isochromatic and isoluminant red-green and blue-yellow stimuli. In contrast to the CSF for isochromatic stimuli, the CSF for isoluminant stimuli has a low-pass shape and the cut-off frequency is significantly lower. This demonstrates that the human visual system is less sensitive to changes in color than to changes in luminance. Spatio-Temporal Contrast Sensitivity. The influence of temporal changes on the contrast sensitivity was also investigated in several experiments, for example in [213, 147]. A model for the spatio-temporal CSF was proposed in [7, 8]. Figure 2.17(a) illustrates the impact of temporal changes on the spatial CSF s c (u). By increasing the temporal frequency v, the contrast sensitivity is at first increased for low spatial frequencies and the spatial CSF becomes a low-pass function; a further increase of the temporal frequency results in a decrease of the contrast sensitivity for the entire range of spatial frequencies.

51 46 Acquisition, Representation, Display, and Perception contrast sensitivity s c (u,v) (a) 300 v = 6 Hz v = 1 Hz 100 v = 16 Hz v = 22 Hz spatial frequency u [cpd] contrast sensitivity s c (u,v) (b) 300 u = 4 cpd 100 u = 0.5 cpd u = 16 cpd u = 22 cpd temporal frequency v [Hz] Figure 2.17: Spatio-temporal contrast sensitivity: (a) Spatial CSF s c(u) for different temporal frequencies v; (b) Temporal CSF s c(v) for different spatial frequencies u. The shown curves represent approximations for the data of Robson [213]. Similarly, as illustrated in Figure 2.17(b), the temporal CSF s c (v) also has a band-pass shape for low spatial frequencies. When the spatial frequency is moderately increased, the contrast sensitivity is improved for low temporal frequencies and the shape of s c (v) represents a lowpass. By further increasing the spatial frequency, the contrast sensitivity is reduced for all temporal frequencies. The experiments show that the spatial and temporal aspects are not independent of each other. The temporal cut-off frequency at which a temporally changing stimulus starts to have a steady appearance is called critical flicker frequency (CFF). It is about Hz. Investigation of the spatiotemporal CSF for chromatic isoluminant stimuli [148] showed that not only the spatial but also the temporal sensitivity to isoluminant stimuli is lower than that for isochromatic stimuli. For isoluminant stimuli the CFF lies in the range of Hz. Pattern Sensitivity. The contrast sensitivity functions provide a description of spatial and temporal aspects of human vision. The human visual system is, however, not linear. Thus, the analysis of the responses to harmonic stimuli is not sufficient to completely describe the resolving capabilities of human vision. There are several neural aspects that influence the way we see and discriminate patterns or track the motion of objects over time. For a further discussion of such aspects the reader is referred to the literature on human vision [285, 196].

52 H y 2.3. Representation of Digital Images and Video 47 W x y x x y top field bottom field top field bottom field top field (a) (b) bottom field Figure 2.18: Spatial sampling of images and video: (a) Orthogonal spatial sampling lattice; (b) Top and bottom field samples in interlaced video. 2.3 Representation of Digital Images and Video In the following, we describe data formats that serve as input formats for image and video encoders and as output formats of image and video decoders. These raw data formats are commonly referred to as representation formats and specify how visual information is represented as arrays of discrete-amplitude samples. Important examples for representation formats are the ITU-R Recommendations BT.601 [114], BT.709 [114], and BT.2020 [114], which specify raw data formats for standard definition (SD), high definition (HD), and ultra-high definition (UHD) television, respectively Spatio-Temporal Sampling In order to process images or videos with a microprocessor or computer, the physical quantities describing the visual information have to be discretized, i.e., they have to be sampled and quantized. The physical quantities that we measure in the image plane of a camera are irradiances observed through color filters. Let c cont (x, y, t) be a continuous function that represents the irradiance for a particular color filter in the image plane of a camera. In image and video coding applications, orthogonal sampling lattices as illustrated in Figure 2.18(a) are used. The W H sample array c n [l, m] representing a color component at a particular time instant t n is, in principle, given by c n [l, m] = c cont ( l x, m y, n t ), (2.45)

53 48 Acquisition, Representation, Display, and Perception Table 2.1: Examples for common picture formats. picture size sample aspect picture aspect (in samples) ratio (SAR) ratio (PAR) :11 4: :11 4:3 standard definition (SD) :11 16: :33 16: :1 16:9 high definition (HD) : 3 16: :1 16:9 ultra-high definition (UHD) :1 16: :1 16:9 In the SD formats, only 704 samples are displayed per scanline. where l, m, and n are integer values with 0 l<w and 0 m<h. The sampling is done by the image sensor of a camera. Due to the finite size of the photocells and the finite exposure time, each sample actually represents the integral over approximately a cuboid in the x-y-t space. The samples output by an image sensor have discrete amplitudes. However, since the number of provided amplitude levels is significantly greater than the number of amplitude levels in the final representation format, we treat c n [l, m] as continuous-amplitude samples in the following. Furthermore, it is presumed that the same sampling lattice is used for all color components. If the required image size is different than that given by the image sensor or the sampling lattices are not aligned, the color components have to be re-sampled using appropriate discrete filters. The size of a discrete picture is determined by the number of samples W and H in horizontal and vertical direction, respectively. The spatial sampling lattice is further characterized by the sample aspect ratio (SAR) and the picture aspect ratio (PAR) given by SAR = x y and PAR = W x H y = W H SAR. (2.46) Table 2.1 lists the picture sizes, sample aspect ratios, and picture aspect ratios for some common picture formats. The picture size W H determines the range of viewing angles at which a displayed picture ap-

54 2.3. Representation of Digital Images and Video 49 pears sharp to a human observer. Due to that reason it is also referred to as the spatial resolution of a picture. The temporal resolution of a video is determined by the frame rate f t = 1/ t. Typical frame rates are 24/1.001, 24, 25, 30/1.001, 30, 50, 60/1.001, and 60 Hz. The spatio-temporal sampling described above is also referred to as progressive sampling. An alternative that was introduced for saving bandwidth in analog television, but is still used in digital broadcast, is the interlaced sampling illustrated in Figure 2.18(b). The spatial sampling lattice is partitioned into odd and even scanlines. The even scan lines (starting with index zero) form the top field and the odd scan lines form the bottom field of an interlaced frame. The top and bottom fields are alternatively scanned at successive time instances. The sample arrays of a field have the size W (H/2). The number of fields per second, called the field rate, is twice the frame rate Linear Color Spaces and Color Gamut For a device-independent description of color information, representation formats include the specification of linear color spaces. As discussed in Section 2.2.2, displays are not capable of reproducing all colors of the human gamut. And since the number of amplitude levels required for representing colors with a given accuracy increases with increasing gamut, representation formats typically use linear color spaces with real primaries and non-negative tristimulus values. Hence, the chosen color space determines the set of representable colors. This set of colors is also referred to as gamut of the representation format. The color spaces are described by the CIE 1931 chromaticity coordinates of the primaries and the white point. The 3 3 matrix specifying the conversion between the tristimulus values of the representation format and the CIE 1931 XYZ color space can be determined by solving the linear equation system in (2.31). As examples, Figure 2.19 lists the chromaticity coordinates for selected representation formats and illustrates the gamuts in the chromaticity diagram. In contrast to the HD and UHD specifications BT.709 and BT.2020, the ITU-R Recommendation BT.601 for SD television does not include the specification of a linear color space. For conventional SD television systems, the linear

55 50 Acquisition, Representation, Display, and Perception red green blue SMPTE EBU ITU-R ITU-R 170M Tech BT.709 BT.2020 x r y r x g y g x b y b white x w (D65) y w y EBU Tech (SD 625) D65 white BT.2020 (UHD) human gamut BT.709 (HD) SMPTE 170M (SD 525) x Figure 2.19: Color spaces of selected representation formats: (left) CIE 1931 chromaticity coordinates for the color primaries and white point; (right) Comparison of the corresponding color gamuts to the human gamut. color spaces specified in EBU Tech [61] (for 625-line systems) and SMPTE 170M [235] (for 525-line systems) are used 7. Due to continuing improvements in display technology, the color primaries for the UHD specification BT.2020 have been selected to lie on the spectral locus yielding a significantly larger gamut than the SD and HD specifications. As a consequence, BT.2020 also recommends larger bit depths for representing amplitude values. At the sender side, the color sample arrays captured by the image sensor(s) of the camera have to be converted into the color space of the representation format. For each point (l, m) of the sampling lattice, the conversion can be realized by a linear transform according to (2.39) 8. If the transform yields a tristimulus vector with one or more negative entries, the color lies outside the gamut of the representation format and has to be mapped to a similar color inside the gamut; the easiest way of such a mapping is to set the negative entries equal to zero. It is common practice to scale the transform matrix in a way that the components of the resulting tristimulus vectors have a maximum value 7 Since the 6th edition, BT.601 lists the chromaticity coordinates specified in EBU Tech [61] (625-line systems) and SMPTE 170M [235] (525-line systems). 8 Note that the white point of the representation format is used for both, the determination of the conversion matrix M Rep and the calculation of the white balancing factors α, β, and γ. If the camera captures C > 3 color components, the conversion matrix M Cam and the combined transform matrix have a size of 3 C.

56 2.3. Representation of Digital Images and Video 51 of one. At the receiver side, a similar linear transform is required for converting the color vectors of the representation format into the color space of the display device. In accordance with video coding standards such as H.264 MPEG-4 AVC [121] or H.265 MPEG-H HEVC [123], we denote the tristimulus values of the representation format with E R, E G, and E B and presume that their values lie in the interval [0; 1] Non-linear Encoding The human visual system has a non-linear response to differences in luminance. As discussed in Sections and 2.2.3, the perceived brightness difference between two image regions with luminances I 1 and I 2 does not only depend on the difference in luminance I = I 1 I 2, but also on the average luminance Ī = (I 1 + I 2 )/2. If we add a certain amount of quantization noise to the tristimulus values of a linear color space, whether by discretizing the amplitude levels or lossy coding, the noise is more visible in dark image regions. This effect can be circumvented if we introduce a suitable non-linear mapping f T C (E) for the linear color components E and quantize the resulting non-linear color components E = f T C (E). The non-linear mapping f T C is often referred to as transfer function or transfer characteristic. For relative luminances Y with amplitudes in the range [0; 1], the perceived brightness can be approximated by a power law Y = f T C (Y ) = Y γe, (2.47) For the exponent γ e, which is called encoding gamma, a value of about 1/2.2 is typically suggested. The non-linear mapping Y Y is commonly referred to as gamma encoding or gamma correction. Since a color component E of a linear color space represents the relative luminance of the corresponding primary spectrum, the power law (2.47) can also be applied to the tristimulus values of a linear color space. At the receiver side, it has to be ensured that the luminances I produced on the display are roughly proportional to Y f 1 T C (Y ) = (Y ) γ d with γ d = 1/γ e, (2.48) so that the end-to-end relationship between the luminance measured

57 52 Acquisition, Representation, Display, and Perception by the camera and the reproduced luminance is approximately linear. The exponent γ d is referred to as decoding gamma. Interestingly, in cathode ray tube (CRT) displays, the luminance I is proportional to (V +ε) γ, where V represents the applied voltage, ε is a constant voltage offset, and the exponent γ lies in the range of about 2.35 to The original motivation for the development of gamma encoding was to compensate for this non-linear voltage-luminance relationship. In modern image and video applications, however, gamma encoding is applied for transforming the linear color components into a nearly perceptual-uniform domain and thus minimizing the bit depth required for representing color information [201]. The power law (2.47) has an infinite slope at zero and yields unsuitably high values for very small input values. For small input values, it is often replaced by a linear function, which yields the piecewise-defined transfer function { κ E : 0 E < b E = f T C (E) = a E γ (a 1) : b E 1. (2.49) The exponent γ and the slope κ are specified in representation formats. The values a and b are determined in a way that both sub-functions of f T C yield the same value and derivative at the connection point E = b. BT.709 and BT.2020 specify the exponent γ = 0.45 and the slope κ = 4.5, which yields the values a and b Representation formats specify the application of the transfer function (2.49) to the linear components E R, E G, and E B, which have amplitudes in the range [0; 1]. The resulting non-linear color components are denoted as E R, E G, and E B ; their range of amplitudes is also [0; 1]. In most applications, E R, E G, and E B have already discrete amplitudes. For a reasonable application of gamma encoding, the bit depth of the linear components has to be at least 3 bits larger than the bit depth used for representing the gamma-encoded values. Figure 2.20(a) illustrates the subjective effect of non-linear encoding for the relative luminance Y of an achromatic signal. In Figure 2.20(b), the transfer function f T C specified in BT.709 and BT.2020 is compared to the simple power law with γ e = 1/2.2 and the transfer function used in the CIE L a b color space (see Section 2.2.2).

2.3. Representation of Digital Images and Video 53 (a) linear increasing Y linear increasing Y = f T C(Y ) (b) non-linear encoded signal E' 1 0.8 0.6 0.4 0.2 CIE L*a*b* γ e = 1/2.

58 2.3. Representation of Digital Images and Video 53 (a) linear increasing Y linear increasing Y = f T C(Y ) (b) non-linear encoded signal E' CIE L*a*b* γ e = 1/2.2 linear encoding linear component signal E BT.709 BT.2020 Figure 2.20: Non-linear encoding: (a) Comparison of linear increasing Y and Y using the transfer function f T C specified in BT.709 and BT.2020, the bottom parts illustrate uniform quantization; (b) Comparison of selected transfer functions The Y CbCr Color Representation Format Color television was introduced as backwards compatible extension of the existing black and white television. This was achieved by transmitting two signals with color difference information in addition to the conventional luminance-related signal. As will be discussed in the following, the representation of color images as a luminance-related signal and two color difference signals has some advantages due to which it is still widely used in image and video communication applications. Firstly, let us assume that the luminance-related signal, which shall be denoted by L, and the color difference signals C 1 and C 2 represent linear combinations of the linear color components E R, E G, and E B. The mapping between the vectors (L, C 1, C 2 ) and the CIE 1931 XYZ color space can then be represented by the matrix equation X Y Z = X r X g X b Y r Y g Y b Z r Z g Z b R l G l R c1 G c1 R c2 G c2 B l B c1 B c2 L C 1 C 2, (2.50) where the first matrix specifies the given mapping between the linear RGB color space of the representation format and the XYZ color space and the second matrix specifies the mapping from the LC 1 C 2 to the RGB space. We consider the following desirable properties: Achromatic signals (x = x w and y = y w ) have C 1 = C 2 = 0; Changes in the color difference components C 1 or C 2 do not have any impact on the relative luminance Y.

59 54 Acquisition, Representation, Display, and Perception The first property requires R l = G l = B l. The second criterion is fulfilled if, for k being equal to 1 and 2, we have Y r R ck + Y g G ck + Y b B ck = 0. (2.51) Probably to simplify implementations, early researchers additionally chose R c1 = 0 and B c2 = 0. With s l, s c1, and s c2 being arbitrary nonzero scaling factors, this choice yields L = s l ( Y r E R + Y g E G + Y b E B ) C 1 = s c1 ( Y r E R Y g E G + (Y r + Y g ) E B ) C 2 = s c2 ( (Y g + Y b ) E R Y g E G Y b E B ). By using Y = Y r E R + Y g E G + Y b E B, we can also write L = s l Y C 1 = s c1 ( (Y r + Y g + Y b ) E B Y ) C 2 = s c2 ( (Y r + Y g + Y b ) E R Y ). (2.52) (2.53) The component L is, as expected, proportional to the relative luminance Y ; the components C 1 and C 2 represent differences between a primary component and the appropriately scaled relative luminance Y. Y CbCr. Due to decisions made in the early years of color television, the transformation (2.53) from the RGB color space into a color space with a luminance-related and two color difference components is applied after gamma encoding 9. The transformation is given by E Y = K R E R + (1 K R K B ) E B + K B E B E Cb = (E B E Y ) / (2 2K B) E Cr = (E R E Y ) / (2 2K R). (2.54) The component E Y is called the luma component and the color difference signals E Cb and E Cr are called chroma components. The terms luma and chroma have been chosen to indicate that the signals are computed as linear combinations of gamma-encoded color components; the non-linear nature is also indicated by the prime symbol. The representation of color images by a luma and two chroma components is 9 In the age of CRT TVs, this processing order had the advantage that the decoded E R, E G, and E B signals could be directly fed to a CRT display.

60 2.3. Representation of Digital Images and Video 55 referred to as Y CbCr or YCbCr color format. In contrast to the RGB color spaces discussed in Section 2.3.2, the Y CbCr format is not an absolute color space. It does not restrict the color gamut, but merely represents the RGB values in a different way. The scaling factors in (2.54) are chosen in a way that the luma component has an amplitude range of [0; 1] and the chroma components have amplitude ranges of [ 0.5; 0.5]. If we neglect the impact of gamma encoding, the constants K R and K B have to be chosen according to K R = Y r Y r + Y g + Y b and K B = Y b Y r + Y g + Y b, (2.55) where Y r, Y g, and Y b are, as indicated in (2.50), determined by the selected RGB color space. For BT.709 (K R = , K B = ) and BT.2020 (K R = , K B = ), the specified values of K R and K B can be directly derived from the chromaticity coordinates of the primaries and white point. BT.601, which does not define a color space, specifies the values K R = and K B = 0.114, which were derived based on the color space of an old NTSC standard [272] 10. In the Y CbCr format, color images are represented by an achromatic signal E Y, a blue-yellow difference signal E Cb, and a red-green difference signal E Cr. In that respect, the Y CbCr format is similar to the CIELAB color space and the opponent processes in human vision. The transformation into the Y CbCr domain effectively decorrelates the cone responses and thus also the RGB data for typical natural images. Hence, the Y CbCr format is well suited for an independent coding of the individual color components. Figure 2.21 illustrates the differences between the RGB representation and the Y CbCr format for an example image. The red, green, and blue components of natural images are typically highly correlated. In the Y CbCr format, however, most of the visual information is concentrated in the luma component. Due to these properties, the Y CbCr format is suitable for lossy coding and is used in nearly all image and video communication applications. 10 In SD television, we have a discrepancy between the Y CbCr format and the linear color spaces [61, 235] used in existing systems. As a result, quantization errors in the chroma components have a larger impact on the luminance I of decoded pictures as it would be the case with the choice given in (2.55).

61 56 Acquisition, Representation, Display, and Perception Figure 2.21: Representation of a color image (left) as red, green, and blue com , (top right) and as luma and chroma components EY0, ECb, EB, EG ponents ER 0 (bottom right). All components are represented as gray-value pictures; for the ECr 0 0 a constant offset (middle gray) is added. signed components ECb and ECr Chroma Subsampling. When we discussed contrast sensitivity functions in Section 2.2.3, we noted that human beings are much more sensitive to high-frequency components in isochromatic than in isoluminant stimuli. For saving bit rate, the chroma components are often downsampled. For normal viewing distances, the reduction of the number of chroma samples does not result in any perceivable degradation of image quality. Table 2.2 summarizes the color formats used in image and video coding applications. The most commonly used format is the Y CbCr 4:2:0 format, in which the chroma sample arrays are downsampled by a factor of two in both horizontal and vertical direction. Representation formats do not specify filters for resampling the chroma components. But for displaying the pictures, the phase shifts of the filters and thus the locations of the chroma samples relative Table 2.2: Common color formats for image and video coding. format description RGB (4:4:4) The red, green, and blue components have the same size. Y CbCr 4:4:4 The chroma components have the same size as the luma component. Y CbCr 4:2:2 The chroma components are horizontally subsampled by a factor of two. The height of the chroma components is the same as that of the luma component. The chroma components are subsampled by a factor of two in both horizontal and vertical direction. Each chroma component contains a quarter of the samples of the luma component. Y CbCr 4:2:0

62 2.3. Representation of Digital Images and Video 57 4:4:4 4:2:2 4:2:0 (BT.2020) 4:2:0 (MPEG-1) 4:2:0 (MPEG-2) Figure 2.22: Nominal locations of chroma samples (indicated by circles) relative to that of luma samples (indicated by crosses) for different chroma sampling formats. to the luma samples should be known. For 4:4:4 and 4:2:2 chroma sampling formats, representation formats and video coding standards generally specify that the top-left chroma samples coincide with the top-left luma sample (see Figure 2.22). For the 4:2:0 chroma sampling format, however, different alternatives are used. While BT.2020 specifies that the top-left chroma samples coincide with the top-left luma sample (third picture in Figure 2.22), in the video coding standards MPEG-1 Video [112], H.261 [119], and H.263 [120], the chroma samples are located in the center of the four associated luma samples (fourth picture in Figure 2.22). And in the video coding standards H.262 MPEG-2 Video [122], H.264 MPEG-4 AVC [121], and H.265 MPEG-H HEVC [123], the nominal offset between the top-left chroma and luma samples is zero in horizontal and half a luma sample in vertical direction (fifth picture in Figure 2.22). Video coding standards such as H.264 MPEG-4 AVC and H.265 MPEG-H HEVC include syntax that allows to indicate the location of chroma samples in the 4:2:0 format inside the bitstream. Constant Luminance Y CbCr. The application of gamma encoding before calculating the Y CbCr components in (2.54) has the effect that changes in the chroma components due to quantization or subsampling influence the relative luminance of the displayed signal. BT.2020 [116] specifies an alternative format, which is given by the components E Y C = f T C ( K R E R + (1 K R K B ) E G + K B E B ) (2.56) E CbC = (E B E Y C) / N B (2.57) E CrC = (E R E Y C) / N R. (2.58)

63 58 Acquisition, Representation, Display, and Perception The sign-dependent scaling factors N B and N R are { 2a (1 K γ N X = X ) : E X E Y C 0 2a (1 (1 K X ) γ ) 1 : E X E Y C > 0, (2.59) where a and γ represent the corresponding parameters of the transfer function f T C in (2.49). By defining s Y = Y r + Y g + Y b and using (2.56) we obtain for the relative luminance Y of the decoded signal Y = s Y (K R E R + (1 K R K B ) E G + K B E B ) ( ( ) ) = s Y K R E R + ft 1 C (E Y C) K R E R K B E B + K B E B = s Y f 1 T C (E Y C). (2.60) The relative luminance Y depends only on E Y C. Due to that reason the alternative format is also referred to as constant luminance Y CbCr format. In the document BT.2246 [117], the impact on video coding was evaluated by encoding eight test sequences, given in an RGB format, with H.265 MPEG-H HEVC [123]. The reconstruction quality was measured in the CIELAB color space using the distortion measure E given in (2.42). It is reported that by choosing the constant luminance format instead of the conventional Y CbCr format, on average 12% bit rate savings are obtained for the same average distortion. The constant luminance Y CbCr format has similar properties as the currently dominating conventional Y CbCr format and could potentially replace it in future image and video applications Quantization of Sample Values For obtaining discrete-amplitude samples suitable for coding and digital transmission, the luma and chroma components E Y, E Cb, and E Cr are quantized using uniform quantization. The ITU-R Recommendations BT.601, BT.709, and BT.2020 specify that the corresponding integer color components Y, Cb, and Cr are obtained according to [ ] Y = (219 E Y + 16) 2 B 8, (2.61) [ ] Cb = (224 E Cb + 128) 2 B 8, (2.62) [ ] Cr = (224 E Cr + 128) 2 B 8, (2.63)

64 2.4. Image Acquisition 59 where B denotes the bit depth, in bits per sample, for representing the amplitude values and the operator [ ] represents rounding to the nearest integer. While BT.601 and BT.709 recommend bit depths of 8 or 10 bits, the UHD specification BT.2020 recommends the usage of 10 or 12 bits per sample. Video coding standards typically support the usage of different bit depths for the luma and chroma components. In the most used profiles, however, only bit depths of 8 bits per sample are supported. If the RGB format is used for coding, all three color components are quantized according to (2.61). Quantization according to (2.61) (2.63) does not use the entire range of B-bit integer values. The ranges of unused values are referred to as footroom (small values) and headroom (large values). They allow the implementation of signal processing operations such as filtering or analog-to-digital conversion without the need of clipping the results. In the xvycc color space [106], the headroom and footroom is used for extending the color gamut. When this format is used, the linear components E as well as the gamma-encoded components E are no longer restricted to the interval [0; 1] and the definition of the transfer function f T C is extended beyond the domain [0; 1]. As an alternative, the video coding standards H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123] provide a syntax element by which it can be indicated that the full range of B-bit integer values is used for representing amplitude values, in which case the quantization equations (2.61) (2.63) are modified so that the minimum and maximum used integer values are 0 and 2 B 1, respectively. 2.4 Image Acquisition Modern digital cameras are complex devices that consist of a multitude of components, which often include advanced systems for automatic focusing, exposure control, and white balancing. The most important components are illustrated in Figure The camera lens forms an image of a real-world scene on the image sensor, which is located in the image plane of the camera. The lens (or some lens elements) can be moved for focusing objects at different distances. As discussed in

60 Acquisition, Representation, Display, and Perception objects in 3-d world aperture lens image sensor digital picture image processor camera Figure 2.

65 60 Acquisition, Representation, Display, and Perception objects in 3-d world aperture lens image sensor digital picture image processor camera Figure 2.23: Basic principle of image acquisition with a digital camera. Section 2.1, the focal length of the lens determines the field of view and its aperture regulates the depth of field as well as the illuminance (photometric equivalent of irradiance) falling on the image sensor. The image sensor basically converts the illuminance pattern observable on its surface into an electric signal. This is achieved by measuring the energy of visible light that falls onto small areas of the image sensor during a certain period of time, which is referred to as exposure time or shutter speed. The image processor controls the image sensor and converts the electric signal that is output by the image sensor into a digital representation of the captured scene. The amount of visible light energy per unit area that is used for creating a picture is called exposure; it is given by the product of the illuminance on the image sensor and the exposure time t e. The illuminance on the sensor is proportional to the area of the entrance pupil and, thus, to the square of the aperture diameter a. But the area of the image on the sensor is also approximately proportional to the square of the focal length f. Hence, for a given scene, the illuminance on the image sensor depends only on the f-number F = f/a. The camera settings that influence the exposure are often expressed as exposure value EV = log 2 (F 2 / t e ). All combinations of aperture and shutter speed that have the same exposure value give the same exposure for a chosen scene. An increment of one, commonly called one stop, corresponds to halving the amount of visible light energy. Note that different camera settings with the same exposure value still yield different pictures, because the depth of field depends on the f-number and the amount of motion blur on the shutter speed. For video, the exposure time has to be smaller than the reciprocal of the frame rate.

2.4. Image Acquisition 61 (a) filter light image sensor microlens color filter light photocell (b) voltage saturation voltage exposure saturation exposure level Figure 2.

24(a). Each photocell corresponds to a pixel in the acquired images. Typically, microlenses are located above the photocells.

66 2.4. Image Acquisition 61 (a) filter light image sensor microlens color filter light photocell (b) voltage saturation voltage exposure saturation exposure level Figure 2.24: Image sensor: (a) Array of light-sensitive photocells; (b) Illustration of the exposure-voltage transfer function for a photocell Image Sensor An image sensor consists of an array of light-sensitive photocells, as is illustrated in Figure 2.24(a). Each photocell corresponds to a pixel in the acquired images. Typically, microlenses are located above the photocells. Their purpose is to improve the light efficiency by directing most of the incident light to the light-sensitive parts of the sensor. For some types of sensors, which we will further discuss in Section 2.4.2, color filters that block light outside a particular spectral range are placed between the photocells and microlenses. Another filter is typically inserted between the lens and the sensor. It is used for removing wavelengths to which human beings are not sensitive, but to which the image sensor is sensitive. Without such a filter, the acquired images would have incorrect colors or gray values, since parts of the infrared or ultraviolet spectrum would contribute to the generated image signal. Modern digital cameras use either charge-coupled device (CCD) or complementary metal-oxide-semiconductor (CMOS) image sensors. Both types of sensors employ the photoelectric effect. When a photon (quantum of electromagnetic radiation) strikes the semiconductor of a photocell, it creates an electron-hole pair. By applying an electric field, the positive and negative charges are collected during the exposure time and a voltage proportional to the number of incoming photons is generated. At the end of an exposure, the generated voltages are read out, converted to digital signals, and further processed by the image processor. Since the created charges are proportional to the number of incoming photons, the exposure-voltage transfer function for a photocell is basically linear. However, as shown in Figure 2.24(b), there

67 62 Acquisition, Representation, Display, and Perception is a saturation level, which is determined by the maximum collectible charge. If the exposure exceeds the saturation level for a significant amount of photocells, the captured image is overexposed; the lost image details cannot be recovered by the following signal processing. Sensor Noise. The number of photons that arrive at a photocell during the exposure time is random; it can be well modeled as a random variable with a Poisson distribution. The resulting noise in the captured image is called photon shot noise. The Poisson distribution has the property that the variance σ 2 is equal to the mean µ. Hence, if we assume a linear relationship between the number of photons and the generated voltage, the signal-to-noise ratio (SNR) of the output signal is proportional to the average number of incoming photons (µ 2 /σ 2 = µ). Other types of noise that affect the image quality are: Dark current noise : A certain number of charges per time unit can be also created by thermal vibration; Read noise : Thermal noise in readout circuitry; Reset noise : Some charges may remain after resetting the photocells at the beginning of an exposure; Fixed pattern noise : Caused by manufacturing variations across the photocells of a sensor. Most noise sources are independent of the irradiance on the sensor. An exception is the photon shot noise, which becomes predominant above a certain irradiance level. The SNR of a captured image increases with the number of photons arriving at a photocell during exposure. Consequently, pictures and videos captured with the small image sensors (and small photocells) of smartphones are considerably noisier than those captured with the large sensors of professional cameras. ISO Speed. The ISO speed or ISO sensitivity is a measure that was originally standardized by the International Organization for Standardization (ISO) for specifying the light-sensitivity of photographic films. It is now also used as a measure for the sensitivity of image sensors. Digital cameras typically allow an selection of the ISO speed inside a given

68 2.4. Image Acquisition 63 range. Changing the ISO speed modifies the amplification factor of the sensors output signal before analog-to-digital conversion. The ISO system defines a linear and a logarithmic scale. Digital cameras typically use the linear scale (with values of 100, 200, etc.), for which a doubling of the ISO sensitivity corresponds to a doubling of the amplification factor. Note that higher ISO values correspond to lower signal-to-noise ratios, since the noise in the sensor s output is also amplified. The ISO speed is the third parameter, beside the aperture and the shutter speed, by which the exposure of a picture can be controlled. Typically, an image is considered to be correctly exposed if nearly the entire range of digital amplitude levels is utilized and the portion of saturated photocells or clipped sample values is very small. For a given scene, the photographer or videographer can select one of multiple suitable combinations of aperture, shutter speed, and ISO sensitivity and, thus, control the depth of field, motion blur, and noise level within certain ranges. For filming in dark environments, increasing the ISO sensitivity is often the only way to achieve the required frame rate Capture of Color Images The photocells of an image sensor basically only count photons. They cannot discriminate between photons of different wavelengths inside the visible spectrum. As discussed in Section 2.2.2, we need however at least three image signals for representing color images (each for a different range of wavelengths). Consequently, the spectrum of visible light has to be decomposed into three spectral components. There are two dominating techniques in today s cameras: Three-sensor systems and single sensors with color filter arrays. A third technique, the multilayer sensor, is also used in some cameras. Three-Sensor Systems. As the name suggests, three-sensor systems use three image sensors, each for a different part of the spectrum. The light that falls through the lens is split by a trichroic prism assembly, which consists of two prisms with dichroic coatings (dichroic prisms), and is illustrated in Figure 2.25(a). The dichroic optical coatings have the property that they reflect or transmit light depending on the light s

64 Acquisition, Representation, Display, and Perception filter coating image sensor (red component) filter coating (a) light falling through lens image sensor (blue component) image sensor (green

69 64 Acquisition, Representation, Display, and Perception filter coating image sensor (red component) filter coating (a) light falling through lens image sensor (blue component) image sensor (green component) (b) color filter photocell Figure 2.25: Color separation: (a) Three-sensor camera with color separation by a trichroic prism assembly; (b) Sensor with color filter array (Bayer pattern). wavelength. In the example of Figure 2.25(a), the short wavelength range is reflected at the first coating and directed to the image sensor that captures the blue color component. The remaining light passes through. At the second filter coating, the long wavelength range is reflected and directed to the sensor capturing the red component. The remaining middle wavelength range, which corresponds to the green color component, is transmitted and captured by the third sensor. In contrast to image sensors with color filter arrays, three-sensor systems have the advantage that basically all photons are used by one of the image sensors and that no interpolation is required. As a consequence, they typically provide images with better resolution and lower noise. Three-sensor systems are, however, also more expensive, and they are large and heavy, in particular when large image sensors are used. Sensors with Color Filter Arrays. Another possibility to distinguish photons of different wavelength ranges is to use a color filter array with a single image sensor. As illustrated in Figure 2.25(b), each photocell is covered by a small color filter that basically blocks photons with wavelengths outside the desired range from reaching the photocell. The color filters are typically placed between the photocell and the microlens as shown in Figure 2.24(a). The color filter pattern shown in Figure 2.25(b) is called Bayer pattern. It is the most common type of color filter array and consists of a repeating 2 2 grid with two green, one red, and one blue color filter. The reason for using twice as many green than red or blue color filters is that humans are more sensitive to the middle wavelength range of visible light. Several alternatives to the

70 2.4. Image Acquisition 65 Bayer pattern are used by some manufacturers. These patterns either use filters of different colors or a different arrangement, or they include filters with a fourth color. Since each photocell of a sensor can only count photons for one of the wavelength ranges, the sample arrays for the color components contain a significant number of holes. The unknown sample values have to be generated using interpolation algorithms. This processing step is commonly called demosaicing. For a Bayer sensor, actually half of the samples for the green component and three quarters of the samples for the red and blue components have to be interpolated. If the assumptions underlying the employed demosaicing algorithm are not true for an image region, interpolation errors can cause visible artifacts. The most frequently observed artifacts are Moiré patterns, which typically appear as wrong color patterns in fine detailed image regions. For reducing demosaicing artifacts, digital image sensors with color filter arrays typically incorporate an optical low-pass filter or anti-aliasing filter, which is placed directly in front of the sensor. Often, this filter consists of two layers of a birefringent material and is combined with an infrared absorption filter. The optical low-pass filter splits every ray of light into four rays, each of which falls on one photocell of a 2 2 cluster. By decreasing the high-frequency components of the irradiance pattern on the photocell array, it reduces the demosaicing artifacts, but also the sharpness of the captured image. Image sensors with color filter arrays are smaller, lighter, and less expensive than three-sensor systems. But due to the color filters they have a lower light efficiency. The demosaicing can cause visible interpolation artifacts; in connection with the often applied optical filtering it also reduces sharpness. Multi-Layer Image Sensors. In a multi-layer sensor, the photocells for different wavelength ranges are not arranged in a 2D, but a 3D array. At each spatial location, three photodiodes are vertically stacked. The sensor employs the property that the absorption coefficient of silicon is highly wavelength dependent. As a result, each of the three stacked photodiodes at a sample location responds to a different wavelength range. The sample values of the three primary colors (red, green, blue)

71 66 Acquisition, Representation, Display, and Perception are generated by an appropriate processing of the captured signals. Since three color samples are captured for each spatial location, optical low-pass filtering and demosaicing is not required and interpolation artifacts do not occur. The spectral sensitivity curves resulting from the employed wavelength separation by absorption are less linearly related to the cone fundamentals than typical color filters. As a consequence, it is often reported that multi-layer sensors have a lower color accuracy than sensors with color filter arrays Image Processor The signals that are output by the image sensor have to be further processed and eventually converted into a format suitable for image or video exchange. As a first step, which is required for any further signal processing, the analog voltage signals have to be converted into digital signals. In order to reduce the impact of this quantization on the following processing, typically a bit depth significantly larger than the bit depth of the final representation format is used. The analog-todigital conversion is often integrated into the sensor. For converting the obtained digital sensor signals into a representation format, the following processing steps are required: Demosaicing (for sensors with color filter arrays, Section 2.4.2), a conversion from the camera color space to the linear color space of the representation format, including white balancing (Section 2.2.2), gamma encoding of the linear color components (Section 2.3.3), optionally, a transform from the linear color space to a Y CbCr format (Section 2.3.4), and a final quantization of the sample values (Section 2.3.5). Beside these required processing steps, image processors often also apply algorithms for improving the image quality, for example, denoising and sharpening algorithms or processing steps for reducing the impact of lens imperfections, such as vignetting, geometrical distortions, and chromatic aberrations, in the output images. Particularly in consumer cameras, the raw image data are typically also compressed using an image or video encoder. The outputs of the camera are then bitstreams (embedded in a container format) that conform to a widely accepted coding standard, such as JPEG [127] or H.264 MPEG-4 AVC [121].

72 2.5. Display of Images and Video Display of Images and Video In most applications, we capture and encode visual information for eventually presenting them to human beings. Display devices act as interface between machine and human. At the present time, a rather large variety of display techniques are available. New technologies and improvements to existing technologies are still developed. Independent of the actual used technology, for producing the sensation of color, each element of a displayed picture has to be composed of at least three primary colors (see Section 2.2.2). The actual employed technology determines the chromaticity coordinates of the primary colors and, thus, the display-internal color space. In general, samples of the representation format provided to the display device have to be transformed into the display s color space; this transformation typically includes gamma decoding and a color space conversion. Modern display devices often apply additional signal processing algorithms for improving the perceived quality of natural video. In the following, we briefly review some important display techniques. For a more detailed discussion, the reader is referred to the overview in [208]. Cathode Ray Tube (CRT) Displays. Some decades ago, all devices for displaying natural pictures were cathode ray tube (CRT) displays. It is the oldest type of electronic display technology, but has now been nearly completely replaced by more modern technologies. As illustrated in Figure 2.26(a), a CRT display basically consists of electron guns, a deflection system, and a phosphor-coated screen. Each electron gun contains a heated cathode that produces electrons by thermionic emission. By applying electric fields the electrons are accelerated and focused to form an electron beam. When the electron beam hits the phosphor-coated screen, it causes the emission of photons. The intensity of the emitted light is controlled by varying the electric field in the electron gun. For producing a picture on the screen, the electron beam is linewise swept over the screen, typically 50 or 60 times per second. The direction of the beam is controlled by the deflection system consisting of magnetic coils. In color CRT displays, three electron guns and three types of phosphors, each for emitting photons for one of the

68 Acquisition, Representation, Display, and Perception shadow mask screen with phosphors V electron beams screen (a) electron guns deflection system (magnetic coils) (b) backlight polarizer liquid

of green light emission of blue light Figure 2.26: Basic principles of display technologies: (a) Cathode ray tube (CRT) display; (b) Liquid crystal display (LCD); (c) Plasma display; (d) OLED display.

A shadow mask mounted in front of the screen prevents electrons from hitting the wrong phosphor. Liquid Crystal Displays (LCDs).

73 68 Acquisition, Representation, Display, and Perception shadow mask screen with phosphors V electron beams screen (a) electron guns deflection system (magnetic coils) (b) backlight polarizer liquid crystals polarizer color filters (c) cell with red phosphor cell with green phosphor cell with blue phosphor (d) emission of red light emission of green light emission of blue light Figure 2.26: Basic principles of display technologies: (a) Cathode ray tube (CRT) display; (b) Liquid crystal display (LCD); (c) Plasma display; (d) OLED display. primary colors red, green, and blue, are used. The different phosphors are arranged in clusters or stripes. A shadow mask mounted in front of the screen prevents electrons from hitting the wrong phosphor. Liquid Crystal Displays (LCDs). Liquid crystals used in displays are liquid organic substances with a crystal molecular structure. They are arranged in a layer between two transparent electrodes so that the alignment of the liquid crystals inside the layer can be controlled by the applied voltage. LCDs employ the effect that depending on the orientation of the liquid crystals inside the layer and, thus, the applied voltage, the polarization direction of the transmitted light is modified. The basic structure of LCDs is illustrated in Figure 2.26(b). The light emitted by the displays backlight is passed through a first polarizer, which is followed by the liquid crystal layer and a second polarizer with a polarization direction perpendicular to that of the first polarizer. By adjusting the voltages applied to the liquid crystal layer, the modification of the polarization direction and, thus, the amount of light

74 2.5. Display of Images and Video 69 transmitted through the second polarizer is controlled. Finally, the light is passed through a layer with color filters (typically red, green, and blue filters) and a color picture perceivable by human beings is generated on the surface of the screen. A disadvantage of LCDs is that a backlight is required and a significant amount of light is absorbed. Due to that reason, LCDs have a rather large power consumption and do not achieve such good black levels as plasma or OLED displays. Plasma Displays. Plasma displays are based on the phenomenon of gas discharge. As illustrated in Figure 2.26(c), a layer of cells typically filled with a mixture of helium and xenon [208] is embedded between two electrodes. The electrode at the front side of the display has to be transparent. When a voltage is applied to a cell, the accelerated electrons may ionize the contained gas for a short duration. If the excited atoms return to their ground state, photons with a wavelength inside the ultraviolet (UV) range are emitted. A part of the UV photons excites the phosphors inside the cell, which eventually emit light in the visible range. The intensity of the emitted light can be controlled by the applied voltage. For obtaining color images, three types of phosphors, which emit light in the red, green, and blue range of the spectrum, are used. The corresponding cells are arranged in a suitable spatial layout. OLED Displays. Organic light-emitting diode (OLED) displays use organic substances that emit visible light when an electric current is passed through them. As illustrated in Figure 2.26(d), a layer of an organic semiconductor is situated between two electrodes. At least one of the electrodes is transparent. If a voltage is applied, the electrons and holes injected from the cathode and anode, respectively, form electronhole pairs called excitons. When an exciton recombines, the excess energy is emitted in the form of a photon; this process is called radiative recombination. The wavelength of the emitted photon depends on the band energy of the organic material. The light intensity can be controlled by adjusting the applied voltage. In OLED displays, typically three types of OLEDs with organic substances that emit light in the red, green, and blue wavelength range are used.

75 70 Acquisition, Representation, Display, and Perception Projectors. In contrast to the display devices discussed so far, projectors do not display the image on the light modulator itself, but on a diffusely reflecting screen. Due to the loose coupling of the light modulator and the screen, very large images can be displayed. That is why projectors are particularly suitable for large audiences. In LCD projectors, the white light of a bright lamp is first split into red, green, and blue components, either by dichroic mirrors or prisms (see Section 2.4.2). Each of the resulting beams is passed through a separate transparent LCD panel, which modulates the intensity according to the sample values of the corresponding color component. Finally, the modulated beams are combined by dichroic prisms and passed through a lens, which projects the image on the screen. Digital light processing (DLP) projectors use a chip with microscopic mirrors, one for each pixel. The mirrors can be rotated to send light from a lamp either through the lens or towards a light absorber. By quickly toggling the mirrors, the intensity of the light falling through the lens can be modulated. Color images are typically generated by placing a color wheel between the lamp and the micromirror chip, so that the color components of an image are sequentially displayed. Liquid crystal on silicon (LCoS) projectors are similar to LCD projectors, but instead of transparent LCD panels, they use reflective light modulators (similar to DLP projectors). The light modulators basically consist of a liquid crystal layer that is fabricated directly on top of a silicon chip. The silicon is coated with a highly reflective metal, which simultaneously acts as electrode and mirror. As in LCD panels, the light modulation is achieved by changing the orientation of liquid crystals. 2.6 Chapter Summary In this section, we gave an overview of some fundamental properties of image formation and human visual perception. And based on certain aspects of human vision, we reviewed the basic principles that are used for capturing, representing, and displaying digital video signals. For acquiring video signals, the lens of a camera projects a scene of the 3D world onto the surface of an image sensor. The focal length

76 2.6. Chapter Summary 71 and the aperture of the lens determine the field of view and the depth of field of the projection. Independent of the fabrication quality of the lens, the resolution of the image on the sensor is limited by diffraction; its effect increases with decreasing aperture. For real lenses, the image quality is additionally reduced by optical aberrations. The human visual system has similar components as a camera; a lens projects an image onto the retina, where the image is sampled by light-sensitive cells. The photoreceptor responses are send to the brain, where the visual information is interpreted. Under well-lit conditions, three types of photoreceptor cells with different spectral sensitivities are active. They basically map the infinite-dimensional space of electromagnetic spectra onto a 3D space. Hence, light stimuli with different spectra can be perceived as having the same color. This property of human vision is the basis of all color reproduction techniques; it is employed in capturing, representing, and displaying image and video signals. For defining a common color system, the CIE standardized a so-called standard colorimetric observer by defining color-matching functions based on experimental data. The derived CIE 1931 XYZ color space represents the basis for specifying color in video communication applications. In display devices, colors are typically reproduced by mixing three suitably selected primary lights; it is, however, not possible to reproduce all colors perceivable by humans. Color spaces that are spanned by three primaries are called linear color spaces. The human eye adapts to the illumination of a scene; this aspect has to be taken into account when processing the signals acquired with an image sensor. The acuity of human vision is determined by several factors such as the optics of the eye, the density of photoreceptor cells, and the neural processing. Human beings are more sensitive to details in luminance than to details in the quality of color. Certain properties of human vision are also exploited for efficiently representing visual information. For describing color information, each video picture consists of three samples arrays. The primary colors are specified in the CIE 1931 XYZ color space. Since the human visual system has a non-linear response to differences in luminance, the linear color components are non-linear encoded. This processing step, also

77 72 Acquisition, Representation, Display, and Perception called gamma encoding, yields color components with the property that a certain amount of quantization noise has approximately the same subjective impact on dark and light image regions. In most video coding applications, the gamma-encoded color components are transformed into a Y CbCr format, in which color pictures are specified using a luminance-related component, called luma component, and two color difference components, which are called chroma components. This transformation effectively decorrelates the color components. Since humans are significantly more sensitive to details in luminance than to details in color difference data, the chroma components are typically downsampled. In the most common format, the Y CbCr 4:2:0 format, the chroma components contain only a quarter of the samples of the luma component. The luma and chroma sample values are typically represented with a bit depth of 8 or 10 bits per sample. The image sensor in a camera samples the illuminance pattern that is projected onto its surface and converts it into a discrete representation of a picture. Each cell of an image sensor corresponds to an image point and basically counts the photons arriving during the exposure time. For capturing color images, the incident light has to be split into at least three spectral ranges. In most digital cameras, this is achieved either by using a trichroic beam splitter with a separate image sensor for each color component or by mounting a color filter array on top of a single image sensor. In display devices, color images are reproduced by mixing (at least) three primary colors for each image point according to the corresponding sample values. Important display technologies are CRT, LCD, plasma, and OLED displays. For large audiences, as in a cinema, projectors are used.

78 3 Video Coding Overview A digital video consists of a sequence of digital pictures, each of which is usually comprised of three color components. In uncompressed form, the color components of a picture are 2D arrays of discrete-amplitude samples. These sample arrays are characterized by the spatio-temporal sampling, the linear color space (color gamut), the transfer function, the color representation format, and the sample bit depths. Data formats that represent digital video signals as raw data samples are referred to as representation formats. At the sender side of a video communication system, the camera-internal data are converted into a representation format, and at the receiver side, the raw video data are provided to the display. The bit rate that would be required for transmitting the samples using fixed-length codewords is referred to as raw data rate. It is determined by the number of pictures per time unit, the sizes of the sample arrays, and the sample bit depths. In most applications, the available bit rate is significantly smaller than the video s raw data rate. For illustration, Table 3.1 lists typical characteristics for three example application scenarios. The main task of video coding is to map the raw sample arrays into a bitstream that can be transmitted over the provided channel. The required compres- 73

79 74 Video Coding Overview Table 3.1: Examples for typical video coding applications. HD movie UHD broadcast Video chat on Blu-ray disc over DVB-S2 over the Internet raw video , 24 fps, , 60 fps, , 50 fps, data format Y CbCr 4:2:0, 8 bit Y CbCr 4:2:0, 10 bit Y CbCr 4:2:0, 8 bit raw data rate ca. 600 Mbit/s ca. 7.5 Gbit/s ca. 550 Mbit/s channel bit rate 36 Mbit/s (read speed) 58 Mbit/s (8PSK 2/3) depends on connection typical video bit rate required compression ca. 20 Mbit/s ca. 25 Mbit/s ca. 1 Mbit/s ca. 30 : 1 ca. 300 : 1 ca. 500 : 1 sion factors highly depend on the actual application; they typically lie in the range of 10 to Such compression factors are only achievable by lossy compression, i.e., by coding techniques that approximate the input signal in a certain way. Consequently, the bitstream generated by a video encoder should provide the best possible reconstruction quality for a given maximum bit rate. In practice, additional aspects such as the end-to-end delay and the complexity of encoder and decoder implementations have to be taken into account. In this text, we will restrict ourselves to a discussion of hybrid video coding. The name of this coding concept indicates that the basic source coding algorithm is a hybrid of two fundamental techniques, namely inter-picture prediction and transform coding. While inter-picture prediction utilizes dependencies between video pictures, transform coding exploits spatial dependencies within the resulting prediction error signals. It is also possible to code pictures or regions of a picture without referring to other pictures of a video sequence. This type of coding is called intra-picture coding. It typically employs the concept of transform coding, but often also includes intra-picture prediction techniques. In most hybrid video coding designs, the algorithms for inter-picture prediction, intra-picture prediction, and transform coding are applied on the basis of rectangular blocks of samples. These block-based hybrid video coding approaches represent the most successful class of video coding designs. All video coding standards that are widely used in practice follow this basic design principle.

80 3.1. Properties of Digital Video Signals 75 Another approach that gained some attention in the research community is 3D subband coding [145, 191, 266, 35, 228]. It is based on the concept of transform coding. The distinctive feature of 3D subband coding is that the used transform represents a spatio-temporal subband decomposition of a group of successive video pictures. For taking into account the motion in video sequences, the 3D subband decomposition often includes techniques for motion compensation, which are similar to the methods used in block-based hybrid video coding. The implementation complexity for a video encoder is typically much higher than that for the corresponding decoder. This is actually desirable for many application areas, such as television broadcast or video streaming. However, there are also some applications, as, for example, the area of low-power surveillance cameras, that could benefit from shifting the complexity from the encoder to the decoder side. Such coding concepts have also been studied by several researchers and are commonly referred to as distributed video coding [78, 198, 56]. This section introduces the basic concepts of hybrid video coding and gives an overview of the structures of typical video encoders and decoders. In the following sections, we will then discuss the components of hybrid video codecs in some detail. This discussion will include a description of coding tools used in modern video coding standards as well as an analysis of the effectiveness of several coding techniques. For details on other video coding approaches such as 3D subband coding or distributed video coding, we refer to the corresponding literature. 3.1 Properties of Digital Video Signals Since a receiver does not know the transmitted video in advance, the sample arrays of a video signal can be considered as realizations of a random process. Hence, we can apply the fundamental source coding techniques, which we studied in the first part [301] of this monograph. In fact, entropy coding, quantization, prediction, and linear transforms represent the basic building blocks of video codecs. There are, however, some important differences between digital video signals and the stationary and continuous-amplitude random signals that we investi-

81 76 Video Coding Overview gated in the source coding part [301]. Even though the samples of a digital video can be ordered into a sequence, video signals cannot be well described by 1D signal models. The color components of a digital picture are natural 2D discrete-space signals. And digital videos can be best described as a set of 3D discrete space-time signals, one for each color component. For a suitable application of the source coding techniques, the space-time dependencies in video signals have to be taken into account. The fact that raw video samples already have discrete amplitudes only impacts actual implementations, it can often be neglected for the basic codec design. Another important difference is, however, that the dependencies inside video signals cannot be well described by a stationary random model. On the one hand, there are significant differences between different video signals; as an example, action scenes in movies have other properties than typical video conferencing content. On the other hand, the statistical properties also vary within a picture or video scene. For coping with the non-stationary characteristics, video codecs typically include concepts that allow an adaptation to local statistics in a picture or video sequence. Even though video signals cannot be well described by a simple random process, they have certain properties that can be exploited for an efficient coding. The present text focuses on the coding of natural video, i.e., video that is captured with a conventional camera. Each picture of a natural video sequence shows the projections of real-world objects, which are characterized by specific structures and surface properties. Image regions that show the projection of homogeneous surface parts contain samples with very similar values. Inside such regions, the signal has similar properties as a realization of a stationary random process with large correlation coefficients. At the boundaries of objects or homogeneous surface parts, however, the sample values can change rapidly. Since the boundaries can have arbitrary orientations, the correlation coefficients between neighboring samples inside an image region typically highly depend on the considered spatial direction. If we film a static scene with a fixed camera, the video pictures show the same content and the small differences between successive pictures are mainly caused by sensor noise. In most cases, however, some of the

82 3.2. Intra-Picture Coding 77 objects in the captured scene move or the camera is moved. Hence, an important source for changes between successive pictures is the motion of objects relative to the camera. Even though picture differences can also be caused by abrupt lighting changes (e.g., flash lights), scene cuts, or fades between different scenes, for most of the video pictures, the main source is the motion of objects. Videos that are not captured with a conventional camera can have different properties. For example, animation movies and videos with screen content often contain large areas with constant sample values, while medical videos (e.g., ultrasonic videos) are typically characterized by a much higher noise level, more diffuse motion, and less clear object boundaries. In this text, we concentrate on the coding of natural videos. Nonetheless, the discussed coding concepts can also be used for other types of videos. But often the coding efficiency can be further improved if special coding tools are incorporated that exploit the characteristic properties of the considered video source. 3.2 Intra-Picture Coding A simple way of compressing a video signal is to code each picture (or color component of a picture) separately. This type of coding, which utilizes only dependencies inside pictures, is referred to as intra-picture coding, or simply intra coding. In the source coding part [301] of this monograph, we discussed three lossy coding techniques that are capable of exploiting dependencies between the samples of a color component: Vector quantization, predictive coding, and transform coding. Even though vector quantization can theoretically provide a higher coding efficiency than the other two techniques, it is generally considered as too complex for image and video coding applications. Our analyses with a 1D Gauss-Markov model [301] showed that, for this type of signals, transform coding is more efficient than predictive coding, in particular in the low bit rate range, which represents the typical operational range of image and video codecs. Another argument for using transform coding is that quantization errors in the transform domain are often less visible than quantization errors inside a sample prediction

83 78 Video Coding Overview (a) block of samples 2D transform scalar quantization entropy coding sequence of bits (b) sequence of bits entropy decoding decoder mapping 2D inverse transform block of samples Figure 3.1: 2D block transform coding: (a) Encoder; (b) Decoder. loop. The research on image and video coding actually started with the investigation of transform coding approaches [59, 4, 202]. In the framework of hybrid video coding, the sample arrays of the color components are partitioned into rectangular, typically square, blocks and transform coding is applied to these blocks of samples. The basic concept is illustrated in Figure 3.1. At the encoder side, a block of samples is transformed using a 2D decorrelating transform, the transform coefficients are quantized with scalar quantizers, the resulting quantization indexes are entropy coded, and the obtained codewords are written to the bitstream. In the decoder, the quantization indexes are decoded from the sequence of bits and mapped onto reconstructed transform coefficients. Finally, the 2D inverse transform is applied and the reconstructed block of samples is obtained. In principle, transform coding could also be applied to image rows or columns, or parts thereof. The usage of 2D transforms has the crucial advantage that dependencies in both spatial directions can be utilized. The main reasons for partitioning the sample arrays into blocks instead of transforming entire sample arrays are the following. On the one hand, our analyses for 1D Gauss-Markov processes and a Karhunen Loève transform [301] indicate that although larger transform sizes increase the coding efficiency, the potential improvements become rather small beyond a certain transform size. And since the implementation complexity continuously increases with the transform size, the usage of suitably sized blocks provides a reasonable trade-off between coding efficiency and complexity. On the other hand, natural images are characterized by nonstationarities such as object boundaries. The energy of such local features is typically distributed over a large number of transform coefficients. By using reasonably small transforms, the impact is

84 3.3. Hybrid Video Coding 79 restricted to certain blocks. For still image coding, the block-based processing was suggested in [316, 91, 311]. In the context of video coding, it has the additional advantage that a region-wise transform coding can be efficiently combined with motion compensation techniques. A prominent example for block-based transform coding is the widely used Baseline process of the image coding standard JPEG [127]. The intra-picture coding techniques used in modern video coding standards combine block-based transform coding with an intra-picture prediction, which additionally exploits dependencies between neighboring transform blocks (a simple variant is also used in Baseline JPEG). A detailed discussion of intra-picture coding tools is provided in Section Hybrid Video Coding Intra-picture coding techniques only exploit statistical dependencies inside pictures. But since video sequences typically contain a very large amount of temporal redundancies, the additional utilization of dependencies between the pictures of a video can significantly improve coding efficiency. Coding techniques that exploit these temporal dependencies are referred to as inter-picture coding techniques or simply inter coding techniques. The basic idea of inter-picture coding can be tracked back to a British Patent [146] from The ability to utilize the temporal dependencies in video sequences for improving coding efficiency is what fundamentally distinguishes video coding from still image coding. The advantage of inter-picture coding can be illustrated by a simple example. If we consider successive pictures in typical video conferencing sequences or news programs, most areas of a picture are essentially repeated in following pictures. It is obvious that we can compress such video sequences more efficiently, if we do not transmit all pictures independently of each other, but transmit only the areas that noticeably change relative to an already transmitted picture and repeat the content of the already transmitted picture for the other areas. This method is called conditional replenishment [181]; it was the only method for exploiting temporal dependencies in the first version of the international video coding standard ITU-T Recommendation H.120 [118]. The

85 80 Video Coding Overview s s u transform transform u s channel encoder decoder transform decoder u bitstream s encoder prediction s decoder prediction Figure 3.2: Basic structure of a hybrid video codec. method of conditional replenishment can be efficiently combined with block-based transform coding. For each block, we can choose between two coding modes. Either we transmit the block in the so-called intra mode, in which we code the samples using transform coding, or we repeat the samples of an already transmitted picture, which we refer to as skip mode. The selected mode can be signaled by one bit per block. A significant shortcoming of conditional replenishment is that we can only choose between transmitting an image area without exploiting any temporal dependencies and repeating the area of a previous picture. Often a substantial coding gain can be achieved if we introduce a third coding mode, which shall be called difference mode. In this mode, the co-located samples of an already coded picture are used as prediction signal. The reconstructed samples are obtained by adding a refinement signal, which is transmitted using transform coding. The inclusion of the difference coding mode already leads to the basic framework of hybrid video coding 1, which was first described by Schroeder [220]. The principle of a hybrid video codec is illustrated in Figure 3.2. It is similar to the concept of differential pulse code modulation (DPCM), which we discussed in the source coding part [301]. The samples s[x, y] of a current picture are predicted using reconstructed samples s ref [x, y] of an already coded picture, ŝ[x, y] = f( s ref[x, y] ). (3.1) 1 The term hybrid coding was originally introduced by Habibi [90] for describing a still image codec that combines a linear transform with multiple predictive quantization loops. In the context of video coding, it is now used for characterizing approaches that apply transform coding inside an inter-picture prediction loop.

86 3.3. Hybrid Video Coding 81 The resulting prediction error signal u[x, y] = s[x, y] ŝ[x, y], (3.2) which is also referred to as residual signal, is transmitted using transform coding. The reconstructed residual signal can be written as u [x, y] = Q( u[x, y] ), (3.3) where Q( ) is an operator that specifies the forward transform, the mappings of the used scalar quantizers, and the inverse transform. The reconstructed signal s [x, y] is obtained by adding the reconstructed residual signal u [x, y] to the prediction signal ŝ[x, y], s [x, y] = ŝ[x, y] + u [x, y]. (3.4) Note that the temporal prediction uses reconstructed samples s ref [x, y] of already transmitted pictures. Hence, in error-free transmission scenarios, the prediction signal ŝ[x, y] in encoder and decoder is the same. In the intra coding mode, the samples of the prediction signal are set equal to zero, ŝ[x, y] = 0, or, if intra coding includes an intra-picture prediction, they are derived using already reconstructed samples s [x, y] of the current picture, ŝ[x, y] = f(s [x, y]). The skip mode corresponds to the special case that all samples of the transmitted residual signal are equal to zero, u [x, y] = 0. In the difference mode described, the samples of the prediction signal are set equal to the co-located samples of an already coded reference picture, ŝ[x, y] = s ref [x, y]. But if an object of the captured scene moves between the sampling times of the reference picture and the current picture, the prediction signal ŝ[x, y] often differs substantially from the original signal s[x, y]. Hence, the difference mode is not very effective for such image regions; transform coding of the original sample values will often provide a higher coding efficiency. The temporal prediction can, however, be significantly improved if we compensate for the motion of objects. This concept is referred to as motion-compensated prediction. The basic idea is illustrated in Figure 3.3. Let us assume we want to encode an area of a current picture that belongs to a moving object, such as the block shown in the figure. Instead of using the co-located area

87 82 Video Coding Overview already coded reference picture FRESH FOOD moving object best-matching block in reference picture current picture displacement vector for current block FRESH FOOD FRESH FOOD m = m x m y current block x displaced object y Figure 3.3: Basic principle of motion-compensated prediction. for predicting the current block, we can select the area in the reference picture that best matches the current block. If we find a well-matching area, the energy of the residual signal u[x, y] is significantly reduced and, hence, the coding efficiency is improved. Coding modes that use motion-compensated prediction are referred to as inter modes. The prediction signal for a current image region is given by a displaced area in a reconstructed reference picture, ŝ[x, y] = s ref[x + m x, y + m y ]. (3.5) The vector m = (m x, m y ) T that specifies the reference area is referred to as displacement vector or motion vector. Since this vector is also required for generating the prediction signal at the decoder side, it has to be transmitted as part of the bitstream. At the encoder side, a well-matching reference area has to be selected. Since the search for a suitable area can be considered as an estimation of the motion between pictures, it is generally referred to as motion estimation. Although most image regions can typically be well predicted using motion-compensated prediction, there are also regions, such as uncovered parts of the background, for which no good match may be found. And since non-matched prediction can decrease coding efficiency [301], hybrid video codecs provide at least two coding modes for a picture area, an inter mode and an intra mode. The coding mode chosen by the encoder has to be transmitted inside the bitstream. The actual bitstream syntax may include features that provide a very efficient signaling for special variants of the inter mode, e.g., for the skip mode or the difference mode described above.

88 3.3. Hybrid Video Coding 83 The research on video coding with motion-compensated prediction started in the 1970s [214, 215, 24, 73, 169, 184]. Block-based motion compensation was introduced in [262, 20]. The hybrid coding structure with block-based motion-compensated prediction and transform coding inside the inter-picture prediction loop was first widely published by Jain and Jain [129]. The first video coding standard that follows this basic approach was ITU-T Recommendation H.261 [119]. Since then, all major video coding standards are based on the approach of blockbased hybrid video coding with motion-compensated prediction. Even though modern video codecs still use the concept of hybrid video coding developed in the 1980s, many details have been improved over time. One of the most important aspects is the exploitation of the dependencies between pictures. Several details of inter-picture coding will be discussed in Section 6. In Section 7, we will compare the features and compression capabilities of important video coding standards Structure of Hybrid Video Encoders and Decoders Figure 3.4 depicts a simplified block diagram of a modern hybrid video encoder. It shows all major components that are found in encoders for the state-of-the-art video coding standards H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123]. In the following, we introduce the basic principles. A detailed description follows in later sections. For simplifying the description, we consider the coding of a single color component, e.g., the luma component. While coding parameters such as coding modes or motion vectors are typically determined only once for a picture area and are used for all associated color components, the actual signal processing is carried out separately for each of the color components. The video pictures are processed in a certain order, which is referred to as coding order. As will be demonstrated later, it can be advantageous if the coding order is different from the display order, i.e., the order in which the video pictures are captured and displayed. The input video signal shall be denoted by s k [x, y], where (x, y) represents the spatial location of a sample inside a video picture and k specifies the picture index in coding order. The video pictures are partitioned into smaller areas. Even though these areas could, in

89 84 Video Coding Overview input video pictures (in coding order and partitioned into blocks) s k [x, y] s k[x, y] u k [x, y] encoder control transform & quantization transform coefficient levels and quantization parameters scaling & inv. transform u k [x, y] s k [x, y] entropy coding bitstream coding modes, intra pred. modes, and motion data coding mode decision f( s k x, y ) s r [x +m x, y +m y ] intra mode decision intra-picture prediction motion-comp. prediction in-loop filtering s k [x, y] s k [x, y] buffer for current picture motion estimation decoded picture buffer reconstructed video (for monitoring) Figure 3.4: Simplified block diagram of a typical hybrid video encoder. principle, have arbitrary shapes, we assume that they represent blocks of samples, which is actually the case in most hybrid video codecs. The blocks of samples that represent the basic entities for coding shall be called coding blocks; the block size can be variable. A coding block s k [x, y] is coded using either intra-picture prediction or motion-compensated prediction. When intra-picture prediction is used, the prediction signal ŝ k [x, y] is generated using reconstructed samples s k [x, y] of already coded neighboring blocks of the same picture. Video codecs often support multiple intra prediction modes. The selection of a suitable intra prediction mode is referred to as intra mode decision. If no prediction between the blocks of a picture is supported (which is the case in older video coding standards), all samples of the prediction signal ŝ k [x, y] can be considered to be equal to zero. In inter-picture coding modes, the prediction signal ŝ k [x, y] for a coding block is formed by reconstructed samples of an already coded and temporally stored picture s r[x, y]. The area s r[x + m x, y + m y ] of

90 3.3. Hybrid Video Coding 85 the reference picture that is used for motion-compensated prediction is specified by a motion vector m = (m x, m y ) T. The encoder s selection of a suitable motion vector is referred to as motion estimation. Modern hybrid video codecs typically support the storage of multiple reconstructed pictures in a decoded picture buffer and the encoder can choose one of them. In that case, the reference picture index r that indicates the used reference picture s r[x, y] has to be transmitted in addition to the motion vector. Video codecs often also support inter-picture coding with multi-hypothesis prediction. The motion-compensated prediction signal ŝ[x, y] for a block is then formed by a weighted sum of multiple (usually two) displaced reference blocks. Since video codecs support multiple modes for transmitting a coding block (at least an intra and an inter mode), the encoder has to select one of them. In Figure 3.4, the decision process is denoted as coding mode decision. For the actual prediction and the signaling of prediction parameters (such as motion vectors or intra prediction modes), a coding block may be split into smaller prediction blocks. For the first picture of a video sequence, no reference pictures are available. Hence, all coding blocks have to be transmitted in an intra mode. Pictures that are coded without referencing other pictures are called intra pictures. In many applications, for example in television broadcast, intra pictures are inserted in regular intervals for enabling clean random access. After a coding mode is selected for a coding block, the corresponding prediction signal ŝ k [x, y] is subtracted from the input signal s k [x, y], yielding the residual signal u k [x, y]. An approximation u k [x, y] of the residual signal u k [x, y] is transmitted using the concept of transform coding. Similar as for the prediction, a coding block can also be split into multiple transform blocks. A transform block represents a block of samples that is transformed using a single 2D transform. For each transform block, the original residual signal u k [x, y] is transformed and the obtained transform coefficients are quantized using a scalar quantizer. The resulting quantization indexes are referred to as transform coefficient levels. The trade-off between approximation quality and bit rate can be adjusted by selecting one of multiple supported quantizers. The chosen quantizer is indicated by a quantization parameter.

91 86 Video Coding Overview All coding parameters that are required for reconstructing the video pictures are losslessly compressed and the obtained codewords are written to the bitstream. This lossless compression is also called entropy coding. The transmitted data typically include subdivision information, coding modes, intra prediction modes, the number of motion hypotheses, reference picture indexes, motion vectors, quantization parameters, and transform coefficient levels. The resulting bitstream represents the input video in compressed form. Beside the coding parameters that are determined in the encoder, it also includes some high-level syntax elements, which specify data such as the picture size, the coding order of pictures, or the actually used set of coding tools. Due to the closed-loop prediction structure, the video pictures have to be reconstructed inside the encoder. First, the transform coefficient levels are mapped to reconstructed transform coefficients. This is typically achieved by scaling the transform coefficient levels with the quantization step size, which is specified by the quantization parameter. The reconstructed residual signal u k [x, y] is then obtained by transforming the reconstructed transform coefficients using the inverse transform (the inverse of the transform that was applied before quantization). Finally, the reconstructed residual signal u k [x, y] is added to the prediction signal ŝ k [x, y]. The resulting reconstructed samples s k [x, y] are written to a buffer for the current picture and can be used for intrapicture prediction of following blocks of the same picture. Hybrid video codecs may include additional filters for improving the reconstruction quality. For example, at low bit rates, a deblocking filter is sometimes applied to the reconstructed samples s k [x, y] in order to reduce block artifacts that are induced by the block-based processing. Beside the subjective quality, the filters typically also improve the effectiveness of motion-compensated prediction for pictures that reference the filtered pictures. Due to that reason, such filters are often applied inside the inter-picture prediction loop. The corresponding processing step is then also referred to as in-loop filtering. Note that a filtering outside the prediction loop does not have any impact on the encoder or decoder structure; it can be considered as part of an additional/optional postprocessing step.

3.3. Hybrid Video Coding 87 bitstream entropy decoding transform coefficient levels and quantization parameters s k[x, y] scaling & inv. transform u k [x, y] intra pred.

92 3.3. Hybrid Video Coding 87 bitstream entropy decoding transform coefficient levels and quantization parameters s k[x, y] scaling & inv. transform u k [x, y] intra pred. modes coding modes intra-picture prediction f( s k x, y ) s r [x +m x, y +m y ] in-loop filtering s k [x, y] s k [x, y] s k [x, y] buffer for current picture motion data motion-comp. prediction decoded picture buffer decoded video Figure 3.5: Simplified block diagram of a typical hybrid video decoder. The filtered pictures s k [x, y] are stored in a decoded picture buffer; they can be used for motion-compensated prediction of following pictures. The operation of the decoded picture buffer either has to follow a certain defined mechanism or the bitstream has to include data that specify what pictures are inserted and removed from the buffer. In the encoder, a multitude of parameters such as subdivision data, coding modes, intra prediction modes, reference picture indexes, motion vectors, quantization parameters, and transform coefficient levels have to be selected. The entirety of the used decision algorithms is referred to as encoder control. As indicated by the dashed arrows in Figure 3.4, the encoder control determines in particular the estimation of motion parameters and intra prediction modes, the decision between different coding modes, and the quantization of transform coefficients. Figure 3.5 shows a block diagram of a hybrid video decoder. Note that the gray-marked components are also found in the encoder. The input bitstream is parsed and the transmitted coding parameters are obtained using entropy decoding. The residual signal u k [x, y] is reconstructed by scaling the transmitted transform coefficient levels with the quantization step size and transforming the resulting reconstructed transform coefficients using the inverse transform.

93 88 Video Coding Overview Depending on the transmitted coding mode, the prediction signal ŝ k [x, y] for a block is generated by either intra-picture prediction or motion-compensated prediction. In intra-picture prediction, the samples of already reconstructed blocks are utilized; the actual construction of the prediction signal is typically specified by a transmitted intra prediction mode. When motion-compensated prediction is used, the prediction signal is formed by a displaced area s r[x + m x, y + m y ] of an already reconstructed picture s r[x, y]. The used reference picture and the selected picture area are specified by the transmitted reference picture index r and motion vector m = (m x, m y ) T, respectively. As noted above, the prediction signal ŝ k [x, y] may also be formed by a weighted sum of multiple reference blocks. Finally, the prediction signal ŝ k [x, y] is added to the reconstructed residual signal u k [x, y]. The result are reconstructed samples s k [x, y], which are written to a buffer for the current picture. After the potential application of in-loop filters, the reconstructed pictures s k [x, y] are stored in a decoded picture buffer. The decoded video is obtained by outputting the video pictures s k [x, y] in correct display order Picture Partitioning, Scanning, and Syntax For carrying out the fundamental processing steps of intra prediction, motion-compensated prediction, and transform coding, the pictures are segmented into smaller regions. In principle, the used image regions could have arbitrary shapes. But when multiple options for partitioning a picture are supported, the selected segmentation has to be signaled to the decoder, which requires a certain number of bits. A larger set of partitioning options is only advantageous if the bit rate reduction that results from the improved prediction and transform coding outweighs the additional bit rate required for transmitting the partitioning data. For practical implementations, it has also to be taken into account that the potential coding gain associated with an increased set of options can only be exploited if an encoder evaluates a significant number of the supported choices. Due to these reasons, video codecs typically provide little freedom for segmenting the pictures into regions used for prediction and transform coding.

94 3.3. Hybrid Video Coding level level level level Figure 3.6: Example for a quadtree partitioning of a block of samples. The root node of the quadtree is marked with a filled square, the leaf nodes are represented by hollow circles. The numbers indicate the scanning order of the resulting subblocks. Picture Partitioning. In the early video coding standards H.261 [119], MPEG-1 Video [112], and H.262 MPEG-2 Video [122], the pictures are always partitioned into blocks of luma samples. These blocks represent the basic processing units, for which coding modes and motion vectors are transmitted. Transform coding is applied to blocks of 8 8 samples. Such a fixed partitioning scheme is straightforward to implement and does not require the transmission of any segmentation data. But since video signals are non-stationary, a more flexible partitioning that can be adapted to regions with constant motion or areas of similar sample values can improve coding efficiency. Newer video codecs [121, 123] generally support variable block sizes. The most common approach for partitioning a picture into blocks of variable sizes uses a quadtree-based hierarchical subdivision [276, 248, 277, 251]. Typically, a picture is first partitioned into square blocks of a fixed size of 2 N 2 N samples, which can then be further subdivided in a hierarchical fashion. The partitioning of each 2 N 2 N block is represented by a quadtree structure. An example is shown in Figure 3.6. Each node of the quadtree is associated with a block of 2 n 2 n samples and either represents a leaf node or a branching node with exactly four descendant nodes. Leaf nodes indicate blocks that are not further split, whereas branching nodes are associated with 2 n 2 n blocks that are subdivided into four 2 n 1 2 n 1 blocks. The root node of the quadtree structure represents the complete 2 N 2 N block. One advantage of quadtree-based segmentations is that the chosen partition can

95 90 Video Coding Overview be efficiently signaled to the decoder. We simply have to code a binary decision for each quadtree node (in a causal order), which indicates whether or not the node represents a leaf node. For leaf nodes that correspond to blocks of the smallest supported size, the decisions do not need to be transmitted. Another benefit is that the hierarchical tree structure is well suited for the application of fast encoder decision algorithms, both optimal [36] and sub-optimal [180] variants. The quadtree approach is not restricted to square 2 n 2 n blocks. However, this choice simplifies implementations. For the transforms used in image and video coding, particularly fast algorithms can be used if the transform size is equal to an integer power of two. The quadtree concept can also be combined with an additional segmentation of the resulting subblocks. As an example, for signaling motion parameters, H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123] allow the subdivision of a square block into two rectangular blocks. Note that block sizes used for transform coding can differ from the block sizes used for prediction. In this text, we use the terminology of H.265 MPEG-H HEVC [123] and distinguish the following four block types: Coding tree block : A fixed-size block of samples for a color component. Coding tree blocks are obtained by an initial partitioning of a picture into fixed-size blocks. They represent the roots of quadtree structures and can be decomposed into smaller blocks; Coding block : A block of samples for which one of multiple coding modes can be selected; the chosen coding mode determines whether all samples of the block are predicted using intra-picture prediction or motion-compensated prediction. A coding tree block can be partitioned into multiple coding blocks; Prediction block : A block of samples that is predicted using the same intra prediction mode or motion parameters. A coding block can be partitioned into multiple prediction blocks; Transform block : A block of samples to which a single 2D transform is applied. A coding block can be partitioned into multiple transform blocks.

96 3.3. Hybrid Video Coding 91 For the selection of coding modes and motion parameters, sample blocks of all color components are often treated as one unit. We will use the terms coding tree unit, coding unit, prediction unit, and transform unit for describing the coding tree blocks, coding blocks, prediction blocks, and transform blocks, respectively, of all color components that cover the same area of a picture and share certain coding parameters. In older video coding standards, the pictures are partitioned into so-called macroblocks. A macroblock represents a block of luma samples and the associated blocks of chroma samples. Since coding modes are selected on a macroblock basis, a macroblock in these standards corresponds to both a coding tree unit and a coding unit. It should be noted that sample-accurate segmentations, e.g., with the goal of partitioning a picture into differently moving regions, have not resulted in noticeable coding efficiency improvements. The reasons are the increased bit rate required for transmitting the segmentation data and the difficulty to determine an efficient segmentation in the encoder. However, investigations with simple non-rectangular regions showed potential for further coding efficiency improvements. The approach presented in [137] supports the subdivision of square blocks by a straight line; the resulting regions are used for motion-compensated prediction. The authors of [99] report significant coding gains by using triangular instead of rectangular image regions for both prediction and transform coding. But even though the usage of non-rectangular image regions might be a topic worth to be further investigated, in this text, we consider only picture partitionings with rectangular blocks. Scanning Order. The coding decisions associated with the blocks of a picture have to be mapped into a 1D bitstream. For that purpose, the blocks need to be processed in a certain scanning order, which has to be known to both encoder and decoder. Since data of the causal past are often exploited for prediction or conditional entropy coding, the used scanning order impacts the compression efficiency. Fixed-size blocks, such as the coding tree units described above, are typically processed in raster-scan order, i.e., in a line-wise fashion from the upper-left to the lower-right block. Within a quadtree partitioning,

97 92 Video Coding Overview the subblocks are often traversed in a depth-first order. This coding order is also referred to as z-scan and is illustrated for the quadtree example in Figure 3.6. Both the raster scan and the z-scan order ensure that for each block, except those located at the top or left picture boundary, the blocks containing the samples above the current block and left to the current block have already been coded. As a consequence, the reconstructed neighboring samples can be used for intra-picture prediction and the coding parameters of the associated blocks can be exploited for predicting the coding parameters of the current block or selecting one of multiple codeword tables or probability models. Bitstream Syntax. Each video picture is eventually represented by a set of coding parameters (partitioning data, coding modes, motion parameters, transform coefficient levels, etc.). The coding parameters are losslessly compressed and the corresponding codewords are written to the bitstream. Since the decoder has to recover the transmitted coding parameters for reconstructing the video pictures, it has to know the transmission order as well as the used codeword tables or, in case of arithmetic coding, the used probability models. Often coding symbols are only transmitted if a previous symbol has a certain value; for example, motion data are only coded if the chosen coding mode specifies motion-compensated prediction. All rules for transmitting the coding symbols are specified in the bitstream syntax, which is known to encoder and decoder. It specifies the transmission order of coding parameters, conditions for their presence, and the entropy coding. The transmitted symbols are also referred to as syntax elements. The bitstream syntax is typically split into a high-level syntax and a low-level syntax. The high-level syntax describes parameters that apply to groups of pictures, individual pictures, or larger areas of a picture. High-level syntax elements do typically not depend on the samples of the video pictures to be coded. As an example, the high-level syntax includes syntax elements that specify the size of the video pictures, the coding order of pictures, the operation of the decoded picture buffer, or the set of enabled coding tools. In contrast, the low-level syntax comprises all parameters that describe the coding decisions on a block

98 3.3. Hybrid Video Coding 93 level, such as coding modes, motion parameters, and transform coefficient levels. The low-level syntax elements basically represent the sample values of the reconstructed video signal in a different form. Entropy coding is tightly coupled with the bitstream syntax. On an abstract level, the syntax elements form a sequence of coding symbols {s 0, s 1, s 2,, s N 1 }, where each symbol s i A i represents a letter of an M i -ary alphabet A i = {a i 0, a i 1, a i 2,, a i M i 1}, with M i 2. The order of the coding symbols and the symbol alphabets are determined by the bitstream syntax. The entropy coding for each symbol can be adjusted to the associated alphabet and an assumed or estimated probability distribution. Typically, either conventional variable length coding or arithmetic coding is used. For adapting the lossless compression to the present coding symbols, the codeword tables or probability models can be switched according to the syntax specification Interoperability and Video Coding Standards Video coding is rarely used in closed systems, where a single manufacturer has control over all encoder and decoder implementations. Instead, video encoders and decoders are typically integrated in a wide range of application environments. The interoperability between different products is of essential importance. It has to be ensured that bitstreams that are generated by an encoder of one manufacturer can be reliably decoded by products of other manufacturers. At the same time it is also important to give the developers of encoder and decoder products as much freedom as possible for adapting their implementations to certain platform architectures and resource constraints. Video coding standards basically represent a specification of a widely accepted interface between encoder and decoder implementations. In order to enable interoperability and simultaneously provide a large degree of freedom for designing actual products, the scope of video coding standards is typically limited to the following two aspects [192]:

99 94 Video Coding Overview Bitstream syntax : Video coding standards specify the bitstream syntax, i.e., the format in which the coding symbols are transmitted. This includes the entropy coding as well as constraints for the bitstream (e.g., the maximum picture size, the maximum bit rate, and the maximum size of the decoded picture buffer). Bitstreams that follow the defined format and obey all specified constraints are called conforming bitstreams; Decoding result : In addition to the bitstream syntax, standards specify an example decoding process for conforming bitstreams. A decoder is said to be a conforming decoder if it generates the same reconstructed pictures 2 as the specified decoding process. All other elements of a video communication system are left out of the scope of video coding standards. There is actually only a little freedom for implementing conforming decoders. In contrast, the complete encoding process, which determines how input pictures are mapped to the transmitted coding symbols is not specified in the standard. A conforming encoder is only required to produce conforming bitstreams. There is no guarantee that a conforming encoder will provide a certain reconstruction quality. Instead, the coding efficiency of two encoders that conform to the same standard can be very different. Even though this text is not restricted to a discussion of video coding standards, we will take into account the interoperability aspect. We do not consider coding algorithms for which the decoder has to make any assumptions regarding the encoding process. For a fair and meaningful evaluation of different design aspects with respect to their impact on coding efficiency, we will apply a uniform and consistent approach for operating the corresponding encoders. The used encoder concept will be described in Section 4. It does not only offer the possibility of comparing different coding tools, but also provides a very good coding efficiency for a given set of coding tools. Video coding standards will be discussed in more detail in Section 7. 2 While the newer video coding standards H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123] specify exact sample values for the reconstructed pictures, older standards such as H.262 MPEG-2 Video [122] specify, due to the usage of a floating-point transform, only an accuracy requirement for the inverse transform.

100 3.4. Coding Efficiency Coding Efficiency Before we investigate details of hybrid video coding, we have to discuss how we evaluate coding efficiency. Coding efficiency describes the ability of a video codec to trade-off bit rate and reconstruction quality. In most applications, we want to provide the best possible reconstruction quality for a given available bit rate. Hence, for comparing the coding efficiency of different video codecs, we have to measure the bit rate and the quality of the reconstructed video in a defined way. The average bit rate of a bitstream can be measured by simply counting the bits in the bitstream and dividing this number by the duration (in seconds) of the video sequence. For all comparisons in this text, we will use the average bit rate. It should, however, be mentioned that in real transmission or storage scenarios not only the average bit rate, but also the distribution of the bits among the pictures plays an important role. This aspect will be further discussed in Section 4.3, where we introduce the basic principles of rate control algorithms. In contrast to the average bit rate, the reconstruction quality of a video is not straightforward to evaluate. Since reconstructed video sequences are typically viewed by human observers, quality measures should ideally take into account the properties of human visual perception. But the characteristics of human perception are very complex and not fully understood. The most reliable way of determining the video quality is to perform subjective viewing tests, in which the video sequences are shown to human observers who rate their quality. In order to obtain reliable and reproducible results, the viewing tests should be performed in accordance with widely accepted guidelines and practices. As an example, the ITU-R Recommendation BT.500 [115] specifies properties of the testing environment (e.g., illumination conditions, screen sizes, and viewing distances) as well as testing methods. The outcome of subjective viewing tests are mean opinion score (MOS) values, which represent the average of the ratings for the group of test persons. In addition, the reliability of the subjective test is characterized by confidence intervals. An x% confidence interval specifies that, with a probability of x%, the true mean score lies inside the confidence interval. It is common practice [115] to use 95% confidence intervals.

101 96 Video Coding Overview Subjective viewing tests with a reasonable large group of human subjects provide reliable quality measures. But they are very expensive and time consuming and can thus only be used for important quality evaluations. For most comparisons, we require objective distortion measures that can be calculated based on the sample values of the original and reconstructed video pictures. Furthermore, only objective distortion measures can be used for guiding encoder decisions. A rather new research topic deals with the direct measurement of perceived video quality using electroencephalography (EEG) signals. Recent studies [219, 1] showed that abrupt quality changes in a video signal can be detected in the measured brain signal. The EEG-based measures were found to be significantly correlated with the outcome of subjective tests. Potentially, the direct assessment of neural signals could lead to more objective measurements of the perceived quality. Distortion Measures. The development of objective distortion measures that approximate the perceptual quality is an active research field. Well-known examples for such measures are VQM [200] and SSIM [288]. Comprehensive assessments [54, 55] showed however that these measures are often not much more correlated with the results of subjective tests than the mean squared error. And since the mean squared error is simpler to calculate and has some desirable properties, such as mathematical tractability, it is often used in coding efficiency comparisons as well as encoder control algorithms. In this text, we will exclusively use the mean squared error (MSE) and the related peak signal-to-noise ratio (PSNR). The MSE for a sample array of width W and height H is given by MSE = 1 W H W 1 x=0 H 1 y=0 ( s [x, y] s[x, y] ) 2, (3.6) where s[x, y] and s [x, y] represent the original and reconstructed samples for the considered sample array. With B being the bit depth of the samples, the associated PSNR is calculated as PSNR = 10 log 10 ( 2 B 1 ) 2 MSE. (3.7)

102 3.4. Coding Efficiency 97 For specifying the quality of video sequences, it is common practice to use the average of the PSNR values for the individual pictures. For a sequence of N pictures, the PSNR for a color component is given by PSNR = 1 N N 1 k=0 PSNR k, (3.8) where PSNR k represents the PSNR value for the considered color component of the k-th picture. For video in YCbCr format, the chroma components typically require much less bit rate than the luma component for a comparable reconstruction quality. Due to that reason, the PSNR of the luma component is often used as quality measure. If all color components need to be considered, a weighted sum PSNR YCbCr = w Y PSNR Y + PSNR Cb + PSNR Cr w Y + 2 (3.9) is often used, where PSNR Y represents the PSNR for the luma component and PSNR Cb and PSNR Cr are the PSNR values for the chroma components. When we use the combined measure PSNR YCbCr, we set the weighting factor equal to w Y = 6 as it is suggested in [192]. Rate-Distortion Curves. Since video sequences have widely varying properties and cannot be described by a single random model, the coding efficiency of two or more video codecs can only be compared by performing actual encodings for a number of test sequences. For a meaningful comparison, we require multiple rate-distortion points for each test sequence, which should be distributed over the bit rate or quality range of interest. In this text, the rate-distortion points for one test sequence are obtained by modifying the quantization parameter (or the base quantization parameter). The used quantization parameters are selected in a way that the encoder operation points range from encodings with very disturbing coding artifacts to encodings for which the reconstructed sequences are indistinguishable from the original. For visually comparing the coding efficiency of multiple video codecs, we will often plot the measured rate-distortion points for a representative test sequence into a rate-psnr diagram. Figure 3.7(a)

103 98 Video Coding Overview PSNR (YCbCr) [db] (a) R B R A bit rate [kbit/s] codec B codec A example target quality ( 39 db ) bit-rate saving [%] (b) bit rate saving of codec B vs. codec A B BA = (R A - R B ) / R A PSNR (YCbCr) [db] example target quality ( 39 db ) Figure 3.7: Coding efficiency comparison for one test sequence: (a) Example ratedistortion curves for two video codecs; (b) The associated bit-rate saving plot. shows an example for two video codecs, which are denoted as codec A and codec B. Each point represents the average bit rate and average PSNR for one bitstream. The curve that is obtained by connecting the measured rate-psnr points is referred to as operational rate-distortion curve or simply as rate-distortion curve for the video codec. For the example in Figure 3.7(a), we can clearly see that codec B has a higher coding efficiency than codec A. It achieves a higher PSNR and, thus, a lower distortion than codec A at the same bit rate; and it requires a lower bit rate for the same reconstruction quality. As measure for the difference in coding efficiency, we will often use bit-rate savings for a given reconstruction quality. For that purpose, the rate-distortion curves for a test sequence are interpolated in the logarithmic rate domain 3 using cubic splines with the not-a-knot condition at the border points. For a given target quality, the bit-rate saving B BA of a codec B relative to a reference codec A is given by B BA = ( R A R B ) / RA, (3.10) where R A and R B are the bit rates that are needed for codec A and codec B, respectively, for achieving the target quality. These bit rates are given by the interpolated rate-distortion curves. By plotting the bit-rate savings over the reconstruction quality, we obtain a bit-rate saving plot, as shown in Figure 3.7(b). In Figure 3.7, the calculation of the bit-rate saving is illustrated for a reconstruction quality of 39 db. 3 An interpolation in the log R PSNR domain often yields smoother curves.

104 3.5. Chapter Summary 99 For expressing the coding efficiency difference as a single number, we calculate the average bit-rate saving by integrating the bit-rate saving curve over the quality range for which the considered rate-distortion curves overlap each other. For the example in Figure 3.7, the average bit-rate saving of codec B relative to codec A is 40.8%. It should be noted that the described method for calculating average bit-rate savings represents a generalization, for rate-distortion curves with more than four points, of the often used Bjøntegaard delta rate [11]. When rate-distortion curves with four points are used, which is required for calculating the Bjøntegaard delta rate, both measures are the same. For characterizing coding efficiency differences over a set of test sequences, we will average the average bit-rate savings over the set of sequences. Differences in coding efficiency can also be represented as PSNR differences. Based on interpolated rate-distortion curves, we can plot the differences in PSNR over the bit rate and calculate corresponding averages. For areas in which it is common practice to evaluate the effectiveness of an approach using PSNR differences (e.g., quantization and transform coding), we will use these alternative measures. 3.5 Chapter Summary The most successful class of video coding designs follows the approach of block-based hybrid video coding. The basic source coding algorithm represents a hybrid of inter-picture prediction and transform coding of the prediction error signal. The video pictures are partitioned into blocks and each block is either intra-picture coded, without referring to other pictures in the video sequence, or it is inter-picture predicted. Inter-picture prediction is the key concept for utilizing the large amount of temporal dependencies found in video sequences. Since most changes between successive pictures are caused by the motion of objects relative to the camera, the prediction signal is typically formed by a displaced block of an already coded picture. The usage of displaced image regions in a reference picture is referred to as motion-compensated prediction. The spatial displacements are also called motion vectors. They have to be selected in an encoder and transmitted as part of the bitstream.

105 100 Video Coding Overview The encoders search for suitable motion vectors is referred to as motion estimation. Intra-picture coding of a sample block may also comprise a prediction. This type of prediction is called intra-picture prediction. The prediction signal is derived using reconstructed samples of already coded neighboring blocks inside the same picture. For both intra-picture and inter-picture coded blocks, the prediction error signal or, if no prediction is used, the original signal is transmitted using transform coding. The transform coefficients that are obtained by applying a 2D transform are quantized using scalar quantizers and the resulting quantization indexes, which are also called transform coefficient levels, are entropy coded together with side information such as coding modes, intra prediction modes, and motion parameters. Due to the non-stationary character of video signals, it is beneficial to use blocks with variable sizes for prediction and transform coding. A simple and effective approach for partitioning pictures into blocks of variable sizes is the hierarchical subdivision with quadtree structures. For reconstructing the video pictures at the decoder side, the encoder decisions have to be transmitted in the form of coding parameters, which include coding modes, motion vectors, and transform coefficient levels. The order in which these coding symbols are transmitted as well as the used entropy codes are specified by the bitstream syntax. The bitstream syntax has to be known to both encoder and decoder. For comparing the coding efficiency of different video codecs, we have to measure the bit rate and reconstruction quality for selected test sequences. The actual perceptual quality of a video can only be determined by performing expensive and time consuming subjective viewing tests. In this text, we measure the reconstruction quality using the mean squared error (MSE) and the related peak signal-to-noise ratio (PSNR). The coding efficiency of two or more codecs will be compared by running encodings for multiple operation points and plotting the obtained rate-distortion curves in a diagram. For quantifying the difference in coding efficiency, we will also calculate bit-rate savings or PSNR differences based on interpolated rate-distortion curves.

106 4 Video Encoder Control The coding efficiency of a video codec is determined by two factors. On the one hand, it is limited by the set of coding tools and features that are provided by the bitstream syntax and the decoding process. For example, if inter-picture prediction is not supported, the dependencies between the pictures of a video sequence cannot be exploited. But on the other hand, the decision process in the encoder that selects coding modes, motion vectors, transform coefficient levels, and other coding parameters is of crucial importance. The chosen parameters determine both the bit rate of the resulting bitstream and the sample values of the reconstructed pictures. For a given syntax and decoding process, the encoding algorithm, which is also referred to as encoder control, determines the coding efficiency of a generated bitstream. The main task of an encoder control is to determine the set of coding parameters, and thereby the bitstream, such that the best possible reconstruction quality is obtained while a given target bit rate is not exceeded and additional constraints, as for example a maximum delay or a maximum buffer capacity, are not violated. The encoding algorithms that were used in the early years of video coding, as for example the Test Model 5 [113] for H.262 MPEG-2 Video [122], typically decide 101

107 102 Video Encoder Control between different coding options by comparing the signal energies of the associated prediction error signals. These approaches consider neither the distortion of the reconstructed samples nor the number of bits that are required for transmitting the coding parameters. In particular for video coding designs that support a large number of coding options, as for example the video coding standards H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123], the usage of such an encoder control cannot exploit the compression capabilities of the bitstream syntax and decoding process. A good encoder control is not only important for the actual generation of bitstreams, but also for evaluating the effectiveness of coding tools and comparing different video coding designs. In this text, we use a unified approach for all encoder decisions. It is based on an optimization with Lagrange multipliers. Due to its conceptual simplicity and effectiveness, the Lagrangian optimization technique can be applied to all decision problems in video encoders and allows a fair comparison of different coding tools and designs. It is also used in the reference encoder models [168, 180] for H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123]. In this section, we review the concept of the Lagrangian encoder control and describe its application in hybrid video encoders. The description concentrates on the achievable coding efficiency; other aspects such as real-time operation or error robustness are neglected. The described encoder control will be used for all coding efficiency comparisons in following sections. 4.1 Encoder Control using Lagrange Multipliers On an abstract level, the goal of an encoder control can be stated as follows. Given are a bitstream syntax, which specifies the data format for transmitting the encoder decisions, and a decoding process, which specifies how the reconstructed video sequence s v (b) is derived from a conforming 1 bitstream b. In the video encoder, we want to determine a conforming bitstream b for a given input video sequence s v such that the distortion D( s v, s v (b) ) between the input video sequence s v 1 Conforming bitstreams are bitstreams that are formatted using the given bitstream syntax and obey all associated constraints (see Section 3.3.3).

108 4.1. Encoder Control using Lagrange Multipliers 103 and its reconstruction s v (b) is minimized subject to certain constraints. The set of constraints usually includes a maximum bit rate, a maximum coding delay, and maximum buffer capacities. With B c being the set of all conforming bitstreams that obey the considered constraints, the optimal bitstream b for any distortion measure D(s v, s v ) is given by b = arg min b B c D ( s v, s v (b) ). (4.1) Due to the extremely large parameter space B c and the required encoding delay, a direct application of the minimization in (4.1) is impossible. Instead, this overall optimization problem has to be split into a series of smaller optimization problems by partly neglecting interdependencies between coding parameters or coding decisions. In the following, we restrict our considerations to the determination of low-level syntax elements, i.e., to block-level coding decisions. The selection of low-level coding parameters basically only impacts the distortion of the reconstructed video pictures and the number of bits required for transmitting the coding decisions. Most other constraints, such as, for example, requirements regarding the structural encodingdecoding delay or the minimum interval between random access points, can be fulfilled by choosing appropriate high-level syntax elements. Some high-level syntax elements actually specify important aspects such as the picture or slice coding types, the temporal prediction structure, or the enabled coding features and have therefore a significant impact on coding efficiency. Nonetheless, the selection of high-level parameters can be considered as part of the encoder configuration rather than the actual encoder control. Lagrangian Optimization. We consider the coding of a set of input samples s, which could be a block of a picture, a complete picture, or a group of pictures. For coding the samples s, we have to select a set of low-level coding parameters p out of the set P of supported coding options 2. Any particular choice p P yields a certain reconstructed set of samples s (p) and requires a certain number of bits R(p) for 2 Since the coding parameters p and the bitstream b are related via a reversible mapping (entropy coding), b = γ(p), they can be used interchangeably.

109 104 Video Encoder Control representing the coded samples inside the bitstream. If we assume that we are given a bit budget R B for coding the considered samples s, the objective of the encoder control is to select the coding parameters p that minimize a distortion measure D(p) = D(s, s (p)) between the original samples s and their reconstructions s (p) subject to the constraint that the required number of bits R(p) does not exceed the bit budget R B, min D(p) subject to R(p) R B. (4.2) p P All solutions {p opt } of the constrained optimization problem (4.2) that can be obtained by varying R B represent optimal operation points for the given bitstream syntax and decoding process. Using the technique of Lagrange multipliers, the constrained optimization problem (4.2) can be transferred into an unconstrained optimization problem [62], min p P D(p) + λ R(p), (4.3) where λ 0 is referred to as Lagrange multiplier. In general, we cannot find all solutions {p opt } of the original optimization problem (4.2) by conducting the minimization (4.3) with varying values of λ. However, each solution p of the unconstrained problem (4.3), for a particular value of λ, represents also a solution to the constrained optimization problem (4.2). This can be proved as follows [62]. Let p λ represent a solution of the Lagrangian optimization problem (4.3) for a particular value of λ, with λ 0. By definition, we have, p P, D(p λ) + λ R(p λ) D(p) + λ R(p) D(p) D(p λ) λ ( R(p λ) R(p) ). (4.4) Since λ 0, the above inequality implies p P : R(p) R(p λ), D(p) D(p λ). (4.5) Hence, p λ is also a solution to the original constrained optimization problem (4.2) with R B = R(p λ ). Figure 4.1(a) illustrates the optimization for a simple example. The depicted points shall represent the rate-distortion points (R(p), D(p)) for all possible choices of p P. The shown staircase function describes

110 4.1. Encoder Control using Lagrange Multipliers 105 D convex hull tangent with slope -λ D+λR convex hull D = -λ R (a) d solution of unconstrained problem R B solution of constrained problem R D D+λR solution of unconstrained problem d (1+λ 2 ) 0.5 tangent parallel to R-axis Figure 4.1: Lagrangian optimization for discrete sets: (a) Rate-distortion diagram showing the available R-D points, the solutions of the constrained optimization problem (4.2), and the solutions of the unconstrained optimization problem (4.3); (b) Unconstrained optimization (4.3) in R-(D+λR) space. (b) R the minimal achievable distortion D as function of the rate budget R B. It is given by D(p opt (R)), where p opt (R) represents a solution of (4.2) for R B = R. The solutions {p } of the Lagrangian optimization problem (4.3) represent a subset of the solutions {p opt } of the constrained problem (4.2), {p } {p opt }. The associated rate-distortion points lie on the convex hull of the set of all available rate-distortion points. The points on the convex hull have the property that they minimize the distance to lines D = λr for a certain range of slopes given by λ 0. Since the distance d of a rate-distortion point (R(p), D(p)) to a line D = λ R is given by 3 d = ( D(p) + λ R(p) ) / 1 + λ 2, (4.6) minimizing the distance d to a line D = λ R is equivalent to minimizing the Lagrange function D(p) + λ R(p). As shown in Figure 4.1(b), the Lagrangian optimization approach can also be interpreted as a coordinate transform D D + λ R. By adding the term λ R to all distortion values, the line D = λ R, with the same value of λ, becomes equal to the R-axis. Hence, the distance d to the line D = λ R can be minimized by selecting one of the ratedistortion points with the smallest ordinate D(p) + λ R(p). 3 The distance between a point on the line D = λr and a given point (R p, D p) is d(r) = (R p R) 2 + (D p + λr) 2. Minimization with respect to R yields (4.6).

111 106 Video Encoder Control Lagrangian Bit Allocation. For the initially considered problem of selecting a set of coding parameters p P for a given set of samples s, the optimization using Lagrange multipliers does not provide any benefit. For solving both the constrained problem (4.2) and the unconstrained problem (4.3), we have to check the distortion D(p) and the number of bits R(p) for all possible coding options p P. The key advantage of the Lagrangian approach lies in the allocation of restricted resources among multiple entities [62, 232]. In the considered encoding problem, the restricted resources are the bits required for transmitting the coding decisions and the entities are sets of samples. Let us assume that the considered set of samples s can be partitioned into a number of subsets s k in such a way that the associated coding decisions p k P k are independent of each other. As an example, the subsets s k could represent blocks of a picture s, which are independently coded. We further assume that an additive distortion measure D k (p k ) is used 4. Using the technique of Lagrange multipliers, the problem of finding the optimal coding decisions p = {p 0, p 1, } for the complete set of samples s can then be formulated as min D k (p k ) + λ R k (p k ). (4.7) p 0 P 0, p 1 P 1, k k It is rather obvious, that the solution p = {p 0, p 1, } of the above minimization can also be obtained by separate minimizations k, min D k (p k ) + λ R k (p k ). (4.8) p k P k Note that such a splitting is not possible with the constrained problem formulation (4.2). Hence, the crucial advantage of the optimization with Lagrange multipliers is that an optimal solution for a collection of independent sets and an additive distortion measure can be obtained by independently solving Lagrangian optimization problems (with the same Lagrange multiplier λ) for the individual sets. The obtained solution implicitly yields an optimal bit allocation {R 0, R 1, }. A simple example for the bit allocation is illustrated in Figure 4.2. We consider a set of samples that can be partitioned into five indepen- 4 The distortion measures that are typically used in video encoders are additive distortion measures.

112 4.1. Encoder Control using Lagrange Multipliers 107 D subset A subset B subset C subset D subset E D convex hull (a) R not optimal optimal, not on convex hull optimal, on convex hull Figure 4.2: Lagrangian bit allocation: (a) Selectable rate-distortion points for five independent subsets; (b) Rate-distortion points for all combinations, the points on the convex hull can be obtained by separate optimizations for the individual subsets. (b) R dent subsets and assume an additive distortion measure. For each of the subsets, we can select among six operating points; the associated ratedistortion points are shown in Figure 4.2(a). If we consider the entire set of samples, there are 6 5 = 7776 coding options. The rate-distortion points for all possible combinations are shown in Figure 4.2(b). By conducting the separate optimization with a constant value of λ, we find an optimal solution on the convex hull. It should be noted that the separate Lagrangian optimization requires only 30 comparisons, while a constrained optimization would require the evaluation of all 7776 combinations. For the optimization problems found in real video encoders, the effect is even much more drastic. In a video encoder, the decisions for the individual sample blocks are not independent of each other. Due to the application of predictive coding techniques, such as intra-picture prediction, motion-compensated prediction, and motion vector prediction, as well as backward-adaptive entropy coding techniques, the decisions taken for a block impact the decisions for following blocks. We can still apply the discussed concept of Lagrangian bit allocation, but we have to partly neglect the interdependencies between coding decisions. In order to keep the algorithm as simple as for the case of independent blocks, the coding parameters p k for a block of samples s k can be determined according to min D k (p k p k 1, p k 2, ) + λ R k (p k p k 1, p k 2, ). (4.9) p k P k

113 108 Video Encoder Control While past decisions {p k 1, p k 2, } are taken into account by calculating the distortion and rate terms based on decisions for already coded blocks, the impact of a decision p k on the coding of following blocks is ignored. This approach does not yield a global optimum for a picture or a group of pictures, but it is simple and various coding experiments verified that it provides a high coding efficiency. 4.2 Lagrangian Optimization in Hybrid Video Encoders First approaches for applying Lagrangian bit allocation techniques in video encoders have been proposed in [322, 250]. Nowadays, the concept of Lagrangian optimization builds the basis of modern encoding algorithms [254, 194, 302, 192]. Since the coding parameter vector p k P k for a sample block s k often includes data such as motion vectors and transform coefficient levels, which can take a large number of values, the parameter space P k for a block is still too large to evaluate all possible combinations according to (4.9). In order to obtain a practical encoding algorithm, we have to further neglect some dependencies between the coding parameters for a block, such as dependencies between motion vectors and transform coefficient levels. For that reason, the encoding of a block is typically split into the sub-problems of mode decision, motion estimation, and quantization. The concept of Lagrangian optimization can be applied to each of these decision steps. But for keeping the complexity within reasonable limits, different methods of approximation and simplification are required. For measuring the distortion D k of a block of samples s k, we consider additive distortion measures of the form D k = s i s β i, (4.10) s i s k where s i and s i represent original and reconstructed samples (or transform coefficients), respectively, of the considered block. For β = 1, the distortion D k represents the sum of absolute differences (SAD), and for β = 2, the sum of squared differences (SSD). Except for motion estimation, we will generally use the SSD for all coding decision in our experiments, so that the encoder is optimized with respect to the mean

114 4.2. Lagrangian Optimization in Hybrid Video Encoders 109 squared error (MSE). It should, however, be noted that basically any other additive distortion measure could be used Mode Decision The bitstream syntax of video codecs provides different coding modes for certain blocks of samples, such as macroblocks or coding units. The considered set of coding modes C k for a block s k may contain simply an intra and an inter coding mode, i.e., C k = {Intra, Inter}, or it may consist of an intra coding mode and multiple inter coding modes with different partitionings for motion-compensated prediction. The coding modes could also represent different intra prediction modes, block subdivisions, or any other type of coding modes. Each potential coding mode c C k is associated with additional parameters such as motion vectors, transform coefficient levels, or coding parameters for subblocks. These parameters are determined before the actual mode decision is carried out. By determining the associated parameters, we assign a coding parameter vector p k (c) to each potential coding mode c C k and thereby select a subset P Ck = {p k (c) c C k } of the parameter space P k for the considered block s k. The resulting subset P Ck is small enough for evaluating all included parameter vectors according to (4.9). Consequently, the coding mode c k and, thus, the coding parameters p k = p k(c k ) for a given block s k are determined according to c k = arg min c C k D k (c p k (c), p k 1, )+λ R k (c p k (c), p k 1, ). (4.11) The distortion term D k (c ) represents the SSD between the original block s k and the reconstruction s k that is obtained by choosing the coding parameters p k (c). Depending on the considered decision problem and the actual implementation, the distortion represents the SSD either for the luma samples only or for all color components. The rate term R k (c ) specifies the number of bits (or an estimate thereof) that are required for transmitting the coding parameters p k (c) using the given bitstream syntax. It includes the bits for the coding mode(s), the associated side information (e.g., motion vectors or intra prediction modes), and the transform coefficient levels. It should be noted

115 110 Video Encoder Control that the decisions {p k 1, p k 2, } for already coded blocks are generally taken into account for calculating the distortion and rate terms. This is ensured by using the correct reconstructed samples for spatial intra prediction and motion-compensated prediction and choosing the correct predictors for syntax elements such as motion vectors. The described mode decision concept was proposed in [299, 300] for the macroblock mode selection in H.263. In encoders for modern video coding standards, as for example the reference encoding methods for H.264 MPEG-4 AVC [168] and H.265 MPEG-H HEVC [180], the approach is typically used for the following coding decisions: Decision between intra and inter coding modes; Decision whether a block is subdivided into smaller blocks; Determination of intra prediction modes; Selection of transform sizes or subdivisions for transform coding. At this point, we want to point out that the mode decision concept is well-suited for determining tree-based partitionings (see Section 3.3.2) for sample blocks. For each internal node of a tree structure, we have to decide whether or not the associated block is split into subblocks. It is important that we determine the partitioning of the subblocks in advance. By processing the tree structure in a depth-first order, it is ensured that we evaluate each potential block only once and still use the correct predictors for samples and coding parameters. For demonstrating the effectiveness of the Lagrangian approach, we consider the macroblock mode decision in H.262 MPEG-2 Video [122] as a simple example. The described mode decision is compared to the reference encoding method specified in Test Model 5 (TM5) [113], which selects the coding mode by comparing the variance of the original signal with the MSE between the original signal and the motion-compensated prediction signal. We use a simple IPPP coding structure, where the first picture is coded as intra picture and all following pictures are coded as P pictures. Only the macroblock mode decision for P pictures is modified; both tested encoders use the TM5 algorithms for motion estimation and quantization. For each macroblock of a P picture, we consider four coding modes, C k = {Intra, Inter, NoCoeff, ZeroMv}. While

116 4.2. Lagrangian Optimization in Hybrid Video Encoders 111 PSNR (Y) [db] (a) Kimono ( , 24 Hz) Lagrangian mode decision TM5 mode decision bit rate [Mbit/s] bit-rate saving vs TM5 [%] (b) Entertainment-quality test sequences Cactus (avg. 13.0%) BQTerrace (avg. 5.6%) BasketballDrive (avg. 5.9%) Kimono (avg. 8.6%) PSNR (Y) [db] ParkScene (avg. 9.2%) Figure 4.3: Lagrangian optimization for the macroblock mode decision in H.262 MPEG-2 Video: (a) Rate-distortion curves for the sequence Kimono; (b) Bit-rate savings of the Lagrangian approach (4.11) relative to Test Model 5 [113]. the first two modes are the conventional intra and inter coding modes, the latter two represent variants of the inter coding mode, which are included because the bitstream syntax provides special features for signaling these modes. For the first of these modes (NoCoeff), a motion vector is transmitted, but all transform coefficient levels are inferred to be equal to zero. The other mode (ZeroMv) specifies that a non-zero residual signal is transmitted, but both motion vector components are inferred to be equal to zero. For each encoding run, the quantization parameter (QP) was held constant for all macroblocks. Different operating points are obtained by varying the QP over the entire supported range. As will be further discussed in Section 4.2.4, the Lagrange multiplier was set equal to λ = 0.6 QP 2. We run simulations using the first 100 pictures of the test sequences listed in Appendix A.1. Figure 4.3(a) compares the operational ratedistortion curves for the sequence Kimono. The bit-rate savings for all five test sequences are shown in Figure 4.3(b). The Lagrangian mode decision improves the coding efficiency for all test sequences. On average, 8.5% of the bit rate is saved. Since the described mode decision requires a complete transform coding for each of the tested modes (except for modes that imply a zero residual signal), the encoding complexity is increased in comparison to the TM5 approach. But as long as the set of considered modes is reasonably small, it nonetheless provides a suitable trade-off between complexity and coding efficiency.

117 112 Video Encoder Control An efficient mode decision approach is even more important for newer video coding standards, which support a much larger number of coding modes. In fact, both H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123] have been designed by anticipating a Lagrangian mode decision. Without such an approach, the large coding efficiency improvements relative to older standards could not be achieved Motion Estimation Motion estimation refers to the encoder s search for a suitable motion vector or, equivalently, a suitable reference block in an already coded picture. The bitstream syntax typically supports a very large range of motion vectors. For the following investigations, we will use a search range of [ 32; 32] [ 32; 32] luma samples around the motion vector predictor, which is sufficient for our test sequences. As reference method for our experiments, we use the motion estimation algorithm specified in TM5, which selects the motion vector that minimizes the SAD distortion between the original luma signal of the considered block and its prediction signal, i.e., the displaced block in the reconstructed reference picture. H.262 MPEG-2 Video uses a motion vector accuracy of half the distance between two luma samples, which is commonly referred to as half-sample precision. Since the prediction signal for sub-sample positions is generated using interpolation filters, the calculation of the SAD for sub-sample positions is significantly more complex than for positions on the integer grid. As a consequence, the TM5 algorithm proceeds in two steps: First, all integer-sample positions (or vectors) inside the search range are tested. Then, the eight half-sample positions around the best integer-sample position are additionally evaluated. Cost Measure. In principle, the motion vectors can be interpreted as coding modes and we could apply a minimization similar to (4.11) over all vectors m inside a search range M. Such an approach would indeed provide the best possible results for a given quantization algorithm (and the considered independent block processing). But while the execution of the complete transform coding is feasible for a few coding modes, it is quite complex for thousands of motion vector candidates.

118 4.2. Lagrangian Optimization in Hybrid Video Encoders 113 A first approach for reducing the complexity is to neglect the quantization of the prediction error coding during motion estimation and instead implicitly assume that the transmitted prediction error signal is equal to zero. With M k denoting the search range for a given block s k, the motion vector m k is then chosen according to m k = arg min m M k D k (m) + λ M R k (m), (4.12) where D k (m) represents the distortion between the original signal and the prediction signal for the motion vector m and R k (m) denotes the number of bits required for transmitting the motion vector m. If we consider (4.12) as a simplification of (4.11), we would use the SSD as distortion measure and set the Lagrange multiplier λ M equal to the Lagrange multiplier λ that is used for mode decision. Since the computation of the SAD can typically be more efficiently implemented, it is often used instead of the SSD in real encoder implementations. Another alternative is to calculate the SAD in the transform domain, which is more complex, but better approximates the rate-distortion cost of the actual prediction error coding. For that purpose, usually a separable Hadamard transform is employed [168, 180]. It has similar properties as the DCT, which is used for the actual prediction error coding, but can be more efficiently implemented. If the distortion is measured using the SAD in the sample or Hadamard domain, a different Lagrange multiplier λ M has to be used. We found that the choice λ M = λ suggested in [295, 254] provides stable and consistent results. Figure 4.4(a) analyzes the effect of the discussed cost measures for a representative test sequence. As in the previous experiment, we use H.262 MPEG-2 Video and an IPPP coding structure. All encoders use the mode decision discussed in Section 4.2.1, the TM5 quantization algorithm, and an exhaustive search over all sub-sample positions inside a [ 32; 32] [ 32; 32] windows around the motion vector predictor. The quantization parameter QP was fixed for each encoder run and the Lagrange multipliers were selected as discussed above (λ = 0.6 QP 2, λ M = λ for SSD, and λ M = λ for SAD). The direct application of the mode decision concept (4.11), which takes into account the transform coding of the prediction error, provides the highest bit-rate savings rel-

119 114 Video Encoder Control bit-rate saving vs TM5-ME [%] (a) Kimono ( , 24 Hz) Full transform coding (avg. 17.8%) 30 Hadamard SAD (avg. 12.2%) 25 SAD (avg. 6.6%) SSD (avg. 6.5%) PSNR (Y) [db] bit-rate saving vs TM5-ME [%] (b) Kimono ( , 24 Hz) Hadamard SAD (avg. 12.2%) 30 Sub-sample refinement (avg. 10.3%) 25 SAD + HSAD (avg. 9.3%) 20 Fast SAD + HSAD (avg. 9.1%) 15 SAD + SAD (avg. 5.3%) PSNR (Y) [db] Figure 4.4: Comparison of motion estimation strategies for the sequence Kimono: (a) Exhaustive searches with different cost measures; (b) Different search strategies. Both diagrams show the bit-rate savings relative to the TM5 motion estimation. ative to the TM5 reference. But as noted above, it is much too complex for real implementations. For the simplified motion search (4.12), both the SSD and SAD distortion measures provide nearly the same coding efficiency. By using the SAD in the Hadamard domain, however, we can significantly increase the coding gain, in particular for the high quality range, which verifies that this measure is a reasonable approximation for the rate-distortion cost of the actual prediction error coding. A Lagrangian cost term composed of the Hadamard SAD of the prediction error signal and the number of bits for transmitting the side information could also be used for reducing the complexity of the mode decision approach; this is, for example, suggested in the low-complexity mode of the reference encoding method [168] for H.264 MPEG-4 AVC. Search Strategy. In addition to the discussed simplification of the cost calculation, the number of tested motion vector candidates has to be reduced. As noted above, the evaluation of sub-sample vectors is particularly complex, since it requires an interpolation of the reference picture signal. Due to that reason, it is common practice to split the motion estimation into a search for the best vector with integer-sample precision and a successive sub-sample refinement. When half-sample precision vectors are used, the eight half-sample vectors that surround the best integer vector are typically evaluated in the refinement step (as in the TM5 algorithm). If the considered bitstream syntax sup-

120 4.2. Lagrangian Optimization in Hybrid Video Encoders 115 ports motion vectors with a higher precision, the refinement is usually split into multiple stages. For quarter-sample precision, the half-sample refinement is followed by an evaluation of the eight quarter-sample precision vectors around the best half-sample precision vector. For most use cases, we also have to apply fast search strategies for the integer-sample search. Well-known examples of fast search algorithms are the three-step search [152, 166], the logarithmic search [129], the conjugate directional search [238], and the enhanced predictive zonal search [269]. In our experiments, we use the fast search strategy implemented in the reference encoder [126, 180] for H.265 MPEG-H HEVC, which represents a combination of different techniques. Due to the computational complexity of the Hadamard transform, the Hadamard SAD is typically only used for the sub-sample refinement, whereas the integer-sample search is done with the normal SAD. The impact of the discussed encoding features on coding efficiency was analyzed for H.262 MPEG-2 Video. We used the same framework as for the previous experiment and only modified the applied search strategy. The bit-rate savings of the tested configurations relative to the TM5 approach are shown in Figure 4.4(b) for a selected test sequence. The exhaustive search over all integer- and half-sample vectors with the Hadamard SAD was chosen as starting point. The corresponding curve in Figure 4.4(b) is the same as in Figure 4.4(a). If we reduce the number of tested vectors by combining an exhaustive search for integer-sample vectors and a half-sample refinement, the average bit-rate saving is reduced from 12.2% to 10.3%. By additionally replacing the Hadamard SAD with the simple SAD for the integer search ( SAD + HSAD ), the coding efficiency is further reduced by about 1%. The usage of a fast search strategy for the integer search ( Fast SAD + HSAD ) reduces the bit-rate saving from 9.3% to 9.1%. We consider the combination of the fast SAD-based integer search and the sub-sample refinement with the Hadamard SAD as suitable compromise between complexity and coding efficiency and will use it for all following coding experiments. The diagram in Figure 4.4(b) additionally includes the result for an encoder configuration (labeled as SAD + SAD ), in which we used an exhaustive integer search and a sub-sample refinement, both with the

121 116 Video Encoder Control SAD as distortion measure. Note that this configuration is very similar to the TM5 approach. The only difference is that the TM5 directly uses the SAD as cost measure, while the modified version additionally takes into account the motion vector bits. This modification provides an average bit-rate saving of 5.3% for the selected sequence. Since a consideration of the motion vector bits favors small variations between the motion vectors of neighboring blocks (motion vectors are typically coded using a prediction from neighboring blocks), the Lagrangian approach leads to smoother motion vector fields. The impact of the rate term increases with smaller blocks and larger values of λ M. The usage of Lagrangian techniques for improving the motion estimation in hybrid video encoders was first proposed by Sullivan and Baker [250]. Various extensions of the basic concept have been investigated [41, 154, 29, 223, 30]. Today, some variant of rate-constrained motion estimation is used in all modern video encoders. Reference Picture Selection. If the bitstream syntax supports multiple reference pictures, the encoder has to select the reference picture in addition to the motion vector. The reference picture is specified by a reference index r R into a list of available reference pictures. Since the set R of selectable reference indexes is usually small, the determination of the reference index r can be integrated into the motion estimation without requiring any further simplifications. In typical implementations, the encoder first determines a motion vector m r for each available reference picture. Then, given the determined motion vectors {m r : r R}, the reference index r is chosen according to r = arg min r R D k (r, m r ) + λ M R k (r, m r ), (4.13) where D k (r, m r ) and R k (r, m r ) represent the distortion and rate terms, respectively, that are associated with choosing the reference index r and the motion vector m r. The rate term R k (r, m r ) includes the bits for both the reference index r and the motion vector m r. In order to avoid any additional distortion calculation, the same distortion measure as for the sub-sample motion vector refinement should be used.

122 4.2. Lagrangian Optimization in Hybrid Video Encoders 117 Multi-Hypothesis Prediction. Video codecs often support interpicture coding modes with multi-hypothesis prediction, in which the prediction signal is formed by a weighted average of multiple displaced reference blocks. For these coding modes, two or more sets of motion parameters, which are also referred to as motion hypotheses, have to be selected. The simplest approach is to independently determine the reference index and motion vector for each hypothesis. Note, however, that independently selected motion parameters are not the best choice for the combined prediction signal. In order to improve coding efficiency without performing a minimization in the product space, an iterative refinement approach [65] is often used, which will be discussed in more detail in Section The number of motion hypotheses used can also be selected using a Lagrangian approach, either with or without considering the transform coding of the residual signal Quantization Quantization is the process of determining quantization indexes that approximate a given signal u. The N =A B quantization indexes of an A B transform block, which are also referred to as transform coefficient levels, are scanned in an order defined by the syntax, for example, a zig-zag scan order, and the resulting sequence q = (q 0, q 1,, q N 1 ) is entropy coded. The reconstructed samples u are obtained by first mapping the integer values q onto reconstructed transform coefficients t = (t 0, t 1,, t N 1 ) and then applying the inverse transform B, u = B t with t k = t k(q k ). (4.14) Note that, for error-free transmission scenarios, the distortion of the final reconstruction signal s = ŝ + u is equal to the distortion of the reconstructed residual u, since s s = (ŝ + u ) (ŝ + u) = u u. (4.15) Using the concept of Lagrangian optimization, the vector of transform coefficient levels q has to be selected according to q = arg min D(q) + λ R(q), (4.16) q Q N

123 118 Video Encoder Control where Q N denotes the vector space of the N transform coefficient levels, D(q) represents the SSD distortion of the reconstructed block u that is obtained for the choice q, and R(q) represents the number of bits that are required for transmitting the transform coefficient levels q. It is infeasible to conduct the minimization in (4.16) over the entire product space Q N. But since most vectors q Q N can be excluded by simple considerations, it is possible to design feasible decision algorithms without neglecting important dependencies inside a transform block. The inverse transforms B that are typically used in video coding have orthogonal or at least nearly orthogonal basis functions. Hence, the SSD distortion D(q) can be represented as a weighted sum of the squared differences (t k t k )2 for the individual transform coefficients. The original transform coefficients t k are given by the forward transform A = B 1 of the input signal u, t = A u = B 1 u. (4.17) The new video coding standards H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123] use uniform reconstruction quantizers, which are characterized by the inverse quantizer mapping t k = k q k. (4.18) The quantization step sizes k are determined by the selected quantization parameter QP and scaling matrix. Even though the concepts discussed in the following can be straightforwardly extended to other quantizer designs, we restrict our considerations to uniform reconstruction quantizers. When we neglect rounding effects, the distortion of a transform block is then given by D(q) = N 1 k=0 D k (q k ) = N 1 k=0 α k (t k k q k ) 2. (4.19) The weighting factors α k are determined by the l 2 -norms of the transform basis functions. For given quantization step sizes k, the distortion is minimized if all values t k / k are rounded to the nearest integers, i.e., if the transform coefficient levels are selected according to q k = sgn(t k ) tk k + 1 2, (4.20)

124 4.2. Lagrangian Optimization in Hybrid Video Encoders 119 where the operator specifies the largest integer value less than or equal to its argument and sgn( ) represents the signum function. Since the bitstream syntax generally exploits dependencies between the transform coefficient levels inside a transform block, the evaluation of the rate term R(q) requires a joint consideration of multiple transform coefficients. For designing a feasible algorithm, we can, however, substantially reduce the number of considered candidates for the transform coefficient levels. As a first aspect, it is reasonable to assume that each reconstruction vector t = (t 0, t 1,, t N 1 ) of the resulting vector quantizer lies inside the associated quantization cell. Hence, for a given coefficient t k, we only need to consider the two nearest integer values of t k / k as candidate levels. Additionally, we can assume that levels with an absolute value q k do never require more bits than the less probable levels with an absolute value of q k + 1. As a consequence, for each level q k, we have to consider only the two candidates tk q k,0 = sgn(t k ) k and q k,1 = sgn(t k ) tk + 1. (4.21) k 2 which may also have the same value. Under certain circumstances, it could be advantageous to consider additional candidates with absolute values smaller than q k,0. However, if these candidates are chosen rather often, it means that we have unfavorably designed quantizers for which some reconstruction levels lie outside the associated quantization intervals. The reason is typically that the selected Lagrange multiplier λ is too large for the used quantization step size. Our experiments showed that, if we select a suitable Lagrange multiplier λ, a consideration of additional candidates does not provide any noticeable coding gains. With the reduced set (4.21) of candidate levels, it is often possible to perform the minimization of the Lagrangian cost (4.16) without any further (or only very small) assumptions. The actual algorithm highly depends on the used bitstream syntax. Quantization methods that take into account the number of bits required for transmitting the transform coefficient levels are often referred to as soft decision quantization or rate-distortion optimized quantization (RDOQ). A first algorithm was proposed in [207] for H.262 MPEG-2 Video. The concept was later also adapted to the bitstream syntax of H.263 [294, 293] and H.264

125 120 Video Encoder Control MPEG-4 AVC [326, 144, 327, 139]. An RDOQ approach is also used in the reference encoder [180, 126] for H.265 MPEG-H HEVC. In the following, we sketch the design of RDOQ algorithms for two example entropy coding schemes, the run-level coding used in H.262 MPEG-2 Video and the H.265 MPEG-H HEVC entropy coding. The entropy coding techniques will be described in Section in more detail. Run-Level Coding. In run-level coding, the scanned sequence of transform coefficient levels q = (q 0, q 1,, q N 1 ) is converted into a sequence of run-level pairs and each run-level pair is represented by a variable-length codeword, which is specified in a codeword table. The run specifies the number of successive transform coefficient levels equal to zero that precede the next non-zero transform coefficient level and the level is the value of this non-zero transform coefficient level. The codeword table also includes an end-of-block symbol (eob), which is required for representing transform coefficient levels equal to zero at the end of the scan. The end-of-block symbol has to be transmitted after the last run-level pair; it specifies that all remaining levels are equal to zero. This code is a practical example of a V2V code [301]. For run-level coding, a feasible RDOQ algorithm can be constructed without any further approximations. We can simply process the transform coefficient levels in coding order using a trellis-based approach. The distortion terms that are associated with the candidate level q k,i can be calculated independently of each other. For all potential subsequences q k = (q 0,, q k ) with q k 0, we can also compute the rate terms by adding up the codeword lengths for the run-level pairs. Hence, we can compare the Lagrangian costs D + λ R that are associated with these subsequences and need to keep only the subsequence q k, with q k 0, that minimizes the Lagrangian cost. In addition, we have to keep up to k+1 subsequences q k with q k = 0, each with a different number of zeros at the end. For these subsequences, we cannot determine the rate terms without knowing the following levels. When reaching the last transform coefficient, we can determine the Lagrangian costs D + λ R for all remaining potential sequences q and finally select the one with the smallest cost. The algorithm is illustrated in Table 4.1 for

126 4.2. Lagrangian Optimization in Hybrid Video Encoders 121 Table 4.1: RDOQ example for run-level coding with k = 10, α k = 1, λ = 10, and the six transform coefficients t = (36, 8, 12, 7, 2, 6). t k q k,i (q 0,, q k ) distortion D number of bits R D + λ R 36 3 {3} 6 2 = 36 R(0, 3) = 6 96 = discard 4 {4} 4 2 = 16 R(0, 4) = {4, 0} = ? =?? [incomplete] 1 {4, 1} = R(0, 1) = {4, 0, 1} = R(1, 1) = = discard {4, 1, 1} = R(0, 1) = {4, 1, 1, 0} = ? =?? [incomplete] 1 {4, 1, 1, 1} = R(0, 1) = {4, 1, 1, 0, 0} = ? =?? [incomplete] {4, 1, 1, 1, 0} = ? =?? [incomplete] 6 0 {4, 1, 1, 0, 0, 0} = R(eob) = {4, 1, 1, 1, 0, 0} = R(eob) = = choose 1 {4, 1, 1, 0, 0, 1} = R(2, 1) + R(eob) = {4, 1, 1, 1, 0, 1} = R(1, 1) + R(eob) = excerpt of H.262 MPEG-2 Video codeword table for transform coefficient levels (s = sign) run level codeword run level codeword run level codeword 0 ±1 11s 0 ± s 2 ± s 0 ± s 1 ±1 011s eob 10 a simple example. Note that in comparison to a simple rounding, which would yield the levels q = (4, 1, 1, 1, 0, 1), the distortion is increased from 53 to 73, but the Lagrangian cost is reduced from 283 to 263. H.265 MPEG-H HEVC. In H.265 MPEG-H HEVC, the levels of a transform block are partitioned into 4 4 subblocks. The scanning order is defined in a way that all levels of a subblock are consecutively traversed; it starts with the locations of high-frequency coefficients. If a transform block contains any non-zero levels, which is indicated by a coded block flag, at first the x and y coordinates of the first non-zero level in scanning order are transmitted. The subblock that contains the signaled position shall be referred to as first significant subblock. Each subblock is associated with a coded subblock flag. If the flag is equal to zero, it indicates that all levels in the subblock are equal to zero. For all subblocks that precede the first significant subblock in coding order, the coded subblock flag is inferred to be equal to zero, while for

127 122 Video Encoder Control the first significant subblock and the last subblock in scanning order, it is inferred to be equal to one. For all remaining subblocks, the flag is transmitted as part of the syntax. If the coded subblock flag is equal to one, the values of all transform coefficient levels inside the subblock are transmitted. Only the levels that precede the transmitted position of the first non-zero level in scanning order are excluded; they are inferred to be equal to zero. All syntax elements are coded using context-based adaptive binary arithmetic coding [177]. The used binary probability models are chosen based on the values of already transmitted binary decisions and are adapted to the actual symbol statistics. Due to the adaptive arithmetic coding, an accurate computation of the rate terms is very complicated. But if we neglect some aspects of the probability model selection and assume that the probability models are not adapted inside a transform block, it is possible to design an RDOQ algorithm with reasonable complexity. Such an algorithm is included in the reference encoder [180, 126] for H.265 MPEG-H HEVC. It consists of the following basic processing steps: 1. For each scanning position k, a level q k is selected by minimizing the Lagrangian cost D k (q k )+λ R k (q k ) under the assumption that the level is not inferred to be equal to zero. D k (q k ) denotes the squared error α k (t k k q k ) 2 and R k (q k ) represents an estimate of the number of bits required for transmitting q k ; 2. The coded subblock flags for 4 4 subblocks are determined by comparing the Lagrangian costs for the following two cases: (a) The levels selected in step 1 are used; (b) The coded subblock flag is set equal to zero and, thus, all levels are equal to zero; 3. The position of the first non-zero level is determined by comparing the Lagrangian costs that are obtained by choosing one of the non-zero levels (after step 2) as first non-zero level in scanning order (the preceding levels are set equal to zero); 4. The coded block flag is determined by comparing the Lagrangian costs for the level sequence obtained after step 3 and the case that all levels inside the transform block are set equal to zero.

128 4.2. Lagrangian Optimization in Hybrid Video Encoders 123 considered range for t k Δ k decision threshold d k q k,0 q k,0 (t k ) 1 qk,0 t k q k,1 t k = q k,0 t k + 1 q k,0 t k +2 t k Δ k quantization interval for q k,0 quantization interval for q k,1 Figure 4.5: Low-complexity quantization. Given a coefficient t k, first the potential levels q k,0 and q k,1 = q k,0 + 1 are calculated. The chosen candidate is determined by comparing t k / k with a threshold d k (q k,0 ), which only depends on q k,0. Low-Complexity Quantization. In the discussed RDOQ algorithms, we tried to determine the distortion and rate terms as accurately as possible. But due to the joint coding of multiple transform coefficient levels, the consideration of the actual entropy coding significantly contributes to the complexity of the algorithms. If we, however, neglect the dependencies between the levels and approximate the rate terms by a simple model, we can process the transform coefficients independently of each other and, thus, obtain low-complexity algorithms. Without loss of generality, we start with considering non-negative transform coefficients t k. If we assume that the reconstruction values of the resulting quantizer lie inside the associated quantization intervals (see above), each transform coefficient t k is represented by one of the following transform coefficient levels, q k,0 = t k / k or q k,1 = q k, (4.22) The basic concept is illustrated in Figure 4.5. We choose the candidate level q k,0 if and only if D k (q k,0 ) + λ R(q k,0 ) D k (q k,0 + 1) + λ R(q k,0 + 1), (4.23) where R(q) represents the model for approximating the number of bits required for transmitting a level q. With D k (q) = α k (t k k q) 2, the condition can be reformulated as t k k d k (q k,0 ) = q k, λ 2 α k 2 k ( ) R(q k,0 + 1) R(q k,0 ), (4.24)

129 124 Video Encoder Control where d k (q k,0 ) represents the decision threshold between the quantization intervals associated with q k,0 and q k,1. Hence, the level q k can be selected by comparing t k / k with a single threshold d k (q k,0 ). Note that the thresholds {d k (0), d k (1), } do not depend on t k ; they can be calculated in advance and stored in table. Using the rounding operator, the eventually chosen transform coefficient level q k can also be written as tk ( q k = + max 0, 1 (d k (q k,0 ) q k,0 )). (4.25) k The maximum operator was introduced to correctly describe the decision for cases in which the threshold d k (q k,0 ) lies outside the considered range, i.e., d k (q k,0 ) > q k,1. For reasonably selected Lagrange multipliers λ, we should always have q k,0 d k (q k,0 ) q k,1. An often used rate model [102] assumes that the number of bits linearly increases with the absolute value of the level, R(q) = a + b q with a, b > 0. With this model, we have d k (q k,0 ) q k,0 = b λ 2 α k 2. (4.26) k If we now extend our considerations to negative values for the transform coefficients t k, we eventually obtain the simple quantization rule ( tk q k = sgn(t k ) + f k with f k = max 0, 1 k 2 b λ ) 2 α k 2. (4.27) k It should be noted this rule is very similar to the simple rounding (4.20), only the offset parameter f k is decreased. If we consider transforms for which all basis functions have the same norm and assume that the same quantization step size is used for all transform coefficients, the offset parameter f k becomes independent of the scanning position k. As will be discussed in Section 4.2.4, the Lagrange multiplier is often selected according to λ = const 2, in which case the offset parameter becomes also independent of the quantization step size. A simple quantization according to (4.27) with a constant value of f k is actually widely used in practice; it is for example specified in the low-complexity mode of the reference encoder [125, 168] for H.264 MPEG-4 AVC.

130 4.2. Lagrangian Optimization in Hybrid Video Encoders 125 PSNR (Y) [db] Kimono ( , 24 Hz) RDOQ f k = 0.2 f k = 0.5 bit-rate saving [%] Kimono ( , 24 Hz) RDOQ vs f k =0.5 (avg. 20.5%) f k =0.2 vs f k =0.5 (avg. 15.8%) RDOQ vs f k =0.2 (avg. 5.8%) (a) bit rate [Mbit/s] (b) PSNR (Y) [db] Figure 4.6: Comparison of different quantization strategies for the test sequence Kimono: (a) Rate-distortion curves; (b) Associated bit-rate savings. Similar simple quantization rules can also be derived for improved rate models [101]. As an example, if we use the above discussed model R(q) = a + b q for the absolute values of the levels and assume that we need c bits for signaling the sign for levels q 0, we obtain ( ) t 0 : k q k = k min 1, λ (b+c) 2 α k 2 k. (4.28) sgn(t k ) tk + f k : otherwise k with f k being given in (4.27). Coding Efficiency. We analyzed the coding efficiency of the discussed approaches for the example of IPPP coding with H.265 MPEG-H HEVC. All encoders used the Lagrangian mode decision and motion estimation described above. The quantization parameter QP was fixed for each bitstream and the Lagrange multiplier was chosen according to λ = QP/3 (see Section 4.2.4). Three quantization methods were tested: The RDOQ algorithm, which is already included in the reference encoder [126, 180], the simple rounding (f k = 0.5), and a quantization with an experimentally optimized offset parameter (f k = 0.2). Figure 4.6 shows the rate-distortion curves and bit-rate savings for a selected sequence. In comparison to the simple rounding, the RDOQ approach provides 20.5% average bit-rate savings. Most of the gain can already be achieved by decreasing the offset f k in the independent quantization (4.27). The consideration of the actual entropy coding in

131 126 Video Encoder Control the RDQO algorithm, however, still provides an average bit-rate saving of 5.8% relative to an optimized independent quantization Selection of Lagrange Multiplier If we assume that high-level syntax features such as the temporal coding structure and the quantization scaling matrices are given, then the operation point of the above described Lagrangian encoder control is determined by two parameters, the quantization parameter QP, which can also be modified on a block basis and is transmitted as part of the bitstream, and the Lagrange multiplier λ. The Lagrange multiplier λ M, which is used in motion estimation, is not considered as an additional degree of freedom. Depending on the chosen distortion measure, we use either λ M = λ (SAD or Hadamard SAD) or λ M = λ (SSD). For a given Lagrange multiplier λ, only certain settings of the block quantization parameters QP k minimize the Lagrangian cost D + λ R. Similarly as for other low-level syntax elements, the selection of the block quantization parameters QP k could be incorporated into the encoder control. Instead of choosing a coding mode c k for a block s k out of a given set C k = {c 0, c 1, }, we could proceed the minimization (4.11) over the product space C k Q k, where Q k = {QP 0, QP 1, } represents the set of candidate quantization parameters. The coding mode and quantization parameter for a block would be jointly selected and the operation point of the encoder control could be adjusted using the Lagrange multiplier λ. Such an approach would, however, substantially increase the complexity of the encoding algorithm. Hence, it is desirable to use a deterministic relationship between the Lagrange multiplier λ and the block quantization parameters QP k. In the source coding text [301], we showed that the design algorithm for entropy-constrained quantizers generates a certain set of reconstruction levels (or vectors) for a given Lagrange multiplier λ and a given probability density function. The reconstruction levels in a video codec are specified by the block quantization parameters. Hence, if we assume that the actual signal properties have a rather small impact, there should also exist an approximately deterministic relationship between the Lagrange multiplier and the quantization parameters.

132 4.2. Lagrangian Optimization in Hybrid Video Encoders 127 Let us assume that our encoder control provides a single real-valued parameter p for adjusting the operation point. We further assume that by modifying the parameter p we obtain a continuous, differentiable, and strictly convex operational distortion rate function D(R). Then, the Lagrangian cost function J = D + λ R is also strictly convex and the minimization of J over the domain of definition of p is equivalent to setting the derivative of J with respect to R equal to zero, d d ( D(R) + λ R ) = 0 = λ = D(R). (4.29) d R d R As a result, we obtain that the Lagrange multiplier λ is equal to the negative slope of the operational distortion rate function D(R) at the associated operation point (see also Figure 4.1). Following the derivation in [254, 297], we additionally assume that D(R) can be described by the high-rate approximation 5 D(R) = a e br [131], which yields λ = d d R D(R) = a b e br = b D(R). (4.30) At high rates, reasonably well-behaved probability density functions can be considered to be constant within each quantization cell. For scalar uniform reconstruction quantizers with decision thresholds in the middle between the reconstruction levels, this assumption leads to the approximation D = 2 /12 [79, 301], where denotes the distance between two neighboring reconstruction levels, i.e., the quantization step size. For the Lagrange multiplier λ, we then obtain λ = c 2, (4.31) with c = b/12 being a constant. Although our assumptions are not completely realistic for a video codec, the derivation indicates that there may indeed exist a strong dependency between the Lagrange multiplier λ and the optimal quantization parameter, where λ is approximately proportional to the square of the quantization step size. For confirming the relationship in (4.31), we performed experiments with H.265 MPEG-H HEVC and the simple IPPP coding structure. 5 Note that all high-rate approximations that we derived in the source coding text [301] can be described by the model D(R) = a e br with b = 2 ln 2.

133 128 Video Encoder Control D+λ R (per luma sample) (a) Kimono ( , 24 Hz) minima are marked by circles 60 λ = 1000 λ = λ = 32 λ = λ = quantization parameter QP Lagrange multiplier λ (b) Entertainment-quality test sequences BasketballDrive BQTerrace Cactus Kimono ParkScene λ = QP/ quantization parameter QP Figure 4.7: Lagrange multiplier λ and quantization parameter QP for IPPP coding with H.265 MPEG-H HEVC: (a) Average Lagrangian cost D + λ R, per luma sample, as function of the quantization parameter QP for selected values of λ and the test sequence Kimono; (b) Relation between the Lagrange multiplier λ and the optimal quantization parameter QP for the test sequences listed in Appendix A.1. The used reference encoder implementation [126, 180] utilizes the above described methods of rate-distortion optimized mode decision, motion estimation, and quantization. For each encoder run, all block quantization parameters were set to the same value, QP k = QP. We modified the encoder in a way that the Lagrange multiplier and the quantization parameter can be independently selected. The Lagrange multiplier λ was varied from 0.1 to in logarithmic steps of about For each value of λ, we generated 52 bitstreams by varying the quantization parameter over the entire set of supported values, QP {0, 1,, 51}. As coding efficiency measure, the average Lagrangian cost D + λ R was used, where D represents the luma MSE over the entire video sequence and R denotes the average bit rate per luma sample. As an example, Figure 4.7(a) shows the measured Lagrangian cost D + λ R as function of the quantization parameter QP for selected values of λ and one test sequence. The quantization parameter that minimizes the Lagrangian cost for a given value of λ shall be referred to as optimal quantization parameter for the corresponding λ. In the diagram of Figure 4.7(a), the minima are marked with circles. The location of the minima for the shown example already indicates that there is indeed a strong dependency between the Lagrange multiplier λ and the optimal quantization parameter.

134 4.2. Lagrangian Optimization in Hybrid Video Encoders 129 In H.265 MPEG-H HEVC, the quantization step size is approximately proportional to 2 QP/6. If the relationship in (4.31) is valid, we should expect a proportionality between the Lagrange multiplier λ and 2 QP/3. For further analyzing this dependency, we plotted λ as a function of the associated optimal quantization parameter QP using a logarithmic scale for the λ-axis. The diagram in Figure 4.7(b) includes the data for all sequences listed in Appendix A.1 and all tested values of λ. The measured λ QP points basically lie on a line in the log-linear plot; a dependency on the test sequence cannot be observed. As shown in the diagram, the data points can be well approximated by λ = QP/3, (4.32) which experimentally confirms the proportionality (4.31) between λ and 2. It should be mentioned that a log-linear regression with our measured data actually yields the relationship λ = QP/2.8263, but the difference to (4.32) is negligible. In particular, for the most often used QP range of about 20 to 40, the difference between the two approximations is significantly smaller than one QP step. The quantization step size in H.262 MPEG-2 Video is directly proportional to the quantization parameter (presuming the high-level syntax element q_scale_type is set equal to zero). If we use the original TM5 quantization method, our experimental investigation yielded the approximation λ = 0.6 QP 2. For encodings with the improved quantization method discussed in Section 4.2.3, we obtained the relationship λ = 1.0 QP 2. A similar but slightly different experiment for determining the λ QP dependency is described in [254, 297]; the investigation for H.263 confirms the proportionality between the Lagrange multiplier λ and the square of the quantization step size 2. The mentioned experiments verify that (4.31) provides an acceptable and robust characterization of the λ QP relationship in hybrid video coding. The proportionality factor depends on the actually used bitstream syntax and encoding method. For the described encoder control with a Lagrangian mode decision, motion estimation, and quantization, the experimentally determined relationships for different video coding standards are summarized in Table 4.2. Note that we obtained different proportionality factors for intra and inter pictures.

135 130 Video Encoder Control Table 4.2: Relationships between the Lagrange multiplier λ and the quantization parameter QP for different video coding standards. For H.264 MPEG-4 AVC and H.265 MPEG-H HEVC, we represented λ as a multiple of 2 QP/3 4 = (1/16) 2 QP/3, because for these standards a certain value a = 2 QP/6 2 specifies approximately the same quantization step size as the quantization parameter QP = a for the video coding standards H.262 MPEG-2 Video, MPEG-4 Visual, and H.263. quantization Lagrange multiplier λ = f(qp) for... step size intra pictures inter pictures H.262 MPEG-2 Video QP λ = 0.6 QP 2 λ = 1.0 QP 2 MPEG-4 Visual QP λ = 0.5 QP 2 λ = 1.0 QP 2 H.263 QP λ = 0.5 QP 2 λ = 0.9 QP 2 H.264 MPEG-4 AVC 2 QP/6 λ = QP/3 4 λ = QP/3 4 H.265 MPEG-H HEVC 2 QP/6 λ = QP/3 4 λ = QP/ Summary and Potential Improvements As we pointed out in the previous subsections, it is impossible to build an encoder control that takes into account all dependencies between coding parameters. For obtaining a feasible encoding algorithm, we have to neglect the impact of certain coding decision on the selection of other coding parameters. The chosen degree of simplification determines both the complexity of the algorithm and the coding efficiency. We consider the following configuration as a reasonable compromise: Selection of operation point using the quantization parameter; Setting of Lagrange multipliers as discussed in Section 4.2.4; Motion estimation according to (4.12), which is split into a fast integer-sample search using the SAD as distortion measure and a sub-sample refinement using the Hadamard SAD; Rate-distortion optimized quantization (see Section 4.2.3); Decision between coding modes according to (4.11). This encoding algorithm will be used for all coding experiments in the following sections. The consequent application of Lagrangian optimization techniques does not only provide a very good coding efficiency, but also enables a fair comparison of different coding tools.

136 4.2. Lagrangian Optimization in Hybrid Video Encoders 131 PSNR (Y) [db] (a) Kimono ( , 24 Hz) Exhaustive Opt Opt. MD Opt. MD, ME, Q 38 Opt. MD, ME 37 Test Model 5 (TM5) bit rate [Mbit/s] bit-rate saving vs TM5 [%] (b) Kimono ( , 24 Hz) 60 Exhaustive Opt. (avg. 28.7%) 50 Opt. MD, ME, Q (avg. 23.2%) 40 Opt. MD, ME (avg. 16.8%) 30 Opt. MD (avg. 8.5%) PSNR (Y) [db] Figure 4.8: Coding efficiency improvements by successively replacing the TM5 algorithms for mode decision (MD), motion estimation (ME), and quantization (Q) with Lagrangian methods: (a) Rate-distortion curves for the test sequence Kimono; (b) Associated bit-rate savings. The curves labeled as Exhaustive Opt. illustrate the maximum achievable coding efficiency for an independent block processing. For demonstrating the efficiency of the described encoder control, we compared it to the TM5 reference encoder [113] for H.262 MPEG-2 Video. Figure 4.8 shows the coding efficiency improvements that are obtained by successively replacing the TM5 algorithms for mode decision, motion estimation, and quantization with the Lagrangian methods described in Sections to For the selected test sequence, the chosen encoding algorithm provides an average bit-rate saving of 23.2% relative to the TM5 reference. We want to point out that in particular the Lagrangian mode decision is more important for video codec designs that provide a larger number of coding options. The reason we chose H.262 MPEG-2 Video for the comparison is that the reference encoding methods for new video coding standards [168, 180] already include the discussed Lagrangian optimization techniques. Figure 4.8 additionally shows a configuration ( Exhaustive Opt. ), in which we evaluated all sub-sample positions inside the motion search range and determined the associated rate-distortion costs using RDOQ. This encoder configuration basically provides the highest coding efficiency that is achievable with an independent block processing in coding order. In comparison to our favored setting, the exhaustive optimization yielded a bit-rate saving of about 7%, but at the cost of a drastically increased complexity (in our simulations, the encoder run time was increased by a factor of about 1000).

137 132 Video Encoder Control In our discussion of Lagrangian encoder optimization, we neglected the dependencies between different blocks right from the beginning. However, by taking into account the impact of coding decisions for a block on following blocks in coding order, it is often possible to further increase the coding efficiency. The dependencies between the coding parameters of different blocks are typically represented using trellis structures. The resulting optimization problems, which basically consist of minimizing a Lagrangian cost measure for a picture or a group of pictures, can be solved using dynamic programming techniques. Due to reasons of complexity, only a rather small number of coding options can be included in the multi-block optimization. Encoding algorithms that use a trellis-based approach for exploiting dependencies between the blocks of a picture were suggested in [299, 300, 222, 30]. The encoding approach described in [205, 206] exploits temporal dependencies inside a video sequence for the selection of picture quantization parameters. An extension that additionally includes the determination of picture coding types was presented in [164, 165]. In [221, 310, 309], the exploitation of temporal dependencies for selecting transform coefficient levels was investigated. By determining coding modes and motion parameters in a pre-processing step and using a simple rate model, the optimization problem could be formulated as l 1 -regularized least squares problem [335], which can be solved using numerical methods [51, 317]. Encoding concepts that exploit dependencies between blocks are typically too complex for practical encoder implementation, but they show that there is still potential for further coding efficiency improvements. 4.3 Additional Aspects of Video Encoders The main focus of our investigations in this text lies on the coding efficiency, which we evaluate by means of the average bit rate and the average PSNR. For comparing different approaches and coding tools, we will use the Lagrangian encoder control described in Section 4.2 and select operation points by adjusting the quantization parameter. This approach is adequate for evaluating the achievable coding efficiency of different coding tools, but it ignores several constraints faced in actual

138 4.3. Additional Aspects of Video Encoders 133 transmission or storage scenarios. In the following, we briefly discuss the most important aspects that have to be additionally considered in practical video encoders. Rate Control. Since abrupt changes in reconstruction quality are very disturbing for human observers, an encoder should attempt to maintain a nearly constant reconstruction quality 6 throughout a video sequence. For achieving this goal, we have to assign more bits to video pictures that are difficult to compress, such as intra pictures or inter pictures that cannot be well predicted, and fewer bits to pictures that are easy to code. However, in most applications, the bitstreams are transmitted through channels with a given constant or maximum bit rate. As a consequence, bit-rate variations have to be compensated using buffers at encoder and decoder, as is illustrated in Figure 4.9(a). But since the buffer sizes are finite, an encoder has to ensure that the bit-rate fluctuations can be compensated with given buffer sizes. The corresponding control algorithm in the encoder is referred to as rate control. Let us consider the buffer mechanism illustrated in Figure 4.9(a) for a transmission with constant bit rate (CBR). The encoder compresses the input pictures in coding order and writes the generated bits into the encoder buffer. After the bits for the first picture are inserted into the buffer, the encoder waits a certain time before the actual transmission is started. Then, the bits from the encoder buffer are removed at a constant bit rate R and send through the channel. At the receiver side, the transmitted bits are first stored in the decoder buffer. The decoder removes the bits from this buffer and eventually reconstructs and outputs the transmitted video pictures. Since the encoder buffer has similarities with a leaky bucket, the described buffer model is also referred to as leaky bucket model. In a transmission with constant bit rate, the buffers must not underflow or overflow. In order to enable interoperability with respect to bit-rate variations inside a bitstream, video coding standards specify an idealized decoder model, which is called hypothetical reference decoder (HRD). The encoder has to con- 6 The usage of fixed quantization parameters (as in the encoder control described in Section 4.2) typically yields an approximately constant video quality.

139 134 Video Encoder Control (a) pictures encoder transmission at a given bit rate R (constant bit rate or peak bit rate) pictures decoder encoder buffer decoder buffer (b) decoder buffer fullness R Δt dec Δt dec t init t 0 t 1 t 2 t 3... buffer size B bits for first picture 1 f t time Figure 4.9: Rate control and hypothetical reference decoder: (a) Buffer model for transmission; (b) Example for decoder buffer fullness during decoding. trol the bit-rate fluctuations in a way that the HRD could decode the bitstream without any underflow or overflow of the decoder buffer. At the decoder side, the leaky bucket model is characterized by three parameters: The decoder buffer size B, the transmission rate R, and the initial decoding delay t dec. The operation of the HRD in the so-called constant-delay mode is illustrated in Figure 4.9(b). If we assume that the bitstream is transmitted at a constant bit rate, the decoder buffer fullness continuously increases with the bit rate R. With t init denoting the time instant at which the first bit enters the decoder buffer, the removal time of the first picture in coding order is given by t 0 = t init + t dec. At this point in time, the bits for the first picture are removed from the buffer and the picture is instantaneously decoded. In the same way, all following pictures in coding order are instantaneously decoded (and the corresponding bits are removed) at the associated removal times. The encoding-to-decoding delay is the same for all pictures. For video sequences with a constant frame rate f t, the interval between two successive removal times is equal to 1/f t. Many transmission channels do not use a constant bit rate, but are characterized by an available peak bit rate. Examples are packet-based networks or optical discs (maximum read speed). The leaky bucket model can also be used for variable bit rate (VBR) channels. However, in contrast to the CBR case, the encoder buffer may become empty and stay empty for a certain time. As long as the encoder buffer contains any bits, the video data are transmitted with the peak bit rate R. If

140 4.3. Additional Aspects of Video Encoders 135 the buffer becomes empty, the transmission is paused until the bits for the next picture are inserted. Similarly as in the CBR case, the decoding requirements can be characterized by the leaky bucket parameters (R, B, t dec ), where R now denotes the peak transmission rate. Note that the decoder requirements for a bitstream can be described by multiple sets of leaky bucket parameters. For example, if we increase the peak transmission rate R, we may be able to decode the bitstream with a smaller buffer size B and a smaller initial decoding delay t dec. Modern video coding standards [121, 123] allow the specification of multiple sets of leaky bucket parameters (R k, B k, t dec,k ) in the bitstream. Given a certain peak transmission rate R, the HRD interpolates among the specified parameters and selects the smallest buffer size and initial decoding delay that allow a decoding without buffer violations. For interactive applications, the HRD can also be operated in a low-delay mode, in which the encoding-to-decoding delay is not the same for all pictures. In order to reduce the average end-to-end delay, this mode allows a larger delay for pictures that are represented with a large number of bits (e.g., intra pictures at the beginning of a video sequence). For a detailed description of the HRD operation, the reader is referred to the comprehensive discussion in [210]. The rate control algorithm in the encoder has to ensure that the buffer constraints are not violated. This can be accomplished by controlling the fullness of an appropriate encoder buffer [210]. Rate control algorithms typically consist of a frame-level and a block-level rate control. The frame-level rate control assigns a target number of bits to each video picture. The block-level rate control selects the block quantization parameters (or Lagrange multipliers) within each picture to (approximately) meet the assigned target number of bits. On the one hand, the frame-level rate control has to choose the target number of bits in a way that the encoder buffer fullness remains in the desired range. But on the other hand, it also has to ensure that the reconstruction quality does not noticeably change over time. To consider the impact on image quality, typically a complexity measure is calculated for each picture. These measures indicate how difficult it is to compress a certain picture. They are either determined in a

141 136 Video Encoder Control pre-analysis step [329] or they are estimated using the coding results of already processed pictures [34, 212]. The frame-level bit allocation then selects the target number of bits for a current picture by considering both the fullness of the encoder buffer and the complexity measures for the current picture and a certain number of following pictures. After the target number of bits R pic for a picture is determined, the block-level rate control has to ensure that this target is approximately achieved. Given a certain block encoding algorithm, the number of bits R k and the distortion D k for a block are determined by the block quantization parameter QP k. Hence, the block quantization parameters have to be selected in a way that N 1 k=0 R k(qp k ) R pic. But besides achieving the target number of bits, we also want to provide a consistent reconstruction quality throughout the picture. Thus, similarly as for the frame-level algorithm, we require a complexity measure c k for each block. In order to keep the encoding complexity small, block-level rate control algorithms typically use certain rate-qp and distortion 7 -QP models R k = f R (QP k, c k ) and D k = f D (QP k, c k ), respectively. Based on these models, the quantization parameter QP k for the k-th block in coding order can be directly computed given the bit budget R pic, the number of generated bits for already coded blocks, and the complexity measures c k for all blocks in the picture. Model parameters for the functions f R and f D can be updated based on the actual coding results. As complexity measures often the variance, SSD, SAD, or Hadamard SAD of the prediction error signals are used. The block-level rate control can be straightforwardly combined with an Lagrangian encoder control. After the quantization parameter QP k for a block is selected, the associated Lagrange multiplier λ k is set according to λ k = f λ (QP k ), where f λ specifies a fixed mapping between λ and QP (see Section 4.2.4). Then, given λ k and QP k, the mode decision, motion estimation, and quantization for the block are performed as described in Section 4.2. Examples for rate control algorithms are described in [113, 211, 173]; the presentation in [173] particularly highlights the interaction with a Lagrangian encoder control. 7 The distortion D k shall reflect the subjective quality of the k-th block. For that purpose, distortion measures other than MSE can be used.

142 4.3. Additional Aspects of Video Encoders 137 Real-Time Encoding. In various applications, such as video conferencing or live broadcasting, the input pictures have to be encoded in real time. For a video with a frame rate f t, the time required for encoding a picture has to be less than 1/f t. While short-term variations can be compensated using an additional buffer at the encoder output, longer violations of the real-time constraint have to be avoided. In Section 4.2, we pointed out some possibilities for reducing the complexity of the mode decision, motion estimation, and quantization algorithms. But in particular for modern codecs that support a very large number of coding options [121, 123], additional complexity reductions are often required. If we already use a fast motion search and a low-complexity quantization, the largest complexity reduction can be obtained by reducing the average number of considered coding modes. Fast mode decision algorithms [328, 197, 320] are typically designed using the following basic principle. Coding modes that are most likely (i.e., modes that are chosen for the majority of blocks) are evaluated first. If the coding result for the best of these modes fulfills certain conditions, the mode decision for the current block is terminated and the best among the tested modes is selected. Only if the conditions are not satisfied, additional coding modes are tested. These additional modes may be again partitioned into more and less likely modes. As a simple termination criterion, we could compare the Lagrangian cost D + λ R with a threshold [328]. Given the actually required encoding times, the used threshold can also be adapted during encoding. The classification of modes into more and less likely modes may be selected in an offline training or it can be determined based on certain characteristics of the image signal for the considered block [197, 320]. Perceptual Quality. As we mentioned in Section 3, the mean squared error (MSE) is not the best criterion for quantifying the perceived image quality. The perceptual quality of a video coded at a given bit rate may be improved if some important properties of the human visual systems are taken into account during encoding. Video compression that incorporates aspects of human perception is an active research topic. An overview of developed perceptual models and coding approaches is

143 138 Video Encoder Control given in [321]. Conceptually, we can modify the Lagrangian encoder control by replacing the MSE-related measures with distortion measures that better reflect the perceived image quality. In a straightforward approach, we do not use completely different distortion measures, but weight the commonly used SSD distortion with a visual sensitivity measure w k for the considered block. The usage of block sensitivity measures is motivated by the observation that typical coding artifacts with a given MSE distortion are only perceivable in some regions of an image [264]. For example, small quantization errors often degrade the subjective quality in low-contrast regions, while they are not perceivable in high-contrast regions. Given a subjective distortion measure D sub, we want the minimize the Lagrangian cost function D sub + λ R in our encoder control, min D sub + λ R. (4.33) If the used distortion measure D sub represents a scaled version of the SSD distortion D SSD, i.e., D sub = w k D SSD, this minimization is equivalent to minimizing D SSD + λ k R with a Lagrange multiplier λ k = λ/w k that depends on the sensitivity measure w k, min D SSD + λ k R with λ k = λ/w k. (4.34) Hence, we do not need to modify the actual block encoding algorithm, but only have to scale the Lagrange multiplier according to the calculated sensitivity measure for the considered block. If the block quantization parameters QP k are not determined as part of the decision process, they should also be adapted according to the sensitivity measures. For that purpose we can utilize the relationship λ k 2 k, which we derived in Section 4.2.4, where k represents the quantization step size for the k-th block. For obtaining a suitable bit allocation, the Lagrange parameter λ that is used in connection with the subjective distortion measure D sub should be held constant for a picture (see Section 4.1), which yields 2 k w k = const. Hence, the block quantization parameters QP k are chosen in such a way that the quantization step size k is inversely proportional to the square root of the sensitivity measure, k 1/ w k. (4.35)

144 4.3. Additional Aspects of Video Encoders 139 An encoder control that utilizes the sketched concept is presented in [33]; the subjective test results reported in this work verify the effectiveness of the approach. Even though this approach is a simple variation of the conventional Lagrangian encoder control, it indicates the potential of perceptually optimized encoding techniques. Error-Robust Encoding. In error-prone transmission scenarios, the bitstream received at the decoder side often differs from the bitstream generated by the encoder. For modern packet-oriented transmission systems, the only type of error we have to consider is the loss of complete video packets [241], i.e., the loss of slices 8 or pictures. The transmission protocols used can reliably detect erroneous transmission packets. If a packet contains bit errors that cannot be corrected, it is typically discarded. And even if the decoder receives an erroneous packet, it will most likely detect violations of certain syntax constraints during decoding. Since it is usually impossible to detect the location of the first bit error, a video decoder will discard the complete packet in this case. When the decoder detects a lost packet, it has to reconstruct the associated sample blocks based on certain assumptions. The decoder processing that attempts to recover the missing blocks is referred to as error concealment. A very simple strategy is to repeat the co-located sample blocks of a previous picture. The visual impact of packet losses can, however, be reduced if more advanced error concealment methods [286, 216, 287] are employed. Due to the usage of motion-compensated prediction, the differences between the recovered samples and the correct sample values are propagated to following pictures. While packet losses do typically not impact the reconstruction of intra-picture coded blocks 9 in following pictures, they result in wrong inter-picture prediction signals and, thus, affect the reconstruction of inter blocks. The reconstruction quality for error-prone transmissions can often be improved if the error propagation is already considered during encoding. This aspect can be elegantly incorporated into a Lagrangian 8 A slice is a coded sequence of blocks (macroblocks or coding tree blocks). Each video picture can be coded as a single slice or it can be split into multiple slices. 9 For error-prone environments, the intra prediction is usually restricted in a way that only samples of other intra blocks are used for generating the prediction signal.

145 140 Video Encoder Control encoder control. For example, the coding mode c k for a block s k could be selected according to c k = arg min c k C k E{D k (c k )} + λ R k (c k ). (4.36) The only difference to the conventional mode decision in (4.11) is that instead of the distortion D k for error-free reconstructions, the expected decoder distortion E{D k } for error-prone transmissions is used. The expected decoder distortion E{D k (c k )} for a mode c k depends on the error characteristics of the transmission channel and the error concealment strategy employed. Example methods for estimating the expected decoder distortion are described in [49, 296, 334, 149, 242]. 4.4 Chapter Summary The coding efficiency of a video codec does not only depend on the used bitstream syntax and decoding process, but, to a large extent, it is also determined by the encoding algorithm. The decision algorithm, which is also referred to as encoder control, chooses the transmitted coding parameters such as coding modes, motion parameters, and transform coefficient levels and, thus, determines both the bit rate and the reconstruction quality of a bitstream. The main task of the encoder control can be formulated as a constrained optimization problem. For a given bitstream syntax and decoding process, the coding parameters for the input video should be selected in such a way that the distortion of the reconstructed video is minimized while a certain maximum bit rate is not exceeded. Due to the large amount of dependencies between the coding decisions in hybrid video coding, it is impossible to find an optimal solution of this minimization problem. For obtaining a feasible encoding algorithm, the overall decision problem has to be split into a series of smaller optimization tasks, which is only possible if some interdependencies between coding decisions are neglected. Using the technique of Lagrange multipliers, the constrained optimization problem of selecting suitable coding parameters can be formulated as an unconstrained problem. By minimizing the Lagrangian cost function D + λ R of the distortion D and rate R for a certain value

146 4.4. Chapter Summary 141 of the Lagrange multiplier λ 0, we obtain a solution to the original constrained problem. The Lagrangian optimization technique is a powerful tool for allocating restricted resources among multiple entities. If we use an additive distortion measure and assume that the coding decisions for different blocks are independent of each other, the optimal solution for a collection of blocks can be found by independently minimizing the Lagrangian costs for the individual blocks. Although the coding decisions for the sample blocks in a video are not independent of each other, a suitable application of the Lagrangian optimization concept yields simple and effective encoding algorithms. We discussed some alternatives with respect to the trade-off between coding efficiency and complexity. The decision for a block of samples is often split into mode decision, motion estimation, and quantization. While decisions for preceding blocks in coding order are taken into account, the impact of a decision on future blocks is typically neglected. As a result of our discussion, we presented a decision algorithm with a reasonable trade-off between coding efficiency and complexity, which is also used in the reference encoder specifications for the video coding standard H.265 MPEG-H HEVC. Due to its algorithmic simplicity, the Lagrangian encoder control does not only provide a high coding efficiency, but also enables a fair comparison of different syntax features and coding tools in terms of coding efficiency. The described decision algorithm will be used for all coding experiments in this text. At the end of this section, we discussed the design of rate control algorithms and real-time encoding algorithms. We also pointed out how the concept of Lagrangian optimization can be combined with perceptual quality metrics. Moreover, we briefly discussed approaches for designing error-robust encoding algorithms.

147 5 Intra-Picture Coding In hybrid video coding, we distinguish two types of block coding modes: Intra-picture and inter-picture coding modes, which are also simply called intra and inter modes. The fundamental difference between these two categories is that inter-picture coding modes utilize dependencies between pictures via motion-compensated prediction, whereas intrapicture coding modes represent a block of a picture without referring to other pictures. Due to that restriction, intra-picture coding cannot exploit the large amount of temporal dependencies found in typical video sequences and thus it has, on average, a lower coding efficiency than inter-picture coding. Nonetheless, intra-picture coding plays an important role in video coding. It is used in two different settings: Intra pictures (or I pictures): All blocks of an intra picture are represented using an intra-picture coding mode. Individual intra blocks: Some blocks of a picture are coded using an intra-picture coding mode, while other blocks of the same picture are coded using an inter-picture coding mode. An intra picture can be decoded independently of all other pictures. The first picture of a video has to be coded as intra picture, since no 142

148 143 PSNR (Y) [db] all coding modes Basketball Drive, IPPP intra-picture coding modes are disabled in inter pictures bit-rate increase [%] Basketball Drive, IPPP bit-rate increase due to disabling of intra-picture coding modes in inter pictures (on average, 10.9%) (a) bit rate [Mbit/s] (b) PSNR (Y) [db] Figure 5.1: Intra-picture coding modes in inter pictures. The example demonstrates the loss in coding efficiency when intra-picture coding modes in inter pictures are disabled: (a) Rate-distortion curves for IPPP coding with H.265 MPEG-H HEVC and the HD test sequence Basketball Drive; (b) Associated bit-rate increase. already coded pictures are available and, hence, motion-compensated prediction cannot be used. In broadcast and streaming applications, intra pictures are inserted in regular intervals (typically ranging from about half a second to a few seconds). This allows a decoder to start the decoding at an intra picture and, thus, provides random access, which is required for channel switching or fast forwarding. The usage of intra pictures is also important in editing applications, where they enable a splicing of parts from different bitstreams. In error-prone environments, intra pictures may be used for terminating the propagation of reconstruction errors caused by packet losses. Pictures that support the use of inter-picture coding modes are referred to as inter pictures. In general, the syntax of inter pictures allows a selection between intra- and inter-picture coding modes on a block basis. The main reason for supporting intra-picture coding of individual blocks in inter pictures is the associated improvement in coding efficiency. Typically, some blocks of a video picture cannot be well predicted using motion-compensated prediction, for example, blocks that contain uncovered regions or regions with diffuse or complicated motion. And since non-matched prediction can be disadvantageous in rate-distortion sense [301], the usage of intra-picture coding modes for individual blocks often improves coding efficiency. Figure 5.1 demonstrates this effect for an example; for the selected sequence, a disabling of intra coding modes resulted in an average bit-rate increase of 11%.

149 144 Intra-Picture Coding If there are no statistical dependencies between a picture and the available reference pictures (e.g., at a scene cut), it may be advantageous to code the corresponding picture as intra picture in order to reduce the bit rate required for signaling the chosen coding modes. In interactive video applications, in which the transmission of complete intra pictures causes an undesirable high encoding-decoding delay, individual intra blocks are often inserted for improving the error resilience. For older video coding standards, such as H.262 MPEG-2 Video, a regular insertion of intra blocks is also required for limiting the accumulation of inverse transform mismatches (see Section 5.1.1). In this section, we review important techniques for intra-picture coding. In particular, we will discuss 2D transform coding of sample blocks, intra-picture prediction between blocks of a picture, and the usage of variable-size blocks for prediction and transform coding. Even though filtering techniques such as deblocking filters also improve the coding efficiency of intra-picture coding, we will describe them as part of the inter-picture coding techniques in Section Transform Coding of Sample Blocks Transform coding of sample blocks is one of the fundamental coding techniques found in all hybrid video codecs. It is not only used in intrapicture coding modes, but also for coding the prediction error signal of inter-picture predicted blocks. As will be further discussed in Section 5.2, intra-picture coding in modern hybrid video codecs often also includes a prediction, which is referred to as intra-picture prediction. The samples of an image block are first predicted using already reconstructed samples of neighboring blocks and then the resulting prediction error signal is coded using transform coding. Since blocks of prediction error samples have different statistical properties than blocks of original samples, we will consider transform coding for both block types in the following discussion. The basic concept of transform coding is illustrated in Figure 5.2. In the encoder, an N M block s of input samples is transformed using a linear analysis transform A. The result is a block or vector t of trans-

150 5.1. Transform Coding of Sample Blocks 145 N M block of original samples s 2D linear analysis transform t 0 t 1 α 0 α 1 q 0 q 1 entropy coding codewords A t NM 1 α NM 1 q NM 1 γ (a) scalar quantization codewords entropy decoding q 0 q 1 β 0 β 1 t 0 t 1 2D linear synthesis transform N M block of reconstr. samples s γ 1 q NM 1 β NM 1 t NM 1 B (b) scalar decoder mapping Figure 5.2: Basic concept of transform coding: (a) Encoder; (b) Decoder. form coefficients {t k }, which represents the input samples in a different coordinate system. The N M independent scalar quantizers, map the transform coefficients {t k } to quantization indexes {q k }, which are also referred to as transform coefficient levels. Finally, the obtained levels are entropy coded and the resulting codewords are written to the bitstream. At the decoder side, the transform coefficient levels {q k } are decoded from the received codewords and mapped to reconstructed transform coefficients {t k }. The reconstructed samples s are obtained by transforming the block or vector t of reconstructed transform coefficients {t k } using a linear synthesis transform B. It should be noted that the 2D character of the input data does not change the basic concept of transform coding. The blocks s of input samples can always be arranged into vectors, in which case we obtain the 1D transform coding structure that we discussed in some detail in the source coding part [301]. The entropy coding γ at the encoder side has to be the inverse of entropy decoding γ 1 at the decoder side. But the algorithm for obtaining the transform coefficient levels {q k } does not need to consist of an analysis transform A and independent scalar encoder mappings {α k }.

151 146 Intra-Picture Coding Entropy coding typically exploits dependencies between transform coefficient levels. Hence, the coding efficiency can often be improved if these dependencies are taken into account during quantization. Suitable quantization algorithms have been discussed in Section For analyzing the design of the transform and the quantizers, we neglect the impact of exploiting dependencies in entropy coding. Instead, we assume that the quantization is performed using N M independent scalar quantizers, each of which is given by an encoder mapping α k and a decoder mapping β k. Additionally, we assume that the analysis transform is the inverse of the synthesis transform, A = B 1. This configuration ensures perfect reconstruction in the absence of quantization and is used in all practical encoders. Furthermore, we restrict our considerations to orthogonal transform matrices, B 1 = B T (yielding A = B T ). In practice, this choice has the advantage that the MSE distortion in the signal space is equal to the MSE distortion in the transform domain. Hence, the MSE distortion between an original and reconstructed sample block can be minimized using independent scalar quantizers. Even if the quantization takes dependencies between transform coefficient levels (introduced by the entropy coding) into account, the usage of orthogonal transforms significantly simplifies the corresponding quantization algorithm. Transform coding with orthogonal block transforms represents a constrained form of vector quantization. The reconstruction vectors lie on a rectangular grid in the (N M)-dimensional signal space. The orientation of the grid is determined by the chosen transform B and the distances between the grid points are given by the decoder mappings {β k } of the scalar quantizers. Due to its structural constraints, transform coding with orthogonal transforms has a lower coding efficiency than unconstrained vector quantization. In particular, the spacefilling advantage of vector quantization cannot be exploited. But in contrast to more general forms of vector quantization, the implementation complexity of transform coding is very low. The good trade-off between implementation complexity and coding efficiency is the main reason why orthogonal block transforms are widely used in image and video coding applications.

152 5.1. Transform Coding of Sample Blocks Orthogonal Block Transforms Let s vec be a vector that consists of the samples of an N M block s. In the general form, the forward and inverse transforms are given by t vec = B T s vec and s vec = B t vec, (5.1) respectively. s vec is the vector of reconstructed samples, t vec and t vec represent the original and reconstructed vectors of transform coefficients, and B is an (N M) (N M) inverse transform matrix. Separable Block Transforms. An implementation according to (5.1) provides the largest degree of freedom for designing the transform. Due to the associated complexity, it is however not used in image and video coding. Instead, the analysis and synthesis transforms are typically implemented in a separable fashion. With s and s being N M blocks of original and reconstructed samples and t and t representing the original and reconstructed transform coefficients, arranged in N M matrices, the separable transforms are given by t = B T V s B H and s = B V t B T H. (5.2) At the analysis side, a multiplication of the sample block s with the N N vertical transform matrix B T V applies a 1D transform to the columns of the sample block. The rows of the resulting matrix (B T V s) are then transformed by a multiplication with the M M horizontal transform matrix B H. Note that the horizontal and vertical transforms can also be carried out in reverse order. The reconstructed samples s are obtained by applying the inverse operations to the blocks t. Due to the 2D character of the input data, the potential loss in coding efficiency that is caused by constraining the transform to be separable is typically small (indications will be presented later). The implementation complexity is, however, significantly reduced. 2D Discrete Cosine Transform (DCT). The optimal transform in rate-distortion sense is signal-dependent. As will be discussed below, the determination of an approximately optimal transform is a very complex task. For using signal-dependent transforms in a video codec, the

153 148 Intra-Picture Coding transform matrices either have to be transmitted inside the bitstream, which increases the required bit rate, or they have to be estimated in a backward-adaptive manner, which would drastically increase the complexity of encoder and, in particular, decoder implementations. In our discussion of transform coding in [301], we introduced the discrete cosine transform (DCT) of type II [2] as a signal-independent transform. We showed that for Gauss-Markov sources, the optimal transform 1 actually approaches the DCT for high correlation coefficients. Even though video signals cannot be well described by a Gauss- Markov model, we will demonstrate later that the DCT is indeed a reasonable choice for transform coding. There are also several computationally efficient algorithms [32, 100, 171, 153] for implementing the forward and inverse transforms. Due to these reasons, the DCT has become a popular choice in image and video coding and is included in many coding standards [127, 119, 122, 120, 111]. The N N inverse transform matrix B DCT = {b ik } of the DCT is given by the matrix coefficients b ik = a ( ( k π cos N N k i + 1 )) 2 with a k = { 1 : i = 0 2 : i > 0 (5.3) The DCT is applied in horizontal and vertical direction, i.e., both transform matrices B H and B V in (5.2) are set equal to the DCT matrix B DCT of the correct size. The first transform coefficient t 00 represents the average of the samples and is referred to as DC coefficient. All other transform coefficients are called AC coefficients. Integer Transforms. A disadvantage of the DCT is that most of the matrix coefficients b ik in (5.3) are irrational numbers. For implementing the forward and inverse transform in a digital computer, the DCT matrix coefficients have to be approximated by binary numbers with finite precision. If the approximations for the inverse transform differ in the implementations of encoder and decoder, the decoder does not always obtain the same reconstructed samples as the encoder. This 1 For Gauss-Markov sources, the optimal transform in rate-distortion sense is the Karhunen Loéve Transform (KLT) [84, 301].

154 5.1. Transform Coding of Sample Blocks 149 often causes problems in hybrid video coding, since the mismatches between encoder and decoder reconstruction can accumulate due to the usage of motion-compensated prediction or intra-picture prediction. As a consequence, video coding standards that use the DCT, such as H.262 MPEG-2 Video, recommend a regular insertion of intra blocks (which reset the mismatch propagation in these standards). In the newer video coding standards H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123], the specified transform matrices B are finite precision approximations of DCT transform matrices B DCT. These standards specify the inverse transform as series of exact integer operations. In contrast to implementation dependent approximations, this approach has the benefit that mismatches between encoder and decoder reconstructions are completely avoided. While H.264 MPEG-4 AVC only includes a 4 4 and an 8 8 transform 2, H.265 MPEG-H HEVC specifies transform sizes ranging from 4 4 to All transforms in H.265 MPEG-H HEVC are specified by a single integer matrix; the smaller transform matrices are obtained by resampling the transform matrix. As an example, the inverse transform matrix for the 8 8 transform is given by B (8) H.265 = (5.4) For designing the transform, a compromise of several desirable properties, such as orthogonality, basis functions with the same norm, closeness to the DCT, symmetry, bit depth, and implementation complexity, had to be found. The basis functions of the transforms in H.265 MPEG-H HEVC are not exactly orthogonal to each other and the norms are not exactly the same. But the deviations are very small, so 2 The standard also includes a transform, which is, however, realized as a cascaded transform. First, the 4 4 subblocks of the block are transformed using the 4 4 transform, and in the second step, the resulting 4 4 array of DC coefficients is additionally transformed using a 4 4 Hadamard transform.

155 150 Intra-Picture Coding that for practical implementations, the transforms can be considered to consist of orthogonal basis functions with the same norm. The scaling that is required for obtaining basis functions with a norm equal to one is incorporated into the quantization and reconstruction process. The integer transforms have the same symmetry properties as the DCT, but they do not provide the trigonometric relationships between matrix elements that are additionally employed in fast DCT algorithms such as the algorithm described in [32]. For more details on the transform design for H.265 MPEG-H HEVC, the reader is referred to the discussions in [71, 23]. More details on the integer transforms of H.264 MPEG-4 AVC can be found in [175, 15, 82]. Transform Gain. The coding efficiency of a transform is difficult to evaluate, since all components of a transform codec influence each other. For high bit rates, Gaussian sources, and entropy-constrained scalar quantizers, the coding efficiency gain (in db) of transform coding relative to a direct scalar quantization of the sample values is given by the transform gain [301], (( NM 1 G T = 10 log 10 i=0 σ 2 i ) / ( NM 1 i=0 σ 2 i )), (5.5) where σi 2 denotes the variance of the i-th transform coefficient. The transform gain G T represents the ratio of the arithmetic and geometric means of the transform coefficient variances (in db). It is a measure for the decorrelation property of the transform. The transform that maximizes the transform gain G T defined in (5.5) is the Karhunen Loéve Transform (KLT) [301], which generates transform coefficients that are completely decorrelated. It should be noted that for 2D blocks, the KLT is, in general, a non-separable transform. For obtaining an indication of the effectiveness of separable block transforms and, in particular, the transforms used in video coding, we compared the transform gain G T of a non-separable KLT with that of a separable KLT 3, the DCT, and the integer transform specified in 3 For a separable KLT, the transform matrices B T V and B T H in (5.2) represent KLTs for the columns and rows of the image blocks, respectively.

156 5.1. Transform Coding of Sample Blocks 151 Loss rel. to non-sep. KLT [db] (a) Separable KLT DCT (type II) HEVC transform Original pictures DCT transform gain (in db) is shown above the bars Bas BQT Cac Kim Par Loss rel. to non-sep. KLT [db] (b) Separable KLT DCT (type II) HEVC transform 2.87 Residual pictures 1.03 DCT transform gain (in db) is shown above the bars Bas BQT Cac Kim Par Figure 5.3: Loss in transform gain relative to a non-separable KLT: (a) Original 8 8 luma blocks; (b) Residual 8 8 luma blocks. H.265 MPEG-H HEVC. In our experiment, we used blocks of 8 8 samples and determined the KLTs and the transform gains for individual pictures of the five HD test sequences listed in Appendix A.1. Figure 5.3 illustrates the losses in transform gain G T relative to a nonseparable KLT for original image blocks and blocks of residual samples. The latter were generated using motion-compensated prediction (according to H.265 MPEG-H HEVC) with reference pictures of typical video quality. The usage of the separable KLT decreases the transform gain by 0.12 db for blocks of original samples and 0.02 db for blocks of residual samples. For the DCT, the losses are increased to 0.19 db (original) and 0.09 db (residual). The transform gain for the tested integer transform is virtually the same as that for the DCT. Comparison of Coding Efficiency. The (non-separable) KLT generates transform coefficients that are decorrelated, but for most sources there are still non-linear dependencies. The effectiveness of a transform also depends on the subsequent quantization and entropy coding. For the following considerations, we assume that the quantization of the transform coefficients is performed with optimal entropy-constrained scalar quantizers, which we discussed in the source coding part [301], and an optimal bit allocation among the transform coefficients (i.e., all quantizers are designed for the same value of the Lagrange multiplier λ). We further assume that the lossless coding of the quantization indexes is performed with independent, but optimal entropy coders.

157 152 Intra-Picture Coding For these assumptions, the average bit rate per sample is given by R = 1 NM 1 H i, (5.6) NM where H i represents the marginal (first-order) entropy for the i-th transform coefficient level. The coding efficiency of a transform does not only depend on the decorrelation property of the transform, but also on the actual distribution of the transform coefficients. In [301], we showed that for Gaussian sources, the KLT is the optimal transform. For other sources, however, a different orthogonal transform may increase the coding efficiency [57]. A necessary and sufficient condition for the optimality of an orthogonal transform A at high rates was derived by Akyol and Rose [3]. They showed that an orthogonal transform A, with the inverse transform B = (A ) T, is optimal if and only if it satisfies A = arg min A i=0 D KL (f T (t) NM 1 i=0 f Ti (t i ) ), (5.7) where f T (t) denotes the joint pdf of the transform coefficients, f Ti (t i ) represents the marginal pdf of the i-th transform coefficient, and D KL (f g) specifies the Kullback-Leibler divergence between two pdfs f and g, which is given by D KL (f g) = f(x) log f(x) 2 dx. (5.8) X g(x) Note that the divergence D KL in (5.7) measures the difference between the joint pdf of the transform coefficients and the joint pdf that would be obtained if the transform coefficients had the same marginal pdfs but were independent of each other. Hence, we can say that the optimal transform for high rate quantization is the transform that minimizes the statistical dependencies between the transform coefficients. For low rate quantization, there is no general optimality criterion. Archer and Leen [5] developed an iterative algorithm for designing an optimal orthogonal transform, which they called coding optimal transform (COT). Given a sufficiently large training set {s k } of sample

158 5.1. Transform Coding of Sample Blocks 153 vectors of size N and a Lagrange multiplier λ, the algorithm consists of the following steps: 1. Choose an initial orthogonal transform (e.g., a KLT ), which shall be represented by the inverse transform matrix B; 2. Generate a set of transform coefficient vectors {t k } by transforming all samples vectors {s k } using the transform B T ; 3. Develop an entropy-constrained scalar quantizer (see [301]) for each of the N transform coefficients using the given Lagrange multiplier λ and generate the set of reconstructed transform coefficients {t k} by quantizing the transform coefficients {t k }; 4. Calculate the inverse orthogonal transform matrix B that minimizes the MSE distortion D for the given sets of sample vectors {s k } and reconstructed transform coefficients {t k}; 5. Repeat the previous three steps until convergence. The authors of [5] show that the orthogonal matrix B that minimizes the distortion D for the training set (in step 4) has the property Q B = (Q B) T with Q = k t k s T k. (5.9) It can be found by using the matrix B of the previous iteration as starting point B 0 and applying a series of Givens (or Jacobi) rotations [80], B k = B k 1 R k. In each step, the rotation matrix R k is determined in a way that a symmetry measure for the matrix M = QB k, which is given by m sym = N 2 N 1 i=0 j=i+1 (m ij m ji ) 2, (5.10) is minimized. The Givens rotations are continued until the symmetry measure m sym becomes smaller than a certain threshold. For more details, the reader is referred to the derivation and description in [5]. Figure 5.4 compares the coding efficiency of the discussed transforms for 8 8 blocks of original and residual samples. The experiments were carried out using 10 pictures of two selected HD test sequences. For determining the COTs and KLTs, the same pictures were used as

159 154 Intra-Picture Coding Loss rel. to non-sep. KLT [db] (a) Cactus (original pictures) separable KLT COT DCT (type II) bit rate (first-order entropy) [Mbit/s] Loss rel. to non-sep. KLT [db] (b) Kimono (original pictures) separable KLT DCT (type II) 0.05 COT bit rate (first-order entropy) [Mbit/s] Loss rel. to non-sep. KLT [db] (c) Cactus (residual pictures) DCT (type II) separable KLT COT bit rate (first-order entropy) [Mbit/s] Loss rel. to non-sep. KLT [db] (d) Kimono (residual pictures) DCT (type II) separable KLT COT bit rate (first-order entropy) [Mbit/s] Figure 5.4: Comparison of the coding efficiency of selected orthogonal transforms: (a,b) 8 8 luma blocks of original samples; (c,d) 8 8 luma blocks of residual samples. All diagrams show the PSNR loss relative to a non-separable KLT. for the actual simulation. The results demonstrate that a non-separable KLT is not the most efficient transform for typical video signals. Often a separable KLT and sometimes also the DCT provides a higher ratedistortion efficiency. The best coding efficiency is provided by the COT. But given the complexity of the design algorithms (COT and KLT), the required bit rate for transmitting the transform matrices, and the complexity of the actual transforms, the signal-independent 2D DCT represents a reasonable choice for transform coding in a video codec Quantization In the following, we consider the quantization of transform coefficients. In the source coding part [301], we have shown that entropyconstrained scalar quantizers (ECSQs) are the best scalar quantizers in rate-distortion sense. An optimal bit allocation among the transform

160 5.1. Transform Coding of Sample Blocks 155 coefficients is achieved if all component quantizers operate at the same slope of their operational distortion rate functions, which means that all quantizers have to be designed using the same Lagrange multiplier λ. Even though a quantization with entropy-constrained scalar quantizers is optimal, it is typically not used in image and video coding. The main reason is that the ECSQ design depends on both the statistical properties of the input signal and the target bit rate (or quality). Hence, for using ECSQs in a video codec, we either would have to signal the reconstruction levels inside the bitstream or simultaneously design the quantizers at encoder and decoder side (based on a certain signal model and the transmitted transform coefficient levels). While the first method would noticeably increase the bit rate, the latter approach would significantly increase the encoder and decoder complexity. In image and video coding, the quantization of the transform coefficients is usually performed using very simple, but yet efficient scalar quantizers. In the following, we describe these simple quantizers and compare their coding efficiency with that of optimal ECSQs. Distribution of Transform Coefficients. Before we discuss the scalar quantization of transform coefficients, we analyze the distribution of transform coefficients (in particular, the distribution of DCT coefficients) for typical blocks of original and prediction error samples. The distribution of DCT coefficients has been investigated by various researchers and many distributions have been proposed, such as Laplacian [209, 162], Cauchy [58], generalized Gaussian [183], and Gaussian mixture [60] models. One of the most popular choices is the Laplacian distribution. It can be derived as follows [162]. We assume that the samples of a transform block are identically distributed. Each DCT coefficient is given by a weighted sum of the samples of the considered block. According to the central limit theorem, a weighted summation of a sufficiently large number of identically distributed random variables can be well approximated by a Gaussian distribution. Due to the nature of the DCT, for all AC coefficients, the mean of the distribution is equal to zero. If we assume that all sample blocks have the same normalized autocorrelation matrix, the

161 156 Intra-Picture Coding variance σi 2 of a transform coefficient t i is proportional to the sample variance σs 2 of the block. An important property of natural images is that the sample variance σs 2 changes across blocks. Therefore, the distribution of an AC transform coefficient t i is modeled according to f i (t) = 0 f i (t σ 2 i ) f i (σ 2 i ) dσ 2 i, (5.11) where f i (t σi 2 ) represents the conditional pdf given a certain transform coefficient variance σi 2 and f i(σi 2 ) denotes the distribution of transform coefficient variances for the i-th coefficient. As discussed above, the conditional pdf f i (t σi 2 ) can be approximated by a Gaussian pdf, f i (t σ 2 i ) = 1 2πσi 2 e t 2 2 σ i 2. (5.12) If we assume that the distribution of sample variances σs 2 and thus also the distribution of transform coefficient variances σi 2 can be approximated by an exponential pdf, we obtain (using an integral table [85]), f i (t) = = = f i (σ 2 i ) = a e a σ2 i, (5.13) 0 2 a 2πσi 2 π a 0 2a 2 e e t 2 2σ i 2 e a σ2 i dσ 2 i e t 2 2σ i 2 a σi 2 dσ i 2a t. (5.14) Hence, if our assumptions are true, the AC transform coefficients should have approximately a zero-mean Laplacian distribution. For evaluating the validity of the our assumptions, we investigated histograms of block variances and DCT coefficients for 8 8 blocks of original and residual pictures. Figure 5.5 shows selected examples. It can be seen that the distribution of block variances cannot be well

162 5.1. Transform Coding of Sample Blocks Block variances of original pictures 0.12 Block variances of residual pictures probability density histogram approximation by exponential pdf probability density histogram approximation by exponential pdf (a) block variance (b) block variance probability density Transform coefficient t 1,1 (residual) approximation by Gaussian pdf histogram approximation by Laplacian pdf (better fit) probability density Transform coefficient t 2,4 (residual) approximation by Gaussian pdf (better fit) histogram approximation by Laplacian pdf (c) transform coefficient value (d) transform coefficient value Figure 5.5: Distribution of block variances and DCT coefficients for 8 8 blocks of the test sequence Kimono: (a,b) Block variance distribution for blocks of original and residual samples; (c,d) Transform coefficient distribution for two selected transform coefficients of residual blocks. approximated by an exponential pdf. Nonetheless, for many transform coefficients, in particular for coefficients with a low-frequency index, the distribution can be well modeled by a Laplacian pdf. For other transform coefficients, in particular high-frequency components, a Gaussian pdf often provides a better fit. Our experiments indicated that the distribution of AC transform coefficients can often be well modeled by a generalized Gaussian pdf (typically between Laplacian and Gaussian); the shape parameter depends on the frequency index. This observation is consistent with the analysis of high order statistics in [162], from which the authors concluded that the generalized Gaussian model provides a reasonable approximation. The distribution of the DC coefficient for blocks of original samples is similar to the distribution of the original samples values. For residual blocks, it can typically be well approximated by a Laplacian pdf.

163 158 Intra-Picture Coding s' -4 s' -3 s' -2 s' -1 s' 0 s' 1 s' 2 s' 3 s' 4 u -3 u -2 u -1 u 0 u 1 u 2 u 3 u 4 z 3 z 2 z 1 z 0 z 0 z 1 z 2 z 3-4 Δ -3 Δ -2 Δ -Δ 0 Δ 1 Δ 3 Δ 4 Δ s Figure 5.6: Design of a uniform reconstruction quantizer for a symmetric pdf. Scalar Quantization. The most popular choice of scalar quantizers in video coding are uniform reconstruction quantizers (URQs). Their design is illustrated in Figure 5.6. URQs have the property that the reconstruction levels {s k } are equally spaced. The distance between two neighboring levels s k and s k+1 is referred to as quantization step size. One of the levels is equal to zero, s 0 = 0. Hence, the complete set of reconstruction levels {s k } is uniquely specified by the quantization step size. The decoder mapping of quantization indexes q to reconstructed transform coefficients t is given by the simple formula t = q. (5.15) While the reconstruction levels s k are specified by a single parameter, the decision thresholds u k, which are only used at the encoder side, can still be arbitrarily selected. The encoder has the freedom to optimize the decision thresholds according to the statistical properties of the input signal. As illustrated in Figure 5.6, a decision threshold u can also be specified by a decision offset z, which represents the distance z to the next reconstruction level s toward zero. Since the pdfs of the transform coefficients are symmetric around zero (with exception of the DC coefficient for original sample blocks), we restrict our considerations to symmetric designs and use the convention shown in Figure 5.6 for numbering the decision offsets z k. The state-of-the-art video coding standards H.264 MPEG-4 AVC and H.265 MPEG-H HEVC exclusively use URQs. Older video coding standards, such as H.262 MPEG-2 Video, also specify modified URQs for which the distances between the reconstruction level s 0 = 0 and the first non-zero reconstruction levels s 1 and s 1 are increased to three

164 5.1. Transform Coding of Sample Blocks 159 halves of the quantization step size. In the following analysis of coding efficiency, we will restrict our considerations to URQs. As discussed above, the distribution of transform coefficients is often modeled using a zero-mean Laplacian pdf. The Laplacian pdf is one of the few pdfs for which the optimal scalar quantizer (for squarederror distortion measures) can be analytically derived. In [247], Sullivan showed that, with exception of the dead zone (center quantization interval), the ECSQ for a Laplacian pdf is characterized by constant distances between neighboring reconstruction levels, s k+1 s k = for k 0 k 1, z k = z for k 0. The reconstruction levels in the center of the pdf are closer together, i.e., s 0 s 1 = s 1 s 0 <. In the same paper [247], it is also shown that, for a zero-mean Laplacian pdf, the SNR gap between a simple URQ with constant decision offsets (z k = const) and the optimal ECSQ is always less than db (for the entire range of bit rates). In the following, we discuss an algorithm for designing URQs and experimentally compare the developed URQs with optimal ECSQs. The distortion D and rate R of an URQ are given by D = k R = k u k+1 u k (s k ) 2 f(s) ds, (5.16) l k u k+1 u k f(s) ds, (5.17) where f(s) represents the pdf of the input signal. Given the decision thresholds u k, the rate is minimized (i.e., it achieves the entropy limit), if the codeword lengths l k are set according to l k = log 2 u k+1 u k f(s) ds. (5.18) For given codeword lengths l k and a given quantization step size, the optimal decision thresholds can be found by minimizing the Lagrangian cost function D + λ R. Since URQs only restrict the choice

165 160 Intra-Picture Coding of reconstruction levels, the optimality condition is the same as for ECSQs [301]. Using the relationship s k = k, we obtain ( u k = k 1 ) + λ 2 2 (l k l k 1 ). (5.19) For given decision thresholds, the choice of the reconstruction levels, which are determined by the quantization step size, only influences the distortion D. By setting the derivative of D with respect to the quantization step size equal to zero, dd d = 2 k k 2 u k+1 u k+1 f(s) ds 2 k s f(s) ds = 0, (5.20) u k k u k we obtain the optimal quantization step size = k k k uk+1 u k s f(s) ds. (5.21) uk+1 k 2 f(s) ds Using (5.18), (5.19), and (5.21), we can construct on iterative algorithm for designing URQs. Given a Lagrange parameter λ and a pdf f(s), the algorithm consists of the following steps: u k 1. Choose an initial quantization step size = init and an initial set of codeword lengths {l k }. 2. Select the decision thresholds {u k } according to (5.19). 3. Select the quantization step size according to (5.21). 4. Set the codeword lengths {l k } according to (5.18). 5. Repeat the previous three steps until convergence. Our experimental investigations showed that the algorithm converges slowly, but reliably if the initial quantization step size init is larger than the optimum quantization step size. However, if the selected initial quantization step size init is too small, some of the quantization intervals may become vanishing small (u k+1 u k 0) and the algorithm approaches a local minimum. We want to point out that the described

166 5.1. Transform Coding of Sample Blocks 161 design algorithm can also be used with a (sufficiently large) training set, instead of a given pdf. In this case, the integrals in the equations (5.18) and (5.21) have to be replaced with corresponding sums over the samples of the training set. For simplifying the encoder s decision process, it is often desirable to restrict the selection of the decision thresholds u k. A particular simple encoder is obtained if all decisions offsets z k have the same value. In another configuration, we may want to determine an independent decision offset z 0, but use a constant value for the remaining decision offsets. Let us assume, we want to use the same value z for all decision offsets z k with k k 0. By setting the derivative of the Lagrangian cost function D + λ R with respect to z equal to zero, we can derive an iterative algorithm. For a symmetric pdf f(s), the iteration rule z (i+1) = λ 2 2 k k 0 f(s k + z (i) ) (l k+1 l k ) k k 0 f(s k + z (i) ) (5.22) is obtained, where the superscripts denote the iteration index. For constructing an URQ with restricted decision thresholds, only step 2 of the presented algorithm has to be modified. The decision thresholds that correspond to decision offsets z k, with k k 0, are determined using the iteration given in (5.22); the starting point z (0) can be set equal to 1/2. The remaining decision thresholds are determined according to (5.19) as in the original design algorithm. For investigating the coding efficiency of scalar quantization with uniformly spaced reconstruction levels, we constructed different URQs using the described algorithm and compared them to optimal ECSQs, which were constructed using the method described in the source coding part [301]. The diagrams in Figure 5.7 show the measured SNR losses for different URQ designs relative to optimal ECSQs for Laplacian and Gaussian pdfs with zero mean. The unrestricted URQ design, for which all decision thresholds are independently optimized, is labeled as URQ (opt.) in the diagrams. The labels URQNT, with N being equal to 1, 2, or 3, indicate that the corresponding URQs use N different decision offsets. While all decision offsets z k with k N 1

167 162 Intra-Picture Coding SNR loss relative to ECSQ [db] (a) URQ1T URQ2T URQ (opt.) rate (entropy) [bit/sample] SNR loss relative to ECSQ [db] (b) URQ2T URQ (opt.) rate (entropy) [bit/sample] URQ1T (maximum at db) URQ3T Figure 5.7: Comparison of the coding efficiency of URQs and optimal ECSQs for a zero-mean Laplacian (a) and Gaussian (b) pdf. are set to the same value z, the offsets z k with k < N 1 (for N > 1) are independently optimized. For the Laplacian pdf, the maximum SNR loss of the optimal URQ relative to the ECSQ design is about db. Even for the URQ with a constant decision offset z k = z, the SNR loss is always less than db 4. By using two different decision offsets z 0 and z 1 = z 2 = = z, we already achieve virtually the same coding efficiency as for the optimal URQ. The experiment with the Gaussian pdf shows larger SNR losses. In particular the simple URQ with a single decision offset yields SNR losses of up to db. However, by using two decision offsets, the maximum SNR loss is already significantly decreased to about db. For the optimal URQ, the maximum SNR loss is about db; this value can also be achieved by an URQ with three independent decision offsets. The experimental results indicate that, for typical pdfs, the losses in coding efficiency that result from restricting the scalar quantizers to URQs is very small. The simple encoder mapping with a single decision offset z k = z may, however, result in noticeable losses compared to the optimal encoder operation. It should be noted that an URQ with a single decision offset corresponds to an encoder mapping according to (4.27), whereas the encoder mapping in (4.28) uses two decision offsets. Experimental results for real video data will be presented below. 4 The same value has been reported in [247] (see above).

168 5.1. Transform Coding of Sample Blocks 163 Bit Allocation. A problem that we have not examined so far is the bit allocation among transform coefficients. For URQs, the bit rate is determined by the chosen quantization step size and the selected encoder mapping. If we assume that all scalar quantizers are operated using a certain given encoder mapping, the bit allocation problem reduces to the question how we should select the quantization step sizes k for the transform coefficients t k of a transform block. In the source coding part [301], we showed that an optimal bit allocation is achieved if all scalar quantizers are operated at the same slope of their operational distortion rate functions D k (R k ). Since the Lagrange multiplier λ k that is used for constructing a scalar quantizer represents the negative slope of the distortion rate function D k (R k ), an optimal bit allocation is achieved if all scalar quantizers are constructed using the same Lagrange multiplier, λ k = λ. In order to derive an approximate rule for optimally selecting the quantization step sizes k, we assume that all component quantizers are operated at bit rates for which the high-rate approximations are valid. The high-rate approximation for the operational distortion rate functions D k (R k ) of the component quantizers are given by D k = ε 2 k σ 2 k 2 2R k, (5.23) where σk 2 denotes the variance for the corresponding transform coefficient and ε 2 k represents a constant factor that depends on the shape of the pdf and the actually used quantizer. For an optimal bit allocation, we require d D k d R k = 2 ln 2 ε 2 k σ 2 k 2 2R k = 2 ln 2 D k = const. (5.24) By using the high-rate approximation for the distortions D k, we obtain the simple high-rate bit allocation rule D k = k (5.25) k = const. (5.26) For URQs and high rates, an optimal bit allocation is achieved if all quantizers use the same quantization step size.

169 164 Intra-Picture Coding In video coding, most of the component quantizers are typically operated at bit rates for which the high-rate approximations are not valid. For evaluating the coding efficiency of transform coding with URQs and the simple bit allocation rule (5.26), we run simulations with an 8 8 DCT for both blocks of original and residual samples. The following quantization schemes were tested: Optimal ECSQ: The transform coefficients are quantized using ECSQs and optimal bit allocation. This is the best possible configuration for an independent scalar quantization. URQ (same λ): The transform coefficients are quantized using unrestricted URQs and optimal bit allocation. URQ (same ): The transform coefficients are quantized using unrestricted URQs with a constant quantization step size. URQNT (same ): The transform coefficients are quantized using URQs with N independent decision offsets (see above). For all quantizers the same quantization step size is used. For constructing and evaluating the quantizers, 10 pictures of the HD test sequences listed in Appendix A.1 were used. The bit rate is determined according to (5.6), i.e., by calculating the first-order entropies for the resulting transform coefficient levels. The simulation results for original and residual blocks of two selected test sequences are shown in Figure 5.8. In comparison to ECSQs with optimal bit allocation, the usage of URQs results in coding efficiency losses of about 0.01 to 0.03 db. By additionally using the same quantization step size for all component quantizers, the SNR losses relative to the optimal configuration are slightly increased to about 0.01 to 0.04 db. A restriction to URQs with two decision offsets (URQ2T) has only a minor impact on coding efficiency; the maximum SNR loss in our simulations was approximately 0.01 db. The usage of URQs with a single decision offset (URQ1T) yields SNR losses of up to 0.06 db. The experimental results demonstrate that a quantization of DCT coefficients with URQs of a constant quantization step size yields a coding efficiency that is only slightly worse than that of the best possible

170 5.1. Transform Coding of Sample Blocks 165 Loss rel. to optimal ECSQ [db] (a) Loss rel. to optimal ECSQ [db] (c) Cactus (original pictures) 0.07 URQ1T (same Δ) URQ2T (same Δ) URQ (same Δ) 0.01 URQ (same λ) bit rate (first-order entropy) [Mbit/s] Cactus (residual pictures) URQ1T (same Δ) 0.04 URQ2T (same Δ) URQ (same Δ) URQ (same λ) bit rate (first-order entropy) [Mbit/s] Loss rel. to optimal ECSQ [db] (b) Loss rel. to optimal ECSQ [db] (d) Kimono (original pictures) URQ1T (same Δ) 0.04 URQ2T (same Δ) 0.03 URQ (same Δ) URQ (same λ) bit rate (first-order entropy) [Mbit/s] Kimono (residual pictures) 0.06 URQ1T (same Δ) 0.05 URQ2T (same Δ) 0.04 URQ (same Δ) 0.03 URQ (same λ) bit rate (first-order entropy) [Mbit/s] Figure 5.8: Comparison of different scalar quantizers and bit allocation methods for transform coding of 8 8 blocks using the DCT: (a,b) Blocks of original samples; (c,d) Blocks of residual samples. The diagrams show the SNR loss relative to the best possible independent quantization using ECSQs and optimal bit allocation. independent quantization. Since this approach is also characterized by a very simple decoder operation and requires only a single parameter for signaling the reconstruction levels of all transform coefficients to the decoder, it is widely used in video coding. Quantization Parameter and Weighting Matrices. The DCT coefficients are typically quantized using URQs. The quantization step size can often be modified on a block basis. For that purpose, video coding standards provide a predefined set of quantization step sizes and the used step size is transmitted using an index into this set. This index is referred to as quantization parameter (QP). Older standards typically specify sets of linearly increasing step sizes, = const QP. In the newer standards H.264 MPEG-4 AVC and H.265 MPEG-H HEVC, the relationship between the quantization

171 166 Intra-Picture Coding parameter and the quantization step size is approximately const 2 QP/6. (5.27) Since the bit rate that is required for transmitting the transform coefficient levels increases roughly linearly with the quantization step size, this exponential relationship between and QP allows a better adjustment of the bit rate over the entire supported range. Video coding standards often provide the possibility to use different quantization step sizes for the individual transform coefficients. This is achieved by specifying so-called quantization weighting matrices w, which can be selected by the encoder, typically on a sequence or picture level, and are transmitted as part of the bitstream. A matrix w has the same size as the corresponding blocks t of transform coefficients. The quantization step size ij for a transform coefficient t ij is given by ij = w ij block, (5.28) where block denotes the quantization step size (signaled by the quantization parameter QP) for the considered block. Even though the quantization weighting matrices could be used for optimizing the bit allocation, the main intention is to provide a possibility for introducing the quantization noise in a perceptual meaningful way. By using appropriate weighting matrices w, the spatial contrast sensitivity of human vision (see Section 2.2.3) can be exploited for achieving a better trade-off between bit rate and subjective quality. Typical weighting matrices specify larger factors for high-frequency components. As an example, the default weighting matrix for 8 8 intra-coded luma blocks in H.265 MPEG-H HEVC is given by w intra = (5.29) Usually different quantization weighting matrices can be used for luma and chroma blocks as well as for intra-coded and inter-coded blocks.

172 5.1. Transform Coding of Sample Blocks 167 For the coding experiments presented in this text, we measure the video quality using the MSE. Due to that reason, we do not apply quantization weighting matrices, but use the same quantization step size for all transform coefficients of a block Entropy Coding In the following, we investigate important statistical properties of the transform coefficient levels and discuss entropy coding techniques that are used in practical image and video codecs. Distribution of Transform Coefficient Levels. The probability mass function (pmf) for a transform coefficient level q depends on the probability density function (pdf) of the corresponding transform coefficient t and the decision thresholds of the used scalar quantizer. As discussed in Section 5.1.2, the distribution of transform coefficients t can often be well modeled using a zero-mean Laplacian pdf, f(t) = a 2 e a t. (5.30) If a symmetric pdf with zero mean is quantized using a quantizer with symmetric decision thresholds, the pmf of the generated quantization indexes is also symmetric around zero. Hence, the transform coefficient levels q can be efficiently represented by their absolute values q and, for non-zero absolute values, their sign. The signs have a binary pmf {1/2, 1/2} and, thus, the optimal lossless code consists of a single bit. The pmf of the absolute values is obtained by quantizing the pdf f(v) = a e av (5.31) of the absolute transform coefficients. If we assume that a simple URQ with a single decision offset z is used, we obtain the pmf p(k) with k > 0 : z p(0) = a p(k) = a 0 e av dv = 1 e az, (5.32) k +z (k 1) +z e av dv = e a((k 1) +z) ( 1 e a ). (5.33)

173 168 Intra-Picture Coding k Table 5.1: Unary binarization of the pmf p(k) given in (5.34). pmf p(k) binarization b 0 b 1 b 2 b 3 b 4 b 5 b p a 1 1 p a (1 p) p a p (1 p) p a p p (1 p) p a p p p (1 p) p a p p p p (1 p) p a p p p p p (1 p) p a p p p p p... (1 p) bin probability p i = P (b i = 0): p a p p p p p p By introducing the probabilities p a = e az and p = e a, the pmf p(k) for the absolute values of the transform coefficient levels is given by { 1 pa : k = 0 p(k) = p a p k 1 (1 p) : k > 0. (5.34) Note that the conditional pmf p(k k > 0) = p k 1 (1 p) represents a geometric distribution. With binary arithmetic coding, the absolute values could basically be transmitted without any redundancy. A particular simple approach is obtained if a unary binarization is used (see Table 5.1). Then, all bins (binary decisions) b i, with i > 0, have the same pmf {p, 1 p} and can be coded using the same context model. Only for the first bin b 0, which indicates whether a transform coefficient level is equal to zero, a different context model has to be used. If the encoder uses N independent decision offsets z, the first N bins have a different pmf, in which case N + 1 context models are required for an optimal lossless coding. Since, for usual coding conditions, most of the transform coefficient levels are zero or have very small values, the redundancy of a binary arithmetic coding with unary binarization is generally very low if a separate context model is used for the first bin or the first few bins. In this context, the actual distribution of transform coefficients and the used decision thresholds play only a minor role. For larger absolute values, even the binarization can be modified without any noticeable impact on coding efficiency.

174 5.1. Transform Coding of Sample Blocks (a) (b) (c) Figure 5.9: Scanning of transform coefficient levels: (a) Probabilities that individual transform coefficient levels are not equal to zero (example for 4 4 residual blocks); (b) Zig-zag scan; (c) Diagonal scan in H.265 MPEG-H HEVC (as will be discussed later, the levels are actually coded in reverse scan order). However, with simple variable length codes, which map each transform coefficient level to a codeword, a nearly optimal lossless coding is not possible. The reason is that the probability p(0) that a transform coefficient level is equal to zero is often significantly greater than 0.5. Hence, variable length codes can only be efficient when the codewords are assigned to combinations of multiple transform coefficient levels. Scanning. The transform coefficient levels {q ij } of a block have to be transmitted in a certain order that is known to the decoder. If the individual levels were coded independently of each other, the chosen order would not have any impact on coding efficiency. However, as we will further discuss below, the entropy coding methods that are used in practice exploit some dependencies between the levels of a transform block. Let P ij represent the probability that the level q ij is unequal to zero. It is typically advantageous to arrange the transform coefficient levels {q ij } of a block in the order of decreasing probabilities P ij. The probability P ij usually decreases with increasing frequency indexes i and j. An example for 4 4 residual blocks and typical coding conditions is shown in Figure 5.9(a). A signal-independent scan that approximately arranges the transform coefficient levels in the desired order is the zig-zag scan. This scan, which is illustrated in Figure 5.9(b) for the example of a 4 4 block, is used in most video coding standards. H.265 MPEG-H HEVC specifies the scan depicted in Figure 5.9(c). It has similar properties as the zig-zag scan but provides some benefits for certain implementations [259].

175 170 Intra-Picture Coding Statistical Dependencies. If we assume that the applied transform generates independent transform coefficients, the quantization indexes were also independent of each other. In that case, a lossless coding that treats the individual levels separately could achieve the bit rate limit given by the block entropy. However, as will be demonstrated by coding experiments, the transform coefficient levels are not independent of each other. Hence, the coding efficiency can be improved by using lossless coding methods that exploit conditional or joint probabilities. The statistical dependencies between transform coefficient levels for different frequency components could be evaluated by comparing marginal and conditional (or joint) pmfs. Due to the large signal space involved, we used a different method and investigated the entropy limits for selected coding concepts, which are designed to exploit potential dependencies between the levels of a transform block. The investigated entropy coding methods are motivated by approaches that are found in actual video coding standards: CBF: At the beginning of each transform block, a so-called coded block flag (CBF) is transmitted. If the CBF is equal to zero, it signals that all levels of the block are equal to zero; otherwise, the levels are transmitted using independent codes; EOB: At the beginning of a block and after each non-zero level, an end-of-block flag (EOB) is transmitted. If it is equal to one, it signals that all following levels inside the block are equal to zero and are not transmitted. The codes that are used for transmitting the EOBs and levels depend on the scan position. Note that this concept represents a generalization of the CBF approach; LastPos: At the beginning of a block, the scan position of the last non-zero level in scanning order (LastPos) is transmitted. The code for transmitting LastPos includes a special value for signaling that the block does not contain any non-zero levels. Subsequently, the levels are transmitted in scanning order up to the signaled last scan position. The codes that are used for transmitting the levels depend on the scan position and on whether the scan position is the last scan position;

176 5.1. Transform Coding of Sample Blocks 171 CtxNumSig: The transform coefficient levels are transmitted in scanning order. However, the levels are coded using conditional codes. As condition, the number of already coded non-zero levels (for the current block) is used. Hence, for each scan position k, one of k different codeword tables is adaptively selected. This concept is tested for a transmission in scanning order (forward) and in reverse scanning order (backward). It should be noted that the conditional coding in reverse scanning order represents a generalization of the LastPos method. We investigated the described concepts for a transform coding using an 8 8 DCT and optimal URQs with a constant quantization step size. All tested approaches used the zig-zag scan. For comparing the different entropy coding methods, we did not perform an actual coding, but calculated the entropy limits. The statistics have been measured over ten pictures of selected HD test sequences. The diagrams in Figure 5.10 compare the calculated entropy limits with the sum of the marginal entropies for the individual transform coefficient levels (which represents the minimum average codeword length for an independent coding). The experimental results clearly demonstrate that there are statistical dependencies between the transform coefficient levels of block. With exception of the CBF concept for original sample blocks, all tested approaches reduced the entropy limit, which would not be possible if the transform coefficient levels were independent of each other. The largest gains are observed for the conditional coding in reverse scanning order. But, in particular for the interesting bit-rate range, simpler concepts such as the addition of an end-of-block flag or the transmission of the last scan position already provide the major part of the gains. Run-Level Coding. In the image compression standard JPEG [127] and the early video coding standards H.261 [119], MPEG-1 Video [112], and H.262 MPEG-2 Video [122], the transform coefficient levels are transmitted using run-level coding. In run-level coding, the transform coefficient levels of a block are converted into an ordered sequence of run-level pairs, which are eventually mapped to variable-length codewords. The transform coefficient levels are processed in scanning order,

177 172 Intra-Picture Coding bit-rate increase [%] (a) CBF Cactus (original pictures) LastPos EOB CtxNumSig (forward) -15 CtxNumSig (backward) bit rate (first-order entropy) [Mbit/s] bit-rate increase [%] (b) CBF EOB Kimono (original pictures) LastPos CtxNumSig (forward) -8 CtxNumSig (backward) bit rate (first-order entropy) [Mbit/s] bit-rate increase [%] (c) CtxNumSig (backward) Cactus (residual pictures) CBF EOB LastPos CtxNumSig (forward) bit rate (first-order entropy) [Mbit/s] bit-rate increase [%] (d) CBF Kimono (residual pictures) EOB LastPos -6 CtxNumSig -8 (forward) CtxNumSig (backward) bit rate (first-order entropy) [Mbit/s] Figure 5.10: Comparison of entropy limits for different lossless coding concepts for transform coefficient levels: (a,b) 8 8 blocks of original samples; (c,d) 8 8 blocks of residual samples. The diagrams show the resulting bit rate increase relative to an optimal independent coding of the transform coefficient levels. typically using the zig-zag scan described above. For each run-level pair, the run indicates the number of transform coefficient levels equal to zero that precede the next non-zero transform coefficient level in scanning order and the level represents the value of this next non-zero transform coefficient level. Since the level is always unequal to zero, the codeword table includes a special end-of-block (eob) symbol, which specifies that all remaining levels of the block are equal to zero. For illustrating the conversion into run-level pairs, consider the following scanned sequence of transform coefficient levels A conversion into run-level pairs (run,level) yields (0,5) (0, 3) (3,1) (1, 1) (2, 1) (eob).

178 5.1. Transform Coding of Sample Blocks 173 Each run-level pair (including the eob symbol) is mapped to a variablelength codeword. The standards typically specify fixed codeword tables. Depending on the actual standard, different codeword tables may be used for intra and inter blocks. In JPEG [127], the run-level pairs are actually decomposed into run-category pairs and a level refinement. The category specifies a range for the level and the refinement information identifies the level inside this range. The run-category pairs are coded using a variable-length code (the codeword table is transmitted). For transmitting the level refinement, fixed-length codes are used, where the codeword length is determined by the category. In run-level coding, a variable number of coding symbols is mapped to variable-length codewords. It represents an example of a V2V code, which we discussed in the source coding part [301]. Due to the joint coding of multiple transform coefficient levels, certain dependencies inside transform blocks can be exploited. Another important aspect is that the probability that a transform coefficient level is equal to zero is typically considerably larger than 1/2. Hence, a direct mapping of transform coefficient levels to variable-length codewords would be very inefficient. By combining multiple levels, the probabilities of all runlevel pairs become significantly smaller than 1/2. As a consequence, the redundancy of the variable-length code is reduced. Run-length coding of transform coefficient levels is often combined with the signalization of a so-called coded block pattern (CBP). In the above mentioned video coding standards, the CBP is a syntax element that specifies which 8 8 transform blocks of a macroblock contain non-zero transform coefficient levels. Since transform blocks without any non-zero levels are already indicated by the CBP, often a slightly different codeword table (without the end-of-block symbol) is used for coding the first run-level pair of a transform block. For intra blocks, the transform coefficient levels for DC coefficients are often treated separately (using an additional codeword table). In that case, the CBP and the run-level pairs represent only the AC transform coefficient levels. For more details on run-level coding in video coding standards, the reader is referred to the specifications of H.261 [119], MPEG-1 Video [112], and H.262 MPEG-2 Video [122].

179 174 Intra-Picture Coding Advanced Run-Level Coding Techniques. The video coding standards H.263 [120] and MPEG-4 Visual [111] use a modified version of run-level coding, which is referred to as run-level-last coding. With this technique, the scanned sequence of transform coefficient levels is converted into a sequence of 3D events (run, level, last) and the variablelength codewords are assigned to these events. The run and level have the same meaning as in conventional run-level coding. The element last represents a flag, which indicates whether the non-zero transform coefficient level of the run-level representation is the last non-zero level in scanning order. For the previously considered example of transform coefficient levels , we obtain the following sequence of (run, level, last) events: (0,5,0) (0, 3,0) (3,1,0) (1, 1,0) (2, 1,1). Since the last flag specifies the position of the last non-zero level in scanning order, an end-of-block symbol is not included in the variablelength code. The signaling of a coded block flag or coded block pattern is however required, since the run-level-last code can only be used if there is at least one non-zero transform coefficient level. The main advantage of run-level-last coding is that additional conditional probabilities can be taken into account for constructing the codeword table. The probability that a certain transform coefficient level represents the last non-zero level in scanning order highly depends on its absolute value. A further improved version of run-level coding is used in H.264 MPEG-4 AVC [121]. Due to the usage of conditional variable-length codes, it is referred to as context-based adaptive variable length coding (CAVLC). In contrast to the previously discussed run-level coding techniques, the runs and levels are coded separately. First the values of the non-zero transform coefficient levels are transmitted and then their locations are specified by coding the runs of transform coefficient levels equal to zero. Furthermore, for most syntax elements, one of multiple codeword tables is selected based on already transmitted data.

180 5.1. Transform Coding of Sample Blocks 175 For coding the values of the non-zero transform coefficient levels, the following observations are exploited: The absolute values of the non-zero levels typically decrease in scanning order. In particular, the non-zero levels at the end of the scan often have values of 1 or 1; It is likely that the number of non-zero levels in a block is similar to the number of non-zero levels in neighboring blocks. The values of the non-zero levels are transmitted as follows: 1. A syntax element coeff_token is transmitted. It specifies the number of non-zero transform coefficient levels in the transform block as well as the number of trailing ones, which are the nonzero levels with a magnitude of one at the end of the scanned sequence. The number of trailing ones is clipped to a maximum value of 3. The standards specifies multiple codeword tables for transmitting coeff_token; the used table is selected based on the number of non-zero levels in neighboring transform blocks; 2. The signs of the trailing ones are coded in reverse scanning order using one bit per sign (a bit equal to 1 indicates a level of 1); 3. For all non-zero levels that are no trailing ones, the actual values are transmitted in reverse scanning order. The used codeword table depends on the values of already transmitted levels. Finally, the run information are coded by the following steps: 1. The total number of zeros before the last non-zero level in scanning order is coded. The selected codeword table depends on the already transmitted number of non-zero coefficients; 2. The runs (i.e., the number of zeros directly preceding non-zero levels) are transmitted in reverse scanning order. Since the maximum possible value is determined by the total number of zeros and the already transmitted runs, the used codeword table is selected accordingly. If there are no zeros left, no data are transmitted. For the first non-zero level in scanning order, the run is not transmitted; it can be derived based on the transmitted data.

181 176 Intra-Picture Coding For our example of a sequence of transform coefficient levels, , the following data are transmitted: coeff_token : (5, 3) [ 5 levels, 3 trailing ones ] signs of trailing ones : [ in reverse scanning order ] remaining levels : 3 5 [ in reverse scanning order ] total number of zeros : 6 runs : [ in reverse scanning order ]. The advantage of CAVLC in comparison to the above discussed runlevel and run-level-last coding is that conditional probabilities are used for designing the codeword tables, so that dependencies between the transform coefficient levels inside a transform block as well as between neighboring transform blocks can be exploited. CAVLC has been designed for 4 4 transforms blocks (and 2 2 chroma DC blocks), which were exclusively used in the first version of H.264 MPEG-4 AVC. For coding the levels of the later introduced 8 8 transform blocks, the codeword tables are re-used. The levels of an 8 8 block are partitioned into four sets of 16 levels and each set is treated as a conventional 4 4 transform block. For more details on CAVLC, the reader is referred to the standard text [121] and the original proposal [12]. Context-based Adaptive Binary Arithmetic Coding. In addition to variable-length coding, H.264 MPEG-4 AVC specifies a second method for the lossless compression of coding symbols, which is is referred to as context-based adaptive binary arithmetic coding (CABAC). When CABAC is selected as entropy coding method, all syntax elements are mapped onto a series of binary decisions. The binary decision are also called bins. The resulting bin sequence is transmitting using binary arithmetic coding, which we introduced in the source coding part [301]. In contrast to conventional variable-length coding, arithmetic coding has the advantage that it is also very efficient for pmfs that include a symbol with a large probability (greater than 1/2). Moreover, conditional and adaptive probability models can be elegantly incorporated.

182 5.1. Transform Coding of Sample Blocks 177 In CABAC, each bin is associated with a probability model, which is referred to as a context. For most bins, the context represents an adaptive probability model; the associated binary pmf is updated based on the coded bin values. Conditional probabilities are exploited by switching the contexts for certain bins based on already transmitted data. The new video coding standard H.265 MPEG-H HEVC does not support variable-length coding for low-level syntax elements, but specifies CABAC as the only entropy coding method. Although the basic design of CABAC in H.265 MPEG-H HEVC is very similar to that in H.264 MPEG-4 AVC, in particular the coding of transform coefficient levels differs in several details. One reason is that H.265 MPEG-H HEVC supports more and larger transform block sizes than H.264 MPEG-4 AVC. In the following, we briefly review both approaches. For more details on CABAC and the coding of transform coefficient levels, the reader is referred to the overviews in [177, 260, 236, 186]. In the CABAC approach of H.264 MPEG-4 AVC, the coding of transform coefficient levels is split into two steps. First the number and the locations of the non-zero transform coefficient levels inside a transform block are indicated using a binary-valued significance map and then the actual values of the non-zero levels are transmitted. The significance map is coded as follows: 1. The one-bit symbol coded_block_flag is transmitted. It indicates whether the transform block includes any non-zero levels. Depending on the coded_block_flag s of neighboring transform blocks, one of four context models is selected; 2. If coded_block_flag is equal to one (i.e., there are non-zero levels in the block), the locations of the non-zero levels are specified by transmitting the binary symbols significant_coeff_flag and last_significant_coeff_flag in scanning order. The one-bit symbol significant_coeff_flag is transmitted for each scan position. A value of one indicates that the transform coefficient level at the scan position is not equal to zero. In that case, the one-bit syntax element last_significant_coeff_flag is coded directly after the significant_coeff_flag. If this flag is equal to one, it specifies

183 178 Intra-Picture Coding that the current non-zero level is the last non-zero level in scanning order. For our example sequence of transform coefficient levels, , the following sequence of the symbols significant_coeff_flag and last_significant_coeff_flag (in parentheses) is obtained: 1(0) 1(0) (0) 0 1(0) 0 0 1(1). Typically, the transmission of the significance map is terminated by a last_significant_coeff_flag equal to one. Note, however, that the flags are never transmitted for the last scan position. If the last scan position is reached and the coding of the significance map was not terminated, it is obvious that the level at the last scan position has to be non-zero. For 4 4 transform blocks, a separate context is used for each scan position (one for each flag). In order to keep the number of contexts reasonably small, for 8 8 transform blocks, the same context is used for four successive scan positions. After the locations of the non-zero levels have been specified, their magnitudes and signs are coded in reverse scanning order. The syntax elements coeff_abs_level_minus1 specify the absolute values of the levels decremented by one (it is already known that the corresponding levels are not equal to zero) and the one-bit symbols coeff_sign_flag specify the signs. The non-binary symbols coeff_abs_level_minus1 are binarized using a combination of a unary prefix code (see Table 5.1) and an Exp-Golomb suffix code (see [177] or [121] for details). For coding the first bin of the unary prefix part, one of five defined contexts is selected. The remaining bins of the unary prefix part are coded using the same context, which is also chosen out of a set of five contexts. The selection of these two contexts depends on the number of already coded levels with a magnitude of one and the number of already coded levels with a magnitude greater than one. Note that the context selection for the unary prefix part is the reason why the level values are transmitted in reverse scanning order. The bins of the Exp-Golomb suffix part as well as the sign flags coeff_sign_flag are coded in the bypass mode of CABAC (non-adaptive context with the pmf {1/2, 1/2}).

184 5.1. Transform Coding of Sample Blocks 179 (a) (b) Figure 5.11: Coding of transform coefficient levels in H.265 MPEG-H HEVC: (a) Partitioning of a transform block into 4 4 subblocks; (b) Coding order of transform coefficient levels inside a 4 4 subblock. With respect to transform coding, the main difference of H.265 MPEG-H HEVC compared to H.264 MPEG-4 AVC is that more and larger transform sizes are supported. The transform coefficient level coding in H.265 MPEG-H HEVC follows a generic concept that can be applied for all supported transform sizes of 4 4, 8 8, 16 16, and samples. Transform blocks that are larger than 4 4 samples are partitioned into 4 4 subblocks. The partitioning is illustrated in Figure 5.11(a) for the example of a transform block. The coding order of the 4 4 subblocks as well as the coding order of the levels inside a 4 4 subblock are, in general, specified by the reverse diagonal scan, which is shown in Figure 5.11(a) and Figure 5.11(b). For certain intra-coded blocks, a horizontal or vertical scan pattern is used. The coding order always starts with high-frequency locations. The levels of a transform block are transmitted on the basis of 4 4 subblocks. The entropy coding includes the following steps: 1. Similar to H.264 MPEG-4 AVC, a coded_block_flag is transmitted, which signals whether there are any non-zero coefficients in the transform block. If coded_block_flag is equal to zero, no further data are transmitted for the transform block; 2. The x and y coordinates of the first non-zero level in coding order are transmitted; 3. Starting with the 4 4 subblock that contains the first non-zero level in coding order, the subblocks are processed in coding order as will be discussed in the following.

185 180 Intra-Picture Coding With exception of the subblocks that precede the first subblock in coding order (as indicated by the location of the first non-zero level), for each subblock, the following data are transmitted: 1. The coded_sub_block_flag indicates whether the subblock includes any non-zero levels. For the first and last subblocks (i.e., the subblocks that contain the first non-zero level or the DC level), this flag is not transmitted but inferred to be equal to one; 2. For the levels inside a subblock with coded_sub_block_flag equal to one, the significant_coeff_flag indicates whether the corresponding level is not equal to zero. This flag is only transmitted for those scan positions for which it cannot be inferred based on already transmitted data; 3. For the first 8 levels with significant_coeff_flag equal to one (in scanning order), the flag coeff_abs_level_greater1_flag is transmitted. It indicates whether the absolute value of the level is greater than one; 4. For the first level with coeff_abs_level_greater1_flag equal to one (if any), the flag coeff_abs_level_greater2_flag is transmitted. It indicates whether the absolute value of the level is greater than two; 5. For levels with significant_coeff_flag equal to one, the coeff_sign_flag specifies the sign of the level; 6. For all levels for which the absolute value is not already specified by the values of the flags coeff_abs_level_greater1_flag and coeff_abs_level_greater2_flag, the remainder of the absolute value is transmitted using the multi-level syntax element coeff_abs_level_remaining. The context that is chosen for coding the coded_sub_block_flag depends on the values of coded_sub_block_flag for already coded neighboring subblocks. The context for the significant_coeff_flag is selected based on the scan position inside a subblock, the size of the transform block, and the values of coded_sub_block_flag in neighboring subblocks. For the flags coeff_abs_level_greater1_flag and

186 5.1. Transform Coding of Sample Blocks 181 coeff_abs_level_greater2_flag, the context selection depends on whether the current subblock includes the DC level and whether any coeff_abs_level_greater1_flag equal to one has been transmitted for the previous subblock. For the coeff_abs_level_greater1_flag, it further depends on the number and the values of the already coded coeff_abs_level_greater1_flag s for the subblock. The signs coeff_sign_flag and the remainder of the absolute values coeff_abs_level_remaining are coded in the bypass mode of the binary arithmetic coder. For mapping coeff_abs_level_remaining onto a sequence of bins, an adaptive binarization scheme is used. The binarization is controlled by a single parameter, which is adapted based on already coded values for the subblock. In comparison to H.264 MPEG-4 AVC, the number of bins that are transmitted in the bypass mode of the arithmetic coder has been increased in order to reduce the complexity of the transform coefficient coding. H.265 MPEG-H HEVC also includes a so-called sign data hiding mode, in which (under certain conditions) the transmission of the sign for the last non-zero level inside a subblock is omitted. Instead, the sign for this level is embedded in the parity of the sum of the absolute values for the levels of the corresponding subblock. Note that the encoder has to consider this aspect in determining appropriate transform coefficient levels. For more details on the transform coefficient level coding in H.265 MPEG-H HEVC, the reader is referred to the standard text [123] or the overview papers [236, 186]. Comparison of Entropy Coding Techniques. For obtaining an indication of the efficiency of different lossless coding techniques, we compared the following methods in terms of their coding efficiency: Run-level coding (using the H.262 MPEG-2 Video syntax); Run-level-last coding (using the MPEG-4 Visual syntax); CABAC (using the H.265 MPEG-H HEVC syntax). Since H.262 MPEG-2 Video and MPEG-4 Visual only support 8 8 blocks, we performed the experiment with 8 8 blocks of original and residual samples. The employed transform coding consists of a DCT

187 182 Intra-Picture Coding bit-rate increase [%] (a) Cactus (original pictures) run-level-last coding (MPEG-4 Visual) run-level coding (H.262 MPEG-2 Video) CABAC (H.265 MPEG-H HEVC) bit rate (first-order entropy) [Mbit/s] bit-rate increase [%] (b) Kimono (original pictures) run-level coding (H.262 MPEG-2 Video) 10 run-level-last coding CABAC (H.265 (MPEG-4 Visual) MPEG-H HEVC) bit rate (first-order entropy) [Mbit/s] bit-rate increase [%] (c) Cactus (residual pictures) run-level-last coding (MPEG-4 Visual) run-level coding (H.262 MPEG-2 Video) -10 CABAC (H.265 MPEG-H HEVC) bit rate (first-order entropy) [Mbit/s] bit-rate increase [%] (d) Kimono (residual pictures) 30 run-level-last coding (MPEG-4 Visual) run-level coding (H.262 MPEG-2 Video) 0-10 CABAC (H.265 MPEG-H HEVC) bit rate (first-order entropy) [Mbit/s] Figure 5.12: Comparison of entropy coding techniques: (a,b) 8 8 blocks of original samples; (c,d) 8 8 blocks of residual samples. The diagrams show the bit rate increase relative to the first-order entropy specified in (5.6). and optimal URQs with a constant quantization step size. The same set of transform coefficient levels is used for all entropy coding methods. For limiting the investigation to the coding of transform coefficient levels, we did not consider syntax elements (such as the coded block pattern) that signal data for multiple transform blocks. Instead, for each tested entropy coding method, a coded block flag was transmitted (which signals whether all levels of a block are equal to zero). In our simulations, the bit rates were measured over 10 pictures of the HD test sequences specified in Appendix A.1. The results for two selected sequences are depicted in Figure The diagrams compare the actual measured bit rates with the first-order entropy limit given in (5.6). CABAC clearly outperforms the variable-length coding techniques, in particular for residual blocks (which it was design for). Often, the CABAC bit rate is smaller than the first-order entropy.

188 5.2. Intra-Picture Prediction between Transform Blocks Intra-Picture Prediction between Transform Blocks Transform coding represents a simple, but yet efficient approach for exploiting statistical dependencies between the samples inside a transform block. In intra pictures, there are, however, also strong statistical dependencies between neighboring transform blocks. Such dependencies can be exploited using linear prediction. The transform coefficients or original samples of a transform block are predicted using already coded data of neighboring blocks. This type of prediction is commonly referred to as intra-picture prediction Prediction in Transform Domain A very simple form of prediction between transform blocks is already used in the early video coding standard H.262 MPEG-2 Video [122] and the image coding standard JPEG [127]. In these standards, the DC transform coefficient level of intra blocks is predicted using the DC level of the previously coded intra block of the same color component. Hence, only the difference between the DC transform coefficient levels of the current transform block and the preceding transform block in coding order is transmitted. In H.262 MPEG-2 Video, the DC predictor is reset to a pre-defined value (middle of supported range for DC levels) at the start of a slice and after coding an inter-predicted block. Since the DC transform coefficients represent the average values of transform blocks, the intra DC prediction is based on the assumption that the averages of neighboring blocks have similar values. Typically, there are more structural dependencies between neighboring blocks than just the similarity of the average sample values. A first improvement of intra picture prediction is used in the optional Advanced Intra Coding mode of H.263, which is specified in Annex I of the standard [120]. In contrast to prior standards, it supports three different intra prediction modes, which are illustrated in Figure 5.13: DC prediction : The DC transform coefficient of an intra transform block is predicted by the average of the reconstructed DC transform coefficients of the transform block above the current block and the transform block to the left of the current block;

189 184 Intra-Picture Coding Figure 5.13: Intra prediction in transform coefficient domain as specified in the Advanced Intra Coding mode (Annex I) of H.263: (left) DC prediction and zig-zag scan; (middle) Horizontal prediction and alternate-vertical scan; (right) Vertical prediction and alternate-horizontal scan. Horizontal prediction : The first column of the block of transform coefficients is predicted using the first column of reconstructed transform coefficients of the transform block to the left; Vertical prediction : The first row of the block of transform coefficients is predicted using the first row of reconstructed transform coefficients of the transform block above the current block. The used prediction mode is selected on a macroblock basis and signaled to the decoder. The same prediction mode is used for the four 8 8 luma blocks as well as the two 8 8 chroma blocks of a macroblock. The intra-picture prediction always uses neighboring blocks of the same color component. Since the transform blocks of inter macroblocks represent motion-compensated prediction errors, the described prediction is only used between intra blocks. If a neighboring block belongs to an inter macroblock, basically only the DC coefficient is predicted using a pre-defined value (or the DC coefficient of the other neighbor). The horizontal prediction mode is designed for transform blocks that are dominated by horizontal image structures. When this mode is used, it is implicitly assumed that the dominant horizontal structures in the current block, which are represented by the first column of the transform coefficients, are similar to the dominant horizontal structures in the block to the left. If a transform block represents mainly horizontal image structures, most of its energy is concentrated in the region of low

190 5.2. Intra-Picture Prediction between Transform Blocks 185 horizontal frequencies. As a consequence, for transform blocks that are coded using the horizontal prediction mode, the transform coefficient levels are scanned using an alternate-vertical scanning pattern, which is defined in a way that low horizontal frequency positions are scanned before low vertical frequency positions. Similarly, the vertical prediction mode is designed for transform blocks that are dominated by vertical image structures. When this mode is selected, the transform coefficient levels are scanned using an alternate-horizontal scan. The alternative scanning patterns are illustrated in Figure For the DC prediction mode, the conventional zig-zag scan is used. In MPEG-4 Visual [111], a very similar concept for intra-picture prediction is used. However, the syntax provides only a choice between the simple DC prediction and an improved prediction. The selection between horizontal and vertical prediction as well as the selection of the neighboring block that is used for predicting the DC coefficient is automatically determined based on data that were transmitted for neighboring blocks Spatial Intra Prediction Since both the transform and the prediction of transform coefficients are linear operations, the order of transform and prediction can be changed without impacting the result (if we neglect rounding effects). As an example, we consider the vertical prediction of transform coefficients, which is shown in Figure 5.14(a). In the transform domain, the first row of transform coefficients is predicted using the same transform coefficients of the block above the current block. The resulting operation in the sample domain is the following: All samples of a column of the current block are predicted by the average of the samples of the same column in the block above. This illustrated in Figure 5.14(b). Let ŝ ver [x, y] represent the vertical prediction signal for the current N M block. The reconstructed samples of already coded blocks are given by the (incomplete) sample array s [x, y]. The coordinates x and y are defined in a way that the top-left sample of the current block is indexed by x = 0 and y = 0 and both x and y increase toward the bottom-right corner. If we assume that the block above has also the

191 186 Intra-Picture Coding (a) (b) (c) Figure 5.14: Comparison of intra prediction in transform and spatial domain: (a) Vertical prediction of transform coefficients; (b) Interpretation of the transform coefficient prediction in spatial domain (each sample is predicted using the column average); (c) Simplified (and improved) vertical prediction in spatial domain. size N M, the vertical prediction signal is given by ŝ ver [x, y] = 1 N N 1 k=0 s [x, 1 k]. (5.35) Since the correlation between two samples typically decreases with an increasing distance between the sample locations, the prediction in the sample domain can be simplified and, at the same time, improved if we use the directly adjacent samples instead of the column averages as predictors. This is illustrated in Figure 5.14(c). The prediction signal for the vertical prediction in the sample domain is then given by ŝ ver [x, y] = s [x, 1]. (5.36) In a similar way, the horizontal prediction in the sample domain can be defined according to And for the DC prediction, we obtain ( N 1 ŝ DC [x, y] = 1 N + M ŝ hor [x, y] = s [ 1, y]. (5.37) k=0 M 1 s [ 1, k] + k=0 s [k, 1] ). (5.38) The spatial intra prediction of sample blocks has a similar (perhaps slightly higher) complexity than a similar operation in the transform domain. But it provides a number of benefits. Due to the usage of directly adjacent samples, the coding efficiency is typically improved. Moreover, it can also be applied (in a straightforward way) when the

192 5.2. Intra-Picture Prediction between Transform Blocks 187 neighboring blocks represent inter-picture predicted blocks 5. But the most important advantage is that the discussed spatial prediction can be straightforwardly extended to prediction directions that are not aligned with the horizontal or vertical axis. As a simple example, the prediction signal for a prediction direction of 135 can be specified by ŝ 135 [x, y] = s [ max(0, x y) 1, max(0, y x) 1 ]. (5.39) With an interpolation of intermediate border samples basically each prediction direction that references border sample locations inside already coded blocks can be supported. As we discussed in Section 3.1, natural images have non-stationary statistical properties. For many image blocks, the sample arrays represent textures with a dominant direction, which could be well predicted using a suitable directional prediction. But since the dominant direction changes inside images, multiple prediction modes should be supported. The used prediction mode can be determined in the encoder and signaled to the decoder (forward adaptation), or it can be simultaneously derived in encoder and decoder based on samples of neighboring blocks (backward adaptation). The forward adaptation has the advantage that the decoder operation remains simple and the encoder has the freedom to choose the best prediction mode in rate-distortion sense. Since the intra prediction mode applies to a block of samples, the reduction in signal energy typically justifies the bit-rate overhead that is caused by the required transmission of the used prediction mode. As will be demonstrated later, a forward-adaptive switched spatial prediction of sample blocks usually improves coding efficiency for natural images. H.264 MPEG-4 AVC. The first standard that included a spatial intra prediction was H.264 MPEG-4 AVC [121]. Beside the I_PCM mode, in which the samples are transmitted without any compression, the first version of the standard specifies two intra macroblock modes: The 5 A spatial intra prediction from inter-picture predicted blocks can be disadvantageous in error-prone environments, because the error propagation can then only be terminated by intra pictures and not by individual intra blocks. The video coding standards H.264 MPEG-4 AVC and H.265 MPEG-H HEVC include a high-level syntax element by which the intra prediction from inter blocks can be disabled.

193 188 Intra-Picture Coding (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 5.15: Spatial intra prediction of sample blocks. As an example, the figures illustrates the H.264 MPEG-4 AVC intra prediction modes for 4 4 luma blocks: (a) Vertical; (b) Horizontal; (c) DC; (d) Diagonal down-left; (e) Diagonal down-right; (f) Horizontal-down; (h) Vertical-left; (i) Horizontal-up. Intra4x4 and the Intra16x16 modes. When the Intra4x4 mode is selected, the array of luma samples is partitioned into sixteen 4 4 blocks, which are processed in z-scan order (see Section 3.3.2). Each of the 4 4 blocks is first predicted using directly adjacent samples of already processed blocks and then the resulting residual signal is transmitted using transform coding. For each 4 4 block, one of the 9 prediction modes illustrated in Figure 5.15 is selected and signaled to the decoder. One of the modes represents a DC prediction and the other 8 modes specify directional predictions. For the vertical, horizontal, and DC prediction modes, the prediction signals are determined according to (5.36), (5.37), and (5.38), respectively. The prediction signal for the other directional prediction modes is basically obtained using the following principle: If the prediction direction refers to a half-sample location at the block border, the filter (1/2, 1/2) is used for generating the corresponding intermediate border sample. For integer-sample locations, the filter (1/4, 1/2, 1/4) is applied. In the Intra16x16 mode, the entire array of luma samples is spatially predicted and the resulting residual signal is coded using

194 5.2. Intra-Picture Prediction between Transform Blocks 189 a low-complexity variant of a transform. Four spatial prediction modes are provided: The vertical, horizontal, and DC prediction mode as well as a plane prediction mode. In the plane prediction mode, the prediction signal ŝ[x, y] represents a plane in the x-y-s space. The three parameters that describe the plane are determined based on the border samples. For both intra macroblock modes, the 8 8 arrays of chroma samples (in the 4:2:0 chroma sampling format) are predicted as one unit. The same 4 prediction modes as for the luma signal of Intra16x16 macroblocks are provided. The selected chroma prediction mode is applied to both chroma components of a macroblock. In later specified profiles, such as the High profile, an additional Intra8x8 macroblock mode was introduced, in which the luma signal of the macroblock is partitioned into four 8 8 blocks. For each 8 8 block, a separate intra prediction mode is selected and the residual is coded using an 8 8 transform. Basically the same nine prediction modes as for the 4 4 blocks in the Intra4x4 mode are provided. However, the border samples are first low-pass filtered using the (1/4, 1/2, 1/4) filter before they are used for spatial intra prediction. This reference sample smoothing often reduces the high-frequency content in the residual signal. The coding of the chroma components is the same as for the Intra4x4 and Intra16x16 modes. Further details on intra-picture coding in H.264 MPEG-4 AVC can be found in the standard text [121]. H.265 MPEG-H HEVC. In H.265 MPEG-H HEVC [123], the flexibility of spatial intra prediction is further increased. Transform coding and intra prediction are supported for block sizes of 4 4, 8 8, 16 16, and samples. For all block sizes, 35 spatial intra prediction modes are provided. Beside the DC prediction mode and a planar prediction mode, there are 33 directional prediction modes, which cover an angular range of 180. The prediction directions are chosen in a way that a denser coverage is provided for near-horizontal and near-vertical directions. The plane mode in H.264 MPEG-4 AVC often introduces discontinuities along the block boundaries. The planar mode specified in H.265 MPEG-H HEVC overcomes this disadvantage. The prediction signal ŝ planar [x, y] represents the average of two linear prediction

195 190 Intra-Picture Coding signals. For an N N block, the scaled linear prediction signals ŝ h [x, y] and ŝ v [x, y] are calculated according to ŝ h [x, y] = (N 1 x) s [ 1, y] + (1 + x) s [N, 1] (5.40) ŝ v [x, y] = (N 1 y) s [x, 1] + (1 + y) s [ 1, N], (5.41) If we neglect rounding, the final prediction signal is then given by ŝ planar [x, y] = 1 ( ) ŝ h [x, y] + ŝ v [x, y]. (5.42) 2N For block sizes greater than 4 4, H.265 MPEG-H HEVC specifies a low-pass filtering of the border samples. Similar to the Intra8x8 mode of H.264 MPEG-4 AVC, the 3-tap filter (1/4, 1/2, 1/4) is used. However, the reference sample smoothing is not applied in all cases. Its usage depends on the block size and the chosen intra prediction mode. In order to reduce discontinuities at block boundaries, in the DC prediction mode, the left and top border of the generated intra prediction signal are smoothed using a 2-tap filter (for the corner sample, a 3-tap filter is used). In the vertical or horizontal prediction mode, the top or left border is filtered, respectively. The smoothing filter is only applied for the luma component. For chroma blocks, basically the same intra prediction modes are supported as for luma blocks. However, the used mode can only be selected out of a set of five candidate modes. This set includes the horizontal, vertical, planar, and DC prediction mode as well as the mode that is used for the co-located luma block (or, in some cases, the left-downward diagonal mode). The same intra prediction mode is used for both chroma components of a coding unit. In contrast to prior standards, H.265 MPEG-H HEVC also uses a special transform for intra-predicted 4 4 luma blocks. Due to the nature of spatial intra prediction, the energy of the prediction error signal typically increases in prediction direction within the predicted block. For such a case, a discrete sine transform (DST) typically provides a better energy compaction than a DCT [136, 92, 217]. However, the implementation complexity of a DST is also significantly higher than that of a DCT. And since the observed coding efficiency improvements for larger block sizes were found to be marginal, the DST is only

196 5.2. Intra-Picture Prediction between Transform Blocks 191 used for 4 4 luma blocks. Similar as for the DCT, H.265 MPEG-H HEVC specifies an integer approximation of the transform. The inverse transform matrix is given by B DST = (5.43) For additional details on the intra-picture coding in H.265 MPEG-H HEVC, the interested reader is referred to the overview in [156] and the standard text [123]. Coding Efficiency. For demonstrating the impact of intra prediction on coding efficiency, we ran coding experiments, in which we varied the number of supported intra prediction modes. The HEVC reference software [126], version 16.3, was slightly modified in order to restrict the selectable set of intra prediction modes. In our experiments, the following configurations were compared: DC prediction only : All intra blocks are coded using the DC prediction mode. DC, horizontal, and vertical prediction : For all intra blocks, the prediction mode can be chosen among the DC, the horizontal, and the vertical prediction mode. 9 prediction modes : The set of candidate modes basically consists of the 9 intra prediction modes specified in H.264 MPEG-4 AVC. The actual coding is done using H.265 MPEG-H HEVC. 35 prediction modes : All spatial intra prediction modes of H.265 MPEG-H HEVC are included in the decision process. For each sample block, one of the prediction modes is chosen using the Lagrangian mode decision (see Section 4.2.1). The coding experiments were performed for the HD test sequences listed in Appendix A.1. Since we are mainly interested in the coding efficiency for intra-predicted blocks, all pictures of the test sequences were coded as intra pictures. The quantization parameter (QP) was varied over the entire supported range. Inside a bitstream, the same QP is used for all blocks.

197 192 Intra-Picture Coding In order to exclude effects that are related to varying block sizes, we restricted the used block sizes in a first experiment. The encoder was configured in a way that intra prediction and transform coding are always applied to blocks of 8 8 samples (for both luma and chroma). The obtained rate-distortion curves for two selected test sequences are depicted in Figure 5.16(a,b). The associated bit-rate savings relative to the configuration in which all blocks are coded using the DC prediction modes are shown in Figure 5.16(c,d). It can be seen that the coding efficiency increases with the number of supported intra prediction modes. By enabling the horizontal and vertical prediction modes in addition to the DC prediction mode, bit-rate savings of about 5-10% have been measured. If all intra prediction modes of H.265 MPEG-H HEVC are enabled, the bit-rate savings are increased to approximately 15-30%. In a second experiments, we enabled all block sizes (see Section 5.3) that are supported in H.265 MPEG-H HEVC. The measured bit-rate savings are shown in Figure 5.16(e,f). Similar as in the first experiment, the rate-distortion efficiency is generally improved by increasing the number of intra prediction modes. However, the coding gains are smaller than for 8 8 blocks. By enabling all 35 intra prediction modes, we obtained bit-rate savings of about 5-22%. 5.3 Block Sizes for Prediction and Transform Coding The coding efficiency of transform coding typically increases with the chosen block size. However, the potential gains become very small beyond a certain block size. An analysis for high-rate coding of Gauss- Markov sources is presented in the source coding part [301]. Furthermore, the complexity of encoder and decoder implementations increases with the size of the transform blocks. Thus, for very large transform block sizes, the increase in complexity usually does not justify the small coding efficiency improvements. Since the correlation between sample values decreases with increasing sample distances, the intra prediction is typically more effective for smaller block sizes. For large blocks, the assumption that the texture has a dominant direction is often not valid. However, the side informa-

198 5.3. Block Sizes for Prediction and Transform Coding 193 PSNR (Y) [db] (a) Cactus ( , 50 Hz), 8 8 blocks 35 prediction modes 34 DC, horizontal, and 32 vertical prediction DC prediction only bit rate [Mbit/s] 9 prediction modes PSNR (Y) [db] (b) Kimono ( , 24 Hz), 8 8 blocks 35 prediction modes 38 DC, horizontal, and vertical prediction 36 DC prediction only bit rate [Mbit/s] 9 prediction modes bit-rate saving vs DC pred. [%] (c) Cactus ( , 50 Hz), 8 8 blocks prediction modes 40 9 prediction modes DC, horizontal, and vertical prediction PSNR (Y) [db] bit-rate saving vs DC pred. [%] (d) Kimono ( , 24 Hz), 8 8 blocks prediction modes 40 9 prediction modes DC, horizontal, and vertical prediction PSNR (Y) [db] bit-rate saving vs DC pred. [%] (e) Cactus ( , 50 Hz), all block sizes prediction modes 30 9 prediction modes 20 DC, horizontal, and vertical prediction PSNR (Y) [db] bit-rate saving vs DC pred. [%] (f) Kimono ( , 24 Hz), all block sizes DC, horizontal, and vertical prediction 35 prediction modes PSNR (Y) [db] 9 prediction modes Figure 5.16: Coding efficiency of spatial intra prediction for two selected HD test sequences: (a,b) Rate-distortion curves for block sizes of 8 8 samples; (c,d) Associated bit-rate savings relative to the configuration in which all blocks are predicted using the DC prediction mode; (e,f) Bit-rate savings for the case that all block sizes supported in H.265 MPEG-H HEVC are enabled.

199 194 Intra-Picture Coding tion rate that is required for transmitting the intra prediction modes increases with the block size. Hence, the optimal block size for prediction and transform coding depends on the actual signal properties. Since natural images have highly non-stationary statistical properties, there is no single optimal block size. For some image areas, as for example rather flat image regions, large block sizes often improve the effectiveness of transform coding without noticeably impacting the quality of the prediction signal. For other image areas, such as regions with fine structures, the improvement in prediction that is obtained with smaller block sizes often outweighs the loss in transform coding performance and the increase in side information rate. Hence, it can be expected that an adaptive selection of the block sizes used for prediction and transform coding can improve the efficiency of intra-picture coding. But since the used block sizes need to be signaled to the decoder, the average bit rate that is required for transmitting the encoder s choices has to be noticeably smaller than the obtained average bit-rate saving. Due to that reason, typically only very simple approaches for partitioning an image into blocks for prediction and transform coding are used, as for example a quadtree partitioning (see Section 3.3.2). In the following, we first describe the approaches for block size selection that are used in video coding standards. Based on experimental investigations with H.265 MPEG-H HEVC, we will then analyze the effect of using different block sizes and demonstrate the benefit of adaptive methods Block Size Selection in Video Coding Standards The video coding standards H.261 [119], MPEG-1 Video [112], H.262 MPEG-2 Video [122], H.263 [120], and MPEG-4 Visual [111] support only a single transform block size of 8 8 samples. The same block size is also used in the image coding standard JPEG [127]. In H.263 (Annex I) and MPEG-4 Visual, which provide multiple intra prediction modes (in the transform domain), the selected intra prediction mode is transmitted on a macroblock basis. It applies to the four luma and the two chroma 8 8 blocks of a macroblock.

200 5.3. Block Sizes for Prediction and Transform Coding 195 Intra4x4 Intra8x8 Intra16x16 Figure 5.17: Partitioning of the luma component of a macroblock into blocks for intra prediction and transform coding in H.264 MPEG-4 AVC. H.264 MPEG-4 AVC. The first video coding standard that supports an adaptive selection of block sizes for intra prediction and transform coding was H.264 MPEG-4 AVC [121]. The used block size can be selected on a macroblock level. As in older standards, a macroblock comprises a array of luma samples and, in the 4:2:0 chroma sampling format, two co-located 8 8 arrays of chroma samples. For intrapicture coding, three macroblock modes 6 are provided: The Intra4x4 mode, the Intra8x8 mode, and the Intra16x16 mode. The subdivision of the luma component into transform blocks is illustrated in Figure In the Intra4x4 mode, the array of luma samples is subdivided into sixteen 4 4 blocks, which are processed in z-scan order. Each 4 4 block is first spatially predicted and then the resulting prediction error signal is transmitted using a 4 4 transform. Since a separate intra prediction mode is coded for each of the subblocks, the prediction modes can be chosen independently of each other. In the Intra8x8 mode, the intra prediction and transform coding are applied to 8 8 blocks of luma samples. When the Intra16x16 macroblock mode is selected, the entire luma sample array of the macroblock is predicted using the adjacent samples of already processed macroblocks. The resulting residual signal is coded using a low-complexity variant of a transform. In all three intra macroblock modes, a single chroma prediction mode is transmitted. It applies to both chroma components. The 8 8 chroma blocks are first spatially predicted and then the prediction error signal is transmitted using a low-complexity 8 8 transform. 6 The Intra8x8 mode is only supported in some profiles. The not mentioned I_PCM mode does not use an intra prediction or transform coding.

201 196 Intra-Picture Coding H.265 MPEG-H HEVC. H.265 MPEG-H HEVC [123] supports more and larger block sizes for intra prediction and transform coding than H.264 MPEG-4 AVC. Furthermore, the approach for selecting the used block sizes is more flexible and the standard does not use less efficient low-complexity variants for large transforms. In H.265 MPEG-H HEVC, the video pictures are first partitioned into so-called coding tree units (CTUs). In contrast to all prior video coding standards, in which macroblocks with a size of luma samples are used, the size of the CTUs can be chosen on a sequence level. The standard supports CTU sizes of 16 16, 32 32, and luma samples. The usage of larger CTU sizes increases the flexibility and typically provides better coding efficiency, but it may also increase the delay, the memory requirements, and the encoder complexity. The luma and chroma sample arrays that are associated with a CTU are referred to as coding tree blocks (CTBs). The CTUs can be further partitioned into coding unit (CUs) of variable sizes. This subdivision is described using a quadtree structure (see Section 3.3.2). At the CTU level, a subdivision flag is transmitted, which specifies whether the complete CTU forms a CU or whether it is split into four equally-sized square blocks. If a CTU is subdivided, for each of the resulting blocks, another subdivision flag is transmitted, which specifies whether the corresponding block is further split. This hierarchical quadtree subdivision is continued until none of the blocks is further split or a minimum CU size is reached. The minimum CU size is also selected on a sequence level and can range from 8 8 luma samples to the CTU size. The luma and chroma sample arrays of a CU are referred to as coding blocks (CBs). In the 4:2:0 chroma sampling format, a CU consist of a 2 N 2 N luma CB and two 2 (N 1) 2 (N 1) chroma CBs, where N lies in the range from 3 to 6, inclusive. An example for the partitioning of a luma CTB into CBs is shown in Figure 5.18(a). The CUs inside a CTU are coded in z-scan order, which we introduced in Section (an example is shown in Figure 3.6). This coding order ensures that for each CU, except those located at a top or left boundary of a slice, the blocks above the CU and left to the CU are

202 5.3. Block Sizes for Prediction and Transform Coding 197 mode 0 mode 0 mode 1 (a) (b) (c) mode 2 mode 3 Figure 5.18: Partitioning and coding of intra prediction modes in H.265 MPEG-H HEVC: (a) Example for the partitioning of a luma CTB into CBs and TBs (CBs that are partitioned into multiple TBs are marked gray); (b) Coding of a single luma prediction mode per CB; (c) Coding of four luma prediction modes per CB. already reconstructed, so that the neighboring samples can be used for intra-picture prediction. The CU represents the entity for which the encoder has to determine a coding mode. In particular, the encoder has to choose between intra-picture and inter-picture prediction. In that respect, a CU is similar to a macroblock in older standards. If a CU is coded using intra-picture prediction, in general, a luma and a chroma intra prediction mode are transmitted for the CU. The luma intra prediction mode applies to the 2 N 2 N luma CB and the chroma intra prediction mode applies to both 2 (N 1) 2 (N 1) chroma CBs. However, if the CU has the minimum CU size signaled inside the bitstream, the luma CB can also be split into four equally-sized square blocks, and then a separate luma intra prediction mode is transmitted for each of these four luma blocks. The selection of luma intra prediction modes is illustrated in Figure 5.18(b,c). In H.265 MPEG-H HEVC, intra prediction modes can be transmitted for blocks of 4 4 to samples. But, as will be described in the following, the actual intra prediction may be applied to smaller blocks. For transform coding, a CB can be further partitioned into socalled transform blocks (TBs). The subdivision is specified by a second quadtree structure, which is referred to as residual quadtree (RQT). Each luma CB is associated with an RQT. Similarly as for the CU

203 198 Intra-Picture Coding quadtree, for each node of the RQT, a subdivision flag is transmitted, which specifies whether the associated sample block represents a single TB or whether it is subdivided into four equally-sized square blocks. Figure 5.18(a) shows an example for the partitioning of a luma CTB into CBs and TBs. The RQT structures are restricted by three parameters, which are signaled on a sequence level: The minimum and maximum luma transform sizes and the maximum depth of the RQTs. The minimum and maximum transform sizes can be chosen in the range from 4 4 to samples, inclusive. If a luma CB is larger than the chosen maximum transform size, it is forced to be subdivided in order to comply with the restrictions on the transform size. For this forced split, no subdivision flags are transmitted. Similarly, if a luma block has the minimum transform size or the depth of the RQT is equal to the signaled maximum depth, the subdivision flag is not transmitted, but it is inferred that the block is not further split. Furthermore, if a luma CB is partitioned into four blocks for specifying the intra prediction modes, the transform cannot span the complete luma CB. Hence, in that case, the luma CB is always subdivided (into at least four TBs). The subdivision of the two chroma CBs follows the subdivision of the associated luma CB. There is, however, one exception to this rule. If an 8 8 luma block is split into four 4 4 TBs, the two associated 4 4 chroma CBs (in 4:2:0 chroma sampling format) are not split, because H.265 MPEG-H HEVC does not support a 2 2 transform. The luma and chroma TBs are not only used for transform coding, but, in intra-coded CUs, they also specify the size of the blocks for which the intra prediction signals are generated. Since the correlation between samples generally decreases with the distance between the samples, the coding efficiency can be typically increased if the blocks for which the intra prediction is generated are as small as possible. The effect is illustrated in Figure Due to this reason, in H.265 MPEG-H HEVC, intra prediction and transform coding are always applied to TBs. The possibility to signal intra prediction modes for blocks that comprise multiple TBs can be seen as a way to reduce the side information rate.

204 5.3. Block Sizes for Prediction and Transform Coding 199 (a) (b) Figure 5.19: Horizontal intra prediction for an 8 8 CB with 4 4 TBs: (a) Intra prediction on the basis of CBs (this approach is not supported in H.265 MPEG-H HEVC); (b) Intra prediction on the basis of TBs Experimental Analysis In the following, we analyze the impact of variable block sizes for intra prediction and transform coding. Since H.265 MPEG-H HEVC provides the greatest flexibility in that respect, we used the H.265 MPEG-H HEVC reference software [126] for our experiments. In order to simulate less flexible block partitioning schemes, we modified the encoder in a way that the set of selectable block sizes can be specified in the configuration file. The experiments were done using the HD test sequences listed in Appendix A.1. All pictures of a sequences are coded as intra pictures. For encoder configurations that support multiple block sizes, the block sizes that are used for intra prediction and transform coding are selected using the Lagrangian mode decision introduced in Section As in previous experiments, the same quantization parameter is used for all blocks of a video sequence. In a first experiment, we restricted the encoder in a way that a fixed block size is used for intra prediction and transform coding. The block sizes are chosen in the entire supported range of 4 4 to samples. Additionally, we included a setting in which all block sizes that are supported in H.265 MPEG-H HEVC are enabled. In order to evaluate the impact of intra-picture prediction on the block size selection, we tested two configurations. While in the first configuration, all blocks are predicted in the DC prediction mode, the second configuration includes all supported intra prediction modes. The obtained rate-distortion curves for two selected test sequences are shown in Figure If we restrict the intra-picture prediction to

205 200 Intra-Picture Coding PSNR (Y) [db] (a) Cactus ( , 50 Hz), DC prediction all block sizes blocks 34 blocks 4 4 blocks blocks bit rate [Mbit/s] PSNR (Y) [db] (b) Kimono ( , 24 Hz), DC prediction 44 all block sizes blocks blocks blocks blocks bit rate [Mbit/s] PSNR (Y) [db] Cactus ( , 50 Hz), all pred. modes 8 8 blocks all block sizes blocks blocks 4 4 blocks PSNR (Y) [db] Kimono ( , 24 Hz), all pred. modes all block sizes blocks blocks 8 8 blocks 4 4 blocks (c) bit rate [Mbit/s] (d) bit rate [Mbit/s] Figure 5.20: Comparison of different block sizes for intra prediction and transform coding. The diagrams show the measured rate-distortion curves for two selected test sequences: (a,b) All block are coded using the DC prediction mode; (c,d) All intra prediction modes supported in H.265 MPEG-H HEVC are enabled. a minimum (DC prediction), the coding efficiency increases with the chosen block size, as can be seen in Figure 5.20(a,b). But the coding gains from one block size to the next become significantly smaller for large block sizes. The results indicate that the effectiveness of transform coding typically increases with the transform size. However, depending on the source material, an adaptive block size selection can still provide a significant coding gain relative to the best single block size. The rate-distortion curves for the configuration in which all intra prediction modes are enabled are shown in Figure 5.20(c,d). It can be seen that the differences in coding efficiency between the tested configurations are reduced. For content with fine spatial details, such as the Cactus sequence, and high bit rates, smaller block sizes often yield a better coding efficiency than larger block sizes. Here, the improvements of the prediction signals outweigh the losses in transform coding per-

206 5.3. Block Sizes for Prediction and Transform Coding 201 bit-rate saving vs 8 8 blocks [%] (a) Cactus ( , 50 Hz), DC prediction all block sizes (4 4 to 32 32) and 8 8 blocks PSNR (Y) [db] 4 4, 8 8 blocks, and blocks bit-rate saving vs 8 8 blocks [%] (b) Kimono ( , 24 Hz), DC prediction all block sizes (4 4 to 32 32) 4 4, 8 8 blocks, and blocks and 8 8 blocks PSNR (Y) [db] bit-rate saving vs 8 8 blocks [%] (c) Cactus ( , 50 Hz), all pred. modes all block sizes (4 4 to 32 32) and 8 8 blocks PSNR (Y) [db] 4 4, 8 8 blocks, and blocks bit-rate saving vs 8 8 blocks [%] (d) Kimono ( , 24 Hz), all pred. modes all block sizes (4 4 to 32 32) 4 4 and 8 8 blocks PSNR (Y) [db] 4 4, 8 8 blocks, and blocks Figure 5.21: Coding efficiency improvements by successively enabling different block sizes for intra prediction and transform coding. The diagrams show bit-rate savings relative to the configuration in which intra prediction and transform coding are applied to 8 8 blocks: (a,b) All block are coded using the DC prediction mode; (c,d) All intra prediction modes supported in H.265 MPEG-H HEVC are enabled. formance. Nonetheless, the best coding performance is achieved with an adaptive block size selection. In a second experiment, we started with a block size of 8 8 samples, which represents the transform size that is used in early video coding standards, and successively enabled additional block sizes. As for the first experiment, we investigated two configurations. In the first configuration, all blocks are predicted using the DC prediction mode, and in the second configuration all intra prediction modes are enabled. The diagrams in Figure 5.21 show the measured bit-rate savings relative to the setting with 8 8 blocks for two selected test sequences. It can be clearly seen that, for all test cases, the coding efficiency is continuously increased when additional block sizes for intra prediction and transform coding are enabled. If all block sizes supported in H.265

207 202 Intra-Picture Coding MPEG-H HEVC are used, bit-rate savings of about 10-50% have been measured. The coding efficiency gains are generally larger for low bit rates. As in the previous experiment, the block size selection has a smaller effect if more intra prediction modes are supported. 5.4 Chapter Summary In block-based hybrid video coding, there are two fundamental classes of coding modes: Inter-picture and intra-picture coding modes. While inter-picture coding modes utilizes dependencies between different video pictures using motion-compensated prediction, intra-picture coding modes represent the samples of a block without referring to other pictures. Even though, on average, inter-picture coding yields a higher coding efficiency due to the exploitation of inter-picture dependencies, intra-picture coding is required for providing random access or improving the error-robustness of a bitstream. Since most video pictures include areas that cannot be well represented using motion-compensated prediction, the support of intra-picture coding modes in otherwise interpredicted pictures can also improve coding efficiency. A fundamental coding technique that is used in both intra-picture and inter-picture coding modes is transform coding. In transform coding, blocks of original or residual samples are first transformed using an (approximately) orthogonal transform, the obtained transform coefficients are quantized using scalar quantizers, and the resulting quantization indexes are transmitted using entropy coding. Since all components of a transform codec influence each other, it is difficult to determine the optimal orthogonal transform for a given source. Video codecs typically use a DCT or an integer approximation of the DCT. Our coding experiments indicate that the DCT does indeed provide a good coding efficiency. And since it is signal-independent and can be implemented using fast algorithms, it represents a suitable choice in video coding. For the scalar quantization of transform coefficients, most video codecs employ very simple uniform reconstruction quantizers (URQs). Our experiments show that the coding efficiency loss that is caused by using URQs instead of optimal, but more complex, entropy-constrained

208 5.4. Chapter Summary 203 scalar quantizers is actually negligible. A nearly optimal bit allocation for the transform coefficients is achieved if all URQs use the same quantization step size. For entropy coding of transform coefficient levels, we discussed several techniques that are used in video coding standards. Among these approaches, context-based adaptive arithmetic coding (CABAC) achieves the best coding efficiency. Transform coding exploits only dependencies between the samples within a block. For additionally utilizing dependencies between transform blocks, intra-picture prediction techniques can be used. We discussed intra-picture prediction in both the transform and the spatial domain. A prediction in the sample domain has the advantage that the prediction signal can be constructed for arbitrary prediction directions. Based on experimental investigations, we verified that the coding efficiency of intra-picture coding can be improved by an adaptive selection of intra prediction modes. The coding efficiency typically increases with the number of supported prediction modes. Due to the non-stationary character of video pictures, the coding efficiency can also be improved by an adaptive selection of block sizes for intra prediction and transform coding. While large blocks are often advantageous for flat image regions, small blocks typically improve the coding efficiency for fine detailed image areas. A quadtree-based block partitioning represents a simple and effective approach for supporting an adaptive selection of block sizes.

209 6 Inter-Picture Coding Natural video sequences are obtained by filming real-world scenes with a camera. The changes between successive video pictures are caused by changes of appearance and position of objects, lighting changes, camera motion, sensor noise, or a modification of camera parameters such as the focal length or the aperture. Even though all of these aspects can contribute to temporal changes within video sequences, the primary source is usually the motion of objects relative to the image plane of the camera. If the motion is taken into account in the design of a video codec, the coding efficiency can often be substantially improved. Coding techniques that exploit the statistical dependencies between video pictures are referred to as inter-picture coding. The inclusion of inter-picture coding is what fundamentally distinguishes video coding from still image coding. The key technique for utilizing the temporal dependencies is motion-compensated prediction (MCP), which we introduced in the video coding overview in Section 3.3. In the inter-picture coding modes of a hybrid video codec, MCP is typically combined with transform coding of the prediction error signal. For demonstrating the importance of MCP, we compared the coding efficiency of different inter-picture coding techniques with that of intra- 204

210 205 picture coding. The following concepts (see Section 3.3) are included in the comparison: Intra-picture coding (Intra): All pictures of a video sequence are coded as intra pictures. The maximum block size for intra prediction and transform coding was set to samples. Conditional replenishment (CR): Each block is either coded in an intra mode or its samples are obtained by copying the co-located samples of the preceding reconstructed picture. Frame difference coding (FD): Each block is either coded in an intra mode or the difference mode. In the difference mode, a prediction signal is formed by the co-located samples of the preceding reconstructed picture and the resulting prediction error signal is transmitted using transform coding. Simple MCP : The difference mode is extended to a coding mode with MCP. The prediction signal for a block is formed by a displaced region of the preceding reconstructed picture. The displacement is specified by a motion vector with luma sample precision, which is estimated in the encoder. Improved MCP : The efficiency of MCP is improved by various aspects that will be discussed later in this section. This approach represents the capabilities of H.265 MPEG-H HEVC with the restriction that the video pictures are coded in display order. The coding experiment was conducted using the H.265 MPEG-H HEVC reference software [126]. For simulating the different approaches, the set of available coding options has been restricted accordingly. All encoders were operated using the same Lagrangian coder control. The simulation results for two selected test sequences are shown in Figure 6.1. The video conferencing sequence Johnny was captured with a static camera. Here, the simple concepts of conditional replenishment and frame difference coding already provide very large gains relative to intra-picture coding. But for the test sequence BQTerrace, which was filmed with a moving camera, these simple approaches yield only comparably small improvements. The inclusion of MCP substantially

211 206 Inter-Picture Coding PSNR (Y) [db] (a) Johnny ( , 60 Hz) 46 improved MCP FD 38 CR 36 simple MCP Intra bit rate [Mbit/s] PSNR (Y) [db] (b) BQTerrace ( , 60 Hz) 40 improved MCP Intra CR FD 32 simple MCP bit rate [Mbit/s] Figure 6.1: Comparison of the coding efficiency of intra-picture coding (Intra), conditional replenishment (CR), frame difference coding (FD), and two versions of motion-compensated prediction (MCP): (a) Johnny; (b) BQTerrace. improves coding efficiency for both test sequences. However, a comparison of the two tested versions clearly shows that the actual design of MCP has a very large impact on the coding performance. The basic idea of MCP is the following: At the encoder side, the motion of a current image region R relative to an already coded and reconstructed reference picture s ref is estimated. Given the estimated motion field m(x) = (m x (x, y), m y (x, y)), the prediction signal for the image region R is formed according to ŝ[x, y] = s ref( x + m x (x, y), y + m y (x, y) ). (6.1) Since the decoder has to generate the same prediction signal as the encoder, both the partitioning of a picture into image regions with constant motion and the corresponding motion fields m(x) have to be transmitted inside the video bitstream. If the motion vector m(x) for a location x = (x, y) points to a fractional sample position, the reference sample s ref (x + m(x)) has to be generated using interpolation. Since the differences between two video pictures cannot be completely described by a motion model according to (6.1), the prediction error signal u[x, y] = s[x, y] ŝ[x, y] (6.2) is typically additionally transmitted using transform coding. At the decoder side, the reconstructed samples s [x, y] are obtained by adding the reconstructed residual signal u [x, y] to the motion-compensated

212 6.1. Accuracy of Motion-Compensated Prediction 207 prediction signal ŝ[x, y], s [x, y] = ŝ[x, y] + u [x, y]. (6.3) Even though the basic concept of inter-picture coding with MCP is rather straightforward, there are numerous design aspects that impact its efficiency. The improvement of inter-picture coding was always a key aspect for increasing the efficiency of video codecs. Transform coding of prediction error signals has already been comprehensively discussed in Section 5.1. In this section, we mainly investigate the concept of motion-compensated prediction and analyze various design aspects with respect to their impact on coding efficiency. 6.1 Accuracy of Motion-Compensated Prediction It can be expected that the quality of the motion-compensated prediction signal for a given image area is maximized if we compensate for the true motion in the captured scene. However, the projection of the real object motion in the image plane can be quite complex. Moreover, not all changes in the image plane can be described by 2D motion fields. And since the used motion fields have to be transmitted to the decoder, rather simple approximations might actually provide a better overall coding efficiency. In addition, the complexity of both the estimation of motion parameters in the encoder and the actual prediction is typically smaller for simple motion models Theoretical Considerations Before we discuss the realization of MCP in hybrid video codecs, we investigate the coding gain of inter-picture coding with MCP relative to intra-picture coding based on simple signal models. The following theoretical analysis was first published by Girod in [74, 75], where the reader can also find additional details. We consider an idealized model of a hybrid video codec, which is illustrated in Figure 6.2. The input pictures s[x] are predicted using MCP and the resulting prediction error pictures u[x] = s[x] ŝ[x] are coded using rate-distortion optimal intra-picture coding. For analyzing

213 208 Inter-Picture Coding s[x] u[x] r-d optimal intra-picture encoder bitstream input pictures intra-picture decoder s x = s ref (x + m) u [x] motioncompensated prediction s ref [x] decoded picture buffer s [x] output pictures Figure 6.2: Idealized model of a hybrid video codec. The prediction error pictures u[x] = s[x] ŝ[x] are coded using rate-distortion optimal intra-picture coding. the effectiveness of MCP in a hybrid video codec, we compare the information rate-distortion functions that are associated with a coding of prediction error pictures u[x] and original input pictures s[x]. To keep the problem tractable, we assume that both the prediction error pictures u[x] and the original pictures s[x] represent realizations of stationary Gaussian random processes with zero mean. Although this assumption is not realistic, it still provides useful insights. It should also be noted that the rate-distortion function R(D) of a Gaussian random process with a particular power spectral density Φ SS (ω) represents an upper bound for the rate-distortion functions of all other random processes with the same power spectral density Φ SS (ω). In the source coding part [301], we derived the rate-distortion function for 1D stationary Gaussian processes with zero mean. An extension to 2D stationary Gaussian processes with zero mean yields the parametric formulation, with θ > 0, D(θ) = 1 π π ( ) 4π 2 min θ, Φ SS (ω x, ω y ) dω x dω y, (6.4) π π R(θ) = 1 π π ( 4π 2 max 0, 1 π π 2 log 2 ) Φ SS (ω x, ω y ) θ dω x dω y, (6.5) where Φ SS (ω) = Φ SS (ω x, ω y ) represents the power spectral density of the considered random process.

214 6.1. Accuracy of Motion-Compensated Prediction 209 Power Spectral Density of Prediction Error Signal. For comparing the coding efficiency of inter-picture and intra-picture coding, we have to derive an approximation for the power spectral density Φ UU (ω) of the prediction error signal u[x] = s[x] ŝ[x]. Let us assume that the MCP for a picture s[x] or a region of a picture is conducted with a motion vector m = (m x, m y ). In general, the motion vector m can have a fractional-sample accuracy, in which case the prediction signal ŝ[x] has to be generated by interpolating the used reconstructed reference picture s ref [x]. The corresponding operation can be described as follows. The discrete reference picture s ref [x] is first convolved with a continuous interpolation kernel g(x). The motioncompensated prediction signal ŝ(x) is then obtained by resampling the resulting continuous picture (g s ref )(x) at spatial locations x + m, ŝ[x] = (g s ref)(x + m). (6.6) In practice, the order of convolution and sampling can be switched. The prediction signal ŝ[x] can also be generated by convolving the reference picture s ref [x] with a discrete filter h[x] that represents a sampled version of the interpolation kernel g(x). In the frequency domain, the filtering operation (6.6) is given by Ŝ(ω) = G(ω) S ref(ω) e j ωt m, (6.7) where ω = (ω x, ω y ) denotes the angular frequency and Ŝ(ω), S ref (ω), and G(ω) represent the 2D band-limited Fourier transforms, with ω x < π and ω y < π, of the prediction signal ŝ[x], the reference picture s ref [x], and the interpolation kernel g(x), respectively. Now, we further assume that the current image signal s[x] can be predicted, up to a noise term n[x], by displacing the reconstructed reference picture s ref [x] using the true displacement vector d = (d x, d y ). In the frequency domain, we have S(ω) = S ref(ω) e j ωt d + N(ω), (6.8) where S(ω) and N(ω) represent the band-limited Fourier transforms of the original image signal s[x] and the noise term n[x], respectively. The noise term n[x] shall represent all differences between s[x] and s ref [x]

215 210 Inter-Picture Coding that are not caused by object or camera motion; it includes lighting changes, camera noise, and quantization noise in the reference picture. Using (6.7) and (6.8), the Fourier transform U(ω) of the prediction error signal u[x] = s[x] ŝ[x] can be written as U(ω) = S(ω) Ŝ(ω) = S(ω) G(ω) S ref(ω) e j ωt m = S(ω) G(ω) (S(ω) N(ω)) e j ωt (m d). (6.9) By introducing the displacement error = ( x, y ) = d m, which represents the difference between the true displacement vector d and the used motion vector m, we obtain ( U(ω) = S(ω) 1 G(ω) e j ωt ) + N(ω) G(ω) e j ωt. (6.10) If we assume that the input signal s[x] and the noise term n[x] are statistically independent, the conditional power spectral density Φ UU (ω ) of the prediction error signal u[x] = s[x] ŝ[x] for a given displacement error can be written as Φ UU (ω ) = Φ SS (ω) ( 1 + G(ω) 2 2 Re { G(ω) e j ωt }) + Φ NN (ω) G(ω) 2, (6.11) where Φ SS (ω) and Φ NN (ω) denote the power spectral densities of the input signal s[x] and the noise term n[x], respectively, and the operator Re{} specifies the real part of its argument. For a random displacement error that is statistically independent of both the input signal s[x] and the noise n[x], we finally obtain the power spectral density Φ UU (ω) = E { Φ UU (ω ) } ( { }) = Φ SS (ω) 1 + G(ω) 2 2 Re G(ω)F (ω) + Φ NN (ω) G(ω) 2, (6.12) where F (ω) represents the Fourier transform of the probability density function f ( ) = f ( x, y ) of the displacement error, { F (ω) = E e j ωt } = f ( x, y ) e j(ωx x+ωy y) d x d y. (6.13)

216 6.1. Accuracy of Motion-Compensated Prediction 211 Interpolation Filter. The power spectral density Φ UU (ω) of the prediction error signal and, thus, the rate-distortion function depend on the signal characteristic given by the power spectral densities Φ SS (ω) and Φ NN (ω), the accuracy of the motion vectors given by F (ω), and the used interpolation kernel g(x). The optimal interpolation kernel that minimizes the power spectral density Φ UU (ω) for each spatial frequency ω = (ω x, ω y ), with ω x < π and ω y < π, has been derived in [74]. The corresponding Wiener filter is given by the frequency response G opt (ω) = F (ω) Φ SS (ω) Φ SS (ω) + Φ NN (ω), (6.14) where F (ω) represents the complex conjugate of F (ω). By inserting (6.14) into (6.12), we obtain the power spectral density ( ) Φ opt UU (ω) = Φ SS(ω) 1 F (ω) 2 Φ SS (ω). (6.15) Φ SS (ω) + Φ NN (ω) For evaluating the impact of the Wiener filter on coding efficiency, we additionally consider the case in which the prediction signal at fractional sample locations is generated using an ideal interpolator. The frequency response of this interpolator is given by G int (ω) = 1, for ω x < π and ω y < π. It yields the power spectral density Φ int UU(ω) = 2 Φ SS (ω) ( 1 Re { F (ω) }) + Φ NN (ω). (6.16) Motion Vector Accuracy. As model for the motion vector accuracy, we assume that the displacement errors are uniformly distributed inside an interval [ max, max ] [ max, max ]. This choice yields F (ω) = F (ω x, ω y ) = max max max max max e j(ωx x+ωy y) d x d y = sinc( max ω x ) sinc( max ω y ), (6.17) where sinc(x) = sin(x)/x represents the sinc function. For simplifying the following considerations, we specify the maximum displacement error max according to max = 2 (1+β). (6.18)

217 212 Inter-Picture Coding Note that the choices β = 0, β = 1, and β = 2 correspond to the usage of integer-sample, half-sample, and quarter-sample accurate motion vectors, etc. The Fourier transform F (ω) of the displacement error pdf is then given by F (ω) = F (ω x, ω y ) = sinc( 2 (1+β) ω x ) sinc( 2 (1+β) ω y ). (6.19) Signal Model. For deriving a model for the power spectral density Φ SS (ω), we assume an isotropic autocorrelation function { } R SS (ζ, η) = E S(x, y) S(x ζ, y η) = σs 2 ϱ ζ 2 +η 2, (6.20) where σs 2 specifies the signal variance, S(x, y) represents the random process associated with the image signal s[x] = s[x, y], and ϱ is the correlation coefficient between two directly neighboring image samples. The power spectral density Φ SS (ω) = Φ SS (ω x, ω y ) is the Fourier transform of the autocorrelation function, Φ SS (ω) = Φ SS (ω x, ω y ) = = K ( R SS (ζ, η) e j (ωx ζ+ωy η) dζ dη 1 + ω2 x + ω 2 y (ln ϱ) 2 ) 3 2. (6.21) Since we consider band-limited signals that are sampled at the Nyquist rate, the power spectral density Φ SS (ω) is equal to zero outside the range [ π, π] [ π, π]. Accordingly, the constant K has to be determined in a way that the following condition is fulfilled, σ 2 S = 1 4π 2 π π π π Φ SS (ω x, ω y ) dω x dω y. (6.22) For the noise term n[x], we assume a constant power spectrum inside the base band [ π, π] [ π, π]. In this range, the power spectral density Φ NN (ω) is given by Φ NN (ω) = σ 2 N = σ 2 S Θ, (6.23) where σ 2 N denotes the variance of the noise and Θ = 10 log 10(σ 2 S /σ2 N ) represents a corresponding signal-to-noise ratio in db.

218 6.1. Accuracy of Motion-Compensated Prediction 213 Theoretical Results. When using the introduced signal models, the power spectral density Φ UU (ω) of the prediction error signal as well as the rate-distortion function R(D) for inter-picture coding only depend on the signal variance σs 2, the correlation coefficient ϱ, the signal-tonoise parameter Θ, the chosen motion vector accuracy β, and the choice of the interpolation filter (ideal interpolator or Wiener filter). The signal variance σs 2 represents only a scaling factor; it does not impact the result if we represent the distortion D as signal-to-noise ratio. Since the correlation coefficient ϱ between two directly neighboring image samples does not vary much for natural images, we use the approximation ϱ = 0.9 for all following experiments. As a first aspect, we compare the prediction error spectrum Φ UU (ω) with the signal spectrum Φ SS (ω). Figure 6.3(a) shows the theoretical results for a small SNR parameter of Θ = 5 db. If we use the ideal interpolator, we obtain a rather flat prediction error spectrum Φ UU (ω). In comparison to the signal spectrum Φ SS (ω), the low-frequency components are significantly reduced while the high-frequency components are noticeably increased. By applying the Wiener filter, the high-frequency components are attenuated below the original signal spectrum. MCP with integer- and quarter sample accurate motion vectors yield nearly the same prediction error spectrum; only the low-frequency components are slightly smaller for more accurate motion vectors. Figure 6.3(b) shows the results for an SNR parameter of Θ = 20 db. Here, all components of the prediction error spectra are smaller than the corresponding components of the original signal spectrum. The Wiener filter has only a slight influence. But the usage of motion vectors with quarter-sample accuracy significantly reduces the prediction error in comparison to integer-sample accurate vectors. The impact of the chosen model parameters on the information ratedistortion function R(D) is illustrated in Figure 6.3(c,d). Since the signal model (6.8) with a constant SNR parameter Θ implicitly presumes high quality coding, the rate-distortion functions for inter-picture coding are only shown for comparably high bit rates. For Θ = 5 db, the difference between integer- and quarter-sample accurate motion vectors is very small. But the choice of the interpolation filter has a signifi-

219 214 Inter-Picture Coding ρ = 0.9, Θ = 5 db 20 ρ = 0.9, Θ = 20 db Φ(ω) / σ S 2 [db] signal spectrum Φ SS Φ UU (integer-sample accuracy) Φ(ω) / σ S 2 [db] signal spectrum Φ SS Φ UU (integer-sample accuracy) (a) -20 Φ UU (quarter-sample accuracy) spatial frequency ω x (ω y =0) (b) -20 Φ UU (quarter-sample accuracy) spatial frequency ω x (ω y =0) SNR: 10 log 10 ( σ S 2 / D ) [db] (c) ρ = 0.9, Θ = 5 db 30 MCP (quarter-sample accuracy) intra-picture coding 5 MCP (integer-sample accuracy) bit rate [bits/sample] SNR: 10 log 10 ( σ S 2 / D ) [db] (d) ρ = 0.9, Θ = 20 db intra-picture coding MCP (quarter-sample accuracy) 5 MCP (integer-sample accuracy) bit rate [bits/sample] Figure 6.3: Comparison of power spectral densities and rate-distortion functions for intra-picture and inter-picture coding (dashed curves: ideal interpolator; solid curves: Wiener filter): (a,b) Power spectral densities for Θ = 5 db and Θ = 20 db; (c,d) Associated information rate-distortion functions (for high rates). cant impact on coding efficiency. The ideal interpolator yields a coding efficiency that is clearly worse than that of intra-picture coding. By using the Wiener filter, however, we obtain a small coding gain. For the large SNR parameter of Θ = 20 db, all evaluated configurations of inter-picture coding clearly improve the coding efficiency relative to intra-picture coding. While the motion vector accuracy has a significant influence, the impact of the Wiener filter is comparably small. In Figure 6.4(a), the high-rate SNR gain of inter-picture coding relative to intra-picture coding is plotted as function of the motion vector accuracy. The coding gain increases with increasing accuracy, but approaches a limit. The influence of the motion vector accuracy is higher for larger SNR parameters Θ. Since the bit rate for transmitting the used motion vectors also increases with the accuracy, there is an

220 6.1. Accuracy of Motion-Compensated Prediction 215 SNR gain rel. to intra coding (a) ρ = 0.9 Θ = 20 db Θ = 15 db Θ = 10 db Θ = 5 db 1/1 1/2 1/4 1/8 1/16 motion vector accuracy in samples 20 log 10 G opt (ω) [db] (b) Θ = 5 db -10 Θ = 10 db -15 Θ = 15 db -20 ρ = 0.9, β = 2 Θ = 20 db spatial frequency ω x (ω y =0) Figure 6.4: Coding gain of inter-picture coding relative to intra-picture coding: (a) High-rate SNR gain as function of the motion vector accuracy for selected values of Θ (dashed curves: ideal interpolator; solid curves: Wiener filter); (b) Frequency responses G opt(ω) of the Wiener filters for quarter-sample accurate MCP. optimum, which depends on the actual noise level. In comparison to the ideal interpolator, the Wiener filter always provides coding gains. Its impact increases with the noise level. As shown in Figure 6.4(b), the optimal filters have a low-pass characteristic, where a stronger lowpass filtering is applied for smaller SNR parameters Θ. For Θ, the Wiener filter approaches the ideal interpolator Choice of Interpolation Filters The results of the theoretical analysis in Section indicate that the applied interpolation filter has a significant impact on the efficiency of motion-compensated inter-picture coding. In the following, we discuss the choice of interpolation filters in practical video codecs. Most video codecs use a simple translational motion model and specify a motion vector m = (m x, m y ) for square or rectangular image blocks. In modern video coding standards [121, 123], the motion vector components m x and m y are transmitted with an accuracy of a quarter luma sample. In the 4:2:0 chroma sampling format, this corresponds to eighth-sample accurate motion vectors for the chroma components. Due to the low spatial detail of chroma components, the choice of the chroma interpolation filters has only a minor impact on the effectiveness of motion-compensated coding. In the following, we concentrate on the selection of interpolation filters for the luma signal.

221 216 Inter-Picture Coding integer-sample locations half-sample locations quarter-sample locations (a) (b) Figure 6.5: MCP with quarter-sample accuracy: (a) Sub-sample locations; (b) In H.264 MPEG-4 AVC, the signal at quarter-sample locations is obtained by averaging two samples on the half-sample grid (as indicated by the black lines). Separable FIR Filters. Figure 6.5(a) illustrates the integer- and subsample locations for quarter-sample accurate MCP. As mentioned in Section 6.1.1, we do not require a continuous interpolation kernel g(x), but only a set of discrete filters h[k, l] for the supported sub-sample locations. The prediction signal ŝ[x, y] for a sample block is given by ŝ[x, y] = s ref (x + m x, y + m y ), (6.24) where s ref (x, y) represents the interpolated reference picture at the used sub-sample resolution. For keeping the complexity reasonably small, the interpolated signal s ref (x, y) is typically generated using separable finite impulse response (FIR) filters 1. With (x, y) being a sub-sample location and x i = x and y i = y denoting the rounded coordinates (rounded down to the nearest integer value), an intermediate horizontally interpolated signal s hor [x, y i ] is obtained according to s hor (x, y i ) = k h hor [k] s ref[x i k, y], (6.25) where h hor [k] denotes the horizontal filter. The interpolated reference picture s ref (x, y) is then obtained using the vertical filter h ver [k], s ref (x, y) = k h ver [k] s hor (x, y i k). (6.26) 1 In H.264 MPEG-4 AVC [121], the interpolated signal at half-sample locations is generated using separable FIR filters. At quarter-sample locations, the samples are obtained by averaging two samples on the half-sample grid, as is illustrated in Figure 6.5(b). For some cases, this results in simple non-separable filters.

222 6.1. Accuracy of Motion-Compensated Prediction 217 The summations in (6.25) and (6.26) proceed over all non-zero filter coefficients. The choice of the interpolation filters h hor [k] and h ver [k] depends on the phases φ x = x x i and φ y = y y i, respectively. If no intermediate rounding is applied, the order of horizontal and vertical interpolation can be switched without changing the result. Typically, the same interpolation filters are used for the horizontal and vertical interpolation and the values at integer-sample locations are not filtered 2 ( s ref (x i, y i ) = s ref [x i, y i ]). Then, the reference picture interpolation for quarter-sample accurate MCP can be specified by two FIR filters: A half-sample filter for phases φ = 1/2 and a quarter-sample filter for phases φ = 1/4 (for φ = 3/4, the mirrored filter is used). Examples of Interpolation Filters. Figure 6.6 shows the filter coefficients and the frequency responses of four selected filter sets. The 2-tap filters specify simple linear interpolation, which is used in H.262 MPEG-2 Video [122], H.263 [120], and MPEG-4 Visual [111] (except if quarter-sample precision is enabled). The 8/7-tap and 4-tap filters represent the luma and chroma interpolation filters specified in H.265 MPEG-H HEVC [123]. The 6-tap filters represent approximations 3 of the luma interpolation filters used in H.264 MPEG-4 AVC [121]. For preventing mismatches between encoder and decoder reconstructions, video coding standards specify the filter coefficients as integer values. At some point in the reconstruction process, the prediction or reconstruction signal has to be scaled and rounded accordingly. The frequency responses depicted in Figure 6.6 show that all interpolation filters have a low-pass character. For the selected filter sets, the strength of the low-pass filtering decreases with the filter length. Furthermore, the half-sample filters represent stronger low-pass filters than the quarter-sample filters. And since the reference picture is not filtered at integer-sample locations, we can assert that the fractional part of a motion vector (m x, m y ) does not only specify the phases φ x and φ y, but also the strength of the low-pass filtering. 2 This corresponds to the usage of the filter h[k] = {1} for the phase φ = 0. 3 By using the listed 6-tap filters in our experiments, we neglect the impact of diagonal averaging and intermediate rounding specified in H.264 MPEG-4 AVC.

223 218 Inter-Picture Coding Magnitude 20 log 10 G(ω) [db] (a) 2 half-sample locations tap filter -4 4-tap filter -6 6-tap filter -8 8-tap filter spatial frequency ω Magnitude 20 log 10 G(ω) [db] (b) 2 quarter-sample locations tap filter -4 4-tap filter -6 6-tap filter -8 7-tap filter spatial frequency ω 2-tap 4-tap 6-tap phase filter coefficients (results are divided by 64) 1/ / / / / / tap/ 1/ tap 1/ (c) phase delay [samples] (d) 0.30 quarter-sample locations tap filter tap filter tap filter tap filter spatial frequency ω Figure 6.6: Selected interpolation filters: (a,b) Magnitude responses of half- and quarter-sample filters; (c) Table of filter coefficients (the bold values indicate the coefficients h[0]). (d) Phase delay of quarter-sample filters (due to their symmetry, all half-sample filters have a constant phase delay of half a sample). Coding Experiments. For evaluating the impact of the interpolation filters on coding efficiency, we modified the H.265 MPEG-H HEVC reference software [126] in a way that the used filters can be selected among the four sets given in Figure 6.6. We chose two test sequences with different noise characteristics: While the low-resolution sequence BQSquare is characterized by high spatial detail and low noise (large signal-to-noise parameter Θ), the video conferencing sequence Johnny has low spatial details and a moderate noise level (small Θ). The results of the coding experiments are summarized in Figure 6.7. The diagrams show bit-rate savings relative to the simple linear interpolation (2-tap filters) for different encoder configurations. In a first setting, we used an IPPP coding structure and a fixed prediction block size of samples. For separately testing the half-sample filters, we

224 6.1. Accuracy of Motion-Compensated Prediction 219 bit-rate saving vs 2-tap filter [%] (a) BQSquare ( , 60 Hz) /2-sample acc. (IPPP, MC blocks) 60 8-tap filter tap filter 30 4-tap filter PSNR (Y) [db] bit-rate saving vs 2-tap filter [%] Johnny ( , 60 Hz) /2-sample acc. (IPPP, MC blocks) 30 4-tap filter tap filter 8-tap filter (b) PSNR (Y) [db] bit-rate saving vs 2-tap filters [%] (c) BQSquare ( , 60 Hz) /4-sample acc. (IPPP, MC blocks) 60 8/7-tap filters tap filters 6-tap filters PSNR (Y) [db] bit-rate saving vs 2-tap filters [%] Johnny ( , 60 Hz) /4-sample acc. (IPPP, MC blocks) 30 6-tap filters tap filters 8/7-tap filters (d) PSNR (Y) [db] bit-rate saving vs 2-tap filters [%] (e) BQSquare ( , 60 Hz) /4-sample acc. (IBBB, all MC block sizes) 60 8/7-tap filters tap filters 6-tap filters PSNR (Y) [db] bit-rate saving vs 2-tap filters [%] (f) Johnny ( , 60 Hz) /4-sample acc. (IBBB, all MC block sizes) 30 4-tap filters tap filters 8/7-tap filters PSNR (Y) [db] Figure 6.7: Bit-rate savings for the interpolation filters listed in Figure 6.6 relative to the 2-tap filters for different configurations and the test sequences BQSquare and Johnny: (a,b) IPPP coding structure with blocks and half-sample accurate motion vectors; (c,d) IPPP coding structure with blocks and quarter-sample accurate motion vectors; (e,f) IBBB coding structure (which supports adaptive biprediction) with multiple reference pictures, variable block sizes, and quarter-sample accurate motion vectors.

225 220 Inter-Picture Coding used only half-sample accurate motion vectors. The obtained simulation results are shown in Figure 6.7(a,b). For the sequence BQSquare, the coding efficiency increases with the length of the interpolation filters. In contrast to that, the 4-tap filter provides the best average coding efficiency for the sequence Johnny. As illustrated in Figure 6.7(c,d), similar results are obtained if we use motion vectors with quarter-sample accuracy. For the low-noise sequence BQSquare, the filter set with the weakest low-pass character provides the best coding efficiency. For the sequence Johnny, which is characterized by a lower signal-to-noise ratio Θ, a filter set with a stronger low-pass filtering (6-tap filters) is advantageous. In a last configuration, we enabled the additional coding features of variable block size MCP (see Section 6.2), multiple reference pictures (see Section 6.3.1), and adaptive bi-prediction (see Section 6.3.2). If we neglect the possibility to use advanced coding structures (see Section 6.5), this configuration of H.265 MPEG-H HEVC provides the best coding performance. The results in Figure 6.7(e,f) show that, for this setting, the 8/7-tap filters specified in H.265 MPEG-H HEVC outperform the other filter sets for both sequences. The coding experiments indicate that it is generally advantageous to chose interpolation filters that represent close approximations of an ideal interpolator. For source material with high spatial detail and low noise, such filters provide significant coding gains. And, for content with lower signal-to-noise ratios, the loss relative to the optimal filter is comparably small, in particular if all coding features of modern video coding standards are enabled. This observation seems to contradict the theoretical results of Section 6.1.1, where we found that spatial filtering has a larger effect for small values of the signal-to-noise parameter Θ. The main reasons for this discrepancy are the following: Block adaptive coding : In a practical hybrid video codec, the encoder can always choose between intra- and inter-picture coding for each macroblock or coding unit. This local adaptivity generally improves coding efficiency (in particular for small Θ). With a suitable encoder control, the usage of inter pictures will always provide a coding gain compared to intra-picture coding.

226 6.1. Accuracy of Motion-Compensated Prediction 221 High SNR : For the blocks that are coded using MCP, the signalto-noise parameter Θ is typically rather large, so that the optimal filter does not significantly differ from the ideal interpolator. Bi-prediction : The averaging of multiple prediction signals in multi-hypothesis prediction (such as bi-prediction) has a similar effect as a low-pass filtering. As a consequence, the optimal filters for multi-hypothesis prediction have a weaker low-pass character than those for conventional MCP and, thus, the spatial low-pass filtering becomes less important [76, 77]. In-loop filters : Modern video codecs often include in-loop filters (see Section 6.6), such as deblocking filters, which basically apply an adaptive low-pass filter to reconstructed pictures. Longer FIR filters provide more degrees of freedom for the design and can, thus, better approximate the ideal filter response. But since longer filters are also associated with a larger implementation complexity and more often produce ringing artifacts, in practice, FIR interpolations filters of moderate sizes are used. The design of the interpolation filters for the video coding standards H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123] is described in [292] and [270, 271], respectively. Further Improvements. The shown results demonstrate that the interpolation of reference pictures significantly influences the efficiency of motion-compensated coding. The investigation of improved interpolation techniques represents an active research area. In the following, we point out some directions for further improving coding efficiency: Internal bit depth : It is generally advantageous to perform intermediate processing steps with a higher arithmetic precision. For example, it has been shown [142] that noticeable coding gains can already be achieved by removing the intermediate rounding in the quarter-sample interpolation of H.264 MPEG-4 AVC. Additional coding efficiency improvements can be obtained if the complete reconstruction process is performed with a higher internal precision and the reconstructed reference pictures are stored with an increased bit depth [37].

227 222 Inter-Picture Coding Filter length : As mentioned above, a better approximation of the optimal filter response can be achieved by increasing the number of filter taps. The first working draft [298] of H.265 MPEG-H HEVC included 12-tap filters, which provided small coding gains relative to the eventually specified 8/7-tap filters. Generalized interpolation : A conventional interpolation uses an interpolating kernel g(x). The interpolated signal is given by s(x) = k s[k] g(x k). (6.27) For a given set of sub-sample locations x, the interpolation can be implemented using discrete FIR filters. The generalized interpolation uses non-interpolating basis functions ψ(x), s(x) = k c k ψ(x k). (6.28) Typically, the required expansion coefficients c k are determined in a way that the interpolation condition, s(k) = s[k], is fulfilled for all integer values k. This pre-processing step can be implemented using a discrete infinite impulse response (IIR) filter [83, 273]. Compared to conventional interpolation, the generalized concept provides a larger degree of freedom for selecting the basis functions ψ(x). Popular choices are B-splines [274, 275] and other functions of the MOMS family [13]. Coding efficiency improvements relative to conventional FIR filters at a similar worst case complexity have been demonstrated in [157, 158, 161]. Signal-adaptive filters : Since the optimal filters depend on the source content and the bit rate, the coding efficiency can often be improved if suitable filters are determined in the encoder and the corresponding filter coefficients are transmitted as part of the bitstream. Given the reconstructed reference picture(s) and the motion parameters for a current picture, the optimal set of FIR filters of a given length can be determined by calculating a corresponding autocorrelation matrix and crosscorrelation vector and solving the resulting system of normal equations (see discussion of optimal linear prediction in the source coding part [301]).

228 6.1. Accuracy of Motion-Compensated Prediction 223 Adaptive interpolation filters with different complexities were proposed in [290, 280, 291, 281, 315, 282]. These investigations showed that signal-adaptive interpolation filters can improve the coding efficiency. But the determination of the filter coefficients increases both the encoder complexity and the coding delay. Switched filters : The concept of switched interpolation filters [143, 138] represents a low-complexity alternative to signaladaptive filters. It includes the definition of multiple sets of interpolation filters with different frequency characteristics (known to encoder and decoder). For each picture, slice, or block, the encoder can select one of the per-defined filter sets; the encoder s choice is signaled inside the bitstream Motion Vector Accuracy The theoretical investigations in Section indicated that the usage of high-precision motion vectors increases the effectiveness of motioncompensated prediction. However, for transmitting more accurate vectors, we also require a larger bit rate. In the following, we analyze the impact of the motion vector precision on coding efficiency. For adjusting the motion vector accuracy without changing other aspects of a video codec, we modified the H.265 MPEG-H HEVC reference software [126] in a way that the used motion vector precision can be selected among integer-, half-, and quarter-sample precision. The prediction signal for sub-sample locations is generated using the 8/7-tap filters specified in the standard (see Figure 6.6). The coding experiments were performed using an IPPP coding structure and a fixed prediction block size of luma samples. The different configurations of inter-picture coding are compared to intra-picture coding. Figure 6.8 shows the obtained rate-distortion curves for four selected test sequences. The average bit-rate savings for the two classes of test sequences (see Appendix A) are summarized in Table 6.1. The experimental results demonstrate that inter-picture coding (with individual intra-picture coded blocks) clearly outperforms intrapicture coding. For all tested sequences, the coding efficiency could also be improved by increasing the motion vector accuracy from integer- to

229 224 Inter-Picture Coding PSNR (Y) [db] (a) BQSquare ( , 60 Hz) MCP (1/4 sample) 30 IPPP, MC blocks bit rate [Mbit/s] MCP (1/1 sample) MCP (1/2 sample) Intra PSNR (Y) [db] (b) Johnny ( , 60 Hz) MCP (1/1 sample) 40 MCP (1/2 sample) 38 MCP (1/4 sample) Intra 36 IPPP, MC blocks bit rate [Mbit/s] PSNR (Y) [db] (c) Cactus ( , 50 Hz) 40 MCP (1/4 sample) MCP (1/1 sample) MCP (1/2 sample) 34 Intra 32 IPPP, MC blocks bit rate [Mbit/s] PSNR (Y) [db] (d) Kimono ( , 24 Hz) 44 MCP (1/4 sample) MCP (1/1 sample) 38 MCP (1/2 sample) Intra 36 IPPP, MC blocks bit rate [Mbit/s] Figure 6.8: Comparison of intra-picture coding and inter-picture coding with varying motion vector precision for selected test sequences: (a) BQSquare; (b) Johnny; (c) Cactus; (d) Kimono. MCP was performed using blocks and the H.265 MPEG-H HEVC interpolation filters (8/7-tap filters in Figure 6.6). half- and from half- to quarter-sample precision. The actual achievable coding gain depends on the properties of the video content. For videos with high spatial details and low noise, such as the low-resolution test sequence BQSquare, an increase of the motion vector accuracy has generally a greater impact on coding efficiency than for typical high-resolution content. The coding results further demonstrate that an increase of the motion vector accuracy from integer- to half-sample precision yields larger coding gains than a further increase from halfto quarter-sample precision. Depending on the statistical properties of the source material, the usage of eighth-sample accurate motion vectors may provide additional considerable coding efficiency improvements [292, 291]. However, for typical high-resolution video content, the achievable gains are considered too small for justifying the additional complexity in the encoder s motion estimation.

230 6.1. Accuracy of Motion-Compensated Prediction 225 Table 6.1: Average bit-rate savings for different motion vector accuracies. The abbreviations IS-MCP, HS-MCP, and QS-MCP stand for inter-picture coding with integer-, half-, and quarter-sample accurate MCP, respectively. entertainment-quality video content codec bit-rate savings relative to Intra IS-MCP HS-MCP IS-MCP % HS-MCP % % QS-MCP % % 8.88% codec video conferencing content bit-rate savings relative to Intra IS-MCP HS-MCP IS-MCP % HS-MCP % % QS-MCP % % 11.75% Motion Vectors Over Picture Boundaries. In the early video coding standard H.262 MPEG-2 Video, the motion vectors are constrained to point to areas that lie completely inside the reference picture. Modern video codecs also allow motion vectors that address reference blocks that are partly or completely located outside the picture boundaries. The samples outside a picture are typically set equal to the closest border sample. This simple generalization often yields noticeable coding gains, in particular for videos that were filmed with a moving camera Motion Models Most video codecs use a simple translational motion model. The motion of an image block is represented by a single motion vector (m x, m y ), m x (x, y) = m x, m y (x, y) = m y. (6.29) Since all samples of the prediction signal for a block can be calculated using the same FIR interpolation filter, the implementation complexity of MCP with a translational model is comparably low. However, with this simple model, only translations parallel to the image plane can be accurately described (if we assume ideal perspective projection, see Section 2.1.1). The real object motion in the image plane is typically much more complicated. But general sample accurate motion fields are not only extremely difficult to estimate in an encoder, but their transmission also requires a large bit rate, which makes them unsuitable for video coding.

231 226 Inter-Picture Coding A better compromise between accuracy and side information rate can be achieved with parametric motion fields, for which the motion field m(x) inside an image block is described by a few model parameters. The translational model (6.29) represents a very simple parametric model. The more general affine motion model, m x (x, y) = a 0 + a 1 x + a 2 y, m y (x, y) = a 3 + a 4 x + a 5 y, (6.30) is characterized by six model parameters {a 0,, a 7 }. It can represent a translation, rotation, scaling, and shearing in the image plane. For the simplified model of a parallel projection (large object distances), the affine motion model can describe the motion of a plane in the 3D object space. In order to describe the motion of a plane with perspective projection, the planar perspective model, m x (x, y) = (a 0 + a 1 x + a 2 y) / (1 + a 6 x + a 7 y) x, m y (x, y) = (a 3 + a 4 x + a 5 y) / (1 + a 6 x + a 7 y) y, (6.31) is required, which is characterized by eight parameters. Another candidate model is the 12-parameter parabolic model, m x (x, y) = a 0 + a 1 x + a 2 y + a 3 x 2 + a 4 xy + a 5 y 2, m y (x, y) = a 6 + a 7 x + a 8 y + a 9 x 2 + a 10 xy + a 11 y 2. (6.32) It can well approximate perspective motion and is also capable to represent some forms of uneven stretching, which could be beneficial for modeling the motion of non-rigid objects. Figure 6.9 illustrates the capabilities of the mentioned higher-order parametric motion models using representative examples. In motion-compensated prediction with higher-order parametric motion models, the reference sample locations do not fall on a defined sub-sample grid. Hence, the samples of the prediction signal ŝ[x, y] have to be calculated using a continuous interpolation kernel g(x) or, when a generalized interpolation according to (6.28) is applied, a continuous basis function ψ(x). Alternatively, the spatial coordinates x + m x (x, y) and y + m y (x, y) of the reference sample locations could be rounded to a given sub-sample precision, in which case the interpolation of the reference picture can also be realized using a set FIR filters.

232 6.1. Accuracy of Motion-Compensated Prediction 227 (a) (b) (c) Figure 6.9: Capabilities of higher-order parametric motion models. The pictures show example reference locations for the samples of a square block: (a) Affine motion model; (b) Planar perspective motion model; (c) Parabolic motion model. Due to the large parameter space involved, a suitable parameter vector a = (a 0, ) cannot be determined by calculating the distortion and rate for a defined set of candidate vectors, as it is typically done for conventional motion vectors (see Section 4.2.2). Typically, the motion parameters are estimated using an iterative Gauss-Newton method. An example of a video codec that utilizes higher-order parametric motion models for MCP is described in [141, 140]. The codec uses up to 12 model parameters (parabolic model) and provides significant coding gains relative to the ITU-T video coding standard H.263. Another example for the application of parametric motion models is the concept of global motion compensation (GMC) in MPEG-4 Visual [111]. When GMC is enabled for a picture, a global parametric motion field is estimated and transmitted. For each macroblock of the picture, the encoder can decide whether the prediction signal is calculated using the global motion field or whether MCP with conventional translational motion vectors is used. The coding scheme suggested in [303, 239, 306] is based on the concept of multiple reference pictures (see Section 6.3.1) and can be considered as an extension of the GMC approach. Affine motion models are used for generating additional reference pictures, which are inserted into a multi-picture buffer and are treated as conventional reconstructed pictures in the actual coding process. Since the usage of higher-order motion models is often combined with advanced interpolation algorithms that support the interpolation at arbitrary sub-sample locations, the reported coding gains are often partly caused by improved interpolation methods. In [160, 159], a cod-

233 228 Inter-Picture Coding ing algorithm with a block-adaptive selection between the affine and translational motion model and a cubic spline interpolation was investigated and compared to the video coding standard H.264 MPEG-4 AVC. A detailed analysis of the coding efficiency improvements showed that most of the gains could be attributed to the improved reference picture interpolation. The average bit-rate savings that resulted from the inclusion of the affine motion model lay in the range of 1-2%. Although higher-order motion models can better approximate the real object motion between video pictures, all video codecs of practical importance exclusively use the simple translational model. The main reasons are the difficulty in estimating higher-order model parameters in the encoder, the increased complexity of motion-compensated prediction, and the comparably small improvements in coding efficiency. The concept of motion-compensated prediction with variable block sizes, which will be discussed in the next section, often provides a similar coding efficiency at a significantly reduced complexity. 6.2 Block Sizes for Motion-Compensated Prediction The partitioning of a picture into blocks with constant motion has a significant impact on the efficiency of inter-picture coding. On the one hand, by using smaller block sizes we can increase the quality of the motion-compensated prediction signal and, thus, reduce the energy of the residual signal. But on the other hand, smaller blocks also require a larger bit rate for transmitting the motion data. It should be noted that the block sizes chosen for MCP also impact the efficiency of the subsequent residual coding. Discontinuities in the prediction error signal often result in a large number of non-zero transform coefficient levels and, thus, reduce the effectiveness of transform coding. Due to that reason, transform coding is typically not applied across motion boundaries. As a consequence, the usage of small blocks for MCP limits the effectiveness of residual coding. Larger blocks do not only enable the usage of larger, typically more effective, transforms for prediction error coding, but also provide a higher degree of freedom for adapting the transform sizes to local statistics.

234 6.2. Block Sizes for Motion-Compensated Prediction Variable Prediction Block Sizes Due to the non-stationary properties of video signals, the usage of a single fixed block size for MCP is suboptimal. While large blocks are typically preferable for image areas with consistent and approximately translational motion, smaller blocks are often advantageous for regions with more complicated motion and high spatial details. In this context, it should be noted that the optimal block size does not only depend on the local properties of the source signal, but also on the target bit rate. More accurate motion descriptions generally result in an improved rate-distortion efficiency of the prediction error coding. And the higher the overall bit rate, the more often these gains outweigh the associated increase in side information rate. Since a signal-adaptive selection of the block sizes for MCP requires the transmission of additional segmentation data, video codecs usually provide only a limited set of partitioning options. A simple and effective approach is the quadtree-based partitioning [27, 246, 251], which we introduced in Section For effectively combining MCP with a transform coding of the prediction error signal (see Section 5), the block dimensions of the quadtree structure typically represent integer powers of two. The quadtree concept can also be extended by providing additional partitioning options (see below). An effective and conceptually simple encoding algorithm for selecting a suitable partitioning has been discussed in Section For demonstrating the advantage of variable block sizes for MCP, we compared the quadtree-based adaptive block size selection with the usage of different fixed block sizes. The coding experiments were conducted using a slightly modified version of the H.265 MPEG-H HEVC reference software [126], in which we enabled only block sizes of 64 64, 32 32, 16 16, 8 8, and luma samples. In a first configuration, we excluded the impact of prediction block sizes on residual coding by using a fixed transform block size of 4 4 samples. The rate-distortion curves for two selected test sequences are shown in Figure 6.10(a,b). Among the tested fixed block sizes, blocks typically provide 4 H.265 MPEG-H HEVC does not support motion-compensated prediction with 4 4 blocks. We slightly modified the syntax for including this additional option.

235 230 Inter-Picture Coding PSNR (Y) [db] (a) PSNR (Y) [db] (c) Cactus ( , 50 Hz) adaptive bit-rate saving of 8 8 adaptive vs 16 16: % on average 4 4 IPPP, 4 4 transform adaptive bit rate [Mbit/s] Cactus ( , 50 Hz) bit rate [Mbit/s] bit-rate saving of adaptive vs 16 16: 21.7 % on average IPPP, all transform sizes PSNR (Y) [db] (b) PSNR (Y) [db] (d) Johnny ( , 60 Hz) adaptive IPPP, 4 4 transform bit rate [Mbit/s] bit-rate saving of adaptive vs 32 32: 34.8 % on average Johnny ( , 60 Hz) adaptive IPPP, all transform sizes bit rate [Mbit/s] bit-rate saving of adaptive vs 32 32: 29.2 % on average Figure 6.10: Motion-compensated coding with different square block sizes for the test sequences Cactus and Johnny: (a,b) Residual coding using 4 4 transform blocks; (c,d) Residual coding with adaptive transform block sizes. the best coding efficiency. Only for very simple content, blocks yield sometimes a slightly better performance. The adaptive block size selection significantly improves the coding efficiency for all sequences and bit rates. Compared to the best fixed block size, average bit-rate savings in the range of 20 37% have been measured for our test set. In a second configuration, we removed the restriction on residual coding and enabled the adaptive quadtree-based selection of transform block sizes supported in H.265 MPEG-H HEVC. The root of the residual quadtrees are associated with the blocks that are used for MCP, so that the transforms are not applied across prediction block boundaries. The simulation results for the sequences Cactus and Johnny are shown in Figure 6.10(c,d). In comparison to the 4 4 transform, the adaptive residual coding improves the coding efficiency for larger prediction blocks as well as for the adaptive block size selection. Depending on the actual video sequence, or blocks represent the best fixed

236 6.2. Block Sizes for Motion-Compensated Prediction 231 block sizes. Similar as for the first configuration, an adaptive selection of the prediction block sizes provides large coding gains. For the used set of test sequences, the average bit-rate savings relative to the best fixed block size lie in the range of 17-32% Prediction Block Sizes in Video Coding Standards The early video coding standards H.261 [119], MPEG-1 Video [112], and H.262 MPEG-2 Video 5 [122] support only a single fixed block size. The video pictures are partitioned into macroblocks, which consist of a luma block and, in the 4:2:0 chroma sampling format, the two co-located 8 8 chroma blocks. If a macroblock is coded using MCP, a single motion vector 6 is transmitting for the entire macroblock. The later standards H.263 [120] and MPEG-4 Visual [111] add another coding mode, in which the macroblock is split into four 8 8 blocks and a separate motion vector is transmitted for each of these subblocks. H.264 MPEG-4 AVC. Similar as older standards, H.264 MPEG-4 AVC [121] uses macroblocks, but it provides a significantly increased flexibility for partitioning a macroblock into prediction blocks. A macroblock is either coded as a single block, or it is partitioned into two 16 8, two 8 16, or four 8 8 blocks. If a partitioning into 8 8 blocks is chosen, each of the 8 8 blocks can be further decomposed into two 8 4, two 4 8, or four 4 4 blocks. The partitioning concept is similar to a two-level quadtree, but additionally supports the splitting of a square block into two rectangular blocks of the same size. These additional options provide another compromise between prediction accuracy and side information rate and can be advantageous at roughly horizontal or vertical motion boundaries. In some profiles of H.264 MPEG-4 AVC, such as the High profile, the transform size for coding the luma residual signal can be selected on a macroblock basis. The standard supports 4 4 and 8 8 transforms. 5 An exception is the field prediction mode, which splits the macroblock signal into two field components. This mode is included for coding interlaced content. 6 We only consider uni-predictive coding. MPEG-1 Video and H.262 MPEG-2 Video also include bi-predictive coding, which will be discussed in Section

237 232 Inter-Picture Coding M M M (M/2) (M/2) M (M/2) (M/2) M (M/4) M (3M/4) (M/4) M (3M/4) M Figure 6.11: Partitioning modes in H.265 MPEG-H HEVC [123] for splitting a CU into one, two, or four PUs. The (M/2) (M/2) mode and the asymmetric modes shown in the bottom row are not supported for all CU sizes. For avoiding transforms across prediction block boundaries, the 8 8 transform can only be selected if the macroblock does not comprise prediction blocks that are smaller than 8 8 luma samples. H.265 MPEG-H HEVC. In the state-of-the-art video coding standard H.265 MPEG-H HEVC [123], the block partitioning concept is further generalized, in particular towards larger block sizes. As described in Section 5.3.1, a video picture is first partitioned into coding tree units (CTUs) with a fixed size of either 16 16, 32 32, or luma samples. Using a quadtree approach, the CTUs can be further partitioned into coding units (CUs). Similar as the CTU size, the minimum CU size can be selected on a sequence level; it can range from 8 8 luma samples to the CTU size. The CU represents the entity for which the encoder chooses between intra- and inter-picture coding. When a CU is coded using inter-picture prediction, it can be further decomposed into up to four prediction units (PUs). A PU consists of a luma prediction block (PB) and, in non-monochrome formats, the two associated chroma PBs. All samples of a PU use the same motion parameters for MCP. H.265 MPEG-H HEVC supports the eight partitioning modes illustrated in Figure In the M M mode, the entire CU is represented as a single PU. The (M/2) (M/2) mode splits the CU into four equally-sized square PUs. Since this is conceptually equivalent to partitioning the image block into four CUs and coding each of

238 6.2. Block Sizes for Motion-Compensated Prediction 233 bit-rate saving vs square blocks [%] (a) Cactus ( , 50 Hz) all supported PU sizes (average: 2.2 %) square and sym. PUs (average: 1.5 %) PSNR (Y) [db] bit-rate saving vs square blocks [%] (b) Johnny ( , 60 Hz) all supported PU sizes (average: 4.0 %) square and sym. PUs (average: 2.5 %) PSNR (Y) [db] Figure 6.12: Bit-rate savings for non-square PUs and a simple IPPP coding structure: (a) Cactus; (b) Johnny. these CUs as a single PU, the (M/2) (M/2) mode is only supported for the minimum CU size. For minimizing the worst-case memory bandwidth, this mode is neither supported for 8 8 CUs, so that MCP with 4 4 blocks is not possible. The M (M/2) and (M/2) M modes split a CU into two rectangular PUs of the same size. Beside these symmetric partitioning modes, H.265 MPEG-H HEVC also supports four asymmetric partitioning modes, which decompose a CU into two rectangular PUs of different sizes. One of the resulting PUs covers one quarter of the CU area and the other covers the remaining three quarters. The asymmetric partitioning modes are only supported for CUs that are greater than 8 8 luma samples. For illustrating the coding efficiency improvements that result from an additional support of non-square PUs, we started with a configuration in which only square PUs were enabled and successively added the symmetric and asymmetric rectangular partitioning modes. The resulting bit-rate savings for the test sequences Cactus and Johnny are shown in Figure It can be seen that both the symmetric and asymmetric rectangular partitioning modes provide coding gains. But compared to the influence of the quadtree partitioning (see Figure 6.10), the gains are rather small for the associated increase in encoder complexity. For coding the prediction error, a CU can be partitioned into multiple transform blocks. As described in Section 5.3.1, this subdivision is specified by a second quadtree structure, the residual quadtree (RQT). In general, the subdivision of the residual signal into transform blocks

239 234 Inter-Picture Coding Table 6.2: Average bit-rate savings for increasing the set of supported block sizes in the development of video coding standards (IPPP coding structure). entertainment-quality video content config. bit-rate savings relative to H.262 H.263 H.264 H % H % 6.04 % H % % % config. video conferencing content bit-rate savings relative to H.262 H.263 H.264 H % H % 3.80 % H % % % can be selected independently of the chosen PU partitioning. Hence, in some cases, it is possible to apply a transform across motion boundaries. Experimental investigations showed that this additional degree of freedom typically provides small coding gains. As an exception, for inter-picture coded CUs, one level of splitting is always applied if the maximum RQT depth is set equal to zero (no splitting) and the CU is decomposed into multiple PUs. Coding Efficiency Comparison. The set of supported block sizes for motion-compensated prediction and transform coding has been successively increased from one generation of video coding standards to the next. For demonstrating that this increased flexibility represents one of the key aspects for improving the compression performance, we simulated the partitioning capabilities of different video coding standards by restricting 7 the block size selection in the H.265 MPEG-H HEVC reference software. Note that for all other aspects, the coding tools of H.265 MPEG-H HEVC are used. We did not compare the different standards, but only investigated the impact of the block partitioning. As in previous experiments, we used a simple IPPP coding structure. The average bit-rate savings for the different encoder configurations are summarized in Table 6.2. The results show that the increased flexibility for partitioning the video pictures into prediction and transform blocks does indeed significantly contribute to the coding efficiency. 7 For simulating the partitioning capabilities of H.264 MPEG-4 AVC, the support of 4 4 prediction blocks has been added to the software.

240 6.2. Block Sizes for Motion-Compensated Prediction 235 The largest step in coding efficiency improvement was observed between the H.264 MPEG-4 AVC and H.265 MPEG-H HEVC configurations; it is mainly caused by the support of larger block sizes. In comparison to the fixed block partitioning in the H.262 MPEG-2 Video configuration, the flexibility in H.265 MPEG-H HEVC provides bit-rate savings of about 30% for the entertainment-quality sequences (Appendix A.1) and 39% for video conferencing content (Appendix A.2) Further Improvements The demonstrated coding efficiency improvements for increasing the number of partitioning options for MCP and transform coding indicate that it may be possible to achieve further improvements by adding even more partitioning modes. In the following, we point out some possibilities that have been suggested by various authors: Larger block sizes : In particular for coding ultra high-resolution (UHD) video content, it can be beneficial to add larger block sizes for motion-compensated prediction and transform coding [28]. Non-rectangular regions : The usage of non-rectangular regions for MCP could improve the adaptivity to real motion boundaries in video pictures. While arbitrarily shaped regions are infeasible due to the associated signalization overhead, it was shown [103, 137] that a partitioning of square blocks into two non-rectangular regions using a straight line can provide coding gains. Block merging : The discussed partitioning concepts always split larger into smaller blocks. For transmitting the motion parameters, it can, however, be beneficial to merge some of the blocks to larger entities [333, 52, 261]. H.265 MPEG-H HEVC actually supports the concept of block merging [96]; it will be discussed in the context of motion parameter coding in Section 6.4. Non-square transform blocks : By supporting non-square transform blocks, the transform sizes can be better adjusted to the boundaries of prediction blocks [307, 332], which can yield additional coding efficiency improvements.

241 236 Inter-Picture Coding s k 3 [x] s k 2 [x] s k 1 [x] s k [x] reconstructed reference pictures current picture Figure 6.13: Motion-compensated prediction with multiple reference pictures. 6.3 Advanced Motion-Compensated Prediction For introducing motion-compensated prediction, we considered the case that all inter-picture coded blocks of a video picture are predicted using a single reconstructed reference picture and that the prediction signal for a blocks is specified by a single motion vector. In the following, we discuss approaches for generalizing and improving the basic concept of motion-compensated prediction Adaptive Reference Picture Selection The largest statistical dependencies typically exist between consecutive video pictures. This does, however, not mean that for each block of a current picture the directly preceding picture represents the most suitable reference picture. Due to different samplings of the optical signals, different noise characteristics, short-time occlusions, and other effects such as aliasing, the usage of an older picture can improve the quality of the prediction signal. Hence, a block-adaptive selection of the reference picture could improve the efficiency of inter-picture coding. Figure 6.13 illustrates the basic concept of MCP with multiple reference pictures for a configuration in which all video pictures are coded in display order. For each inter-coded block of a current picture s k [x], the encoder selects the reference picture used out of a set of available reconstructed pictures (s k 1 [x], s k 2 [x], and s k 3 [x] in the example).

242 6.3. Advanced Motion-Compensated Prediction 237 A suitable algorithm for the reference picture selection has been discussed in Section Similar to the motion vector, the choice of the reference picture has to be signaled inside the bitstream. For that purpose, the available reference pictures are typically ordered in a so-called reference picture list (the order has to be known to both encoder and decoder) and an index r into this list, which is referred to as reference index, is transmitted together with the motion vector. The usage of multiple reference picture was first investigated for improving the error robustness of video transmissions. Later, it was shown [305, 306] that a block-adaptive selection of the reference picture can also significantly improve coding efficiency. In video coding standardization, MCP with multiple reference pictures and a block-adaptive selection was first specified in the optional Annex U of H.263 [120]. Due to its effectiveness, it forms an integral part of the later specified video coding standards H.264 MPEG-4 AVC and H.265 MPEG-H HEVC. Memory Management. The storage of reconstructed pictures that are intended to be used as reference pictures for MCP needs to be arranged simultaneously at encoder and decoder. Furthermore, for each picture (or slice of a picture), the same mechanism has to be used for ordering the available reference pictures into a reference picture list. Since the reference index r represents a relative index, any deviation between the reference picture lists that are used in encoder and decoder could result in differing prediction signals and, thus, also different reconstructed video pictures. Typically, the usage of reconstructed pictures for MCP is controlled by two mechanisms: The memory management and the reference list construction. For storing reconstructed pictures, encoder and decoder include a decoded picture buffer (DPB), see Figures 3.4 and 3.5, which basically consists of a certain number M 1 of frame memories. While the memory management controls the storage of reconstructed pictures into the DPB and their removal from the DPB, the reference list construction controls which of the reference pictures available in the DPB are used for motion-compensated prediction of a picture (or a slice of a picture) and determines their order in the reference picture list.

243 238 Inter-Picture Coding The memory management can be operated in different ways. In practice, typically one of the following two operation modes is used: Sliding Window : The DPB is operated similar to a first-in-firstout buffer of size M. After a picture is completely reconstructed, it is inserted into the DPB. If there is no free frame store, the oldest picture in the DPB is removed in advance. Adaptive Buffering : The bitstream includes syntax elements that specify whether the current picture is stored in the DPB and which of the present pictures are removed from the DPB (or kept in the DPB). The encoder has to select the syntax elements in a way that the capacity M of the DPB is never exceeded. The sliding window mode is particularly useful if all pictures of a video sequence are coded in display order (IPPP coding structure). In such a configuration, it ensures that the M reconstructed pictures that directly precede the current picture are available for MCP (see Figure 6.13). The adaptive buffering mode provides the greatest possible flexibility, since all operations of the DPB can be signaled inside the bitstream. Among other features, it enables the usage of advanced coding structures (see Section 6.5). If the video pictures are not coded in display order, the DPB is also used for delaying the output of certain pictures. In that case, the M frame stores of the DPB are typically partitioned into two groups. The reconstructed pictures that can be later used as reference pictures are stored in the first group of M ref M frames stores and the remaining M M ref frame stores are used for delaying the output of pictures that are no longer used as reference pictures. The reference picture list for a current picture is constructed using the reference pictures available in the DPB. The bitstream syntax typically includes a parameter that specifies the length of the list. The order of the pictures is either determined by a default ordering mechanism or it is specified using additional syntax elements, which are often called reference picture list modification commands. For details on the memory management and reference list construction in video coding standards, the reader is referred to the corresponding specifications [120, 121, 123].

244 6.3. Advanced Motion-Compensated Prediction 239 bit-rate saving vs 1 ref. pic. [%] (a) bit-rate saving vs 1 ref. pic. [%] (c) BQSquare ( , 60 Hz) 60 8 ref. pics. (avg. 35.7%) 50 4 ref. pics. (avg. 29.0%) ref. pics. (avg. 16.6%) PSNR (Y) [db] Cactus ( , 50 Hz) 20 8 ref. pics. (avg. 7.6%) 15 4 ref. pics. (avg. 5.7%) 10 2 ref. pics. (avg. 3.0%) PSNR (Y) [db] bit-rate saving vs 1 ref. pic. [%] (b) bit-rate saving vs 1 ref. pic. [%] (d) Johnny ( , 60 Hz) 25 8 ref. pics. (avg. 10.1%) 20 4 ref. pics. (avg. 9.0%) 15 2 ref. pics. (avg. 5.3%) PSNR (Y) [db] Kimono ( , 24 Hz) ref. pics. (avg. 1.6%) 5 4 ref. pics. (avg. 1.5%) ref. pics. (avg. 0.9%) PSNR (Y) [db] Figure 6.14: Coding efficiency improvement due to the usage of multiple reference pictures in IPPP coding: (a) BQSquare; (b) Johnny; (c) Cactus; (d) Kimono. Coding Efficiency. The coding efficiency improvement for using multiple reference pictures with a block-adaptive selection is demonstrated for the example of H.265 MPEG-H HEVC. We used an IPPP coding structure with the simple sliding window mechanism and modified the number M of reference pictures that are available for MCP. The pictures in the reference picture list were always ordered in reverse coding order. Figure 6.14 illustrates the experimental results for four test sequences. The diagrams show the bit-rate savings for M = 2, 4, 8 relative to a configuration with a single reference picture (M = 1). For all test sequences and bit rates, the coding efficiency increases with the number of reference pictures used. Similar to our investigations of the interpolation filter and motion vector accuracy, the largest coding gains are observed for sequences with high spatial details and low noise. This indicates that the different sampling of the optical signal in the reference pictures represents one reason for the observed gains.

245 240 Inter-Picture Coding s k 3 [x] s k 2 [x] s k 1 [x] s k [x] reconstructed reference pictures current picture Figure 6.15: Motion-compensated prediction with multiple motion hypotheses Multi-Hypothesis Prediction In conventional MCP, the prediction signal ŝ[x] for a block is formed by a displaced block s r(x + m) of an already reconstructed picture, where r denotes the reference index that indicates the reference picture used and m represents the motion vector. Since the prediction signal is formed using a single reference block, this approach can also be referred to as single-hypothesis prediction. The concept of multihypothesis prediction generalizes the approach of MCP in way that multiple displaced reference blocks are used. Each motion hypothesis {r k, m k } is characterized by a reference index r k and an associated motion vector m k and specifies a single displaced block s r k (x + m). The final prediction signal is obtained by averaging the individual displaced image blocks. In the simplest and most commonly used variant, the prediction signal ŝ[x] for a block is given by ŝ[x] = 1 K K 1 k=0 s r k (x + m k ), (6.33) where K 1 denotes the used number of motion hypotheses. More general forms of averaging will be discussed in Section As illustrated in Figure 6.15, the motion hypotheses for a block may refer to the same or to different reference pictures. And since multi-hypothesis prediction may not improve the coding efficiency for all inter-picture coded blocks of a picture, the syntax of video codecs

246 6.3. Advanced Motion-Compensated Prediction 241 typically allows a block-adaptive selection of the number of motion hypotheses used. Let K max denote the maximum number of motion hypotheses that are supported by a video codec. The prediction signal for a block is then specified by the number K K max of motion hypotheses used and the associated reference pictures indexes r k and motion vectors m k, with k [0, K 1]. For selecting the number K of motion hypotheses for a block, the supported values of K [1, K max ] can be considered as different coding modes and the Lagrangian mode decision algorithm described in Section can be applied. Multi-hypothesis prediction can be motivated as follows. Let u k [x] denote the prediction error signals for the individual hypotheses, u k [x] = s[x] s r k (x + m k ). (6.34) The final prediction error signal u[x] can then be written as u[x] = s[x] 1 K K 1 k=0 s r k (x + m k ) = 1 K K 1 k=0 u k [x]. If we now assume that the prediction error signals u k [x] for the individual motion hypotheses have zero mean and are uncorrelated with each other, the power spectral density Φ UU (ω) for the final prediction error signal is given by Φ UU (ω) = 1 K 1 K 2 Φ Uk U k (ω), (6.35) where Φ Uk U k (ω) denotes the power spectral density for the prediction error signal of the k-th motion hypothesis. For simplifying the consideration, we further assume that the individual hypotheses yield a similar power spectrum for the prediction error signals, which gives k=0 Φ UU (ω) 1 K Φ U 0 U 0 (ω). (6.36) For our idealized assumptions, the residual power spectrum Φ UU (ω) for multi-hypothesis prediction is reduced to 1/K-th of the residual power spectrum Φ U0 U 0 (ω) for conventional single-hypothesis prediction. And as discussed in Section 6.1.1, a reduction of the power spectral density for the prediction error results in a coding efficiency improvement.

247 242 Inter-Picture Coding The assumption made above are not completely realistic. The prediction errors u k [x] for the individual hypotheses are typically not completely uncorrelated and the power spectra Φ Uk U k (ω) can noticeably differ from each other. Nonetheless, in many cases, the usage of multiple motion hypotheses reduces the energy of the prediction error signal relative to single-hypothesis prediction; but the reduction is smaller than for the idealized case given in (6.36). A more rigorous analysis of the efficiency of multi-hypothesis prediction is given in [76, 77, 67]. The authors extended the model that we introduced in Section toward multi-hypothesis prediction and additionally considered the impact of the motion vector accuracy and the choice of interpolation filters. Even though multi-hypothesis prediction typically improves the quality of the prediction signal, it also requires a larger side information rate. Instead of a single set of motion data, we have to transmit multiple motion vectors and reference indexes (if multiple reference pictures are enabled). Since the associated bit rate increase could outweigh the improvement of the prediction signal, it is important that the number K of motion hypotheses can be adapted on a block level. Motion Estimation. For multi-hypothesis prediction with K motion hypotheses, the encoder has to determine K motion vectors m k and K reference indexes r k. It should be noted that even if the reference indexes r k are given, an independent determination of the motion vectors m k for the individual hypotheses cannot provide the best possible prediction signal. For an optimal solution, the motion search had to be done in the product space of all hypotheses. Let us assume we apply the Lagrangian motion search discussed in Section and use the SAD between the original and prediction signal as distortion measure. Then, the optimal motion data {r k, m k } for a block can be found by minimizing the Lagrangian cost s[x] 1 K 1 x K k=0 s r k (x + m k ) + λ ( K 1 k=0 R(r k, m k ) ), (6.37) where R(r k, m k ) denotes the number of bits that are required for transmitting the reference index r k and the associated motion vector m k.

248 6.3. Advanced Motion-Compensated Prediction 243 Due to the extremely large product space, a direct minimization of the Lagrangian cost (6.37) is not possible in practice. A practical and still very efficient algorithm is obtained if the individual motion hypotheses are determined in an iterative manner [323, 66, 65]. In each iteration step, the parameters of one hypothesis are determined conditioned on given parameters for the other hypotheses. Let us assume we want to determine the reference index r i and the motion vector m i for the i-th hypotheses given the parameters for the other hypotheses. Then, a minimization of (6.37) is equivalent to minimizing the cost s i [x] s r i (x + m i ) + λ R(r i, m i ), (6.38) x with λ = Kλ and s i [x] given by s i [x] = K s[x] k i s r k (x + m k ). (6.39) Note that a minimization of (6.38) has the same complexity as a conventional motion estimation for a single hypothesis. The only difference is that the distortion term is calculated using the modified signal s i [x] instead of the original signal s[x]. The iterative algorithm for determining the motion parameters for K motion hypotheses can be summarized as follows: 1. Estimate an initial set of the K motion hypotheses {r k, m k } using a conventional independent motion search; 2. Set the hypothesis index i equal to zero; 3. Determine the signal s i [x] according to (6.39) and refine the i-th motion hypothesis {r i, m i } by minimizing the cost (6.38); 4. Increment the hypothesis index i and, if i < K, go to step 3; 5. If a certain convergence criterion is not fulfilled (e.g., a maximum number of iterations), proceed with step 2. In the initialization step (step 1), it is typically advantageous to first estimate a motion vector for each available reference picture and then select the K initial hypotheses among these combinations of reference

249 244 Inter-Picture Coding forward prediction backward prediction bi-prediction I/P B (2) B (3) I/P (1) Figure 6.16: B pictures in the video coding standards MPEG-1 Video [112], H.262 MPEG-2 Video [122], and MPEG-4 Visual [111]. The numbers in parentheses indicate the coding order of the pictures in a BBP or BBI group. indexes and motion vectors. For further reducing the algorithmic complexity (without a noticeable impact on coding efficiency), the search range for the motion refinement in step 3 can be chosen significantly smaller than the search range for the initial estimation in step 1. Video Coding Standards. Due to the impact on the implementation complexity of encoder and decoder, today s video coding standards support only up to two motion hypotheses. MCP with two hypotheses is commonly referred to as bi-prediction. Similarly, MCP with a single motion hypothesis can also be referred to as uni-prediction. The usage of bi-prediction is not enabled in all inter pictures. Typically, the following picture types (or slice types) are supported: I picture : All blocks are coded in an intra mode; P picture : Each block is either coded in an intra mode or it is coded using conventional MCP (uni-prediction); B picture : Each block can be coded using bi-prediction, uniprediction, or intra-picture coding. In the early video coding standards MPEG-1 Video [112], H.262 MPEG-2 Video [122], and MPEG-4 Visual [111], the usage of B pictures is restricted in several ways. If B pictures are enabled, the coding order of pictures is modified as illustrated in Figure B pictures are only

250 6.3. Advanced Motion-Compensated Prediction 245 coded as part of a B BP or B BI group. First the I or P picture succeeding the group of B pictures is coded using the previous I/P picture as reference. Then, the B pictures of the group are coded in display order. The mentioned standards do not support the general concept of multiple reference pictures, but provide the following three basic MCP modes for B pictures (see Figure 6.16): Forward prediction : Single-hypothesis prediction using the temporal preceding I/P picture as reference picture. Backward prediction : Single-hypothesis prediction using the temporal succeeding I/P picture as reference picture. Bi-directional prediction : Bi-prediction, where one hypothesis uses the temporal preceding I/P picture as reference picture and the other uses the temporal succeeding I/P picture. The video coding standard H.263 [120] includes a more flexible concept for using multiple reference pictures (if Annex U is enabled), but has the same limitations regarding the coding order for B pictures. In H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123], the usage of bi-prediction was generalized. These standards specify I, P, and B slices, which have similar properties as the I, P, and B pictures in prior standards. A video picture can be composed of slices of different types. The coding order of pictures can be chosen independently of the used slice types. For B slices, two reference picture lists are specified, which are referred to as list 0 and list 1. Beside intra-picture coding, the B slice syntax provides the following MCP modes: List 0 prediction : Single-hypothesis prediction using a reference picture of list 0. List 1 prediction : Single-hypothesis prediction using a reference picture of list 1. Bi-prediction : Two-hypothesis prediction using one reference picture of list 0 and one reference picture of list 1. The size of the reference picture lists and the contained pictures can be arbitrarily selected. Hence, bi-prediction can be used without any

251 246 Inter-Picture Coding limitations. Moreover, the decoupling of the coding order and the picture/slice types provides the possibility to choose advanced temporal coding structures, which we will discuss in Section 6.5. There are, however, some limitations regarding the block sizes for which bi-prediction can be selected. In H.264 MPEG-4 AVC, if an 8 8 sub-macroblock is split into smaller blocks, all of these blocks are coded using the same prediction type (list 0, list 1, or bi-prediction). And in H.265 MPEG-H HEVC, bi-prediction can only be selected for blocks greater than or equal to 8 8 luma samples. Coding Efficiency. The effectiveness of multi-hypothesis prediction is demonstrated for the example of H.265 MPEG-H HEVC. As in previous experiments, we chose a prediction structure, in which all pictures are coded in display order. Except for the first picture, which is coded as intra picture, a block-adaptive selection between intra-picture coding, uni-prediction, and bi-prediction is enabled by coding the video pictures using the B slice syntax (with one slice per picture). This IBBB coding is compared to a conventional IPPP coding, in which the pictures are coded using the P slice syntax. The number of reference pictures was varied between M = 1, 2, 4, 8. The reference picture lists include the preceding M pictures in reverse coding order. For B slices, list 0 and list 1 contain the same pictures in the same order. The measured bit-rate savings are shown in Figure 6.17 for four selected test sequences. Each curve compares adaptive bi-prediction and uni-prediction for the same number of reference pictures. The support of bi-prediction improves the coding efficiency for all configurations, bit rates, and test sequences. The obtained gains depend mainly on the source material and the bit rate. In the diagrams of Figure 6.18, the average bit-rate savings for uniprediction and adaptive bi-prediction are plotted over the number of reference pictures used. The bit-rate savings are measured relative the IPPP coding with a single reference picture and are averaged over the sequences of the test sets specified in Appendix A. It can be seen that, for both IPPP and IBBB coding, the coding efficiency increases with the number of reference pictures.

252 6.3. Advanced Motion-Compensated Prediction 247 bit-rate saving vs uni-pred. [%] (a) BQSquare ( , 60 Hz) 25 2 ref. pics. (avg. 14.9%) ref. pic. (avg. 13.2%) 10 4 ref. pics. (avg. 14.2%) 5 8 ref. pics. (avg. 13.0%) PSNR (Y) [db] bit-rate saving vs uni-pred. [%] (b) Johnny ( , 60 Hz) ref. pics. (avg. 12.1%) 20 4 ref. pics. (avg. 11.4%) 15 2 ref. pics. (avg. 10.5%) ref. pic. (avg. 8.3%) PSNR (Y) [db] bit-rate saving vs uni-pred. [%] (c) Cactus ( , 50 Hz) 20 2 ref. pics. (avg. 5.9%) 15 8 ref. pics. (avg. 7.3%) 10 4 ref. pics. (avg. 6.5%) 5 1 ref. pic. (avg. 5.3%) PSNR (Y) [db] bit-rate saving vs uni-pred. [%] (d) Kimono ( , 24 Hz) ref. pics. (avg. 6.8%) 10 8 ref. pics. (avg. 7.0%) 8 4 ref. pics. (avg. 6.9%) ref. pic. (avg. 6.8%) PSNR (Y) [db] Figure 6.17: Coding efficiency improvement of adaptive bi-prediction relative to uni-prediction for different numbers of reference pictures (the video pictures are coded in display order): (a) BQSquare; (b) Johnny; (c) Cactus; (d) Kimono. average bit-rate saving [%] (a) entertainment-quality video content adaptive bi-prediction uni-prediction number of reference pictures average bit-rate saving [%] (b) video conferencing content adaptive bi-prediction uni-prediction number of reference pictures Figure 6.18: Average bit-rate savings relative to uni-predictive coding with a single reference picture (the video pictures are coded in display order): (a) Entertainmentquality content (Appendix A.1); (b) Video conferencing content (Appendix A.2).

253 248 Inter-Picture Coding Due to the limitations in the H.265 MPEG-H HEVC standard, we only compared MCP with one and two motion hypotheses. However, the experimental investigations in [66, 67] indicate that further improvements of coding efficiency could be obtained when MCP modes with three and more motion hypotheses are added Weighted Prediction Weighted prediction describes an approach for further generalizing the concept of motion-compensated prediction. In its general form, each motion hypothesis {r k, m k } is associated with a scaling factor w k and an offset o k. The final prediction signal ŝ[x] for a block is given by ŝ[x] = K 1 k=0 w k s r k (x + m k ) + o k. (6.40) In conventional multi-hypothesis prediction, all scaling factors w k are set equal to 1/K and the offsets o k are set equal to zero. By introducing adjustable scaling factors and offsets, the flexibility for constructing the prediction signal ŝ[x] is increased. It should be noted that weighted prediction cannot only be applied to multi-hypothesis, but also to conventional single-hypothesis prediction. The highest quality of the prediction signal ŝ[x] could be obtained if the scaling factors and offset were optimized for each block. This would, however, drastically increase the side information rate, so that typically no improvement of the overall coding efficiency would be achieved. In typical applications, each picture in the reference picture lists is associated with individual weights and offsets. Hence, the scaling factor w k and offset o k are selected together with the reference index r k and, except for signaling the weighted prediction parameters in the picture or slice header, the side information rate is not increased. The concept of weighted prediction is included in the standards H.264 MPEG-4 AVC and H.265 MPEG-H HEVC. It is particularly useful for coding video material that contains fades (fade-ins, fade-outs, or crossfades). An algorithm for estimating suitable weighting parameters is described in [17]; this publication also includes experimental results that show significant coding gains for video with fades.

254 6.3. Advanced Motion-Compensated Prediction Further Motion Compensation Techniques The concepts of motion-compensated prediction that are most relevant for practical video codecs have been described in the previous sections. In the following, we briefly mention two techniques that gained some interest in the research community, but are rarely used in practice: MCP using control grid interpolation : The motion between two pictures is described using a small set of so-called control points, which are typically chosen as vertices of a rectangular grid in the current image. The displacements of the control points are estimated at the encoder and transmitted to the decoder. The spatial displacements of all other image points are determined by interpolating between the displacements of the control grid points. More details can be found in [249, 190, 107, 31]. Overlapped block motion compensation (OBMC): While in conventional block motion compensation, the picture to be coded is partitioned into non-overlapping blocks, in OBMC, the picture is segmented into overlapping blocks. Typically, this segmentation is performed in a way that each sample is covered by four blocks. The prediction signal ŝ[x] for a sample is then given by ŝ[x] = 3 w k (x) s r k (x + m k ), (6.41) k=0 where r k and m k denote the reference indexes and motion vectors, respectively, that are associated with the blocks that cover the sample at location x. w k (x) represents a window function that decays from the center to the border of the k-th block. The selected window functions fulfill the condition 3 w k (x) = 1. (6.42) k=0 An advantage of OBMC is that, due to the windowing of the individual prediction signals s r k (x + m k ), it generates a smooth prediction signal for the picture. For more details on OBMC, the reader is referred to the publications [188, 253, 193, 155, 265].

255 250 Inter-Picture Coding 6.4 Coding of Motion Parameters The bit rate that is required for transmitting the motion parameters of inter-picture coded blocks can form a substantial part of the overall bit rate. But since in many cases the motion does not change abruptly from block to block, the motion data of already coded neighboring blocks can be utilized for an efficient coding. Video codecs often provide two types of inter-picture coding modes, which differ in the capabilities to signal motion information. The first class shall be referred to as conventional inter-picture coding modes and allows the transmission of arbitrary motion parameters. For the second class of coding modes, the motion parameters are not transmitted, but completely inferred based on the data of already coded blocks Motion Vector Prediction For conventional inter-picture coding modes, all parameters that are required for specifying the formation of the prediction signal have to be transmitted. In the general case, this includes the following data: Prediction type (e.g., list 0, list 1, or bi-prediction) or, more generally, the number of motion hypotheses; Reference picture index for each motion hypothesis; Motion vector for each motion hypothesis. The prediction type 8 and the reference picture indexes are typically coded using dedicated variable length codes or binary arithmetic codes. The corresponding syntax elements are only transmitted if high-level parameters such as the slice type or the reference list size indicate the support of multi-hypothesis prediction or an adaptive selection of reference pictures. Note that for an efficient representation of the reference indexes, the pictures in the reference picture lists should be ordered according to the probabilities of their usage. In contrast to the prediction type and the reference indexes, the motion vector components m x and m y are usually not directly coded. 8 In H.264 MPEG-4 AVC, a combined code is used, which specifies the partitioning of a macroblock as well as the prediction types of the resulting subblocks.

256 6.4. Coding of Motion Parameters 251 Instead, a motion vector predictor m = ( m x, m x ) is derived and the components m x and m y of the resulting motion vector difference m = m m (6.43) are transmitted. Hence, the coding efficiency highly depends on the procedure used for deriving the motion vector predictor m. In B slices or B pictures, the forward and backward (or list 0 and list 1) motion vectors are typically treated independently of each other. In a very simple approach, which is, for example, used in H.262 MPEG-2 Video, the motion vector predictor m is set equal to the motion vector of the block left to the current block. Median Prediction. The median predictor is derived based on the motion vectors of three already coded neighboring blocks A, B, and C. Let m A = ( m A x, m A ) y be the motion vector for block A, etc. The motion vector predictor m = ( m x, m y ) for the current block is set equal to the component-wise median of the motion vectors m A, m B, and m C, ( ) m x = median m A x, m B x, m C x, ( ) (6.44) m y = median m A y, m B y, m C y. The neighboring blocks A, B, and C are typically chosen as illustrated in Figure 6.19(a). Let us assume that the current block is located at a motion boundary. In many cases, two of the neighboring blocks would still have similar motion. Due to that reason, the component-wise median more often represents a suitable predictor than the motion vector of a single pre-defined neighboring block (e.g., the left block A). The concept of median prediction is, for example, used in H.263, MPEG-4 Visual, and H.264 MPEG-4 AVC 9. If multiple reference pictures are used, it is often advantageous to prefer neighboring blocks that have the same reference index as the current block. Alternatively, the neighboring motion vectors can be scaled according to the temporal distances between the current picture and the corresponding reference 9 An exception are the 16 8 and 8 16 blocks in H.264 MPEG-4 AVC, for which the motion vector of a single neighboring block is used as predictor.

257 252 Inter-Picture Coding A B C B 2 B 1 B 0 co-located block current block current block T 1 A 1 (a) (b) A 0 T 0 Figure 6.19: Neighboring blocks used for motion vector prediction: (a) Median prediction; (b) Switched motion vector prediction in H.265 MPEG-H HEVC. pictures. For example, with t c, t r, and t A representing the sampling times of the current picture, the reference picture for the current block, and the reference picture for block A, respectively, the scaled motion vector could be derived according to m (scaled) A = t c t r t c t A m A. (6.45) Switched Prediction. If the motion vector coding is conducted with a single predictor, there are always situations in which a large motion vector difference m has to be transmitted, because the predictor significantly differs from the motion vector of the current block. In order to overcome this limitation, in switched prediction, multiple potential predictors are derived and one of them is chosen for coding the motion vector [132, 163]. This concept obviously reduces the bit rate for coding the motion vector differences, but requires the signaling of the selected predictor. For that purpose, the candidate predictors are ordered into a list (the order is known to encoder and decoder) and an index into the list is transmitted. The improvement in coding the motion vector differences typically outweighs the overhead for signaling the used predictor. In [163], average bit-rate savings of roughly 5% have been reported relative to a component-wise median prediction. Due to its simplicity and effectiveness, switched motion vector prediction was included into the design of H.265 MPEG-H HEVC [123]. In the standard, the candidate list comprises two potential predictors. As illustrated in Figure 6.19(b), up to two predictors (A and B) are derived from five spatially neighboring blocks. When at least one of the spatial candidates is not available or when both candidates are the

258 6.4. Coding of Motion Parameters 253 same, a temporal predictor (T) is added to the list (if available). It is derived from two co-located blocks in a defined reference picture. The subscripts in Figure 6.19(b) indicate the processing order of the blocks for deriving the corresponding candidate predictors. For a detailed description, the reader is referred to the standard text [123] Inferred Motion Parameters Video codecs often include special coding modes, in which the motion parameters are not transmitted, but completely derived at the decoder side. Compared to conventional inter-picture coding modes, the energy of the prediction error signal is increased. But in many cases, this is more than compensated by saving the bit rate for the motion data. The support of coding modes with inferred motion parameters in addition to conventional inter-picture coding modes often significantly increases coding efficiency. In the following, we briefly describe three examples. Temporal Direct Mode. The motion parameter derivation for the temporal direct mode is based on the assumption that the velocity of objects in the image plane does not change significantly during a rather short time period. As illustrated in Figure 6.20(a), we consider the derivation of bi-predictive motion parameters for a current block in a picture with sampling time t c. Given is a co-located picture with sampling time t 1. The used picture is typically indicated by high-level syntax elements. The block in the co-located picture that has the same location as the current block is referred to as co-located block. It is associated with a motion vector m col that references a picture with sampling time t 0. The list 0 and list 1 reference indexes for the current block are selected in a way that they refer to the pictures with sampling times t 0 and t 1, respectively. Using the assumption of linear motion, the associated motion vectors m 0 and m 1 are derived according to m 0 = t c t 0 t 1 t 0 m col, m 1 = t c t 1 t 1 t 0 m col. (6.46) A temporal direct mode is included in the video coding standards H.263 and H.264 MPEG-4 AVC. The direct mode in MPEG-4 Visual includes the transmission of a correction vector.

259 254 Inter-Picture Coding m col E B C co-located block (a) t 0 m 0 current block m 1 t c t 1 co-located block time (b) A D current block T 1 T 0 Figure 6.20: Inference of motion parameters: (a) Temporal direct mode; (b) Neighboring candidate blocks used for the merge mode in H.265 MPEG-H HEVC. Often, the bitstream syntax supports two different versions of the direct mode, which are typically referred to as Direct and Skip mode. While in the Direct mode, only the motion parameters are inferred and the prediction error signal is transmitted using transform coding, the Skip mode additionally signals that all samples of the prediction error signal are equal to zero. In the Skip mode, only the coding mode has to be transmitted. Hence, it can very efficiently represent blocks that are characterized by temporally consistent translational motion. Merge Mode. Instead of a direct mode, the video coding standard H.265 MPEG-H HEVC [123] supports a so-called merge mode [96]. The concept is similar to that of switched motion vector prediction discussed in Section If a block is coded in merge mode, a list with candidate motion parameters is constructed and an index into this list is transmitted. But in contrast to the switched motion vector prediction, each candidate specifies a complete set of motion parameters including the prediction type (number of motion hypotheses), the reference indexes, and the motion vectors. In H.265 MPEG-H HEVC, the candidate list consists of up to five parameter sets; the actual list size is signaled in the slice header. Figure 6.20(b) illustrates the locations of the blocks that are used for deriving the candidate parameter sets. Up to four spatial merge candidates are derived using the neighboring blocks labeled with A to E, and one temporal candidate is derived using the co-located blocks T 0 and T 1 in a given co-located picture (specified by high-level syntax elements). If multiple candidates are not available or have identical motion

260 6.4. Coding of Motion Parameters 255 derived motion vector already reconstructed blocks best matching location location in current picture (reference picture) template not yet coded blocks current block (current picture) Figure 6.21: Basic principle of decoder-side motion vector derivation. parameters, the candidate list is filled using combined parameter sets or zero motion vector candidates. For the spatial candidates, all motion parameters of the corresponding neighboring blocks are copied. The derivation of the temporal candidate includes a corresponding scaling of the co-located motion vector. For more details, the reader is referred to the standard text [123] or the description in [96]. Similar as for the direct mode in prior standards, H.265 MPEG-H HEVC supports a Skip mode in addition to the normal merge mode. Beside the usage of the merge concept for deriving the motion data, it also signals that all samples of the residual signal are equal to zero. If we neglect the temporal candidate, coding a block in merge mode (or Skip mode) is basically the same as merging the current block with one or more of the neighboring blocks and transmitting a single motion parameter set for the resulting image region [52, 96]. Due to that reason, this coding mode is called merge mode. It represents a simple and very efficient method to signal the motion of large consistently moving image regions. Decoder-Side Motion Estimation. The above discussed approaches for deriving motion parameters utilize the motion data of already coded neighboring or co-located blocks. In an alternative concept, the motion data are derived by performing a motion estimation at the decoder side. The basic idea is illustrated in Figure Since the samples of the current block cannot be used, the motion search is typically conducted using a template T, which consists of neighboring samples that belong to already reconstructed blocks [255, 256, 134, 135]. Let s [x] represent the already reconstructed samples of the current picture

261 256 Inter-Picture Coding and s ref (x) represent an interpolated reference picture. If we use the SAD as distortion measure, the derived motion vector m is given by m = arg min m M s [x] s ref(x + m), (6.47) x T where M specifies the set of potential motion vector candidates. In order to prevent mismatches, encoder and decoder have to use exactly the same procedure for motion estimation. Note that, in comparison to all other discussed coding modes, the decoder complexity is noticeably increased. The actual implementation complexity highly depends on the chosen set M of potential motion vector candidates. An example of a fast search strategy is discussed in [133]. Conceptually, the reference pictures indexes and the prediction type can be derived in a similar way. An alternative for deriving bi-predictive motion data is suggested in [150]. As for the temporal direct mode, it is assumed that the motion is linear within a short period of time. For determining the bi-predictive motion parameters, only those motion vector pairs are considered that fulfill the linearity assumption. The used motion vector pair is derived by minimizing a distortion measure between the two associated blocks in the reference pictures. 6.5 Coding Structures At first glance, it might seem appropriate to code the pictures of a video sequence in the same order in which they were captured by the camera and will eventually be displayed at the receiver side. But as mentioned in previous sections, we could also choose a coding order that deviates from the acquisition and display order. On the one hand, this would increase the encoding-decoding delay, since both encoder and decoder had to reorder the received pictures. But on the other hand, it opens up new possibilities for motion-compensated prediction. We could, for example, predict certain pictures using reference pictures from both the past and the future. The combination of coding order and the selection of reference pictures is commonly referred to as coding structure or prediction structure. As will be discussed in the following, the chosen coding structure has a significant impact on coding efficiency.

262 6.5. Coding Structures 257 I P P P P P P P P P I B B P B B P B (a) (b) B P 9 7 Figure 6.22: Conventional coding structures: (a) IPPP coding structure; (b) IBBP coding structure. The numbers specify the coding order of pictures. Conventional Coding Structures. The conceptually simplest coding structure is the IPPP structure depicted in Figure 6.22(a). It is the only supported configuration in H.261 [119]. All pictures are coded in display order and use the temporal preceding picture as reference picture. We can also use additional reference pictures and enable multi-hypothesis prediction without changing the coding order. In case the pictures are coded using the B picture or B slice syntax, the coding structure is also referred to as IBBB coding structure. With the goal of saving bit rate, various researchers investigated the concept of motion-compensated interpolation [185, 237, 268]. The basic idea is to skip some pictures during coding and interpolate them later using motion compensation techniques. Since there are often some image regions that cannot be suitably represented using temporal interpolation, the concept was improved by incorporating the transmission of a prediction error signal [203, 204, 93, 330]. This eventually led to the introduction of B pictures (see Section 6.3.2) in the video coding standards MPEG-1 Video [112] and H.262 MPEG-2 Video [122]. In these standards, B pictures can only be used in connection with certain coding structures, which originated from the concept of motioncompensated interpolation. The coding structures consist of B BP groups of a P picture and one or more B pictures. An example with two B pictures, known as IBBP structure, is shown in Figure 6.22(b). The P picture is transmitted first and uses the preceding I/P picture as reference picture. Thereafter, the B pictures are transmitted in display order. They can be predicted using both the temporal preceding I/P picture and the temporal succeeding P picture. Since B pictures are not used for MCP of other pictures, the bits that are spent for coding a B picture only impact the quality of the

263 258 Inter-Picture Coding group of pictures (GOP) group of pictures (GOP) group of pictures (GOP) I 0 B 3 B 2 B 3 B 1 B 3 B 2 B 3 B 0 B 3 B 2 B 3 B 1 B 3 B 2 B 3 B B 3 B 2 B 3 B 1 B 3 B 2 B 3 B Figure 6.23: Hierarchical B pictures with groups of 8 pictures (GOP8). The subscripts indicate the hierarchy levels and the numbers specify the coding order. picture itself. In contrast to that, an increase of the reconstruction quality of an I or P picture typically also improves the quality of the motion-compensated prediction signal for the pictures that use the I/P picture as reference. Due to that reason, the quantization step size for B pictures is usually increased in comparison to that of I and P pictures. Hierarchical Coding Structures. The conventional coding structures with B pictures, such as the example in Figure 6.22(b), can also be interpreted as a partitioning of the video pictures into two hierarchy levels. The first hierarchy level is formed by the I and P pictures; it can be decoded independently of all other pictures. The B pictures represent the second hierarchy level; they use the pictures of the first hierarchy level for motion-compensated prediction. The coding efficiency can often be further increased by introducing additional hierarchy levels [224, 225]. The resulting coding structures are often referred to as hierarchical B pictures. An example with four hierarchy levels is shown in Figure Similarly as for conventional coding structures with B pictures, the pictures of the lowest hierarchy level (I 0 and B 0 in the figure) are coded in display order. Since these pictures form the basis of the hierarchy, they are also called key pictures. Even though the key pictures only use temporal preceding pictures for MCP, they can still be coded using the B picture syntax. The most common configurations are the so-called dyadic coding structures, in which each additional hierarchy level inserts a picture between two successive pictures of the lower levels. As illustrated in Figure 6.23, the two surrounding pictures of the lower hierarchy levels are typically used as reference pictures. But it is, of course, possible to

264 6.5. Coding Structures 259 include additional pictures into the reference picture lists. The coding order has to be chosen in a way that each picture is transmitted before it is used as reference picture. A picture of the lowest level and the preceding pictures of higher levels are often called a group of pictures (GOP). If a GOP consists of X pictures, the corresponding hierarchical coding structure is also referred to as GOPX structure. Due to the decoupling of coding order, picture/slice coding types, and the construction of reference lists in H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123], these standards support a large variety of hierarchical coding structures. The only restriction is the maximum size of the decoded picture buffer. Coding Efficiency. Due to the hierarchical structured prediction, the overall coding efficiency highly depends on the bit allocation among the different hierarchy levels. The reconstruction quality of the key pictures (hierarchy level 0) directly or indirectly impacts the quality of the motion-compensated prediction signal for all other pictures. In contrast to that, the impact of a picture of hierarchy level 1 is limited to the pictures between two successive key pictures (see Figure 6.23). The higher the hierarchy level of a picture, the fewer pictures are impacted by its reconstruction quality. Hence, the quantization parameter should be increased from one hierarchy level to the next. An nearly optimal selection of the picture quantization parameters could be achieved by a trellis-based rate-distortion analysis similar to the strategy suggested in [206]. In practice, such a computationally complex approach is however infeasible. Typically, fixed relationships between the quantization parameters of different hierarchy levels are used. A simple and robust strategy was suggested in [226]. With QP 0 being the quantization parameter for the key pictures, the quantization parameter QP k for hierarchy level k 0 is set according to 10 QP k = QP 0 + δ 1 + (k 1), (6.48) with δ 1 = 4. It should be noted that, in this bit allocation strategy, 10 Hereby, it is assumed that the relationship between the quantization step size and the quantization parameter QP is given by 2 QP/6, as it is the case in the video coding standards H.264 MPEG-4 AVC and H.265 MPEG-H HEVC.

265 260 Inter-Picture Coding IBBB GOP2 GOP4 (simple) I/B B B B B B B B B I/B B 1 B 0 B 1 B 0 B 1 B 0 B 1 B 0 avg. bit-rate saving vs IBBB [%] (b) cascaded QP assignment with δ 1 =4 GOP2 Average Cactus ParkScene BQTerrace GOP4 (simple) Kimono BasketballDrive GOP4 (hier.) GOP8 (hier.) GOP4 (hier.) GOP8 (hier.) (a) I/B B 1 B 1 B 1 B 0 B 1 B 1 B 1 B 0 I/B B 2 B 1 B 2 B 0 B 2 B 1 B 2 B 0 I/B B 3 B 2 B 3 B 1 B 3 B 2 B 3 B 0 avg. bit-rate saving vs IBBB [%] (c) average of entertainment-quality video content GOP2 δ 1 =2 δ 1 =4 GOP4 (simple) δ 1 =3 GOP4 (hier.) same QP δ 1 =1 GOP8 (hier.) Figure 6.24: Coding efficiency for hierarchical prediction structures: (a) Illustration of tested prediction structures; (b) Average bit-rate savings relative to IBBB coding for different test sequences and a QP assignment according to (6.48) with δ 1 = 4; (c) Average bit-rate savings relative to IBBB coding for different QP assignments. the Lagrange multipliers λ for all hierarchy levels are calculated using the same rule (4.32). An alternative approach is used in the suggested encoder configuration for H.265 MPEG-H HEVC [16]. Here, the QP is increased by one from one hierarchy level to the next (δ 1 = 1), but the scaling factor in (4.32) is modified depending on the hierarchy level. For demonstrating the advantage of hierarchical prediction structures, we run simulations with the H.265 MPEG-H HEVC reference software [126] and the test sequences listed in Appendix A.1. The tested coding structures are illustrated in Figure 6.24(a). With exception of the first picture of a video sequence, all pictures are coded using the B slice syntax with one picture per reference picture list. The diagrams in Figure 6.24(b,c) show average bit-rate savings relative to the simple IBBB coding structure. While the results in Figure 6.24(b) were gener-

266 6.5. Coding Structures 261 PSNR (Y) [db] Cactus ( , 50 Hz) GOP8 (cascaded QP, δ 1 =4, QP 0 =28): Average = db at 4052 kbit/s 37.0 GOP8 (same QP, QP 0 =30): Average = db at 5080 kbit/s IBBB (same QP, QP 0 =30): Average = db at 5504 kbit/s 34.5 GOP8 (cascaded QP, δ 1 =4, QP 0 =30): Average = db at 3077 kbit/s picture number (in display order) Figure 6.25: Luma PSNRs of individual pictures for selected configurations. ated using the suggested QP cascading (6.48) with δ 1 = 4, the diagram in Figure 6.24(c) compares different QP assignments. The results indicate that the usage of hierarchical prediction structures can considerably improve coding efficiency. With exception of the non-dyadic structure GOP4 (simple), a coding gain relative to the simple IBBB coding could already be achieved if all hierarchy levels are coded using the same quantization parameter. By using a cascaded QP assignment, the coding efficiency could be further improved. A comparison of the two GOP4 configurations clearly demonstrates the advantage of hierarchical prediction structures over conventional coding structures with B pictures. For the tested dyadic coding structures, the coding gain increases with the number of hierarchy levels. The GOP8 structure with the suggested QP assignment (δ 1 = 4) provides an average bit-rate savings of 27.4%. The approach specified in the common test conditions [16] for H.265 MPEG-H HEVC, which uses δ 1 = 1 but modifies the relationship between λ and QP, yields an average bit-rate saving of 27.7%. The difference in coding efficiency is very small and indicates that both approaches for setting QP and λ lead to a very similar bit allocation among the hierarchy levels. At this point, it should be mentioned that the cascaded QP assignment results in relatively large PSNR fluctuations inside a group of pictures. Figure 6.25 shows an example. For the IBBB coding structure and the GOP8 coding structure with QP k = QP 0, the PSNRs are very similar for all pictures of a video sequence. But if we apply the sug-

267 262 Inter-Picture Coding gested QP cascading with δ 1 = 4 for the GOP8 structure, the non-key pictures have a PSNR that is significantly smaller than that of the key pictures. Nonetheless, the reconstructed video typically appears smooth without any annoying changes of the subjective quality. If we compare settings with the same key picture QP, the cascading of the quantization parameters may lead to small losses in subjective quality. But these losses are typically outweighed by the significant bit-rate savings for the non-key pictures. In fact, a hierarchical coding with cascaded quantization parameters can be considered as a suitable realization of the original idea of motion-compensated interpolation. For keeping the description simple, we analyzed only coding structures with a single picture per reference list. In general, the coding efficiency can be further improved if additional reference pictures are used. As an example, the GOP8 coding structure specified in the reference configuration [16] for H.265 MPEG-H HEVC uses two reference pictures per list and provides about 3% bit-rate savings in comparison to the simple GOP8 structure discussed above. Due to their efficiency, hierarchical coding structures have become a popular choice for applications that can tolerate a moderate encoding-decoding delay. They can also be used for providing several levels of temporal scalability [226]. Random Access Points. In certain application areas such as broadcast or streaming, features such as channel switching, fast forwarding, or a general access to parts of the bitstream have to be supported. Therefore, the used video bitstreams have to provide random access points in regular intervals, at which a receiver can start decoding. In most applications, a so-called clean random access is required, which means that the reconstructed video pictures must not depend on the actually used random access point. This is only possible if the picture at a random access point is coded as intra picture and the reference lists are restricted in a way that all pictures that follow the intra picture in both coding and display order do not reference any picture that precedes the intra picture in either coding or display order. There are two different concepts for providing random access points with hierarchical coding structures. These are commonly referred to as

268 6.5. Coding Structures 263 (a) B 0 B 2 B 1 B 2 B 0 I 0 B 2 B 1 B 2 B 0 (b) B 0 B 2 B 1 B 2 I 0 B 2 B 1 B 2 B 0 Figure 6.26: Random access points: (a) Closed GOP; (b) Open GOP. closed GOP and open GOP coding and are illustrated in Figure In closed GOP coding, a picture at a random access point is treated in the same way as the first picture of a sequence. All pictures that precede the intra picture at a random access point in display order are transmitted before this intra picture. While this concept is simple, it has the decisive disadvantage that the temporal prediction chain is interrupted at the random access point. In particular at lower bit rates, this often causes subjectively annoying flickering effects. In open GOP coding, the intra pictures are coded similar to normal key pictures. Hence, there are pictures that follow the intra picture in coding order, but precede it in display order. These pictures can use the intra picture as well as pictures that precede the intra picture in coding order for motion-compensated prediction. However, if the decoding is started with the intra picture at the random access point, the pictures that succeed the intra picture in coding order and precede it in display order have to be discarded. The H.265 MPEG-H HEVC syntax provides special parameters that simplify the identification of these pictures. The substantial advantage of open GOP coding is that the pictures before the random access point are predicted from both the past and the future. As a consequence, the differences in coding artifacts between the intra picture and the preceding key picture are distributed over several pictures and are often imperceptible. Low-Delay Coding Structures. In interactive video applications such as video conferencing, the increase in encoding-decoding delay that is imposed by the discussed hierarchical coding structures is not tolerable. For providing an interactive user experience, the video pictures have to be transmitted in display order. Nonetheless, it is still possible to use a cascaded QP assignment and modified reference picture lists.

269 264 Inter-Picture Coding (a) I B B B B B B B (b) I 0 B 2 B 1 B 2 B 0 B 2 B 1 B 2 B 0 B (c) bit-rate saving vs IBBB [%] interactive video content 30 average: 15.9% 25 FourPeople Vidyo4 Johnny Vidyo3 Vidyo1 5 KristenAndSara PSNR (Y) [db] Figure 6.27: Coding efficiency for low-delay coding structures: (a) IBBB coding with two reference pictures; (b) Hierarchical coding with two reference pictures; (c) Bit-rate savings of the hierarchical structure relative to IBBB coding (same number of active reference pictures) for the test sequences of Appendix A.2. As an example, we consider the low-delay coding structures shown in Figure 6.27(a,b). In both configurations, the pictures are coded in display order and each picture uses two reference pictures for MCP. The IBBB coding structure in Figure 6.27(a) treats all pictures in the same manner. MCP is performed using the two directly preceding pictures as reference pictures and the same QP is assigned to all pictures. In contrast to that, the coding structure in Figure 6.27(b) splits the pictures into three hierarchy levels. Besides the directly preceding picture, the reference picture lists also include the preceding key picture (hierarchy level 0). This coding structure is typically combined with a non-uniform QP assignment. The key pictures (I 0 and B 0 ) are more often used as reference pictures than the other pictures. Consequently, the key pictures are also coded using a smaller QP. For evaluating the impact on coding efficiency, we run coding experiments using the video conferencing sequences listed in Appendix A.2. Figure 6.27(c) summarizes the measured bit-rate savings of the hierarchical low-delay coding structure in Figure 6.27(b), with the QP assignment QP k = QP k, relative to the conventional IBBB coding structure in Figure 6.27(a). The modifications of the reference picture lists and the cascaded QP assignment led to an average bit-rate saving of about 16% for all test sequences. Similarly as for the high-delay coding structures, the coding efficiency can be further improved by using additional reference pictures.

270 6.6. In-Loop Filters In-Loop Filters The reconstruction quality of the decoded video pictures can often be improved by applying one or more filters. In particular at low bit rates, certain filters can reduce the most common coding artifacts and, thus, increase the perceived image quality. Conceptually, the filters could be applied as an optional post-processing step. But an application inside the motion compensation loop has the advantage that the filtered pictures are not only used for output but also for the motion-compensated prediction of following pictures. This typically leads to an increase in coding efficiency. For this reason, modern video coding standards such as H.264 MPEG-4 AVC [121] and H.265 MPEG-H HEVC [123] specify one or more filters inside the prediction loop. Additional filters may still be applied as post-processing steps. In the following, we briefly describe three examples of filters. Two of them are included in H.265 MPEG-H HEVC and the other is often considered as a candidate for future video codecs Deblocking Filter In block-based hybrid video coding, transform coding of the prediction error signals is performed on the basis of blocks. It has the effect that samples near block boundaries are less accurately reconstructed than samples in the interior of a block. If comparably large quantization step sizes are used, this often results in visible discontinuities at the borders between transform blocks. These discontinuities are referred to as blocking artifacts and are particularly disturbing in rather smooth image areas. Another source for blocking artifacts is the block-based character of MCP. Discontinuities at prediction block boundaries can be caused by using different areas of a reference picture or different reference pictures for MCP of neighboring blocks. The subjective quality of reconstructed pictures can be improved if the discontinuities at block boundaries are attenuated by a so-called deblocking filter. In-loop deblocking filters are included in the video coding standards H.263 [120], H.264 MPEG-4 AVC [121], and H.265 MPEG-H HEVC [123]. The specification of MPEG-4 Visual [111] also

266 Inter-Picture Coding Figure 6.28: Deblocking filter example: (left) Section of a reconstructed intra picture at very low bit rate; (right) Same image section after deblocking.

271 266 Inter-Picture Coding Figure 6.28: Deblocking filter example: (left) Section of a reconstructed intra picture at very low bit rate; (right) Same image section after deblocking. includes a deblocking filter, but only as an optional post filter. The effect of a deblocking filter is illustrated in Figure The figure shows a section of an intra-coded picture before and after applying the deblocking filter specified in H.265 MPEG-H HEVC. In the following, we describe the basic operation of a deblocking filter. We use the luma deblocking filter of H.265 MPEG-H HEVC as an example, but do not describe all aspects. For more details, the reader is referred to the video coding standards [120, 121, 123] or the publications [170] and [189], which explain the deblocking filter design for H.264 MPEG-4 AVC and H.265 MPEG-H HEVC, respectively. Adaptive Filtering. The basic idea of a deblocking filter is to apply an adaptive low-pass filter across horizontal and vertical boundaries between prediction and transform blocks. The problem can be split into the filtering of 1D line segments, which represent sets of samples across a horizontal or vertical block boundary. Figure 6.29(a) illustrates such a line segment. The original reconstructed signal shows a distinct discontinuity at the boundary between the blocks P and Q. The deblocking filter modifies the samples near the block boundary and, thus, smoothes out the edge between the considered blocks. The main dif-

Processing. Electrical Engineering, Department. IIT Kanpur. NPTEL Online - IIT Kanpur

Processing. Electrical Engineering, Department. IIT Kanpur. NPTEL Online - IIT Kanpur NPTEL Online - IIT Kanpur Course Name Department Instructor : Digital Video Signal Processing Electrical Engineering, : IIT Kanpur : Prof. Sumana Gupta file:///d /...e%20(ganesh%20rana)/my%20course_ganesh%20rana/prof.%20sumana%20gupta/final%20dvsp/lecture1/main.htm[12/31/2015