Leopold-Franzens University of Innsbruck Institute of Computer Science. Realtime Distortion Estimation based Error Control for Live Video Streaming

Size: px

Start display at page:

Download "Leopold-Franzens University of Innsbruck Institute of Computer Science. Realtime Distortion Estimation based Error Control for Live Video Streaming"

Christina Riley
6 years ago
Views:

1 Leopold-Franzens University of Innsbruck Institute of Computer Science Dissertation Realtime Distortion Estimation based Error Control for Live Video Streaming Author Dipl.-Ing. Michael Schier Supervisor Univ.-Prof. Dr.-Ing. Michael Welzl February 14, 2012

3 Everyone should have ten megabits and then the web will be a wonderful thing. Jon Postel, 1996

5 Abstract Driven by the advent of efficient video compression techniques and the wide availability of high speed internet access, video has become the killer application of today s internet, and it is projected that it will account for more than 90% of the global traffic by While streaming of prestored content can be conveniently performed over TCP, time-critical video applications such as video conferencing, remote control systems, and streaming game technologies demand for alternative transport protocols as mandatory retransmissions might induce unacceptable amounts of delay. For such applications, partial delivery is to be preferred over error-free but delayed delivery, and consequently, distortion estimation mechanisms are in demand to minimize inevitable video quality degradations. In this dissertation, two novel distortion estimation algorithms are proposed, suitable for MPEG video compression standards. They operate at the macroblock level and consider important indicators such as macroblocks partitioning, temporal relationships, potential scene cuts, and the extent of scene motion to obtain estimates of the loss-impact of single media units. Both approaches are computationally lightweight and can be computed in realtime, which renders them highly attractive for live video applications. The developed techniques are integrated into selected error control frameworks, encompassing proactive packet discard, selective ARQ, and unequal rateless FEC. In this context, this thesis proposes solutions of how to reasonably combine distortion estimation and error control techniques, how to additionally take timing constraints into consideration, and how potentially limited feedback can be exploited to refine estimates on the fly. It identifies pitfalls with regard to digital fountain codes and offers respective customizations. The benefits of the developed techniques are demonstrated by various experimental results, which clearly indicate that they considerably outperform existing approaches.

7 Declaration of Authorship I, Dipl.-Ing. Schier Michael, declare that this thesis and the work presented in it are my own. I confirm that: This work was done wholly while in candidature for a doctoral degree at the Leopold-Franzens University of Innsbruck, Austria. No part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution. Where I have consulted the published work of others, this is always clearly attributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. I certify that the work presented here is, to the best of my knowledge and belief, original and the result of my own investigations, except as acknowledged. Date Signature (Schier Michael)

9 Contents Abstract 5 Declaration of Authorship 7 1. Introduction 15 I. Background Video History Compression Terminology Classification of redundancies Elementary building blocks of a video encoder Intra prediction Inter prediction Coefficient transformation and quantization Entropy coding Profiles and levels Error resilient encoding techniques Network abstraction layer and parameter sets Flexible macroblock ordering Arbitrary slice ordering

10 Contents Redundant slices Data partitioning Switching I and P slices Intra refresh Reference picture selection Error concealment decoding techniques Spatial error concealment Temporal error concealment Quality metrics Mean opinion score Peak signal to noise ratio Structural similarity index metric Other approaches Transport Background Timing constraints Video on demand streaming Realtime/Live video streaming Building blocks of a streaming system Rate control Protocols Error control Feedback based error control Forward error correction Hybrid approaches

11 Contents II. Original Contribution Related work: Video distortion estimation Background Lightweight schemes Heavyweight schemes Conclusion Realtime distortion estimation Introduction Prioritization of MPEG-2 and MPEG-4 ASP Design of Ψ ASP Macroblock type weighting: Ψ mb ASP Temporal frame dependencies: Ψ td ASP Scene cut detection: Ψ sc ASP Prioritization of H.264/MPEG-4 AVC Introduction Design of Ψ AVC Macroblock type and partitioning: Ψ type AVC Temporal error tracking: Ψ dep AVC Motion vector extents: Ψ mv AVC Error tracking capabilities Computational complexity Prioritized proactive packet discard Introduction Simulation framework Experiment parameters Results Content-aware selective ARQ Introduction

12 Contents 7.2. Streaming MPEG-4 ASP over DCCP DCCP implementation details Obtaining receiver feedback Transmission schedule optimization Experiment setup Results Streaming H.264/MPEG-4 AVC over UDP Derivation of send timestamps Feedback based priority adaptation Jointly optimizing timeliness and relevance of content Reference approaches Experiment setup Streaming under independent loss Streaming under bursty loss Low-latency streaming under loss Low-latency streaming over a wireless link Roundup of results Unequal error protection fountain codes Digital fountain codes Implications of using a small block size Modification of the member distribution Modification of the degree distribution Multicast video delivery without feedback Results Conclusion Findings Future directions Glossary

13 Contents List of Figures 255 List of Tables 261 Bibliography

15 Chapter 1. Introduction Over the last few years there has been a strongly increasing interest in streaming video over packet switched networks, and it can be expected that this trend is set to continue. This phenomenon can be mainly attributed to the modernization of telecommunication networks, the resulting wide availability of high speed internet access, the growing market of mobile entertainment, and the increasing number of web sites that offer their users high-quality multimedia content. In many of today s packet switched networks, traffic is handled in a best-effort manner, and measures that guarantee a certain quality-of-service such as guaranteed bandwidth or limited delay are rarely in place. Consequently, video streaming systems have to cope with the drawbacks of such hardly predictable environments, which are primarily fluctuating delay and packet loss. The adoption of an appropriate combination of error resilience and robustness techniques is strongly demanded to prevent, or at least mitigate, the impact of packet loss on the video quality experienced by end users. The need for such quality preserving mechanisms becomes most evident when delivering video content to mobile clients as wireless channels can be subject to interference from other transmitters or can be affected by multipath fading or shadowing. It is a well known fact that some parts of a video stream contribute more to the quality degradation in case of loss than others, which is mainly due to the complex structure of today s video codecs, the diversity of content to encode, and the spatio-temporal dependencies between data elements. During the last years, 15

16 Chapter 1. Introduction several techniques have been proposed to determine the expected loss-distortion of individual units or groups of media units. The provision of such importance estimates at the packet level allows the adoption of content-aware error handling and protection mechanisms. Application scenarios encompass, amongst others, prioritized packet drop schemes, selective automatic repeat request, unequal forward error correction, the exploitation of quality of service capabilities of wireless networks, and the assignment of packets to different service levels according to their impact on video quality. The two major scientific contributions of this dissertation are (A) the development of two realtime-capable video distortion estimation techniques that outperform related approaches, and (B) the design of content-aware error control mechanisms for selected realtime video delivery scenarios. To achieve (A), an in-depth knowledge of the internals of the video encoding and decoding processes has to be gained, the structure of a video stream of a target format has to be carefully analyzed, and the dependencies within a bitstream have to be identified. This allows the isolation of those information units that can be obtained at a reasonable computational cost, which is crucial as the realtimeapplicability of the developed distortion estimation techniques is a central goal. To reveal the impact of different variables such as prediction modes or motion vector lengths on the video quality of loss-affected streams and to quantify their influence, a large number of simulations have to be conducted with different encoding settings and diverse input content. Besides that, it has to be investigated what error resilience options the respective video coding standard supports, and what is actually used in practice. Moreover, related approaches have to be researched, relevant techniques have to be filtered out, and potential weak points have to be identified. This reveals that fully decoding video bitstreams or even simulating error concealment operations to obtain distortion estimates is computationally too costly to be conducted in realtime, which motivated the formulation of approaches that operate not at the pixel but at the macroblock level. Both developed distortion estimation approaches presented in this dissertation consider the impact of both intra and inter prediction, and are designed to specifically incorporate the 16

17 special properties of the codecs they were created for. They complement the set of existing approaches as they are mostly kept general to preserve codec independence. More importantly, related techniques are either straightforward to compute but imprecise or depend on an in-depth analysis of the video data, thus being computationally expensive and therefore hardly realtime capable. To accomplish (B), different error control techniques have to be reviewed, being applicable to diverse delivery scenarios. In this context, it is discussed how the distortion estimation techniques of step (A) can be beneficially combined with such mechanisms, what pitfalls are to be avoided, and why the developed mechanisms provide superior results compared to existing approaches. Several scenario-specific problems have to be tackled: as an example, the key to enable content-based selective retransmissions is to reasonably combine the knowledge of packets send deadlines and their loss impact; both aspects need to be well balanced. Another issue is that the probability distributions used within rateless codes need to be modified to both, enable unequal error protection, and increase their intermediate decoding performance to leverage the video quality when channel conditions become worse. In addition to that, for systems that support (limited) feedback, an extension is developed that exploits incoming feedback to increase the precision of distortion estimates. The contributions are summarized as follows: Formulation and analysis of a distortion estimation approach for MPEG-2 and MPEG-4 ASP encoded video content [171]. An approach is developed that estimates the loss-impact of media units based on the analysis of macroblocks. It takes their types and their proximity to the region of interest within the respective frame into account, considers a frame s position with respect to the current group of pictures structure, and aims to detect and protect scene cuts. The mechanism requires video streams to only be superficially parsed and does not contain computationally expensive calculation steps, which enables its execution in realtime and allows its deployment at arbitrary nodes along the delivery path. 17

18 Chapter 1. Introduction Design and simulation study of two content-aware MPEG-4 ASP streaming systems employing proactive packet discard and selective retransmission techniques respectively [171, 172]. The developed distortion estimation approach is embedded in a test environment and its performance is compared against selected reference approaches, simulating the impact on video quality when packets are selectively discarded along the delivery path during phases when the available bandwidth drops below the streams bitrate. Secondly, the approach s added value is demonstrated in combination with selective retransmissions over DCCP connections and a method is presented to jointly consider the attributes timeliness and relevance of content of single media units. Adaptation, extension, and analysis of the distortion estimation approach to cover H.264/MPEG-4 AVC encoded video content [ ]. The previously developed approach yields poor estimates when applied to H.264/MPEG-4 AVC video data as, e.g., more macroblock types have to be distinguished, macroblocks may be highly partitioned, group of pictures structures are no longer static, and scene cuts are usually already detected by the encoder. Consequently, the mechanism is adjusted accordingly with the major focus on tracking error propagation as the exploitation of temporal dependencies is one of the main drivers behind H.264/MPEG-4 AVC s increased compression efficiency. Specification and experimental evaluation of a content-aware rateless FEC streaming system for H.264/MPEG-4 AVC video data [173, 174]. Inspired by the fact that rateless codes have some unique properties that make them attractive especially for video multicast, a method is developed to correlate the decoding probability of media units with their loss-impact on the video stream. An alternative symbol degree mechanism is presented coupled with a modified decoding algorithm to increase the intermediate decoding performance and thus prevent severe video quality degradations under poor channel conditions. 18

19 Deployment and evaluation of the distortion estimation approach in a UDP based streaming system that employs selective ARQ as error control mechanism, and design of a feedback based dynamic estimate adaptation mechanism [175]. An algorithm is specified that, based on the developed distortion estimation approach for H.264/MPEG-4 AVC video data, a stream s timing information, and incoming feedback, creates an optimized send schedule for content-aware selective retransmissions. Additionally, it is demonstrated how feedback can be exploited to adjust and refine distortion estimates, and that both static and feedback adaptive mechanisms perform superior to reference approaches under independent and bursty loss conditions, tested in wired and wireless environments. The remainder of this dissertation is structured as follows: Chapters 2 and 3 review related background from the video compression and video delivery perspectives, providing a solid basis for the core part of this thesis. Chapter 4 discusses related work and classifies existing video distortion estimation approaches according to precision and realtime capability. In Chapter 5, the author s distortion estimation algorithms are specified and their properties are discussed. The performance of the estimation algorithm designed for MPEG-4 ASP video content is evaluated for the selective packet discard use case in Chapter 6, and Section 7.2 demonstrates how that algorithm can be beneficially combined with a DCCP based streaming system that employs selective ARQ. Section 7.3 discusses a similar use case but focuses on the H.264/MPEG-4 AVC version of the author s distortion estimation algorithm and introduces the feedback based prioritization refinement technique. Chapter 8 is about streaming systems that use error protecting fountain codes. A method to integrate distortion estimates into the rateless coding process is presented and further modifications that are required to obtain an improved error control mechanism are discussed. Finally, Chapter 9 concludes by recapitulating the author s achievements and discussing potential future work. 19

21 Part I. Background 21

23 Chapter 2. Video This chapter provides a rough overview of elementary video coding mechanisms and discusses the fundamentals of video error resilience and video error concealment techniques. In order to grasp the principles behind the video-related scientific contribution of this thesis, explained in Chapter 5, and to comprehend the design of the developed distortion estimation algorithms, it is necessary to understand how generic compression, resilience, and concealment strategies work and which impact they have on visual quality. For a more encompassing discussion to gain an in-depth knowledge of these topics, the reader is referred to [112, 221]. Section 2.1 provides a short overview of the history of video compression up to the present. In Section 2.2 and Section 2.3, fundamental compression techniques, which can be found in all MPEG-like compression schemes, are discussed and the notions profiles and levels are explained. The most important encoder-side error resilience and decoder-side error concealment techniques are summarized in Section 2.4 and Section 2.5. Section 2.6 concludes by presenting these video quality metrics that are used throughout the remainder of this thesis History Efficient video compression techniques have become an integral component of broadcast, entertainment media, and video collaboration. Their development was 23

24 Chapter 2. Video mainly driven by the need for mechanisms being capable of drastically reducing the bitrates of raw video streams to enable their storage on portable media as well as their transmission over bandwidth-limited channels. MPEG-1 [82], the first video and audio compression standard of the Moving Picture Experts Group (MPEG) (released in 1993), is primarily designed to compress a Video Home System (VHS) quality raw video stream and a CD-quality audio stream at bitrates of up to 1.5 Mbit/s without significant quality loss. The main goal was to make the storage on video CDs and similar digital storage devices feasible. MPEG-1 builds upon the International Telecommunication Union (ITU)-T standard H.261 [91], which was defined in 1988 for low-bitrate video telephony and video conferencing over ISDN connections. H.261 achieves compression ratios of 1:100 up to 1:2000 by applying a combination of lossy discrete cosine transformations, lossless entropy coding, and motion prediction an efficient combination that was maintained throughout all succeeding ITU-T as well as the ISO / MPEG standardizations. Figure 2.1 provides a rough overview of the evolution of video codecs over the last two decades. As pointed out, MPEG-1 was primarily designed for media storage and retrieval. However, the trend towards digital television clearly demanded for an advanced compression standard that supports higher bitrates, interlacing, and error resilience. MPEG-2/H.262 [83], released in 1995, fulfills these requirements as it allows bitrates of up to 80 Mbit/s. Additionally, it introduces error resilience techniques such as (adaptively sized, intra) slices followed by resynchronization markers, it provides tools for data partitioning and scalability, and it contains recommendations on how to implement effective error concealment mechanisms. Although already 16 years have passed since its specification, MPEG-2 is still heavily in use today as it is a required codec for Digital Versatile Disk Video (DVD-V), and it forms the basis for digital television broadcasts, including terrestrial (DVB-T), cable (DVB-C), and direct broadcast satellite (DVB-S). The successor of MPEG-2, MPEG-4 Part 2 Advanced Simple Profile (MPEG- 4 ASP) [84], was initially designed for video telephony and multimedia streaming applications on the internet. At that time, this could only be realized with streams that were encoded at a very-low bitrate and that still had an acceptable video 24

25 2.1. History H On2 VP3 On2 VP4 On2 VP5 On2 VP6 On2 VP7 free VP3 theora H.263 RealVideo 1 RealVideo 2 RealVideo 3 RealVideo 4 MPEG 1 H.262 MPEG 2 MPEG 4 ASP H.264 MPEG 4/AVC Sorensen Sorensen 2 Sorensen 3 Sorensen Spark Pixlet Cinepak DV WMV 7 WMV 8 WMV 9 Indeo 2 Indeo 3 Indeo 4 Indeo 5 CineForm lossless ffv1 Avid DNxHD On2 VP8 VC WebM H.265 HEVC Figure 2.1.: History of popular video compression formats. 25

26 Chapter 2. Video quality (simple profile). When the standard was finally released in 1999, it also supported high video bitrates (advanced simple profile). Besides increased compression efficiency, further goals were to enable the composition of independent video objects to scenes, interactivity, and support for 3D rendering to move closer to computer graphics applications. In 2001, the Joint Video Team (JVT) 1 was formed to foster the collaboration of video experts from MPEG 2 and ITU 3. Over a period of serveral years, they worked out a new video compression standard called H.264 [93]. In fact, this is the name given to it by the ITU-T whereas MPEG called it MPEG-4 AVC. The official full name is MPEG-4 Part 10 Advanced Video Coding (H.264/MPEG-4 AVC). This new standard is computationally more demanding than its predecessors, contains more complex coding tools, and provides significant compression efficiency gains over previous standards. Coupled with the so called fidelity range extensions that were later integrated into the standard as novel high profiles to address the needs of high-end broadcast applications, it is today considered to be the state of the art of modern video compression. Besides H.264/MPEG-4 AVC, Microsoft s VC-1 and Apple s QuickTime codec collection are the main codecs used for compression and streaming today as they are shipped along with their proprietors operating systems. Theora evolved from the codec VP3 and is a royality-free codec developed by the Xiph community 4. It provides lower compression rates than H.264/MPEG-4 AVC, but the fact that it is not protected by any known patents still makes it an attractive alternative. To date, Google s Chrome, Mozilla s Firefox, and Opera are the only popular web browsers supporting that codec. Finally, VP8 is constantly gaining popularity since its acquisition 5 and integration into the WebM project by Google. Similar to Theora, VP8 is made royality-free because the new owner opened all patents filed by On2. Since then, Google has put a considerable effort into promoting its use and convinced

27 2.2. Compression popular hardware vendors like Nvidia and software companies like Skype to provide support for it. However, it is unclear whether it will remain under the current BSD-like license as video codecs are conglomerates of numerous algorithms and the fact that VP8 s design is highly similar to that of H.264/MPEG-4 AVC makes it an attractive target for patent holders to claim license fees for some of these algorithms once the codec is well-established Compression This section provides an insight into the basic building blocks of video encoders with a strong focus on H.264/MPEG-4 AVC. In what follows, the coding process is briefly sketched to provide the reader with a rough overview of video compression techniques and to make him/her familiar with the terminology adopted in that standard. For a compact summary of features provided by H.264/MPEG-4 AVC, he/she is further referred to [234] Terminology A frame from a progressive or a field from an interlaced input sequence is encoded with the result being referred to as coded picture or coded frame. In the bitstream, each coded frame is associated with a frame number, which relates to the display timestamp of that frame. This number may differ from the decoding timestamp, which defines the decoding order and reflects the reference order of frames. Previously coded frames may be used as references for temporal (inter) prediction, and they are selectively buffered in the decoded picture buffer, controlled by a reference picture management routine. Reference frames can be addressed using two lists, list0 and list1, which are usually (but not exclusively) used for forward and backward prediction respectively. Each coded frame consists of one to several slices, which are typically, but not necessarily, in raster scan order. Slices can be categorized into three groups: I slices, P slices, and B slices. The two advanced slice types SI and SP, which are 27

28 Chapter 2. Video only defined in the extended profile, will not be discussed here. For the sake of completeness, it is remarked that categorizing frames as I, P, and B frames as it is the case with MPEG-2 is inappropriate in the context of H.264/MPEG-4 AVC as frames may also contain slices of unequal types. The standard only distinguishes between instantaneous decoder refresh (IDR) and non-idr frames. A slice consists of a set of macroblocks, which are non-overlapping, neighbouring quadratic pixel regions usually of the size They contain luma and chroma samples for both color components Cb and Cr. I slices contain only macroblocks that use intra prediction, i.e., macroblocks that exclusively reference previously coded samples of the same slice. There are four different types of such macroblocks: macroblocks that predict the entire area, macroblocks that are divided into either 8 8 or 4 4 blocks of which each block is separately intra predicted, and macroblocks that use no prediction at all (so called I_PCM macroblocks). P slices contain macroblocks that may either be intra or inter predicted, i.e., as references, they may either use previously coded samples of the same slice or of slices that belong to frames that have a smaller display timestamp than the current frame. Each macroblock is divided into partitions of size 16 16, 16 8, 8 16, or 8 8 luma and associated chroma samples, and, if the partition mode 8 8 is choosen, the block is further divided into sub-partitions of size 8 8, 8 4, 4 8, or 4 4. For each macroblock partition, a different reference picture from list0 may serve as anchor for inter predictions; however, sub-partitions have to agree on a common reference picture. Thus, each macroblock in a P slice may use up to four different temporal predecessing frames as reference. B slices increase the average compression efficiency of the inter prediction process because they allow the referencing of two separate previously coded frames from both list0 and list1 per macroblock partion. As a consequence, macroblock partitions of B slices may either be intra predicted, inter predicted by using a picture from list0, inter predicted by using a picture from list1, or inter predicted by jointly using two pictures from both list0 and list1. 28

29 2.2. Compression Classification of redundancies Most of today s video compression standards such as H.264/MPEG-4 AVC and VP8 [9] are hybrid in the sense that they combine techniques that reduce different kinds of redundancies that occur in the bitstream. These redundancies can roughly be categorized as either being perceptual, spatial, temporal, or statistical. Perceptual redundancies The details of a single video frame or changes between two consecutive frames that a human eye cannot perceive are collectively denoted as perceptual redundancies. Such redundancies can be removed from the bitstream without being noticed by the viewer. By studying the Human Visual System (HVS), and especially the distribution of rods and cones on the retina, it is remarkable that we are more sensitive to variations in light intensitiy than variations in light color. This justifies the use of the luminance and chrominance (blue-yellow, red-green) color space (YCbCr) as primary color space for video compression as it exploits the characteristics of the HVS better than the red-green-blue color space (RGB). The conversion from RGB to YCbCr is linear and is, according to the ITU-R recommendation BT.601 [89], defined as follows: With regard to channel resolution, one distinguishes between YCbCr 4:4:4, 4:2:2, and 4:2:0 input formats. With YCbCr 4:4:4, each 2 2 block of Y pixels has four pixels for Cb and Cr each, whereas with 4:2:2, the same block has only two pixels per chroma channel. With 4:2:0, the pixel ratio is further decreased to 1 /4. Due to the fact that reducing the resolution of the chroma channels (chroma subsampling) can hardly be perceived by the human eye, the YCbCr 4:2:0 format, which causes a 50% reduction in picture size, is predominantly used in entertainment applications such as DVD-V and High Definition Television (HDTV). YCbCr 4:2:2 Y Cb Cr = R G B (2.1) 29

30 Chapter 2. Video spatial sampling temporal sampling Figure 2.2.: Temporal and spatial sampling of a video sequence. is typically considered for intermediate processing such as in studio applications whereas YCbCr 4:4:4 is hardly used at all. An important instrument to reduce perceptual redundancies is quantization. Analogous to the process of image compression, the number of levels is reduced by quantizing the coefficients of transformations applied to n n blocks. Thus, the compression efficiency is increased at the cost of a decreased video quality. Quantization matrices have to incorporate the unequal importance of different transform coefficients. They should be adapted according to application requirements to still retain an acceptable video quality while complying with given bitrate constraints. Pre-filtering is a further technique for perceptual redundancy reduction that decreases the (possibly source or channel characteristic) noise of the input signal. It thus improves the quality of the compressed sequences as it lowers the probability of artifact occurrences. Temporal redundancies A video is basically a sequence of pictures sampled at a certain frame rate. In most cases, consecutive pictures are highly similar, depending on the sampling rate (the temporal resolution, see Figure 2.2) and the amount of motion of the camera and the objects within the scene. Encoding each frame independently of 30

2.2. Compression Figure 2.3.: Frame 25 and 26 of the sequence stefan, and their absolute luma differences; remaining residual energy without motion estimation and compensation.

In fact, the exploitation of temporal redundancies of a sequence is to be accounted for the majority of compression gains in modern video encoding schemes.

31 2.2. Compression Figure 2.3.: Frame 25 and 26 of the sequence stefan, and their absolute luma differences; remaining residual energy without motion estimation and compensation. the others, even when performed with an efficient image compression algorithm, would therefore not yield satisfying compression rates. In fact, the exploitation of temporal redundancies of a sequence is to be accounted for the majority of compression gains in modern video encoding schemes. A straightforward approach would be to only independently encode the first frame of a sequence (the key frame), and to store the differences between consecutive frames for the remaining part. Although feasible, the energy of the residual frame (the sum of all absolute luma or chroma sample differences between the current and the reference frame) would still be high, i.e., there is a considerable amount of information left to be compressed after the temporal prediction (see Figure 2.3). By replacing static prediction methods with advanced techniques that try to identify the most suitable reference regions to minimize prediction residuals, the compression efficiency can be dramatically improved. Such search processes are commonly referred to as motion estimation and are key to efficient video compression. Besides selecting appropriate reference regions, a further challenge is to choose the most appropriate block size. By considering smaller n n blocks, the likelihood to find better reference regions (i.e. regions that yield smaller residuals) increases, but at the same time, the total number of blocks also increases, which leads to a higher amount of motion information and signaling overhead to be encoded. Because of this, selecting block sizes is a tradeoff between motion and residual data, and the partitioning mechanism has to estimate the overall expected bit savings in order to make sound decisions. 31

32 Chapter 2. Video Spatial redundancies In natural scenes, a significant amount of correlation can often be found between neighboring pixels, and small areas of a picture usually have remarkable similarities. Several techniques exist to exploit these redundancies, minimizing the amount of information to be encoded. A well-established procedure is the discrete cosine transform (DCT), which was already used in the Joint Photographic Experts Group (JPEG) compression process [87]. It operates on n n blocks and transforms them from the time to the frequency domain. The DCT thus renders blocks more suitable for video compression as both energy compaction and de-correlation are established. More specifically, after the transformation, the energy of an n n pixel block is compacted into few low-frequency coefficients, which are of higher importance with regard to the HVS than high-frequency coefficients. This fact is further exploited in the quantization process where the quantization matrices are usually designed to apply a stronger quantization to high-frequency coefficients than to low-frequency coefficients. As a consequence, a considerable amount of information can be recovered by encoding just a small fraction of the quantized coefficients. As an alternative to the DCT, the wavelet transformation, which is widely adopted in image compression, can be used for energy compaction. Its methodology is somehow opposite to that of the DCT: each DCT coefficient represents the factor by which a constant pattern is applied to the whole block while wavelet transform coefficients represent single, localized patterns applied to a specific part of that block. DCTs are generally applied to small areas and are intended to cover regions of roughly uniform patterns and complexity. On the contrary, wavelet transformations are applied to the entire picture region and aim to exploit the large-scale redundancy of it. At first glance, it seems as if the wavelet transformation has considerable advantages over the DCT: coupled with overlapped-block motion compensation (OBMC), a more precise inter prediction can be achieved, large-scale correlations can be taken advantage of, and there is no need for deblocking filtering as with the DCT. Indeed, the wavelet transformation is already 32

33 2.2. Compression used in modern video codecs such as Dirac [11] and Snow 6. Despite the considerable amount of research dealing with wavelet transformations applied in video compression, there are still unsolved problems regarding the compatibility with H.264/MPEG-4 AVC compression tools that hinder the unfolding of their full potential. More specifically, there is a lack of mechanisms that allow for efficient intra coding, the overlap between motion partitions causes the mixing of partition sizes to lead to serious practical problems, and it is still unclear whether spatial adaptive quantization leads to a similar performance improvement when applied to wavelet transform coefficients as with DCT. A further technique to exploit spatial redundancies (which can also occur after motion compensation) is to predict single image samples or regions from previously coded areas of the same frame, commonly referred to as intra prediction. Similar to motion compensation, when the prediction is gainful, the energy of the residual is lower than that of the original region, thus it can be represented with fewer bits. As reference, the encoder does not use the original picture material but an encoded and subsequently decoded version of it to prevent potentitally cumulative mismatches at the decoder. Due to the fact that the encoding process is usually lossy, taking the original picture material as reference would lead to distorted reconstructions at the decoder as the originals are not available, and when this effect propagates, one speaks of a quality drift. Statistical redundancies Reducing statistical redundancies is the last step in the video encoding process. By applying a lossless compression to the outcomes of previous encoding steps, the final video s size can be further reduced. The targets of the compression are run-level (DCT) or zero-tree (wavelet transform) encoded quantized transform coefficients, motion vectors indicating the horizontal and vertical displacement of motion-compensated blocks, headers, markers, and supplementary information

34 Chapter 2. Video The simplest way to encode this information is to use fixed length codes. However, the values are not uniformly distributed, and encoding values of unequal likelihood with symbols of the same length would therefore lead to unnecessarily large files. Entropy coding algorithms like variable length codes (e.g. Huffman codes, exponential Golomb codes) and arithmetic codes are typically applied to exploit the non-uniform probability of the values to be encoded. Furthermore, some symbols may be highly correlated in local areas of an image. For example, collocated motion vectors may have similar horizontal and vertical displacements because object motion may extend across large regions of a frame, and DC values of neighboring intra coded blocks may be very similar. By predictively coding motion vectors, only the motion vector differences (MVDs) have to be encoded, which are typically smaller than the absolute motion vectors. Besides that, videos with certain realtime constraints often use varying quantizers to control the tradeoff between compression efficiency and video quality, e.g., to adapt the bitrate according to the present transmission channel bandwidth. Similar to the coding of MVDs, it is sufficient to only signal the relative changes of the quantizer because the variations between successive coded macroblocks are rather small Elementary building blocks of a video encoder Modern video compression schemes try to exploit all the dimensions of redundancies mentioned in Section As far as H.264/MPEG-4 AVC is concerned, the specification does not define the codec (the encoder/decoder pair) but rather the syntax and semantics of the encoded video bitstream and the methods to decode the compressed data. The internal structure of a typical H.264/MPEG-4 AVC encoder implementation and the data flow among different encoding tools is depicted in Figure 2.4. The fundamental building blocks prediction, transformation, quantization, and entropy compression can also be found in previous standards such as H.261 and H.263 as well as MPEG-1, MPEG-2 and MPEG-4 ASP. Many im- 34

35 2.2. Compression input current slice current macroblock Intra Prediction residuals Transformation Quantization decoded macroblocks Motion Estimation XOR decoded picture buffer Motion Compensation Inverse Quantization Deblocking Filter mode, motion vectors,... Inverse Transformation output NAL unit Packetization Entropy coding Reordering reconstructed macroblock Figure 2.4.: Block diagram of an H.264/MPEG-4 AVC encoder. provements in terms of coding efficiency that have been achieved over the last years actually happened within these blocks. To facilitate comprehension of subsequent chapters, the main encoding steps are sketched in this paragraph, and the internals of elementary building blocks are described in detail in the succeeding sections. There exist two distinct dataflow paths in an encoder: a forward and a reconstruction dataflow path. The forward dataflow path is the primary path whereas the reconstruction path constitutes a decoder within the encoder. The latter provides reference data for the prediction procedures. In a first step, the color-transformed and subsampled frames are divided into macroblocks. For each of those blocks, the encoder decides, depending on the content, the rate constraints, the profile used, and numerous other criteria, whether to apply intra or inter prediction. By choosing the most appropriate prediction mode for individual macroblocks, the encoder can minimize the size of residuals. In case of intra prediction, previously coded but unfiltered samples 35

36 Chapter 2. Video of the current slice are used to form the prediction. In contrast to that, for the inter prediction process, one or two previously coded, filtered frames from list0 and/or list1 are used. The differences between the obtained prediction and the samples of the current block are called residuals, and they are subsequently blocktransformed and quantized. After the quantization, the transform coefficients are reordered and entropy encoded. They are stored, along with motion vector information, quantization steps, prediction modes, and further side information needed to decode each macroblock in network abstraction layer (NAL) units to be either instantaneously transmitted or stored for future use. Along the reconstruction path, the quantization of transform coefficients is reversed and the coefficients are inversely transformed, which yields an approximation of the original residuals. This approximation is required for both intra and inter predictions of future blocks. The simulation of the decoding steps within the encoding process is necessary because not the original residuals but their approximation will be available to actual decoders Intra prediction I slices do not use any form of inter prediction and can therefore only contain intra coded macroblocks. In contrast to that, intra coded macroblocks may also occur in P as well as B slices due to various reasons (e.g., poor inter prediction performance, scene cuts, limiting temporal error propagation). An intra coded macroblock is encoded without referring to regions located outside the surrounding slice, i.e., only macroblocks of the same slice that have been previously encoded can be used for prediction. In general, there is a significant correlation between luma and chroma samples of adjacent macroblocks, which is the main reason why reconstructed macroblocks at the top, the top-left, the top-right and the left of the current macroblock are considered as potential reference areas. For luma components, the encoder can decide whether to process a macroblock in block units of size 4 4, 8 8, or (see Figure 2.5). When a smaller block size is chosen (e.g., 4 4), the 36

37 2.2. Compression Figure 2.5.: Luma samples of intra coded macroblocks of an intra coded frame may be arbitrarily partitioned. prediction precision is increased, which consequently leads to smaller residuals. However, this advantage comes at the cost of an increased signaling overhead as the prediction modes used have to be encoded for every block. When compared to a block size of 16 16, this leads to a signaling overhead of roughly a factor of 16. Thus, selecting an appropriate macroblock partitioning is a tradeoff between the sizes of residuals (the prediction precision) and the cost of mode signaling, and encoders aim to minimize the total number of bits needed to allow the reconstruction of macroblocks. The mode of prediction defines how the adjacent samples are used to interpolate the block. As an example, the possible prediction modes for 4 4 luma blocks 37

38 Chapter 2. Video Figure 2.6.: The 9 possible intra prediction modes for 4 4 luma blocks. are depicted in Figure 2.6. In total, there are 22 modes for luminance (9 for 4 4 blocks, 9 for 8 8 blocks, and 4 for blocks) and 4 modes for 8 8 chrominance blocks. For 4 4 and 8 8 luma blocks, it is possible to decrease the signaling overhead by applying predictive coding. The idea is that (intra coded) collocated blocks of the same size use a prediction mode that also provides the best prediction result for the current block. Consequently, the encoder can decide to either explicitly communicate the prediction mode to be used or to signal in the bitstream that the prediction mode is to be inferred by inspecting neighboring blocks Inter prediction Unlike with intra prediction, inter prediction selects regions of previously coded frames as reference areas. Reference candidate frames are stored in the decoded picture buffer, which can be accessed via two distinctive lists: list0 and list1. Both list0 and list1 contain short term as well as long term reference pictures. After encoding a frame, it is decoded again and prepended to the corresponding list after being de-blocked using a filter operation. Due to the lists finiteness, a technique called sliding window memory control is applied that removes the short term picture with the highest index and shifts previous short term pictures accordingly. Adaptive memory control commands can be used to transform short term to long term reference pictures to avoid that they will be dropped from the buffer, and to remove selected pictures from the decoded picture buffer. An IDR 38

39 2.2. Compression list decoded picture buffer list short term long term Figure 2.7.: A possible decoded picture buffer allocation along with list0 and list1 while encoding a B slice with a display timestamp of 42. frame, which is a special instance of an I frame, may be inserted at strategic points in the video stream to trigger the removal of all pictures from the decoded picture buffer. This is reasonable, e.g., after abrupt scene cuts or at the beginning of a new DVD chapter. In particular, one IDR frame is always placed at the beginning of a video sequence. All subsequently coded slices are not allowed to refer to regions located before the IDR frame. This enables decoders to start the playback from that IDR frame on and stops any error propagation that may have been caused by previous frames that could only partially be decoded. As already indicated in Section 2.2.1, when P slices (forward prediction only) are inter predicted, list0 is used, and the list s order reflects the decoding timestamps of frames in the decoded picture buffer in descending order. In case of B slices, both list0 and list1 can be used, and the ordering is based on the display timestamps of reference frames. More specifically, with B slices, the first entries of list0 are frames that have a display timestamp smaller than the one of the current frame in descending order, followed by frames with a display timestamp greater than that in ascending order. The structure of list1 is equal to that of list0 with the exception that future frames (i.e. frames with a display timestamp greater than that of the current frame) are ranked first. For clarification, Figure 2.7 depicts one possible decoded picture buffer allocation when encoding a B slice with a display timestamp of 42. In general, the encoder tends to choose small prediction list indices due to the fact that the maximum correlation typically exists among temporally collocated frames. This implies that list0-prediction usually refers to past 39

40 Chapter 2. Video 16x16 16x8 8x16 8x macroblock partitions x8 8x4 4x8 4x4 sub-macroblock partitions Figure 2.8.: Possible block sizes for inter predicted macroblocks. frames and list1-prediction to future frames, which motivates the terminology of forward, backward, and bidirectional prediction. The inter prediction process encompasses selecting an appropriate reference region in a frame from list0 and/or list1 and choosing a suitable block size. The latter is similar to the intra prediction process (see Section 2.2.4), although block sizes no longer need to be quadratic (see Figure 2.8). The offset between the region to interpolate in the current frame and the selected reference region is commonly referred to as motion vector, and its x and y components do not necessarily have to be integer values. More specifically, motion vectors have a 1 /4-pixel resolution for luma and 1 /8-pixel resolution for chroma components. The corresponding subpixel samples are interpolated based on the surrounding samples, as exemplarily sketched in Figure 2.9. Motion vectors of spatially co-located blocks are usually strongly correlated. It therefore makes sense to differentially encode them, which is referred to as motion vector prediction. For each (sub-)macroblock partition, the motion vectors of the left, top, and top-right blocks are taken to form the motion vector predictor (MVP). The difference MVD between that predictor and the current block s motion vector is stored along with the temporal coordinate of the reference frame at the macroblock layer of the bitstream. If bi-prediction is applied and both motion vectors have the same temporal direction, it is possible to further increase the coding efficiency by encoding the difference between the second (list1) motion vector MV 1 and a scaled version of 40

41 2.2. Compression (4,-2) (20.5,20) 16x8 16x8 (2.75,26) frame i-j frame i frame i+k Figure 2.9.: Motion vectors may point to positions between samples (sub-pixel resolution), which demands for averaging / interpolation. MV 0 instead of MV 1. The scaling is performed based on the temporal distance of the reference frames to the current frame. For example, if the current frame had a display timestamp of 16 and MV 0 and MV 1 referred to the frames 14 and 12 respectively, then a scaling factor of = 2 would be applied As a special case, direct mode prediction needs no motion vector information to be encoded. The motion vectors are inferred based on previously derived motion vectors of either spatially (spatial direct mode) or temporally (temporal direct mode) collocated blocks. Based on motion vector information, a motion compensated predictor is formed. When a block has one motion vector, the region this vector points to is taken as predictor. If there are two motion vectors, the predictor is calculated as an average of the samples of both list0 and list1 regions. Better predictors can be obtained by optionally using weighted prediction techniques where either implicitly (based on the temporal distance of the reference frames) or explicitly (in the slice header) signaled weighting factors are used prior to motion compensated prediction to adjust prediction samples Coefficient transformation and quantization Up to now, the encoder collected information about the macroblock partitioning, the prediction modes used, prediction parameters (e.g. motion vectors), and pre- 41

42 Chapter 2. Video diction residuals. All this information needs to be coded in a standardized and space-saving way. The video coding techniques applied so far are reversible in the sense that the input video sequence can be reconstructed without any loss of quality. However, to obtain high compression ratios and to satisfy rate constraints, it becomes necessary to transform and quantize the residual information. In most cases, this is inevitably accompanied by a video quality degradation. Previous standards such as JPEG [87], MPEG-2 [83], and MPEG-4 ASP [84] decouple the transformation and quantization steps. They apply a two-dimensional DCT that converts the samples of blocks from the time to the frequency domain (see Equation 2.2). The occurrence of cosine terms requires implementations to perform approximations, which might lead to a mismatch between encoders and decoders when different approximation schemes are deployed. Despite the existence of standardized accuracy criteria, a quality drift (a cumulative distortion caused by reconstruction-error propagation due to inter prediction) is still possible. Consequently, countermeasures such as periodic intra refreshing must be implemented to mitigate the negative impact on video quality. To prevent quality drifts, the H.264/MPEG-4 AVC standard, unlike previous standards, exactly specifies the scaling and inverse transformation process to ensure that different H.264 implementations produce identical reconstructions. More precisely, a reversible integer transform on quadratic blocks is specified. It is a scaled approximation of the DCT. F (x f, y f ) = C(x f, y f ) N 1 N 1 x t=0 y t=0 ( ) ( ) 2xt + 1 T (x t, y t ) cos 2N x 2yt + 1 fπ cos 2N y fπ 1/8 if x f = 0 and y f = 0 with C(x f, y f ) = 1/ 32 if either x f = 0 or y f = 0 1/4 otherwise (2.2) 42

43 2.2. Compression Figure 2.10.: Standard DCT patterns for 8 8 sample blocks [112]. The basic idea behind the DCT is to find a matching linear combination of the standard patters displayed in Figure 2.10 to interpolate the residual blocks. The transform is applied to 4 4 or 8 8 (high profiles only) blocks of luma/chroma samples and can be carried out by using either (limited precision) integer or fixed-point arithmetic. With some modes, the DC (low frequency) coefficients are isolated and taken as input for an additional Hadamard transform to exploit the 43

44 Chapter 2. Video fact that DC coefficients are usually significantly correlated. This is similar to the strategy in JPEG where DC coefficients are differentially encoded. A further aspect, which distinguishes H.264/MPEG-4 AVC from previous standards, is that the transformation and quantization steps are no longer separated. By overlapping these two stages and adding a normalization step to the latter, it is possible to considerably reduce the number and complexity of operations required to process one block. In fact, the original DCT is split into a core transform and a scaling operation, of which the latter is combined with the quantization step and the quantization precision constant to form the actual quantization step. For clarification, this is illustrated by taking the example of a 4 4 block transform. The original DCT can be formulated as the matrix multiplication of Equation 2.3. X and Y are the coefficients in the time and frequency domain respectively and D denotes the DCT operator with d 1 = 1 /2, d 2 = 1/2 sin ( 3π /8), and d 3 = 1/2 sin ( π /8). To avoid computations with irrational numbers, D is rounded as defined in Equation 2.4. However, to preserve the orthogonality property, C has to be scaled by S C as given in Equation 2.5 where denotes the entry-wise or Schur product. Plugging this approximation into the original Equation 2.3 yields the two matrices C and S (see Equation 2.6) that denote the core transform and the scaling operation respectively. The quantization step size qstep is finally integrated to form the quantization matrix Q S 215 /qstep where 2 15 is the quantization precision constant. After transformation and quantization, an additional division by that constant becomes necessary to restore the correct scale. Therefore, the constant s magnitude (specified in the standard) can be considered as compromise between accuracy and arithmetic precision. It has to be remarked that the standard does not define qstep. Instead, it provides a lookup-table L Q as given in Equation 2.7 from which the quantizer scaling matrix S Q can be derived as a function of the quantization parameter (QP) (see Equation 2.8). The quantization parameter ranges from 0 to 51, and the corresponding matrices S Q (QP) with QP 6 can be computed via the relationship S Q (QP) = 2 QPdiv6 S Q (QPmod6). S C S Q (QP) Finally, the expression S 1 C 44

45 2.2. Compression yields the quantization matrix Q(QP) where S 1 C is the corresponding scaling matrix of the inverse integer transform as provided by the standard. As an optional feature that is only available in higher profiles, it is possible to exploit the unequal sensitivity of the HVS to different coefficient frequencies. As demonstrated in [76], quantization artifacts caused by missing or imprecise high frequency coefficients appear to be less disturbing to the human being than those caused by inexact low frequency coefficients. It therefore seems reasonable to Y = D X D T = D = d 1 d 1 d 1 d 1 d 2 d 3 d 3 d 2 d 1 d 1 d 1 d 1 d 3 d 2 d 2 d C S C = C 1/2 1/ 10 1/2 1/ 10 X 2.5 ( ) d 1 d 2 d 1 d 3 d 1 d 3 d 1 d 2 d 1 d 3 d 1 d 2 d 1 d 2 d 1 d 3 C {}} { (2.3) (2.4) (2.5) S {}}{ Y = (C S C ) X (C S C ) T = (C X C T ) (S C SC) T (2.6) L Q = (2.7) L Q (0, QP) L Q (2, QP) L Q (0, QP) L Q (2, QP) S Q (QP) = L Q (2, QP) L Q (1, QP) L Q (2, QP) L Q (1, QP) L Q (0, QP) L Q (2, QP) L Q (0, QP) L Q (2, QP) (2.8) L Q (2, QP) L Q (1, QP) L Q (2, QP) L Q (1, QP) 45

46 Chapter 2. Video adjust the quantization step size according to the magnitude of a coefficient s frequency, i.e., its position in the coefficient matrix, and to quantize lower frequencies (located in the top-left part of the Y matrix) more moderately than higher frequencies (situated in the bottom-right part). This strategy is commonly referred to as frequency dependent quantization. It is implemented by including one or several scaling matrices in the bitstream and by signaling quantization matrix adaptations to the decoder. Prior to entropy coding, the transformed and quantized coefficients have to be arranged as an array. A perfect constellation would be when the coefficients are in descending order as it would enable the entropy coder to minimize the number of bits required to represent that array. However, additionally signaling a custom order of coefficients would eliminate the compression gain. As a compromise, a default block scan order, also known as zig-zag scan order, is used. It orders coefficients according to their frequency because low frequency coefficients tend to have higher amplitudes whereas a higher number of zero values can usually be found in the regions where high frequency coefficients are located (see Figure 2.11). With most video sequences, this approach works sufficiently well. However, there are also videos where a more advanced block scan strategy may lead to a higher entropy compression rate. Numerous publications exist that address this optimization problem (e.g. [229,240]). The approaches are manifold and range from inspecting the morphological representation of coefficients [33] to considering pixel similarities among them [39] Entropy coding The entropy coding process transforms symbols that represent prediction types, (differentially coded) motion vectors, transform coefficients, identifiers, parameters, etc., into bit-patters. This is performed in such a way that the decoder can still reconstruct all symbols and that the total number of bits needed is minimized. In general, and in connection with the H.264/MPEG-4 AVC standard, entropy coding 46

47 2.2. Compression Figure 2.11.: Block scan order of the quantized DCT coefficients. The brighter the block, the higher the frequency of the coefficient it represents. is a lossless compression technique. The standard supports four different coding methods: Fixed length codes (FLC) transform symbols into bit sequences of known length. The lengths and signs of fixed length coded symbols are specified in the standard. 47

48 Chapter 2. Video n ld(n + 1) c(n) Table 2.1.: Construction of Exponential Golomb codes. Universal variable length codes (UVLC) are typically used for symbols that have small values with high probability but that may also have large values in rare situations. In such cases, using fixed length codes would only unnecessarily enlarge the bitstream as they do not consider the unequal occurrence probabilities of symbols. Variable length codes do not only have to encode the value but also the length of the symbol to enable parsers to delimit the symbol s bit representation. In case of H.264/MPEG-4 AVC, Exponential Golomb codes are used, and the construction of the first eight codewords is illustrated in Table 2.1. Each codeword starts with a series of ld(n + 1) 0-bits whose length allows the deduction of the number of remaining bits of the symbol. They are followed by a 1-bit and ld(n + 1) bits in two s complement binary code. The length of the codeword c(n) of symbol n can be calculated as len(c(n)) = 2 ld(n + 1) + 1. Both fixed length codes and variable length Exponential Golomb codes are mostly used for symbols above the slice layer because they are simpler to parse at the cost of a decreased coding efficiency. Context adaptive variable length codes (CAVLC) are used to compress the quantized and block-scanned DCT coefficients. They are designed to efficiently code zero coefficients, trailing non-zero coefficients of magnitude 1, and non-zero coefficients of higher magnitude. Encoding a block of coefficients encompasses several steps: First, a so called coefficient token is written, which indicates the 48

49 2.2. Compression context model local statistics subranges update yes yes binary arithmetic coder syntax element already binary? < threshold? entropy coded bitstream no binarization no bypass encoder UVLC Figure 2.12.: Data flow in a context adaptive binary arithmetic coder. number of non-zero coefficients of the block and the number of (up to three) trailing ones. This token is taken from one of four lookup tables. The choice of which table to use depends on the average number of non-zero coefficients of the previously coded (left and top) blocks. This coding strategy exploits the fact that the number of non-zero coefficients in neighboring blocks is correlated, and it explains the notion of context adaptiveness. After also signaling the signs of the trailing ones, the levels of the remaining non-zero coefficients are encoded, starting with the highest frequency. During the last step, the number of zero coefficients before the last non-zero coefficient and the number of zero coefficients preceeding each non-zero coefficient are encoded. Context adaptive binary arithmetic codes (CABAC) are, unlike CAVLC, an optional feature in the H.264/MPEG-4 AVC standard. They are only available in the main and high profiles. To encode symbols, a binary arithmetic coder is used that considers local statistics derived from spatial and temporal characteristics of neighboring elements to produce the codewords (see Figure 2.12). The coder processes symbols in binary mode, which implies that symbols have to be binarized prior to encoding. The context model used may vary on a per-bit basis. In some cases, the symbol values can exceed a certain threshold. As a consequence, the 49

50 Chapter 2. Video arithmetic coder is bypassed and the values are encoded using an Exponential Golomb coder as fallback. Context models provide two probabilities, which gives rise to the name binary arithmetic codes: one probability that a bit in a certain context is 0 and one probability that the bit is 1. These probabilities determine the width of the two sub-ranges during the arithmetic coding process. The choice of a suitable context model depends on the collected local statistics and the frequency counts of priorly coded 0 and 1 bits. The H.264/MPEG-4 AVC standard encompasses roughly 300 such context models and specifies the binarization for various syntax elements. For an extensive discussion on arithmetic codes, the reader is referred to [169,236]. Regarding efficiency, Marpe et. al. point out that for typical test sequences, average bit-rate savings of 9% to 14% corresponding to a range of acceptable video quality of about 30 to 38 db can be expected [136]. Experimental results with the H.264/MPEG-4 AVC reference software from [140] further indicate that at the same video quality, the streams size can be reduced by up to 11%, depending on the amount of motion in the respective video sequence. However, although CABAC is superior to CAVLC in terms of space efficiency, this normally comes at the price of additional encoding/decoding complexity as pointed out in [ ] Profiles and levels The H.264/MPEG-4 AVC standard currently encompasses roughly 600 pages and contains both normative and informative information. Appendix A [93, p. 286 ff.] defines profiles and levels to improve the interoperability of encoders and decoders. H.264/MPEG-4 AVC encompasses numerous coding tools such as CABAC, B slices, macroblock mode sets, and quantizer scaling. The idea of profiles is to limit the number of tools that a decoder has to be capable of in order to correctly decode profile-compliant videos. When there were no such restrictions in place, encoders would be free to use any tools defined in the standard. This in turn would force decoders to support every single tool as they can not know in advance which 50

51 2.3. Profiles and levels tools are actually used by the encoders producing their input. Consequently, this would unnecessarily enlarge implementations and significantly increase the price of hardware decoders. A profile defines a set of tools that decoders have to implement when they want to provide support for that specific profile. Encoders that want to produce a video bitstream that is compliant to that profile are only allowed to use tools from that set. Table 2.2 lists some of the most frequently used profiles together with the tools they permit. The baseline profile was originally intended for low delay applications over lossy channels and provides tools such as flexible macroblock ordering, redundant slices and arbitrary slice ordering to improve video streams robustness. Unfortunately, these features have rarely been integrated into decoders, which subsequently led to the definition of the constrained baseline profile. The main and high profiles are mainly used for both storage and transmission and provide a broad range of coding tools to achieve sufficiently high compression rates at an acceptable quality. The high intra profiles are intended for video editing applications as an alternative to raw formats or other lossless compression techniques. They allow complete random access to coded frames, high resolution storage of chroma components, and (lossless) intra prediction. In contrast to profiles, levels focus on the amount of data that has to be processed by decoders. More specifically, they define upper limits on the resolution, the number of blocks to process per time unit, the memory required to decode a sequence, the number of motion vectors per two consecutive macroblocks, etc. [198]. Table 2.3 provides an overview of some limits enforced by level definitions. Levels are frequently used (usually in combination with a suitable profile) to ensure that devices with limited or constrained capabilities such as mobile devices and Blu-ray players are able to play back the encoded content. On the other hand, manufacturers get the possibility to include the information about which profiles/levels their devices and onboard software bundles can handle in the product specification to give customers a concrete idea of what their products are capable of. 51

52 Chapter 2. Video Profiles Tools Constrained baseline Baseline I slices P slices B slices 4x4 transform 8x8 transform in-loop de-blocking filter quantizer scaling 8x8 intra predict lossless prediction color plane coding multiple reference frames 8 bit samples 9-10 bit samples bit samples 4:0:0 format 4:2:0 format 4:2:2 format 4:4:4 format interlacing separate Cb and Cr QPs CAVLC CABAC flexible MB ordering arbitrary slice ordering redundant slices SI slices SP slices data partitioning Extended Main High High422 High444Pred High422Intra Table 2.2.: Frequently used H.264/MPEG-4 AVC profiles and associated tools. High444Intra 52

53 2.3. Profiles and levels Constraints Levels max. macroblocks/s max. resolution (Pixel) max. bitrate ( Mbit/s) max. frame rate for SVGA max. frame rate for 1080p HD max. DPB for SVGA max. DPB for 1080p HD b Table 2.3.: Selected constraints enforced by different level definitions. 53

54 Chapter 2. Video 2.4. Error resilient encoding techniques In many application scenarios such as digital television broadcast, video conferencing, and on-demand video streaming, video data is transported over lossy and error prone channels. To prevent that transport errors cause a too severe degradation of reconstructed video streams quality, error control techniques have to be applied. They can be roughly categorized into channel related and source related countermeasures as illustrated in Figure Several source coding techniques are covered in this and the following Section 2.5 whereas approaches towards channel error control are extensively discussed in Section 3.6. With bidirectional connections, it is always possible to cancel out transmission errors with the help of transport layer protocols, e.g., by ensuring full reliability through automatic repeat request (ARQ) techniques. This may however not be desirable for delay sensitive applications as multiple retransmissions can introduce unacceptably high delays [231]. The goal of error resilient coding techniques is to produce video streams that are robust against transmission errors, i.e., the impact of such errors on the reconstructed video quality should be minimized. As a consequence, error resilient coding can only be done at the source (the encoder) or at intermediate nodes (transcoders) along the delivery path as opposed to error concealment techniques, which are integrated into decoders. There typically is a tradeoff between error resilience and compression efficiency as error resilience schemes inevitably introduce redundancy into the bitstream. On the other hand, compression techniques aim to minimize redundancies (see Section 2.2.2) by applying temporal and spatial prediction, which in turn increases the impact of error propagation and thus reduces streams robustness. It is therefore beneficial to have a certain amount of foreknowledge about the expected channel conditions to select appropriate error resilience schemes, tune them accordingly, and thus find an acceptable tradeoff [109]. As already indicated in the previous section, there are still few implementations that provide support for some error resilient coding techniques, and even fewer 54

55 2.4. Error resilient encoding techniques Error control Source related Channel related Error resilence Error concealment Scalable video coding Multiple description coding Rate distortion control Wyner Ziv coding Forward error correction Automatic repeat request Proactive dropping Rateless coding Energy adaptation Rate adaptation Figure 2.13.: Examples of both source and channel related approaches towards transport error control in video streaming applications. that implement the tools required for baseline or extended profile compliance. In the following, the most popular techniques are described Network abstraction layer and parameter sets The raw video bitstream produced by an encoder can be divided into video coding layer (VCL) and non-vcl units (see Figure 2.14). Both types are encapsulated by NAL units, which are, when prepared for storage-only scenarios, preceded by a start code prefix. This facilitates navigation and enables resynchronization to skip corrupted video data segments. The use of start code prefixes makes the insertion of emulation prevention bytes necessary (Annex B of [93]). Due to the fact that most of the syntax elements have a variable bit length, sequences of elements are aggregated and trailing bits have to be appended before NAL encapsulation to preserve byte alignment. The structure of the NAL makes it possible to customize packetization of video content to meet the requirements of specific networks [38]. The VCL encompasses data that is directly related to the content of frames such as prediction parameters and residuals, whereas non-vcl information can be 55

56 Chapter 2. Video VCL units non VCL units S P S P P P slice slice PS IDR slice slice slice P slice S S S storage mode: NAL header start code prefix emulation prevention trailing bits transport mode: packet header NAL header packet trailer Figure 2.14.: VCL encapsulation in NAL units for storage or transport. either mandatory (e.g. parameter sets) or optional/supplemental information (e.g. supplemental enhancement information (SEI), video usability information (VUI)). There are two kinds of parameter sets: sequence parameter sets (SPSs) and picture parameter sets (PPSs). SPSs provide parameters for sequences of frames that are typically separated by key frames. In contrast to that, PPSs may change on a perpicture basis. SPS and PPS are needed to assign values to parameters that remain constant throughout the respective scope such as the entropy coding mode, the picture dimensions, the motion vector resolution, or the intra prediction mode. The information about which SPS/PPS combination is to be used is signaled in the slice headers by referring to the corresponding index; previously transmitted/decoded parameter sets are stored in lookup tables for later usage. This design not only makes parameter sets reusable and avoids redundant signaling, but also provides more security against intrusion. Both SPS and PPS are of integral importance for the decoding process (see Section 2.4.5) and error-free delivery should therefore be enforced. A straightforward solution is to signal each parameter set more than once, which is an acceptable compromise regarding coding efficiency 56

57 2.4. Error resilient encoding techniques as parameter sets are relatively small compared to VCL units. Besides that, in streaming systems that support methods for error protection or partial reliability, parameter sets can be preferentially handled, using either in-band or out-of-band approaches Flexible macroblock ordering When video content is transported over error prone networks, a straightforward way to increase a video stream s error resilience is to instruct video encoders to produce slices whose size does not exceed that of network packets. This is reasonable because intra prediction is limited by slice boundaries and slices can therefore be considered as independent units of the video stream with regard to intra prediction. Except for few situations where low-resolution video content is extremely compressed, the size of a frame usually exceeds 1.5 kb. This can be considered as a reasonable upper bound on the maximum transfer unit (MTU) as most internet connections run over Ethernet [105, 231]. As a consequence, a frame has to be segmented into more than one slice, which raises the question of how to perform the segmentation in an error resilient way. The standard approach is to assign macroblocks in raster scan order to a slice until that slice exceeds the size threshold. When a packet that contains a slice is damaged or lost, the corresponding region in the frame will not be decodable and error concealment techniques will have to be applied. Spatial error concealment works best if neighboring regions are available to predict lost macroblocks, and consequently, the result of the concealment process is rather poor when the lost slice covers large regions of consecutive macroblocks. The flexible macroblock ordering (FMO) feature aggregates slices to slice groups and allows the definition of a macroblock allocation map, which associates macroblocks with slice groups. It therefore enables consecutive macroblocks to be assigned to different slice groups, which improves the reconstructed video quality when a slice is lost as collocated regions that tend to show considerable similarities with high probability are available for interpolation/concealment [43]. 57

58 Chapter 2. Video interleaved dispersed row wise column wise box out rectangular custom Figure 2.15.: Different FMO macroblock allocation map types. The type of the macroblock allocation map indicates the strategy of how macroblocks are assigned to slice groups. Figure 2.15 shows examples for all possible macroblock maps. Slices of slice groups can be either arranged row-wise interleaved, dispersed, split into regions in raster scan order, or split into regions in vertical scan order. Further possibilities are a box-out scan where the scan starts at the center macroblock, a rectangular scan where rectangular regions are defined and the last slice group contains the remaining (background) macroblocks, and a custom scan order where the group affiliation is signaled for each macroblock separately. A considerable body of work exists that focuses on FMO. As an example, it was studied how beneficial it is to combine FMO with forward error correction (FEC) schemes [4] and how such schemes can be extended to provide unequal error protection [186]. Furthermore, studies investigated to what extent performance improvements can be expected from FMO in wired as well as wireless environments [42], what strategies there are to assign macroblocks to slice groups [80, 81], and how FMO can be optimally combined with other error resilience techniques such as redundant slices [100]. 58

59 2.4. Error resilient encoding techniques Arbitrary slice ordering Similar to macroblocks in slices without FMO, slices occur in raster scan order in the bitstream. Arbitrary slice ordering (ASO), a feature that was already included in the previous H.263+ standard [222], enhances transport flexibility by removing this restriction so that slices of the same frame or even of different frames may be arbitrarily transposed. ASO is a decoder-only feature because the encoder is not involved in the decision of whether ASO is "used" or not, and consequently, no additional information has to be signaled in the bitstream. In addition to that, encoders are, independent of the selected profile, free to choose an appropriate slice size. They can assign an appropriate number of macroblocks to single slices to provide support for customized transport level packetization [38]. Primarily realtime applications that demand for very low end-to-end delay can benefit from that feature. ASO-capable decoders are able to immediately decode a slice without having to wait for previous slices of the same frame. However, all required temporal reference regions must be available to the decoder. This advantage unfolds when transport protocols are used that cannot ensure in-order delivery as it is the case e.g. with User Datagram Protocol (UDP) internet connections. Additionally, senders may intentionally transmit NAL units out of order, which makes sense in environments that are characterized by bursty loss behavior because it reduces the probability that consecutive slices are lost. This in turn increases the likelihood that decoders error concealment mechanisms satisfyingly interpolate lost regions Redundant slices As a further error resilience tool, slices may be encoded as ordinary non-redundant (primary) or as redundant (secondary) slices. The latter is an alternative representation of a frame part. When no error occurs during transit or when merely redundant data is affected, the decoder discards all received redundant slices and decodes the video by exclusively processing the non-redundant data, similar as if no redundant slices (RS) were used at all. However, if non-redundant slices are 59

60 Chapter 2. Video affected by packet loss, redundant slices can be of substantial importance to the decoder to reconstruct affected regions. In this case, the decoded video quality may still be less than that of the original encoded sequence as higher quantizers are usually used to bound the amount of redundant information in the bitstream. Another reason may be that secondary slices are only partially available for regions where primary slices are missing, and error concealment mechanisms that take both redundant and non-redundant data of collocated regions as input have to be applied. A further aspect that improves error resilience is that redundant slices may use reference regions of frames different to those used by the primary slices. Consequently, if both the primary and redundant slices of a frame are available to the decoder but the reference frames used by some of the primary slices are damaged, the redundant slices may be considered instead to obtain an improved picture quality and to mitigate error propagation. In this context, the question arises of how to optimally select the amount of redundancy in the bitstream. For example, this is investigated in [203] where the network conditions and the expected decoder error propagation are taken as decision criterion. The tradeoff between inserted redundancy and decrease in video quality when primary slices are affected by packet loss is experimentally demonstrated in [164]. Moreover, a method based on Wyner-Ziv coding is presented to reduce the coding overhead caused by redundant data. The problem of jointly optimizing mode decisions (e.g. coding parameters and rate allocations) for macroblocks of both primary and secondary slices is covered in [178]. The authors extended their formerly proposed distortion estimation scheme to optimize the tradeoff between rate and distortion at specific packet loss rates Data partitioning As summarized in Section 2.2.7, NAL units carry different types of information such as mode parameters, motion information, and residual data. With regard to error concealment mechanisms, they are of unequal importance to the decoder as e.g. the availability of motion information generally leads to a higher reconstructed 60

61 2.4. Error resilient encoding techniques video quality than the availability of texture information. This motivates the data of NAL units to be categorized according to their relevance for the error concealment process. Data partitioning (DP) enables the separation of information encapsulated by NAL units. It was adapted, similar to the ASO feature, from the H.263+ standard [222]. DP writes symbols to three different bit buffers and accordingly supports three types of partitions: Type A partitions contain the slice headers, all header related data, and macroblock types, quantizer information and motion vectors. They are of integral importance to decoders as they allow other partition types to be parsed and interpreted correctly. Type B partitions contain the residual information of intra coded slices (I and SI slices), and successful decoding depends on the availability of the corresponding header information. They are considered to be more important than type C partitions because no temporal dependencies have to be satisfied and possible quality drifts can be effectively stopped. Type C partitions carry residual data of inter coded slices (P, B, and SP slices) and make up (on average) the largest share of NAL unit information. Similar to type B partitions, correct decoding depends on type A partition header information plus motion related information. When a packet that contains type C partition data is lost, this in general affects the reconstructed video stream s quality the least because related motion information can still be extracted from type A partitions, which allows effective error concealment. For transport, each partition is typically placed in a separate NAL unit, which enables error protection mechanisms to exploit the knowledge about the unequal importance of media units. Various combinations are possible including hierarchical quadrature amplitude modulation [10], fixed-rate FEC codes such as XOR parity [246] and Reed-Solomon codes [120, 247], Growth codes [165], and Raptor codes [3]. 61

62 Chapter 2. Video Switching I and P slices The extended profile adds two further slice types to the H.264/MPEG-4 AVC video coding standard: switching intra coded (SI) and switching predictively coded (SP) slices [97]. As the name already suggests, these slice types are primarily used for efficient switching between video streams. For example, this is beneficial when multiple versions of the same sequence encoded at different bitrates are available at streaming servers and the most appropriate stream is selected based on transmission channel conditions. When those conditions change, another version of the stream may turn out to be more suitable, and the streaming server may decide to switch to that other version (bandwidth scalability). However, with conventional (non extended profile) coding techniques, the switch can not be performed at arbitrary frames without risking to introduce visual artifacts. This is because prediction dependencies have to be satisfied and decoded versions of previous frames are only available for the prior stream. One solution to that problem is to insert more intra coded slices, which increases switching flexibility and does not require any reference frames to be available. However, instead of unnecessarily decreasing coding efficiency by increasing the I slice ratio, SP slices can be used to enable drift-free bitstream switching. This is established as illustrated in Figure 2.16: at switching positions, one additional (secondary) SP slice S A,B 3 needs to be encoded (for simplicity, a one-slice-per-frame scheme is assumed), which may use previous frames of stream A (the stream to switch from) as references. Succeeding frames of stream B (the stream to switch to) are allowed to use frames back to the primary SP slice S3 B as references because S A,B 3 is constructed in a way that its decoded version is identical to S3 B. Thus, when the server decides to switch from one stream to the other, the following slices are transmitted: [S1 A, S2 A, S A,B 3, S4 B, S5 B,... ]. Switching points are unidirectional in a sense that it can only be switched from one specific stream to another. For switching back, a further switching point is necessary. SI slices can be used instead of SP slices in situations where applying inter prediction would be of limited benefit such as after scene cuts. Furthermore, they 62

63 2.4. Error resilient encoding techniques Stream A S 1 A S 2 A S 3 A S 4 A S 5 A S 3 AB Stream B S 1 B S 2 B S 3 B S 4 B S 5 B Figure 2.16.: Switch between two streams using SP slices (frames). are useful when switching between two streams with different content, e.g. when switching between streams from different cameras or when inserting commercials at certain points during playback. Besides that, SI and SP slices allow for video cassette recording (VCR) functionality such as fast forward and fast reverse. Regarding error resilience and error recovery, they provide several advantages. For example, assuming that a feedback channel is available, the information about lost slices can be signaled to the streaming server. The server in turn decides whether to transmit the primary or the secondary version of the next SP slice, depending on whether the regions referred to by some version are available or not. Additionally, SI slices can be used to adaptively perform intra refreshs (i.e., refreshing the prediction chain and thus stopping temporal error propagation) in scenarios with pre-stored content. This is done by using SI slices as secondary representation of SP slices and by deciding during streaming, depending on the current channel conditions, of whether to transmit the SI or SP version of switching points [183,184] Intra refresh Intra coded information can occur at the macroblock level, at the slice level, and at the level of frames. Besides allowing random access, it is primarily placed by encoders at specific positions in the bitstream to combat video quality drifts. 63

64 Chapter 2. Video Regarding intra coded macroblocks, it is interesting to note that even they can be affected by errors that occurred in previously decoded frames [28]. This is possible when intra macroblocks are part of inter slices and use neighboring inter coded macroblocks for spatial prediction. Thus, transitive error inheritance is risked. By setting the constrained intra prediction header flag, this effect can be (temporarily or permanently) prevented, which makes sense in high packet loss rate scenarios. It is further remarkable that with H.264/MPEG-4 AVC, unlike with previous video coding standards, inserting an I slice (only consisting of intra coded slices) does not necessarily stop error propagation as future frames can still refer to frames before the I frame. A so-called instantaneous decoder refresh can only be enforced by inserting an IDR frame, which flushes and invalidates the reference picture lists (see Section 2.2.5). An interesting research question is to what extent, where, and at what frequency (periodic, random, or adaptive) to place intra coded syntax elements to counter quality degradation [244]. When a feedback channel is available, positive/negative acknowledgements can be used to select/rule out candidate reference areas for temporal prediction. When the candidate set becomes too small or no suitable candidates can be identified for the current macroblock, intra coded macroblocks usually turn out to be an acceptable compromise between increased error resilience and slightly decreased coding efficiency [193]. In [2], an adaptive intra refresh scheme combined with data partitioning and adaptive channel coding is considered, and the form as well as the rate at which intra information should occur in the bitstream is studied. In this connection, not only the implications on error propagation have to be considered but also those on packetization as packets that largely contain intra coded macroblocks usually cover smaller regions. A further intra placement strategy is to try to identify the region of interest (ROI) (see Figure 2.17) and to refresh macroblocks that belong to these regions more frequently [28]. The authors refer to this approach as attention-based adaptive intra refresh. Regarding periodic intra refresh, the authors of [180] propose a motionaware algorithm that considers macroblock modes and motion vectors to protect the borders of refresh waves. They try to avoid that old reference material is used 64

2.4. Error resilient encoding techniques Figure 2.17.: ROI identification as a decision criterion for intra placement [28]. in new frames (i.e. frames after intra refresh), which can happen when the refresh rows/columns do not overlap.

Reference picture selection Reference picture selection (RPS), also referred to as frame dependency management, is the encoder-side task of choosing those reference regions from the reference picture

65 2.4. Error resilient encoding techniques Figure 2.17.: ROI identification as a decision criterion for intra placement [28]. in new frames (i.e. frames after intra refresh), which can happen when the refresh rows/columns do not overlap Reference picture selection Reference picture selection (RPS), also referred to as frame dependency management, is the encoder-side task of choosing those reference regions from the reference picture buffer for temporal prediction that have been correctly received by decoding applications. As indicated in the previous section, one possibility is to signal lost and/or correctly received media units via a feedback channel to the encoder to avoid referencing dirty regions [59, 126, 207]. A more advanced approach is to integrate the knowledge about channel conditions and loss in a rate distortion framework to calculate probabilities for future packet loss events plus the corresponding impact of error propagation to optimize packet dependency management [65,124]. However, besides requiring a bidirectional connection, these solutions are only possible in low delay streaming scenarios with round trip times (RTTs) below 30 ms for common 30 fps video. Additionally, a higher computational effort is to be expected from loss-adaptive RPS approaches as they require motion search to be applied to larger regions and temporal distances. 65

66 Chapter 2. Video traditional: independent branch VRC: independent branch Figure 2.18.: Traditional versus VRC reference picture selection. Circles indicate transportlevel data segments such as NAL units. When data of one branch is lost, other branches will not be affected. For scenarios where no continuous feedback is possible such as multicast streaming applications, video redundancy coding (VRC) can be beneficial [159, 230]. The basic idea of VRC is to split up dependency chains into multiple, isolated branches. Each of those branches is typically processed by a separate thread at a loss-granularity dependent level (at NAL-granularity in the case of H.264/- MPEG-4 AVC). Consequently, when one branch (a NAL unit dependency sub-tree) is affected by packet loss, the other branches will not suffer from any quality degradation (see Figure 2.18). As it is the case with other error resilient schemes, with RPS, encoding efficiency is traded off as motion search is restricted to selected reference regions of the respective branch Error concealment decoding techniques When video streams get corrupted during transport and the lost data cannot be recovered (in time), video decoders have to interpolate affected regions to keep the visual experience impairments as low as possible. This difficult task typically involves both finding suitable reference information to predict lost macroblocks from and applying filter operations to reconstructed pictures to make potential artifacts less noticeable [192,219]. Lost information can be roughly categorized into 66

67 2.5. Error concealment decoding techniques texture/prediction residuals, motion information, and prediction modes, ascendingly ranked according to their loss-impact on reconstruction hardness and hence visual quality. This separation makes sense when confronted with non-recoverable bit errors, or when data partitioning (see Section 2.4.5) is involved. Also partial loss of NAL units may allow decoders to recover the data prior to the lost segment. However, succeeding successfully received segments have to be dropped as it is not possible for CAVLC/CABAC decoders to resynchronize. In scenarios where quality degradation mainly stems from packet loss and NAL units are shaped to fit into network packets, none of the above mentioned information fragments are separately available. Error concealment mechanisms ideally exploit both temporally and spatially collocated non-impaired regions to obtain good interpolations. As temporal correlation is generally much higher than spatial correlation in real world sequences, this makes the availability of the former more valuable [110]. It is a well-known fact that pictures of natural scenes mainly consist of low frequency components, which means that luma and chroma samples of collocated pixels only slightly vary except for regions with sharp edges [222]. Hence, concealment algorithms try to find matching interpolations by considering the smoothness properties of reconstructions [34]. The same criterion, although to a lesser extent, also applies to temporal relationships such as motion vector components. When confronted with bursty loss, the number of non-impaired interpolationreference regions may be drastically low. Consequently, error concealment mechanisms are forced to consult already concealed regions and risk the propagation of prior interpolation errors. The quality of a macroblock concealment attempt can be estimated, e.g., by considering the side match distortion as illustrated in Figure It is calculated as the average of absolute luma pixel value differences between in- and out-blocks [226]. Especially in scenes with smooth texture information, a low side match distortion indicates an acceptable reconstruction. Hence, this simple metric is one method among many to select the most appropriate interpolation pre-image and concealment mode. 67

68 Chapter 2. Video Figure 2.19.: The side matching distortion can be used as a measure for the interpolation quality of a macroblock. As indicated in Section 2.4.1, SEI messages may be delivered along with VCL units to provide decoders with supplemental information. Some of these messages such as motion constrained slice group, scene information, and recovery point [225] can apparently be helpful for selecting suitable error concealment parameters Spatial error concealment Spatial error concealment, also referred to as intra error concealment, guesses the texture of a lost macroblock by exclusively considering spatially collocated regions, i.e., information of the current frame only. With regard to H.264/MPEG-4 AVC, a simple but effective algorithm was proposed in [212], which has been chosen as reference mechanism throughout many publications. It interpolates each lost pixel p i (16 d i ) p l of a macroblock by taking the average p l = i 16 d i of border pixels p 1, p 2, p 3, p 4 of collocated macroblocks from the top, bottom, left and right, and incorporating the pixel distances d i as weights (see Figure 2.20). To combat error propagation, [212] further recommends to only rely on clean (non-concealed) pixels, 68

69 2.5. Error concealment decoding techniques d 1 =11 d 4 =5 d 2 =12 d 3 =6 Figure 2.20.: Simple method for interpolating lost macroblocks based on available neighboring blocks. and to take previously concealed samples into consideration only if the number of available neighboring pixels is below two. A common problem that occurs in connection with this algorithm is that it works quite well in smooth picture regions with dominant low frequency components, but causes blurry reconstructions when high frequencies are involved. To tackle that problem, fuzzy logic reasoning can be used to coarsely interpret high frequency details (e.g. edges). Additionally, a sliding window iteration may improve surface continuity when combining high frequency with low frequency information, e.g., obtained by surface fitting techniques [118]. In [14], a coarse-tofine block replenishment algorithm is proposed that aims to reconstruct affected pictures at different scales. More specifically, it first tries to recover large-scale patterns such as surfaces and illumination, and subsequently reconstructs large scale structural data. The last step involves local edge reconstruction, which is a computationally expensive task. The authors of [243], and in a more advanced version those of [110], also try to guess lost high frequency parts with respect to 69

70 Chapter 2. Video received concealed lost Figure 2.21.: Order in which lost macroblocks of a slice can be concealed. edge detection and retrieve edge connection points by abstracting pixels as binary (edge/non-edge) grid. Subsequently, the interpolation direction is chosen according to the edge directions. The order in which macroblocks of a frame are repaired depends on the loss pattern and the packetization scheme. When dispersed FMO is used (see Section 2.4.2), the loss of one slice (group) is easier to conceal than without FMO as at least two neighboring non-impaired macroblocks surround each lost macroblock. Without FMO, a slice loss leads to lost macroblocks that are not surrounded by received macroblocks with high probability. It is therefore reasonable to first conceal those macroblocks that have a sufficient number of correctly received or already concealed neighbor macroblocks. Additionally, macroblocks close to the center of a frame are in general harder to conceal as they are in the region-of-interest with high probability and thus are structurally more complex. As an example, the JM reference software decoder [205] provides a concealment mechanism that processes lost macroblocks column-wise from the borders to the center and each column is processed starting from the top and bottom of the lost segment (see Figure 2.21). Hence, it both maximizes the number of available, surrounding 70

71 2.5. Error concealment decoding techniques macroblocks per block and avoids propagation of misconcealed regions from the center Temporal error concealment Temporal error concealment, also referred to as inter error concealment, uses, besides available macroblocks of the current frame, also non-impaired or reconstructed information of previously decoded frames. The decoder adaptively decides which concealment mode to apply, e.g., by considering the picture or slice type [226], or the coding types of adjacent macroblocks [224]. If the current frame is similar to previously decoded frames, temporal error concealment techniques will in general produce better reconstructions than spatial mechanisms. However, if the region to be concealed is part of an intra coded frame, temporal error concealment may lead to unsatisfactory reconstructions. This is because intra coded frames are not only used for providing random access points and picture refreshs but also at the beginning of sequences and at scene cuts. The order in which macroblocks are concealed is typically similar to that depicted in Figure More advanced techniques adaptively determine the concealment order by tracking connected corrupted regions and analyzing external boundary patterns of adjacent blocks [161]. This may especially reduce noticeable artifacts in regions with sharp edges and when FMO is used. The simplest form of temporal concealment is to copy the macroblock with the same spatial coordinates from a temporally collocated, non-impaired frame. This approach, also referred to as frame copy error concealment, is computationally cheap but only works well with low motion scenes [197]. When spatially collocated macroblocks are available, boundary matching techniques can be applied. They consider the boundary of the missing macroblock and try to find a suitable reference region in the previous frame within a certain search range by minimizing the sum of absolute differences (SAD) between the boundaries [156]. Better results, although at the cost of more operations, can be obtained by performing block matching [95] (see Figure 2.22). For the top, bottom, left, and right blocks 71

72 Chapter 2. Video v t v r v avg v l v b search range reference frame loss affected frame Figure 2.22.: Temporal block matching error concealment. (gray) of the missing macroblock (dark gray), suitable reference blocks (shaded) are determined by using SAD, which yields motion vector candidates. The average of those candidates is then used to conceal the lost macroblock by copying the corresponding region (shaded dark gray) from the reference frame. As an alternative, the average of the macroblocks that the motion vector candidates point to may be used as a replacement. Unfortunately, spatially collocated macroblocks are in general also lost when they are part of the same slice, which significantly decreases the efficiency of boundary and block matching approaches. In many cases, techniques that only rely on temporally collocated information provide better reconstructions. As an example, multi-frame motion vector averaging exploits motion vectors of several previously decoded frames and takes the (weighted) average of selected vectors to obtain a suitable motion vector predictor for the lost block. When the corresponding region to predict from is error-free, this technique efficiently hides the lost data with high probability as motion components of adjacent frames are generally strongly correlated. As an alternative, motion vector extrapolation (see Figure 2.23) compensates the missing macroblock (dark gray) by extrapolating spatially neighboring motion vectors of temporally adjacent frames [32, 151]. An overlap of 72

73 2.5. Error concealment decoding techniques reference of the previous frame search range previous frame loss affected frame Figure 2.23.: Temporal motion vector extrapolation error concealment. the corresponding projected regions (shaded gray) is then used as reconstruction for the lost macroblock. With data partitioning (see Section 2.4.5), it may happen that header information and motion vectors of macroblocks are available to the decoder but residuals are not. A straightforward approach in such situations is to copy the region to which the motion vector points to from the corresponding reference frame and to assume that the missing residuals are zero. In case of bidirectional prediction, the average of the corresponding reference regions is taken. Table 2.4, taken from [204], exemplarily depicts the effectiveness of different error concealment techniques. With all test sequences, the error concealment technique that can exploit the motion vectors (only the residual data is lost) provides the reconstructions with the least recognizable visual impairments. A major drawback of block-based algorithms is that blocking artifacts can occur especially in combination with non-translational motion such as zooming and rotation [237]. In addition to that, the performance of decoder motion vector estimation approaches is usually better than that of boundary or block matching algorithms. This however ultimately depends on the location and pattern of the corrupted block and the availability of adjacent blocks in the current frame and in previously decoded frames. To tackle that problem, hybrid approaches try 73

74 Chapter 2. Video Akiyo Foreman Soccer Error-free Border pixel averaging Frame copy Boundary matching match- Block ing No concealment DP, missing residuals Table 2.4.: Comparison of different error concealment techniques [204]. 74

75 2.6. Quality metrics to find a suitable compromise between temporal and spatial error concealment techniques by keeping track of non-impaired and concealed regions. Based on that, the most appropriate concealment mode is selected to minimize noticeable visual impairments and counter error propagation. As an example, Hwang et al. propose an algorithm that either takes, depending on the normalized SAD, the result of the boundary matching technique, the motion vector estimation technique, or a weighted sum of both techniques to conceal lost regions [75]. Overlapped block-motion compensation is adaptively applied to counter blocking artifacts. Similar to that, Friebe et al. present a spatio-temporal fading scheme that also combines the results from spatial and temporal error concealment [56]. Fading aids in suppressing blocking artifacts while keeping the reconstructed picture area relatively sharp. Furthermore, the concealment of bidirectionally coded regions is improved by averaging the estimated image samples of prediction directions by a dynamic weighting matrix. Regarding computational complexity, temporal error concealment mechanisms are more demanding than spatial techniques because motion estimation requires larger search regions to be processed. This is a severe problem for devices with limited capabilities such as mobile phones and low-end tablets. In a recent publication, a technique to redundantly hide motion vectors in other macroblocks of intra coded frames is proposed [30]. To avoid that also the redundant information is affected when packet loss occurs, a block shuffling scheme is applied. The approach shifts computational complexity from the decoder to the encoder and additionally provides superior reconstruction quality as not the reconstructed but the original motion vectors are available for error concealment Quality metrics The visual quality of digital videos can be impaired at several stages: during acquisition, processing, compression, transmission, transcoding, storage, and decoder concealment. As video content is consumed by human beings in almost 75

76 Chapter 2. Video all application scenarios, the most accurate way of assessing video quality is by using subjective metrics. However, subjective metrics are both expensive and timeconsuming and are therefore not suitable for many situations. Their application is typically restricted, e.g., to key stages of larger projects to corroborate results from objective tests. Regarding applicability, objective video quality metrics are more flexible as they are cheap, their results are easier to compare with results of related work, and online calculation is possible. Objective metrics are mathematical models that approximate the results of subjective assessments. They either compare sequences with the reference material at a per-pixel or per-block basis or detect artifacts that are perceived as above-average disturbing by the HVS and ignore minor pixel value deviations. Especially for industry, the availability of accurate and reliable objective video quality metrics has become increasingly important with the emergence of new video applications and services such as Internet Protocol Television (IPTV), internet video, and mobile video delivery. They are beneficial in multiple use cases ranging from client-side quality measurements and in-service network monitoring to testing of equipment and codec settings. Besides subjectivity and objectivity, video quality metrics can be divided into full-reference (FR), reduced-reference (RR), and no-reference (NR) metrics depending on the amount of reference information available. FR metrics provide the most accurate results but are not suitable for some application scenarios. As an example, some service providers may want to keep track of quality impairments at strategic points in their delivery network. Obviously, an error-free video stream will not be available as reference material, and other mechanisms have to be installed that, e.g., monitor the stream s blockiness or the fluency of motions (RR or NR), or that exploit redundant information received over a side-channel to estimate the channel-induced distortion (RR). This especially makes NR metrics more versatile and flexile as they can be deployed anywhere along the delivery path. On the other hand certain assumptions have to be made on the types of artifacts that may occur / that may be induced by a certain compression technique or by a specific transport medium. Popular representatives are blocking metrics, blurring metrics, and metrics measuring the colorfulness of picture sequences. 76

77 2.6. Quality metrics Score Opinion Description 5 Excellent Imperceptible 4 Good Slightly perceptible 3 Fair Slightly annoying 2 Poor Annoying 1 Bad Very annoying Table 2.5.: Mapping of subjective opinions to numerical scores Mean opinion score As indicated, subjective tests are quite time consuming and encompass several steps: the video sequences to be tested and an appropriate testing methodology have to be selected, a sufficiently large number of test persons (typically 15 to 30) has to be recruited, the tests have to be executed more than once, and outliers have be to identified and removed to obtain consistent results. To quantify the opinion of test persons, the Mean Opinion Score (MOS) can be used. A common mapping is depicted in Table 2.5. The MOS of a test instance is the arithmetic mean of all individual scores and typically ranges from 1 (worst) to 5 (best). However, the subjectivity and variability of the scores cannot be completely eliminated as each human being has different expectations towards video quality the deviations have to be minimized by training, exact instructions, and standardization of the test environment. The testing procedure is specified in detail in the ITU-T recommendations [90] and [92]. For an extensive discussion about testing methodology and test data analysis, the reader is referred to [141] Peak signal to noise ratio The Peak Signal to Noise Ratio (PSNR) is the ratio between the maximum power of the error-free signal s correct and the error-prone signal s error. It is the logarithmic representation of the mean squared error (MSE) as given in Equation 2.9 and Equation

78 Chapter 2. Video w, h, and n denote the width, the height, and the number of frames of the video sequences, and max(s) is the largest possible value of the video signal. In the YCbCr color space, the luminance or a weighted sum of all components is usually taken as s. The results of the PSNR metric are expressed in decibel. The use of a logarithmic scale renders the range of possible values more lucid than those of MSE. The relationship between perceived subjective quality and PSNR is considered to be roughly linear when values are in the range of 20 to 40 [101]. However, this relationship is only approximate because video content is compared at a perpixel basis and the characteristics of the HVS are not taken into consideration. Therefore, it first has to be checked whether PSNR is a suitable metric for the specific application scenario as it considers, e.g., high-frequency noise (hardly perceptible by the HVS) to be equally disturbing as blocking artifacts (perceived to be annoying by the HVS). PSNR is known to accurately express additive noise but poorly performs with certain types of artifacts where it is outperformed by vision-based metrics [6]. Despite these disadvantages, PSNR is still used in the majority of scientific publications as it is simple to calculate and its results can be easily interpreted due to the familiarity of the research community with this metric. Another reason is the lack of alternative standardized metrics that are applicable not only to specific but to a broad range of use cases. Finally, the minimization of MSE is mathematically straightforward and is therefore easy to integrate into optimization models/frameworks. MSE = 1 nwh n w k=0 j=0 i=0 h (s correct (i, j, k) s error (i, j, k)) 2 (2.9) ( ) max(s) 2 PSNR = 10 lg MSE (2.10) 78

79 2.6. Quality metrics Structural similarity index metric The Structural Similarity Index Metric (SSIM) calculates the distortion based on the structural deviations between the original and the impaired image/video [228]. It pays special attention to the fact that the HVS is highly sensitive to distortion concerning the structure/contours of scene objects. It considers the luminance L (the mean intensity), the contrast C (estimated by the intensities standard deviations), and the structural elements S (the intensities covariance). Let µ s1, µ s2, σ s1, σ s2, and σ s1 s 2 be the means, the standard deviations and the covariance of the two signals s 1 and s 2 to be compared. The three main video quality indicators can then be calculated as given in Equation 2.11, 2.12, and 2.13 where the three constants c 1, c 2, and c 3 are relatively small and depend on the dynamic range of the pixel values. The final SSIM score is obtained as an exponentially weighted sum of all three indicators. For reasons of simplicity and to improve the comparability of results, most publications that use SSIM as quality metric set the weights to 1 and c 3 to c 2 /2. Consequently, the SSIM can be calculated as given in Equation 2.14 where results range from -1 (worst quality) to 1 (best quality). Due to the limitations of the HVS s foveation feature 7 and because image statistical characteristics are in general highly spatially non-stationary, the SSIM has to be locally applied. One solution is to consider small squares and to move the window 7 Only a relatively small region of an image can be viewed at a high resolution. L(s 1, s 2 ) = 2 µ s 1 µ s2 + c 1 µ 2 s 1 + µ 2 s 2 + c 1 (2.11) C(s 1, s 2 ) = 2 σ s 1 σ s2 + c 2 σ 2 s 1 + σ 2 s 2 + c 2 (2.12) C(s 1, s 2 ) = σ s 1 s 2 + c 3 σ s1 σ s2 + c 3 (2.13) SSIM(s 1, s 2 ) = (2 µ s 1 µ s2 + c 1 )(2 σ s1 s 2 + c 2 ) (µ 2 s 1 + µ 2 s 2 + c 1 )(σ 2 s 1 + σ 2 s 2 + c 2 ) (2.14) 79

80 Chapter 2. Video across the entire image at a per-pixel basis. This might however lead to unwanted blocking artifacts as described in [227]. One way to circumvent this problem is to consider circular regions and to apply a Gaussian weighting function to adapt the local statistics accordingly Other approaches Searching scientific databases for publications related to image and video quality metrics reveals a significant body of work that dates back to the 1970 s. This can be roughly divided into techniques that try to closely model the characteristics of the HVS and techniques that rate content according to detected artifacts, which typically occur due to error concealment or low compression rates. The former are designed by analyzing data from psycho-visual experiments and focus on indicators such as color perception, contrast sensitivity, and pattern masking. Recent examples are the Sarnoff Just Noticeable Differences (JND) [129], the Visual Differences Predictor (VDP) [40], the Perceptual Distortion Metric (PDM) [235], and the moving picture quality metric (MPQM) [210]. The SSIM metric as described in Section is an example for metrics that focus on the structure and the occurrence of specific artifacts in video sequences such as blocking, blurring, ringing, and jerkiness. For a study on the noticeability of different types of artifacts on the HVS, the reader is referred to [99]. As another representative, the patent-covered Video Quality Metric (VQM) of Pinson et al. [155] analyses videos by dividing them into spatio-temporal blocks and tracking both extent and orientation of their activity. The extracted characteristics are then compared using a masking-like procedure (a linear combination). The underlying model to be used (e.g. television, video conferencing, PSNR-like, general) has to be chosen according to the target application scenario. 80

81 Chapter 3. Transport 3.1. Background Today, video content is transported over a variety of different networks with different characteristics regarding delay, fault-tolerance, and Quality of Service (QoS). While TV shows are still predominantly broadcasted over traditional channels such as terrestrial, satellite and cable, more and more customers are starting to consume TV content over IP-based networks. IPTV especially booms in the hospitality sector as it offers besides live TV also Video on Demand (VoD) and interactive TV (itv). The wide availability of broadband internet access is the key factor that today allows video consumption and communication at reasonable quality levels over the internet. Besides IPTV, the emergence of video platforms like YouTube, Vimeo, and MyVideo coupled with continuously falling prices of devices that are capable of recording high quality video can be mainly attributed for the recent excessive increase in multimedia traffic. According to the forecast of Cisco s visual networking index [35], this trend is set to continue: Global internet video traffic surpassed global peer-to-peer (P2P) traffic in 2010, and by 2012 internet video will account for over 50 percent of consumer internet traffic. It would take over 5 years to watch the amount of video that will cross global IP networks every second in

82 Chapter 3. Transport Internet video is now 40 percent of consumer internet traffic, and will reach 62 percent by the end of 2015, not including the amount of video exchanged through P2P file sharing. Internet video to TV tripled in Video-on-demand traffic will triple by High-definition video-on-demand will surpass standard definition by the end of When video content is about to be transported over a network, there are multiple things to consider. The target applications requirements with regard to tolerable delay, bandwidth consumption, fault tolerance, and degree of interactivity have to be carefully analyzed. Based on that, it has to be checked whether the network fulfills the minimum requirements and whether it additionally provides guarantees such as guaranteed bandwidth or guaranteed maximum error rates that are typically found in dedicated networks, which can considerably facilitate video delivery. Once all relevant parameters are known, the streaming system has to be adapted accordingly to sustain and compensate unsteady conditions, of which the most important are fluctuating video bit rates, fluctuating bandwidths, unsteady transmission delays, jitter, transmission impairments, and transmission losses. This chapter summarizes the most important properties of existing transport techniques and protocols with respect to video streaming. More specifically, Section 3.2 discusses the two paradigms on-demand streaming of prestored content 82

83 3.2. Timing constraints and realtime streaming, and underlines their differences from the video transport perspective. Section 3.3 describes the fundamental modules that video streaming architectures consist of, and Section 3.4 discusses how encoder rate control can be beneficially combined with adaptive streaming systems and the basic idea behind rate distortion optimization. In Section 3.5, transport protocols that are suitable for video streaming are described and their pros and cons are summarized. Finally, Section 3.6 provides a rough overview of popular transport error control mechanisms in preparation for the content-aware extensions discussed and evaluated in Chapters 6, 7, and Timing constraints In video delivery, one distinguishes between live and on-demand streaming, and both streaming modes have different constraints regarding timing aspects. As an example, occasional playback interruptions of several seconds during on-demand streaming are generally perceived to be less disturbing than a one second time-shift during a video conferencing session. When designing streaming applications, it is therefore important to ensure that certain delay thresholds will not be exceeded, which possibly comes at the cost of a decrease in video quality Video on demand streaming VoD allows the user to choose the point in time for starting the playback of a selected content. This makes it attractive as, e.g., TV broadcast networks operate on a fixed schedule and do not offer this feature. The downside of this is that one stream is typically consumed by only one user (unicast) and video platforms must have considerable upload capabilities and ideally regional mirror sites to cover the excessive bandwidth consumption. To mitigate this problem, some providers (mostly of commercial networks) offer near-vod. As opposed to true-vod, users can only start a stream s playback at regular intervals (e.g., every 10 minutes). True-VoD is interactive as it allows navigation (fast forward, fast rewind, searching) 83

84 Chapter 3. Transport during playback. However, the execution delay of navigation commands should not exceed several seconds to not be perceived as annoying. Today, web-based VoD is typically implemented in Flash, as a Java applet, or handled by a native browser plugin. Previously, download and play schemes were quite popular as they allow playback to be independent of the speed of the network connection. As their name already suggests, the video content is entirely downloaded before the playback starts, which, e.g., enables high definition movie trailers to also be viewed on computers connected to the internet via a dial-up link. The disadvantages are at hand: the user has to accept longer waiting times and more storage space is required at the client machine. Progressive download and play can be seen as intermediary delivery mode. As opposed to conventional download and play, it allows the playback to start during downloading but requires the video content to be encoded in a streaming-suitable format. It however lacks the degree of interactivity that true-vod has because stream navigation is hardly possible Realtime/Live video streaming Realtime video streaming systems are fundamentally different from VoD systems when considering the timeliness criticality. Time-shifts due to buffer underflows, a situation one frequently faces with VoD, have a significant impact on users Quality of Experience (QoE) especially when rebuffering events cumulate over time. With many applications, it can be observed that the strictness of timing constraints positively correlates with the degree of interactivity; here, interactivity does not denote the possibility to control the playback but the user s ability to directly interact with the content itself. As an example, when receiving a (non-interactive) IPTV live stream, a delay of a few seconds is the normal case for many Set-top boxes (STB). This delay is required to compensate transmission fluctuations by receiver-side pre-buffering. An interactive realtime application scenario such as video conferencing does not tolerate end-to-end delay in the range of seconds. Large delays impede natural conversation as people usually pause when they anticipate that their conversational partner wants to speak. This either results in 84

85 3.2. Timing constraints Delay milliseconds milliseconds over 400 milliseconds Impact on user QoE minor or unnoticeable impairment to normal conversation; acceptable to most users. possible to notice impairment; acceptable for use; however, some user applications may be impaired. not acceptable for general use; should be avoided in network planning; may be used in exceptional circumstances. Table 3.1.: ITU-G.114 based one-way delay recommendations for video conferencing. unwanted cross-talking or in annoyingly long idle times to prevent the former. For illustration, Table 3.1 lists the impact of one-way delay on the user s QoE for two-way conference applications as specified in [88]. An even more time-critical video streaming scenario is telemedicine, which allows a physician to remotely practice medicine. While the application of remotely controlled surgery robots over the internet is still limited to a very small number of medical facilities, teleradiology, telepathology, and telepsychology are commonly used today. These new technological achievements substantially support the cooperation of geographically separated physicians, provide new ways of monitoring and diagnosis, and turned out to be beneficial, e.g., for physically disabled patients. Another field that recently gained popularity are streaming game technologies, also referred to as cloud gaming. As opposed to traditional systems where the frames of video games are rendered locally, cloud gaming shifts this computational complex task to dedicated servers. The frames are remotely rendered and compressed, which enables the latest games to be played at a high quality level and resolution even with old and cheap hardware or on thin clients. Streaming game service providers such as onlive 1, Gaikai 2, Otoy 3, and InstantAction 4 underline the significant cost savings in terms of local hardware in their advertisements

86 Chapter 3. Transport Supplemental payload information Connection/Buffer states Video encoder Buffer Transport protocol Sender Coder/Buffer feedback Transport feedback Network Video decoder Buffer Transport protocol Receiver Error signalling Figure 3.1.: Fundamental building blocks of a video streaming architecture. They additionally offer secondary services such as a visitor mode to monitor other players or social network integrations and automated game recording publishing to attract customers. In realtime applications, the time between a user s input and its respective visual feedback is commonly referred to as lag. Similar to online multiplayer games and locally rendered games, the lag has to be kept small (ideally below 100 ms) to not impair the gaming experience [36, 37]. The vision of streaming game technologies is to minimize the lag down to a certain point where the player is no longer able to distinguish whether the game is running locally or whether it is rendered in a remote data center. This, coupled with latency considerations of transmission channels (typical optical fiber), requires data centers to be strategically selected within a certain radius of the customer to ensure that timing constraints are satisfied Building blocks of a streaming system By ignoring user interfaces, capturing devices, security considerations, and innetwork processing, a realtime video streaming system can be sketched from an end-to-end perspective as depicted in Figure 3.1. Video content that is continuously produced, e.g., by a video camera or a rendering engine is encoded by the 86

87 3.3. Building blocks of a streaming system video encoder module. The encoded content is then (optionally) buffered prior to transmission. The sender module splits the data stream into segments to ensure that network packets that consist of the video payload plus optional control- and meta-information and transport protocol headers are smaller or equal to the connection s MTU. The segmentation is either done in a content-unaware manner or by respecting media unit boundaries such as NAL units (see Section 2.4.1). Based on a set of information such as the connection s state, the maximum currently allowed transmission rate, the media units timestamps, and the relative importance of the payload, the sender determines whether and when to transmit which network packet. During transit, the packet may be delayed, e.g., due to queuing, damaged, e.g., due to channel interference, or lost because of congestion. At the client side, the receiver module receives the packets, extracts the payload, and puts it in the decoder buffer. Both encoder and decoder buffers are necessary to compensate for the variable bit rates produced by the encoder, the variable transmission rates, and the delay variations to keep the application-level end-to-end delay constant. The initial playout delay ultimately determines the offset between the time when a frame is put into the decoder buffer and when it gets displayed. It has to be carefully chosen depending on the network conditions to be expected and the encoding settings. A too extensive playout delay would unnecessarily introduce an additional delay into the system whereas a short playout delay might lead to packets being treated as lost as they are useless when they arrive after the respective decoding deadline [192]. The video decoder module selects required media units from the decoder buffer and decodes the current frame, which is subsequently displayed, put in a playout buffer, or forwarded to another encoder in case of transcoding. Besides this primary data flow, there can be multiple secondary information paths. As already indicated, the send schedule may be derived in a more optimal manner when the sender module is provided with supplemental information about the payload that cannot be (easily) deduced from the data itself. Moreover, the sender module may inform the encoder about the connection s current state and the send-buffer fill-level to enable proactive measures that reduce the risk of packet 87

At the client side, it may be necessary that the receiver explicitly signals the information about corrupted or lost data segments to the video decoder to ensure proper decoding of correctly received

88 Chapter 3. Transport Figure 3.2.: Different macroblock sizes of an H.264/MPEG-4 AVC P frame. loss, e.g., by applying adaptive rate control. Feedback can also be provided by the client at the transport level (e.g., RTT, loss rates, loss vectors) and at the application level (buffer feedback, decoder feedback). At the client side, it may be necessary that the receiver explicitly signals the information about corrupted or lost data segments to the video decoder to ensure proper decoding of correctly received media units Rate control Macroblocks of a frame can have strongly varying sizes due to the dissimilar richness in detail in different picture regions, the use of different quantization parameters, and the varying efficiency of temporal prediction mechanisms. As an example, Figure 3.2 shows the macroblock sizes of a frame from the sequence kitchen where a higher number of bits per macroblock is indicated by brighter blocks. Larger macroblocks can be found close to the contours of the actress whereas a more efficiently compressed picture region is nearby the kitchen rack. In this example, the unequal macroblock sizes are mainly caused by the actress s high amount of movement and the fact that a high percentage of the area that depicts her arm consists of intra coded macroblocks. Macroblocks of unequal size cause 88

89 3.4. Rate control slices to contain a varying number of macroblocks if the packetization scheme aims to produce NAL units of equal size. This may finally lead to frames that have highly varying sizes, i.e., the video bitrate will not be constant. Bandwidth limitations are in general not a problem for on demand streaming applications as occasional decoder buffer underflows are tolerated. In live streaming scenarios, this however can cause severe QoE degradations when the buffers are not dimensioned accordingly to compensate for bitrate fluctuations. As already indicated in Section 3.3, buffer overprovisioning is not an acceptable solution as it introduces additional delay and conflicts with the constraints of realtime applications. Modern video encoders therefore support rate control to compress frames at a constant, or at least at a constrained bit rate. During compression, the bit rate is measured and the encoding parameters are continuously adjusted so that the difference between the output bit rate and the target bit rate is kept small. In almost all adaptation steps, the quantization parameter is involved as its increase generally leads to smaller macroblocks and vice versa (see Section 2.2.6). Changing the quantization parameter is a non-trivial task because too frequent or too drastic changes are perceived as annoying by the HVS. Consequently, further possibilities to adjust the rate such as selecting a different prediction mode or reducing the partitioning resolution have to be considered. Besides that, it must be possible to estimate the macroblocks bit size after compression to select appropriate encoding parameters. A common approach is to estimate the macroblocks activities by calculating the Mean Absolute Difference (MAD) between the current and the temporally collocated block and by using an appropriate rate distortion model to compute suitable quantization parameters. Regarding rate allocation, one way to implement rate control is to allocate a certain bit budget for a set of frames such as a group of pictures (GOP). This budget is then assigned to the different slice types according to user-definable weights, a technique that can be further extended down to the level of macroblocks. The selection of the optimal coding mode in terms of compression efficiency and video quality, also referred to as rate distortion optimization (RDO), is complementary to rate control approaches. In [123], their interplay is described as 89

90 Chapter 3. Transport large l small l Distortion [PSNR] coding mode candidates Rate [bit] Figure 3.3.: Rate distortion optimization coding mode selection. chicken and egg problem because the MADs of candidate modes are needed by rate control algorithms. They are calculated after the RDO step, which in turn requires the quantization parameter as input. As a consequence, the efficiency of rate control algorithms depends on the availability of effective estimators for block MADs. For illustration, Figure 3.3 depicts the locations of several possible coding modes for a macroblock in the two-dimensional rate distortion space. Here, distortion is, for the sake of simplicity, the inverse of picture quality. Rate distortion algorithms try to select coding modes that are close to a target curve, given by the encoder settings. The main goal of a video encoder is to jointly minimize the costs of both dimensions, i.e. to produce a video stream with a minimal distortion (a maximum quality) by using a minimum number of bits. In the context of RDO, a very popular method to jointly minimize both dimensions is to instead minimize a combined cost function J = D + λr where λ determines the impact of the rate on the overall cost. When a very small value is assigned 90

91 3.5. Protocols to λ, the encoder always chooses a macroblock coding mode that maximizes the picture quality unless the number of required bits is not too excessive. In contrast to that, a large λ value causes the encoder to preferably select a coding mode that leads to a high compression, probably at the cost of a noticeable decrease in quality. Obviously, the calculation of the cost J for every possible macroblock coding mode is computationally expensive. Efficient implementations therefore skip those modes that led to bad results during previous compression steps where similar content was involved. By considering computational cost as a third dimension, the minimization problem can be extended to the three dimensional space. This makes it even harder to solve and has been attracting researchers for the last decade, which has led to numerous publications in that field (e.g. [5, 61, 114]) Protocols This section briefly discusses the most important transport and application layer protocols used for video streaming today. With regard to application layer protocols, the discussion is by no means exhaustive as the number of open as well as proprietary protocols is large, and therefore only the RFC-standardized Real-Time Transport Protocol (RTP) and the formerly proprietary Microsoft Media Server Protocol (MMS) and Real-Time Messaging Protocol (RTMP) are considered. TCP Today, video content is mainly transported using the Transmission Control Protocol (TCP) [158] as it provides reliability, which eases application development. On top of TCP, Hypertext Transfer Protocol (HTTP) streaming gained considerable popularity [191], which can be attributed to the ubiquity of web browsers and the success of VoD platforms. HTTP streaming still has to cope with some problems like plugin availability and compatibility, video format issues, insufficient bandwidth, and firewalls. However, there are currently strong ongoing efforts towards standardizing aspects such as the media presentation description, the resource locators, and 91

92 Chapter 3. Transport adaptive bitrate streaming (e.g., Dynamic Adaptive Streaming over HTTP (DASH) [86]) to improve interoperability. With regard to live video streaming, TCP is only of limited benefit. Probably the biggest problem is that reliability is a non-optional feature that may lead to severe playback interruptions and stalls when video is sent over lossy networks. The sender does not get the chance to decide whether the retransmission of a lost packet is reasonable or not (e.g., whether the retransmitted packet will arrive before the respective playback deadline or not). More specifically, TCP imposes its own flow control and windowing schemes, which enforce in-order delivery and prohibit the adoption of tailor-made adaptive streaming solutions for specific application scenarios. Besides that, TCP does not provide any support for multi- and broadcasting, requires a connection setup which imposes some additional delay, and it does not efficiently utilize the available bandwidth when the RTT is high [18]. When certain application requirements prohibit the use of alternative transport protocols, e.g., because minimizing the probability of getting blocked by firewalls is essential for market success/dominance such as in the case of Skype 5, some measures can be taken to lower TCP s end to end delay. First of all, Nagle s algorithm should be disabled because it causes additional delay at the sender [146] when small media units are transmitted. Furthermore, the sending application should switch the sockets to non-blocking mode to avoid that the TCP connection is limited by its flow control mechanism. Besides that, the use of TCP selective acknowledgement (SACK) can increase the loss efficiency [27]. As shown in [18], byte counting and congestion window validation during application-limited periods periods during which the video bit rate is lower than the maximum allowed transmission rate imposed by the TCP congestion window should be disabled. Finally, it was proposed to use parallel connections when possible as this may significantly decrease the transmission delay

93 3.5. Protocols UDP Due to the aforementioned reasons, the UDP protocol [157] is generally more suitable for live video streaming. It provides an unreliable transport of datagrams and allows (partial) reliability to be implemented at the application level. Compared to TCP, it has a lower header overhead and there is no transport level buffering caused by windowing mechanisms, which eliminates any delay at this stage and reduces streams end-to-end delay. UDP can also be used for unidirectional networks (e.g., satellite broadcasts) as no back channel is required. It is therefore highly suitable for (IP) multicast. The downside of this transport protocol s lightness is that rate control also has to be implemented at the application layer, which is an additional burden for application developers. Moreover, UDP traffic is generally handled more conservatively by firewalls, i.e., the chance of getting blocked is higher, and streaming protocols have to be adapted accordingly when the traversal of Network Address Translation (NAT) boxes should be possible. In summary, when the streaming framework supports (content-aware) error control and when the minimization of the end-to-end delay is crucial for providing the expected QoE, UDP is a good choice for video delivery. DCCP One of the primary goals of the Datagram Congestion Control Protocol (DCCP) [106] is to let the application layer control the tradeoff between delay and reliable inorder delivery. It is therefore especially intended for streaming delay sensitive video and audio data as no additional delay caused by mandatory retransmissions can occur. Regarding functionality, DCCP can be placed somewhere between TCP and UDP as it, on the one hand, does not provide reliable packet transport, but on the other hand, realizes congestion- and flow control. Shifting congestion control into the transport layer makes DCCP attractive for UDP-based applications as implementing an (ideally TCP friendly 6 ) congestion control mechanism at the 6 A corresponding flow uses no more bandwidth in steady state than a TCP conforming flow under comparable conditions [17]. 93

94 Chapter 3. Transport application layer is complicated. DCCP provides several congestion control mechanisms such as TCP-like rate control [52] and TCP Friendly Rate Control (TFRC) [55], it supports explicit congestion notification (ECN) [163] and ECN nonces [190] and provides reliable connection setup / teardown and option negotiation. Besides that, it allows feedback to be encoded as acknowledgement (ACK) vectors (cumulative ACKing), which significantly increases the entropy of receiver feedback regarding lost packets. Compared to TCP and UDP, DCCP is a rather new protocol, and consequently it is rarely supported by older operating systems of connection endpoints and in the network itself, e.g., at NAT boxes. To temporarily circumvent the latter issue, a DCCP over UDP encapsulation approach has been proposed [154]. At the time of writing, it is still unclear whether DCCP will become widely available and whether it will be regarded as an attractive alternative to UDP not only by researchers but also by application developers. It is part of the Linux kernel since version (2005) and its socket model is very similar to that of TCP to facilitate protocol switching. It is also fully supported by iptables since However, there is still no support for Apple and Windows platforms except for a user space port of Tom Phelan s DCCP implementation 8. The author believes that the lack of support in closed source operating systems is the major reason why DCCP is still relatively unknown in the streaming community, and that its usage will significantly increase once this obstacle has been removed due to its compelling advantages. In fact, there are some well-known video streaming applications such as VLC 9 and GStreamer 10, which already support it a942f9501f7ce287e1c37c553eb02a1e269e

95 3.5. Protocols Byte 0 Byte 1 Byte 2 Byte 3 V P X CC M PT sequence number timestamp synchronization source identifier optional contributing source identifiers... optional header extensions... V 2 bit version number P padding bit X extension bit M application specific marker bit PT 7 bit payload type Table 3.2.: The header of the RTP protocol. RTP RTP [181] is an application layer protocol that runs on top of UDP. It can also be used in combination with TCP or with other transport technologies such as Asynchronous Transfer Mode (ATM), but these constellations are not widespread. It was designed to provide functionality to realtime multimedia applications and to improve their interoperability by defining a universal header (see Table 3.2). As RTP is intended to be primarily used by time-critical multimedia applications, packet loss is usually preferred over late delivery, which explains why it is mostly implemented over UDP rather than TCP. RTP assigns sequence numbers to single network packets, allows for multicasting to feed several destinations simultaneously and uses timestamps to synchronize multiple video and audio streams. The payload type indicates whether the packet is carrying parts of an audio or of a video stream and the codec, sampling rate, etc. that was used for compression. Different payload types are mapped to different RTP packets, which, e.g., allows bandwidth-heterogeneous receivers to decide whether to only receive the audio data of a videoconferencing session or whether to receive both audio and video. In RTP terminology, the sender of a multimedia stream is referred to as synchronization source. Along the delivery path, the data packets may be changed due to 95

96 Chapter 3. Transport transcoding, mixing, and synchronization by a mixer. The mixer appends the corresponding contributing source identifiers to the RTP header. In contrast to mixers, translators do not change data packets but forward them and aid in firewall and NAT traversal. The disadvantages of RTP are that it may not be supported by all networking devices and that no header fields were defined to assign content-based priorities in order to set up different service classes or to realize priority queuing. Several payload formats have been defined for different content types of which the most important are MPEG-1/2 [71], MPEG-4 ASP [102], MPEG-4 AAC [211], and H.264/MPEG-4 AVC [232]. Moreover, RTP data transport over DCCP connections has recently been standardized [152] and major questions regarding RTP packet framing and Real-Time Streaming Protocol (RTSP) signaling have been settled. While RTP exclusively transports data packets, the Realtime Transport Control Protocol (RTCP) is used as side-channel for control signaling and status reporting. Each RTP connection is typically associated with an RTCP connection that uses the RTP connection s port number incremented by one. RTCP provides receivers with timing information, which allows them to synchronize their clocks. This is essential when (mixed) content of multiple senders is displayed or rendered simultaneously. The receivers of RTP streams frequently report the reception quality in terms of packet loss, jitter in packet arrival time, and the time when the last packet was received back to the sender. This information is vital for the sender and intermediate nodes to perform rate control, adaptive error protection, and adaptive media encoding or transcoding. When IP multicast is used, receiver reports are not only received by the sending parties but by all session participants, which allows third-party monitors to diagnose network problems and to determine whether they have local or global causes. The periodicity at which receiving nodes send reports depends on the bandwidth demanded by the RTP traffic and the number of session participants. The latter can be determined by processing the periodically incoming reports of other receiving nodes. To guarantee scalability, the standard recommends that the RTCP fraction of the RTP traffic should be around 5%. Source nodes also send RTCP reports to enable new receivers to determine the appropriate RTP streams. Such reports contain timing information (RTP and 96

97 3.5. Protocols Network Time Protocol (NTP) timestamps), report blocks similar as with receiver reports, and miscellaneous sender information. They contain the canonical names of senders because synchronization source (SSRC) identifiers may change over time due to sender leaves and identifier conflicts. With regard to the sender-receiver report ratio, the standard further recommends that the share of sender reports should be about 25%. This ensures that joining receivers do not have to wait an unacceptable amount of time until they receive the canonical names of the desired sending nodes. MMS and RTMP As already mentioned in the introductory part of this section, there exists a large number of application layer protocols, the majority of which is product specific and is only used by the respective streaming server. Probably the most widely used protocols are Microsoft s MMS and Adobe s RTMP. MMS was initially designed as a proprietary protocol but its specification was opened in 2008 [144] as parts of its design were reverse engineered and subsequently released. It works on top of UDP and TCP (using an unprivileged port or over HTTP). The appropriate transport protocol is negotiated during the so called protocol rollover phase where UDP is usually preferred over TCP connections and TCP acts as a fallback option in case UDP gets blocked by firewalls or NAT boxes. Furthermore, the client requests the appropriate video and audio streams according to the estimated connection bandwidth. During streaming, the streams bitrates can be adaptively adjusted by skipping non-key frames. Similar to MMS, RTMP was also a proprietary protocol until 2009, although to date the publicly available specification [1] is still not complete and omits details that are vital for creating compliant implementations. RTMP is primarily known for its use with the browser plugin Adobe Flash Player 11 and runs over multiple TCP connections. To increase firewall acceptance, the streams can be tunneled through HTTP or HTTP Secure (HTTPS). During transmission, media units get chunked

98 Chapter 3. Transport where the chunk size is a tradeoff between streaming agility and CPU usage. The protocol additionally provides support for dynamic bitrate adaptation (i.e., switching to another encoded version of the same content without changing the timeline) and is the common denominator of Adobe s Flash Media Server family 12. Besides that, it is supported by several other video related software products that support streaming such as RealNetworks Helix server 13 and ffmpeg [148] Error control All content-aware prioritization algorithms presented in this thesis work on top of unreliable transport protocols as only they provide the flexibility needed to make streaming systems truly adaptive. In the following, the functionality of the two most important application level error control mechanisms, FEC techniques and feedback adaptive mechanisms, is summarized and their pros and cons are discussed. Further possible application scenarios encompass, amongst others, prioritized packet drop schemes [63, 74, 171], the exploitation of QoS capabilities of wireless networks [49], and the assignment of packets to different service levels according to their impact on video quality [138, 199]. The author tries to answer the frequently posed question of what approach is generally more beneficial and discusses some hybrid techniques that aim to combine the advantages of both categories Feedback based error control With feedback based error control techniques, a feedback channel is used to signal positive or negative acknowledgements of packet receptions. ACKs may be sent upon the arrival of each packet, upon packet timeouts, or in regular intervals. Feedback can be designed to acknowledge a single packet, a range of packets, or

99 3.6. Error control a certain set of packets by containing a single sequence number, an ACK vector, or some custom format. There are several ways in which the sender may react to incoming feedback. In stop-and-wait ARQ, the sender only sends one packet at a time and sends the next packet only if the previous one was positively acknowledged. When a negative ACK comes in, the sender sends the current packet again. The stop-and-wait ARQ scheme is used in some protocols such as in the MAC layer of the wireless standard [77] or in the Trivial File Transfer Protocol (TFTP) [188] due to its ease of implementation. However, its major drawbacks are the relatively long idle times of senders while waiting for pending ACKs. In contrast to that, go-back-n schemes do not have such idle times as the transmitter continuously sends packets without waiting for the respective ACKs. When a negative ACK arrives, the sender goes back to the respective packet and continues the transmission process from that sequence number on even if that means that some previously correctly received packets will nevertheless be retransmitted. Selective ARQ mechanisms do not retransmit all packets with a sequence number greater than that of the lost packet. Instead, they only retransmit those packets that are known to be lost, e.g., due to negative ACKs, duplicate positive ACKs, or timeouts. Feedback based error control techniques are characterized by a relatively low transmission overhead (with selective ARQ) and are computationally cheap compared to FEC solutions. They however introduce a significant amount of delay especially when packets get (repeatedly) lost over delay-intensive connections because such mechanisms have to wait at least one RTT until retransmissions can be triggered. Additionally, a feedback channel is required, which renders ARQ techniques useless for unidirectional and highly asymmetric delivery networks. ARQ is applicable to multicast scenarios only to a certain degree because one sender has to handle the feedback of multiple receivers. Therefore, limitations have to be applied on the receivers feedback frequency to protect the sender and the network from overload, i.e., to prevent ACK implosions. 99

100 Chapter 3. Transport Forward error correction In transport protocols, backward error correction (BEC) is used to detect errors at the receiver side (e.g., Cyclic Redundancy Check (CRC) [153]). Errors can only be detected but not corrected by BEC, and therefore, guaranteeing reliable data transport requires impaired or lost packets to be retransmitted. In contrast to that, FEC codes are able to detect and correct a certain number of bits at the cost of an increased redundancy overhead. They can be categorized into block codes (e.g., Hamming codes [67], Reed-Solomon (RS) codes [166], Low-Density Parity- Check (LDPC) codes [57]) and convolutional codes (e.g., Viterbi codes [216], Turbo codes [15]). Block codes operate on fixed-size blocks or packets of symbols and convolutional codes work on bit streams, analogous to block and stream ciphers. Today, FEC codes can typically be found at two layers of the IP stack: on the link layer (e.g., Turbo codes in the Universal Mobile Telecommunications System (UMTS) [209], LDPC codes in WiMAX [79] and Wireless LAN IEEE n [78]) and on the application layer (Raptor codes in Digital Video Broadcasting Handheld (DVB-H) [47, 62, 131]). In the context of this thesis, the application of FEC codes is exclusively considered at the application layer. This implies that only erasure protection capabilities are of concern because most transport (except for DCCP and UDP-Lite [111]) and link layer protocols enforce some form of message integrity checking. An erasure protection code transforms a message of k plain symbols into a message of n encoded symbols with k < n. When such a message is sent over a binary erasure channel, some of the symbols (n k packets) will be erased, and only a subset will arrive at the receiver without errors. Packets that are not lost during transit but that arrive at the receiver and are affected by bit errors are discarded by some layer below the application layer and can therefore not be exploited by application layer FEC mechanisms. The fraction k /n is called the code rate and k /k is referred to as reception efficiency. Fixed rate codes such as RS codes fix the code rate prior to the transmission of the first symbol whereas rateless codes (e.g., LT codes [130], Raptor codes [187]), also referred to as fountain codes or expandable codes, have the potential to produce an infinite number of 100

101 3.6. Error control encoded symbols. This however comes at the cost of a slightly decreased coding performance: for example, RS codes are optimal (maximum distance separable (MDS)) codes, which means that the original message can be reproduced from any set of k encoded symbols, whereas currently existing instances of rateless codes are only near-optimal, i.e, at least (1 + ɛ)k encoded symbols are needed for decoding. Decoding optimal codes is computationally intensive especially for large values of n whereas simple decoding algorithms exist for near-optimal codes [133]. From a live video streaming perspective, application layer FEC is an attractive alternative to feedback based error control schemes because it introduces no additional delay. It is highly suitable for multicast and broadcast scenarios as no return channel is required. In contrast to that, FEC codes have a higher overhead and their code rates have to be appropriately selected, which can be a challenging problem in heterogeneous environments with unsteady channel conditions. Especially in mobile networks where transmission errors can be caused by single bit errors, burst errors, or intermittent connection losses, FEC schemes should be made adaptive to the current channel conditions to avoid redundancy under- as well as over-provisioning [96] Hybrid approaches Summarizing the previous two sections, the advantages and disadvantages of feedback based error control and FEC mechanisms are quite complementary. FEC approaches preserve flexibility while reducing end-to-end delay at the cost of an increased transmission overhead whereas feedback based techniques minimize the transmission overhead but have unbounded delay. Hybrid approaches (sometimes referred to as hybrid ARQ (HARQ) [125]) try to combine both paradigms to maximize the error control effectiveness in bandwidth and delay constrained environments. One strategy is to primarily apply FEC codes and only use ARQ as a backup or limited feedback mechanism. As an example, FEC packets may be sent at every transmission opportunity until the source block s deadline is reached, all FEC symbols have been sent, or the receiver signals that enough symbols were correctly 101

102 Chapter 3. Transport received to enable successful decoding [135]. In a similar version, the receiver may not provide a positive but a negative feedback to the sender to request more parity packets because the number of received packets so far is too small. Another possibility is to design a transmission policy and to adaptively decide whether to send a new, original packet or a FEC packet, or to retransmit some previous packets in order to minimize the overall delay [142]. In multicast environments, each receiver may signal information about outstanding or lost packets to the sender, and based on that, the sender can create FEC packets in such a way that the cumulative transmission cost is minimized. As a strongly simplified example, a receiver 1 may be missing packet A whereas a receiver 2 misses packet B. By sending a single packet A xor B instead of A and B separately, the sender can save 50% of the bandwidth. A comprehensive discussion about various strategies to combine FEC and ARQ mechanisms at the same or at different layers in the context of reliable multicast can be found in [149]. The selection of the most appropriate error control mechanism ultimately depends on the tradeoff between reliability, timeliness, and transmission overhead. It is therefore not feasible to give a general statement of which mechanism works best as this is highly situation dependent with respect to application requirements and channel conditions. Some related work exists that proposes a mechanism that decides, depending on the network conditions, which mechanism (ARQ, FEC, or HARQ) is currently suited best [145]. Other publications focus on a certain problem area and propose specific improvements such as in [189] where the authors discuss the latency issue over last-hop wireless networks. They develop a framework that incorporates reliability and ARQ-induced delay by applying delay-constraint packet embedded error control at the link layer. 102

103 Part II. Original Contribution 103

104

105 Chapter 4. Related work: Video distortion estimation 4.1. Background When video content is about to be transferred over packet-switched networks, it first has to be packetized (see Section 2.4.1). Such packets are generally of unequal importance with regard to the impact on video quality in case they are not available to the decoder. There is a set of explanations for this fact: The media units that are encapsulated by packets may temporarily and spatially depend on each other (see Section and Section 2.2.5). The unavailability of media units that are not referenced by other units will only distort the respective frame or slice whereas the loss of highly referenced media units will certainly lead to a distortion of a large number of regions in multiple frames, a phenomenon that is referred to as distortion propagation. The location of a packet/slice within the respective frame does matter (see Figure 2.17) because a video stream usually has one to several regions that catch the eye of the observer (i.e., that cause a succession of eye fixations), also referred to as ROIs [29, 72, 94]. These regions may change over time, but they are usually somewhere near the picture s center. The distortion of 105

106 Chapter 4. Related work: Video distortion estimation slices that cover regions close to frames boundaries is therefore generally perceived to be less disturbing than the distortion of regions close to the center. As explained in Section 2.2, smooth picture areas can be more efficiently compressed than structurally complex areas. Consequently, the respective macroblocks will be smaller in terms of bit size and more of these macroblocks will fit into one network packet. Therefore, the loss of such a packet implies that a larger picture region will be affected. Nevertheless, the overall distortion to be expected is in general smaller than when a structurally complex packet is not available to the decoder due to the characteristics of error concealment techniques (see Section 2.5). With scalable coded video content, media units of enhancement layers are considered to be less important than those of the base layer (cf. [64,68,147]). This is plausible because the availability of the base layer is a precondition to successfully decode the upper layers [182]. The unequal importance of media units can be exploited by transport mechanisms in various ways. Obviously, when the channel conditions are good enough to allow the error-free reception of a media segment before its deadline, content diversification and unequal transport handling do not yield any benefit for video streaming applications. However, when the packet error rate exceeds a certain threshold and transport mechanisms are no longer able to repair or retransmit the affected data in time, the video quality degradation can be significantly mitigated by ensuring that the most important parts of the media stream are available to the decoder. Because of this, deploying unequal error control techniques in VoD streaming solutions does not make much sense as periods of poor channel conditions can be bridged by sufficiently large decoder buffers and occasional playback interruptions due to re-buffering. As already explained in Section 3.2.2, realtime video streaming applications can only have decoder buffers of limited size and must avoid playback interruptions for reasons of synchronicity. Therefore, they 106

107 4.2. Lightweight schemes can largely benefit from unequal error control techniques. It can roughly be said that the tighter the streaming system s timing constraints are, the higher the value of such techniques becomes. The unequal importance of media units is quantified by distortion estimation schemes, sometimes also referred to as video prioritization schemes. The former usually denote techniques that estimate the distortion that is caused by the loss of one or several media units in terms of an objective video quality metric such as PSNR. The latter however encompasses a broader range of techniques and is also used for intuitive schemes such as content diversification based on the frame type. In the context of this work, the author uses both terms interchangeably because schemes from both classes can be used to build unequal error control mechanisms. One property however that does matter is the realtime-capability of such schemes as video content is usually produced and encoded on the fly and heavy pre-computations are therefore not possible. The computational cost of distortion estimation schemes is mainly dominated by three aspects: the depth at which the video stream is analyzed, the length of dependency chains that are tracked, and the scope of estimates (whether derived at the frame, slice, or macroblock level) [45]. In general, the finer the granularity of the evaluation, the more computationally demanding it is. By taking the computational complexity as categorization criterion, existing approaches can be divided into simple/intuitive lightweight schemes and into advanced but computationally intensive schemes Lightweight schemes Probably the simplest form of video data classification is the sole distinction between key and non-key frames. This strategy is applied in numerous publications as it is relatively easy to implement for researchers who are mainly interested in the networking aspect of video streaming. All authors share the opinion that the loss of a key frame affects a larger number of successive frames within the current GOP than the loss of a non-key frame. Therefore, key frames should in general 107

108 Chapter 4. Related work: Video distortion estimation be favored over non-key frames. Gürses et al. [63] propose a system to improve the playback continuity of video streams over TCP by adaptively discarding low priority (non-key) frames from the sender buffer. In such a way, they ensure that high priority (key) frames are delivered to the client on time. In [25], a priority based caching algorithm is developed for relay nodes in video delivery networks that favors non-key over key frames in accordance with timing constraints when old packets have to be dropped from the caches. Díaz et al. [45] unequally protect RTP video streams from loss at a frame level by considering the bitrate constraints, the channel state, and the frames relevance. An optimization algorithm is used to decide the degree of protection. It works with so called decision frame sets (DFSs), subsets of the entire video sequence, which is a compromise between accuracy and computational cost. Two types of DFSs are used: I-DFSs for key frames and PB-DFSs for non-key frames. By also distinguishing between the predictively coded frame types P and B, video sequences can be partitioned into three importance classes. P frames are considered to have a more significant impact on video quality in case of loss than B frames because in most cases the latter (transitively) depend on the further. In contrast to that, P frames only depend on other P frames or on I frames. As an example from literature, Feamster et al. [48] distinguish between intra coded, inter coded and bidirectionally-coded frames and assign priorities to frames in decreasing order as just listed to perform selective retransmissions. Zhang et al. [245] apply fixed erasure coding to protect video streams and consider the frame type, the frame s sequence number within the GOP, and the per-frame bitrate to select an appropriate amount of redundancy. Similar to that, Talari et al. [201] also strive to optimally protect video streams by inspecting the frame type but propose to use rateless codes instead of fixed rate codes to prevent redundancy overprovisioning. In [98], the frame type is used as a chunk scheduling criterion in a mesh-based peer-to-peer video streaming system. The authors consider one GOP as basic request unit and divide it into three sets according to the respective frame type. Based on this classification and the progress of playback, the node requests outstanding chunks from neighboring peers. A semi-reliable streaming 108

109 4.2. Lightweight schemes framework on top of DCCP is proposed by Yuan-Cheng et al. [242] that aims to minimize the playout buffer delay by scheduling retransmissions based on the playback buffer, the RTT, the currently allowed transmission rate, and the frame type. Korhonen et al. [108] present a flexible forward error correction scheme that enables partial recovery of video data for the case that the packet loss rate exceeds a certain threshold. They extend short block codes to provide unequal error protection capabilities based on frames types. The inequality I > P >B in terms of importance does on average only hold in the context of MPEG-4 ASP and its predecessors whereas with H.264/MPEG-4 AVC, this heavily depends on the encoding settings. As pointed out in Section 2.2.1, H.264/MPEG-4 AVC only distinguishes between IDR and non-idr frames whereas the distinction between I, P, and B is made at the slice level for the sake of improved encoding flexibility. Some publications neglect that fact, adopt a one-sliceper-frame packetization strategy, and argue that they consider only low bitrate scenarios where an entire frame fits into a network packet to reduce complexity. However, when carefully considering the compression efficiency of modern encoders and assuming a MTU of 1.5 kb, this seems only feasible for very-low resolution sequences. To provide a concrete example, a B slice (frame) from a highly compressed Common Intermediate Format (CIF) sequence fits into one network packet whereas an I slice (frame) roughly takes 6 network packets. Thus, the applicability of those approaches to real world scenarios seems questionable. In some other publications, the frame/slice type issue is simply misunderstood as the terminology used therein is incorrect. A recent publication that considers the slice type as well as its size as importance indicators is [160] where an H.264/MPEG-4 AVC stream is encapsulated in an unequally protected MPEG-2 transport stream to be delivered over DVB or ATSC networks [176]. By presuming that clients support DP (see Table 2.2), a further approach is to distinguish between IDR and non-idr partition A, B, and C slices in the case of H.264/MPEG-4 AVC, or between different groups of blocks (GOB) in the case of MPEG-4 ASP [31]. A link-layer solution for wireless environments that applies hierarchical quadrature amplitude modulation to provide unequal error protection 109

110 Chapter 4. Related work: Video distortion estimation is presented in [10]. The authors underline the benefit of H.264/MPEG-4 AVC s DP feature and demonstrate a method to appropriately map different frame types onto high- and low-priority capacities. Streaming layered media formats such as scalable video coding (SVC) or multiview video coding (MVC) provide new possibilities regarding their distribution and their loss protection due to their altered, hierarchical bitstream structure. As pointed out in Section 4.1, the error-free decoding of some parts of the stream may require other, more important parts to be correctly received. These dependencies motivate the design of unequal error control schemes such as in [68] where a layer-aware FEC mechanism is used to globally leverage the video quality in mobile broadcast scenarios. Fiandrotti et al. [49] also consider layers as prioritization classes and propose, based on that, a packet scheduling algorithm to improve video transport over wireless ad-hoc networks. They further show that by appropriately scheduling packets, a more graceful video quality degradation can be achieved during congested network conditions than when H.264/MPEG-4 AVC is used. In [127], an adaptive video streaming scheme is presented that combines layer selection and unequal error protection. A model-based algorithm is used to roughly estimate the receiver-side distortion. The amount of FEC redundancy and the number of layers to transmit is determined based on the distortion estimates. In conclusion, there are some application scenarios where such simple prioritization techniques are sufficient, e.g., when media units have to be assigned to two different transport service levels. However, there are other situations where a less intuitive, finer-grained prioritization mechanism can be extremely beneficial such as in systems that support adaptive reliability Heavyweight schemes In contrast to computationally cheap mechanisms, pixel-based techniques estimate the expected channel-induced distortion by anticipating the receiver-side error-concealed video stream and calculating the difference between error-free 110

111 4.3. Heavyweight schemes and reconstructed samples per pixel. They demand for suitable models that incorporate, amongst many other factors, the error propagation between media units, the receiver s error concealment mechanism, and the channel loss behavior. The appropriate integration of channel characteristics (average packet loss rate, loss patterns, burst length, etc.) into distortion estimation models is the subject of many studies because it is a non-trivial task. One representative is a method called Recursive Optimal per-pixel End-to-end distortion estimate (ROPE), proposed by Rose et al. [177, 179, 185, 244]. ROPE recursively calculates the expected distortion at the pixel level and defines two formulae that consider the error propagation in intra and inter coded macroblocks. The model is applicable for encoder decision optimization and its benefit has already been demonstrated for reference picture selection [119], error-resilient motion compensation [220], multiple description coding [70], and QoS selection [139]. In the context of transport error control, the model however demands a close knowledge of the loss rate and assumes a Bernoulli distribution. It therefore takes no bursty loss behavior into account, which prohibits its deployment in certain streaming scenarios. Moreover, pixel-filtering/averaging and sub-pixel motion compensation operations have to be incorporated in the model. This demands cross-correlation approximation for which it is hard to come up with an effective, low-complexity algorithm. A major drawback is the large computational effort needed to calculate estimates because two moments have to be tracked at every pixel. Especially if used in combination with long and complex GOP structures and if bidirectional prediction is applied, which demands for additional terms in the recursion formulae, realtime distortion estimation becomes an infeasible task. Furthermore, memory consumption can no longer be neglected. This applies even more to models that try to mimic (possibly iterative) error concealment techniques of present decoders as pointed out in [239], which turns the originally rather simple formulae into complex derivation sequences. Schmidt et al. point out that ROPE only considers isolated packet loss and that it ignores correlations between lost media units [177]. The incorporation of all possible loss patterns, though limited by GOP boundaries, is computationally intractable due to the exponential number of pos- 111

112 Chapter 4. Related work: Video distortion estimation sibilities. As approximation, they propose to apply a ROPE-based first-order Taylor expansion, which yields a linear approximation of the expected distortion close to a reference packet loss ratio. A stationary linear model is described in [103] that estimates the impact of packet loss on video quality based on the average of single packet loss distortions. The authors argue that the model is accurate for isolated losses, i.e., losses that are sufficiently far apart, and for bursty losses where bursts only affect a single frame, similar to the observations of [167]. Due to its simplistic design, the model is however sometimes inappropriate, e.g., for applications in wireless environments where packet loss rates can heavily fluctuate. Furthermore, it is highly questionable in connection with modern video codecs whether expecting the same loss-impact from different media units, even under the given constraints, really yields an accurate distortion estimation. In fact, an expressive counter-example can be found in Figure 2 of [22]. Stockhammer et al. underline the difficulties that have to be faced when extending the ROPE approach to H.264/MPEG-4 AVC [193, 195]. They point out that the in-loop deblocking filter, the sub-pel motion accuracy, the increased complexity of intra prediction, and the advanced error concealment tools are hard to incorporate in the existing model. As an alternative, they propose a rather computationally intensive method to estimate the expected decoder distortion by considering a finite number of possible channel realizations. For every realization, the sample distortion is calculated in terms of MSE, assuming that the decoder side error concealment mechanism is known, and the average over all distortion estimates is taken as result. The authors argue that according to the strong law of large numbers, by considering a sufficient number of channel realizations, the expected distortion can be accurately estimated. However, the algorithm significantly increases the complexity of the sending application because the simulation of a single channel realization is roughly equivalent to decoding an entire GOP in terms of computational cost [194]. Chakareski et al. point out that most of the existing distortion estimation schemes cannot be applied in realtime due to their computational complexity [22]. 112

113 4.3. Heavyweight schemes To circumvent this problem for stored video streaming, they propose to precalculate the distortion characteristics of compressed videos as a side product of the encoding process or in a post-encoding step and to store them in so called rate-distortion hint tracks. Similar to the hint tracks of the MPEG-4 file format MP4 [85], this meta information is used during the actual streaming session to provide the streaming server with meta information. This can greatly enhance error resilience as it allows low-complexity RDO without the need to analyze the compressed media data on the fly. Moreover, the authors present a packet-loss impact estimation approach that they call distortion chains. Depending on the respective chain order, the distortion caused by isolated, binary, or n-ary packet loss is computed and stored in such hint tracks. To reduce computational cost, the distortion caused by the joint loss of packets of different GOPs is considered to be equal to the sum of distortions of the respective isolated losses. The streaming sever can extract this information to predict the distortion caused by specific loss constellations by extrapolating selected precomputed loss patterns of shorter length. This can also be done at arbitrary nodes along the delivery path to perform optimized packet dropping when the network is congested [23, 24]. A major restriction of the scheme is that only a one-frame-per-network-unit packetization is considered, and it is unclear whether the approach can be extended to incorporate the spatial dimension(s) as well. Furthermore, backward prediction is neglected as only forward error propagation is included in the distortion chain design. However, this is only a minor issue because just slight modifications would be necessary to support bidirectional prediction. Although the authors claim that the cost of pre-computation is linear in the (packet) length of the sequence, this is no longer true for video streams of higher quality as the number of packets constituting a GOP becomes the dominant variable in the cost approximation when using distortion chains of at least order one. Finally, as indicated by the authors, the approach is only beneficial for streaming of stored video content and not for realtime applications. Li et al. also strive to estimate the video distortion, but as expectation taken over all possible channel realizations, and investigate the impact of Markov-model burst 113

114 Chapter 4. Related work: Video distortion estimation packet losses on video quality [121, 122]. They argue that error propagation decays over time and use a sliding window algorithm to limit the number of possible loss patterns to be considered based on that assumption. Although the approach appears to be promising with respect to a more general view on channel loss, there are some limitations. First, Li et al. assume that each frame is packaged into one network packet, which is not realistic from a practical point of view unless only highly compressed low-resolution sequences are used. They remark that the framework can be easily extended to the case where a frame is spread over multiple packets, but they do not address the variability of spatial error propagation over multiple frame segments. Besides that, they consider the error attenuation factors of received and lost frames as constants and ignore the diversity of content with respect to temporal and spatial complexity. Finally, they assume in their model that the correct reception and decoding of I frames can always be guaranteed. This limitation additionally demands content-aware transportation to avoid severe implications on playback continuity in real implementations. In contrast to the previously discussed approach, Wang et al. do not model the temporal error attenuation as constant but as a function of the packet loss ratio and the proportion of intra coded macroblocks [223]. The intra block ratio is not globally fixed but individually evaluated for each frame. Several possible function designs are discussed that incorporate intra prediction, deblocking filtering, and sub-pel motion estimation. The appropriate choice depends on the encoder s configuration and the present receiver side error concealment mechanism. Similar to Li et al., they argue that their approach also takes slicing below the frame level into account. Yet they restrict evaluations to a one-slice-per-frame packetization scheme and calculate distortion estimates on a per-frame basis in the model verification section. They also discuss the use of single frame loss distortion estimates for unequal error protection schemes by simplifying the proposed model and assuming additive distortion behavior. However, this simplification presumes that intra rates of different frames are fairly constant and that the channel distortion in subsequent frames decays exponentially. This is only true for a limited number of sequences, encoded using simple GOP structures and short reference frame pre- 114

115 4.3. Heavyweight schemes diction lists (the authors focus on a IP* GOP structure and restrict predictions only to the previous frame). Furthermore, it is not a trivial task to find fitting model parameters for sequences, and appropriate estimation/optimization techniques are needed to obtain them. Babich et al. point out that the frequently applied idea to model the effects of multiple losses as the superposition of multiple independent losses is not always accurate, especially in low bit-rate wireless video communication when losses are not spaced sufficiently far apart [7]. They propose three models that build on each other. These models use the mean-squared error of consecutive frames as input to estimate the distortion caused by error propagation due to channel loss. Similar as in [121], they require a globally defined constant that reflects the decay of distortion over time, but they additionally incorporate the lengths of preceding bursts. This demands a precise knowledge of the loss history, and is therefore only practical e.g. in a system that supports selective retransmission. As in most of the previously discussed approaches, only entire frame-loss is considered. Masala et al. propose a mechanism called Analysis-by-Synthesis (AbS), which estimates the overall channel distortion by simulating the complete decoder behavior for the loss of each packet individually [8, 19, 20, 137, 138, 215]. Due to the approach s packet level granularity, the estimates get quite close to the actual distortion experienced at the receiver side. However, the derivation procedure is computationally expensive and is hard to carry out in realtime especially when complex concealment algorithms are in place and the propagation chain is long. If sufficiently precise estimates can be obtained from the encoder as by-product (e.g. when no temporal prediction is conducted), this drawback is negligible, but in general complexity is high and heavy precomputation is needed to render the approach suitable for stored-video scenarios. The significant computational effort can become a bottleneck especially when new content is produced (and has to be analyzed) at high rates such as with public video platforms like YouTube where thousands of videos are being uploaded every day. The co-impact of multiple packet loss is modeled in an additive way as in previously discussed approaches because an exhaustive analysis of all possible loss patterns would be computa- 115

116 Chapter 4. Related work: Video distortion estimation tionally intractable. The authors address the limited applicability caused by the large amount of computation needed with a workaround in [218]. They propose to only use AbS to evaluate the distortion introduced in the current frame, while the distortion in succeeding frames is estimated by means of an error propagation model, similar to Babich et al. Although the complexity of the modified approach is heavily reduced, its generality is questionable: the model is based on the statistical analysis of several test sequences, but only one GOP structure is considered, a fixed quantization parameter of 28 is used, and plain frame-copy error concealment is assumed. In summary, this does not guarantee the model s applicability to arbitrary settings Conclusion The discussed approaches in the previous section seem quite diverse at first glance, but most of them do actually have very much in common. They calculate the distortion in terms of mean-squared error between original and artificially reconstructed pixel values and estimate the spatio-temporal error propagation in similar ways some are even almost identical in this respect. As repeatedly pointed out, they are hardly applicable for realtime streaming scenarios without significant simplifications at the cost of estimation accuracy. Paired with the observation that existing lightweight schemes are of limited precision, this was the major incentive to come up with novel distortion estimation techniques that are capable to provide estimates of superior accuracy in realtime, described in Chapter 5. For a discussion about the algorithms computational demands, the reader is referred to Section

117 Chapter 5. Realtime distortion estimation 5.1. Introduction In this and the following section, two algorithms developed by the author are presented that quantify the relative loss impact of network packets carrying video content on the reconstructed videos quality. The adjective relative refers to the fact that network packets are not associated e.g. with MSE distortion values obtained by simulation as done by some approaches presented in Section 4.3 but with prioritization values from the range [0; 1], computed by the respective algorithm. With respect to estimation quality, those prioritization values should ideally be strongly correlated with the MSE values of isolated distortion simulations where the entire decoding process is incorporated. For the remainder of this thesis, distortion estimation algorithms will be denoted by Ψ. In particular, the algorithms presented in the following will be denoted by Ψ ASP and Ψ AVC in accordance to their applicability to MPEG-4 ASP and H.264- /MPEG-4 AVC respectively. Ψ : Π [0; 1] is a total mapping function, which assigns distortion estimates to video packets from the set Π. The following property should be satisfied as good as possible: If and only if Ψ(p i ) > Ψ(p j ), then the loss of p i results in a higher quality degradation when decoding the partially received video than when p j is lost. 117

118 Chapter 5. Realtime distortion estimation 5.2. Prioritization of MPEG-2 and MPEG-4 ASP Ψ ASP was initially designed to be deployed in wireless environments where devices are characterized by limited battery and processing capabilities [171]. Therefore, ensuring its operability in a power conserving fashion was a major goal, and accordingly, computationally intensive approaches such as the complete decoding of the bitstream or the consideration of cross-correlation issues had to be neglected. As a consequence, Ψ ASP closely models the impact of isolated loss but does not take the side-effects of correlated loss into account Design of Ψ ASP The estimation process operates at the packet-level and requires an error resilient packetization scheme that limits the effects of spatial error propagation. In MPEG- 4 ASP, this can be established by instructing the encoder to dimension GOBs appropriately so that they fit into network packets. Furthermore, resynchronization markers have to be inserted into GOB headers, which enables the entropy decoder to continue the parsing process after having detected a bitstream error or having skipped a lost segment. This implies that only the data segment between resynchronization markers where the error occurred has to be concealed. In contrast to that, as depicted in Figure 5.1, the entire remainder of the current frame would have to be concealed when no resynchronization markers are used. Smart packetization schemes are generally aware of the position of resynchronization markers and packetize data accordingly to minimize the impact of packet errors. For each packet, Ψ ASP considers three major factors: the types of single macroblocks that the packets consist of, the frames position within the current GOP, and the temporal proximity to potential scene cuts. These three factors are quantized by the sub-functions Ψ i ASP, and the overall score is calculated as a weighted sum of the sub-functions as given in Equation 5.1 and Equation 5.2. The sub-functions are discussed in the following sections. 118

119 5.2. Prioritization of MPEG-2 and MPEG-4 ASP Figure 5.1.: Decodable regions of a loss-affected frame with (left) and without (right) resynchronization markers in the bitstream Macroblock type weighting: Ψ mb ASP Ψ mb ASP reflects the importance of a macroblock s type, weighted by its spatial position. It is defined in Equation 5.3 where mb(p) is the set of macroblocks contained in packet p, # computes the number of elements of a set, and type(m) is the type of macroblock m. Furthermore, pos x (m) and pos y (m) are the x and y offsets of macroblock m s center in the current frame and ω pos (x, y) [0; 1] is a function incorporating the lower perceptual importance of edge regions. Ideally, ω pos [0; 1] should be designed in such a way that it reflects the macroblock s proximity to the respective frame s regions of interest. As ROI detection Ψ ASP (p) = w i Ψ i ASP(p) (5.1) i {mb,td,sc} w mb + w td + w sc = 1 (5.2) Ψ mb ASP(p) = m mb(p) ω type (type(m)) ω pos ( posx (m), pos y (m) ) # (mb(p)) (5.3) 119

120 Chapter 5. Realtime distortion estimation Figure 5.2.: Simple ROI scheme of ω pos to weight macroblocks according to their location. is rather costly and requires the video stream to be fully decoded, ω pos is instead calculated based on a simple scheme by assuming that the ROI is located at the frame s center (see Figure 5.2). The dimensions of the video sequence are used to scale the ROI appropriately as given in Equation 5.4 where w and h denote the frame width and height. As discussed in Section 2.2, the vast majority of existing video codecs achieve high compression ratios by detecting and reducing spatial and temporal redundancies. The basic objects of compression are macroblocks, and the encoder decides for every block how to encode it in order to meet given constraints. In the encoded bitstream, the macroblock types indicate the compression strategy that was applied. The importance of single macroblocks with respect to these types (type(m)) is reflected by the weighting function ω type [0; 1]. It distinguishes between six different sets of encoding modes, which are briefly described in the following paragraph from the perspective of the targeted compression standards. (2x ) 2 ω pos (x, y) = w 1 + ( ) 2 2y h 1 (5.4) 120

121 5.2. Prioritization of MPEG-2 and MPEG-4 ASP In MPEG-2 as well as in MPEG-4 ASP, I frames, also referred to as key frames, only exploit spatial redundancy and do not depend on other frames. As a result, they provide the least effective coding efficiency but are indispensable for error compensation and error concealment mechanisms. I frames only contain macroblocks coded in intra mode, referred to as i-macroblocks. P frames can contain i-macroblocks as well as f-macroblocks, which forward-predict information from the preceding I or P frame. By using a motion vector and only encoding the difference to the reference area, coding efficiency can be improved. Furthermore, skipped s-macroblocks can occur in P frames, which are also referred to as not coded macroblocks. They do not contain any motion and texture information, and consequently, no further data is sent when such a macroblock is signaled in the bitstream. At decoding time, their content is calculated by copying the corresponding region from the previously displayed frame. B frames go one step further and additionally allow their macroblocks to be backward predictive. Besides i-macroblocks, f-macroblocks and s-macroblocks, a B frame can also consist of backward predictive b-macroblocks, bidirectional predictive bi-macroblocks and direct bidirectional predictive d-macroblocks. bi-macroblocks predict their content using two reference areas one located in the previous and one in the next I or P frame. d-macroblocks are similar to bi-macroblocks as they use both forward and backward references but contain only delta motion information. Their motion vectors are calculated from the forward motion vector of the temporally collocated macroblock of the succeeding I or P frame and an additional delta motion vector. As an example, Figure 5.3 depicts the macroblock types used in two frames of the sequence Foreman. The color code is as follows: red for i-, blue for f-, green for b-, orange for d-, and blue-green for bi-macroblocks. Based on these observations, the weighting function ω type is specified as given in Equation 5.5. Clearly, i-macroblocks are the most important ones since they i 1.0 f 0.50 b 0.30 bi 0.15 d 0.10 s 0.00 (5.5) 121

In contrast to that, the loss of s-macroblocks does only have a very limited effect on video quality since they carry no information.

122 Chapter 5. Realtime distortion estimation Figure 5.3.: Different macroblock types in a P (left) and a B frame (right). do not depend on other macroblocks and they are not affected when temporally collocated macroblocks are lost. At the same time, the loss of i-macroblocks almost always affects depending macroblocks. In contrast to that, the loss of s-macroblocks does only have a very limited effect on video quality since they carry no information. f- and b-macroblocks are considered to be the second most important ones since they only depend on macroblocks in one reference frame. f-macroblocks are favored over b-macroblocks because they can also occur in P frames, which are more important than B frames. Finally, bi-macroblocks are rated to be more important than d-macroblocks because they carry one motion vector for each prediction direction, whereas a d-macroblock only has delta motion information and depends on motion information of the corresponding macroblock in the next reference frame. When this data gets lost, the motion vectors for the d- macroblock cannot be calculated correctly. These considerations are incorporated in the weighting function s design and the weights of i, f, b, bi, d, s macroblocks are adjusted in decreasing order as just listed Temporal frame dependencies: Ψ td ASP With MPEG-4 ASP, the GOP structure is not that flexible as it is with H.264/- MPEG-4 AVC. Although there is a feature called dynamic GOP, which lets the 122

123 5.2. Prioritization of MPEG-2 and MPEG-4 ASP GOP B M+1 B 2M+2 B N I 1 B P B 2 M+3 N-M+1 M+2 P 2M+3 P N-M B I 1 Figure 5.4.: MPEG-4 ASP GOP dependency graph with indices in display order. encoder select between three hierarchical GOP structures, this has been rarely used in the past. Therefore, regular MPEG-4 ASP GOP structures are characterized by two parameters throughout this thesis: N, which is the number of frames within a GOP, and M, which describes the spacing between non-b frames, i.e, the number of consecutive B frames. Consequently, a GOP consists of one I N M frame, 1 P frames and N B frames. As an example, M = 2 and M+1 M+1 N = 12 are typical values for MPEG-2 like GOP structures. It is not guaranteed that encoders stick to this structure since, for example, a P frame that follows a scene change may be badly predicted due to a low correlation with its reference frame and may be replaced with a more robust I frame. By studying the number of dependent frames (i.e. affected frames in case of data loss) in GOP dependency graphs (see Figure 5.4), it is noticeable that especially P frames can be of diverse importance. This motivates the definition of Ψ td ASP as given in Equation 5.6 where δ td (p) transitively calculates the number of dependent frames of the frame that is (partially) carried by p. Ψ td ASP(p) = δ td(p) N 1 (5.6) 123

124 Chapter 5. Realtime distortion estimation Scene cut detection: Ψ sc ASP When data of frames that have a high proximity to scene changes is corrupted or lost, a higher degree of noticeable artifacts can be experienced as compared to damaged frames that are further apart from scene cuts. pronounced in the face of abrupt scene cuts. This difference is Error concealment mechanisms reconstruct and interpolate damaged areas of frames by retrofitting their partially received data with motion and texture information of their reference frames (see Section 2.5). Under most circumstances this works well, but it performs poorly when the temporal correlation between frames is exceptionally small as it is the case with scene cuts. As an explanation, encoders tend to select temporal reference frames that are on the same side of the scene cut as illustrated in Figure 5.5. Three different scene cut constellations are depicted where solid arrows indicate a strong use of the respective reference target whereas dotted arrows indicate weak connectivity due to a lack of similarity. If a scene cut occurs immediately before a sequence of consecutive B frames, encoders usually backward predict their motion and residual information due to their low correlation with the preceding I or P frame. Similarly, a scene cut that occurs after a sequence of B frames generally triggers encoders to apply forward prediction. In the third scenario of Figure 5.5, a scene cut happens between B frames. Preceding B frames will then be forward predicted and succeeding ones are backward predicted. Moreover, it was observed that if the scene cut occurs before frame f N M, the next P frame is characterized by an above average fraction of i-macroblocks, which can again be explained by the low temporal inter frame correlation. When some GOBs of B frames are lost, error concealment mechanisms try to reconstruct affected frames by exploiting both reference frames. These techniques are usually not aware of scene cuts and therefore perform faulty predictions unless they ignore the poorly temporally correlated reference frame on the opposite side of the scene cut. A similar problem is addressed in [196] where the authors show that by integrating scene cut detection into an H.264/MPEG-4 AVC decoder, error 124

125 5.2. Prioritization of MPEG-2 and MPEG-4 ASP non B B B B B non B (a) Scene cut before B frames non B B B B B non B (b) Scene cut after B frames non B B B B B non B (c) Scene cut between B frames Figure 5.5.: Possible scene cut constellations 125

126 Chapter 5. Realtime distortion estimation concealment can be performed more effectively. To compensate for the lack of such error concealment techniques in decoders, Ψ sc ASP protects B frames close to scene cuts. Such frames are identified by the function δ sc {0, 1}, which analyzes the statistical distribution of the macroblocks the frames consist of. When the fraction of forward predictive f- or backward predictive b-macroblocks is above a certain threshold t sc, and if this is true for all frames of a sequence of consecutive B frames, a scene cut is verly likely, denoted by δ sc = 1. Otherwise, δ sc = 0. As an example, Figure 5.6 shows three consecutive frames in display order, where each frame has a significant fraction of macroblocks of a certain type. In this particular example, a high threshold of t sc = 0.9 would still correctly detect the scene cut. However, for the measurements in Chapter 6 and Section 7.2, t sc was set to 0.6 as this turned out to be an acceptable compromise regarding false positives and negatives during the simulations. Based on these observations, Ψ sc ASP is defined as given in Equation 5.7. Ψ sc ASP(p) = 1 δ sc (p) (5.7) 5.3. Prioritization of H.264/MPEG-4 AVC Introduction As will be demonstrated in the following two chapters, Ψ ASP provides accurate estimates for MPEG-4 ASP sequences and leads to superior video quality when deployed in error control frameworks. However, up to this point it was unclear whether this distortion estimation scheme can also be applied to H.264/- MPEG-4 AVC content. Experiments soon revealed that Ψ ASP provides only a limited degree of accuracy, which could also be obtained by falling back on less sophisticated approaches. Moreover, it turned out that encoding settings play a crucial role as complex settings lead to larger deviations than when short GOP structures and few reference pictures are used. More specifically, the three sub-functions 126

$Significant fractions of f -$

127 5.3. Prioritization of H.264/MPEG-4 AVC Figure 5.6.: Significant fractions of f - (blue) and b-macroblocks (green) at a scene cut 127

128 Chapter 5. Realtime distortion estimation of Ψ ASP are inappropriate in the context of H.264/MPEG-4 AVC and have to be adapted or replaced due to the following reasons: Categorizing macroblock types into six different sets as it is done by Ψ mb ASP is too superficial. As H.264/MPEG-4 AVC allows macroblocks to be partitioned down to the size of 4 4 blocks where each block may be predicted from different regions of different frames, either each macroblock type should be weighted separately or a finer-grained categorization should be applied. GOP structures defined at encoding time are no longer static. This means that the encoder will only stick to the respective GOP structure when the content to encode is rather homogeneous with regard to the structural complexity and the amount of motion. M and N (as defined in Section 5.2.3) are only considered as upper bounds. When encoding content that, e.g., contains significant amounts of motion, the encoder may replace B slices (frames) with P or I slices (frames). Additionally, the encoder may decide to shorten the GOP when the amount of exploitable temporal correlation drops below a certain threshold. As a consequence, Ψ td ASP would have to be extended to support dynamic GOPs. Efficient H.264/MPEG-4 AVC encoder implementations such as x264 [143] incorporate scene cut detection, which enables them to place intra coded or IDR frames at the beginning of new scenes. On the one hand, this can improve the coding efficiency because succeeding frames are now provided with a suitable reference target. On the other hand, this reduces the benefit of Ψ sc ASP as scene cuts no longer need to be protected due to the fact that the bitstream is more error resilient. This is a direct consequence of enforcing the use of intra prediction after scene cuts. The prioritization scheme presented in the following is to be used for video content encoded using one of the commonly used profiles such as constrained baseline, main, or high*. It is assumed that stream s error resilience capabilities are limited as related tools of the baseline and the extended profiles are not supported 128

129 5.3. Prioritization of H.264/MPEG-4 AVC (cf. Table 2.2). This however only marginally limits the scheme s applicability due to missing decoder support Design of Ψ AVC In this section, a technique is described that prioritizes H.264/MPEG-4 AVC NAL units with respect to the distortion in video quality caused when they are lost. For the remainder of this thesis, this approach is denoted by Ψ AVC, and it belongs to the category of lightweight mechanisms and is applicable to realtime scenarios. However, its design is not as intuitive as corresponding approaches presented in Section 4.2. In simplified terms, video codecs aim to minimize the storage space needed by reducing spatial and temporal redundancies. The increased exploitation of temporal similarities is, besides the use of CABAC as entropy coding technique, the main reason for the improved coding efficiency of H.264/MPEG-4 AVC over MPEG- 4 ASP. As a consequence, media units depend even more on each other, which in turn causes videos streams to be more susceptible to data loss. With H.264/- MPEG-4 AVC, the exploitation of spatial redundancies is limited by the size of the slice the element belongs to, which implies that similarities between macroblocks that belong to the same frame but different NAL units cannot be spatially exploited. For improved error resilience, a one-nal-unit-per-network-unit packetization scheme is advisable because it limits the impact of a single packet loss to a subregion (a slice) of the corresponding frame. In contrast to that, when using a one- NAL-unit-per-frame scheme, one packet loss will cause the respective entire frame to be undecodable as it is not possible for the entropy decoder to resynchronize between NAL headers. Furthermore, the inspection of intra prediction applied within media units is of limited benefit due to the fact that only the loss of entire NAL units is considered as video decoders are, similarly as with MPEG-4 ASP, generally not able to cope with bit errors between resynchronization points. Ψ AVC works at the level of NAL units and is therefore independent of the respective packetization scheme. However, as the error resilient one-nal-unit-per-network- 129

130 Chapter 5. Realtime distortion estimation unit packetization scheme provides substantial video quality improvements for streams delivered over error-prone channels, it is used for all measurements and experiments in this thesis. Because of this, the parameter p of Ψ(p) representing a packet is equivalent to a VCL NAL unit or slice with Ψ AVC. Based on these considerations, Ψ AVC mainly focuses on the analysis of temporal relationships to better incorporate the loss-impact with regard to temporal error propagation. It is based on the analysis of the macroblock partitioning, the spatial extents of temporal dependencies, and the length and strength of prediction chains existing among macroblocks. Analogous to the definition of Ψ ASP in Equation 5.1, Ψ AVC is calculated as a weighted sum over three sub-functions as given in Equation 5.8 and Equation 5.9. They are discussed in the following sections. The main difference between Ψ ASP and Ψ AVC is that the former inspects macroblocks but provides estimates at the packet level whereas the latter entirely estimates the relevance of content at the level of macroblocks. To obtain estimates for specific segments such as slices, the mean over all distortion estimates Ψ AVC (m i ) of macroblocks m i these regions contain is calculated Macroblock type and partitioning: Ψ type AVC The value of Ψ type AVC (m) [0; 1] essentially reflects the partitioning of macroblock m. As defined by the H.264/MPEG-4 AVC standard, macroblocks can be divided into 16x16, 16x8, 8x16, and 8x8 partitions, in which the latter can be further subdivided into four differently shaped sub-partitions. At encoding time, the encoder decides which partition mode to use based on the spatial complexity of the source frames and the position of regions suitable for inter prediction within reference frame candidates. Selecting the appropriate size of an inter prediction Ψ AVC (m) = w i Ψ i AVC(m) [0; 1] (5.8) i {type,dep,mv} w type + w dep + w mv = 1 (5.9) 130

131 5.3. Prioritization of H.264/MPEG-4 AVC Figure 5.7.: Macroblock partitioning at regions with increased spatial complexity. partition is a tradeoff between the quantity of data needed to represent motion vectors and the coding gain provided by using motion compensation with smaller blocks (see Section 2.2.5). By analyzing the structure of various encoded sequences, it turned out that encoders prevailingly stick to the following rule of thumb: the higher the complexity, the larger the number of blocks that single macroblocks are partitioned into. As an example, Figure 5.7 shows the partitioning of a frame of the test sequence coastguard that consists of five P slices. The regions that have a rather inhomogeneous (luma) distribution are located close to the two boats and the waves, which are at the same time the only places where macroblocks with a partitioning of 8x8 and below can be found. 131

132 Chapter 5. Realtime distortion estimation Based on these observations, a lookup-table was defined that encompasses 22 entries, which are disjoint subsets of macroblock types as defined in tables and of the H.264/MPEG-4 AVC standard [93] to quantify the selected modes in terms of introduced distortion in case of loss. These values were calculated by an iterative tuning process where they were continuously adjusted based on test results of loss simulations, measuring the videos quality distortion in the current frame only, and deliberately neglecting the error propagation over time. For decoding, ffmpeg [148] was used, and motion copy error concealment and deblocking filtering was applied. This process was carried out twice using x264 [143] and JM [205] as software encoders respectively. As expected, the two obtained tables were not identical as encoder decision making is not covered by the standard. However, the table values showed remarkable similarities, and there was a recognizable trend that distortion positively correlates with the number of partitions of a macroblock. In other words, it can be expected that the loss of a structurally simple macroblock (e.g., that contains two 8x16 partitions) causes on average a lower quality degradation than the loss of a structurally complex macroblock (e.g., one with more than four partitions), assuming similar circumstances with regard to spatio-temporal dependencies. Although by only considering two different encoders it cannot be claimed that this approach is truly generic, there is a reason to believe that the relationship between block size and loss impact also holds for other, reasonably efficient encoder implementations. Additionally, the reader is referred to [241] and [73] whose authors came to similar conclusions. Based on this assumption, Ψ type AVC (m) is defined by mapping the type of a macroblock m to the appropriately scaled mean value of the respective entries from both tables. Those mean values are listed in Table 5.1 where the properties of a row indicate the common features of the macroblocks of the respective set. L0, L1, and Bi indicate that the macroblocks are forward-, backward-, or bidirectional-predicted. A pair of these identifiers is used for 16x8 or 8x16 blocks to signify the prediction modes of the left and right or top and bottom blocks. SKIP macroblocks neither have motion nor residual information and can therefore be concealed best, which motivated setting their weight equal to zero. 132

133 5.3. Prioritization of H.264/MPEG-4 AVC slice type size properties weight I intra 16x I intra 8x I intra 4x I intra 16x16 PCM lossless P inter 16x16 L P inter 16x8, 8x16 L P inter 8x8 L P inter <8x8 L P inter 8x8 L0 inferred P inter <8x8 L0 inferred P inter 16x16 SKIP B inter 16x16 L0, L B inter 16x16 Bi B inter 16x8, 8x16 L0_L0, L1_L B inter 16x8, 8x16 L0_L1, L1_L B inter 16x8, 8x16 L0_Bi, L1_Bi, Bi_L0, Bi_L B inter 16x8, 8x16 Bi_Bi B inter 8x8 L0, L B inter 8x8 Bi B inter <8x8 L0, L1, Bi B inter 8x8 DIRECT B inter 8x8 SKIP Table 5.1.: Lookup-table of Ψ type AVC. 133

134 Chapter 5. Realtime distortion estimation DIRECT macroblocks do have residual information but no motion information as the latter is inferred by decoders. Finally, macroblocks of type PCM are quite special as the prediction, transformation, and quantization steps are skipped and the entropy coder is bypassed. Therefore, they are lossless and usually occur in rare situations when the size of a macroblock after compression exceeds 384 bytes (the fixed size of uncompressed PCM macroblocks), which heavily depends on the current state of the CABAC encoder and the used quantization parameter. Consequently, the likelihood that error concealment mechanisms produce an acceptable interpolation of such complex regions is rather low, which justifies the choice of Ψ type AVC (m type(m) = PCM) = Temporal error tracking: Ψ dep AVC Ψ dep AVC reflects the temporal error propagation with respect to prediction dependencies between media units. Its derivation is performed by constructing a dependency graph and weighting edges according to the sizes of affected areas. The graph s edges are inverted meaning that, as opposed to motion vectors, dependencies point from the area used for reference to the dependent region. For illustration purposes, Figure 5.8 depicts an exemplary, heavily reduced instance of such a graph. Its edges have weights associated and the node labels are to be read as framedecodingtimestamp macroblockindex frametype macroblocktype. By studying dependency graphs, some universal properties come to light. They are acyclic, which is implied by the fact that decoding timestamps are strictly decreasing along an arbitrary path. Such paths are limited by the size of the current GOP, presuming that the video sequence has a closed GOP structure. Otherwise, an additional threshold would have to be defined that limits the search depth of the algorithm in order to keep the computational cost at a reasonable level. Based on these observations, the recursive function δ dep (m) was defined as given in Equation It reflects the cumulative size of temporally dependent areas in one reference frame or distributed over many frames. δ dep (v), defined in Equation 5.11, reflects the sum of areas that depend on the region that the 134

135 5.3. Prioritization of H.264/MPEG-4 AVC 1 10 P I Idr I P I Idr I B B B B P P0 4 2 P P0 4 4 P P P P B B Bref B Bref B B B P P P P P P P I P P P P P P B B B B11 Figure 5.8.: Temporal dependency graph of a 48x32 test pattern sequence. motion vector v points to. In this context, MV(m) denotes the set of forward and backward motion vectors that use parts of macroblock m as reference. Such sets do not necessarily have to be disjoint because motion vectors can point to up to 4 different macroblocks. Consequently, possible domains for i in Equation 5.11 are {1}, {1, 2}, and {1, 2, 3, 4}, independent of whether integer or sub-pel motion vector precision is used. m i denotes the macroblocks that v refers to, and α i (v) [0; 1] is the share of the ith macroblock in v s targeted region and is directly related to the labels of edges in Figure 5.8. In other words, α i (v) reflects the percentage of the ith area that contributes to the prediction. δ dep (m) = 1 + δ dep(v) (5.10) v MV(m) δ dep(v) = δ dep (m i ) α i (v) (5.11) i 135

136 Chapter 5. Realtime distortion estimation 1 2 v (11,14) 16x8 16x8 3 4 P slice B slice Figure 5.9.: Derivation of α i for motion vector v. One possible prediction constellation is depicted in Figure 5.9: the values for α 1 to α 4 are 5 /128, 11 /128, 15 /128, and 33 /128 respectively, and their sum is equal to the percentage of the size of the predicted macroblock region. Therefore, the specific temporal relationship v influences δ dep (m 4 ) the most whereas the influence on δ dep (m 1 ) is only limited due to the negligible overlap. The recursive computation can be significantly accelerated by caching values of δ dep (m) where m is part of a frame of the reference picture list. In addition to that, the function α can be easily extended to incorporate the optional weighted prediction feature [16]. The mapping from δ dep to Ψ dep AVC is specified in Equation It is a strictly increasing function because a loss of macroblock m generally causes quality degradations in all dependent regions, and the larger these regions are, the more likely it is that a viewer feels annoyed by visual artifacts. The parameter κ dep is used to adapt Ψ dep AVC to the expected amount of temporal prediction, which primarily depends on the GOP structure and the maximum allowed number of reference frames. Figure 5.10 shows the impact of κ dep on Ψ dep AVC for different values of δ dep. For example, when the selected encoder settings allow complex dependency chains, κ dep 0.8 yields an appropriate mapping whereas ( ) κdep AVC = 1 2 [0; 1] (5.12) δ dep (m) + 1 Ψ dep 136

137 5.3. Prioritization of H.264/MPEG-4 AVC Figure 5.10.: κ dep is selected based on the expected range of δ dep. when temporal dependency exploitation is greatly restricted, κ dep 1.2 is more suitable. Especially in live video streaming scenarios it may happen that not the entire GOP is available at the sender s buffer upon sending of its first frame. Consequently, Ψ dep AVC cannot always be completely calculated in such situations and the algorithm is forced to assume that there are no further temporal dependencies, which may lead to a decreased accuracy of estimates. Besides that, it has to be noted that Ψ dep AVC pessimistically models macroblock distortion by assuming successful reception of all dependency tree leaves. For a discussion on how lossfeedback can be incorporated for estimate refinement, the reader is referred to Section

138 Chapter 5. Realtime distortion estimation Motion vector extents: Ψ mv AVC The sub-function Ψ mv AVC considers the length of motion vectors in both spatial and temporal domains. As previously mentioned, broken dependencies lead to distortion in dependent macroblocks because decoders have to guess the region used for prediction. A common trick known as motion copy is to inspect motion vectors of neighboring macroblocks and interpolate the missing motion vector by taking the mean over available motion vectors within a certain search radius (see Section 2.5.2). Such algorithms generally work quite well, but results get increasingly inaccurate with growing distance to the reference macroblock. The further apart in either dimension the initially referenced region is, the more unlikely it is that the respective region will be recognized by a decoder. Accordingly, sequences with inhomogeneous motion characteristics are more vulnerable to data loss than homogeneous or slow-motion scenes with respect to decoded video quality. This motivates the formulation of Ψ mv AVC, defined in Equation All motion vectors v i of macroblock m are inspected and their spatial length mvlen(v i ) = x(vi ) 2 + y(v i ) 2 together with their temporal extent ref (v i ) is considered. The latter is defined as the difference between the display timestamps of the current frame and the reference frame and can be negative in case of backward predictions. len max and refmax denote the upper bounds on the spatial and temporal prediction distances specified at encoding time, and #(MV(m)) is, analogous to Section 5.3.4, the number of motion vectors associated with macroblock m. Finally, κ mv controls the balance between the impact of temporal and spatial distance on Ψ mv AVC. The basic idea behind Ψ mv AVC is to consider for each motion vector of a macroblock its spatial as well as its temporal expansion. At macroblocks belonging to regions with a high amount of motion, mvlen(v) is rather large, and loss con- Ψ mv AVC(m) = v MV (m) κ mv mvlen(v) len max + (1 κ mv ) ref(v) refmax #(MV(m)) (5.13) 138

139 5.3. Prioritization of H.264/MPEG-4 AVC cerning such regions is much more demanding to conceal. Additionally, error concealment mechanisms more likely use temporally closely collocated data, and consequently, macroblocks that have motion vectors with high values of ref cause inadequate error masking with high probability. As an example of poor error concealment, Figure 5.11 depicts the impact of the loss of one single NAL unit on the visual quality of succeeding frames. Each sub-figure shows a distortion plot where the absolute differences between the luma values of single pixels of the source and the reconstructed/concealed sequences are mapped to grey values. The red lines are motion vectors, and each vector has a small square at one of its endings that indicates its owner, i.e., the endings without squares point to the respective reference regions. The horizontal and vertical black lines indicate slice boundaries. It is remarkable that the loss happens while the camera is panning to the right, and consequently, the misconcealed regions of the right part of the affected frame propagate in the opposite direction, from right to left, and cause severe artifacts. In contrast to that, the influence of concealed macroblocks of the left part of the lost slice disappears over time Error tracking capabilities The proposed H.264/MPEG-4 AVC distortion estimation mechanism Ψ AVC, and especially Ψ dep AVC, closely tracks and analyses motion within sequences to calculate estimates. It models temporal error propagation more accurately than related approaches discussed in Section 4.2 and Section 4.3. Moreover, it does not depend on the validity of the conjecture that distortion exponentially decays over time as assumed by some other models. As simple counterexample to this assumption, the loss of a randomly chosen NAL unit that belongs to a P frame was simulated with four different test sequences. Two GOP structures IP * and I(BP) * were used and the resulting distortion in subsequent frames was measured. The results, depicted in Figure 5.12, suggest, except for the third sequence, that the assumption holds for the IP * GOP structure (left sub-plots). However, in the presence of B frames and when multiple references are allowed (right sub-plots), this is no longer true. 139

140 Chapter 5. Realtime distortion estimation Figure 5.11.: Severe error propagation caused by rapid camera motion in the frames 177, 179, 181, 184, 187, and 190 of the sequence foreman. 140

141 5.3. Prioritization of H.264/MPEG-4 AVC mean square error frame index frame index (a) Sequence coastguard, slice 2 of frame 50 lost mean square error frame index frame index (b) Sequence stefan, slice 3 of frame 7 lost mean square error frame index (c) Sequence foreman, slice 4 of frame 177 lost frame index 25 mean square error frame index frame index (d) Sequence football, slice 9 of frame 151 lost Figure 5.12.: Error attenuation in succeeding frames when using different encoding settings: bframes = 0, ref = 1 (left plots) and bframes = 1, ref = 4 (right plots). 141

142 Chapter 5. Realtime distortion estimation Moreover, when a significant amount of motion is present at the time the loss occurs such as in Figure 5.12c, the distortion can even increase over time, independent of the GOP structure used. This can be also observed in Figure 5.11 where the distortion maps were captured during the same simulation run as in Figure 5.12c. Hence, it cannot be claimed that exponential decay distortion models are applicable to arbitrary encoder settings, as e.g. noted in [223] Computational complexity During the distortion estimation process, received, produced, or passing packets have to be analyzed. Ψ AVC requires macroblocks types, their partitioning, and their motion vectors for proper operation. These pieces of information can be extracted by superficially decoding the bitstream, which is far less computationally demanding than a full decoding process conducted by video decoders. The latter additionally includes some costly steps such as inverse quantization, inverse integer transformation, and error concealment. This is not required for Ψ AVC as residuals and decoded/concealed samples are not considered. Memory consumption is also insignificant because the maximum number of frames to be simultaneously analyzed is limited by the respective GOP size. For cases where the video stream to be analyzed was encoded using an open-gop structure, the memory usage can be limited by introducing an upper bound on the maximum length of prediction chains to consider. Ψ AVC can therefore also be applied in realtime to highly compressed, highresolution content, which, in contrast, is not true for pixel-level approaches such as ROPE and AbS. Hence, it has a wider operational range as it can additionally be used in systems where encoded video content is not prestored but created on the fly. This makes it attractive for environments that are distinguished by strict timing constraints such as video conferencing applications, remote control systems, streaming game technologies, and life-critical medical applications (e.g., tele-surgery). 142

143 Chapter 6. Prioritized proactive packet discard 6.1. Introduction Distortion estimation techniques turn out to be extremely beneficial when combined with error control mechanisms. One way to exploit them is to use the knowledge of the unequal importance of fragments or packets of a video stream to make intelligent discard decisions when the available bandwidth drops below the stream s bitrate. This adaptation process can happen either at the streaming source or somewhere in the delivery network. By appropriately annotating video packets, intermediate routers are able to decide which fragments to drop such that the visual video quality is minimally impacted. This is of particular interest, e.g., for the last hop in IPTV scenarios where the delivery to the consumer device is increasingly realized over wireless links. Prioritized packet discard at the wireless home router can avoid annoying quality drifts, and the annotation can even be implemented without changing existing protocols, e.g., by using the idr flag in NAL unit headers. Previous, related approaches encompass, amongst others, rate distortion hint-tracking based packet dropping [23,24] and AbS based packet marking for differentiated-services networks [41, 217]. Recently, even an extension to 3D video was proposed where two-dimensional video frames and depth map frames are 143

144 Chapter 6. Prioritized proactive packet discard symmetrically discarded by considering the playback deadlines, the transmission bandwidth, and the inter-frame dependency relationships in order to adapt the video data rate to the available transmission budget [208]. This chapter demonstrates that Ψ ASP, presented in Section 5.2, leads to considerable performance gains when combined with the packet discard policies discussed above. A small subset of the obtained results was already published in [171]. Here, an encompassing set of results is presented, focusing not only on performance gains under different discard ratios but also under different GOP settings. For the evaluation, the timing aspect with regard to queuing delays and playback deadlines is neglected, i.e., it is assumed that all packets that arrive at the receiver can be exploited by the decoding process. The joint impact of relevance of content and proximity to playback deadlines will however be subject of investigation in the following chapters Simulation framework The tool-chain depicted in Figure 6.1 was used to evaluate the effectiveness of Ψ ASP under variable packet discard rates. The source sequences were uncompressed and were stored in the YUV format with quarter chroma resolution (4:2:0). Ffmpeg [148] was used to produce a raw MPEG-4 ASP video file. The encoder module was however modified to produce an additional encoding trace file, which is needed by the prioritization mechanism. In a next step, MP4Box, which is part of the multimedia research project GPAC 1, was taken to attach RTP hint tracks to the video file in order to facilitate the segmentation. This is also required for popular streaming frameworks such as the Darwin streaming server or the QuickTime streaming server. After media hinting, mp4trace, which is one of the tools that EvalVid [104] provides, was used to transmit the video to a virtual network sink, realized by netcat 2. Additionally, mp4trace produced a sender trace file that contains the send timestamps, the packet ids, and the packet sizes. Moreover,

145 6.2. Simulation framework source file YUV [modified] encoder ffmpeg raw encoded video M4V video hinter MP4Box encoder trace decoded video YUV video decoder ffmpeg hinted video MP4 packet priorities DISTORTION ESTIMATION sender trace video sender mp4trace LOSS SIMULATION transmission dump packet trace tcpdump modified transmission dump video reconstruction etmp4 loss affected video MP4 network sink nc quality measurement PSNR concealed video YUV video decoder ffmpeg results Figure 6.1.: Test environment used for packet drop simulations. 145

146 Chapter 6. Prioritized proactive packet discard tcpdump 3 was running during the transmission to produce a transmission dump, which is required by the distortion estimation mechanism, the loss simulator, and the stream reconstruction unit. Based on the encoder s video trace, the streamer s send trace, and the transmission dump, the respective distortion estimation mechanism was used to calculate the priorities for all packets. These priorities were subsequently plugged into the loss simulator together with the transmission dump and the target packet discard rate. The discard process operated in the following manner: the video stream was segmented into network packets, which were then assigned to different sets according to the GOP they belonged to. For each set, the loss simulator continuously removed the packets with the least priorities until the target discard rate was met. Then, the corresponding entries of dropped packets were removed from the transmission dump. Based on the modified transmission dump and the hinted MPEG-4 ASP video file, a loss-affected video file was produced with the help of etmp4, which is a further tool from the EvalVid suite. After that, the loss-affected video file was decoded and concealed by ffmpeg s decoder module by applying motion copy error concealment and a deblocking filter. This yielded the raw YUV stream in the way as it would be perceived in case of an actual error-prone video transmission. Finally, the quality degradation was measured in terms of PSNR by taking the source and the generated YUV files as input Experiment parameters To keep the measured results as general as possible, an encompassing set of test sequences was compiled. It contained 36 videos at CIF (352x288) and 4CIF resolution with a length of 300 to 3000 frames. The selected test sequences were characterized by a diversity of spatial and temporal complexity. Regarding encoding settings, various GOP structures were used with M ranging from 1 to 5 and N from 6 to 36. Besides that, a moderate quantization ratio was chosen at

147 6.3. Experiment parameters which the quality degradation caused by the lossy encoding procedure was hardly perceivable. The encoder was further instructed to set the RTP maximum payload size to 1.2 kb to ensure proper packetization, which is a crucial precondition for Ψ ASP to perform properly (see Section 5.2.1). To benchmark the efficiency of the content-aware packet discard mechanism, two additional schemes were implemented. The first scheme, Ψ R, discards packets at random. This makes it roughly equivalent to the content-unaware delivery of video data where the sender or intermediate nodes arbitrarily drop packets when queues get congested. In this respect, it serves as lower bound in terms of efficiency and shall improve the comparability of results. The second scheme, which will be denoted by Ψ FT in this chapter, prioritizes packets based on the type of the frame they are associated with. It is frequently used and cited in literature, e.g., in [202] and [214], which can be mainly attributed to its ease of implementation. Analogous to its description in Section 4.2, I frames are considered to be the most important fragments of a video stream, B frames are the least important ones, and P frames are considered to be of average importance as expressed in Equation 6.1. During loss simulation, it can happen that there is more than one packet being ranked as the least important one. This can especially be the case with Ψ FT because this prioritization distinguishes only between three different levels. In such cases, the loss decision was performed randomly as the send mechanism is not able to identify the packet with the least loss-impact and therefore is forced to guess. To deflate the resulting randomization effects, in [171], the testing procedure was applied to all test sequences 30 times. For the results of this chapter, the number of repetitions was increased to 100 to provide smoother performance curves. Furthermore, multiple discard ratios were considered, ranging from 1% to 95%. Obviously, quantifying video quality beyond a certain discard ratio thresh- Ψ FT (p 1 ) > Ψ FT (p 2 ) > Ψ FT (p 3 ) (type(f(p 1 )) = I) (type(f(p 2 )) = P ) (type(f(p 3 )) = B) (6.1) 147

148 Chapter 6. Prioritized proactive packet discard old does not make sense as the concealed sequence will be scrambled in such a way that it becomes unacceptable for the human being. That threshold depends, among other factors, mainly on the selected GOP structure and the amount of motion. In connection with the results presented in the next section, it roughly varies between 50 and 65 percent Results Figure 6.2 shows the video quality at the receiver after decoding and concealment at different packet discard rates. The discard decisions were made based on one of the three prioritization approaches discussed in Section 5.2 and Section 6.3. For each discard rate, the mean PSNR value of all frames and across all repetitions was computed per approach. Moreover, a GOP structure of M = 2 and N = 12, which is quite common for both MPEG-2 and MPEG-4 ASP, was selected. Without exception, Ψ ASP outperforms both reference techniques in all test sequences at realistic discard rates. This performance gain is especially visible with the test sequences crew, football, and mobile where the difference between Ψ ASP and Ψ FT is up to 7 db. A remarkable observation is that these two prioritization mechanisms tend to converge at two specific discard rates. These rates are roughly related to the bitshare of I, P, and B frames in the encoded sequences. As an example, the sequence foreman consists of 300 frames of which 26 are I, 75 are P, and 199 are B frames. The B frames make up 52.0%, the P frames 29.2%, and the I frames 18.8% of the total stream size. Correspondingly, the points of convergence in Figure 6.2h are 52% and 78%. This can be explained by the fact that also Ψ ASP prefers to drop B frames over P or I frames. However, its superiority over Ψ FT between such convergence points is mainly due to more precise and finer-grained distortion estimates with regard to media units belonging to frames of the same type. As indicated in the previous section, concealed video streams become worthless when the loss rates exceed a certain threshold. The range of the considered discard 148

149 6.4. Results mean PSNR [db] Ψ R Ψ FT Ψ ASP mean PSNR [db] Ψ R Ψ FT Ψ ASP percentage of data loss (a) Sequence akiyo percentage of data loss (b) Sequence carracing Ψ R Ψ FT 40 Ψ R Ψ FT mean PSNR [db] Ψ ASP mean PSNR [db] Ψ ASP percentage of data loss (c) Sequence city percentage of data loss (d) Sequence coastguard Ψ R Ψ FT 40 Ψ R Ψ FT mean PSNR [db] Ψ ASP mean PSNR [db] Ψ ASP percentage of data loss (e) Sequence container percentage of data loss (f) Sequence crew Figure 6.2.: Performance of the prioritization approaches at different discard rates with M = 2 and N =

150 Chapter 6. Prioritized proactive packet discard Ψ R Ψ FT 40 Ψ R Ψ FT mean PSNR [db] Ψ ASP mean PSNR [db] Ψ ASP percentage of data loss (g) Sequence football percentage of data loss (h) Sequence foreman Ψ R Ψ R 40 Ψ FT 40 Ψ FT mean PSNR [db] Ψ ASP mean PSNR [db] Ψ ASP percentage of data loss (i) Sequence harbour percentage of data loss (j) Sequence highway mean PSNR [db] Ψ R Ψ FT Ψ ASP mean PSNR [db] Ψ R Ψ FT Ψ ASP percentage of data loss (k) Sequence ice percentage of data loss (l) Sequence mobile 150

151 6.4. Results Ψ R Ψ FT 40 Ψ R Ψ FT mean PSNR [db] Ψ ASP mean PSNR [db] Ψ ASP percentage of data loss (m) Sequence news percentage of data loss (n) Sequence paris Ψ R Ψ FT 40 Ψ R Ψ FT mean PSNR [db] Ψ ASP mean PSNR [db] Ψ ASP percentage of data loss (o) Sequence silent percentage of data loss (p) Sequence stefan Ψ R Ψ FT 40 Ψ R Ψ FT mean PSNR [db] Ψ ASP mean PSNR [db] Ψ ASP percentage of data loss (q) Sequence tempete percentage of data loss (r) Sequence waterfall 151

152 Chapter 6. Prioritized proactive packet discard rates in Figure 6.2 covers the entire spectrum only for the sake of completeness, i.e., unrealistic, theoretical discard rates beyond 65% were also considered. Beyond that threshold, with some test sequences, it even happens that Ψ R outperforms the content-aware mechanisms, which is due to the fact that the video decoder has too little information available to perform suitable error concealment. As a result, the outcome of the error concealment process is nearly random, and comparing video quality using the PSNR metric is therefore no longer reasonable. This is also the explanation for why the measured distortion decreases with some sequences for Ψ FT when increasing the discard rate beyond the respective threshold. That phenomenon can best be seen with the sequence paris where the video quality is 24.6 db at 64% but 26.0 db at 70%. Another interesting trend is that the position of the convergence points heavily depends on the temporal and structural complexity of the respective sequence. It can be observed that the higher the diversity of the input content, the more bits are needed to represent the compressed data of B frames. Consequently, more complex sequences such as football, highway, and mobile have convergence points that occur at higher discard rates than with incomplex, homogeneous sequences. This also implies that the performance gain when using the author s approach Ψ ASP is more significant with heterogeneous video streams. Up to now, the efficiency of the distortion estimation schemes Ψ ASP and Ψ FT was only evaluated using a single, fixed GOP structure. In a next step, the impact of the two GOP parameters M and N was evaluated. Figure 6.3 depicts the experiment s results with the sequence football. The parameters M and N range from 1 to 5 and from 6 to 36, thus a broad range of GOP structures is covered. The results reveal that the GOP length N does not have an influence on the performance of both prioritization mechanisms, which implies that the performance gain of Ψ ASP is relatively constant for different choices of N. Regarding the number of consecutive B frames, it turns out that Ψ ASP s performance gain positively correlates with M. This seems reasonable because increasing the number of B frames inevitably increases the proportion of B frames in terms of bit, which in turn leads to a shift of the first convergence point to the right (to a higher discard rate). Consequently, 152

153 6.4. Results 10 M=5 M=4 M=3 M=2 M=1 10 M=5 M=4 M=3 M=2 M= delta PSNR 7 6 delta PSNR GOP length (N) GOP length (N) (a) Discard rate of 5% (b) Discard rate of 10% 10 M=5 M=4 M=3 M=2 M=1 10 M=5 M=4 M=3 M=2 M= delta PSNR 7 6 delta PSNR GOP length (N) GOP length (N) (c) Discard rate of 15% (d) Discard rate of 20% 10 M=5 M=4 M=3 M=2 M=1 10 M=5 M=4 M=3 M=2 M= delta PSNR 7 6 delta PSNR GOP length (N) GOP length (N) (e) Discard rate of 25% (f) Discard rate withof 30% Figure 6.3.: Performance gain of Ψ ASP over Ψ FT at different GOP settings with sequence football. 153

154 Chapter 6. Prioritized proactive packet discard distortion estimation schemes can select discard candidates from a broader range, which promotes the efficiency of Ψ ASP due to aforementioned reasons. To summarize the experiments, Figure 6.4 depicts a boxplot of all obtained results. As already noted in connection with Figure 6.2, the use of Ψ ASP leads to considerable improvements in video quality over Ψ FT at realistic discard rates. Moreover, at moderate discard rates, the performance differences are highly scattered, which can be attributed to the diversity of the test sequences with respect to temporal and spatial complexity. Finally, the advantage of fine-grained packetlevel distortion estimation over frame-level distortion estimation is demonstrated by considering the case of a single packet drop. In Figure 6.5, three discard scenarios are depicted where a single slice of a B frame is not available to the decoder. The left and right pictures show the results of the error concealment process when Ψ FT and Ψ ASP are used to determine the least important slice of the frame. As can be seen, taking Ψ ASP for the decision process leads to less perceptible artifacts as opposed to Ψ FT with which the affected regions are mainly characterized by blurriness and annoying block margins. 154

155 6.4. Results discard rate [%] delta PSNR [db] Figure 6.4.: Summary of all experiments with M=2 and N=12: average video quality improvement when using Ψ ASP instead of Ψ FT. 155

156 Chapter 6. Prioritized proactive packet discard (a) Sequence carracing (b) Sequence kitchen (c) Sequence soccer Figure 6.5.: Discard of the least important packet determined by ΨFT (left pictures) and ΨASP (right pictures). 156

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Module 8 VIDEO CODING STANDARDS Lesson 27 H.264 standard Lesson Objectives At the end of this lesson, the students should be able to: 1. State the broad objectives of the H.264 standard. 2. List the improved