PACKET LOSS PROTECTION FOR H.264-BASED VIDEO CONFERENCING

Size: px

Start display at page:

Download "PACKET LOSS PROTECTION FOR H.264-BASED VIDEO CONFERENCING"

Sabina Craig
5 years ago
Views:

1 PACKET LOSS PROTECTION FOR H.264-BASED VIDEO CONFERENCING by Dong Zhang B.A.Sc., Simon Fraser University, 2007 a Thesis submitted in partial fulfillment of the requirements for the degree of Master of Applied Science in the School of Engineering Science c Dong Zhang 2010 SIMON FRASER UNIVERSITY Fall 2010 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.

Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or

3 Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the Institutional Repository link of the SFU Library website < at: < and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author s written permission. Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence. While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire. The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive. Simon Fraser University Library Burnaby, BC, Canada Last revision: Spring 09

4 Abstract In two-way video conferencing applications, video packets transmitted through various networks can suffer from loss. It is well known that forward error correction (FEC) code is an effective weapon in combating packet loss. However designing a FEC code that has strong recovery capability while satisfying real life requirements such as low complexity and low delay poses an intriguing challenge. In this project, the first objective that we achieved is the design of a practical multi-rate packet loss recovery FEC code which has linear encoding and decoding complexity, low end to end delay and is backward compatible with systems that do not support this particular FEC coding scheme. A transport protocol was also proposed which is compatible with existing RTP protocol for transmitting H.264 coded video frames. In our applications, appropriate FEC code rate can be assigned assuming that the effect of packet loss on the decoder video quality can be predicted before actual decoding occurs. Consequently, the second component of the project is the design and implementation of a real time end to end distortion estimator(eede) for H.264 codec, which runs on the encoder side and is capable of estimating decoder side distortions introduced by packet loss. With these tools at our disposal, an optimization process is then performed with the objective of minimizing the decoder side visual distortion while satisfying user defined FEC code rate. The end product of this project is a complete real time embedded video conferencing system with FEC encoding and decoding capability running H.264 codec at resolution up to WVGA resolution ( ), 30 frames per second (fps) and 2 megabits per second (Mbps). iii

5 Acknowledgments I would like to thank my senior supervisor Dr. Jie Liang and my manager and technical supervisor Dr. Inderpreet Singh for their helpful and innovative technical suggestions. I would like to thank Dr. Jiangchuan Liu for being my supervisor, Dr. Shahram Payandeh as the chair and Dr. Gary Wang as the examiner of my defense. Last but not the least, I would also like to thank my parents for their loving support in my fledging years that cultivated precious values and personalities in appreciating life s challenges and enjoyments. iv

6 Contents Approval Abstract Acknowledgments Contents List of Tables List of Figures ii iii iv v vii viii 1 Introduction Proposed FEC Code And Existing FEC Codes Distortion Estimator and Similar Methods System Requirement and System Overview Main Contributions Source Distortion Model Propagated Distortion Mismatch Distortion Model Summary Optimization Process Integration FEC code and transport protocol design Code Construction Encoding Algorithm v

7 3.3 Decoding Algorithm Rate Distortion Model For Optimization Integration Transport Protocol Design RFC3984 Protocol Integration FEC Payload IDs FEC Scheme Information FEC Code Block Information Protocol Interpretation For None-FEC Receivers Optimization frame work 36 5 Results FEC Code Design Performance End to End Distortion Estimator Performance Future Works and Conclusions Future Works Streaming Application for FEC Data Protection and Encryption Distortion Optimized Instantaneous Decoding Refresh (IDR) Request Distortion Optimized Mode Decision Distortion Estimation in Scalable Video Coding Conclusions A EEDE Results Group A 50 B EEDE Results Group B 62 C EEDE Results Group C 74 D EEDE Results Group D 86 References 89 vi

8 List of Tables 3.1 Rate distortion table for K=15 and N= NALU Octet Type field definition from RFC FEC Payload IDs field FEC Scheme Information field FEC Code Block Information field Resolution for all the test sequences ASMR for all the test sequences. FPG stands for frames per GOP. CIF sequences are encoded at 512kbps with 300 byte slice size. 525P sequences are encoded at 1Mbps with 600 byte slices size ASMR for football and suzie at 2Mbps, with 1000 byte slice vii

9 List of Figures 1.1 System Block Diagram MB overlap in Inter prediction mode FEC code tanner graph A special case for message passing algorithm Encoding Cost Decoding Cost Loss rate is varied while keeping the same code rate and code block size Code rate is varied while keeping the same loss rate Asymptomatical performance of the code design is tested by increasing the blocks size while channel loss rate and code rate are fixed A.1 Encoder and decoder SSE comparison for Akiyo at 5% packet loss rate A.2 Encoder and decoder PSNR comparison for Akiyo at 5% packet loss rate.. 51 A.3 Encoder and decoder SSE comparison for Coastguard at 5% packet loss rate. 52 A.4 Encoder and decoder PSNR comparison for Coastguard at 5% packet loss rate 52 A.5 Encoder and decoder SSE comparison for Container at 5% packet loss rate. 53 A.6 Encoder and decoder PSNR comparison for Container at 5% packet loss rate 53 A.7 Encoder and decoder SSE comparison for Foreman at 5% packet loss rate.. 54 A.8 Encoder and decoder PSNR comparison for Foreman at 5% packet loss rate. 54 A.9 Encoder and decoder SSE comparison for News at 5% packet loss rate A.10 Encoder and decoder PSNR comparison for News at 5% packet loss rate A.11 Encoder and decoder SSE comparison for Silent at 5% packet loss rate A.12 Encoder and decoder PSNR comparison for Silent at 5% packet loss rate viii

10 A.13 Encoder and decoder SSE comparison for Stefan at 5% packet loss rate A.14 Encoder and decoder PSNR comparison for Stefan at 5% packet loss rate.. 57 A.15 Encoder and decoder SSE comparison for City at 5% packet loss rate A.16 Encoder and decoder PSNR comparison for City at 5% packet loss rate A.17 Encoder and decoder SSE comparison for Football at 5% packet loss rate.. 59 A.18 Encoder and decoder PSNR comparison for Football at 5% packet loss rate. 59 A.19 Encoder and decoder SSE comparison for Suzie at 5% packet loss rate A.20 Encoder and decoder PSNR comparison for Suzie at 5% packet loss rate A.21 Encoder and decoder SSE comparison for Train at 5% packet loss rate A.22 Encoder and decoder PSNR comparison for Train at 5% packet loss rate B.1 Encoder and decoder SSE comparison for Akiyo at 10% packet loss rate B.2 Encoder and decoder PSNR comparison for Akiyo at 10% packet loss rate.. 63 B.3 Encoder and decoder SSE comparison for Coastguard at 10% packet loss rate 64 B.4 Encoder and decoder PSNR comparison for Coastguard at 10% packet loss rate B.5 Encoder and decoder SSE comparison for Container at 10% packet loss rate. 65 B.6 Encoder and decoder PSNR comparison for Container at 10% packet loss rate 65 B.7 Encoder and decoder SSE comparison for Foreman at 10% packet loss rate. 66 B.8 Encoder and decoder PSNR comparison for Foreman at 10% packet loss rate 66 B.9 Encoder and decoder SSE comparison for News at 10% packet loss rate B.10 Encoder and decoder PSNR comparison for News at 10% packet loss rate.. 67 B.11 Encoder and decoder SSE comparison for Silent at 10% packet loss rate B.12 Encoder and decoder PSNR comparison for Silent at 10% packet loss rate.. 68 B.13 Encoder and decoder SSE comparison for Stefan at 10% packet loss rate B.14 Encoder and decoder PSNR comparison for Stefan at 10% packet loss rate.. 69 B.15 Encoder and decoder SSE comparison for City at 10% packet loss rate B.16 Encoder and decoder PSNR comparison for City at 10% packet loss rate B.17 Encoder and decoder SSE comparison for Football at 10% packet loss rate.. 71 B.18 Encoder and decoder PSNR comparison for Football at 10% packet loss rate. 71 B.19 Encoder and decoder SSE comparison for Suzie at 10% packet loss rate B.20 Encoder and decoder PSNR comparison for Suzie at 10% packet loss rate.. 72 B.21 Encoder and decoder SSE comparison for Train at 10% packet loss rate ix

11 B.22 Encoder and decoder PSNR comparison for Train at 10% packet loss rate.. 73 C.1 Encoder and decoder SSE comparison for Akiyo at 10% packet loss rate C.2 Encoder and decoder PSNR comparison for Akiyo at 10% packet loss rate.. 75 C.3 Encoder and decoder SSE comparison for Coastguard at 10% packet loss rate 76 C.4 Encoder and decoder PSNR comparison for Coastguard at 10% packet loss rate C.5 Encoder and decoder SSE comparison for Container at 10% packet loss rate. 77 C.6 Encoder and decoder PSNR comparison for Container at 10% packet loss rate 77 C.7 Encoder and decoder SSE comparison for Foreman at 10% packet loss rate. 78 C.8 Encoder and decoder PSNR comparison for Foreman at 10% packet loss rate 78 C.9 Encoder and decoder SSE comparison for News at 10% packet loss rate C.10 Encoder and decoder PSNR comparison for News at 10% packet loss rate.. 79 C.11 Encoder and decoder SSE comparison for Silent at 10% packet loss rate C.12 Encoder and decoder PSNR comparison for Silent at 10% packet loss rate.. 80 C.13 Encoder and decoder SSE comparison for Stefan at 10% packet loss rate C.14 Encoder and decoder PSNR comparison for Stefan at 10% packet loss rate.. 81 C.15 Encoder and decoder SSE comparison for City at 10% packet loss rate C.16 Encoder and decoder PSNR comparison for City at 10% packet loss rate C.17 Encoder and decoder SSE comparison for Football at 10% packet loss rate.. 83 C.18 Encoder and decoder PSNR comparison for Football at 10% packet loss rate. 83 C.19 Encoder and decoder SSE comparison for Suzie at 10% packet loss rate C.20 Encoder and decoder PSNR comparison for Suzie at 10% packet loss rate.. 84 C.21 Encoder and decoder SSE comparison for Train at 10% packet loss rate C.22 Encoder and decoder PSNR comparison for Train at 10% packet loss rate.. 85 D.1 Encoder and decoder SSE comparison for Football at 10% packet loss rate.. 87 D.2 Encoder and decoder PSNR comparison for Football at 10% packet loss rate. 87 D.3 Encoder and decoder SSE comparison for Suzie at 10% packet loss rate D.4 Encoder and decoder PSNR comparison for Suzie at 10% packet loss rate.. 88 x

12 Chapter 1 Introduction Video conferencing over computer network has become more and more popular. However, its quality can be seriously affected by packet loss during transmission. Currently available methods for packet loss protection include automatic repeat request (ARQ) [4], redundant picture coding specified in H.264 video coding standard [9] and forward error correction (FEC) [15] based packet loss recovery schemes [22, 16, 19]. ARQ approach requires receiver to notify the sender on packet loss information in order to have the missing packet resent. The disadvantage is that this scheme is only available in TCP/IP protocol but video conferencing application utilizes user datagram protocol (UDP) [20] where retransmission is not supported due to the constraint of low end to end delay. H.264 standard also supports redundant picture coding where potentially each frame can be duplicated for loss tolerance. The redundancy introduced in this fashion is quite brute force in the sense that a huge bandwidth is wasted to support adding redundant pictures. Furthermore, if packet loss is caused by network traffic congestion, adding more data to the bitstream can only aggravate the situation even further. FEC coding on the other hand does not require any retransmission from the sender and utilizes the bandwidth in a lot more efficient manner. In recent years, many research efforts were spent on how to mitigate the effect of packet loss to maximize end user experience in video conferencing applications. Usually some form of FEC coding is involved in the system at the physical transmitter and receiver layer for loss packet recovery. Another area that attracts research interests is on understanding how packet loss affects video coding quality. Many algorithms were introduced for estimating distortions induced by packet loss on H.264 decoder from H.264 encoder side. From the encoder point of view, having a powerful channel code with means to estimate the decoder 1

13 CHAPTER 1. INTRODUCTION 2 quality, one can combine the quality estimations with the channel code to efficiently perform resource allocation with the objective of maximizing perceived decoder quality. The process of combining source coding information with channel coding for best rate distortion tradeoff is referred to as joint channel source optimization. FEC coding is a widely employed channel coding method in communication systems to correct errors introduced during data transmission. Many communication standards have adapted FEC coding as a error correction method to improve reliability. In recent years, using FEC coding to correct erasures rather than errors have become an attractive research area. This trend is facilitated by the massive deployment of Wi-Fi and Ethernet network infrastructures. Consequently many FEC codes are specially designed for erasure recovery purposes. Examples include LT code [16], Raptor Code [22]. The classical designs like Reed Solomon Code [12] and LDPC code [6] have also resurfaced and redesigned to work with erasures rather than errors. A common problem with the FEC code is that performance is achieved as code block length increases. At very short length, not many algorithms were proposed on how the code structure should be designed. Furthermore, many FEC codes require high encoding, decoding complexity and high end to end delay. This means they are not suitable for real time implementation. On the front of source coding, H.264 codec in specific, a number of algorithms have been proposed on how decoding quality can be estimated on the encoder side when a part of the video frame is missing due to transmission loss [14, 5, 27, 29, 24]. Even though they are shown to give excellent estimation performance, many of them are either high in complexity and memory demand or do not consider useful coding features proposed in H.264 standards. Consequently many of the proposed algorithms are not suitable choices when real time implementation is concerned. Frequently a joint optimization process is carried out by combining source coding rate distortion model with channel coding rate distortion model to minimize the overall distortion of the system [5]. Many of the proposed joint channel source optimization methods use building blocks that have practical implementation issues which render them unattractive for our application. The objective of this project involves designing a suitable FEC code and a distortion optimization algorithm for our H.264 video application. These two modules then are jointly used in an optimization process such that rate allocations on FEC code are performed to minimize video distortion. Most importantly the H.264 video conferencing system that employs this joint optimization method needs to run in real time at WVGA

14 CHAPTER 1. INTRODUCTION 3 resolution, 30 frames per second and at 2Mbps bitrate. The thesis is organized as follows. Existing FEC coding schemes and source coding distortion estimation algorithms will be briefly reviewed in the rest of this chapter. The system configuration and requirement are also presented in this chapter to give the readers an understanding on the system design criteria and issues that need to be addressed. Chapter two introduces our end to end distortion estimation algorithm. Chapter three presents the proposed FEC code design. In chapter four, the optimization frame work that combines both FEC code and distortion estimator for distortion optimized rate allocation is explained. Performance analysis for the proposed FEC code and end to end distortion estimator are presented in chapter five. Finally other application that can utilizes the building blocks introduced by this project and some concluding remarks are presented in chapter six and chapter seven respectively. 1.1 Proposed FEC Code And Existing FEC Codes Most of the FEC codes are block code. This means source messages containing the actual data are grouped for FEC encoding and each source message is treated as one symbol. The output of the FEC encoding process is the coded messages and usually is longer than the original data message. The extra symbols are generated from the source message symbols and carry recovery information. The code rate of a specific FEC code is defined as follows, where K is the number of source messages and N is the number of total coded symbols. Code Rate = K N (1.1) All FEC codes can be categorized into two classes: systematic code and non-systematic code. Systematic code means the original data messages are embedded in the coded messages; whereas non-systematic code refers to FEC codes whose coded messages do not contain the original source messages. The advantage of systematic code is that if the message receiver does not support FEC decoding, the original message can still be recovered with the aid of extra signaling protocol. Almost all classic FEC codes were originally invented to correct transmission errors rather than to recover loss. One example of channels that introduces loss is the binary erasure channel (BEC) [3]. When decoding for losses in BEC, the most common algorithm used is called message passing [19], which will be briefly reviewed in a later section. The FEC code proposed in this project is a systematic code and a matching

15 CHAPTER 1. INTRODUCTION 4 transport layer protocol is also defined. There are a number of FEC codes that are well unknown. Low density parity check (LDPC) code was invented by Gallager in Just like Reed Solomon Code, LDPC was invented for error correction rather than erasure recovery. Recently LDPC code was rediscovered for recovering packet loss. Computationally XOR operations can be used for encoding and decoding which is much easier to implement than Reed Solomon Code. Due to the capacity achieving nature, there has been many proposals on how to optimize LDPC code structure [8, 21, 2, 13, 10]. The drawback of LDPC code lies in its utilization of sparse parity check matrix. Consequently a long code block length is needed for good performance. In low delay applications, a very long code block length is not ideal. Reed Solomon Code [15] was introduced by Irving S. Reed and Gustave Solomon in 1960s. Reed Solomon code is a systematic FEC code. Originally it was invented for correcting errors. It has very good error correction capability and can operate at very short block length. For example, suppose there are K source messages and N code messages are produced after encoding process. Among the N coded messages, the first K are equal to the original source messages. Any K out of the N coded messages can be used to recover the original K source messages, which means N K lost messages can be tolerated. However, the drawback of Reed Solomon code lies in its complexity. Because the coefficients are values chosen from finite field, the encoding and decoding process are very difficult to implement with low complexity in a software system. Blomer proposed an implementation friendly erasure code [17], which is based on the same principle of Reed Solomon code but can be implemented using XOR based operations. Even though it is made possible for integrating Reed Solomon code in a software system with only XOR operations, the complexity of the encoder and decoding process is still not acceptable for real time applications. There are also FEC codes designed specifically for erasure recovery. LT code is a good example [16]. LT code is called a rateless code due to the fact that given the source symbols any number of coded symbols can be generated. LT code performs well when the code block length is large. Due to the required long block length, rateless codes in its unmodified form are not suitable for real-time video conferencing due to delay and complexity reasons. The decoding method used by rateless code is message passing. The code construction process for our proposed FEC code can generate as many coded symbols as the user wants, in this sense, the proposed FEC code can also be viewed as a form of rateless code.

16 CHAPTER 1. INTRODUCTION 5 In order to perform rate distortion optimized rate allocation, a closed form rate distortion model is needed. This model needs to map a channel loss rate to a loss rate after FEC decoding given a specific code structure and code rate. Proietti [3] provides a method on how a closed form solution for the LDPC code performance can be derived. It was shown that the bit erasure probability of an individual code construction approaches the ensemble average of bit erasure probabilities of all possible codes constructed as code block length increases. It is also true that the ensemble average of all code constructions approaches cycle free performance as code block increases. Consequently the focus is on deriving a closed form solution for the ensemble average to approximate individual code performance and asymptotic analysis is used during the derivation process. It should be noted that the closed form solution derived is based on the assumption that the code block length is long. The derived solution is recursive and thus also has complexity issues during calculation. In video conferencing application, a video frame is composed of a number of slices. Due to strict low delay requirement, FEC coding is applied at each frame by treating each slice as a source symbol. Therefore we are only allowed to use code blocks with extremely short length. Most of the time the code block length is less than 30, in which case the solution from asymptotic analysis can no longer be used as the actual expression. A simple but effective way to develop a rate distortion model for FEC code is by simulation. Constructed FEC code are simulated using dummy data for different loss rates and different code rates and the results are stored in a specific format which can be used by the distortion optimization process in real time operations. 1.2 Distortion Estimator and Similar Methods A number of distortion estimation algorithms currently exist. Kenneth and Yang [27] introduced a pixel level based estimation model called recursive per-pixel end to end distortion estimation (ROPE). Due to the amount of calculation required at each pixel, the algorithm complexity is very high and is not practical for real time implementation. More specifically, the model estimates the distortion of each decoded pixel with sub-pixel motion estimation considered. Therefore tracking of two moments at each pixel level is required. Consequently an astronomical number of arithmetic operations are involved in estimating the distortion for a single video frame. Furthermore, the algorithm assumes that intra prediction cannot happen when available neighboring macroblocks are inter coded. The most obvious reason

17 CHAPTER 1. INTRODUCTION 6 for the restricted intra prediction mode is that intra MB are assumed to be error resilient during experiments. This is true for intra macroblocks in I slice. However if intra prediction is allowed from inter coded macroblock, propagated distortion from inter macroblocks can be carried forward by the intra macroblock as well. One can argue that for better error resiliency it is better to have restricted intra coding mode. However, in practice, sometimes intra prediction in P slices gives a better prediction than inter prediction, in which case using intra prediction rather than inter prediction gives better coding efficiency. Consequently it is necessary to address the case where unrestricted intra prediction mode is allowed. Some efforts have been spent on reducing the ROPE complexity. In [29] pixel level based estimation operations are simplified to the block level. However, just like the original ROPE approach, restricted intra prediction mode is used. Consequently, any intra coded macroblocks does not propagate distortion in the estimation model. A very simple distortion model is proposed by Fallah [5]. The relationship between video distortion and bitrate is explored and the authors simplified the distortion model by treating packet loss as a reduction in bitrate. A closed formed equation was first developed to model the relationship between video bitrate and distortion, then this model was modified by transforming packet loss into a corresponding bitrate value in order to work with the model. The advantage of this scheme is its simplicity, which guarantees real time implementation at virtually no increase in CPU usage. However it is too simple of a model to capture the dynamic essences of a specific video sequence during distortion estimation. It is also noted that the rate distortion model proposed has parameters that is sequence dependant. This raises other issues like adaptive parameter tunings during encoding process to adjust the distortion model to the video sequence being encoded. Liu and Li [14] introduced a macroblock based estimation algorithm. Comparing to the pixel level based estimation algorithm (ROPE), this is a lot more practical. The context of the proposed model is in pre-coded video transmission. The distortion estimated by the model is calculated between the encoder and decoder reconstruction. This is a bit different from what our objective is. Traditionally, video decoder quality is calculated by comparing the raw video sequence to the decoded video sequence. Our proposed model still follow this convention, namely, the distortion is still defined as the deviation of the decoded sequence from the original raw sequence. Furthermore, just like other proposed models, restricted intra prediction is assumed, which means intra coded macroblocks do not propagate distortion.

18 CHAPTER 1. INTRODUCTION 7 Ever since rate distortion optimization was introduced into video coding [23], many distortion models have been proposed in efforts to improve video coding quality. In error prone environment, a good end to end distortion model can also benefit many areas of the video codec. Yang [28] introduced a distortion optimized motion compensation scheme following their proposal on ROPE. Heng [7] also devised a similar frame work utilizing an end to end distortion model for mode selection. Consequently the importance of an accurate end to end distortion estimation algorithm cannot be stressed enough. In order to implement a real time embedded conferencing system with FEC protection capability a number of requirements need to be satisfied. The following general requirements apply to FEC encoding algorithm, FEC decoding algorithm, optimization algorithm and end to end distortion estimation algorithm. 1. All algorithms need to be low in complexity in terms of CPU usage such that they can be handled by embedded processors in real time. 2. All algorithms need to have a low demand for memory, which is a very precious system resource. 3. All algorithms should not incur too much end to end delay on the overall system. Because low end to end delay is the defining characteristic of a real time system, whenever end to end delay is in a trade off with other design criteria, satisfying end to end delay requirement should be placed at a higher priority. Our proposed end to end distortion estimation algorithm requires a very low complexity in order to run in real time. More specifically, the capability of our hardware based H.264 encoder is that it can support 720P real time encoding with arbitrary slice at 28 frames per second. The objective of the estimation algorithm is to have the encoder running with the estimator at WVGA resolution at 30 frames per second. The distortion estimation model developed in this project has to balance between accuracy and performance. Consequently, any pixel level based estimation models cannot be used due to the massive amount of operations required and thus a macroblock based scheme is chosen. Furthermore, our system has very strict bit budget requirements, any H.264 coding features that preserve quality while lowering the bitrate should be used. Unrestricted intra prediction mode offers good coding efficiency in that aspect and by default unrestricted intra prediction mode is always used. Skipped macroblocks are quite common in video conferencing applications. In most of the

19 CHAPTER 1. INTRODUCTION 8 use cases, the background of the video is stationary. Consequently the mode decision block in the encoder can introduce a significant number of skipped macroblocks to improve coding efficiency. Taking care of skipped macroblocks during distortion estimation is something that was not explicitly addressed by other algorithms. 1.3 System Requirement and System Overview There are six major modules involved in the joint optimization effort. Figure 1.1 illustrates the entire system configuration. The first module is a H.264 encoder which has the distortion estimator running in parallel as frame encoding is performed. The H.264 encoder receives RAW YUV frames as input and outputs are encoded frames with distortion information associated. The H.264 encoder is running in slice encoding mode with a predefined limit on the maximum slice size. In a system where FEC coding is not involved, each slice or network abstraction layer (NAL) [9] unit produced by the H.264 encoder is encapsulated inside a real time transport protocol (RTP) packet based on format specified in RFC3984 [26]. Then each RTP packet is further built into user datagram protocol (UDP) and Internet Figure 1.1: System Block Diagram

20 CHAPTER 1. INTRODUCTION 9 protocol (IP) packet for transmission. The maximum slice size is defined to be 1200 byte. The reason for this value is that the maximum transmission unit (MTU) size is 1500 byte for IPV4 network and selecting a slice size close to MTU size can minimize processing overhead like parsing packets internally within the system. In this encoder configuration, if a packet is lost on the network, the decoder can still decode other received slices without being affected by that single loss. On the other hand, if slice mode is not used, a single frame will be encapsulated inside one RTP packet and if it is bigger than MTU size, IP stack will fragment this large RTP packet into packets smaller than MTU size. This is very undesirable because if a single fragmented packet is lost during transmission, the entire frame will be thrown away by the IP stack on the receiver side such that incomplete RTP packet is never allowed to be passed to the lower level of the system. Consequently for better error resiliency, even without FEC protection, slice mode should be used. The second module is the end to end distortion estimator. The distortion estimator runs in parallel with the H.264 encoder. After a frame is encoded, the distortion estimator is invoked to calculate the distortion perceived by the decoder when packet loss presents. The distortion estimator needs an estimate of the loss rate on the channel, which is calculated by the receiver and fed back to the transmitter using existing protocols like real time control protocol (RTCP). Unlike in ARQ scheme where feedback request causes data transmission, loss rate fed back does not cause any data retransmission and happens in an infrequent and periodic interval. It should be noted that if FEC capability is enabled in the system the loss rate passed to the distortion estimator should be the loss rate after FEC decoding is applied because the loss rate after FEC decoding is what the decoder will experience. The estimation algorithm uses a recursive model on a macroblock basis. Comparing to pixel level based estimation algorithms, this method is more practical and in fact is the only practical approach that guarantees a real time implementation. The third module is a code rate optimizer. The code rate optimizer receives the distortion information calculated by the distortion estimator embedded in the encoder for each frame and combines the distortion information with pre-calculated FEC code rate distortion model to performance optimization. Since our H.264 encoder runs in slice mode and a number of slices are generated for each video frame, slices belonging to the same video frame are conveniently considered as source message symbols in a single code block. The optimization process operates on a number of frames which composes an optimization group. For low delay reasons, only 2 frames are allowed to be optimized at the same time. Consequently

21 CHAPTER 1. INTRODUCTION 10 the end result from the optimizer for each video frame is a code rate corresponding to each frame which should maximize the visual quality after the video frame is decoded. The fourth module is the FEC encoder. The FEC encoder receives slices belonging to the same video frame at a time as well as code rate calculated by the optimization process. FEC encoding process is carried out using the code rate from optimizer by treating each slice as a source message symbol. The code structure is pre-calculated and stored on the system for efficiency reasons. The encoding process is based on XOR operations which is computationally friendly. The encoded messages symbols contain the original source message symbols such that receivers without FEC decoding capability can recover the original message. Because the video slices do not have the same size, the FEC encoder should be able to encode message symbols with different but similar sizes. A transport protocol is also designed to encapsulate each of the encoded message symbols into one RTP packet by adding extra header information. The fifth module is the FEC decoder. The FEC decoder uses message passing method to decode received encoded message symbols. Just like FEC encoder, FEC decoder also utilizes XOR operations when decoding encoded message symbols. FEC decoder uses information embedded in the RTP packet such as code block size to decide how to perform FEC decoding. The sixth module is the H.264 decoder. The H.264 decoder receives slices from FEC decoder and decodes each slice independently. A last frame repeat packet loss concealment (PLC) algorithm is implemented on the H.264 decoder. This is a simple and commonly used PLC scheme. If a slice is lost, all the macroblocks in that slice are copied from the co-located macroblocks from the previous frame. The distortion estimator also assumes that the decoder behaves as such when encountering missing slices. 1.4 Main Contributions The main contribution of this project is the design and implementation of a FEC code and end to end distortion estimator in an embedded system environment. Every design presented in this project is driven by real life consumer requirements which many pure theoretical researches either ignore or simplify. More specifically, in order to satisfy low delay requirement, the FEC code design proposed operates at very short length which is different from the usual long block length use case for many FEC code designs. Furthermore, FEC code design at very short length is not a big focus in channel coding research because long

22 CHAPTER 1. INTRODUCTION 11 or infinite code block length facilitates asymptotical analysis where the code performance can be nicely bounded. In pure theoretical research, system interoperability does not receive huge attentions. However, it is imperative that any new design introduced in a real product should ensure forward and backward compatibility. In our case, the FEC code design can only properly operate with the aid of the protocol extension proposed. The end to end distortion estimator designed in this project is also different from many other existing estimation algorithms in the sense that full features of H.264 codec have to be considered in order to fulfill customer requirements. In order to protect the ideas and designs presented in this project, patents will be filed and possibly the proposed protocol will be submitted to standard bodies for potential standardization. Furthermore, our research does not stop at this point. More articles detailing further enhancements and improvements will be submitted in the future after the patent application. Currently these advanced loss recovery algorithms have already been integrated into our video conferencing product which targets enterprise customers as well as regular consumers.

23 Chapter 2 Source Distortion Model For joint channel source optimization, a distortion model for H.264 codec needs to be developed. The objective is to represent the distortion of a video frame as a function of packet loss rate seen by the decoder. Notation D f (P(f)) can be used, where D f denotes the distortion of frame f and P(f) is the packet loss rate for frame f. A macroblock(mb) based recursive algorithm rather than a pixel based algorithm [27] is used by the end to end distortion estimator. During the estimation process, two distortion values are calculated for each macroblock. One is the distortion if the macroblock is received. The other is the distortion if the macroblock is lost. The macroblock distortion is then weighted by the probability of the corresponding event. Furthermore, the encoder assumes that the decoder uses last frame repeat concealment method to deal with packet loss. This means if a macroblock is missing the decoder copies the co-located macroblock from the previous frame. The estimation algorithm requires an estimation of the loss rate on the channel, which is another subject by itself and is not covered by this project. The estimated distortion is calculated using sum of squared error format (SSE) on luminance values only which is what human eyes are most sensitive to. Each macroblock has 256 pixels. The SSE for two MB s, MB a and MB b is given in equation 2.1. SSE for a frame is calcualted by summing SSE values for all the MB s in the frame. 256 SSE MB = (MB a (i) MB b (i)) 2 (2.1) i=1 There are four types of macrobocks encountered during frame encoding. 1. Intra coded macroblock from I slice 12

24 CHAPTER 2. SOURCE DISTORTION MODEL Inter coded macroblock from P slice 3. Intra coded macroblock from P slice 4. Skipped macroblock Because of the predicative nature of H.264 codec, motion estimation introduces propagation errors. For example, in P slice, inter coded macroblocks depends on prediction values from the previous frame during motion compensation process. If the prediction value from the previous is not correct, then the current reconstructed macroblock also carries the distortion forward to the next frame. On the other hand intra coded macroblocks in I slice will not introduce propagation errors due to the fact intra predictions are not allowed across slice boundary and all macroblocks within the I slice are intra coded. However, to improve coding efficiency, constrained intra pred flag [9] is set to be 0 for our H.264 encoder. This means that macroblocks within P slices are allowed to be coded in intra prediction mode. Consequently if intra coded macrobloack depends on the prediction values from neighboring inter coded macroblocks propagation error also occurs. Skipped mactoblock is another way to improve coding efficiency by saving bits for other more text rich and unpredictable macroblocks. In video conferencing application, where a typical scene consists of a person sitting in front of the camera with a stationary background, many macroblocks can potentially be coded using skip mode. 2.1 Propagated Distortion The distortion of a MB that gets carried forward during the prediction process is defined as propagated distortion. Let F(f,m,i) denote pixel i in the original frame f, macroblock m. Let ˆF(f,m,i) denote pixel i in encoder reconstructed frame f, macroblock m. Let F(f,m,i) denote pixel i in decoder reconstructed frame f, macroblock m. The distortion of the macroblock m in frame f can be expressed as D MB (f,m) = (1 P(f))D R MB(f,m) + P(f)D L MB(f,m) (2.2) Where D R MB (f,m) is the distortion of this macroblock when it is received, DL MB (f,m) is the distortion of this macroblock when it is lost, P(f) is the loss rate in this frame. We are assuming that the loss rate does not change within this frame.

25 CHAPTER 2. SOURCE DISTORTION MODEL 14 Next we will define the macroblock distortion when it is lost. When a macroblock is lost, the decoder will make use of error concealment methods to improve perceived visual quality. We are assuming that whenever a macroblock is lost, the co-located macroblock from the previous frame is copied over as a simple error concealment scheme. Consequently, the distortion due to a lost macroblock can be expressed as follows. D L MB (f,m) = E{[F(f,m,i) F(f 1,m,i)] 2 } = E{[F(f,m,i) ˆF(f,m,i) + ˆF(f,m,i) F(f 1,m,i)] 2 } = E{[F(f,m,i) ˆF(f,m,i)] 2 } + E{[ ˆF(f,m,i) F(f 1,m,i)] 2 } = D Q MB (f,m) + E{[ ˆF(f,m,i) ˆF(f 1,m,i) + ˆF(f 1,m,i) F(f 1,m,i)] 2 } = D Q MB (f,m) + E{[ ˆF(f,m,i) ˆF(f 1,m,i)] 2 } + E{[ ˆF(f 1,m,i) F(f 1,m,i)] 2 } = D Q MB (f,m) + EFD(f,f 1,m) + Dmis MB(f 1,m) (2.3) Where D Q MB (f,m) is the distortion due to quantization errors for frame f macroblock m, EFD(f,f 1,m) is defined to be the difference between encoder reconstructed frame f and f 1 at macroblock m and DMB mis (f 1,m) is defined to be the encoder and decoder reconstruction mismatch at frame f 1 macroblock m. It is assumed that the cross terms in the expectation are uncorrelated. Next, we need to find macroblock distortion if it is received. Our encoder has three modes, inter mode, intra mode and skip mode. In intra mode, there are two cases to consider. If the intra coded MB is inside an I slice, the MB distortion is mainly caused by quantization. On the other hand, if the intra coded MB is from a P slice, there is a chance that the intra prediction is based on the neighboring inter coded MB, which carries the distortion propagated from the previous frame. Therefore, we will have to consider both quantization distortion as well as propagated distortion. In inter and skip mode, we will need to consider distortion propagation from the previous frame. It is noted that skipped MB is the same as inter coded MB from distortion calculation perspective because the motion vectors are available for access in both cases on the encoder side. We list the received MB distortion as follows.

26 CHAPTER 2. SOURCE DISTORTION MODEL 15 If MB is received and is Intra and is from I slice: D R MB (f,m) = E{[F(f,m,i) F(f,m,i)] 2 } = E{[F(f,m,i) ˆF(f,m,i)] 2 } = D Q MB (f,m) (2.4) When received, decoder reconstruction is the same as encoder reconstruction. If MB is received and is Intra and is from P slice: D R MB(f,m) = E{[F(f,m,i) F(f,m,i)] 2 } = E{[F(f,m,i) ˆF(f,m,i) + ˆF(f,m,i) F(f,m,i)] 2 } = E{[F(f,m,i) ˆF(f,m,i)] 2 } + E{[ ˆF(f,m,i) F(f,m,i)] 2 } = D Q MB (f,m) + E{[ê(f,m,i) + ˆF(f,m,i) ẽ(f,m,i) F(f,m,i)] 2 } = D Q MB (f,m) + E{[ ˆF(f,m,i) F(f,m,i)] 2 } = D Q MB (f,m) + Dmis MB (f,m ) (2.5) Where ê(f,m,i) and ẽ(f,m,i) are the prediction errors on the encoder and decoder, which are the same when received. ˆF(f,m,i) and F(f,m,i) are the predictions from the neighbouring MB on the encoder and decoder side depending on the intra prediction mode. D mis MB (f,m ) is the difference between neighbouring MB used by encoder and decoder for intra prediction. It should be calculated by assuming the neighbouring MB s are received because they come from the same received slice. D mis MB (f,m ) is calculated based on the availability of the neighbouring MB and intra prediction mode. Due to the complexity of pixel averaging operations in intra prediction mode, an approximation is used in order to save CPU usage. D mis MB (f,m ) = D mis MB (f,m top) (2.6) D mis MB (f,m ) = D mis MB (f,m left) (2.7) DMB mis (f,m ) = Dmis MB (f,m top) + DMB mis(f,m left) 2 (2.8) Where m top is the top neighbouring MB and m left is the left neighbouring MB. Equation 2.6 is used if only the top neighbour is used for intra prediction. Equation 2.7 is used if only the left neighbour is used for intra prediction. Equation 2.8 is used if both MB s are used for intra prediction.

27 CHAPTER 2. SOURCE DISTORTION MODEL 16 Figure 2.1: MB overlap in Inter prediction mode If MB is received and is Inter or Skip: D R MB(f,m) = E{[F(f,m,i) F(f,m,i)] 2 } = E{[F(f,m,i) ˆF(f,m,i) + ˆF(f,m,i) F(f,m,i)] 2 } = E{[F(f,m,i) ˆF(f,m,i)] 2 } + E{[ ˆF(f,m,i) F(f,m,i)] 2 } = D Q MB (f,m) + E{[ê(f,m,i) + ˆF(f 1,m,i) ẽ(f,m,i) F(f 1,m,i)] 2 } = D Q MB (f,m) + E{[ ˆF(f 1,m,i) F(f 1,m,i)] 2 } = D Q MB (f,m) + Dmis MB(f 1,a) A a + D A MB(f mis 1,b) A b MB A MB + D mis MB(f 1,c) A c A MB + D mis MB(f 1,d) A d A MB (2.9) When received, the ditortion consists of propogated distortion due to motion estimation as well as quantization distortion. ê(f,m,i) and ẽ(f,m,i) are the residue signals on the encoder and decoder side. When the MB is received, they are identical. ˆF(f 1,m,i) and F(f 1,m,i) are the pixels in MB m which is used as the inter prediction for motion compensation from frame f 1 on encoder and decoder side. Consequently the MB distortion consists of quantization distortion and encoder decoder mismatch from the previous frame. Where A a, A b, A c and A d are the overlapping regions of the motion compensated macroblock

28 CHAPTER 2. SOURCE DISTORTION MODEL 17 with the macroblocs from the previous frame. Figure 2.1 illustrates this. It should be noted that this is a complexity saving estimation scheme for real time implementation. It is not as accurate as the ROPE approach which operates on each pixel. If the macroblock is skipped, the motion vector may or may not be 0, consequently we can utitlize the same approach for Inter MB. 2.2 Mismatch Distortion The next step is to find the encoder and decoder reconstruction mismatch. We define the encoder and decoder reconstruction difference for frame f as follows. The encoder and decoder reconstruction model used is very similar to what was proposed by Liu and Li [14]. D mis MB (f,m) = E{[ ˆF(f,m,i) F(f,m,i)] 2 } = (1 P(f))D R mis MB (f,m) + P(f)DL mis MB (f,m) (2.10) Where D R mis MB (f,m) and DL mis MB (f,m) are the mismatch distortion when the macroblock is received and lost. Similarly, D L mis MB (f,m) the distortion when the MB is lost can be expressed as D L mis MB (f,m) = E{[ ˆF(f,m,i) F(f,m,i)] 2 } = E{[ ˆF(f,m,i) F(f 1,m,i)] 2 } = E{[ ˆF(f,m,i) ˆF(f 1,m,i)] 2 } + E{[ ˆF(f 1,m,i) F(f 1,m,i)] 2 } = EFD(f,f 1,m) + DMB mis (f 1,m) (2.11) For the received encoder and decoder mismatch distortion, we also need to consider coding modes if the macrobloc is received. Similar to finding the MB distortion, we will have to consider two cases for intra coded MB blocks. If the intra coded MB is from a I slice, the received distortion will be 0. If it is from a P slice, there could be a chance that the intra MB is predicted from neighbouring inter coded MB which carries distortion propogated from the previous frames. If MB is received and is Intra and is from I slice: D R mis MB (f,m) = 0 (2.12)

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes Digital Signal and Image Processing Lab Simone Milani Ph.D. student simone.milani@dei.unipd.it, Summer School