DISTORTIO-AWARE RETRASMISSIO OF VIDEO PACKETS AD ERROR COCEALMET USIG THUMBAIL hi Li EE398 Course Project, Winter 07/08 ABSTRACT In this project, we investigate retransmission-based robust video streaming over lossy packet networks. We propose to send a low-rate video thumbnail, which is a projected representation of the video content, along with the video packets. The receiver decodes the video with the help of error concealment to compensate packet loss. Upon receiving the thumbnails, the receiver can estimate the local distortions due to packet loss, and make intelligent decisions on which packets are needed for retransmission based on the distortion information. Furthermore, additional gain in video quality can be achieved by using thumbnail to guide error concealment mode selection. Our experiment results demonstrate gains over previously proposed distortionunaware heuristic methods. Index Terms Automatic retransmission request, video streaming, error concealment. ITRODUCTIO Reliable transmission of video content over lossy networks has been a research topic for more than two decades. Practical solutions for robust video transmission can be put into two categories proactive methods such as forward error correction (FEC) and reactive methods such as automatic retransmission request (ARQ). In this work, we explore a new combination of the two approaches, where a redundant bitstream generated at the server provides the basis for retransmission requests of the receiver. In conventional video streaming based on ARQ, the receiver is unaware of the video packet content and sends retransmission request whenever it detects a packet loss, and the server resends the lost packet as per requested. Although such scheme is simple, it is not the best solution in performance. More advanced retransmission schemes jointly consider the video packet content when making retransmission decisions. For example, Soft ARQ [] avoids retransmission of late data that has passed the delivery deadline and would not be useful at the decoder. In [3], retransmission is prioritized based on heuristics of different perceptual importance of packetized syntax elements of video. A more recent work [] focuses on estimating the perceptual importance at a finer level of granularity, using an analysis-by-synthesis technique. The basic idea is to analyze each packet by decoding (with error concealment), simulating the situation when the packet is lost, and use the distortion incurred as a measure of the packet importance. However, the simulation-based approach imposes heavy computational burden on the server. It also requires the standardization of error concealment techniques, which may not be practical. In this work, we explore a new possibility of robust video streaming over error-prone networks with retransmission. Instead of making a conditional retransmission decision at the encoder, we shift the complexity to the receiver side and let the receiver decide. As a result, there is no need for standardization of error concealment techniques, since the retransmission decision takes the success of error concealment into account. We propose to send a low-rate video thumbnail, which is a projected representation of the video content, along with the video packets. At the receiver, this thumbnail helps to estimate the distortion resulted due to packet loss, and therefore helps the receiver to make intelligent decisions on which packets to request for retransmission. Furthermore, this thumbnail can also guide the decoder on error concealment mode selection. This report is organized as follows. In Section, we give a structural overview of the proposed scheme. In Section 3, we discuss the details of realizing the proposed thumbnail-based retransmission and error concealment. Section gives the detailed experiment setup and results, followed by some discussions. We draw the conclusions in Section 5.. SYSTEM OVERVIEW The principal idea is as follows. Along with the regular video packets, we send a low-rate representation of the video sequence (named as the video thumbnail) to help the receiver estimate the distortion caused by network packet loss. The thumbnail is essentially some projection of the video content into some lower dimensions, which must be small in data size, yet best preserve the video property for a
Fig.. Overview of the proposed system of using thumbnail for distortion-aware retransmission and error concealment representation. At the receiver side, the received packets in the main video stream are first decoded and error concealment is applied, if packets are missing. Upon receipt of the thumbnail, the decoder uses the thumbnail to aid selecting the best mode for concealing the error in the current portion of video. With the decoded video in the pixel domain, error localization is performed. Based on this information, the receiver determines the retransmission of lost packets in a rate-distortion-optimized manner. In Section 3, more details of the error localization and packet retransmission decision will be given. Fig.. Subroutine of thumbnail generation and slice distortion estimation accurate distortion estimation. At the receiver, the video sequence is decoded with error concealment compensating the lost packets. The thumbnail helps localize the areas in the video sequence (both spatially and temporally) where the error remains significant after error concealment. The receiver then sends the retransmission requests of the most effective packets to recover these areas. For areas where the error can be concealed well, retransmission is unnecessary. Therefore, in our proposed scheme, better error concealment will lead to better performance. Furthermore, the thumbnail can also aid the decoder to select error concealment modes that leads to lower distortions. For example, when the previous frame is lost and badly concealed, the decoder may adaptively select the current block s neighboring blocks as the basis for intra-mode error concealment. Fig. illustrates the proposed system of using thumbnail for distortion-aware retransmission of video packets and adaptive error concealment. At the sender side, we encode and packetize the video content. We also pass the video sequence to the thumbnail generator, where it is projected into lower dimensions and quantized into finite bit 3. THUMBAIL-AIDED RETRASMISSIO AD ERROR COCEALMET In this section, we discuss the detailed implementation of thumbnail-aided retransmission and error concealment. The module locate error packets in Fig. can be further decomposed into several stages. First, based on the decoded video and the thumbnail, the receiver estimates the slice distortion. Second, based on the GOP structure and an additive decaying model, the receiver estimates the packet distortion using the slice distortion computed. Third, the receiver makes a rate-distortion-optimized retransmission decision using the information of packet distortion and packet size. In the sequel, we discuss how each of the steps works. Last, we describe how the thumbnail is helpful for error concealment mode selection. 3.. Thumbnail generation and slice distortion estimation Video thumbnail is only generated on I and P pictures, since they are more important than B pictures in their influence on video quality. Each thumbnail pixel is generated based on a square block (e.g. 3 x 3). The design constraint on the thumbnail is that it must be small in data size yet best preserve the video property for accurate distortion
slice MSE (estimated) slice MSE (estimated) 90 80 70 60 50 0 0 0 0 0 0 0 0 50 60 70 80 90 slice MSE (oracle) 90 80 70 60 50 0 0 0 (a) 0 0 0 0 0 50 60 70 80 90 slice MSE (oracle) (b) Fig. 3. Scatter plot of slice distortion estimation for (a) mean-based estimator and (b) random-projection estimator. The x-axis is the oracle distortion obtained using original video and decoded video; the y-axis is the estimated distortion. (Test sequence: Foreman CIF) estimation. Fig. illustrates the subroutine of thumbnail generation at the sender and slice distortion estimation at the receiver. Essentially we need to project the image block into a single dimension value. In this section, we examine two candidate projection methods. The first method is to simply project the block into its mean. In the second randomprojection based method [5], the block is first whitened by multiplying with a pseudo random sequence of + and -. Then it is orthogonally transformed and a single coefficient is extracted as the thumbnail pixel. To examine both methods, we plot the scatter plots of the estimated and actual distortions for both methods in Fig. 3. The test data are extracted from CIF Foreman sequence. The results suggest that although the random-projection method gives unbiased estimation, the variance is large. In contrast, the mean-based method always underestimates the distortion, yet the estimation variance is small. We give detailed analysis of both methods in the Appendix. Intuitively, the results can be explained as follows. Local distortion, especially when the block size is small, can be modeled as a random variable with non-zero mean and small variance. The mean-based estimator works well when the pixels within a block are highly correlated, regardless whether it is zero mean or not. In contrast, although the random projection estimator works well for global distortion estimation (as in PSR estimation of whole picture or sequences of pictures), it suffers from large variation in estimated value when the distortion is not zero-mean. From the results of Fig. 3 we can interpret that the local distortion in video due to packet loss (after error concealment) is highly correlated yet not zero-mean. We conclude that the mean-based estimator is better than the random-projection estimator for localized distortions. 3.. Packet distortion estimation and retransmission decisions The packetization scheme is as follows. Each picture is divided into multiple slices. For best characterization of local distortion, we use slice groups to divide each picture into regular rectangles of equal size. For example, a CIF picture is divided into 6 slices, in each row. Then each slice is packetized into a network packet. After obtaining the slice distortions, we are in a position to estimate the distortion contribution of each lost packet. This estimated distortion will serve as the importance of lost packets, therefore a basis for retransmission decision. Due to temporal prediction, loss of a single I or P packet would affect several subsequent pictures. In this work, we assume an additive decaying model [] for packet distortion estimation. Assume the receiver knows the GOP structure. Given a lost packet, the distortion would propagate until the next I picture where intra-refreshment takes place. The strength of distortion decays as it propagates, due to the leaky bucket filtering effect. Furthermore, distortions due to different packet losses are assumed to be additive (this is proven to be a fairly accurate model under moderate packet loss conditions). Based on this model, we solve the distortion contribution of each lost packet as follows. Refer to Fig.. Assume within a sliding window, there are slices and M lost packets. Let d s =[d s () d s () d s ()] T denote the distortion of slices, and d p =[d p () d p () d p (M)] T denote the lost packet distortion contributions per slice. Let W denote a xm matrix with entry W(n,m) denoting the weight of contribution of the mth lost packet on the nth slice. Then the following relationship can be established: Wd p = ds. We find the distortion contribution of lost packets by solving for d p. In practice, however, since the additive model is approximate, we need to solve d p in a least square
0 3 0 packet MSE (estimated) 0 0 0 0-0 - Fig.. Illustration of packet distortion estimation. An additive decaying model [] is used. Each time the receiver jointly estimates the distortion within a sliding window. W(n,m) denotes the weight of distortion contribution of lost packet m on slice n. sense, subject to d p 0. Therefore, solving d p becomes the following optimization problem: minimize Wd p ds s.t. d 0. p However, in a practical causal scenario, upon knowing the lost packet, the receiver may not know how much distortion it would incur in the subsequent pictures that have not yet arrived. As a consequence, the receiver may underor overestimate the distortion. To mitigate its influence, each packet is given multiple retransmission opportunities. The receiver maintains a sliding window. At each retransmission checkpoint, it performs joint estimation within the window and decides whether to retransmit packet within the window that has not yet been retransmitted. As a consequence, if a packet is first estimated not important, but later found important, it can still be retransmitted. In Fig. 5, we show the scatter plot of the estimated packet distortion against the oracle distortion. After the lost packet distortion contributions are estimated, we can make retransmission decisions. At each retransmission checkpoint, the receiver is given retransmission options for some lost packets. Given a retransmit rate constraint, the receiver needs to make retransmission decisions to minimize distortion. We follow similar strategies as in [6]. In brief, the following strategy will ensure a rate-distortion-optimized retransmission performance arrange the packets in descending order according to the ratio between packet distortion and packet size, and retransmit in this prioritized order until the rate constraint is met. 3.3. Adaptive error concealment using thumbnail The benefit of the thumbnail is twofold. ot only it can help the receiver to make intelligent retransmission decisions, it can also guide the decoder to select modes of error 0-3 0-3 0-0 - 0 0 0 0 0 3 packet MSE (oracle) Fig. 5. Scatter plot of the packet distortion estimation. The x-axis is the oracle distortion obtained from offline simulation of individual packet drop; the y-axis is the estimated distortion. (Test sequence: Foreman CIF) concealment to improve video quality. Here we only consider two simple error concealment strategies. Intra mode corresponds to the case when the decoder uses neighboring blocks of the same picture to conceal error in the lost block. Inter mode corresponds to the case when the decoder takes the previous picture to conceal the current picture. In the proposed adaptive error concealment using thumbnail, for each lost packet (i.e. slice) the decoder tries to decode the current slice with both the intra and inter modes. The resulting pictures are used for distortion estimation with the help of the thumbnail, same as in Section 3.. The decoder then chooses the concealment mode with less estimated distortion. In real implementations, other more sophisticated error concealment methods may be used. But the principle is the same.. EXPERIMETAL RESULTS We conduct experiments based the following settings. We use standard H.6 reference software JM 3. as the video codec. The main test video sequence is Foreman in CIF format. The video is encoded using default QP (8, 8, ) for I, P and B pictures, respectively. Fixed GOP structure IBPBPBP is used. Each GOP is of size 0. Each picture is grouped into 6 slices of roughly equal rectangle shape, in each row. Each slice is packetized into one network packet. To simply simulate the network condition, we randomly drop packets. For simplicity, we also assume that the feedback channel does not lose acknowledgements. Throughout the experiments, we make sure the number of data forward transmitted is maintained at 00% of the video data size. Therefore, each time after the receiver makes retransmission requests of I and P packets, and upon
38 36 3 rate=550 kb/s oracle thumbnail-aided frame-aware no retran. 38 36 3 rate=550 kb/s adaptive ErC / thumbnail-aided retran. default ErC / thumbnail-aided retran. adaptive ErC / no retran. default ErC / no retran. 3 3 PSR (db) 8 PSR (db) 8 6 6 0 0 5 0 5 0 5 packet loss rate (%) 0 0 5 0 5 0 5 packet loss rate (%) Fig. 6. PSR vs. packet loss rate for various retransmission schemes (Test sequence: Foreman CIF, default error concealment) receiving the request, the sender randomly drop the subsequent B packets to make sure that the total data rate is 00%. Thumbnail is generated for I and P pictures only. For CIF video, we generate one thumbnail pixel (in -bit representation) for each 3 x 3 block. The resulted thumbnail has rate of about 5 kbps, corresponding to % of the overall rate (550 kbps for encoded Foreman). The default error concealment uses intra-mode for I packets and inter-mode for P and B packets. We compare the proposed thumbnail-aided retransmission scheme and adaptive error concealment with some other schemes. In the oracle scheme which serves as a performance upper bound, we assume the receiver has perfect knowledge of the original video. The oracle distortion is computed offline using dropping individual packets and measuring its distortion contribution. In the frame-aware [3] scheme, the receiver makes retransmission decision based on the heuristics that I packets are always more important than B packets and hence it gives retransmission priority to I packets. We also compare the case without any retransmission. In Fig. 6, we plot PSR against packet loss rates (PLR), ranging from 0 to 5% for the proposed thumbnailaided retransmission and other schemes. As expected, the oracle scheme with perfect distortion information has the best performance. The proposed thumbnail-aided retransmission achieves performance gain of 0.5 to.5 db over the distortion-unaware heuristic frame-based scheme, despite that % extra thumbnail stream needs to be transmitted. Compared with the oracle scheme, the proposed thumbnail-aided scheme has only less than db performance loss under modest packet loss. However, as the packet loss becomes severe, the performance loss becomes larger. One explanation is that under severe packet loss, the additive decaying model used in packet distortion estimation becomes less accurate. The inaccuracy in Fig. 7. PSR vs. packet loss rate for default error concealment and adaptive error concealment under no retransmission and thumbnail-aided retransmission conditions (Test sequence: Foreman CIF) distortion estimation leads to inaccurate retransmission decisions. ext we investigate how the thumbnail can help for adaptive error concealment. We fix the all the other settings and we replace the default error concealment with the adaptive error concealment scheme. In Fig. 7, we show the performance of adaptive error concealment compared to default error concealment under thumbnail-aided retransmission and no retransmission conditions. It can be seen that with thumbnail-aided retransmission, under severe packet loss conditions, adaptive error concealment achieves a performance gain of about 0.5 db over default error concealment. Under moderate packet loss, the gain is insignificant. An explanation is as follows. For each slice, the receiver can choose among intra or inter error concealment, or retransmission. Under moderate packet loss when there is enough resource and the PSR is high, the receiver would rather choose retransmission. However, when the resource is tight under severe packet loss, the difference between default and adaptive error concealment becomes prominent. This becomes more evident in the case of no retransmission, where the receiver is only allowed intra/inter- error concealment options. In this case, the adaptive error concealment achieves a gain of about 0.5 to db. The visual quality of using default error concealment and adaptive error concealment is shown in Fig. 8. 5. COCLUSIOS In this work, we explored a new direction of robust video streaming over lossy packet network. It has the ingredients of both proactive and passive error protections in the sense that a redundant bitstream generated at the server provides the basis for retransmission requests of the receiver.
7. REFERECES [] B. Girod and. Färber, Feedback-based error control for mobile video transmission, Proc. IEEE, vol. 87, no. 0, pp. 707 73, Oct 999. [] M. Podolsky, S. McCanne, and M. Vetterli, Soft ARQ for layered streaming media, Tech. Rep. UCB/CSD-98-0, Computer Science Division, University of California, Berkeley, Calif, USA, ovember 998. (a) [3] Y. Shan and A. akhor, Cross layer techniques for adaptive video streaming over wireless networks, in Proc. IEEE International Conference on Multimedia and Expo, 00, vol., pp. 77 80. [] P. Bucciol, E. Masala, E. Filippi, and J. C. De Martin, Crosslayer perceptual ARQ for video communications over 80.e wireless networks, Advances in Multimedia, Hindawi Publishing, vol. 007, Article ID 3969, pp., 007. [5] R. Kawada, O. Sugimoto, A. Koike, M. Wada and S. Matsumoto, Highly precise estimation scheme for remote video PSR using spread spectrum and extraction of orthogonal transform coefficients, Electronics and Communications in Japan, Part, Vol. 89, o. 6, 006. (b) Fig. 8. Visual quality of (a) thumbnail-aided retransmission and default error concealment and (b) thumbnail aided retransmission and adaptive error concealment. (Test sequence: Foreman CIF, 0% packet loss) Experiment results demonstrate that using the proposed thumbnail-aided retransmission where distortions are explicitly taken into consideration, we can usually achieve gain of 0.5 to.5 db over other heuristic distortion-unaware retransmission schemes. Furthermore, without any extra cost, the thumbnail can also guide the decoder to select error concealment modes, leading to additional gain of 0.5 to db under moderate to severe packet losses. We conclude that the key ingredient leading to such gain is the contentlevel error detection and correction that makes distortionawareness possible. 6. ACKOWLEDGEMET I would like to thank Prof. Girod, Yao-Chung Lin, Xiaoqing hu, David Varodayan and Pierpaolo Baccichet for the extremely helpful discussions. [6] J. Chakareski and P. Frossard, Rate-Distortion Optimized Distributed Packet Scheduling of Multiple Video Streams over Shared Communication Resources, IEEE Transactions on Multimedia, Special Issue on Distributed Media Technologies and Applications, vol. 8, no, April 006, pp. 07-8. 8. APPEDIX In this section, we derive the properties of the mean-based estimator and the random-projection estimator for slice distortion estimation. Let X = ( X,..., X ) be the source, transmitted over an additive channel with noise = (,..., ). The received signal is Y = X +. Assume,..., are i.i.d. with non-zero mean μ and standard deviation σ. The assumption that n is nonzero mean is for characterizing the local distortion property. We want to estimate the power of the noise E = μ + σ. Mean-based estimator. The mean-based estimator can be expressed as ˆ Pm = Yn Xn n n= = n= n= We now derive its expectation value and variance..
EP [ ˆ ] E E = m = n m n = n= n= m= μ + σnm n= m= n= m= σ + [ ] = μ < μ + σ = E. ˆ Var[ P ( ˆ m] = E n E Pm ) n= σ = E nm jk μ + n m j k μ ( μ ) 0 as σ 0. From above, we can see the mean-based estimator is biased, since it always underestimates the noise power. However, the estimation variance diminishes with the noise variance. ( ˆ ) Var[ ˆ PRP ] = E n E PRP n= E E R R R R = n m j k n m j k n m j k! μ + μ!! + 6 μ μ ( μ σ ) + σ. From above, we can see the random- as 0 projection estimator is unbiased. However, the estimation variance does not diminish with the noise variance. Random-projection estimator. The random-projection estimator first whitens the signal through multiplying it by a pseudo-random sequence R ± { }. This is followed by Welsh-Hardarmad Transform (WHT) and extraction of a random pixel in the transform domain. These steps can be essentially combined and expressed in the following form: ˆ PRP = YnRn XnRn n= n= = R n n n= where R ± { } is the multiplication of R and the orthogonal sequence used in WHT. The following derives its expectation value and variance. ˆ EP [ RP] = E R n n n= = E[ mn] E[ RmRn] n= m= = E[ nn] = E n=