AN EVER increasing demand for wired and wireless

Size: px

Start display at page:

Download "AN EVER increasing demand for wired and wireless"

Tyler Smith
5 years ago
Views:

1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 11, NOVEMBER Channel Distortion Modeling for Multi-View Video Transmission Over Packet-Switched Networks Yuan Zhou, Chunping Hou, Wei Xiang, Senior Member, IEEE, and Feng Wu, Senior Member, IEEE Abstract Channel distortion modeling for generic multi-view video transmission remains a unfilled blank, despite that intensive research efforts have been devoted to model traditional 2-D video transmission. This paper aims to fill this blank through developing a recursive distortion model for multi-view video transmission over lossy packet-switched networks. Based on the study on the characteristics of multi-view video coding and the propagating behavior of transmission error due to random frame losses, a recursive mathematical model is derived to estimate the expected channel-induced distortion at both the frame and sequence levels. The model we develop explicitly considers both temporal and inter-view dependencies, induced by motion-compensated and disparity-compensated coding, respectively. The derived model is applicable to all multi-view video encoders using the classical block-based motion-/disparity-compensated prediction framework. Both objective and subjective experimental results are presented to demonstrate that the proposed model is capable of effectively model channel-induced distortion for multi-view video. Index Terms Distortion modeling, multi-view coding, multi-view video, packet-switching networks, video transmission. I. Introduction AN EVER increasing demand for wired and wireless access to 3-D image and video contents over the last few years has sparked numerous research efforts in stereoscopic imaging technology for next generation information technology. A multi-view video communications system acquires several video sequences of the same scene simultaneously from more than one camera angle, and transports these streams remotely. Owing to the massive amount of data involved and extensive processing requirements, multi-view video processing presents research challenges that lie at the frontier of video coding, image processing, computer vision, and display technologies. Manuscript received July 9, 2010; revised October 12, 2010 and December 31, 2010; accepted January 4, Date of publication March 28, 2011; date of current version November 2, This work was supported in part by the National Natural Science Foundation of China, under Grants , , and the International Science Linkages established under the Australian Government s Innovation Statement, Backing Australia s Ability. This paper was recommended by Associate Editor A. Vetro. Y. Zhou and C. Hou are with the School of Electronic and Information Engineering, Tianjin University, Tianjin , China ( zhouyuan@tju.edu.cn; hcp@tju.edu.cn). W. Xiang is with the Faculty of Engineering and Surveying, University of Southern Queensland, Toowoomba, QLD 4350, Australia ( xiangwei@usq.edu.au). F. Wu is with Microsoft Research Asia, Beijing , China ( fengwu@microsoft.com). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSVT /$26.00 c 2011 IEEE In packet-switched networks, packets may be discarded due to buffer overflow at intermediate nodes of the network, or may be considered lost due to long queuing delays. Moreover, compressed video signals, especially coded stereoscopic video sequences, are extremely vulnerable to transmission errors, since low bit-rate video coding schemes rely on inter-frame predictive coding to achieve high coding efficiency. The coding structure of motion-compensated inter-frame prediction creates strong spatio-temporal dependency in video frames [1], [2]. Consequently, unavoidable packet losses during transmission may result in catastrophic error propagation and thus severe quality degradation at the decoder side. Modeling the effect of packet losses on the end-to-end video quality is important for jointly determining parameters for source coding (e.g., quantization and intra-rate) [3], rate-distortion optimization [4], [5], channel code rate control [6], [7], and inter-/intra-mode switching [8]. The MVC technique exploits the extra dependency between neighboring views in 3-D video in addition to existing spatiotemporal dependencies. MVC technology employs the socalled disparity compensation approach to reduce the amount of data between neighboring views using differential interview coding. Consequently, mismatch errors in MVC can propagate along both the temporal and view dimensions, which is termed 2-D error propagation. Inter-dependent coding among neighboring views may lead to decoding error propagation in several views if a packet from one view is lost. As a result, the correlations between the current frame and the frames from adjacent views must be appropriately modeled to account for channel-induced distortion. In this paper, we concentrate our efforts on developing a recursive model to estimate the distortion due to channel errors for multi-view video, which is capable of effectively tackling sophisticated 2-D error propagation. To contextualize our research on distortion modeling, we first discuss a few key related distortion models in Section I-A, and then outline our contributions in Section I-B. A. Related Work A number of distortion models for monoscopic video transmission over lossy channels have been proposed in the literature to date. For instance, in the case of lowcomplexity estimation models, transmission distortion was analyzed considering intra coding and spatial loop filtering, as described in [9]. Zhang et al. [10] developed a low-complexity transmission distortion model using a piece-wise linear-fitting

2 1680 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 11, NOVEMBER 2011 approach for whole-frame losses. In [11], a simplified distortion estimation model dubbed expected length of error propagation was proposed which only took temporal error propagation into account, without considering picture complexity. Note that the above low-complexity estimation models are mainly applicable to low error rate applications, and usually are not accurate enough to achieve optimal rate-distortion (R- D) performance. For accurate distortion estimation with modest complexity, the overall distortion accumulated from previous frames is estimated to determine the coding mode for the current macroblock (MB). For example, Yang and Rose proposed the well-known recursive optimal per-pixel estimation (ROPE) algorithm [8] and its extensions [12], [15] that recursively calculated the first-order and second-order moments of the decoded value for each pixel, The extensions of ROPE provided a solution for cross-correlation approximation considering sub-pixel prediction. Stuhlmuller et al. [9] presented distortion models for estimating the distortion caused by source coding and channel errors. The R-D characteristics of the source coder, and the distortion due to residual channel errors were modeled empirically using test sequences. In [13], He et al. developed a model for estimating the distortion due to bit errors in motion-compensated video, and then used this model with the source R-D model for adaptive intra-mode selection and joint source-channel rate control under timevarying channel conditions. The model introduced in [14] took into account almost all popular coding features, including intra-/inter-prediction and deblocking filtering, in order to accurately estimate the expected channel distortion in terms of the mean squared error (MSE). Sabir et al. [3] presented a statistical distortion model for MPEG-4 coded video streams sent through fading channels with error-resilient tools such as data partitioning and packetization which encoded a video sequence into different partitions/layers. In [10], a recursive group-ofpicture (GoP)-level transmission distortion model based on the error propagation behavior of whole-frame losses was proposed. Moreover, taking loss burstiness into consideration, a more accurate distortion model was established in [16] to estimate the expected MSE distortion at both the frame and sequence levels given Gilbert channel packet losses. Although the above discussed models are able to estimate the expected value of distortion with reasonable accuracy, they are only applicable to monoscopic video compression based upon the classic block-based motion-compensated prediction coding framework, which disregards the characteristics of multi-view video signals. In contrast, very little attention has been paid to distortion modeling for stereoscopic video transmission. Most noticeably, R-D optimization for error-resilient stereoscopic video coding with inter-view refreshment was investigated in [18]. Tan et al. [19], [20] introduced an end-toend R-D model for 3-D video that achieved optimal encoder bit rates and UEP rates. In these works, transmission distortion was only analyzed for 3-D video with the left and right views. Until recently, to the best knowledge of the authors, there hardly exists any published work on distortion modeling for generic multi-view video (MVV) transmission over packet lossy networks. B. Contributions In this paper, we develop a recursive model to estimate the distortion caused by random packet losses for H.264/MVC coded video transmission over packet-switched networks. In an MVC-coded video streaming system, the currently decoding frame correlates with not only the previous time frame in the same view, but the frames in adjacent views. MVC techniques employ disparity compensation to exploit extra dependency between neighboring views in addition to existing spatiotemporal dependencies. In single-view video coding, errors propagated to the current frame can only come from preceding temporal frames. This is 1-D error propagation. However, in multi-view video coding, errors propagated to the current frame come from the preceding frames not only in the same view but also in other views, which is clearly 2-D error propagation. One of our major contributions in this paper is to the proposal of a recursive distortion model that is capable of handling sophisticated 2-D error propagation. In the 2-D error propagation scenario, the total number of propagation paths is calculated as a production of the previous time and view frames, which rapidly grows in a nonlinear manner with the increase of the number of preceding frames. Consequently, the straightforward extension from the single-view to multi-view video scenarios in terms of error propagation modeling will lead to intractable complexity, if combined errors along both the temporal and view directions were to be taken into account. Furthermore, coefficients in the conventional models for single-view video (e.g., the advanced ROPE model in [15] and the frame-level model in [14]) are complex because they explicitly consider coding features (including integer-pel, 1/2-pel and 1/4-pel motion estimation, intra-/inter-prediction and deblocking filtering) to estimate the cross-correlation terms in ROPE s second moment calculation. It would become computationally intractable if one was to simply apply a straightforward extension of the error calculation methods in these models to the scenario of 2-D error propagation, because combinations of 2-D errors considerably increase the modeling complexity. Consequently, previous single-view distortion modeling schemes cannot be simply extended to the view dimension, due to the overwhelming complexity when calculating the 2nd moment. Hence, new modeling approaches must be sought that are capable of tackling sophisticated 2-D error propagation in MVV. In this paper, we analyze the channel distortion for coded MVV sequences caused by random packet losses. A framelevel distortion estimation model in terms of the MSE is derived in the form of a recursive formula. The proposed model takes into consideration both motion and disparity compensation, which relates the channel-induced distortion in the current frame with that in the previous frame or the neighboring-view frame, and allows for any motioncompensated and disparity-compensated error concealment methods at the decoder. The proposed 3-D distortion model is able to effectively model sophisticated 2-D error propagation in MVV with low complexity. The remainder of this paper is organized as follows. In Section II, preliminaries, notations and assumptions in relation to the proposed model are introduced. In addition, the

3 ZHOU et al.: CHANNEL DISTORTION MODELING FOR MULTI-VIEW VIDEO TRANSMISSION 1681 distortion modeling problem is briefly formulated. Section III develops a frame-level recursive formula which predicts the channel-induced distortion in coded MVV sequences due to random packet losses. In Section IV, we validate the proposed model through transmitting H.264/MVC coded test video sequences at different packet loss rates, and compare the modeled distortion with the actual measured distortion. Subjective evaluation tests are also carried out to validate the proposed model. Finally, concluding remarks are drawn in Section V. II. Notations and Assumptions A. Assumptions of Coded Multi-View Video The reference structure of coded multi-view video under consideration in this paper is illustrated in Fig. 1, which is an extension of the structure used in [14] from the singleview scenario to the multi-view scenario. In Fig. 1, the horizontal axis denotes the frame number t in the temporal domain, whereas the vertical axis represents the view number s. Denote by S and T the number of views and the length of each GoP, respectively. A 2-D coded image group is called a group of group-of-pictures (GGoP). At the encoder, similar to the coding structures in [12] [14], [16], and [17] for single-view video coding and in [18] for double-view video coding, we assume an MVV sequence is partitioned into a number of GGoPs, and each GGoP starts with an I-frame followed by multiple P-frames, as illustrated in Fig. 1. Denote by M(s, t) the frame at view s and time t. Both the previous time frame M(s, t 1) and previous view frame M(s 1,t) are taken as the reference frames for M(s, t), except for the frames in view zero and the first frame in other views. For I-frames, all MBs are coded in the INTRA mode, without using the reference frame information. For P-frames, an MB is coded either in the INTRA or INTER mode. For the latter mode, the MB is predicted from the corresponding MB in a previous view or time frame. At the decoder, we assume that a proper error concealment operation is performed on lost MBs. Generally, one can resort to either temporal or interview concealment. For simplicity, we assume that all MBs in a frame are grouped into one slice, and each slice is carried in a separate transport packet in the UDP layer. Consequently, the packet loss rate is equivalent to the frame (slice) loss rate. It is further assumed that packet losses occur randomly with an average packet loss rate. B. Definition of Frame-Level Distortion Denote by F i (s, t) the original value of pixel i in frame M(s, t). Let ˆF i (s, t) and F i (s, t) be the reconstructed values of pixel i in M(s, t) at the encoder and decoder, respectively. Let D(s, t) be the overall distortion at the receiver end, defined as the MSE between the original frame M(s, t) and its decoded version D(s, t) =E Fi (s, t) F i (s, t) ] 2 } (1) Fig. 1. Reference MVC coding structure for multi-view video sequences. where E{ } denotes the expectation taken over all the pixels in frame M(s, t). In a coded video transmission system, there are two major types of distortion, namely quantization distortion introduced in source encoding, and channel distortion caused by channel errors. For ease of exposition, these two types of distortion are defined as source-induced distortion and channel-induced distortion, denoted by D s (s, t) and D c (s, t), respectively. They can be expressed as D s (s, t) =E Fi (s, t) ˆF i (s, t) ] 2 } (2) D c (s, t) =E ˆF i (s, t) F i (s, t) ] 2 }. (3) As can be seen from (2) and (3), source-induced distortion refers to the MSE between the original frame and the reconstructed video frame at the encoder, whereas channelinduced distortion refers to the MSE between the reconstructed frame at the encoder and the decoded video frame at the receiver. It has been shown in [14] with experimental data that D s (s, t) is indeed quite uncorrelated with D c (s, t). That is, F i (s, t) ˆF i (s, t) and ˆF i (s, t) F i (s, t) can be assumed as uncorrelated. Hence, the total distortion of frame M(s, t) can be decomposed as Fi D(s, t) =E (s, t) F i (s, t) ] } 2 Fi = E (s, t) ˆF i (s, t)+ ˆF i (s, t) F i (s, t) ] } 2 Fi = E (s, t) ˆF i (s, t) ] } { 2 [ + E ˆF i (s, t) F i (s, t) ] } 2 Fi +2E (s, t) ˆF i (s, t) ][ˆF i (s, t) F i (s, t) ] } = D s (s, t)+d c (s, t). (4) In this paper, we will focus on modeling the average channel-induced distortion D c (s, t) for each frame M(s, t).

4 1682 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 11, NOVEMBER 2011 III. Proposed Transmission Distortion Model for Random Packet Losses A. General Analysis of Channel-Induced Distortion We encode a multi-view video sequence with H.264/ MVC [24]. It is assumed that if frame M(s, t) is lost, motioncompensated or disparity-compensated concealment will be invoked, resulting in an average distortion D L (s, t). If M(s, t)is received correctly, it may still be subject to channel distortion, D R (s, t), due to errors propagated from the previous time and/or view frames. Thus, the expected channel distortion of frame M(s, t) becomes { DL (s, t), frame M(s, t) is lost D c (s, t) = (5) D R (s, t), frame M(s, t) is received. Equation (5) represents the general form of transmissioninduced distortion. Our objective is to calculate D c (s, t) based upon (5). More specifically, we aim to derive a recursive distortion model for D c (s, t), which is able to: 1) capture the channel error propagation effect on multi-view video; and 2) analyze the impact of random packet losses on the average video quality of multi-view video. The proposed channel distortion model is a frame-level recursive model, relating the channel distortion of frame M(s, t) with that of frames M(s 1, t) and M(s, t 1) within the same GGoP. The proposed model extends channel-induced distortion models for conventional 2-D video, especially the work in [14], from the single-view to multi-view scenarios, which is suitable for block-based motion or disparity-compensated MVC coding schemes. B. Recursive Computation of Distortion for Received Frames First, we compute D R (s, t), aiming to develop a recursive formula for correctly received frames. For these frames, their reconstruction may be affected by channel distortion due to errors existed in the previous time and/or view frames. For P- frames, an MB is coded in either the INTRA or INTER mode. In the former mode, the pixels in the MB are coded without using other predictive frame information. In the latter mode, the MB is predicted by a corresponding MB in the previous time or view frame. We assume the average percentage of the MBs using the INTRA mode in each P-frame is Q. Denoted by D INTRA (s, t) and D INTER (s, t) the average distortion in the received INTRA mode MBs and INTER mode MBs, respectively. The average channel distortion D R (s, t) is D R (s, t) =QD INTRA (s, t)+(1 Q)D INTER (s, t). (6) For received frames, if intra-prediction is employed, error propagation within the same frame is negligible so that channel-induced distortion in received INTRA mode MBs is assumed to be zero, i.e., D INTRA (s, t) = It should be noted that the assumption that error propagation does not occur to intra-coded MBs within the same frame is not strictly true. Actually, this type of error propagation is well modeled in [14]. However, the distortion propagated to received INTRA mode MBs in the same frame is very small and often considered negligible in the literature (e.g., [10], [13], [16], [17]). Additionally, our experimental data show that the modeling results are accurate even when distortion propagation in received INTRA mode MBs is ignored. For INTER mode coded MBs, there exist two options for prediction: 1) intra-view prediction that takes as the reference frame the previous time frame of the same view, and 2) interview prediction that takes as the reference frame the same time frame in the previous view. The distortions caused by these two types of prediction are denoted as D RT (s, t) and D RV (s, t), respectively. 1) Recursive Formula for Inter-View Prediction Distortion D RV (s, t): If an MB employs inter-view prediction, F i (s, t) for each pixel i in the MB is predicted from ˆF ρ(i) (s 1,t), where ρ(i) refers to the spatial indices of the K p matching pixels in the previous view frame M(s 1,t) that are used to predict pixel i in the current frame. For monoscopic video coding, the pixel operation performed on all ˆF ρ(i) (n 1) to predict F i (n) can be considered as a linear pixel filtering function [13], [16], where n denotes the frame index for a 2-D video sequence. For example, as shown in [13], if sub-pixel motion estimation is employed, the predicted value of F i (n) is expressed as K p ˆF i (n) = a k ˆF ρk (i)(n 1) (7) k=1 where a k and ρ k (i) are the weighting coefficient and the index of the kth matching pixel in the previous frame, respectively. The values for K p, a k, and ρ k (i) are dependent on the chosen sub-pixel motion estimation mode (e.g., half-pixel or quarterpixel), the motion vectors (MVs) for the MB being coded, and the interpolation filter employed for sub-pixel motion estimation. For the special case of integer-pixel motion estimation, we have K p = 1 and a k = 1 so that (7) can be simplified into ˆF i (n) = ˆF ρ( i)(n 1). Note that disparity estimation in stereoscopic video coding is similar to motion estimation in monoscopic video coding. Let i ( ) be a pixel interpolation function applied to all ˆF ρ(i) (s 1,t) to predict F i (s, t). Thus, at the encoder, ( the predicted value of F i (s, t) can be expressed as i ˆF ρ(i) (s 1,t) ). If sub-pixel disparity estimation is employed by the MVC algorithms, i ( ) will be a linear interpolation function similar to (7), whereas i (F) =F if integer-pixel disparity estimation is adopted. At the decoder, when the MB is received, the prediction of F i (s, t) ( is based on F ρ(i) (s 1,t), which can be expressed as i F ρ(i) (s 1,t) ). For this inter-view predicted MB, due to the correct reception of the prediction error information, the channel distortion is solely caused by the mismatch between the prediction references at the encoder and decoder. Therefore, the distortion can be derived as D RV (s, t) =E ˆF i (s, t) F i (s, t) ] } 2 i ( = E ˆF ρ(i) (s 1,t) ) ( i F ρ(i) (s 1,t) )] } 2 = E i ( ˆF ρ(i) (s 1,t) F ρ(i) (s 1,t) )] 2 }. (8) Note that the derivation of (8) utilized the linearity property of i ( ). When the distortion in the previous view frame propagates to the current frame, the distortion signal tends to attenuate,

5 ZHOU et al.: CHANNEL DISTORTION MODELING FOR MULTI-VIEW VIDEO TRANSMISSION 1683 due to the coding effects such as intra refreshing, noninteger disparity compensation, and deblocking filtering. Consequently, i ( ) can be regarded as a filter with an attenuation parameter λ a that will attenuate error propagation [9]. Hence, (8) can be approximated as i ( D RV (s, t)= E ˆF ρ(i) (s 1,t) F ρ(i) (s 1,t) )] } 2 = λ a E ˆF ρ(i) (s 1,t) F ρ(i) (s 1,t) ] } 2 = λ a D c (s 1,t). (9) For the special case of integer-pixel estimation, since i (F) =F we have D RV (s, t) i ( = E ˆF ρ(i) (s 1,t) ) ( i F ρ(i) (s 1,t) )] } 2 = E ˆF ρ(i) (s 1,t) F ρ(i) (s 1,t) ] } 2 = D c (s 1,t). (10) Comparison of (9) and (10) indicates that λ a = 1 when integerpixel estimation is employed. For practical 3-D video applications, the parameter λ a mainly depends on the video quality level of the MVC coding algorithms employed. More specifically, it hinges on both the accuracy of disparity estimation and the video contents in question. In this paper, we estimate λ a based on empirical experimental data as will be discussed in Section IV. 2) Recursive Formula for Intra-View Prediction Distortion D RT (s, t): If the MB employs intra-view prediction, in a similar ( manner, the predicted value of F i (s, t) can be expressed as i ˆF ρ (i)(s, t 1) ) ( at the encoder, and i F ρ (i)(s, t 1) ) at the decoder, respectively. ρ (i) denotes the indices of the K p matching pixels in frame M(s, t 1) that are used to predict pixel i in the current frame. The intra-view prediction distortion can be derived as D RT (s, t) =E ˆF i (s, t) F i (s, t) ] } 2 i ( = E ˆF ρ (i)(s, t 1) ) ( i F ρ (i)(s, t 1) )] } 2 i ( = E ˆF ρ (i)(s, t 1) F ρ (i)(s, t 1) )] } 2 = λ b D c (s, t 1) (11) where λ b is an error attenuation factor for the received MBs. Finally, as in the case of λ a, the parameter λ b will be estimated through experimental data, which will be described in detail in Section IV too. Note that λ b may be differerent from λ a. 3) Computation and Discussion of D R (s, t): As shown in Fig. 1, both intra-view prediction and inter-view prediction are applied in predicting the current frame. We assume the percentage of the MBs employing intra-view prediction is V according to the similar assumption in [18]. Therefore, the average channel distortion for the INTER mode MBs in the received frame M(s, t) becomes D INTER (s, t) = VD RT (s, t)+(1 V )D RV (s, t) = Vλ b D c (s, t 1)+(1 V )λ a D c (s 1,t). (12) We discuss two special cases of (12) as follows. For the frames (except for the first I-frame) in view 0, all the INTER mode MBs in these frames employ intra-view prediction. In this case, V = 1, and (12) can be simplified into D INTER (0,t)=D RT (0,t)=λ b D c (0,t 1). (13) In the second special case, for the first frame of each view (except for view 0), all the INTER mode MBs employ interview prediction. Then, V = 0, and (12) degrades to D INTER (s, 0) = D RV (s, 0) = λ a D c (s 1, 0). (14) More generally, substituting (12) into (6) and noting the fact that D INTRA (s, t) = 0 for received INTRA mode MBs, we obtain the following recursive model for D R (s, t): D R (s, t) =(1 Q) [Vλ b D c (s, t 1)+(1 V )λ a D c (s 1,t)]. (15) It is evident that (15) is a recursive formula, which relates the distortion of frame M(s, t) to those of M(s, t 1) and M(s 1,t). C. Recursive Computation of Distortion for Lost Frames Following Section III-B, we then compute D L (s, t), aiming to develop a recursive formula for lost frames due to packet errors. In subsequent derivations, if a frame is lost (with probability P), we assume that all of the MBs in the frame will be concealed using some temporal or inter-view error concealment strategy, resulting in an average distortion D L (s, t), irrespective of the coding mode. In MVV communication systems, the pixels in a lost frame may be reconstructed from either its previous time or view frame, depending on the content and error concealment strategy under consideration [21]. In error concealment, the MV or disparity vector (DV) of a missing MB is first estimated, typically based on the MVs or DVs of the matching MBs in a previous frame. The MB is then replaced by the corresponding MB in the previous time or view frame referred to by the estimated MV or DV. A commonly used error concealment method is frame copy, which is a special case with zero MVs or DVs. Denote by D LT (s, t) and D LV (s, t) the distortions when motion-compensated or disparity-compensated error concealment is employed to reconstruct the lost pixels, respectively. 1) Recursive Formula for Inter-View Concealment Distortion D LV (s, t): When disparity-compensated inter-view concealment is in use, each pixel value F i (s, t) in the lost frame is reconstructed from the matching pixels in the previous view frame M(s 1,t). Let θ(i) be the spatial indices of the matching pixels in frame M(s 1,t) used to estimate pixel i in frame M(s, t). The total number of matching pixels participating in error concealment depends on the specific error concealment algorithm in use. Denote by F VEC,θ(i) (s 1,t) the error concealed value for pixel i at the decoder, which is obtained by applying the error concealment algorithm on all the matching pixels in frame M(s 1,t).

6 1684 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 11, NOVEMBER 2011 The distortion D LV (s, t) can be derived as D LV (s, t)=e ˆF i (s, t) F VEC,θ(i) (s 1,t) ] } 2 =E {[ ˆF i (s, t) ˆF VEC,θ(i) (s 1,t) + ˆF VEC,θ(i) (s 1,t) F VEC,θ(i) (s 1,t) ] } 2 =E ˆF i (s, t) ˆF VEC,θ(i) (s 1,t) ] } 2 +E ˆF VEC,θ(i) (s 1,t) F VEC,θ(i) (s 1,t) ] } 2 (16) where ˆF VEC,θ(i) (s 1,t) is defined as the concealed value for ˆF i (s, t) if there was no channel-induced distortion up till frame M(s, t). Therefore, ˆF i (s, t) ˆF VEC,θk (i)(s 1,t) can be regarded as the distortion caused by the error concealment algorithm for pixel i due only to the loss of M(s, t) (free of error propagation from previous frames). On the other hand, the second term in line 3 of (16) indicates the channel distortion caused by transmission errors. In other words, distortion D LV (s, t) in (16) is divided into two terms with the first term being the distortion caused solely by the error concealment algorithm, whereas the second term denotes the distortion caused by transmission errors. It should be taken note that the third line in (16) is based on the assumption that these two types of distortions are uncorrelated. Let D VEC (s, t) be the error concealment distortion D VEC (s, t) =E ˆF i (s, t) ˆF VEC,θk (i)(s 1,t) ] 2 } (17) which is contingent on the inter-view error concealment algorithm in use. For example, if the simple frame copy concealment method is chosen, we have θ(i) = i, and ˆF VEC,θ(i) (s 1,t)= ˆF i (s 1,t). Thus, (17) becomes the mean squared difference between the two corresponding frames in adjacent views. In MVC-coded video communication systems, the errors propagated to the current frame can be considered as the filtered output of the distortion in the reference frame. Thus, error concealment operations, such as interpolation, deblocking, and weighted prediction, can be regarded as a spatial filter that will attenuate error propagation. Consequently, the following relationship holds true: E ˆF VEC,θ(i) (s 1,t) F VEC,θ(i) (s 1,t) ] } 2 = μ a D c (s 1,t) (18) where μ a is the error attenuation factor for the lost frame, which is reconstructed through disparity-compensated interview concealment. μ a is a parameter that hinges on the MVC video codec and contents, and can be estimated through empirical experiments. Generally, μ a differs from λ a for the same pixel because of the difference between the estimated DV and the actual DV used in the encoder. For the special case of integer-pixel estimation, (18) can be rewritten as E ˆF VEC,θ(i) (s 1,t) F VEC,θ(i) (s 1,t) ] } 2 = E ˆF θ(i) (s 1,t) F θ(i) (s 1,t) ] } 2 = D c (s 1,t). (19) Comparing (19) to (18) implies that μ a = 1 when integer-pixel estimation is in use at the encoder. Substituting (17) and (18) into (16), we arrive at the following: D LV (s, t) =D VEC (s, t)+μ a D c (s 1,t). (20) 2) Recursive Formula for Intra-View Concealment Distortion D LT (s, t): In a similar manner, when motion-compensated temporal concealment is applied, the reconstructed value of F i (s, t) can be expressed as F TEC,θ (i)(s, t 1) at the decoder, where θ (i) represents the spatial indices of the matching pixels in the previous time frame M(s, t 1) used to estimate pixel i in M(s, t). Thus, the intra-view error concealment distortion can be derived as D LT (s, t)=e ˆF i (s, t) F TEC,θ (i)(s 1,t) ] } 2 =E {[ ˆF i (s, t) ˆF TEC,θ (i)(s, t 1) + ˆF TEC,θ (i)(s, t 1) F TEC,θ (i)(s, t 1) ] } 2 =E ˆF i (s, t) ˆF TEC,θ (i)(s, t 1) ] } 2 +E ˆF TEC,θ (i)(s, t 1) F TEC,θ (i)(s, t 1) ] } 2 =D TEC (s, t)+μ b D c (s, t 1) (21) with D TEC (s, t) =E ˆF i (s, t) ˆF TEC,θk (i)(s, t 1) ] 2 }. (22) Note that μ b is the error attenuation parameter for the lost frame. Due to the similar reason, μ b may differ from λ b. 3) Computation and Discussion of D L (s, t): For a lost frame, some MBs will be reconstructed by motioncompensated temporal concealment, whereas other MBs may be reconstructed by disparity-compensated spatial concealment, depending on the mode selection strategy of error concealment. We assume that the percentage of MBs employing motion-compensated temporal concealment is U in accordance with the similar assumption in [22]. Thus, the average distortion for the lost frame M(s, t) can be expressed as D L (s, t)=ud LT (s, t)+(1 U)D LV (s, t) =U [ D TEC (s, t)+μ b D c (s, t 1) ] +(1 U) [ D VEC (s, t)+μ a D c (s 1,t) ] = [ UD TEC (s, t)+(1 U)D VEC (s, t) ] + [ Uμ b D c (s, t 1)+(1 U)μ a D c (s 1,t) ]. (23) There are two special cases for (23). First, for any P-frame in view 0 that is lost, we assume that all the MBs in the lost P-frame are recovered through motion-compensated temporal concealment, taking the previous time frame in view 0 as the reference frame. That is, U = 1. As a result, (23) becomes D L (0,t)=D LT (0,t)=D TEC (0,t)+μ b D c (0,t 1). (24) On the other hand, for the first frames of each view (except for view 0), it is reasonable to assume that all the MBs in the lost P-frame employ disparity-compensated spatial

7 ZHOU et al.: CHANNEL DISTORTION MODELING FOR MULTI-VIEW VIDEO TRANSMISSION 1685 concealment, taking the corresponding frame in the adjacent view as the reference frame. That is, U = 0. Therefore, (23) is simplified into D L (s, 0) = D LV (s, 0) = D VEC (s, 0) + μ a D c (s 1, 0). (25) More generally, as can be observed from (23), D L (s, t) is estimated as the sum of the average concealment and propagation distortions. D TEC(s,t) and D VEC(s,t) in (23) represent the average intra-view and inter-view error concealment distortions of frame M(s, t), respectively, which can be readily calculated at the encoder. If we regard the error concealment operator as a low-pass filter, the concealed frame becomes a filtered version of the original frame. From this perspective, the error concealment distortion can be modeled as the average MSE between the original and reference frames attenuated by a factor of a, i.e., ae ˆF i (s, t) ˆF i (s, t 1) ] 2 }, where the parameter a can be regarded as the energy loss ratio of the encoder filter [9]. For the aforementioned frame copy concealment method, it is apparent that a = 1. In this paper, concealment distortion is pre-calculated by running the decoder error concealment algorithm on ˆF i (s, t). D. Recursive Computation of Average Channel Distortion Finally, substituting (15), (23) into (5), we arrive at the following frame-level channel distortion model: (1 Q) [ Vλ b D c (s, t 1) +(1 V )λ a D c (s 1,t) ], if M(s, t) is recieved D c (s, t) = [ UDTEC (s, t)+(1 U)D VEC (s, t) ] + [ Uμ b D c (s, t 1)+(1 U)μ a D c (s 1,t) ], if M(s, t) is lost. (26) It is evident that (26) is a recursive expression that calculates D c (s, t) iteratively using D c (s 1,t) and D c (s, t 1). Let P be the average packet loss ratio. Thus, a packet is either lost with a probability of P or received with a probability of 1 P. Due to (5), D c (s, t) can also be expressed as D c (s, t) =(1 P)D R (s, t)+pd L (s, t) = αd c (s, t 1) + βd c (s 1,t)+γ (27) with α =(1 P)(1 Q)Vλ b + PUμ b β =(1 P)(1 Q)(1 V )λ a + P(1 U)μ a γ = P [ UD TEC (s, t)+(1 U)D VEC (s, t) ] (28) where α and β can be regarded as the error propagation coefficients. Generally speaking, we have 0 < α < 1 and 0 < β < 1. Given the specific MVC video codec, error concealment algorithm and video contents, coefficients α and β are constant. As described in Section III-B, λ a, λ b, μ a, and μ b in (28) are regarded as the attenuation coefficients of the spatial interpolation filter that attenuates error propagation. Generally speaking, we have 0 <λ a 1 and 0 <λ b 1. When the encoder only uses integer-pixel MVs and deblocking filtering is disabled, λ a = 1 and λ b = 1 according to our analysis in Section III-B. If a frame is lost, temporal concealment will be invoked with an estimated MV. In general, either integer-pixel or sub-pixel estimation can be adopted. When the decoder uses integer-pixel MVs for error concealment without deblocking filtering after concealment, we have μ a =1,μ b = 1. Similarly, when sub-pixel MVs are in use, we have 0 <μ a 1 and 0 <μ b 1. When deblocking filtering is applied, the deblocking operation can be considered as a post-filter after integer-pixel or sub-pixel estimation, which is described in [14]. The attenuation coefficients of the deblocking filter can be either smaller or greater than one [14]. Therefore, the deblocking filter can either attenuate or exacerbate error propagation. As a result, the values of λ a and λ b are either smaller or larger than in the case of without deblocking, depending on the deblocking filter employed by the encoder. For lost frames, deblocking is typically not applied after concealment. If this is not the case, deblocking will impact μ a and μ b in a similar manner that it affects λ a and λ b. Typically, video data exhibit unequal sensitivity to channel errors. Unequal error protection (UEP) that protects video data according to the system requirements has been widely applied by video transmission [23]. Data can be divided into two or more classes of different importance, and thus unequal protection levels can be applied to these classes. This allows the receiver to recover more important data with a lower packet loss rate, and less important data with a relatively higher packet loss rate. Consequently, when UEP is applied, what we can observe in the application layer is different packet losses. In order to extend the distortion model to the case of UEP transmission, the average packet loss ratio P is modified to a function of the frame number, i.e., P(s, t). For any UEP scheme, we firstly calculated P(s, t) according to the channel conditions and the specific UEP scheme. Then, the distortion can be estimated through the recursive model in (27). As a result, the model can also be extended to consider the distortion with UEP transmission. Finally, the sequence-level distortion over an entire GGoP consisting of S views with T frames in each view is defined as D c,ggop, which can be expressed as D c,ggop = 1 S T D c (s, t) ST s=0 t=0 [ S = 1 T D c (s, 0) + D c (0,t)+ ST s=0 t=1 S s=1 ] T D c (s, t). t=1 (29) The initial condition for (29) is that frame M(0, 0) is coded in the INTRA mode and has a channel distortion of D c (0, 0). Using this initial condition and recursively applying (27) to (29), we are able to obtain the cumulative expected distortion over the entire GGoP.

8 1686 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 11, NOVEMBER 2011 With regards to the computational complexity of the proposed model, one needs to compute the distortion for each frame either received or lost. As can be seen from (16) to (25), error concealment is the same regardless of the coding mode of the MBs, and this fact contributes to reduce the computational complexity of the model. For each received frame, five multiplication and three addition operations are required to calculate D c (s, t). On the other side, a lost frame requires six multiplication and four addition operations to compute the same distortion. For an entire GGoP consisting of S views with T frames in each view, one needs (2ST + S + T + 16) multiplication and (2ST + S + T + 10) addition operations to calculate the sequence-level distortion D c,ggop. Furthermore, the error concealment algorithm at the decoder has to be implemented at some additional computational cost. The simple frame copy scheme used in our simulations causes negligible complexity. However, more sophisticated concealment algorithms would require extra computational resources. Finally, it is noted that the distortions for M(s 1,t) and M(s, t 1) need to be stored as two floating-point numbers when computing D c (s, t). Fortunately, this additional storage requirement will not pose significant challenges to most applications. IV. Experimental Results In this section, we aim to validate the derived distortion model for MVV transmission over packet lossy channels. Computer simulations are carried out to compare the modeled results with those of simulations through transmitting test MVV sequences at various packet loss rates using the H.264/MVC reference software (JMVM 8.0) [24]. Numerical results will be presented to validate the performance of the proposed distortion model in terms of the MSE and peak signal-to-noise ratio (PSNR). A. Simulation Configurations Four MVV test sequences are selected to evaluate the performance of the proposed distortion estimation model, namely, one high motion sequence Ballroom, two medium motion sequences Vassar and Exit, and one low motion sequence Lotus. The first three sequences are obtained from Mitsubishi Electric Research Laboratories, Cambridge, MA [25], which are recommended by the Joint Video Team (JVT) as standard test sequences to evaluate the performance of MVC [24]. The baseline distance between adjacent cameras is 20 cm for these three sequences. The other sequence Lotus is obtained from Tianjin 3-D Imaging Technique Co. Ltd., Tianjin, China, which is a 3-D MVV test sequence with proper parallax between adjacent views. The resolution of Lotus is , while the resolutions of the other three sequences are For each MVV sequence, 8 views with 250 frames in each view are coded, where the first frame in view 0 is coded as an I-frame, whereas the remaining frames are coded as P-frames, in accordance with the GGoP structure illustrated in Fig. 1. The frame rate is 15 frames/s. The QP values for sequences Lotus and Vassar, Exits, and Ballroom used in our experiments are 32, 35, and 41, respectively. TABLE I INTRA Mode Rate Q, Inter-View Rate V, and Temporal Concealment Rate U for Test Sequences Sequence Ballroom Exit Vassar Lotus Q 3.59% 3.19% 0.49% 0.16% V 89.44% 97.28% 98.22% 90.19% U 89.81% 97.7% 98.45% 91.43% There is only one I-frame for each MVV sequence, which is assumed to be received error-free. All the MBs in a frame are grouped into a single slice, which is carried in a separate transport packet. It is noted that the packet lengths for all the P-frames in our simulations are within the limit of the maximum transmission unit (MTU), which is 1500 bytes for Ethernet. For higher resolution MVV sequences, the MTU size limitation can be easily circumvented through partitioning compressed video frames into an integer number of packets. The P-frames are subject to packet loss at rates of 2%, 5%, 8%, and 10%. The fractions of intra blocks (INTRA mode rate Q) and inter-view blocks (intra-view rate V ) are obtained by the H.264/MVC encoder. Table I gives the average percentages of intra blocks and inter-view blocks for the four test video sequences. Two different packet loss patterns are under consideration to simulate packet losses in our simulations. That is, the random packet loss pattern in which packet losses occur randomly at an average packet loss rate, and the JVT SVC/AVC loss pattern which is derived from actual error measurement in packetswitching networks and recommended by JVT. The latter loss pattern is generally considered as a close approximation to real packet loss scenarios in packet-switching networks [26]. In our experiments, the error concealment method of frame copy is employed, which conceals a lost frame by copying the previous reconstructed frame at the decoder. The concealed frame is also stored in the reference frame buffer for decoding subsequent frames. D TEC for each frame is pre-calculated as D TEC (s, t) = E ˆF i (s, t) ˆF i (s, t 1)] 2}, whereas D VEC can be pre-calculated as D VEC (s, t) = E ˆF i (s, t) ˆF i (s 1,t)] 2}. When computing the average channel-induced distortion for the entire GGoP, D TEC,avg and D VEC,avg, the average concealment distortions over all the frames, are used instead of the concealment distortion for each frame. Note that since frame copy is employed as the concealment strategy, for each concealed frame, we have U = 1 when the previous time frame is taken as the reference frame. On the contrary, U = 0 when the corresponding frame in the adjacent view is adopted as the reference frame. When computing the average channel-induced distortion over all the P-frames in a GGoP, the average value of U is used, which is also given in Table I. B. Parameters Estimation In order to compute the average channel distortion of an MVV sequence using (29), certain model parameters need to be determined, i.e., λ a, λ b, μ a, and μ b. λ a and λ b mainly depend on the video quality level, and are thus related to the

ZHOU et al.: CHANNEL DISTORTION MODELING FOR MULTI-VIEW VIDEO TRANSMISSION 1687 TABLE II Typical Values of the Model Parameters Parameters Sequence λ a λ b μ a μ b Ballroom 0.8677 0.8701 1 1 Exit 0.

9 ZHOU et al.: CHANNEL DISTORTION MODELING FOR MULTI-VIEW VIDEO TRANSMISSION 1687 TABLE II Typical Values of the Model Parameters Parameters Sequence λ a λ b μ a μ b Ballroom Exit Vassar Lotus average quantization step size in source coding. On the other hand, μ a and μ b are contingent on the chosen temporal and inter-view error concealment schemes. We propose to use the least square fitting technique to find these parameters using training video data. For example, in order to estimate λ a and λ b, we apply least square fitting to (15). μ a and μ b are determined by minimizing ( argmin Dc,measured (s, t) D R (s, t) ) 2 (30) λ a,λ b M(s,t) Received where D c,measured (s, t) denotes the actual measured distortion in the simulations, and the sum over M(s, t) considers all lost P-frames in the decoded MVV sequence. Similarly, μ a and μ b are estimated through applying least square fitting to (23) ( argmin Dc,measured (s, t) D L (s, t) ) 2. (31) μ a,μ b M(s,t) Lost Since parameters λ a, λ b, μ a, and μ b are contingent on the sequence/content and the concealment strategy, these parameters should be estimated for each MVV sequence individually. For example, when we determine the values of these parameters for Ballroom, the measured distortion D c,measured is first obtained through simulations with different packet loss rates. Then, the model parameters can be computed by applying least square fitting to (15) and (23). When frame copy is employed as the error concealment strategy in our experiments, we have μ a = 1 and μ b = 1 according to our analysis in Section III-C. Typical values of the model parameters used in our simulations are given in Table II. C. Results and Discussion We compare the MSE values obtained via the proposed model and measured in simulations for the four MVV test sequences. In the first set of experiments, the random packet loss pattern is employed to simulate packet losses. Fig. 2(a) and (b) plots the PSNRs against both the view and frame numbers for the four MVV sequences at selected packet loss rates. These curves are obtained using both our recursive model and test simulations. Note that the MSE is converted to the PSNR based upon the relationship PSNR = 10 log 10 (255 2 /MSE). It can be observed from Fig. 2 that error propagation occurs along both the intra-view and inter-view directions. For a close-up look, as an example, the distortion curves for each view of Ballroom are plotted in Fig. 3, showing the framelevel distortion of each view versus the frame number. Next, the JVT SVC/AVC loss pattern is adopted to simulate packet losses. Unlike the random loss pattern, the JVT Fig. 2. Modeled and measured distortion versus both the view and frame numbers at the packet loss rate of 5%. (a) Modeled distortion. (b) Measured distortion. pattern takes into account realistic packet loss scenarios (e.g., bursty losses), as derived from actual error measurement in packet-switching networks. Fig. 4 compares the modeled and simulated results. The distortion versus the frame number for each view of Ballroom is shown in the figure. For both error loss patterns, the distortions estimated based upon the proposed model are in close agreement with their measured counterparts. Since UEP that applies stronger protection to more important video bits, has been in wide use in video communications systems, unequal packet loss rates are also considered in our simulations. It is well known that earlier frames in a GGoP are more critical than the remaining frames. Therefore, in our experiments we assume that all the frames in view 0 and the first 20 frames in the other views receive better protection than the rest of frames in the GGoP. The frame-level distortion curves for each view of Exit versus the frame number are plotted in Fig. 5. In the experiment, the packet loss rate for the better protected frames is set to 5%, whereas the loss rate is 10% for the rest of the frames.

10 1688 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 11, NOVEMBER 2011 TABLE III Comparison of Measured Distortion and Modeled Distortion in Various Sequences PSNR at Different Loss Rate (db) Sequences 2% 5% 8% 10% Measured Modeled Measured Modeled Measured Modeled Measured Modeled View View View View Ballroom View View View View View View View View Exit View View View View View View View View Vassar View View View View View View View View Lotus View View View View We then discuss the accuracy of our proposed model. For the tth frame of a single-view video, the errors in the preceding t frames may be propagated to the current frame. Thus, the length of distortion propagation can be defined as t. Likewise, the length of distortion propagation for frame M(s, t) in MVV is st, which is s times longer than the distortion propagation length of single-view video. As a result, the accumulated prediction error for the MVV frame M(s, t) could potentially be s times larger than that of the tth frame in the single-view scenario. As shown in Fig. 4 in [8] (ROPE), Figs. 2 and 8 in [14], and Figs. 1 and 2 in [17], the maximum difference between the predicted and experimental results for single-view video is around 100 in MSE and 2 db in PSNR. However, as can be observed from the curves in Figs. 3, 4, and 5, the maximum difference in MSE between the measured and predicted results with the proposed model is no larger than 100, and that is also around 2 db in PSNR. It can, therefore, be concluded that the proposed model is capable of achieving the same level of accuracy as the distortion models for singleview video, even though 2-D error propagation in MVV along both temporal and view dimensions could potentially cause much larger modeling inaccuracy than the single-view video case. We then estimate the average PSNR per view for the four MVV test sequences. Table III lists the PSNR values for both the modeled and measured average distortion of each view of the four MVV test sequences. These comparative results clearly demonstrate that the modeled distortion is very close to its measured counterpart over a broad range of packet loss rates. Furthermore, we compute the average channel-induced distortion over all the P-frames in a GGoP at a variety of packet loss rates. Fig. 6 depicts the PSNRs of the whole GGoP versus the packet loss rate for the four MVV test sequences. As can be seen from Fig. 6, the proposed model is capable of accurately predicting the expected distortions over the large range of the loss rates examined. The average distortion of the GGoP is computed using the average concealment distortions over all the frames, i.e., D TEC,avg and D VEC,avg. The results well match the actual measured distortion curves. The close agreement between the modeled and measured distortion results demonstrates that the proposed recursive model can be used to estimate and analyze the impact of random packet losses on the average multi-view video quality. D. Subjective Evaluations In this section, subjective evaluations are performed to prove the validity of the proposed distortion model through comparing modeled distortion results to subjective assessment results. All four MVV sequences are evaluated by their subjective quality. However, only Lotus is evaluated for stereo (depth) sensation in our subjective experiments. This is because the

11 ZHOU et al.: CHANNEL DISTORTION MODELING FOR MULTI-VIEW VIDEO TRANSMISSION 1689 Fig. 3. Distortion (PSNR) versus frames for each view in Ballroom at the packet loss rate of 5% under random packet loss. (a) PSNR of view 0 in Ballroom. (b) PSNR of view 1 in Ballroom. (c) PSNR of view 2 in Ballroom. (d) PSNR of view 3 in Ballroom. (e) PSNR of view 4 in Ballroom. (f) PSNR of view 5 in Ballroom. (g) PSNR of view 6 in Ballroom. (h) PSNR of view 7 in Ballroom. distance of views of the other three sequences is 20 cm, far greater than the standard parallax of human eyes (65 mm) [25], whereas Lotus possesses proper parallax between adjacent views. Ten observers with normal vision are selected to participate in our subjective experiments. Half of the participants have experience, while the other half do not. A 19 inch LCD display is used to display the views of the MVV sequences one by one. The ten observers evaluate and rate each test MVV sequence in terms of their video quality grade, which is classified into five grades, i.e., bad (Grade 1), poor (Grade 2), fair (Grade 3), good (Grade 4), and excellent (Grade 5) according to the ITU- R Recommendation BT [27] in regards to subjective quality assessment. Average channel-induced distortion over the entire GGoP of each test sequence is estimated with the proposed model. The mean ratings of the video quality and the modeled results at packet loss rates of 2%, 5%, 8%, and 10% are listed in Table IV. Fig. 4. Distortion (MSE) versus frames for each view in Ballroom at the packet loss rate of 5% under the JVT SVC/AVC loss pattern. (a) MSE of view 0 in Ballroom. (b) MSE of view 1 in Ballroom. (c) MSE of view 2 in Ballroom. (d) MSE of view 3 in Ballroom. (e) MSE of view 4 in Ballroom. (f) MSE of view 5 in Ballroom. (g) MSE of view 6 in Ballroom. (h) MSE of view 7 in Ballroom. For the evaluation of stereo (depth) sensation of Lotus, a 42 inch multi-view auto-stereoscopic display (3DFreeEye 42HD, provided by Tianjin 3-D Imaging Technique Co. Ltd.) is used to display the 8-view 3-D video. The viewing distance of the observers is 2.5 m apart. Eight different views of the sequence are integrated in the multi-view auto-stereoscopic display. The 3-D MVV is evaluated by three different subjective criteria, namely, image quality, stereo (depth) sensation, and visual comfort [28]. A 5-point rating scale, i.e., [bad]- [poor]-[fair]-[good]-[excellent], is used for each subjective criterion in accordance with the ITU-R Recommendation BT [27]. The image with no perceived depth is rated bad (Grade 1), while the image with most perceived depth is rated excellent (Grade 5). Visual comfort is described by [very annoying]-[annoying]-[slightly annoying]-[perceptible, but not annoying]-[imperceptible], which correspond to the quality grades from bad (Grade 1) to excellent (Grade 5). Table V presents the mean ratings of the three subjective criteria and

12 1690 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 11, NOVEMBER 2011 TABLE IV Comparison of Subjective Assessment and Modeled Distortion Results for Ballroom, Vassar, and Exit Sequences Packet Loss Rate No Packet Loss 2% 5% 8% 10% Ballroom Modeled (PSNR) Subjective quality grade Excellent Poor Poor Bad Bad Exit Modeled (PSNR) Subjective quality grade Excellent Fair Poor Poor Poor Vassar Modeled (PSNR) Subjective quality grade Excellent Good Fair Fair Poor Fig. 6. Average distortion over all P-frames in a GGOP versus packet loss rates. (a) PSNR of Ballroom. (b) PSNR of Exit. (c) PSNR of Vassar. (d) PSNR of Lotus. Fig. 5. Distortion (PSNR) versus frames for each view in Exit under unequal error protection. (a) PSNR of view 0 in Exit. (b) PSNR of view 1 in Exit. (c) PSNR of view 2 in Exit. (d) PSNR of view 3 in Exit. (e) PSNR of view 4 in Exit. (f) PSNR of view 5 in Exit. (g) PSNR of view 6 in Exit. (h) PSNR of view 7 in Exit. Fig. 7. Correlation between modeled results and MOS. modeled results for Lotus at packet loss rates ranging from 2% to 20%. In Fig. 7 the subjective MOS and objective estimation results are plotted for all four MVVs for each packet loss ratio. Different subjective metrics, i.e., image quality, stereo (depth) sensation, and visual comfort are shown in the plot. Since the MSE and PSNR are mathematical statistics measuring differences between two images, they reflect visual per- ceptual quality more effectively, when errors occur uniformly randomly in the images/video [29]. It should be taken note that in our simulations channel errors are distributed randomly in the video after random packet losses and error concealment. In our experiment, it is found that there exists a large correlation coefficient of determination between the image quality and the modeled distortion measure (R 2 =0.85) and a medium correlation coefficient of determination between stereo (depth)

13 ZHOU et al.: CHANNEL DISTORTION MODELING FOR MULTI-VIEW VIDEO TRANSMISSION 1691 TABLE V Comparison of Subjective Assessment and Modeled Distortion Results for Lotus Packet Loss Rate No Packet Loss 2% 5% 8% 10% 15% 20% Modeled (PSNR) Image quality Excellent Good Good Good Fair Fair Poor Subjective Assessment Stereo (depth) sensation Good Good Good Good Good Fair Fair Visual comfort Excellent Good Fair Fair Fair Poor Poor sensation, visual comfort and the modeled PSNR (R 2 =0.56 and R 2 = 0.75 ). As shown in Table IV and Fig. 6, the modeled results are in close agreement with the subjective assessment results for sequences Ballroom, Vassar and Exit. For Lotus, as can be observed from Table V, the modeled results have close correlation with the subjective results in terms of image quality. The modeled results also correlate to some extend with the subjective results in terms of stereo (depth) sensation and visual comfort. Our subjective evaluation results in Tables IV and V clearly demonstrate that the proposed model is able to effectively model the channelinduced distortion for multi-view video. V. Conclusion Despite the fact that many researchers have modeled the channel distortion behavior for monoscopic video transmission using various error-prone channel models, there is still a lack of work on distortion modeling for multi-view video communications systems. Traditional models for single-view video only consider error propagation in temporal frames, which is 1-D. However, in multi-view video coding, mismatch errors can propagate along both the temporal and view dimensions. In this paper, we developed a recursive distortion model which is able to properly model 2-D error propagation in multi-view video, and accurately estimate the distortion of lossy multiview video transmission due to both random and unequal packet losses. The proposed model is applicable to any MVC schemes that are based upon the classic motion-compensated and disparity-compensated predictive video coding paradigm. An important feature of this model is that it is recursive in that the model iteratively computes the distortion for the current frame based upon those of the previous time and view frames. In addition to theoretical analyses, both objective and subjective evaluation results are presented to compare the modeled and measured distortion results, which clearly demonstrate that the developed model is able to estimate the expected channel distortions at both the frame and GGoP levels with high accuracy. Therefore, in practical applications, our distortion model can be employed to study the performance of framebased error resilient techniques for MVV transmission over packet-switched networks. Acknowledgment The authors would like to thank Dr. Y. Chen for many helpful discussions and suggestions. References [1] Q. Zhang, W. Zhu, and Y. Q. Zhang, End-to-end QoS for video delivery over wireless Internet, Proc. IEEE, vol. 93, no. 1, pp , Jan [2] B. Girod and N. Farber, Feedback-based error control for mobile video transmission, Proc. IEEE, vol. 87, no. 10, pp , Oct [3] M. F. Sabir, R. W. Heath, and A. C. Bovik, Joint source-channel distortion modeling for MPEG-4 video, IEEE Trans. Image Process., vol. 18, no. 1, pp , Jan [4] P. A. Chou and Z. Miao, Rate-distortion optimized streaming of packetized media, IEEE Trans. Multimedia, vol. 8, no. 2, pp , Apr [5] G. Cote, S. Shirani, and F. Kossentini, Optimal mode selection and synchronization for robust video communications over error-prone networks, IEEE J. Sel. Areas Commun., vol. 18, no. 6, pp , Jun [6] Y. M. Zhou, Y. Sun, Z. D. Feng, and S. X. Sun, New rate-distortion modeling and efficient rate control for H.264/AVC video coding, Signal Process. Image Commun., vol. 24, pp , May [7] S.-R. Kang and D. Loguinov, Modeling best-effort and FEC streaming of scalable video in lossy network channels, IEEE/ACM Trans. Netw., vol. 15, no. 1, pp , Feb [8] R. Zhang, S. L. Regunathan, and K. Rose, Video coding with optimal inter/intra-mode switching for packet loss resilience, IEEE J. Sel. Areas Commun., vol. 18, no. 6, pp , Jun [9] K. Stuhlmuller, N. Farber, M. Link, and B. Girod, Analysis of video transmission over lossy channels, IEEE J. Sel. Areas Commun., vol. 18, no. 6, pp , Jun [10] C. Y. Zhang, H. Yang, S. Y. Yu, and X. K. Yang, GOP-level transmission distortion modeling for mobile streaming video, Signal Process. Image Commun., vol. 23, pp , Feb [11] X. K. Yang, C. Zhu, Z. G. Li, X. Lin, G. N. Feng, S. Wu, and N. Ling, Unequal loss protection for robust transmission of motion compensated video over the Internet, Signal Process. Image Commun., vol. 18, no. 3, pp , Mar [12] H. Yang and K. Rose, Recursive end-to-end distortion estimation with model-based cross-correlation approximation, in Proc. IEEE Conf. Image Process., Sep. 2003, pp [13] Z. He, J. Cai, and C. W. Chen, Joint source channel rate-distortion analysis for adaptive mode selection and rate control in wireless video coding, IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 6, pp , Jun [14] Y. Wang, Z. Wu, and J. M. Boyce, Modeling of transmission-loss induced distortion in decoded video, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 6, pp , Jun [15] H. Yang and K. Rose, Advances in recursive per-pixel end-to-end distortion estimation for robust video coding in H.264/AVC, IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 7, pp , Jul [16] Z. Li, J. Chakareski, X. Niu, Y. Zhang, and W. Gu, Modeling and analysis of distortion caused by Markov-model burst packet losses in video transmission, IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 7, pp , Jul [17] Y. Zhang, W. Gao, Y. Lu, Q. M. Huang, and D. B. Zhao, Joint sourcechannel rate-distortion optimization for H.264 video coding over errorprone networks, IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 3, pp , Apr [18] X. Xiang, D. Zhao, Q. Wang, S. Ma, and W. Gao, Rate-distortion optimization with inter-view refreshment for stereoscopic video coding over error-prone networks, Proc. SPIE, vol. 7257, pp. 1 7, Jan [19] A. S. Tan, A. Aksay, G. B. Akar, and E. Arikan, Rate-distortion optimization for stereoscopic video streaming with unequal error protection, Eurasip J. Adv. Signal Process., vol. 2009, pp. 1 14, Jan

1692 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 11, NOVEMBER 2011 [20] A. S. Tan, A. Aksay, C. Bilen, G. B. Akar, and E.

H. Zhang, Depth image-based temporal error concealment for 3-D video transmission, IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 4, pp. 600 604, Apr. 2010. [22] K. Song, T. Chung, Y.

E. Ladner, E. A. Riskin, and A. Lippman, Unequal loss protection for H.263 compressed video, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 3, pp. 412 419, Mar. 2005. [24] P. Pandit, A.

[25] Multiview Video Test Sequences from MERL, ISO/IEC JTC1/SC29/ WG11 MPEG05/m12077, Apr. 2005. [26] Y. Guo, H. Q. Li, and Y. K.

14 1692 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 11, NOVEMBER 2011 [20] A. S. Tan, A. Aksay, C. Bilen, G. B. Akar, and E. Arikan, Ratedistortion optimized layered stereoscopic video streaming with raptor codes, in Proc. 16th Int. Packet Video Workshop, Nov. 2007, pp [21] Y. Liu, J. Wang, and H. H. Zhang, Depth image-based temporal error concealment for 3-D video transmission, IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 4, pp , Apr [22] K. Song, T. Chung, Y. Oh, and C. S. Kim, Error concealment of multiview video sequences using inter-view and intra-view correlations, J. Vis. Commun. Image Represent., vol. 20, pp , May [23] J. Goshi, A. E. Mohr, R. E. Ladner, E. A. Riskin, and A. Lippman, Unequal loss protection for H.263 compressed video, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 3, pp , Mar [24] P. Pandit, A. Vetro, H. Kimata, and Y. Chen, Joint Multiview Video Model (JMVM) 8.0, document JVT-AA208, ISO/IEC JTC1/SC29/WG11 and ITU-T Q6/SG16, Apr [25] Multiview Video Test Sequences from MERL, ISO/IEC JTC1/SC29/ WG11 MPEG05/m12077, Apr [26] Y. Guo, H. Q. Li, and Y. K. Wang, SVC/AVC Loss Simulator Donation, document JVT-P069, ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Oct [27] Methodology for the Subjective Assessment of the Quality of Television Pictures, Rec. ITU-R BT , ITU, [28] P. J. H. Seuntiens, Visual experience of 3-D TV, Ph.D. dissertation, Philips Research Eindhoven/Eindhoven Univ. Technol., Eindhoven, The Netherlands, [29] W. Osberger, Perceptual vision models for picture quality assessment and compression applications, Ph.D. dissertation, School Electr. Electron. Syst. Eng., Queensland Univ. Technol., Brisbane, Australia, Yuan Zhou received the B.Eng. and M.Eng. degrees in electronic engineering and communication engineering from Tianjin University, Tianjin, China, in 2006 and 2008, respectively. She is currently pursuing the Ph.D. degree in communication engineering from the School of Electronic Information Engineering, Tianjin University. Her current research interests include image/video transmission and 3-D image/video processing. Chunping Hou received the M.Eng. and Ph.D. degrees, both in electronic engineering, from Tianjin University, Tianjin, China, in 1986 and 1998, respectively. She was a Post-Doctoral Researcher with the Beijing University of Posts and Telecommunications, Beijing, China, from 1999 to Since 1986, she has been with the faculty of the School of Electronic and Information Engineering, Tianjin University, where she is currently a Full Professor and the Director of the Broadband Wireless Communications and 3-D Imaging Institute. Her current research interests include wireless communication, 3-D image processing, and the design and applications of communication systems. Wei Xiang (S 00 M 04 SM 10) received the B.Eng. and M.Eng. degrees, both in electronic engineering, from the University of Electronic Science and Technology of China, Chengdu, China, in 1997 and 2000, respectively, and the Ph.D. degree in telecommunications engineering from the University of South Australia, Adelaide, Australia, in Since January 2004, he has been with the Faculty of Engineering and Surveying, University of Southern Queensland, Toowoomba, Australia, where he was first an Associate Lecturer in Computer Systems Engineering from 2004 to 2006, then a Lecturer from 2007 to 2008, and currently holds a faculty post of Senior Lecturer. He was a Visiting Scholar with Nanyang Technological University, Singapore, from January 2008 to June 2008, and with the University of Mississippi, Oxford, from October 2010 to March 2011, respectively. His current research interests include the broad area of communications and information theory, particularly coding and signal processing for multimedia communications systems. Dr. Xiang received the prestigious Queensland International Fellowship awarded by the State Government of Queensland, Commonwealth of Australia, in Feng Wu (M 99 SM 06) received the B.S. degree in electrical engineering from Xidian University, Xi an, China, in 1992, and the M.S. and Ph.D. degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 1996 and 1999, respectively. He joined Microsoft Research Asia (formerly Microsoft Research China), Beijing, China, as an Associate Researcher in He has been a Researcher with Microsoft Research Asia since 2001 and is currently a Senior Researcher/Research Manager. He has authored or co-authored over 200 papers published in journals like the IEEE Transactions on Image Processing and the IEEE Transactions on Circuits and System for Video Technology, and some other international conferences and forums, e.g., MOBICOM, SIGIR, INFOCOM, CVPR, DCC, and ICIP. His current research interests include image and video representation, media compression and communication, and computer vision. Dr. Wu received the Best Paper Award in the IEEE Transactions on Circuits and Systems for Video Technology 2009, PCM 2008, and SPIE VCIP 2007 as a co-author.

Robust 3-D Video System Based on Modified Prediction Coding and Adaptive Selection Mode Error Concealment Algorithm

International Journal of Signal Processing Systems Vol. 2, No. 2, December 2014 Robust 3-D Video System Based on Modified Prediction Coding and Adaptive Selection Mode Error Concealment Algorithm Walid