A multiview sequence CODEC with view scalability

Signal Processing: Image Communication 19 (2004) 239 256 A multiview sequence CODEC with view scalability JeongEun Lim a, King N. Ngan b, Wenxian Yang b, Kwanghoon Sohn a, * a Department of Electrical and Electronic Engineering, Yonsei University, 134 Shinchon-dong, Seodaemun-gu, Seoul 120-749, South Korea b School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore Received 3 March 2003; received in revised form 25 July 2003; accepted 2 October 2003 Abstract A multiview sequence CODEC with flexibility, MPEG-2 compatibility and view scalability is proposed. We define a GGOP (Group of GOP) structure as a basic coding unit to efficiently code multiview sequences. Our proposed CODEC provides flexible GGOP structures based on the number of views and baseline distances among cameras. The encoder generates two types of bitstreams; a main bitstream and an auxiliary one. The main bitstream is the same as a MPEG-2 mono-sequence bitstream for MPEG-2 compatibility. The auxiliary bitstream contains information concerning the remaining multiview sequences except for the reference sequences. Our proposed CODEC with view scalability provides several viewers with realities or one viewer motion parallax whereby changes in the viewer s position results in changes in what is seen. The important point is that a number of view points are selectively determined at the receiver according to the type of display modes. The viewers can choose an arbitrary number of views by checking the information so that only the views selected are decoded and displayed. The proposed multiview sequence CODEC is tested with several multiview sequences to determine its flexibility, compatibility and view scalability. In addition, we subjectively confirm that the decoded bitstreams with view scalability can be properly displayed by several types of display modes, including 3D monitors. r 2003 Elsevier B.V. All rights reserved. Keywords: Multiview sequence CODEC; MPEG-2 compatibility; View scalability 1. Introduction One of the most desired features for realizing high quality information and telecommunication services in the near future is The Sensation of Reality. This can be achieved by visual communication based on three-dimensional (3D) images. The 3D imaging system has many potential *Corresponding author. E-mail addresses: asknngan@ntu.edu.sg (K.N. Ngan), khsohn@yonsei.ac.kr (K. Sohn). applications in education, entertainment, medical surgery, videoconferencing, etc. To provide many viewers more vivid and accurate information of the remote scene, three or more cameras are placed at slightly different viewpoints to produce multiview sequences. Because of the current interest in 3D images, a number of research groups have reported on 3D image processing and display systems. In particular, several 3DTV projects combine both signal processing concepts and human factors. In Europe, research on 3DTV has been initiated 0923-5965/$ - see front matter r 2003 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2003.10.002

240 J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 by several projects such as DISTIMA, the objective of which is developing a system for capturing, coding, transmitting and presenting digital stereoscopic image sequences [4,9,18]. These projects led to another project, PANORAMA, and its goal is to enhance visual information in telecommunications with a 3D telepresence [2,15]. The IST project VIRTUE is developing a videoconferencing system aimed at a convincing impression of immersive telepresence [10]. It is being designed and constructed to achieve a three-way video conference supporting life-size, upper-body video images in a shared virtual environment. In addition, 3D broadcasting with a stereo display was proposed for the FIFA 2002 World Cup in Korea [11]. Other noteworthy efforts have been made by the 3D HDTV project of NHK in Japan [1,3]. At the receiver side, 3D displays are required to decode and display multiview video sequences. Many of the 3D-LCD systems under development are single user systems that only permit one person to experience the full 3D effect at a time. However, we believe that for applications in the entertainment industry and consumer electronics, a social context exists in which multiple users would be able to simultaneously enjoy 3D on the screen. Recently, some multi-user 3D display monitors have been developed to providing multiple viewers more vivid and accurate information [6,8,19]. However, they pose many problems in several aspects. Three or more cameras may be used to form a multiview video system to produce multiview image sequences. A substantial amount of data are produced and the processing complexity becomes higher as number of viewpoints increase. Thus, a multiview sequence CODEC is needed in order to transmit or store these data efficiently. In addition, a number of view points should be selectively determined at the receiver according to the type of display modes such as conventional 2D and 3D display monitors. In this paper, we propose a block-based multiview sequence CODEC with flexibility, MPEG-2 compatibility and view scalability. Our main goals for developing the CODEC are as follows. Firstly, coding structure and efficient compression techniques which are not computationally expensive are required to encode the substantial amount of data for transmission. We define GGOP structures that can consider correlations between both view and time domains. They provide flexible structures according to the view number and baseline distance among the multi-cameras. Secondly, the encoding/decoding should be compatible with an existing standard such as MPEG-2 or MPEG-4. Our CODEC included a MPEG-2 structure and generated a main bitstream identical to a MPEG-2 bitstream while the other bitstream contains information concerning other sequences. Thirdly, the receiver side should be able to provide multiple viewers with selected views by various types of 3D displays. We can selectively decode the appropriate number of views according to the type of display modes with view scalability. This paper is organized as follows. Section 2 provides a brief review of existing multiview sequence coding methods, i.e., MPEG-2 Multi- View Profile (MVP). Other current methods for multiview sequence coding and their problems are also discussed. In Section 3, a new multiview sequence CODEC with flexibility, MPEG-2 compatibility and view scalability is proposed. Simulation results and discussions are presented in Section 4 and conclusions are outlined in Section 5. 2. Background The MPEG-2 Multi-View Profile (MVP) was defined in 1996 as an amendment to the MPEG-2 standard, and its main new elements are the definition of usage of the Temporal Scalability (TS) mode for multi-camera sequences and the definition of acquisition camera parameters in the MPEG-2 syntax [16]. It is possible to encode a base layer stream representing a signal with a reduced frame rate, and to define an enhancement layer stream, which can be used to insert additional frames in between to allow for reproduction with a full frame rate if both streams are available. A very efficient way to encode the enhancement layer allows decisions to be made concerning the best motion-compensated

J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 241 prediction for each macroblock in the enhancement layer frame: either from the base layer frame or from the recently reconstructed enhancement layer frame. For such a signal, performing stereo and multiview channel encoding using the temporal scalability syntax is straightforward. For this purpose, frames from one camera view (usually the left) are defined as the base layer, and frames from the other as the enhancement layer. The enhancement-from-base-layer prediction then turns out to be a disparity-compensated prediction instead of a motion-compensated prediction. The base layer represents a simultaneous monoscopic sequence. For the enhancement layer, although the disparitycompensated prediction might fail in occluded regions, it is still possible to maintain the reconstructed image quality by motion-compensated prediction within the same channel. Since MPEG-2 MVP was mainly defined for stereo sequences, it does not support multiview sequences and is inherently difficult to extend. Some other methods for coding stereo sequences have recently been proposed, one of which was by the Electronics and Telecommunications Research Institute (ETRI) in Korea for experimental broadcasting of the FIFA 2002 World Cup. It converts the stereo sequence into a mono-sequence by decimation in the horizontal direction [11]. A similar method was proposed by the Communications Research Centre (CRC) in Canada, but decimation in the vertical direction was used [5]. These two methods are useful in saving bandwidth and can be implemented easily in terms of hardware and software. However, they cannot be applied to multiview sequences due to their low resolution. For coding multiview sequences, compatible resolution constrained multiview coding has been Camera Setup View capture and mux info. View capture and mux info. Display Setup In_V1 In_V2 Spatial Decimate SV1 Compatible Stereoscopic Compatible Stereoscopic SV1 Demux Views and out_v1 out_v2 Views SV2 Enco der Decoder SV2 Spatial In_Vn and Mux of super-views of super-views Interpolate out_vn (a) Transmission or storage media Super view2 View2 View4 Super view2 View2 View4 View6 View8 View1 View3 Super view1 (b) View1 View3 Super view1 View5 View7 Fig. 1. Compatible resolution constrained multiview coding: (a) Block diagram of the compatible multiview sequence CODEC. (b) Multiplexing of multiviews to form super views.

242 J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 proposed [7,17]. This method is shown in Fig. 1(a), where four input views are spatially decimated to generate four views of half-resolution each such that pair of half-resolution views forms a super view and the coding problem of four views is reduced to that of the coding of a pair of super views. A compatible stereoscopic encoder which uses Temporal Scalability can then be used such that one of the super views is coded independently and the other super view is coded with the disparity- and motion-compensated prediction. Fig. 1(b) shows examples of how four or eight reduced resolution views may be multiplexed to form a pair of super views. To obtain reduced spatial resolution, spatial decimation consisting of filtering and subsampling are employed. However, these methods lead to the low resolution of 3D multiview sequences since they use spatial decimation. Thus, a multiview sequence CODEC is required to provide many viewers realistic and vivid images or one viewer a wide view point. In addition to the compression rate and image quality, compatibility with existing standard and view scalability needs to be considered. 3. Proposed system with view scalability We propose a multiview sequence CODEC, which has the properties of flexibility, MPEG-2 compatibility and view scalability. We propose a flexible GGOP structure according to the baseline distance among cameras. For compatibility with MPEG-2, we coded reference sequences so as to be the same as MPEG-2 bitstreams. Fig. 2 shows the concept of multiview sequence CODEC with view scalability. We define view scalability so that viewers can arbitrarily choose the number of views. In other words, the desired number of views will be decoded at the receiver side. The view information is inserted in each picture header at the encoder so that the decoder detects it and only decodes the selected views. We confirm that the decoded multiview sequences can properly be displayed on the several types of display Fig. 2. The proposed block based multiview sequence CODEC with view scalability.

J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 243 modes including conventional 2D, stereo and 3D monitors. 3.1. Multiview sequence encoder We describe the multiview sequence encoder with view scalability as shown in Fig. 3, which enables choosing arbitrary number of views at the decoder. The multiview sequence encoder consists of preprocessing, disparity estimation/compensation, motion estimation/compensation, rate control and residual coding stages. The encoder generates two types of bit streams, the main bitstream which is compatible with a MPEG-2 video stream, and the auxiliary one which provides additional information for multiview sequences. The main bitstream contains information concerning sequences that include I pictures and thus maintains MPEG-2 compatibility. The auxiliary bitstream contains the remaining sequences. We insert the view information to consider view scalability so that the decoder detects it and only decodes the selected views. As shown in Fig. 4, the view information is inserted in each picture header by n bits to maximally support up to 2 n -view sequence encoding and decoding at the encoder. In the case of MPEG-2, the output of the encoder is a deterministic periodic sequence, in which the period is a Group of Pictures (GOP) realized with three types of encoded frames. I frames are coded using only the information present in the picture itself, in order to provide potential random access points for the compressed video sequence. P frames are coded using a similar coding algorithm to I frames, but with the addition of motion compensation with respect to the previous I or P frame. B frames are coded with motion compensation with respect to the adjacent I frame, P frame, or an interpolation between them. The length (N) of GOP is normally defined as the distance between I pictures. The distance between the anchor I / P picture and the P picture is represented by M. Motion Estimation I-DCT I-Quantization DCT Quantization VLC Main stream Multi-view Sequences Preprocessing Rate Control Channel Χηαννε λ DCT Quantization VLC Auxiliary stream Motion Estimation Disparity Estimation I-Quantization I-DCT Picture Type View Information Fig. 3. Block diagram of the proposed multiview sequence encoder.

244 J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 PICTURE START CODE (32 bits) VIEW INFORMATION (n bits) TEMPORAL REF (10 bits) PICTURE CODING TYPE (3 bits) VBV DELAY (16 bits) Fig. 4. The view information in an auxiliary bitstream. For a multiview sequence encoder, we proposed a new coding structure, which is referred to as a group of GOP, as a basic unit for coding and rate control that contains pictures in the time domain as well as in the view domain [14]. Let us explain the detailed structure of GGOP with a 5-view sequence encoder. The basic concept behind the proposed GGOP structure is to remove the spatial redundancy within a frame, temporal redundancy in the time domain and view redundancy in the view domain. Of course, the GGOP can simultaneously reduce temporal and view redundancy by estimating the disparity and motion vectors. The GGOP structure can exist in three possible types; one- I type, two- I type and five- I type for the case of a 5-view sequence encoder. The proposed structure for one-i type has six picture types : I frame, P t frame and B t frame for removing temporal redundancy using motion vectors, P s frame and B s frame for view redundancy using disparity vectors and B s,t frame for both temporal and view redundancy. One- I type has only one reference sequence in the GGOP. Fig. 5(a) shows one- I type for the case of N ¼ 6 and M ¼ 3 and the arrows represent the directions for predicting the disparity and motion vectors. There is only one I frame, one P t frame, four P s frames, four B t frames, and 20 B s,t frames in one- I type. The reference sequence including I frame,? B t,b t,i, B t,b t,p t,?, is encoded as a MPEG-2 video data bitstream for compatibility with MPEG-2. We regard this bitstream as the main bitstream, which is compatible with MPEG-2, and the remaining one as an auxiliary bitstream, which contains information for other view sequences. One MPEG-2 bitstream and one auxiliary bitstream are generated for the case of Fig. 5(a). The other view sequences without I frame are coded using disparity and motion vectors based on the reference sequence. P s frames use different semantics as compared to P t frames used in the main bitstream. These frames are predicted from a spatially adjacent, I frame. B s,t frames are predicted from a spatially adjacent frame, temporally adjacent frame, or interpolation between them. Thus, these B s,t frames can be selectively reconstructed by either disparity vectors or motion vectors. If a frame contains large motion vectors, it is reconstructed not by motion vectors but by disparity vectors in order to reduce error and improve the coding efficiency. If a frame has large disparity vectors, it is reconstructed only by motion vectors. Thus, this type of encoder is able to avoid errors caused by large motion vectors or large disparity vectors. We also designed two- I types, which contain two reference sequences in the GGOP, for multiview sequences captured by a large baseline distance, as shown in Fig. 5(b). Two MPEG-2 bitstreams and one auxiliary bitstream are generated in this type of GGOP. B s frames in the third view sequence are predicted from an adjacent left frame, right frame, or both. The third type of GGOP for the 5-view sequence encoder contains five reference sequences in each GGOP as shown in Fig. 5(c). Only motion estimation is needed as in the case of MPEG-2 since each sequence of view is independently encoded. If the receiver does not have a decoder for multiview sequences but only a MPEG-2 decoder, this type of GGOP can be adopted. The above concept of GGOP structure can be extended to 7-view, 9-view and even more view sequences. Fig. 6 shows a possible extension to a 9- view sequence encoder as an example. Its structure is also compatible with the monoscopic MPEG-2 CODEC. Fig. 6 shows examples of two- I type and three- I type for 9-view sequences. For compatibility with MPEG-2, the sequences, including I frames, are encoded as an MPEG-2 video data bitstream. There are two I frames, two P t frames, six P s frames, six B s frames and 38 B s,t frames in two- I type for N ¼ 6andM ¼ 3 as shown in Fig. 6(a). The arrows represent the

J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 245 Fig. 6. GGOP structures for 9-view sequence encoder: (a) Two- I type and (b) Three- I type. cameras. Three main bitstreams and one auxiliary bitstream can be generated from this type of GGOP. In addition, it is possible to encode multiview sequences independently using only motion estimation/compensation. Since our proposed multiview sequence encoder includes a view scalability property, the decoder can appropriately decode the corresponding bitstreams only for the selected views. If the receiver has the capability of displaying a stereo sequence, bitstreams for any selected two views are decoded. 3.2. Multiview sequence decoder Fig. 5. GGOP structures for 5-view sequence encoder: (a) One- I type; (b) Two- I type; (c) Five- I type. direction of prediction. Fig. 6(b) shows another type of GGOP, which has three reference sequences, for a large baseline distance between the We define a new concept of view scalability for selecting a number of view points according to the type of display modes. In MPEG-4, view scalability is defined in Synthetic Nature/Hybrid Coding(SNHC) part [12]. The property of view scalability enables the viewer to control the size and location of virtual objects. However, we define it as a different

246 J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 concept such that it enables the viewer to control the number of decoded views to select the desired views. The concept of view scalability is described in Fig. 7(a). There are two salient features in this view scalability. First, 3D displays with view scalability will not be affected by the number of encoded views. For example, if a 5-view sequence is encoded without considering view scalability and the multiview display is only capable of displaying 3 views, the display cannot recognize the encoded bitstream. However, the concept of view scalability enables multiview sequences to be displayed regardless of the type of the display. In addition, if the encoder does not have the property of view scalability, the decoder must decode the entire auxiliary bitstream to display even on conventional TV/HDTV. This causes unnecessary decoding efforts. Second, view scalability in multiview sequence CODEC provides flexibility to the viewers and reduces computing time. We insert view information in the picture header by n bits to maximally support a 2 n -view sequence, so that the decoder selectively decodes the sequences in the auxiliary bitstream and the main bitstream, as shown in Fig. 7(b). As a result, decoding time and computational complexity can be reduced accordingly. If the number of views is increased, the header size for view information must increase to cater to the increased number of views. Fig. 7. The proposed view scalability: (a) The concept of view scalability and (b) picture header bits for view scalability.

J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 247 Our proposed multiview sequence decoder with view scalability is described in Fig. 8. It consists of two parts; one is for the main bitstream and the other is for the auxiliary bitstream. It can adaptively decode bitstreams based on various types of display modes with the concept of view scalability. Only selected views are decoded and appropriately Motion Compensation Main stream I-Quantization I-DCT + Auxiliary stream Picture type View Information Disparity vector Motion vector Residual images data Disparity Compensation Motion Compensation Various types of 3D displayer + I-Quantization I-DCT Fig. 8. Block diagram of the proposed multiview sequence decoder. Fig. 9. Multiview camera system.

248 J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 displayed at the receiver. Views that are not selected will not be decoded so as to save decoding time from this property of view scalability. 4. Simulation results and discussion To confirm the performance of our proposed multiview sequence CODEC with flexibility, Fig. 10. The acquisition of multiview sequence in 3D Max. Fig. 11. 720 576 Train and tunnel 5-view sequences at first frame: (a) First view, (b) second view, (c) third view, (d) fourth view, (e) fifth view.

J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 249 Fig. 12. 640 480 Top and train 5-view sequences at first frame: (a) First view, (b) second view, (c) third view, (d) fourth view, (e) fifth view. Fig. 13. 640 480 Robotics 9-view sequences at first frame: (a) First view, (b) second view, (c) third view, (d) fourth view, (e) fifth view, (f) sixth view, (g) seventh view, (h) eighth view, (i) nineth view.

250 J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 Fig. 14. The 5-view sequences parameters. 38 "Train and tunnel" 5-view sequences PSNR 36 34 32 30 28 26 24 22 20 One-'I' type Two-'I' type Five-'I' type 0 10 20 30 40 50 60 Mbps Fig. 15. Performance comparison for several types of GGOP at various bit rates ( Train and tunnel 5-view sequence). MPEG-2 compatibility and view scalability, we used several multiview sequences. It is nearly impossible to find multiview sequence data taken by multiview cameras at this time. Even though a few multiview sequence data sets exists, they are not guaranteed to be captured by well-aligned multiview cameras. We developed our own multiview camera system as shown in Fig. 9. However, only three cameras are currently mounted on the system due to the technical limit of the current data acquisition board technology. For this reason, we used our previously developed IVS

J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 251 41 "Top & Train" 1st view sequences with a close baseline 40 39 38 PSNR 37 36 35 34 33 (a) 37 One-I Type Two-I Type 10 20 30 40 50 Mbps "Top & Train" 1st view sequences with a large baseline 36 35 PSNR 34 33 32 31 (b) One-I Type Two-I Type 10 20 30 40 50 Mbps Fig. 16. Performance comparison for several types of GGOP at various bit rates: (a) Top and train first view sequences with a close baseline distance and (b) Top and train first view sequences with a large baseline distance. (intermediate View Synthesis) algorithm to make 5-view sequences data for the simulation. In addition, we generated more than a 5-view graphic sequence by using 3D MAX to test the generality of our proposed multiview sequences CODEC. We generated 3 intermediate views between a left image sequence and a right image sequence using our own intermediate view synthesis algorithm [13]. The 720 576 Train and tunnel sequences were originally stereo sequences acquired by a fixed-focus stereoscopic camera with a focal length of 40 mm and a baseline distance of 87.5 mm. We generated 5-view sequences from stereo sequences, and then used the resulting 5-view sequences for the simulation. Moreover, we used a computer graphics tool, 3D MAX 3.0, to generate multiview sequences for the simulation. We assume the virtual cameras in 3D MAX 3.0 are parallel as shown in Fig. 10 since the parallel setup simplifies the correspondence problem and benefits from simpler mathematical expressions. We generated the 640 480 Top and Train 5-view sequences and the 640 480 Robotics 9-view sequences using 3D MAX, which have variable baseline distances, to test the flexibility of our proposed encoder. The baseline distances for the Top and Train 5-view sequences are 5 pixels and 20 pixels. Figs. 11 13 show the first frames for the Train and tunnel 5-view sequences and the Top and Train 5-view sequences having a baseline distance and the Robotics 9-view sequences, respectively. In our simulation, the block size for disparity and motion estimations was 16 16 and the search ranges were 16 to 16 for disparity estimation and 16 to 16 for motion estimation, respectively. The color format for each multiview sequence is Y:U:V, 4:2:0. The simulation was performed with Visual C++ 6.0 on a Pentium-IV PC with a 1.80 GHz CPU. We encoded multiview sequences with the information shown in Fig. 14. The input parameter file includes profile ID, aspect ratio

252 J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 Fig. 17. Comparison of resulting images with a large baseline distance: (a) Resulting image by one-i type and (b) resulting image by two-i type. information, display horizontal/vertical size, etc. The output parameter file informs the simulation results for multiview sequences. The output bitstream files consist of the main and the auxiliary bitstreams. The main bitstream contains information concerning reference sequences that include I pictures and the auxiliary bitstream contains the remaining sequences. Fig. 15 shows the results of the proposed multiview sequence encoder for the Train and tunnel 5-view sequences at various bit rates. We confirmed that one- I type and two- I type show better results compared with five- I type, which was independently coded by an MPEG-2 encoder. One- I type shows a better performance by 1 db and 5 db in an average PSNR at 20 Mbps, compared with two- I type and five- I type, respectively. The reason for why two- I type shows worse results than one- I type is the bits for P s, P t, B t, B s and B s,t are relatively decreased because of the additional I frames. We tested the Top and Train 5-view sequences with various baseline distances. Fig. 16 shows the values of the average PSNR for the first view sequences, which belong to results for the 5- view sequences, at various bit rates. We proved that two- I type with a large baseline distance shows a better performance than one- I type, as shown in Fig. 16(b). Fig. 17 shows the resulting images for the first view with a large baseline distance at the first frame. Figs. 17(a) and (b) are reconstructed by one-i type and two-i type, respectively. Fig. 17(b) clearly shows a better image quality than Fig. 17(a) since two-i type contains more I frames than one-i type. Thus, we subjectively and objectively confirmed that two- I type shows a better performance than one- I type in the case of a large baseline distance. Our proposed encoder is capable of generating various bits rate from 8 Mbps to above 50 Mbps. If we encode 5-view sequences using five-i type for a bits rate of 10 Mbps, each view sequence has a bits rate of 2 Mbps. This indicates that it works well in comparison with MPEG-2 MP@ML, which has a maximum target bits rate of 15 Mbps. Thus, our encoder can be applied to a variety of applications having various bits rate. In our proposed multiview sequence encoder, B s,t frames can be selectively reconstructed by either disparity vectors or motion vectors as previously described. If a frame contains large motion vectors, it is reconstructed not by motion vectors but by disparity vectors in order to reduce error and improve coding efficiency. If a frame has large disparity vectors, it is reconstructed only by motion vectors. Fig. 18(a) was reconstructed only using motion vectors, as in MPEG-2 and Fig. 18(b) was reconstructed using motion vectors or disparity vectors to obtain the minimum error. Fig. 18(b) clearly shows a better image quality than Fig. 18(a). Thus, our encoder has a sufficiently flexible structure to choose blocks having

J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 253 Fig. 18. Comparison of resulting images with large motion vectors: (a) Resulting image by MPEG-2 encoder and (b) resulting image by proposed 5-view sequence encoder. 56 Robotics 9-view sequences 54 52 50 48 PSNR 46 44 42 40 38 Two-I Ty pe Three-I Type Nine-I T ype 36 0 10 20 30 40 50 60 70 Mbps Fig. 19. Performance comparison for several types of GGOP for 9-view sequence. ( Robotics 9-view sequence)

254 J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 Fig. 20. Comparison of resulting images with view scalability at 2 frame: (a) 2 view decoded by MPEG-2 decoder; (b) 4 view decoded by MPEG-2 decoder; (c) second view decoded by proposed 9-view sequence decoder and (d) 4 view decoded by proposed 9-view sequence decoder. large motion vectors or small disparity vectors reconstructed by disparity vectors and blocks having large disparity vectors or small motion vectors reconstructed by motion vectors. Fig. 19 shows the results of the proposed multiview sequence encoder for the Robotics 9-view sequences at various bit rates. Two- I type and Three- I type show better performance compared with nine- I type, which encodes multiview sequences independently using MPEG-2 encoder. The proposed multiview sequence decoder is capable of selectively decoding output bitstreams based on the concept of view scalability. Fig. 20 shows the resulting images at 20Mbps if the receiver has only stereo display and the user selects the second view and the forth view among the nine views at the receiver. Figs. 20(a) (d) show the resulting images with the MPEG-2 decoder and with the proposed multiview sequence, respectively. In our proposed multiview sequence encoder, B s,t frames can be selectively reconstructed by either disparity vectors or motion vectors. Figs. 20(a) and (b) was reconstructed using only motion vectors, as in MPEG-2 while Figs. 20(c) and (d) was reconstructed using motion vectors or disparity vectors to obtain the minimum error. Figs. 20(c) and (d) clearly shows a better image quality than Figs. 20(a) and (b). Thus, our multiview sequence CODEC has the view scalability for considering the type of display mode and the flexibility for choosing motion vectors or disparity vectors efficiently. We subjectively confirmed that the decoded bitstreams can be appropriately displayed using several types of display modes, including 3D monitors. At the receiver side, only the main bitstream is well decoded for display on the conventional 2D monitor and the other data from the transmitter is discarded. For stereoscopic display systems, one main bitstream and one auxiliary bitstream or two main bitstreams can successfully be decoded for display on the stereo monitor. We also confirmed that the decoded 9- view sequences were appropriately displayed on the 9-view 3D monitor. Fig. 21 shows an example scene of 9-view computer graph data ( Robotics sequences) displayed on the 9-view 3D monitor after decoding. We could feel reality with sufficient depth information without wearing any special-purpose glasses. 5. Conclusion A block-based multiview sequence CODEC with flexibility, MPEG-2 compatibility and view scalability is proposed. The GGOP, which is a coding unit for multiview sequences, was newly defined for the efficient

J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 255 For further work, an optimal rate control algorithm for multiview sequences CODEC is needed to improve coding efficiency. Acknowledgements This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University. References Fig. 21. Displayed multiview sequences on 9-view 3D-LCD Monitor. coding of multiview sequences. Our proposed CODEC provides flexible GGOP structures according to the number of views and baseline distance between cameras. For compatibility with MPEG-2, the reference sequences including I frames in the time domain are encoded the same as in MPEG-2. Thus, the bitstreams corresponding to reference sequences are same as MPEG-2 bitstreams and bitstreams corresponding to other sequences but reference sequences include disparity vectors, motion vectors, residual image data, etc. View scalability is one of the main concerns for multiview sequence encoding. The major innovations of view scalability are the flexible decoding capability to satisfy the needs for the several types of 3D display modes and the reduction in decoding time. We subjectively confirmed that the decoded bitstreams can be appropriately displayed in several types of display modes such as conventional 2D and 3D display monitors. We tested the proposed multiview sequence CODEC with several multiview sequences to test its flexibility, view scalability and compatibility with MPEG-2. We confirmed that it worked well for multiview sequences as well as mono sequences with reduced computational complexity. [1] C.V. Berkel, D.W. Parker, Multiview 3D-LCD, Proc. SPIE 2653 (1996) 32 39. [2] Berlin, PANORAMA Final Demonstrations, AC092/SIE/ FinalDemo/DS/P/032/b1, October 1998. [3] R. Borner, Autostereoscopic direct-view displays and rearprojection for short viewing distances by lenticular method, Proceedings of the First International Symposium on Three Dimensional Image Communication Technologies, Tokyo, December 1993, pp.1 14. [4] R. Franich, R. Lagendijk, R. Horst, Reference model for hardware demonstrator implementation, RACE DISTI- MA deliverable 45/TUD/IT/DS/B/003/b1, October 1992. [5] G. Gagnon, S. Subramaniam, A. Vincent, 3D MPEG-2 video transmission over broadband network and broadcast channels, Proc. SPIE 4297 (2001) 290 298. [6] P. Harman, Autostereoscopic display system, Proc. SPIE 2653 (1996) 56 64. [7] B.G. Haskell, A. Puri, A.N. Netravali, Digital Video: An Introduction to MPEG, Kluwer Academic Publishers, Dordrecht, December 1996. [8] K. Hopf, An autostereoscopic display providing comfortable viewing conditions and a high degree of telepresence, IEEE Trans. Circuits Systems Video Technol. 10 (3) (April 2000) 359 365. [9] http://www.tnt.uni-hannover.de/plain/project/eu/distima/. [10] http://www.virtue.eu.com/. [11] N. Hur, C. Ahn, Experimental service of 3DTV broadcasting relay in Korea, Proc. SPIE 4864 (2002) 1 13. [12] J. Katto, M. Ohta, System architecture for synthetic/natural hybrid coding and some experiments, IEEE Trans. Circuits Systems Video Technol. 9 (2) (March 1999) 325 335. [13] H.S. Kim, K. Sohn, Feature-based disparity estimation for intermediate view reconstruction of multiview images, Proc. CISST 2 (June 2001) 1 8. [14] J.E. Lim, K.H. Sohn, MPEG-2 Compatible multiview sequence encoder, Proc. CISST 1 (2002) 379 385. [15] S. Malassiotis, M.G. Strintzis, Coding of video-conference stereo image sequences using 3D models, Signal Processing: Image Communications 9 (1) (January 1997) 125 135.

256 J. Lim et al. / Signal Processing: Image Communication 19 (2004) 239 256 [16] S. Okubo, K. McCann, A. Lippman, MPEG-2 requirements, profiles and performance verification framework for developing a generic video coding standard, Signal Processing: Image Communications 7 (3) (September 1995) 201 209. [17] A. Puri, B.G. Haskell, A revised proposal for multi-view coding and multi-view profile, ISO/IEC JTC1/SC29/ WG11 Doc. MPEG95/249, July 1995. [18] A. Rauol, State of the art of autostereoscopic displays, RACE DISTIMA deliverable 45/THO/WP4.2/DS/R/57/ 01, December 1995. [19] G.J. Woodgate, D. Ezra, J. Harrold, N.S. Holliman, G.R. Jones, R.R. Moseley, Autostereoscopic 3D display systems with observer tracking, Signal Processing: Image Communications 14 (6) (November 1998) 131 145.