IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY"

Erica Atkinson
5 years ago
Views:

1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY Spatial-Random-Acce-Enabled Video Coding for Interactive Virtual Pan/Tilt/Zoom Functionality Aditya Mavlankar, Member, IEEE, and Bernd Girod, Fellow, IEEE Abtract High-patial-reolution video offer the poibility of viewing an arbitrary region-of-interet (RoI) interactively. Zoom functionality enable watching high-reolution content even on diplay of lower patial reolution. If arbitrary region correponding to arbitrary zoom factor can be erved to the uer, the tranmiion and/or decoding of the entire high-patial-reolution video can be avoided. Moreover, if the video content can be encoded uch that arbitrary RoI correponding to different zoom factor can be imply extracted from the compreed bittream, we can avoid dedicated video encoding for each uer. We propoe uch a video coding cheme that i vital in allowing the ytem to cale to large number of remote uer a well a to encode and tore the content for ubequent repeated playback. Apart from generating a multi-reolution repreentation, our coding cheme ue P lice from H.264/AVC. We tudy the tradeoff in the choice of lice ize. A larger lice ize enable higher coding efficiency for repreenting the entire cene but increae the number of pixel that have to be tranmitted. The optimal lice ize achieve the bet tradeoff and minimize the expected tranmiion bitrate. Experimental reult confirm the optimality of our predicted lice ize for variou tet cae. Furthermore, we propoe an improvement baed on background extraction and long-term memory motion-compenated prediction. Experiment indicate up to 85% bitrate reduction while retaining efficient random acce capability. Index Term Interactive video treaming, pan/tilt/zoom, region-of-interet. I. Introduction HIGH-patial-reolution digital video will be widely available at low cot in the near future. Thi development i driven by increaing patial reolution offered by digital imaging enor and increaing capacitie of torage device. Furthermore, there exit algorithm for titching a comprehenive high-reolution view from multiple camera [1], [2]. Certain current product titch a large panoramic view in real time [3]. Alo, image acquiition on pherical, cylindrical, or hyperbolic image plane via multiple camera can record cene with a wide field-of-view while the recorded data can Manucript received May 21, 2009; revied October 22, 2009 and July 30, 2010; accepted October 18, Date of publication March 17, 2011; date of current verion May 4, Thi paper wa recommended by Aociate Editor I. Ahmad. A. Mavlankar wa with Stanford Univerity, Stanford, CA USA. He i now with Tely Lab, Inc., Menlo Park, CA USA ( aditya.mavlankar@ieee.org). B. Girod i with the Department of Electrical Engineering, Stanford Univerity, Stanford, CA USA ( bgirod@tanford.edu). Color verion of one or more of the figure in thi paper are available online at Digital Object Identifier /TCSVT /$26.00 c 2011 IEEE be warped later to the deired viewing format [4]. An example of uch an acquiition device i [5]. Depite the availability of high-reolution video, challenge in delivering thi high-reolution content to the client are poed by the limited reolution of diplay and/or limited data rate for communication. If the uer were made to watch a patially downampled verion of the entire video cene, then he might not be able to watch a local region-of-interet (RoI) with the recorded high reolution. To overcome thi problem, we propoe interactive virtual pan/tilt/zoom functionality while viewing the video. Some practical cenario where thi kind of interactivity i well-uited are: interactive playback of a highreolution video from a locally tored file, interactive TV for watching content captured with very high detail (e.g., interactive viewing of port event), providing virtual pan/tilt/zoom within a wide-angle and high-reolution cene from a urveillance camera, and treaming intructional video captured with high patial reolution (e.g., panel dicuion, lecture video). A video clip that howcae interactive viewing of occer in a TV-like etting can be een here [6]. In a treaming cenario, our propoed video coding cheme allow tranmitting uer-elected RoI, thu eliminating the need to tranmit the entire patial extent of the cene in full reolution. The encoding can either take place live or offline beforehand. Additionally, our cheme allow limiting the load of encoding irrepective of the number of uer. The entire recorded field-of-view can be encoded once, poibly with multiple reolution layer to upport different zoom factor. Spatial reolution layer are coded uing P lice 1 of H.264/AVC. Thi one-time encoding generate a repoitory of lice, and relevant lice can be erved to everal uer depending on their individual RoI. Thu, the coding cheme allow the ytem to cale to large number of uer; it avoid a dedicated encoder for each uer individual RoI equence. Another benefit i that requeted RoI can be extracted from the bittream even inide or at the edge of the network, cloer to the client-node. Ideally, the video delivery ytem hould be able to react to the uer changing RoI with a little latency a poible. The propoed coding cheme enable acce to a new region, with an arbitrary zoom factor, during any frame interval intead of having to wait for the end of a group of picture (GoP) or having to tranmit extra lice from previou frame. 1 We employ the following terminology: lice refer to a rectangular portion of a video frame, wherea tile refer to the equence of lice from the ame reolution layer and at the ame poition in each video frame.

2 578 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY 2011 The patial random acce approach developed in thi paper i alo relevant for the deign of ytem that employ imagebaed-rendering (IBR) [7], [8] and manipulate the tranmitted imagery further to yield a novel view, e.g., teleimmerive ytem [9] and free viewpoint TV [10]. Thi paper i tructured a follow. Section II review related work and dicue the challenge in providing random acce. Section III preent the coding cheme and dicue how to optimize the lice ize. The optimal lice ize minimize tranmiion bitrate by triking the bet compromie between compreion efficiency and uperfluou pixel tranmiion. Section IV preent an improvement of the coding cheme baed on background extraction and long-term memory motion-compenated prediction. Experiment indicate that the propoed improvement can reduce bitrate by up to 85% while retaining efficient random acce capability. II. Related Work Taubman et al. [11] propoed a olution for interactive browing of image uing JPEG2000. The multi-reolution repreentation of an image uing wavelet i leveraged to provide pan/tilt/zoom. JPEG2000 encode block of wavelet tranform coefficient independently. Conequently, every coded block ha influence on the recontruction of a limited number of pixel of the image. Moreover, the coding of each block reult in an independent, embedded bittream, which allow treaming any given block with a deired degree of fidelity. Taubman et al. alo developed the JPEG2000 over Internet Protocol, for communication between client and erver that upport remote interactive browing of JPEG2000 coded image [12]. The erver can keep track of the RoI trajectory of the client a well a the part of the bittream that have already been treamed to the client. Given a rate of tranmiion for the current time interval, the erver olve an optimization problem to determine which part of the bittream hould be ent in order to maximize the quality of the current RoI. Thi i imilar to packet cheduling algorithm propoed in [13] for treaming of video. It hould be noted, however, that an accurate model for the ditortion reduction due to ucceful delivery of any particular packet i neceary. Video coding for patial random acce preent a pecial challenge. To achieve good compreion efficiency, video compreion cheme typically employ motion-compenated interframe prediction for exploiting correlation among ucceive frame [14] [16]. However, the coding dependencie among ucceive frame make it difficult to provide random acce for patial browing within the cene. The decoding of a block of pixel require that other reference frame block ued by the predictor have previouly been decoded. Thee reference frame block might lie outide the RoI and might not have been tranmitted or decoded earlier. Coding, tranmiion, and rendering of high-reolution panoramic video uing MPEG-4 i propoed in [17]. A limited part of the entire cene i tranmitted to the client depending on the choen viewpoint. In [17], only intraframe coding i ued to allow random acce. The cene i ubdivided into lice which are coded independently. The author alo conidered interframe coding to improve compreion efficiency. However, they noted that thi involve tranmitting lice from the pat if the current lice require thoe for it decoding. A longer intraframe period entail ignificant tranmiion overhead for lice from the latter frame in the GoP, a thi dependency chain grow. Beide the tranmiion overhead, the reference frame block alo entail growing overhead of decoding. Coding and treaming of image from an IBR repreentation alo entail the random acce iue aociated with interframe coding. Thi applie both when the captured cene i tatic or evolving in time. Interactive treaming of tatic light field ha been tudied by Ramanathan et al. in [18] and [19]. The abovementioned growing dependency chain i avoided by uing multiple repreentation coding baed on two new picture type defined in the H.264/AVC tandard, SP, and SI picture type [20]. Ramanathan et al. alo extended rate-ditortion optimized packet cheduling, baed on the framework in [13], to multiple repreentation coding for light field. However, in their etup, only entire picture from the light field data-et are treamed and there i no proviion of patial random acce within a picture. Compreion and treaming of tatic light field uing ditributed ource coding ha been invetigated in [21] and [22]. If adequate rate i pent for ignaling the non-key frame then identical recontruction i guaranteed independent of the reference block ued a ide information at the receiver. Although thi implifie random acce, the coding efficiency i lower than hybrid video coding and the problem of rate etimation while treaming i challenging. Bauermann et al. conducted a detailed analyi of the decoding complexity and the mean tranmiion bitrate for remote acce to arbitrary part of compreed image-baed cene repreentation encoded uing hybrid video coding [23], [24]. Their work, however, doe not include a multi-reolution repreentation of the image data-et and i retricted to tatic imagery. Alo, for aving tranmiion bitrate, apart from knowing which pixel block are currently required, the erver alo need to know which pixel block have already been tranmitted to the uer. The erver ue thi information to tream a burt of reference pixel block. The variation of intantaneou bitrate and decoding load are undeirable. Recently, Kurutepe et al. [25] propoed live interactive 3DTV baed on dynamic light field. They employed application-layer peer-to-peer (P2P) multicat and delivered a ubet of view to a peer from a et of multiview video of the cene. Similar to [18] and [19], entire view are either elected or dropped according to the peer viewpoint. Random acce to arbitrary view i provided by encoding the view independently. Multicating lower the bandwidth requirement at the erver, however, the coded repreentation hould conit of logical ubtream for which multicat group can be formed. Efficient random acce i highly deirable ince it implifie the peer tak of deciding which multicat group to ubcribe. Similar to [18] and [19], entire frame from the data-et are treamed or not, and there i no proviion of patial random acce within a picture. Background extraction for motion-compenated prediction ha been propoed in [26]. Sprite coding defined in MPEG-4

MAVLANKAR AND GIROD: SPATIAL-RANDOM-ACCESS-ENABLED VIDEO CODING FOR INTERACTIVE VIRTUAL PAN/TILT/ZOOM FUNCTIONALITY 579 Fig. 1. Graphical uer interface.

Each creenhot how a frame of the panoramic Cardgame video equence ued in our experiment. Fig. 2. Video coding cheme. The thumbnail video contitute a bae layer and i coded with H.

3 MAVLANKAR AND GIROD: SPATIAL-RANDOM-ACCESS-ENABLED VIDEO CODING FOR INTERACTIVE VIRTUAL PAN/TILT/ZOOM FUNCTIONALITY 579 Fig. 1. Graphical uer interface. The client diplay how the thumbnail and the RoI. The effect of changing the zoom factor can be een by comparing the two creenhot. Each creenhot how a frame of the panoramic Cardgame video equence ued in our experiment. Fig. 2. Video coding cheme. The thumbnail video contitute a bae layer and i coded with H.264/AVC uing I, P, and B picture. The recontructed bae layer video frame are upampled by a uitable factor and ued a prediction ignal for encoding video correponding to the higher reolution layer. Higher reolution layer are coded uing P lice. Viual (MPEG-4 Part 2) allow coding the background either fully or partly for ubequent ue a reference in predictive coding. The term prite i more general and cover any tranmitted video object that can be warped and/or cropped in certain way for ue by the motion predictor. However, unlike our propoed cheme, the compreion cheme in the literature employing background extraction are not deigned to provide virtual pan/tilt/zoom functionality. III. Spatial-Random-Acce-Enabled Video Coding We have developed a graphical uer interface which allow the uer to elect the RoI while watching the video. The RoI location and zoom factor are controlled by operating the moue. The application upport continuou zoom to provide mooth control of the zoom factor. In addition to the RoI, we alo diplay a thumbnail overview with an overlaid rectangle indicating the location of the RoI. Screenhot of the client diplay are hown in Fig. 1. A. Coding Scheme Baed on Upward Prediction and Slice Fig. 2 how the video coding cheme. The thumbnail overview contitute a bae layer video and i coded with H.264/AVC uing I, P, and B picture. The recontructed bae layer video frame are upampled by a uitable factor and ued a prediction ignal for encoding video correponding to the higher reolution layer. Each frame belonging to a higher reolution layer i coded uing a grid of rectangular P lice. Employing upward prediction from only the thumbnail enable efficient random acce to local region within any patial reolution. For a given frame interval, the diplay of the client i rendered by tranmitting the correponding frame from the bae layer and few P lice from exactly one higher reolution layer. We tranmit lice from that reolution layer which correpond cloet to the uer current zoom factor. At the client ide, the correponding RoI from thi reolution layer i reampled to correpond to the uer zoom factor. We may tore few patial reolution layer at the erver but can till render mooth zoom control. If a required enhancement layer P lice i unavailable at the client, for example, due to lo in the network, we perform error concealment by upampling portion of the thumbnail video. In our experiment, the patial reolution layer tored at the erver are dyadically paced. Hence, the recontructed thumbnail frame need to be upampled by power of two horizontally and vertically to generate the correponding prediction ignal. For upampling the luminance component, we employ the ix-tap filter having the coefficient (1, 5, 20, 20, 5, 1) /32 a defined in H.264/AVC. For chroma, we employ a imple two-tap filter with equal coefficient. The upampling procedure i repeated an appropriate number of time depending on the reolution layer. Although

4 580 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY 2011 we chooe thee parameter for our experiment, our deign can incorporate arbitrarily paced reolution layer and alo arbitrary procedure for upampling the recontructed bae layer. Alo, at the client ide, for reampling the correponding RoI from the choen reolution layer, any technique can be accommodated. In our experiment, we ue bilinear interpolation. B. Comparion with Current Video Compreion Standard The coding cheme propoed above ue H.264/AVC building block but itelf i not tandard compliant. State-of-the-art video compreion tandard, H.264/AVC and SVC, provide tool like lice but no traightforward method for patial random acce ince their main focu ha been compreion efficiency of full-frame video and reilience to loe. SVC upport both lice a well a patial reolution layer. Ala, SVC allow only ingle-loop decoding wherea upward prediction from intercoded bae-layer frame implie multiple-loop decoding, and hence i not upported by the tandard. If the bae layer frame i intercoded, then SVC allow predicting the motion-compenation reidual at the higher-reolution layer from the reidual at the bae layer. However, interframe prediction dependencie acro tile belonging to a high-reolution layer hamper patial random acce. Note that for employing SVC, the motion vector (MV) can be choen to avoid intertile dependencie. Alo note that intead of SVC, AVC could be employed eparately for the high-reolution layer with the MV imilarly retricted to eliminate inter-tile dependencie. Thi i very imilar to treating the tile a eparate video equence. An obviou drawback i the redundancy between the high-reolution tile and the bae layer. A econd drawback i that after RoI change, a newly needed tile can only be decoded tarting from an intracoded lice. However, note that B lice could alo be employed for the high-reolution layer. Prior work on view random acce, dicued in Section II, employ multiple repreentation for coding an image. Similarly, we can ue multiple repreentation for coding a highreolution lice. Thi will allow u to ue interframe coding among ucceive high-reolution layer frame and to tranmit the appropriate repreentation for a lice depending on the lice that have been tranmitted earlier. Some repreentation will exploit inter-tile correlation, thu lowering the tranmiion bitrate. However, more torage will be required for multiple repreentation. The benefit of the cheme in Fig. 2 i that knowing the current RoI i enough to decide which data need to be tranmitted unlike the cae of multiple repreentation where the deciion i conditional on prior tranmitted data. In our propoed cheme, motion compenation among ucceive frame i performed at the bae layer. We alo employ diplacement compenation with a mall earch range of about four pixel to find the bet match relative to the upampled bae layer frame while coding the high-reolution P lice. The total encoding load i determined by the maximum reolution and the number of layer and can be etimated to be roughly 1.3 time the load of encoding jut the highet reolution layer uing tandard motion-compenated hybrid video coding. Fig. 3. Depending on the lice ize and the location of the RoI within the given reolution layer, there i an overhead of pixel that are tranmitted but not ued for rendering the client diplay. The haded portion depict the pixel overhead in thi example. Fig. 4. Sequence of pixel i divided into 1-D lice. In thi example, the length of each lice i = 4. The length of the 1-D region-of-interet i R =3. C. Minimization of Mean Tranmiion Bitrate For the coding cheme hown in Fig. 2, the lice ize for each reolution layer can be independently optimized given the prediction reidual for that layer. The trategy propoed here can be independently ued for all layer. Given a reolution layer, we aume that the lice form a regular rectangular grid, o that every lice i w pixel wide and h pixel tall. The lice on the boundarie can have maller dimenion due to the layer dimenion not being integer multiple of the lice dimenion. The number of bit tranmitted to the client, or decoded for local playback, depend on the lice ize a well a the uer RoI trajectory over the interactive viewing eion. The quality of the decoded video depend on the quantization parameter (QP) ued for encoding the lice. However, it hould be noted that for the ame QP, almot the ame quality i obtained for different lice ize, even though the number of bit i different. Hence, given the QP, our goal i to chooe the lice ize that minimize the expected number of bit tranmitted and/or decoded per rendered pixel. The maller the lice ize the wore i the coding efficiency. Thi i becaue of increaed number of lice header, lack of context continuation acro lice for context adaptive coding, and inability to exploit interpixel correlation acro lice. On the other hand, a maller lice ize entail lower pixel overhead. The pixel overhead conit of pixel that have to be tranmitted and/or decoded becaue of the coare lice diviion, but are not ued to render the client diplay. For example, the haded pixel in Fig. 3 how the pixel overhead for the hown lice grid and location of the RoI. In the following analyi, we aume that the RoI location can be changed with a granularity of one pixel both horizontally and vertically. Alo, every location i equally likely to be elected. Depending on the application cenario, the lice might be put in different tranport layer packet. The packetization overhead of layer below the application layer, for example RTP/UDP/IP, ha not been taken into account but can be eaily incorporated into the propoed optimization framework. 1) Pixel Overhead: To implify the analyi, we firt conider the 1-D cae and then extend it to 2-D.

5 MAVLANKAR AND GIROD: SPATIAL-RANDOM-ACCESS-ENABLED VIDEO CODING FOR INTERACTIVE VIRTUAL PAN/TILT/ZOOM FUNCTIONALITY 581 a) Analyi in 1-D: Imagine an infinitely long equence of pixel. Thi equence i divided into lice of length. For example, in Fig. 4, = 4. Alo given i the length of the region-of-interet, denoted by R. Aume R = 3 in thi example. To calculate the pixel overhead, we are intereted in the probability ditribution of the number of 1-D lice that need to be tranmitted. Thi can be obtained by teting for location within one lice, ince the pattern repeat every lice. For RoI location w and x, we would need to tranmit a ingle lice, wherea for location y and z, we would need to tranmit two lice. Let N be the random variable repreenting the number of lice to be tranmitted. Given and R, wecan uniquely chooe m, R N uch that m 0 and 1 R and alo the following relationhip hold: R = m + R. (1) By inpection, we find the p.m.f. of random variable N Pr {N = m +1} = (R 1) Pr {N = m +2} = R 1 and zero everywhere ele. From the p.m.f. of N E {N} = (m +1) (R 1) +(m +2) R 1 = (m +1)+ R 1. (2) Let P be the random variable which denote the number of pixel that need to be tranmitted E {P} = E {N} = (m +1) + R 1 = R + 1. (3) The expected pixel overhead i 1. It increae monotonically with lice length and urpriingly i independent of the length R of the region-of-interet. Ala, the reult i that imple only for 1-D. If R itelf i a random variable, then for a given value of R = r, (3) can be rewritten a E {P R = r} = r + 1. (4) b) Analyi in 2-D: We define two new random variable, P w, the number of column to be tranmitted and P h, the number of row to be tranmitted. Similarly, R w and R h are random variable denoting the number of column and row (among thoe tranmitted) required to render the RoI repectively. From the 1-D analyi, we obtain E {P w R w = r w } = r w + w 1 E {P h R h = r h } = r h + h 1. The number of tranmitted pixel i alo a random variable, P = P w P h. Since P w and P h can be aumed to be conditionally independent given R w,r h, we can write E {P R w = r w,r h = r h } = (r w + w 1)(r h + h 1). (5) While R w R h i the number of pixel among thoe tranmitted which are rendered in the RoI window, it i not the ize of the RoI window. The array of R w R h pixel i reampled to fit the fixed ize d w d h of the RoI diplay window. Recall that thi allow u to upport arbitrary zoom factor with mall number of dicretely paced reolution layer. Random variable Z C denote the continuou zoom factor controlled by uer input. It value determine the value of the dicrete random variable Z D which i the zoom factor rounded to a power of two. For example Z D = 1, if (1 Z C < 1.5) 2, if (1.5 Z C < 4). (6) To render the RoI at ome zoom factor Z C, we round to dicrete zoom factor Z D and retrieve the reolution layer log 2 (Z D )+1. The mimatch Z C /Z D i made up by reizing the tranmitted video after decoding. For our analyi, we need to model the conditional pdf of Z C given the layer number. In our modeling below, we aume that, given the layer number, Z C i uniformly ditributed. For example, if the optimization i being carried out for the econd layer in the example above, then we aume that Z C i uniformly ditributed between 1.5 and 4. Note that the ditribution of the uer-elected zoom factor in practice might depend on ize of certain alient object in the video. Neverthele, we make the aumption about Z C without performing any video content analyi. Let d w and d h be contant denoting the width and height of the RoI diplay portion on the client diplay, repectively. The random variable R w and R h are determined by Z C a follow: Z D Z D R w = d w R h = d h. (7) Z C Z C The expected value of R w and R h are given by { } 1 E {R w } = d w Z D E Z { C } 1 E {R h } = d h Z D E ince the analyi i carried out given the layer number and hence the dicrete zoom factor, Z D. Now, we can apply iterated expectation on (5) to yield Z C E {P} = (E {R w } + w 1)(E {R h } + h 1). (8) 2) Optimal Slice Size: The average number of bit per pixel for coding the prediction reidual of a given reolution layer, denoted by η ( w, h ), i a function of the lice ize ( w, h ). We alo define the number of pixel tranmitted per rendered pixel a the relative pixel overhead ψ ( w, h ) = E{P} d w d h, where E {P} i given by (8). The optimal lice ize minimize the expected number of bit tranmitted per rendered pixel and i given by ( opt w, opt h ) = arg min η( w, h ) ψ ( w, h ). (9) ( w, h ) One way to obtain the function η ( w, h ) i through ample encoding of the prediction reidual by varying the lice

6 582 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY 2011 Fig. 5. Model prediction veru empirical value for pixel tranmitted per rendered pixel, ψ ( w, h ), hown for the three equence, Cardgame, Making Sene, and Soccer. The empirical value are obtained by averaging over 100 uer-interaction trajectorie for each equence. The econd y-axi how the bit per pixel for coding the reidual of the high-reolution layer, η ( w, h ). The lice width and lice height in number of pixel are denoted by w and h, repectively. (a) Cardgame equence, layer 1 (PSNR 38.7 db). (b) Cardgame equence, layer 2 (PSNR 39.2 db). (c) Making Sene equence, layer 1 (PSNR 39.0 db). (d) Making Sene equence, layer 2 (PSNR 39.6 db). (e) Soccer equence, layer 1 (PSNR 35.5 db). (f) Soccer equence, layer 2 (PSNR 37.0 db). ize. Alternatively, η ( w, h ) could alo be predicted by an analytical model to reduce the number of ample encoding. Either way, (9) can be ued to find the optimal lice ize. We now preent experimental reult to demontrate that our model predict the optimal lice ize accurately without requiring to capture uer-interaction trajectorie. In our experiment, we obtain η ( w, h ) through a ample encoding of about 30 frame for each teted lice ize configuration ( w, h ). We ue three video equence for our experiment. The width height of the Cardgame 2 and Making Sene 2 2 Stanford Center for Innovation and Learning, Stanford, CA, generouly provided thee equence. equence i pixel. For the Soccer 3 equence, it i pixel. The RoI diplay i pixel. For all three equence, the thumbnail video i obtained by patially downampling the original by 4 both horizontally and vertically. There are two high-reolution layer; the firt layer equence i obtained by downampling the original by 2 both horizontally and vertically, while the econd layer equence i imply the original video. All equence are 25 frame/. Cardgame and Making Sene have 298 frame and Soccer ha 598 frame. We encode the thumbnail video with an 3 Fraunhofer Heinrich-Hertz Intitute, Berlin, Germany, generouly provided thi equence.

7 MAVLANKAR AND GIROD: SPATIAL-RANDOM-ACCESS-ENABLED VIDEO CODING FOR INTERACTIVE VIRTUAL PAN/TILT/ZOOM FUNCTIONALITY 583 Fig. 6. Model prediction veru empirical value for bit tranmitted per rendered pixel, hown for the three equence, Cardgame, Making Sene, and Soccer. The empirical value are obtained by averaging over 100 uer-interaction trajectorie for each equence. The lice width and lice height in number of pixel are denoted by w and h, repectively. (a) Cardgame equence, layer 1 (PSNR 38.7 db). (b) Cardgame equence, layer 2 (PSNR 39.2 db). (c) Making Sene equence, layer 1 (PSNR 39.0 db). (d) Making Sene equence, layer 2 (PSNR 39.6 db). (e) Soccer equence, layer 1 (PSNR 35.5 db). (f) Soccer equence, layer 2 (PSNR 37.0 db). Fig. 7. Model prediction veru empirical value for zoom-adjuted relative pixel overhead, φ ( w, h ), hown for Making Sene equence. The empirical value are obtained by averaging over 100 uer-interaction trajectorie. The econd y-axi how the bit per pixel for coding the reidual of the high-reolution layer, η ( w, h ). The lice width and lice height in number of pixel are denoted by w and h, repectively. (a) Making Sene equence, layer 1 (PSNR 39.0 db). (b) Making Sene equence, layer 2 (PSNR 39.6 db).

584 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY 2011 Fig. 7 how the zoom-adjuted relative pixel overhead, φ ( w, h ), for the Making Sene equence.

8 584 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY 2011 Fig. 7 how the zoom-adjuted relative pixel overhead, φ ( w, h ), for the Making Sene equence. We oberved that the model prediction i cloe to the empirical value for all three equence. Thu, the analyi preented in thi ection enable etimating variou quantitie related to acceed portion from the cene repreentation without recording uerinteraction trajectorie and meauring thee quantitie from long bittream encoded for variou lice ize. Thi help ytem dimenioning of an interactive video tranmiion ytem. Fig. 8. Improvement baed on background extraction. Each high-reolution layer frame ha two reference to chooe from, the frame obtained by upampling the recontructed thumbnail frame and the background frame from the ame layer in the background pyramid. intraframe period of 15 frame uing two conecutive B frame between anchor frame. The PSNR at bitrate for Cardgame, Making Sene, and Soccer i 39.1 db at 162 kb/, 39.6 db at 201 kb/, and 35.3 db at 355 kb/, repectively. For Cardgame and Making Sene, we chooe the QP to yield a PSNR of db for the high-reolution layer. For Soccer, the QP yield a PSNR of db. Fig. 5 how the relative pixel overhead, ψ ( w, h ) for the three equence. We compare the model prediction againt empirical value averaged over 100 uer-interaction trajectorie for each equence. The trajectorie were recorded while interactively viewing the equence uing the graphical uer interface decribed in Section III. Each trajectory tart at a random location with a random zoom factor, i 1 min long, and the et of frame of the original equence are looped to play for 1 min. The uer zoom factor, Z C, i allowed to vary between 1 and 6. The threhold given by (6) determine the high-reolution layer for rendering the RoI. Fig. 6 how the bit tranmitted per rendered pixel for the three equence. For a given equence and reolution layer, the comparion in Fig. 5 and 6 for different lice ize i made for the ame QP and hence imilar PSNR. Although the model predict the optimal lice ize fairly accurately, it can underetimate or overetimate the tranmitted bitrate. Thi i becaue the popular lice that contitute the alient object in the video could entail high or low bitrate compared to the average. Alo, the location of the object can bia the pixel overhead to the high or low ide, wherea the model ue the average overhead. For certain zoom factor choen by the uer, the acceed /tranmitted pixel could be le than the number of rendered pixel. Thi can be een in Fig. 5 where the relative pixel overhead, ψ ( w, h ), goe below one. Hence, we alo compute { the zoom-adjuted relative pixel overhead, φ ( w, h ) = E Pw P h R w R h }. Thi quantity i alway greater than one where φ ( w, h ) = [ { } ][ { } ] 1 1 ( w 1)E +1 ( h 1)E +1 R w R h { } 1 E = E {Z C} R w d w Z D { } 1 E = E {Z C}. R h d h Z D IV. Background Extraction and Long-Term Memory Motion-Compenated Prediction The coding cheme propoed in Section III exploit temporal correlation by performing motion compenation among ucceive frame of the thumbnail video. Temporal prediction among ucceive frame of the high-reolution layer i avoided to enable efficient random acce. Although it enable efficient random acce, upward prediction uing the recontructed thumbnail frame might reult in ubtantial reidual energy for high patial frequencie. In thi ection, we propoe creating a background frame [27], [28] for each highreolution layer and employing long-term memory motioncompenated prediction (LTM MCP) [29] to exploit the correlation between thi frame and each high-reolution frame to be encoded. The background frame i intracoded. A hown in Fig. 8, high-reolution P lice have two reference to chooe from, upward prediction and the background frame. If a tranmitted high-reolution P lice refer to the background frame, then relevant I lice from the background frame are tranmitted only if they have not been tranmitted earlier. Thi i different from [26], in which the encoder ue only thoe part of the background for prediction that exit in the decoder multi-reolution background pyramid. The encoder mimic the decoder in [26], which build a background pyramid out of all previouly received frame. Background extraction algorithm a well a detection and update of changed background portion have been previouly tudied, for example in [30], and are not the focu of thi paper. Since a moving camera might hamper patial browing experience, the camera i tatic in our equence. A imple temporal median operator [27] yield a plauible background frame. Out of the firt 150 frame, we include every fifth frame for the median operation. Fig. 9 how the reult for Cardgame, Making Sene, and Soccer. Although ome tationary object remain in the background frame, thi help the coding efficiency. In our experiment, the background frame i not updated after it creation at the tart. Thi i typical with a tatic camera. For example, in a occer game, the background typically change due to illumination change, which happen infrequently. The background frame i intracoded with the ame lice tructure a the other frame from the layer. Fig. 10 how the coding bitrate reduction due to thi approach. The figure i hown for lice ize of ( w 16 h 16) =4 16 for layer 1 and ( w 16 h 16) =6 4 for layer 2 of Cardgame and Making Sene. ForSoccer, the lice ize i ( w 16 h 16) =4 4 for both layer. For Cardgame, Fig. 11 how the reulting tranmiion

9 MAVLANKAR AND GIROD: SPATIAL-RANDOM-ACCESS-ENABLED VIDEO CODING FOR INTERACTIVE VIRTUAL PAN/TILT/ZOOM FUNCTIONALITY 585 Fig. 9. Sample frame and background frame for layer 1 of Cardgame, Making Sene, and Soccer equence. Fig. 10. Bitrate reduction through background extraction (BE) and long-term memory motion-compenated ( prediction (LTM MCP), hown for the Cardgame, Making Sene, and Soccer equence. For both Cardgame and Making Sene, the lice ize i w ) ( 16 h 16 =4 16 for layer 1 and w ) ( 16 h 16 =6 4 for layer 2. For Soccer, the lice ize i w ) 16 h 16 =4 4 for both layer. (a) Cardgame equence, layer 1. (b) Making Sene equence, layer 1. (c) Soccer equence, layer 1. (d) Cardgame equence, layer 2. (e) Making Sene equence, layer 2. (f) Soccer equence, layer 2.

The lice width and lice height in number of pixel are denoted by w and h, repectively. Tranmiion bitrate value are obtained by counting bit required to tranmit relevant high-reolution lice.

10 586 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY 2011 Fig. 11. Tranmiion bitrate i reduced after employing background extraction (BE) and long-term memory motion-compenated prediction (LTM MCP), here hown for the two layer of Cardgame. The lice width and lice height in number of pixel are denoted by w and h, repectively. Tranmiion bitrate value are obtained by counting bit required to tranmit relevant high-reolution lice. The value are averaged over 100 uer-interaction trajectorie. (a) Cardgame equence, layer 1. (b) Cardgame equence, layer 2. Fig. 12. Number of I and P lice tranmitted over the treaming eion, when background extraction (BE) and long-term memory motion-compenated prediction (LTM MCP) are employed. The data are plotted for a ingle uer-interaction trajectory. Slice ize are a in Fig. 10. For Cardgame and Making Sene, we chooe the QP to yield around 40.6 db PSNR for both layer. For Soccer, the PSNR i around 37.3 db for layer 1 and 38.5 db for layer 2. (a) Cardgame equence. (b) Making Sene equence. (c) Soccer equence. Fig. 13. Model prediction veru empirical value for bit tranmitted per rendered pixel, hown for the Making Sene equence, encoded uing background extraction (BE) and long-term memory motion-compenated prediction (LTM MCP). The empirical value are obtained by averaging over 100 uer-interaction trajectorie. The lice width and lice height in number of pixel are denoted by w and h, repectively. (a) Making Sene equence, layer 1 (PSNR 40.6 db). (b) Making Sene equence, layer 2 (PSNR 40.6 db). bitrate reduction. For Fig. 10, the lice ize choen are either optimal or cloe to optimal. If the mean tranmiion bitrate correponding to two lice ize are cloe, we prefer the larger lice ize for reaon noted in Section V. For the high-reolution layer, Fig. 12 how the number of tranmitted I lice from the background pyramid and the number of tranmitted P lice. It how the number for a ingle uer-interaction trajectory. For the firt frame of the treaming eion, roughly equal number of I and P lice are tranmitted. Subequently, I lice need to be tranmitted poradically in time and generally fewer in number than at the tart. Although not hown here, when averaged over 100 trajectorie, the profile of the tranmitted I and P lice appear moother; the number of P lice i almot contant and matche the expected number of tranmitted P lice that can be computed from analyi imilar to Section III-C. The average number of tranmitted I lice i highet at the tart and i about 1% of the number of tranmitted P lice thereafter. We model the bit tranmitted per rendered pixel a before. However, for implicity, the cot of tranmitting I lice i counted in the coding bitrate, η( w, h ), but not in the number of pixel tranmitted per rendered pixel, ψ( w, h ). A hown in Fig. 13, the model matche cloely with the empirical

11 MAVLANKAR AND GIROD: SPATIAL-RANDOM-ACCESS-ENABLED VIDEO CODING FOR INTERACTIVE VIRTUAL PAN/TILT/ZOOM FUNCTIONALITY 587 value for the Making Sene equence. The model matche well for the other two equence a well. It hould be noted that the change in the optimal lice ize after employing the background frame i mall, and the lice ize that i optimal for the earlier cheme till yield a mean tranmiion bitrate very cloe to that correponding to the new optimal lice ize. Hence, we chooe the ame lice ize for comparing the coding bitrate of the two cheme in Fig. 10. V. Concluion and Further Work We propoed a patial-random-acce-enabled video coding cheme that eliminate the need to tranmit and/or decode the entire video cene in high patial reolution. The RoI can be witched during any frame interval without waiting for the end of the GoP or having to tranmit extra lice from the pat. The coding cheme allow the ytem to cale with the number of client; it avoid encoding each client RoI equence individually. Another benefit i that requeted RoI can be extracted from the bittream even inide or at the edge of the network, cloer to the client-node. The random acce apect preented in thi paper alo apply to the deign of other IBR-baed interactive treaming ytem. We optimized the lice ize to minimize the tranmiion bitrate. Our model accurately predict the optimal lice ize without requiring to capture uer-interaction trajectorie. We propoed an improvement of the coding cheme baed on background extraction and long-term memory motioncompenated prediction. Experiment indicate that both the coding bitrate a well a the tranmiion bitrate can be reduced by up to 85% while retaining efficient random acce capability. Thi improvement, however, entail tranmitting ome I lice from the background pyramid that might be required for decoding the current high-reolution P lice. Neverthele, the cot of doing thi i amortized over the treaming eion. For reducing latency in a treaming cenario, we propoed predicting the uer RoI in advance [31], [32] and pre-fetching relevant data. A bigger lice ize add robutne againt inaccurate RoI prediction, although it might increae tranmiion bitrate. Alo, if the packetization overhead aociated with layer below the application layer i conidered, for example when each lice need to be put in a different tranport layer packet, then a bigger lice ize might be optimal. A ample cenario i application-layer P2P multicat to a population of peer where each peer can ubcribe/unubcribe requiite tile according to it RoI. In [33] and [34], we propoed forming a multicat group for each lice. In thi cenario, data from dijoint lice are preferably tranmitted/forwarded in different tranport layer packet. In the RoI P2P ytem, the peer tak of deciding which multicat group to ubcribe i implified thank to efficient random acce of the underlying video coding cheme. Acknowledgment The author would like to thank Dr. P. Baccichet, Dr. D. Varodayan, and K. Chono for ueful dicuion. Reference [1] C. Fehn, C. Weiig, I. Feldmann, M. Mueller, P. Eiert, P. Kauff, and H. Blo, Creation of high-reolution video panorama of port event, in Proc. 8th IEEE ISM, Dec. 2006, pp [2] J. Kopf, M. Uyttendaele, O. Deuen, and M. F. Cohen, Capturing and viewing gigapixel image, in Proc. ACM SIGGRAPH, vol. 26, no. 3. Aug. 2007, pp [3] Hewlett-Packard. (2009, Sep. 16). Halo: Video Conferencing Product by Hewlett-Packard [Online]. Available: html [4] A. Smolic and D. McCutchen, 3DAV exploration of video-baed rendering technology in MPEG, IEEE Tran. Circuit Syt. Video Technol., vol. 14, no. 3, pp , Mar [5] Immerive Media. (2009, Sep. 16). Dodeca 2360: An Omni-Directional Video Camera Providing Over 100 Million Pixel per Second by Immerive Media [Online]. Available: [6] Video Clip Showcaing Interactive TV with Pan/Tilt/Zoom (2009, Sep. 26) [Online]. Available: Ko9jcIjBXnk [7] H.-Y. Shum, S. B. Kang, and S.-C. Chan, Survey of image-baed repreentation and compreion technique, IEEE Tran. Circuit Syt. Video Technol., vol. 13, no. 11, pp , Nov [8] M. Levoy and P. Hanrahan, Light field rendering, in Proc. ACM SIGGRAPH, Aug. 1996, pp [9] P. Kauff and O. Schreer, Virtual team uer environment: A tep from tele-cubicle toward ditributed tele-collaboration in mediated workpace, in Proc. IEEE ICME, vol. 2. Aug. 2002, pp [10] M. Tanimoto, Overview of FTV (free-viewpoint televiion), in Proc. ICME, Jul. 2009, pp [11] D. Taubman and R. Roenbaum, Rate-ditortion optimized interactive browing of JPEG2000 image, in Proc. IEEE ICIP, Sep. 2000, pp [12] D. Taubman and R. Prandolini, Architecture, philoophy and performance of JPIP: Internet protocol tandard for JPEG2000, Proc. SPIE Intl. Symp. VCIP, vol. 5150, no. 1, pp , Jul [13] P. Chou and Z. Miao, Rate-ditortion optimized treaming of packetized media, IEEE Tran. Multimedia, vol. 8, no. 2, pp , Apr [14] B. Girod, The efficiency of motion-compenating prediction for hybrid coding of video equence, IEEE J. Sel. Area Commun., vol. 5, no. 7, pp , Aug [15] B. Girod, Motion-compenating prediction with fractional-pel accuracy, IEEE Tran. Commun., vol. 41, no. 4, pp , Apr [16] B. Girod, Efficiency analyi of multihypothei motion-compenated prediction for video coding, IEEE Tran. Image Proce., vol. 9, no. 2, pp , Feb [17] S. Heymann, A. Smolic, K. Mueller, Y. Guo, J. Rurainky, P. Eiert, and T. Wiegand, Repreentation, coding and interactive rendering of high-reolution panoramic image and video uing MPEG-4, in Proc. PPW, Feb [18] P. Ramanathan and B. Girod, Rate-ditortion optimized treaming of compreed light field with multiple repreentation, in Proc. 14th Packet Video Workhop, Dec [19] P. Ramanathan and B. Girod, Random acce for compreed light field uing multiple repreentation, in Proc. IEEE 6th Int. Workhop MMSP, Sep. 2004, pp [20] M. Karczewicz and R. Kurceren, The SP- and SI-frame deign for H.264/AVC, IEEE Tran. Circuit Syt. Video Technol., vol. 13, no. 7, pp , Jul [21] X. Zhu, A. Aaron, and B. Girod, Ditributed compreion for large camera array, in Proc. IEEE Workhop Statit. Signal Proce., Sep. 2003, pp [22] A. Aaron, P. Ramanathan, and B. Girod, Wyner Ziv coding of light field for random acce, in Proc. IEEE 6th Workhop MMSP, Sep. 2004, pp [23] I. Bauermann and E. Steinbach, RDTC optimized compreion of image-baed cene repreentation (part I): Modeling and theoretical analyi, IEEE Tran. Image Proce., vol. 17, no. 5, pp , May [24] I. Bauermann and E. Steinbach, RDTC optimized compreion of image-baed cene repreentation (part II): Practical coding, IEEE Tran. Image Proce., vol. 17, no. 5, pp , May [25] E. Kurutepe, M. R. Civanlar, and A. M. Tekalp, Interactive tranport of multi-view video for 3DTV application, J. Zhejiang Univ. Sci. A, vol. 7, no. 5, pp , May [26] J. Berntein, B. Girod, and X. Yuan, Hierarchical encoding method and apparatu employing background reference for effi-

588 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY 2011 ciently communicating image equence, U.S. Patent 5 155 594, Oct. 1992. [27] M. Maey and W.

IEEE ICIP, vol. 1. Sep. 2003, pp. 145 148. [29] T. Wiegand, X. Zhang, and B. Girod, Long-term memory motioncompenated prediction, IEEE Tran. Circuit Syt. Video Technol., vol. 9, no. 1, pp. 70 84, Feb.

12 588 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY 2011 ciently communicating image equence, U.S. Patent , Oct [27] M. Maey and W. Bender, Salient till: Proce and practice, IBM Syt. J., vol. 35, no. 3 4, pp , [28] D. Farin, P. de With, and W. Effelberg, Robut background etimation for complex video equence, in Proc. IEEE ICIP, vol. 1. Sep. 2003, pp [29] T. Wiegand, X. Zhang, and B. Girod, Long-term memory motioncompenated prediction, IEEE Tran. Circuit Syt. Video Technol., vol. 9, no. 1, pp , Feb [30] D. Hepper, Efficiency analyi and application of uncovered background prediction in a low bit rate image coder, IEEE Tran. Commun., vol. 38, no. 9, pp , Sep [31] A. Mavlankar, D. Varodayan, and B. Girod, Region-of-interet prediction for interactively treaming region of high reolution video, in Proc. IEEE 16th Packet Video Workhop, Nov. 2007, pp [32] A. Mavlankar and B. Girod, Pre-fetching baed on video analyi for interactive region-of-interet treaming of occer equence, in Proc. IEEE ICIP, Nov. 2009, pp [33] A. Mavlankar, J. Noh, P. Baccichet, and B. Girod, Peer-to-peer multicat live video treaming with interactive virtual pan/tilt/zoom functionality, in Proc. IEEE ICIP, Oct. 2008, pp [34] A. Mavlankar, J. Noh, P. Baccichet, and B. Girod, Optimal erver bandwidth allocation for treaming multiple tream via P2P multicat, in Proc. IEEE 10th Workhop MMSP, Oct. 2008, pp Aditya Mavlankar (S 99 M 09) received the B.E. degree in electronic and telecommunication from the Univerity of Pune, Pune, India, the M.S. degree in communication engineering from the Technical Univerity of Munich, Munich, Germany, and the Ph.D. degree in electrical engineering from Stanford Univerity, Stanford, CA. He i currently with Tely Lab, Inc., Menlo Park, CA. He ha publihed over 30 conference and journal paper, book chapter, and patent. Hi current reearch interet include calable video coding, interactive video delivery, and peer-to-peer video treaming. Dr. Mavlankar wa the recipient of the Edion Prize Bronze Medal awarded by IIE Europe in conjunction with the GE Foundation for hi Mater thei in 2006, wa a co-recipient of the Bet Student Paper Award at the IEEE Workhop on Multimedia Signal Proceing, Victoria, BC, Canada, and a corecipient of the Bet Student Paper Award at the European Signal Proceing Conference, Poznan, Poland. He won the Student Travel Grant Award for hi paper at the 16th International Packet Video Workhop, Lauanne, Switzerland. Paper co-authored by him have been nominated multiple time for bet paper award at international conference. Bernd Girod (M 80 SM 97 F 98) received the M.S. degree from the Georgia Intitute of Technology, Atlanta, and the Engineering Doctorate degree from the Univerity of Hannover, Hannover, Germany. He ha been a Profeor of electrical engineering and (by courtey) computer cience with the Information Sytem Laboratory, Stanford Univerity, Stanford, CA, ince Previouly, he wa a Profeor of telecommunication with the Department of Electrical Engineering, Univerity of Erlangen- Nuremberg, Erlangen/Nuremberg, Germany. He ha publihed over 400 conference and journal paper, a well a 5 book. Hi current reearch interet include the area of video compreion and networked media ytem. Prof. Girod received the EURASIP Signal Proceing Bet Paper Award in 2002, the IEEE Multimedia Communication Bet Paper Award in 2007, the EURASIP Image Communication Bet Paper Award in 2008, a well a the EURASIP Technical Achievement Award in A an entrepreneur, he ha been involved with everal tartup venture a the founder, director, invetor, or advior, among them Polycom (Nadaq:PLCM), Vivo Software, 8x8 (Nadaq: EGHT), and RealNetwork (Nadaq: RNWK). He i a EURASIP fellow and a member of the German National Academy of Science (Leopoldina).

Grouping and Retrieval Schemes for Stored MPEG. Video. Senthil Sengodan, Victor O. K. Li. University of Southern California

Grouping and Retrieval Schemes for Stored MPEG. Video. Senthil Sengodan, Victor O. K. Li. University of Southern California Grouping and Retrieval Scheme for Stored MPEG Video Senthil Sengodan, Victor O. K. Li Communication Science Intitute Department of Electrical Engineering Univerity of Southern California Lo Angele, CA