NAIVE - Network Aware Internet Video Encoding

NAIVE - Network Aware Internet Video Encoding Hector M. Bricefio MIT hbriceno@cs. mit. edu Steven Gorter Harvard University sjg @ cs. harvard. edu Leonard McMian MIT mcmian@cs.mit. edu Abstract The distribution of digita video content over computer networks has become commonpace. Unfortunatey, most digita video encoding standards do not degrade gracefuy in the face of packet osses, which often occur in a but-sty fashion. We propose an new video encoding system that scaes we with respect to the network s performance and degrades gracefuy under packet oss. Our encoder sends packets that consist of a sma random subset of pixes distributed throughout a video frame. The receiver paces sampes in theirproper ocation (through a previousy agreed ordering), and appies a reconstruction agorithm on the received sampes to produce an image. Each of the packets is independent, and does not depend on the successfu transmission of any other packets. Additionay, each packet contains information that is distributed over the entire image. We aso appy spatia and tempora optimization to achieve better compression. 1 Introduction With the advent of the intemet, the distribution of digita video content over computer networks has become commonpace. Unfortunatey, digita video standards were not designed to be used on computer networks. Instead, they generay assume a fixed bandwidth and reiabe transport from the sender to the receiver. However, for the typica user, the intemet does not make any such guarantees about bandwidth, atency or errors. This has ead to the adaptation or repackaging of existing video encoding standards to meet these constraints. These attempts have met with varying eves of success. In this paper we propose to design a new video encoding agorithm specificay for computer networks from the ground up. The intemet is a heterogeneous network whose basic unit of transmission is a packet. In order to assure scaabiity, the intemet was designed as a best effort network - i.e. it makes no guarantees that a packet sent by a host wi arrive at the receiver or that it wi be deivered in the order that it was sent. This aso impies that it makes no guarantees on the atency of the deivery. A video encoding system designed for computer networks woud ideay satisfy the foowing requirements. The transmitted data f ermission to make digita or hard copies of a or part of this work for Persona or cassroom use is granted without fee provided that copies are not made or distributed for profit or commercia advant -age and that copies bear this notice and the fu citation on the first Page. To COPY otherwise. to repubish, to post on servers or to redistribute to ists. requires Prior specific permission and/or a fee. ACM Mutimedia 99 10/99 Orando, FL, USA 0 1999 ACM -58113-151.8/99/0010...$5.00 stream shoud be toerant to variations in bandwidth and error rates aong various networking routing paths. A given data stream shoud aso be capabe of supporting different quaities of service. Where this quaity of service might be dictated by oca resources (such as CPU performance) or the other user requirements. These requirements are ony partiay satisfied by existing video encoding systems. In this paper we propose a fexibe video encoding system that satisfies the foowing design goas: The system must aow for broadcast. We woud ike a system where video can be transmitted to a arge audience in rea time with no feedback to the source. This aows for arbitrary scaabiity. The network can arbitrariy drop packets due to congestion or difference of bandwidths between networks or receivers. Since this system is targeted to error prone networks, it must perform we under packet osses. The sender shoud be abe to dynamicay vary the bandwidth and CPU requirements of the encoding agorithm. In order to guarantee a quaity of service variations in bandwidth may be necessary. For instance, at scene changes or during a compex sequence. Variations in bandwidth coud aso occur due to resource imitations at the source such as channe capacity and CPU utiization, or by a poicy decision. The receiver shoud be abe construct a reasonabe approximation of the desired stream using a subset of the data transmitted. Furthermore, the receiver may aso intentionay ignore part of the data received to free up resources in exchange for reduced quaity. The quaity of the video shoud degrade gracefuy under packet oss by the network or throtting by the sender or the receiver. Variations in the agorithm shoud support a wide range of performance eves, from sma persona appiances to highend workstations. Users shoud be abe to quicky join a session in progress. These goas pace severe constraints on how the system can be buit. We consider packets as the basic unit of network transmission [13]. A video frame generay spans many packets. System throughput and quaity are affected by throtting packets at the sender, 251

packet oss in the network, and ignoring of packets at the receiver. Therefore, we choose to regard packets as atomic in our system design. For scaabiity and error handing we avoid packets that contain prioritized data or interdependencies, such as the custering of data or differentia encoding. These goas motivate our design principes: Gobaness - Individua packets shoud contain enough information to reconstruct the whoe image. They aso shoud be additive - each additiona packet increases the reconstructed image quaity. Conversey, for each packet that is dropped by the sender, network or receiver, the image quaity degrades. Independence-A packets are independent of each other; any one of them can be dropped without abrupt changes in quaity, and in many cases we can process them out of order. These principes are quite different than current video encoding systems. Typica video encoding agorithms (i.e. H.263 [] or IS0 MPEG), use compression and encoding techniques that make packets interdependent; when one packet is ost, a other packets that are reated to it ose their usefuness. We propose an encoding system that scaes we with respect to the sender s performance, the number of receivers, and the network s performance. This system degrades gracefuy under packet oss. Briefy stated: the encoder sends packets that consist of a sma random subset of pixes distributed throughout a video frame. The receiver paces sampes in their proper ocation (through a previousy agreed ordering), and appies a reconstruction agorithm on these sampes to produce an image. Notice that since each packet contains a sma random subset of the image, there is no ordering or priority for packets. We aso appy spatia and tempora optimization to achieve better compression without compromising our goba and independence principes. Many other researchers have shown that there is an inherent tradeoff between the amount of compression and the degree of robustness to data oss [ 141. Our work is no exception; our achieved image quaity at a given eve of compression is beow the best known channe encoders. For this price, we obtain the abiity to reconstruct images even when receiving one packet per frame. Finding fair ways to measure this tradeoff remains as future work. Layering approaches have party aeviated this ast probem. Agorithms ike L-DCT [2] and PVH [21], use a base channe to encode a ow quaity representation of the bock; and use additiona channes to encode enhancement information to reproduce a more faithfu bock. Because enhancement ayers usuay depend on the base ayered being received, when the base ayer packets are ost, the bock cannot be updated at a. Error handhng can aso be incorporated into the network ayer. By using error correcting codes, or retransmission based schemes, errors can be minimized or eiminated, as to create the iusion of a reiabe network stream. Open-oop approaches [32] (i.e. those that don t require feedback) such as, Forward Error Correction (FEC), eiminate errors when they are we characterized. Unfortunatey, these systems must incude enough redundancy in advance to dea with the worst-case packet oss rate scenario. This eads to inefficiencies. The overhead for error correction aso increases tota network oad. Thus the entire network is taxed due to the worse performing route [26, 121. The aternative is to use a cosed-oop approach. Cose-oop approaches [28, 25, 7, 331, where the receivers request the retransmission of ost packets, have the drawback of higher atency and are difficut to scae [6,4]. Additionay, since packet oses generay occur during congestion, these requests and subsequent retransmissions can make matters worse. Robustness to data oss can be achieved using mutipe description coding (MDC) [23,29, 161. MDC coders buid correation between the symbos aowing for good reconstruction from subsets of the data. Much of the previous work has deat with two-channe coding [23], which can withstand the oss haf of the transmitted data. There has aso been some preiminary work on many-channe coding [16, 291. One can think of the NAIVE encoding as an extreme exampe of MDC, where no decorreating transform is appied to the origina pixe data, and pictures can be reconstructed from any received data. The agorithm we propose bears many resembences to work in error conceament [3, 11, 34, 311. Whie most error conceament techniques are buit upon existing standards, our technique proposes an entirey nove encoding scheme. Our encoding scheme is toerant to bursty errors, and does not require resynchronization. Our reconstruction agorithm is fast, and makes no a-priori assumptions about the existance of specific nearby bocks or pixes. 2 Previous Work Video encoding agorithms specificay taiored for the intemet have been previousy proposed. IS0 MPEG-1 provides high compression ratios, and it aows for bitstream resynchronization using sices. Generay sices span mutipe packets, and few encoders make an effort to aign sices within packet boundaries. The variabe ength encoding and difference encoding used by MPEG-I is very effective in reducing the bitrate, but both techniques make assumptions about what has been previousy received. If these assumptions are wrong (caused by packet oss) [8], artifacts wi deveop in the new frame. Other discrete cosine transform (DCT) based agorithms ike H.26 1, have been successfuy adapted for use in computer networks by using a technique sometimes caed conditiona repenishment [21]. The idea is, that instead of encoding the differences from previous frames, they either keep od bocks or entirey repenish new bocks independenty encoded. These techniques require that a bocks are repenished within a specified period of time. During heavy packet osses, important areas may not be updated unti the osses subside. This is an a or nothing approach: a bock wi competey reach its new state or not change at a. 3 The Agorithm The Network Aware Internet Video Encoding (NAIVE) system sends a random subset of sampes for each video frame and reconstructs the frame at the receiver. The random sampes are distributed across one or more network packets. Given a sufficienty uniform samping distribution, each packet can be considered as a subsamped version of the origina image. Thus, each packet satisfies our gobaness objective. Sampes are seected in a random sequence in order to hide errors caused by packet oss and to reduce aiasing artifacts such as bockiness at ow samping densities [22]. If packets of sampes are ost, the degradation is distributed eveny throughout the reconstruction instead of being ocaized as is typica of the sequentiay encoded bocks used in other compression methods. Furthermore, the reconstruction artifacts due to packet oss shoud ead to an apparent oss in resoution (burring) rather than introduce spurious structure as woud be expected from an uniform subsamping. Such structure is generay visibe even when using higher order reconstruction fiters. Foowing our design principes, each packet contains sampes 252

Figure 1: Grayscae Susie image pyramid reconstruction. The input sampes are ocated in mutipe eves of the pyramid. Notice that input sampes in eve 1 and 2 correspond to the background and smooth regions of the image. uniformy distributed throughout the whoe image, and independent of any previous packet sent. Our encoding system aows for arbitrary packet oss, thus there is no guarantee that the cient has received any particuar set of image information. This presents us with the probem of reconstructing an image from irreguary spaced sampes. W Figure 2: Grayscae enna image sampes and reconstruction. Using 22% origina pixes (a), and using 5% of origina pixes (b). The images in the eft coumn show the input pixes. The right coumn shows our reconstruction 3.1 Image Reconstruction A viabe soution to this image reconstruction probem must have the foowing features: The method must run at frame rate. Thus, it is too expensive to sove systems of equations (as is done when using goba spine methods I30, 191 ) or to buid spatia data structures (such as a Deauney trianguation [24]). The method must dea with spatiay scattered sampes. Thus we are unabe to use standard interpoation methods, or Fourierbased samping theory. a The method must create reconstructions of acceptabe quaity. In this paper we adapt the pu-push agorithm of Gorter et a. [15]. This agorithm is based on concepts from image pyramids [9], waveets [20] and subband coding [18], and it extends earier ideas found in [o] and [22]. The agorithm proceeds in two phases caed pu and push. During the first phase, pu, a hierarchica set of ower resoution data sets is created in an image pyramid. Each of these ower resoution images represents a burred version of the input data; at ower resoutions, the gaps in the data become smaer (see pu coumn in figure 1). During the second phase, push, this ow resoution data is used to fi in the gaps at the higher resoutions (compare eve 2 pu and push in figure 1). Care is taken not to destroy high resoution information where it is avaiabe. Figure 2 shows the reconstruction of the enna grayscae from 5% and 22% of the origina pixes. 3.1.1 Organization The agorithm uses a hierarchica set of image pixes with the highest resoution abeed 0, and ower resoutions having higher indices. Each resoution has /2 the resoution in both the horizonta and vertica dimensions. For our 320 by 240 images, we use a 5 eve pyramid. Associated with the ij th pixe vaue & at resoution T is a weight ~1,~. These weights, representing pixe confidence, determine how the pixes at different resoution eves are eventuay combined. 3.1.2 Initiaize During initiaization, each of the received pixes is used to set the associated pixe vaue p& in the high resoution image, and the associated weight u$ ~ for this pixe is set to f. f is the vaue chosen to represent fu confidence. The meaning of f is discussed beow. A other weights at the high resoution are set to 0. 3.1.3 Pu The pu phase is appied hierarchicay, starting from the highest resoution and going unti the owest resoution in the image pyramid. In this pu phase, successive ower resoution approximations of the image are derived from the adjacent higher resoution by performing a convoution with a discrete ow pass fiter k. In our sys- 253

tern, we use the tent sequence. h[-..] 1 /16 /8 /16 /8 /4 /8 /16 /8 /16 1 x [-..]: The ower resoution pixes are computed by combining the higher resoution pixes using i. One way to do this woud be to compute (1) This is equivaent to convoving with i and then downsamping by a factor of two. This computation can be interpreted as foows: Suppose we have a set of continuous tent fiter functions associated with each pixe in the image pyramid. Suppose B&(u, U) is a continuous piecewise biinear inear tent function centered at i, j and two units (high resoution pixes) wide, Fiji (u, w) at the next ower resoution is a tent function centered at ii, 2j and is four units (high resoution pixes) wide, B$ (u, V) at the next ower resoution is a tent function centered at 4i, 4j and is 8 units wide, and so on. These continuous functions are reated using the discrete sequence h: B$ (u,w) = c hk-2i,-2jb;,~(ur v) k.1 This means that one can ineary combine finer tents to obtain a ower resoution tent. The desired mutiresoution pixe vaues can be expressed as an integra over an origina continuous image P(u, V) using the fij,j (u, w) as weighting functions: co 05 du du?;,j (u,.)i=(,, u) (2) II --m -cm If one approximates this integra with a discrete sum over the received pixe vaues, one obtains where P -1. Wi,j = Bi,j c (KY )Wko, k.1 It is easy to show that the vaues computed by Equation 3 can be exacty and efficienty obtained by appying Equation 1 hierarchicay. This method creates good ow resoution images when the origina sampes are uniformy distributed. But when the origina sampes are uneveny distributed, Equation 3 becomes a biased estimator of the desired ow resoution vaue defined by Equation 2 for it overy emphasizes the over sampied regions. Our soution to this probem is to repace Equation 1 with: ur? 113 := ck,, ~k--2i,i--2j min(#$? f) := --& C, Ihk-2i,~--2j min(wi,,f)pl, (4) $2 The vaue f represents fu confidence, and the min operator is used to pace an upper bound on the degree that one image pyramid pixe corresponding to a highy samped region, can infuence (a) b) Figure 3: Grayscae enna test image reconstruction with 10% of sampes: (a) using f = 1, (b) f = /8 the tota sum. Any vaue of /16 5 f 5 1 creates a we defined agorithm. If f is set to one, then no saturation is appied, and this equation is equivaent to Equation 1. If f is set to / 16, then even a singe sampe under the sum is enough to saturate the computation for the next ower resoution. In the system we have experimented with many vaues, and have obtained the best resuts with f = /8. Athough compete theoretica anaysis of the estimator in Equation 4 has yet to be competed, our experiments show it to be far superior to Equation 1. Figure 3 shows the reconstruction of the enna grayscae image with 10% of its sampes reconstructed using (a) f = 1, (b) f = /8. The pu stage runs in time inear in the number of pixes summed over a of the resoutions. Because each ower resoution has haf the density of pixes, the computation time can be expressed as a geometric series and thus this stage runs in time inear in the number of high resoution pixes at resoution 0. 3.1.4 Push The push phase is aso appied hierarchicay, starting from the owest resoution in the image pyramid, and working to the highest resoution. During the push stage, ow resoution approximations are used to fi in the regions that have ow confidence in the higher resoution images. If a higher resoution pixe has a high associated confidence (i.e., has weight greater than or equa to f), we disregard the ower resoution information for that high resoution pixe. If the higher resoution pixe does not have sufficient weight, we bend in the information from the ower resoution. To bend this information, the ow resoution approximation of the function must be expressed in the higher resoution. This is done using an interpoation sequence aso based on the tent sequence but withadifferent normaization: h[-..] x [-..]: 1 /4 /2 /4 /2 1 /2 [ /4 /2 /4 Bush is done in two steps: we first compute temporary vaues k, This computation is equivaent to upsamping by a factor of 2 (adding 0 vaues), and then convoving with h. These temporary vaues are 254

now ready to be bended with the p vaues aready at eve r, using the wp as the bending factors. r Pij := (1-2$, tp;,j + $i p;,j anaogous to the over bending performed in image compositing [27]. 9-1 E + Offset Tabe 3.1.5 Compression in the NAIVE Framework To some extent, NAIVE achieves both compression and resiiency by reying on a random subset of sampes from an image to reconstruct the missing information. However, neither the seection nor reception of the sampes is reated to the specific content of the transmitted image. Since the goa of any compression agorithm is the eimination of redundancy in the target signa, we have aso deveoped techniques to expoit the specific contents of a given video stream to achieve greater compression. In particuar, video sequences commony exhibit significant spatia and tempora correations that are generay concentrated in ower frequency ranges. At first gance it woud appear that a random samping strategy, ike the one used in NAIVE, runs counter to any effort to reduce spatia and tempora correation (since randomizing a correated function tends to decorreate it). However, if the notion of a sampe is expanded to incude not ony pixes from the highest resoution eve of the pyramid hierarchy, but aso the subsequent ower resoution eves, significant reductions in spatia correation can sti be achieved. Likewise, if the persistence of a given sampe from the reconstruction pyramid is engthened from a singe frame period to mutipe frame intervas, simiar tempora reductions are aso possibe. Often there are cases when an image encoder benefits from transmitting ony ow-resoution information about some region. Perhaps that region contains itte or no high frequency detai, or perhaps the region is considered insignificant and the current instantaneous bandwidth avaiabe does not support the transmission of a fu resoution image. To accommodate this abiity our agorithm aows the encoder to insert ower resoution sampes directy into an appropriate eve of the pu-push image pyramid, pr,j for T > 0. When ow-resoution sampes are received they are paced directy into the reconstruction pyramid at the appropriate resoution. Aso, the puing of higher resoution sampes onto a ower-resoution sampe is suppressed. In order to effectivey appy this capabiity both perceptua and information theoretic concerns shoud be considered. Thus, as is typica of most digita video compression methods, there is a considerabe art to making the best use of this capabiity. More detais about how muti-resoution sampes are encoded are given in subsection 4.1. In video sequences image regions can change sowy. Our system takes advantage of this tempora coherence by aowing pixes from previous frames to be incuded in the pu-push reconstruction process. The persistence of a given sampe is controed by two mechanisms. First, a sampes are aged at a constant rate with newer sampes superceding oder ones. After a sampe s age imit is reached, it no onger takes part in the image reconstruction process. Secondy, entire regions, or bocks, of od sampes can be invaidated. This invaidation is typicay used in areas of rapid motion or at scene changes. There are many tradeoffs to be considered when using these methods. More information about the aging and invaidation of sampes is described in subsection 4.2. 256N-1 Figure 4: Offset Tabe: There are N 16x16 bocks in the image. The i th entry points to a sampe in bock number i moduo N. On any seection of N consecutive entries, there is a sampe from every bock 3.2 Packetization The pu-push agorithm provides a means of reconstructing an image from non-uniform sampes. From our principe of gobaness we need to pick sampes from the whoe image. And these have to be seected at random to avoid visibe artifacts and to aow the appearance of simutaneous update everywhere in the image [5]. We guarantee coverage of the whoe image by dividing it into 16x16 bocks and making succesive passes over the image seecting one random sampe from each bock on each pass. In order to minimize the information transmitted, the sender and the receiver agree on the ordering of sampes, such that the sender ony needs to send the ocation of the first sampe in a packet. This is done as foows. The image is spit into 16x16 bocks, this means that there are 256 sampes per bock. Say there are N bocks in an image. We generate a tabe, caed the offset tabe, that has 256*N entries. The i th entry in the tabe points to a sampe in bock number i mod N. The first entry contains the coordinate of a random sampe in the first bock; the second entry contains the coordinate of a sampe in the second bock; The N+ 1 th entry contains the ocation of a sampe again in the first bock. The random ordering of the sampes within a bock is estabished by assigning a pseudo-random number to each pixe. The pixes are then sorted into a ist according to this random number. The offset tabe can then be constructed by seecting a pixe from each of the N ists. The sender and receiver are synchronized through the transmission of a seed for the random number generator. With the seed and frame size information the receive can construct the offset tabe. This is the ony information that must be transmitted via a reiabe protoco such as TCP/IP. This ordering guarantees that if we pick N consecutive sampes, they wi span the whoe image without arge custers. Additionay, we can compute the bock that a sampe beongs from its tabe offset moduo N. See figure 4. The reconstruction expained so far appies to a grayscae image. This same idea can be extended to the chrominance components of coor images. We encode coor images by samping the chrominance components at a resoution /4 of the uminance image, simiar to MPEG. To encode them, we maintain another offset tabe with 8x8 bocks to correspond to the 16x16 bocks of the uminance components. We encode the chrominance sampes inde- 255

offset Offset UV sampes Y sampes UV sampes Y sampes Figure 5: Packet Format pendenty of the uminance sampes. We need to send very itte overhead information with each packet. Each packet consists of: the frame number; tabe offset of first chrominance sampe, number of chrominance sampes, and the sampes themseves; and tabe offset of first uminance sampe, with the remaining of the packet fied with uminance sampes (see figure 5). We use 1024 bytes as our defaut packet size. This structure satisfies our goba and independence properties. If a packet has more than N uminance sampes (where N is the number of bocks in a frame), then there wi be one sampe in every bock of the image guaranteed by the way we traverse the offset tabe. 4 Enhancements The baseine approach described above works we for images whose detais are uniformy distributed throughout the whoe image. Most images, though, have ocaized regions of detai. And most sequences bear a high eve of tempora coherency across frames. We can take advantages of these characteristics to produce better quaiity video with the same or ess amount of data. 4.1 Spatia Locaity In image regions with mosty ow frequency content, our encoding system aows us to directy transmit ower resoution sampes, and the receiver can insert these directy into ower resoution pyramid eves. In our encoding system, we encode the sampe vaue and resoution eve in the same byte. We use 7 bits of precision for eve 0 sampes, and 6 bits of precision for eve 1 and eve 2 sampes. If the east significant bit is 0, the sampe is a eve 0 sampe; if the east significant bits is 01 or 11 the sampe is a eve 1 or eve 2 sampe respectivey. With this change we keep the packet structure unchanged, except for how sampe vaues are interpreted. Sampes that are inserted at ower resoution eves, correspond spatiay to many more sampes at finer eves. Thus, when a ow resoution sampe is sent, fewer higher resoution sampes are needed for that bock. To manage the bookkeeping for this information, we use a specia tabe, caed the SKIP TABLE. There is a SKIP TABLE entry for each bock. The SKIP TABLE contains the encoder/decoder agreed upon number of sampes for this bock that wi be skipped. When a packet is received, a entries in the SKIP TABLE are initiaized to 0; thus each bock is guaranteed to have one sampe. When a sampe is inserted into a ower resoution eve, we oad the skip tabe entry for that bock, with a predefined constant, agreed upon by the sender and the receiver. In our system, when a sampe is sent for eve 1, we skip the next 3 sampes for this bock. When a sampe is sent for eve 2, we skip the next 15 sampes for this bock. Each time that bock occurs in the sequence we inspect the skip tabie entry to see if it is non-zero, if it is, we decrement the skip tabe, and go to the next bock without reading a sampe from the packet. Otherwise, we insert the current sampe into the bock according to the offset tabe entry. 4.2 Tempora Locaity Tempora ocaity can be expoited even when packets are independent of each other. MPEG and H.261 expoit tempora ocaity by reusing bock of pixes that are cosey ocated in the previous frame, encoding this ocation and their difference. In our approach, we don t make any assumptions about the previous frame or what packets the receiver has processed. We simpy take advantage of the fact that pixes in a bock may not change significanty across many frames, in which case, we reuse them to reconstruct a higher quaity image. In NAIVE, pixes from previous frames can be kept around for up to 20 frames, and used as equa participants in the pu-push agorithm. When a bock has changed significanty, a KILL-BLOCK signa is encoded for that bock, and a pixes for that bock from previous frames are discarded. For scene changes, a KILLALLBLOCKS signa wi discard a previous pixes from previous frames. We fush the previous frame sampes for a given bock by using a specia word (KILL-BLOCK) instead of encoding the sampe. When this code is seen, the bock that corresponds to the offset for that sampe, wi be marked, and a corresponding sampes from previous frames are fushed. Additionay, we do not increment the pointer into the offset tabe, such that the next sampe in the stream fas in the current bock. We encode the KILL-BLOCK signas for new bocks in a the packets of a given frame. Currenty, there exists a possibiity of reusing sampes from a wrong frame under few error scenarios; but this contition can be remedied by encoding a sequence number with the KILLBLOCK signa (anaogous to MPEG-2 sice id information). Bocks that do not change wi sowy improve in quaity because they are reusing sampes from previous frames; therefore we wish to add more sampes to the bocks which are changing more rapidy and are not reusing sampes. We accompish this by inserting negative vaues in the SKIP TABLE in the foowing way. When a bock is kied, we set its corresponding SKIP TABLE entry to a negative vaue (currenty -10). After we have gone around once for a bocks in the image, we ony visit bocks that have a negative SKIP TABLE entry and increment its SKIP TABLE for each sampe received. This continues unti there are no more negative SKIP TABLE entries eft. This increases the reconstructed quaity of bocks that are not reusing previous sampes. This does not vioate our gobaness principe, since we sti have at east one sampe per every bock if they fit in a packet. 5 Resuts In this section we evauate the performance of our compression system. Before we proceed it is important to note two caveats. First, the poicies of the encoder wi greaty determine the quaity of the decompressed stream. The encoder can make many decisions. For exampe, it can make decisions about which bocks to fush or keep, what offset to start sending sampes from, from which eves sampes shoud be drawn, what proportion of uminance/chrominance sampes to use, among other decisions. We have manuay found reasonabe settings for our video streams. In the optima case, the encoder woud make these decisions automaticay. Secondy, we have used the signa-to-noise ratio metric (SNR) for evauating our resuts. It is we known that SNR is not an optima measurement for image quaity. It is acceptabe for comparing the agorithms 256

1, I I I I I I I I I I i di f ny -8192 bye* 4096 byte* 1 ~ * w, tyt.. -512 byt Figure 6: Rate-distortion curve on the grayscae 5 12x512 Lena test image. Figure 7: Average SNR of 3 coor sequences with 100 frames encoded at 1 bpp 2 and decoded with different packet drop rates yieding different bpp. receive rates. based on the same transform with different settings [ 171. A better measurement woud be based on modes of the human visua system; but these are usuay harder to impement or compute than the SNR. Figure 6 shows the rate distortion curve for 512x512 grayscae image, compressed for different target bit per pixes (bpp) and different packet sizes. Large packet sizes are important for arge images. If the packet is not arger than the number of bocks in an image, then there wi not be enough space to go around a the bocks once, and more importanty, the agorithm wi not make use of the SKIP TABLE, which aows it to get more sampes in needed areas. The drawback of using arge packets is that they are more ikey to fragmented and ost. When a packet is fragmented, and one of its fragments get ost, the whoe packet is ost. For sma images, a packet size of 1024 bytes is adequate. For our experiments we used a packet size of 1024 bytes because it is compatibe with the maximum packet size of most networks. Figure 7 shows how the quaity degrades gracefuy for different kinds of video sequences. For these sequences, tempora and spatia ocaity has been used. The first sequence, Wak, contains a men in suits waking from a car, the scene has high detai and motion. The second sequence, Caire is a standard head and shouders shot. Lasty, the Interview, consists of three scenes: a person waking into a room, a head and shouders shot of the person taking inside the room, and cose up of her face. A three sequences contain 100 frames, and were encoded at bpp. To generate a the data, the sequences were decoded with different packet drop rates cacuating the average SNR of a frames. The packet drop rate determines the independent probabiity that a packet wi be dropped. Over a whoe sequence, a video encoded at bpp and decoded with a packet drop rate of 30%, wi have a receive bpp of 0.7bpp. The sope of a three curves is very simiar, showing that it degrades sowy regardess of the kind of video. The agorithm handes bursty packet osses we. Figure 8 shows the frame by frame SNR for the 10 second Interview (320x240 coor) sequence compressed at 0.33 bpp. This sequence is com- posed of three shots. The first 22 frames is a shot sequence of the person waking into an office. The stride of the person and camera ange makes the shot contain one sow motion frame and one fast motion frame, to give the resuting wave-ike shape for the SNR during that shot. The second shot is a head and shouders shot of the person being interview in her office. This shot asts unti frame 77. The ast shot is a cose up of the person. The quaity of the image is above 30dB for most of the sequence, there is a short dip between frame 77 and frame 78, but it does not take ong to recover. Figure 9 shows the same sequence under bursty packet oss. The dashed ine represents the actua bit rate during the reception of each frame. This figure shows that even under heavy oss (receiving ess that 0.1 bpp), the quaity does not degrade significanty. At the end of the first burst, in frame 28, the quaity eve recovers rapidy. Additionay, the quaity hardy degrades during the second burst, between frames 37 and 47. The compexity of the agorithm is simpe enough to aow a software-ony impementation. Tabe 1 shows the decoding frame rate for different sequences. The agorithm was run on a common Inte Pentium Pro 200Mhz processor running Linux and the X windows system. The frame rate is not very sensitive to the amount of data received. The decoding time is dominated by the pu-push agorithm after a the sampes received from the network have been paced in the image. The coor sequence ran at 50% ower frame rate, than the comparabe grayscae sequence. This makes sense, since we have to reconstruct the chrominance data which is haf the size of the uminance data for coor sequences. Dispaying QCIF sequences in rea time woud not be a probem, and with a faster machine and an efficient dispay system, the same might be possibe for CIF sequences. 6 Concusions The NAIVE system that we have presented is an initia step towards a video compression system taiored specificay for computer networking environments. NAIVE satisfies our initia design goas. It supports broadcast over arge-area network and maintains scaabiity. NAIVE is toerant to packet oss at any point aong the network 257

35 0.35 30 g 25 E 20 cn 15 0.3 0.25 k 0.2 : g 0.15 C 0.1 p 0.05 30 25 20 15 0.3 0.25 0.2 0.15 0.1 0.05 rocnwbwwbmoj-0 77 nc9bmwbacno Frame Number r 10 0 -SNR - - - - - - bpp - SNR - - - - - - bpp 1 Figure 8: 3ase: SNR for each frame vs. the bpp received per frame, constant receive rate of 0.33 bpp Figure 9: Bursty: SNR for each frame vs. the bpp received per frame, there are bursty errors, so the receive rate drops sporadicay 1 Test Sequence 1 fps bpp 1 fps O.Sbpp ] interview (coor 320x240) ] 23.5 1 25.3 susie (gray 352x240) 34.81 36.32 acaire (arav 176x144) 76.7 84.9 I Tabe 1: Decoding frame rates (without dispaying) for different sequences. from the sender to the receiver. In fact, the intentiona dropping of packets at the source is one method of increasing the effective compression of the bit stream. Simiary, the seective dropping of packets at the receiver effectivey sheds CPU oad. A NAIVE sender can aso dynamicay vary its transmission bandwidth when required by the video sequence in order to maintain a given quaity eve. In a cases, the receiver of a NAIVE video stream is abe to reconstruct a reasonabe approximation of an entire frame using a minimum of information (i.e. a singe packet). The reception of additiona packets further enhances the quaity of the frame. Finay, our system degrades gracefuy under severe packet osses. Fundamentay, the randomizing of sampes used in our NAIVE method has the effect of decorreating the input signa and effective compression methods essentiay depend on highy correated input signas. Thus, our NAIVE agorithm sacrifices compression ratio, as compared to other video compression techniques, in order to achieve our design goas. We beieve that other compression techniques can be ayered onto our NAIVE methods to achieve substantiay improved compression. For instance, differentia encoding methods coud be appied to a sampes in a packet foowing the initia sampe, Variabe ength encoding techniques can be appied within individua packets to reduce redundancy in the transmitted symbos. We are aso hopefui that motion compensation techniques can be appied within our framework by encoding motion vector for each bock. These motion vectors woud impy that a bock of sampes in a pyramid eves woud be copied to the current bock. Thus, the sender woud make no specific assumption concerning which sampes are avaiabe at the receiver, ony that those sampes within the transferred bock woud form the best basis for reconstructing the desired bock. It is aso possibe to incorporate embedded coding techniques to the sampes within each packet. This woud potentiay aow for trading off the quantization of sampes for increased samping density. Another shortcoming of our NAIVE method is that the sender is fundamentay unabe to make any quaity guarantees to any particuar receiver. The need for such a guarantee might arise based from an economics driven approach where particuar receivers pay a premium for assurances of a given quaity eve. Layering is an effective technique for satisfying such requirements. We beieve that our NAIVE method coud be extended to provide ayering. Finay, we pan to integrate audio into our framework in the near future. We either adapt the NAIVE mechanisms to audio or use one of the standard protocos for audio distribution. In summary, we view our NAIVE agorithm as starting point for the deveopment of a new cass of video compression methods that are we suited for computer networks. By considering the reaities of rea networks we beieve that is possibe to define new casses of agorithms that are scaabe in broadcast appications and degrade gracefuy under variations in network activity. Acknowedgements We woud ike to thank Aaron Isaksen for his hep in preparing our videos. Support for this research was provided by DARPA contract N30602-97-1-0283, and Massachusetts Institute of Technoogy s Laboratory for Computer Science. References [] H.263: Video coding for ow bitrate communication. Draft ITU-TRecommendation H.263., May 1996. 258

[2] Een Amir, Steven McCanne, and Martin Vetteri. A ayered dct coder for intemet video. In IEEE Internationa Conference on Image Processing, pages 13-16, Lausanne, Switzerand, September 1996. [3] E. Asbun and E. Dep. Rea-time error conceament in compressed digita video streams. Proceedings of the Picture Coding Symposium 1999, Apri 1999. [4] Ernst W. Biersack. A performance study of foward error correction in atm networks. In internationa Workshop on Network and Operating System Support for Digita Audio and video (NOSDAV) 1993, pages 391-399, Heideberg, Germany, November 1993. [S] G. Bishop, H. Fuchs, L. McMian, and E. Scher Zaiger. Frameess rendering: Doube buffering considered harmfu. Computer Graphics (SIGGRAPH 94), pages 175-176, 1994. 161 Jean-Chrysostome Boot, Hugues Crepin, and Andres Vega Garcia. Anaysis of audio packet oss in the intemet. In NOS- DAV, pages 154-165, Durham, NH, 1995. [7] Jean-Chrysostome Boot, Thieny Turetti, and Ian Wakeman. Scaabe feedback contro for muticast video distribution in the intemet. In ACM Communication Architectures, Protocos, and Appications (SIGCOMM) 1994, pages 58-67, London, UK, 1994. [8] J. Boyce and R. Gagianeo. Packet oss effects on mpeg video sent over the pubic intemet. ACM Mutimedia, 1998, 1998. [9] P. Burt and E. Adeson. Lapacian pyramid as a compact image code. IEEE Transactions on Communications, 3 (4), Apri 1983. [o] P. J. Burt. Moment images, poynomia fit fiters, and the probem of surface interpoation. In Proceedings ofcomputer Vision and Pattern Recognition, pages 144-152. IEEE Computer Society Press, June 1988. [ 1 I] Y. Chung, J. Kim, and C. Kuo. Dct based error conceament for rtsp video over a modem intemet connection. Internationa Symposium on Circuits and Systems 98, May 1998. [12] I. Cidon, A. Khamisy, and M. Sidi. Anaysis of packet oss processes in high-speed networks. IEEE Trans. Info. Theory, 39(), January 1993. [13] D. D. Cark and D. L. Tennenhouse. Architectura considerations for a new generation of protocos. In ACM Communication Architectures, Protocos, and Appications (SIGCOMM) 1990, September 1990. [14] A. A. E-Gama and T. M. Cover. Achievabe rates for mutipe descriptions. IEEE Trans. tnformation Theory, 28:851-857, 1982. [I51 S. Gorter, R. Grzeszczuk, and M. Cohen R. Szeiski. The umigraph. Computer Graphics (SIGGRAPH 96), pages 43-54, 1996. [ 161 V. K. Goya, J. Kovacevic, R. Arean, and M. Vetteri. Mutipe description transform coding of images. Proc. 1EEEInt. Conf: Image Processing, October 1998. [17] Yung-Kai Lai, Jin Li, and C.-C. Jay Kuo. A waveet approach to compressed image quaity measurement. 30th Annua Asiomar Conference on Signas, Systems, and Computers, November 1996. [18] A. Lippman and W.Butera. Coding image sequences for interactive retrieva. ACM: CACM, 32(7):852-860, Juy 1989. [19] Peter Litwinowicz and Lance Wiiams. Animating images with drawings. In Computer Graphics (SIGGRAPH 94), pages 409-412, 1994. [20] S. Maat. A theory for mutiresoution signa decomposition: The waveet representation. IEEE PAMI, 11, Juy 1989. [21] Steven R. McCanne. Scaabe video Coding and Transmission for Znternet Muticast video. PhD thesis, University of Caifornia, Berkeey, December 1996. [22] D. P. Mitche. Generating antiaiased images at ow samping densities. Computer Graphics (StGGRAPH 87), 21(4):65-72, Juy 1987. [23] M. T. Orchard, Y. Wang, V. Vaishampayan, and A. R. Reibman. Redundancy rate-distortion anaysis of mutipe description coding using pairwise correating transforms. Proc. IEEE Int. Conf Image Processing, October 1997. [24] J. O%ourke. Computationa Geometry in C. Cambridge University Press, 1993. [25] Sassan Pejhan, Mischa Schwartz, and Dimitris Anastassiou. Error contro using retransmission schemes in muticast transport protocos for rea-time media. IEEE/ACM Transactions on Networking, 4(3):413+27, June 1996. [26] M. Podoscy, C. Romer, and S. Mccanne. Simuation of fecbased error contro for packet audio on the intemet. INFO- COM 98, March 1998. [27] Thomas Porter and Tom Duff. Cornpositing digita images. In Hank Christiansen, editor, Computer Graphics (SIGGRAPH 84 Proceedings), voume 18, pages 253-259, Juy 1984. [28] Injong Rhee. Error contro techniques for interactive owbit rate video transmission over the intemet. In ACM Communication Architectures, Protocos, and Appications (SIG- COMM) 1998, pages 290-301, Vancouver, B.C., 1998. [29] S. Servetto, K. Ramchandran, V. Vaishampayan, and K. Nahrstedt. Mutipe description waveet based image coding. In the Proceedings of the IEEE Internationa Conference on Image Processing (ICIP), October 1998. [30] D. Terzopouos. Reguarization of inverse visua probems invoving discontinuities. IEEE Transactions on Pattern Anaysis andmachine Inteigence, PAMI-8(4):413-424, Juy 1986. [31] S Tsekeridou, I Pitas, and C LeBuhan. An error conceament scheme for mpeg-2 coded video sequences. ISCAS 97, pages 1289-1292, June 1997. [32] P. H. Westerink, J. H. Weber, and J. W. Limpers. Adaptive channe error protection of subband encoded images. IEEE Transactions on Communications, 41(3):454-459, March 1993. 259

[33] X. Rex Xu, Andrew C. Myers, Hui Zhang, and Raj Yavatkar. Resiient muticast support for contininuous-media appications. In Internationa Workshop on Network and Operating Sy,stem Support for Digita Audio and video (NOSDAV) 1997, pages 183-193, St. Louis, MO, May 1997. [34] W. Zeng and B Liu. Geometric structure based directiona fitering for error conceament in image video transmission. SPIE vo 2601, Wireess Data Transmission, Photonics East 95, October 1995. 260