CHAPTER INTRODUCTION:

Size: px

Start display at page:

Download "CHAPTER INTRODUCTION:"

Richard Cummings
5 years ago
Views:

1 CHAPTER INTRODUCTION: The color variations among different viewpoints in multi-view video sequences may deteriorate the visual quality and coding efficiency. Various color correction methods have been proposed, however, the color appearance and histogram of corrected target frames are not similar enough to the reference frames in details. Focusing on restoring more similar color, a block-based color correction algorithm is proposed. The blocks in reference frames are matched into target frames through spatial prediction, and the colorization scheme is then adopted to expand color as a coarse correction. Finally the mixture with global color transfer result yields the fine correction. The multi-view video sequence is captured using a combination of synchronous cameras from different positions. Although the set of cameras has been adjusted to the same configuration as precisely as possible, it is still difficult to avoid the luminance and chrominance discrepancies among different viewpoints due to the scene illumination, camera calibration and jitter speed. These variations not only deteriorate the visual quality, but also degrade the coding efficiency. The color correction matrix was used as a mapping function and image segmentation was adopted to relate the corresponding color regions from different viewpoints. Restoring better color similarity using the corresponding blocks borrowed directly from reference frames to target frames. Furthermore, spatial prediction is used to match the blocks and this process can avoid the heavy computation from segmentation and also easily to be integrated into a multi-view video coding system. To stream a media type means to deliver it to a user across the network in such a way that the user may begin to view or hear the file as it downloads. In order for time-based media to be streamed across a network, it must be compressed. The data rate with which the file has been saved must not exceed the speed of the network over which it will be delivered, or playback will be choppy and frames dropped. Generally speaking, unless users are satisfied with small frame sizes and somewhat pixelated and jerky playback, streaming video to users with modems is not yet a possibility. The Ethernet network is reliable and fast enough that some video streaming is supportable, but it is probably best to stick with frame sizes of no larger than 240x180. 1

2 Examples of video streaming technologies include RealMedia, VDOLive and VDOPhone, Vivo, and IBM's Videocharger MPEG streaming system. It is possible to use the QuickTime Fast Start movie technology to synthesize video streaming. A regular QuickTime movie, saved with a standard compressor (Cinepak, Sorenson, etc) may be re-saved as a Fast Start movie, which means that the movie clip puts all information about the length of the clip and the type of compression used at the beginning of rht file. The QuickTime plugin or viewer (MoviePlayer) can use to decide when enough information has been received that the user can begin viewing the movie as it downloads with more or less smooth playback. The difference between this technology and other streaming solutions is that the QuickTime movie is, in actually, being downloaded and cached locally as with a normal file download. Any intra-file browsing must be done with the file stored locally, and the user can only jump back in the movie as it travels to her machine; she may not scan forward since there is no connection with a streaming server to perform buffering PROBLEM DEFINITION: Video Motion Analysis is the technique used to get information about moving objects from video. Examples of this include gait analysis, sport replays, speed/acceleration calculations and, in the case of team/individual sports, task performance analysis. The motions analysis technique usually involves a digital movie camera, and a computer that has software allowing frame-by-frame playback of the video. Analog video is a video signal transferred by an analog signal. An analog color video signal contains luminance, brightness (Y) and chrominance (C) of an analog television image. When combined in to one channel, it is called composite video as is the case, among others with NTSC, PAL AND SECAM. Analog video may be carried in separate channels, as in two channel S-video (YC) and multi-channel component video formats. Analog video is used in both consumer and professional television production applications. However, digital video signal formats with higher quality have been adopted, including serial digital interface (SDI), firewire (IEEE1394), Digital Visual Interface(DVI) and High- Definition Multimedia Interface (HDMI). Digital video is a type of digital recording system that works by using a digital rather than an analog video signal. A digital system is a data technology that uses discrete (discontinuous) 2

3 values. By contrast, non-digital (or analog) systems use a continuous range of values to represent information. Although digital representations are discrete, the information represented can be either discrete, such as numbers, letters or computer icons, or continuous, such as sounds, images, and other measurements of continuous systems Video materials: Video has been the primary concern of the movie and television industry. It has developed detailed and complete procedures and techniques to index, store, edit, retrieve, sequence and present video material. Video materials should be modelled and stored in a similar way for effective retrieval. Shot change detection is the procedure for identifying changes in the scene content of a video sequence so that alternate representation may be derived for the purposes of browsing and retrieval. A shot is defined as a part of the video that results from one continuous recording by a single camera. A scene is composed of a number of shots, while a television broadcast consists of a collection of scenes. The gap between two shots is called a shot boundary. There are mainly four different types of common shot boundaries within shots: A cut: It is a hard boundary or clear cut which appears by a complete shot over a span of two serial frames. It is mainly used in live transmissions. A fade: Two different kinds of fades are used: The fade-in and the fade-out. The fade-out emerges when the image fades to a black screen or a dot. The fade-in appears when the image is displayed from a black image. Both effects last a few frames. A dissolve: It is a synchronous occurrence of a fade-in and a fade-out. The two effects are layered for a fixed period of time e.g. 0.5 seconds (12 frames). It is mainly used in live in-studio transmissions. A wipe: This is a virtual line going across the screen clearing the old scene and displaying a new scene. It also occurs over more frames. It is commonly used in films such as Star Wars and TV shows. As these effects exist, shot boundary detection is a nontrivial task Video Coding Moving images contain significant temporal redundancy successive frames are very similar and add an extra motion model at the front end of the image encoder. 3

4 Coding Standards JPEG Joint Photographic Experts Group Still image compression MPEG1 Moving Picture Experts Group Video compression for CD storage / Internet MPEG2 Video compression for digital TV MPEG4 General purpose video compression H.261, H.263 Video compression for video conferencing Video denoising methods can be divided into: Spatial video denoising methods, where image noise reduction is applied to each frame individually. Temporal video denoising methods, where noise between frames is reduced. Motion compensation may be used to avoid ghosting artifacts when blending together pixels from several frames. Spatial-Temporal video denoising methods use a combination of spatial and temporal denoising OBJECTIVES: When data is transmitted, or indeed handled at all, a certain amount of noise enters into the signal. Noise can have several causes: data transmitted wirelessly, such as by radio, may be received inaccurately, suffer interference from other wireless sources, or pick up background noise from the rest of the universe. Microphones pick up both the intended signal as well as background noise without discriminating between signal and noise, so when audio is encoded digitally, it typically already includes noise. Electric pulses transmitted via wires are typically attenuated by the resistance of the wire, and changed by its capacitance or inductance. Temperature variations can increase or reduce 4

5 these effects. While digital transmissions are also degraded, slight variations do not matter since they are ignored when the signal is received. With an analog signal, variances cannot be distinguished from the signal and so provide a kind of distortion. In a digital signal, similar variances will not matter, as any signal close enough to a particular value will be interpreted as that value. Care must be taken to avoid noise and distortion when connecting digital and analog systems, but more when using analog systems Video Sources Although there are many sources of video, all fall into two categories: electronically generated sources and optical sources. Electronic sources include all of the standard test signals, character generators, computers, and background generators. Optically generated video is produced by television cameras, either directly or pre-recorded on videotape. The electronic test signals are used primarily to set up monitoring equipment and are rarely included in program content. They include color bars, gray scale, dots, crosshatch, and multiburst Color Bars While color bars have several formats, all share a sequence of colors and gray scale values that are fixed and can be used as a standard reference in adjusting equipment. The color sequence is always white, yellow, cyan, green, magenta, red, blue, and black, reading left to right. Each color, going from left to right, has ten per cent less luminance than the preceding color. Using this signal it's possible to set a monitor for the correct contrast and brightness, as well as for the correct color saturation and hue, guaranteeing an optimum picture display. Color bars are often used at the beginning of a program to allow setup of tape processing equipment and playback receivers and monitors Gray Scale Gray scale is a five or a ten step scale plus black used to set up monitors for proper brightness and contrast range. 5

6 Dots and Crosshatch A dot is a matrix of white dots on a black background. Crosshatch may be shown as a series of vertical or horizontal lines or as a combination of both. Dots and crosshatch are used to adjust the electron beams in color monitors so that the red, green, and blue signals overlap properly. Neither signal has any burst or chroma content. Crosshatch can also be used to make sure graphics shown by cameras are straight Multiburst Multiburst is a series of square waves with increasing frequency from left to right. It's used to check the overall frequency response of various pieces of video equipment and is used primarily as an engineering tool. The remaining electronically generated signals are used in the actual production of programs Color Black The most basic is called "color black." This signal has a luminance value of 7.5 IRE and no chroma in the picture. It does have sync and color burst. It's the signal used every time the picture "fades to black." Color Background The color background generator provides a solid color background in which the luminance, chroma, and hue can be adjusted. It's normally used as a background for text or graphics displayed on the screen. It's possible to adjust the background generator to create color fields that exceed the technical specifications for video. This results in improper operation of home receivers and should be avoided if at all possible. 6

7 CHAPTER COLOR STANDARDS AND FORMATS This chapter starts with a description of the color specifications in various video standards, including color primaries and color coding standards. While the focus of this research is color and contrast enhancement in video (and also still images), various other processing not directly related to color or contrast significantly affect the overall picture quality. Thus, this thesis includes a review of various processes typical of a video processing chain in consumer video applications. These processes are independent of any specific display technology, and so have been classified as display-independent video processing. Various modern display technologies necessitate additional color and image processing, which are essentially displaydependent processing. To appreciate the state of the art, the working principles for various modern display devices, as well as special video processing techniques employed in some of these devices, have been briefly reviewed along with appropriate references for a detailed discussion. The concluding section takes a fresh look at the way color is handled in video processing, and how color science can be used to improve color quality in consumer video applications Color Specifications in Video Standards Three potential areas were identified for colorimetric improvement in television systems, i) Defining color characteristics, ii) Extending color gamut s to take advantage of the newer display technologies, and iii) Inclusion of constant luminance operation, which ensures conveying luminance information to the fullest extent in a television system. Accordingly, color specifications in video standards can be classified into three parts, namely, specification of color primaries and the white point, specification of the Opto- Electronic Transfer Functions (OETFs), and specification of color coding for the compression and transmission of color information. Before describing these color specifications, it is important to differentiate between standard definition television (SDTV) and high-definition television (HDTV). This classification is based on resolution and scanning format. There are two main scanning formats in SDTV, formats with 480 active pixel lines with interlacing and a frame rate of Hz (denoted as 480i or 480i29.97) and 576 active pixel lines with 7

8 interlacing and a frame rate of 25 Hz (denoted as 576i or 576i25). 480i systems with 4:3 aspect ratio (ratio of width to height of the displayed image) can have a resolution of 640x480, 704x480 or 720x i systems with 4:3 aspect ratio can have a resolution of 768x576 or 720x576 or 948x576. Widescreen 16:9 format supports resolutions 720x483 and 720x576. On the other hand, HDTV has higher resolution, typically 0.75 million pixels or more. Most common HDTV formats are 1280x720 with progressive scanning at a frame rate of 60 Hz (denoted as 720p or 720p60) and 1920x1080 with progressive or interlaced scanning at a frame rate of 24 (only progressive) or 30 Hz (1080p30/1080i30, 1080p24/1080p30). 1080p60 and 1080p120 formats are also possible. The aspect ratio in HDTV is 16: Color Primaries The set of colorants used for a particular coloration process is referred to as a primary set. Except for some state-of-the-art technologies, most displays use three primaries, namely red, green and blue. All video standards describe color primaries in terms of the chromaticities of RGB and the white point (an achromatic color with highest luminance achievable in a given system), and thereby the color gamut of any device that complies with a given standard. Color gamut describes the range of colors produced by a coloration system, including displays. Color primaries defined in various widely known standards are discussed below CIE: CIE color primaries were defined in CIE 1931 standard observer. CIE Illuminant B was defined as the white point. CIE primaries are no longer used in video coding or reproduction NTSC: In 1958, National Television System Committee (NTSC) standardized color primaries, primarily to be used for color Cathode-Ray Tube (CRT) displays. The white point has the chromaticities of CIE Illuminant C. The NTSC primaries were chosen such that largest color gamut could be achieved with the commercially available phosphors for CRT monitors. These primaries and the white point are no longer used in displays, but NTSC specification is still used as the industry benchmark. Compared to NTSC compliant displays of the past, modern CRTs have brighter and more efficient phosphors, even though NTSC displays produced more saturated red and yellow. 8

9 EBU Tech 3213: In 1975, European Broadcasting Union (EBU) published a standard for chromaticity tolerances for studio monitors conforming to 576i SDTV systems (standard definition televisions with 576 active picture lines and interlaced scanning), known as EBU Tech D65 was used as the white point SMPTE RP 145: Society of Motion Pictures and Television Engineer (SMPTE) set color standards for 480i SDTV systems (standard-definition televisions with 480 active picture lines and interlaced scanning) and early 1035i30 HDTV systems (high-definition televisions with 1035 active picture lines, frame rate and interlaced scanning). This standard also uses D65 as white point ITU-R BT 709/sRGB: In 1990, International Telecommunication Union s Radio communication Sector (ITU-R) recommended standard primaries for high-definition television (HDTV), formally known as ITU-R BT 709, or simply Rec The Rec. 709 primaries are incorporated into the srgb specifications widely used in the computing and computer graphics community, but srgb uses D50 white point, while Rec. 709 uses D65. These primaries are the most widely used color primaries for studio video and modern display systems. Note that displays using Rec. 709 primaries have a color gamut that is 71% of the standard NTSC color gamut obtained from conventional CRT displays Adobe RGB: These primaries were designed to provide large color gamut with RGB as the working space, and were based on SMPTE-240M standard and later renamed as Adobe RGB 98. The primaries have been adopted in some of the modern wide gamut CRT and LCD displays. Red Green Blue White point x y x y x y Illuminant x y CIE B NTSC C EBU Tech D SMPTE RP D ITU-R BT D srgb D Adobe RGB D Table Adobe RGB 9

10 Fig Adobe RGB 2.3. Color Coding Standards This section outlines various color coding standards followed in the industry, details of which are beyond the scope of this thesis, but are available in Signal coding in video systems involve three steps as described below: Step 1 Gamma correction: A nonlinear transfer function is applied to each of the linear R, G and B signals. This function, often called gamma correction, is comparable to a square root and takes care of the nonlinearity in the conventional CRT display. Gamma correction results in nonlinear signals denoted as R, G and B. Since human visual system is sensitive to luminance changes over a wide range of luminance values, nonlinear image coding needs to be used to achieve perceptual uniformity. The nonlinear transfer function gamma is used to approximate our lightness perception. Note that in encoding, gamma correction is used before converting RGB to an opponent based color space. This is important from engineering standpoint, to reduce computational complexity in the decoding stage. Step 2 Formation of luma and chroma signals: From the nonlinear signals R, G and B, luma component Y and color difference components (B - Y ) and (G - Y ) are formed. Note 10

11 that the term luma is used to differentiate this component from luminance. Y computations for SDTV (as per Rec. 601) and HDTV (as per Rec. 709) are as follows: 601 Y = 0.299R G B 709 Y = 0.212R G B In component digital video, MPEG and Motion-JPEG, the color difference components are scaled to form CB and CR respectively. Step 3 Chroma sub sampling: The color difference components are sub sampled. Chroma sub sampling is the process of reducing data capacity needed to transmit color information, while maintaining full luma information. This takes advantage of the relatively low sensitivity of our visual system to detect color differences compared to luminance. Chroma sub sampling does not typically result in a perceptual loss of chromatic details in video, but is the key source of artifacts resulting from color processing in video. Different schemes are available for chroma sub sampling. If we consider a 2x2 pixel array of R G B components, converting the nonlinear RGB to Y CBCR will result in 12 bytes of data in 8-bit systems. This is denoted as 4:4:4 Y CBCR. In 4:2:2 sampling included in Rec. 601 for studio digital video, the color difference components are sub sampled horizontally by a factor of 2, with CB and CR being coincident with even numbered Y samples. This consumes 8 bytes, instead of 12. In 4:1:1 scheme, CB and CR are sub sampled horizontally by a factor of 4, and coincide with every fourth Y sample. This scheme requires 6 bytes only. In 4:2:0 sampling scheme used in JPEG, H-261, MPEG-1, MPEG-2 etc, CB and CR are sub sampled both horizontally and vertically each by a factor of 2. Thus, there is one set of CB and CR components for the four Y samples. The number of bytes used in this case is also SDTV color coding: Studio applications require that full 8-bit range (between 0-255) not be used for luma scaling so as to leave a headroom and foot room to accommodate higher output resulting from filter operation and misadjusted equipment. In an 8-bit system, offsets of 16 and 128 are added to the luma and chroma signals respectively. Luma reference levels are 16 and 235, and chroma reference levels are 16 and 240. Digital values 0 and 255 are used in the video data only for synchronization purposes. Following equation is used for computing 8-bit Rec. 601 Y CBCR from gamma corrected digital counts normalized between 0 and 1: 11

12 ... = HDTV color coding: As mentioned before, ITU-R Rec. BT.709 standard is the most commonly used standard for HDTV. Rec. 709 Y CBCR can be computed from R G B using the following equation:... = NTSC and PAL color coding: NTSC (acronym for National Television System Committee) and PAL (acronym for Phase Alternating Line) coding are also known as composite coding, where quadrature modulation is applied to combine two color difference components into a modulated chroma signal, which is then added to the luma signal through frequency interleaving to form a composite signal. Composite decoding breaks the composite signal into constituent luma and chroma signals. A composite signal enabled the use of new color receivers in early sixties to receive black and white broadcast, and thus providing required backward compatibility to the color television sets. However, these coding schemes are not generally used anymore due to the resulting artifacts, and also because of the availability of adequate bandwidth to carry component signals as in Rec. 601 and Rec Display-Independent Video Processing The components of a typical video processing pipeline in consumer video applications are shown in Figure. Encoded signal is transmitted from a broadcasting station and is received by the signal receiver, which then passes it to the decoding module. The video stream then passes through various post processing routines for artifact removal, format conversion and enhancement. Next, a color space transformation is applied in case of different primaries for source and display formats. A gamma correction ensures correct tone reproduction on the display, while quantization is required to obtain discreet digitized values for each display channel signal. The processed video is then ready to be displayed on a designated display device. 12

13 Video source format Decode Display format Artifact Removal Coding Artifact Removal Noise Reduction Spatio-Temporal Format Conversion Spatial Scaling De-interlacing Frame-rate conversion Enhancement Sharpness Contrast Color Display Quantization Gamma Correction (Linearization) Color Space Conversion Fig A typical video processing pipeline in consumer video systems 2.4. RGB analog component video The various RGB (red, green, blue) analog component video standards (e.g., RGBS, RGBHV, RG&SB) use no compression and impose no real limit on color depth or resolution, but require large bandwidth to carry the signal and contain much redundant data since each channel typically includes the same black and white image. Most modern computers offer this signal via the VGA port. Many televisions, especially in Europe, utilize RGB via the SCART connector. All arcade games, excepting early vector and black and white games, use RGB monitors. Analog RGB is slowly falling out of favour as computers obtain better clarity using display port or Digital Visual Interface (DVI) digital connections, while home theatre moves towards High-Definition Multimedia Interface (HDMI). Analog RGB has been largely ignored, despite its quality and suitability, as it cannot easily be made to support digital rights management. RGB was never popular in North America for consumer electronics as S-Video was considered sufficient for consumer use, although RGB was used extensively in commercial, professional and high-end installations. RGB requires an additional signal for synchronizing the video display. Several methods are used: composite sync, where the horizontal and vertical signals are mixed together on a separate wire (the S in RGBS) separate sync, where the horizontal and vertical are each on their own wire (the H and V in RGBHV) 13

14 sync on green, where a composite sync signal is overlaid on the green wire (SoG or RGsB). sync on red or sync on blue, where a composite sync signal is overlaid on either the red or blue wire Composite sync is common in the European SCART connection scheme (using pin 17 [gnd] and 19 [out] or 20 [in]). Sometimes a full composite video signal may also serve as the sync signal, though often computer monitors will be unable to handle the extra video data. A full composite sync video signal requires four wires red, green, blue, and sync. If separate cables are used, the sync cable is usually colored white (or yellow, as is the standard for composite video). Separate sync is most common with VGA, used worldwide for analog computer monitors. This is sometimes known as RGBHV, as the horizontal and vertical synchronization pulses are sent in separate channels. This mode requires five conductors. If separate cables are used, the sync lines are usually yellow (H) and white (V), or yellow (H) and black (V), or gray (H) and black (V). Sync on Green (SoG) is less common, and while some VGA monitors support it, most do not. Sony is a big proponent of SoG, and most of their monitors (and their PlayStation 2 video game console) use it. Like devices that use composite video or S-video, SoG devices require additional circuitry to remove the sync signal from the green line. A monitor that is not equipped to handle SoG will display an image with an extreme green tint, if any image at all, when given a SoG input. Sync on red and sync on blue are even rarer than sync on green, and are typically used only in certain specialized equipment YPbPr analog component video Further types of component analog video signals do not use R, G, and B components but rather a colorless component, termed luma, which provides brightness information (as in black-and-white video). This combines with one or more color-carrying components, termed chroma, that give only color information. Both the S-Video component video output (two separate signals) and the YPbPr component video output (three separate signals) seen on DVD players are examples of this method. 14

15 Converting video into luma and chroma allows for chroma_subsampling, a method used by JPEG images and DVD players to reduce the storage requirements for images and video. The YPbPR scheme is usually what is meant when people talk of component video today. Many consumer DVD players, high-definition displays, video projectors and other video devices use this form of color coding. These connections are commonly and mistakenly labeled with terms like "YUV", "Y/R-Y/B- Y" and Y, B-Y, R-Y. This is inaccurate since YUV, YPbPr, and Y B-Y R-Y differ in their scale factors. When used for connecting a video source to a video display where both support 4:3 and 16:9 display formats, the PAL television standard provides for signaling pulses that will automatically switch the display from one format to the other Color Format: The color format is considered as one of the main factor for the processing of the image. A grey scale and color image plays an important major role in the contribution of the resolution of an image. The color format is classified into various models such as RGB, CMY & CMYK, YIQ, HSB, HSI, and YCbCr. COLOR DESCRIPTION USAGE MODELS RGB Primary colors Red, Blue, Green CMY & CMYK Secondary Cyan, Magenta, Yellow YIQ Luminance and chrominance Used in NTSC color TV HSB or HSI Color space values varies from 0 to 1.0 Hue, saturation, and brightness or intensity YCbCr Luminance and chrominance used in digital images Table Description of the color models Images represented in the RGB color model consist of three component images, one for each primary color. The number of bits used to represent each pixel in RGB space is called the pixel depth. Consider an RGB image in which each of the red, green, and blue images is an 8- bit image. Under these conditions each RGB color pixel is said to have a depth of 24 bits. The 15

16 full-color image is used often to denote a 24-bit RGB color images. The total number of colors in a 24-bit RGB image is (2 8 ) 3 =16,777,216. Cyan, magenta, and yellow are the secondary colors of light or, alternatively, the primary colors of pigments. RGB values can be obtained easily from set of CMY values by subtracting the individual CMY values from 1. The HSB or HSI (hue, saturation, and brightness or intensity) color model decouples the brightness or intensity component from the color-carrying information in a color image YIQ NTSC color space, image data consists of three components: luminance (Y), hue (I), and saturation (Q). The first component, luminance, represents gray scale information, while the last two components make up chrominance (color information) YCbCr color space is widely used for digital video. In this format, luminance information is stored as a single component (Y), and chrominance information is stored as two color-difference components (Cb and Cr). Cb represents the difference between the blue component and a reference value. Cr represents the difference between the red component and a reference value. (YUV, another color space widely used for digital video, is very similar to YCbCr but not identical.) Color Space Conversion The Color Space Conversion block converts color information between color spaces. Use the Conversion parameter to specify the color spaces we are converting between. Were choices are R'G'B' to Y'CbCr, Y'CbCr to R'G'B', R'G'B' to intensity, R'G'B' to HSV, HSV to R'G'B', sr'g'b' to XYZ, XYZ to sr'g'b', sr'g'b' to L*a*b*, and L*a*b* to sr'g'b'. Port Input/ Output Supported data types Complex values supported Input/ Output M-by-N-by-P color image signal where P is the number of color planes R Matrix that represents one plane of the input RGB image stream G Matrix that represents one plane of the input RGB image stream B Matrix that represents one plane of the input RGB - Double-precision floating point - Single-precision floating point - 8-bit unsigned integer - Double-precision floating point - Single-precision floating point - 8-bit unsigned integer - Double-precision floating point - Single-precision floating point - 8-bit unsigned integer - Double-precision floating point 16 No No No No

17 image stream - Single-precision floating point - 8-bit unsigned integer Y Matrix that represents the luma portion of an image - Double-precision floating point - Single-precision floating point - 8-bit unsigned integer Cb Cr Matrix that represents one chrominance component of an image Matrix that represents one chrominance component of an image - Double-precision floating point - Single-precision floating point - 8-bit unsigned integer - Double-precision floating point - Single-precision floating point - 8-bit unsigned integer I Matrix of intensity values - Double-precision floating point - Single-precision floating point - 8-bit unsigned integer H Matrix that represents the hue component of an image S Matrix that represents represent the saturation component of an image V Matrix that represents the value (brightness) component of an image X Matrix that represents the X component of an image Y Z Matrix that represents the Y component of an image Matrix that represents the Z component of an image L* Matrix that represents the luminance portion of an image a* Matrix that represents the a* component of an image b* Matrix that represents the b* component of an image - Double-precision floating point - Single-precision floating point - Double-precision floating point - Single-precision floating point - Double-precision floating point - Single-precision floating point - Double-precision floating point - Single-precision floating point - Double-precision floating point - Single-precision floating point - Double-precision floating point - Single-precision floating point - Double-precision floating point - Single-precision floating point - Double-precision floating point - Single-precision floating point - Double-precision floating point - Single-precision floating point No No No No No No No No No No No No No Table Device-independent color spaces 17

18 The data type of the output signal is the same as the data type of the input signal. Use the Image signal parameter to specify how to input and output a color image signal. If we select one multidimensional signal, the block accepts an M-by-N-by-P color image signal, where P is the number of color planes, at one port. If we select Separate color signals, additional ports appear on the block. Each port accepts one M-by-N plane of an RGB image stream. 18

19 CHAPTER VIDEO ANALYSIS AND STANDARDS All popular video formats including AVI, MPEG, MP4, VOB, DVD, MOV, WMV, ASF are supported. Input Formats: AVI, MPEG, MP4, VOB, DVD, MOV, WMV, ASF. Output Formats: AVI, MPEG, WMV. The effective Video Filters and Effects: Filters: The Video includes the most popular video filters used for improving the quality of the video files. Every filter is intended to solve one of the problems of video files, such as: interlacing video, the picture is out of balance and blurred, insufficient brightness and contrast and others. Sometimes it makes sense to use several filters at the same time for enhancing your videos. Magic Enhance: The Magic Enhance filter automatically removes any color casts by calculating the best balance of colors in the frame. This filter is real magic: it fixes almost anything wrong and removes any errors. It not only repairs color balance, but also brightness, contrast and the levels. Try it and see the power of magic enhance! Auto Contrast: The Auto Contrast filter calculates the best contrast in the frame automatically. It enhances highlights and shadows, making the dark parts of the frames darker, and the light ones lighter. It adjusts the entire movie using histograms of each individual frame. Auto Contrast can drastically improve the appearance of the movie, especially if it contains many dark frames. Auto Saturation: Saturation, sometimes called chroma, is the strength or purity of the color. The Auto Saturation filter automatically adjusts saturation of the entire image. Auto White Balance: The Auto White Balance filter automatically calculates best white balance values according to the lighting conditions in the video. It fixes the error when white color appears gray or blue. De-interlace: The De-interlace filter removes interlace effect in the movie. Many digital camcorders divide each frame into a set of even and odd horizontal rows: even rows are recorded first, and then the odd rows. So, each frame you see consists of two consecutive pictures. This mixing is called interlacing. It allows compressing video without using digital compression methods. 19

20 But if you record a fast moving object (sport competition, racing, pets, children, etc.), it can be at one place in the "even" picture, and at another in the "odd" one. As a result, you get a "striped" picture. The De-interlace filter transforms the mixed frames into normal video frames and removes the interlaced horizontal rows. Brightness/Contrast: The Brightness filter provides advanced options for the brightness and contrast adjusting. The filter allows manual enhancement for highlights and shadows, making them darker or lighter. Use this filter to repair very dark videos. Blur: The Blur filter smoothes the edges of low bandwidth movies. It eliminates also the excessive high-frequency noise that appears when using digital zoom or after enlarging small-size clips. Sharpen: The Sharpen filter focuses blurry frames by increasing the contrast of adjacent pixels. It makes the image more detailed, however it may increase the amount of noise. Color Temperature: The Color Temperature filter "warms up" or "cools down" the colors in an image. Videos shot in incandescent light often appear too warm, and videos shot outdoors in sunlight may have a blue hue. Use this filter to fix both. Values can range from -100 (very cold colors) to +100 (very warm colors). Hue/Saturation: The Hue/Saturation filter is created for adjusting the hue and saturation of a video clip. It is helpful, for example, when a video clip was captured with too little saturation, and there is a need to increase it. The Hue filter lets you adjust the hue of the entire image. Adjusting the hue, or color, represents a move around the color wheel. Hue is the color reflected from or transmitted through an object. It is measured as a location on the standard color wheel. A positive value indicates clockwise rotation, a negative value counter clockwise rotation. Values can range from -180 to The Saturation filter, sometimes called chroma, is the strength or purity of the color. Saturation represents the amount of gray in proportion to the hue. Values can range from -100 (percentage of desaturation, duller colors) to +100 (percentage of saturation increase). On the standard color wheel, saturation increases from the center to the edge. The Saturation effect lets you adjust the saturation of the entire image. Gamma: Gamma is the level of brightness and contrast of the midtone values. The Gamma filter is used to correct gamma. It can greatly increase or decrease the brightness and contrast of your video. 20

21 Color Balance: The Color Balance filter allows you to adjust the overall mixture of the colors in your video. Chroma Balance: The Chroma Balance filter is one more filter for correcting colors in your video. Chroma is another word for saturation. Unlike the Hue/Saturation filter that adjusts the saturation level, the Chroma Balance allows performing more precise corrections that depend on colors you want to add/reduce in your video. De-noise: The De-noise filter detects noise in the video and removes it from the picture. It greatly increases the overall quality of video. Other very important function of this filter is removing noises that may appear after applying some video filters. For instance, the Brightness filter helps removing dark areas but may cause some noise (especially with settings close to maximum). The Denoiser filter solves this problem by removing any noise. For this reason, we recommend putting the Denoiser filter in the bottom of the list of filters. De-blocking: The De-blocking filter removes the block-like artifacts from lowquality, highly compressed videos or videos that were ripped from DVD or decompressed. It greatly increases the overall quality of video. Mosaic: The Mosaic effect creates square blocks of specified size. The blocks represent the colors of pixels in the video. Add Noise: The Add Noise effect applies random pixels to the video creating the effect of low-quality shooting. It is an interesting effect to simulate old movies (especially combined with the Grayscale effect). Posterize: The Posterize effect lets you adjust the number of brightness levels in an image. It creates an interesting special effect of large flat areas. Diffuse: The Diffuse effect makes the video look less focused. It moves pixels randomly according to the specified settings. Grayscale: The Grayscale effect converts the original image to grayscale (black and white). It is an interesting effect to simulate old movies (especially combined with the Add Noise effect). Invert: The Invert effect inverts colors in the video making it look like a negative tape. 21

22 3.1. Videotape Formats Videotape recorders are available for analog and digital recording in various formats. They are further classified by performance level or market: consumer, professional, and broadcast. In addition, during postproduction (editing, including addition of graphics), the original footage can be transferred to digital media; digital videotape formats are available for composite and component video formats. There are no official standards for videotape classifications Tape Formats and Video Formats Electronics Consumer Professional Broadcast Postproduction Analog VHS cassette (composite) U-Matic (SP) cassette, 3/4-inch (composite) Type C reel-to-reel, 1- inch (composite) S-VHS (YC, composite) Type B (Europe) (composite) S-Video (YC- 358) S-Video (YC-358) Beta (composite) Betacam (component) 8 mm (composite) Hi-8mm (YC, composite) Hi-8mm (YC) Betacam SP (YUV, YIQ, composite) MII (YUV, YIQ, composite) Digital D1 525 (YUV) D1 625 (YUV) D2 525 (NTSC) D2 625 (PAL) Table 3-1 summarizes the formats. Although the VL and other software for Silicon Graphics video options do not distinguish between videotape formats, you need to know what kind of connector your video equipment uses. For example, the Galileo board has composite and S-Video connectors. 22

23 Most home VCRs use composite connectors. S-Video, on the other hand, carries the color and brightness components of the picture on separate wires; hence, S-Video connectors are also called Y/C connectors. Most S-VHS and Hi-8mm VCRs feature S-Video connectors. Video formats are confusing because most video files have at least two different types: the container, and the codec(s) used inside that container. The container describes the structure of the file: where the various pieces are stored, how they are interleaved, and which codecs are used by which pieces. It may specify an audio codec as well as video. A codec ("coder/decoder") is a way of encoding audio or video into a stream of bytes. To make life even more confusing, some names, such as "mpeg-4", describe both a codec and a container, so it's not always clear from context which is being used. You could have a movie encoded with an mpeg-4 codec inside an avi container, for example, or a movie encoded with the Sorenson codec inside an mpeg-4 container. The Linux file program is a fast way to find out the container format of a video file. You can use the m-encoder program (part of mplayer) to tell you the container and video codec of a file (you'll have to wade through a lot of other output). For mpeg files, you can find out the audio codec with mpginfo, part of the mpgtx package. For other formats, try mplayer -identify -frames 0 filename grep ID_ 3.2. Common Container Formats: AVI (.avi): Most commonly contains M-JPEG (especially from digital cameras?) or DivX (for whole movies), but can contain nearly any format (not Sorenson). Sometimes you'll see a reference to the "fourcc": this is a four-character code (such as "divx" or "mjpg") inside the AVI container which specifies which video codec is being used. Quicktime: Most often used for the locked Apple Sorenson codec, or for Cinepak (free), but can also hold other codecs such as mjpeg, etc. WMV (.wmv): More or less MPEG4; can contain nearly any codec, including several Microsoft spinoffs of MPEG-4 which vary in their freedom and licensing requirements. ASF ("Advanced Streaming Format",.asf): a subset of wmv, intended primarily for streaming: an early Microsoft implementation of an MPEG4 codec. 23

24 3.3. Common Codecs: MPEG ("Moving Pictures Expert Group"): three video formats, MPEG 1, 2, and 4. MPEG-1: Old, supported by everything (at least up to 352x240), reasonably efficient. A good format for the web. MPEG-2: A souped-up version of MPEG-1, with better compression. 720x480. Used in HDTV, DVD, and SVCD. MPEG-4: A family of codecs, some of which are open, others Microsoft proprietary. MPEG spinoffs: mp3 (for music) and VideoCD. MJPEG ("Motion JPEG"): A codec consisting of a stream of JPEG images. Common in video from digital cameras and a reasonable format for editing videos, but it doesn't compress well, so it's not good for web distribution. DV ("Digital Video"): Usually used for video grabbed via firewire off a video camera. Fixed at 29.97FPS, or 25 FPS. Not very highly compressed. WMV ("Windows Media Video"): A collection of Microsoft proprietary video codecs. Since version 7, it has used a special version of MPEG4. RM ("Real Media"): a closed codec developed by Real Networks for streaming video and audio. DivX: in early versions, essentially an ASF (incomplete early MPEG-4) codec inside an AVI container; DivX 4 and later are a more full MPEG-4 codec.. No resolution limit. Requires more horsepower to play than mpeg1, but less than mpeg2. Hard to find mac and windows players. Sorenson 3: Apple's proprietary codec, commonly used for distributing movie trailers (inside a quicktime container). Quicktime 6: Apple's implementation of an MPEG4 codec. RP9: a very efficient streaming proprietary codec from Real (not MPEG4). WMV9: a proprietary, non-mpeg4 codec from Microsoft. Ogg Theora: A relatively new open format from Xiph.org. Dirac: A very new open format under development by the BBC MPEG MPEG stands for Moving Picture Experts Group a committee that sets international standards for the digital encoding of movies and sound. There are several audio/video formats 24

25 which bear this group's name. In addition to their popularity on the Internet, several MPEG formats are used with different kinds of A/V gear: MPEG1. This format is often used in digital cameras and camcorders to capture small, easily transferable video clips. It's also the compression format used to create Video CDs, and commonly used for posting clips on the Internet. The well-known MP3audio format (see definition below) is part of the MPEG1 codec. MPEG2. Commercially produced DVD movies, home-recorded DVD discs, and most digital satellite TV broadcasts employ MPEG2 video compression to deliver their highquality picture. MPEG2 is also the form of lossy compression used by TiVo-based hard disk video recorders. It can rival the DV format when it comes to picture quality. Because MPEG2 is a "heavier" form of compression that removes a larger portion of the original video signal than DV, however, it's more difficult to edit with precision. The MPEG2 codec allows for selectable amounts of compression to be applied, which is how home DVD recorders and hard disk video recorders can offer a range of recording speeds. MPEG2 is considered a container format. MPEG4. A flexible MPEG container format used for both streaming and downloadable Web content. It's the video format employed by a growing number of camcorders and cameras MP3 (MPEG1, Audio Layer 3) The most popular codec for storing and transferring music. Though it employs a lossy compression system which removes frequencies judged to be essentially inaudible, MP3 still manages to deliver near-cd sound quality in a file that's only about a tenth or twelfth the size of a corresponding uncompressed WAV file. When creating an MP3 file, you can select varying amounts of compression depending on the desired file size and sound quality. For more info, see our article on the MP3 format Mp3pro An updated version of the original MP3 codec. Small, low-bitrate mp3pro files contain much 25

26 more high-frequency detail than standard MP3 files encoded at similar low bitrates. The highfrequency portion of the audio signal is handled by an advanced and extremely efficient coding process known as Spectral Band Replication (SBR), while the rest of the signal is encoded as a regular MP3. That means that when you play an mp3pro file on non-mp3procompatible software, you'll only hear the non-sbr-encoded portions (so you'll lose the highs altogether). However, when encoded and played back using a fully compatible audio program, such as Windows Media Player, mp3pro files can deliver very good sound quality using low bitrates YUV video format It is a color space typically used as part of a color image pipeline. It encodes a color image or video taking human perception into account, allowing reduced bandwidth for chrominance components, thereby typically enabling transmission errors or compression artifacts to be more efficiently masked by the human perception than using a direct RGB-representation. Other color spaces have similar properties, and the main reason to implement or investigate properties of Y'UV would be for interfacing with analog or digital television or photographic equipment that conforms to certain Y'UV standards More about YUV video format The scope of the terms Y'UV, YUV, YCbCr, YPbPr, etc., is sometimes ambiguous and overlapping. Historically, the terms YUV and Y'UV were used for a specific analog encoding of color information in television systems, while YCbCr was used for digital encoding of color information suited for video and still-image compression and transmission such as MPEG and JPEG. Today, the term YUV video format is commonly used in the computer industry to describe file-formats that are encoded using YCbCr. The Y'UV model defines a color space in terms of one luma (Y') and two chrominance (UV) components. The Y'UV color model is used in the NTSC, PAL, and SECAM composite color video standards. Previous black-and-white systems used only luma (Y') information. Color information (U and V) was added separately via a sub-carrier so that a black-and-white receiver would still be able to receive and display a color picture transmission in the receiver's native black-and-white format. 26

27 3.7. Advantages of YUV video format The primary advantage of YUV video format is that it remains compatible with black and white analog television. The Y signal is essentially the same signal that would be broadcast from a normal black and white camera (with some subtle changes), and the U and V signals can simply be ignored. When used in a color setting the subtraction process is reversed, resulting in the original RGB color space. Another advantage is that the signal in YUV video format can be easily manipulated to deliberately discard some information in order to reduce bandwidth. The human eye actually has fairly low color resolution, the high-resolution color images we see actually being processed by the visual system by combining the high-resolution black and white image with the low-resolution color image. Using this information to their advantage, standards such as NTSC reduce the amount of signal in the U and V considerably, leaving the eye to recombine them. For instance, NTSC saves only 11% of the original blue and 30% of the original red, throwing out the rest. Since the green is already encoded in the Y signal, the resulting U and V signals are substantially smaller than they would otherwise be if the original RGB or YUV signals were sent. This filtering out of the blue and red signals is trivial to accomplish once the signal is in YUV video file formats. Without video compression, Internet video would simply not be possible. It is that important. Video files are HUGE. Download times for raw, uncompressed video would be so high that no one would bother, even at broadband speeds. Compression is the process of taking the huge file and knocking the file size down while preserving a decent picture. Compression works by first analyzing the complex and detailed video signal. Portions of the signal not noticeable to the human eye are dropped. Detail and resolution are lost, but hopefully not too much. Different compression methods select differing data to lose and this causes variable results. Often, a video signal is compressed, and then when opened and played, it is decompressed and some of the lost information is restored. 27

28 3.8. Codec Codec is a term that combines the root words compression and decompression. What is compressed must be decompressed. To do so properly requires the machine reading the file be able to reverse the process used for compression. Hence the two go together like a lock and key and form the word codec. Compatibility issues come up with editing software or playing devices if the compression and decompression methods are incompatible. Table 3.2. Codecs Component video o Three color components stored/transmitted separately o Use either RGB or YIQ (YUV) coordinate o New digital video format (YCrCb) o Betacam (professional tape recorder) use this format Composite video o Convert RGB to YIQ (YUV) 28

29 o Multiplexing YIQ into a single signal o Used in most consumer analog video devices S-video o Y and C (QAM of I and Q) are stored separately o Used in high end consumer video devices High end monitors can take input from all three The following characteristics of digital video tend to encourage broader use in educational settings: The compact size and low cost of high-quality, full-featured digital cameras make the acquisition of instructional video accessible to most students and faculty. Features such as inherent low-light recording capability, built-in titling, and electronic image stabilization largely eliminate the necessity for large camera crews typical of analog video production. Relatively inexpensive non-linear editing systems built into common computer platforms with powerful editing software allow much of the post-production process to be controlled by faculty and students themselves. More importantly, digital video has the following characteristics that allow more flexible use of video for instruction: The ability to produce, edit, and display video in non-sequential form. Tools that allow for relatively easy distribution of limited forms of video over the internet for training and instruction (e.g. Real Video, Quicktime). The ability to interactively combine video with other forms of digital media, such as creation of a single web site with text, graphics, audio, and video. The current inability to send full-length broadcast quality video over the internet encourages use of shorter program lengths of lesser quality. Ironically, these limitations in digital video quality and length seem to encourage greater use of video in instructional materials Video Formats Today there are many different formats available to shoot your video on. This list will provide you with how one is different from the other and some advantages and disadvantages. This list is divided into the two types - analog and digital. 29

30 Analog Analog recorders record video and audio signals as an analog track on video tape. A major disadvantage is that each time you make a copy of a tape; it loses some image and audio quality. The main difference between the available analog formats is what kind of video tapes the recorder uses and the resolution. VHS Standard VHS cameras use the same type of video tapes as a regular VCR. One advantage of this is that after you've recorded something, you easily play it on most VCRs. Because of their widespread use, VHS tapes are a lot less expensive than the tapes used in other formats. Another advantage is that they give you a longer recording time than the tapes used in other formats. VHS-C VHS-C camcorders record on standard VHS tape that is in a more compact cassette. You can play VHS-C cassettes in a standard VCR with an adaptor that runs the tape through a full-size cassette. Basically, VHS-C format offers the same compatibility as standard VHS format. The smaller tape size allows for more compact designs, making VHS-C camcorders more portable. But the reduced tape size means VHS-C tapes have a shorter running time than standard VHS cameras. In short play mode; the tapes can hold 30 to 45 minutes of video. They can hold 60 to 90 minutes of material if you record in extended play mode (EP), but this sacrifices image and sound quality considerably. Super VHS Super VHS camcorders are generally the same size as standard VHS cameras. The main difference between the two formats is that S-VHS tape records an image with 380 to 400 horizontal lines, a much higher resolution image than standard VHS tape. You cannot play super VHS tapes on a standard VCR, but, as with all formats, the camcorder itself is a VCR and can be hooked up directly to your television or to your VCR to dub standard VHS copies. However, you can record on an S-VHS tape in a VHS recorder, but, the signal on the S-VHS tape will be only VHS quality. Yes, this can get confusing. 30

31 8 mm These camcorders use small 8-millimeter tapes (about the size of an audio cassette). One advantage of this format is manufacturers can produce more compact camcorders (similar design to VHS-C). The format offers about the same resolution as standard VHS, with slightly better sound quality. Like standard VHS tapes, 8 mm tapes hold about two hours of footage, but they are more expensive. To watch 8 mm tapes on your television, you have to attach your camcorder and use it as a VCR. 8 mm player/recorders are available but are expensive. Hi-8 Hi-8 camcorders are very similar to 8 mm camcorders, but they have a much higher resolution (about 400 lines). Hi-8 tapes are more expensive than ordinary 8 mm tapes. Page 13 of Digital The increasingly popular format is digital. Consumer prices have continued to drop which has created a larger number of digital camcorder sales. Digital camcorders differ from analog camcorders in a few very important ways. They record information digitally, as bytes (you've heard - 1's and 0's), which means that the image can be reproduced without losing any image or audio quality. Digital video can also be downloaded to a computer, where you can edit it or post it on the Web. Another distinction is that digital video has a much better resolution than analog video, typically 500 lines Digital Video (DV) DV camcorders can record on compact mini-dv cassettes, which hold 60 to 90 minutes of footage. The video has up to 500 lines of resolution and can be easily transferred to a personal computer. DV camcorders can be extremely lightweight and compact -- many are about the size of a small book. Another interesting feature is the ability to capture still pictures, just as digital cameras do. If you've already been recording analog video (VHS, S-VHS), a digital camcorder can help you finally edit it into something you'll want to watch. Your original camcorder and a digital 31

32 camcorder with audio-video input jacks give you everything you need to convert analog footage to digital. All you need to do is connect a VCR or your old camcorder to your new camcorder, using standard audio-video cables (commonly called RCA cables). Hit Play on the analog VCR and Record on the digital camcorder, and you'll have a digital copy in no time. Consequently, you'll lose one generation in the conversion process, of course, but after that the footage is in digital form and so won't degrade any further. You can then edit it just as you would any other digital footage. The other main advantage of digital video is that once you download it to your computer, it is stored as a basic computer file. This means you can your movies, post them on the internet or simply store them on your hard drive. However, keep in mind that digital video files are quite large, and that you'll probably need to upload small portions of your footage at a time and then upload your finished movies back to tape for permanent storage Analog vs. Digital Create the media you want to place on the web. This means that you must either: Transfer pre-record media to a computer, for example, music from a CD or cassette tape, video from a VHS tape or captured from a television show. Record your own media from scratch, for example, videotape yourself with a camcorder or record yourself speaking into a microphone. Your source recording will either be in digital or analog format. If analog, like a VHS video tape or sound recorded on a cassette tape, it must be converted to a digital format before it can be placed on the web. Special hardware is required to make this conversion. In the case of audio files, the sound card already installed in your computer is probably capable of doing the job. In the case of video, a special video capture card is required. In addition to hardware, software is also needed to convert analog to digital. With audio files, there is a good, free alternative, but with video, the software required will usually come with your video capture card, or can be purchased separately. On the other hand, if your media is recorded digitally in the first place, like from a digital camcorder or audio spoken into a media recorder on your computer, it is not necessary to convert it further, though it may be necessary to use software to edit it. In the case of audio, an original 32

33 digital recording will yield far better results, and generally be much easier to work with, than a recording converted from analog to digital. With video this is also true, but to a lesser extent. The idea with capturing or recording media files is to get them into an uncompressed, or raw, digital file format, which can then be compressed--in order to reduce their size--for the web. For video files, this will either be an AVI or MOV format; for audio files, either a WAV or AIFF format. This discussion will concentrate on the PC world, so it will discuss Microsoft AVI and WAV formats. Remember, AVI files store uncompressed video data (usually comprised of both video and audio tracks), and WAV files store uncompressed audio data. These files can be enormous, especially in the case of video files. Creating these initial files is by far the most difficult part of the process, especially if editing is required and you are combining multiple files into a single file. There is no substitute for practice with your software of choice when doing this. The question to keep in mind when picking a video editing program is: can the program read and write uncompressed AVI files? If so, you won't have any trouble preparing files for the web. You will also need a large hard drive to work with video. Uncompressed video files are very, very large. File formats for digital video.mov signifies Quicktime, an Apple standard. It is playable on Macintosh and Windows machines..avi is a Microsoft standard that is playable on Windows and Macintosh machines..mpeg (.MPG) is playable on Unix and Windows machines. Macintoshes can play MPEG, but may have trouble with the audio track..rm files are used by RealNetworks streaming. They are playable on Windows, MacOS, and Unix computers..asf files are a Microsoft streaming format, and play on Windows, MacOS, and Solaris Compressors/decompressors A note is needed about compressors because all of the above technologies use them and the first three technologies in the list can use several. Some compressors are good at keeping picture quality high, while others are good at reducing the data rate needed for playback. But 33

34 a trade off is always needed--the compressor that produces low data rates and high picture quality doesn't exist yet..rm and.asf compression is intended to produce a low data rate, and may discard data as well as compress it. This is appropriate for streaming purposes, but these formats should be used for distribution only. Archive copies of the digital file need to be maintained using other formats in order to keep the quality high enough. The first three technologies support several compressors, and may overlap in the ones that they use. When a movie is compressed, the viewer must have that same compressor on their computer in order to play it back Playing video files Most of the movie-playing applications and web plug-ins overlap in the files that they can play. A tug-of-war is currently going on with one technology hijacking another's file type and vice-versa. So even though a movie uses a certain technology, another technology's player or plug-in may play it back Digitizing video: definitions and quality considerations Moving pictures are an illusion. Motion pictures are a filmstrip whose individual frames are flashed on the projection screen at the rate of 24 per second. Television does the same thing electronically at the rate of 30 still images a second. The basic idea in digitizing motion footage is to digitize each image and show them to the viewer at a quick enough rate to simulate motion. Generally, at least 12 to 15 frames per second are needed for smooth motion. This approach quickly runs into problems because of the huge amount of data that results. If each full frame image is about 1 MB in size: 1,000,000 bytes/frame * 30 frames/ second * 8 bits/byte = 240,000,000 bits/second At this stage of desktop computer technology, only specialized hardware can deal with this data rate. Compression helps reduce the data, but it by itself is not enough. So the task of digitizing video requires making decisions about how to reduce the data rate. 34

35 Three factors determine the data rate: the size of the video window. 320 x 240 pixels (1/4 screen) is usually the largest window commonly used. The smaller the window, the lower the data rate needed. the frame rate of the digital movie. The majority of movies on the Web are either 10 fps or 15 fps. Choose a frame rate that is an even multiple of the frame rate of the source footage. the picture quality (how much compression is used) The data rate that one aims for is determined by how the movie is stored and delivered. For example, a hard disk can deliver data quicker than the Web can, so better quality, or a faster frame rate, or a larger window might be possible if the movie is played from it. QuickTime Compression Quality Comparison Frame size 240 x 180 File size: 375K Data rate: 125K/second Frame size 240 x 180 File size: 10.5MB Data rate: 3.5MB/second NTSC NATIONAL TELEVISION SYSTEM COMMITTEE Lines/ Field Horizontal Frequency Vertical Frequency 525/ khz 60 Hz Color Subcarrier Frequency MHz Video Bandwidth Sound Carrier 4.2 MHz 4.5 MHz Table a, NTSC system 35

36 PAL Phase Alternating Line SYSTEM PAL PAL N PAL M Lines/ Field 625/50 625/50 525/50 Horizontal Frequency khz khz kHz Vertical Frequency 50Hz 50Hz 60Hz Color Subcarrier Frequency MHz MHz MHz Video Bandwidth 5.0MHz 4.2 MHz 4.2 MHz Sound Carrier 5.5 MHz 4.5 MHz 4.5 MHz Table b,pal system Image and video compression seem to be able to cope with this effect due to the use of lossy compression techniques. The main idea is that, for each image or video frame, one can exploit known properties of human vision with the use of signal transforms, in order to produce a hierarchical or layered representation of the input content in space and time In this representation, the significant visual information tends to be clustered in a small percentage of transform coefficients, while the remaining coefficients tend to constitute a sparse representation. Such transform representations can be efficiently quantized and coded with a variety of techniques; moreover, depending on the available bandwidth, a percentage of the transform-coefficient information is ignored. It is important to note that, in the case of video, the sparseness in the transform domain representation is significantly increased with the use of motion estimation and compensation techniques that exploit temporal similarities among neighbouring frames. In many cases, depending on the performance of the utilized motion estimation model, as well as the transform and coding techniques, a visually nearlossless representation of the input video can be obtained after decoding. 36

37 CHAPTER COLOR COMPRESSION FOR VIDEO Compression is performed when an input video stream is analyzed and information that is indiscernible to the viewer is discarded. Each event is then assigned a code commonly occurring events are assigned few bits and rare events will have codes more bits. These steps are commonly called signal analysis, quantization and variable length encoding respectively. There are four methods for compression; discrete cosine transforms (DCT), vector quantization (VQ), fractal compression, and discrete wavelet transform (DWT). Discrete cosine transform is a lossy compression algorithm that samples an image at regular intervals, analyzes the frequency components present in the sample, and discards those frequencies which do not affect the image as the human eye perceives it. DCT is the basis of standards such as JPEG, MPEG, H.261, and H.263. Vector quantization is a lossy compression that looks at an array of data, instead of individual values. It can then generalize what it sees, compressing redundant data, while at the same time retaining the desired object or data stream s original intent. Fractal compression is a form of VQ and is also a lossy compression. Compression is performed by locating self-similar sections of an image, then using a fractal algorithm to generate the sections. Like DCT, discrete wavelet transform mathematically transforms an image into frequency components. The process is performed on the entire image, which differs from the other methods (DCT), that work on smaller pieces of the desired data. The result is a hierarchical representation of an image, where each layer represents a frequency band. Video that's stored in the standard uncompressed digital format must have a value of each primary additive color for each pixel. That is, each pixel is defined by three numbers; one representing Red, one representing Green, and another representing Blue. When applied to colors it's called color depth. Technically this means it's not uncompressed as a nearly infinite number of colors can be created from variations on these three colors, but given the limitations of the human eye it's possible to narrow them down to those the human eye and brain can distinguish from one another. 37

38 4.1. Achieving Compression Reduce redundancy and irrelevancy Sources of redundancy Temporal: Adjacent frames highly correlated Spatial: Nearby pixels are often correlated with each other Color space: RGB components are correlated among themselves, Relatively straightforward to exploit Irrelevancy Perceptually unimportant information, Difficult to model and exploit. Compression is achieved by exploiting the spatial and temporal redundancy inherent to video. Video: Sequence of frames (images) that are related Related along the temporal dimension Therefore temporal redundancy exists Main addition over image compression Temporal redundancy Exploit the temporal redundancy 4.2. Main addition over image compression: Exploit the temporal redundancy Predict current frame based on previously coded frames Types of coded frames: I-frame: Intra-coded frame, coded independently of all other frames P-frame: Predictively coded frame, coded based on previously coded frame B-frame: Bidirectionally predicted frame, coded based on both previous and future coded frames 4.3. Temporal Processing: Motion-Compensated Prediction Simple frame differencing fails when there is motion 38

39 Must account for motion Motion-compensated (MC) prediction MC-prediction generally provides significant improvements Question: How can we estimate motion? How can we form MC-prediction? Ideal situation: Partition video into moving objects Describe object motion Very difficult Practical approach: Block-matching ME Partition each frame into blocks Describe motion of each block No object identification required Good, robust performance 4.4. Basic Video Compression Architecture Exploiting the redundancies: Temporal: MC-prediction and MC-interpolation Spatial: Block DCT Color: Color space conversion Scalar quantization of DCT coefficients Zigzag scanning, runlength and Huffman coding of the nonzero quantized DCT coefficients A fundamental approach towards compression of media signals is to remove redundancy via signal prediction or linear transforms (or a combination of both), followed by a quantization (scaling and rounding to a nearest integer) and entropy coding (representing those integers with a small number of bits by exploiting their joint statistics) The color space mapper is a first step of redundancy reduction, usually converting the pixels from an R-G-B color space to a luminance and chrominance space, such as Y-Cb-Cr (luma, blue-luma, and red-luma), 39

40 with the luma and chroma images typically being encoded independently. For video coding, pixel prediction is usually nonlinear, through motion compensation a motion field applied to a previously-encoded frame. In image coding, most codecs do not use pixel prediction, so that a linear transform is applied directly to the image pixels. Color is essentially how the brain interprets different frequencies (wavelengths) of light that reach the eye. That light can either come directly from a light source, such as the sun or a light bulb, or as a reflection off of some other object. Light that contains equal amounts of every visible wavelength is white. When white light reaches an object, whether it's a painting, a car, or your skin, some frequencies will be absorbed and others will be reflected. The color of the object is equivalent to white light minus the absorbed frequencies. If all wavelengths of visible light are absorbed, no light is reflected and the object appears black. This is called subtractive color, and if you're like me you learned about this as a small child. Reflected light can be divided into three primary colors (colors that can be combined to make any other color) - Red, Yellow, and Blue. When dealing with light coming directly from the source, or reflected from a white screen (meaning no visible wavelengths are absorbed) color isn't determined by subtracting, but rather by adding different frequencies together. Not surprisingly, this is called additive color. Like subtractive color, it can be divided into three primary colors, but instead of Red, Yellow, and Blue the primary additive colors are Red, Green, and Blue (RGB). This is how televisions, computer monitors, and film projectors all work Representing Color Digitally Video that's stored in the standard uncompressed digital format must have a value of each primary additive color for each pixel. That is, each pixel is defined by three numbers; one representing Red, one representing Green, and another representing Blue. Like all representations of analog data in the digital domain, each one of these numbers is a sample. This means that there are a number of possible values for each one based on the size of number used to represent it, also known as bit depth (bits per sample). When applied to colors it's called color depth. Technically this means it's not uncompressed as a nearly infinite number of colors can be created from variations on these three colors, but given the 40

41 limitations of the human eye it's possible to narrow them down to those the human eye and brain can distinguish from one another Bitrate vs. Bandwidth If you know a little bit about digital video you should recognize the term bitrate, meaning how many bits are read per second. This is the correct way to measure a digital signal, but is completely meaningless in the analog domain. This doesn't mean there isn't a way to measure the "size" of an analog signal, but instead of a stream of bits it's a range of frequencies. This is the bandwidth of the signal. Every wave, whether it's made up of light, sound, electricity, or radio has a certain frequency (how many times a complete wave repeats per second) and one cycle per second equates to 1Hz (Hertz). Just as lowering the bitrate of digital bitstream would allow frames to pass more rapidly on a wire, reducing the bandwidth of analog video allows more signals to be included on the same cable or in the same satellite transmission Gamma Correction Analog video uses a linear scale to represent Red, Green, and Blue values. This means that the difference between a value of 1 and 2 is the same as a difference between 128 and 129, which is also the same as the difference between 254 and 255. This is fine if you're simply measuring the light instead of representing a picture with it. Unfortunately, since the human eye can distinguish between smaller variations in wavelength at some frequencies than others, this representation either gives extra information that isn't perceptible by humans, or eliminates perceptible information to avoid the wasted storage or excessive transmission bandwidth. When this system was designed it wasn't much of an issue because any bandwidth concerns were overshadowed by the quality of technology available to reproduce the signal. In the digital world that extra bandwidth translates to bits, meaning we can either choose to waste bits for details we can't see or avoid wasted bits by sacrificing detail in the most important frequencies. In reality, digital video uses a third option called Gamma correction. Gamma correction fixes the problem by using a non-linear scale for each value. By devoting more bits to describe frequencies that the eye can perceive the most detail in, equal quality in all frequencies of visible light can be achieved without wasting precious storage space. In general, when you see what looks like an analog notation, such as RGB, with a single quote (or apostrophe) added afterward, it refers to gamma corrected digital video. For example, 41

42 gamma corrected RGB uses the notation R'G'B'. This is normally only an important distinction if you work with both analog and digital signals, but is also important to understand if you're trying to understand a technically accurate text. For the purposes of this guide, unless analog video is specifically mentioned, you should assume that all values refer to gamma corrected digital video. Most of the information available on the internet intended for hobbyists uses analog notation to refer to digital video RGB For our purposes we'll consider 24 bit RGB (R'G'B') to be uncompressed color. The number 24 refers to the total number of bpp (bits per pixel) used to describe color. A 24 bit number has a range of 0 (no color or black) to 16,777,215 for a total of 16,777,216 different colors. Since Red, Green, and Blue are each described by 8 bits (24/3), which gives a range of 0 (no Red, no Green, or No Blue) to 255 for a total of 256 possible variations of each primary color. From this you can start to get an idea of why using RGB color presents storage challenges. If you were to use the lowest resolution allowed for an NTSC DVD frame (352x240) with RGB24 color (the common name for 24 bit RGB) you'd end up with 352x240x24 = 2,027,520 bits or 253,440 Bytes for each frame. And that's without any of the additional information required for correct playback. At that bitrate, a single second of video is over 7 Megabytes, a minute is nearly 2/3 of a CD, and a full hour is over 25 Gigabytes - more than five DVD-9 discs. At the highest DVD resolution, which most DVDs are encoded at, it would be over 100GB per hour. Fig RGB 42

43 4.9. Color Space The term color space refers to a method of describing color. RGB is one color space, but not the only one. Just as the human eye is more sensitive to certain wavelengths of light than others, it's also more sensitive to changes in luminosity (how bright something is), than chromaticity (changes in hue and saturation of each primary color). Therefore it makes sense to use a color space that takes that handles luminosity and chromaticity separately. The analog color space that does this is called YUV or YPbPr. Since both are technically analog terms, you may also see this referred to with the correct terminology of Y'U'V' or Y'CbCr. As the notation suggests, The difference is that YUV values are calculated from linear scaled analog RGB, while Y'U'V' values are figured from gamma corrected R'G'B'. Fig.4.2. YIQ Color model (NTSC) YUV Color Space (PAL) YCbCr Color model The YCbCr color space is widely used for digital video. In this format, luminance information is stored as a single component (Y), and chrominance information is stored as two color-difference components (Cb and Cr). Cb represents the difference between the blue component and a reference value. Cr represents the difference between the red component and a reference value. (YUV, another color space widely used for digital video, is very similar to YCbCr but not identical.) 43

44 Fig.4.3. YCbCr color model YCbCr data can be double precision, but the color space is particularly well suited to uint8 data. For uint8 images, the data range for Y is [16, 235], and the range for Cb and Cr is [16, 240]. YCbCr leaves room at the top and bottom of the full uint8 range so that additional (non image) information can be included in a video stream Chroma Subsampling The human eye is far less sensitive to changes in chrominance than to changes in luminance. Since YUV already has a separate luma channel the next logical step is to lower the resolution of the chroma samples. In other words, a single chroma sample can be used for multiple pixels. In fact, some YUV implementations sample chroma at 1/4 the resolution of luma. Since the bpp for chroma is 2/3 of the total bpp this can save a lot of space. DVD, for example, uses YV12, which has only one U and one V value per block of four pixels. This cuts the bitrate in half - from 24bpp to 12bpp. Each variation on chroma sampling is considered a separate color space, but they can all be described as being in the YUV color space. 44

45 Each YUV color space can be described with a series of three numbers, representing the relationship between the number of Y samples and the number of U and V samples. The primary YUV color spaces you're likely to see are 4:2:0, 4:2:2, and 4:1:1. While I've seen explanations for a supposed system that can be used to understand what these numbers mean, I can't verify the accuracy of these explanations, and in the end it's easier to simply memorize what each standard set of numbers means. 4:2:2 In 4:2:2 YUV each pair of chroma samples is shared across a pair of pixels horizontally adjacent to each other. The vertical luma and chroma resolution are identical. MPEG-2 and MPEG-4 both technically support 4:2:2 YUV, but I'm not aware of any encoder or standalone player format that takes advantage of this. 4:2:0 YUV with a chroma subsampling of 4:2:0 shares chroma samples across both horizontally adjacent and vertically adjacent pixels. Each 2x2 block of pixels contains only a single U/V chroma sample. 4:1:1 Much like 4:2:2, 4:1:1 YUV uses the same vertical resolution for chroma as luma. Unlike 4:2:2, the horizontal resolution is only 1/4 of the luma, meaning each group of four horizontally adjacent pixels shares a single piece of U/V information Sample Locations Although it would be possible to simply repeat the same chroma information for each pixel that shares a pair of chroma samples, the quality is much better if the chroma is assigned to a specific point and then values between any two points (samples) is interpolated. Interpolation means simply taking two points and mathematically determining the value of the points in between. Depending on the algorithm involved, more points can be considered for greater accuracy. For example, consider three points in a line. If the first point has a red value of 0, 45

46 the last has a red value of 255, and the middle a value of 127, you can be fairly certain that colors in between them are even graduations. In reality, the logic required to interpolate the additional pixels is much more complicated than that, but the basic idea is the same. Packed vs. Planar There are two ways for a YUV encoded picture to be stored in a file. The most common is using a packed format. This simply means that the luma and chroma samples are stored next to each other in the file. For example, a 4:2:2 encoded frames could have the following order: U Y V Y U Y V Y U Y V Y U Y V Y Each chroma sample is located in the exact same coordinate of the frame as the corresponding Y sample. Each group of pixels that share a single chroma sample is called a macro pixel. The alternative, planar format, changes the order so that all the Y information for a given frame is followed by all the U information for that frame, which is followed by all the V information for the frame. Think of it as starting with a surface that's backlit with a single white lamp for each pixel in a frame. Then a sheet of translucent blue material with the chroma samples and interpolated points in between is put over the top. Finally a translucent red sheet, with chroma samples and interpolated points, is put on top of the blue. When all three layers are in place you have your final picture. It has one major advantage over packed formats. Since luma is never sub sampled, each luma point is always at an exact pixel location. Chroma however doesn't necessarily need to be lined up perfectly with a single pixel. With a planar format pixels can be placed on a grid that centers them between pixels. For example, a 4:2:0 encoded frames could make use of this by putting the chroma samples in the center, where all four pixels a sample applies to meet. Once you understand the basics, it's easiest to use more graphic representations of each one to understand them. I've also included a table at the end that details which color space is used for some standard video formats. 46

4.10. YUY2 YUY2 is a packed color format that uses 4:2:2 chroma subsampling. Each pixel has either a U or V sample, but not both. Fig. 4.10. YUY2 4.11.

In the diagram below the chroma layers are shown relative to pixel positions. Fig. 4.11. YV12

47 4.10. YUY2 YUY2 is a packed color format that uses 4:2:2 chroma subsampling. Each pixel has either a U or V sample, but not both. Fig YUY YV12 YV12 is a planar color format that uses 4:2:0 chroma subsampling. Each chroma sample is located at the nexus of 4 pixels. In the diagram below the chroma layers are shown relative to pixel positions. Fig YV DV (NTSC) NTSC DV doesn't have a name for its color format, but always uses the same packed 4:1:1 chroma subsampling. Some PAL DV formats (but not consumer grade DV equipment) also uses this subsampling. Notice that it uses the same number of chroma samples as YV12, but 47

only subsamples horizontally, and locates both the U and V sample at the left most pixel in the group instead of centering like YV12 or locating the samples at different pixels like YUY2. Fig. 4.12. DV (NTSC) 4.

linear scaled) domain. For digital video they're actually given the names luma and chroma to denote the use of Y', Cb, and Cr calculations vs. Y, Pb, and Pr.

48 only subsamples horizontally, and locates both the U and V sample at the left most pixel in the group instead of centering like YV12 or locating the samples at different pixels like YUY2. Fig DV (NTSC) How YUV Works The most important difference between RGB and YUV is separation of luminosity and chromaticity characteristics, commonly referred to as luminance and chrominance in the analog (ie linear scaled) domain. For digital video they're actually given the names luma and chroma to denote the use of Y', Cb, and Cr calculations vs. Y, Pb, and Pr. Since most luma is actually perceived in green light waves, and conversely most information the human eye perceives from the color green is luminance, green can be omitted from the chroma information completely. That leaves us with chroma components of only Red and Blue. Each of the three components may also be referred to as a channel, as in the Y (Luma) Channel, U (Blue) Channel, and V (Red) Channel HSV Color model Fig.4.13.a. HSV color space Fig.4.13.b. HSV color space, Hue is measured by an angle with Red 48

49 The HSV color space (Hue, Saturation, Value) is often used by people who are selecting colors (e.g., of paints or inks) from a color wheel or palette, because it corresponds better to how people experience color than the RGB color space does. The functions rgb2hsv and hsv2rgb convert images between the RGB and HSV color spaces. <h3colorimetry< h3=""></h3colorimetry<> Regardless of what color space is used to store video, it always starts out and is displayed in RGB format. In order to make sure the RGB values used for display are as close as possible to the ones originally encoded to YUV, both the encoder and decoder need to use the same calculations. Those calculations are called colorimetry. Fortunately there are specific standards that cover this. For the most part, the only two standards you need to know about are Rec. 601 (aka ITU.601, BT.601, or SMPTE 170M) and Rec. 709 (aka ITU.709 or BT.709). Rec. 601 is used for MPEG-1, MPEG-4 ASP (DiVX, XViD, and the like), and DV. MPEG-2 may use Rec. 601, Rec. 709, or SMPTE 240M (almost the same as Rec. 709). HDTV and DVD video are always supposed to use Rec Just as with nearly everything related to digital video this isn't always as simple as it should be. Some HDTV signals and DVDs are encoded with Rec. 601, and sometimes the colorimetry even changes in the middle of a video stream. In the case of MPEG-2, although the colorimetry used is stored in the file (since it supports multiple standards) sometimes it's missing so Rec. 709 is assumed. Since there's no way to confirm this from the actual frame data this may or may not be a correct assumption RGB vs. YUV There are many possible color space definitions, but I will address those commonly used in video production. Digital imagery often uses the red/green/blue color space, known simply as RGB. Intensity-chromaticity color spaces, YUV and YIQ, are used for television broadcast. In addition to kyuv219, Final Cut Pro contains two different RGB color spaces, krgb219 and krgb255. krgb255 is the standard computer color space, the same as used by Photoshop, After Effects and most other computer graphics applications. krgb219 is a restricted color space, designed to clamp values to match YUV broadcast standards. Systems keep the signal in YUV throughout, except for the Symphony Color Correction, which uses 10 bit RGB to avoid clipping and rounding errors. A common problem solved with color correction is color shift that happens during the processing of film stock. This 49

problem is typically a non-linear shift in the CMY color space that film stock uses. It can only be solved in RGB space, the inverse of CMY. You can't fix these problems in YUV space.

50 problem is typically a non-linear shift in the CMY color space that film stock uses. It can only be solved in RGB space, the inverse of CMY. You can't fix these problems in YUV space. Older Avid products that use the ABVB take a YUV input and efficiently convert it to RGB, then convert again to YUV on output to tape. Most graphics applications work in RGB and many which work with broadcast video regularly move between color spaces without any degradation of the image. The bottom line is that when the image goes to broadcast, it will be at the most restrictive of the possible color spaces, YUV or YIQ Red, Green, Blue (RGB): RGB is an additive color model that is used for lightemitting devices, such as CRT displays (note: CMY is a subtractive model that is used often for printers). RGB can be thought of as three grayscale images (usually referred to as channels) representing the light values of Red, Green and Blue. Combining these three channels of light produces a wide range of visible colors. All color spaces are threedimensional orthogonal coordinate systems, meaning that there are three axes (in this case the red, green, and blue color intensities) that are perpendicular to one another. The red intensity starts at zero at the origin and increases along one of the axes. Similarly, green and blue intensities start at the origin and increase along their axes. Because each color can only have values between zero and some maximum intensity (255 for 8-bit depth), the resulting structure is a cube. We can define any color simply by giving its red, green, and blue values, or coordinates, within the color cube. These coordinates are usually represented as an ordered triplet - the red, green, and blue intensity values enclosed within parentheses as (red, green, blue). Where the best video formats are sampled at 4:2:2, RGB graphics are sampled at 4:4:4, but it is not practical to broadcast the much information, particularly when much of it is beyond our ability to perceive. YUV (Y C B C R) reduces the amount of information required to reproduce an acceptable video image. Fig Red, Green, Blue 50

51 Luminance-Chrominance (YUV & YIQ): Broadcast TV uses color spaces based on luminance and chrominance, which correspond to brightness and color. These color spaces are denoted as YUV and YIQ. The YUV space is used for the PAL broadcast television system used in Europe and the YIQ color space is used for the NTSC broadcast standard in North America. The two methods are nearly identical, using slightly different equations to transform to and from RGB color space. In both systems, Y is the luminance (brightness) component and the I and Q (or U and V) are the chrominance (color) components. These are the variables that are changed by the brightness, color, and tint controls on a television. The principle advantage of using YUV or YIQ for broadcast is that the amount of information needed to define a color television image is greatly reduced. However, this compression restricts the color range in these images. Many colors that appear on a computer display cannot be recreated on a television screen due to the limits of the YUV and YIQ standards HSB (Hue, Saturation, and Brightness) - this color space is convenient for doing color correction work. HSB is also called HLS (Hue, Luminance, and Saturation ). Hue is the position of a color along the spectrum. Hue can be described as any particular color like red, or green. Luminance is the brightness of the image on a gradient scale from white to black. Raising the luminance channel can turn dark maroon into a bright red. Saturation is the amount of color (chrominance) which also moves on the gradient scale from white to black. Lowering the saturation can turn your bright red into a pink color YIQ - is a color space derived from the NTSC television color standard. Y is the luminance channel, I is the red-cyan channel (in-phase), while Q is the magenta-green channel (quadrature). The National Television Systems Committee (NTSC) defines a color space known as YIQ. This color space is used in televisions in the United States. One of the main advantages of this format is that grayscale information is separated from color data, so the same signal can be used for both color and black and white sets. In the NTSC color space, image data consists of three components: luminance (Y), hue (I), and saturation (Q). The first component, luminance, represents grayscale information, while the last two components make up chrominance (color information) YUV - is a color space derived from the PAL television color standard. Y is the luminance channel, U is the blue channel, and V is the red channel. 51

52 CMYK - is a color space derived from the color separations used in commercial print shops. The standard inks for printing are Cyan, Magenta, Yellow, and Black. Thus, CMYK. RGB - is the color space, which uses a number for each color. One for red, one for green, and one for blue. This number can be used to describe a precise color in the spectrum. An example of this would be, 8 bits defines 256 color scale, 16 bits defines 65,536 colors, and 24 bit defines 16.7 million colors. If you add the bits together you get the color depth of that image. There are a few image formats that include yet another number which defines an alpha channel. This is known as an RGBA file. There is another important factor in all this as well, the color depth of an image can be described by the total number of bits, and by the number of bits in each channel. Most common DV formats use 8 bits per sample, which results in 256 levels of image brightness. This is quite satisfactory, but a few of the higher end formats use 10 bits per sample, permitting far more gradations in brightness yielding smoother pictures with a greater signal-to-noise ratio. A 10 bit image has four times the information of an 8 bit image Color Encoding Color-encoding methods are: RGB (component) YUV (component) YIQ (component) YC (separate luminance (Y) and chrominance (C)), YC-358, YC-443, S-Video composite video RGB RGB is the color-encoding method used by most graphics computers, as well as some professional-quality video cameras. The three colors red, green, and blue are generated separately; each is carried on a separate wire. 52

53 YUV YUV, a form of which is used by the PAL video standard and by Betacam and D1 cameras and VCRs, is also a component color-encoding method, but in a different way from RGB. In this case, brightness, or luminance, is carried on a signal known as Y. Color is carried on the color difference signals, U and V, which are B-Y and R-Y respectively. The YUV matrix multiplier derives colors from RGB via the following formula: Y =.299R G B CR = R-Y CB = B-Y in which Y represents luminance and R-Y and B-Y represent the color difference signals used by this format. In this system, which is sometimes referred to as Y/R-Y/B-Y, R-Y corresponds to CR and V, and B-Y corresponds to CB and U. R-Y and B-Y are obtained by subtracting luminance (Y) from the red (R) and blue (B) camera signals, respectively. CR, CB, V, and U are derived through different normalization methods, depending on the video format used. The U and V signals are sometimes subsampled by a factor of 2 and then carried on the same signal, which is known as 4:2:2. YUV component color encoding can be recorded digitally, according to the CCIR 601 standard; this recording technique is referred to as D1. YIQ YIQ color encoding, which is typically used by the NTSC video format, encodes color onto two signals called I and Q (for intermodulation and quadrature, respectively). These two signals have different phase modulation in NTSC transmission. Unlike the U and V components of YUV, I and Q are carried on different bandwidths. The YIQ formula is as follows: Y =.299 R G B (the same as for YUV) I =.596 R G B Q =.212 R G B 53

54 YC, YC-358, YC-443, or S-Video YC, a two-wire signal, results when I and Q are combined into one signal, called chrominance (C). Chrominance is a quadrature phase amplitude-modulated signal. In the NTSC broadcast standard, U is the 0-degree modulation and V is at 90 degrees. In the PAL broadcast standard, the V component is modulated at +/- 90 degrees line-to-line for the active picture and +/- 135 degrees for the reference burst. YC-358 is the most common NTSC version of this luminance/chrominance format; YC-443 is the most common PAL version. These formats are also known as S-Video; S-Video is one of the formats used for S-VHS videotape recorders Composite Video The composite color-encoding schemes combine the brightness and color signals into one signal for broadcast. NTSC and PAL both combine brightness and color but use different methods. Figure 4-2. Relationships between Color-encoding Methods and Video Formats NTSC YIQ YC-358 RGB PAL D1 525, D1 625 YUV D2 625 YC Fig Shows the relationships between color-encoding methods and video formats. 54

55 Video Signals The video signal, whatever the broadcast standard being used, carries other information besides video (luminance and chrominance) and audio. For example, horizontal and vertical synchronization information is required, as well as a color phase reference, which is called color sync burst. Figure 4.11 shows a composite video signal waveform. Figure shows Composite Video Waveform Fig Shows Composite Video Waveform 55

56 CHAPTER SCALABILITY FOR VIDEO COMPRESSION Scalable video coding technology should support at least four spatial resolution levels, which could range (dyadically) from to pixels. This example includes the commonly used QCIF and CIF formats at the low-end of the scalability regime. The ability to provide multiple resolutions from one bit stream is important in a variety of applications. For example, multi-channel content production and distribution may involve the usage of different video resolutions, from studio-profile down to PDA resolution. In this way, different clients can be served by a single scalable bit stream. Another important application of resolution scalability is in the area of video surveillance and industrial monitoring systems, where the usage of this functionality is two-fold: In a typical scenario, multiple views from different locations are received at each monitoring station. Based on spatial scalability, each view can be enlarged on-the-fly, if necessary. This limits the overall communication bandwidth between the camera network and the monitoring stations, since higher-resolution content is transmitted only following an alarm or a request from the user. Support for video storage with erosion functionality. In surveillance and industrial monitoring systems, the importance of the recorded data, and thus the required resolution of the recorded video sequences, decreases over time. For example, the full resolution video needs only to be stored for three days, and sequences older than that are only required in a medium resolution. If the sequences are older than one week, only a minimal resolution may be required for medium-to-long-term archiving Requirements for Bitrate (Quality) Scalability The representation of video into multiple qualities through scalable coding permits the video transmission to occur in a fine-granular set of bitrates. Hence, in practice, quality scalability is synonymous to bitrate scalability. According to MPEG requirements, practical qualityscalable coding schemes should provide a large variety of bitrates, which correspond to transmission rates provided by a variety of today s networks. It should be mentioned that for certain video archiving applications, such as the Digital Cinema industry, military surveillance and space exploration, or some medical applications, the possibility for lossless 56

57 coding should be supported. Obviously, the main application domains of quality scalability involve the broad area of video transmission over unreliable networks, where large bandwidth fluctuations can be efficiently handled by on-the-fly adapting the quality of the transmitted video. Another significant application is in content distribution, where in order to charge a different fee for higher quality content (requiring more bandwidth/storage), different quality levels should be provided by one bit stream Requirements for Frame-rate (Temporal) Scalability Adapting the transmission and replay frame-rate consists of an efficient way for changing the video quality perceived by human observers. In addition, different frame-rates correspond to different complexity profiles for decoding the same video sequence. As a result, it is established that scalable video coding should support at least four levels of temporal scalability, which (in the majority of cases) can vary the decoding frame-rate dynamically. Moreover, in order to satisfy a broad range of video applications, decoding of moving pictures with frame rates up to 60 Hz should be supported. The main application domains demanding temporal scalability evolve around multi-channel content production and distribution, where the same bit stream will be viewed on a variety of devices supporting different temporal resolutions. For example, 7.5 Hz, 15 Hz, 30 Hz and 60 Hz should be supported in order to accommodate a broad range of clients, ranging from studio-level devices down to video-on-demand on a cell phone with limited processing capabilities. In addition, in this scenario, temporal scalability may be used to simulate fast forward/backward capabilities found in analog video playback devices, such as VCRs. Finally, temporal scalability is very useful in the application areas that involve resolution scalability, such as surveillance and monitoring applications, where cameras are usually monitoring static areas and high frame-rate video is required only after an alarm is activated Requirements for Complexity Scalability It is generally considered important that decoding complexity scales proportionally to the decoded temporal and spatial resolution [14]. Nevertheless, an equally important aspect of complexity scalability involves the establishment of a hierarchy of the video compression 57

58 tools in terms of their average complexity profile. In this way, depending on the available resources at the decoding platform, on-the-fly adaptation of the compressed bitstream can occur so that real-time decoding is guaranteed, by selecting the sub-stream that leads to lowcomplexity decoding. To this end, since complexity is always relative to the algorithmic features as well as the implementation platform, in order to achieve reliable results, applicable complexity models should be used. In the context of multimedia algorithms, complexity modeling can be broadly defined as the procedure through which one obtains relative performance metrics for different algorithms (or different instantiations of one algorithm) with respect to: (a) the algorithm realization in terms of software or hardware; (b) the implementation platform. Important applications of complexity scalability exist in the area of wireless video streaming to mobile computing devices such as cellphones and PDAs. In addition, depending on the target application, it may be required that encoding meets certain complexity bounds. This can be important in scenarios where the encoding devices are distributed over the network, as in the case of surveillance and monitoring applications for example Frame types The basic principle for video compression is the image-to-image prediction. The first image is called an I-frame and is self-contained, having no dependency outside of that image. The following frames may use part of the first image as a reference. An image that is predicted from one reference image is called a P-frame and an image that is bidirectionally predicted from two reference images is called a B-frame. > I-frames: Intra predicted, self-contained > P-frames: Predicted from last I or P reference frame > B-frames: Bidirectional; predicted from two references one in the past and one in the future, and thus out of order decoding is needed. The video decoder restores the video by decoding the bit stream frame by frame. Decoding must always start with an I-frame, which can be decoded independently, while P- and B- frames must be decoded together with current reference image(s). 58

59 I-frames: Intra (I) frames, also known as reference or key frames, contain all the necessary data to re-create a complete image. An I-frame stands by itself without requiring data from other frames in the group of pictures. Every group of picture contains one I-frame, although it does not have to be the first frame of the group of picture. I-frames are the largest type of MPEG frame, but they are faster to decompress than other kinds of MPEG frames. P-frames: Predicted (P) frames are encoded from a predicted picture based on the closest preceding I- or P-frame. P-frames are also known as reference frames, because neighbouring B- and P-frames can refer to them. P-frames are typically much smaller than I-frames. B-frames: Bi-directional (B) frames are encoded based on an interpolation from I- and P- frames that come before and after them. B-frames require very little space, but they can take longer to decompress because they are reliant on frames that may be reliant on other frames Group of Picture Pattern A group of picture pattern is defined by the ratio of P- to B-frames within a group of picture. Common patterns used for DVD are IBP and IBBP. All three frame types do not have to be used in a pattern. For example, an IP pattern can be used. IBP and IBBP group of picture patterns, in conjunction with longer group of picture lengths, encode video very efficiently. Smaller group of picture patterns with shorter group of picture lengths work better with video that has quick movements, but they don t compress the data rate as much. Some encoders can force I-frames to be added sporadically throughout a stream s group of pictures. These I-frames can be placed manually during editing or automatically by an encoder detecting abrupt visual changes such as cuts, transitions, and fast camera movements Group of Picture Length Longer group of picture lengths encode video more efficiently by reducing the number of I- frames but are less desirable during short-duration effects such as fast transitions or quick camera pans. MPEG video may be classified as long-group of picture or short-group of picture. The term long-group of picture refers to the fact that several P- and B-frames are 59

60 used between I-frame intervals. At the other end of the spectrum, short-group of picture MPEG is synonymous with I-frame only MPEG. Formats such as IMX use I-frame only MPEG-2, which reduces temporal artifacts and improves editing performance. However, I- frame only formats have a significantly higher data rate because each frame must store enough data to be completely self-contained. Therefore, although the decoding demands on your computer are decreased, there is a greater demand for scratch disk speed and capacity. Maximum group of picture length depends on the specifications of the playback device. The minimum group of picture length depends on the group of picture pattern. For example, an IP pattern can have a length as short as two frames. Here are several examples of group of picture length used in common MPEG formats: MPEG-2 for DVD: Maximum group of picture length is 18 frames for NTSC or 15 frames for PAL. This group of picture lengths can be doubled for progressive footage line HDV: Uses a long-group of picture structure that is 15 frames in length. 720-line HDV: Uses a six-frame group of picture structure. IMX: Uses only I-frames Open and Closed group of pictures An open group of picture allows the B-frames from one group of picture to refer to an I- or P- frame in an adjacent group of picture. Open group of pictures are very efficient but cannot be used for features such as multiplexed multi-angle DVD video. A closed group of picture format uses only self-contained group of pictures that do not rely on frames outside the group of picture. The same group of picture pattern can produce different results when used with an open or closed group of picture. For example, a closed group of picture would start an ibbp pattern with an i-frame, whereas an open group of picture with the same pattern might start with a b- frame. In this example, starting with a b-frame is a little more efficient because starting with an i-frame means that an extra p-frame must be added to the end (a group of picture cannot end with a b-frame). 60

61 Fig 5.7. Open and Closed group of pictures 5.8. MPEG Containers and Streams MPEG video and audio data are packaged into discrete data containers known as streams. Keeping video and audio streams discrete makes it possible for playback applications to easily switch between streams on the fly. For example, DVDs that use MPEGplays. 2 video can switch between multiple audio tracks and video angles as the DVD Each MPEG standard has variations, but in general, MPEG formats support two basic kinds of streams: Elementary streams: These are individual video and audio data streams. System streams: These streams combine, or multiplex, video and audio elementary streams together. They are also known as multiplexed streams. To play back these streams, applications must be able to demultiplex the streams back into their elementary streams. Some applications only have the ability to play elementary streams MPEG-1 MPEG-1 is the earliest format specification in the family of MPEG formats. Because of its low bit rate, MPEG-1 has been popular for online distribution and in formats such as Video CD (VCD). DVDs can also store MPEG-1 video, though MPEG-2 is more commonly used. Although the MPEG-1 standard actually allows high resolutions, almost all applications use NTSC- or PAL-compatible image dimensions at quarter resolution or lower. Common MPEG-1 formats include 320 x 240, 352 x 240 at fps (NTSC), and 352 x 288 at 25 fps (PAL). Maximum dataa rates are often limited to around 1.5 Mbps. MPEG-1 only supports progressive-scan video. MPEG-1 supports three layers of audio compression, called MPEG-1 Layers 1, 2, and 3. MPEG-1 Layer 2 audio is used in some formats such as HDV and DVD, but MPEG-1 Layer 61

62 3 (also known as MP3) is by far the most common. In fact, MP3 audio compression has become so popular that it is usually used independently of video. MPEG-1 elementary stream files often have extensions such as.m1v and.m1a, for video and audio, respectively MPEG-2 The MPEG-2 standard made many improvements to the MPEG-1 standard, including: Support for interlaced video Higher data rates and larger frame sizes, including internationally accepted standard definition and high definition profiles Two kinds of multiplexed system streams Transport Streams (TS) for unreliable network transmission such as broadcast digital television, and Program Streams (PS) for local, reliable media access (such as DVD playback) MPEG-2 categorizes video standards into MPEG-2 Profiles and MPEG-2 Levels. Profiles define the type of MPEG encoding supported (I-, P-, and B-frames) and the color sampling method used (4:2:0 or 4:2:2 Y CBCR). For example, the MPEG-2 Simple Profile (SP) supports only I and P progressive frames using 4:2:0 color sampling, whereas the High Profile (HP) supports I, P, and B interlaced frames with 4:2:2 color sampling. Levels define the resolution, frame rate, and bit rate of MPEG-2 video. For example, MPEG- 2 Low Level (LL) is limited to MPEG-1 resolution, whereas High Level (HL) supports 1920 x 1080 HD video. MPEG-2 formats are often described as a combination of Profiles and Levels. For example, DVD video uses Main Profile at Main Level ML), which defines SD NTSC and PAL video at a maximum bit rate of 15 Mbps (though DVD limits this to 9.8 Mbps). MPEG-2 supports the same audio layers as MPEG-1 but also includes support for multichannel audio. MPEG-2 Part 7 also supports a more efficient audio compression algorithm called Advanced Audio Coding, or AAC. MPEG-2 elementary stream files often have extensions such as.m2v and.m2a, for video and audio, respectively. 62

63 MPEG-4 MPEG-4 inherited many of the features in MPEG-1 and MPEG-2 and then added a rich set of multimedia features such as discrete object encoding, scene description, rich metadata, and digital rights management (DRM). Most applications support only a subset of all the features available in MPEG-4. Compared to MPEG-1 and MPEG-2, MPEG-4 video compression (known as MPEG-4 Part 2) provides superior quality at low bit rates. However, MPEG-4 supports high-resolution video as well. For example, Sony HDCAM SR uses a form of MPEG-4 compression. MPEG- 4 Part 3 defines and enhances AAC audio originally defined in MPEG-2 Part 7. Most applications today use the terms AAC audio and MPEG-4 audio interchangeably MPEG-4 Part 10, or H.264 MPEG-4 Part 10 defines a high-quality video compression algorithm called Advanced Video Coding (AVC). This is more commonly referred to as H.264. H.264 video compression works similarly to MPEG-1 and MPEG-2 encoding but adds many additional features to decrease data rate while maintaining quality. Compared to MPEG-1 and MPEG-2, H.264 compression and decompression require significant processing overhead, so this format may tax older computer systems. Fig MPEG-4 The illustration above shows how a typical sequence with I-, B-, and P- frames may look. Note that a P-frame may only reference a preceding I- or P-frame, while a B-frame may reference both preceding and succeeding I- and P-frames. 63

64 Since the H.261/H.263 recommendations are neither international standard nor offer any compression enhancements compared to MPEG, they are not of any real interest and is not recommended as suitable technique for video surveillance. Due to its simplicity, the widely used Motion JPEG, a standard in many systems, is often a good choice. There is limited delay between image capturing in a camera, encoding, transferring over the network, decoding, and finally display at the viewing station. In other words, Motion JPEG provides low latency due to its simplicity (image compression and complete individual images), and is therefore also suitable for image processing, such as in video motion detection or object tracking. Any practical image resolution, from mobile phone display size (QVGA) up to full video (4CIF) image size and above (megapixel), is available in Motion JPEG. However, Motion JPEG generates a relatively large volume of image data to be sent across the network. In comparison, all MPEG standards have the advantage of sending a lower volume of data per time unit across the network (bit-rate) compared to Motion JPEG, except at low frame rates. At low frame rates, where the MPEG compression cannot make use of similarities between neighbouring frames to a high degree, and due to the overhead generated by the MPEG streaming format, the bandwidth consumption for MPEG is similar to Motion JPEG. MPEG 1 is thus in most cases more effective than Motion JPEG. However, for just a slightly higher cost, MPEG 2 provides even more advantages and supports better image quality comprising of frame rate and resolution. On the other hand, MPEG-2 requires more network bandwidth consumption and is a technique of greater complexity. MPEG 4 is developed to offer a compression technique for applications demanding less image quality and bandwidth. It is also able to deliver video compression similar to MPEG 1 and MPEG 2, i.e. higher image quality at higher bandwidth consumption. If the available network bandwidth is limited, or if video is to be recorded at a high frame rate and there are storage space restraints, MPEG may be the preferred option. It provides a relatively high image quality at a lower bit-rate (bandwidth usage). Still, the lower bandwidth demands come at the cost of higher complexity in encoding and decoding, which in turn contributes to a higher latency when compared to Motion JPEG. Looking ahead, it is not a bold prediction that H.264 will be a key technique for compression of motion pictures in many application areas, including video surveillance. As mentioned above, it has already been implemented in as diverse areas as high-definition DVD (HD-DVD and Blu-ray), for digital video broadcasting including high-definition TV, in the 3GPP standard for third generation mobile 64

65 telephony and in software such as QuickTime and Apple Computer s MacOS X operating system. H.264 is now a widely adopted standard, and represents the first time that the ITU, ISO and IEC have come together on a common, international standard for video compression. H.264 entails significant improvements in coding efficiency, latency, complexity and robustness. It provides new possibilities for creating better video encoders and decoders that provide higher quality video streams at maintained bit-rate (compared to previous standards), or, conversely, the same quality video at a lower bit-rate. There will always be a market need for better image quality, higher frame rates and higher resolutions with minimized bandwidth consumption. H.264 offers this, and as the H.264 format becomes more broadly available in network cameras, video encoders and video management software, system designers and integrators will need to make sure that the products and vendors they choose support this new open standard. And for the time being, network video products that support several compression formats are ideal for maximum flexibility and integration possibilities Principles of Perceptual Encoding Systems which are optimised for audio and image compression often deploy specialised algorithms which exploit weaknesses in human perception. Ever since people attempted to capture and replay moving images and sound, the constraints of mechanics and electronics have required economies and adaptation of content to suit the recording and transmission systems. In an obvious example, film and video cameras typically capture only a limited number of full images over time, typically 18, 24, 25 or roughly 30 frames or still-images per second. When these are screened, and conveyed to the human brain, we process these image sequences as if they were continuous, ignoring the Our ability to gloss over these temporal aberrations, or imperfections in the time-domain, underpin an important set of principles exploited in image encoding. Similarly video signals normally don t convey the full resolution of color information that we may see in real life. Some video systems used in domestic cameras and computers have only half as much video resolution, typically denoted 4:2:0. For many years, analog television has also fooled our perception by capturing and conveying images in a series of lines drawn across the screen of a television receiver or 65

66 monitor. Rather than presenting a whole frame at a time, the film is projected, the video signal traces a series of lines across the inside face of a phosphor-coated cathode-ray tube [CRT] which glows momentarily in response to the applied voltage. Because the light emitted from the phosphor coating decays rapidly, each frame is divided into halves known as fields, carrying alternate lines which are sent and displayed in succession. This interlaced signal reduces the bandwidth needed in electronic processing, and minimises flickering in CRTs. Color television takes this a step further, with 3 electron guns firing 3 separate Red, Green and Blue traces onto the inner front surface of the CRT, with each color separated by a mask. In digital video, and computer monitors, images are represented by arrays or mosaic of individual squares or rectangles, each with their own RGB or YUV values. Typical sizes are: 640W x 480H [VGA], 800 x 600 [SVGA], 1024 x 768 [XVGA] 720 x 480 [ATSC SDTV], 720 x 576 [PAL-DVB SDTV],... [1280 x 720, progressive-scan HD TV], [1920 x 1080, interlaced HD TV]. Although the basic technology has limitations, and is a long way from providing a seamless presentation of images in the horizontal, vertical and temporal dimensions, our visual perception is good at glueing together all these visual cues into a sense of reality. Just as importantly, global standards ensure that all the basic paratmeters are agreed, so that content can be exchanged without loss or corruption. A film which is shot at 24 frames each second, can be projected at the same speed anywhere in the world, even if converted into digital video. An important principle has been adopted within archives around the world; images and audio should not undergo further degradation each time they are stored and recalled for reuse, otherwise the future portrayal of history would eventually disappear Intra-frame Compression The basis for traditional JPEG and MPEG encoding of data within individual frames of video is the application of a Discrete Cosine Transform [DCT] to shape individual blocks of image data into similar patterns or sequences of pixels, followed by entropy encoding, which describes these areas in mathematical short-hand. This intra-frame compression saves large amounts of data when compared to the task of storing or transmitting a full description of every pixel in the original image. MPEG-2 and MPEG-4 standards have enabled increasingly adaptive borders of macroblock boundaries to match more closely the transitions in image complexity. 66

67 MPEG-4 Part 10/H.264 AVC includes an additional post-filtering stage to smooth the edges of macroblocks which are visible typically at higher compression ratios. JPEG-2000 includes a more advanced intra-frame compression technique known as wavelet, which minimises blocky artefacts when compared to DCT compression Inter-frame Compression MPEG video compression supports a hybrid encoding regime, in which not only individual frames may be compressed, but entire sequences of frames with similarities [usually within a shot or scene] may be encoded with data specifying only the differences between adjacent frames. Under certain conditions, typically at lower bit-rates, these differences over time, known as temporal or inter-frame encoding techniques, require less data for a given picture quality, than does encoding every frame. Inter-frame encoding typically maps groups of pixels within macroblocks which stay the same from one frame to the next [i.e. fixed backgrounds] or which move in the same direction [e.g. moving objects or panned backgrounds]. Rather than recoding these image regions, their relative positions are tracked using motion vectors, which require much less coding. A typical group of pictures (or GOP ) would include one I-frame which is intra-coded [so that it can be represented on its own], along with P-frames which are predicted using such tools as motion-prediction and motion-compensation. Additional B-frames ( in-between ) are then extrapolated from I- and P-frames. A typical long-gop MPEG structure would be: I- B-B-P-B-B-I-B-B-P-B-B-I-B-B-P-B-B-I-B-B-P-B-B-I for a 25 frame sequence Two Complementary Image Compression Schemes Two main families of image compression are now in widespread use, JPEG and MPEG. The JPEG standard has been developed by the Joint Picture Experts Group [Working Group 1 under Sub-Committee 29 of the ISO/IEC Joint Technical Committee-1]. JPEG sits alongside MPEG [WG-11], under SC-29 of Joint Technical Committee 1 in the ISO standards framework. 67

68 5.13. Compression Performance Subjective, visually lossless compression is defined for MPEG-4 as mathematically lossless compression, but is not a requirement for the standard. This original scope statement highlights an important distinction between true, mathematically lossless compression, and lossy compression schemes in which significant amounts of picture information but whose effects are visually difficult to perceive. These systems are also known as pseudo lossless. In true, mathematically lossless compression systems, every pixel of every image is transmitted and rendered in exactly the same place within the image frame, with exactly the same value as the original. ITU-T Rec. H.264 ISO/IEC version 3, approved in July 2005 includes a set of Fidelity Range Extensions [FRExt]which are intended to expand the original application Temporal and spatial compression Since digital video is both a sequence of images over time and (sometimes) an audio track, it is not surprising that file sizes can grow very rapidly. Compression is a must. Most video capture and editing software allows the user to choose to compress more than once: while digitizing, perhaps also while editing, and again when saving the final result. Since each compression potentially could bring a loss of quality, it is best to save compression for the last step. If possible, digitize and edit your video with little or no compression. Digital video compression uses two techniques: spatial and temporal compression. Spatial compression was introduced in the section on digital images. It involves removing or reordering information about a field of colored pixels to conserve file space. Temporal compression, as the name might suggest, operates across time. It compares one still frame with an adjoining frame and, instead of saving all the information about each frame into the digital video file, only saves information about the differences between frames (frame differencing). This type of compression relies on periodic keyframes. At each keyframe, the entire still image is saved, and these complete pictures are used as the comparison frames for frame differencing. Another key concept introduced when compressing digital video (but also used with other time-based media, such as audio) is data rate. Data rate refers to the speed at which information must be read from the delivery source (a file off of a hard drive, a file from a 68

69 CD-ROM, a file over the network, etc). When compressing video, most applications allow the user to specify a data rate, so that playback will be smooth for users on, for example, a 56 Kb/second modem Other Lossless Video Candidates In addition to the JPEG group and the MPEG group, there are a number of additional lossless video candidates. In the interest of providing a comprehensive list of all lossless candidates, we will review them all here. None of these should be considered to be the same caliber as JPEG and MPEG in terms of broad and well-published support HuffYUV HuffYUV is a lossless codec that relies on the Huffman code for compression, a popular and memory-efficient code set useful for many compression applications. The HuffYUV video compression algorithm is credited to Ben Rudiak-Gould s application of this compression algorithm to video. HuffYUV is intended to replace uncompressed YUV as a video capture format. It is fast enough to compress full-resolution CCIR 601 video (720 x 480 x 30fps) in real time as it's captured on my machine. HuffYUV also supports lossless compression of RGB data. The criteria for a viable candidate for a long-term preservation video format require that the format be a publicly published format with a reasonable level of commercial support. This is one area where HuffYUV falls short, in that while it is based on well-published it is project supported by a limited staff, namely Ben Rudiak-Gould. Color Channel Down sampling (lossy) Humans are much better at discerning changes in luminance, than changes in chrominance. Luminance, or brightness, is a measure of color intensity. Chrominance is a measure of color hue. Pictures are separated into one luminance and two chrominance channels, called the YCbCr color space. The chrominance channels are typically down sampled horizontally and vertically. Frequency Quantization (lossy) an image can be expressed as a linear combination of horizontal and vertical frequencies. Humans are much more sensitive to low frequency image 69

70 components, such as a blue sky, than to high frequency image components, such as a plaid shirt. Unless a high frequency component has a strong presence in an image, it can be discarded. Frequencies which must be coded are stored approximately (by rounding) rather than encoded precisely. This approximation process is called quantization. How the different horizontal and vertical frequencies are quantized is determined by empirical data on human perception. Motion Prediction (lossless) Frames of a video contain a great deal of temporal redundancy because much of a scene is duplicated between sequential frames. Motion estimation is used to produce motion predictions with respect to one or more reference frames. Predictions indicate what any given frame should look like. For similar frames, only the motion estimate and any error between the predicted values and the actual values must be encoded. Difference Coding (lossless) over any given region in an image the average color value is likely to be similar or identical to the average color in surrounding regions. Thus the average colors of regions are coded differentially with respect to their neighbours. Motion information at neighbouring regions is also likely to be similar or identical and is therefore coded differentially with respect to motion at neighbouring regions. 70

71 CHAPTER COLOR INITIALIZATION IN VIDEO Color Constancy. Most color constancy algorithms make the assumption that only one light source is used to illuminate the scene. Given the image pixel values I(x), the goal of color constancy is to estimate the color of light source, assuming there is a single constant source for the entire image. Alternatively, as shown in [2], the illumination is considered to be a spectrum and assuming that the color of illuminant l depends on the illuminant spectral power distribution L(λ) and camera sensitivity c(λ), l is estimated as where l is the estimated color of the illuminant, ω is the visible spectrum and λ is the wavelength. Therefore, estimating l is equivalent to estimating its color coordinates, where l has 3 components one value for each color channel. 6.1 Color constancy Many color constancy algorithms have been proposed to estimate the color of illumination of a scene. These methods can be broadly classified into 4 categories: gamut-based/learning methods, probabilistic methods, methods based on low-level features, and methods based on combination of different methods. The gamut mapping method proposed by Forsyth is based on the observation that given an illuminant, the range of RGB values present in a scene is limited. Under a known illuminant (typically, white), the set of all RGB colors is inscribed inside a convex hull and is called the canonical gamut. This method tries to estimate the illuminant color by finding an appropriate mapping from an image gamut to the canonical gamut. Since this method could result in infeasible solutions, Finlayson et al. improve the above algorithm by constraining the transformations so that the illuminant estimation corresponds to a predefined set of illuminants. Finlayson et al. use the knowledge about appearance of colors under a certain illumination as a prior to estimate the probability of an illuminant from a set of illuminations. The disadvantage of this method is that the estimation of the illuminant depends on a good model of the lights and surfaces, which is not 71

72 easily available. Consider the spatial dependencies between pixels to estimate the illuminant color. Another learning-based approach by Cardei et al., use a neural network to learn the illumination of a scene from a large number of training data. The disadvantage with neural networks is that the choice of training dataset heavily influences the estimation of the illuminant color. A nonparametric linear regression tool called kernel regression has also been used to estimate the illuminant chromaticity. Probabilistic methods include Bayesian approaches that estimate the illuminant color depending on the posterior distribution of the image data. These methods first model the relation between the illuminants and surfaces in a scene. They create a prior distribution depending on the probability of the existence of a particular illuminant or surface in a scene, and then using Bayes s rule, compute the posterior distribution. Rosenberg et al., combine the information about neighbouring pixels being correlated within the Bayesian framework. A disadvantage of the above-mentioned algorithms is that they are quite complex and all the methods require large datasets of images with known values of illumination for training or as a prior. Also, the performance may be influenced by the choice of training dataset. Another set of methods uses low level features of the image. The white-patch assumption is a simple method that estimates the illuminant value by measuring the maximum value in each of the color channels. The grey world algorithm assumes that the average pixel value of a scene is grey. The grey-edge algorithm measures the derivative of the image and assumes the average edge difference to be achromatic. All the above low level color constancy methods can be expressed as where n is the order of derivative, p is the Minkowski norm and s is the parameter for smoothing the image I(x) with a Gaussian filter. The original formulation of (3) for the whitepatch assumption and the grey-world algorithm can be found in [12]. As expressed in (3), the white patch assumption can be expressed as l 0,8,0, the grey-world algorithm can be expressed as l 0,1,0 and the n-order grey-edge algorithm can be expressed as l n, p, s. Van de Weijer et al. [2] have shown results using values of n= 1andn = 2. 72

More recent techniques to achieve color constancy use higher-level information and also use combinations of different existing color constancy methods.

73 More recent techniques to achieve color constancy use higher-level information and also use combinations of different existing color constancy methods. Reference [13] estimates the illuminant by taking a weighted average of different methods. The weights are predetermined depending on the choice of dataset. Gijsenij and Gevers [14]use Weibull parameterization to get the characteristics of the image and depending on those values, divide the image space into clusters using k-means algorithm and then use the best color constancy algorithm corresponding to that cluster. The best algorithm for a cluster is learnt from the training dataset. Van de Weijer et al. [15] model an image as a combination of various semantic classes such as sky, grass, road and buildings. Depending on the likelihood of semantic content, the illuminant color is estimated corresponding to the different classes. Similarly, information about images being indoor or outdoor are also used to select a color constancy algorithm and consequently estimate the color of illuminant [16]. More recently, 3D scene geometry is used to classify images and a color constancy algorithm is chosen according to the classification results to estimate the illuminant color [17]. 6.2 Enhancement Techniques. The most common technique to enhance images is to equalize the global histogram or to Performa global stretch. Since this does not always gives us good results, local approaches to histogram equalization have been proposed. One such technique uses adaptive histogram equalization that computes the intensity value for a pixel based on the local histogram for a local window. Another technique to obtain image enhancement is by using curve let transformations. A recent technique has been proposed by Palma-Amestoy. that uses a variational framework based on human perception for achieving enhancement. This method also removes the color cast from the images during image enhancement unlike most other methods and achieves the same goal as our algorithm. Inspired by the grey-world algorithm, Rizzi et al. introduced a technique called automatic color equalization (ACE) that uses an odd function of differences between pixel intensities to enhance an image. This is a two step process the first step computes the chromatic spatial adjustment by considering the difference in the pixels and weighted by a distance function. 73

Ch. 1: Audio/Image/Video Fundamentals Multimedia Systems. School of Electrical Engineering and Computer Science Oregon State University

Ch. 1: Audio/Image/Video Fundamentals Multimedia Systems Prof. Ben Lee School of Electrical Engineering and Computer Science Oregon State University Outline Computer Representation of Audio Quantization