Lecture 2 Video Formation and Representation

2013 Spring Term 1 Lecture 2 Video Formation and Representation Wen-Hsiao Peng ( 彭文孝 ) Multimedia Architecture and Processing Lab (MAPL) Department of Computer Science National Chiao Tung University 1 Mar. 2013 Hsinchu, Taiwan

Preface 2 The previous lecture talks about what light is and how it is perceived by our visual system to initiate color vision. In this lecture, we shall have a look at methods for capturing and representing video signals.

Video Signal 3 When we refer to a video, we are actually referring to a sequence of moving images, each is the perspective projection of a 3 D scene onto a 2 D image plane. This drawing by Durer clearly conveys the idea of perspective projection. Normally, we refer to a point in the image plane as a pixel or a pel, especially when we talk about digital imagery.

Color Video Camera 4 This block diagram shows you the typical imaging pipeline in a video camera. As can be seen, to capture color information, there are three types of sensors, each has a frequency response determined by the color matching functions of the chosen primary.

Color Video Camera 5 Most cameras nowadays use CCD or CMOS sensors for digital color imaging. Normally, with these sensors, only one color value can be sampled at a point, and the sampling pattern is usually 50% Green, 25% Red and 25% Blue. Green color has a higher sampling rate because, as we have seen in our first lecture, it captures the most of brightness information. To get a complete set of RGB values for each point, interpolation is required. Recently, there appear some advanced sensors, which can acquire all three color values at a single point without interpolation.

Color Video Camera 6 For more efficient processing and transmission, most cameras will further convert the captured RGB values into more independent luminance and chrominance information.

Progressive and Interlaced Scan This slide presents two different ways for sampling a video signal; one is progressive sampling and the other is interlaced sampling. With progressive sampling, a video signal is sampled as a sequence of complete video frames, just like what you would normally do with sampling. But, with interlaced sampling, we keep only half of the information in a complete frame each time. That is, we sample only the even numbered lines at one instance, and then proceed with the odd numbered lines at the next. The pictures so obtained are called field pictures. Also, the field containing the first and following alternating lines is referred to as the top field and that containing the second and following alternating lines as the bottom field. 7

Progressive and Interlaced Scan 8 Since field pictures have a lower vertical resolution, they are normally sampled twice more frequently than frame pictures along the temporal dimension. That is, with the same data rate, we can send twice as many field pictures as the number of frame pictures in a progressive sequence. As a result, an interlaced sequence tends to have smoother motion when played back. This is the motivation for using the interlaced sampling.

Progressive and Interlaced Scan 9 However, the downside of the interlaced sampling is that visual artifacts may appear when the scene contains fast moving objects. In this case, you can observe some ziz zag or featherlike artifacts along the vertical edges of objects. This arises because when the top field and the bottom field are displayed together in the form of a complete video frame, scenes/images captured at different time instances are blended together. It is important to remember that these field pictures are actually separated in time.

Progressive and Interlaced Scan 10 To alleviate the artifacts, a de interlacing algorithm is usually employed to convert field pictures into frame pictures before playback.

Analog Video Raster 11 This slide describes the mechanism for video capture and display in early days when analog cameras were in use. As illustrated by this figure, analog cameras capture a video signal by continuously and periodically scanning an image region from the top to the bottom. Different lines are scanned at slightly different times, and the scan format can be either progressive or interlaced. Along contiguous scan lines, the intensity values are recorded as a 1 D waveform, which is known as a raster scan. This figure shows a typical waveform of such a raster scan signal.

Analog Video Raster 12 In general, a raster is characterized by two basic parameters, which are the frame rate (frames/second) and the line number. The frame rate defines the temporal sampling rate of a raster while the line number indicates the vertical sampling rate. From these parameters, we can further derive other parameters, such as the line rate (lines/second), line interval, and frame interval. Notice that the 1 D raster signal is set periodically to a constant level to indicate when the display devices should retrace its sensor horizontally or vertically to begin displaying a new line or a new field.

Spectrum & Signal Bandwidth 13 This and the following slides talk about the spectrum of the 1 D raster signal and its bandwidth estimation. I will skip this part. For details, please refer to Wang s book.

Analog Color TV Systems 14 This table compares the three major analog TV systems that are used worldwide. Please refer to Wang s book for a more detailed exposition. [Note: Taiwan s Over the Air TV networks have gone digital since May 2012, but most of households subscribe to Cable TV, whose signals remain analog]

Digital Video (1/2) 15 A digital video can be obtained either by sampling a raster scan, or sampling the scene with a digital video camera. Like an analog video, a digital video is defined by a few parameters, such as the frame rate, the line number per frame, the number of samples per line, and the bit depth, which denotes the number of bits used to represent a pixel value. The raw data rate of a digital video can be computed as the product of these parameters, which has a unit of bits per second.

Digital Video (2/2) 16 Conventionally, the luminance or each of the three color values is specified with 8 bits; so, Nb is equal to 8 for a monochrome video and 24 for a color video. However, in cases where the chrominance components have a different sampling resolution (spatial and temporal) than that of the luminance, Nb should reflect the equivalent number of bits used for each pixel in the luminance resolution. In addition, another two important parameters are image aspect ratio and pixel aspect ratio. The pixel aspect ratio indicates the ratio of the width to the height of a physical rectangular area used for rendering a pixel.

ITU R BT.601 (1/2) 17 The ITU R BT.601 is a standard format used to represent different analog TV video signals (NTSC, PAL, SECAM). It specifies how to convert a 1 D raster scan into a digital video by sampling. The sampling rate is chosen to meet two constraints: horizontal and vertical sampling interval should match the same rate should be used for NTSC and PAL/SECAM and it should be a multiple of their respective line rates (so that each line has an integer number of samples) (1) leads to 11 MHz for NTSC and 13MHz for PAL/SECAM (2) needs a multiple of least comm. mult. (15750,15625) A number that satisfies both constraints is 13.5MHz.

ITU R BT.601 (2/2) 18 With this sampling rate, we will have 858 pixels per line for NTSC and 864 pixels for PAL/SECAM. The resulting formats are shown in these figures. It is noteworthy that there are some pixels in the so called non active area, and they correspond to signal samples for the horizontal or vertical retrace, and are thus not intended for display. So, the true display resolution is 480 or 576 lines per frame, depending on whether the signal is NTSC or PAL, and both have 720 pixels per line. A digital video with either of these resolutions is often called an Standard Definition (SD) video.

Digital Video Formats (1/2) 19 This table summarizes some common digital video formats, along with their main applications and compression methods. The right most column gives their raw data rates to indicate how much bandwidth it would take if they are transmitted without any compression. As an example, for an SD video with 4:2:0 color sampling, its raw data rate is 124 Mbps, which is roughly the bandwidth limit that can be supported by the best Wi Fi technology we have today. By MPEG 2 compression, it is possible to reduce the bit rate to 4 8 Mbps, which is equivalent to a 10 30x compression ratio.

Digital Video Formats (2/2) 20 On the top of this table are the two popular HD formats, which have been widely used for HDTV as well as smartphone video. They are usually referred to as 720p or 1080p video, depending on the number of lines in height. The suffix p means progressive sampling, and we use i as the suffix when referring to interlaced sampling. 1080p video is also known as Full HD video. The SIF/CIF/QCIF formats were quite popular 10 years ago but are gradually phased out.

High Definition and Ultra High Definition 21 This chart compares the resolutions of different video formats. In particular, the green/purple and dark blue areas show the sizes of the so called Ultra High Definition, which is going to be the format for next generation digital video. Roughly, UHD video has a spatial resolution that is 4 or 16 times the 1080p resolution. UHD video is the target application for the newly developed H.265/HEVC codec.

Color Coordinates 22 From the previous lecture, we learned that for video capture and display, we would mostly choose the RGB primary. But, one disadvantage of the RGB primary is that it mixes the luminance and chrominance attributes of a light. In many applications, it is desirable to separate these information. For example, our visual system was found to be less sensitive to color than to brightness. So, it is possible to represent a color image more efficiently by representing the chrominance with a lower resolution than the luminance. This calls for color space conversion.

Color Space Conversion (1/2) 23 This slide presents the YUV primary, where Y represents the brightness information and U, V collectively characterize the color information (represented as color differences). The conversion between YUV and RGB values are given by this matrix multiplication, which should be of no surprise to you. We learned from our first lecture that the tristimulus values for different primaries must be linearly related. You can see that the green value dominates the brightness computation, which agrees with our previous observation that green color captures the most of the brightness information.

Color Space Conversion (2/2) 24 Another observation is that the coefficients of the first row add up to 1 whereas those of the other two rows add up to 0. So, if the RGB values are all identical, both U and V components will be zero; in this case, we get a gray image with NO color. YCbCr is another widely used primary; it is part of the BT.601 standard. YCbCr values are scaled and shifted versions of YUV values.

RGB vs. YCbCr 25 This slide compares between the RGB and YCbCr representations of a color image by showing each of these components as a separate image. It can be seen that, with RGB representation, all three components appear to be equally important that is, they all contain rich textural information. By contrast, the signals of the Cb & Cr components are much smoother than that of the Y component.

Chrominance Subsampling (1/2) 26 This result motivates the use of chrominance subsampling; that is to represent the Cb & Cr components with a lower resolution than the Y component. This and the following slides present the three commonly used subsampling formats. The first one is called 4:4:4, with which both Cb and Cr have exactly the same resolution as Y; in other words, there is no chrominance subsampling.

Chrominance Subsampling (2/2) 27 The second one is 4:2:2, which reduces horizontally the number of chrominance samples by half. The code 4:2:2 suggests that for every FOUR Y samples, there are TWO Cb samples and TWO Cr samples. The last format is 4:2:0, with which, Cb and Cr each has half the horizontal and vertical resolutions of Y. The name 4:2:0 is chosen historically as a particular code to identify this format; logically, it may make more sense to use 4:1:1 i.e., for every FOUR Y samples, there are One Cb sample and One Cr sample. This last slide gives you a visual comparison between 4:4:4 and 4:2:0 formats; as you can tell, the difference in color appearance actually looks quite small.

Objective Quality Measure (1/4) 28 Now let us talk about video quality measure. In designing a video compression algorithm, one often needs to measure the distortion caused by the compression. One commonly used distortion measure is the mean squared error (MSE) between the original and the processed video sequences. Its value is quite easy to obtain. You simply first compute the squared error between every two corresponding pixels and then take the average of the results obtained over all pixels. For a color video, we compute the MSE separately for each color component.

Objective Quality Measure (2/4) 29 Another popular distortion measure, which is actually more often used than MSE, is Peak Signal to Noise Ratio (known as PSNR) in decibel (db). It converts the MSE into another number by considering the peak value of the video signal. The exact formula is given here, where the psi_max denotes the maximum value of the video signal, and is equal to 255 for the most common 8 bit video. The reason why PSNR is preferred to MSE is that people tend to associate the image quality with a certain range of PSNR. As a rule of thumb, a PSNR higher than 40dB typically indicates an excellent image, between 30 40dB usually means a good image, between 20 30dB suggests a poor quality image.

Objective Quality Measure (3/4) 30 We also very often compute the average per frame PSNR. Unlike the previous definition, the average per frame PSNR computes the PSNR between every two corresponding frames and then averages the obtained values over individual frames. In general, the PSNR so computed is different from the result obtained with the previous definition.

Objective Quality Measure (4/4) 31 Although these measures have been used almost exclusively, it is well known that they do not correlate very well our perception. This slide gives you an example. Here you probably can see that the left picture looks visually more acceptable, but in fact, it has a much larger MSE value than the picture on the right, which exhibits obvious blocking artifacts.

Subjective Quality Measure (1/4) 32 Thus, in some serious occasions, e.g., when choosing an initial technology for a new standard, we do rely on human observers to rate the video quality. One person s judgment of visual quality can be very subjective and is influenced by many factors. Thus, the standard ITU R BT.500 11 defines several procedures for subjective quality evaluation.

Subjective Quality Measure (2/4) 33 The first method is known as the Double Stimulus Continuous Quality Scale. With this method, an observer will be presented a pair of video sequences, one is the original sequence (reference) and the other is its impaired version; the order is randomized. He/she will then be asked to grade each of these sequences by marking on a continues line ranging from Excellent to Bad. The results will be converted into a single mean opinion score indicating the relative quality between the reference and the test sequences.

Subjective Quality Measure (3/4) 34 The second method is known as the Double Stimulus Impairment Scale. In this method, the reference sequence is presented prior to its impaired version, and the observer will be asked to grade the impaired sequence as compared to its reference. He/she will need to indicate whether the difference between the two sequences is imperceptible or perceptible, and in the case of perceptible, the degree to which the difference looks. This method is more suitable for cases where the impairments in the test sequences are small.

Subjective Quality Measure (4/4) 35 The last method is called Single Stimulus Continuous Quality Evaluation. It is designed to address the situations where the quality of the test sequence fluctuate widely. With this method, the observer will be required to grade the test sequence continuously without a source reference and a device will be used to record the continuous quality assessment from the observer.

Bjontegaard Metric 36 To compare the performance of two coding algorithms by finding the numerical averages the average PSNR improvement or the average percent bitrate saving between their RD curves

Average PSNR improvement (BD PSNR) 37 Intuitively, we would obtain the average PSNR improvement by computing the area between these two curves over the bit rate interval where they overlap and dividing it by the length of that interval. But, it was found more appropriate to do the integration based on logarithmic scale of bit rate. This has an effect of weighting more lightly the PSNR differences at higher bit rates.

Average PSNR improvement (BD PSNR) 38 A polynomial function of order 3 is used to fit each RDcurve so that the area between them can be approximated with a closed form formula. PSNR (p) interpolated interpolated r low r high Bitrate in log-scale (r)

Average percent bitrate saving (BD Rate) 39 Do the integration along the PSNR axis.