A Big Umbrella Content Creation: produce the media, compress it to a format that is portable/ deliverable Distribution: how the message arrives is often as important as what the message is Search: finding the information you need Protection: we care about privacy and security, ownership and digital rights The four are tangled together $&
Goal of This Course Understand various aspects of a modern multimedia pipeline Content creating, editing Distribution Search & mining Protection Hands-on experience on hot media trends A Multimedia System #&
Digital Data Acquisition Source: Analog Output: Digital Analog Digital Two Steps Sampling: take samples at time nt T: sampling period; f s = 1/T: sampling frequency f s= 10Hz! T=0.1 second Quantization: map amplitude values into a set of discrete values '&
Sampling Theorem A signal can be reconstructed from its samples, if the original signal has no frequencies above 1/2 the sampling frequency The minimum sampling rate for band-limited function is called Nyquist rate This means: T (or f s ) depends on the signal frequency range A fast varying signal should be sampled more frequently! speech: f s >8KHz; music, f s >44KHz Before and After Sampling Spatial domain Frequency domain 1 -f M f M Sampling! frequency signal duplications at k fs 1/T -f s -f M f M f s T f s =1/T (&
Original signal Reconstruction (Frequency domain view) 1 -f M f M Sampled signal f s > =2f M 1/T -f s -f M f M f s Ideal reconstruction! signal x low-pass filter in frequency domain T -f s /2 1 f s /2 Reconstructed signal = original signal -f M f M Reconstruction (Frequency domain view) Original signal 1 -f M f M Sampled signal f s < 2f M 1/T -f s -f M f M f s T Ideal reconstruction Filter (low-pass) Reconstructed signal!= original signal -f s /2 f s /2 1 -f M f M Alias due to insufficient sampling rate )&
Definition of An Image Think an image as a function, f, from R 2 to R: f (x, y ) gives the intensity at position ( x, y ) Realistically, we expect the image only to be defined over a rectangle, with a finite range: f: [a,b]x[c,d]! [0,1] A color image is just three functions pasted together (R, G, B) components Grayscale Image x f(x,y) y!&
24-bit Colored Image Each pixel is represented by three bytes, usually representing RGB one byte for each R, G, B component 256x256x256 possible combined colors, or a total of 16,777,216 possible colors. However such flexibility does result in a storage penalty: A 640x480 24-bit color image would require 921.6 kb of storage without any compression. Define Colors via RGB Trichromatic color mixing theory Any color can be obtained by mixing three primary colors with a right proportion Primary colors for illuminating sources: Red, Green, Blue (RGB) CRT works by exciting red, green, blue phosphors using separate electronic guns R+G+B=White Used in digital images *&
A Multimedia System Redundancy in Media Data Medias (speech, audio, image, video) are not random collection of signals, but exhibit a similar structure in local neighborhood Temporal redundancy: current and next signals are very similar (smooth media: speech, audio, video) Spatial redundancy: the pixels intensities and colors in local regions are very similar Spectral redundancy: When the data is mapped into the frequency domain, a few frequencies dominate over the others +&
Lossless Compression Lossless compression Compress the signal but can reproduce the exact original signal Used for archival purposes and often medical imaging, technical drawings Assign new binary codes to represent the symbols based on the frequency of occurrence of the symbols in the message Example 1: Run Length Encoding (BMP, PCX) BBBBEEEEEEEECCCCDAAAAA! 4B8E4C1D5A Example 2: Lempel-Ziv-Welch (LZW): adaptive dictionary, dynamically create a dictionary of strings to efficiently represent messages, used in GIF & TIFF Example 3: Huffman coding: the length of the codeword to present a symbol (or a value) scales inversely with the probability of the symbol s appearance, used in PNG, MNG, TIFF Lossy Compression The compressed signal after de-compressed, does not match the original signal Compression leads to some signal distortion Suitable for natural images such as photos in applications where minor (sometimes imperceptible) loss of fidelity is acceptable to achieve a substantial reduction in bit rate. Types Color space reduction: reduce 24!8bits via color lookup table Chrominance subsampling: from 4:4:4 to 4:2:2, 4:1:1, 4:2:0, eye perceives spatial changes of brightness more sharply than those of color, by averaging or dropping some of the chrominance information Transform coding (or perceptual coding): Fourier transform (DCT, wavelet) followed by quantization and entropy coding Today s focus,&
A Typical Image Compression System Transform original data into a new representation that is easier to compress Use a limited number of levels to represent the signal values Find an efficient way to represent these levels using binary bits Transformation Quantization Binary Encoding DCT for images +Zigzag ordering Scalar quantization (Run-length coding Huffman coding ) DC: prediction + Huffman AC: run-length + Huffman Coding Colored Images Color images are typically stored in (R,G,B) format JPEG standard can be applied to each component separately Does not make use of the correlation between color components Does not make use of the lower sensitivity of the human eye to chrominance samples Alternate approach Convert (R,G,B) representation to a YCbCr representation Y: luminance, Cb, Cr: chrominance Down-sample the two chrominance components Because the peak response of the eye to the luminance component occurs at a higher frequency than to the chrominance components $%&
Chrominance Subsampling Key Concepts of Video Compression Temporal Prediction: (INTER mode) Predict a new frame from a previous frame and only specify the prediction error Prediction error will be coded using an image coding method (e.g., DCT-based JPEG) Prediction errors have smaller energy than the original pixel values and can be coded with fewer bits Motion-compensation to improve prediction: Use motion-compensated temporal prediction to account for object motion INTRA frame coding: (INTRA mode) Those regions that cannot be predicted well are coded directly using DCT-based method Spatial prediction: Use spatial directional prediction to exploit spatial correlation (H.264) Work on each macroblock (MB) (16x16 pixels) independently for reduced complexity Motion compensation done at the MB level DCT coding of error at the block level (8x8 pixels or smaller) Block-based hybrid video coding $$&
Different Prediction Modes Intra: coded directly; Predictive: predicted from a previous frame; Bidirectional: predicted from a previous frame and a following frame. Intra: coded directly; Predictive: predicted from a previous frame; Bidirectional: predicted from a previous frame and a following frame. Can be done at frame or block levels $#&
MPEG Frame Arrangement A Typical Video Compression System Transform original data into a new representation that is easier to compress Use a limited number of levels to represent the signal values Find an efficient way to represent these levels using binary bits Transformation Quantization Binary Encoding Temporal Prediction (P,B) Motion Compensation Spatial Prediction (for I frames) Scalar quantization Vector quantization Fixed length Variable length (Run-length coding Huffman coding ) $'&
A Typical Speech Compression System Transform original data into a new representation that is easier to compress Use a limited number of levels to represent the signal values Find an efficient way to represent these levels using binary bits Transformation Quantization Binary Encoding Temporal Prediction Scalar quantization Vector quantization Fixed length Variable length (Run-length coding Huffman coding ) Compressing Speech via Temporal Prediction $(&
Demo Results Original signal Original signal s Histogram Difference signal Difference signal s Histogram Much smaller range! easier to encode Your Ear as a Filterbank The auditory system can be roughly modeled as a filterbank, consisting of 25 overlapping bandpass filters, from 0 to 20 KHz The ear cannot distinguish sounds within the same band that occur simultaneously. Each band is called a critical band The bandwidth of each critical band is about 100 Hz for signals below 500 Hz, and increases linearly after 500 Hz up to 5000 Hz 1 bark = width of 1 critical band $)&
Threshold in Quiet Audible level at various frequencies: The minimum sound level of an average ear with normal hearing can hear with no other sound present Only need to code a frequency band if its sound level is above its corresponding threshold Sound Level (db) Threshold in quiet Frequency Frequency Masking When two sound frequencies are present in the signal simultaneously, the presence of one might hide the perception of the other Also known as simultaneous masking A weak noise (the maskee) can be made inaudible by simultaneously occurring stronger signal (the masker), e.g, a pure tone; if masker and maskee are close enough to each other in frequency. Sound Level (db) Threshold in quiet A 1kHz tone of strength 60dB is present Masking threshold Frequency $!&
A Multimedia System Application architectures Client-server Including data centers / cloud computing Peer-to-peer (P2P) Hybrid of client-server and P2P 2: Application Layer 34 $*&
Ways to Distribute Videos Single server, single (or many) clients Not scalable IP multicast Required uniform router hardware Content delivery networks (CDNs) $$$$, serve small-size, highly popular data Application end points (pure/hybrid P2P) Unstable, popularity driven Client-server architecture client/server server: always-on host permanent IP address server farms for scaling clients: communicate with server may be intermittently connected may have dynamic IP addresses do not communicate directly with each other 2: Application Layer 36 $+&
Pure P2P architecture no always-on server arbitrary end systems directly communicate peers are intermittently connected and change IP addresses peer-peer Highly scalable but difficult to manage 2: Application Layer 37 Hybrid of client-server and P2P Skype voice-over-ip P2P application centralized server: finding address of remote party: client-client connection: direct (not through server) Instant messaging chatting between two users is P2P centralized service: client presence detection/location user registers its IP address with central server when it comes online user contacts central server to find IP addresses of buddies 2: Application Layer 38 $,&
Media over IP (Internet): Making it Work Use UDP to avoid TCP congestion control and the delay associated with it; required for time-sensitive media traffic Use RTP/UDP to enable QoS monitoring, sender and receiver can record the # of packets sent/received and adjust their operations accordingly Client-side uses adaptive playout delay to compensate for the delay (and the jitter) Server side matches stream bandwidth to available client-to-server path bandwidth Chose among pre-encoded stream rates Dynamic encoding rate Error recovery (on top of UDP) FEC and/or interleaving Retransmissions (time permitting) Unequal error protection (duplicate important parts) Conceal errors (interpolate from nearby data) Image and Video are vulnerable to losses Assuming conventional MPEG-like system: MC-prediction, Block-DCT, run length and Huffman coding Losses create two types of problems Loss of bit stream synchronization: Decoder does not know what bits correspond to what parameters E.g. error in Huffman codeword Incorrect state and error propagation: Decoder s state is different from encoder s, leading to incorrect predictions and error propagation E.g. error in MC-prediction or DC-coefficient prediction 40 #%&
Layered Solution Use a layered representation. Receivers decide Layers added and dropped to adjust to appropriate target rate. R1 S R2 R3 41 Error Concealment for Video Repeat pixels from previous frame Effective when there is no motion, potential problems when there is motion Interpolate pixels from neighboring region Correctly recovering missing pixels is extremely difficult, however even correctly estimating the DC (average) value is very helpful Interpolate motion vectors from previous frame Can use coded motion vector, neighboring motion vector, or compute new motion vector 42 #$&
A Multimedia System What is a Watermark? A watermark is a secret message that is embedded into a cover message Usually, only the knowledge of a secret key allows us to extract the watermark. Has a mathematical property that allows us to argue that its presence is the result of deliberate actions. Effectiveness of a watermark is a function of its Stealth Resilience Capacity ##&
Watermarking Encoding original image Watermark S Encoder watermarked image User Key K Watermarking Decoding S=X? watermarked image original image Decoder Watermark X User Key K #'&
Various Categories of Watermarks Based on method of insertion Additive Quantize and replace Based on domain of insertion Transform domain Spatial domain Based on method of detection Private - requires original image Public (or oblivious) - does not require original Based on security type Robust - survives image manipulation Fragile - detects manipulation (authentication) Embedding Watermarks Method 1: Spatial Domain Least Significant Bit (LSB) Modification Simple but not robust An image pixel s value Replace the bit with your watermark pixel value (0 or 1) #(&
Spatial Domain Robust Watermarking Pseudo-randomly (based on secret key) select n pairs of pixels: pair i: a i, b i are the values of the pixels in the pair The expected value of sum i (a i -b i )==0 Increase a i by 1, Decrease b i by 1 The expected value of sum i (a i -b i ) now!2n To detect watermark, check sum i (a i -b i ) on the watermarked image Frequency-domain Robust Watermark: Spread Spectrum Watermark Spread Spectrum == transmits a narrowband signal over a much larger bandwidth the signal energy present in any single frequency is much smaller Apply this to watermark: The watermark is spread over many frequency bins so that the (change of ) energy in any one bin is very small and almost undetectable Watermark extraction == combine these many weak signals into a single but stronger output Because the watermark verification process knows the location and content of the watermark To destroy such a watermark would require noise of high amplitude to be added to all frequency bins #)&
UMCP ENEE631 Slides (created by M.Wu based on Research Talks 98-04) Spread Spectrum Watermark: Cox et al What to use as watermark? Where to put it? Place wmk in perceptually significant spectrum (for robustness) Modify by a small amount below Just-noticeable-difference (JND) Use long random noise-like vector as watermark for robustness/security against jamming+removal & imperceptibility Embedding v i = v i +! v i w i = v i (1+! w i ) Perform DCT on entire image and embed wmk in DCT coeff. Choose N=1000 largest AC coeff. and scale {v i } by a random factor Original image Full frame 2D DCT seed random vector generator marked wmk N largest coeff. image sort v =v (1+! w) Full Frame IDCT & other coeff. normalize UMCP ENEE631 Slides (created by M.Wu based on Research Talks 98-04) " Subtract original image from the test one before feeding to detector ( non-blind detection ) " Correlation-based detection a correlator normalized by Y in Cox et al. paper test image X =X+W+N? X =X+N? original unmarked image preprocess orig X test X DCT DCT select N largest select N largest wmk compute similarity threshold decision #!&
A Multimedia System Final Exam Cover everything till this lecture Use the lecture slides and book readings Place + Time June 9 th, 9am 11am rather than 8am 11am Closed book, closed notes Two more office hours: Friday May 4 th 3 5pm at HFH 1121 Next Monday May 7 th 11-noon at HFH 1121 #*&