LN #8 (2 Hrs) Data Encoding CTPS 2018
Objectives To understand positional numeral systems. To depict how complex information such as text, colors, pictures, and sound can be encoded as bit strings.
Positional Number System A number system defines how a number can be represented using distinct symbols. A number can be represented differently in different systems. For example, the two numbers (2A) 16 and (52) 8 both refer to the same quantity, (42) 10, but their representations are different.
Common Number Systems System Base Symbols Used by humans? Used in computers? Decimal 10 0, 1, 9 Yes No Binary 2 0, 1 No Yes Octal 8 0, 1, 7 No No Hexadecimal 16 0, 1, 9, A, B, F No No 4
Bits and binary All computer data is represented using binary, a number system that uses 0s and 1s. Binary digits can be grouped together into bytes. Computers use binary - the digits 0 and 1 - to store data. A binary digit, or bit, is the smallest unit of data in computing. It is represented by a 0 or a 1. Binary numbers are made up of binary digits (bits), eg the binary number 1001. 5 Department of CSE
S = {0, 1}
1.What is the biggest binary number one can write with n bits? at 3. How many unique patterns does a sequence of 5 bits generate? 4. Write all the patterns of a sequence of 5 bits. nt of CSE, Coimbatore 8
1. What is the biggest binary number one can write with n bits? n 1 s 2. How many unique patterns does a sequence of 5 bits generate? 2^5 3. Write all the patterns of a sequence of 5 bits. 00000,00001,00010..11111 nt of CSE, Coimbatore 9
Table : Four positional number systems
Bits and binary The circuits in a computer's processor are made up of billions of transistors. A transistor is a tiny switch that is activated by the electronic signals it receives. The digits 1 and 0 used in binary reflect the on and off states of a transistor. Computer programs are sets of instructions. Each instruction is translated into machine code - simple binary codes that activate the CPU. Programmers write computer code and this is converted by a translator into binary instructions that the processor can execute. 11 Department of CSE
Byte to Terabyte Bits can be grouped together to make them easier to work with. A group of 8 bits is called a byte. Other groupings include: Nibble - 4 bits (half a byte) Byte - 8 bits Kilobyte (KB) - 1024 bytes (or 1024 x 8 bits) Megabyte (MB) - 1024 kilobytes (or 1048576 bytes) Gigabyte (GB) - 1024 megabytes Terabyte (TB) - 1024 gigabytes Most computers can process millions of bits every second. A hard drive's storage capacity is measured in gigabytes or terabytes. RAM is often measured in megabytes or gigabytes. 12 Department of CSE
Big Data: Volume One page of text 30KB One song 5 MB One movie 5 GB 6 million books 1 TB 55 storeys of DVD 1 PB Data up to 2003 5 EB Data in 2011 1.8 ZB NSA data center 1 YB Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB
Using Hexadecimal Hex codes are used in many areas of computing to simplify binary codes. It is important to note that computers do not use hexadecimal - it is used by humans to shorten binary to a more easily understandable form. Hexadecimal is translated into binary for computer use. Some examples of where hex is used include: Colour references Error messages Assembly language programs 14 Department of CSE
Color References: Hex colour model Hex can be used to represent colours on web pages and image-editing programs using the format: #RRGGBB (RR = reds, GG = greens, BB = blues). The # symbol indicates that the number has been written in hex format. Eg #FF6600. The Hex color model uses two hex digits for each colour 15 Department of CSE
#FF 66 00 As one hex digit represents 4 bits Two hex digits together make 8 bits (1 byte). 16 Department of CSE
The values for each colour run between 00 and FF. In binary, 00 is 0000 0000 FF is 1111 1111 That provides 2^8 = 256 possible values for each of the three colours. That gives a total spectrum of 256 reds x 256 greens x 256 blues - which is over 16 million colours in total. 17 Department of CSE
#FF0000 will be the purest red - red only, no green or blue. Black is #000000 - no red, no green and no blue. White is #FFFFFF. An orange colour can be represented by the code #FF6600. The hex code is much easier to read than the binary equivalent 1111 1111 0110 0110 0000 0000 18 Department of CSE
Colour model The figure shows the additive mixing of red, green and blue primaries to form the three secondary colors yellow (red + green), cyan (blue + green) and magenta (red + blue), and white ((red + green + blue). RGB model Computer display
Colour models The figure shows the three subtractive primaries, and their pairwise combinations to form red, green and blue, and finally black by subtracting all three primaries from white. CMYK model - Used in Printing
If you are making a web page with HTML or CSS you can use hex codes to choose the colours. The RGB model ( Additive ) is used for color monitors and most video cameras. Hex values have equivalents in the RGB colour model. The RGB model is very similar to the hex colour model, you use a value between 0 and 255 for each colour. So an orange colour that is #FF 66 00 in hex would be 255, 102, 0 in RGB. Cyan color is 0,255,255 Teal color is 0,128,128 21 Department of CSE
Color, Hex and RGB color codes Red #FF0000 (255,0,0) Tomato #FF6347 (255,99,71) Coral #FF7F50 (255,127,80) indian red #CD5C5C (205,92,92)
Color HTML / CSS Name Hex Code#RRGGBB Decimal Code(R,G,B) Black #000000 (0,0,0) White #FFFFFF (255,255,255) Red #FF0000 (255,0,0) Lime #00FF00 (0,255,0) Blue #0000FF (0,0,255) Yellow #FFFF00 (255,255,0) Cyan / Aqua #00FFFF (0,255,255) Magenta / Fuchsia #FF00FF (255,0,255) Silver #C0C0C0 (192,192,192) Gray #808080 (128,128,128) Maroon #800000 (128,0,0) Olive #808000 (128,128,0) Green #008000 (0,128,0) Purple #800080 (128,0,128) Teal #008080 (0,128,128) Navy #000080 (0,0,128)
Error messages using Hexa Hex is often used in error messages on your computer. The hex number refers to the memory location of the error. This helps programmers to find and then fix problems. 24 Department of CSE
Different forms of data Text, image,audio,video. Image Image Text Audio Audio Video Data in any form is represented in binary form only in computers.
All software, music, documents, and any other information that is processed by a computer, is stored using binary. 26 Department of CSE
Different Forms of Data
Inside a computer, all data is stored as numbers ( binary) : Numbers are stored as numbers, obviously! Text characters are stored as a code that represents each e.g. ASCII. Images are stored as numbers representing the amounts of red, green and blue for each pixel. Sounds are stored as numbers representing the loudness at given intervals.
Storage space for data Different types of data require different amounts of storage space. Data One extended-ascii character in a text file (eg 'A') The word 'Monday' in a document A plain-text email Storage 1 byte 6 bytes 2 KB 64 pixel x 64 pixel GIF 12 KB Hi-res 2000 x 2000 pixel RAW photo Three minute MP3 audio file One minute uncompressed WAV audio file One hour film compressed as MPEG4 11.4 MB 3 MB 15 MB 4 GB 29 Department of CSE
Bit number patterns Computer systems and files have limits that are measured in bits. For example, image and audio files have bit depth. The bit depth reflects the number of binary numbers available.. This is similar to the number of combinations available on a padlock. The more wheels of numbers on a padlock, the more combinations of numbers are possible. The greater the bit depth, the more combinations of binary numbers are possible. 30 Department of CSE
Bit number patterns Every time the bit depth increases by one, the number of binary combinations is doubled. A 1-bit system uses combinations of numbers up to one place value (1).There are just two options: 0 or 1. A 2-bit system uses combinations of numbers up to two place values (11).There are four options: 00, 01, 10 and 11. 31 Department of CSE
Binary combinations One bit Maximum binary number = 1 Maximum denary number = 1 Binary combinations = 2 32 Department of CSE
Two bit Maximum binary number = 11 Maximum denary number = 3 Binary combinations = 4 33 Department of CSE
Three bit 34 Department of CSE Maximum binary number = 111 Maximum denary number = 7 Binary combinations = 8
Bit depth Max (binary) Max (denary) Combinations available 1 1 1 2 2 11 3 4 3 111 7 8 4 1111 15 16 5 11111 31 32 A 1-bit image can have 2 colours, a 4-bit image can have 16 colours, an 8-bit image can have 256 colours, and a 16-bit image can have 65,536 colours. 35 Department of CSE
Encoding and Decoding Encoding is the process of putting a sequence of characters (letters, numbers, punctuation, and certain symbols) into a specialized digital format for efficient transmission or transfer. Decoding is the opposite process -- the conversion of a digital signal into a sequence of characters. Encoding and decoding are used in data communications, networking, and storage.
Everything on a computer is represented as streams of binary numbers. Audio, images and characters all look like binary numbers in machine code. These numbers are encoded in different data formats to give them meaning, eg the 8-bit pattern 01000001 could be the number 65, the character 'A', or a colour in an image. 37 Department of CSE
Encoding formats Encoding formats have been standardised to help compatibility across different platforms. audio is encoded as audio file formats, eg mp3,wav, AAC video is encoded as video file formats, eg MPEG4, H264 text is encoded in character sets, eg ASCII, Unicode images are encoded as file formats, eg BMP, JPEG, PNG The more bits used in a pattern, the more combinations of values become available. This larger number of combinations can be used to represent many more things, eg a greater number of different symbols, or more colours in a picture. 38 Department of CSE
Character sets QWERTY keyboard A keyboard with Japanese characters 39 Department of CSE Different languages use different keyboard layouts.
A French keyboard has an é. If we were writing in Japanese or Arabic, we would need even more choices of characters. In theory, anyone can create a character set. But it is important that computers can communicate, so we use global standards for character sets. 40 Department of CSE
Every word is made up of symbols or characters. When you press a key on a keyboard, a number is generated that represents the symbol for that key. This is called a character code. A complete collection of characters is a character set. 41 Department of CSE
Representing Character
You can check what character encoding your web browser is using by looking in your browser settings: Mozilla Firefox >Tools > Page Info: Encoding Microsoft Internet Explorer >View > Encoding Google Chrome >Tools > Encoding 43 Department of CSE
If all our messages are made up of the eight symbols A, B, C, D, E, F, G, and H, we can choose a code with ----------------- bits per character.
If all our messages are made up of the eight symbols A, B, C, D, E, F, G, and H, we can choose a code with three bits per character: A 000 C 010 E 100 G 110 B 001 D 011 F 101 H 111
A 000 C 010 E 100 G 110 B 001 D 011 F 101 H 111 With this code, the message BACADAEAFABBAAAGAH is encoded as the string of ----------------bits
A 000 C 010 E 100G 110 B 001 D 011 F 101 H 111 With this code, the message BACADAEAFABBAAAGAH is encoded as the string of 54 bits 0010000100000110001000001010000010010000 00000110000111
Text Encoding Characters are usually encoded as integer values using encoding schemes. The associations between numbers and text are known collectively as a character encoding scheme.
ASCII - American Standard Code for Information Interchange Unaccented, English letters Every letter, number, capital, etc, represented by codes 0-127. Eg: Space, 32; A, 65; a, 97. Only the 7-bit patterns were standardized under ASCII. Standard 8-bit ASCII codes start with a zero-valued bit (followed by 7-bit ASCII code).
Extended ASCII codes start with a one-valued bit these codes are not standard and vary in meaning among different manufactures and equipment. First 32 patterns are control codes: the most common of these are 0Ah (Line Feed) and 0Dh (Carriage Return).
Table : ASCII Chart
EBCDIC (Extended Binary Coded Decimal Interchange Code) Developed by IBM. Restricted mainly to IBM or IBM compatible mainframes. Conversion software to/from ASCII available. Common in archival data. Character codes differ from ASCII. ASCII EBCDIC Space 20 16 40 16 A 41 16 C1 16 b 62 16 82 16
Unicode Unicode uses between 8 and 32 bits per character It can represent characters from languages from all around the world. It is commonly used across the internet. As it is larger than ASCII, it might take up more storage space when saving documents. Global companies, like Facebook and Google, would not use the ASCII character set because their users communicate in many different languages.
Multilingual: defines codes for Nearly every character-based alphabet. Large set of ideographs for Chinese, Japanese and Korean. Composite characters for vowels and syllabic clusters required by some languages. Allows software modifications for local-languages.
ASCII only contains 127 characters An extended version of ASCII exists with 257 characters This is by far not enough as it is too restrictive to the English language. UNICODE was developed to alleviate this problem: the latest version, UNICODE 5.1.0 contains more than 100,000 characters, covering most existing languages. For more information, see: http://www.unicode.org/versions/unicode5. 1.0/
Image Encoding Binary representation of bitmap images All bitmap images are stored as array of pixels. A monochrome images store 1 for black pixel and 0 for a white pixel (or vice versa depending on the encoding protocol) It could also be necessary to store the dimensions of the image.
Bitmap
What is the bitmap?
Bitmap
Show how to encode
Answer This image could be represented as following 35 binary digits (5 bytes): 00100 01010 01010 10001 11111 10001 00000
Color Images
Representing Color
Representing Color Each pixel of the rose flower is to be defined using 24 bits(8 bits/ color RGB) The first 8 bits specifying the shade of red, The next 8 bits specifying the shade of green and The last 8 bits specifying the shade of blue.
Color Images
Color Images
Color Images
Color Images
Representing Sound Sound is produced by the vibration of a media like air or water. Audio refers to the sound within the range of human hearing. Sound is stored in a computer as binary codes
A microphone translates the change in air pressure and converts it to a wave form. A converter within the sound card of the computer takes readings each second. These readings are positions (voltages, actually) on the wave in relation to the zero line. They are recorded and converted from decimal to binary numbers.
Sound Data As Bytes: The data is represented as a pair of numbers. The first part representing the time and the second part representing the voltage value {0000 low and 1111-high}
A sound signal is analog, i.e. continuous in both time and amplitude. To store and process sound information in a computer or to transmit it through a computer network, we must first convert the analog signal to digital form using an analogto-digital converter ( ADC ) The conversion involves two steps: (1) sampling, and (2) quantization.
Sampling is the process of examining the value of a continuous function at regular intervals. Sampling usually occurs at uniform intervals, which are referred to as sampling intervals. The number of samples taken in a second is called the sampling rate
Amplitude To represent the varying values of a soundwave, it s height must be measured at regular intervals and the measurements given binary codes. The sampled measurements make up the digital sound file Analogue signal Time Sampling rate
Quantization is the process of limiting the value of a sample of a continuous function to one of a predetermined number of allowed values, which can then be represented by a finite number of bits. The number of bits used to store each intensity defines the accuracy of the digital sound.
Using 2 bit sampling to represent the audio signal... 00 01 10 11 t1 t2 t3 t4 t10
Using 2 bit sampling to represent the audio signal... 00 01 10 11 t1 t2 t3 t4 t10 At t1 : 01
Using 2 bit sampling to represent the audio signal... 00 01 10 11 AT t2 it is : 00 We have 01 00 t1 t2 t3 t4 t10
Using 2 bit sampling to represent the audio signal... 00 01 10 11 At t3 it is: 01 We have 01 00 01 t1 t2 t3 t4 t10
Using 2 bit sampling to represent the audio signal... 00 01 10 11 The complete wave is represented by specifying the region to which it belongs i.e at time 1 it is in region 01, at time 2 it is in 00 and so on. Here we are not representing time as we are sampling continuously at time = 1, 2, 3
The complete representation of the signal is. 01 00 01 01 11 01 10 01 11 01
Adding one bit makes the sample twice as accurate
How much space do we need to store one minute of music? - 60 seconds - 44,100 samples -16 bits (2 bytes) per sample - 2 channels (stereo) S = 60x44100x2x2 = 10,534,000 bytes 10 MB!! 1 hour of music would be more than 600 MB!
Data Information Data and information are not synonymous terms! Data is the means by which information is conveyed. Data compression aims to reduce the amount of data while preserving as much information as possible.
REDUNDANTDATA INFORMATION DATA = INFORMATION + REDUNDANT DATA H.R. Pourreza
The same information can be represented by different amount of data 1. Your wife, Helen, will meet you at Logan Airport in Boston at 5 minutes past 6:00 pm tomorrow night 2. Your wife will meet you at Logan Airport at 5 minutes past 6:00 pm tomorrow night 3. Helen will meet you at Logan at 6:00 pm tomorrow night
Data Compression The art of reducing the number of bits needed to store or transmit data is data compression. To reduce the volume of data to be transmitted (text, fax, images). To reduce the bandwidth required for transmission and to reduce storage requirements (speech, audio, video).
Classification Lossless compression Lossless compression for legal and medical documents, computer programs. Information preserving Low compression ratios Lossy compression Digital audio, image, video where some errors or loss can be tolerated. Not information preserving High compression ratios
Trade-off: information loss vs compression ratio
Video and Audio Compression Video and Audio files are very large. Unless we develop and maintain very high bandwidth networks (Gigabytes per second or more) we have to compress the data. Relying on higher bandwidths is not a good option. Compression becomes part of the representation or coding scheme which have become popular audio, image and video formats.
Run-length Encoding This encoding method is frequently applied to images (or pixels in a scan line). It is a small compression component used in JPEG compression. In this instance, sequences of image elements X 1, X 2,, X n are mapped to pairs (c 1, l 1 ), (c 1, l 2 ),, (c n, l n ) where c i represent image intensity or colour and l i the length of the ith run of pixels
Black and White Image
Black and White
Improve Efficiency
Color Images
Color Images
Color Images
Run-length Encoding Figure: An encoded figure
Run Length encode the image
Run Length Code the image.
Run-length encoding isn't a good approach for text compression. Why?
Run-length encoding isn't a good approach for text compression. Why? Long runs rarely appear in a natural language.
Data compression ratio Data compression ratio, also known as compression power, is a computer science term used to quantify the reduction in datarepresentation size produced by a data compression algorithm. The data compression ratio is analogous to the physical compression ratio used to measure physical compression of substances. Data compression ratio is defined as the ratio between the uncompressed size and compressed size: C o m p r e s s i o n R a t i o = U n c o m p r e s s e d S i z e / C o m p r e s s e d S i z e
A representation that compresses a 10 MB file to 2 MB has a compression ratio of 10/2 = 5, often notated as an explicit ratio, 5:1, or as an implicit ratio, 5/1. Sometimes the space savings is given instead, which is defined as the reduction in size relative to the uncompressed size: S p a c e S a v i n g s = 1 C o m p r e s s e d S i z e / U n c o m p r e s s e d S i z e A representation that compresses a 10MB file to 2MB would yield a space savings of 1-2/10 = 0.8, often notated as a percentage, 80%.
What has been described? Positional number system. Binary representation The data encoding schemes for text, color, image and sound. Compression technique and how data can be compressed using RLE method. Credits Foundations of Computer Science --- Behrouz Forouzan, Firouz Mosharral www.bbc.co.uk Home KS3 Computing Data representation Google images