Slide Set Overview. Special Topics in Advanced Digital System Design. Embedded System Design. Embedded System Design. What does a digital camera do?

Similar documents
Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

Contents Circuits... 1

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

TV Character Generator

ALONG with the progressive device scaling, semiconductor

CHAPTER1: Digital Logic Circuits

Microprocessor Design

Implementation of an MPEG Codec on the Tilera TM 64 Processor

VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress

UNIT 1: DIGITAL LOGICAL CIRCUITS What is Digital Computer? OR Explain the block diagram of digital computers.

Encoders and Decoders: Details and Design Issues

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Motion Video Compression

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Sharif University of Technology. SoC: Introduction

Laboratory Exercise 4

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

ELEN Electronique numérique

Lossless Compression Algorithms for Direct- Write Lithography Systems

Chapter 4. Logic Design

MODULE 3. Combinational & Sequential logic

DSP in Communications and Signal Processing

Understanding Compression Technologies for HD and Megapixel Surveillance

VU Mobile Powered by S NO Group

6.3 Sequential Circuits (plus a few Combinational)

Design and analysis of microcontroller system using AMBA- Lite bus

EEE130 Digital Electronics I Lecture #1_2. Dr. Shahrel A. Suandi

Sequencing and Control

Chapter 3. Boolean Algebra and Digital Logic

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

CPS311 Lecture: Sequential Circuits

Modeling Digital Systems with Verilog

Logic Design II (17.342) Spring Lecture Outline

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Combinational vs Sequential

V6118 EM MICROELECTRONIC - MARIN SA. 2, 4 and 8 Mutiplex LCD Driver

21.1. Unit 21. Hardware Acceleration

CS 61C: Great Ideas in Computer Architecture

UNIT V 8051 Microcontroller based Systems Design

EECS150 - Digital Design Lecture 17 - Circuit Timing. Performance, Cost, Power

COMP2611: Computer Organization. Introduction to Digital Logic

L12: Reconfigurable Logic Architectures

Data Storage and Manipulation

More Digital Circuits

Serial FIR Filter. A Brief Study in DSP. ECE448 Spring 2011 Tuesday Section 15 points 3/8/2011 GEORGE MASON UNIVERSITY.

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

Optimization of memory based multiplication for LUT

Section 6.8 Synthesis of Sequential Logic Page 1 of 8

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Implementation of Memory Based Multiplication Using Micro wind Software

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

COE328 Course Outline. Fall 2007

Verification Methodology for a Complex System-on-a-Chip

The reduction in the number of flip-flops in a sequential circuit is referred to as the state-reduction problem.

[Krishna*, 4.(12): December, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

Sequential Logic. Introduction to Computer Yung-Yu Chuang

A Low Power Delay Buffer Using Gated Driver Tree

ECE532 Digital System Design Title: Stereoscopic Depth Detection Using Two Cameras. Final Design Report

Digital System Design

Frame Processing Time Deviations in Video Processors

CSCB58 - Lab 4. Prelab /3 Part I (in-lab) /1 Part II (in-lab) /1 Part III (in-lab) /2 TOTAL /8

Laboratory 1 - Introduction to Digital Electronics and Lab Equipment (Logic Analyzers, Digital Oscilloscope, and FPGA-based Labkit)

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

Page 1) 7 points Page 2) 16 points Page 3) 22 points Page 4) 21 points Page 5) 22 points Page 6) 12 points. TOTAL out of 100

Testing Digital Systems II

OMS Based LUT Optimization

8/30/2010. Chapter 1: Data Storage. Bits and Bit Patterns. Boolean Operations. Gates. The Boolean operations AND, OR, and XOR (exclusive or)

Pivoting Object Tracking System

Snapshot. Sanjay Jhaveri Mike Huhs Final Project

Testability: Lecture 23 Design for Testability (DFT) Slide 1 of 43

Digital Video Telemetry System

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Chapter 9 MSI Logic Circuits

DIGITAL ELECTRONICS MCQs

Using on-chip Test Pattern Compression for Full Scan SoC Designs

Fingerprint Verification System

DESIGN AND IMPLEMENTATION OF A CONTENT AWARE IMAGE PROCESSING MODULE ON FPGA. A Dissertation Presented to The Academic Faculty. Burhan Ahmad Mudassar

Digital Systems Design

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

ECE 263 Digital Systems, Fall 2015

TV Synchronism Generation with PIC Microcontroller

Design of Memory Based Implementation Using LUT Multiplier

L11/12: Reconfigurable Logic Architectures

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

DIGITAL SYSTEM DESIGN UNIT I (2 MARKS)

The word digital implies information in computers is represented by variables that take a limited number of discrete values.

Switching Circuits & Logic Design, Fall Final Examination (1/13/2012, 3:30pm~5:20pm)

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

Digital Logic Design: An Overview & Number Systems

Chapter 2. Digital Circuits

Introduction to CMOS VLSI Design (E158) Lab 3: Datapath and Zipper Assembly

Register Transfer Level (RTL) Design Cont.

Why FPGAs? FPGA Overview. Why FPGAs?

EECS 578 SVA mini-project Assigned: 10/08/15 Due: 10/27/15

Flip Flop. S-R Flip Flop. Sequential Circuits. Block diagram. Prepared by:- Anwar Bari

Transcription:

Slide Set Overview Special Topics in Advanced Digital System Design by Dr. Lesley Shannon Email: lshannon@ensc.sfu.ca Course Website: http://www.ensc.sfu.ca/~lshannon/ Simon Fraser University Slide Set: Date: January, 007 An implementation study example Demonstrates why we have to make choices in our implementation platform to meet application specification requirements Example Application: A digital camera ENSC 0/: Lecture Set Embedded System Design Embedded System Design All real systems contain both hardware and software No such thing as a software only systems What do these systems look like? How do we design these systems? Recall the embedded system design methodology handout. This design study shows: Four implementations (with varying degrees of hardware) Illustrates the tradeoffs between hardware and software implementations Demonstrates how fixed-point vs floating-point can be used as an optimization for hardware and software This example is from Embedded System Design: A Unified Hardware/Software Introduction. Vahid, Gavargis 000. (you can find more details in chapter 7 of this book) ENSC 0/: Lecture Set ENSC 0/: Lecture Set What does a digital camera do? Recall our ipod Use Case Diagram: 1. Capture images, process them, store them in memory. Uploads images to a PC. Vary image size, delete images, digital stretching, zoom in and out, etc This example focuses on the image first Use case: When the shutter is pressed: Image captured Converted to digital form by charge coupled device (CCD) Compressed and archived The compression part is discussed in detail ENSC 0/: Lecture Set ENSC 0/: Lecture Set 1

Specify System Requirements Overview of Image Capture Performance: We want to process a picture in one second Slower would be annoying Faster not necessary for a low-end camera Size: Must fit on a low-cost chip Let s say 00,000 gates, including the processor Power and Energy: We don t want a fan We want the battery to last as long as possible When exposed to light, each cell of the Charge Coupled Device (CCD) becomes electrically charged. This charge can then be converted to a - bit value where 0 represents no exposure while represents very intense exposure of that cell to light. rows Lens area Covered columns Shutter Circuitry columns ENSC 0/: Lecture Set 7 ENSC 0/: Lecture Set Overview of Image Capture Overview of Image Capture Lens area Covered columns Shutter The electromechanical shutter is activated to expose the cells to light for a brief moment. Lens area Covered columns Shutter Some of the columns are covered with a black strip of paint. The lightintensity of these pixels is used for zero-bias adjustments of all the cells. rows columns Circuitry rows columns Circuitry ENSC 0/: Lecture Set ENSC 0/: Lecture Set Overview of Image Capture Overview of Image Capture Lens area Covered columns CCD input Zero-bias adjust The electronic circuitry, when commanded, discharges the cells, activates the electromechanical shutter, and then reads the -bit charge value of each cell. These values can be clocked out of the CCD by external logic through a standard parallel bus interface. rows columns Shutter Circuitry yes DCT Quantize Archive in memory More blocks? no no Done? yes Transmit serially serial output e.g., 01... ENSC 0/: Lecture Set ENSC 0/: Lecture Set 1

Manufacturing errors cause cells to measure slightly above or slightly below the actual light intensity Error typically the same along columns, but different across rows Some of the left-most columns are blocked by black paint If you get anything but 0, you have a zero-bias error Each row is corrected by subtracting the average error in all the blocked cells for that row (example on next slide) Zero-Bias Error 1 0 0 1 1 1 1 1 7 0 1 1 1 7 1 1 1 1 0 0 1 1 1 1 1 1 1 7 7 0 0 1 1 1 1 1 1 1 1 1 7 1 Before zero-bias adjustment Zero-Bias Error Covered cells Zero-bias adjustment 1-0 -7 - - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 7 0 1 1 1 0 7 1 1 After zero-bias adjustment ENSC 0/: Lecture Set 1 ENSC 0/: Lecture Set 1 CCDPP (CCD PreProcessing) module Performs zero-bias adjustment CcdppCapture uses CcdCapture and CcdPop to obtain image Performs zero-bias adjustment after each row read in void void CcdppCapture(void) CcdCapture(); CcdCapture(); for(rowindex0; for(rowindex0; rowindex<sz_row; rowindex<sz_row; rowindex) rowindex) for(colindex0; for(colindex0; colindex<sz_col; colindex<sz_col; colindex) colindex) buffer[rowindex][colindex] CcdPop(); CcdPop(); bias bias (CcdPop() (CcdPop() CcdPop()) CcdPop()) / ; ; for(colindex0; for(colindex0; colindex<sz_col; colindex<sz_col; colindex) colindex) buffer[rowindex][colindex] - - bias; bias; ENSC 0/: Lecture Set 1 Compression Store more images Transmit image to PC in less time JPEG (Joint Photographic Experts Group) Popular standard format for representing digital images in a compressed form Provides for a number of different modes of operation Based on discrete cosine transform (DCT) Image data is divided into x blocks of pixels steps performed on each block: 1. DCT. Quantization. ENSC 0/: Lecture Set 1 Discrete Cosine Transform (DCT) Transforms an x block of pixels into the frequency domain We produce a new x block of values such that: Upper-left corner represent the low frequency components (essence of the image) Lower-right corner represents the high frequency components (the finer details) We can reduce the precision of the higher frequency components and retain reasonable image quality Equation to perform DCT: 1 F( u, C( u) C( where Discrete Cosine Transform (DCT) x 0..7 y 0.. 7 π (x 1) u π (y 1) v Dxy cos( ) cos( ) 1 1 1 if h 0 C(h) 1 otherwise ENSC 0/: Lecture Set ENSC 0/: Lecture Set 1

Quantization Achieve high compression ratio by reducing image quality Reduce bit precision of encoded data Fewer bits needed for encoding let s reduce the precision equally across all frequency values Even better, divide by a factor of Simple right shifts can do this CODEC void void CodecDoFdct(void) CodecDoFdct(void) int int i, i, j; j; for(i0; for(i0; i<num_row_blocks; i<num_row_blocks; i) i) for(j0; for(j0; j<num_col_blocks; j<num_col_blocks; j) j) CodecDoFdct_for_one_block(i, j); j); 1-0 - 1-1 - -7 - - - 1 - - - - 1 7-7 - 1 - - 0 - - -7 1-1 - 1 7-7 - After being decoded using DCT Divide each cell s value by 1-0 1 0 0 1-0 0-0 - 0 - - - - - 1 - - 1 - - - 0 0 - - - 0 1 After quantization void void CodecDoFdct_for_one_block(int i, i, int int j) j) int int x, x, y; y; for(x0; for(x0; x<; x<; x) x) for(y0; for(y0; y<; y<; y) y) obuffer[ix][jy] FDCT(i, FDCT(i, j, j, x, x, y, y, ibuffer); ibuffer); ENSC 0/: Lecture Set 1 ENSC 0/: Lecture Set 0 What about Fixed Point Number Representation? Rather than computing the floating point cosine function Notice that there are only distinct values need for the cosine: 1 F( u, C( u) C( x 0..7 y 0.. 7 So let s pre-compute them π (x 1) u π (y 1) v Dxy cos( ) cos( ) 1 1 ENSC 0/: Lecture Set 1 What about Fixed Point Number Representation? The result of the cosine is floating point It would be better if we could store the table in less memory Example: Suppose we want to represent to 1 using 1 bits Floating Point Fixed Point 0.0 0 0. 1 0. 1 0. 77-0. So if x is the floating point number, the fixed point number is round ( 7 x) We can generate a fixed-point table now. ENSC 0/: Lecture Set CODEC CODEC static static const const short short COS_TABLE[][] COS_TABLE[][] 7, 7, 1, 1, 07, 07, 7, 7, 0, 0,,, 1, 1,,, 7, 7, 7, 7, 1, 1, -, -, -0, -0, -1, -1, -07, -07, 0 0,, 7, 7,,,,, -1, -1, -0, -0,,, 07, 07, 7 7,, 7, 7,,, -07, -07, 0, 0, 0, 0, 7, 7,,, -1-1,, 7, 7, -, -, -07, -07,,, 0, 0, -7, -7,,, 1 1,, 7, 7, 0, 0,,, 1, 1, -0, -0, -, -, 07, 07, -7-7,, 7, 7, -7, -7, 1, 1,,, -0, -0, 1, 1, -07, -07,,, 7, 7, -1, -1, 07, 07, -7, -7, 0, 0, 0, 0, 1, 1, - - ; ; static static double double COS(int COS(int xy, xy, int int u u return( return( COS_TABLE[xy][uv] COS_TABLE[xy][uv] / / 7.0); 7.0); static static int int FDCT(int FDCT(int base_x, base_x, base_y, base_y, offset_x, offset_x, offset_y, offset_y, short short img) img) r r 0; 0; u u base_x base_x offset_x; offset_x; v v base_y base_y offset_y; offset_y; for(x0; for(x0; x<; x<; x) x) s[x] s[x] img[x][0] img[x][0] COS(0, COS(0, img[x][1] img[x][1] COS(1, COS(1, img[x][] img[x][] COS(, COS(, img[x][] img[x][] COS(, COS(, img[x][] img[x][] COS(, COS(, img[x][] img[x][] COS(, COS(, img[x][] img[x][] COS(, COS(, img[x][7] img[x][7] COS(7, COS(7, ; ; for(x0; for(x0; x<; x<; x) x) r r s[x] s[x] COS(x, COS(x, u); u); return return (r (r.. C(u) C(u) C(); C(); ENSC 0/: Lecture Set ENSC 0/: Lecture Set

Serialize x block of pixels Values are converted into single list using zigzag pattern Perform Huffman encoding More frequently occurring pixels assigned short binary code Longer binary codes left for less frequently occurring pixels ENSC 0/: Lecture Set -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x How often each pixel value occurs (so, occurs 1 times, 0 occurs times, etc.) ENSC 0/: Lecture Set -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x We are going to construct a tree. Each leaf represents one pixel value. The number inside the node is the frequency. 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set 7 -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x We start adding internal nodes. The value inside in each internal node is the sum of the two leaves. Choose two leaves such that the sum will be as small as possible. (in this case, leaves are both 1, and sum is ) 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set 0

-x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set 1 -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set

-x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set 7 -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set 0 -x - x 1 x x x x - x - x 0 x 1 1x - 1x - 1x 1x 1 1x 1 1 1 1 0-0 - - 1 - - - 1 ENSC 0/: Lecture Set 1 Now let s construct the Huffman Table -x - x 1 x x 1 1 x 1 x - x - x 1 0-0 x 1 1x - 1x - 1x 0 - - 1x 1 1x 1 - - - 1 ENSC 0/: Lecture Set 7

Huffman codes -x 0 - x - 1 x 1 x 1 1 x 1 x - x - - x 1 0 - - 0 x 0 1 1x 1 - - 1x - - 1x 0 - - - 1x 1 1x 1 - - - 1 1 ENSC 0/: Lecture Set Find a path for each table entry 0 go left, 1 go right Huffman codes -x 00 0 - x - 1 x 1 x 1 1 x 1 x - x - - x 1 0 - - 0 x 0 1 1x 1 - - 1x - - 1x 0 - - - 1x 1 1x 1 - - - 1 1 ENSC 0/: Lecture Set Find a path for each table entry 0 go left, 1 go right Huffman codes -x 00 0 0 - x - 1 x 1 x 1 1 1 x x - x - - x 1 0 - - 0 x 0 1 1x 1 - - 1x - - 1x 0 - - - 1x 1 1x 1 - - - 1 1 ENSC 0/: Lecture Set Find a path for each table entry 0 go left, 1 go right Huffman codes -x 00 0 0 - x - 1 1 x 1 x 1 1 x 1 x - x - - x 1 0 - - 0 x 0 1 1x 1 - - 1x - - 1x 0 - - - 1x 1 1x 1 - - - 1 1 ENSC 0/: Lecture Set Find a path for each table entry 0 go left, 1 go right Huffman codes -x 00 0 0 - x - 1 1 x 1 0 x 1 1 x 1 x - x - - x 1 0 - - 0 x 0 1 1x 1 - - 1x - - 1x 0 - - - 1x 1 1x 1 - - - 1 1 ENSC 0/: Lecture Set 7 Huffman codes -x 00 0 0 - x - 1 1 x 1 0 x 1 1 1 x x 01 - x - 1 - x 1 0 - - 1 0 x 0 0 1 1x 1 - - 1x - - 1x 0 - - - 1x 01 1 1x 1 - - - 1 1 01 ENSC 0/: Lecture Set

Common pixel values are short No code is a prefix of another code makes decoding easy! Huffman codes -x 00 0 0 - x - 1 1 x 1 0 x 1 1 1 x x 01 - x - 1 - x 1 0 - - 1 0 x 0 0 1 1x 1 - - 1x - - 1x 0 - - - 1x 01 1 1x 1 - - - 1 1 01 ENSC 0/: Lecture Set Archiving Images Here s a really simple memory map The amount of memory required depends on N and the compression ratio ENSC 0/: Lecture Set 0 Uploading to the PC Overview of Image Capture When connected to PC and upload command received Read images from memory Transmit serially using UART While transmitting Reset pointers, image-size variables and global memory pointer accordingly CCD input Zero-bias adjust DCT Quantize Archive in memory no Done? yes yes More blocks? no Transmit serially serial output e.g., 01... ENSC 0/: Lecture Set 1 ENSC 0/: Lecture Set Implementing the Design Recall our System Requirements We are going to talk about four potential implementations: 1. Microcontroller Alone (everything in software). Microcontroller and CCDPP. Microcontroller and CCDPP / Fixed- Point DCT. Microcontroller and CCDPP / DCT Performance: We want to process a picture in one second Slower would be annoying Faster not necessary for a low-end camera Size: Must fit on a low-cost chip Let s say 00,000 gates, including the processor Power and Energy: We don t want a fan We want the battery to last as long as possible ENSC 0/: Lecture Set ENSC 0/: Lecture Set

Implementation 1: Microprocessor Alone Suppose we use an Intel 01 Microcontroller Total IC Cost about $ Well below 00mW power We figure it will take months to get the product done 1 Mhz, 1 cycles per instruction one million instructions per second Can we get the required performance? (let s say our grid is x) ENSC 0/: Lecture Set void void CcdppCapture(void) CcdCapture(); CcdCapture(); for(rowindex0; for(rowindex0; rowindex<sz_row; rowindex<sz_row; rowindex) rowindex) for(colindex0; colindex<sz_col; colindex) for(colindex0; colindex<sz_col; colindex) buffer[rowindex][colindex] CcdPop(); buffer[rowindex][colindex] CcdPop(); bias bias (CcdPop() (CcdPop() CcdPop()) CcdPop()) / ; ; for(colindex0; for(colindex0; colindex<sz_col; colindex<sz_col; colindex) colindex) buffer[rowindex][colindex] - - bias; bias; Nested loops, () iterations. If each iteration is 0 assembly language instructions, 1 0 instructions 0,00 instructions per image This is half our budget and we haven t even done DCT or Huffman yet! ENSC 0/: Lecture Set void void CcdppCapture(void) CcdCapture(); CcdCapture(); for(rowindex0; for(rowindex0; rowindex<sz_row; rowindex<sz_row; rowindex) rowindex) for(colindex0; colindex<sz_col; colindex) for(colindex0; colindex<sz_col; colindex) buffer[rowindex][colindex] CcdPop(); buffer[rowindex][colindex] CcdPop(); bias bias (CcdPop() (CcdPop() CcdPop()) CcdPop()) / ; ; for(colindex0; for(colindex0; colindex<sz_col; colindex<sz_col; colindex) colindex) buffer[rowindex][colindex] - - bias; bias; Nested loops, () iterations. If each iteration is 0 assembly language instructions, 1 0 instructions 0,00 instructions per image This is half our budget and we haven t even done DCT or Huffman yet! ENSC 0/: Lecture Set 7 Implementation : Microcontroller and CCDPP SOC EEPROM UART 01 CCDPP CCDPP function implemented on custom hardware unit Improves performance less microcontroller cycles Increases engineering cost and time-to-market Easy to implement Simple datapath Few states in controller Simple UART easy to implement as custom hardware unit also EEPROM for program memory and RAM for data memory added as well ENSC 0/: Lecture Set RAM Implementation : Microcontroller and CCDPP SOC EEPROM UART 01 CCDPP CCDPP function implemented on custom hardware unit Improves performance less microcontroller cycles Increases engineering cost and time-to-market Easy to implement Simple datapath Few states in controller Simple UART easy to implement as custom hardware unit also EEPROM for program memory and RAM for data memory added as well ENSC 0/: Lecture Set RAM Microcontroller Synthesizable version of Intel 01 available Written in VHDL Captured at register transfer level (RTL) Fetches instruction from ROM Decodes using Instruction Decoder ALU executes arithmetic operations Source and destination registers reside in RAM Special data movement instructions used to load and store externally Special program generates VHDL description of ROM from output of C compiler Block diagram of Intel 01 processor core Instruction K ROM Decoder ALU Controller 1 RAM To External Memory Bus ENSC 0/: Lecture Set 0

UART UART invoked when 01 executes store instruction with UART s enable register as target address Memory-mapped communication between 01 and UART Start state transmits 0 indicating start of byte transmission then transitions to Data state Data state sends bits serially then transitions to Stop state Stop state transmits 1 indicating transmission done then transitions back to idle mode Idle: I 0 Stop: Transmit HIGH invoked I < I Start: Transmit LOW Data: Transmit data(i), then I ENSC 0/: Lecture Set 1 CCDPP Hardware implementation of zero-bias operations Internal buffer, B, memory-mapped to 01 GetRow state reads in one row from CCD to B bytes: pixels blacked-out pixels ComputeBias state computes bias for that row and stores in variable Bias FixBias state iterates over same row subtracting Bias from each element NextRow transitions to GetRow for repeat of process on next row or to Idle state when all rows completed R Idle: R0 C0 NextRow: R C0 C invoked R < C < FixBias: B[R][C]B[R][C]- Bias GetRow: B[R][C]Pxl CC1 ENSC 0/: Lecture Set C < C ComputeBias: Bias(B[R][] B[R][]) / C0 Connecting SoC Components Analysis Memory-mapped All single-purpose processors and RAM are connected to 01 s memory bus Read Processor places address on 1-bit address bus Asserts read control signal for 1 cycle Reads data from -bit data bus 1 cycle later Device (RAM or custom circuit) detects asserted read control signal Checks address Places and holds requested data on data bus for 1 cycle Write Processor places address and data on address and data bus Asserts write control signal for 1 clock cycle Device (RAM or custom circuit) detects asserted write control signal Checks address bus Reads and stores data from data bus VHDL VHDL VHDL VHDL simulator Execution time Synthesis tool gates gates gates Sum gates Power equation Gate level simulator Chip area Power ENSC 0/: Lecture Set ENSC 0/: Lecture Set Analysis Analysis of Implementation Entire SOC tested on VHDL simulator Interprets VHDL descriptions and functionally simulates execution of system Recall program code translated to VHDL description of ROM Tests for correct functionality Measures clock cycles to process one image (performance) Gate-level description obtained through synthesis Synthesis tool like compiler for hardware Simulate gate-level models to obtain data for power analysis Number of times gates switch from 1 to 0 or 0 to 1 Count number of gates for chip area Total execution time for processing one image:.1 seconds Power consumption: 0.0 watt Energy consumption: 0.0 joule (.1 s x 0.0 watt) Total chip area:,000 gates ENSC 0/: Lecture Set ENSC 0/: Lecture Set

Analysis of Implementation Implementation : Fixed-Point DCT Total execution time for processing one image:.1 seconds Power consumption: 0.0 watt Energy consumption: 0.0 joule (.1 s x 0.0 watt) Most of the execution time is now spent in the DCT We could design custom hardware like we did for CCDPP More complex, so more design effort Let s see if we can speed up the DCT by modifying the number representation (but still do it in software) Total chip area:,000 gates ENSC 0/: Lecture Set 7 ENSC 0/: Lecture Set DCT Floating Point Cost DCT uses ~0 floating-point operations per pixel transformation 0 ( x ) pixels per image 1 million floating-point operations per image No floating-point support with Intel 01 Compiler must emulate Generates procedures for each floating-point operation mult, add Each procedure uses tens of integer operations Thus, > million integer operations per image More procedures increase code size Fixed-point arithmetic can improve on this Shrink code size ENSC 0/: Lecture Set Fixed-Point Arithmetic Integers used to represent a real number Some bits represent fraction, some bits represent whole number Integer Part So this fractional part is 1/1 0.7 So the number is.7 There are 1 possible values ( codes ) of the factional part. If we quantize the fractional value over these 1 possible codes: 0: encode with 0000 1/1: encode with 0001 1/1: encode with ENSC 0/: Lecture Set 70 Fixed-Point Arithmetic Fixed-Point Arithmetic How do you convert a real constant to fixed point: Multiply real value by ^ (# of bits used for fractional part) Round to nearest integer Example: Represent.1 as -bit integer with bits for fraction ^ 1.1 x 1 0. 0 0010 1 (^) possible values for fraction, each represents 0.0 (1/1) Last bits (00) x 0.0 0.1 (00) 0.1.1.1 To get a more accurate the representation: Use more bits to represent the fraction Addition: A good approximation is to simply add the fixed-point representations: Example: Suppose we want to add.1 and.71.1 is represented as 00 00.71 is represented as 00 Add these two representations to get: 01 This corresponds to.1, which is kind of close to. To get a more accurate the representation Use more bits to represent the fraction ENSC 0/: Lecture Set 71 ENSC 0/: Lecture Set 7 1

Fixed-Point Arithmetic Multiply: Multiply the representations Shift right by the number of bits in the fractional part Example: Suppose we want to multiply.1 and.71.1 is represented as 00 00.71 is represented as 00 Multiply these two representations to get: 000101 Shift right by bits: 0001 This corresponds to.7, which is kind of close to.0 Moral: we can add and multiply easily. This is faster and smaller than floating point ENSC 0/: Lecture Set 7 New CODEC static static const const char char code code COS_TABLE[][] COS_TABLE[][],,,,,,,,,,,,,,,,,,,,,,,, -, -, -, -, -, -, - -,,,,,, -, -, -, -, -, -, 1, 1,,,,,,, 1, 1, -, -, -, -,,,,, -, -, - -,,,,,, -, -,,,,, -, -, -, -,,,,, -, -, -, -,,, -, -,,,,, - -,,,, -, -,,, 1, 1, -, -,,, -, -,,,,, -, -,,, -, -,,, -, -,,, ; ; static static int int FDCT(int FDCT(int base_x, base_x, base_y, base_y, offset_x, offset_x, offset_y, offset_y, short short img) img) r 0; 0; u base_x base_x offset_x; offset_x; v base_y base_y offset_y; offset_y; for for (x0; (x0; x<; x<; x) x) s[x] s[x] 0; 0; for(j0; for(j0; j<; j<; j) j) s[x] s[x] (img[x][j] (img[x][j] COS_TABLE[j][v] COS_TABLE[j][v] ) >> >> ; ; for(x0; for(x0; x<; x<; x) x) r (s[x] (s[x] COS_TABLE[x][u]) COS_TABLE[x][u]) >> >> ; ; return return (short)((((r (short)((((r (((1C(u)) (((1C(u)) >> >> ) ) C() C() >> >> )) )) >> >> ) ) >> >> ); ); ENSC 0/: Lecture Set 7 Analysis of Implementation Analysis of Implementation Total execution time for processing one image: 1. seconds Power consumption: 0.0 watt (same as implementation ) Energy consumption: 0.00 joule (1. s x 0.0 watt) Battery life x longer!! Total chip area: 0,000 gates (,000 fewer gates -- less memory needed for code) ENSC 0/: Lecture Set 7 Total execution time for processing one image: 1. seconds Power consumption: 0.0 watt (same as implementation ) Energy consumption: 0.00 joule (1. s x 0.0 watt) Battery life x longer!! Total chip area: 0,000 gates (,000 fewer gates -- less memory needed for code) ENSC 0/: Lecture Set 7 Implementation : Implement the CODEC in H/W Implementation : Implement the CODEC in H/W EEPROM 01 RAM EEPROM 01 RAM CODEC UART CCDPP CODEC UART CCDPP SOC SOC ENSC 0/: Lecture Set 77 ENSC 0/: Lecture Set 7 1

CODEC Design Four memory mapped registers C_DATAI_REG: used to push x block into CODEC C_DATAO_REG: used to pop x block out of CODEC C_CMND_REG: used to command CODEC Writing 1 to this register invokes CODEC C_STAT_REG: indicates CODEC done and ready for next block Polled in software Direct translation of C code to VHDL for actual hardware implementation. Fixed-point version used ENSC 0/: Lecture Set 7 Analysis of Implementation Total execution time for processing one image: 0.0 seconds (well under 1 sec) Power consumption: 0.00 watt Increase over and because the chip has more hardware Energy consumption: 0.0000 joule (0.0 s x 0.00 watt) Battery life 1x longer than previous implementation!! Total chip area: 1,000 gates Significant increase over previous implementations ENSC 0/: Lecture Set 0 Analysis of Implementation Total execution time for processing one image: 0.0 seconds (well under 1 sec) Power consumption: 0.00 watt Increase over and because the chip has more hardware Energy consumption: 0.0000 joule (0.0 s x 0.00 watt) Battery life 1x longer than previous implementation!! Total chip area: 1,000 gates Significant increase over previous implementations ENSC 0/: Lecture Set 1 So, what do you tell your boss? Implementation Implementation Implementation Performance (second).. 0.0 Power (watt) 0.0 0.0 0.00 Size (gate),000 0,000 1,000 Energy (joule) 0.0 0.00 0.000 Implementation Close in performance Cheaper Less time to build and less gates Implementation Great performance and energy consumption More expensive and may miss time-to-market window If DCT designed ourselves then increased engineering cost and time-to-market If existing DCT purchased then increased IC cost Which is better? ENSC 0/: Lecture Set Highlights of this slide set: We saw an example / case study that illustrates some of the tradeoffs Hardware takes longer to design Hardware will be faster Sometimes you can optimize the software instead Always a tradeoff between performance, cost, and time This was just one example. However, these concepts can be applied to the general problem of designing embedded systems and SoCs. ENSC 0/: Lecture Set 1