Tutorial Outline. Typical Memory Hierarchy

Similar documents
An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

Low Power Design: From Soup to Nuts. Tutorial Outline

ROM MEMORY AND DECODERS

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

FP 12.4: A CMOS Scheme for 0.5V Supply Voltage with Pico-Ampere Standby Current

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Hardware Design I Chap. 5 Memory elements

V6118 EM MICROELECTRONIC - MARIN SA. 2, 4 and 8 Mutiplex LCD Driver

DESIGN OF NOVEL ADDRESS DECODERS AND SENSE AMPLIFIER FOR SRAM BASED memory

SYNCHRONOUS DERIVED CLOCK AND SYNTHESIS OF LOW POWER SEQUENTIAL CIRCUITS *

Tutorial Outline. Design Levels

Noise Margin in Low Power SRAM Cells

FinFETs & SRAM Design

DIFFERENTIAL CONDITIONAL CAPTURING FLIP-FLOP TECHNIQUE USED FOR LOW POWER CONSUMPTION IN CLOCKING SCHEME

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

P.Akila 1. P a g e 60

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design

SoC IC Basics. COE838: Systems on Chip Design

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Introduction to CMOS VLSI Design (E158) Lecture 11: Decoders and Delay Estimation

CS 152 Computer Architecture and Engineering

Modifying the Scan Chains in Sequential Circuit to Reduce Leakage Current

UNIT III COMBINATIONAL AND SEQUENTIAL CIRCUIT DESIGN

High Performance Dynamic Hybrid Flip-Flop For Pipeline Stages with Methodical Implanted Logic

LOW POWER DOUBLE EDGE PULSE TRIGGERED FLIP FLOP DESIGN

Design of Fault Coverage Test Pattern Generator Using LFSR

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

A Low Power Delay Buffer Using Gated Driver Tree

Figure.1 Clock signal II. SYSTEM ANALYSIS

Chapter 7 Memory and Programmable Logic

EE5780 Advanced VLSI CAD

ISSCC 2003 / SESSION 19 / PROCESSOR BUILDING BLOCKS / PAPER 19.5

An FPGA Implementation of Shift Register Using Pulsed Latches

POWER OPTIMIZED CLOCK GATED ALU FOR LOW POWER PROCESSOR DESIGN

Leakage Current Reduction in Sequential Circuits by Modifying the Scan Chains

A Low-Power CMOS Flip-Flop for High Performance Processors

Fully Static and Compressed Topology Using Power Saving in Digital circuits for Reduced Transistor Flip flop

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

PICOSECOND TIMING USING FAST ANALOG SAMPLING

A Low-Power 0.7-V H p Video Decoder

MT8814AP. ISO-CMOS 8 x 12 Analog Switch Array. Features. -40 to 85 C. Description. Applications

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

Static Timing Analysis for Nanometer Designs

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

New Single Edge Triggered Flip-Flop Design with Improved Power and Power Delay Product for Low Data Activity Applications

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

MT8812 ISO-CMOS. 8 x 12 Analog Switch Array. Features. Description. Applications

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

Sharif University of Technology. SoC: Introduction

On the Rules of Low-Power Design

Scan. This is a sample of the first 15 pages of the Scan chapter.

DESIGN OF DOUBLE PULSE TRIGGERED FLIP-FLOP BASED ON SIGNAL FEED THROUGH SCHEME

A Power Efficient Flip Flop by using 90nm Technology

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

Design and analysis of RCA in Subthreshold Logic Circuits Using AFE

Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis

Lossless Compression Algorithms for Direct- Write Lithography Systems

Design of Low Power D-Flip Flop Using True Single Phase Clock (TSPC)

Interframe Bus Encoding Technique for Low Power Video Compression

Layout Decompression Chip for Maskless Lithography

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

EFFICIENT POWER REDUCTION OF TOPOLOGICALLY COMPRESSED FLIP-FLOP AND GDI BASED FLIP FLOP

RECENTLY, the growing popularity of powerful mobile

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

WINTER 15 EXAMINATION Model Answer

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Low-Power and Area-Efficient Shift Register Using Pulsed Latches

Reduction of Area and Power of Shift Register Using Pulsed Latches

Novel Low Power and Low Transistor Count Flip-Flop Design with. High Performance

ANALYSIS OF POWER REDUCTION IN 2 TO 4 LINE DECODER DESIGN USING GATE DIFFUSION INPUT TECHNIQUE

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

Combinational vs Sequential

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

Area Efficient Pulsed Clock Generator Using Pulsed Latch Shift Register

Design and Analysis of Custom Clock Buffers and a D Flip-Flop for Low Swing Clock Distribution Networks. A Thesis presented.

A Symmetric Differential Clock Generator for Bit-Serial Hardware

Nan Ya NT5DS32M8AT-7K 256M DDR SDRAM

11. Sequential Elements

Memory, Latches, & Registers

Performance Driven Reliable Link Design for Network on Chips

Impact of Intermittent Faults on Nanocomputing Devices

Low Power High Speed Voltage Level Shifter for Sub- Threshold Operations

A low-power portable H.264/AVC decoder using elastic pipeline

Lecture 26: Multipliers. Final presentations May 8, 1-5pm, BWRC Final reports due May 7 Final exam, Monday, May :30pm, 241 Cory

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

Comparative study on low-power high-performance standard-cell flip-flops

Simultaneous Control of Subthreshold and Gate Leakage Current in Nanometer-Scale CMOS Circuits

64CH SEGMENT DRIVER FOR DOT MATRIX LCD

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

Low Power D Flip Flop Using Static Pass Transistor Logic

A Novel Approach for Auto Clock Gating of Flip-Flops

MOS Logic Families. Somayyeh Koohi. Department of Computer Engineering Sharif University of Technology

Transcription:

Tutorial Outline 8:30-8:45 8:45-9:05 9:05-9:30 9:30-10:30 10:30-10:50 10:50-12:15 12:15-1:30 1:30-2:30 2:30-3:30 3:30-3:50 3:50-4:30 4:30-4:45 Introduction and motivation Sources of power in CMOS designs Power analysis tools and techniques Gate & functional unit design issues & techniques BREAK Architectural level issues and techniques LUNCH Low power memory system design Software level issues and techniques BREAK Software level issues and techniques, con t Future challenges ISCA Tutorial: Low Power Design Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) DEC 21164a (2.0V dd, 0.35µ, 400MHz, 30W max) caches dissipate 25% of the total chip power DEC SA-110 (2.0V dd, 0.35µ, 233MHz, 1W typ) no L2 on-chip I$ (D$) dissipate 27% (16%) of the total chip power ISCA Tutorial: Low Power Design Memories.2 1

Importance of Optimizing Memory System Energy Many emerging applications are dataintensive For ASICs and embedded systems, memory system can contribute up to 90% energy Multiple memories in future System-onchip designs ISCA Tutorial: Low Power Design Memories.3 2D Memory Architecture 2 k-j bit line word line Row Address A j A j+1 A k-1 Row Decoder storage (RAM) cell m2 j Sense Amplifiers amplifies bit line swing Read/Write Circuits Column Address A 0 A 1 A j-1 Column Decoder selects appropriate word from memory row Input/Output (m bits) ISCA Tutorial: Low Power Design Memories.4 2

2D Memory Configuration Sense Amps Row Decoder Sense Amps ISCA Tutorial: Low Power Design Memories.5 Sources of Power Dissipation Active Power Sources P = V dd.i dd Negligible at high frequencies (n+m) = 2 for CMOS NAND decoders I dd = m.i act + m.(n-1).i ret +(n+m).c de.v int.f + C pt.v int.f + I dcp m - number of columns n - number of rows V dd - External power supply I act - Effective current of active cells I ret - Data retention current of inactive cells C de - Output node capacitance of each decoder V int - Internal Supply Voltage C pt - total capacitance in periphery I dcp - Static current of Column circuitry, Diff Amps Virtually independent of operating frequency ISCA Tutorial: Low Power Design Memories.6 3

DRAM Energy Consumption I dd increases with m and n Destructive Readout characteristics of DRAM requires bit line to be charged and discharged with a large Voltage Swing, V swing (1.5-2.5 V) I dd = [m.c BL V swing + C pt.v int ] f + I dcp Reduce charging capacitance - C pt, m.c BL Reduce external and internal voltages - V dd, V int, V swing Reduce static current - I dcp ISCA Tutorial: Low Power Design Memories.7 DRAM Reliability Concerns Signal to Noise Characteristics requires bit line capacitance to be small Signal, V s = (C s / C BL ). V swing C s - Cell capacitance Reducing is C BL beneficial Reducing is V swing detrimental ISCA Tutorial: Low Power Design Memories.8 4

SRAM Design I dd = [m.i DC t+ C pt.v int ] f + I dcp Signal to Noise not so serious Both SRAM and DRAM have evolved to use similar techniques ISCA Tutorial: Low Power Design Memories.9 Data Retention Power In data retention mode, memory has no access from outside and data are retained by the refresh operation (for DRAMs) I dd = [m.c BL V swing + C pt.v int ] (n/t ref )+ I dcp t ref is the refresh time and increases with reducing junction temperature I dcp can be significant in this mode ISCA Tutorial: Low Power Design Memories.10 5

SRAM Power Budget 60 Average mw 40 20 Decoders Word line BL+SA+Cell Write ckt Read ckt 0 128x128 256x64 64x256 Array Size 16K bits 0.5µ technology 10ns cycle time 4.05ns access time 3.3V V dd From Chang, 1997 ISCA Tutorial: Low Power Design Memories.11 Low Power SRAM Techniques Standby power reduction Operating power reduction» memory bank partitioning» SRAM cell design» reduced bit line swing (pulsed word line and bit line isolation)» divided word line» bit line segmentation Can use the above in combination! ISCA Tutorial: Low Power Design Memories.12 6

Memory Bank Partitioning Partition the memory array into smaller banks so that only the addressed bank is activated» improves speed and lowers power» word line capacitance reduced» number of bit cells activated reduced At some point the delay and power overhead associated with the bank decoding circuit dominates (2 to 8 banks typical) ISCA Tutorial: Low Power Design Memories.13 Partitioned Memory Structure Row Addr Block Addr Column Addr Input/Output (m bits) Advantages: 1. Shorter word and/or bit lines 2. Block addr activates only 1 block saving power ISCA Tutorial: Low Power Design Memories.14 7

SRAM Cell 6-T SRAMs cell reduces static current (leakage) but takes more area WL Reduction of V th in very low V dd SRAMs suffer from large leakage currents» use multiple threshold devices (memory cells with higher V th to reduce leakage while peripheral circuits use low V th to improve speed) ISCA Tutorial: Low Power Design Memories.15 BL Q Q BL Switched Power Supply with Level Holding High Vt 0 - Normal 1 - Not used Vdd Q Low Vt Level Holder Circuit High Vt 1 - Normal 0 - not used Multi Vt device by changing Well voltages; Vt high during standby & low otherwise ISCA Tutorial: Low Power Design Memories.16 8

Reduced Bit Line Swing Limit voltage swing on bit lines to improve both speed and power» need sense amp for each column to sense/restore signal» isolate memory cells from the bit lines after sensing (to prevent the cells from changing the bit line voltage further) - pulsed word line» isolate sense amps from bit lines after sensing (to prevent bit lines from having large voltage swings) - bit line isolation ISCA Tutorial: Low Power Design Memories.17 Pulsed Word Line Generation of word line pulses very critical» too short - sense amp operation may fail» too long - power efficiency degraded (because bit line swing size depends on duration of the word line pulse) Word line pulse generation» delay lines (susceptible to process, temp, etc.)» use feedback from bit lines ISCA Tutorial: Low Power Design Memories.18 9

Pulsed Word Line Structure Read Word line Dummy bit lines Bit lines Complete 10% populated Dummy column» height set to 10% of a regular column and its cells are tied to a fixed value» capacitance is only 10% of a regular column ISCA Tutorial: Low Power Design Memories.19 Pulsed Word Line Timing Read Complete Word line Bit line Dummy bit line V = 0.1V dd V = V dd Dummy bit lines have reached full swing and trigger pulse shut off when regular bit lines reach 10% swing ISCA Tutorial: Low Power Design Memories.20 10

Bit Line Isolation bit lines V = 0.1V dd isolate sense Read sense amplifier sense amplifier outputs V = V dd ISCA Tutorial: Low Power Design Memories.21 Divided Word Line RAM cells in each row are organized into blocks, memory cells in each block are accessed by a local decoder Only the memory cells in the activated block have their bit line pairs driven» improves speed (by decreasing word line delay)» lowers power dissipation (by decreasing the number of BL pairs activated) ISCA Tutorial: Low Power Design Memories.22 11

Divided Word Line Structure Row block WL i Local decoder WL i+1 LWL i LD RAM cell LWL i+1 BL j BL j+1 BL j+m LD BSL Block select line Load capacitance on word line determined by number/size of local decoder» faster word line (since smaller capacitance)» now have to wait for local decoder delay ISCA Tutorial: Low Power Design Memories.23 Cells/Block How many cells to put in one block?» Power savings best with 2 cells/block fewest number of bit lines activated» Area penalty worst with 2 cells/block more local decoders and BSL buffers» BSL logic need buffers to drive each BSL 4 and 16 cells/block BSLs are the enable inputs of the column decoder s last stage of 2x4 decoders 2 (8) cells/block need a NOR gate with 2 (8) inputs from the output of the column decoder ISCA Tutorial: Low Power Design Memories.24 12

DWL Power Reduction Write Operations Read Operations Cells/block 128x128 256x64 64x256 128x128 256x64 64x256 2 77.0% 68.5% 78.4% 80.1% 71.6% 82.9% 4 75.5% 65.5% 77.2% 79.1% 68.3% 82.0% 8 73.1% 60.3% 75.8% 76.6% 62.9% 80.3% 16 67.2% 49.8% 72.6% 70.2% 51.9% 76.7% From Chang, 1997 ISCA Tutorial: Low Power Design Memories.25 DWL Area Penalty Cells/block 128x128 256x64 64x256 2 25.5% 24.6% 24.8% 4 19.2% 18.5% 18.4% 8 17.0% 16.5% 16.2% 16 15.4% 14.8% 14.5% ISCA Tutorial: Low Power Design Memories.26 13

Bit Line Segmentation RAM cells in each column are organized into blocks selected by word lines Only the memory cells in the activated block present a load on the bit line» lowers power dissipation (by decreasing bit line capacitance)» can use smaller sense amps ISCA Tutorial: Low Power Design Memories.27 Bit Line Segmented Structure SWL i,j Switch to isolate segment SWL i+n,j BL j LBL i,j LBL i+n,j WL i Address decoder identifies the segment targeted by the row address and isolates all but the targeted segment from the common bit line Has minimal effect on performance ISCA Tutorial: Low Power Design Memories.28 14

Cache Power On-chip I$ and D$ (high speed SRAM)» DEC 21164a (2.0V dd, 0.35µ, 400MHz, 30W max) I/D/L2 of 8/8/96KB and 1/1/3 associativity caches dissipate 25% of the total chip power» DEC SA-110 (2.0V dd, 0.35µ, 233MHz, 1W typ) I/D of 16/16KB and 32/32 associativity (no L2 on-chip) I$ (D$) dissipate 27% (16%) of the total chip power Improving the power efficiency of caches is critical to the overall system power ISCA Tutorial: Low Power Design Memories.29 Cache Energy Consumption Energy Dissipated by Bitlines: precharge, read and write cycles Energy Dissipated by Wordlines: when a particular row is being read or written Energy Dissipated by Address Decoders Energy Dissipated by Peripheral Circuit - comparators, cache control logic etc. Off-Chip Main Memory Energy is based on per-access cost ISCA Tutorial: Low Power Design Memories.30 15

Analytical Energy Model Example On-chip cache Energy = Ebus + Ecell + Epad + Emain Ecell = β*(wl_length)*(bl_length+4.8)*(nhit + 2*Nmiss) wl_length = m*(t + 8L + St) bl_length = C/(m*L) Nhit = number of hits; Nmiss = number of misses; C = cache size; L = cache line size in bytes; m = set associativity; T = tag size in bits; St = # of status bits per line; β = 1.44e-14 (technology parameter) ISCA Tutorial: Low Power Design Memories.31 Cache Power Distribution Power in milliwatts 1800 1600 1400 1200 1000 800 600 400 200 0 ijpeg perl fppp avg Base Configuration: 4-way superscalar 32KB DM L1 I$ 32KB, 4-way SA L1 D$ 32B blocks, write back 128KB, 4-way SA L2 64B blocks, write back 1MB, 8-way SA off-chip L3 128B blocks, write thru L1 I$ L1 D$ L2 Interconnect widths 16B between L1 and L2 32B between L2 and L3 64B between L3 and MM From Ghose, 1999 ISCA Tutorial: Low Power Design Memories.32 16

Low Power Cache Techniques SRAM power reduction Cache block buffering Cache subbanking Divided word line Multidivided module (MDM) Modifications to CAM cell (for FA cache and FA TLB) ISCA Tutorial: Low Power Design Memories.33 Cache Block Buffering Check to see if data desired is in the data output latch from the last cache access (i.e., in the same cache block) Saves energy since not accessing tag and data arrays» minimal overhead hardware Can maintain performance of normal set associative cache ISCA Tutorial: Low Power Design Memories.34 17

Block Buffer Cache Structure disable sensing Address issued by CPU Tag Data Tag Data = = = last_set_# Hit Desired word ISCA Tutorial: Low Power Design Memories.35 Block Buffering Performance Power in milliwatts 3000 2700 2400 2100 1800 1500 1200 900 600 300 0 L1 I$ L1 D$ L2 Total Same base configuration 4-way superscalar 32KB DM L1 I$... 0 buffers 1 buffer 2 buffers From Ghose, 1999 ISCA Tutorial: Low Power Design Memories.36 18

Cache Subbanking Address issued by CPU Tag Tag Data Tag Tag Data subbank 0 subbank 1 Only read from one subbank = = Similar to column multiplexing in SRAMs columns can share precharge and sense amps each subbank has its own decoder ISCA Tutorial: Low Power Design Memories.37 Hit Desired word Subbanking Performance Power in milliwatts 3600 3000 2400 1800 1200 600 0 L1 I$ L1 D$ L2 Total Same base configuration 4-way superscalar 32KB DM L1 I$ 4B subbank width conv 16B subbank 16B conv 32B subbank 32B From Ghose, 1999 ISCA Tutorial: Low Power Design Memories.38 19

Divided Word Line Cache Same goals as subbanking from byte select bit<0> reduce # of active bit lines WL i LD LD reduce capacitive loading on word and bit lines WL i+1 word<1> word<1> LD word<0> word<0> LD ISCA Tutorial: Low Power Design Memories.39 Multidivided Module Cache Address issued by CPU With M modules and only one module activated per cycle, load capacitance is reduced by a factor of M (reduces both latency and power) s0-s15 s16-s31 Can combine multidivided module, buffering, and subbanking or divided word line to get the savings of all three ISCA Tutorial: Low Power Design Memories.40 20

Translation Lookaside Buffers Small caches to speed up address translation in processors with virtual memory All addresses have to be translated before cache access» DEC SA-110 (2.0V dd, 0.35µ, 233MHz, 1W typ) I$ (D$) dissipate 27% (16%) of the total chip power TLB 17% of total chip power I$ can be virtually indexed/virtually tagged ISCA Tutorial: Low Power Design Memories.41 TLB Structure Address issued by CPU (page size = index bits + byte select bits) VA Tag PA Tag Data Tag Data Hit Most TLBs are small (<= 256 entries) and thus fully associative = Hit = Desired word ISCA Tutorial: Low Power Design Memories.42 21

TLB Power Power in milliwatts 80 70 60 50 40 30 20 10 DM 2 SA 4 SA 8 SA FA 0 32 64 128 256 From Juan, 1997 ISCA Tutorial: Low Power Design Memories.43 CAM Design WL<0> Hit WL<1> match<0> word line<0> of data array WL<2> match<1> WL<3> match<2> match<3> bit WL bit Read/Write Circuitry match/write data precharge/match match ISCA Tutorial: Low Power Design Memories.44 22

Low Power CAM Cell bit WL bit bit WL bit match match control ISCA Tutorial: Low Power Design Memories.45 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) DEC 21164a (2.0V dd, 0.35µ, 400MHz, 30W max) caches dissipate 25% of the total chip power DEC SA-110 (2.0V dd, 0.35µ, 233MHz, 1W typ) no L2 on-chip I$ (D$) dissipate 27% (16%) of the total chip power ISCA Tutorial: Low Power Design Memories.46 23

Low Power DRAMs Conventional DRAMs refresh all rows with a fixed single time interval» read/write stalled while refreshing» refresh period -> t ref» DRAM power = k * (#read/writes + #ref) So have to worry about optimizing refresh operation as well ISCA Tutorial: Low Power Design Memories.47 Optimizing Refresh Selective refresh architecture (SRA)» add a valid bit to each memory row and only refresh rows with valid bit set» reduces refresh 5% to 80% Variable refresh architecture (VRA)» data retention time of each cell is different» add a refresh period table and refresh counter to each row and refresh with the appropriate period to each row» reduces refresh about 75% From Ohsawa, 1995 ISCA Tutorial: Low Power Design Memories.48 24

Application-Specific Memories Data and Code Compression» Custom instruction sets: ARM thumb code: interleaving of 32-bit and 16-bit thumb codes» Reduces memory size» Reduces width of off-chip buses» location of compression unit is important» Compress only selective blocks ISCA Tutorial: Low Power Design Memories.49 Hardware Code Compression Assuming only a subset of instr s used, replace them with a shorter encoding to reduce memory bandwidth addresses Core logn bits instructions IDT k bits memory instruction decompression table (restores original format) ISCA Tutorial: Low Power Design Memories.50 25

Other Techniques Customizing Memory Hierarchy» Close vs. far memory accesses» Close - faster, less energy consuming, smaller caches» Energy per access increases monotonically with memory size» Automatic memory partitioning ISCA Tutorial: Low Power Design Memories.51 Memory Partitioning A memory partition is a set of memory banks that can be independently selected Any address is stored into one and only one bank The total energy consumed by a partitioned is the sum of the energy consumed by all its banks Partitions increasing selection logic cost Macii, 2000 ISCA Tutorial: Low Power Design Memories.52 26

Scratch Pad Memory Use of Scratch Pad Memory instead of Caches for locality» Memory accesses of embedded software are usually very localized» Map most frequent accessed locations onto small on-chip memory» Caches have tag overhead - eliminate by application specific decode logic» Map small set of most frequently accessed addresses to consequetive locations in small memory Benini 2000 ISCA Tutorial: Low Power Design Memories.53 Key References, Memories Amrutur, Techniques to Reduce Power in Fast Wide Memories, Proc. of SLPE, pp. 92-93, 1994. Angel, Survey of Low Power Techniques for ROMs, Proc. of SLPED, pp. 7-11, Aug. 1997. Chang, Power-Area Trade-Offs in Divided Word Line Memory Arrays, Journal of Circuits, Systems, Computers, 7(1):49-57, 1997. Evans, Energy Consumption Modeling and Optimization for SRAMs, IEEE Journal of SSC, 30(5):571-579, May 1995. Itoh, Low Power Memory Design, in Low Power Design Methodologies, pp. 201-251, KAP, 1996. Ohsawa, Optimizing the DRAM Refresh Count, Proc. Of SLPED, pp. 82-87, Aug 1998. Shimazaki, An Automatic Power-Save Cache Memory, Proc. Of SLPE, pp. 58-56, 1995. Yoshimoto, A Divided Word Line Structure in SRAMs, IEEE Journal of SSC, 18:479-485, 1983. ISCA Tutorial: Low Power Design Memories.54 27

Key References, Caches Ghose, Reducing Power in SuperScalar Processor Caches Using Subbanking, Multiple Line Buffers and Bit-Line Segmentation, Proc. of ISLPED, pp. 70-75, 1999. Juan, Reducing TLB Power Requirements, Proc. of ISLPED, pp. 196-201, Aug 1997. Kin, The Filter Cache: An Energy-Efficient Memory Structure, Proc. of MICRO, pp. 184-193, Dec. 1997. Ko, Energy Optimization of Multilevel Cache Architectures, IEEE Trans. On VLSI Systems, 6(2):299-308, June 1998. Panwar, Reducing the Frequency of Tag Compares for Low Power I$ Designs, Proc. of ISLPD, pp. 57-62, 1995. Shimazaki, An Automatic Power-Save Cache Memory, Proc. of SLPE, pp. 58-59, 1995. Su, Cache Design Tradeoffs for Power and Performance Optimization, Proc. of ISLPD, pp. 63-68, 1995. ISCA Tutorial: Low Power Design Memories.55 28