Tutorial Outline 8:30-8:45 8:45-9:05 9:05-9:30 9:30-10:30 10:30-10:50 10:50-12:15 12:15-1:30 1:30-2:30 2:30-3:30 3:30-3:50 3:50-4:30 4:30-4:45 Introduction and motivation Sources of power in CMOS designs Power analysis tools and techniques Gate & functional unit design issues & techniques BREAK Architectural level issues and techniques LUNCH Low power memory system design Software level issues and techniques BREAK Software level issues and techniques, con t Future challenges ISCA Tutorial: Low Power Design Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) DEC 21164a (2.0V dd, 0.35µ, 400MHz, 30W max) caches dissipate 25% of the total chip power DEC SA-110 (2.0V dd, 0.35µ, 233MHz, 1W typ) no L2 on-chip I$ (D$) dissipate 27% (16%) of the total chip power ISCA Tutorial: Low Power Design Memories.2 1
Importance of Optimizing Memory System Energy Many emerging applications are dataintensive For ASICs and embedded systems, memory system can contribute up to 90% energy Multiple memories in future System-onchip designs ISCA Tutorial: Low Power Design Memories.3 2D Memory Architecture 2 k-j bit line word line Row Address A j A j+1 A k-1 Row Decoder storage (RAM) cell m2 j Sense Amplifiers amplifies bit line swing Read/Write Circuits Column Address A 0 A 1 A j-1 Column Decoder selects appropriate word from memory row Input/Output (m bits) ISCA Tutorial: Low Power Design Memories.4 2
2D Memory Configuration Sense Amps Row Decoder Sense Amps ISCA Tutorial: Low Power Design Memories.5 Sources of Power Dissipation Active Power Sources P = V dd.i dd Negligible at high frequencies (n+m) = 2 for CMOS NAND decoders I dd = m.i act + m.(n-1).i ret +(n+m).c de.v int.f + C pt.v int.f + I dcp m - number of columns n - number of rows V dd - External power supply I act - Effective current of active cells I ret - Data retention current of inactive cells C de - Output node capacitance of each decoder V int - Internal Supply Voltage C pt - total capacitance in periphery I dcp - Static current of Column circuitry, Diff Amps Virtually independent of operating frequency ISCA Tutorial: Low Power Design Memories.6 3
DRAM Energy Consumption I dd increases with m and n Destructive Readout characteristics of DRAM requires bit line to be charged and discharged with a large Voltage Swing, V swing (1.5-2.5 V) I dd = [m.c BL V swing + C pt.v int ] f + I dcp Reduce charging capacitance - C pt, m.c BL Reduce external and internal voltages - V dd, V int, V swing Reduce static current - I dcp ISCA Tutorial: Low Power Design Memories.7 DRAM Reliability Concerns Signal to Noise Characteristics requires bit line capacitance to be small Signal, V s = (C s / C BL ). V swing C s - Cell capacitance Reducing is C BL beneficial Reducing is V swing detrimental ISCA Tutorial: Low Power Design Memories.8 4
SRAM Design I dd = [m.i DC t+ C pt.v int ] f + I dcp Signal to Noise not so serious Both SRAM and DRAM have evolved to use similar techniques ISCA Tutorial: Low Power Design Memories.9 Data Retention Power In data retention mode, memory has no access from outside and data are retained by the refresh operation (for DRAMs) I dd = [m.c BL V swing + C pt.v int ] (n/t ref )+ I dcp t ref is the refresh time and increases with reducing junction temperature I dcp can be significant in this mode ISCA Tutorial: Low Power Design Memories.10 5
SRAM Power Budget 60 Average mw 40 20 Decoders Word line BL+SA+Cell Write ckt Read ckt 0 128x128 256x64 64x256 Array Size 16K bits 0.5µ technology 10ns cycle time 4.05ns access time 3.3V V dd From Chang, 1997 ISCA Tutorial: Low Power Design Memories.11 Low Power SRAM Techniques Standby power reduction Operating power reduction» memory bank partitioning» SRAM cell design» reduced bit line swing (pulsed word line and bit line isolation)» divided word line» bit line segmentation Can use the above in combination! ISCA Tutorial: Low Power Design Memories.12 6
Memory Bank Partitioning Partition the memory array into smaller banks so that only the addressed bank is activated» improves speed and lowers power» word line capacitance reduced» number of bit cells activated reduced At some point the delay and power overhead associated with the bank decoding circuit dominates (2 to 8 banks typical) ISCA Tutorial: Low Power Design Memories.13 Partitioned Memory Structure Row Addr Block Addr Column Addr Input/Output (m bits) Advantages: 1. Shorter word and/or bit lines 2. Block addr activates only 1 block saving power ISCA Tutorial: Low Power Design Memories.14 7
SRAM Cell 6-T SRAMs cell reduces static current (leakage) but takes more area WL Reduction of V th in very low V dd SRAMs suffer from large leakage currents» use multiple threshold devices (memory cells with higher V th to reduce leakage while peripheral circuits use low V th to improve speed) ISCA Tutorial: Low Power Design Memories.15 BL Q Q BL Switched Power Supply with Level Holding High Vt 0 - Normal 1 - Not used Vdd Q Low Vt Level Holder Circuit High Vt 1 - Normal 0 - not used Multi Vt device by changing Well voltages; Vt high during standby & low otherwise ISCA Tutorial: Low Power Design Memories.16 8
Reduced Bit Line Swing Limit voltage swing on bit lines to improve both speed and power» need sense amp for each column to sense/restore signal» isolate memory cells from the bit lines after sensing (to prevent the cells from changing the bit line voltage further) - pulsed word line» isolate sense amps from bit lines after sensing (to prevent bit lines from having large voltage swings) - bit line isolation ISCA Tutorial: Low Power Design Memories.17 Pulsed Word Line Generation of word line pulses very critical» too short - sense amp operation may fail» too long - power efficiency degraded (because bit line swing size depends on duration of the word line pulse) Word line pulse generation» delay lines (susceptible to process, temp, etc.)» use feedback from bit lines ISCA Tutorial: Low Power Design Memories.18 9
Pulsed Word Line Structure Read Word line Dummy bit lines Bit lines Complete 10% populated Dummy column» height set to 10% of a regular column and its cells are tied to a fixed value» capacitance is only 10% of a regular column ISCA Tutorial: Low Power Design Memories.19 Pulsed Word Line Timing Read Complete Word line Bit line Dummy bit line V = 0.1V dd V = V dd Dummy bit lines have reached full swing and trigger pulse shut off when regular bit lines reach 10% swing ISCA Tutorial: Low Power Design Memories.20 10
Bit Line Isolation bit lines V = 0.1V dd isolate sense Read sense amplifier sense amplifier outputs V = V dd ISCA Tutorial: Low Power Design Memories.21 Divided Word Line RAM cells in each row are organized into blocks, memory cells in each block are accessed by a local decoder Only the memory cells in the activated block have their bit line pairs driven» improves speed (by decreasing word line delay)» lowers power dissipation (by decreasing the number of BL pairs activated) ISCA Tutorial: Low Power Design Memories.22 11
Divided Word Line Structure Row block WL i Local decoder WL i+1 LWL i LD RAM cell LWL i+1 BL j BL j+1 BL j+m LD BSL Block select line Load capacitance on word line determined by number/size of local decoder» faster word line (since smaller capacitance)» now have to wait for local decoder delay ISCA Tutorial: Low Power Design Memories.23 Cells/Block How many cells to put in one block?» Power savings best with 2 cells/block fewest number of bit lines activated» Area penalty worst with 2 cells/block more local decoders and BSL buffers» BSL logic need buffers to drive each BSL 4 and 16 cells/block BSLs are the enable inputs of the column decoder s last stage of 2x4 decoders 2 (8) cells/block need a NOR gate with 2 (8) inputs from the output of the column decoder ISCA Tutorial: Low Power Design Memories.24 12
DWL Power Reduction Write Operations Read Operations Cells/block 128x128 256x64 64x256 128x128 256x64 64x256 2 77.0% 68.5% 78.4% 80.1% 71.6% 82.9% 4 75.5% 65.5% 77.2% 79.1% 68.3% 82.0% 8 73.1% 60.3% 75.8% 76.6% 62.9% 80.3% 16 67.2% 49.8% 72.6% 70.2% 51.9% 76.7% From Chang, 1997 ISCA Tutorial: Low Power Design Memories.25 DWL Area Penalty Cells/block 128x128 256x64 64x256 2 25.5% 24.6% 24.8% 4 19.2% 18.5% 18.4% 8 17.0% 16.5% 16.2% 16 15.4% 14.8% 14.5% ISCA Tutorial: Low Power Design Memories.26 13
Bit Line Segmentation RAM cells in each column are organized into blocks selected by word lines Only the memory cells in the activated block present a load on the bit line» lowers power dissipation (by decreasing bit line capacitance)» can use smaller sense amps ISCA Tutorial: Low Power Design Memories.27 Bit Line Segmented Structure SWL i,j Switch to isolate segment SWL i+n,j BL j LBL i,j LBL i+n,j WL i Address decoder identifies the segment targeted by the row address and isolates all but the targeted segment from the common bit line Has minimal effect on performance ISCA Tutorial: Low Power Design Memories.28 14
Cache Power On-chip I$ and D$ (high speed SRAM)» DEC 21164a (2.0V dd, 0.35µ, 400MHz, 30W max) I/D/L2 of 8/8/96KB and 1/1/3 associativity caches dissipate 25% of the total chip power» DEC SA-110 (2.0V dd, 0.35µ, 233MHz, 1W typ) I/D of 16/16KB and 32/32 associativity (no L2 on-chip) I$ (D$) dissipate 27% (16%) of the total chip power Improving the power efficiency of caches is critical to the overall system power ISCA Tutorial: Low Power Design Memories.29 Cache Energy Consumption Energy Dissipated by Bitlines: precharge, read and write cycles Energy Dissipated by Wordlines: when a particular row is being read or written Energy Dissipated by Address Decoders Energy Dissipated by Peripheral Circuit - comparators, cache control logic etc. Off-Chip Main Memory Energy is based on per-access cost ISCA Tutorial: Low Power Design Memories.30 15
Analytical Energy Model Example On-chip cache Energy = Ebus + Ecell + Epad + Emain Ecell = β*(wl_length)*(bl_length+4.8)*(nhit + 2*Nmiss) wl_length = m*(t + 8L + St) bl_length = C/(m*L) Nhit = number of hits; Nmiss = number of misses; C = cache size; L = cache line size in bytes; m = set associativity; T = tag size in bits; St = # of status bits per line; β = 1.44e-14 (technology parameter) ISCA Tutorial: Low Power Design Memories.31 Cache Power Distribution Power in milliwatts 1800 1600 1400 1200 1000 800 600 400 200 0 ijpeg perl fppp avg Base Configuration: 4-way superscalar 32KB DM L1 I$ 32KB, 4-way SA L1 D$ 32B blocks, write back 128KB, 4-way SA L2 64B blocks, write back 1MB, 8-way SA off-chip L3 128B blocks, write thru L1 I$ L1 D$ L2 Interconnect widths 16B between L1 and L2 32B between L2 and L3 64B between L3 and MM From Ghose, 1999 ISCA Tutorial: Low Power Design Memories.32 16
Low Power Cache Techniques SRAM power reduction Cache block buffering Cache subbanking Divided word line Multidivided module (MDM) Modifications to CAM cell (for FA cache and FA TLB) ISCA Tutorial: Low Power Design Memories.33 Cache Block Buffering Check to see if data desired is in the data output latch from the last cache access (i.e., in the same cache block) Saves energy since not accessing tag and data arrays» minimal overhead hardware Can maintain performance of normal set associative cache ISCA Tutorial: Low Power Design Memories.34 17
Block Buffer Cache Structure disable sensing Address issued by CPU Tag Data Tag Data = = = last_set_# Hit Desired word ISCA Tutorial: Low Power Design Memories.35 Block Buffering Performance Power in milliwatts 3000 2700 2400 2100 1800 1500 1200 900 600 300 0 L1 I$ L1 D$ L2 Total Same base configuration 4-way superscalar 32KB DM L1 I$... 0 buffers 1 buffer 2 buffers From Ghose, 1999 ISCA Tutorial: Low Power Design Memories.36 18
Cache Subbanking Address issued by CPU Tag Tag Data Tag Tag Data subbank 0 subbank 1 Only read from one subbank = = Similar to column multiplexing in SRAMs columns can share precharge and sense amps each subbank has its own decoder ISCA Tutorial: Low Power Design Memories.37 Hit Desired word Subbanking Performance Power in milliwatts 3600 3000 2400 1800 1200 600 0 L1 I$ L1 D$ L2 Total Same base configuration 4-way superscalar 32KB DM L1 I$ 4B subbank width conv 16B subbank 16B conv 32B subbank 32B From Ghose, 1999 ISCA Tutorial: Low Power Design Memories.38 19
Divided Word Line Cache Same goals as subbanking from byte select bit<0> reduce # of active bit lines WL i LD LD reduce capacitive loading on word and bit lines WL i+1 word<1> word<1> LD word<0> word<0> LD ISCA Tutorial: Low Power Design Memories.39 Multidivided Module Cache Address issued by CPU With M modules and only one module activated per cycle, load capacitance is reduced by a factor of M (reduces both latency and power) s0-s15 s16-s31 Can combine multidivided module, buffering, and subbanking or divided word line to get the savings of all three ISCA Tutorial: Low Power Design Memories.40 20
Translation Lookaside Buffers Small caches to speed up address translation in processors with virtual memory All addresses have to be translated before cache access» DEC SA-110 (2.0V dd, 0.35µ, 233MHz, 1W typ) I$ (D$) dissipate 27% (16%) of the total chip power TLB 17% of total chip power I$ can be virtually indexed/virtually tagged ISCA Tutorial: Low Power Design Memories.41 TLB Structure Address issued by CPU (page size = index bits + byte select bits) VA Tag PA Tag Data Tag Data Hit Most TLBs are small (<= 256 entries) and thus fully associative = Hit = Desired word ISCA Tutorial: Low Power Design Memories.42 21
TLB Power Power in milliwatts 80 70 60 50 40 30 20 10 DM 2 SA 4 SA 8 SA FA 0 32 64 128 256 From Juan, 1997 ISCA Tutorial: Low Power Design Memories.43 CAM Design WL<0> Hit WL<1> match<0> word line<0> of data array WL<2> match<1> WL<3> match<2> match<3> bit WL bit Read/Write Circuitry match/write data precharge/match match ISCA Tutorial: Low Power Design Memories.44 22
Low Power CAM Cell bit WL bit bit WL bit match match control ISCA Tutorial: Low Power Design Memories.45 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) DEC 21164a (2.0V dd, 0.35µ, 400MHz, 30W max) caches dissipate 25% of the total chip power DEC SA-110 (2.0V dd, 0.35µ, 233MHz, 1W typ) no L2 on-chip I$ (D$) dissipate 27% (16%) of the total chip power ISCA Tutorial: Low Power Design Memories.46 23
Low Power DRAMs Conventional DRAMs refresh all rows with a fixed single time interval» read/write stalled while refreshing» refresh period -> t ref» DRAM power = k * (#read/writes + #ref) So have to worry about optimizing refresh operation as well ISCA Tutorial: Low Power Design Memories.47 Optimizing Refresh Selective refresh architecture (SRA)» add a valid bit to each memory row and only refresh rows with valid bit set» reduces refresh 5% to 80% Variable refresh architecture (VRA)» data retention time of each cell is different» add a refresh period table and refresh counter to each row and refresh with the appropriate period to each row» reduces refresh about 75% From Ohsawa, 1995 ISCA Tutorial: Low Power Design Memories.48 24
Application-Specific Memories Data and Code Compression» Custom instruction sets: ARM thumb code: interleaving of 32-bit and 16-bit thumb codes» Reduces memory size» Reduces width of off-chip buses» location of compression unit is important» Compress only selective blocks ISCA Tutorial: Low Power Design Memories.49 Hardware Code Compression Assuming only a subset of instr s used, replace them with a shorter encoding to reduce memory bandwidth addresses Core logn bits instructions IDT k bits memory instruction decompression table (restores original format) ISCA Tutorial: Low Power Design Memories.50 25
Other Techniques Customizing Memory Hierarchy» Close vs. far memory accesses» Close - faster, less energy consuming, smaller caches» Energy per access increases monotonically with memory size» Automatic memory partitioning ISCA Tutorial: Low Power Design Memories.51 Memory Partitioning A memory partition is a set of memory banks that can be independently selected Any address is stored into one and only one bank The total energy consumed by a partitioned is the sum of the energy consumed by all its banks Partitions increasing selection logic cost Macii, 2000 ISCA Tutorial: Low Power Design Memories.52 26
Scratch Pad Memory Use of Scratch Pad Memory instead of Caches for locality» Memory accesses of embedded software are usually very localized» Map most frequent accessed locations onto small on-chip memory» Caches have tag overhead - eliminate by application specific decode logic» Map small set of most frequently accessed addresses to consequetive locations in small memory Benini 2000 ISCA Tutorial: Low Power Design Memories.53 Key References, Memories Amrutur, Techniques to Reduce Power in Fast Wide Memories, Proc. of SLPE, pp. 92-93, 1994. Angel, Survey of Low Power Techniques for ROMs, Proc. of SLPED, pp. 7-11, Aug. 1997. Chang, Power-Area Trade-Offs in Divided Word Line Memory Arrays, Journal of Circuits, Systems, Computers, 7(1):49-57, 1997. Evans, Energy Consumption Modeling and Optimization for SRAMs, IEEE Journal of SSC, 30(5):571-579, May 1995. Itoh, Low Power Memory Design, in Low Power Design Methodologies, pp. 201-251, KAP, 1996. Ohsawa, Optimizing the DRAM Refresh Count, Proc. Of SLPED, pp. 82-87, Aug 1998. Shimazaki, An Automatic Power-Save Cache Memory, Proc. Of SLPE, pp. 58-56, 1995. Yoshimoto, A Divided Word Line Structure in SRAMs, IEEE Journal of SSC, 18:479-485, 1983. ISCA Tutorial: Low Power Design Memories.54 27
Key References, Caches Ghose, Reducing Power in SuperScalar Processor Caches Using Subbanking, Multiple Line Buffers and Bit-Line Segmentation, Proc. of ISLPED, pp. 70-75, 1999. Juan, Reducing TLB Power Requirements, Proc. of ISLPED, pp. 196-201, Aug 1997. Kin, The Filter Cache: An Energy-Efficient Memory Structure, Proc. of MICRO, pp. 184-193, Dec. 1997. Ko, Energy Optimization of Multilevel Cache Architectures, IEEE Trans. On VLSI Systems, 6(2):299-308, June 1998. Panwar, Reducing the Frequency of Tag Compares for Low Power I$ Designs, Proc. of ISLPD, pp. 57-62, 1995. Shimazaki, An Automatic Power-Save Cache Memory, Proc. of SLPE, pp. 58-59, 1995. Su, Cache Design Tradeoffs for Power and Performance Optimization, Proc. of ISLPD, pp. 63-68, 1995. ISCA Tutorial: Low Power Design Memories.55 28