IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 9, SEPTEMBER

Similar documents
Variation-and-Aging Aware Low Power embedded SRAM for Multimedia Applications

RECENTLY, the growing popularity of powerful mobile

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

Noise Margin in Low Power SRAM Cells

Fully Static and Compressed Topology Using Power Saving in Digital circuits for Reduced Transistor Flip flop

FRAME RATE BLOCK SELECTION APPROACH BASED DIGITAL WATER MARKING FOR EFFICIENT VIDEO AUTHENTICATION USING NETWORK CONDITIONS

An FPGA Implementation of Shift Register Using Pulsed Latches

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

Low-Power and Area-Efficient Shift Register Using Pulsed Latches

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

Design and analysis of RCA in Subthreshold Logic Circuits Using AFE

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

Reduction of Area and Power of Shift Register Using Pulsed Latches

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

An Efficient Reduction of Area in Multistandard Transform Core

DIFFERENTIAL CONDITIONAL CAPTURING FLIP-FLOP TECHNIQUE USED FOR LOW POWER CONSUMPTION IN CLOCKING SCHEME

FP 12.4: A CMOS Scheme for 0.5V Supply Voltage with Pico-Ampere Standby Current

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

A Low-Power 0.7-V H p Video Decoder

An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications

Figure.1 Clock signal II. SYSTEM ANALYSIS

LUT Optimization for Memory Based Computation using Modified OMS Technique

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

Low Power High Speed Voltage Level Shifter for Sub- Threshold Operations

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

P.Akila 1. P a g e 60

Power Optimization by Using Multi-Bit Flip-Flops

IN DIGITAL transmission systems, there are always scramblers

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

DESIGN OF A NEW MODIFIED CLOCK GATED SENSE-AMPLIFIER FLIP-FLOP

Arithmetic Unit Based Reconfigurable Approximation Technique for Video Encoding

POWER AND AREA EFFICIENT LFSR WITH PULSED LATCHES

Area Efficient Pulsed Clock Generator Using Pulsed Latch Shift Register

MEMORY ERROR COMPENSATION TECHNIQUES FOR JPEG2000. Yunus Emre and Chaitali Chakrabarti

DESIGN OF DOUBLE PULSE TRIGGERED FLIP-FLOP BASED ON SIGNAL FEED THROUGH SCHEME

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

A Low Power Delay Buffer Using Gated Driver Tree

Constant Bit Rate for Video Streaming Over Packet Switching Networks

Modified Ultra-Low Power NAND Based Multiplexer and Flip-Flop

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design

WITH the rapid development of high-fidelity video services

Feasibility Study of Stochastic Streaming with 4K UHD Video Traces

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION

WITH the demand of higher video quality, lower bit

EFFICIENT POWER REDUCTION OF TOPOLOGICALLY COMPRESSED FLIP-FLOP AND GDI BASED FLIP FLOP

Visual Communication at Limited Colour Display Capability

Express Letters. A Novel Four-Step Search Algorithm for Fast Block Motion Estimation

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Low Power D Flip Flop Using Static Pass Transistor Logic

Design Low-Power and Area-Efficient Shift Register using SSASPL Pulsed Latch

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

Design of a High Frequency Dual Modulus Prescaler using Efficient TSPC Flip Flop using 180nm Technology

ALONG with the progressive device scaling, semiconductor

A Symmetric Differential Clock Generator for Bit-Serial Hardware

Dual Edge Adaptive Pulse Triggered Flip-Flop for a High Speed and Low Power Applications

Design of Memory Based Implementation Using LUT Multiplier

Comparative Analysis of Pulsed Latch and Flip-Flop based Shift Registers for High-Performance and Low-Power Systems

PHASE-LOCKED loops (PLLs) are widely used in many

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

PICOSECOND TIMING USING FAST ANALOG SAMPLING

OBJECT-BASED IMAGE COMPRESSION WITH SIMULTANEOUS SPATIAL AND SNR SCALABILITY SUPPORT FOR MULTICASTING OVER HETEROGENEOUS NETWORKS

Modifying the Scan Chains in Sequential Circuit to Reduce Leakage Current

A High Performance Deblocking Filter Hardware for High Efficiency Video Coding

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Dual-V DD and Input Reordering for Reduced Delay and Subthreshold Leakage in Pass Transistor Logic

A Power Efficient Flip Flop by using 90nm Technology

ISSN Vol.08,Issue.24, December-2016, Pages:

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops

Color Image Compression Using Colorization Based On Coding Technique

A video signal processor for motioncompensated field-rate upconversion in consumer television

Operating Bio-Implantable Devices in Ultra-Low Power Error Correction Circuits: using optimized ACS Viterbi decoder

Selective Intra Prediction Mode Decision for H.264/AVC Encoders

Reduced complexity MPEG2 video post-processing for HD display

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

THE USE OF forward error correction (FEC) in optical networks

LOW POWER AND AREA-EFFICIENT SHIFT REGISTER USING PULSED LATCHES

Bit Rate Control for Video Transmission Over Wireless Networks

Power Optimization of Linear Feedback Shift Register (LFSR) using Power Gating

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

Interframe Bus Encoding Technique for Low Power Video Compression

Improve Performance of Low-Power Clock Branch Sharing Double-Edge Triggered Flip-Flop

Design and Implementation of LUT Optimization DSP Techniques

Robust 3-D Video System Based on Modified Prediction Coding and Adaptive Selection Mode Error Concealment Algorithm

ANALYSIS OF POWER REDUCTION IN 2 TO 4 LINE DECODER DESIGN USING GATE DIFFUSION INPUT TECHNIQUE

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC

Optimization of memory based multiplication for LUT

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

Power Optimization Techniques for Sequential Elements Using Pulse Triggered Flip-Flops with SVL Logic

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

Memory interface design for AVS HD video encoder with Level C+ coding order

University College of Engineering, JNTUK, Kakinada, India Member of Technical Staff, Seerakademi, Hyderabad

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

Transcription:

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2017 2625 SPIDER: Sizing-Priority-Based Application-Driven Memory for Mobile Video Applications Na Gong, Member, IEEE, Seyed Alireza Pourbakhsh, Xiaowei Chen, Xin Wang, Dongliang Chen, Student Member, IEEE, and Jinhui Wang, Member, IEEE Abstract Recently, mobile devices such as smartphones and tablets have become the most important medium for delivering internet traffic, especially multimedia content, to end users. However, mobile embedded memory incurs large power consumption owing to the highly frequent access and extensive computation. This paper presents an sizing-priority-based application-driven memory (SPIDER) design methodology for low-power mobile video applications. We investigate the size dependent memory failure characteristics and effectively reduce the memory failure rate with low area overhead. Also, we develop a model for the influence of the memory failure on video output, connecting the hardware design process and application requirement. Based on this, we design the SPIDER algorithms for area-priority and quality-priority mobile video applications. During this process, we also consider the contribution of both Luma and Chroma to output quality, avoiding over-optimization issue. We also develop a hardware-based python-assisted SPIDER simulator to apply our proposed design in one leading edge video compression system, the H.264 decoder. Our simulation results in 45-nm CMOS technology show that SPIDER supports mobile videos successfully as voltage downs to 500 mv from 1 V, enabling over 70% power savings in memory arrays. Index Terms Application, area, embedded memory, mobile video, power consumption. I. INTRODUCTION RECENTLY, mobile devices such as smartphones have become the most important medium for delivering enduser internet traffic, especially multimedia content. According to research from Cisco in February 2013, two-thirds of global mobile data traffic will be driven by video in 2017 [1]. Fig. 1 shows an example of a video streaming system. The original video is compressed to reduce the number of data bits and then transmitted to mobile devices over a communication channel based on a specific protocol, such as Apple s hypertext transfer protocol live streaming [49]. However, video decoding has become the most important energy-intensive application Manuscript received August 25, 2016; revised January 17, 2017 and March 28, 2017; accepted April 30, 2017. Date of publication June 28, 2017; date of current version August 23, 2017. This work was supported in part by the National Science Foundation under Grant CCF-1514780 and CNS-1628961, in part by the ND NASA EPSCoR, in part by the ND Venture Grant, in part by the NDSU-RCA funding, and in part by the Offerdahl Foundation. (Corresponding authors: Na Gong; Jinhui Wang.) The authors are with the Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND 58108 USA (e-mail: na.gong@ndsu.edu; seyedalireza.pourbak@ndsu.edu; xiaowei.chen@ndsu.edu; xin.wang.2@ndsu.edu; dongliang.chen@ndsu.edu; jinhui.wang.1@ndsu.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2017.2715002 Fig. 1. Mobile video streaming. used in mobile devices [2]. In particular, the major signal processing units in video decoders, such as motion estimation, require a significant number of calculations and need frequent embedded memory accesses. Embedded static random access memory (SRAM) occupies over 65% of the core area of a video decoder chip [8] and contributes to over 30% of the system power consumption of a mobile device [9] [12]. This situation is only expected to grow for the nextgeneration mobile video format H.265/high efficiency video coding (HEVC) which has 2 3 higher memory demand compared to that of H.264 [30]. First, the increased pixel complexity (10 b/pixel and 7680 4320 pixels/frame) and ultrahigh throughput requirement (120 fps) require much larger on-chip memories as data memories and pipeline buffers [31]. Second, to meet the high bandwidth requirement, designers increase on-chip memories to reduce off-chip memory traffic [32] [34]. References [31], [34], and [35] are the three recently published mobile video decoders featuring 396, 154, and 308 kb of onchip SRAM. Consequently, enhancing the energy efficiency of on-chip SRAM is of paramount importance to enable efficient mobile video systems. Supply voltage scaling is one of the most effective techniques to reduce the power consumption of memory [1] [7]. However, there are three main considerations for low-voltage memory designers. 1) The noise margin of conventional SRAM deteriorates significantly due to the process variation at low voltage. 2) Reducing the area overhead of low-power embedded SRAM is another major design concern. 3) Various mobile video applications have different requirements, from area-priority applications such as healthcare video streaming to quality-priority applications such as ultra-high-definition videos and 3-D gaming. In this paper, we present an sizing-priority-based application-driven memory (SPIDER) design methodology for power efficient mobile video applications. We make the following contributions in this paper. 1063-8210 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2626 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2017 1) Based on the detailed analysis on memory failure characteristics, we propose a novel priority-based SRAM sizing methodology to enhance SRAM fault tolerance ability. Different from previous sizing approaches, our technique only increases the size of most sensitive transistors to failure, thereby reducing the area overhead (Section III). 2) We develop a new model that connects memory failure and video output, considering both Luma and Chroma. To the best of our knowledge, none of the existing hardware design techniques consider application output directly while designing hardware. Also, our model includes both Luma and Chroma contributions to output quality, avoiding over-optimization (Section IV-A). 3) We design SPIDER algorithms for area-priority and quality-priority applications, maximizing the power efficiency (Section IV-B). 4) We design a hardware-based evaluation flow, directly injecting memory failure into the application process with high control precision. We also present a pythonassisted controlling scheme to achieve an automatic evaluation process. We apply our proposed design in one leading edge video compression system, the H.264 decoder. Our evaluation results show that SPIDER achieves significant power savings for different video applications. (Details are shown in Section V.) The rest of this paper is organized as follows. In Section II, we provide a review of related low-power mobile video techniques. In Section III, we present the sizing dependent memory failure characteristics and its impact on video system. In Section IV, we develop new algorithms to achieve optimized embedded memory for different video applications. Section V describes the simulation methodology and results, followed by the conclusion in Section VI. In our analysis, we use a high-performance 45-nm CMOS process to meet the multimegahertz performance requirement of the video decoder in the modern mobile devices. II. RELATED WORK Significant amount of research that targets low-power memory has been reported in the literature. In this section, we briefly review some existing work related to the proposed technique. Low-power mobile video techniques can broadly be classified into two different categories. A. General-Purpose Memory Used for Mobile Video Applications Many solutions are developed to lower the power consumption of memory utilizing assist schemes such as adjustment of cell voltage [13], boosted wordline voltage [14], [15], dualrail supply schemes [16], negative bitline schemes [17], [18], and read modify write or write-back schemes [19], [20]. The improvements in power efficiency are often achieved with significant design complexity and power penalty for voltage regulations or boosting circuits. Most existing solutions adopt more than 6T to achieve lowpower operation, such as asymmetric 7T cell [21], singleended read-decoupled 8T cells [22], [23], Zigzag 8T cells [24], read-disturb-free 9T [25] and 10T SRAM cells [26], and bitinterleaving 12T cells [27]. However, the developed memory cells still suffer from the write half-select disturb problem, limiting the power efficiency that can be achieved. Most importantly, all of these general-purpose memory designs fail to consider the context of the target video applications, thereby losing potential power saving opportunities. B. Mobile Video Specific Memory Several recent efforts have explored mobile video memory design with attempts to consider simple application-specific properties, such as data patterns [2] and contributions of different data bits [4], [5], [19]. Many mobile video SRAM designs have been presented for low-power consumption. In [3] and [7], hybrid 6T + 8T and 8T + 10T SRAM structures were presented to achieve quality-area optimization. However, such hybrid structures increase the implementation complexity of peripheral circuitries such as memory decoders. In [4], a heterogeneous sizing scheme was presented to reduce the failure probability of conventional 6T bit-cells, but it suffers from large area overhead and it can only achieve 0.9-V operation supply, limiting the power efficiency. In [5], errorcorrection-code (ECC) approach is proposed to reduce the area overhead of 8T bit-cells, but it suffers from a performance penalty for data encoding/decoding and area overhead for both ECC circuitry and redundancy data. Also, all of those techniques ignore Chroma data and they may lose optimization opportunities. The common feature of the above existing techniques is that the power savings come at a cost of large area overhead. In contrast, the proposed SPIDER realizes significant power savings with reduced area overhead while consider areapriority and quality-priority mobile video applications. Note that we had earlier presented the basic idea of SPIDER in [29] with some preliminary results. In this paper, we extend our original work and make the following additional contributions. 1) Novel SPIDER algorithms are designed for both the area-priority and quality-priority video applications in Section IV-B and the detailed evaluation results are presented and discussed in Section V. 2) A memory sizing methodolgy considering both Luma and Chroma data is presented in Section IV, and the evaluation results are discussed in Section V-C. It shows that the proposed Luma + Chroma technique solves the over-optimization problem caused by traditional design which only includes Luma data, and therefore, the proposed technique is effective in saving silicon area. 3) A hardware-based SPIDER simulator based on Verilog, MATLAB, HSPICE, and Python is detailed in Section V-A to enable the higher simulation precision as compared to traditional software-based simulators. 4) A video memory power consumption model is presented, and based on it, the power efficiency of the proposed technique is compared with the traditional design and existing techniques. 5) The video outputs using the proposed technique for different applications are included.

GONG et al.: SPIDER FOR MOBILE VIDEO APPLICATIONS 2627 TABLE I SIZING DEPENDENT SRAM FAILURE CHARACTERISTICS, 45-nm TECHNOLOGY Fig. 2. (a) Standard 6T SRAM (WPU:WPD:WAX = 1:2:1.5). (b) Failure rate of CASE I with 20% increase. (c) Failure rate of CASE III with 20% increase. III. SIZING DEPENDENT SRAM FAILURE CHARACTERISTICS AND FAILURE-INDUCED VIDEO DEGRADATION In this section, we first analyze the sizing dependent SRAM failure characteristics. Then, the impact of memory failure on H.264 video system is discussed. A. Sizing Dependent SRAM Failure Characteristics Fig. 2(a) shows a schematic of 6T SRAM bit-cell. In lowvoltage operation with process variation, the worst process corners for 6T SRAM read operation and write operation are fast-nmos and slow-pmos (FS) and slow-nmos and fastpmos (SF), respectively [3], [4], [7]. Since the read failure rate at FS corner [P RF (FS)] is much larger than the write failure rate at SF corner [P WF (FS)], the overall 6T SRAM cell failure rate (P F ) can be estimated as the read failure rate in the FS process corner, as expressed as P F = P RF (FS) + P WF (SF) = P RF (FS). (1) Researchers have shown that the failure rate of SRAM bitcells decreased with larger transistor size, and they increased all 6T transistors to reduce the failure rate [4]. In this paper, we explored the sizing dependent memory failure characteristics, based on extensive SPICE Monte Carlo simulations. To keep the sizing radio of different devices in SRAM, we increase the width and the length simultaneously. We consider four different sizing cases. 1) CASE I: The size of all 6T transistors is increased. 2) CASE II: The size of only two nmos transistors (PD) is increased. 3) CASE III: The size of only two access nmos transistors (AX) is increased. 4) CASE IV: The size of only two pmos transistors (PU) is increased. The results are shown in Table I. As observed, if only two pmos transistors (PU) are larger, the failure rate is growing. This is because, larger pull-up transistors make the reading process even more difficult. It should be noted from Table I that increasing all 6T transistors in prior work cannot optimize the failure rate, and it also induces large area overhead. The failure rate is minimized with only two larger nmos access transistors AX [see Fig. 2(a)]. As the size of access transistors is increased by 50%, the failure rate is reduced from 1335/10 000 to 3/10 000. We further analyzed the butterfly curves of CASE I and CASE III from Monte Carlo simulations, as shown in Fig. 2(b) and (c). We define a fail area based on the curves, where three intersection points exist in butterfly curves, indicating a read failure. As shown, more butterfly curves have three intersection points and therefore the fail area of CASE I is much larger than CASE III. Accordingly, in our SPIDER design, we adopt the CASE III sizing methodology to reduce memory failure with reduced area overhead or better video quality depending on the requirement of applications, which will be discussed in Section IV. B. Video Pixel Data Characteristics In this paper, we apply the SPIDER to H.264 video system, which is one of the most popular video codec standards in mobile multimedia communications. Fig. 3 shows the general block diagram of the H.264 decoder. After decoding, inverse quantization and inverse transformation, the residual error of frames can be reconstructed based on the compressed video streams. The motion compensator uses the previous

2628 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2017 Fig. 3. H.264 decoder and video data stored in memory. Fig. 4. Impact of memory failure on video output. (a) 4 LOB (bit3 bit0) with 4/10 000 failure rate. (b) 4 HOB (bit7 bit4) with 4/10 000 failure rate. (c) All Chroma 8-b data with 4/10 000 failure rate. (d) All Luma 8-b data with 4/10 000 failure rate. reconstructed frames stored in the reference frame buffer and the transmitted motion vectors to construct new frames. Due to the frequent accesses, embedded SRAM consumes large power consumption, which is the dominant contributor to the entire H.264 decoder power [4]. Accordingly, ultralow voltage embedded SRAM design is extremely important for power efficient mobile video applications. As shown in Fig. 3, in a video system supporting common intermediate format (CIF) video format (30 fps, 4:2:0), embedded memory stores 8-b Luma data (Y ) and 8-b Chroma data (C b and C r ). Luma data represent the brightness in a frame, while the Chroma data represent the color information. Since the human vision has stronger sensitivity to luminance differences, previous memory researchers only consider Luma in the design process [3], [4]. However, ignoring the memory failure impact on Chroma may induce over-optimization, losing power saving opportunities. In SPIDER, we consider the contribution of both Luma and Chroma to the output quality while optimizing the application-driven memory, as discussed in Section IV. Here, we use peak-signal-noise-ratio (PSNR) as the output quality metric, which is defined as [7] ( ) 255 PSNR = 20 log 10 (2) MSE where MSE is the mean square error between the original videos (Y Org ) and the degraded videos (Y Deg ), as expressed MSE = 1 mn m 1 n 1 [Y Org (i, j) Y Deg (i, j)] 2. (3) }{{} i=0 j=0 pixel Fig. 4 shows the impact of memory failure on output quality. As expected, as compared to low-order bits (LOBs), the highorder bits (HOBs) have the larger contribution to the output quality. If four LOBs (bit3 bit0) are stored in memory with 4/10 000 fault rate, PSNR is as high as 38.507. Alternatively, if four HOBs (bit7 bit4) are stored in the same memory, PSNR is reduced to 27.607. Another observation from Fig. 4 is that memory failure-induced Chroma data also plays an important role in output quality, reducing PSNR to 35.481. It is only 4.481 db higher than memory failure-induced Luma data. Accordingly, ignoring Chroma data may not achieve a poweroptimized video hardware. IV. PROPOSED SPIDER ALGORITHM To implement SPIDER, a model which can connect memory failure and video output quality is required. In this section, we first develop a model to connect memory failure and output quality. Based on this, we develop SPIDER algorithms for area-priority and quality-priority mobile video applications. A. Modeling Memory Failure and Video Output Based on MSE expressed in (3), we define MSE of an 8-b pixel data-bit due to memory failure as MSE pixel = (Y Org Y Deg ) 2. (4) To estimate the impact of memory failure on the output quality, we get [ 7 2 MSE pixel = (2 k Y k )] (5) k=0 where Y k is the memory failure coefficients for each bit { 1 if memory bit k fails Y k = (6) 0 otherwise. Assume Y k is a value between 0 and 1. If the failure rate for each bit is the same as f, then (5) becomes [ 7 MSE pixel = (2 k Y f ) k=0 [ 7 ] 2 = Y 2 f 2 k k=0 ] 2 = Y 2 f (1 + 2 + 2 2 + +2 7 ) 2 = 255 2 Y 2 f. (7)

GONG et al.: SPIDER FOR MOBILE VIDEO APPLICATIONS 2629 TABLE II COEFFICIENTS FOR THE SPIDER MODEL Algorithm 1 Area-Priority SPIDER Fig. 5. Memory failure and video output quality PSNR for ten random memory combinations. Therefore, we get Y f = MSEpixel, f 255 [ 7 MSE pixel = (2 k Y f,k ) k=0 ] 2 (8) where Y f,k represnts the memory failure coefficients for bit k. Therefore, (8) captures the approximately square relationship between memory failure coefficients Y f and output quality MSE. The obtained memory failure coefficients are listed in Table II. Fig. 5 compares the derived model against the obtained video decoding output from H.264 simulator. We randomly pick ten memory failure combinations and then evaluate the video output quality. As shown, the error rate is less than 6%, demonstrating acceptable accuracy of the developed model. B. SPIDER Algorithm Based on the developed model between memory failure and output quality, we design SPIDER algorithms for area-priority and quality-priority applications. 1) Problem Definition: The SPIDER sizing optimization problem can be formulated as follows: 1) given an application constraint and target supply voltage and 2) determine memory bit-cell size so that the target performance parameter is optimized. For mobile video embedded memory storing an 8-b Luma/Chroma data (e.g., H.264), the bit-cell size set can be represented as D SPIDER = d7, d6, d5,...,d0. Note that, for emerging HEVC videos with 10 b/pixel [31], D SPIDER = d9, d8, d7,...,d0. In our experiment with target voltage of 500 mv, the minimum size of bit-cell AX is L min /W min = 50 nm/100 nm and the maximum size L max /W max = 165 nm/495 nm. We use 5 nm as steps, which is the minimum permissible grid size for 45-nm technology. To implement SPIDER, we use a similar lookup table-based approach in [4], which provides the failure rate and silicon area for a specified SRAM size. In the following sections, we consider two applications: area-priority and quality-priority SPIDER Algorithm The SRAM bit-cell size search problem can be considered as a problem of finding a bit-cell size set, D = d7, d6, d5,..., d0, which gives rise to the minimum area overhead while achieving the target PSNR. The procedure for

2630 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2017 Algorithm 2 Quality-Priority SPIDER failure rate in the FS process corner for 6T SRAM cells with a failure rate analyzer which was implemented using MATLAB. During the failure rate estimation process, Python program is used to control HSPICE and MATLAB, changing the design parameters automatically such as voltage values and bit-cell sizing parameters. The obtained memory failure rates are used for fault injection. Then, we implement a H.264 decoder based on Verilog language and randomly inject the memory faults across the reference frame buffer based on the obtained failure probabilities. Finally, we capture the video frames on the H.264 decoder side to evaluate the video quality with calculated PSNR values. B. Video Memory Power Consumption Model We use the following model to estimate the overall active power consumption including both dynamic and leakage power of embedded SRAM: P = P w + P r (9) area-priority SPIDER sizing is described in Algorithm 1. Note that, to speed up this process, we can use other algorithms such as dynamic programming approach. 2) Quality-Priority SPIDER Algorithm: The SRAM bit-cell size search problem can be considered as a problem of finding a bit-cell size set D, which gives rise to the best output quality under a specific area constraint. Algorithm 2 shows the algorithm pseudocode. V. EXPERIMENTAL RESULTS A. Experimental Methodology We use 300 frames of Akifo colorful CIF video sequences to verify the output quality based on the proposed SRAM scheme. The frame size in our simulation is 176 144. In order to inject memory failure-induced faults into decoding process, we implement a hardware-based SPIDER simulator, as shown in Fig. 6. As compared to software-based video coding simulator, such as JM simulator [28], the SPIDER simulator is that it can specifically identify the memory modules and directly injecting memory faults, achiving higher precision. As shown in Fig. 6, the SPIDER consists of three components: 1) python-based controller; 2) HSPICE/MATLAB-based memory failure analyzer; and 3) verilog-based H. 264 mobile video decoder. The working process is detailed as follows. In order to observe the video quality degradations during the low-voltage operations caused by memory failures, first, we performed 100 000 HSPICE Monte Carlo simulations to obtain the failure probabilities of SRAM bit-cells with local threshold voltage (V th ) variation in the worst global process corner. As discussed in Section III-A, we measured the read where P w and P r are the power consumption on write and read operations, respectively. For an 8-b pixel data, the power consumption can be expressed as 7 P w = [F k (i, j) P wk (i, j)] (10) P r = k=0 i=0,1 j=0,1 7 [F k (i) P rk (i)] (11) k=0 i=0,1 where k is the bit number; i and j are old and new values stored in an SRAM. F(i) represents the probability of a bit to be i where i = 0 or 1. F(i, j) indicates the bit change (switching) probability from i to j, where both i and j are 0 or 1. It is worth noting that, for most significant bits in 8-b pixel data, F(0, 0) and F(1, 1) are much higher than F(0, 1) and F(1, 0) due to their stong correlation [7]. In our analysis, F(i) and F(i, j) are extracted from the real video frames in the decoding process. C. SPIDER for Area Priority Applications In our implementation, the target PSNR is set as 30.5 db. Table III presents the optimal SRAM bit-cell sizes and failure rate based on only Luma-based optimization and Luma- and Chroma-based optimization ( Luma + Chroma ). It shows that only considering Luma during optimization process would induce more area overhead, as shown in Fig. 7. Although the proposed Luma + Chroma technique only reduces the area of bit 2 from 0.935 to 0.906 μm 2 for this video format (4:2:0 YUV videos), additional area savings are expected to achieve for other video formats such as (4:4:4 or uncompressed videos) due to the increased contribution of chroma data. Fig. 8 shows the power savings of the proposed memory as compared to the basic memory. We can see that, with the proposed SPIDER which can work at 500 mv, an 8-b memory array achieves over 70% power savings as compared to the traditional memory at 1 V. Using a modified version

GONG et al.: SPIDER FOR MOBILE VIDEO APPLICATIONS 2631 Fig. 6. SPIDER simulator and Python controller. TABLE III OPTIMAL SRAM BIT-CELL SIZES AND CORRESPONDING FAILURE PROBABILITIES of CACTI 5.3 [36], we model a typical mobile video on-chip SRAM [3], [37] in 45-nm technology. The total memory size is 32 kb with four banks and each bank has 32 256 b. It shows that the bit-lines and bit-cells consume 11% 20% power consumption during the reading and writing processes. Accordingly, SPIDER can enable 8% 16% power savings

2632 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2017 Fig. 7. Layout of application-driven memory for an 8-b pixel. (Top) Designed memory considering both Luma and Chroma. (Bottom) Designed memory only considering the contribution of Luma. Fig. 8. Power savings of SPIDER. Conventional: basic memory at 1 V, Area-priority: area-priority SPIDER with PSNR Target = 30.5 db, qualitypriority (50% area): quality-priority SPIDER with area constraint as 50% overhead, and quality-priority (75% area): quality-priority SPIDER with area constraint as 75% overhead. for the entire memory. Note that, no peripheral circuit modification is needed to implement SPIDER as compared to traditional SRAM. The proposed SPIDER technique, which aims to reduce the power consumption of memory bitcells, can be applied in conjunction with the low-power peripheral techniques, such as the multiple sleep modes zig-zag horizontal and vertical sleep transistor sharing approach [32] to achieve additional power savings. We further evaluate the performance of SPIDER memory. It shows that although SPIDER brings performance penalty, the delay time is smaller than 1.271 ns, which is fast encough to support various mobile videos, including highquality videos. Finally, we evaluate the video output quality based on SPIDER. Fig. 9 shows the results of the Akiyo clip, based on different memory designs. We can see that, the conventional 6T SRAM results in significant degradation of frame quality at 500 mv and the PSNR is only 9.166 db. Alternatively, our proposed SPIDER scheme can deliver output quality with no significant degradation. D. SPIDER for Quality-Priority Applications Based on the developed SPIDER Algorithm 2, we also implement mobile video memory for quality-priority Fig. 9. Output quality for area-priority applications. (a) Conventional memory. (b) SPIDER considering both Luma and Chroma. (c) SPIDER only consider Luma. Fig. 10. Output quality for quality-priority applications with 50% area constraint. (a) All-6T [4]. (b) SPIDER. Fig. 11. Output quality for quality-priority applications with 75% area constraint. (a) All-6T [4]. (b) SPIDER. applications. Fig. 10 compares the video output based on all-6t sizing methodology [4] and SPIDER with 50% area constraint. We can see that, with the same area constraint, our proposed SPIDER methodology significantly improves the output quality. As compared to all-6t sizing approach in [4], the PSNR is increased from 9.372 db to 32.203 db. Fig. 11 shows the video outputs with 75% area constraint. In this case, the PSNR based on the developed SPIDER

GONG et al.: SPIDER FOR MOBILE VIDEO APPLICATIONS 2633 methodology is improved by 1.067 db. Alternately, the PSNR based on all-6t methodology [4] does not show significant improvement. Accordingly, SPIDER achieves the larger quality improvement as the area constraint increases. VI. CONCLUSION In this paper, we have presented a new SPIDER design technique for power efficient mobile video applications. The technique adopts a priority-based SRAM sizing methodology to mitigate memory failure at low operation voltage while reducing the area overhead. We designed the model between memory failure and output quality, thereby introducing application output into the hardware design process. Based on this, we designed memory optimization algorithms for areapriory and quality-priority mobile video applications. Finally, we developed a hardware-based Python-assisted simulator, enabling precise memory fault injection into mobile video decoding process. The simulation results demonstrate that the proposed design achieves over 70% power savings as compared to the conventional SRAM array. REFERENCES [1] Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2012-2017, Cisco Syst., San Jose, CA, USA, Feb. 2013. [2] M. E. Sinangil and A. P. Chandrakasan, application-specific SRAM design using output prediction to reduce bit-line switching activity and statistically gated sense amplifiers for up to 1.9 lower energy/access, IEEE J. Solid-State Circuits, vol. 49, no. 1, pp. 107 117, Jan. 2014. [3] I. J. Chang, D. Mohapatra, and K. Roy, A priority-based 6T/8T hybrid SRAM architecture for aggressive voltage scaling in video applications, IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 2, pp. 101 112, Feb. 2011. [4] J. Kwon, I. Lee, and J. Park, Heterogeneous SRAM cell sizing for low power H.264 applications, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 99, no. 2, pp. 1 10, Feb. 2012. [5] J. Park, J. Park, and S. Bhunia, VL-ECC: Variable data-length error correction code for embedded memory in DSP applications, IEEE Trans. Circuits Syst. II, Express Briefs, vol. 61, no. 2, pp. 120 124, Feb. 2014. [6] M. Cho, J. Schlessman, W. Wolf, and S. Mukhopadhyay, Reconfigurable SRAM architecture with spatial voltage scaling for low power mobile multimedia applications, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 1, pp. 161 165, Jan. 2011. [7] N. Gong, S. Jiang, A. Challapalli, S. Fernandes, and R. Sridhar, Ultra-low voltage split-data-aware embedded SRAM for mobile video applications, IEEE Trans. Circuits Syst. II, Express Briefs, vol. 59, no. 12, pp. 883 887, Dec. 2012. [8] J. S. Wang, P. Y. Chang, T. S. Tang, J. W. Chen, and J. I. Guo, Design of subthreshold SRAMs for energy-efficient quality-scalable video applications, IEEE Trans. Emerg. Sel. Topics Circuits Syst., vol. 1, no. 2, pp. 183 192, Jun. 2011. [9] M. A. Hoque, M. Siekkinen, and J. K. Nurminen, Energy efficient multimedia streaming to mobile devices A survey, IEEE Commun. Surveys Tuts., vol. 16, no. 1, pp. 579 597, 1st Quart., 2014. [10] Y. Benmoussa, J. Boukhobza, E. Senn, and D. Benazzouz, Energy consumption modeling of H.264/AVC video decoding for GPP and DSP, in Proc. 16th Euromicro Conf. Digit. Syst. Design, 2013, pp. 890 896. [11] A. Carroll and G. Heiser, An Analysis of Power Consumption in a Smartphone, in Proc. USENIX Annu. Tech. Conf., 2010, pp. 1 14. [12] T. Liu et al., A 125 μw, fully scalable MPEG-2 and H.264/AVC video decoder for mobile applications, IEEE J. Solid-State Circuits, vol. 42, no. 1, pp. 161 169, Jan. 2007. [13] K. Nii et al., A 45-nm bulk CMOS embedded SRAM with improved immunity against process and temperature variations, IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 180 191, Jan. 2008. [14] O. Hirabayashi et al., A process-variation-tolerant dual-power-supply SRAM with 0.179μm 2 Cell in 40nm CMOS using level-programmable wordline driver, in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2009, pp. 458 459. [15] T. Suzuki, H. Yamauchi, Y. Yamagami, K. Satomi, and H. Akamatsu, A stable 2-port SRAM cell design against simultaneously read/writedisturbed accesses, IEEE J. Solid-State Circuits, vol. 43, no. 9, pp. 2109 2119, Sep. 2008. [16] F. Tachibana et al., A 27% Active and 85% standby power reduction in dual-power-supply SRAM using BL power calculator and digitally controllable retention circuit, IEEE J. Solid-State Circuits, vol. 49, no. 1, pp. 118 126, Jan. 2014. [17] N. Shibata, H. Kiya, S. Kurita, H. Okamoto, M. Tan no, and T. Douseki, A 0.5-V 25-MHz 1-mW 256-kb MTCMOS/SOI SRAM for solar-power-operated portable personal digital equipment Sure write operation by using step-down negatively overdriven bitline scheme, IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 728 742, Mar. 2006. [18] D. P. Wang et al., A 45nm dual-port SRAM with write and read capability enhancement at low voltage, in Proc. IEEE Int. SoC Conf., Sep. 2007, pp. 211 214. [19] M. Khellah et al., Wordline & bitline pulsing schemes for improving SRAM cell stability in low-vcc 65nm CMOS designs, in Proc. Symp. VLSI Circuits, Jun. 2006, pp. 9 10. [20] K. Kushida et al., A 0.7 V single-supply SRAM with 0.495 μm 2 cell in 65 nm technology utilizing self-write-back sense amplifier and cascaded bit line scheme, IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 1192 1198, Apr. 2009. [21] K. Takeda et al., A read-static-noise-margin-free SRAM cell for low- VDD and high-speed applications, IEEE J. Solid-State Circuits, vol. 41, no. 1, pp. 113 121, Jan. 2006. [22] T.-H. Kim, J. Liu, and C. H. Kim, A voltage scalable 0.26 V, 64 kb 8T SRAM with V min lowering techniques and deep sleep mode, IEEE J. Solid-State Circuits, vol. 44, no. 6, pp. 1785 1795, Jun. 2009. [23] R. Saeidi, M. Sharifkhani, and K. Hajsadeghi, A subthreshold symmetric SRAM cell with high read stability, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 1, pp. 26 30, Jan. 2014. [24] J.-J. Wu et al., Alargeσ V TH /VDD tolerant zigzag 8T SRAM with area-efficient decoupled differential sensing and fast write-back scheme, IEEE J. Solid-State Circuits, vol. 46, no. 4, pp. 815 827, Apr. 2011. [25] S. A. Verkila, S. K. Bondada, and B. S. Amrutur, A 100MHz to 1GHz, 0.35V to 1.5V supply 256 64 SRAM block using symmetrized 9T SRAM cell with controlled read, in Proc. Conf. VLSI Design, Jan. 2008, pp. 560 565. [26] F. Abouzeid et al., Scalable 0.35 V to 1.2 V SRAM bitcell design from 65 nm CMOS to 28 nm FDSOI, IEEE J. Solid-State Circuits, vol. 49, no. 7, pp. 1499 1505, Jul. 2014. [27] Y.-W. Chiu et al., 40 nm bit-interleaving 12T subthreshold SRAM with data-aware write-assist, IEEETrans.CircuitsSyst.I,Reg.Papers, vol. 61, no. 9, pp. 2578 2585, Sep. 2014. [28] H.264/AVC JM Simulator, accessed on Feb. 2014. [Online]. Available: http://iphome.hhi.de/suehring/tml/ [29] S. A. Pourbakhsh, X. Chen, D. Chen, X. Wang, N. Gong, and J. Wang, Sizing-priority based low-power embedded memory for mobile video applications, in Proc. Int. Symp. Quality Electron. Design (ISQED), Mar. 2016, pp. 1 5. [30] F. Sampaio, M. Shafique, B. Zatt, S. Bampi, and J. Henkel, Energyefficient architecture for advanced video memory, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2014, pp. 132 139. [31] C.-C. Ju et al., A 0.5 nj/pixel 4 K H.265/HEVC codec LSI for multiformat smartphone applications, IEEE J. Solid-State Circuits, vol. 51, no. 1, pp. 56 67, Jan. 2016. [32] V. Sze, D. F. Finchelstein, M. E. Sinangil, and A. P. Chandrakasan, A 0.7-V 1.8-mW H.264/AVC 720p video decoder, IEEE J. Solid-State Circuits, vol. 44, no. 11, pp. 2943 2956, Nov. 2009. [33] C.-D. Chien et al., A 252kgate/71mW multi-standard multi-channel video decoder for high definition video applications, in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2007, pp. 282 283. [34] M. Tikekar, C.-T. Huang, C. Juvekar, V. Sze, and A. P. Chandrakasan, A 249-Mpixel/s HEVC video-decoder chip for 4K ultra-hd applications, IEEE J. Solid-State Circuits, vol. 49, no. 1, pp. 61 72, Jan. 2014. [35] D. Zhou et al., A 4Gpixel/s 8/10b H.265/HEVC video decoder chip for 8K Ultra HD applications, in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2016, pp. 266 267.

2634 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2017 [36] Hewlett-Packard Company, Palo Alto, CA, USA. CACTI, accessed on Jan. 2017. [Online]. Available: http://quid.hpl.hp. com:9081/cacti/sram.y?new [37] D. Chen, J. Edstrom, X. Chen, W. Jin, J. Wang, and N. Gong, Datadriven low-cost on-chip memory with adaptive power-quality trade-off for mobile video streaming, in Proc. Int. Symp. Low Power Electron. Design (ISLPED), Aug. 2016, pp. 188 193. [38] H. Homayoun, A. Sasan, A. Veidenbaum, H.-C. Yao, S. Golshan, and P. Heydari, MZZ-HVS: Multiple sleep modes zig-zag horizontal and vertical sleep transistor sharing to reduce leakage power in on-chip SRAM peripheral circuits, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 12, pp. 2303 2316, Dec. 2011. Na Gong (M 13) received the B.E. degree in electrical engineering and the M.E. degree in microelectronics from Hebei University, Hebei, China, in 2004 and 2007, respectively, and the Ph.D. degree in computer science and engineering from the State University of New York, Buffalo, NY, USA, in 2013. She is currently an Assistant Professor of Electrical and Computer Engineering with North Dakota State University, Fargo, ND, USA. Her current research interests include energy-efficient sotrage systems, embedded systems, and approximate computing. Seyed Alireza Pourbakhsh received the B.E. degree in electrical engineering from the Sharif University of Technology, Tehran, Iran, in 2007. He is currently pursuing the master s degree in electrical engineering with North Dakota State University, Fargo, ND, USA. His current research interests include emerging memory technologies and 3-D IC. Xiaowei Chen received the B.E. degree in electrical engineering from the First Aviation Academy of Chinese Air Force, Changchun, China, in 2008. He is currently pursuing the Ph.D. degree in electrical and computer engineering with North Dakota State University, Fargo, ND, USA. His current research interests include emerging memory technologies. Xin Wang received the B.E. degree in electrical engineering from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2005. She is currently pursuing the master s degree in electrical engineering with North Dakota State University, Fargo, ND, USA. Her current research interests include video processing and energy-efficient very large scale integration design. Dongliang Chen (S 15) received the B.S. degree in electrical engineering from the Dalian University of Technology, Dalian, China, in 2010. He is currently pursuing the Ph.D. degree in electrical and computer engineering with North Dakota State University, Fargo, ND, USA. His current research interests include powerefficient mobile video streaming and embedded vision. Jinhui Wang (M 13) received the B.E. degree in electrical engineering from Hebei University, Hebei, China, in 2004, and the Ph.D. degree in electrical engineering through a joint USA/China program between the University of Rochester, Rochester, NY, USA, and the Beijing University of Technology, Beijing, China, in 2010. He is currently an Assistant Professor with the Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND, USA. He has authored more than 80 publications and holds 20 patents in the emerging semiconductor technologies. His current research interests include low power, high performance, and reliable integrated circuit design, 3-D IC and EDA methodologies, and thermal issue solution in very large scale integration.