Data-Pattern Enabled Self-Recovery Low-Power Storage System for Big Video Data

Size: px

Start display at page:

Download "Data-Pattern Enabled Self-Recovery Low-Power Storage System for Big Video Data"

Caitlin Moody
6 years ago
Views:

IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW 1 Data-Pattern Enabled Self-Recovery Low-Power Storage System for Big Video Data Jonathon Edstrom, Dongliang Chen, Yifu Gong, Jinhui Wang, Member, IEEE,

However, due to the large video data size and intensive computation, mobile video applications require frequent embedded memory access, which consumes a large amount of power and limits battery life.

1 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW 1 Data-Pattern Enabled Self-Recovery Low-Power Storage System for Big Video Data Jonathon Edstrom, Dongliang Chen, Yifu Gong, Jinhui Wang, Member, IEEE, and Na Gong, Member, IEEE Abstract The growing popularity of powerful mobile devices such as smart phones and tablet devices has resulted in the exponential growth of demand for video applications. However, due to the large video data size and intensive computation, mobile video applications require frequent embedded memory access, which consumes a large amount of power and limits battery life. In this paper, we present a low-cost self-recovery video storage system by investigating meaningful data patterns hidden in big video data, by introducing data mining techniques to the hardware design process. We propose a two-dimensional data-pattern approach to explore horizontal data-association and vertical data-correlation characteristics. Such data relationship discovery and pattern identification enable a new dimension for the hardware design space and bring self-recovery ability to memories in the presence of bitcell failures. Based on the identified optimal data patterns, we present a low-cost and efficient SRAM design to enable data self-recovery at low voltages. A 45nm 32kb SRAM is implemented that delivers good video quality at near-threshold voltage (0.5 V) with negligible area overhead (7.94%). Index Terms videos; data mining; data pattern; low-power; self-recovery; on-chip memory 1 INTRODUCTION I NFORMATION has driven the remarkable evolution of human society. According to market research, by 2020, the amount of data that is created, replicated, and consumed, will be as large as 40ZB (Zettabyte, or B) [1, 2]; and more than half of the data traffic will be video data [3]. Traditional, plain TV sets, are losing ground to hybrid TVs, PCs, game consoles, and more recently, mobile devices such as tablets and smartphones. In this new, mobile, and big video age, one of the biggest contributors to user dissatisfaction still remaining is short battery life [3]. In particular, due to the intensive computation and large data size, video applications are demanding continuously increased storage space. To realize this process, embedded static random-access memory (SRAM) occupies over 65% of the mobile video decoder chip area and they are also a major contributor to mobile battery consumption (>92% of the motion compensation energy [32]) and this situation is only expected to grow for the next-generation mobile video format - H.265/HEVC which has 2x-3x higher memory demand compared to that of H.264 [32]. Voltage scaling techniques have been widely applied to reduce the power consumption of memory systems. Researchers have shown that SRAM achieves maximum efficiency at near-threshold voltage [14]. However, as voltage scales, SRAMs are susceptible to failure due to significant process variation. Various techniques have been developed to correct or eliminate these memory failures as voltage is scaled. Traditional low-power memory techniques can be divided into three, general, categories: (i) assist schemes, The authors are with the Department of Electrical and Computer Engineering, North Dakota State University, ND {jonathon.edstrom, dongliang.chen, yifu.gong, jinhui.wang.1, na.gong}@ndsu.edu. Big-data enabled data knowledge Traditional isolated hardware design process Positive Feedback Loop: big-data enabled better hardware will support big-data applications better Fig. 1. Big-data enabled intelligent efficient hardware. such as adjustment of cell voltage, [5] and boosted wordline voltage [6]; (ii) large bitcells such as upsized 6T cell [24], asymmetric 7T cell [7], single-ended read-decoupled 8T cells [8], read-disturb-free 9T [9], and bit-interleaving 12T cells [10]; and (iii) error correction techniques spanning from the use of error correction codes [11] to data remapping [12]. Unfortunately, almost all existing solutions require considerable implementation overhead to the original memory design. For example, the penalty of the area overhead are as high as %. Such large overhead leads to increased layout area, higher design complexity, and reduced performance of the entire system. Recently, a new branch of low-voltage embedded memory techniques have been developed to embrace the memory faults, instead of avoiding the faults (assistance techniques or more than 6T cells) or correcting the faults (e.g. ECC). Those techniques aim to mitigate the impact of memory faults by minimizing the magnitude of the error (due to a faculty cell), based on the determined memory fault positions from run-time testing (e.g. built-in self-test (BIST)). We refer those techniques as fault-position aware mitigation techniques. For example, Ganapathy et al. [13] developed a shifting technique to always store the leastsignificant-bits (LSBs) in the faulty cells, which may lead to a tolerable output quality. Ferreron et al. [14] presented a squeezing technique to compress zeros and store them in Manuscript received 14 Oct less memory space, thereby avoiding the presented +

2 2 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW memory failures at low voltage. However, based on the predetermined memory fault positions, the existing techniques still involve complex operation (e.g. shifting value calculation and storage) and the overhead incurred is still significant (e.g. 65% in [13]). To address the storage challenge of videos as well as other big things, we propose to leverage the assets of big data to extract useful knowledge and actionable information for hardware design. Recently, it has been observed that today s big-data applications, including videos, have three common data characteristics [33]: (i) redundant inputs, (ii) multiple acceptable outputs, and (iii) statistical computations. Those intrinsic characteristics provide substantial opportunities for data relationship discovery and pattern identification, which will enable a new dimension for hardware design space and bring exciting innovation opportunities for multi-dimensional innovations in circuits and systems, as illustrated in Fig. 1. Specifically, in this paper, we present a novel Data Pattern enabled Self- Recovery video SRAM (DPSR) to achieve efficient nearthreshold voltage computing while delivering good video output quality. By introducing advanced data mining techniques, we investigate meaningful data patterns hidden in video data and use them to enable self-recovery in SRAM. We propose a two-dimensional (2D) data pattern approach to explore horizontal data-association and vertical datacorrelation characteristics to determine the optimal data patterns for self-recovery. Based on this, we develop an efficient SRAM design technique to implement DPSR with negligible area overhead (7.94%) and performance penalty. Earlier in [30], we presented a basic DPSR design storing only chroma data associations, including some preliminary results. We extend our original work with the following additional contributions: We investigate data associations between bits of luma data in various videos to enable additional power savings by implementing the design across both luma and chroma data in the video memory (Sections 3.1, 3.2 and 3.3). We propose a new hardware design that realizes near-threshold voltage storage for the luma data based on discovered optimal data patterns. We analyze different hardware bit prediction schemes and implement the optimal wordline architecture for the highest bit prediction percentage. Since there is twice as much luma data as chroma data in typical mobile videos, our additions allow for triple the power savings as compared to our previous design [30] (Sections 4 and 5.2). To analyze the quality of video output, we add a new structural similarity (SSIM) metric proposed in [23], which is aware of the user s perception by including calculations for luminance, contrast, and structural changes in the video (Section 4.3). Also, to verify the power efficiency of the proposed technique, we develop memory power consumption models for both active and leakage power consumption and performed detailed analysis (Section 5.3). mobile devices in the environment of big data, we expand the video data to larger-scale and real videos using the recently released YouTube-8M dataset [31]. Specifically, 10,000 unique YouTube-8M videos, with 57.6 GB total data size, representing 500,000 individual frames, have been analyzed using data mining techniques to identify the general data patterns existing in various videos (Sections 3.1 and 3.2). Additionally, 25 videos from YouTube- 8M, separate from the 10,000 videos used in the training dataset, are used to verify the correct prediction percentage and video output qualities (Sections 3.3 and 5.4). It should be emphasized that the biggest challenge to achieve data-enabled hardware is that it is difficult for hardware designers to directly observe the inherent data behaviors from the large volume of video data. To realize the proposed data-pattern enabled efficient video storage, hardware designers require a deep understanding and systematic study of inherent data relationships from massive video data. This will not be solved by traditional data techniques, due to the increased complexity and the growing amount of video data. Also, the larger the data size, the more general data patterns can be identified and the more power saving opportunities can be enabled. Accordingly, big data can be one and the only one powerful way to realize the full potential of the proposed intelligent hardware. Noted that the data pattern identification process is conducted in design time (off-line), thereby avoiding runtime performance overhead caused by big data algorithms. The rest of the paper is organized as follows. Section 2 presents SRAM failure at near threshold-voltage. In Section 3, the data-mining enabled mobile video data patterns are analyzed for self-recovery. In Section 4, we present DPSR. The evaluation results are provided in Section 5. Finally, the conclusion is drawn in Section 6. 2 EMBEDDED MEMORY FAILURE ANALYSIS AT NEAR-THRESHOLD VOLTAGE It has been shown that the computing efficiency is maximized when a circuit is operating at near-threshold voltage [14]. However, at 0.5V (our target near-threshold voltage), SRAM failures become more severe with the increasing process variation. In particular, the random dopant fluctuation (RDF) effect leads to threshold voltage (Vth) variation and SRAM cell failures [15]. For the current manufacturing technologies, the failure probability of an SRAM cell (Pfail) typically ranges between 10-3 and 10-2, depending on the bitcell area [14, 16]. The minimum-sized SRAM has highest failure rate of 10-2 and larger bitcells have a lower failure probability. With 58% area overhead, the failure rate can be reduced from 10-2 to 10-3 [16]. In our analysis, we consider both 10-2 (minimum-sized SRAM) and 10-3 (upsized SRAM) conditions. It should be noted that the failure rate Most importantly, in order to emulate the use of

3 (a) Fig. 2. Error maps in the SRAM array at 0.5 V. (a) Failure rate is 10-3 (0.001)

can be further optimized using a recently developed priority-based sizing technique [26].

SRAM failure rate: 10-3 (0.001) SRAM failure rate: 10-2 (0.01) 0 96.8523477% 72.7279953% 1 3.0992274% 23.2812509% 2 0.0479198% 3.6012385% 3 0.0005023% 0.3611914% 4 0.0000028% 0.0267011% 5 0% 0.

3 3 (a) Fig. 2. Error maps in the SRAM array at 0.5 V. (a) Failure rate is 10-3 (0.001) and (b) failure rate is 10-2 (0.01). Each dot on the maps illustrates the bitcell failure locations with row number (y axis) and column number (x axis) in the SRAM array. can be further optimized using a recently developed priority-based sizing technique [26]. To further study the SRAM failure characteristics at low voltage, we investigated error maps for a 512 word 64 bit (b) TABLE 1 FAULT PROBABILITY IN A 32-BIT SRAM WORD Number of faults per wordline SRAM failure rate: 10-3 (0.001) SRAM failure rate: 10-2 (0.01) % % % % % % % % % % 5 0% % 6 0% % 7 0% % *Calculations based on Monte Carlo simulation (10 9 trials) SRAM with Pfail equal to 10-2 and During the fault injection process, we assumed the failed bits to be located across the memory cells based on the failure probabilities according to a uniform distribution, introducing embedded memory failures to the decoding process. Using a uniform distribution for the errors is confirmed by memory failure measurements in [29]. The results are shown in Fig. 2. SRAM faults are uniformly distributed in the array. We also analyzed the probability of different faults in the same wordline (32-bit word) and the results are listed in Table 1. It can be seen that a wordline has a low number of faulty cells. The probability of two faults existing in the same wordline is only 3.6% when the SRAM bitcell failure rate is Accordingly, in the presence of a memory fault, SRAM may achieve self-recovery based on other bits in the same wordline if meaningful bit-level data-patterns exist. 3 DATA PATTERN INVESTIGATION FOR SELF- RECOVERY This section presents our data-mining methodology to discover data-patterns hidden in video data to enable reliable self-recovery. Specifically, we propose a new two-dimensional (2D) data pattern approach to explore horizontal data-association and vertical data-correlation characteristics, thereby achieving optimal data patterns. 3.1 Rule Mining Enabled Horizontal Today s mobile video frames are typically stored and processed in YUV format. The YUV format includes one luma (Y) component, which contains the brightness information of the image, and two chroma components, which contain the blue-difference (Cb) and red-difference (Cr) color Fig. 3 shows a typical frame of video data stored in embedded memory using a resolution YUV 4:2:0 video as an example. As shown, each pixel has 8-bit luma data and 8-bit subsampled chroma data. Since video data is stored in on-chip memory as binary bits, we utilize MSB Luma(Y) data 8 bits/pixel Chroma (Cb) data 8 bits/4 pixels Chroma (Cr) data 8 bits/4 pixels LSB Cb1 Cb2 Cb3 Cb4 Cb5 Cb6 Cb7 Cb8 MSB MSB LSB Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 LSB Cr1 Cr2 Cr3 Cr4 Cr5 Cr6 Cr7 Cr8 2D Data-Pattern 4:2:0 Subsampling Y Cb Cr 16x16 Pixels 4:2:0 YUV Video Frame MSB LSB Cb1 Cb2 Cb3 Cb4 Cb5 Cb6 Cb7 Cb8 MSB Vertical Data Pattern: LSB Cb1 Cb2 Cb3 Cb4 Cb5 Cb6 Cb7 Cb8... Dataset/ Database Horizontal Data Pattern: Transaction 1 Item 1 Item 2 Item 3... Item X {0,1} Rule Mining Enabled Horizontal Data Pattern Data-Pattern Analysis Dataset No. of bits (no. of frames) ( Akiyo 364,953,600 (300) 364,953,600 (300) Container 364,953,600 (300) Flower 304,128,000 (250) Foreman 304,128,000 (250) Coastguard Hall 304,128,000 (250) Mobile 304,128,000 (250) Mother- Daughter 304,128,000 (250) News 304,128,000 (250) Silent 304,128,000 (250) Tempete 316,293,120 (260) Waterfall 316,293,120 (260) Total 4,221,296,640 bits (3470 frames) Fig. 3. 2D data-pattern enabled self-correction and data pattern analysis dataset.

4 4 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW Rules TABLE 3 VERTICAL CORRELATION PROBABILITY probability from 12 video benchmarks Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % probability from 10,000 YouTube-8M videos Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % Y % Cb % Cr % an association data mining technique to identify horizontal bit-level data patterns. rule mining was introduced in 1993 to discover relationships between different variables, called items, in a dataset or database [17]. A complete dataset is made up of many transactions where each transaction contains a set of items. Each item can be associated with a binary attribute, 0 or 1, that is used to distinguish that item is present or not in its corresponding transaction. This type of data organization is illustrated in Fig. 3. Each resulting rule, generated from the association rule mining process, is an implication of the form X Y, where X and Y are disjoint sets of, or individual, items. Each rule is also accompanied by collected statistics from the dataset called TABLE 2 DISCOVERED ASSOCIATION RULES From 12 video benchmarks [27, 28] From 10,000 Youtube-8M videos [31] Confidence Support Confidence Support Rules Confidence Support Confidence Support Y2=1 Y1= % % % Y2=1 Y1= % % % Y3=1 Y1= % % % Y3=1 Y1= % % % Y2=0 Y1= % % % Y2=0 Y1= % % % Y1=1 Y2= % % % Y1=1 Y2= % % % Y1=1 Y3= % % % Y1=1 Y3= % % % Cb2=0 Cb1= % % % Cb2=0 Cb1= % % % Cb2=1 Cb1= % % % Cb2=1 Cb1= % % % Cb1=0 Cb2= % % % Cb1=0 Cb2= % % % Cb1=1 Cb2= % % % Cb1=1 Cb2= % % % Cr2=1 Cr1= % % % Cr2=1 Cr1= % % % Cr2=0 Cr1= % % % Cr2=0 Cr1= % % % Cr1=1 Cr2= % % % Cr1=1 Cr2= % % % Cr1=0 Cr2= % % % Cr1=0 Cr2= % % % Cr1=1 Cr3= % % % Cr1=1 Cr3= % % % Cr1=0 Cr3= % % % Cr1=0 Cr3= % % % *Bit 1 (i.e. Y1, Cb1, Cr1) is the MSB. Bit 8 (i.e. Y8, Cb8, Cr8) is the LSB. of items is the proportion of transactions in the dataset that contains such set of items. The confidence value for an association rule, X Y, is the proportion of transactions that contain X which also contain Y, or the conditional probability P(Y X). To enable association data mining, we first use 12 different video benchmarks to build a dataset [27, 28]. In total, the video data size is 4,221,296,640 bits from 3470 frames, as shown in Fig. 3. Each video data bit is defined as an individual item and we used Weka [18] to perform the wellknown association rule mining algorithm - Apriori on our large video dataset. Table 2 lists the horizontal data patterns we obtained for chroma data based on video benchmarks. We further expand the video data to larger-scale and real video datasets in order to emulate the use of mobile devices in the environment of big data. We use Google s recently released YouTube-8M dataset [31], which is the largest video dataset to date. Specifically, 10,000 unique videos from YouTube-8M dataset, with 57.6 GB total data size, representing 500,000 individual frames, was analyzed using data-mining methods. A script was written to download these 10,000 videos from the ~7 million available URLs provided in the Youtube-8M dataset. After each video was downloaded, 50 contiguous frames were randomly selected from the video and were converted from the MP4 format to the raw YUV format using the FFmpeg decoder [32] for data-mining analysis. To support largescale video data processing, our experiments have been performed on the Thunder cluster at the Center for Computationally Assisted Science and Technology (CCAST) of North Dakota State University, which consists of 53 compute nodes. Each node has dual socket Intel Xeon 2670v2 Ivy Bridge (10 core per socket) 2.5GHz with 64GB DDR3 RAM and all nodes are interconnected with FDR InfinBand at a 56Gbit/s transfer rate. As illustrated in Fig. 3, each video data bit is defined as an individual item and the well- support and confidence values. The support value for a set

5 5 TABLE 4 OPTIMAL LUMA DATA PATTERNS FROM 25 YOUTUBE-8M VIDEOS, SEPARATE FROM THE 10,000 VIDEOS USED IN THE TRAINING DATASET Y bits Optimal Data Patterns Correct Prediction (%) Y1 (Y1 previous) Y2 (Y2 previous) Y3 (Y3 previous) Y4 (Y4 previous) Y5 (Y5 previous) Y6 (Y6 previous) Y7 (Y7 previous) Y8 (Y8 previous) known association rule mining algorithm - Apriori was used on our large video dataset to gather confidence and support metric calculations. The average results obtained for the horizontal data patterns are also listed in Table 2. We can see that the association rules obtained from video benchmarks are very general and they also exist in large scale videos. 3.2 Vertical Vertical data correlation characteristics of multimedia applications have been studied by many researchers [19, 20]. These works show that the most-significant-bits (MSB) of video data have strong correlation with neighboring pixels and the switching probability is very low. As listed in Table 3, from video benchmarks, the correlation probability of the MSBs (Y1, Cb1, Cr1) in neighboring pixels is over 93%, while it is reduced to 53% for the LSB of Cb (Cb8). The similar correlation characteristics can be observed in Your- Tube-8M videos. The MSBs in neighboring pixels have very strong correlation (with probability over 90%), but LSBs are more random and have little correlation with neighboring pixels. Power saving techniques involving the correlation have been used in previous works for bit prediction where no transistor switching results in power savings [4] and attempting to load the same value (reading continuous 0s or 1s) from a memory bit cell in order to eliminate the cost of precharging if the correct value is read out from the previous bit line read [19]. This work uses the correlation property of YUV data through the use of a novel bit correction technique that attempts to correct memory faults with high precision. By comparing the correlation percentages and the association rules we can identify the best combination of association rules and correlation between bits to construct an optimal pattern for data self-recovery. 3.3 Optimal Data Patterns for Self-Recovery In order to select an optimal data pattern from association and correlation, we define the Weighted Confidence based on the support and confidence of a particular rule as follows: Weighted Confidence = Confidence (Rule) Support (Rule) +Confidence(Complement Rule) Support(Complement Rule) (1) TABLE 5 OPTIMAL CHROMA DATA PATTERNS FROM 25 YOUTUBE- 8M VIDEOS, SEPARATE FROM 10,000 VIDEOS USED BEFORE For example, Weighted Confidence of association rule Cr1 Cr 2 can be expressed as Weighted Confidence of Cr1 Cr 2 Cb bits Cb1 Cb2 Cb3 Cb4 Cb5 Cb6 Cb7 Cb8 Optimal Data Patterns Confidence Cr 1 0 Cr 2 1 Support Cr1 0 Cr 2 1 Cr 1 1 Cr 2 0 Support Cr1 1 Cr 2 0 Confience We then use this parameter to compare to the sum of the correlation values for 0 and 1 non-switching which is equal to the correlation. This is equivalent to the Weighted Confidence calculation but instead uses the individual bit value (0 or 1) correlation percentages and is calculated as follows: = Confidence(Bitprevious = 0 Bitcurrent = 0) + Confidence(Bitprevious = 1 Bitcurrent = 1) where Bitprevious and Bitcurrent represent the video data bits in the same position of two neighboring pixels. As an example, of Cr2 Cb2 Cb1 Cb1 Cb2 (Cb3previous) (Cb4previous) (Cb5previous) (Cb6previous) (Cb7previous) (Cb8previous) Correct Prediction (%) Confidence Cr 2 previous 0 Cr 2current Confidence Cr 2previous 1 Cr 2current Accordingly, we obtain the optimal bit-level data patterns with high prediction rate to enable self-recovery, as listed in Table 4 for luma (Y) and Table 5 for chroma (Cr and Cb). 25 videos from YouTube-8M, separate from the 10,000 videos used in the training dataset, are used to verify the correct prediction percentage shown in Table 4 and Table Cr bits Cr Cr Cr Cr Cr Cr Cr Cr8 Optimal Data Patterns Cr2 Cr1 Cr1 Cr2 Cr1 Cr3 (Cr4previous) (Cr5previous) (Cr6previous) (Cr7previous) (Cr8previous) Correct Prediction (%) (2) (3) (4)

6 6 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW Double Faults TABLE 6 DPSR RECOVERY FAILURE RATE SRAM Pfail: 10-3 (0.001) 5. These videos are obtained using the same method as previous, but are unique from the previous 10,000 videos to ensure our technique works properly for correction. Our analysis shows that luma data is more random and has less association with other bits in the same pixel and the optimal data patterns are all from correlation. 3.4 Recovery Failure Caused by Double Faults in Data Pattern Since the discovered optimal data patterns used for selfrecovery exist between two bits in the same wordline, it may cause recovery failure if both of the two bits in a pattern fail simultaneously. Table 6 lists the recovery failure rate. It shows that DPSR has good reliability with extremely low self-recovery rate (less than 0.2%). This is due to the fact that there is a low probability of having multiple faults in the same wordline, as discussed in Section 2. 4 DPSR HARDWARE IMPLEMENTATION SRAM Pfail: 10-2 (0.01) Correction Faults % % Faults % % DPSR Recovery Failure % % Utilizing the obtained optimal bit-level data patterns, we present a simple but efficient DPSR hardware design with low implementation cost. Fig. 4 shows the array architecture of the proposed DPSR, where the total array size is 32 kbits and there are four blocks with 256 words 32 bits. In this design, both luma data and chroma data will be stored in the same SRAM but in different blocks. Block 1 and block 2 will be used to store the luma data and each wordline will store the luma data of 4 pixels. Block 3 and block 4 will be used to store the chroma data and each wordline will store the chroma data of 2 pixels. Regarding the luma data stored in blocks 1 and 2, based on the optimal luma patterns obtained in Section 3, vertical correlation rules, that is, luma data of the previous pixel will be used for recovering the data of current pixel. Since SRAM reads are row-wise, reading two physical rows will cause considerable performance penalty. Accordingly, we adapt the vertical correlation based luma self-recovery to a hardware-friendly design scheme. Each wordline will store the luma data of 4 pixels, we will use its neighboring pixel in the same row for data correction in the current pixel. For example, if a data bit in pixel 1 has failure (see Fig. 4), we will use the corresponding bit in pixel 2 for recovery; if a bit in pixel 4 has failure, we will use the corresponding bit in pixel 3 (which is the neighboring pixel in the same row) for recovery. In order to verify our design would maximize the correct bit predictions, we calculated the correct prediction percentage for predicting each bit using both the previous and next pixel s corresponding bit. Our analysis shows that they were approximately equal calculations from all samples in our training and verification testing video benchmarks [27, 28]. Chroma data self-recovery is implemented in SRAM block 3 and block 4 using the optimal chroma patterns. As shown in Fig. 4, each wordline will store two pixels of chroma data with both Cr and Cb. And both vertical correlation rules and horizontal data pattern rules are used for the chroma data self-recovery (see Table 5). Similarly, for vertical correction rules based recovery, we use the neighboring pixel stored in the same row for data correction, thereby avoiding performance penalty. For example, if Cb1 in Pixel 1 has failure (see Fig. 4), the inverted value of Cb2 in the same pixel will be used for recovery; if Cr4 in Pixel 1 is failed, Cr4 in Pixel 2 stored in the same row will be used for recovery. As shown in Fig. 4, a hierarchical readout bitline (RBL) scheme (local RBL and global RBL) is applied to reduce the access time of the memory. The self-recovery logic of DPSR can be simply implemented by connecting multiplexers (MUX) to readout bitlines of conventional SRAM. Each global bit-line (gbl) is connected to a MUX which is controlled by the received fault positions. If a fault is indicated, self-recovery is enabled by selecting the data pattern. The fault position information will be used as the select signal of the MUX to control which bit to be the output. Similar to other existing fault position aware mitigation techniques, DPSR receives pre-determined locations of the faulty bits, which is usually executed either during post fabrication testing or during power-on self-test (POST) [14, 21, 22]. Such testing process can also be used to track temporal degradation caused memory failure such as aging effect. The evaluation results in the following sections show values (both ~79.4% average correct prediction) based on Read Decoder 256 wordlines 256 wordlines SRAM Block 1 (256*32) Write Driver Self-Recovery MUX & Readout SRAM Block 3 (256*32) Write Driver SRAM Block 2 (256*32) SRAM Block 4 (256*32) Read Decoder SRAM Block X (256*32) SRAM Block 1 & 2 Sub_array 1 (32 x 32) Sub_array 2 (32 x 32) Sub_array 8 (32 x 32) 31 Luma Luma Luma 8 7 Luma 0 Pixel 1 Pixel 2 Pixel 3 Pixel Cr Cb Cr 8 7 Cb 0 Pixel 1 SRAM Block 3 & 4... Pixel 2 wbl[31:0] wblb[31:0] lbl1[31:0] lbl2[31:0] S 32 PRE gbl[31:0] gblx[31:0] Self-Recovery MUX 32 Luma Self-Recovery MUX gblx31 gblx23 S31 S30 S0 Y31 Y30 Y0 gbl31 gblx30 gblx22 gbl30... gblx0 gblx8 gbl0 Chroma Self-Recovery MUX S31 S30 S0 Cr1 Cr2 Cb8 Fig.4. Proposed DPSR with data self-recovery ability. gblx31 gblx30 gbl31 gblx30 gblx31 gbl30... gblx0 gblx16 gbl0

7 Read Decoder (18.79 43.57µm) each Write Decoder Luma MUX Chroma MUX 125.56 µm SRAM Block 1 (32 256) SRAM Block 2 (32 256) SRAM Block 3 (32 256) SRAM Block 4 (32 256) 500.80 µm Fig.5. Layout Design of DPSR (with 7.

7 7 Read Decoder ( µm) each Write Decoder Luma MUX Chroma MUX µm SRAM Block 1 (32 256) SRAM Block 2 (32 256) SRAM Block 3 (32 256) SRAM Block 4 (32 256) µm Fig.5. Layout Design of DPSR (with 7.94% area overhead). that DPSR also achieves smaller silicon area overhead, while delivering good output quality at near threshold voltage. 5 EVALUATION METHODOLOGY AND RESULTS To evaluate the effectiveness of the proposed technique, a 32kb SRAM is implemented using a high-performance 45- nm FreePDK CMOS process [31] to meet the multi-megahertz performance requirement of today s mobile video decoders. 5.1 Performance We first evaluate the performance of the proposed DPSR. Due to the added multiplexers, the read access time of DPSR increases from 0.27 ns to 0.31 ns, which is fast enough to deliver high-quality video format such as 8K Ultra HD applications [25]. 5.2 Layout As discussed before, embedded SRAMs usually occupy a large portion of area in a video chip, and therefore the area cost of the embedded SRAM is an important design concern. Fig. 5 shows the layout of DPSR. Each added self-recovery logic (MUX) occupies an area of µm µm, resulting in 7.94% area overhead. It should be noted that, the self-recovery logic is added to readout bitlines and increasing the number of words in a memory is beneficial in reducing the area overhead. 5.3 Power Efficiency To evaluate the power effectiveness of the proposed technique, we model the power consumption of the memory as: 31 Pj i Ri W i 2 P (5) Dynamic j0 i0, k0 j0 i P L (6) Leak j where PDynamic and PLeak are the dynamic and leakage power consumption, respectively. i is the value stored in SRAM, j is the bit number in a word, which is from 0 to 31. Pj(i) is the probability of a data bit j to be 0 or 1, which is extracted from various video benchmarks. R(i), W(i), and L(i) are the read power, write power, leakage power consumption for a single SRAM bitcell, respectively, storing the data bit i. Fig. 6 compares the power consumption in different memory operations. As expected, all power components decreases as the voltage scales from 1 V to 0.5 V. It should be noted that the power consumption overhead caused by the self-correction logic in the proposed technique is negligible as compared to the power reduction enabled by reducing voltage to near-threshold voltage, since the dynamic and leakage power consumption scale quadratically and linearly with voltage, respectively. The proposed memory at 0.5 V consumes 219 µw leakage power and 193 µw leakage power. As compared to the conventional memory at 1 V, 81.52% dynamic power savings and 82.45% leakage power savings can be achieved by the proposed technique. 5.4 Video Output Quality Analysis Different from the video benchmark sets used in Section 3, we use a new video benchmark set for verification: 3 videos from [26] and another 5 videos from [27]. We adopt the well-known peak signal-to-noise ratio (PSNR) metric to evaluate the video quality, which is defined as [19] PSNR 10 log (7) 10 MSE where MSE is the mean square error between the original videos (Org) and the degraded videos (Deg), expressed as m n Org( i, j) Deg ( i, j) MSE (8) mn i0 j 0 Memory Operations Fig.6. Power consumption of different memory operations (W(0): write 0, W(1): write 1, R(0): read 0, R(1): read 1, L(0): leakage power of storing 0, L(1): leakage power of storing 1). Researchers have shown that PSNR with 30 db or higher for a video would be acceptable [14]. Table 7 compares PSNR values using different techniques as Pfail are 10-2 (for minimum-sized SRAM) and 10-3 (for upsized SRAM with 58% area overhead [16]). In addition to video benchmarks, 10 YouTube-8M videos from the 25 videos used for verification earlier in Table 4 and Table 5 (separate from the 10,000 videos used in the training dataset) are used for calculating the video metrics presented. Due to the limited space, Fig. 7 shows six video output images with memory failure of 10-2 when failures are injected. It can be seen that DPSR has good recovery precision and it can deliver good video quality with PSNR over 35 db, even for minimumsized SRAM. Accordingly, DPSR achieves good video output quality at near-threshold voltage. 2

8 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW Videos Original Video

01) [13] city PSNR: 36.8039 SSIM: 0.9306 PSNR: 24.5045 SSIM: 0.

9290 crew PSNR: 37.1454 SSIM: 0.9078 PSNR: 24.5212 SSIM: 0.

9060 football PSNR: 36.5037 SSIM: 0.9163 PSNR: 24.4878 SSIM: 0.

9148 Concert - PSNR: 24.7459 SSIM: 0.6161 PSNR: 39.8358 SSIM: 0.

5834 PSNR: 39.6992 SSIM: 0.9839 PSNR: 59.4008 SSIM: 0.

8 8 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW Videos Original Video Conventional (Pfail = 0.01) DPSR (Pfail = 0.01) Shift (Pfail = 0.01) [13] city PSNR: SSIM: PSNR: SSIM: PSNR: SSIM: PSNR: SSIM: crew PSNR: SSIM: PSNR: SSIM: PSNR: SSIM: PSNR: SSIM: football PSNR: SSIM: PSNR: SSIM: PSNR: SSIM: PSNR: SSIM: Concert - PSNR: SSIM: PSNR: SSIM: PSNR: SSIM: Game - PSNR: SSIM: PSNR: SSIM: PSNR: SSIM: Electric Guitar - PSNR: SSIM: PSNR: SSIM: PSNR: SSIM: Fig.7. Video output using different video storage techniques.

9 9 Dataset Video benchmarks YouTube 8M Dataset Dataset Video benchmarks YouTube 8M Dataset TABLE 7 VIDEO PSNR METRIC COMPARISON conventional DPSR conventional DPSR Ref. [13] Ref. [13] Videos (Pfail = 0.001) (Pfail = 0.001) (Pfail = 0.01) (Pfail = 0.01) (Pfail =0.001) (Pfail = 0.01) akiyo bus city coastguard crew football foreman sign_irene Running Concert Music Video Festival Game Electric Guitar Snow Flute Vehicle Planet TABLE 8 VIDEO SSIM METRIC COMPARISON Conventional DPSR conventional DPSR Ref. [13] Ref. [13] Videos (Pfail = 0.001) (Pfail = 0.001) (Pfail = 0.01) (Pfail = 0.01) (Pfail =0.001) (Pfail = 0.01) akiyo bus city coastguard crew football foreman sign_irene Running Concert Music Video Festival Game Electric Guitar Snow Flute Vehicle Planet The matric PSNR has been used extensively to describe video output quality in a quantitative way but recent efforts to capture true human perception show it may not accurately describe the actual video quality a human perceives [23]. This is due to the fact that PSNR is based on the summation of error for every pixel s chrominance and luminance component values and this alone is not necessarily a good estimate to the user s perception of the video. Analyzing the video quality using the structural similarity (SSIM) metric is a method that is more aware of the user s perception since it includes calculations for luminance, contrast, and structural changes in the video. The general form of the SSIM equation is defined as [23] l x, y cx, y sx y SSIM ( x, y), 2 x y c1 2 xy c c c x y 1 where l(x,y), the luminance comparison, is a function of the mean intensities, μx and μy, c(x,y), the contrast comparison is a function of the standard deviations, σx and σy, and s(x,y), the structural comparison, is a function of the correlation between x and y, or σxy. Setting α = β = γ = 1 in the original equation results in the second equation. C1 (C2) is a constant that is included to avoid instabilities when the sum of the means (standard deviations) squared is equal to x y 2 (9)

10 10 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW values near zero. The value of the SSIM is in the range 0 to 1. As the value, SSIM(x,y), gets closer to 1, the quality of video y more closely matches the quality of video x. For our testing purposes, video x is the raw, uncompressed YUV video, before the decoding process, and video y is the post decoded YUV video that may or may not have other bit shifting or correction changes performed on it. The results of these SSIM calculations for conventional and DPSR are listed in Table 8. The video output quality of our DPSR method has a significant increase in SSIM value over the no failure correction, conventional, memory. 5.5 Comparison with Prior Work Table 9 compares the DPSR s performance with the stateof-the-art. With data-pattern enabled self-recovery ability, DPSR exhibits low implementation cost (7.94%) and reliable operation at near-threshold voltage to achieve maximum energy efficiency. Comparing with State-of-the-Art Data-Shifting [13]: Table 8 also compares the video output quality of the proposed DPSR and the data-shifting technique presented in [13]. As shown, the data-shifting technique [13] has slightly better quality in terms of PSNR and SSIM metrics as compared to our proposed technique, but is realized with large area overhead (~14%). This is because, the shifting scheme needs to calculate the shift values based on the received fault positions and then perform shifting to store least-significant-bits in the identified faulty bitcells. Comparing with State-of-the-Art Data-Squeezing [14]: The data squeezing technique presented in [14] is another recently developed memory failure mitigation technique. Based on the observation that, for many general-purpose applications, the last-level cache contains large amounts of null data, this technique compresses null subblocks so that they can be allocated to memory entries with faulty cells. This technique works well for register files and caches for general-purpose applications, which store as high as 79.23% zeros as discussed in [34]. However, it is not suitable for videos because the 8-bit video pixel data varys a lot between 0 and 255, which is difficult for zero compression. Comparing with State-of-the-Art Error Correction Code (ECC): ECCs have also been studied in ultra-low voltage contexts to protect again memory failures [35]. Similar redundancy based repair mechanisms, to implement ECC, the capacity of a memory need to be increased or part of its effective capacity has to be sacrificed to store check bits. In addition to memory space overhead, complex logic for ECC encoding and decoding must be added, which brings significant implementation penalty. For example, by using orthogonal Latin square codes discussed in [35], half of memory capacity is used to store ECC bits. In our developed big-data enabled memory technique, the general data patterns existing in large scale videos have been identified, which are used to achieve self-correction in the presence of the memory failures. The overhead of the developed self-correction logic is significantly reduced as compared to existing techniques. At the result, the developed DPSR is capable of delivering the best video quality for the least area overhead. fault-position awareness Low-power techniques bitcell modified near threshold operation power efficiency additional logic needed performance overhead 6 CONCLUSION In this paper, we have presented a data-pattern enabled SRAM with self-recovery ability for big video data. Based on the data patterns obtained by data-mining techniques, a simple circuit-level design technique is applied to enable self-recovery with low area overhead (7.94%). Our design successfully delivers good video quality for minimumsized SRAM at near-threshold voltage (with failure rate 10-2 ). ACKNOWLEDGMENT This work was supported in part by the National Science Foundation under Grant CCF and CNS , the ND NASA EPSCoR, the ND Venture grant, the NDSU- RCA funding, and the Offerdahl Foundation. Na Gong and Jinhui Wang are the corresponding authors. REFERENCES TABLE 9 COMPARISON WITH PRIOR WORK TCASI 12 [24] DAC 15 [13] [1] IDC (2012). The Digital Universe in 2020: Big Data, bigger Digital Shadows, and Biggest Growth in the Far East. December [Online]. Available: [2] K. Kim, Silicon Technologies and Solutions for the Data-Driven World, in Proc. IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2015, pp [3] N. Rastogi, You Charged Me All Night Long, [Online]. Available: TC 16 [14] This Work No Yes Yes Yes bitcell sizing datashifting datasqueezing data-pattern enabled selfrecovery Yes No No No No (0.9V) Yes (-) Yes (0.5 V) Yes (0.5V) Bad Good Good Good No LUTs and shifter - - Rearrangement logic and tag array, comparator, Mux extra clock (for decompression) MUX 0.04 ns video quality acceptable good does not apply good area overhead 1 14% 6.3% 7.94% 11-65% 1 depending on the number of shifting bits

11 11 ntern/2009/10/you_charged_me_all_night_long.html. [4] M. E. Sinangil and A. P. Chandrakasan, Application-Specific SRAM Design Using Output Prediction to Reduce Bit-Line Switching Activity and Statistically Gated Sense Amplifiers for Up to 1.9 Lower Energy/Access, IEEE Journal of Solid-State Circuits, vol. 49, no. 1, pp , Jan [5] K. Nii, M. Yabuuchi, Y. Tsukamoto, S. Ohbayashi, S. Imaoka, H. Makino,Y.Yamagami, S. Ishikura,T. Terano, T. Oashi, K. Hashimoto, A. Sebe, G. Okazaki, K. Satomi, H. Akamatsu, and H. Shinohara, A 45-nm Bulk CMOS Embedded SRAM With Improved Immunity Against Process and Temperature Variations, IEEE J. Solid-State Circuits, vol. 43, no. 1, pp , Jan [6] O. Hirabayashi, A. Kawasumi, A. Suzuki, Y. Takeyama, K. Kushida, T. Sasaki, A. Katayama, G. Fukano, Y. Fujimura, T. Nakazato, Y. Shizuki, N. Kushiyama, and T. Yabe, A Process- Variation-Tolerant Dual-Power-Supply SRAM With Cell in 40 nm CMOS Using Level-Programmable Wordline Driver, in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2009, pp [7] K. Takeda, Y. Hagihara, Y. Aimoto, M. Nomura, Y. Nakazawa, T. Ishii, and H. Kobatake, A Read-static-noise-margin-free SRAM Cell for Low-VDD and High-speed Applications, IEEE J. Solid-State Circuits, vol. 41, no. 1, pp , Jan [8] T.-H. Kim, J. Liu, and C. H. Kim, A Voltage Scalable 0.26 V, 64 kb 8T SRAM with Vmin Lowering Techniques and Deep Sleep Mode, IEEE J. Solid-State Circuits, vol. 44, no. 6, pp , [9] M.-F. Chang, S.-W. Chang, P.-W. Chou, and W.-C. Wu, A 130 mv SRAM with Expanded Write and Read Margins for Subthreshold Applications, IEEE J. Solid-State Circuits, vol. 46, no. 2, pp , Feb [10] Y.-W. Chiu, Y.-H. Hu, M.-H. Tu, J.-K. Zhao, Y.-H. Chu, S.-J. Jou, and C.-T. Chuang, 40 nm Bit-Interleaving 12T Subthreshold SRAM With Data-Aware Write-Assist, IEEE Trans. Circuits Syst. I, vol. 61, no. 9, pp , Sep [11] M. K. Qureshi and Z. Chishti, Operating Secded-based Caches at Ultralow Voltage with Flair, in Proc rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2013, pp [12] A. Ansari, S. Feng, S. Gupta, and S. A. Mahlke, Archipelago: A Polymorphic Cache Design for Enabling Robust Near-threshold Operation, in Proc. IEEE Symp. on High Performance Computer Architecture (HPCA), 2011, pp [13] S. Ganapathy, G. Karakonstantis, A. Teman, and A. Burg, Mitigating the Impact of Faults in Unreliable Memories for Error-Resilient Applications, in Proc. Design Automation Conf. (DAC), 2015, pp [14] A. Ferreron, D. S. Gracia, J. Alastruey-Benedé, T. Monreal-Arnal, and P. E. Ibáñez, Concertina: Squeezing in Cache Content to Operate at Near-Threshold Voltage, IEEE Trans. On Computers, vol. 65, no. 3, Mar [15] N. Gong, S. Jiang, A. Challapalli, M. Panesar, and R. Sridhar, Variation-and-Aging Aware Low Power embedded SRAM for Multimedia Applications, in Proc. 25th IEEE International SoC Conference (SoCC 12), 2012, pp [16] S. Zhou, S. Katariya, H. Ghasemi, S. Draper, and N. S. Kim, Minimizing Total Area of Low-Voltage SRAM Arrays Thought Joint Optimization of Cell Size, Redundancy, and ECC, in Proc. Int. Conf. on Computer Design (ICCD), 2010, pp [17] R. Agrawal, T. Imielinski, and A. Swami, Mining Rules between Sets of Items in Large Databases, in Proc. for Computing Machinery s Special Interest Group on Management of Data (ACM SIGMOD) Conf., Washington DC, pp , May [18] Weka 3: Data Mining Software in Java, [Online]. Available: [19] N. Gong, S. Jiang, A. Challapalli, S. Fernandes, and R. Sridhar, Ultra-Low Voltage Split-data-aware Embedded SRAM for Mobile Video Applications, IEEE Trans. on Circuits and Systems II, vol. 59, no. 12, pp , Dec [20] H. Noguchi, Y. Iguchi, H. Fujiwara, Y. Morita, K. Nii, H. Kawaguchi, and M. Yoshimoto, A 10T Non-precharge Twoport SRAM for 74% Power Reduction in Video Processing, in Proc. IEEE Computer Society Annual Symp. VLSI Circuits, Mar. 2007, pp [21] A. R. Alameldeen, I. Wagner, Z. Chishti, W. Wu, C. Wilkerson, S. L. Lu, Energy-Efficient Cache Design Using Variable- Strength Error-Correcting Codes, in Proc. ISCA, 2011, pp [22] J. Chang, M. Huang, J. Shoemaker, J. Benoit, S.-L. Chen, W. Chen, S. Chiu, R. Ganesan, G. Leong, V. Lukka, S. Rusu, and D. Srivastava, The 65-nm 16-MB shared on-die L3 cache for the dual-core Intel Xeon Processor 7100 Series, IEEE J. Solid-State Circuits, vol. 42, no. 4, pp , Apr [23] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. on Image Processing, vol. 13, no. 4, pp , Apr [24] J. Kwon, I. Lee, and J. Park, "Heterogeneous SRAM Cell Sizing for Low Power H.264 Applications," IEEE Trans. on Circuits and Systems I, vol. 99, no. 2, pp. 1-10, Feb [25] D. Zhou, S. Wang, H. Sun, J. Zhou, J. Zhu, Y. Zhao, J. Zhou, S. Zhang, and S. Goto, A 4Gpixel/s 8/10b H.265/HEVC Video Decoder Chip for 8K Ultra HD Applications, in Proc. Int. Solid- State Circuits Conf. (ISSCC), Feb. 2016, San Franscisco, CA, pp [26] S. A. Pourbakhsh, X. Chen, D. Chen, X. Wang, N. Gong, and J. Wang, Sizing-Priority Based Low-Power Embedded Memory for Mobile Video Applications, in Proc. International Symposium on Quality Electronic Design (ISQED), 2016, Santa Clara, CA, pp [27] YUV Video Sequences, [Online]. Available: [28] Xiph.org Video Test Media [derf s collection], [Online]. Available: [29] F. Frustaci, D. Blaauw, D. Sylvester and M. Alioto, "Better-thanvoltage scaling energy reduction in approximate SRAMs via bit dropping and bit reuse," in Power and Timing Modeling, Optimization and Simulation (PATMOS), th International Workshop on, Salvador, 2015, pp [30] N. Gong, J. Edstrom, D. Chen, and J. Wang, Data-Pattern Enabled Self-Recovery Multimedia Storage System for Near- Threshold Computing, in Proc. International Conference on Computer Design (ICCD), 2016, Phoenix, Arizona, accepted. [31] 45-nm FreePDK. [Online]. Available: [32] F. Sampaio, M. Shafique, B. Zatt, S. Bampi, and J. Henkel, Energy-Efficient Architecture for Advanced Video Memory, in

12 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW Proc. 2014 IEEE/ACM International Conf. on Computer-Aided Design (ICCAD), Nov. 2014, pp. 132-139. [33] S. Venkataramani, S. T. Chakradhar, K. Roy, and A.

Sridhar, TM-RF: Aging Aware Power Efficient Register File Design for Modern Microprocessors, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 7, pp. 1196-1209, Jul. 2015.

Jonathon Edstrom received the B.S. degree in computer engineering and the M.S. degree in electrical and computer engineering from North Dakota State University, Fargo, ND, in 2015 and 2017, respectively.

Dongliang Chen received the B.S. degree in electrical engineering, Dalian University of Technology (DUT), Dalian, China, in 2010. Currently, he is pursuing his Ph.D. degree in electrical and computer engineering at the North Dakota State University, Fargo, ND.

He is currently working towards his Ph.D. degree in electrical and computer engineering at the North Dakota State University, Fargo, ND. His research focuses on low-power embedded vision system.

12 12 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW Proc IEEE/ACM International Conf. on Computer-Aided Design (ICCAD), Nov. 2014, pp [33] S. Venkataramani, S. T. Chakradhar, K. Roy, and A. Raghunathan, Approximate computing and the quest for computing efficiency, in Proc. the 52nd Annual Design Automation Conf. (DAC 15), Jun. 2015, pp [34] N. Gong, J. Wang, S. Jiang, and R. Sridhar, TM-RF: Aging Aware Power Efficient Register File Design for Modern Microprocessors, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 7, pp , Jul [35] Z. Chishti, A. R. Alameldeen, C. Wilkerson, W. Wu, and S.-L. Lu, Improving cache lifetime reliability at ultra-low voltages, in Proc. 42nd IEEE/ACM Int. Symp. Microarchit., 2009, pp Jonathon Edstrom received the B.S. degree in computer engineering and the M.S. degree in electrical and computer engineering from North Dakota State University, Fargo, ND, in 2015 and 2017, respectively. Currently, he is pursuing his Ph.D. degree in electrical and computer engineering at North Dakota State University. His research focuses on datadriven intelligent energy-efficient hardware design. Dongliang Chen received the B.S. degree in electrical engineering, Dalian University of Technology (DUT), Dalian, China, in Currently, he is pursuing his Ph.D. degree in electrical and computer engineering at the North Dakota State University, Fargo, ND. His research focuses on data-driven power-efficient mobile computing. Yifu Gong received the B.S. degree in electrical engineering at North Dakota State University in He is currently working towards his Ph.D. degree in electrical and computer engineering at the North Dakota State University, Fargo, ND. His research focuses on low-power embedded vision system. Jinhui Wang (M 13) received the B.E. degree in electrical engineering from Hebei University, Hebei, China, in 2004, and the Ph.D. degree in electrical engineering through a joint USA/China program between University of Rochester and Beijing University of Technology, in Dr. Wang is currently an Assistant Professor with the Department of Electrical and Computer Engineering at the North Dakota State University, Fargo, ND, USA. His research interests include low-power, highperformance, and variation-tolerant integrated circuit design, 3-D IC and EDA methodologies, and thermal solutions in VLSI. He has over 100 publications and 20 patents in the emerging semiconductor technologies. Na Gong (M 13) received the B.E. degree in electrical engineering, the M.E. degree in microelectronics from Hebei University, Hebei, China, and the Ph.D. degree in computer science and engineering from the State University of New York, Buffalo, in 2004, 2007, and 2013, respectively. Currently, Dr. Gong is an Assistant Professor of Electrical and Computer Engineering at the North Dakota State University, Fargo, ND, USA. Her research interests include data-driven energy-efficient VLSI circuits and systems, viewer-aware mobile systems, with an emphasis on memories.

RECENTLY, the growing popularity of powerful mobile

RECENTLY, the growing popularity of powerful mobile IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 59, NO. 12, DECEMBER 2012 883 Ultra-Low Voltage Split-Data-Aware Embedded SRAM for Mobile Video Applications Na Gong, Shixiong Jiang,