Data-Pattern Enabled Self-Recovery Low-Power Storage System for Big Video Data

IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW 1 Data-Pattern Enabled Self-Recovery Low-Power Storage System for Big Video Data Jonathon Edstrom, Dongliang Chen, Yifu Gong, Jinhui Wang, Member, IEEE, and Na Gong, Member, IEEE Abstract The growing popularity of powerful mobile devices such as smart phones and tablet devices has resulted in the exponential growth of demand for video applications. However, due to the large video data size and intensive computation, mobile video applications require frequent embedded memory access, which consumes a large amount of power and limits battery life. In this paper, we present a low-cost self-recovery video storage system by investigating meaningful data patterns hidden in big video data, by introducing data mining techniques to the hardware design process. We propose a two-dimensional data-pattern approach to explore horizontal data-association and vertical data-correlation characteristics. Such data relationship discovery and pattern identification enable a new dimension for the hardware design space and bring self-recovery ability to memories in the presence of bitcell failures. Based on the identified optimal data patterns, we present a low-cost and efficient SRAM design to enable data self-recovery at low voltages. A 45nm 32kb SRAM is implemented that delivers good video quality at near-threshold voltage (0.5 V) with negligible area overhead (7.94%). Index Terms videos; data mining; data pattern; low-power; self-recovery; on-chip memory 1 INTRODUCTION I NFORMATION has driven the remarkable evolution of human society. According to market research, by 2020, the amount of data that is created, replicated, and consumed, will be as large as 40ZB (Zettabyte, or 10 21 B) [1, 2]; and more than half of the data traffic will be video data [3]. Traditional, plain TV sets, are losing ground to hybrid TVs, PCs, game consoles, and more recently, mobile devices such as tablets and smartphones. In this new, mobile, and big video age, one of the biggest contributors to user dissatisfaction still remaining is short battery life [3]. In particular, due to the intensive computation and large data size, video applications are demanding continuously increased storage space. To realize this process, embedded static random-access memory (SRAM) occupies over 65% of the mobile video decoder chip area and they are also a major contributor to mobile battery consumption (>92% of the motion compensation energy [32]) and this situation is only expected to grow for the next-generation mobile video format - H.265/HEVC which has 2x-3x higher memory demand compared to that of H.264 [32]. Voltage scaling techniques have been widely applied to reduce the power consumption of memory systems. Researchers have shown that SRAM achieves maximum efficiency at near-threshold voltage [14]. However, as voltage scales, SRAMs are susceptible to failure due to significant process variation. Various techniques have been developed to correct or eliminate these memory failures as voltage is scaled. Traditional low-power memory techniques can be divided into three, general, categories: (i) assist schemes, The authors are with the Department of Electrical and Computer Engineering, North Dakota State University, ND 58108. E-mail: {jonathon.edstrom, dongliang.chen, yifu.gong, jinhui.wang.1, na.gong}@ndsu.edu. Big-data enabled data knowledge Traditional isolated hardware design process Positive Feedback Loop: big-data enabled better hardware will support big-data applications better Fig. 1. Big-data enabled intelligent efficient hardware. such as adjustment of cell voltage, [5] and boosted wordline voltage [6]; (ii) large bitcells such as upsized 6T cell [24], asymmetric 7T cell [7], single-ended read-decoupled 8T cells [8], read-disturb-free 9T [9], and bit-interleaving 12T cells [10]; and (iii) error correction techniques spanning from the use of error correction codes [11] to data remapping [12]. Unfortunately, almost all existing solutions require considerable implementation overhead to the original memory design. For example, the penalty of the area overhead are as high as 50 100%. Such large overhead leads to increased layout area, higher design complexity, and reduced performance of the entire system. Recently, a new branch of low-voltage embedded memory techniques have been developed to embrace the memory faults, instead of avoiding the faults (assistance techniques or more than 6T cells) or correcting the faults (e.g. ECC). Those techniques aim to mitigate the impact of memory faults by minimizing the magnitude of the error (due to a faculty cell), based on the determined memory fault positions from run-time testing (e.g. built-in self-test (BIST)). We refer those techniques as fault-position aware mitigation techniques. For example, Ganapathy et al. [13] developed a shifting technique to always store the leastsignificant-bits (LSBs) in the faulty cells, which may lead to a tolerable output quality. Ferreron et al. [14] presented a squeezing technique to compress zeros and store them in Manuscript received 14 Oct. 2016. less memory space, thereby avoiding the presented +

2 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW memory failures at low voltage. However, based on the predetermined memory fault positions, the existing techniques still involve complex operation (e.g. shifting value calculation and storage) and the overhead incurred is still significant (e.g. 65% in [13]). To address the storage challenge of videos as well as other big things, we propose to leverage the assets of big data to extract useful knowledge and actionable information for hardware design. Recently, it has been observed that today s big-data applications, including videos, have three common data characteristics [33]: (i) redundant inputs, (ii) multiple acceptable outputs, and (iii) statistical computations. Those intrinsic characteristics provide substantial opportunities for data relationship discovery and pattern identification, which will enable a new dimension for hardware design space and bring exciting innovation opportunities for multi-dimensional innovations in circuits and systems, as illustrated in Fig. 1. Specifically, in this paper, we present a novel Data Pattern enabled Self- Recovery video SRAM (DPSR) to achieve efficient nearthreshold voltage computing while delivering good video output quality. By introducing advanced data mining techniques, we investigate meaningful data patterns hidden in video data and use them to enable self-recovery in SRAM. We propose a two-dimensional (2D) data pattern approach to explore horizontal data-association and vertical datacorrelation characteristics to determine the optimal data patterns for self-recovery. Based on this, we develop an efficient SRAM design technique to implement DPSR with negligible area overhead (7.94%) and performance penalty. Earlier in [30], we presented a basic DPSR design storing only chroma data associations, including some preliminary results. We extend our original work with the following additional contributions: We investigate data associations between bits of luma data in various videos to enable additional power savings by implementing the design across both luma and chroma data in the video memory (Sections 3.1, 3.2 and 3.3). We propose a new hardware design that realizes near-threshold voltage storage for the luma data based on discovered optimal data patterns. We analyze different hardware bit prediction schemes and implement the optimal wordline architecture for the highest bit prediction percentage. Since there is twice as much luma data as chroma data in typical mobile videos, our additions allow for triple the power savings as compared to our previous design [30] (Sections 4 and 5.2). To analyze the quality of video output, we add a new structural similarity (SSIM) metric proposed in [23], which is aware of the user s perception by including calculations for luminance, contrast, and structural changes in the video (Section 4.3). Also, to verify the power efficiency of the proposed technique, we develop memory power consumption models for both active and leakage power consumption and performed detailed analysis (Section 5.3). mobile devices in the environment of big data, we expand the video data to larger-scale and real videos using the recently released YouTube-8M dataset [31]. Specifically, 10,000 unique YouTube-8M videos, with 57.6 GB total data size, representing 500,000 individual frames, have been analyzed using data mining techniques to identify the general data patterns existing in various videos (Sections 3.1 and 3.2). Additionally, 25 videos from YouTube- 8M, separate from the 10,000 videos used in the training dataset, are used to verify the correct prediction percentage and video output qualities (Sections 3.3 and 5.4). It should be emphasized that the biggest challenge to achieve data-enabled hardware is that it is difficult for hardware designers to directly observe the inherent data behaviors from the large volume of video data. To realize the proposed data-pattern enabled efficient video storage, hardware designers require a deep understanding and systematic study of inherent data relationships from massive video data. This will not be solved by traditional data techniques, due to the increased complexity and the growing amount of video data. Also, the larger the data size, the more general data patterns can be identified and the more power saving opportunities can be enabled. Accordingly, big data can be one and the only one powerful way to realize the full potential of the proposed intelligent hardware. Noted that the data pattern identification process is conducted in design time (off-line), thereby avoiding runtime performance overhead caused by big data algorithms. The rest of the paper is organized as follows. Section 2 presents SRAM failure at near threshold-voltage. In Section 3, the data-mining enabled mobile video data patterns are analyzed for self-recovery. In Section 4, we present DPSR. The evaluation results are provided in Section 5. Finally, the conclusion is drawn in Section 6. 2 EMBEDDED MEMORY FAILURE ANALYSIS AT NEAR-THRESHOLD VOLTAGE It has been shown that the computing efficiency is maximized when a circuit is operating at near-threshold voltage [14]. However, at 0.5V (our target near-threshold voltage), SRAM failures become more severe with the increasing process variation. In particular, the random dopant fluctuation (RDF) effect leads to threshold voltage (Vth) variation and SRAM cell failures [15]. For the current manufacturing technologies, the failure probability of an SRAM cell (Pfail) typically ranges between 10-3 and 10-2, depending on the bitcell area [14, 16]. The minimum-sized SRAM has highest failure rate of 10-2 and larger bitcells have a lower failure probability. With 58% area overhead, the failure rate can be reduced from 10-2 to 10-3 [16]. In our analysis, we consider both 10-2 (minimum-sized SRAM) and 10-3 (upsized SRAM) conditions. It should be noted that the failure rate Most importantly, in order to emulate the use of

3 (a) Fig. 2. Error maps in the SRAM array at 0.5 V. (a) Failure rate is 10-3 (0.001) and (b) failure rate is 10-2 (0.01). Each dot on the maps illustrates the bitcell failure locations with row number (y axis) and column number (x axis) in the SRAM array. can be further optimized using a recently developed priority-based sizing technique [26]. To further study the SRAM failure characteristics at low voltage, we investigated error maps for a 512 word 64 bit (b) TABLE 1 FAULT PROBABILITY IN A 32-BIT SRAM WORD Number of faults per wordline SRAM failure rate: 10-3 (0.001) SRAM failure rate: 10-2 (0.01) 0 96.8523477% 72.7279953% 1 3.0992274% 23.2812509% 2 0.0479198% 3.6012385% 3 0.0005023% 0.3611914% 4 0.0000028% 0.0267011% 5 0% 0.0015432% 6 0% 0.0000756% 7 0% 0.000004% *Calculations based on Monte Carlo simulation (10 9 trials) SRAM with Pfail equal to 10-2 and 10-3. During the fault injection process, we assumed the failed bits to be located across the memory cells based on the failure probabilities according to a uniform distribution, introducing embedded memory failures to the decoding process. Using a uniform distribution for the errors is confirmed by memory failure measurements in [29]. The results are shown in Fig. 2. SRAM faults are uniformly distributed in the array. We also analyzed the probability of different faults in the same wordline (32-bit word) and the results are listed in Table 1. It can be seen that a wordline has a low number of faulty cells. The probability of two faults existing in the same wordline is only 3.6% when the SRAM bitcell failure rate is 10-2. Accordingly, in the presence of a memory fault, SRAM may achieve self-recovery based on other bits in the same wordline if meaningful bit-level data-patterns exist. 3 DATA PATTERN INVESTIGATION FOR SELF- RECOVERY This section presents our data-mining methodology to discover data-patterns hidden in video data to enable reliable self-recovery. Specifically, we propose a new two-dimensional (2D) data pattern approach to explore horizontal data-association and vertical data-correlation characteristics, thereby achieving optimal data patterns. 3.1 Rule Mining Enabled Horizontal Today s mobile video frames are typically stored and processed in YUV format. The YUV format includes one luma (Y) component, which contains the brightness information of the image, and two chroma components, which contain the blue-difference (Cb) and red-difference (Cr) color Fig. 3 shows a typical frame of video data stored in embedded memory using a 352 288 resolution YUV 4:2:0 video as an example. As shown, each pixel has 8-bit luma data and 8-bit subsampled chroma data. Since video data is stored in on-chip memory as binary bits, we utilize MSB Luma(Y) data 8 bits/pixel Chroma (Cb) data 8 bits/4 pixels Chroma (Cr) data 8 bits/4 pixels LSB Cb1 Cb2 Cb3 Cb4 Cb5 Cb6 Cb7 Cb8 MSB MSB LSB Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 LSB Cr1 Cr2 Cr3 Cr4 Cr5 Cr6 Cr7 Cr8 2D Data-Pattern 4:2:0 Subsampling Y Cb Cr 16x16 Pixels 4:2:0 YUV Video Frame MSB LSB Cb1 Cb2 Cb3 Cb4 Cb5 Cb6 Cb7 Cb8 MSB Vertical Data Pattern: LSB Cb1 Cb2 Cb3 Cb4 Cb5 Cb6 Cb7 Cb8... Dataset/ Database Horizontal Data Pattern: Transaction 1 Item 1 Item 2 Item 3... Item X {0,1} Rule Mining Enabled Horizontal Data Pattern Data-Pattern Analysis Dataset No. of bits (no. of frames) (http://trace.eas.asu.edu/yuv/) Akiyo 364,953,600 (300) 364,953,600 (300) Container 364,953,600 (300) Flower 304,128,000 (250) Foreman 304,128,000 (250) Coastguard Hall 304,128,000 (250) Mobile 304,128,000 (250) Mother- Daughter 304,128,000 (250) News 304,128,000 (250) Silent 304,128,000 (250) Tempete 316,293,120 (260) Waterfall 316,293,120 (260) Total 4,221,296,640 bits (3470 frames) Fig. 3. 2D data-pattern enabled self-correction and data pattern analysis dataset.

4 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW Rules TABLE 3 VERTICAL CORRELATION PROBABILITY probability from 12 video benchmarks Y1 96.721862% Cb1 93.786775% Cr1 93.775505% Y2 93.218747% Cb2 92.865158% Cr2 93.584600% Y3 87.786258% Cb3 90.774607% Cr3 92.335457% Y4 80.856551% Cb4 85.450795% Cr4 88.349737% Y5 73.971623% Cb5 77.947842% Cr5 81.559365% Y6 67.416638% Cb6 69.304415% Cr6 73.180250% Y7 61.777633% Cb7 59.986183% Cr7 63.496048% Y8 58.520118% Cb8 53.245386% Cr8 55.157592% probability from 10,000 YouTube-8M videos Y1 91.5541% Cb1 90.1024% Cr1 90.4118% Y2 84.6401% Cb2 89.8718% Cr2 90.1226% Y3 77.2344% Cb3 88.6654% Cr3 88.7543% Y4 68.8737% Cb4 84.8641% Cr4 85.1078% Y5 60.0483% Cb5 78.1086% Cr5 78.7188% Y6 51.4871% Cb6 68.8737% Cr6 70.1113% Y7 44.1374% Cb7 58.8108% Cr7 60.5647% Y8 38.7802% Cb8 50.8419% Cr8 52.6225% an association data mining technique to identify horizontal bit-level data patterns. rule mining was introduced in 1993 to discover relationships between different variables, called items, in a dataset or database [17]. A complete dataset is made up of many transactions where each transaction contains a set of items. Each item can be associated with a binary attribute, 0 or 1, that is used to distinguish that item is present or not in its corresponding transaction. This type of data organization is illustrated in Fig. 3. Each resulting rule, generated from the association rule mining process, is an implication of the form X Y, where X and Y are disjoint sets of, or individual, items. Each rule is also accompanied by collected statistics from the dataset called TABLE 2 DISCOVERED ASSOCIATION RULES From 12 video benchmarks [27, 28] From 10,000 Youtube-8M videos [31] Confidence Support Confidence Support Rules Confidence Support Confidence Support Y2=1 Y1=0 74.1636% 57.1506% 42.3849% Y2=1 Y1=0 76.9247% 50.5684% 38.8996% Y3=1 Y1=0 74.2830% 52.2311% 38.0399% Y3=1 Y1=0 72.1678% 51.5167% 37.1785% Y2=0 Y1=1 22.6535% 42.8494% 9.70688% Y2=0 Y1=1 41.5092% 71.1837% 29.5478% Y1=1 Y2=0 39.8522% 24.2221% 9.6530% Y1=1 Y2=0 71.1837% 41.5092% 29.5478% Y1=1 Y3=0 42.8576% 24.2221% 10.3810% Y1=1 Y3=0 61.8217% 37.0087% 22.8794% Cb2=0 Cb1=1 94.7547% 23.2298% 22.0113% Cb2=0 Cb1=1 98.5620% 46.5333% 45.8642% Cb2=1 Cb1=0 97.5574% 73.6405% 71.8417% Cb2=1 Cb1=0 98.1982% 52.6542% 51.7055% Cb1=0 Cb2=1 98.2838% 73.6405% 72.3766% Cb1=0 Cb2=1 99.2605% 52.6901% 52.3005% Cb1=1 Cb2=0 92.6464% 23.2298% 21.5215% Cb1=1 Cb2=0 99.4222% 46.5599% 46.2909% Cr2=1 Cr1=0 97.5149% 23.5168% 22.9323% Cr2=1 Cr1=0 95.7486% 33.6640% 32.2328% Cr2=0 Cr1=1 99.9963% 75.8811% 75.8782% Cr2=0 Cr1=1 99.3380% 65.3801% 64.9473% Cr1=1 Cr2=0 99.2164% 75.8811% 75.2864% Cr1=1 Cr2=0 99.0114% 65.4258% 64.7790% Cr1=0 Cr2=1 99.9881% 23.5168% 23.5140% Cr1=0 Cr2=1 99.4412% 33.7054% 33.5171% Cr1=1 Cr3=0 97.8025% 74.7997% 73.1559% Cr1=1 Cr3=0 95.2082% 62.6061% 59.6061% Cr1=0 Cr3=1 92.2269% 21.6914% 20.0053% Cr1=0 Cr3=1 97.0418% 32.6198% 31.6548% *Bit 1 (i.e. Y1, Cb1, Cr1) is the MSB. Bit 8 (i.e. Y8, Cb8, Cr8) is the LSB. of items is the proportion of transactions in the dataset that contains such set of items. The confidence value for an association rule, X Y, is the proportion of transactions that contain X which also contain Y, or the conditional probability P(Y X). To enable association data mining, we first use 12 different video benchmarks to build a dataset [27, 28]. In total, the video data size is 4,221,296,640 bits from 3470 frames, as shown in Fig. 3. Each video data bit is defined as an individual item and we used Weka [18] to perform the wellknown association rule mining algorithm - Apriori on our large video dataset. Table 2 lists the horizontal data patterns we obtained for chroma data based on video benchmarks. We further expand the video data to larger-scale and real video datasets in order to emulate the use of mobile devices in the environment of big data. We use Google s recently released YouTube-8M dataset [31], which is the largest video dataset to date. Specifically, 10,000 unique videos from YouTube-8M dataset, with 57.6 GB total data size, representing 500,000 individual frames, was analyzed using data-mining methods. A script was written to download these 10,000 videos from the ~7 million available URLs provided in the Youtube-8M dataset. After each video was downloaded, 50 contiguous frames were randomly selected from the video and were converted from the MP4 format to the raw YUV format using the FFmpeg decoder [32] for data-mining analysis. To support largescale video data processing, our experiments have been performed on the Thunder cluster at the Center for Computationally Assisted Science and Technology (CCAST) of North Dakota State University, which consists of 53 compute nodes. Each node has dual socket Intel Xeon 2670v2 Ivy Bridge (10 core per socket) 2.5GHz with 64GB DDR3 RAM and all nodes are interconnected with FDR InfinBand at a 56Gbit/s transfer rate. As illustrated in Fig. 3, each video data bit is defined as an individual item and the well- support and confidence values. The support value for a set

5 TABLE 4 OPTIMAL LUMA DATA PATTERNS FROM 25 YOUTUBE-8M VIDEOS, SEPARATE FROM THE 10,000 VIDEOS USED IN THE TRAINING DATASET Y bits Optimal Data Patterns Correct Prediction (%) Y1 (Y1 previous) 91.5290 Y2 (Y2 previous) 82.6719 Y3 (Y3 previous) 76.2655 Y4 (Y4 previous) 67.6406 Y5 (Y5 previous) 59.2428 Y6 (Y6 previous) 51.7514 Y7 (Y7 previous) 44.4694 Y8 (Y8 previous) 38.4120 known association rule mining algorithm - Apriori was used on our large video dataset to gather confidence and support metric calculations. The average results obtained for the horizontal data patterns are also listed in Table 2. We can see that the association rules obtained from video benchmarks are very general and they also exist in large scale videos. 3.2 Vertical Vertical data correlation characteristics of multimedia applications have been studied by many researchers [19, 20]. These works show that the most-significant-bits (MSB) of video data have strong correlation with neighboring pixels and the switching probability is very low. As listed in Table 3, from video benchmarks, the correlation probability of the MSBs (Y1, Cb1, Cr1) in neighboring pixels is over 93%, while it is reduced to 53% for the LSB of Cb (Cb8). The similar correlation characteristics can be observed in Your- Tube-8M videos. The MSBs in neighboring pixels have very strong correlation (with probability over 90%), but LSBs are more random and have little correlation with neighboring pixels. Power saving techniques involving the correlation have been used in previous works for bit prediction where no transistor switching results in power savings [4] and attempting to load the same value (reading continuous 0s or 1s) from a memory bit cell in order to eliminate the cost of precharging if the correct value is read out from the previous bit line read [19]. This work uses the correlation property of YUV data through the use of a novel bit correction technique that attempts to correct memory faults with high precision. By comparing the correlation percentages and the association rules we can identify the best combination of association rules and correlation between bits to construct an optimal pattern for data self-recovery. 3.3 Optimal Data Patterns for Self-Recovery In order to select an optimal data pattern from association and correlation, we define the Weighted Confidence based on the support and confidence of a particular rule as follows: Weighted Confidence = Confidence (Rule) Support (Rule) +Confidence(Complement Rule) Support(Complement Rule) (1) TABLE 5 OPTIMAL CHROMA DATA PATTERNS FROM 25 YOUTUBE- 8M VIDEOS, SEPARATE FROM 10,000 VIDEOS USED BEFORE For example, Weighted Confidence of association rule Cr1 Cr 2 can be expressed as Weighted Confidence of Cr1 Cr 2 Cb bits Cb1 Cb2 Cb3 Cb4 Cb5 Cb6 Cb7 Cb8 Optimal Data Patterns Confidence Cr 1 0 Cr 2 1 Support Cr1 0 Cr 2 1 Cr 1 1 Cr 2 0 Support Cr1 1 Cr 2 0 Confience 0.9998810.235168 0.9921640.758811 0. 988004972 We then use this parameter to compare to the sum of the correlation values for 0 and 1 non-switching which is equal to the correlation. This is equivalent to the Weighted Confidence calculation but instead uses the individual bit value (0 or 1) correlation percentages and is calculated as follows: = Confidence(Bitprevious = 0 Bitcurrent = 0) + Confidence(Bitprevious = 1 Bitcurrent = 1) where Bitprevious and Bitcurrent represent the video data bits in the same position of two neighboring pixels. As an example, of Cr2 Cb2 Cb1 Cb1 Cb2 (Cb3previous) (Cb4previous) (Cb5previous) (Cb6previous) (Cb7previous) (Cb8previous) Correct Prediction (%) Confidence Cr 2 previous 0 Cr 2current Confidence Cr 2previous 1 Cr 2current 1 0.20908658 0.72675942 0.935846 Accordingly, we obtain the optimal bit-level data patterns with high prediction rate to enable self-recovery, as listed in Table 4 for luma (Y) and Table 5 for chroma (Cr and Cb). 25 videos from YouTube-8M, separate from the 10,000 videos used in the training dataset, are used to verify the correct prediction percentage shown in Table 4 and Table Cr bits 98.5965 Cr1 99.7935 Cr2 88.4593 Cr3 84.3113 Cr4 78.5307 Cr5 69.3991 Cr6 59.3976 Cr7 51.1264 Cr8 Optimal Data Patterns Cr2 Cr1 Cr1 Cr2 Cr1 Cr3 (Cr4previous) (Cr5previous) (Cr6previous) (Cr7previous) (Cr8previous) Correct Prediction (%) 96.7237 97.7735 93.8576 83.6360 78.3486 68.8025 59.7336 52.9571 0 (2) (3) (4)

6 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW Double Faults TABLE 6 DPSR RECOVERY FAILURE RATE SRAM Pfail: 10-3 (0.001) 5. These videos are obtained using the same method as previous, but are unique from the previous 10,000 videos to ensure our technique works properly for correction. Our analysis shows that luma data is more random and has less association with other bits in the same pixel and the optimal data patterns are all from correlation. 3.4 Recovery Failure Caused by Double Faults in Data Pattern Since the discovered optimal data patterns used for selfrecovery exist between two bits in the same wordline, it may cause recovery failure if both of the two bits in a pattern fail simultaneously. Table 6 lists the recovery failure rate. It shows that DPSR has good reliability with extremely low self-recovery rate (less than 0.2%). This is due to the fact that there is a low probability of having multiple faults in the same wordline, as discussed in Section 2. 4 DPSR HARDWARE IMPLEMENTATION SRAM Pfail: 10-2 (0.01) Correction Faults 0.0010899% 0.1077362% Faults 0.0005957% 0.0587964% DPSR Recovery Failure 0.0016856% 0.1665326% Utilizing the obtained optimal bit-level data patterns, we present a simple but efficient DPSR hardware design with low implementation cost. Fig. 4 shows the array architecture of the proposed DPSR, where the total array size is 32 kbits and there are four blocks with 256 words 32 bits. In this design, both luma data and chroma data will be stored in the same SRAM but in different blocks. Block 1 and block 2 will be used to store the luma data and each wordline will store the luma data of 4 pixels. Block 3 and block 4 will be used to store the chroma data and each wordline will store the chroma data of 2 pixels. Regarding the luma data stored in blocks 1 and 2, based on the optimal luma patterns obtained in Section 3, vertical correlation rules, that is, luma data of the previous pixel will be used for recovering the data of current pixel. Since SRAM reads are row-wise, reading two physical rows will cause considerable performance penalty. Accordingly, we adapt the vertical correlation based luma self-recovery to a hardware-friendly design scheme. Each wordline will store the luma data of 4 pixels, we will use its neighboring pixel in the same row for data correction in the current pixel. For example, if a data bit in pixel 1 has failure (see Fig. 4), we will use the corresponding bit in pixel 2 for recovery; if a bit in pixel 4 has failure, we will use the corresponding bit in pixel 3 (which is the neighboring pixel in the same row) for recovery. In order to verify our design would maximize the correct bit predictions, we calculated the correct prediction percentage for predicting each bit using both the previous and next pixel s corresponding bit. Our analysis shows that they were approximately equal calculations from all samples in our training and verification testing video benchmarks [27, 28]. Chroma data self-recovery is implemented in SRAM block 3 and block 4 using the optimal chroma patterns. As shown in Fig. 4, each wordline will store two pixels of chroma data with both Cr and Cb. And both vertical correlation rules and horizontal data pattern rules are used for the chroma data self-recovery (see Table 5). Similarly, for vertical correction rules based recovery, we use the neighboring pixel stored in the same row for data correction, thereby avoiding performance penalty. For example, if Cb1 in Pixel 1 has failure (see Fig. 4), the inverted value of Cb2 in the same pixel will be used for recovery; if Cr4 in Pixel 1 is failed, Cr4 in Pixel 2 stored in the same row will be used for recovery. As shown in Fig. 4, a hierarchical readout bitline (RBL) scheme (local RBL and global RBL) is applied to reduce the access time of the memory. The self-recovery logic of DPSR can be simply implemented by connecting multiplexers (MUX) to readout bitlines of conventional SRAM. Each global bit-line (gbl) is connected to a MUX which is controlled by the received fault positions. If a fault is indicated, self-recovery is enabled by selecting the data pattern. The fault position information will be used as the select signal of the MUX to control which bit to be the output. Similar to other existing fault position aware mitigation techniques, DPSR receives pre-determined locations of the faulty bits, which is usually executed either during post fabrication testing or during power-on self-test (POST) [14, 21, 22]. Such testing process can also be used to track temporal degradation caused memory failure such as aging effect. The evaluation results in the following sections show values (both ~79.4% average correct prediction) based on Read Decoder 256 wordlines 256 wordlines SRAM Block 1 (256*32) Write Driver Self-Recovery MUX & Readout SRAM Block 3 (256*32) Write Driver SRAM Block 2 (256*32) SRAM Block 4 (256*32) Read Decoder SRAM Block X (256*32) SRAM Block 1 & 2 Sub_array 1 (32 x 32) Sub_array 2 (32 x 32)... 32 Sub_array 8 (32 x 32) 31 Luma 24 23 Luma 16 15 Luma 8 7 Luma 0 Pixel 1 Pixel 2 Pixel 3 Pixel 4... 31 Cr 24 23 Cb 16 15 Cr 8 7 Cb 0 Pixel 1 SRAM Block 3 & 4... Pixel 2 wbl[31:0] wblb[31:0] lbl1[31:0] 32 32 lbl2[31:0] S 32 PRE gbl[31:0] gblx[31:0] Self-Recovery MUX 32 Luma Self-Recovery MUX gblx31 gblx23 S31 S30 S0 Y31 Y30 Y0 gbl31 gblx30 gblx22 gbl30... gblx0 gblx8 gbl0 Chroma Self-Recovery MUX S31 S30 S0 Cr1 Cr2 Cb8 Fig.4. Proposed DPSR with data self-recovery ability. gblx31 gblx30 gbl31 gblx30 gblx31 gbl30... gblx0 gblx16 gbl0

7 Read Decoder (18.79 43.57µm) each Write Decoder Luma MUX Chroma MUX 125.56 µm SRAM Block 1 (32 256) SRAM Block 2 (32 256) SRAM Block 3 (32 256) SRAM Block 4 (32 256) 500.80 µm Fig.5. Layout Design of DPSR (with 7.94% area overhead). that DPSR also achieves smaller silicon area overhead, while delivering good output quality at near threshold voltage. 5 EVALUATION METHODOLOGY AND RESULTS To evaluate the effectiveness of the proposed technique, a 32kb SRAM is implemented using a high-performance 45- nm FreePDK CMOS process [31] to meet the multi-megahertz performance requirement of today s mobile video decoders. 5.1 Performance We first evaluate the performance of the proposed DPSR. Due to the added multiplexers, the read access time of DPSR increases from 0.27 ns to 0.31 ns, which is fast enough to deliver high-quality video format such as 8K Ultra HD applications [25]. 5.2 Layout As discussed before, embedded SRAMs usually occupy a large portion of area in a video chip, and therefore the area cost of the embedded SRAM is an important design concern. Fig. 5 shows the layout of DPSR. Each added self-recovery logic (MUX) occupies an area of 18.79 µm 43.47 µm, resulting in 7.94% area overhead. It should be noted that, the self-recovery logic is added to readout bitlines and increasing the number of words in a memory is beneficial in reducing the area overhead. 5.3 Power Efficiency To evaluate the power effectiveness of the proposed technique, we model the power consumption of the memory as: 31 Pj i Ri W i 2 P (5) Dynamic j0 i0, 1 1022 31 k0 j0 i P L (6) Leak j where PDynamic and PLeak are the dynamic and leakage power consumption, respectively. i is the value stored in SRAM, j is the bit number in a word, which is from 0 to 31. Pj(i) is the probability of a data bit j to be 0 or 1, which is extracted from various video benchmarks. R(i), W(i), and L(i) are the read power, write power, leakage power consumption for a single SRAM bitcell, respectively, storing the data bit i. Fig. 6 compares the power consumption in different memory operations. As expected, all power components decreases as the voltage scales from 1 V to 0.5 V. It should be noted that the power consumption overhead caused by the self-correction logic in the proposed technique is negligible as compared to the power reduction enabled by reducing voltage to near-threshold voltage, since the dynamic and leakage power consumption scale quadratically and linearly with voltage, respectively. The proposed memory at 0.5 V consumes 219 µw leakage power and 193 µw leakage power. As compared to the conventional memory at 1 V, 81.52% dynamic power savings and 82.45% leakage power savings can be achieved by the proposed technique. 5.4 Video Output Quality Analysis Different from the video benchmark sets used in Section 3, we use a new video benchmark set for verification: 3 videos from [26] and another 5 videos from [27]. We adopt the well-known peak signal-to-noise ratio (PSNR) metric to evaluate the video quality, which is defined as [19] 2 255 PSNR 10 log (7) 10 MSE where MSE is the mean square error between the original videos (Org) and the degraded videos (Deg), expressed as 1 1 1 m n Org( i, j) Deg ( i, j) MSE (8) mn i0 j 0 Memory Operations Fig.6. Power consumption of different memory operations (W(0): write 0, W(1): write 1, R(0): read 0, R(1): read 1, L(0): leakage power of storing 0, L(1): leakage power of storing 1). Researchers have shown that PSNR with 30 db or higher for a video would be acceptable [14]. Table 7 compares PSNR values using different techniques as Pfail are 10-2 (for minimum-sized SRAM) and 10-3 (for upsized SRAM with 58% area overhead [16]). In addition to video benchmarks, 10 YouTube-8M videos from the 25 videos used for verification earlier in Table 4 and Table 5 (separate from the 10,000 videos used in the training dataset) are used for calculating the video metrics presented. Due to the limited space, Fig. 7 shows six video output images with memory failure of 10-2 when failures are injected. It can be seen that DPSR has good recovery precision and it can deliver good video quality with PSNR over 35 db, even for minimumsized SRAM. Accordingly, DPSR achieves good video output quality at near-threshold voltage. 2

8 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW Videos Original Video Conventional (Pfail = 0.01) DPSR (Pfail = 0.01) Shift (Pfail = 0.01) [13] city PSNR: 36.8039 SSIM: 0.9306 PSNR: 24.5045 SSIM: 0.5739 PSNR: 33.7729 SSIM: 0.9095 PSNR: 36.7801 SSIM: 0.9290 crew PSNR: 37.1454 SSIM: 0.9078 PSNR: 24.5212 SSIM: 0.5142 PSNR: 35.5632 SSIM: 0.8901 PSNR: 37.1197 SSIM: 0.9060 football PSNR: 36.5037 SSIM: 0.9163 PSNR: 24.4878 SSIM: 0.5542 PSNR: 34.6731 SSIM: 0.9046 PSNR: 36.4816 SSIM: 0.9148 Concert - PSNR: 24.7459 SSIM: 0.6161 PSNR: 39.8358 SSIM: 0.9887 PSNR: 59.4008 SSIM: 0.9988 Game - PSNR: 24.7593 SSIM: 0.5834 PSNR: 39.6992 SSIM: 0.9839 PSNR: 59.4008 SSIM: 0.9987 Electric Guitar - PSNR: 24.7528 SSIM: 0.5580 PSNR: 42.5844 SSIM: 0.9884 PSNR: 59.4008 SSIM: 0.9985 Fig.7. Video output using different video storage techniques.

9 Dataset Video benchmarks YouTube 8M Dataset Dataset Video benchmarks YouTube 8M Dataset TABLE 7 VIDEO PSNR METRIC COMPARISON conventional DPSR conventional DPSR Ref. [13] Ref. [13] Videos (Pfail = 0.001) (Pfail = 0.001) (Pfail = 0.01) (Pfail = 0.01) (Pfail =0.001) (Pfail = 0.01) akiyo 33.762219 40.641272 24.676287 36.639433 41.248639 41.185088 bus 32.102969 35.405863 24.410622 33.373556 35.706569 35.689801 city 32.550805 36.360825 24.426879 33.772872 36.801408 36.780126 coastguard 32.090258 35.489524 24.426879 34.265358 35.667736 35.650842 crew 32.680147 36.928508 24.521212 35.563219 37.142670 37.119667 football 32.439115 36.255904 24.487795 34.673071 36.501345 36.481558 foreman 32.71063 36.878115 24.529656 35.022773 37.211848 37.188112 sign_irene 33.253495 38.980559 24.590776 36.573802 38.976183 38.940649 Running 34.843802 47.751093 27.751663 37.896356 69.178843 59.400849 Concert 34.843123 50.617823 24.745933 39.835772 69.178843 59.400849 Music Video 34.842942 48.993861 24.765553 37.908828 69.178843 59.400849 Festival 34.843240 45.838237 24.892104 35.958557 69.178843 59.400849 Game 34.843259 49.286247 24.759353 39.699233 69.178843 59.400849 Electric Guitar 34.843014 51.566521 24.752845 42.584377 69.178843 59.400849 Snow 34.844445 50.725480 24.761392 40.861991 69.178843 59.400849 Flute 34.842227 53.769387 24.755972 44.158630 69.178843 59.400849 Vehicle 34.843032 50.015065 24.741031 42.251862 69.178843 59.400849 Planet 34.843295 53.306924 24.760113 44.022668 69.178843 59.400849 TABLE 8 VIDEO SSIM METRIC COMPARISON Conventional DPSR conventional DPSR Ref. [13] Ref. [13] Videos (Pfail = 0.001) (Pfail = 0.001) (Pfail = 0.01) (Pfail = 0.01) (Pfail =0.001) (Pfail = 0.01) akiyo 0.895568 0.960037 0.524455 0.943273 0.961509 0.959164 bus 0.884369 0.928646 0.615363 0.917352 0.929814 0.928514 city 0.879284 0.928319 0.573905 0.909453 0.93045 0.929045 coastguard 0.872335 0.919269 0.587307 0.910952 0.920116 0.918561 crew 0.850047 0.905746 0.514167 0.890076 0.907585 0.906039 football 0.863992 0.914951 0.554214 0.904554 0.91613 0.914782 foreman 0.865366 0.920302 0.539163 0.908825 0.921568 0.919849 sign_irene 0.879546 0.940751 0.521892 0.928188 0.942161 0.940099 Running 0.949418 0.997716 0.631449 0.979334 0.999886 0.998972 Concert 0.945931 0.998801 0.616055 0.988679 0.999874 0.998849 Music Video 0.945398 0.998251 0.607637 0.980091 0.999876 0.998857 Festival 0.953149 0.998222 0.660847 0.983623 0.999892 0.998987 Game 0.941848 0.998281 0.583433 0.983909 0.999855 0.998658 Electric Guitar 0.937000 0.998636 0.558013 0.988400 0.999842 0.998543 Snow 0.939508 0.998665 0.573617 0.987401 0.999850 0.998656 Flute 0.936771 0.999146 0.551228 0.992497 0.999848 0.998629 Vehicle 0.942345 0.999002 0.588818 0.992204 0.999855 0.998681 Planet 0.931941 0.998594 0.533084 0.991441 0.999818 0.998322 The matric PSNR has been used extensively to describe video output quality in a quantitative way but recent efforts to capture true human perception show it may not accurately describe the actual video quality a human perceives [23]. This is due to the fact that PSNR is based on the summation of error for every pixel s chrominance and luminance component values and this alone is not necessarily a good estimate to the user s perception of the video. Analyzing the video quality using the structural similarity (SSIM) metric is a method that is more aware of the user s perception since it includes calculations for luminance, contrast, and structural changes in the video. The general form of the SSIM equation is defined as [23] l x, y cx, y sx y SSIM ( x, y), 2 x y c1 2 xy c2 2 2 2 2 c c x y 1 where l(x,y), the luminance comparison, is a function of the mean intensities, μx and μy, c(x,y), the contrast comparison is a function of the standard deviations, σx and σy, and s(x,y), the structural comparison, is a function of the correlation between x and y, or σxy. Setting α = β = γ = 1 in the original equation results in the second equation. C1 (C2) is a constant that is included to avoid instabilities when the sum of the means (standard deviations) squared is equal to x y 2 (9)

10 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW values near zero. The value of the SSIM is in the range 0 to 1. As the value, SSIM(x,y), gets closer to 1, the quality of video y more closely matches the quality of video x. For our testing purposes, video x is the raw, uncompressed YUV video, before the decoding process, and video y is the post decoded YUV video that may or may not have other bit shifting or correction changes performed on it. The results of these SSIM calculations for conventional and DPSR are listed in Table 8. The video output quality of our DPSR method has a significant increase in SSIM value over the no failure correction, conventional, memory. 5.5 Comparison with Prior Work Table 9 compares the DPSR s performance with the stateof-the-art. With data-pattern enabled self-recovery ability, DPSR exhibits low implementation cost (7.94%) and reliable operation at near-threshold voltage to achieve maximum energy efficiency. Comparing with State-of-the-Art Data-Shifting [13]: Table 8 also compares the video output quality of the proposed DPSR and the data-shifting technique presented in [13]. As shown, the data-shifting technique [13] has slightly better quality in terms of PSNR and SSIM metrics as compared to our proposed technique, but is realized with large area overhead (~14%). This is because, the shifting scheme needs to calculate the shift values based on the received fault positions and then perform shifting to store least-significant-bits in the identified faulty bitcells. Comparing with State-of-the-Art Data-Squeezing [14]: The data squeezing technique presented in [14] is another recently developed memory failure mitigation technique. Based on the observation that, for many general-purpose applications, the last-level cache contains large amounts of null data, this technique compresses null subblocks so that they can be allocated to memory entries with faulty cells. This technique works well for register files and caches for general-purpose applications, which store as high as 79.23% zeros as discussed in [34]. However, it is not suitable for videos because the 8-bit video pixel data varys a lot between 0 and 255, which is difficult for zero compression. Comparing with State-of-the-Art Error Correction Code (ECC): ECCs have also been studied in ultra-low voltage contexts to protect again memory failures [35]. Similar redundancy based repair mechanisms, to implement ECC, the capacity of a memory need to be increased or part of its effective capacity has to be sacrificed to store check bits. In addition to memory space overhead, complex logic for ECC encoding and decoding must be added, which brings significant implementation penalty. For example, by using orthogonal Latin square codes discussed in [35], half of memory capacity is used to store ECC bits. In our developed big-data enabled memory technique, the general data patterns existing in large scale videos have been identified, which are used to achieve self-correction in the presence of the memory failures. The overhead of the developed self-correction logic is significantly reduced as compared to existing techniques. At the result, the developed DPSR is capable of delivering the best video quality for the least area overhead. fault-position awareness Low-power techniques bitcell modified near threshold operation power efficiency additional logic needed performance overhead 6 CONCLUSION In this paper, we have presented a data-pattern enabled SRAM with self-recovery ability for big video data. Based on the data patterns obtained by data-mining techniques, a simple circuit-level design technique is applied to enable self-recovery with low area overhead (7.94%). Our design successfully delivers good video quality for minimumsized SRAM at near-threshold voltage (with failure rate 10-2 ). ACKNOWLEDGMENT This work was supported in part by the National Science Foundation under Grant CCF-1514780 and CNS-1628961, the ND NASA EPSCoR, the ND Venture grant, the NDSU- RCA funding, and the Offerdahl Foundation. Na Gong and Jinhui Wang are the corresponding authors. REFERENCES TABLE 9 COMPARISON WITH PRIOR WORK TCASI 12 [24] DAC 15 [13] [1] IDC (2012). The Digital Universe in 2020: Big Data, bigger Digital Shadows, and Biggest Growth in the Far East. December 2012. [Online]. Available: https://www.emc.com/collateral/analyst-reports/idc-digitaluniverse-united-states.pdf [2] K. Kim, Silicon Technologies and Solutions for the Data-Driven World, in Proc. IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2015, pp. 22-26. [3] N. Rastogi, You Charged Me All Night Long, [Online]. Available: TC 16 [14] This Work No Yes Yes Yes bitcell sizing datashifting datasqueezing data-pattern enabled selfrecovery Yes No No No No (0.9V) Yes (-) Yes (0.5 V) Yes (0.5V) Bad Good Good Good No LUTs and shifter - - Rearrangement logic and tag array, comparator, Mux extra clock (for decompression) MUX 0.04 ns video quality acceptable good does not apply good area overhead 1 14% 6.3% 7.94% 11-65% 1 depending on the number of shifting bits

11 http://www.slate.com/articles/health_and_science/the_green_la ntern/2009/10/you_charged_me_all_night_long.html. [4] M. E. Sinangil and A. P. Chandrakasan, Application-Specific SRAM Design Using Output Prediction to Reduce Bit-Line Switching Activity and Statistically Gated Sense Amplifiers for Up to 1.9 Lower Energy/Access, IEEE Journal of Solid-State Circuits, vol. 49, no. 1, pp. 107-117, Jan. 2014. [5] K. Nii, M. Yabuuchi, Y. Tsukamoto, S. Ohbayashi, S. Imaoka, H. Makino,Y.Yamagami, S. Ishikura,T. Terano, T. Oashi, K. Hashimoto, A. Sebe, G. Okazaki, K. Satomi, H. Akamatsu, and H. Shinohara, A 45-nm Bulk CMOS Embedded SRAM With Improved Immunity Against Process and Temperature Variations, IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 180-191, Jan. 2008. [6] O. Hirabayashi, A. Kawasumi, A. Suzuki, Y. Takeyama, K. Kushida, T. Sasaki, A. Katayama, G. Fukano, Y. Fujimura, T. Nakazato, Y. Shizuki, N. Kushiyama, and T. Yabe, A Process- Variation-Tolerant Dual-Power-Supply SRAM With 0.179 Cell in 40 nm CMOS Using Level-Programmable Wordline Driver, in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2009, pp. 458-459. [7] K. Takeda, Y. Hagihara, Y. Aimoto, M. Nomura, Y. Nakazawa, T. Ishii, and H. Kobatake, A Read-static-noise-margin-free SRAM Cell for Low-VDD and High-speed Applications, IEEE J. Solid-State Circuits, vol. 41, no. 1, pp. 113-121, Jan. 2006. [8] T.-H. Kim, J. Liu, and C. H. Kim, A Voltage Scalable 0.26 V, 64 kb 8T SRAM with Vmin Lowering Techniques and Deep Sleep Mode, IEEE J. Solid-State Circuits, vol. 44, no. 6, pp. 1785-1795, 2009. [9] M.-F. Chang, S.-W. Chang, P.-W. Chou, and W.-C. Wu, A 130 mv SRAM with Expanded Write and Read Margins for Subthreshold Applications, IEEE J. Solid-State Circuits, vol. 46, no. 2, pp. 520-529, Feb. 2011. [10] Y.-W. Chiu, Y.-H. Hu, M.-H. Tu, J.-K. Zhao, Y.-H. Chu, S.-J. Jou, and C.-T. Chuang, 40 nm Bit-Interleaving 12T Subthreshold SRAM With Data-Aware Write-Assist, IEEE Trans. Circuits Syst. I, vol. 61, no. 9, pp. 2578-2585, Sep. 2014. [11] M. K. Qureshi and Z. Chishti, Operating Secded-based Caches at Ultralow Voltage with Flair, in Proc. 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2013, pp. 1 11. [12] A. Ansari, S. Feng, S. Gupta, and S. A. Mahlke, Archipelago: A Polymorphic Cache Design for Enabling Robust Near-threshold Operation, in Proc. IEEE Symp. on High Performance Computer Architecture (HPCA), 2011, pp. 539 550. [13] S. Ganapathy, G. Karakonstantis, A. Teman, and A. Burg, Mitigating the Impact of Faults in Unreliable Memories for Error-Resilient Applications, in Proc. Design Automation Conf. (DAC), 2015, pp. 1-6. [14] A. Ferreron, D. S. Gracia, J. Alastruey-Benedé, T. Monreal-Arnal, and P. E. Ibáñez, Concertina: Squeezing in Cache Content to Operate at Near-Threshold Voltage, IEEE Trans. On Computers, vol. 65, no. 3, Mar. 2016. [15] N. Gong, S. Jiang, A. Challapalli, M. Panesar, and R. Sridhar, Variation-and-Aging Aware Low Power embedded SRAM for Multimedia Applications, in Proc. 25th IEEE International SoC Conference (SoCC 12), 2012, pp. 21-26. [16] S. Zhou, S. Katariya, H. Ghasemi, S. Draper, and N. S. Kim, Minimizing Total Area of Low-Voltage SRAM Arrays Thought Joint Optimization of Cell Size, Redundancy, and ECC, in Proc. Int. Conf. on Computer Design (ICCD), 2010, pp. 112-117. [17] R. Agrawal, T. Imielinski, and A. Swami, Mining Rules between Sets of Items in Large Databases, in Proc. for Computing Machinery s Special Interest Group on Management of Data (ACM SIGMOD) Conf., Washington DC, pp. 207-216, May 1993. [18] Weka 3: Data Mining Software in Java, [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/ [19] N. Gong, S. Jiang, A. Challapalli, S. Fernandes, and R. Sridhar, Ultra-Low Voltage Split-data-aware Embedded SRAM for Mobile Video Applications, IEEE Trans. on Circuits and Systems II, vol. 59, no. 12, pp. 883-887, Dec. 2011. [20] H. Noguchi, Y. Iguchi, H. Fujiwara, Y. Morita, K. Nii, H. Kawaguchi, and M. Yoshimoto, A 10T Non-precharge Twoport SRAM for 74% Power Reduction in Video Processing, in Proc. IEEE Computer Society Annual Symp. VLSI Circuits, Mar. 2007, pp. 107-112. [21] A. R. Alameldeen, I. Wagner, Z. Chishti, W. Wu, C. Wilkerson, S. L. Lu, Energy-Efficient Cache Design Using Variable- Strength Error-Correcting Codes, in Proc. ISCA, 2011, pp. 1-11. [22] J. Chang, M. Huang, J. Shoemaker, J. Benoit, S.-L. Chen, W. Chen, S. Chiu, R. Ganesan, G. Leong, V. Lukka, S. Rusu, and D. Srivastava, The 65-nm 16-MB shared on-die L3 cache for the dual-core Intel Xeon Processor 7100 Series, IEEE J. Solid-State Circuits, vol. 42, no. 4, pp. 846 852, Apr. 2007. [23] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. on Image Processing, vol. 13, no. 4, pp. 600-612, Apr. 2004. [24] J. Kwon, I. Lee, and J. Park, "Heterogeneous SRAM Cell Sizing for Low Power H.264 Applications," IEEE Trans. on Circuits and Systems I, vol. 99, no. 2, pp. 1-10, Feb. 2012. [25] D. Zhou, S. Wang, H. Sun, J. Zhou, J. Zhu, Y. Zhao, J. Zhou, S. Zhang, and S. Goto, A 4Gpixel/s 8/10b H.265/HEVC Video Decoder Chip for 8K Ultra HD Applications, in Proc. Int. Solid- State Circuits Conf. (ISSCC), Feb. 2016, San Franscisco, CA, pp. 266-267. [26] S. A. Pourbakhsh, X. Chen, D. Chen, X. Wang, N. Gong, and J. Wang, Sizing-Priority Based Low-Power Embedded Memory for Mobile Video Applications, in Proc. International Symposium on Quality Electronic Design (ISQED), 2016, Santa Clara, CA, pp. 1-6. [27] YUV Video Sequences, [Online]. Available: http://trace.eas.asu.edu/yuv/ [28] Xiph.org Video Test Media [derf s collection], [Online]. Available: https://media.xiph.org/video/derf/ [29] F. Frustaci, D. Blaauw, D. Sylvester and M. Alioto, "Better-thanvoltage scaling energy reduction in approximate SRAMs via bit dropping and bit reuse," in Power and Timing Modeling, Optimization and Simulation (PATMOS), 2015 25th International Workshop on, Salvador, 2015, pp. 132-139. [30] N. Gong, J. Edstrom, D. Chen, and J. Wang, Data-Pattern Enabled Self-Recovery Multimedia Storage System for Near- Threshold Computing, in Proc. International Conference on Computer Design (ICCD), 2016, Phoenix, Arizona, accepted. [31] 45-nm FreePDK. [Online]. Available: http://www.eda.ncsu.edu/wiki/freepdk. [32] F. Sampaio, M. Shafique, B. Zatt, S. Bampi, and J. Henkel, Energy-Efficient Architecture for Advanced Video Memory, in

12 IEEE TRANSACTIONS ON BIG DATA, UNDER REVIEW Proc. 2014 IEEE/ACM International Conf. on Computer-Aided Design (ICCAD), Nov. 2014, pp. 132-139. [33] S. Venkataramani, S. T. Chakradhar, K. Roy, and A. Raghunathan, Approximate computing and the quest for computing efficiency, in Proc. the 52nd Annual Design Automation Conf. (DAC 15), Jun. 2015, pp. 1-6. [34] N. Gong, J. Wang, S. Jiang, and R. Sridhar, TM-RF: Aging Aware Power Efficient Register File Design for Modern Microprocessors, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 7, pp. 1196-1209, Jul. 2015. [35] Z. Chishti, A. R. Alameldeen, C. Wilkerson, W. Wu, and S.-L. Lu, Improving cache lifetime reliability at ultra-low voltages, in Proc. 42nd IEEE/ACM Int. Symp. Microarchit., 2009, pp. 89 99. Jonathon Edstrom received the B.S. degree in computer engineering and the M.S. degree in electrical and computer engineering from North Dakota State University, Fargo, ND, in 2015 and 2017, respectively. Currently, he is pursuing his Ph.D. degree in electrical and computer engineering at North Dakota State University. His research focuses on datadriven intelligent energy-efficient hardware design. Dongliang Chen received the B.S. degree in electrical engineering, Dalian University of Technology (DUT), Dalian, China, in 2010. Currently, he is pursuing his Ph.D. degree in electrical and computer engineering at the North Dakota State University, Fargo, ND. His research focuses on data-driven power-efficient mobile computing. Yifu Gong received the B.S. degree in electrical engineering at North Dakota State University in 2015. He is currently working towards his Ph.D. degree in electrical and computer engineering at the North Dakota State University, Fargo, ND. His research focuses on low-power embedded vision system. Jinhui Wang (M 13) received the B.E. degree in electrical engineering from Hebei University, Hebei, China, in 2004, and the Ph.D. degree in electrical engineering through a joint USA/China program between University of Rochester and Beijing University of Technology, in 2010. Dr. Wang is currently an Assistant Professor with the Department of Electrical and Computer Engineering at the North Dakota State University, Fargo, ND, USA. His research interests include low-power, highperformance, and variation-tolerant integrated circuit design, 3-D IC and EDA methodologies, and thermal solutions in VLSI. He has over 100 publications and 20 patents in the emerging semiconductor technologies. Na Gong (M 13) received the B.E. degree in electrical engineering, the M.E. degree in microelectronics from Hebei University, Hebei, China, and the Ph.D. degree in computer science and engineering from the State University of New York, Buffalo, in 2004, 2007, and 2013, respectively. Currently, Dr. Gong is an Assistant Professor of Electrical and Computer Engineering at the North Dakota State University, Fargo, ND, USA. Her research interests include data-driven energy-efficient VLSI circuits and systems, viewer-aware mobile systems, with an emphasis on memories.