Multicore Processing and Efficient On-Chip Caching for H.264 and Future Video Decoders

Size: px
Start display at page:

Download "Multicore Processing and Efficient On-Chip Caching for H.264 and Future Video Decoders"

Transcription

1 Multicore Processing and Efficient On-Chip Caching for H.264 and Future Video Decoders The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Finchelstein, D.F., V. Sze, and A.P. Chandrakasan. Multicore Processing and Efficient On-Chip Caching for H.264 and Future Video Decoders. Circuits and Systems for Video Technology, IEEE Transactions on (2009): IEEE Institute of Electrical and Electronics Engineers Version Final published version Accessed Tue Oct 23 19:03:00 EDT 2018 Citable Link Terms of Use Detailed Terms Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.

2 1704 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009 Multicore Processing and Efficient On-Chip Caching for H.264 and Future Video Decoders Daniel F. Finchelstein, Member, IEEE, Vivienne Sze, Student Member, IEEE, and Anantha P. Chandrakasan, Fellow, IEEE Abstract Performance requirements for video decoding will continue to rise in the future due to the adoption of higher resolutions and faster frame rates. Multicore processing is an effective way to handle the resulting increase in computation. For power-constrained applications such as mobile devices, extra performance can be traded-off for lower power consumption via voltage scaling. As memory power is a significant part of system power, it is also important to reduce unnecessary onchip and off-chip memory accesses. This paper proposes several techniques that enable multiple parallel decoders to process a single video sequence; the paper also demonstrates several onchip caching schemes. First, we describe techniques that can be applied to the existing H.264 standard, such as multiframe processing. Second, with an eye toward future video standards, we propose replacing the traditional raster-scan processing with an interleaved macroblock ordering; this can increase parallelism with minimal impact on coding efficiency and latency. The proposed architectures allow N parallel hardware decoders to achieve a speedup of up to a factor of N. For example, if N = 3, the proposed multiple frame and interleaved entropy slice multicore processing techniques can achieve performance improvements of 2.64 and 2.91, respectively. This extra hardware performance can be used to decode higher definition videos. Alternatively, it can be traded-off for dynamic power savings of 60% relative to a single nominal-voltage decoder. Finally, on-chip caching methods are presented that significantly reduce off-chip memory bandwidth, leading to a further increase in performance and energy efficiency. Data-forwarding caches can reduce off-chip memory reads by 53%, while using a lastframe cache can eliminate 80% of the off-chip reads. The proposed techniques were validated and benchmarked using full-system Verilog hardware simulations based on an existing decoder; they should also be applicable to most other decoder architectures. The metrics used to evaluate the ideas in this paper are performance, power, area, memory efficiency, coding efficiency, and input latency. Index Terms H.264, low-power, multicore, parallelism, video decoders. Manuscript received January 31, 2009; revised May 21, First version published September 1, 2009; current version published October 30, This work was funded by Nokia, and Texas Instruments Incorporated. Chip fabrication was provided by Texas Instruments Incorporated. The work of V. Sze was supported by the Texas Instruments Graduate Women s Fellowship for Leadership Activities in Microelectronics and the Natural Sciences and Engineering Research Council. This paper was recommended by Associate Editor M. Mattavelli. D. Finchelstein is with Nvidia Corporation, Santa Clara, CA USA ( dfinchel@alum.mit.edu). V. Sze and A. P. Chandrakasan are with the Microsystems Technology Laboratories, Massachusetts Institute of Technology, Cambridge, MA USA ( sze@alum.mit.edu; anantha@mtl.mit.edu). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSVT I. Introduction THE INCREASING demand for higher definition and faster frame rate video is making high-performance and low-power critical in hardware video decoders. Mobile multimedia devices such as smart phones are energy-constrained, so reducing their power is critical for extending video playback times. For wired devices such as set-top boxes, speed is critical for high-quality video playback. Ideally, a video decoder would not require a different architecture for these two types of applications. This would help reduce design time and lower implementation costs. Pipelining and parallelism, two well-known hardware architecture techniques, can be used to achieve these highperformance requirements. Pipelining increases computation concurrency by reducing the datapath between registers. This allows a circuit to be clocked at a higher frequency, and thus process data faster. One disadvantage of pipelining is the increase in pipeline registers and control complexity. Parallelism increases concurrency by distributing computation amongst several identical hardware units. The main cost of parallelism is an increase in chip area and muxing/demuxing logic to feed all the units and collect their results. The power of a given video decoder architecture can be minimized by lowering the supply voltage. First, the decoder s clock frequency is set to the lowest value that still guarantees the current computation workload can be met. Next, the supply voltage is reduced to the minimal value that still allows the circuit to operate at the chosen frequency. Voltage scaling reduces dynamic energy consumption by a quadratic factor. This comes at a cost of increased circuit delay, as the currents decrease with supply voltage. Specifically, the circuit suffers a linear increase in delay above the threshold voltage; as the supply voltage approaches the sub-threshold region, the circuit begins to suffer an exponential increase in delay [1]. This decreased speed can be a challenge for real-time applications such as video decoding where on average a new frame must be computed every 33 ms for a frame rate of 30 frames per second (fps). Video decoding also requires a significant amount of onchip and off-chip memory bandwidth, for both motion compensation (MC) and last-line accessing. Therefore, memory system optimization can reduce total power in the decoder system, which includes both the decoder application-specific integrated circuit (ASIC) and the off-chip frame buffer memory. One effective way to reduce memory power is the use of /$26.00 c 2009 IEEE

3 FINCHELSTEIN et al.: MULTICORE PROCESSING AND EFFICIENT ON-CHIP CACHING FOR H.264 AND FUTURE VIDEO DECODERS 1705 on-chip caching. This technique trades off an increase in chip area for a reduction in more power-hungry off-chip accesses. A. Related Work State-of-the-art H.264 ASIC video decoders have used microarchitectural techniques such as pipelining and parallelism to increase throughput and thus reduce power consumption of digital logic [2] [6]. In this paper, parallelism is applied at the system level by replicating the entire decoder pipeline. In [2], the performance bottleneck was identified to be the entropy decoding (ED) unit. This is because context-adaptive variable-length coding (CAVLC) processes an inherently serial bitstream and cannot be easily parallelized. This is also seen in [7], where everything but the ED unit was replicated by a factor of 8. One way to overcome the ED performance bottleneck is to run it at a faster frequency, as suggested in [8] and [9]. However, the ED unit must be run at a higher voltage than the rest of the system, so it will lower the overall energy efficiency. Also, even at the maximum frequency allowed by the underlying transistor technology, the ED unit might not be able to run fast enough to meet the highest performance demands. A system-level approach to increasing decoder throughput is to break the input stream into slices that can be processed in parallel by multiple cores, which has been proposed by [10] and [11]. [10] proposes breaking up each frame into completely independent slices; this method was described for MPEG-2 but is also applicable within the H.264 standard at the cost of lower coding efficiency, as will be shown in Section II-A. [11] proposes breaking up each frame into entropy slices where only the ED portion is independent; this method is not H.264 compliant. State-of-the-art H.264 decoders have also used last-line and MC caching to reduce power consumption of the memory subsystem [2] [6]. Another technique to reduce off-chip frame buffer (OCFB) bandwidth (BW) is to compress the reference frames [12]. The idea of [12] is not H.264 compliant and must be performed in the same way at the encoder and decoder. B. Main Contributions of This Paper The main contributions of this paper fall in two categories: 1) multicore parallelism (Sections II and III) and 2) on-chip memory caching (Section IV). Multicore parallelism consists of replicating an existing video decoder (DEC) architecture, as shown in Fig. 1. Each of the parallel DEC processes different parts of the bitstream, and together they produce an output video. First, we describe a way of decoding multiple H.264 frames simultaneously, while achieving a linear improvement in performance with no loss in coding efficiency. Second, through coexploration of algorithm and architecture, we develop interleaved entropy slice (IES) processing; this method also achieves a linear increase in throughput with negligible impact on coding efficiency, and has lower input latency than the multiple frame technique. Third, we show how using either a last-frame cache (LFC) or data-forwarding cache (DFC) can drastically reduce OCFB Fig. 1. Parallel video decoder architecture. read BW. Fourth, we show how IES processing increases data locality and reduces the BW of a full-last-line cache (FLLC). Finally, the different techniques are evaluated for speed, area, power, latency, and coding loss, and the results are summarized in Section V. The proposed architectures were implemented using Verilog and the coding loss was simulated using the H.264 reference software [13]. The underlying DEC architecture used for all the analysis is based on [2] and [8]. II. Video Decoder Replication for H.264 Previously published H.264 DEC have used parallelism within the DEC units [for example, deblocking filter (DB) or MC] to increase system performance. In this section, we will describe different ways in which two or more DEC can process a H.264 video in parallel and therefore increase system performance. The goal of these techniques is to enable N DEC to execute concurrently, in order to achieve a performance improvement of up to N. These techniques are also cumulative, so they could be used together to expose even more parallelism. While this section deals only with H.264-compliant video processing, Section III will describe other ways to expose the desired parallelism by slightly modifying the H.264 algorithm. The metrics used to evaluate the ideas are performance, power, area, memory efficiency, coding efficiency, and input latency. A. Slice Parallelism A scheme that enables high-level DEC parallelism is the division of a frame into two or more H.264 slices at the video encoder (ENC). Each slice can then be processed by a separate DEC core. Parallel slice processing relies on the ability of the DEC s ED to parse two or more slices simultaneously, and also assumes that the ENC divides each frame into enough slices to exploit parallelism at the DEC. In the H.264 standard [14], each slice is preceded by a small 32-bit delimiter code, as shown in Fig. 2. If the DEC can afford to buffer an entire encoded frame of the input stream and quickly parse for the start code of all slices, then it can simultaneously read all the slices from this input buffer. This idea is similar to the parallel MPEG-2 decoder described in [10]. This scheme trades off increased parallelism for a decrease in coding efficiency. We evaluated the impact of slice parallelism by encoding 150 frames of four different video

4 1706 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009 Fig. 2. Start of slices can be found by parsing for headers. Fig. 4. Three parallel video decoders processing three consecutive frames of the same video. Fig. 3. frame. CAVLC coding loss increases with number of H.264 slices in a 720 p sequences and separating each frame into a fixed number of slices, using the JM reference software [13] with QP = 27. The result is shown in Fig. 3. Relative to having single-slice frames, the coding efficiency decreases because the redundancy across the slice borders is not exploited by the ENC. Furthermore, the size of the slice header information is constant while the size of the slice body decreases because it contains fewer macroblocks (MBs), which are blocks of pixels. For example, when dividing a 720 p frame into three slices, the CAVLC coding method suffers an average 0.41% coding loss, when measured under common conditions [15]. Besides the loss in coding efficiency, another disadvantage of the slice partitioning scheme is that the FLLCs need to be replicated together with each DEC, since they operate on completely different regions of the frame. This causes the area overhead of parallelism to be nearly proportional to the degree of parallelism, since the DEC pipeline is replicated, but not the memory controller logic. In some DEC implementations the on-chip cache dominates the active area in [2, 75%], so replicating the FLLC incurs a large area cost. The performance improvement of H.264 slice multicore parallelism is shown in Fig. 16. Ideally, the performance improvement of slice parallelism with N decoders is at most N. However, there are two reasons why the performance does not reach this peak. First, the workload is not evenly distributed amongst the parallel slices, especially since they operate on disjoint regions of the frame which could have different coding characteristics. Second, the increase in total bits per MB due to loss in coding efficiency (more nonzero coefficients, for example) leads to an increase in ED computation cycles. B. Frame Parallelism In this section, we show how to process N consecutive H.264 frames in parallel, without requiring the ENC to perform any special operations, such as splitting up frames into N slices. The simultaneous parsing of several frames relies on input buffering and searching for delimiters, similar to the discussion of Section II-A. However, note that this technique requires buffering N frames, so it will incur a higher input latency than the buffering of N slices. Several consecutive frames can be processed in parallel by N different DEC, as shown in Fig. 4. The main cost of multiframe processing is the area overhead of parallelism, which is proportional to the degree of parallelism, just as in Section II-A. If these frames are all I-frames (spatially predicted), then they can be processed independently from each other. However, when these frames are P-frames (temporally predicted), DEC i requires data from frame buffer (FB) location FB i 1, which was produced by DEC i 1. If we synchronize all the parallel DEC, such that DEC i lags sufficiently behind DEC i 1, then the data from FB i 1 is usually valid. If the motion vector (MV) in DEC i requires pixels not yet decoded by DEC i 1, then concurrency suffers and we must stall DEC i. This could happen if the y-component of the MV is a large positive number. The performance decrease due to these stalls was simulated to be less than 1% for N = 3, across 100 frames of a 720 p Mobcal video sequence. The relatively small number of stalls for the simulated videos can be understood by examining the statistics of their vertical motion vectors. As shown in Fig. 5(b), the y-motion vectors for various videos are typically small and have a very tight spread, which minimizes stalling. Frame multicore decoding can increase the DEC performance by up to a factor of N. The parallel frame processing architecture was implemented in Verilog using the core of [2] for each of the DEC. The architecture was then verified for different video sequences and varying degrees of parallelism. Fig. 16 shows how the maximum clock period increases for a given resolution as we process more frames in parallel. For a given resolution, a larger clock period is made possible by an increase in performance, since fewer cycles are needed to decode a given workload. This increase is nearly linear, but is limited by the workload imbalance across the various sets of frames running on each of the parallel DEC.

5 FINCHELSTEIN et al.: MULTICORE PROCESSING AND EFFICIENT ON-CHIP CACHING FOR H.264 AND FUTURE VIDEO DECODERS Fig. 6. Fig. 5. Distribution of vertical motion vectors for several conformance videos showing a tight spread. III. Decoder Replication for Future Video Standards In this section, we propose some changes to the H.264 algorithm which allows two or more DEC to process a video in parallel and therefore increase system performance. As in Section II, the goal of these techniques is to enable N DEC to execute concurrently, in order to achieve performance improvement of up to N. The multicore ideas in this section offer several advantages over the techniques in Section II, such as better workload distribution among the parallel cores, lower cache area requirements, smaller loss in coding efficiency, and a much smaller input buffering latency. A. Diagonal Macroblock Processing The H.264 coding standard processes the MB of video frames in raster-scan order. In order to exploit spatial redundancy, each MB is coded differentially with respect to its already decoded neighbors to the left, top-left, top, and top-right. The redundancy between neighbors is present in both pixel values and prediction information (motion vectors, number of coded coefficients, etc.) Processing order is on a 2 to 1 diagonal. We could instantiate multiple DEC to process the MB on a 2:1 diagonal as shown in Fig. 6. This is similar to the processing order described in [7]. The diagonal height D could be set to anywhere from 1 to H (frame height). The different diagonals are ordered from left to right. Setting D = 1 corresponds to the typical raster-scan processing order. If diagonal processing is used, all the MB on a diagonal can be decoded concurrently since there are no dependencies between them. If all MB had similar processing workloads, the scheme described in this section could speed up the DEC by N (degree of DEC replication). In reality, the workload per MB does vary, so the performance improvement is lower than the increase in area. The diagonal height D of each region of diagonals can be set to N, since no further parallel DEC hardware is available. Note that the top line of MB in each region of diagonals is still coded with respect to the MB in the region of diagonals just above, in order to maintain good coding efficiency. A limitation to implementing this scheme is that the coded MB in H.264 arrive in raster-scan order from the bitstream. One solution would be to modify the algorithm and reorder the MB in a 2:1 diagonal order at the ENC. For example, the MB in each diagonal could be transmitted from top-right to bottom-left in the bitstream. Another ordering could transmit MB on even diagonals top-right to bottom-left and MB on odd diagonals from bottom-left to top-right. This reordering would require a change in the H.264 standard, so that both the ENC and DEC now process MB in a diagonal order. The CAVLC entropy coding efficiency would not suffer, since each MB can be coded in the same way as for the raster-scan ordering of H.264. Therefore, the reordered CAVLC bitstream would contain the same bits within the MB, but the MB would just be rearranged in a different order. However, even if diagonal reordering is used, we still cannot scan ahead to the next MB since the current MB has variable length and there are no MB delimiters. This critical challenge is addressed in the next section. B. Interleaved Entropy Slice (IES) Parallelism In order to enable DEC parallelism when using the diagonal scanning order of Section III-A, we propose the following solution. The bitstream can be split into N different IES; for

6 1708 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009 Fig. 7. Interleaved entropy slices (IESs) with diagonal dependencies. example, when N = 2, IES A and B in Fig. 7. Therefore, each of N parallel DEC would be assigned to an entire MB line, and the IES would be interleaved amongst these lines. Just as slices are separated for H.264, the bitstream could be split into different IES. There are several key differences between IES and the entropy slices of [11]. First, IES allows for context selection across slice boundaries, which improves coding efficiency relative to [11]. Second, IES are interleaved to enable better parallel processing and memory locality. Finally, IES allows the full decoding path to be parallelized; the decoded syntax elements can be fed directly to the rest of the decoder which can begin processing immediately, avoiding the need for intermediate buffering which is necessary for traditional slice partitioning. The processing of IES would be synchronized to ensure that the 2:1 diagonal order is maintained. As a result, each DEC must trail the DEC above. However, if one IES has a higher instantaneous processing workload than the IES above it, the DEC above can move forward and proceed further ahead, so that stalling is minimized. This approach is different than the one used in [7]. In that paper, the ED processing was done in the usual raster-scan order and all the syntax elements were buffered for one frame. The diagonal processing could only start after the entire frame was processed by ED. In the IES approach, which would be enabled by a change in the H.264 algorithm, even the ED processing is done in parallel, which speeds up the ED operation and does not require buffering any syntax elements. This technique is similar to the dual macroblock pipeline of [9]. In that paper, the authors duplicate the MB processing hardware at the encoder, whereas in this paper we replicate the DEC at the decoder. Parallel processing at the decoder is more challenging than at the encoder since the input is a variable-length bitstream and the MB are transmitted in a fixed raster-scan order. The encoder has the flexibility to process MB in any order, whereas interleaved processing at the decoder requires a change in the H.264 standard. It is worth considering how the use of IES affects the entropy coding efficiency. Once again, if the video uses CAVLC, the bitstream size will only be slightly affected, since the macroblocks are coded in the same way as the rasterscan order of H.264. The only coding overhead is the 32 bits used for the slice header and at most 7 extra bits for byte alignment between slices. As we see in Fig. 8, this scheme offers much better coding efficiency than using CAVLC with H.264 slices, since there is no loss in coding efficiency at the borders between IES. Fig. 8. Average CAVLC coding efficiency of IES relative to parallel H.264 slice processing of Section II-A for 150 frames of four different videos: Bigships, Mobcal, Shields, and Parkrun. The coding efficiency curve for H.264 slices is the average of the curves shown in Fig. 3. To evaluate the actual performance of a real system, we implemented the IES parallelism scheme in Verilog and evaluated it for several videos and degrees of parallelism. The performance of IES multicore decoding is shown in Fig. 9 for varying N. Ideally, the IES parallelism technique can speed up the DEC performance by up to a factor of N. This was also shown in [7], where an 8-core decoder achieved a 7.5 improvement in throughput. In reality, an exact linear increase in performance cannot be achieved due to varying slice workloads (as discussed in Section II-A) and stalls due to synchronization between the DEC. The increased performance of multicore IES can be mapped to a larger clock period for a given workload. This allows lower voltage operation and the corresponding power savings are shown in Fig. 9. The area costs associated with multicore IES are also shown in Fig. 9. Unlike the multicore ideas of Section II, only one last-line cache is needed independent of the number of cores, as will be shown in Section IV-D. As a result, the total area costs of this scheme do not increase linearly with the degree of parallelism. Fig. 9 shows that the rate of power reduction decreases as we add more and more DEC cores, eventually leveling off. For example, we need 3 cores and 39% more area to save 60% of the power relative to a single-core DEC. However, a 25-core implementation only saves 38% of the power relative to a 3-core DEC, but uses 4.1 the area. Fig. 16 shows that IES perform better than regular H.264 slices, and there are several reasons for that. First, the workload variation is not as large between interleaved slices since they cover similar regions of a frame; as N increases, however, the variation in interleaved slice workloads also gets larger. Second, IES parallelism does not suffer from a large coding penalty, so the ED performance does not suffer as a result. IV. Memory Optimization The memory subsystem is critical to both the performance and power of a DEC. A state-of-the-art OCFB dynamic random access memory (DRAM) such as the one in [16] can consume as much power as the DEC processing itself, as

7 FINCHELSTEIN et al.: MULTICORE PROCESSING AND EFFICIENT ON-CHIP CACHING FOR H.264 AND FUTURE VIDEO DECODERS 1709 Fig. 10. Caching an entire frame on chip for MC. present new caching ideas that can be used independently or together with existing techniques. Fig. 9. Performance of IES multicore decoding. The power is normalized relative to a single decoder running at the nominal supply voltage. The area increase assumes caches make up 75% of the area of a single DEC and the memory controller takes 23% of the logic [2]. estimated in [8]. Additionally, off-chip accesses require further power to charge high-capacitance bondwires and PCB traces. Therefore, reducing off-chip memory accesses is important for minimizing system power. The off-chip write BW of a DEC is bounded on the low side by the number of pixels output by the decoding process. However, the off-chip read BW of the MC block is typically higher than the write BW due to the use of a six-tap interpolating filter. The following sections propose different ways to reduce the MC off-chip read BW with the help of on-chip caches. A. Existing Memory Optimization Techniques The work in [8] discusses two main caching techniques for reducing off-chip BW of a hardware DEC: FLLC and MC caches. The top-neighbor dependency requires each MB to refer to the MB above in the last line. The use of a FLLC allows us to fetch this data from on-chip SRAMs rather than getting the previously processed data from a large off-chip memory. As a result, the off-chip memory BW is reduced by 26%. For 720 p resolutions, the area cost of this technique is 138 kbits of onchip SRAM [8], whereas for 1080 p the FLLC size increase to 207 kbits. To further reduce off-chip memory BW, the MC data read from the previous frame can also be cached on-chip. In [2], the MC block identifies the horizontal and vertical overlap of interpolation area between two adjacent 4 4 blocks of pixels with identical MV. This data is cached on-chip in flip-flops, so as not to be read twice from the OCFB. This technique was shown to reduce the total off-chip BW by 19%. If the more general MC cache of [8] is used, a further 33% of OCFB read BW can be saved with a moderately sized cache (512 Bytes). A much larger MC cache (32 kbytes for 720 p resolutions) is needed to increase the read BW savings from 33% to 56% over the simple MC caching of [2]. The following sections B. Last-Frame Cache (LFC) for Motion Compensation During MC, most of the pixels are read from the previous frame, as opposed to being read from even earlier frames. If we can store the last reference frame in an on-chip LFC, we can avoid going off-chip for the majority of MC reads. This caching architecture is described in Fig. 10(a), which shows how reads from FB 1 are replaced with reads from the LFC. This scheme requires a WB in order to not overwrite the data at the current location in the LFC, which is needed for MC. To understand the need for a WB, let us assume that there is no WB and the MV for the current block at location (x, y) is ( 10, 10). In this case, the data from the last frame at location (x 10,y 10) would no longer be found in the LFC, since it would have already been overwritten by the block at location (x 10,y 10) from the current frame. One overhead of this LFC scheme is the significant additional area of the WB and the LFC. There is also a power overhead, since each decoded pixel is now written to the LFC (as well as to the OCFB), and written to and read from the WB, all of this just to avoid reading it back from the OCFB. For 720 p resolutions, the size of the LFC would be 1.4 MBytes, with an area of 2.7 mm 2 if implemented with high-density embedded DRAM (edram) [16]. The size of the WB depends on how many misses we are willing to tolerate in the LFC. Fig. 11 shows how the hit rate of the LFC varies with the size of the WB, as simulated in Verilog. If there is no WB, the videos with more movement from left to right or up to down will have more LFC misses. For example, for the Shields video, the LFC hit rate with no WB is 65% because the movement is from left to right. A small WB with the size of 1 MB can improve the hit rate up to about 93%. To eliminate the remaining misses, the entire row of MB above must be buffered by the WB, which explains the last jump up to 100% when the WB size is 80 MB. A miss occurs in the LFC when the block being fetched has a much smaller y-coordinate than the current block being processed. In the case of this miss, the MC data was already overwritten by a recently decoded block which was evicted from the WB. If this happens, the data must be fetched from FB 1. If the reference frame is not the last frame, the LFC is also bypassed and the data is fetched from the OCFB. The work in [17] shows that the previous frame is chosen 80% of the time as the reference frame, as averaged

8 1710 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009 Fig. 13. Reduction in off-chip reads versus size of MC DFC for N =3. Fig. 11. Hit rate of last-frame cache versus size of writeback cache for different 720 p videos. For each video, the type of motion is described, in order to help explain the differences in hit rates. Fig. 12. Motion compensation (MC) DFC for N = 3. over 10 different videos of common intermediate format (CIF) resolution. C. Motion Compensation Data-Forwarding Caches (DFCs) If we allow N parallel DEC to operate concurrently on N consecutive frames, as in Section II-B, we can forward MC data between them using on-chip DFC, as shown in Fig. 12. This will avoid most off-chip MC reads for all DEC but DEC 0, greatly reducing off-chip read BW (up to 67% for N = 3) and power. The OCFB BW only depends on the video frame rate and resolution, and is independent of the number of DEC used. In general, DEC i and DEC i 1 need to be synchronized, such that DEC i lags sufficiently behind DEC i 1, similar to the discussion in Section II-B. Conversely, if DEC i 1 gets too far ahead of DEC i, the temporal locality is lost, and the MC data will be read from the OCFB instead of from DFC i 1,i. In that case, we can stall DEC i 1 in order to maximize the hit rate of the DFC. These two constraints can be handled with the help of low and high-watermarks. In order to evaluate the performance impact and hit rate of these DFC, we implemented the DFC in Verilog and placed them between the DEC described in Section II-B. The performance impact of stalling at these watermarks was simulated for a Mobcal video sequence of 100 frames. The overall loss in throughput for N = 3 was less than 8%. The DFC need to store about lines of pixels to minimize the cache miss rate, so their on-chip area can be quite large for high-resolution, highly parallel DEC. To understand the trade-off between the size of the DFC and the hit rate, we simulated the DFC system for 100 frames of the Mobcal video. The result is shown in Fig. 13. As expected, a really large cache will have near 100% hit rate, leading to 67% reduction in off-chip MC reads for N = 3. The hit rate drops off significantly for DFC sizes of less than 32 lines, since the vertical MV can easily fall outside this range. For N = 3 and 720 p resolution, the total area of the two 64-line DFC is about 1mm 2, assuming high-density 65 nm SRAMs. D. Last-Line Caching for Interleaved Entropy Slices (IESs) In addition to enabling parallel processing, the IES of Section III-B also allow for better memory efficiency than the raster-scan processing in H.264. This section shows how the IES processing order can reduce accesses to the large FLLC discussed in Section IV-A. For example, when decoding B i in Fig. 14, the data from MB A i 1,A i 2,A i 3 can be kept in a much smaller cache since those MB were recently processed by DEC A and have high-temporal locality. The caches that pass data vertically between decoders, such as DEC A to DEC B in Fig. 14, are implemented as FIFOs. A deeper FIFO could better handle workload variation between the IES by allowing DEC A to advance several MB ahead of DEC B and thus reduce stall cycles and increase throughput. The caches that pass data horizontally within each decoder only need to hold the information for 1 MB, and are unchanged from the H.264 raster-scan implementation. However, when we process A i, the FLLC is still needed to hold the data that is passed from DEC C to DEC A, since DEC C writes this data long before DEC A can process it. The depth of the FLLC FIFO should therefore be about as large as the frame width in order to prevent deadlock. The caching of data for IES processing is similar to the one used in the encoder of [9]. To evaluate the performance impact of sizing the FIFOs of Fig. 14, we implemented the IES caches in Verilog and placed them together with the system of Section III-B. When simulating intra frames for N = 3, we found that a FIFO depth of four 4 4 edges (one MB edge) only has a 3% performance penalty, whereas a minimally sized FIFO can reduce system performance by almost 25%. This trade-off is illustrated in Fig. 15. The FLLC FIFO is read by DEC 0 and written to by DEC N 1, so if a single-ported memory is used, the accesses will need to be shared. The total size of the IES inter-slice FIFOs is independent of the degree of parallelism N, since the FLLC is not replicated with each DEC. This implies that the total area overhead of DEC parallelism with diagonal

9 FINCHELSTEIN et al.: MULTICORE PROCESSING AND EFFICIENT ON-CHIP CACHING FOR H.264 AND FUTURE VIDEO DECODERS 1711 Fig. 14. Caches used for IES processing with three DEC. Fig. 16. Three different multicore architectures show nearly linear performance gains. The multicore performance of H.264 slices is slightly lower because of the extra processing required by the CAVLC and also the unbalanced slice workload due to uneven image characteristics across the slices. Fig. 15. Impact of FIFO sizing on parallel IES performance. processing is not a factor of N, as was the case for the techniques in Section II. For the DEC of [2], the area of the FLLC SRAMs was three times larger than the rest of the DEC logic. As a result, for N = 3, the area increase due to parallelism would be about 50% and not 200%. This analysis only includes the DEC of Fig. 1, and does not include the area of the frame buffer and bitstream memory controller. If N is the number of parallel IES DEC, the number of accesses to the large FLLC are reduced to 1/N of the original. These accesses are replaced with accesses to much smaller FIFOs that hold the information for about 1 MB. This uses less energy than accessing a large memory that stores 80 MB for 720 p, or 120 MB for 1080 p. This reduction in FLLC accesses allows the designer to even eliminate the area-hungry FLLC and just use the large off-chip memory where the frame buffer is stored. It is interesting to note that diagonal processing enabled by IES can reduce FLLC accesses even when only one DEC is used (no DEC replication). This would require the single DEC to alternate between different IES whenever one of the FIFOs in Fig. 14 stalls. The improved data locality of IES processing can also benefit the MC cache described in Section IV-A, enabling a higher hit rate. Specifically, IES enables vertically neighboring MB to be processed simultaneously. The read areas of these MB overlap which enables an improved hit rate and consequently reduce OCFB BW. For example, the hit rate of a 2 kb MC cache was simulated to be 5% larger than for an equally sized MC cache of a DEC that uses regular raster-scan MB ordering. V. Results The high-level parallelism in this paper was achieved by replicating a full decoder of an arbitrary architecture. The different architectures we proposed were implemented, verified, and benchmarked in Verilog. Fig. 16 shows that all multicore architectures achieve a near-linear speedup and corresponding clock frequency reduction for a given resolution. However, as was shown in Fig. 9, extending the level of multicore parallelism to much higher than 3 achieves relatively small power savings at the cost of a much larger area. As a result, we compare these different multicore architectures for N = 3, as shown in Table I. Their relative coding efficiency was quantified by running several experiments using a modified version of the reference H.264 software. The results are summarized and compared in Table I. The table lists the performance achieved when the decoder is replicated three times, relative to the performance of a single decoder. A 3-core implementation was found in Section III-B to be a good trade-off of power savings versus area. As discussed previously, this performance can be tradedoff for a slower clock and lower voltage, and the equivalent power savings are also shown in Table I. These dynamic power savings assume the original single decoder runs at full voltage and is voltage-scaled when parallelism is enabled. In addition to video decoder parallelism, several on-chip caching techniques were introduced that significantly reduce the off-chip memory bandwidth requirements. The different caching ideas were implemented, verified, and benchmarked in Verilog. They are summarized and compared in Table II. The first two techniques reduce off-chip frame buffer bandwidth by using large on-chip caches. The third technique takes advantage of IES processing to provide better data locality and thus minimize accesses to the FLLC.

10 1712 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009 TABLE I Video Decoder Parallelism (N =3, 720p) Comparison Done For a Fixed Workload For Different Techniques Relative to a Single-Core Decoder at Full Voltage Parallelism H.264 H.264 Interleaved Technique Slices Frames Entropy Slices Paper Section II-A II-B III-B Degree of Parallelism Relative Performance Equivalent Dynamic Power Savings 58% 59% 60% CAVLC Coding Loss 0.41% 0% 0.05% Relative Last-Line Size Relative Logic Area Input Buffering Latency (ms) H.264 Compliance Yes Yes No TABLE II Summary and Comparison of Different DEC Caching Techniques. Results Shown For 720 p are Relative to a Decoder With Only a FLLC LFC with 48-line 1-MB FIFOs Caching 32-line DFC for IES Technique WB N=3 N=3 Paper Section IV-B IV-C IV-D Cache Type edram SRAM FIFO Flip-Flops Cache Size (kb) Silicon Area in 65nm (mm 2 ) OCFB MC BW Reduction 80% 53% >0%* FLLC BW Reduction 0% 0% 67% Memory Access Power Savings 60% 44% 65% H.264 Compliance Yes Yes No *OCFB BW can be reduced further if IES processing is combined with a MC cache. The memory power savings listed in Table II are with respect to the power of the accesses which the cache helps to reduce. To calculate the exact energy savings based on the different cache hit rates, we simulated and estimated the energies for the different types of memories involved. The normalized energy per bit for each of the types of memories are as follows: 1) 1 for a FIFO flip-flop from the synthesis library; 2) 19 for a large edram [16]; 3) 51 for a large static random access memory (SRAM) [2]; and 4) 672 for an offchip synchronous dynamic random access memory (SDRAM) [18], and 10 pf/pin. VI. Conclusion In order to handle the high-computation load of modern high-definition hardware decoders, parallelism needs to be exposed wherever possible. In this paper, we presented several ways to enable high-level parallelism and provide a clear performance-area trade-off. If performance, power, area, memory efficiency, coding efficiency, and input latency are key concerns for the video decoder designer, we recommend choosing the proposed IES architecture. In all of these metrics, IES processing provides comparable or better results relative to the other techniques, though it requires a slight change in the video standard. To optimize the memory system, a larger cache (LFC) can reduce more off-chip bandwidth but requires the most area; a good power area compromise is provided by the DFC of Section IV-C. Acknowledgments The authors would like to thank M. Budagavi and T. Ono for their valuable feedback on this paper. References [1] J. M. Rabaey, A. P. Chandrakasan, B. Nikolic, and J. M. Rabaey, in Digital Integrated Circuits, 2nd ed. Upper Saddle River, NJ: Prentice- Hall, Dec [2] D. F. Finchelstein, V. Sze, M. E. Sinangil, Y. Koken, and A. P. Chandrakasan, A low-power 0.7-V H p video decoder, in Proc. IEEE Asian Solid State Circuits Conf., Fukuoka, Japan, Nov. 3 5, 2008, pp [3] C.-D. Chien, C.-C. Lin, Y.-H. Shih, H.-C. Chen, C.-J. Huang, C.-Y. Yu, C.-L. Chen, C.-H. Cheng, and J.-I. Guo, A 252 kgate/71 mw multi-standard multi-channel video decoder for high definition video applications, in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, San Francisco, CA, Feb , 2007, pp [4] S. Na, W. Hwangbo, J. Kim, S. Lee, and C.-M. Kyung, 1.8 mw, hybrid-pipelined H.264/AVC decoder for mobile devices, in Proc. IEEE Asian Solid State Circuits Conf., Jeju, South Korea, Nov , 2007, pp [5] T.-M. Liu, T. Lin, S. Wang, W. P. Lee, K. Hou, J. Yang, and C. Lee, A 125 uw, fully scalable MPEG-2 and H.264/AVC video decoder for mobile applications, IEEE J. Solid-State Circuits, vol. 42, no. 1, pp , Jan [6] C.-C. Lin, J.-W. Chen, H.-C. Chang, Y.-C. Yang, Y.-H. O. Yang, M.-C. Tsai, J.-I. Guo, and J.-S. Wang, A 160 K gates/4.5 KB SRAM H.264 video decoder for HDTV applications, IEEE J. Solid-State Circuits, vol. 42, no. 1, pp , Jan [7] S. Nomura, F. Tachibana, T. Fujita, H. C. K. T. Usui, F. Yamane, Y. Miyamoto, C. Kumtornkittikul, H. Hara, T. Yamashita, J. Tanabe, M. Uchiyama, Y. Tsuboi, T. Miyamori, T. Kitahara, H. Sato, Y. Homma, S. Matsumoto, K. Seki, Y. Watanabe, M. Hamada, and M. Takahashi, A 9.7 mw AAC-decoding, 620 mw H p 60 fps decoding, 8- core media processor with embedded forward-body-biasing and powergating circuit in 65 nm CMOS technology, in Proc. IEEE Int. Solid- State Circuits Conf. Dig. Tech. Papers, San Francisco, CA, Feb. 2008, pp [8] V. Sze, D. F. Finchelstein, M. E. Sinangil, and A. P. Chandrakasan, A 0.7-V 1.8 mw H.264/AVC 720 p video decoder, IEEE J. Solid-State Circuits, vol. 44, no. 11, Nov [9] K. Iwata, S. Mochizuki, T. Shibayama, F. Izuhara, H. Ueda, K. Hosogi, H. Nakata, M. Ehama, T. Kengaku, T. Nakazawa, and H. Watanabe, A 256 mw full-hd H.264 high-profile CODEC featuring dual macroblockpipeline architecture in 65 nm CMOS, in Proc. Symp. Very-Large-Scale Integr. Circuits Dig. Tech. Papers, Honolulu, HI, Jun , 2008, pp [10] L. Phillips, S. V. Naimpally, R. Meyer, and S. Inoue, Parallel architecture for a high definition television video decoder having multiple independent frame memories, U.S. Patent No , Matsushita Electric Corporation of America, Apr. 23, 1996.

11 FINCHELSTEIN et al.: MULTICORE PROCESSING AND EFFICIENT ON-CHIP CACHING FOR H.264 AND FUTURE VIDEO DECODERS 1713 [11] J. Zhao and A. Segall, VCEG-AI32: New results using entropy slices for parallel decoding, in Proc. 35th Meet. ITU-T Study Group 16 Question 6, Video Coding Experts Group (VCEG), Berlin, Germany, Jul [12] M. Budagavi and M. Zhou, Video coding using compressed reference frames, in Proc. IEEE Int. Conf. Acoustics, Speech Signal Process., Las Vegas, NV, 2008, pp [13] H.264/AVC Reference Software, JM 12.0 [Online]. Available: [14] Advanced Video Coding for Generic Audiovisual Services, document Rec. ITU-T H.264.doc, ITU-T, [15] T. K. Tan, G. Sullivan, and T. Wedi, Recommended Simulation Common Conditions for Coding Efficiency Experiments Revision 1, document ITU-T SG16/Q6.doc, ITU-T VCEG-AE010, Jan [16] K. Hardee, F. Jones, D. Butler, M. Parris, M. Mound, H. Calendar, G. Jones, L. Aldrich, C. Gruenschlaeger, M. Miyabayashil, K. Taniguchi, and I. Arakawa, A 0.6-V 205 MHz 19.5 ns trc 16 Mb embedded DRAM, in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb , 2004, pp [17] Y.-W. Huang, B.-Y. Hsieh, T.-C. Wang, S.-Y. Chient, S.-Y. Ma, C.-F. Shen, and L.-G. Chen, Analysis and reduction of reference frames for motion estimation in MPEG-4 AVC/JVT/H.264, in Proc. IEEE Int. Conf. Acoustics, Speech Signal Process., vol. 3. Apr. 6 10, 2003, pp [18] Micron Mobile DRAM, The Secret to Longer Life [Online]. Available: Daniel F. Finchelstein (M 09) received the Ph.D. and M.S. degrees in 2009 and 2005, respectively, from the Massachusetts Institute of Technology, and the B.A.Sc. degree in 2003 from the University of Waterloo. His doctoral thesis focused on energy and memory efficient parallel video processing. He is currently working in the 3-D graphics performance group at Nvidia Corporation. His research interests include energy-efficient and highperformance digital circuits and systems. His recent focus has been on parallel processors and memory architectures for video and graphics processing. He has been an engineering intern on 8 different occasions at companies like IBM, ATI, Sun, and Nvidia. Dr. Finchelstein received student design contest awards at both A-SSCC (2008) and DAC/ISSCC (2006), and has several conference and journal publications. He was awarded the Natural Sciences and Engineering Research Council of Canada (NSERC) graduate scholarship from 2003 to He also received the University of Waterloo Sanford Fleming Medal for ranking first in the graduating class. He holds one U.S. patent related to cryptographic hardware. Anantha P. Chandrakasan (S 87 M 95 SM 01 F 04) received the B.S, M.S. and Ph.D. degrees in electrical engineering and computer sciences from the University of California, Berkeley, in 1989, 1990, and 1994, respectively. Since September 1994, he has been with the Massachusetts Institute of Technology, Cambridge, where he is currently the Joseph F. and Nancy P. Keithley Professor of Electrical Engineering. His research interests include low power digital integrated circuit design, wireless microsensors, ultrawideband radios, and emerging technologies. He is a co-author of Low Power Digital CMOS Design (Kluwer Academic Publishers, 1995), Digital Integrated Circuits (Pearson Prentice-Hall, 2003, 2nd edition), and Sub-threshold Design for Ultralow Power Systems (Springer 2006). He is also a co-editor of Low Power CMOS Design (IEEE Press, 1998), Design of High-Performance Microprocessor Circuits (IEEE Press, 2000), and Leakage in Nanometer CMOS Technologies (Springer, 2005). Dr. Chandrakasan was a co-recipient of several awards including the 1993 IEEE Communications Society s Best Tutorial Paper Award, the IEEE Electron Devices Society s 1997 Paul Rappaport Award for the Best Paper in an EDS publication during 1997, the 1999 DAC Design Contest Award, the 2004 DAC/ISSCC Student Design Contest Award, the 2007 ISSCC Beatrice Winner Award for Editorial Excellence and the 2007 ISSCC Jack Kilby Award for Outstanding Student Paper. He has served as a technical program cochair for the 1997 International Symposium on Low Power Electronics and Design (ISLPED), VLSI Design 98, and the 1998 IEEE Workshop on Signal Processing Systems. He was the Signal Processing Sub-committee Chair for ISSCC , the Program Vice-Chair for ISSCC 2002, the Program Chair for ISSCC 2003, and the Technology Directions Sub-committee Chair for ISSCC He was an Associate Editor for the IEEE Journal of Solid-State Circuits <DCL: Note IEEE journal titles are cap/small caps> from 1998 to He served on SSCS AdCom from 2000 to 2007 and he was the meetings committee chair from 2004 to He is the Conference Chair for ISSCC He is the Director of the MIT Microsystems Technology Laboratories. Vivienne Sze (S 04) received the B.A.Sc. (Hons.) degree in electrical engineering from the University of Toronto, Toronto, ON, Canada, in 2004, and the S.M. degree from the Massachusetts Institute of Technology (MIT), Cambridge, in 2006, where she is currently a doctoral candidate. From May 2007 to August 2007, she worked in the DSP Solutions Research and Development Center at Texas Instruments, Dallas, TX, designing low-power video coding algorithms. From May 2002 to August 2003, she worked at Snowbush Microelectronics, Toronto, ON, Canada, as an IC Design Engineer. Her research interests include low-power circuit and system design, and low-power algorithms for video compression. Her work has resulted in several contributions to the ITU-T Video Coding Experts Group (VCEG) for the next generation video coding standard. Ms. Sze was a recipient of the 2007 DAC/ISSCC Student Design Contest Award and a co-recipient of the 2008 A-SSCC Outstanding Design Award. She received the Julie Payette - Natural Sciences and Engineering Research Council of Canada (NSERC) Research Scholarship in 2004, the NSERC Postgraduate Scholarship in 2007, and the Texas Instruments Graduate Woman s Fellowship for Leadership in Microelectronics in She was awarded the University of Toronto Adel S. Sedra Gold Medal and W.S. Wilson Medal in 2004.

A Low-Power 0.7-V H p Video Decoder

A Low-Power 0.7-V H p Video Decoder A Low-Power 0.7-V H.264 720p Video Decoder D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken, A.P. Chandrakasan A-SSCC 2008 Outline Motivation for low-power video decoders Low-power techniques pipelining

More information

A Highly Parallel and Scalable CABAC Decoder for Next Generation Video Coding

A Highly Parallel and Scalable CABAC Decoder for Next Generation Video Coding 8 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012 A Highly Parallel and Scalable CABAC Decoder for Next Generation Video Coding Vivienne Sze, Member, IEEE, and Anantha P. Chandrakasan,

More information

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS 9th European Signal Processing Conference (EUSIPCO 2) Barcelona, Spain, August 29 - September 2, 2 A 6-65 CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS Jinjia Zhou, Dajiang

More information

Low-Power Techniques for Video Decoding. Daniel Frederic Finchelstein

Low-Power Techniques for Video Decoding. Daniel Frederic Finchelstein Low-Power Techniques for Video Decoding by Daniel Frederic Finchelstein Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree

More information

Joint Algorithm-Architecture Optimization of CABAC

Joint Algorithm-Architecture Optimization of CABAC Noname manuscript No. (will be inserted by the editor) Joint Algorithm-Architecture Optimization of CABAC Vivienne Sze Anantha P. Chandrakasan Received: date / Accepted: date Abstract This paper uses joint

More information

An FPGA Implementation of Shift Register Using Pulsed Latches

An FPGA Implementation of Shift Register Using Pulsed Latches An FPGA Implementation of Shift Register Using Pulsed Latches Shiny Panimalar.S, T.Nisha Priscilla, Associate Professor, Department of ECE, MAMCET, Tiruchirappalli, India PG Scholar, Department of ECE,

More information

A HIGH THROUGHPUT CABAC ALGORITHM USING SYNTAX ELEMENT PARTITIONING. Vivienne Sze Anantha P. Chandrakasan 2009 ICIP Cairo, Egypt

A HIGH THROUGHPUT CABAC ALGORITHM USING SYNTAX ELEMENT PARTITIONING. Vivienne Sze Anantha P. Chandrakasan 2009 ICIP Cairo, Egypt A HIGH THROUGHPUT CABAC ALGORITHM USING SYNTAX ELEMENT PARTITIONING Vivienne Sze Anantha P. Chandrakasan 2009 ICIP Cairo, Egypt Motivation High demand for video on mobile devices Compressionto reduce storage

More information

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

The Multistandard Full Hd Video-Codec Engine On Low Power Devices The Multistandard Full Hd Video-Codec Engine On Low Power Devices B.Susma (M. Tech). Embedded Systems. Aurora s Technological & Research Institute. Hyderabad. B.Srinivas Asst. professor. ECE, Aurora s

More information

Chapter 2 Introduction to

Chapter 2 Introduction to Chapter 2 Introduction to H.264/AVC H.264/AVC [1] is the newest video coding standard of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). The main improvements

More information

WITH the demand of higher video quality, lower bit

WITH the demand of higher video quality, lower bit IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 8, AUGUST 2006 917 A High-Definition H.264/AVC Intra-Frame Codec IP for Digital Video and Still Camera Applications Chun-Wei

More information

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard Conference object, Postprint version This version is available

More information

Chapter 10 Basic Video Compression Techniques

Chapter 10 Basic Video Compression Techniques Chapter 10 Basic Video Compression Techniques 10.1 Introduction to Video compression 10.2 Video Compression with Motion Compensation 10.3 Video compression standard H.261 10.4 Video compression standard

More information

A 0.7-V 1.8-mW H.264/AVC 720p Video Decoder

A 0.7-V 1.8-mW H.264/AVC 720p Video Decoder A 0.7-V 1.8-mW H.264/AVC 720p Video Decoder The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Sze, V.

More information

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur Module 8 VIDEO CODING STANDARDS Lesson 27 H.264 standard Lesson Objectives At the end of this lesson, the students should be able to: 1. State the broad objectives of the H.264 standard. 2. List the improved

More information

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH 1 Kalaivani.S, 2 Sathyabama.R 1 PG Scholar, 2 Professor/HOD Department of ECE, Government College of Technology Coimbatore,

More information

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel IEEE TRANSACTIONS ON MAGNETICS, VOL. 46, NO. 1, JANUARY 2010 87 Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel Ningde Xie 1, Tong Zhang 1, and

More information

THE new video coding standard H.264/AVC [1] significantly

THE new video coding standard H.264/AVC [1] significantly 832 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 9, SEPTEMBER 2006 Architecture Design of Context-Based Adaptive Variable-Length Coding for H.264/AVC Tung-Chien Chen, Yu-Wen

More information

Selective Intra Prediction Mode Decision for H.264/AVC Encoders

Selective Intra Prediction Mode Decision for H.264/AVC Encoders Selective Intra Prediction Mode Decision for H.264/AVC Encoders Jun Sung Park, and Hyo Jung Song Abstract H.264/AVC offers a considerably higher improvement in coding efficiency compared to other compression

More information

H.264/AVC Baseline Profile Decoder Complexity Analysis

H.264/AVC Baseline Profile Decoder Complexity Analysis 704 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003 H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, Senior

More information

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0 General Description Applications Features The OL_H264e core is a hardware implementation of the H.264 baseline video compression algorithm. The core

More information

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Implementation of an MPEG Codec on the Tilera TM 64 Processor 1 Implementation of an MPEG Codec on the Tilera TM 64 Processor Whitney Flohr Supervisor: Mark Franklin, Ed Richter Department of Electrical and Systems Engineering Washington University in St. Louis Fall

More information

Interframe Bus Encoding Technique for Low Power Video Compression

Interframe Bus Encoding Technique for Low Power Video Compression Interframe Bus Encoding Technique for Low Power Video Compression Asral Bahari, Tughrul Arslan and Ahmet T. Erdogan School of Engineering and Electronics, University of Edinburgh United Kingdom Email:

More information

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0 General Description Applications Features The OL_H264MCLD core is a hardware implementation of the H.264 baseline video compression

More information

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions 1128 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions Kwok-Wai Wong, Kin-Man Lam,

More information

A low-power portable H.264/AVC decoder using elastic pipeline

A low-power portable H.264/AVC decoder using elastic pipeline Chapter 3 A low-power portable H.64/AVC decoder using elastic pipeline Yoshinori Sakata, Kentaro Kawakami, Hiroshi Kawaguchi, Masahiko Graduate School, Kobe University, Kobe, Hyogo, 657-8507 Japan Email:

More information

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame I J C T A, 9(34) 2016, pp. 673-680 International Science Press A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame K. Priyadarshini 1 and D. Jackuline Moni

More information

Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression

Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression Interframe Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression Asral Bahari, Tughrul Arslan and Ahmet T. Erdogan Abstract In this paper, we propose an implementation of a data encoder

More information

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension 05-Silva-AF:05-Silva-AF 8/19/11 6:18 AM Page 43 A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension T. L. da Silva 1, L. A. S. Cruz 2, and L. V. Agostini 3 1 Telecommunications

More information

THE USE OF forward error correction (FEC) in optical networks

THE USE OF forward error correction (FEC) in optical networks IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 8, AUGUST 2005 461 A High-Speed Low-Complexity Reed Solomon Decoder for Optical Communications Hanho Lee, Member, IEEE Abstract

More information

Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC

Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC http://dx.doi.org/10.5573/jsts.2013.13.5.430 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.13, NO.5, OCTOBER, 2013 Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC Juwon

More information

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System Zhibin Xiao and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Outline Introduction to H.264

More information

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Reduction Stephanie Augsburger 1, Borivoje Nikolić 2 1 Intel Corporation, Enterprise Processors Division, Santa Clara, CA, USA. 2 Department

More information

Reduction of Area and Power of Shift Register Using Pulsed Latches

Reduction of Area and Power of Shift Register Using Pulsed Latches I J C T A, 9(13) 2016, pp. 6229-6238 International Science Press Reduction of Area and Power of Shift Register Using Pulsed Latches Md Asad Eqbal * & S. Yuvaraj ** ABSTRACT The timing element and clock

More information

An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications

An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications N.KIRAN 1, K.AMARNATH 2 1 P.G Student, VRS & YRN College of Engineering & Technology, Vodarevu Road, Chirala 2 HOD & Professor,

More information

A video signal processor for motioncompensated field-rate upconversion in consumer television

A video signal processor for motioncompensated field-rate upconversion in consumer television A video signal processor for motioncompensated field-rate upconversion in consumer television B. De Loore, P. Lippens, P. Eeckhout, H. Huijgen, A. Löning, B. McSweeney, M. Verstraelen, B. Pham, G. de Haan,

More information

AUDIOVISUAL COMMUNICATION

AUDIOVISUAL COMMUNICATION AUDIOVISUAL COMMUNICATION Laboratory Session: Recommendation ITU-T H.261 Fernando Pereira The objective of this lab session about Recommendation ITU-T H.261 is to get the students familiar with many aspects

More information

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.8, NO.5, OCTOBER, 08 ISSN(Print) 598-657 https://doi.org/57/jsts.08.8.5.640 ISSN(Online) -4866 A Modified Static Contention Free Single Phase Clocked

More information

Overview: Video Coding Standards

Overview: Video Coding Standards Overview: Video Coding Standards Video coding standards: applications and common structure ITU-T Rec. H.261 ISO/IEC MPEG-1 ISO/IEC MPEG-2 State-of-the-art: H.264/AVC Video Coding Standards no. 1 Applications

More information

Fully Static and Compressed Topology Using Power Saving in Digital circuits for Reduced Transistor Flip flop

Fully Static and Compressed Topology Using Power Saving in Digital circuits for Reduced Transistor Flip flop Fully Static and Compressed Topology Using Power Saving in Digital circuits for Reduced Transistor Flip flop 1 S.Mounika & 2 P.Dhaneef Kumar 1 M.Tech, VLSIES, GVIC college, Madanapalli, mounikarani3333@gmail.com

More information

A Novel VLSI Architecture of Motion Compensation for Multiple Standards

A Novel VLSI Architecture of Motion Compensation for Multiple Standards A Novel VLSI Architecture of Motion Compensation for Multiple Standards Junhao Zheng, Wen Gao, Senior Member, IEEE, David Wu, and Don Xie Abstract Motion compensation (MC) is one of the most important

More information

Frame Processing Time Deviations in Video Processors

Frame Processing Time Deviations in Video Processors Tensilica White Paper Frame Processing Time Deviations in Video Processors May, 2008 1 Executive Summary Chips are increasingly made with processor designs licensed as semiconductor IP (intellectual property).

More information

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005.

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005. Wang, D., Canagarajah, CN., & Bull, DR. (2005). S frame design for multiple description video coding. In IEEE International Symposium on Circuits and Systems (ISCAS) Kobe, Japan (Vol. 3, pp. 19 - ). Institute

More information

Video coding standards

Video coding standards Video coding standards Video signals represent sequences of images or frames which can be transmitted with a rate from 5 to 60 frames per second (fps), that provides the illusion of motion in the displayed

More information

Decoder Hardware Architecture for HEVC

Decoder Hardware Architecture for HEVC Decoder Hardware Architecture for HEVC The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Tikekar, Mehul,

More information

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 6, Ver. II (Nov - Dec.2015), PP 40-50 www.iosrjournals.org Design of a Low Power

More information

Motion Video Compression

Motion Video Compression 7 Motion Video Compression 7.1 Motion video Motion video contains massive amounts of redundant information. This is because each image has redundant information and also because there are very few changes

More information

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design International Journal of Education and Science Research Review Use of Low Power DET Address Pointer Circuit for FIFO Memory Design Harpreet M.Tech Scholar PPIMT Hisar Supriya Bhutani Assistant Professor

More information

A VLSI Architecture for Variable Block Size Video Motion Estimation

A VLSI Architecture for Variable Block Size Video Motion Estimation A VLSI Architecture for Variable Block Size Video Motion Estimation Yap, S. Y., & McCanny, J. (2004). A VLSI Architecture for Variable Block Size Video Motion Estimation. IEEE Transactions on Circuits

More information

Adaptive Key Frame Selection for Efficient Video Coding

Adaptive Key Frame Selection for Efficient Video Coding Adaptive Key Frame Selection for Efficient Video Coding Jaebum Jun, Sunyoung Lee, Zanming He, Myungjung Lee, and Euee S. Jang Digital Media Lab., Hanyang University 17 Haengdang-dong, Seongdong-gu, Seoul,

More information

Multimedia Communications. Video compression

Multimedia Communications. Video compression Multimedia Communications Video compression Video compression Of all the different sources of data, video produces the largest amount of data There are some differences in our perception with regard to

More information

POWER AND AREA EFFICIENT LFSR WITH PULSED LATCHES

POWER AND AREA EFFICIENT LFSR WITH PULSED LATCHES Volume 115 No. 7 2017, 447-452 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu POWER AND AREA EFFICIENT LFSR WITH PULSED LATCHES K Hari Kishore 1,

More information

Reduced complexity MPEG2 video post-processing for HD display

Reduced complexity MPEG2 video post-processing for HD display Downloaded from orbit.dtu.dk on: Dec 17, 2017 Reduced complexity MPEG2 video post-processing for HD display Virk, Kamran; Li, Huiying; Forchhammer, Søren Published in: IEEE International Conference on

More information

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP 1 R.Ramya, 2 C.Hamsaveni 1,2 PG Scholar, Department of ECE, Hindusthan Institute Of Technology,

More information

A High Performance Deblocking Filter Hardware for High Efficiency Video Coding

A High Performance Deblocking Filter Hardware for High Efficiency Video Coding 714 IEEE Transactions on Consumer Electronics, Vol. 59, No. 3, August 2013 A High Performance Deblocking Filter Hardware for High Efficiency Video Coding Erdem Ozcan, Yusuf Adibelli, Ilker Hamzaoglu, Senior

More information

An Efficient Reduction of Area in Multistandard Transform Core

An Efficient Reduction of Area in Multistandard Transform Core An Efficient Reduction of Area in Multistandard Transform Core A. Shanmuga Priya 1, Dr. T. K. Shanthi 2 1 PG scholar, Applied Electronics, Department of ECE, 2 Assosiate Professor, Department of ECE Thanthai

More information

Area Efficient Pulsed Clock Generator Using Pulsed Latch Shift Register

Area Efficient Pulsed Clock Generator Using Pulsed Latch Shift Register International Journal for Modern Trends in Science and Technology Volume: 02, Issue No: 10, October 2016 http://www.ijmtst.com ISSN: 2455-3778 Area Efficient Pulsed Clock Generator Using Pulsed Latch Shift

More information

A Power Efficient Flip Flop by using 90nm Technology

A Power Efficient Flip Flop by using 90nm Technology A Power Efficient Flip Flop by using 90nm Technology Mrs. Y. Lavanya Associate Professor, ECE Department, Ramachandra College of Engineering, Eluru, W.G (Dt.), A.P, India. Email: lavanya.rcee@gmail.com

More information

Memory interface design for AVS HD video encoder with Level C+ coding order

Memory interface design for AVS HD video encoder with Level C+ coding order LETTER IEICE Electronics Express, Vol.14, No.12, 1 11 Memory interface design for AVS HD video encoder with Level C+ coding order Xiaofeng Huang 1a), Kaijin Wei 2, Guoqing Xiang 2, Huizhu Jia 2, and Don

More information

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION 1 YONGTAE KIM, 2 JAE-GON KIM, and 3 HAECHUL CHOI 1, 3 Hanbat National University, Department of Multimedia Engineering 2 Korea Aerospace

More information

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks Research Topic Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks July 22 nd 2008 Vineeth Shetty Kolkeri EE Graduate,UTA 1 Outline 2. Introduction 3. Error control

More information

Performance Evaluation of Error Resilience Techniques in H.264/AVC Standard

Performance Evaluation of Error Resilience Techniques in H.264/AVC Standard Performance Evaluation of Error Resilience Techniques in H.264/AVC Standard Ram Narayan Dubey Masters in Communication Systems Dept of ECE, IIT-R, India Varun Gunnala Masters in Communication Systems Dept

More information

Multimedia Communications. Image and Video compression

Multimedia Communications. Image and Video compression Multimedia Communications Image and Video compression JPEG2000 JPEG2000: is based on wavelet decomposition two types of wavelet filters one similar to what discussed in Chapter 14 and the other one generates

More information

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015 Optimization of Multi-Channel BCH Error Decoding for Common Cases Russell Dill Master's Thesis Defense April 20, 2015 Bose-Chaudhuri-Hocquenghem (BCH) BCH is an Error Correcting Code (ECC) and is used

More information

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September-2014 917 The Power Optimization of Linear Feedback Shift Register Using Fault Coverage Circuits K.YARRAYYA1, K CHITAMBARA

More information

MPEG-2. ISO/IEC (or ITU-T H.262)

MPEG-2. ISO/IEC (or ITU-T H.262) 1 ISO/IEC 13818-2 (or ITU-T H.262) High quality encoding of interlaced video at 4-15 Mbps for digital video broadcast TV and digital storage media Applications Broadcast TV, Satellite TV, CATV, HDTV, video

More information

Express Letters. A Novel Four-Step Search Algorithm for Fast Block Motion Estimation

Express Letters. A Novel Four-Step Search Algorithm for Fast Block Motion Estimation IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 6, NO. 3, JUNE 1996 313 Express Letters A Novel Four-Step Search Algorithm for Fast Block Motion Estimation Lai-Man Po and Wing-Chung

More information

Implementation of MPEG-2 Trick Modes

Implementation of MPEG-2 Trick Modes Implementation of MPEG-2 Trick Modes Matthew Leditschke and Andrew Johnson Multimedia Services Section Telstra Research Laboratories ABSTRACT: If video on demand services delivered over a broadband network

More information

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes Digital Signal and Image Processing Lab Simone Milani Ph.D. student simone.milani@dei.unipd.it, Summer School

More information

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops International Journal of Emerging Engineering Research and Technology Volume 2, Issue 4, July 2014, PP 250-254 ISSN 2349-4395 (Print) & ISSN 2349-4409 (Online) Gated Driver Tree Based Power Optimized Multi-Bit

More information

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264 Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264 Ju-Heon Seo, Sang-Mi Kim, Jong-Ki Han, Nonmember Abstract-- In the H.264, MBAFF (Macroblock adaptive frame/field) and PAFF (Picture

More information

ISSCC 2006 / SESSION 14 / BASEBAND AND CHANNEL PROCESSING / 14.6

ISSCC 2006 / SESSION 14 / BASEBAND AND CHANNEL PROCESSING / 14.6 ISSCC 2006 / SESSION 14 / BASEBAND AND CHANNEL PROSSING / 14.6 14.6 A 1.8V 250mW COFDM Baseband Receiver for DVB-T/H Applications Lei-Fone Chen, Yuan Chen, Lu-Chung Chien, Ying-Hao Ma, Chia-Hao Lee, Yu-Wei

More information

Lossless Compression Algorithms for Direct- Write Lithography Systems

Lossless Compression Algorithms for Direct- Write Lithography Systems Lossless Compression Algorithms for Direct- Write Lithography Systems Hsin-I Liu Video and Image Processing Lab Department of Electrical Engineering and Computer Science University of California at Berkeley

More information

High Performance Carry Chains for FPGAs

High Performance Carry Chains for FPGAs High Performance Carry Chains for FPGAs Matthew M. Hosler Department of Electrical and Computer Engineering Northwestern University Abstract Carry chains are an important consideration for most computations,

More information

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm Mustafa Parlak and Ilker Hamzaoglu Faculty of Engineering and Natural Sciences Sabanci University, Tuzla, 34956, Istanbul, Turkey

More information

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL Random Access Scan Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL ramamve@auburn.edu Term Paper for ELEC 7250 (Spring 2005) Abstract: Random Access

More information

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532 www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 5 Issue 10 Oct. 2016, Page No. 18532-18540 Pulsed Latches Methodology to Attain Reduced Power and Area Based

More information

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards COMP 9 Advanced Distributed Systems Multimedia Networking Video Compression Standards Kevin Jeffay Department of Computer Science University of North Carolina at Chapel Hill jeffay@cs.unc.edu September,

More information

Figure.1 Clock signal II. SYSTEM ANALYSIS

Figure.1 Clock signal II. SYSTEM ANALYSIS International Journal of Advances in Engineering, 2015, 1(4), 518-522 ISSN: 2394-9260 (printed version); ISSN: 2394-9279 (online version); url:http://www.ijae.in RESEARCH ARTICLE Multi bit Flip-Flop Grouping

More information

A Low Power Delay Buffer Using Gated Driver Tree

A Low Power Delay Buffer Using Gated Driver Tree IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) ISSN: 2319 4200, ISBN No. : 2319 4197 Volume 1, Issue 4 (Nov. - Dec. 2012), PP 26-30 A Low Power Delay Buffer Using Gated Driver Tree Kokkilagadda

More information

An Overview of Video Coding Algorithms

An Overview of Video Coding Algorithms An Overview of Video Coding Algorithms Prof. Ja-Ling Wu Department of Computer Science and Information Engineering National Taiwan University Video coding can be viewed as image compression with a temporal

More information

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY 128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY 1 Mrs.K.K. Varalaxmi, M.Tech, Assoc. Professor, ECE Department, 1varuhello@Gmail.Com 2 Shaik Shamshad

More information

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

data and is used in digital networks and storage devices. CRC s are easy to implement in binary Introduction Cyclic redundancy check (CRC) is an error detecting code designed to detect changes in transmitted data and is used in digital networks and storage devices. CRC s are easy to implement in

More information

Error Resilient Video Coding Using Unequally Protected Key Pictures

Error Resilient Video Coding Using Unequally Protected Key Pictures Error Resilient Video Coding Using Unequally Protected Key Pictures Ye-Kui Wang 1, Miska M. Hannuksela 2, and Moncef Gabbouj 3 1 Nokia Mobile Software, Tampere, Finland 2 Nokia Research Center, Tampere,

More information

Comparative Analysis of Pulsed Latch and Flip-Flop based Shift Registers for High-Performance and Low-Power Systems

Comparative Analysis of Pulsed Latch and Flip-Flop based Shift Registers for High-Performance and Low-Power Systems IJECT Vo l. 7, Is s u e 2, Ap r i l - Ju n e 2016 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Comparative Analysis of Pulsed Latch and Flip-Flop based Shift Registers for High-Performance and Low-Power

More information

Transactions Briefs. Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression

Transactions Briefs. Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010 831 Transactions Briefs Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression

More information

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems Hsin-I Liu, Brian Richards, Avideh Zakhor, and Borivoje Nikolic Dept. of Electrical Engineering

More information

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops A.Abinaya *1 and V.Priya #2 * M.E VLSI Design, ECE Dept, M.Kumarasamy College of Engineering, Karur, Tamilnadu, India # M.E VLSI

More information

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC Motion Compensation Techniques Adopted In HEVC S.Mahesh 1, K.Balavani 2 M.Tech student in Bapatla Engineering College, Bapatla, Andahra Pradesh Assistant professor in Bapatla Engineering College, Bapatla,

More information

Jun-Hao Zheng et al.: An Efficient VLSI Architecture for MC of AVS HDTV Decoder 371 ture for MC which contains a three-stage pipeline. The hardware ar

Jun-Hao Zheng et al.: An Efficient VLSI Architecture for MC of AVS HDTV Decoder 371 ture for MC which contains a three-stage pipeline. The hardware ar May 2006, Vol.21, No.3, pp.370 377 J. Comput. Sci. & Technol. An Efficient VLSI Architecture for Motion Compensation of AVS HDTV Decoder Jun-Hao Zheng 1;3 (ΨΞ ), Lei Deng 2 ( Π), Peng Zhang 1;3 (Φ ±),

More information

P.Akila 1. P a g e 60

P.Akila 1. P a g e 60 Designing Clock System Using Power Optimization Techniques in Flipflop P.Akila 1 Assistant Professor-I 2 Department of Electronics and Communication Engineering PSR Rengasamy college of engineering for

More information

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion Asmar A Khan and Shahid Masud Department of Computer Science and Engineering Lahore University of Management Sciences Opp Sector-U,

More information

DESIGN OF A NEW MODIFIED CLOCK GATED SENSE-AMPLIFIER FLIP-FLOP

DESIGN OF A NEW MODIFIED CLOCK GATED SENSE-AMPLIFIER FLIP-FLOP DESIGN OF A NEW MODIFIED CLOCK GATED SENSE-AMPLIFIER FLIP-FLOP P.MANIKANTA, DR. R. RAMANA REDDY ABSTRACT In this paper a new modified explicit-pulsed clock gated sense-amplifier flip-flop (MCG-SAFF) is

More information

Retiming Sequential Circuits for Low Power

Retiming Sequential Circuits for Low Power Retiming Sequential Circuits for Low Power José Monteiro, Srinivas Devadas Department of EECS MIT, Cambridge, MA Abhijit Ghosh Mitsubishi Electric Research Laboratories Sunnyvale, CA Abstract Switching

More information

A 249-Mpixel/s HEVC Video-Decoder Chip for 4K Ultra-HD Applications

A 249-Mpixel/s HEVC Video-Decoder Chip for 4K Ultra-HD Applications A 249-Mpixel/s HEVC Video-Decoder Chip for 4K Ultra-HD Applications The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Noise Margin in Low Power SRAM Cells

Noise Margin in Low Power SRAM Cells Noise Margin in Low Power SRAM Cells S. Cserveny, J. -M. Masgonty, C. Piguet CSEM SA, Neuchâtel, CH stefan.cserveny@csem.ch Abstract. Noise margin at read, at write and in stand-by is analyzed for the

More information

REAL-TIME H.264 ENCODING BY THREAD-LEVEL PARALLELISM: GAINS AND PITFALLS

REAL-TIME H.264 ENCODING BY THREAD-LEVEL PARALLELISM: GAINS AND PITFALLS REAL-TIME H.264 ENCODING BY THREAD-LEVEL ARALLELISM: GAINS AND ITFALLS Guy Amit and Adi inhas Corporate Technology Group, Intel Corp 94 Em Hamoshavot Rd, etah Tikva 49527, O Box 10097 Israel {guy.amit,

More information

Comparative Study of JPEG2000 and H.264/AVC FRExt I Frame Coding on High-Definition Video Sequences

Comparative Study of JPEG2000 and H.264/AVC FRExt I Frame Coding on High-Definition Video Sequences Comparative Study of and H.264/AVC FRExt I Frame Coding on High-Definition Video Sequences Pankaj Topiwala 1 FastVDO, LLC, Columbia, MD 210 ABSTRACT This paper reports the rate-distortion performance comparison

More information

PACKET-SWITCHED networks have become ubiquitous

PACKET-SWITCHED networks have become ubiquitous IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 7, JULY 2004 885 Video Compression for Lossy Packet Networks With Mode Switching and a Dual-Frame Buffer Athanasios Leontaris, Student Member, IEEE,

More information

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy Vladimir Afonso 1-2, Henrique Maich 1, Luan Audibert 1, Bruno Zatt 1, Marcelo Porto 1, Luciano Agostini

More information

Design and analysis of RCA in Subthreshold Logic Circuits Using AFE

Design and analysis of RCA in Subthreshold Logic Circuits Using AFE Design and analysis of RCA in Subthreshold Logic Circuits Using AFE 1 MAHALAKSHMI M, 2 P.THIRUVALAR SELVAN PG Student, VLSI Design, Department of ECE, TRPEC, Trichy Abstract: The present scenario of the

More information