An Efficient Reduction of Area in Multistandard Transform Core

Similar documents
Design of Memory Based Implementation Using LUT Multiplier

Implementation of Memory Based Multiplication Using Micro wind Software

ALONG with the progressive device scaling, semiconductor

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

Memory efficient Distributed architecture LUT Design using Unified Architecture

LUT Optimization for Memory Based Computation using Modified OMS Technique

A Novel Architecture of LUT Design Optimization for DSP Applications

Optimization of memory based multiplication for LUT

Modified Reconfigurable Fir Filter Design Using Look up Table

Implementation of Area Efficient Memory-Based FIR Digital Filter Using LUT-Multiplier

OMS Based LUT Optimization

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

N.S.N College of Engineering and Technology, Karur

FPGA Hardware Resource Specific Optimal Design for FIR Filters

An Lut Adaptive Filter Using DA

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

VLSI IEEE Projects Titles LeMeniz Infotech

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

K. Phanindra M.Tech (ES) KITS, Khammam, India

Design and Implementation of LUT Optimization DSP Techniques

An FPGA Implementation of Shift Register Using Pulsed Latches

A Fast Constant Coefficient Multiplier for the XC6200

A Parallel Area Delay Efficient Interpolation Filter Architecture

Implementation of Low Power and Area Efficient Carry Select Adder

Memory Based Computing for DSP. Pramod Meher Institute for Infocomm Research

Optimizing area of local routing network by reconfiguring look up tables (LUTs)

Designing an Efficient and Secured LUT Approach for Area Based Occupations

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

A Low Energy HEVC Inverse Transform Hardware

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Reconfigurable Fir Digital Filter Realization on FPGA

An MFA Binary Counter for Low Power Application

THE USE OF forward error correction (FEC) in optical networks

WITH the demand of higher video quality, lower bit

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

Distributed Arithmetic Unit Design for Fir Filter

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

TEST PATTERN GENERATION USING PSEUDORANDOM BIST

Microprocessor Design

Transactions Briefs. Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression

An Efficient High Speed Wallace Tree Multiplier

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

Modified128 bit CSLA For Effective Area and Speed

A Low Power Delay Buffer Using Gated Driver Tree

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

FPGA Implementation of DA Algritm for Fir Filter

Figure 1.LFSR Architecture ( ) Table 1. Shows the operation for x 3 +x+1 polynomial.

DIGITAL CIRCUIT LOGIC UNIT 9: MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

L12: Reconfigurable Logic Architectures

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

ISSN:

Figure.1 Clock signal II. SYSTEM ANALYSIS

Comparative Analysis of Stein s. and Euclid s Algorithm with BIST for GCD Computations. 1. Introduction

A High Performance Deblocking Filter Hardware for High Efficiency Video Coding

Clock Gating Aware Low Power ALU Design and Implementation on FPGA

Designing Fir Filter Using Modified Look up Table Multiplier

Implementation of BIST Test Generation Scheme based on Single and Programmable Twisted Ring Counters

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Implementation of High Speed Adder using DLATCH

Power Optimization by Using Multi-Bit Flip-Flops

FPGA Design with VHDL

Efficient Implementation of Multi Stage SQRT Carry Select Adder

A New Family of High-Performance Parallel Decimal Multipliers*

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

Power Reduction and Glitch free MUX based Digitally Controlled Delay-Lines

Fully Pipelined High Speed SB and MC of AES Based on FPGA

SDR Implementation of Convolutional Encoder and Viterbi Decoder

Architecture of Discrete Wavelet Transform Processor for Image Compression

An Efficient Viterbi Decoder Architecture

A VLSI Architecture for Variable Block Size Video Motion Estimation

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

A Symmetric Differential Clock Generator for Bit-Serial Hardware

COMP 9519: Tutorial 1

Arithmetic Unit Based Reconfigurable Approximation Technique for Video Encoding

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Weighted Random and Transition Density Patterns For Scan-BIST

Implementation of CRC and Viterbi algorithm on FPGA

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

DESIGN OF HIGH PERFORMANCE, AREA EFFICIENT FIR FILTER USING CARRY SELECT ADDER

Field Programmable Gate Arrays (FPGAs)

Transcription:

An Efficient Reduction of Area in Multistandard Transform Core A. Shanmuga Priya 1, Dr. T. K. Shanthi 2 1 PG scholar, Applied Electronics, Department of ECE, 2 Assosiate Professor, Department of ECE Thanthai Periyar government institute of technology, Vellore, Tamilnadu, India. Abstract: This paper explains about the reduction of area in multistandard transform core. By using the buffer circuit instead of D flip-flop in the pipeline registers of architectural design, the gate counts will be reduced. By doing this the area of the hardware design is reduced. Thus the hardware cost of the core is reduced and also the hardware efficiency is improved and the improved efficiency will be 45%. I. INTRODUCTION In this part of the paper, the overall views about various transform techniques for various video applications are discussed. A.Distributed Arithmetic: It is an efficient technique for calculation of sum of product or vector dot product. It is a technique that is bit serial in nature. The important advantage of DA is, it is used in data path circuit designing. It uses look up tables and accumulators instead of multipliers for computing inner product and has been used in many DSP applications such as DCT, DFT, and digital filters. B. New Distributed Arithmetic: Reducing cost metrics is important attention in ASIC design. There are increasing demands for more efficient DA paradigms which can eliminate the need of using ROM. NEDA scheme is proposed to realize the inner product of vectors with optimal solutions of hardware requirements. C.Discrete Cosine Transforms: DCT expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. It is a Fourier related transform similar to DFT. DCTs are equivalent to DFTs of roughly twice the length, operating on real data with even symmetry, where is some variants the input/output data shifted by half a sample. Useful properties of DCT in image and video applications: 1. Energy compaction: concentrating the energy into small number of co efficient. 2. De-correlation: minimizing the interdependencies between coefficients D.Factor Sharing: The factor sharing method shares the same factor in different coefficients among the same input. The 2D DCT/IDCT technique is used for finding the coefficient matrix. By using the IDCT coefficients the delta matrix and the area efficient architecture is found for the unified IDCT circuits. In section II, the previous work related to common sharing distributed arithmetic algorithm is discussed. In section III the architecture of multi standard transform core is discussed. In section IV the importance metrics of FPGA is discussed. Section V gives the circuit comparison between the buffer and D flip-flop. Section VI gives the performance analysis of the design. II.RELATED WORK The common sharing distributed arithmetic algorithm is the combination of both Factor Sharing and Distributed Algorithm. The factor sharing method first finds the shared factor in the coefficient matrix. The DA method is then applied to share the combination of the input among the coefficients. The search flow of the CSDA algorithm is used for better hardware resource sharing of the core design. So the higher capability in hardware resource sharing can be achieved by using CSDA. To form the coefficient matrix for CSDA the canonic signed digit coefficients will be used. Canonical signed digit is a number system for encoding a floating point value in a two s complement representation. This encoding contains 33% fewer non zero than 2 s 985

complement form leading to efficient implementation of add/subtract networking in hardware digital signal processing. The main aim of CSDA is to reduce the non zero elements in the co efficient matrix. So the canonical signed digit is used in CSDA method. For the mathematical part of the CSDA, the general 8point transform technique is used. According to the symmetric property, the 8point transform can be equally divided into two 4 point transforms, where one part is even transform and another one is odd transform. Again the four point transforms can be divided into two 2point transforms. III.ARCHITECTURE OF MULTISTANDARD TRANSFORM CORE: 1D multistandard transform core consists of Selected Butterfly Architecture (SBF) Even Part CSDA Odd Part CSDA Error Correction Adder Tree (ECAT) Permutation A.SBF Selected butterfly model architecture performs the 8point butterfly model with 8 multiplexers. These 8 point transform is divided into two 4 point transforms, like even part CSDA and odd part CSDA. B.EVEN PART CSDA CSDA even part calculates the even part of the 8 point transform. Here it consists of the two pipeline stage architectures, which is developed by using the D flipflops fig1. 1 D CSDA MST core as pipeline registers. 1 st stage executes the 4point input butterfly matrix circuit and the second stage of even part CSDA tells about the hardware resources in variable video applications. C.ODD PART CSDA Similar to the even part CSDA, the odd part CSDA also consists of the two pipeline stage registers, which efficiently shares the hardware resources among the odd part of the architecture in variable video standard applications. D.ECAT&PERMUTATION Error correction adder trees are followed by the CSDA even part and CSDA odd part which add the non zero coefficients of CSDA with corresponding tree like structures. Permutation is used to minimize the truncation errors. 986

In the 2D multistandard transform core consists of two one dimensional MST and transposed memory which is made up of pipelined registers which is formed by D flip-flop. Due to this there are more number of gate counts are required in synthesis part of the design. IV. IMPORTANCE OF FPGA Field programmable gate array mainly consists of look up tables based logic elements in the combinational logic blocks. It uses the inter connects for connecting the logic elements between the blocks. In general to describe the device capacity in FPGA, three metrics are mainly considered. They are maximum logic gates, maximum memory bits and typical gate range. FPGA manufacturers mainly considering the device capacity in terms of gate counts because of the logic elements. Maximum logic gates metric is used to estimate the maximum number of gates that can be used for performing the particular logic functions. Fig2. 2D CSDA MST core Table 1: gate counts for common logic functions. FUNCTIONS GATE COUNTS COMBINATIONAL FUNCTIONS 2 input NAND gate 2 2 to 1 MUX 4 3 input XOR 6 4 input XOR 9 2 bit carry save full adder 9 REGISTER FUNCTIONS D flip-flop 6 D flip-flop with set or reset 8 D flip-flop with reset and clock 12 enable V.BUFFER AND D FLIPFLOP: Both flip-flops and buffers are used for the same purpose i.e. for holding the circuit data for the specific clock period. But in the case of area reduction these buffers can be used in the place of flip-flops because of the minimum number of gate counts. Fig3. Buffer using NAND gates 987

From the above two figures we can easily say that the NAND gate required for buffer is only two, where the D flip flop requires six NAND gates for the circuit design. VI.PERFORMANCE ANALYSIS: Table 2: Device utilization summary UTILIZATION USING D FLIP USING BUFFER FLOP Logic utilization: Number of slice flip flop 1259 704 used Number of 4 input LUT 2095 2048 Logic distribution: Number of occupied 1310 1239 slices Number of slices 1310 1239 containing only related logic Number of slices 0 0 containing unrelated logic Total number of 4 input 2279 2158 LUTs Number used as logic 2095 2048 Number used as route 184 110 thru Number of bonded IOB 188 188 Number of GCLKS 1 1 Total number of gate counts for design 28606 23799 Fig 4. D flip flop using NAND gates From the above table we can say that the total number of gate counts is reduced by using the buffer circuit. By using the CSDA technique the gate counts is 30k. But when the buffer circuit is implemented in the design the gate counts are reduced to 23k. So the area is reduced in the core design and the hard ware efficiency which is inversely proportional to the gate counts also improved to 45%. VII.REFERENCES: 1. Y. H. Chen, J. N. Chen, T.Y. Chang and C.W.lu, High -throughput multi standard transform core supporting MPEG/H.264/VC-1 using common sharing distributed arithmetic.vlsi.2013.2251021. 2. Gate Count Capacity Metrics for FPGAs XAPP 059 Feb1. 1997 version 1.1. 3. A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, NEDA: A low-power high performance DCT architecture, IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955 964, Mar. 2006. 4. C. Peng, X. Cao, D. Yu, and X. Zhang, A 250 MHz optimized distributed architecture of 2D 8 8 DCT, in Proc. 7th Int. Conf. ASIC, Oct. 2007, pp. 189 192. 5. C. Y. Huang, L. F. Chen, and Y. K. Lai, A high-speed 2-D transform architecture with unique kernel for multi-standard video applications, in Proc. IEEE Int. Symp. Circuits Syst., May 2008, pp. 21 24. 6. Y. H. Chen, T. Y. Chang, and C. Y. Li, High throughput DA-based DCT with high accuracy error-compensated adder tree, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 4, pp. 709 714, Apr. 2011. 7. Y. H. Chen, T. Y. Chang, and C. Y. Li, A high performance video transform engine by using space-time scheduling strategy, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 4, pp. 655 664, Apr. 2012. 988

8. Y. K. Lai and Y. F. Lai, A reconfigurable IDCT architecture for universal video decoders, IEEE Trans. Consum. Electron., vol. 56, no. 3, pp. 1872 1879, Aug. 2010. 9. H. Chang, S. Kim, S. Lee, and K. Cho, Design of area-efficient unified transform circuit for multi-standard video decoder, in Proc. IEEE Int. SoC Design Conf., Nov. 2009, pp. 369 372. 10. S. Lee and K. Cho, Circuit implementation for transform and quantization operations of H.264/MPEG-4/VC-1 video decoder, in Proc. Int. Conf. Design Technol. Integr. Syst. Nanosc., Sep. 2007, pp. 102 107. 11. H. Qi, Q. Huang, and W. Gao, A low-cost very large scale integration architecture for multistandard inverse transform, IEEE Trans. Circuits Syst., vol. 57, no. 7, pp. 551 555, Jul. 2010. 12. T. S. Chang, C. S. Kung, and C. W. Jen, A simple processor core design for DCT/IDCT, IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 3, pp. 439 447, Apr. 2000. 13. K. H. Chen, J. I. Guo, J. S. Wang, C. W. Yeh, and T. F. Chen, A poweraware IP core design for the variable-length DCT/IDCT targeting at MPEG4 shape-adaptive transforms, in Proc. IEEE Int. Symp. Circuits Syst., vol. 2. May 2004, pp. 141 144. 14. S. Lee and K. Cho, Design of high-performance transform and quantization circuit for unified video CODEC, in Proc. IEEE Asia Pacific Conf. Circuits Syst., Nov. 2008, pp. 1450 1453. 15. C. P. Fan and G. A. Su, Fast algorithm and low-cost hardwaresharing design of multiple integer transforms for VC-1, IEEE Trans. Circuits Syst., vol. 56, no. 10, pp. 788 792, Oct. 2009. 989