Fully Pipelined High Speed SB and MC of AES Based on FPGA

Similar documents
Optimum Composite Field S-Boxes Aimed at AES

VLSI Based Minimized Composite S-Box and Inverse Mix Column for AES Encryption and Decryption

LUT Optimization for Memory Based Computation using Modified OMS Technique

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

Implementation of CRC and Viterbi algorithm on FPGA

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Optimization of memory based multiplication for LUT

An Efficient High Speed Wallace Tree Multiplier

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

An Efficient Reduction of Area in Multistandard Transform Core

THE USE OF forward error correction (FEC) in optical networks

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Distributed Arithmetic Unit Design for Fir Filter

A Fast Constant Coefficient Multiplier for the XC6200

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

A Compact and Fast FPGA Based Implementation of Encoding and Decoding Algorithm Using Reed Solomon Codes

Hardware Implementation of Viterbi Decoder for Wireless Applications

L12: Reconfigurable Logic Architectures

FPGA Implementaion of Soft Decision Viterbi Decoder

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

A Parallel Area Delay Efficient Interpolation Filter Architecture

VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress

Implementation of Dynamic RAMs with clock gating circuits using Verilog HDL

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

FPGA Hardware Resource Specific Optimal Design for FIR Filters

Design and Analysis of Modified Fast Compressors for MAC Unit

L11/12: Reconfigurable Logic Architectures

Implementation of Low Power and Area Efficient Carry Select Adder

Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

DESIGN OF RECONFIGURABLE IMAGE ENCRYPTION PROCESSOR USING 2-D CELLULAR AUTOMATA GENERATOR

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

Memory efficient Distributed architecture LUT Design using Unified Architecture

Design of Memory Based Implementation Using LUT Multiplier

FPGA Implementation of DA Algritm for Fir Filter

Clock Gating Aware Low Power ALU Design and Implementation on FPGA

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

ISSN:

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

An Lut Adaptive Filter Using DA

Optimizing area of local routing network by reconfiguring look up tables (LUTs)

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

Ultra-lightweight 8-bit Multiplicative Inverse Based S-box Using LFSR

DESIGN and IMPLETATION of KEYSTREAM GENERATOR with IMPROVED SECURITY

Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

A Review on Hybrid Adders in VHDL Payal V. Mawale #1, Swapnil Jain *2, Pravin W. Jaronde #3

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

The Design of Efficient Viterbi Decoder and Realization by FPGA

AbhijeetKhandale. H R Bhagyalakshmi

High Performance Carry Chains for FPGAs

Performance Analysis of Convolutional Encoder and Viterbi Decoder Using FPGA

Laboratory Exercise 7

MODEL-BASED DESIGN OF LTE BASEBAND PROCESSOR USING XILINX SYSTEM GENERATOR IN FPGA

VID_OVERLAY. Digital Video Overlay Module Rev Key Design Features. Block Diagram. Applications. Pin-out Description

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

[Krishna*, 4.(12): December, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

Synthesis Techniques for Pseudo-Random Built-In Self-Test Based on the LFSR

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

A Novel Architecture of LUT Design Optimization for DSP Applications

Available online at ScienceDirect. Procedia Technology 24 (2016 )

Why FPGAs? FPGA Overview. Why FPGAs?

From Theory to Practice: Private Circuit and Its Ambush

FPGA Implementation of Viterbi Decoder

Design and Implementation of Uart with Bist for Low Power Dissipation Using Lp-Tpg

DESIGN OF LOW POWER AND HIGH SPEED BEC 2248 EFFICIENT NOVEL CARRY SELECT ADDER

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

Design of Fault Coverage Test Pattern Generator Using LFSR

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

RELATED WORK Integrated circuits and programmable devices

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

Single Channel LVDS Tx

PROCESSOR BASED TIMING SIGNAL GENERATOR FOR RADAR AND SENSOR APPLICATIONS

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications

ARM7 Microcontroller Based Digital PRBS Generator

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

An Efficient Viterbi Decoder Architecture

Implementation of UART with BIST Technique

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

An MFA Binary Counter for Low Power Application

2.6 Reset Design Strategy

ALONG with the progressive device scaling, semiconductor

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Modified Reconfigurable Fir Filter Design Using Look up Table

Analysis of Low Power Test Pattern Generator by Using Low Power Linear Feedback Shift Register (LP-LFSR)

EFFICIENT IMPLEMENTATION OF RECENT STREAM CIPHERS ON RECONFIGURABLE HARDWARE DEVICES

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

Faculty of Electrical & Electronics Engineering BEE3233 Electronics System Design. Laboratory 3: Finite State Machine (FSM)

Design of Low Power Efficient Viterbi Decoder

Area-efficient high-throughput parallel scramblers using generalized algorithms

Robust Secure FPGA-based Wireless Smart Meters Utilizing PUF and CSI

Transcription:

Fully Pipelined High Speed SB and MC of AES Based on FPGA S.Sankar Ganesh #1, J.Jean Jenifer Nesam 2 1 Assistant.Professor,VIT University Tamil Nadu,India. 1 s.sankarganesh@vit.ac.in 2 jeanjenifer@rediffmail.com Abstract: A new implementation scheme of high speed mixcolumn based on sharing the use of sbox is introduced in this paper. The single MC (mixcolumn) shares the single SB(sbox-subbyte) based on the time slot. For each time slot SB and MC performed parallelly. Earlier they use 16 individual sbox for each input. In our paper, we introduce sharing concept of sbox which can eliminate the use of 16 individual sbox and reduce the delay and cost. Normal AES uses shiftrows followed by sbox needs 128 bit for their operations that consumes large time. By eliminating shiftrows, we can increase the speed of the AES operation. LUT based sbox consumes more than 75% of power. In our paper we design the Composite field sbox which reduces the power consumption of AES architecture. Sbox is the main source of information leakage since the values are fixed one. In our paper the values of sbox are masked by using particular fixed value thus increase the system security. Keywords: AES, high speed MC, time slot, composite field sbox, vhdl, shiftrow elimination. I. INTRODUCTION In this modern world, communication between each and everyone is very important. Internet, satellite made the communication much easier than past decades. Even communication between each get easier, the security transmission of the message is very important today. Transmitting and receiving the secured data becomes tougher in nowadays. Encryption algorithms are used to protect the data from hackers. There are so many algorithms present like,triple DES,AES,,,,etc.. Among them AES is very strong encryption standard that will give more secure encrypted data [4]. Even though AES is very strong algorithm, the hardware implementation sometimes leaks the information. The hackers attack the data in different ways to trace the key or the plaintext. In AES sbox is the main thing that leaks the message information or key information. In our paper sbox values are masked with some fixed value that useful to increase the data security and reduce the side channel attacks. In our paper we introduce sbox time sharing method that can eliminate the use of 16 individual sboxes. For each time slot the sbox gets the input and performs the subbyte operations on that input within that time slot. By sending Add Round Key output in proper way as a input of sbox gives output which is equal to shitrows output. Thus we can eliminate shiftrows and increases speed and system clock frequency and also throughput because the sbox output always 8 bit and shifting the rows need total of 128 bit. Then sbox output is directly given to mixcolumn. MC and SB both are performed in parallel. II. ADVANCED ENCRYPTION STANDARD A. Brief Explanation of Rijndael Algorithm The Rijndael as Advanced Encryption Standard (AES) was published by NIST( National Institute of Standards and Technology) in 2001[3].The AES is strong security standard that become effective on May 26, 2002 by NIST to replace DES. The AES uses 128 bit input and the key length is 128 bit, 192 bit or 256 bits. AES can be implemented easily on software and also the hardware. Rijndael algorithm consists of encryption and decryption and key schedule algorithm. The main operations among three parts of Rijndael algorithm have four main operations. They use a) Byte substitution (sub bytes) b) The shiftrows c) Mixcolumns d) Round key adding (Add round key). ISSN : 0975-4024 Vol 5 No 4 Aug-Sep 2013 3184

Fig.1. AES Encryption Structure AES-128 encryption consists of 10 rounds of transmission of the input plaintext for the cipher text. For AES- 128 bit the corresponding key length is 128 bits. In this paper only AES-128 encryption scheme with 128 bit key is considered. III.COMPOSITE FIELD SBOX The hardware implementation of LUT based sbox uses ROM or RAM to store the sbox values. This usage of memory consumes more than 75% power of overall AES architecture. In our paper the composite field sbox design is used which replace the need of memory. The composite field sbox and inverse sbox includes two main operations 1. Subbyte multiplicative inverse in GF(2 8 ) Affine Transformation. 2. Invsubbyte inverse affine transformation Multiplicative inverse in GF(2 8 ). Fig.2. Two stages pipelined composite field sbox Hence both the operations needs multiplicative inverse module so we can share the multiplicative inverse module and calculation of both can be separated by enable. Composite field sbox consumes less area and power compare to ROM based sbox design and also it reduces the construction charges. The main drawback of composite field sbox is logic delay. This can be eliminated during hardware implementation by inserting registers between multiplicative inverse and affine operation. IV. SPEED and SECURITY IMPROVED SBOX The sbox input and output always 8 bit and performing mixcolumn operation needs at least 32 bit at a time. 8 bit sbox output and use of registers slow down the mixcolumn operation. Instead of using separate sbox for each input, single sbox is shared by each input by a time slot. The inputs are separated by a time slot and for a particular time period it will take one value as an input. The sbox uses fixed ROM cases the hackers easily track the information by adding simple resisers parallel with that ROM. By reading the power consumed by each input to sbox we can trace out the key information this is known as power analysis side channel attacks. There are so many side channel attacks present, they can trace the information at any stage of AES but sbox is the main thing because it uses fixed known values. In our paper we use another values instead of original values. The masked values obtained by this way, the sbox values are xored with some particular value to reduce the side channel attacks [1]. ISSN : 0975-4024 Vol 5 No 4 Aug-Sep 2013 3185

Instead of using original value another one value is transmitted from sbox increases the system security. Another thing to increase the speed of AES algorithm, we eliminate the shiftrows stage. The output of Add Round Key is rearranged equalling to shiftrows output then given to sbox as an input. Then the sbox will give the output which is equal to shiftrow output. The performance of normal add round key and our proposed add round key output is explained in below figure3 and figure4. Fig.3. Original Add Round Key output Fig.4. Rearranged Add Round Key output V. PIPELINING MIXCOLUMN OPERATION Normal mixcolumn operation performed on each column and each module gives one column output. Thus normal implementation needs 3 times replication of MC module. Replication of modules increases the speed but cost and area of the system architecture also get increased. Single MC based on time sharing concept is introduced in this paper. The time sharing MC will get the input for each time slot and what are the operations to be performed on that input also done on the same time slot. The MC has some complicated calculation compare to other stages in AES. Once the MC gets output from sbox it will starts its calculations and these calculations are performed parallel with the sbox operations. The basic idea in this part of AES is all incoming bit to the MC have the GF over all fixed coefficients. In normal MC performs in column wise means it read the input for maximum of 4 times but in our paper the input is taken once and performs four of it operations within the same time period. The naming the signal plays important role in this concept. Naming and rearranging the signals are done in software itself we need not worry about the extra hardware thus does not increase the system area and cost. The same time we reduce the use of MC module from 4 to 1 will reduce the cost of constructions and area. The normal one round stages in AES is given in the below diagram. ISSN : 0975-4024 Vol 5 No 4 Aug-Sep 2013 3186

Fig.5. normal operation of subbyte and mixcolumn In our proposed model each round key output share the same sbox and input to sbox based on time slot. The 16 x 1 mux is used to separate the 16 eight bit output and selection pins of mux is controlled by internal clock signal [7]. Then the output of sbox is directly given to mixcolumn stage. The operation of mixcolumn can be explained as below. The MixColumns transformation operates on the State column-by-column, treating each Column as a four-term polynomial [9]. The columns are considered as polynomials over GF(2 8 ) and multiplied modulo x4 + 1 with a fixed polynomial a(x), given by a(x) = {03}x3 + {01}x2 + {01}x + {02} This can be written as a matrix multiplication. Let s (x) = a(x)äs(x) : s o,c 02 03 01 01 s o,c s 1,c = 01 02 03 01 s 1,c s 2,c 01 01 02 03 s 2,c s 3,c 03 01 01 02 s 3,c for 0 c < Nb. As a result of this multiplication, the four bytes in a column are replaced by the following: S 0,c = ({ 0 2} S0,c) ({ 0 3} S1,c) S2,c S3,c S 1,c = S0,c ({ 0 2} S1,c) ({ 03} S2,c) S3,c S 2,c = S0,c S1,c ({ 02} S2,c) ({ 03} S3,c) S 3,c = ({ 0 3} S0,c) S1,c S2,c ({0 2 S3,c ). Sbox gets the input at a interval of 5 ns. Each 5 ns it gets the input and produces the 8 bit output then the mixcolumn performed on the output parallelly. For example 11010100 is the first output of sbox comes out at 5 ns and the mixcolumn performed on 11010100. 10111111 comes out next at 10 ns and MC performed parallelly. Since the inputs are pipelined with time slot[2] the MC also produce pipelined output[6], hence the speed of the AES architecture increased. By combining the signal properly we will get the required information. ISSN : 0975-4024 Vol 5 No 4 Aug-Sep 2013 3187

Fig.6. Sharing subbyte and high speed mixcolumn Our proposed architecture eliminates shiftrows and input to MC is based on time slot. Thus each column shares the same mixcolumn module. These timing operations are controlled by clock period. VI. ELIMINATION OF SHIFTROWS We are all known that sbox can produce 8 bit output at a time. But shiftrows need total of 128 bit since each row is shifted according to their row number. 1. row0 no change 2. row1 one left shift 3. row2 two left shifts 4. row3 three left shifts Thus the stage between subbyte and mixcloumn consume large time period. Shifting the row is nothing but a left shift based on their row number. There no arithmetic operations, we can eliminate this stage [5] by rearrange the Add Round Key output which produces the same output as shiftrows. Add Round Key have 128 bit output always thus we rearrange the bit in this stage itself we can eliminate the use of internal registers between sbox and shiftrows. This can be explained by following example Fig.7. Normal operation of AES algorithm Above the example we will know that the shift rows start its shifting operation only after getting 128 bit. This is the time consuming process same time there are no arithmetic operation. So we just shift the position of shiftrows before the subbyte operations where are 128 bit available for any time. The following diagram explains how is the output of Add Round Key rearranged and how can we eliminate the shiftrows. The output of both gives the same results. ISSN : 0975-4024 Vol 5 No 4 Aug-Sep 2013 3188

Fig.8. Improved AES algorithm VII. FUNCTIONAL SIMULATION and SYNTHESIS RESULTS A. Simulation result using Modelsim Altera 6.6c: The coding is written by using VHDL language and then the code is simulated by using Modelsim Altera 6.6c version. The obtain waveform is given below Fig.9. Simulation results of Improved SB and MC stage Fig.10. Pipelined MC output The simulation result shows that the VHDL coding simulates properly and the test vector fed for the simulation gave the correct output. ISSN : 0975-4024 Vol 5 No 4 Aug-Sep 2013 3189

B. Synthesis result of high speed SB and MC: The simulated output then synthesized using ISE 9.2i. The Target Device is Virtex XCV600 BG 560 6 Speed Grade:-6[8]. The synthesis results shows that all inputs are fitted correctly and all mapping functions and routing functions are done successfully. Fig.11. synthesis results of time sharing SB and MC The synthesis & mapping results of AES design are summarized in Table I. Target Device Optimization Goal Number of slices Number of 4 input LUTs Number of bounded IOBs Total memory usage Virtex XCV600 BG 560 6 speed 270/6912 (3%) 467/3824 (3%) 260/404 (64%) 151304 kilobytes Table.I. synthesis summary of time sharing SB and MC V. CONCLUSION This paper presented a fully pipelined implementation of the AES S-box and mixcolumn based on time sharing concept. Composite field S-Box becomes very compact and dissipates low power. We used a simple XOR function with fixed value improve the security of the S-Box of Wolkerstorfer et al. Time sharing concept reduces the resource requirement and elimination of shiftrow reduces the delay. The presented S-Box and MC was combine simulated using modelsim 6.6c altera technology. The simulation and synthesis results show that our design is the best choice for applications requiring small silicon area, low power consumption and high security. REFERENCES [1] Abdel alim kamal and Amr M.Youssel An Area-Optimized implementation for AES with Hybird countermeasures against power analysis IEEE 9781-4244-3786-3/09. [2] Ahmed Rady, Ehab EL Shehely,A.M EL Hennawy Design and implementation of area optimized AES algorithm on reconfigurable FPGA IEEE volume 3 2007. [3] Announing the Advanced Encryption standards (AES),Federal Information processing standards publication,2001. [4] J.Deamen and Vincent Rijimen, A Specification for the AES algorithm Rijidael. [5] Krishnamurthy GN,V.Ramaswamy Study of Removal of shiftrows and mixcolumn stage of AES and AES-KDS on their Encryption and hence security World Academy of Science and Technology 50,2011. [6] M M.Wong and M.L.D Wong A High Throughput Low Power Composite field Arithmetic and Algebraic Normal Form Reprentation IEEE vol 8,2010. [7] M.R.M Rizk,M.Morsy Optimized Area and Optimized speed Hardware Implementation of AES on FPGA IEEE vol 1,2007. [8] Pravin B. Ghewari et al Efficient Hardware Design and Implementation of AES Crypo System IJESTvol2(3),2010. [9] W.Stallings,Cryptography and Network Security, Prentice Hall, 3 rd ed,2003. ISSN : 0975-4024 Vol 5 No 4 Aug-Sep 2013 3190