EFFICIENT IMPLEMENTATION OF RECENT STREAM CIPHERS ON RECONFIGURABLE HARDWARE DEVICES

Similar documents
LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

Decim v2. To cite this version: HAL Id: hal

Segmented Leap-Ahead LFSR Architecture for Uniform Random Number Generator

DESIGN and IMPLETATION of KEYSTREAM GENERATOR with IMPROVED SECURITY

Cryptanalysis of LILI-128

Designing Integrated Accelerator for Stream Ciphers with Structural Similarities

Performance Evaluation of Stream Ciphers on Large Databases

New Address Shift Linear Feedback Shift Register Generator

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

CSE 352 Laboratory Assignment 3

From Theory to Practice: Private Circuit and Its Ambush

Fault Analysis of Stream Ciphers

Stream Ciphers. Debdeep Mukhopadhyay

LFSR stream cipher RC4. Stream cipher. Stream Cipher

Power-driven FPGA to ASIC Conversion

A Pseudorandom Binary Generator Based on Chaotic Linear Feedback Shift Register

An Improved Hardware Implementation of the Grain-128a Stream Cipher

Fully Pipelined High Speed SB and MC of AES Based on FPGA

Fault Analysis of Stream Ciphers

Understanding Cryptography A Textbook for Students and Practitioners by Christof Paar and Jan Pelzl. Chapter 2 Stream Ciphers ver.

Randomness analysis of A5/1 Stream Cipher for secure mobile communication

Understanding Cryptography A Textbook for Students and Practitioners by Christof Paar and Jan Pelzl. Chapter 2 Stream Ciphers ver.

VLSI System Testing. BIST Motivation

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Ultra-lightweight 8-bit Multiplicative Inverse Based S-box Using LFSR

Reducing DDR Latency for Embedded Image Steganography

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Assistant Professor, Electronics and Telecommunication Engineering, DMIETR, Wardha, Maharashtra, India

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

BLOCK CIPHER AND NON-LINEAR SHIFT REGISTER BASED RANDOM NUMBER GENERATOR QUALITY ANALYSIS

Efficient Realization for A Class of Clock-Controlled Sequence Generators

True Random Number Generation with Logic Gates Only

WG Stream Cipher based Encryption Algorithm

Synthesis Techniques for Pseudo-Random Built-In Self-Test Based on the LFSR

Pseudorandom bit Generators for Secure Broadcasting Systems

Stream Cipher. Block cipher as stream cipher LFSR stream cipher RC4 General remarks. Stream cipher

CS150 Fall 2012 Solutions to Homework 4

Optimum Composite Field S-Boxes Aimed at AES

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

How to Predict the Output of a Hardware Random Number Generator

Design of Fault Coverage Test Pattern Generator Using LFSR

High Performance Carry Chains for FPGAs

EECS150 - Digital Design Lecture 19 - Finite State Machines Revisited

UPDATE TO DOWNSTREAM FREQUENCY INTERLEAVING AND DE-INTERLEAVING FOR OFDM. Presenter: Rich Prodan

Field Programmable Gate Arrays (FPGAs)

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Research Article Ring Counter Based ATPG for Low Transition Test Pattern Generation

THE USE OF forward error correction (FEC) in optical networks

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

Why FPGAs? FPGA Overview. Why FPGAs?

Available online at ScienceDirect. Procedia Technology 24 (2016 )

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

Clock Gating Aware Low Power ALU Design and Implementation on FPGA

Design and Implementation of Data Scrambler & Descrambler System Using VHDL

Digital Systems Laboratory 1 IE5 / WS 2001

Comparative Analysis of Stein s. and Euclid s Algorithm with BIST for GCD Computations. 1. Introduction

FPGA Design with VHDL

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Implementation of UART with BIST Technique

Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL

Memory efficient Distributed architecture LUT Design using Unified Architecture

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

Design of BIST with Low Power Test Pattern Generator

2e 23-1 Peta Bits Per Second (Pbps) PRBS HDL Design for Ultra High Speed Applications/Products

An Lut Adaptive Filter Using DA

Bit Swapping LFSR and its Application to Fault Detection and Diagnosis Using FPGA

Sequences and Cryptography

FPGA Hardware Resource Specific Optimal Design for FIR Filters

FPGA Implementation of Sequential Logic

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Instructions. Final Exam CPSC/ELEN 680 December 12, Name: UIN:

Power Optimization of Linear Feedback Shift Register Using Clock Gating

A Delay-based PUF Design Using Multiplexer Chains

HiPAcc-LTE: An Integrated High Performance Accelerator for 3GPP LTE Stream Ciphers

Final Exam CPSC/ECEN 680 May 2, Name: UIN:

MATHEMATICAL APPROACH FOR RECOVERING ENCRYPTION KEY OF STREAM CIPHER SYSTEM

A Novel Low Power pattern Generation Technique for Concurrent Bist Architecture

Testing of UART Protocol using BIST

LFSR Counter Implementation in CMOS VLSI

Built-In Self-Test of Embedded SEU Detection Cores in Virtex-4 and Virtex-5 FPGAs

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

An Application Specific Reconfigurable Architecture Diagnosis Fault in the LUT of Cluster Based FPGA

L12: Reconfigurable Logic Architectures

Guidance For Scrambling Data Signals For EMC Compliance

An Efficient Reduction of Area in Multistandard Transform Core

An Application Specific Reconfigurable Architecture Diagnosis Fault in the LUT of Cluster Based FPGA

BeepBeep: Embedded Real-Time Encryption

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

An Efficient High Speed Wallace Tree Multiplier

VHDL Implementation of Logic BIST (Built In Self Test) Architecture for Multiplier Circuit for High Test Coverage in VLSI Chips

Memory Efficient LUT Based Address Generator for OFDM-WiMAX De-Interleaver

DESIGN OF RECONFIGURABLE IMAGE ENCRYPTION PROCESSOR USING 2-D CELLULAR AUTOMATA GENERATOR

RELATED WORK Integrated circuits and programmable devices

L11/12: Reconfigurable Logic Architectures

Jin-Fu Li Advanced Reliable Systems (ARES) Laboratory. National Central University

EECS150 - Digital Design Lecture 15 Finite State Machines. Announcements

Transcription:

EFFICIENT IMPLEMENTATION OF RECENT STREAM CIPHERS ON RECONFIGURABLE HARDWARE DEVICES Philippe Léglise, François-Xavier Standaert, Gaël Rouvroy, Jean-Jacques Quisquater UCL Crypto Group, Microelectronics laboratory, Université catholique de Louvain, Place du Levant, 3-1348 Louvain-la-Neuve, Belgium e-mails: {leglise, fstandae, rouvroy, jjq}@dice.ucl.ac.be Stream ciphers have the reputation to be very efficient when implemented in hardware, much more efficient than any block cipher. However, although plenty of papers and books claim it, few results of hardware implementations of stream ciphers are available. In this paper, we provide FPGA implementation results of recent stream ciphers in order to evaluate their actual hardware efficiency. In addition, we compare these results with those of standard block ciphers (AES, 3DES, Rijndael, Misty1,...). The selected stream ciphers are LILI-II, Helix and SNOW 2.0 and the implementation platform is a Virtex-II FPGA from Xilinx. On the basis of these results, it may be argued that while present stream ciphers allow us to obtain efficient implementations, they are not overwhelmingly more efficient than block ciphers. In general, their efficiency is comparable. However, stream ciphers are made of arguably low cost primitives which could provide really compact designs if correctly combined together. INTRODUCTION Stream ciphers are an important class of symmetric encryption algorithms. They contain internal states that vary with time and generate pseudo-random key bits, the keystream. The keystream is then bitwise XORed with the message to encrypt/decrypt. By contrast, block ciphers tend to simultaneously encrypt/decrypt blocks of bits of a message using a fixed encryption transformation. Stream ciphers are also more appropriate, and in some cases mandatory (e.g. in some telecommunications applications), when buffering is limited or when characters must be individually processed as they are received. Because they have limited or no error propagation, stream ciphers may finally be advantageous in situations where transmission errors are highly probable.

Since 1999 and the creation of the European NESSIE project [10], stream ciphers have known a growing interest but they remain fewer and less investigated (at least from the hardware implementation point of view) than block ciphers. With regards to their usual efficiency claim: Stream ciphers can be conceived in order to be very efficient, more efficient than the block ciphers and this, more particularly in hardware [9] and in spite of the possible commercial applications resulting from this efficiency, this situation seems astonishing. The goal of this contribution is to quantify this claim ( Is it true? ). As until now, few results of hardware implementations are available [1, 8], various types of stream ciphers will be implemented. The selected stream ciphers are LILI-II [3], Helix [6] and SNOW 2.0 [4] and the implementation platform is a Virtex-II FPGA. These designs are based on different principles and have been recently proposed (after 2002). Then, these implementation results will be compared with those of standard block ciphers. To this end, we have defined the hardware efficiency as the ratio throughput/area (the area corresponding to the number of slices used on an FPGA). HARDWARE DESCRIPTION All our implementations were carried out on a Xilinx Virtex XC2v6000ff1152-6 FPGA [13] which contains 33792 slices and 144 RAM blocks, which means 67584 LUTs and 67584 flip-flops. In the next sections, we compare the number of slices and RAM blocks of the different implementations. We also evaluated the delays and frequencies after place and route thanks to our implementation tool (Xilinx ISE-6). LILI-II Most of the time, modern stream ciphers are built with only one LFSR as basic element. LILI-II [3] is based on two bitwise LFSRs. The way in which non-linearity is introduced into LILI-II follows two principles. First, a completely irregular synchronization of the second LFSR (127-bit long) is used. It is at least synchronized once and maximum 4 times between the consecutive production of two keystream bits. This LFSR is controlled by the first LFSR (128-bit long) (see Figure 1). Second, a non-linear filter is used, which is reduced to a 12:1 truth table. The presence of these two LFSRs combined with a relatively simple introduction of non-linearity

makes of LILI-II an intuitive stream cipher. For these reasons, it was logically retained for an hardware implementation. LILI-II uses a 128-bit key and a 128-bit IV (Initialization Vector). For further details about the algorithm, see [3]. c(t) LFSR C LFSR d m n... z(t) f c f d CLOCK-CONTROL DATA GENERATION Figure 1: Overall view of LILI-II HELIX Helix [6] was retained for a hardware implementation because it is basically different from the traditional stream ciphers : no LFSR was used. All operations in Helix are on 32-bit words. These operations are addition modulo 2 32, XOR and left rotation by fixed numbers of bits. These operations are efficient in hardware. Helix combines the stream cipher and MAC (Message Authentication Code) functionalities and its design philosophy can be summarized as many simple rounds. The Helix state is composed of 5 words (Z 0 to Z 4 ) of 32 bits each. In figure 2, half a block of Helix is Figure 2: Half block of Helix illustrated. Helix uses a 256-bit key and a 128-bit IV. Its key scheduling is complex and cannot be explained within this paper. For further details, see [6].

SNOW 2.0 SNOW 2.0 [4] is the evolution of SNOW 1.0 [5] and has been designed to improve performances and security. SNOW 2.0 (see Figure 3) is based on only one LFSR (contrary to LILI-II which uses two of them). It has the characteristic to work on 32-bit words rather than on a single bit one. It is thus interesting to verify whether that leads to an efficient implementation in hardware. This LFSR is 512 bits long. The non-linearity is provided by an FSM (Finite State Machine) based on the Rijndael S-Box. SNOW 2.0 uses a 128-bit or a 256-bit key and a 128-bit IV. For further details about the LFSR, the FSM and the key scheduling, see [4]. Figure 3: Design of SNOW 2.0 DESIGN ISSUES Of the two styles of LFSRs, the usual style is called a Fibonacci LFSR. To shift a Fibonacci LFSR, you simply copy each bit to its neighbor on the right. The original rightmost bit is considered as the output. The bit that is shifted in at the left is the parity of some specific subset of the bits (the taps) of the register. The other style of LFSR is called a Galois LFSR, and has the same properties as the Fibonacci LFSR, but is shifted differently. To shift a Galois LFSR, each bit is copied to its neighbor on the right, except for the taps, for which the rightmost bit of the register is XORed in before the copy is done. The bit that is shifted in at the left is the original rightmost bit, which is also considered the output [7]. We now briefly describe our implementations of LILI, Helix and SNOW 2.0.

For LILI-II, both styles of LSFRs have been implemented. For the regularly clocked LFSR, the Galois LFSR needs 59 slices for a throughput of 384.6 Mbits/sec compared to 44 slices and a throughput of 273.6 Mbits/sec for the Fibonacci LFSR. Their efficiency ratios are 6.52 and 6.22 respectively. So, the Galois is the more efficient. The advantage of a Galois LFSR over a Fibonacci LFSR when being implemented in hardware is that a Galois LFSR usually has an even lower gate delay than a Fibonacci LFSR, resulting in a potentially lower clock cycle time. For the second LFSR of LILI-II, during the production of two consecutive keystream bits, 1 to 4 shifts have to be performed. To this end, in only one clock cycle, each one of its 127 registers has 4 possible different inputs. They correspond to the value that this same register should have if this LFSR was clocked 1, 2, 3, or 4 times. This value is selected by the output of the first LFSR. The resulting implementation uses 127 flip-flops and 127 multiplexers (4:1). In this case, the Galois LFSR needs 385 slices for a throughput of 238.1 Mbits/sec compared to 245 slices and a throughput of 203.5 Mbits/sec for the Fibonacci LFSR. Their efficiency ratios are 0.6 and 0.83 respectively. The Fibonacci LFSR is now more efficient due to the fact that for the generation of this style of LFSRs, only four equations have to be stored for the leftmost taps whereas the Galois LFSR needs to store 4 equations per tap. The truth table has been achieved in a Virtex RAM block. The key scheduling requires 963 cycles of latency and uses a large memory for storing its intermediate states. The hardware implementation of one Helix block is straightforward. It requires 329 slices for 2009 Mbits/sec which gives a ratio of 6.1. On the other hand, although efficient in software, the generation of the key words is cumbersome to deal with in hardware. For this reason, they are sometimes assumed to be precomputed, as in [8]. As a consequence, we provide the results of both a single block of Helix and the complete cipher with embedded generation of the key words. Helix then requires a latency of 26 cycles for encrypting/decrypting the first 32 bits of the message. Finally, two versions of SNOW 2.0 have been implemented : one using the Virtex RAM blocks, the other not. The multiplication of a tap of the LFSR by α or α 1 can be done with the help of one dual-port RAM block and this is performed as explained in the original paper [4]. The only difference is that RAM blocks of the

Virtex are synchronous. So we have to take the taps one state before in order to get the good values at the appropriate moment (see Figure 4). The implementation of the whole LFSR requires 488 slices to reach 7,990 Mbits/sec. For the FSM, the synchronization problem is resolved as shown in Figure 4 where is a XOR, an addition modulo 2 32, R1 and R2 two 32-bit registers. The FSM requires 90 slices and two RAM blocks for a throughput of 5,970 Mbits/sec. s t+15 s t+5 R1 R2 RAMB << 8 S(31 downto 24) RAMB >> 8 S(7 downto 0) RAMB T0 - T1 RAMB T3 T2 s t+12 s t+11 s t+1 s t Figure 4: Multiplication by α or α 1 and FSM implemenation The version of SNOW 2.0 implemented without RAM blocks stores the tables in the Virtex look-up tables which are used as ROMs. This version of the LFSR needs 795 slices for a throughput of 13,781 Mbits/sec. The FSM has been implemented as shown on the Figure 3. It requires 2,420 slices for a throughput of 5,351 Mbits/sec. CONCLUSIONS In this paper, four representative stream ciphers have been implemented. Table 1 summarizes our results and compares them with certain recent block ciphers on Xilinx FPGAs. Remark that strict comparisons are made difficult since these designs relate to different contexts (e.g. encryption/decryption designs, loop architectures or unrolled architectures for block ciphers). Looking at these results, the most efficient of all the ciphers is A5/1 which is also one of the weakest. With regard to other stream ciphers, Helix appears to be efficient as well, but requires some software precomputations, which may not be a practical solution for any context where the complete cipher has to be embedded on a single platform. LILI-II is not competitive with modern block ciphers and its efficiency is mainly limited by its expensive synchronization process. Finally, SNOW 2.0 allows the best implementation opportunities

Algorithm Nbr. of Nbr. of Throughput Efficiency slices RAMs (Mbits/sec) Mbits/(sec.slices) STREAM CIPHERS - Virtex-II A5/1 [8] 32 0 188.3 5.88 E0 [8] 895 0 189 0.21 LILI-II 866 1 243 0.28 Helix (prec. key words) [8] 418 0 1,024 2.45 Helix block 329 0 2,009 6.1 Helix complete 3,367 0 1,707 0.51 RC4 [8] 140 3 120.8 0.86 SNOW 1.0 [1] 752 3 2, 128 2.83 SNOW 2.0 1,015 3 5,659 5.57 SNOW 2.0 2,420 0 5,351 2.21 BLOCK CIPHERS - Virtex Twofish [12] 21, 000 0 15, 200 0.72 Serpent [12] 19, 700 0 16, 800 0.85 BLOCK CIPHERS - Virtex-E Camelia [12] 9, 692 0 6, 750 0.7 Khazad [12] 7, 175 0 7, 872 1.10 Misty1 [12] 6, 322 0 10, 176 1.61 Rijndael [12] 2, 524 0 2, 085 1.17 BLOCK CIPHERS - Virtex-II RC6 [12] 7, 456 0 4, 800 0.64 IDEA [12] 9, 793 0 6, 800 0.69 SHACAL-1 [12] 13, 729 0 17, 021 1.24 3DES [12] 604 0 917 1.51 ICEBERG [12] 4, 946 0 17, 344 3.51 BLOCK CIPHERS - Virtex-II + RAMBs Rijndael [12] 146 3 358 2.45 ICEBERG [12] 3, 132 64 13, 440 4.29 AES [11] 146 3 358 2.45 Table 1: Performances of block and stream ciphers on Xilinx FPGAs

and offers better efficiency than most recent block ciphers (excepte ICEBERG [12] that was specifically designed for FPGA implementations). As SNOW was originally software-oriented, we may expect the future design of an even better stream cipher dedicated to hardware. Remark that most stream ciphers have limited area requirements compared to block ciphers. Therefore, the main difference between block and stream ciphers may not be in their respective effectiveness, but rather in their ability to provide compact solutions for constraint contexts. REFERENCES [1] K. Alexander, R. Karri, I. Minkin, K. Wu, P. Mishra, X. Li, Towards 10-100 Gbps Cryptographic Architectures, in CATT/WICAT Annual Research Review, available from http://wicat.poly.edu/tech report/tr/02-005.pdf, 2003. [2] L. Batina, J. Lano, N. Mentens, B. Preneel, I. Verbauwhede, S. B. Örs, Energy, Performance, Area versus Security Trade-offs for Stream Ciphers, in ECRYPT Workshop, SASC - The State of the Art of Stream Ciphers, pp. 302-310, 2004. [3] A. Clark, E. Dawson, J.Fuller, J.Golic, H-J. Lee, W. Millan, S-J.Moon, L. Simpson, The LILI-II Keystream Generator, ACISP 2002, 2002. [4] P. Ekdahl, T. Johansson. A new version ot the stream cipher SNOW, available from http://www.it.lth.se/cryptology/snow/, 2002. [5] P. Ekdahl, T. Johansson, SNOW - a new stream cipher, available from http://www.it.lth.se/cryptology/snow/, 2000. [6] N. Ferguson, D. Whiting, B. Schneier, J. Kelsey, S. Lucks, T. Kohno, Helix: Fast Encryption and Authentication in a Single Cryptographic Primitive, in FSE 2003, 2003. [7] I. Goldberg, D. Wagner, Architectural Considerations for Cryptanalytic Hardware, CS252 technical report, Berkeley, May 1996. [8] M. D. Galanis, P. Kitsos, G. Kostopoulos, O. Koufopavlou, Comparison of the Performance of Stream Ciphers for Wireless Communications, proceedings of CCCT 04, Austin, Texas, USA, August 14-17, 2004. [9] A. Menezes, P. van Oorschot, S. Vanstone, Handbook of Applied Cryptography, CRC Press, 1996. [10] NESSIE: New European Schemes for Signatures, Integrity, and Encryption, available from http://www.cryptonessie.org, 2004. [11] G. Rouvroy, Secure and Reconfigurable Hardware Decoder for Digital Cinema Images, PhD Thesis, UCL, June 2004. [12] F.-X. Standaert, Secure and efficient use of reconfigurable hardware devices in symmetric cryptography, PhD Thesis, UCL, June 2004. [13] Xilinx, Virtex-II Data sheets, available from http:// www.xilinx.com, 2003.