PAPER High-Throughput Low-Complexity Four-Parallel Reed-Solomon Decoder Architecture for High-Rate WPAN Systems

Similar documents
PAPER A High-Speed Low-Complexity Time-Multiplexing Reed-Solomon-Based FEC Architecture for Optical Communications

THE USE OF forward error correction (FEC) in optical networks

FPGA Implementation OF Reed Solomon Encoder and Decoder

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications

Implementation of Modified FEC Codec and High-Speed Synchronizer in 10G-EPON

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

/$ IEEE

A Compact and Fast FPGA Based Implementation of Encoding and Decoding Algorithm Using Reed Solomon Codes

IN DIGITAL transmission systems, there are always scramblers

A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

PAPER Low Complexity Filter Architecture for ATSC Terrestrial Broadcasting DTV Systems

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Design of Memory Based Implementation Using LUT Multiplier

An Efficient Reduction of Area in Multistandard Transform Core

PIPELINE ARCHITECTURE FOR FAST DECODING OF BCH CODES FOR NOR FLASH MEMORY

ALONG with the progressive device scaling, semiconductor

Area-efficient high-throughput parallel scramblers using generalized algorithms

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

Hardware Implementation of Viterbi Decoder for Wireless Applications

LFSR Counter Implementation in CMOS VLSI

Memory efficient Distributed architecture LUT Design using Unified Architecture

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

Distributed Arithmetic Unit Design for Fir Filter

Design of BIST with Low Power Test Pattern Generator

Implementation of Memory Based Multiplication Using Micro wind Software

Novel Correction and Detection for Memory Applications 1 B.Pujita, 2 SK.Sahir

Design of Low Power Efficient Viterbi Decoder

IC Design of a New Decision Device for Analog Viterbi Decoder

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL

An Efficient Viterbi Decoder Architecture

Logic Design II (17.342) Spring Lecture Outline

An MFA Binary Counter for Low Power Application

A Novel Architecture of LUT Design Optimization for DSP Applications

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Clock Gating Aware Low Power ALU Design and Implementation on FPGA

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

FPGA Implementation of DA Algritm for Fir Filter

FPGA Implementation of Convolutional Encoder And Hard Decision Viterbi Decoder

FPGA Implementation of Viterbi Decoder

Optimization of memory based multiplication for LUT

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

LUT Optimization for Memory Based Computation using Modified OMS Technique

The Design of Efficient Viterbi Decoder and Realization by FPGA

[Krishna*, 4.(12): December, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

DESIGN OF A NEW MODIFIED CLOCK GATED SENSE-AMPLIFIER FLIP-FLOP

Overview: Logic BIST

CHAPTER 4 RESULTS & DISCUSSION

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

An Efficient High Speed Wallace Tree Multiplier

(51) Int Cl.: H04L 1/00 ( )

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

A VLSI Architecture for Variable Block Size Video Motion Estimation

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

A low jitter clock and data recovery with a single edge sensing Bang-Bang PD

Implementation of Area Efficient Memory-Based FIR Digital Filter Using LUT-Multiplier

FAULT SECURE ENCODER AND DECODER WITH CLOCK GATING

Power Reduction and Glitch free MUX based Digitally Controlled Delay-Lines

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

Figure 1.LFSR Architecture ( ) Table 1. Shows the operation for x 3 +x+1 polynomial.

A Low Power Delay Buffer Using Gated Driver Tree

VLSI System Testing. BIST Motivation

Operating Bio-Implantable Devices in Ultra-Low Power Error Correction Circuits: using optimized ACS Viterbi decoder

OMS Based LUT Optimization

Layout Decompression Chip for Maskless Lithography

Modified Reconfigurable Fir Filter Design Using Look up Table

Bit Swapping LFSR and its Application to Fault Detection and Diagnosis Using FPGA

Optimization of Multi-Channel BCH. Error Decoding for Common Cases. Russell Dill

ISSCC 2006 / SESSION 14 / BASEBAND AND CHANNEL PROCESSING / 14.6

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

Design Project: Designing a Viterbi Decoder (PART I)

Designing Fir Filter Using Modified Look up Table Multiplier

Vlsi Digital Signal Processing Systems Design And Implementation

Analysis of Digitally Controlled Delay Loop-NAND Gate for Glitch Free Design

CPE 628 Chapter 5 Logic Built-In Self-Test. Dr. Rhonda Kay Gaede UAH. UAH Chapter Introduction

A Combined Compatible Block Coding and Run Length Coding Techniques for Test Data Compression

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Improved 32 bit carry select adder for low area and low power

Modeling Digital Systems with Verilog

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Fault Detection And Correction Using MLD For Memory Applications

Research Article Low Power 256-bit Modified Carry Select Adder

Implementation of Low Power and Area Efficient Carry Select Adder

Using on-chip Test Pattern Compression for Full Scan SoC Designs

WITH the demand of higher video quality, lower bit

Design and Implementation of LUT Optimization DSP Techniques

ECE 263 Digital Systems, Fall 2015

Solution to Digital Logic )What is the magnitude comparator? Design a logic circuit for 4 bit magnitude comparator and explain it,

International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March-2015 ISSN DESIGN OF MB-OFDM SYSTEM USING HDL

Transcription:

1332 PAPER High-Throughput Low-Complexity Four-Parallel Reed-Solomon Decoder Architecture for High-Rate WPAN Systems Chang-Seok CHOI,Hyo-JinAHN, Nonmembers, and Hanho LEE a), Member SUMMARY This paper presents a high-throughput low-complexity four-parallel Reed-Solomon (RS) decoder for high-rate WPAN systems. Four-parallel processing is used to achieve 12-Gbps data throughput and low hardware complexity. Also, the proposed pipelined folded Degree- Computationless Modified Euclidean (fdcme) algorithm is used to implement the key equation solver (KES) block, which provides low hardware complexity for the RS decoder. The proposed four-parallel RS decoder is implemented 90-nm CMOS technology optimized for a 1.2 V supply voltage. The implementation result shows that the proposed RS decoder can be operated at a clock frequency of 400 MHz and has a data throughput 12.8- Gbps. The proposed four-parallel RS decoder architecture has high data processing rate and low hardware complexity. Therefore it can be applied in the FEC devices for next-generation high-rate WPAN systems with data rate of 10-Gbps and beyond. key words: forward error correction (FEC), Reed-Solomon (RS), decoder, mmwave, WPAN 1. Introduction The emergence of a multitude of bandwidth hungry multimedia applications has definitely exacerbated the need for multi-gigabit wireless solutions, which are beyond reach of conventional WLAN technology (802.11a, b and g). Uncompressed high-definition video distribution and massive data synchronization are driving data-throughput requirements well beyond gigabits/s (Gbps), and already demanding up to 10-Gbps with introduction of, for example, the HDMI 1.3 video standard [1]. Such a strong commercial interest in using the 57 66 GHz band known as the millimeter wave band for indoor wireless communications is evidenced by the recent industrial and standard development efforts in several international standard groups including ECMA TC- 387, IEEE 802.15.3c and the 802.11 VHT60 [2]. These task groups are developing a millimeter-wave (mmwave) based alternative physical layer (PHY) for highrate Wireless Personal Area Network (WPAN) standard [3], [4]. This mmwave WPAN system will allow high coexistence with all other microwave systems in the 802.15 family of WPANs. In addition, the mmwave WPAN will support high data rate applications such as high speed internet access and streaming content download (video on demand, home theater, 3D TV etc.). Very high data rates in excess of 10-Gbps beyond will be provided for simultaneous time de- Manuscript received September 4, 2010. Manuscript revised January 12, 2011. The authors are with School of Information and Communication Engineering, Inha University, Incheon, 402-751, Republic of Korea. a) E-mail: hhlee@inha.ac.kr DOI: 10.1587/transcom.E94.B.1332 pendent applications such as real time multiple HDTV video stream and wireless data bus for cable replacement. These reasons, such a demand for ever higher data rates, makes it necessary to devise very high-speed Forward Error Correction (FEC) architectures. Reed-Solomon (RS) codes have been adopted WPAN systems as a FEC scheme [3], [4], and also several multi-giga bit RS decoders have been reported. To get a high throughput, parallel processing method can be a best solution for the hardware design. The one-shot Reed-Solomon encoder/decoder scheme [5], [6], which is based on parallel combinational circuit, can be a representative example for high throughput RS decoder. In this paper, we present the four-parallel RS (240,224) encoder/decoder architecture for mmwave WPAN systems especially ECMA standard. Four-parallel processing is used to achieve 12-Gbps data throughput rates. Also, folded Degree-Computationless Modified Euclidean (fdcme) architecture is applied for key equation solver (KES) block to reduce a hardware complexity. This paper is organized as follows. Section 2 shows the proposed four-parallel RS encoder architecture. In Sect. 3, we will describe the key ideas applied to four-parallel RS decoder design, especially those for achieving high throughput and reduced hardware complexity. Four-parallel syndrome computation, Chien search & error correction block and pipelined fdcme architecture are proposed. Section 4 gives implementation results and performance comparison. Finally, conclusions are provided in Sect. 5. 2. Four-Parallel Reed-Solomon Encoder The systematic RS encoding produces codeword polynomial in Eq. (1), which is comprised of message symbols followed by parity symbols. The message polynomial M(x) is multiplied by x n k after then added the parity polynomial P(x). If generator polynomial G(x) was given as Eq. (2), the following parity polynomial P(x)can be written as Eq. (3). To apply four-parallel structure, the Eq. (3) should be reformulated. The M(x) consists of 224 symbols, which are multiple of four. As a result the four-parallel based P(x) can be rewritten to Eq. (4) and we can derive the following partial generator polynomial as shown in Eq. (5). The proposed four-parallel RS encoder is shown in Fig. 1. Four-parallel message symbols are inputted from ports [M3,M2,M1,M0] during 56 clocks and multiplied by each partial generator polynomials g 0 (x) g 3 (x) in Eq. (5). Finally, parity symbols are generated through Linear Feed- Copyright c 2011 The Institute of Electronics, Information and Communication Engineers

CHOI et al.: HIGH-THROUGHPUT LOW-COMPLEXITY FOUR-PARALLEL REED-SOLOMON DECODER ARCHITECTURE 1333 Fig. 2 Four-parallel Reed-Solomon decoder. Fig. 1 Four-parallel Reed-Solomon encoder. back Shift Register (LFSR). U(x) = x n k M(x) + P(x) (1) G(x) = (x α 0 )(x α 1 ) (x α 14 )(x α 15 ) (2) P(x) = x n k M(x) modg(x) (3) = [{[(m 223 x 19 +m 222 x 18 +m 221 x 17 +m 220 x 16 )modg(x)] x 4 +[(m 219 x 19 +m 218 x 18 +m 217 x 17 +m 216 x 16 )modg(x)] (4) x 4 + ] x 4 +(m 3 x 19 +m 2 x 18 +m 1 x 17 +m 0 x 16 )modg(x) g 0 (x) = x 16 mod G(x), g 2 (x) = x 18 mod G(x), g 1 (x) = x 17 mod G(x) g 3 (x) = x 19 mod G(x) 3. Four-Parallel Reed-Solomon Decoder Generally, the RS decoder consists of following three blocks, which are syndrome computation block, KES block, Chien search and error correction block. The RS decoder can be implemented using modified Euclidean (ME) algorithm to solve a key equation. In this paper, we propose fdcme algorithm that is reformulated version of our previous pipelined Degree-Computationless Modified Euclidean (pdcme) algorithm in [16]. While the pdcme algorithm can be implemented by systolic array architecture, the fd- CME algorithm is useful for folding architecture. Therefore, the proposed fdcme architecture can be provided much lower hardware complexity for the KES block. Both the syndrome computation block and Chien Search & error correction block are reformulated for the high data throughput four-parallel processing. The proposed four-parallel RS decoder architecture is shown in Fig. 2. The proposed architecture includes fourparallel syndrome computation block, fdcme block, and four-parallel Chien search and error correction block. This section gives full explanation about sub-blocks. 3.1 Four Parallel Syndrome Computation Block The syndrome computation block calculates all syndromes S i (0 i 15) by putting the roots of generator polynomial G(x) into the received codeword polynomial R(x) in Eq. (6). As shown in Fig. 3, proposed four-parallel syndrome computation block is implemented by following Eq. (7). (5) Fig. 3 Four parallel syndrome computation block. R(x)=r 239 x 239 + r 238 x 238 + r 1 x + r 0 (6) S i =R(a i )=(( (r 239 (α i ) 3 +r 238 (α i ) 2 +r 237 (α i ) 1 +r 236 )(α i ) 4 +r 235 (α i ) 3 +r 234 (α i ) 2 +r 233 (α i ) 1 +r 232 )(α i ) 4 + )(α i ) 4 +r 3 (α i ) 3 +r 2 (α i ) 2 +r 1 (α i ) 1 +r 0 ) (7) The received codeword consists of 240 symbols which are multiple of 4, so that the proposed syndrome computation block should calculate syndromes during 60 clock cycles. At the first clock, the received codeword (r 239, r 238, r 237, r 236 ) are inputted by parallel, and then partial syndromes r 239 (α i ) 3 +r 238 (α i ) 2 +r 237 (α i ) 1 +r 236 are computed following stored in the flip-flop (1). At the next clock cycle, the flip-flop (1) is multiplied by (α i ) 4 and then added with r 235 (α i ) 3 + r 234 (α i ) 2 + r 233 (α i ) 1 + r 232. This iterative process will be performed during 60 clock cycle after the syndromes S i are stored in the flip-flop (1). Multiplexer (3) and (4) are selected 1 at every 60th clock cycle, and syndromes S i are shifted to the flip-flop (2). Finally, the syndromes S 0, S 1,...,S 15 are outputted serially to the KES block, and new syndromes can be computed in the syndrome cells. 3.2 Key Equation Solver Block The KES block is used to obtain the error locator polynomial σ(x) and the error value polynomial ω(x) by solving the key equation ω(x) = S (x)σ(x) modx 2t. The KES block is the most critical part in the design of RS decoders. The KES architectures based on the modified Euclidean (ME) algorithm [7] [9], [17], [18] or Berlekamp-Massey (BM) algorithm [11], [12] are regular structure, but the hardware cost

1334 is very high, because their architectures are required both systolic-array structure and degree computation units. So, the pdcme algorithm was suggested alternatively in [15], but the pdcme architecture still has high hardware complexity. While pdcme architecture can be implemented by 2t processing element (PE), the proposed fdcme algorithm, which is employed folding technique, consists of only 2 PEs with shift-registers. The proposed fdcme algorithm is described by the pseudo-code shown in below. Two array of PE performs the DCME algorithm continuously and then the error locator polynomial σ(x) and error value polynomial ω(x) can be computed. Until when the index stage is reached at t times, a i 1 and b i 1 are the leading coefficients of polynomial F i 1 (x) andg i 1 (x) respectively. Either Step2 (swap operation) or Step3 (delaying previous coefficients) is executed until when the index loop of Step1 reaches 2, repeatedly. The Step2 is controlled by stop-signal (stop), swapsignal (sw) and Shift-signal (sht). 2t 1 G 0 (x)= x 2t, F 0 (x)= S i x i (S i 0, 0 i 2t 1) (8) G 0 (x) = x 2t, F 0 (x) 2t 1 = S i x i { S i =0, 2t 1 k i 2t 1, k 0 S i 0 G 0 (x) = x 2t, F 0 (x) 2t 1 = S i x i { S i =0, 2t 1 m i 2t 1, m 0 S i 0 (9) (10) Inputs of PE (1) and (2) have several patterns, which correspond to Eqs. (8) (10). These patterns are used to generate two control signals which are sw and sht. The sw signal determines whether two polynomials pair F i 1 (x), G i 1 (x) andh i 1 (x), I i 1 (x) should be swapped or not. The sht signal determines either polynomial arithmetic operation or shift operation. In Eq. (8), G 0 (x) isx 2t and F 0 (x) is S (x) multiplied x. And the coefficient S 2t 1 is non zero. Since the degree of two polynomials are same as 16, the PE (1) executes the arithmetic operation. After the operation of PE (1), G 1 (x), F 1 (x) have same degree as 15. In Eq. (9) the coefficient S 2t 1 is zero. That means the degree of F 0 input is 15. So PE (1) executes only delay operation for G 0 s output to make the same degree of two inputs. And then PE (2) executes the arithmetic operation since degree of two inputs is same. In Eq. (10), the coefficient S 2t 1 is non zero but S 2t 2 is zero. In case of this, the PE (1) has same operation as Eq. (10). But degree of F 0 s output is 14. Thus, the PE (2) executes only delay operation for output of G 1. Since thedegreeoff 1 (x) is less than the degree of G 1 (x), two inputs were swapped before the PE (1) operation. When the index stage is reach at ttimes, the fdcme algorithm stops. The output F 16 (x) of PE (2) becomes the error value polynomial ω(x) and the output H 16 (x) becomes the error locator polynomial σ(x). Figure 4 shows a block diagram of proposed fdcme architecture, which consists of two PEs and shift-registers connected by means of a recursive loop. F i 1 (x), G i 1 (x), H i 1 (x)andi i 1 (x) generates the updated coefficients of each polynomial serially. The output of PE (2) is fed back into the PE (1) in descending order. The PE (1) and (2) consist of a polynomial arithmetic structure, control-signal generate block and stop-signal generate block. One PE consists of four Galois-field (GF) multipliers, two GF adders and ten multiplexers. The PE unit has three pipelining stages to provide significant improvement for the clock frequency. The twelve stage shift-registers are used to store the output of PE (2) at each recursive iteration step. Therefore, the fdcme block has eighteen pipelining stages. The PE (1) and (2) use pipelined fully-parallel GF multiplier to reduce the critical path delay and to provide significant gains for the clock frequency. Therefore, the critical path delay of PE is T inv + T and2 + 3T mux2 + T ff,wheret inv, T and2,andt mux2 are delays of the inverter, 2-input AND gate, and 2 1 multiplexer.

CHOI et al.: HIGH-THROUGHPUT LOW-COMPLEXITY FOUR-PARALLEL REED-SOLOMON DECODER ARCHITECTURE 1335 Fig. 4 Block diagram of folded degree-computationless modified Euclidean (fdcme) architecture. Fig. 5 Block diagram of Chien search and error correction block, (a) Chien search block, and (b) Chien search cell.

1336 3.3 Four-Parallel Chien Search and Error Correction Block After the KES block operation, the error locator polynomial σ(x) and the error value polynomial ω(x) are obtained. Let X l = a ml and Y l = e ml, the Eq. (11) can be transformed to the Eq. (12), where X l and Y l are the possible error location and the possible error value, respectively. Chien search algorithm can be implemented using the Eq. (13). The roots of σ(x) are the inversion of error location. In case of RS(240,224) code, σ(α 16 )=0 means that r 239 was corrupted by an error. At first, α 16 is putted into σ(x) because the first symbol of received codeword is r 239 in the RS(240,224) codes. The error value polynomial can be derived as the Eq. (14). Finally the error value can be computed using the Eq. (15), where σ (x) is the derivative of σ(x). Rewriting σ(x) as the sum of the even terms σ even (x) and the odd terms σ odd (x), we have σ odd (x) = x σ (x). Therefore, the Chien search and error correction block is implemented as shown in Fig. 5. S i = r(α i ) = e(α i ) = S (x) = 15 15 S i x i = v e mi α m l i l=1 (11) v Y l Xl i xi (12) l=1 σ(x) = (1 xx 1 )(1 xx 2 ) (1 xx v ) (13) ω(x) = S (x) σ(x) modx 2t (t = 8) v v = Y l (1 xx n ) (14) l=1 Y l = ω(x 1 l n=1,n l )/(( Xl 1 ) σ (Xl 1 )) (15) The dividing operation is implemented by 256 8 ROM in which the inverse of field elements are stored. As shown in Fig. 5(b), serial Chien search cell was expanded into four- parallel Chien search cell, because the following Chien search and Forney algorithm block should calculate four locations of error at each clock cycle. Because the RS(240,224) code is shortened version of the RS(255,239), first 16 symbols don t have to be computed. Thus, at the first clock cycle, σ(α 16 ), σ(α 17 ), σ(α 18 ), σ(α 19 ) are calculated, and at the last clock cycle, σ(α 252 ), σ(α 253 ), σ(α 254 ), Fig. 6 Timing chart of (a) four-parallel RS encoder, and (b) four-parallel RS decoder.

CHOI et al.: HIGH-THROUGHPUT LOW-COMPLEXITY FOUR-PARALLEL REED-SOLOMON DECODER ARCHITECTURE 1337 Table 1 Comparision results of various RS decoder architectures. σ(α 255 ) are calculated, consecutively. 4. Timing Chart and Performance Comparison 4.1 Timing Chart The timing charts of proposed RS encoder and decoder are shown in Figs. 6(a) and (b), respectively. The proposed RS encoder has only one clock latency, and generates an encoded codeword (ECWD A ECWD D) continuously. The encoder start signal (RSEST) is needed during 1 clock cycle, after then effective codeword symbols are entered (UCWD A UCWD D). As shown in Fig. 6(b), when the start signal of RS decoder (RSDST) is inputted, the decoder accepts a received codeword (RCWD A RCWD D) at the same time. Proposed RS decoder has 209 clock latencies. When the proposed RS decoder outputs the corrected codeword (CCWD A CCWD D), error count signal (ERRCNT) for the first codeword is outputted at 266th clock if there are errors, otherwise (FAIL) signal is outputted. 4.2 Performance Comparison The proposed four-parallel RS encoder/decoder architecture was modeled in Verilog HDL and simulated to verify its functionality. After complete verification of the design functionality, it was then synthesized using appropriate time and area constraints. Both simulation and synthesis steps were carried out using SYNOPSYS synthesis tool and 90 nm CMOS technology optimized for a 1.2 V supply voltage. The total number of gates for proposed four-parallel RS decoder is 23,920 gates from the synthesized results including memory block. From the post-layout simulation, the proposed four-parallel RS decoder architecture can operate at a clock frequency of 400 MHz and has a data processing rate of 12.8-Gbps. Table 1 shows the comparison results of various RS decoder architectures. In case of KES block, proposed fdcme architecture provides much lower hardware complexity than other KES architectures based on ME algorithm. For the purpose of comparison, we used Technology-Scaled Normalized Throughput (TSNT) in [13]. The TSNT is the silicon area normalized to a 0.13μm technology, as shown in below. We can see that the throughput rate and the TSNT index of our design is the highest among all other architectures. TSNT = Throughput Rate #of Total Gates Tech. 0.13 μm The implementation result shows that the proposed four-parallel RS decoder architecture has much higher data processing rate and low hardware complexity compared with the conventional ME algorithm based RS decoder architectures. 5. Conclusions This paper presented the design and implementation of fourparallel RS encoder/decoder for high-rate WPAN systems. Four-parallel processing is used to achieve 12-Gbps high data throughput. A high-speed low-complexity fdcme block is applied in the KES block. Four-way parallelizing for syndrome computation and Chien search blocks allow the inputs to be received at very high data rates and the outputs to be delivered at correspondingly high rates with a minimum delay. As a result, the proposed four-parallel RS decoder architecture has a much higher data processing rate and low hardware complexity compared with the conventional RS decoder architectures. The proposed RS decoder can be applied in the FEC devices for next-generation high-rate WPAN systems. Acknowledgments This work was supported by Inha University. References [1] J. Laskar, 60 GHz CMOS low power single chip radio: The intersection of gaming and connectivity, Radio and Wireless Symposium 2009 (RWS.2009), pp.654 657, Jan. 2009. [2] L.L. Yang, 60 GHz: Opportunity for gigabit WPAN and WLAN convergence, ACM SIGCOMM Compute Communication Review, vol.39, no.1, pp.56 61, Jan. 2009. [3] Wireless medium access control (MAC) and physical layer (PHY) specifications for high rate wireless personal area networks (WPANs): Amendment 2: Millimeter-wave based alternative physical layer extension, IEEE P802.15.3c/D00, 2008. [4] High rate 60 GHz PHY, MAC and HDMI PAL, Standard ECMA- 387, 1st ed., Dec. 2008. [5] S. Morioka and Y. Katayama, Design methodology for a one-shot Reed-Solomon encoder and decoder, IEEE International Conference on Computer Design (ICCD 99), pp.60 67, Oct. 1999. [6] T. Yamane and Y. Katayama, An ultra-fast Reed-Solomon decoder soft-ip with 8-error correcting capability, IEEE Conference on

1338 Multimedia & Expo (ICME 03), vol.3, pp.445 448, July 2003. [7] H.M. Shao, T.K. Truong, L.J. Deutsch, J.H. Yuen, and I.S. Reed, A VLSI design of a pipeline Reed-Solomon decoder, IEEE Trans. Comput., vol.c-34, no.5, pp.393 403, May 1985. [8] L. Song, M.-L. Yu, and M.S. Shaffer, 10 and 40-Gb/s forward error correction devices for optical communications, IEEE J. Solid-State Circuits, vol.37, no.11, pp.1565 1573, Nov. 2002. [9] H. Lee, High-speed VLSI architecture for parallel Reed-Solomon decoder, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.11, no.2 pp.288 294, April 2003. [10] Q. Hu, Z. Wang, J. Zhang, and J. Xiao, Low complexity parallel chien search architecture for RS decoder, International Symposium on Circuits and Systems 2005 (ISCAS 2005), pp.340 343, May 2005. [11] D.V. Sarwate and N.R. Shanbhag, High-speed architectures for Reed-Solomon decoders, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.9, no.5, pp.641 655, Oct. 2001. [12] M.D. Shieh, Y.K. Lu, S.M. Chung, and J.H. Chen, Design and implementation of efficient Reed-Solomon decoders for multi-mode applications, International Symposium on Circuits and Systems 2006 (ISCAS 2006), pp.289 292, May 2006. [13] H.Y. Hsu, A.Y. Wu, and J.C. Yeo, Area-efficient VLSI design of Reed-Solomon decoder for 10 GBase-LX4 optical communication systems, IEEE Trans. Circuits Syst. II, vol.53, no.11, pp.1245 1249, Nov. 2006. [14] J.H. Baek and M.H. Sunwoo, New degree computationless modified euclid algorithm and architecture for Reed-Solomon decoder, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.14, no.8, pp.915 920, Aug. 2006. [15] J.H. Baek and M.H. Sunwoo, Simplified degree computationless modified Euclid s algorithm and its architecture, IEEE International Symposium on Circuits and Systems (ISCAS 2007), pp.905 908, May 2007. [16] S. Lee and H. Lee, A high-speed pipelined degree-computationless modified Euclidean algorithm architecture for Reed-Solomon decoders, IEICE Trans. Fundamentals, vol.e91-a, no.3, pp.830 835, March 2008. [17] S. Lee, C-S. Choi, and H. Lee, Two-parallel Reed-Solomon based FEC architecture for optical communications, IEICE Electronics Express, vol.5, no.10, pp.374 380, May 2008. [18] C.-S. Choi and H. Lee, High-speed low-complexity three-parallel Reed-Solomon decoder for 6-Gbps mmwave WPAN systems, European Conference on Circuit Theory and Design 2009 (EC- CTD 09), pp.515 518, Aug. 2009. Hyo-Jin Ahn received the B.S. degree in IT electronic engineering from Daejeon University in 2005 and M.S. degree in information & communication engineering from Inha University, Incheon, Korea, in 2010. His research interests VLSI design and implementation for communication systems, especially forward error correction architectures. Hanho Lee received the Ph.D. and M.S. degrees, both in Electrical & Computer Engineering, from the University of Minnesota, Minneapolis, in 2000 and 1996 respectively. In 1999, he was a Member of Technical-Staff-1 at Lucent Technologies, Bell Labs, Holmdel, NJ. From April 2000 to August 2002, he was a Member of Technical Staff at the Lucent Technologies (Bell Labs Innovations), Allentown, where he was responsible for the development of VLSI architectures and implementation of high-performance DSP multiprocessor SoC for wireless infrastructure systems. From August 2002 to August 2004, he was an assistant professor at the Department of Electrical & Computer Engineering, University of Connecticut, Storrs. Since August 2004, he has been with the School of Information and Communication Engineering, Inha University, Incheon, Korea, where he is presently a Professor. His research interests include design of VLSI circuits and systems for communications, System-on-a-Chip (SoC) design, reconfigurable architecture, and forward error correction coding. Chang-Seok Choi received the B.S. degree in information & communication engineering from Hanshin University in 2005 and M.S. degree in information & communication engineering from Inha University, Incheon, Korea, in 2007, respectively. He is currently working toward the Ph.D. degree in Inha University. His research interests VLSI design and implementation for communication systems, especially forward error correction architectures.