PAPER A High-Speed Low-Complexity Time-Multiplexing Reed-Solomon-Based FEC Architecture for Optical Communications

Similar documents
THE USE OF forward error correction (FEC) in optical networks

PAPER High-Throughput Low-Complexity Four-Parallel Reed-Solomon Decoder Architecture for High-Rate WPAN Systems

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

/$ IEEE

Implementation of Modified FEC Codec and High-Speed Synchronizer in 10G-EPON

A Compact and Fast FPGA Based Implementation of Encoding and Decoding Algorithm Using Reed Solomon Codes

PAPER Low Complexity Filter Architecture for ATSC Terrestrial Broadcasting DTV Systems

FPGA Implementation OF Reed Solomon Encoder and Decoder

An Efficient Reduction of Area in Multistandard Transform Core

IN DIGITAL transmission systems, there are always scramblers

A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

PIPELINE ARCHITECTURE FOR FAST DECODING OF BCH CODES FOR NOR FLASH MEMORY

LFSR Counter Implementation in CMOS VLSI

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

ALONG with the progressive device scaling, semiconductor

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

FPGA Implementation of Convolutional Encoder And Hard Decision Viterbi Decoder

Hardware Implementation of Viterbi Decoder for Wireless Applications

An MFA Binary Counter for Low Power Application

Design of Memory Based Implementation Using LUT Multiplier

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL

Investigation on Technical Feasibility of Stronger RS FEC for 400GbE

Optimization of Multi-Channel BCH. Error Decoding for Common Cases. Russell Dill

[Krishna*, 4.(12): December, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC

Memory efficient Distributed architecture LUT Design using Unified Architecture

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Area-efficient high-throughput parallel scramblers using generalized algorithms

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Operating Bio-Implantable Devices in Ultra-Low Power Error Correction Circuits: using optimized ACS Viterbi decoder

International Journal of Engineering Research-Online A Peer Reviewed International Journal

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

An Efficient High Speed Wallace Tree Multiplier

A Novel Architecture of LUT Design Optimization for DSP Applications

Overview: Logic BIST

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

VLSI Based Minimized Composite S-Box and Inverse Mix Column for AES Encryption and Decryption

Implementation of Low Power and Area Efficient Carry Select Adder

Design of an Efficient Low Power Multi Modulus Prescaler

Design Project: Designing a Viterbi Decoder (PART I)

Design of Low Power Efficient Viterbi Decoder

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

DESIGN OF A NEW MODIFIED CLOCK GATED SENSE-AMPLIFIER FLIP-FLOP

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

Logic Design II (17.342) Spring Lecture Outline

Implementation of Memory Based Multiplication Using Micro wind Software

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Design of Modified Carry Select Adder for Addition of More Than Two Numbers

Modeling Digital Systems with Verilog

Implementation of CRC and Viterbi algorithm on FPGA

Novel Correction and Detection for Memory Applications 1 B.Pujita, 2 SK.Sahir

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

Improved 32 bit carry select adder for low area and low power

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

An Efficient Viterbi Decoder Architecture

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Optimization of memory based multiplication for LUT

Design of BIST with Low Power Test Pattern Generator

Performance Analysis of Convolutional Encoder and Viterbi Decoder Using FPGA

Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

OMS Based LUT Optimization

IC Design of a New Decision Device for Analog Viterbi Decoder

Design of Carry Select Adder using Binary to Excess-3 Converter in VHDL

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

LUT Optimization for Memory Based Computation using Modified OMS Technique

(51) Int Cl.: H04L 1/00 ( )

PICOSECOND TIMING USING FAST ANALOG SAMPLING

A VLSI Architecture for Variable Block Size Video Motion Estimation

Distributed Arithmetic Unit Design for Fir Filter

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

Measurements of metastability in MUTEX on an FPGA

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

FPGA Implementation of Viterbi Decoder

A New Overlap-Scan Circuit for High Speed and Low Data Voltage in Plasma-TV

Modified Reconfigurable Fir Filter Design Using Look up Table

SDR Implementation of Convolutional Encoder and Viterbi Decoder

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

Vlsi Digital Signal Processing Systems Design And Implementation Solution Manual

PAPER A 1.25-Gb/s Digitally-Controlled Dual-Loop Clock and Data Recovery Circuit with Enhanced Phase Resolution

Further Studies of FEC Codes for 100G-KR

An Improved Recursive and Non-recursive Comb Filter for DSP Applications

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

ISSN:

Retiming Sequential Circuits for Low Power

Guidance For Scrambling Data Signals For EMC Compliance

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

Design and Analysis of Modified Fast Compressors for MAC Unit

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

DESIGN OF HIGH PERFORMANCE, AREA EFFICIENT FIR FILTER USING CARRY SELECT ADDER

Implementation of High Speed Adder using DLATCH

Transcription:

2424 IEICE TRANS. FUNDAMENTALS, VOL.E95 A, NO.12 DECEMBER 2012 PAPER A High-Speed Low-Complexity Time-Multiplexing Reed-Solomon-Based FEC Architecture for Optical Communications Jeong-In PARK, Nonmember and Hanho LEE a), Member SUMMARY A high-speed low-complexity time-multiplexing Reed- Solomon-based forward error correction architecture based on the pipelined truncated inversionless Berlekamp-Massey algorithm is presented in this paper. The proposed architecture has very high speed and very low hardware complexity compared with conventional Reed- Solomon-based forward error correction architectures. Hardware complexity is improved by employing a truncated inverse Berlekamp-Massey algorithm. A high-speed and high-throughput data rate is facilitated by employing a three-parallel processing pipelining technique and modified syndrome computation block. The time-multiplexing method for pipelined truncated inversionless Berlekamp-Massey architecture is used in the parallel Reed- Solomon decoder to reduce hardware complexity. The proposed architecture has been designed and implemented with 90-nm CMOS technology. Synthesis results show that the proposed 16-channel Reed-Solomon-based forward error correction architecture requires 417,600 gates and can operate at 640 MHz to achieve a throughput of 240 Gb/s. The proposed architecture can be readily applied to Reed-Solomon-based forward error correction devices for next-generation short-reach optical communications. key words: Reed-Solomon, forward error correction, time-multiplexing, truncated inversionless Berlekamp-Massey, optical communications 1. Introduction Demands for 100 Gigabit Ethernet (GbE) devices are increasing dramatically where data traffic converges, such as high performance computing, servers, data centers, and enterprise networks. In the future, bandwidth will be much more in demand than 100 GbE. For this reason, the IEEE 802.3ba task force approved IEEE std802.3ba-2010 for the use of 40 Gb/s and 100 Gb/s Ethernet [1]. These very high speed data transmission techniques that have been developed for fiber optic networking systems have necessitated the implementation of high speed Forward Error Correction (FEC) architecture to meet the continuing demand for ever higher data rates. Also, high speed (40Gb/s and beyond) short-reach optical communication systems commonly use Reed-Solomon (RS)(255,239) code. Specifically, the ITU- T has discussed standardization of a hard-decision FEC for a 100Gb/s optical transport network (OTN) [2]. As a result, the RS(255,239) code has become the one of candidate for 100-Gb/s short-reach optical communication systems. The very high-speed data transmission techniques for Manuscript received June 1, 2011. Manuscript revised June 19, 2012. The authors are with School of Information and Communication Engineering, Inha University, Incheon, 402-751, Korea. a) E-mail: hhlee@inha.ac.kr DOI: 10.1587/transfun.E95.A.2424 optical communications have necessitated the implementation of high-speed low-complexity RS-based FEC architecture to meet the continuing demands for ever higher data rates (100 Gb/s and beyond). The typical high-speed parallel RS-based FEC architectures have adopted modified Euclidean (ME) architecture to achieve the requirement of high throughput rate [3] [7]. However, hardware utilization is not efficient and requires a huge hardware cost to achieve very high speed transmission data rates for optical systems. Also, the RS decoder architectures using folded ME architecture were proposed to achieve efficient hardware utilization and low hardware complexity [7], [8]. However, they require very long latency. In this paper, we present three-parallel RS decoder architecture and high-speed low-complexity timemultiplexing RS-based FEC architecture using a truncated inversionless Berlekamp-Massey (TiBM) algorithm for next generation short-reach optical systems. We describe the key ideas applied to 16-channel time-multiplexing RS-based FEC architecture design, especially those related to achieving high throughput, low-complexity, and low latency. The synthesized result shows that compared with related research, the proposed RS-based FEC architecture has very low hardware complexity and delivers a very high throughput rate. The rest of this paper is organized as follows. Section 2 presents the three-parallel RS decoder with a modified syndrome computation block and pipelined TiBM (ptibm) architecture. Section 3 presents the high-speed and lowcomplexity 16-channel time-multiplexing RS-based FEC architecture. The performance evaluation and comparisons with related work are described in Sect. 4. Finally, conclusions are provided in Sect. 5. 2. Three-Parallel Reed-Solomon Decoder The RS decoder consists of three main blocks, which are syndrome computation block, key equation solver (KES) block and Chien search and error evaluation (CSEE) block, as shown in Fig. 1. Generally, the RS decoder can be implemented with a Berlekamp-Massey (BM) algorithm or ME algorithm to solve a key equation. In this section, we propose three-parallel RS decoder using modified syndrome computation block and ptibm architecture, which provides high speed and low hardware-complexity. The modified Copyright c 2012 The Institute of Electronics, Information and Communication Engineers

PARK and LEE: A HIGH-SPEED LOW-COMPLEXITY TIME-MULTIPLEXING REED-SOLOMON-BASED FEC ARCHITECTURE 2425 Fig. 1 Three-parallel Reed-Solomon decoder. syndrome computation block and CSEE block are reformulated to minimize the critical path delay. Fig. 2 Modified three-parallel syndrome computation block. 2.1 Modified Three-Parallel Syndrome Computation Block The Let C(x) andr(x) be the codeword polynomial and the received polynomial, respectively. The transmitted polynomial can be corrupted by channel noise during the transmission. Therefore, the received polynomial can be described as R(x) = C(x) + E(x) = R n 1 x n 1 +...+ R 1 x + R 0,where E(x) is the error polynomial. The first step in the decoding algorithm is to calculate 2t syndromes S i (0 i 2t 1) which are used to correct fixable errors. The t is the capability of error correction. If all 2t syndromes S i (0 i 2t 1) are zero, then the received polynomial R(x) is a valid codeword C(x), that is, no errors have occurred. The syndrome polynomial S (x) is defined as (1) and (2). Also (3) represents the syndrome polynomial described for three-parallel processing: S (x) = S 15 x 15 + S 14 x 14 +...+ S 1 x + S 0 (1) S i = R(α i ) = R 254 α 254i + R 253 α 253i +...+ R 1 α i + R 0, (i = 0, 1, 2,...,15) (2) S i = R(α i ) = ((...(R 254 α 2i + R 253 α i + R 252 )α 3i +R 251 α 2i + R 250 α i + R 249 )α 3i +...)α 3i +(R 2 α 2i + R 1 α i + R 0 ) (3) The conventional three-parallel syndrome computation block consists of 2t syndrome cells, which compute the S i value during 85 clock cycles. However, the critical path of the syndrome cell is increased if the syndrome computation block is implemented for three-parallel processing as shown in (3). To reduce the critical path, the syndrome polynomial can be separated into even terms and odd terms as follows: S i (α i ) = R even (α i )+R odd (α i ) (4) = (R 254 α 254i +R 252 α 252i +...+R 2 α 2i +R 0 )+(R 253 α 253i +R 251 α 251i +...+R 1 α i ) (5) = (R 254 α 2i 127 +R 252 α 2i 126 +...+R 2 α 2i +R 0 )+(R 253 α 2i 126 +R 251 α 2i 125 +...+R 1 )α i (6) = [((...(R 254 α 2i +R 253 α i +R 252 )α 6i +R 248 α 2i +R 247 α i +R 246 )α 6i +...+(R 8 α 2i +R 7 α i +R 6 )α 6i +(R 2 α 2i +R 1 α i +R 0 )] +[((...(R 251 α 2i +R 250 α i +R 249 )α 6i +R 245 α 2i +R 244 α i +R 243 )α 6i +...)α 6i +(R 5 α 2i +R 4 α i +R 3 )α 3i ] (7) If the three-parallel syndrome computation block is reformulated by the syndrome polynomial shown in (7), the pipelining is possible without any additional latency. Figure 2 shows the modified three-parallel syndrome computation block. The even and odd terms are computed alternately during 84 clock cycles. At the final 85th clock cycle, we can obtain a syndrome polynomial by multiplying the odd term by α 3i. The critical path of the proposed syndrome computation block is reduced to 3T xor + T ff from the critical path 6T xor + T mux + T ff of the conventional syndrome computation block, in which 3T xor means the critical path delay of the constant Galois-field (GF) multiplier. 2.2 ptibm Architecture The low-complexity TiBM architecture for a KES block was presented in our previous paper [9] and removed the unnecessary t 1 PEs in the conventional RiBM architecture [10]. The TiBM algorithm can be described by pseudocode as follows: The TiBM Algorithm Initialization: δ 2t+1 (0)=1; δ 2t (0)=0; k(0)=0; γ(0)=1; Input : δ i (0) = θ i (0) = S i,(i = 0,...,2t 1). for (r = 0, n = 0; r < 2t; r++) Step TiBM.1 if r = 2m (m = 0, 1,...,t 1) or r = 2t 1 then A i (r) = δ i+1 (r) (i = 0, 1,...,2t + 1) B i (r) = θ i (r)(i = 0, 1,...,2t + 1) else A i (r) = δ i (r) (i = 2t + 1, 2t,...,2t n) A i (r) = 0(i = 2t 1 n) A i (r) = δ i+1 (r) (i = 0, 1,...,2t 2 n) B i (r) = θ i 1 (r)(i = 2t + 1, 2t,...,2t n) B i (r) = θ i (r)(i = 0, 1,...,2t 1 n) n = n + 1 Step TiBM.2 δ i (r + 1) = γ(r) A i (r) δ 0 (r) B i (r), (i = 0,...,2t + 1) Step TiBM.3 if δ 0 (r) 0 and k(0) 0

2426 IEICE TRANS. FUNDAMENTALS, VOL.E95 A, NO.12 DECEMBER 2012 then θ i (r + 1) = A i (r), (i = 0, 1,...,2t + 1) γ(r + 1) = δ 0 (r) k(r + 1) = k(r) 1 else θ i (r + 1) = B i (r), (i = 0, 1,...,2t + 1) γ(r + 1) = γ(r) k(r + 1) = k(r) + 1 Output : λ i (2t) = δ t+i (2t), (i = 0, 1,...,t); ω i (2t) = δ i (2t), (i = 0, 1,...,t 1). Figure 3 shows the block diagram of the proposed ptibm architecture. In the ptibm architecture, the original t+1 PE1s which are employed in the conventional RiBM architecture are used in PE1 0 PE1 t and modified t + 1PE2s are used in PE2 t+1 PE2 2t+1. Some lost zero values occurred because of truncated t 1 PE1s. Thus, MUX(1) and MUX(2) were added into the modified PE2s to give zero values at the appropriate time. Also, the proposed ptibm architecture can be pipelined for high speed. This fact represents that a time-multiplexing method can be used efficiently in the multi-channel RS-based FEC architecture. The timemultiplexing method is described in Sect. 3. The ptibm architecture consists of PE1, PE2, and Control Units 1 and 2. Because of removed t 1PE1s, control circuits are needed to adjust MUX(1) and MUX(2) in PE2, and propagate δ i (r) andθ i (r) correctly. Control Unit 1 generates the control signal such as MC(r), γ(r) and δ 0 (r). Control Unit 2 generates the selection signals of the MUX(1) and MUX(2) in the PE2. Control Unit 2 can be implemented via a finite state machine (FSM). Each selection signals of 9 MUX(1)s are represented by 2 bits, which are 0(00), 1(01) and 2(10). So the total selection signals of 9 MUX(1)s are 18 bits. Also, each selection signal of 9 MUX(2)s is represented as 1 bit, which is either 0 or 1. So the bit size of selection signals is total 9 bits. Therefore, the total selection signal for MUX(1)s and MUX(2)s is 27 bits, as shown in Fig. 3. The FSM starts their operation with a resetsignaland inputw repeats periodically with 0, x, 0, x, 0, x, 0, x, 0, x, 0, x, 0, x, 1, where x is don t care. MUX signal Gen. 1 and MUX signal Gen. 2 generate 27 bit selection signals. MUX signal Gen. 1 can be generated by concatenating 18 bits for MUX(1) and 9 bits for MUX(2). The former 18 bits move to the right every 2 clock cycles and 2 is inserted at the very left of the Control Unit 2 as shown in Fig. 3. Also, the latter 9 bits move to the right every 2 clock cycles and 1 is inserted at the very left. For instance, 27 bit initial selection signals (2, 2, 0, 1, 1, 1, 1, 1, 1and1, 1, 0, 0, 0, 0, 0, 0, 0) are updated to signals (2, 2, 2, 0, 1, 1, 1, 1, 1) and (1, 1, 1, 0, 0, 0, 0, 0, 0) after 2 clock cycles. Also, the next selection signals are updated to (2, 2, 2, 2, 0, 1, 1, 1, 1) and (1, 1, 1, 1, 0, 0, 0, 0, 0). MUX signal Gen. 2 always outputs fixed values. Finally, the final 27 bit selection signals are selected by FSM. If the selection signals of MUX(1) and MUX(2) are adjusted using this method, the error locator polynomial λ(x) and error evaluator polynomials ω(x) can be obtained correctly using only 2t+2 PEs after the operation of 2t times. The PE architecture consists of 3-stage pipelined GF multipliers, adders, and D-FFs. The critical path delay of the proposed KES block has 2T xor + T ff. Fig. 3 Proposed ptibm architecture and its sub-blocks such as original PEs, modified PE2s, and control units. Fig. 4 Pipelined three-parallel Chien search block and cell.

PARK and LEE: A HIGH-SPEED LOW-COMPLEXITY TIME-MULTIPLEXING REED-SOLOMON-BASED FEC ARCHITECTURE 2427 2.3 Pipelined Three-Parallel CSEE Block The CSEE block finds error locations and error values. Figure 4 represents the three-parallel Chien search blocks and their cells. The Forney algorithm block is almost the same structure as the Chien search block, except that the C8 cell is eliminated. The dotted line in Fig. 4 is a cutline for pipelining. Then, the critical path delay of the Chien search block is reduced from 7T xor + T mux + T ff to 3T xor + T mux + T ff.the detailed information for the parallel Chien search block is described in [11]. 3. 16-Channel Time-Multiplexing RS-Based FEC Architecture Figure 5 shows the proposed 16-channel time-multiplexing RS-based FEC architecture, which is made up of fourchannel three-parallel RS decoders. The syndrome computation block provides 2t syndromes after 85 clock cycles which are required for computing the syndrome polynomial. Since four syndrome computation blocks are connected by only one KES block, syndrome values are entered into the KES block alternately. The KES block outputs four error location polynomials λ(x) and four error value polynomials ω(x) in parallel after 64 clock cycles. Finally, a CSEE block completes error correction. Most conventional high-speed RS decoders have used ME algorithms to solve the KES block, because the ME algorithm can be easily implemented by fully pipelined systolic-array structure. On the other hand, the systolic- array ME architecture has very high hardware complexity compared to the BM architecture. In general, the BM algorithm is difficult to use pipeline technique because of their feedback loops. But if many channels are used in the TiBM architecture, the pipelining techniques can be efficiently used with a time-multiplexing method. Therefore, the proposed ptibm architecture is able to process a maximum of four indepent syndrome values because the iteration period for obtaining λ(x) andω(x)inthekesblock is 16 clock cycles and the syndrome computation block uses 85 clock cycles for its computation. Figures 6(a) and (b) show the timing chart of four indepent syndrome values for conventional ME architecture and the proposed ptibm architecture using timemultiplexing. The proposed ptibm block is initialized by four indepent syndrome values during 4 clock cycles, as shown in Fig. 6(b). After 60 clock cycles, computation processing of the ptibm architecture is completed and the outputs λ(x) and ω(x) are generated during 61 to 64 clock cycles. For ptibm architecture, a total of 18 processing elements (PEs) are connected serially, and every PE accepts the value δ 0, γ and MC control signal from a control unit. After 64 clock cycles, D-FF in the PE 0 to PE 7 have four indepent values of ω(x). The values of λ(x) are also in the PE 8 to PE 16. Figure 7 represents a timing chart of the proposed 4- channel RS decoder. This architecture has as much as 161 clock cycles of latency. 85 clock cycles are used in the Syndrome computation block because of their three-parallel architecture. Also, 64 clock cycles are used in the KES block Fig. 5 Proposed 16-channel time-multiplexing RS-based FEC architecture.

2428 IEICE TRANS. FUNDAMENTALS, VOL.E95 A, NO.12 DECEMBER 2012 (excluding the FIFO memory) and the clock frequency is 625 MHz. The proposed time-multiplexing architecture has higher throughput rate and lower hardware complexity than the parallel architectures in [4] [6]. Compared to the design in [3], the proposed design can operate much faster with comparable hardware requirements. Note that the proposed architecture is using the highly pipelined GF multiplier, but the design in [3] cannot use the pipelined GF multiplier in a KES block. As a result, the proposed time-multiplexing RS-based FEC architecture has higher throughput rate, lower hardware complexity, and lower latency than previous architectures. 5. Conclusion Fig. 6 Timing chart of (a) conventional ME architecture [6], and (b) proposed ptibm architecture using time-multiplexing for 4-channel RS decoder architecture. Fig. 7 Timing chart of proposed 4-channel RS decoder. using the time-multiplexing method. The rest of the latency is used for a delay to adjust the timing sequence. 4. Result and Comparison The proposed 16-channel time-multiplexing RS-based FEC architecture and conventional architectures [5], [6] were modeled in Verilog HDL and simulated to verify their functionality. After complete verification of the design functionality, it was then synthesized using appropriate time and area constraints. Both simulation and synthesis steps were carried out using SYNOPSYS design tools and 90-nm CMOS technology optimized for a 1.2 V supply voltage. For fare comparison, the conventional RS decoders in [5], [6] were synthesized using the same 90-nm CMOS technology. Table 1 shows the critical path of each sub-block for the proposed and conventional decoder architectures. As shown in Table 1, the critical path delay of the proposed architecture is reduced significantly. Table 2 shows the implementation results of the proposed 16-channel time- multiplexing RS-based FEC architecture and the other existing RS-based FEC architectures. The total number of gates for the proposed architecture is 417,600 from the synthesized results This paper presented a high-speed, low-complexity VLSI architecture of 16-channel time-multiplexing RS-based FEC for next generation short-reach optical communication applications. The three-parallel processing for syndrome computation and error correction allows the inputs to be received at very high fiber optic rates, and the outputs to be delivered at correspondingly high rates with a minimum delay. A high-speed and high-throughput rate is facilitated by employing a three-parallel processing pipelining technique and modified syndrome computation block. Especially, the syndrome computation block is reformulated for pipelining to obtain high clock speed. The time-multiplexing method for resource sharing of ptibm architecture is used in the parallel RS decoder to reduce hardware complexity. As a result, the proposed RS-based FEC architecture has a much higher throughput rate and lower hardware complexity compared to conventional RS-based FEC architectures. The proposed architecture has potential applications in RS-based FEC devices for short-reach optical communications with a data rate of 100 Gb/s and beyond. Acknowledgments This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2012R1A1A2007740). References [1] IEEE P802.3ba 40 Gb/s and 100 Gb/s Ethernet Task Force. [2] ITU-T Manual 2009, Optical fibers, cables and systems, pp.133 158. [3] L. Song, M.-L. Yu, and M.S. Shaffer, 10 and 40-Gb/s forward error correction devices for optical communications, IEEE J. Solid-State Circuits, vol.37, no.11, pp.1565 1573, Nov. 2002. [4] H. Lee, High-speed VLSI architecture for parallel Reed-Solomon decoder, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.11, no.2, pp.288 294, April 2003. [5] H. Lee, C.-S. Choi, J. Shin, and J.-S. Ko, 100 Gb/s three-parallel Reed-Solomon based forward error correction architecture for optical communications, 2008 International SoC Design Conference, pp.265 268, Nov. 2008. [6] S. Lee, C.-S. Choi, and H. Lee, Two-parallel Reed-Solomon based

PARK and LEE: A HIGH-SPEED LOW-COMPLEXITY TIME-MULTIPLEXING REED-SOLOMON-BASED FEC ARCHITECTURE 2429 Table 1 Comparison of critical path delay. Table 2 Implementation results of the 16-channel RS-FEC architectures. FEC architecture for optical communications, IEICE Electron. Express, vol.5, no.10, pp.374 380, May 2008. [7] H.Y. Hsu, A.Y. Wu, and J.I. Yeo, Area-efficient VLSI design of Reed-Solomon decoder for 10 GBase-LX4 optical communication systems, IEEE Trans. Circuits Syst. II, Express Briefs, vol.53, no.11, pp.1245 1249, Nov. 2006. [8] B. Yuan, Z. Wang, L. Li, M. Gao, J. Sha, and C. Zhang, Areaefficient Reed-Solomon decoder design for optical communications, IEEE Trans. Circuits Syst. II, vol.56, no.6, pp.469 473, June 2009. [9] J.-I. Park and H. Lee, Area-efficient truncated berlekamp-massey architecture for Reed-Solomon decoders, IET Electron. Lett., vol.47, no.4, pp.241 243, Feb. 17, 2011. [10] D.V. Sarwate and N.R. Shanbhag, High-speed architecture for Reed-Solomon decoders, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.9, no.5, pp.641 655, Oct. 2001. [11] Y. Chen and K.K. Parhi, Small area parallel Chien search architectures for long BCH codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.12, no.5, pp.545 549, May 2004. Hanho Lee received Ph.D. and M.S. degrees, both in Electrical & Computer Engineering, from the University of Minnesota, Minneapolis, in 2000 and 1996 respectively, and a B.S. degree in Electronics Engineering from Chungbuk National University, Korea, in 1993. In 1999, he was a Member of Technical-Staff- 1 at Lucent Technologies, Bell Labs, Holmdel, NJ. From April 2000 to August 2002, he was a Member of Technical Staff at the Lucent Technologies (Bell Labs Innovations), Allentown, where he was responsible for the development of VLSI architectures and implementation of high-performance DSP multiprocessor for wireless infrastructure systems. From August 2002 to August 2004, he was an Assistant Professor at the Department of Electrical and Computer Engineering, University of Connecticut. Since August 2004, he has been with the School of Information and Communication Engineering, Inha University, where he is presently a Professor. He was a visiting researcher at Electronics and Telecommunications Research Institute (ETRI) in 2005.From August 2010 to August 2011, he was a visiting scholar at Bell Labs, Alcatel-Lucent, Murray Hill, USA. His research interests include VLSI architecture design for digital signal processing and communications, System-on-a-Chip (SoC) design, and forward error correction architectures. Jeong-In Park received a B.S. degree in Information and Communication Engineering in 2009 from Inha University in Korea, where he is currently working toward his M.S. degree. His research interests include VLSI architecture design and implementation for communications, and forward error correction architecture design.