PAPER A High-Speed Low-Complexity Time-Multiplexing Reed-Solomon-Based FEC Architecture for Optical Communications

2424 IEICE TRANS. FUNDAMENTALS, VOL.E95 A, NO.12 DECEMBER 2012 PAPER A High-Speed Low-Complexity Time-Multiplexing Reed-Solomon-Based FEC Architecture for Optical Communications Jeong-In PARK, Nonmember and Hanho LEE a), Member SUMMARY A high-speed low-complexity time-multiplexing Reed- Solomon-based forward error correction architecture based on the pipelined truncated inversionless Berlekamp-Massey algorithm is presented in this paper. The proposed architecture has very high speed and very low hardware complexity compared with conventional Reed- Solomon-based forward error correction architectures. Hardware complexity is improved by employing a truncated inverse Berlekamp-Massey algorithm. A high-speed and high-throughput data rate is facilitated by employing a three-parallel processing pipelining technique and modified syndrome computation block. The time-multiplexing method for pipelined truncated inversionless Berlekamp-Massey architecture is used in the parallel Reed- Solomon decoder to reduce hardware complexity. The proposed architecture has been designed and implemented with 90-nm CMOS technology. Synthesis results show that the proposed 16-channel Reed-Solomon-based forward error correction architecture requires 417,600 gates and can operate at 640 MHz to achieve a throughput of 240 Gb/s. The proposed architecture can be readily applied to Reed-Solomon-based forward error correction devices for next-generation short-reach optical communications. key words: Reed-Solomon, forward error correction, time-multiplexing, truncated inversionless Berlekamp-Massey, optical communications 1. Introduction Demands for 100 Gigabit Ethernet (GbE) devices are increasing dramatically where data traffic converges, such as high performance computing, servers, data centers, and enterprise networks. In the future, bandwidth will be much more in demand than 100 GbE. For this reason, the IEEE 802.3ba task force approved IEEE std802.3ba-2010 for the use of 40 Gb/s and 100 Gb/s Ethernet [1]. These very high speed data transmission techniques that have been developed for fiber optic networking systems have necessitated the implementation of high speed Forward Error Correction (FEC) architecture to meet the continuing demand for ever higher data rates. Also, high speed (40Gb/s and beyond) short-reach optical communication systems commonly use Reed-Solomon (RS)(255,239) code. Specifically, the ITU- T has discussed standardization of a hard-decision FEC for a 100Gb/s optical transport network (OTN) [2]. As a result, the RS(255,239) code has become the one of candidate for 100-Gb/s short-reach optical communication systems. The very high-speed data transmission techniques for Manuscript received June 1, 2011. Manuscript revised June 19, 2012. The authors are with School of Information and Communication Engineering, Inha University, Incheon, 402-751, Korea. a) E-mail: hhlee@inha.ac.kr DOI: 10.1587/transfun.E95.A.2424 optical communications have necessitated the implementation of high-speed low-complexity RS-based FEC architecture to meet the continuing demands for ever higher data rates (100 Gb/s and beyond). The typical high-speed parallel RS-based FEC architectures have adopted modified Euclidean (ME) architecture to achieve the requirement of high throughput rate [3] [7]. However, hardware utilization is not efficient and requires a huge hardware cost to achieve very high speed transmission data rates for optical systems. Also, the RS decoder architectures using folded ME architecture were proposed to achieve efficient hardware utilization and low hardware complexity [7], [8]. However, they require very long latency. In this paper, we present three-parallel RS decoder architecture and high-speed low-complexity timemultiplexing RS-based FEC architecture using a truncated inversionless Berlekamp-Massey (TiBM) algorithm for next generation short-reach optical systems. We describe the key ideas applied to 16-channel time-multiplexing RS-based FEC architecture design, especially those related to achieving high throughput, low-complexity, and low latency. The synthesized result shows that compared with related research, the proposed RS-based FEC architecture has very low hardware complexity and delivers a very high throughput rate. The rest of this paper is organized as follows. Section 2 presents the three-parallel RS decoder with a modified syndrome computation block and pipelined TiBM (ptibm) architecture. Section 3 presents the high-speed and lowcomplexity 16-channel time-multiplexing RS-based FEC architecture. The performance evaluation and comparisons with related work are described in Sect. 4. Finally, conclusions are provided in Sect. 5. 2. Three-Parallel Reed-Solomon Decoder The RS decoder consists of three main blocks, which are syndrome computation block, key equation solver (KES) block and Chien search and error evaluation (CSEE) block, as shown in Fig. 1. Generally, the RS decoder can be implemented with a Berlekamp-Massey (BM) algorithm or ME algorithm to solve a key equation. In this section, we propose three-parallel RS decoder using modified syndrome computation block and ptibm architecture, which provides high speed and low hardware-complexity. The modified Copyright c 2012 The Institute of Electronics, Information and Communication Engineers

PARK and LEE: A HIGH-SPEED LOW-COMPLEXITY TIME-MULTIPLEXING REED-SOLOMON-BASED FEC ARCHITECTURE 2425 Fig. 1 Three-parallel Reed-Solomon decoder. syndrome computation block and CSEE block are reformulated to minimize the critical path delay. Fig. 2 Modified three-parallel syndrome computation block. 2.1 Modified Three-Parallel Syndrome Computation Block The Let C(x) andr(x) be the codeword polynomial and the received polynomial, respectively. The transmitted polynomial can be corrupted by channel noise during the transmission. Therefore, the received polynomial can be described as R(x) = C(x) + E(x) = R n 1 x n 1 +...+ R 1 x + R 0,where E(x) is the error polynomial. The first step in the decoding algorithm is to calculate 2t syndromes S i (0 i 2t 1) which are used to correct fixable errors. The t is the capability of error correction. If all 2t syndromes S i (0 i 2t 1) are zero, then the received polynomial R(x) is a valid codeword C(x), that is, no errors have occurred. The syndrome polynomial S (x) is defined as (1) and (2). Also (3) represents the syndrome polynomial described for three-parallel processing: S (x) = S 15 x 15 + S 14 x 14 +...+ S 1 x + S 0 (1) S i = R(α i ) = R 254 α 254i + R 253 α 253i +...+ R 1 α i + R 0, (i = 0, 1, 2,...,15) (2) S i = R(α i ) = ((...(R 254 α 2i + R 253 α i + R 252 )α 3i +R 251 α 2i + R 250 α i + R 249 )α 3i +...)α 3i +(R 2 α 2i + R 1 α i + R 0 ) (3) The conventional three-parallel syndrome computation block consists of 2t syndrome cells, which compute the S i value during 85 clock cycles. However, the critical path of the syndrome cell is increased if the syndrome computation block is implemented for three-parallel processing as shown in (3). To reduce the critical path, the syndrome polynomial can be separated into even terms and odd terms as follows: S i (α i ) = R even (α i )+R odd (α i ) (4) = (R 254 α 254i +R 252 α 252i +...+R 2 α 2i +R 0 )+(R 253 α 253i +R 251 α 251i +...+R 1 α i ) (5) = (R 254 α 2i 127 +R 252 α 2i 126 +...+R 2 α 2i +R 0 )+(R 253 α 2i 126 +R 251 α 2i 125 +...+R 1 )α i (6) = [((...(R 254 α 2i +R 253 α i +R 252 )α 6i +R 248 α 2i +R 247 α i +R 246 )α 6i +...+(R 8 α 2i +R 7 α i +R 6 )α 6i +(R 2 α 2i +R 1 α i +R 0 )] +[((...(R 251 α 2i +R 250 α i +R 249 )α 6i +R 245 α 2i +R 244 α i +R 243 )α 6i +...)α 6i +(R 5 α 2i +R 4 α i +R 3 )α 3i ] (7) If the three-parallel syndrome computation block is reformulated by the syndrome polynomial shown in (7), the pipelining is possible without any additional latency. Figure 2 shows the modified three-parallel syndrome computation block. The even and odd terms are computed alternately during 84 clock cycles. At the final 85th clock cycle, we can obtain a syndrome polynomial by multiplying the odd term by α 3i. The critical path of the proposed syndrome computation block is reduced to 3T xor + T ff from the critical path 6T xor + T mux + T ff of the conventional syndrome computation block, in which 3T xor means the critical path delay of the constant Galois-field (GF) multiplier. 2.2 ptibm Architecture The low-complexity TiBM architecture for a KES block was presented in our previous paper [9] and removed the unnecessary t 1 PEs in the conventional RiBM architecture [10]. The TiBM algorithm can be described by pseudocode as follows: The TiBM Algorithm Initialization: δ 2t+1 (0)=1; δ 2t (0)=0; k(0)=0; γ(0)=1; Input : δ i (0) = θ i (0) = S i,(i = 0,...,2t 1). for (r = 0, n = 0; r < 2t; r++) Step TiBM.1 if r = 2m (m = 0, 1,...,t 1) or r = 2t 1 then A i (r) = δ i+1 (r) (i = 0, 1,...,2t + 1) B i (r) = θ i (r)(i = 0, 1,...,2t + 1) else A i (r) = δ i (r) (i = 2t + 1, 2t,...,2t n) A i (r) = 0(i = 2t 1 n) A i (r) = δ i+1 (r) (i = 0, 1,...,2t 2 n) B i (r) = θ i 1 (r)(i = 2t + 1, 2t,...,2t n) B i (r) = θ i (r)(i = 0, 1,...,2t 1 n) n = n + 1 Step TiBM.2 δ i (r + 1) = γ(r) A i (r) δ 0 (r) B i (r), (i = 0,...,2t + 1) Step TiBM.3 if δ 0 (r) 0 and k(0) 0

2426 IEICE TRANS. FUNDAMENTALS, VOL.E95 A, NO.12 DECEMBER 2012 then θ i (r + 1) = A i (r), (i = 0, 1,...,2t + 1) γ(r + 1) = δ 0 (r) k(r + 1) = k(r) 1 else θ i (r + 1) = B i (r), (i = 0, 1,...,2t + 1) γ(r + 1) = γ(r) k(r + 1) = k(r) + 1 Output : λ i (2t) = δ t+i (2t), (i = 0, 1,...,t); ω i (2t) = δ i (2t), (i = 0, 1,...,t 1). Figure 3 shows the block diagram of the proposed ptibm architecture. In the ptibm architecture, the original t+1 PE1s which are employed in the conventional RiBM architecture are used in PE1 0 PE1 t and modified t + 1PE2s are used in PE2 t+1 PE2 2t+1. Some lost zero values occurred because of truncated t 1 PE1s. Thus, MUX(1) and MUX(2) were added into the modified PE2s to give zero values at the appropriate time. Also, the proposed ptibm architecture can be pipelined for high speed. This fact represents that a time-multiplexing method can be used efficiently in the multi-channel RS-based FEC architecture. The timemultiplexing method is described in Sect. 3. The ptibm architecture consists of PE1, PE2, and Control Units 1 and 2. Because of removed t 1PE1s, control circuits are needed to adjust MUX(1) and MUX(2) in PE2, and propagate δ i (r) andθ i (r) correctly. Control Unit 1 generates the control signal such as MC(r), γ(r) and δ 0 (r). Control Unit 2 generates the selection signals of the MUX(1) and MUX(2) in the PE2. Control Unit 2 can be implemented via a finite state machine (FSM). Each selection signals of 9 MUX(1)s are represented by 2 bits, which are 0(00), 1(01) and 2(10). So the total selection signals of 9 MUX(1)s are 18 bits. Also, each selection signal of 9 MUX(2)s is represented as 1 bit, which is either 0 or 1. So the bit size of selection signals is total 9 bits. Therefore, the total selection signal for MUX(1)s and MUX(2)s is 27 bits, as shown in Fig. 3. The FSM starts their operation with a resetsignaland inputw repeats periodically with 0, x, 0, x, 0, x, 0, x, 0, x, 0, x, 0, x, 1, where x is don t care. MUX signal Gen. 1 and MUX signal Gen. 2 generate 27 bit selection signals. MUX signal Gen. 1 can be generated by concatenating 18 bits for MUX(1) and 9 bits for MUX(2). The former 18 bits move to the right every 2 clock cycles and 2 is inserted at the very left of the Control Unit 2 as shown in Fig. 3. Also, the latter 9 bits move to the right every 2 clock cycles and 1 is inserted at the very left. For instance, 27 bit initial selection signals (2, 2, 0, 1, 1, 1, 1, 1, 1and1, 1, 0, 0, 0, 0, 0, 0, 0) are updated to signals (2, 2, 2, 0, 1, 1, 1, 1, 1) and (1, 1, 1, 0, 0, 0, 0, 0, 0) after 2 clock cycles. Also, the next selection signals are updated to (2, 2, 2, 2, 0, 1, 1, 1, 1) and (1, 1, 1, 1, 0, 0, 0, 0, 0). MUX signal Gen. 2 always outputs fixed values. Finally, the final 27 bit selection signals are selected by FSM. If the selection signals of MUX(1) and MUX(2) are adjusted using this method, the error locator polynomial λ(x) and error evaluator polynomials ω(x) can be obtained correctly using only 2t+2 PEs after the operation of 2t times. The PE architecture consists of 3-stage pipelined GF multipliers, adders, and D-FFs. The critical path delay of the proposed KES block has 2T xor + T ff. Fig. 3 Proposed ptibm architecture and its sub-blocks such as original PEs, modified PE2s, and control units. Fig. 4 Pipelined three-parallel Chien search block and cell.

PARK and LEE: A HIGH-SPEED LOW-COMPLEXITY TIME-MULTIPLEXING REED-SOLOMON-BASED FEC ARCHITECTURE 2427 2.3 Pipelined Three-Parallel CSEE Block The CSEE block finds error locations and error values. Figure 4 represents the three-parallel Chien search blocks and their cells. The Forney algorithm block is almost the same structure as the Chien search block, except that the C8 cell is eliminated. The dotted line in Fig. 4 is a cutline for pipelining. Then, the critical path delay of the Chien search block is reduced from 7T xor + T mux + T ff to 3T xor + T mux + T ff.the detailed information for the parallel Chien search block is described in [11]. 3. 16-Channel Time-Multiplexing RS-Based FEC Architecture Figure 5 shows the proposed 16-channel time-multiplexing RS-based FEC architecture, which is made up of fourchannel three-parallel RS decoders. The syndrome computation block provides 2t syndromes after 85 clock cycles which are required for computing the syndrome polynomial. Since four syndrome computation blocks are connected by only one KES block, syndrome values are entered into the KES block alternately. The KES block outputs four error location polynomials λ(x) and four error value polynomials ω(x) in parallel after 64 clock cycles. Finally, a CSEE block completes error correction. Most conventional high-speed RS decoders have used ME algorithms to solve the KES block, because the ME algorithm can be easily implemented by fully pipelined systolic-array structure. On the other hand, the systolic- array ME architecture has very high hardware complexity compared to the BM architecture. In general, the BM algorithm is difficult to use pipeline technique because of their feedback loops. But if many channels are used in the TiBM architecture, the pipelining techniques can be efficiently used with a time-multiplexing method. Therefore, the proposed ptibm architecture is able to process a maximum of four indepent syndrome values because the iteration period for obtaining λ(x) andω(x)inthekesblock is 16 clock cycles and the syndrome computation block uses 85 clock cycles for its computation. Figures 6(a) and (b) show the timing chart of four indepent syndrome values for conventional ME architecture and the proposed ptibm architecture using timemultiplexing. The proposed ptibm block is initialized by four indepent syndrome values during 4 clock cycles, as shown in Fig. 6(b). After 60 clock cycles, computation processing of the ptibm architecture is completed and the outputs λ(x) and ω(x) are generated during 61 to 64 clock cycles. For ptibm architecture, a total of 18 processing elements (PEs) are connected serially, and every PE accepts the value δ 0, γ and MC control signal from a control unit. After 64 clock cycles, D-FF in the PE 0 to PE 7 have four indepent values of ω(x). The values of λ(x) are also in the PE 8 to PE 16. Figure 7 represents a timing chart of the proposed 4- channel RS decoder. This architecture has as much as 161 clock cycles of latency. 85 clock cycles are used in the Syndrome computation block because of their three-parallel architecture. Also, 64 clock cycles are used in the KES block Fig. 5 Proposed 16-channel time-multiplexing RS-based FEC architecture.

2428 IEICE TRANS. FUNDAMENTALS, VOL.E95 A, NO.12 DECEMBER 2012 (excluding the FIFO memory) and the clock frequency is 625 MHz. The proposed time-multiplexing architecture has higher throughput rate and lower hardware complexity than the parallel architectures in [4] [6]. Compared to the design in [3], the proposed design can operate much faster with comparable hardware requirements. Note that the proposed architecture is using the highly pipelined GF multiplier, but the design in [3] cannot use the pipelined GF multiplier in a KES block. As a result, the proposed time-multiplexing RS-based FEC architecture has higher throughput rate, lower hardware complexity, and lower latency than previous architectures. 5. Conclusion Fig. 6 Timing chart of (a) conventional ME architecture [6], and (b) proposed ptibm architecture using time-multiplexing for 4-channel RS decoder architecture. Fig. 7 Timing chart of proposed 4-channel RS decoder. using the time-multiplexing method. The rest of the latency is used for a delay to adjust the timing sequence. 4. Result and Comparison The proposed 16-channel time-multiplexing RS-based FEC architecture and conventional architectures [5], [6] were modeled in Verilog HDL and simulated to verify their functionality. After complete verification of the design functionality, it was then synthesized using appropriate time and area constraints. Both simulation and synthesis steps were carried out using SYNOPSYS design tools and 90-nm CMOS technology optimized for a 1.2 V supply voltage. For fare comparison, the conventional RS decoders in [5], [6] were synthesized using the same 90-nm CMOS technology. Table 1 shows the critical path of each sub-block for the proposed and conventional decoder architectures. As shown in Table 1, the critical path delay of the proposed architecture is reduced significantly. Table 2 shows the implementation results of the proposed 16-channel time- multiplexing RS-based FEC architecture and the other existing RS-based FEC architectures. The total number of gates for the proposed architecture is 417,600 from the synthesized results This paper presented a high-speed, low-complexity VLSI architecture of 16-channel time-multiplexing RS-based FEC for next generation short-reach optical communication applications. The three-parallel processing for syndrome computation and error correction allows the inputs to be received at very high fiber optic rates, and the outputs to be delivered at correspondingly high rates with a minimum delay. A high-speed and high-throughput rate is facilitated by employing a three-parallel processing pipelining technique and modified syndrome computation block. Especially, the syndrome computation block is reformulated for pipelining to obtain high clock speed. The time-multiplexing method for resource sharing of ptibm architecture is used in the parallel RS decoder to reduce hardware complexity. As a result, the proposed RS-based FEC architecture has a much higher throughput rate and lower hardware complexity compared to conventional RS-based FEC architectures. The proposed architecture has potential applications in RS-based FEC devices for short-reach optical communications with a data rate of 100 Gb/s and beyond. Acknowledgments This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2012R1A1A2007740). References [1] IEEE P802.3ba 40 Gb/s and 100 Gb/s Ethernet Task Force. [2] ITU-T Manual 2009, Optical fibers, cables and systems, pp.133 158. [3] L. Song, M.-L. Yu, and M.S. Shaffer, 10 and 40-Gb/s forward error correction devices for optical communications, IEEE J. Solid-State Circuits, vol.37, no.11, pp.1565 1573, Nov. 2002. [4] H. Lee, High-speed VLSI architecture for parallel Reed-Solomon decoder, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.11, no.2, pp.288 294, April 2003. [5] H. Lee, C.-S. Choi, J. Shin, and J.-S. Ko, 100 Gb/s three-parallel Reed-Solomon based forward error correction architecture for optical communications, 2008 International SoC Design Conference, pp.265 268, Nov. 2008. [6] S. Lee, C.-S. Choi, and H. Lee, Two-parallel Reed-Solomon based

PARK and LEE: A HIGH-SPEED LOW-COMPLEXITY TIME-MULTIPLEXING REED-SOLOMON-BASED FEC ARCHITECTURE 2429 Table 1 Comparison of critical path delay. Table 2 Implementation results of the 16-channel RS-FEC architectures. FEC architecture for optical communications, IEICE Electron. Express, vol.5, no.10, pp.374 380, May 2008. [7] H.Y. Hsu, A.Y. Wu, and J.I. Yeo, Area-efficient VLSI design of Reed-Solomon decoder for 10 GBase-LX4 optical communication systems, IEEE Trans. Circuits Syst. II, Express Briefs, vol.53, no.11, pp.1245 1249, Nov. 2006. [8] B. Yuan, Z. Wang, L. Li, M. Gao, J. Sha, and C. Zhang, Areaefficient Reed-Solomon decoder design for optical communications, IEEE Trans. Circuits Syst. II, vol.56, no.6, pp.469 473, June 2009. [9] J.-I. Park and H. Lee, Area-efficient truncated berlekamp-massey architecture for Reed-Solomon decoders, IET Electron. Lett., vol.47, no.4, pp.241 243, Feb. 17, 2011. [10] D.V. Sarwate and N.R. Shanbhag, High-speed architecture for Reed-Solomon decoders, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.9, no.5, pp.641 655, Oct. 2001. [11] Y. Chen and K.K. Parhi, Small area parallel Chien search architectures for long BCH codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.12, no.5, pp.545 549, May 2004. Hanho Lee received Ph.D. and M.S. degrees, both in Electrical & Computer Engineering, from the University of Minnesota, Minneapolis, in 2000 and 1996 respectively, and a B.S. degree in Electronics Engineering from Chungbuk National University, Korea, in 1993. In 1999, he was a Member of Technical-Staff- 1 at Lucent Technologies, Bell Labs, Holmdel, NJ. From April 2000 to August 2002, he was a Member of Technical Staff at the Lucent Technologies (Bell Labs Innovations), Allentown, where he was responsible for the development of VLSI architectures and implementation of high-performance DSP multiprocessor for wireless infrastructure systems. From August 2002 to August 2004, he was an Assistant Professor at the Department of Electrical and Computer Engineering, University of Connecticut. Since August 2004, he has been with the School of Information and Communication Engineering, Inha University, where he is presently a Professor. He was a visiting researcher at Electronics and Telecommunications Research Institute (ETRI) in 2005.From August 2010 to August 2011, he was a visiting scholar at Bell Labs, Alcatel-Lucent, Murray Hill, USA. His research interests include VLSI architecture design for digital signal processing and communications, System-on-a-Chip (SoC) design, and forward error correction architectures. Jeong-In Park received a B.S. degree in Information and Communication Engineering in 2009 from Inha University in Korea, where he is currently working toward his M.S. degree. His research interests include VLSI architecture design and implementation for communications, and forward error correction architecture design.