Area-efficient high-throughput parallel scramblers using generalized algorithms

Similar documents
IN DIGITAL transmission systems, there are always scramblers

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

LFSR Counter Implementation in CMOS VLSI

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

A low jitter clock and data recovery with a single edge sensing Bang-Bang PD

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

DESIGN OF A NEW MODIFIED CLOCK GATED SENSE-AMPLIFIER FLIP-FLOP

DESIGN OF DOUBLE PULSE TRIGGERED FLIP-FLOP BASED ON SIGNAL FEED THROUGH SCHEME

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

THE USE OF forward error correction (FEC) in optical networks

Operating Bio-Implantable Devices in Ultra-Low Power Error Correction Circuits: using optimized ACS Viterbi decoder

A Power Efficient Flip Flop by using 90nm Technology

Power Optimization by Using Multi-Bit Flip-Flops

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

IN A SERIAL-LINK data transmission system, a data clock

Design of a Low Power Four-Bit Binary Counter Using Enhancement Type Mosfet

Improve Performance of Low-Power Clock Branch Sharing Double-Edge Triggered Flip-Flop

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

Power Problems in VLSI Circuit Testing

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

P.Akila 1. P a g e 60

DESIGN AND ANALYSIS OF COMBINATIONAL CODING CIRCUITS USING ADIABATIC LOGIC

Implementation of BIST Test Generation Scheme based on Single and Programmable Twisted Ring Counters

Figure 1.LFSR Architecture ( ) Table 1. Shows the operation for x 3 +x+1 polynomial.

A Symmetric Differential Clock Generator for Bit-Serial Hardware

Project 6: Latches and flip-flops

Novel Correction and Detection for Memory Applications 1 B.Pujita, 2 SK.Sahir

DIFFERENTIAL CONDITIONAL CAPTURING FLIP-FLOP TECHNIQUE USED FOR LOW POWER CONSUMPTION IN CLOCKING SCHEME

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

Fault Detection And Correction Using MLD For Memory Applications

Design of Fault Coverage Test Pattern Generator Using LFSR

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District UNIT-III SEQUENTIAL CIRCUITS

Design and Simulation of a Digital CMOS Synchronous 4-bit Up-Counter with Set and Reset

PICOSECOND TIMING USING FAST ANALOG SAMPLING

Parametric Optimization of Clocked Redundant Flip-Flop Using Transmission Gate

Weighted Random and Transition Density Patterns For Scan-BIST

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

An MFA Binary Counter for Low Power Application

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Fully Static and Compressed Topology Using Power Saving in Digital circuits for Reduced Transistor Flip flop

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

DUAL EDGE-TRIGGERED D-TYPE FLIP-FLOP WITH LOW POWER CONSUMPTION

Final Exam review: chapter 4 and 5. Supplement 3 and 4

Guidance For Scrambling Data Signals For EMC Compliance

Figure.1 Clock signal II. SYSTEM ANALYSIS

Midterm Exam 15 points total. March 28, 2011

Dual Slope ADC Design from Power, Speed and Area Perspectives

High Speed 8-bit Counters using State Excitation Logic and their Application in Frequency Divider

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

Design of a High Frequency Dual Modulus Prescaler using Efficient TSPC Flip Flop using 180nm Technology

ISSN Vol.08,Issue.24, December-2016, Pages:

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

IC Design of a New Decision Device for Analog Viterbi Decoder

Design Project: Designing a Viterbi Decoder (PART I)

LUT Optimization for Memory Based Computation using Modified OMS Technique

DESIGN OF EFFICIENT SHIFT REGISTERS USING PULSED LATCHES

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

Design Low-Power and Area-Efficient Shift Register using SSASPL Pulsed Latch

PHASE-LOCKED loops (PLLs) are widely used in many

Clock Tree Power Optimization of Three Dimensional VLSI System with Network

New Single Edge Triggered Flip-Flop Design with Improved Power and Power Delay Product for Low Data Activity Applications

Flip-Flops A) Synchronization: Clocks and Latches B) Two Stage Latch C) Memory Requires Feedback D) Simple Flip-Flop Gate

EFFICIENT POWER REDUCTION OF TOPOLOGICALLY COMPRESSED FLIP-FLOP AND GDI BASED FLIP FLOP

Report on 4-bit Counter design Report- 1, 2. Report on D- Flipflop. Course project for ECE533

WINTER 15 EXAMINATION Model Answer

Research Article Ultra Low Power, High Performance Negative Edge Triggered ECRL Energy Recovery Sequential Elements with Power Clock Gating

Long and Fast Up/Down Counters Pushpinder Kaur CHOUHAN 6 th Jan, 2003

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

DESIGN OF LOW POWER TEST PATTERN GENERATOR

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Low-Power and Area-Efficient Shift Register Using Pulsed Latches

Power Efficient Design of Sequential Circuits using OBSC and RTPG Integration

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Implementation of Memory Based Multiplication Using Micro wind Software

AN OPTIMIZED IMPLEMENTATION OF MULTI- BIT FLIP-FLOP USING VERILOG

High-Speed ADC Building Blocks in 90 nm CMOS

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

SIC Vector Generation Using Test per Clock and Test per Scan

ASNT8140. ASNT8140-KMC DC-23Gbps PRBS Generator with the (x 7 + x + 1) Polynomial. vee. vcc qp. vcc. vcc qn. qxorp. qxorn. vee. vcc rstn_p.

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Novel Design of Static Dual-Edge Triggered (DET) Flip-Flops using Multiple C-Elements

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

Chapter 4. Logic Design

Area Efficient Pulsed Clock Generator Using Pulsed Latch Shift Register

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel

International Journal Of Global Innovations -Vol.6, Issue.I Paper Id: SP-V6-I1-P46 ISSN Online:

Flip Flop. S-R Flip Flop. Sequential Circuits. Block diagram. Prepared by:- Anwar Bari

ISSCC 2006 / SESSION 14 / BASEBAND AND CHANNEL PROCESSING / 14.6

Performance of a Low-Complexity Turbo Decoder and its Implementation on a Low-Cost, 16-Bit Fixed-Point DSP

An FPGA Implementation of Shift Register Using Pulsed Latches

ISSN:

Design of an Efficient Low Power Multi Modulus Prescaler

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

ALONG with the progressive device scaling, semiconductor

Transcription:

LETTER IEICE Electronics Express, Vol.10, No.23, 1 9 Area-efficient high-throughput parallel scramblers using generalized algorithms Yun-Ching Tang 1, 2, JianWei Chen 1, and Hongchin Lin 1a) 1 Department of Electrical Engineering, National Chung Hsing University, Taichung, Taiwan 2 Department of Electronics Engineering, Hsiuping Institute of Technology, Taichung, Taiwan a) hclin nchu edu tw Abstract: This paper presents generalized algorithms for highthroughput parallel scramblers for digital communication circuits. The proposed algorithm can be applied to any three-term scrambler polynomials with the critical path of one register and one XOR gate using the smallest number of registers. The fan-outs of each register can also be determined by calculation. The test chip reveals that the chip area can be reduced by more than 50% compared with that in the literature, and the power dissipation, including the clock buffers, is only 17.33 mw at 1.6 GHz with 16 parallel outputs, which is equivalent to 25.6 Gbps using TSMC 0.18 μm CMOS process. Keywords: high-throughput, parallel scrambler, clock buffer, register Classification: Electron devices, circuits, and systems References [1] C. Cheng and K. K. Parhi: IEEE Trans. Circuits Syst. II, Exp. Briefs 53 [10] (2006) 1017. [2] K. K. Parhi: IEEE Trans. Circuits Syst. I, Reg. Papers 51 [3] (2004) 512. [3] S. W. Seetharam, G. J. Minden and J. B. Evans: IEEE Int. Symp. Circuits and Systems 3 (1993) 2011. [4] C.-H. Lin, C.-N. Chen, Y.-J. Wang, J.-Y. Hsiao and S.-J. Jou: IEEE Trans. Circuits Syst. II, Exp. Briefs 53 [7] (2006) 558. [5] J. Chen, H. Lin and Y.-C. Tang: IEEE Int. Symp. Circuits and Systems (2010) 441. [6] H. Lin, Y.-F. Chen and H.-C. She: IEEE Int. Symp. Circuits and Systems 4 (2001) 148. 1 Introduction Many types of noise affect quality of communications. Single coding technology may not provide an excellent error-correcting capability. To improve the performance, in addition to the Reed-Solomon code, the turbo 1

code or the low-density parity-check (LDPC) code, scramblers and interleavers are usually adopted. However, serial operation with single bit output may not be feasible for future high-speed transmission, especially in optical communications. Therefore, high-throughput parallel scramblers are required. Various scrambler polynomials have been adopted. Figure 1 shows the synchronous frame scrambler in the IEEE 802.11a/g/n, which utilizes the modulo-two polynomial x 7 + x 4 + 1. The generated pattern is a sequence with a maximum length of 127. For the physical coding sub-layer (PCS) in 802.3ae, the polynomial is x 58 + x 39 + 1. For the WAN interface sub-layer (WIS), it is x 7 + x 6 + 1. In the IEEE 1394 b, the polynomial is x 11 + x 9 +1. The purposes of these scramblers are to disperse the binary signals with nearly equal zero and one probabilities. Fig. 1. Synchronous frame scrambler Pipelining and unfolding have been discussed for parallel processing of linear feedback shift register (LFSR) and cyclic redundancy check (CRC) [1, 2]. Since the algorithms are used for the polynomials with arbitrary terms, they are not optimized for the specific polynomials like scramblers in terms of critical paths, number of registers and fan-outs. The first parallel scrambler was applied to optical communications [3]. It generates parallel outputs by calculating the polynomials using several XOR operations with numerous fan-outs of each register. The work [4] utilizes the look-ahead method with more registers. The critical path is reduced to one XOR and one register. However, for certain polynomials, the algorithm requires numerous registers. This paper proposes a generalized parallel architecture for any scrambling polynomial. The critical path is minimized to one XOR gate and one register for high speed. Furthermore, it reduces registers and to save chip area and power consumption. To verify the proposed architecture, the dynamic differential circuits were used to design and implement the parallel scrambler for Fig. 1 using the TSMC 0.18 μm CMOS technology. Then, the experimental results and comparison are presented. The final section is the conclusion. 2 Generalized parallel architectures To derive the generalized parallel scrambler algorithm for the aforementioned various scramblers, the generating polynomial P(x) is assumed to have three terms, given by Eq. (1), where A is the number of shift registers in the conventional serial scrambler and B is the other integer such that A > B. PðxÞ ¼ x A þ x B þ 1 (1) 2

For convenience in the following derivation, D is defined as the difference between A and B. Ifq k is the (k+1) th generated bit in the serial scrambler, then Q(j), given below, represents the M parallel output sequence in the j th cycle of the parallel scrambler. Qj ðþ¼fq Mj ;q Mjþ1 ;...;q MjþM 2 ;q MjþM 1 g (2) For example, consider P(x) = x 7 + x 4 +1; the initial values {q 0 ~q 6 } are stored in the registers {x 7 ~x 1 }, respectively. Since the generating function depends on the generating polynomial P(x), the output bits q i for i 7 can be determined using the following three tips. 1 Looking ahead For a serial scrambler, the data bit of the next state can be determined using looking ahead [4], which produces the next data bit using the previous state. For example, if the first data bit generated by P(x) is composed of the function x 7 + x 4, then the next function is x 6 + x 3, and the following is x 5 + x 2. 2 Substitution During looking ahead, if any terms in P(x) are not available, they have to be converted to the other equivalent forms. For example, if the current function is x 4 + x 1, then the next becomes x 3 + x 0 = x 3 + x 7 + x 4 based on looking ahead and substitution, where x 0 is replaced by x 7 + x 4. 3 Merging For the XOR operation, any two identical terms in the functions can be merged into none due to modulo-2 operation. For instance, if the current function is x 5 + x 2 + x 1, then the next function will be x 4 + x 1 + x 7 + x 4. Here, x 4 appears twice, so the function can be merged into x 7 + x 1. Based on the above simple tips, the following algorithm of parallel scrambler can minimize the register number and keep the shortest critical path of one register and one XOR gate for high speed and high throughput. With the parameters M, A, B, and D A B, the following relations can be derived. A q k : output term with index k For a given k, h is the smallest integer that satisfies the following condition. and q k can be expressed as where k must exceeds A 1. k<a 2 h (3) q k ¼ q k A2 ðh 1Þ þ q k B2 ðh 1Þ (4) B Register upper bound (R) for the case of M parallel output bits If the sequence Q from q 0 to q k, in which k = R M 1, is divided into two parts, then R registers are required to store the data bits that are associated with the first half {q 0, q 1,, q R 1 }; the second half {q R,q R+1,, q R M 1} represents M parallel output bits. However, from Eq. (4), the index of 3

q k B2 ðh 1Þ must exceed that of q k A2 ðh 1Þ,soq k B2 ðh 1Þ is the condition that limits the number of registers. Substituting the output q k with the maximum index k R M 1 into Eq. (4) makes the index of the second term as R þ M 1 B 2 ðh 1Þ, which is less or equal to R 1. Thus, we have M B 2 ðh 1Þ. Notably, h may change owing to different k s, so h for k R 1 is not equal to h for k R M 1. After selecting a new parameter g equal to h 1, we have g to be the smallest integer to satisfy Eq. (5). M B 2 g (5) Another term that limits the number of registers is the maximum k R 1 for the same g. Substituting the upper limit k ¼ A 2 g 1 into k B 2 ðg 1Þ yields ða þ DÞ2 ðg 1Þ 1, which is equal to R 1. Hence, the upper bound (R) is R ¼ ða þ DÞ2 ðg 1Þ (6) Note that R is the register upper bound for M parallel output bits, but it could be further reduced in some situations to be explained later. C. Number of fan-outs of each register to XOR gates After R is determined by Eq. (6) for the given M, the sequence q R to q R M 1 is divided into two parts, q R to q C and q C 1 to q R M 1, in which C is the upper limit A 2 g 1. The h values for those two groups are g and g 1. Using Eq. (4) for q R and q C yields q R ¼ q ðaþd q C ¼q A2g 1 ¼q A2g 1 A2 ðg 1 the ranges ða þ D ða þ D Þ2 ðg 1 Þ2 ðg 1 Þ ¼ q ðaþdþ2 ðg 1 Þ A2 þ q ðg 1Þ ðaþdþ2 ðg 1Þ B2 and ðg 1Þ ð Þ. Any k in q k is constrained within Þ A 2 ðg 1Þ k A 2 g 1 A 2 ðg 1Þ and Þþq A2 g 1 B2 g 1 Þ2 ðg 1 ð Þ B 2 g 1Þ k A 2 g 1 B 2 ðg 1Þ. Those are D 2 ðg 1Þ k<a2 ðg 1Þ (7.1) D 2 g k< ðaþ DÞ2 ðg 1Þ (7.2) Similarly, q C 1 and q R M 1 are expressed as q Cþ1 ¼q A2 g ¼q A2 g A2 g þq A2 g B2 g ¼ q 0 þ q D2 g and q RþM 1 ¼ q ðaþdþ2 ðg 1 Þ þm 1 ¼ q ðaþdþ2 ðg 1 Þ þm 1 A2 g þq ðaþd are Þ2 ðg 1 Þ þm 1 B2 ¼q g M B2 ðg 1 Þþq M D2 g B2 ðg 1 IEICE Electronics Express, Vol.10, No.23, 1 9 Þ. Thus, the other two relations 0 k<m B2 ðg 1Þ (7.3) D 2 g k<mþd2 g B 2 ðg 1Þ (7.4) The fan-outs of each register can be determined in two steps. Step 1 is based on Eqs. (7.1) to (7.4). If k satisfies one, two, three, or all of these four equations, then register k has one, two, three, or four fan-outs to the XOR gates, respectively. In Step 2, we consider whether the outputs of registers feed back to another registers by searching fan-outs of register k from k 0toR M 1. If yes, the fan-outs of register k M are increased by one. Note that the fan-outs of registers could be zero, which indicates the registers could be removed after all of the feedback paths of the registers are checked. 4

D. Minimum number of registers Based on the above algorithm, two parallel scrambler architectures, register-last and register-first, can be constructed and illustrated in Figs. 2 and 3. In Fig. 2, the feedback data enter the XOR gates, and the registers (REG) are placed after the XOR s. The REG M register block is composed of M registers with M XOR gates in front of them. Since each XOR can be merged with one register as indicated by dashed lines for implementation, that facilitates the design of a special register with an XOR function [4, 5]. The other register block has R ml M registers without XOR before them. Fig. 2. Architecture of register-last parallel scrambler In Fig. 3, the feedback data enter the registers, and then go to the XOR gates. Some of the XOR gates may have one more fan-out, so some registers may have one less fan-out. Fig. 3. Architecture of register-first parallel scrambler The minimum numbers of registers (R ml and R mf ) in the register-last and register-first parallel scramblers can be determined by Register last : for 0 k R M 1 R ml ¼ number of nonzero fanouts þ M Register first : for 0 k R 1 R mf ¼ number of nonzero fanouts Table I gives several scrambler polynomials are chosen as the examples for given k with M 8 and M 16 parallel outputs. Table II lists the examples of three polynomials with the parameters M, R, R mf and R ml. The upper bound R on the number of registers is determined first. From the fanouts of all registers, the minimum number of registers can be determined. Table I. Examples to output q k 5

Table II. Examples to find R, R mf, and R ml for the different polynomials 3 Comparison and analysis Table III compares the number of registers, the maximum fan-outs of registers and XOR gates between the proposed architectures and those in the works [4] for two scrambler polynomials with 16 parallel outputs. All of them have 16 2-input XOR gates with the critical paths of one register and one XOR gate. Lin, et al. [4] developed a systematic parallel scrambler design methodology, but more registers than those of our proposed architectures are needed when A B in Eq. (1) is getting larger. Our register-last and register-first algorithms are denoted as RL and RF, respectively. The proposed register-first scrambler can reduce the number of registers by more than 40%. Table III. Comparison of architectures for M 16 For some polynomial such as x 7 x 6 1, the proposed algorithm has the architecture identical to that described elsewhere [4]. However, for the polynomial x 7 +x 4 +1, the proposed schemes utilize fewer registers. Table IV gives the gate-level comparison of the proposed architecture and that in the cited work [4] for 16 parallel outputs. Both methods have the same critical path that consists of one register plus a two-input XOR operation. The proposed architecture has fewer registers. Even though some registers may have one more fan-out than, its influence on speed should be insignificant. Table IV. Comparison of architectures for M =16 6

4 Circuit design for the parallel scrambler To design the parallel scrambler using the proposed algorithm, only two logic blocks are required. One is a register. Another is a register with a twoinput XOR. Since these two logic blocks perform mutually symmetrical operations, differential logic [6] was adopted for transistor-level design. Here, the modified dynamic differential logic gates are used to enhance speed and reduce power. Figure 4 shows the differential logic gates for a register and a register with a two-input XOR, where the tri-state inverter is illustrated in Fig. 4 (c). A and B are the inputs, while OUT is the output. Figure 5 illustrates the 3-phase clock patterns, CLK 1, CLK 2 and CLK 3, used in the circuits. One clock cycle is divided into six small time slots t 1 to t 6. During t 1, since CLK 1 is high to turn on the two NMOS switches with their gates tied to CLK 1, the signals will be passed to the two PMOS cross-coupled nodes. At t 2, CLK 3 goes low to switch the PMOS with its gate tied to CLK 3 to pull one of the two cross-coupled nodes to V DD. The time t 3 is a buffering time to avoid the signals leaked from the inputs to the outputs. During t 4, CLK 2 becomes high to activate the tri-state inverters. The data are transferred to the outputs and latched. Next, CLK 3 is high to turn off the PMOS at t 5, which could be short for preparation to receive the next data. Finally, t 6 is also short like t 3. Since all of the clocked controlled transistors are off, the circuit is ready for the next data bit Fig. 4. (a) The differential register (b) The differential register with 2-input XOR (c) The tri-state inverter 7

Fig. 5. The clock patterns of CLK 1, CLK 2, and CLK 3 5 Implementation and results Figure 6 illustrates the proposed parallel scrambler for x 7 + x 4 +1 with M = 16, where P 0 to P 15 refer to the output nodes. The blocks of R and X-R correspond to the circuits in Figs. 4 (a) and 4 (b), respectively. It was designed and fabricated using TSMC 0.18 μm CMOS technology in the Fig. 6. The proposed parallel scrambler of x 7 + x 4 +1 with M = 16 Fig. 7. The top view of proposed parallel scrambler Fig. 8. The measured waveform of the output signal toggled every 127 clocks 8

area of 95 μm 70 μm. Figure 7 (a) shows the layout of the parallel scrambler, which corresponds to the block marked by the red lines in Fig. 7 (b). The rest of the chip contains the clock generator, detection circuit, clock buffers, I/O buffers and power rings. A test circuit was also designed to detect the accuracy of the parallel scrambler by designing a detection circuit to reverse the output signal if it is matched after each 127 cycles. Figure 8 shows the measured waveform, in which the output signal is reversed with the period of 158.75 ns at a clock frequency of 1.6 GHz. The power consumption of the parallel scrambler (including the clock buffers) is 17.33 mw. Table V compares the implementation results of the proposed parallel scrambler and Ref. [4]. The proposed circuit has higher throughput per unit area and lower power consumption divided by throughput. Therefore, the proposed differential logic circuit is area-efficient, while the double-edge triggered registers used in the work [4] may have higher parasitic effects. Table V. Performance comparison of different architectures after implementation 6 Conclusion The generalized systematic parallel scrambler algorithm and architecture with the critical path of one register and one XOR gate are developed with the minimum register number for high throughout. The proposed algorithm can be applied to any scrambler polynomials with three terms to achieve small numbers of registers and fan-outs. Specially, the proposed dynamic differential registers for the polynomial x 7 + x 4 +1 with M = 16 meet the requirement of high speed, high throughput, low power and small chip area. The results reveal that the power dissipation, including that of the clock buffers is 17.33 mw at 1.6 GHz with a throughput of 25.6 Gbps in a chip area of 95 μm 70 μm using TSMC 0.18 μm CMOS technology. The throughput per unit area is up to 3.85 Mbps/μm 2. Acknowledgments The authors would like to acknowledge the Chip Implementation Center (CIC) of the National Applied Research Laboratories of Taiwan for the support in chip fabrication. This work was supported by National Science Council of Taiwan (NSC 101-2220-E-005-013) and was supported in part by the Ministry of Education, Taiwan under the ATU plan. 9