Low Latency Synchronization Scheme Using Prediction and Avoidance of Synchronization Failure in Heterochronous Clock Domains

Size: px

Start display at page:

Download "Low Latency Synchronization Scheme Using Prediction and Avoidance of Synchronization Failure in Heterochronous Clock Domains"

Terence Boone
5 years ago
Views:

1 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, 2015 ISSN(Print) ISSN(Online) Low Latency Synchronization Scheme Using Prediction and Avoidance of Synchronization Failure in Heterochronous Clock Domains Sung-Gun Song, Seong-Mo Park, Jeong-Gun Lee, and Myeong-Hoon Oh Abstract For the performance-efficient integration of IPs on an SoC utilizing heterochronous multi-clock domains, we propose a synchronization scheme that causes low latency overhead when data are crossing clock boundaries. The proposed synchronization scheme is composed of a clock predictor and a synchronizer. The clock predictor of a sender clock domain produces a predicted clock that is used in a receiver clock domain to detect possible synchronization failures in advance. When the possible synchronization failures are detected, a synchronizer at the receiver delays data-capture times to avoid the possible synchronization failures. From the simulation of the proposed scheme through SPICE modeling using a Chartered 0.18 mm CMOS process, we verified the functionalities and timing behavior of the clock predictor and the synchronizer. The simulation results show that the clock predictor produces a predicted clock before a synchronization failure, and the synchronizer samples data correctly using the predicted clock. Index Terms Synchronization, heterochronous clock domain, avoidance of synchronization failure, prediction of synchronization failure, clock predictor Manuscript received Jul. 28, 2014; accepted Jan. 23, School of Electronics and Computer Engineering, Chonnam National University, 77 Yongbong-ro, Buk-gu, Gwang-Ju , Korea 2 Dept. of Computer Engineering, Hallym University, 1 Hallymdaehakgil, Chuncheon, Gangwon-do , Korea 3 Corresponding Author, Electronics and Telecommunications Research Institute(ETRI), 218 Gajeong-ro, Yuseong-gu, Daejeon , Korea mhoonoh@etri.re.kr I. INTRODUCTION Nowadays, high-performance and low-power VLSI circuits are designed using deep sub-micrometer or nanometer semiconductor technologies. Thanks to the rapidly shrinking process technologies, highly transistorintegrated system designs with high complexity and multi-functionalities are now possible to fit in a single system on chip (SoC). The design method using a traditional global clock distribution has potential restrictions on timing of the clock, skew and jitter for safety of operations. The restrictions become more difficult to satisfy as clock speed increases. As the life cycle of the products is decreasing in modern electronics industry and a speedy release of the products becomes essential due to rapid market changes, it is intolerable to consider the problems of global clock distribution from the early stage of a large-scale SoC design. Accordingly, current SoCs are implemented with a reuse-centric design in which predeveloped and verified IP modules are combined onto an SoC. It is also expected that this trend will be more accelerated. In a reuse-centric design, each IP contained in an SoC can have an independent clock source and frequency in a hard or soft macro form, and it will be offered by various vendors [1]. Thus, in order to integrate various IPs with heterogeneous timing behaviors onto an SoC, we have to design and provide a synchronizer to meet safe data transmissions between those IPs that work with different clock sources and frequencies [2].

2 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, Table 1. Classification for Data-Clock Synchronization Classification D f DÆ Periodic Number of clock Synchronizer Synchronous 0 0 Yes Single NONE Mesochronous 0 Constant Yes Single Phase Compensation Plesiochronous f d < e Varies Yes Single Adaptive Phase Compensation Related Fixed Ratio Constant/Varies Yes Multi Adaptive Phase Compensation Heterochronous f d > e Varies Yes Multi Prediction Asynchronous Don t care unknown No Multi Full Synchronizer Working environment of a synchronizer can be classified according to the frequency and phase difference, periodic nature of clocks, and the number of clock sources [3, 4]. Table 1 summarizes the classification of working environments and methods of synchronization. Under the specified knowledge of synchronizer s working environment, an optimal synchronizer can be selected and utilized with minimal timing overhead. A synchronous environment is an ideal case where a single clock source with a single fixed frequency with low phase is distributed in a whole system. The system is a generally called synchronous system, and since all part of the system operate with the same clock, it does not need any synchronization. In a mesochronous environment, clocking is also operated by a single clock source and frequency, but due to clock skew or parasitic elements, a phase difference of the clock is produced. Synchronization can be achieved through the circuit for phase compensation adopting delay elements [5, 6]. In a plesiochronous environment, a single clock source is used, but it is assumed that there is a fine frequency difference of the clock. Hence, the phase difference of the clock is changing. Synchronization in this environment can be conducted by repeating adaptive phase compensation [7, 8]. A related environment allows multiple clock sources and their frequencies have an integer ratio each other. Synchronization in this environment can be easily implemented using such circuits as a divider or counter [9]. A heterochronous environment has multiple clock sources with various frequencies. There is no correlation between the frequencies of the clocks, but the frequency of each clock is maintained. They can be synchronized through a prediction of a synchronization failure [10, 11]. Lastly, an asynchronous environment does not have any restrictions of the clock, but an asynchronous design technique needs extra handshaking circuits for its finegrained and localized data synchronization [4, 12]. As synchronous, mesochronous, and plesiochronous environments run with a single clock, and a related environment is operated with multiple clocks but requires a correlation of clock frequencies, these environments are not appropriate for the design of a large-scale SoC based on various IPs. Note that IPs can have different clock sources and diverse clock frequencies. In addition, the synchronizer in an asynchronous environment that deals with a synchronization failure is not easy to implement, which causes an increase in design expense. These restrictions also influence performance of the entire system and design time. On the other hand, heterochronous environment is the most realistic and available type of environment for generic IP-based SoC designs in that the heterochronous environment does not need to consider restrictions mentioned above since it assumes clock domains with no correlation. Focusing on the IP-based SoC designs, this paper examines the techniques of synchronization in a heterochronous environment. In addition, this paper proposes and describes a synchronization scheme that can transmit data with low latency by detecting and avoiding synchronization failures through prediction mechanism. This paper consists of six sections. Section II covers previous synchronization-related works. Section III explains the principle of prediction of a synchronization failure. We describe how to realize the proposed synchronization scheme in Section IV. Section V shows a detailed evaluation of the proposed synchronization and

3 210 SUNG-GUN SONG et al : LOW LATENCY SYNCHRONIZATION SCHEME USING PREDICTION AND AVOIDANCE OF the results of its simulation. Finally, Section VI concludes this paper with brief summary. II. RELATED WORK 1. Synchronization Failure and Metastable State In a digital circuit using a clock, data is transmitted through combinational circuits and preserved in a flipflop. A flip-flop is a bi-stable circuit that memorizes one of two stable states, 0 or 1. The flip-flop captures data by referencing clock timing. The times when data should be stable before and after the clock edge of a flip-flop are called the setup time and the hold time, respectively. When data violating the setup or hold time, the flip-flop can enter into a metastable state where the output value stays on intermittent value between 0 and 1. In this situation, the flip-flop can act in unpredictable ways, and it can lead to a synchronization failure [13]. Since circuits are designed according to the precise timing constraints of the clock in the synchronous environment, it is possible to design circuits considering the setup and hold time violations. However, data enters the receiver clock domain asynchronously in the heterochronous environment where multiple clocks exist without correlation between them. In consequence, a synchronization failure cannot help but occur fundamentally. To guarantee the safe and correct operations of a target SoC, robust and low overhead synchronization methods should be provided in the heterochronous environment. 2. Synchronization Methods in a Heterochronous Environment In a system composed of multiple clocks with various frequencies, a two-flop synchronizer is widely employed for data transfers [13]. The main drawback of the twoflop synchronizer is its multi-cycle latency, which can limit the throughput. A complete data transfer should wait about one or two more clock cycles at each transfer through serialized two flip-flops, and the next data transfer cannot start before the previous data transfer is completed [10, 14]. To prevent overheads caused by synchronizing with a two-flop synchronizer for all data bits on wide data paths, an asynchronous handshake protocol bundled with such data paths is typically used [12, 15]. A speculative synchronizer [16] and a parallel flop synchronizer [17] to alleviate the inherent latency have been presented, but they have failed to perfectly eliminate the latency. Another commonly used synchronizer is a mixedclock FIFO (First-In First-Out). A FIFO includes a full state or empty state, which occur when data are unable to be written or read, respectively. When a full or empty state occurs, a FIFO faces a delay of one to two more cycles when writing/reading data, as in a two-flop synchronizer. These full and empty states frequently occur if data production/consuming rates are different [18, 19]. Many researchers have proposed synchronization methods to increase the data throughput and reduce the latency [19, 20], but these methods should still incur a special circuit such an arbitration circuit and cannot perfectly solve the latency. As a more fundamental solution, there is a stoppable clock synchronizer that controls the clocks themselves [4, 21, 22]. Although there is no possibility of a synchronization failure whenever data are transmitted, the clock domain consumes many clock cycles to store the current state and restart clock operations. Thus a stoppable clock synchronizer is not proper for a fast data transfer. Furthermore, this synchronizer should accompany a relatively complex control circuit that manages the external asynchronous handshake signals and stops the internal clock [22, 23]. The mentioned synchronizers above can be evaluated from two design aspects for synchronization, data throughput and design complexity as follows. In terms of latency: A two-flop synchronizer and a mixed-clock FIFO inevitable cause multi-cycle latency and limit data throughput. On other hand, a stoppable clock synchronizer incurs additional latency due to clock tree delays and extra clock cycles during operations to pause and restart a system. Besides, it is not proper for fast data transfer because it involves asynchronous handshake signals. In terms of the design complexity: Since a two-flop synchronizer and a stoppable clock synchronizer involve an asynchronous design technique, they can have increased design complexity and longer development times. This is because an asynchronous control circuit design and testing is not easy and designers are still

4 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, unfamiliar with supporting design tools and methodologies [9, 12, 24] in spite of a meaningful case [26]. As for a mixed-clock FIFO, it is also not simple to design a control circuit dealing with the full/empty states. External Clock Domain Sender data Local Clock Domain Receiver This paper proposes a synchronization scheme that does not cause synchronizing latency or depend on an asynchronous design technique which makes the design complicated in a heterochronous environment. As explained in section I, the heterochronous environment is made of many clock sources with various frequencies without their correlation, but each clock is periodic. In other words, the frequency of each clock does not change but is maintained. For better exploiting the periodic nature of the clocks, the relation between the frequencies of two clock domains, a sender and a receiver, can be defined [10, 11]. Once the relation of the frequencies has been defined, a possible synchronization failure can be predicted and be avoided such that data can be transmitted without a latency delay. III. PROPOSED SYNCHRONIZATION SCHEME lclk Fig. 1. General heterochronous communication. T lclk -T setup lclk slclk T hold hlclk (a) D D S Q setup S Q hold Fig. 2. (a) Clock generator for synchronization failure detection, (b) Synchronization failure detection clocks. lclk S setup=1 slclk T setup T hold (a) S hold=1 hlclk lclk slclk S setup=0 (b) T setup T hold (b) S hold=0 hlclk 1. Detection of Synchronization Failure S setup =0 S hold =1 Fig. 1 shows general data transfer in a heterochronous environment. The receiver operates with lclk (local clock), and the sender operates with (external clock). Since data from the sender are synchronized with the and there is no correlation between lclk and, the receiver that operates with lclk, has the possibility of a synchronization failure when data are inputted. Fig. 2(a) shows a clock generator for synchronization failure detection. This clock generator produces two samples, S setup and S hold, by pulling lclk by the setup time of (T setup ), and by delaying lclk by the hold time of (T hold ). As shown in Fig. 2(b), slclk and hlclk are a version of the lclk advanced by T setup and delayed by T hold, respectively. These can create the metastability window of on lclk, virtually. When S setup and S hold are produced with slclk and hlclk clocks for incoming, it is possible to detect whether the rising edge of triggers in the metastable window of on lclk or not by comparing these S setup and S hold. Fig. 3 describes the more detailed scenarios of detecting a synchronization failure. S setup and S hold signals lclk slclk T setup (c) T hold hlclk Fig. 3. (a) Synchronization, (b) Synchronization, (c) Synchronization failure, (d) Synchronization failure detector. produced from sampling by slclk and hlclk with the detection circuit shown in Fig. 2(a) can have three states like in Figs. 3(a)-(c). Figs. 3(a) and (b) show states where both S setup and S hold signals have the same value of 1 or 0. This means that a synchronization failure does not occur because rises in the outside of the metastability windows. However, Fig. 3(c), in which the S setup signal is 0 and the S hold signal is 1, indicates that a synchronization failure occurs because rises within the metastability windows of on lclk. In the end, the inverted S setup and S hold signals are used as inputs of an AND-gate to produce a fail signal at the synchronization failure detector as shown in Fig. 3(d). The synchronization failure detector is the circuit, which an AND- (d)

5 212 SUNG-GUN SONG et al : LOW LATENCY SYNCHRONIZATION SCHEME USING PREDICTION AND AVOIDANCE OF lclk 0ns C B D A 50ns 100ns 150ns 200ns 40ns 80ns 120ns 160ns 200ns 50ns 90ns 130ns 170ns 210ns T lclk=50ns d conflict 30ns 70ns 110ns 150ns 190ns p T D=30ns T D T X Fig. 4. Example of synchronization failure prediction. gate is added into the clock generator shown in Fig. 2(a). The principle of Fig. 3 only means that a possible synchronization failure can be detected, but it does not imply that the circuit in Fig. 3 provides a full synchronization solution for the synchronization failure that has been predicted. To transmit the data correctly, a synchronization failure must be predicted and avoided before it occurs. 2. Prediction of Synchronization Failure Prediction of a synchronization failure can be achieved by producing a predicted clock in a sender and transmitting it to a receiver. A predicted clock is defined by a clock with which the receiver can detect a possible synchronization failure before it happens. Fig. 4 illustrates an example of prediction of a synchronization failure. Each clock period of the sender () and the receiver (lclk) is assumed to be 40 ns and 50 ns, respectively. It is expected that the two clocks will fail to be synchronized at every 200 ns. If we delay by T D (30 ns in this example) as in p, a synchronization failure that will occur at 200 ns can be detected at the prediction time (150 ns), which is one cycle before the synchronization failure with lclk. T D is a delay time which can incur a synchronization failure at the prediction time (150 ns) by delaying as shown in Fig. 4. That is, T D is the time difference between the prediction time (150 ns) of lclk and the nearest rising edge of backward from the prediction time. In this example, since before the prediction time (150 ns) rises at 120 ns, T D would be 30 ns (150 ns-120 ns). In addition, to predict a synchronization failure at two cycles and three cycles ahead of lclk prior to the real synchronization failure (200 ns), at 100 ns and 50 ns, it can be possible to delay by 2T D and 3T D, as in p2 and p3. In this regard, n cycle early prediction Fig. 5. Principles of predicted clock generation. time of synchronization failure with lclk can be defined by T predict (n) and expressed as in Eq. (1). ( ) T n = T - n T (1) predict failure lclk So, T D can be presented as in Eq. (2). Here, T latest is the latest rising edge time of before T predict (n). ( 1) T = T - T (2) D predict latest From Eqs. (1) and (2), we can get n T D = T predict (n)- T latest (n). This means that if we obtain the value of T D, the prediction time can be controlled to avoid a synchronization failure. In fact, a sender cannot take a receiver s clock, and thus it is not possible to find the value of T D at the sender side. However, if the sender knows minimum information of the receiver s clock (e.g. period of lclk (T lclk )), T D can be calculated as in Fig. 5. Each clock period of a sender (T ) and a receiver (T lclk ) are still assumed to be 40 ns and 50 ns, respectively. In this figure, d is a version of the delayed clock of by T lclk. The time difference between the rising edges of d and, T X (from D to A ), is 30 ns, which is the same as T D. In addition, it can be identified that T X is a phase difference between and d. As a result, if the sender knows the period of lclk, it can find T D only with its own clock (). Eventually, the sender can generate the predicted clock (p) by delaying by T D. The proof that T X and T D are the same is explained in the Appendix. 3. Optimization of the Delay Time of Multiple T D s It is confirmed that delaying by as much as T D, 2T D, and 3T D makes it possible to predict a

6 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, synchronization failure at one, two, and three cycles prior to the real synchronization failure that can happen as shown in Fig. 4, respectively. At the viewpoint of circuit implementation, delay elements are usually used to delay a signal by a specific amount of time. The longer the delay is, the more circuit area the delay elements consume in general. In consequence, side effects such as higher power consumption can be raised in a long-delay delay element. To reduce the area and power/energy overhead of delay elements, we present a design scheme for reducing the delay that is used for generating a predicted clock. It is assumed that the set of the rising edge times of a free running signal with period T is F, and that the set of the rising edge times of a signal that delays F as much as t is delf(t). Actually, since a free running signal with period T has the same set of the rising edge times even if delaying or pulling F by T, the relation between F and delf(t) can be expressed as in Eqs. (3) and (4), respectively. ( ), (, 2, 1, 0, 1, 2, ) delf ( t) = delf ( nt + t), ( n = - - and any t) F = delf nt n = L - - L (3) where L, 2, 1, 0, 1, 2, L, (4) In our low latency synchronizer, we need to predict the synchronization failure three cycles in advance before a possible synchronization failure time. Further detailed reasoning for the use of 3T D will be discussed in section IV. When applying 3T D and T to Eq. (4) instead of t and T to find the minimal delay time of p3 in Fig. 4, it can be expressed as Eq. (5). ( D ) = ( + D ) ( n =, - 2, -1, 0, 1, 2, ) delf 3T delf nt 3 T, L L (5) The minimum value satisfying Eq. (6) can be calculated, and the value of nt +3T D comes to the minimum delay time of the free running signal that is the same as the signal that delays as much as 3T D. nt + 3T ³ 0 (6) As an example of Fig. 4, n that satisfies Eq. (6) is -2 because T is 40 ns and T D is 30 ns. As a result, the D Fig. 6. Flowchart of production process of the predicted clock. minimum delay time for p3 is 3T D -2T =3 30 ns ns=10 ns. This means that even if T is delayed by as much as 10 ns, it has the same effect as delaying 3T D (150 ns). 4. Process of Producing a Predicted Clock To summarize clock predictor s operations explained in section III.2 and III.3, Fig. 6 provides a flowchart for the process of producing a predicted clock, p3 in Fig. 4. Once is input to a clock predictor, is delayed by T lclk, and d is then generated. Then, the clock predictor detects the time difference between the rising edges of d and, which is T D. As is delayed by T D, an initial predicted clock, p, is then generated. In addition, to predict a synchronization failure before three cycles ahead, multiple T D s pass through an optimization process. Finally, the clock predictor can create p3, and transmits it with data to a receiver for synchronization. 5. Avoidance of a Synchronization Failure The receiver samples data by its own clock s event (lclk) as in Fig. 2. However, when a synchronization failure is detected through a predicted clock and a synchronization failure detector shown in Fig. 3(d), lclk should not be used to capture data. Instead, the receiver performs the avoidance of the synchronization failure. To allow the receiver to correctly sample the data, as shown in Fig. 7, the avoidance is achieved via delaying lclk by T keep-out. Here, T keep-out is the delay used to move lclk, so the capturing can safely escape from the region of a metastability window of on lclk. The value of T keep-

7 214 SUNG-GUN SONG et al : LOW LATENCY SYNCHRONIZATION SCHEME USING PREDICTION AND AVOIDANCE OF (a) (b) (a) Fig. 7. The method of synchronization failure avoidance (a) Circuit for Synchronization failure avoidance, (b) Waveform of Synchronization failure avoidance. out should be sum of T setup and T hold of. Note that the data is synchronized by. IV. IMPLEMENTATION (b) 1. Overall Structure of the Proposed Scheme Fig. 8(a) is a top-level structural view of the proposed scheme for synchronization through a clock prediction in this paper. A sender and receiver operate with and lclk, respectively. The clock predictor in the sender creates p3, which can predict a synchronization failure at three cycles earlier than the real synchronization failure. The synchronizer in the receiver takes p3 along with the data. With p3, the synchronizer detects and avoids a synchronization failure through a synchronization failure detector in Fig. 3(d) and an avoidance circuit in Fig. 7. The synchronizer in the receiver detects a synchronization failure using p3, and then performs the synchronization. Fig. 8(c) is the proposed structure of a synchronizer. Basically, the synchronizer samples data using lclk. If a synchronization failure is detected, the synchronizer samples data through the other clock that is a version of lclk delayed by T keep-out, which can avoid a synchronization failure. The receiver exploits double two-flop synchronizers to seamlessly transmit data among the input clock domain (M3) for the synchronization failure detector, the sender clock domain (M1) that generates p3, and the receiver clock domain (M2) which uses the outputs of the detector. As a result, three cycles are needed owing to the additional double two-flops, thus p3 is needed. The proposed structure of a clock predictor is depicted in Fig. 8(b). A clock predictor is divided into three main parts; time to digital converter (TDC), variable delay element (VDE), and cycle multiplier (CM). The TDC detects T D by comparing the phase difference between and d which is delayed by period of lclk. The VDE generates p via delaying by the detected T D. The CM multiplies the detected T D by two, and then delays the generated p by 2T D. In summary, when TDC detected T D, is delayed by T D in the VDE and 2T D in the CM respectively in series, thus p3 is created for predicting a synchronization failure three cycles in advance. 2. Clock Predictor (c) Fig. 8. The proposed synchronization scheme (a) Top-level structural view, (b) Structure of the Clock Predictor, (c) Structure of the Synchronizer. Fig. 9 shows a structure of clock predictor implementation. The clock predictor works as follows. The signal of d that delays by the period of lclk (T lclk ) is

8 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, Lead Q0 Q1 Simple Q2 Qk Buffer TDC T a T a T a Final Lead Lag Mutex In1 Out1 In2 Out2 Mutex In1 Out1 In2 Out2 Mutex In1 Out1 In2 Out2 Mutex In1 Out1 In2 Out2 Lag Tb Tb Tb Fig. 10. The Structure of TDC. Fig. 9. The Structure of the implemented Clock Predictor. passed into SR latch I0. Then, The SR latch I0 captures the first rising edge event of d, and maintains a high state on its own output, which is the d_r signal. After the d_r goes to the high state, its value goes to the input of DFF I1 and the _r signal becomes high at the first rising edge event of in DFF I1. Then, the d_r and _r signals are used as inputs for TDC. The TDC detects T D, the time difference between the rising edge of the d_r and the rising edge of the _r. The digital code corresponding to the detected T D is stored in F0~Fn DFFs in sequence. For this reason, the maximum detection resolution of the TDC exceeds the sum of the setup, hold, and propagation delay times of a DFF. The DFFs are updated by the final signal in which the d signal is structurally passed through all delay cells of the TDC. The VDE and CM delay by T D and 2T D, respectively. Finally, we can create p3 which is a delayed signal of by 3T D. A. Time to Digital Converter A TDC measures a time difference between two input signals and then generates a digital code proportional to the time difference. The structure of the implemented TDC is shown in Fig. 10. It employs the general structure of a Vernier delay line, which consists of a pair of tapped delay lines with a MUTEX [25] at each corresponding pair of taps. The MUTEX is a mutually exclusive element whose main function is to resolve the contention between two input signals from independent sources. In the MUTEX, if both the input signals arrive at the same time, it goes into metastability. To avoid the metastability at the output, a metastability filter is used [12]. Fig. 11. The Structure of VDE. The operation is as follows. While the TDC delays Lead and Lag signals as much as T a and T b respectively, the two signals are compared by the MUTEXs to confirm which signal arrives first at each stage. The T a and T b are the propagation delay of the two delay lines. Since the propagation delay of T a is shorter than that of T b, the Lag signal becomes faster than the Lead signal at one point, and the output port Q becomes 0. In the opposite case, Q becomes 1. The TDC can detect the phase difference with a resolution of T a -T b. B. Variable Delay Element and Cycle Multiplier Fig. 11 shows the structure of the implemented VDE. The implemented VDE allows selection of multiple signal propagation paths and it consists of a number of delay cells as shown in Fig. 11. A delay cell consists of two inverters and one multiplexer, and has a propagation delay of T m. The two inverters are simply a buffer to delay an input signal, and the multiplexer is used to connect with the delay cells in the adjacent stages for making a delay path. The propagation delay of T m is the same value as the resolution of the TDC (T a -T b ). The VDE makes a delay path by sel k bits, and its delay time is controlled in the unit of T m k. Here, k indicates the number of delay cells included in a delay path to make T D delay. A cycle multiplier (CM) is built in the same structure as VDE in Fig. 11. In comparison with VDE, since the delay cell of the CM has double the delay time, the delay time of the CM is adjusted in the unit of 2T m k (= 2 T D ).

9 216 SUNG-GUN SONG et al : LOW LATENCY SYNCHRONIZATION SCHEME USING PREDICTION AND AVOIDANCE OF are the same as 500 ps, and they are built of 26 and 8 transistors, respectively. In this section, we analyze the area overhead of the implemented clock predictor and synchronizer, and verify their functionalities through simulations. 1. Trade-off Analysis between Performance and Area Overhead Fig. 12. The Structure of the implemented Synchronizer. 3. Synchronizer Fig. 12 shows a structure of a synchronizer proposed in this work, which corresponds to Fig. 8(c). The receiver receives p3 with data from the sender. To detect a synchronization failure, the receiver samples the p3 using the clock of lclk -Tsetup and lclk + Thold corresponding to slclk and hlclk in Fig. 2(b). Since the receiver and sender belong to different clock domains, the receiver takes the p3 using the two-flop (Fs_A1 Fs_A2, Fh_A1 Fh_A2) (the clock domain M3). These sampled signals of p3 are stored once more through the two-flop (Fs_B1 Fs_B2, Fh_B1 Fh_B2) by the lclk, because a synchronization failure is detected in the lclk domain (the clock domain M2). Due to double two-flop synchronizers, a p3 which can predict a synchronization failure three cycles in advance is required to compensate the latency caused by double two-flop synchronizers. The fail signal of 1 indicates a synchronization failure, and the synchronizer conveys delayed_data that is captured data in F1 using the delayed lclk by T keep-out which can avoid a synchronization failure. Otherwise, when the fail signal is 0, the synchronizer conveys the nondelayed_data that is data stored in F0 using lclk. V. ANALYSIS The proposed scheme is designed using a Chartered 0.18 mm CMOS process through SPICE modeling. For a simple design, static D flip-flops and delay cells are implemented using only the basic logic gates offered by the standard cell library of the given CMOS process. The propagation delays of the static D flip-flop and delay cell In the proposed scheme, the clock predictor detects a phase difference between two clock domains to produce a predicted clock, and the synchronizer avoids a synchronization failure using the predicted clock. Due to the fact that the clock predictor detects a phase difference, its area is highly influenced by a detection range. The wider the detection range is, the more delay cells and registers that are needed and the larger the area of the clock predictor is. On the contrary, the narrower the detection range is, the fewer delay cells and registers that are needed and the smaller the area of the clock predictor is. On the other hand, since the synchronizer is not related with the detection of a phase difference, its area is not affected by a detection range. To identify the area overhead of the implemented clock predictor and synchronizer, the variations in the number of the circuit elements based on the detection ranges at maximum resolution were evaluated. As explained in section IV.2, the maximum resolution of the clock predictor should be above the sum of the setup, hold, and propagation delay times of the D flip-flop because the phase code detected at the TDC is stored into the D flip-flops in sequence. In this paper, the maximum resolution is assumed to be 500 ps, and the delay cells were also designed in the unit of 500 ps. Table 2 shows the area overheads of the clock predictor and the synchronizer when the maximum detection ranges are 4 ns, 8 ns, 16 ns, 32 ns, and 56 ns at a resolution of 500 ps. It is observed that as the maximum detection range increases, the numbers of the circuit elements and area of the clock predictor increase proportionally, while the area of the synchronizer is constant. Our final version of the implemented clock predictor can detect a phase difference of 0.5 to 56 ns, with a resolution of 500 ps. It was made up of 113 registers, 448 delay cells, and 10K transistors. The implemented

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, 2015 217 Table 2.

resolution of 500 ps 0.5 ns 4 ns 0.5 ns 8 ns 0.5 ns 16 ns 0.5 ns 32 ns 0.5 ns 56 ns D flip-flops 9 17 33 65 113 Delay Cells (per 500 ps) 32 64 128 256 448 Transistors 0.7 K 1.4 K 2.7 K 5.

10 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, Table 2. Area overhead of the implemented circuits Implemented Synchronization Circuits The Clock Predictor The Synchronizer Circuit Elements Number of circuit elements according to the detection range at a resolution of 500 ps 0.5 ns 4 ns 0.5 ns 8 ns 0.5 ns 16 ns 0.5 ns 32 ns 0.5 ns 56 ns D flip-flops Delay Cells (per 500 ps) Transistors 0.7 K 1.4 K 2.7 K 5.3 K 10 K D flip-flops 10 Delay Cells (per 500 ps) 3 Transistors 0.3 K Fig. 13. The Simulation waveforms of Clock Predictor. synchronizer was made up of 10 registers, 3 delay cells, and 0.3K transistors. The area overhead of the proposed scheme can be improved when using an improved CMOS process, decreasing the numbers and sizes of the transistors through various types of registers and delay cells. 2. Simulation We perform the simulations to verify the functionality of the clock predictor and synchronizer. It was verified whether the clock predictor of the sender produced a predicted clock before three cycles ahead of a synchronization failure, and whether the synchronizer of the receiver sampled data correctly using the predicted clock. A. Operation of the Clock Predictor The simulation environment was a heterochronous environment where the period of (the sender clock) was 60 ns, and the period of lclk (the receiver clock) was 70 ns. In this environment, the data transfer was simulated after the two clocks rose at a time of 0 simultaneously. In addition, since the clock predictor needs an initial stabilizing time to delay by as much as T D 3 to produce p3, it was assumed that the clock predictor transmits the data to the receiver after p3 was stabilized. The setup and hold times each were set to 500 ps given from a flip-flop of the CMOS process. It is expected that and lclk would face a synchronization failure whenever the least common multiples of the two clocks were reached. Fig. 13 shows the waveforms of the clock predictor in Fig. 9. It shows the process used to find T D (point C) and the process to confirm whether the predicted clock is produced properly (point A, B, D, and E). The following is the process used to find T D. Here, d is a clock in which is delayed by T lclk (70 ns). At the first rising event of d, the signal of d_r becomes high, and then the signal of _r becomes high by sampling the d_r in consequence of the rising event of. The time difference between the rising edges of d_r and _r is 50ns (120ns-70ns), which indicates T D. The following describes the production of a predicted

218 SUNG-GUN SONG et al : LOW LATENCY SYNCHRONIZATION SCHEME USING PREDICTION AND AVOIDANCE OF Fig. 14. The simulation waveforms of Synchronizer. clock.

11 218 SUNG-GUN SONG et al : LOW LATENCY SYNCHRONIZATION SCHEME USING PREDICTION AND AVOIDANCE OF Fig. 14. The simulation waveforms of Synchronizer. clock. It is expected that and lclk will face a synchronization failure at 420 ns (point A) that is where they become least common multiple. As explained in section IV.3, a synchronization failure should be detected at 210 ns (point B), which is three cycles before the synchronization failure on lclk. For this, the predicted clock p3 is delayed by 150 ns (T D 3) at point C. Note that at point B where a synchronization failure should be detected, the first synchronization failure is not detected. This indicates that the clock predictor needs an initial stabilizing time to produce the p3 correctly. The next synchronization failure occurs at 840 ns (point D), and thus the synchronization failure should be predicted at 630 ns (point E). At point E, the p3 is rising accurately. B. Operation of the Synchronizer Fig. 14 shows the waveforms of the synchronizer in Fig. 12. In this figure, a synchronization failure is detected at point B and the data is synchronized at point A in the synchronizer of the receiver The following is to examine the process of a synchronization failure being detected. Here, the synchronizer takes p3 from the clock predictor along with data_in. At 630 ns (point B) where the synchronization failure is predicted, p3 and lclk are rising at the same time. Thus, a synchronization failure is detected, and S setup becomes 0 and S hold becomes 1. The S setup and S hold appear at 840 ns (point A), after three cycles owing to the double two-flops, and they are combined to generate the fail signal through an AND gate. At point A, the fail becomes 1, indicating the detection of a synchronization failure. The following shows the process that data are synchronized. At 840 ns (point A) where a synchronization failure occurs, data_in is changed from 0 to 1. The data_in is sampled into nondelayed_data captured by lclk and the delayed_data captured by lclk+t keep-out. The data_out is the final synchronized data. The nondelayed_data is selected to go through a MUX to data_out when the fail signal is 0. When the fail signal is 1, delayed_data is propagated to data_out. At time point A in Fig. 14, instead of nondelayed_data, delayed_data propagates to data_out since a synchronization failure is detected at the point B that is three clock cycles ahead of the point A.

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, 2015 219 Fig. 15. The Simulation waveforms of implemented scheme. C. Another Case Fig.

12 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, Fig. 15. The Simulation waveforms of implemented scheme. C. Another Case Fig. 15 shows simulation waveforms of the example mentioned in Figs. 4 and 5 where the periods of and lclk are 40 ns and 50 ns, respectively. It is expected that and lclk will face synchronization failures at every time of least common multiples of 40 and 50, i.e. 400 ns, 600 ns, 800 ns, and so on. These synchronization failures will be detected at three cycles before the synchronization failure on lclk. At 250 ns, 450 ns, 650 ns (point B, D, F) shown in Fig 15, p which is the delayed version of by 3T D and lclk are rising at the same time. Consequently, the fail signal occurs 3 cycles later from point B, D, and F due to double two-flop synchronizer. At 400 ns, 600 ns, 800 ns (point A, C, E), despite the synchronization failure occurrence, data_out conveys correct data values with no synchronization failure, unlike nondelayed_data within marked circles. This is because data_out takes data_in using the fail signal and the delayed lclk by T keep-out as described in Fig. 12. The above results of simulations well demonstrate that the advantage of the proposed scheme is no synchronization failure. With regard to latency, there is no synchronization latency in proposed scheme. However in terms of data transfer latency, two latencies are incurred due to the use of a technique avoiding synchronization failure. One of the incurred latencies is a receiver clock delay when the receiver captures data without a synchronization failure. Another is the latency of flop-flops and MUXs propagation delays to avoid synchronization failure in advance. But the data transfer latency in proposed scheme can be relatively ignored compared with the two-flop or FIFO synchronizers. VI. CONCLUSIONS This paper proposed a synchronization scheme in a heterochronous environment. In the proposed scheme, the clock predictor in the sender produces a predicted clock that can detect a synchronization failure in advance, and the synchronizer in the receiver detects and avoids a synchronization failure using the predicted clock. The proposed scheme has low latency which is regarded as a weak point of traditional methods. In addition, it can be designed without using an asynchronous method which increases the complexity of the design. The proposed scheme is implemented using a Chartered 0.18 mm CMOS process through SPICE modeling. To generate the predicted clock, the implemented clock predictor can measure the phase difference of the two clock domains, from 0.5 to 56 ns, with a resolution of 500 ps. Reflected on the prediction mechanism of synchronization failures, the proposed synchronizer can sampled data correctly by detecting and avoiding the synchronization failures without latency. The functionality of the implemented scheme was

13 220 SUNG-GUN SONG et al : LOW LATENCY SYNCHRONIZATION SCHEME USING PREDICTION AND AVOIDANCE OF verified through a simulation. Also, its area overhead is analyzed. The new synchronization scheme proposed in this paper can offer a diversity of choices to SoC designers in the design of a synchronizer. REFERENCES [1] D. Saha and S. Sur-Kolay, SoC: a real platform for IP reuse, IP infringement, and IP protection, VLSI Design, vol. 2011, Article ID , 10 pages, Jan [2] R. Saleh, S. Wilton, S. Mirabbasi, and et al., System-on-Chip: Reuse and Integration, Proc. of the IEEE, Vol.94, No.6, pp , Jun., [3] D.G. Messerschmitt, Synchronization in Digital System Design, IEEE Journal on Selected Areas in Communications, Vol.8, No.8, pp , Oct., [4] P. Teehan, M. Greenstreet and G. Lemieux, A survey and taxonomy of GALS design styles, IEEE Design & Test of Computers, Vol.24, No.5, pp , Oct., [5] A. Chakraborty and M.R. Greenstreet, Efficient self-timed interfaces for crossing clock domains, IEEE Int l Symp. on Asynchronous Circuits and Systems, pp.78-88, May, [6] Y. Semiat and R. Ginosar, Timing measurements of synchronization circuits, IEEE Int l Symp. on Asynchronous Circuits and Systems, pp.68-77, May, [7] L.R. Dennison, W.J. Dennison and D. Xanthopolous, Low-latency plesiochronous data retiming, IEEE Int l Conf. on Adv. Res. in VLSI, pp , Mar., [8] R. Kol and R. Ginosar, Adaptive Synchronization for Multi-Synchronous Systems, ICCD, pp , [9] L.F.G. Sarmenta, G.A. Pratt and S.A. Ward, Rational clocking, IEEE Int l Conf. on VLSI in Computers and Processors, pp , Oct., [10] W.J. Dally and J.W. Poulton, Digital Systems Engineering, Cambridge, [11] U. Frank and R. Ginosar A Predictive Synchronizer for Periodic Clock Domains, Journal of Formal Methods in System Design archive, Vol.28, No.2, pp , Mar., [12] A.J. Martin, Asynchronous Techniques for System-on-Chip Design, Proc. of the IEEE, Vol.94, No.6, pp , Jun., [13] L.S. Kim, R. Cline, and R.W. Dutton, Metastability of CMOS Latch/Flip-Flop, Journal of Solid-State Circuits, Vol.25, No.4, pp , Aug., [14] R. Ginosar, Fourteen ways to fool your synchronizer, IEEE Int l Symp. on Asynchronous Circuits and Systems, pp.89-96, May, [15] M.H. Oh and S.W. Kim, Asynchronous 2-phase protocol based on ternary encoding for on-chip interconnect," ETRI Journal, Vol.33, No.5, pp , Oct., [16] D.J. Kinniment and A. Yakovlev, Low latency synchronization through speculation, Workshop. on Power and Timing Modeling, Optimization and Simulation, pp , Sep., [17] S.J. Kim, J.G. Lee and K. Kim, A parallel flop synchronizer for bridging asynchronous clock domains, IEEE Conf. on Asia Pacific-ASIC, pp , Aug., [18] A. chakraborty and M.R. Greenstreet, A minimal source-synchronous interface, IEEE Int l Conf. on ASIC/SOCI, pp , Sept., [19] G.N. Pham and K.C. Schmitt, A high throughput, asynchronous, dual port FIFO memory implemented in ASIC technology, IEEE Int l Conf. on ASIC and Exhibition, pp.p , Sep., [20] T. Chelcea and S.M. Nowick, Robust interfaces for mixed-timing systems with application to latency-insensitive protocols, ACM/IEEE Conf. on Design Automation, pp.21-26, [21] R. Mullins and S. Moore, Demystifying datadriven and pausible clocking schemes, IEEE Int l Symp. on Asynchronous Circuits and Systems, pp , Mar, [22] K.Y. Yun and R.P. Donohue, Pausible clockingbased heterogeneous systems, IEEE Transactions on VLSI Systems, Vol.7, No.4, pp , Dec., [23] D.S. Bormann and P.Y.K. Cheung, Asynchronous wrapper for heterogeneous systems, IEEE Int l Conf. on Computer Design, pp , Oct., [24] C.K. Ong, M.T. Mustaffa, and L.H Goh, Asynchronous to synchronous: A design methodology, IEEE Symp. on Industrial Electronics and Applications, pp , Sept., 2011.

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, 2015 221 [25] P. Dudek, S. Szczepanski and J.V. Hatfield, A high resolution CMOS time to digital converter utilizing a vernier delay line, IEEE J.

, Architectural design issues on a clockless 32-bit processor using an asynchronous HDL, ETRI Journal, Vol.35, No.3, pp.480-490, Jun., 2013. APPENDIX.

14 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.2, APRIL, [25] P. Dudek, S. Szczepanski and J.V. Hatfield, A high resolution CMOS time to digital converter utilizing a vernier delay line, IEEE J. Solid State Circuits, Vol.35, No.2, pp , Feb., [26] M.H. Oh, Y.W. Kim, S.H. Kwak, et al., Architectural design issues on a clockless 32-bit processor using an asynchronous HDL, ETRI Journal, Vol.35, No.3, pp , Jun., APPENDIX. IDENTITY OF T D AND T X When the proposed scheme is implemented, the sender should know T D to produce p in Fig. 4. However, since the clock of the receiver (lclk) is not available, the sender uses the phase difference (T X ) with d, which delays by as much as the period of lclk (T lclk ). In this section, the relationship between T D and T X is described. Fig. 15 assumes that lclk and rise at 0 ns at the same time. The time of a synchronization failure can be presented as in Eq. (7). T failure, the time of a synchronization failure, occurs after the integer cycle of lclk, and has as many integer cycles as the same amount of time of the integer cycle of lclk. Here, N is the number of clock cycles of the receiver at the time of a synchronization failure, and K is the number of clock cycles of the sender at the time of a synchronization failure. Tfailure = N Tlclk = K T (7) Eq. (8) shows the prediction time, T predict, which can be presented through the relationship with T D, one cycle before the synchronization failure of lclk and. Here, M is the number of clock cycles of the sender before the prediction time. lclk d p ( 1) T = M T + T = N - T (8) predict D lclk N ⅹ T lclk 70ns 140ns 210ns 280ns 350ns K ⅹ T 0s 60ns 120ns 180ns 240ns 300ns 360ns 420ns M ⅹ T 70ns 130ns 190ns 250ns 310ns 370ns 430ns M ⅹ T conflict 50ns 110ns 170ns 230ns 290ns 350ns 410ns T lclk T D Fig. 16. Example of clock prediction. T D T X 420ns Since d is a clock that is delayed by as much as T lclk from, T failure can be represented as in Eq. (9). Tfailure = M T + Tlclk + Tx = K T (9) That is, the fact that T X and T D are the same can be induced from Eqs. (8) and (9). Fig. 15 is applied to Eqs. (8) and (9). T lclk and T have 70 ns and 60 ns each, and the time of a synchronization failure is 420 ns, and the prediction time is 350 ns, one cycle before lclk. At that point, T D becomes 50 ns, by 350 ns=(5 60 ns)+t D in Eq. (8). In addition, T X also becomes 50 ns by (5 60 ns)+70 ns+t X =420 ns in Eq. (9). As a result, the two values are the same. Sung-Gun Song received a Ph.D. degrees in Electronics and Computer Engineering from Chonnam National University, Gwangju, Korea, in He received the B.S. and M.S. degree in Information and Communication Engineering from Honam University, Gwangju, Korea, in 2004 and 2006, respectively. His research interests include embedded system and SoC design, communication protocol, VLSI architecture of asynchronous and synchronous circuit. Seong-Mo Park is a professor in School of Electronics and Computer Engineering, Chonnam National University. He received his B.S. degree in Electronics Engineering at Seoul National University in Seoul, Korea, in 1977 and his M. S. degree in Electrical and Electronics Engineering at Korea Advanced Institute of Science and Technology in Seoul, Korea, in He received his Ph.D. degree in Electrical and Computer Engineering at North Carolina State University, in Raleigh, NC in He worked as an IC design engineer at Korea Institute of Electronics Technology, in Korea, from 1979 to He was with Old Dominion University in Norfolk, VA, from 1988 to 1992 as an assistant professor. His research interests include digital signal processing, algorithm specific architecture, embedded system and SoC design. Dr. Park is a member of IEEE, IEEK and KMMS.

222 SUNG-GUN SONG et al : LOW LATENCY SYNCHRONIZATION SCHEME USING PREDICTION AND AVOIDANCE OF Jeong-Gun Lee received the B.S. degree in computer engineering from Hallym University in 1996, and M.S. and Ph.

Prior to joining the faculty of Hallym University in 2008, he was a postdoctoral researcher of the Computer Lab. at the University of Cambridge, UK. During his sabbatical leave (from Jan. to Dec.

15 222 SUNG-GUN SONG et al : LOW LATENCY SYNCHRONIZATION SCHEME USING PREDICTION AND AVOIDANCE OF Jeong-Gun Lee received the B.S. degree in computer engineering from Hallym University in 1996, and M.S. and Ph.D degree from Gwangju Institute of Science and Technology (GIST), Korea, in 1998 and He is currently an associate professor in the Computer Engineering department at Hallym University. Prior to joining the faculty of Hallym University in 2008, he was a postdoctoral researcher of the Computer Lab. at the University of Cambridge, UK. During his sabbatical leave (from Jan. to Dec. of 2014), he was a visiting scholar at the Computer Lab. at the University of Cambridge. His research interests focus on an asynchronous circuit design, high performance computing with GPU and FPGA, and multi-core systems for multi-media applications. Myeong-Hoon Oh received his PhD in information and communications engineering from Gwangju Institute of Science and Technology (GIST), Gwangju, Korea in He has been with Electronics and Telecommunications Research Institute (ETRI), Daejeon, Korea since 2005 as a senior engineer. From 2006, he has been an associate professor in University of Science and Technology (UST), Daejoen, Korea. His current research focuses on digital circuit design, relevant embedded system, cloud computing hardware design and cloud computing standardization. He also has been an editor for developing a Recommendation of cloud Desktop as a Service in ITU-T SG13.

Metastability Analysis of Synchronizer

Forn International Journal of Scientific Research in Computer Science and Engineering Research Paper Vol-1, Issue-3 ISSN: 2320 7639 Metastability Analysis of Synchronizer Ankush S. Patharkar *1 and V.