Clock Control Architecture and ATPG for Reducing Pattern Count in SoC Designs with Multiple Clock Domains

Clock Control Architecture and ATPG for Reducing Pattern Count in SoC Designs with Multiple Clock Domains Tom Waayers Richard Morren Xijiang Lin Mark Kassab NXP semiconductors High Tech Campus 46 5656 AE Eindhoven, The Netherlands {tom.waayers, richard.morren}@nxp.com Mentor Graphics Corp. 8005 SW Boeckman Rd. Wilsonville, OR 97068 {xijiang_lin, mark_kassab}@mentor.com Abstract This paper presents a clock control architecture for designs with multiple clock domains, and a novel mix of existing ATPG techniques as well as novel ATPG enhancements. The combination of the ATPG techniques and the clock control hardware lowers the number of test patterns in a fully automated flow, while maintaining the high coverage that is required nowadays by production test. Experimental results are shown for two industrial designs.. Introduction In high performance IC designs, we find a growing number of chip-internal clock domains that interact. Clock distribution and balancing, while taking all constraints from embedded modules into account, is an iterative process that is time consuming and error prone for large System-on- Chips (SoC). One option is to build smaller synchronous islands that communicate in a non-synchronous manner []. Within the islands, the design is synchronous and hence that is where the clock trees or clock domains are found. This is also where additional control for test is preferred such that internal clocks can be suppressed by scan-based ATPG to prevent skew problems during manufacturing test. At-speed test was the main driver for moving control of test clocks from the tester to the silicon under test. As functional frequencies and number of clock domains increased, it became increasing difficult, if not impossible, to provide the clocking directly from the tester. Several designs have been reported over the past few years that apply on-chip clock controllers to generate high speed clock signals on chip using the internal functional PLL. In [2] PLL clock chopping was introduced, which allowed atspeed scan to occur while inputting a slower clock from the tester. In [3] a programmable clock control circuit for issuing a given number of at-speed clocks is presented. Any complex at-speed functional clock waveform for 6 cycles can be applied by the method presented in [4]. This method allows asynchronous clock domains to be tested simultaneously. A special test waveform generator circuit fed by a PLL creates specific functional clock waveforms for each pattern. As a result, pattern generation becomes harder because of the increase of the sequential depth for ATPG. The Clock Pulse Filter blocks that are introduced in Paper 4. INTERNATIONAL TEST CONFERENCE 978--4244-7207-9/0/$26.00 200 IEEE [5] provide two, three, or four pulses to generate delay tests for signals that cross the boundaries of synchronous clock domains. These tests apply a launch pulse in one clock domain and a capture pulse in the other clock domain. The Clock Control Macro that is discussed in [6] delivers a programmable number of launch clock pulses, and a number of capture pulses. Only one clock per domain can toggle during a launch or capture cycle. The concept requires foreknowledge of the capture sequences that the ATPG tool will require during the ATPG process. To facilitate scan-based structural testing and provide ATPG with the ability to generate safe, skew-free patterns, the on-chip clock controllers are often implemented to be programmable such that each internal clock is controlled by the logic values loaded at a set of dedicated scan cells. The loaded control values determine the operation of the internal clocks during capture. During scan load/unload, scan data is shifted through scan chains using a slow clock signal generated by the tester. This approach enables use of low-speed tester to test high performance devices, reducing test cost. To our best knowledge the prior work in this field does not address the design of clock controllers to target pattern count reduction for stuck-at ATPG. Test data volume and test application time are also dominant factors in determining the test cost. To achieve high fault coverage while minimizing the test pattern count, significant research and industrial work has been done in the past several decades. In our work, we do not focus on test data compression techniques in which compressed test stimuli are expanded into the actual data loaded into scan chains through an on-chip de-compressor, and the test responses are compacted by an on-chip compactor [7][8] We target pattern reduction on top of test data compression. In ATPG-based solutions, dynamic compaction and static compaction are two basic approaches to reduce test pattern count. When designs include multiple clocks, clock grouping and staggering [9] are two effective and complementary techniques to dramatically reduce the test pattern count further. We will describe these two techniques in Section 7, and extend them in Section 8.

In this paper, we present a clock control architecture for designs with multiple clock domains, and a novel mix of existing ATPG techniques as well as an improvement to one of those ATPG techniques. The combination of the ATPG techniques and the clock control hardware lowers the number of test patterns while maintaining the high test coverage that is required nowadays by production test. Our focus includes implementing hardware that targets pattern reduction for stuck-at test, rather than at-speed test only. In addition, a novel technique for ATPG is presented to derive clock control information from a design description instead of external input. This results in a test development process that avoids human mistakes in pattern creation mistakes that are only found by timing simulations very late in the design phase. Experimental results with the clock control architecture and novel ATPG techniques on two industrial designs show a substantial reduction in pattern count, in a fully automated flow. The remainder of the paper is organized as follows. Section 2 briefly describes the clock related aspects of our design style Islands of Synchronicity. In Section 3, on-chip clock generation is introduced that uses functional clock dividers to support at-speed inter-clock domain test. The clock control architecture, scan chain architecture, and automated clock control identification are presented in Sections 4 to 6. ATPG techniques to utilize the on-chip clock control for test pattern count reduction are addressed in Sections 7 and 8. Experimental results on two industrial designs are reported in Section 9. Section 0 concludes the paper. 2. Islands of Synchronicity In modern designs, IP is re-used as much as possible and integrated with new functionality. SoC integration is more and more a critical aspect, both from a functional silicon design (timing closure in particular) and test point of view. 2 3 CGU Figure. Islands of synchronicity Clock distribution and clock tree balancing in large SoC designs is difficult to achieve at any operating point. From a clocking perspective, our SoCs are typically built as a collection of 'Islands of synchronicity'. As synchronous design is no longer feasible at the SoC level, smaller 4 synchronous islands are made which communicate in either a source synchronous or asynchronous manner. At the SoC level, there is only clock routing from the central Clock Generator Unit (CGU) to the islands. All clock generation is preferably centralized into the CGU. In case of SoC with distinct subsystems and multiple PLLs, multiple CGUs can be present. Clocks can originate from the same clock source, but due to buffering and insertion delays, the end delay can be different between clock trees. (e.g., island 3 and 4 in Figure ). It implies that different islands have to be considered as different clock domains during test, even if the clock sources are identical. Another important test aspect is that a clock source in the CGU can branch to several hardened IP embedded inside an island. Since this IP typically requires the clock control features of ATPG, dedicated control structures shall be implemented in its test wrapper (see Section 4) to prevent tedious clock alignment and to enable parallel testing with other wrapped IP. Islands can be isolated for test by a wrapper and contain embedded IP that is isolated by a wrapper recursively. This means that cascaded clock control structures can be present and the automated clock control identification has to cope with it (see Section 6). In a single-frequency island, all clocks are identical in phase and frequency. Identical here means that the skew in the clock tree end points is sufficiently smaller than the smallest data path propagation delay. Whether a clock domain is synchronous by a single clock tree, or forced to be synchronous by clock tree alignment in the back-end, obviously does not matter. In both cases, the domain is synchronous and standard ATPG without clock control features will be successful. Some single-frequency islands can have clocks that only differ in phase. For stuck-at test, this phase difference can affect the capture cycle, whereas for transition test, it likely affects both the launch and capture cycles. Clock control features of ATPG can be used to guarantee that data transfer between clock domains that differ in phase is safe during test. Skew-safe pattern creation by ATPG is also required for asynchronous interfaces that in functional mode by definition do not have a defined phase or frequency relationship. It is noted that transition test requires launch events to happen before capture events and that hence the clock pulses in different domains must respect an adequate order provided by the CGU. 3. On-Chip Clock Generation For designs that require testing for transition faults, it is important to derive clocks from the on-chip PLLs during capture cycles of a scan test. With clocks that are derived from one PLL, it is possible to test multiple clock domains at different frequencies in parallel. We typically implement divider settings that apply frequencies up to application speed, or even above. With the appropriate test structures Paper 4. INTERNATIONAL TEST CONFERENCE 2

sequencer sequencer inside the CGU, a clocking scheme can be created that allows testing for transition faults across clock domains. clk_xx clk_yy clk_zz TCK se clockdiv_rst done synchronously since stopping the clocks is not timing critical. 4. Clock control architecture The Islands of synchronicity design style forms the basis for our clocking concept that uses both centralized clock control for test in the CGU and de-centralized clock control at test wrapper boundaries of synchronous modules. clk_g_xx clk_g_yy CGU clk_g_z at-speed clock switch clk_g_z2 simultaneous launch unaligned capture sequencer sequencer sequencer functional clock gating Figure 2. At speed transition across clock domains One of the basic functions of the CGU during test is switching all internal clocks to a primary test clock pin. For scan tests that target static fault models, a tester-driven clock is used during both shift and capture cycles. For transition test, the switching mechanism is used to select between a tester-driven clock during the shift operation and PLL-driven clock during capture cycles as shown in Figure 2. A high value at the scan enable input se defines the shift operation to be active. As a result, all internal clocks clk_g_<> follow the primary clock input TCK. A low value results in internal clocks that are driven from their functional PLL divider outputs. To prevent any race condition, the dividers are kept in reset by a high value on signal clockdiv_rst during the shift operation. Releasing the divider resets by forcing a low value causes at-speed pulses to be generated. The clockdiv_rst signal can be derived from a scan input terminal since it does not require direct control during the shift cycles. The number of pulses for transition test is limited to a maximum of two cycles by the sequencer structure that will be introduced in Section 4. The clock dividers in the CGU provide a reset function such that all internally generated clocks start with an aligned launch edge once the reset is released. This enables transition test across clock domains (see Figure 2). In this example domain clk_g_yy interfaces to clk_g_z. ATPG guarantees a safe capture in the latter domain by masking the capture of the first domain. Masking is done by the clock control hardware introduced in Section 4. Due to differences in clock tree insertion delays, the actual edge at flip-flops might not be aligned but it shall be guaranteed that all launch edges happen before any capture event. Note that the clock skew between unbalanced clock trees must be sufficiently less than the shortest clock period applied in order to satisfy this condition for the launching edges. At the assertion of the reset, all clock dividers have to output a low value since AND-type gating is used in the clock control architecture. Resetting the internal dividers can be test wrapper test wrapper Figure 3. Clock control architecture test wrapper The clocking concept allows standard application-related clock gating for low power purposes, which is mandated to be part of clock trees such that no destructive skew is introduced. The basic principle of the clock control concept is that each clock domain's source clock can be suppressed via scan chain elements independent of other clock domain source clocks. To guarantee full controllability over internal clocks during multiple capture cycles, it has been decided not to reuse functional clock gating structures for clock control during test. Reuse of functional clock gating generally means that multiple gating structures cannot stop/allow clocks independent of each other due to dependencies in their control logic. This may lead to untestable faults and pattern increase. Figure 3 shows the different components of the clocking architecture and their hierarchical position in the overall design. The functional clock gating, also known as power gating, is implemented at the register level and is highly distributed. (functional) hold_clock_n wir_local_clock_active_se clk TE E Clock gate latch GN Figure 4. Clock gate with test control gated_clk In the architecture, the AND-type gating is used by default (see Figure 4). It implements glitch-free gating by use of a latch and a mechanism for test to overrule the functional control hold_clock_n via the wir_local_clock_active_se. During application mode, this signal is constantly driven low from a test Wrapper Instruction Register (WIR) as defined in Std. IEEE 500 [0] to enable the functional Paper 4. INTERNATIONAL TEST CONFERENCE 3

scan channels in scan channesl out control via the hold_clock_n signal. Running clocks during scan shifting are guaranteed by feeding the global chip scan enable into the wir_local_clock_active_se signal. During scan test capture cycles, it is controlled by a scan element to achieve full test coverage. For tests that do use the scan path but require a running clock for proper operation, like memory BIST or mixed signal test, a continuous high value can be driven to force this running clock. clk wir_clock_active_se wir_seq_two_cycle_n si 0 se wir_seq_en so TE E Clock gate Figure 5. Four bit sequencer clock control gated_clk Figure 5 shows our standard clock control structure at the synchronous island boundary. By definition, this structure sits at the beginning of each clock tree and can be treated as a clock isolation cell that is part of the test wrapper. We call the four scan elements that form a shift register during the capture cycles a 4-bit sequencer. The sequencer size is limited to 2 stages by a low value on wir_seq_two_cycle_n. This is typically used for at-speed transition test as explained in Section 3, but can also serve as a fallback scenario when too many clock control constraints hamper the equation solving efficiency of test data compression. The clock control architecture does allow sequencer sizes of two or more. The minimum length guarantees that efficient transition tests can be generated by ATPG while using standard clock control features. We typically use larger sequencers to enable more efficient pattern creation due to clock staggering, as shown later in Section 7, and to enable other advanced ATPG features that require multiple capture cycles (e.g., testing through memories). Once the sequencer is enabled by driving a high value on wir_seq_en from the WIR, the scan elements provide the clock control for ATPG during capture cycles. During scan shifting, the se and wir_seq_two_cycle_n signals receive a high logic value driven by the global scan enable. As a result, the sequencer is scanned as part of the overall scan infrastructure of which details are explained in Section 5. Note that scan insertion of the sequencer only needs to replace the first element since the structure already implements a shift register by definition. The signal wir_clock_active_se resembles wir_local_clock_active_se except that the first requires a low value during scan test capture cycles to enable control for gated_clk from the sequencer FFs and to force the clock-off value at the input of the sequencer. The latter guarantees that the maximum amount of gated_clk pulses is defined by the sequencer size. The sequencer is clocked by a functional clock clk that is not actively gated or derived from a test clock at any stage outside the CGU. In practice, designs do not follow the Islands of synchronicity design style literally in all their components. As a result, clouds of logic can be fed by clock trees that have their root at the CGU clock outputs, which in this case require the same clock control features as described above for the island boundary. In summary, all clock trees in the design are mandated to have a clock control structure that is dedicated to test with a sequencer of minimal length two. 5. Scan Chain Architecture Our SoC architecture implements a single Test Access Mechanism (TAM) for scan access as proposed in []. Wrapped modules get scan access via the full available TAM width N that is typically connected to test data compression logic (EDT) at the SoC level (see Figure 6). In the daisy-chain scan architecture, multiple wrapped modules can be accessed simultaneously. Each module implements multiple scan chain configurations to guarantee efficient test access. Efficient interconnect test between modules requires scan access to the wrapper cells and the accompanying sequencer cells that control their clocks, and bypasses module-internal scan chains. In addition wrapped modules implement a single bit scan bypass that allows any subset of wrappers to be accessed simultaneously. For several reasons, it is not uncommon in practice to have a considerable amount of glue logic at the SoC level that is not isolated for test by a wrapper. The flip-flops of this glue logic are grouped into N balanced scan chains that are provided with a combinational bypass path. This bypass path implements our transport mechanism and enables the glue logic clocks to be switched off when this logic is not under test. E D T Module Module 2 Module 3 scan chains scan chains scan chains N N N internal bypass internal bypass internal bypass Glue logic N bypass Figure 6. Single TAM daisy-chain scan architecture The scan chain architecture and its multitude of configurations cause that the position of sequencer cells during scan access can differ per configuration. In case a module is configured in bypass, all internal and wrapper cells including clock control sequencers are bypassed. At the same time, this causes sequencer cells of neighboring islands further down the daisy-chain to change their chain E D T Paper 4. INTERNATIONAL TEST CONFERENCE 4

position relative to their chain input. For automated clock control identification, we need to take this behavior into account and guarantee that only accessible clock control structures are identified, as explained in subsequent section. 6. Clock Control Identification The clock control sequencer structure presented in Section 4 on the one hand is used to enable at-speed transition test, and on the other provides a means for efficient skew-robust pattern generation. To generate valid clock sequences at the internal clocks, the clock control schemes for each internal clock must be provided to the ATPG tool as presented in [2]. clock_control gated_clk_ = atpg_cycle 0 = condition u_top/u_core/ccb_yp/sff0_q_reg/qn 0; end; atpg_cycle = condition u_top/u_core/ccb_yp/sff_q_reg/q ; end; end; Figure 7. Example clock control definition Figure 7 gives an example definition that describes the conditions for internal clock gated_clk_ to pulse. The keyword atpg_cycle defines the clock cycle to pulse gated_clk_. The accompanying clock-on conditions are all defined by a value at the output of a scan cell. The example shows the clock control scheme for a sequencer of length two that has an inverting output QN connected to the clock gate. The clock-off conditions are derived directly from the clock-on conditions by the ATPG tool. Despite the fact that the schemes can be limited to the clock enable conditions per ATPG cycle, manually defining clock control details can be a tedious and error-prone task in practice. This process can be fully automated by identifying all sequencer cells and their relation to the internal clocks per scan configuration. Remember that all clock trees in our architecture do have this standard clock control structure that is dedicated to test. We perform the clock control identification in following steps:. Read the complete design description 2. Build a list of internal clocks driven by a clock gater that is controlled from a sequencer 3. Build a list of valid internal clocks by removing those driven from functional sequencer structures 4. Per test mode, write clock control definitions for valid internal clocks that are controlled from a sequencer that has scan access The automated identification of valid clock control makes use of two techniques. The first is symbolic simulation of clocked initialization sequences followed by a state stability analysis to calculate test mode values on internal nodes of the design description. Second is tracing by backward traversal of nets guided by symbolic values on internal nodes and logic expressions of structures. TCK 0 wir_clock_active_se clock-off value 0 se 0 Q QN Q wir_seq_en TE E Clock gate Figure 8. Test mode values for enabled sequencer U gated_clk_ After reading the design, a first simulation is performed to set up a mode in which all sequencers are enabled. Any sequencer not enabled in this mode is not considered to drive a clock tree in subsequent steps. The mode typically stems from an initialization that enables all wrapped modules and further logic for stuck-at test, disables asynchronous resets, and forces running test clocks from primary terminals. In addition, the scan enable condition for capture cycles is propagated to complete the mode. The following steps locate the clock gating and sequencer structures under capture mode conditions (see Figure 8):. Locate candidate clock gates: A 2-input AND gate embedded in a library cell with a running clock value at one input (the clock) and an undefined value at the other input (the enable). The internal clock will be the net name of the library cell instance output port. 2. Locate the first sequencer element by tracing the clock gate enable to an output of a flip-flop. 3. Locate the further sequencer elements: the active data input of flip-flop N- must stem from the output of another flip-flop which is then the Nth sequencer element. The sequencer ends at a flip-flop which data input has a clock-off value. 4. All sequencer flip-flops must receive a running clock and disabled asynchronous set and/or reset ports. In application mode, the sequencer structures cannot control internal clock gating by definition since we do not allow functional clock gating structures to be used for clock control during test (as explained in Section 4). This means that under functional operating conditions, either the clock gater enable is not controlled from the sequencer flip-flop output or the enable is controlled from the sequencer flipflop output that is inactive because of an active asynchronous set or reset value. To be able to compile a list of valid internal clocks by removing those driven from functional sequencer structures, symbolic simulation of the device reset sequence (e.g., asynchronous reset via primary pin) is performed to set up a mode in which application values are present at internal nodes of the design description. For all internal clocks that were identified to be controlled from a sequencer, the clock gating enable source is traced. The accompanying internal clock is removed from the list of valid internal clocks in case it is driven from the sequencer flip-flop that was found before and its Paper 4. INTERNATIONAL TEST CONFERENCE 5

asynchronous set or reset port is inactive. Upon ATPG input generation for all test modes, the clock gate and sequencer structures are identified using the mechanism described, but now applied on the list of valid internal clocks only. This is done to guarantee correct clock control descriptions. Specific test modes typically select a subset of sequencers and as such differ from the one that did enable all sequencers in the first steps of the clock control identification. In addition, it is checked that each sequencer element does exist as an element of one of the scan chains in the scan chain configuration that is enabled by the test mode. Clock control definitions for a specific test mode are created for all valid internal clocks that are controlled by a sequencer that does have scan access in this test mode. The example listed in Figure 7 describes the structure shown in Figure 8. Calculating scan chains is done in a separate step for each test mode. 7. Clock Grouping and Staggering to Reduce Pattern Count To reduce the test pattern count generated by ATPG in the presence of multiple clock domains, two complementary approaches - clock grouping and clock concatenation (or staggering) - have been proposed in [9]. Those will be briefly explained in this section since the next section will describe how to further improve the results possible using this combination. these clocks can be grouped together. In our designs, this DFT is not used to facilitate pulsing of interacting clocks. The clock staggering technique pulses different clocks sequentially between scan load and unload. Since there is no overlap between the pulsed clocks, the test pattern count reduction is achieved by pulsing the interacting clocks sequentially. This technique requires that there be adequate time between the pulses of the interacting clocks in order to avoid clock skew affecting the captures responses. This is straightforward when generating static tests, such as stuckat tests, since slow clocks are used and there exists adequate time between clock edges. For example, a stuckat test pattern waveform is shown in Figure 9. In this figure, clk and clk2 are pulsed simultaneously in the first capture cycle since there is no interaction between them. In the following cycle, non-interacting clock pair clk3 and clk4 are pulsed simultaneously. Although clk and clk3 interact, the skew is not an issue since the order of events is predictable. When performing at-speed testing where the capture clocks are derived from an on-chip PLL, hardware support should be implemented if clock staggering is to be used. clk t t 2 t 3 t 4 t 5 clk2 AC_start Figure 9. Stuck-at test waveform with both clock grouping and clock staggering The clock grouping technique analyzes the interaction between every pair of clocks in the design and classifies non-interacting clocks into groups. A pair of clocks is said to be interacting if the data captured by one clock can impact the data captured by another clock [][9][3]. All clocks in the same group may be pulsed simultaneously during test generation without causing capture of uncertain values into state elements. However, if interacting clocks are pulsed simultaneously, it may not be possible to predict the values captured by the state elements at the boundary of the clock domains due to the unknown skew between the clock domains. In order to increase the number of clocks in the same group and consequently reduce pattern count, one DFT technique is to insert isolation logic at either the sources or the sinks of all inter-clock-domain paths [4][5]. The additional logic blocks the interaction between the interacting clocks during capture such that Figure 0. At-speed test waveform with clock staggering Consider the clock waveforms shown in Figure 0. The onchip PLL based at-speed test uses a signal such as AC_start as a trigger. When AC_start is asserted, double pulses are created at the internal clock clk derived from PLL outputs after certain time period t. After the double-pulse at clk, another double-pulse is created at clk2. If clk and clk2 are interacting clocks, the time interval t 3 between the last pulse of clk and the first pulse of clk2 must be long enough in order to allow the signal changes at the state elements controlled by clk to propagate to the inputs of the state elements controlled by clk2. Otherwise, uncertain values will by captured in the state elements located in the inter-clock domain between clk and clk2. If those clocks vary significantly in frequency, meeting this timing requirement can require staggered trigger signals or a significantly larger clock control circuitry. Due to the additional design complexity of supporting clock staggering for at-speed transition tests, clock staggering will only be used for static tests in this paper. In order to maximize the test pattern count reduction, the clock grouping and clock staggering techniques can be used concurrently. Compatible clocks can be pulsed Paper 4. INTERNATIONAL TEST CONFERENCE 6

simultaneously, while groups of incompatible clocks are staggered. The combination allows many clock domains to be exercised in each test pattern, but for designs with a large number of clock domain groups, this may still be insufficient for optimal results. 8. Pulsing Interacting Clocks Simultaneously to Reduce Pattern Count This section presents an enhancement to the ATPG method presented in [3] where interacting clocks are allowed to pulse simultaneously to reduce pattern count. In addition, this method is combined with the clock grouping and staggering methods explained in the previous section to further improve compaction for designs with many clock domain groups. Clock staggering, when used, is usually applied for a relatively small number of cycles (usually 4 cycles or less). To support clock staggering for many cycles, the clock controllers shall allow controllability of the clock domains for the required number of cycles. This correspondingly increases the number of bits that ATPG has to specify to control each clock in each of those cycles. This can have a negative impact on compression for high chain-to-channel ratios. Attempting to stagger a large number of clock groups leads to high effort and slow run time by ATPG. clk clk2 DFF DFF3 DFF5 g s-a-0 g2 g3 s-a-0 DFF2 DFF4 DFF6 Figure. Pulse interacting clocks to detect faults simultaneously In our designs, the number of clock domains is increasing into the 00 s, and the number of clock boundary state elements for a clock pair is typically much less than the number of state elements in those clock domains. Pulsing only compatible clock groups in each cycle, even when combined with clock staggering, may still leave many clock domains unexercised in a given pattern. Preventing clock pairs with a small amount of interaction from pulsing simultaneously results in sub-optimal pattern count. It is more effective to allow some of those interacting clocks to pulse simultaneously, even at the expense of having to X out boundary state elements when needed. ATPG has to be performed in such a way that no coverage is lost as a result. Consider the circuit shown in Figure. Clocks clk and clk2 are interacting clocks since the data captured at DFF3 by clk2 can reach the data input of DFF4 controlled by clk. If clk and clk2 are not allowed to be pulsed simultaneously, two test patterns are requires to detect g s- a-0 and g3 s-a-0. The first test pattern detects g s-a-0 at DFF2 by assigning DFF= and pulsing clk. The second test pattern detects g3 s-a-0 at DFF6 by assigning DFF5= and pulsing clk2. Since DFF, DFF2, DFF5, and DFF6 are not located at clock domain crossings, those two test patterns can be merged into one test pattern that detects both faults. However, if the interacting clocks clk and clk2 are pulsed simultaneously, DFF4 needs to be masked out when the transition launched from DFF3 can impact the value captured into DFF4. This means assigning a capture value of X to DFF4 during simulation. In order to implement accurate masking for the state elements in the clock domain crossing, one can use the same technique for false/multi-cycle path simulation [6]. However, that method is designed for fully-specified tests and can be used during fault simulation. During ATPG, however, only a test cube is available. (A test cube is a partially-specified test pattern that detects one or more target faults.) Since it is very difficult to support false path handling for partially specified test cubes during test cube generation for target faults, an iterative method was proposed in [3]. In the first iteration, the test generator ignores clock interaction when generating test cubes for target faults and the masking operation is implemented in the fault simulation step as done in [3] such that the generated test cubes may fail to detect target faults during fault simulation. The unsuccessfully tested faults are retargeted in the second iteration; i.e., all interacting clocks are prohibited from simultaneous pulsing during test cube generation in order to avoid fault coverage loss. The test pattern reduction ratio by using this method depends on an empirical parameter that controls the time to switch from the first iteration to the second iteration. The best control parameter is design dependent. In our implementation, we enhanced the test generator to take the masking operation into account in order to avoid multiple test generation iterations. For example, if a partial test cube shown in Figure has been generated and the next secondary target fault is g2 s-a-0 during compaction, we will skip this fault since it is possible that the fault would no longer be detected during fault simulation when masking is applied. On the other hand, when there are too many state elements located in the cross clock domain of an interacting clock pair, the masking operation for those state elements may have a significant impact on dynamic compaction if this pair of clocks is allowed to be pulsed simultaneously. In our implementation, we allow a pair of clocks clk i and clk j to be pulsed simultaneously if it satisfies the condition listed below: IS IS i j MAX, CS i CS j Paper 4. INTERNATIONAL TEST CONFERENCE 7

Where CS i and CS j are the number of state elements controlled by clk i and clk j, respectively; IS i (IS j ) is the number of state elements controlled by clk i (clk j ) but are reachable from the state elements controlled by clk j (clk i ) through combinational paths; and is a float number between 0 and. =0 means that no interacting clock pair is allowed to be pulsed simultaneously and = means that any clock pair is allowed to be pulsed simultaneously. During at-speed test, the relative order of clocks is usually unpredictable especially when they are derived from different on-chip PLL sources, or when multi-frequency clocks interact and include state elements that operate on either edge of the clock. To accommodate all those cases safely in the general case, when two interacting clocks clk and clk2 are double-pulsed in a test pattern, all the state elements at the receiving boundary of either clock domain must be masked out in every capture cycle due to the unknown phase relationship between clk and clk2. However, when using the clock control method presented in this paper, the clocks are launch-aligned and only control state elements on the leading edge of the clock. In addition, only PLL source is active at any time. This means that for any pair of clocks, all the launch events occur prior to the capture events, and there is no need for this pessimistic assumption. This also allows testing of inter-domain faults. 9. Experimental Results The clock control architecture presented in this paper has been implemented on two silicon devices, each of which targets a different application domain. As such, they highly differ in design characteristics and implementation details as listed in Table. Des ign # Scan Cells Table : Design characteristics # Internal Clocks # Clock Groups # Faults Application area A 80k 46 9 3.4M Radio B.4M 89 56 62.M TV Design A uses c065 technology and embeds a lot of analog IP that was excluded for our experiments. This design has a substantial number of clocks grouped into a relatively small number of clock groups. The on-chip test data compression logic implements a ratio of 40 between primary pin scan channels and internal scan chains. In the scan chains, 46 clock control sequencers are accessible. Design B uses c045 technology and has a lot of digital IP embedded such that it results in significant clock domain interaction as reflected in the relatively high number of clock groups. The on-chip test data compression logic implements a ratio of 32 between channels and chains. In the scan chains, 89 clock control sequencers are accessible. The final design descriptions of both devices were used to experiment with the novel features presented in this paper. The clock control identification has been built in a scripting environment that links a design description reader, a symbolic simulator, and tracing capabilities. The use of clock grouping, pulsing compatible or incompatible clocks, and staggering of clocks was implemented in an evaluation version of a commercial ATPG solution. The experiments that were performed can be grouped in two parts. The first targets pattern reduction for single stuck-at pattern generation, for which results are shown in Figure 2 to Figure 4. The second part targets pattern reduction for transition pattern generation, for which results can be found in Table 2. As explained before, we always apply two cycles for transition test. Within all figures, the ATPG set-up applied in the experiments are shown. The so-called lower bound, listed for stuck-at fault experiments, is an optimistic lower bound generated under the assumption that all clocks are skew-balanced and may be pulsed simultaneously. Classes compatible clocks only and interacting clocks allowed refer respectively to whether only non-interacting (compatible) clocks are allowed to pulse simultaneously, or whether some interacting clocks are allowed to pulse simultaneously. The main target of our work was to reduce pattern count by an adequate combination of clock control hardware and our novel ATPG technique. Results presented in Figure 2 to Figure 4 show the relation between sequencer size and pattern reduction for stuck-at. For completeness we did include experiments for Design A with test data compression logic bypassed in Figure 3. In the experiments that allow simultaneous pulsing of incompatible clocks, only clock domains with minimal interaction were allowed to pulse simultaneously to reduce the number of cells capturing unreliable values, and to limit the number of additional X s in the captured responses which can lower the effectiveness of on-chip compression. =0.02 (2%) was used in the reported experiments. Using a threshold of 0.02 or 0.0, for example, produced almost the same results. On the other hand, using a very large number close to resulted in pattern count increases especially when using on-chip compression. Before going into further detailed observations, we first list some global results. The clock control identification is a fast and robust mechanism; none of the experiments gave unexpected results. The experiments could easily be automated by an iterative process on different design descriptions. The process simply calls the automated identification prior to running ATPG. The fault coverage figures of all stuck-at runs were captured and compared. The absolute difference between the smallest and largest coverage figure reported by the ATPG tool was 0.03%. This confirms that the novel functionality maintains the pattern quality, as expected. Paper 4. INTERNATIONAL TEST CONFERENCE 8

When only compatible clocks are allowed to pulse simultaneously, it can be seen that clock staggering can be very effective at reducing pattern count. The higher the number of capture cycles, the more incompatible clock groups can be pulsed in each patterns, and consequently the lower the pattern count. For Design A, most of the benefits are realized when 3 cycles are used. Beyond this point, the pattern count reduction with increasing capture cycles is minimal (see Figure 2, Figure 3). In Design B, pattern count is reduced through 6 capture cycles (see Figure 4). This is consistent with the clock architecture of the designs. Design B has many more clock groups than Design A such that a smaller percentage of FFs can be pulsed in each cycle, and therefore the increase in the number of capture cycles affects Design B more. Figure 2. Stuck-at patterns Design A with EDT active Figure 3. Stuck-at patterns Design A with EDT bypassed When simultaneous pulsing of incompatible clocks is allowed, several observations can be made. First, the pattern count is reduced even without clock staggering ( capture cycle), as expected. But as the number of capture cycles increases, the two compaction methods work constructively to further decrease pattern count. Second, since more clock groups can pulse in the same cycle, the compaction benefits are achieved with fewer capture cycles as if the design had fewer clock groups. For instance, the pattern count reduction for Design B plateaus after 3 capture cycles, versus 6 capture cycles when pulsing of incompatible clocks is not allowed (see Figure 4). The pattern count with the combined methods is lower than what was achieved with clock staggering alone for Design B. Finally, the combination of those two methods is most effective for designs with many clock groups. This is why they are more effective for Design B than Design A. Run time generally increases when using clock staggering, but modestly. Note that even though the reported lower bound is optimistic, it can be observed that the pattern counts achieved with the combined compaction methods are close to this lower bound. All the observations made are consistent whether generating patterns with on-chip compression or not. There is no fault coverage loss due to pulsing of incompatible clocks since ATPG accounts for this interaction and selectively disables which incompatible clocks are allowed to pulse simultaneously based on the faults targeted by a given test pattern. Figure 4. Stuck-at patterns Design B with EDT active Table 2 reports results for two cycle transition patterns in which we target both intra- and inter-domain faults. Similar to the stuck-at experiments it shows an optimistic lower bound under the assumption that all clocks are skewbalanced and may be pulsed simultaneously. In addition it shows classes compatible clocks only and interacting clocks allowed for two cycles only. Paper 4. INTERNATIONAL TEST CONFERENCE 9

Table 2: Pattern count for two cycle transition test (launch/capture) Design A EDT active Design A EDT bypassed Design B EDT active compatible clocks only 2298 586 574 CPU run-time (secs) 2486 200 3585 interacting clocks allowed 978 8570 85455 CPU run-time (secs) 2489 209 28072 lower bound 3025 930 4009 CPU run-time (secs) 373 944 68634 Similar observations can be made for transition ATPG as were made for stuck-at. Allowing pulsing of incompatible clocks is effective for both designs in particular since clock staggering is not used. In addition, pulsing of incompatible clocks had a greater impact on Design B, where pattern reduction exceeded 45%, than it did on Design A due to the higher number of clock groups. The lower bound transition pattern count is significantly lower than the pattern counts when only pulsing compatible clocks, or even when pulsing incompatible clocks. The reason is that even when pulsing of incompatible clocks is allowed, only domains with limited interaction are allowed to pulse simultaneously. The lower bound experiment imposes no limits, and allows all clocks to pulse simultaneously without having to mask any scan cells due to unsafe data due to race conditions. The fault coverage figures of all transition runs were captured and compared. The absolute difference between the smallest and largest coverage figure reported by the ATPG tool for Design B was 0.09%. It was observed that the transition test coverage of the lower bound of Design A is 2.4% higher than the other transition test coverages. A first analysis showed that faults were aborted by ATPG when transition test clock restrictions were imposed. Other faults became untestable when some interacting clocks could not be pulsed simultaneously. Pattern count may be reduced further in the future by allowing staggering of cycles where different clock domains are double-pulsed successively; i.e., when applying the same concept presented for stuck-at tests to transition tests. The launch clocks (every other ATPG cycle) may be launch-aligned. A delay needs to be added between successive pairs of ATPG cycles to ensure safe timing between asynchronous clock domains. 0. Conclusion We presented a combination of clock control hardware and novel ATPG techniques that significantly reduce pattern count for designs with many clock domains. We introduced multiple stages in clock control sequencers to optimize stuck-at pattern generation, and enhanced ATPG support for simultaneous clocking of interacting clocks. Using our novel methods we achieved pattern count reduction on two industrial designs that was up to 3.9x for stuck-at patterns and.9x for transition patterns, compared to conventional ATPG compaction. This is in addition to the benefits of onchip compression using EDT. Pattern count for transition patterns may be reduced further in the future by allowing staggering of cycles. Part of this work is carried out as part of the Catrene project "TOETS" [CT302] and is supported by the Dutch Government of Economical Affairs. References [] A.P. Niranjan, P. Wiscombe, Islands of Synchronicity, a Design Methodology for SoC Design, Design Automation and Test in Europe, 2004 [2] T. McLaurin, F. Frederick, The Testability Features of the MCF5407 Containing the 4 th Generation ColdFire, Proc. Intl. Test Conf., pp. 5-59, 2000. [3] N. Tendolkar; R. Molyneaux, C. Pyron, and R. Raina, At-speed testing of delay faults for Motorola's MPC7400, a PowerPC TM microprocessor, IEEE VLSI Test Symposium, pp. 3-8, 2000. [4] V. Iyengar, et. al., At-Speed Structural Test for High-Performance ASICs, Proc. Int. Test Conf., paper 2.4, 2006. [5] M. Beck, O. Barondeau, M. Kaibel, F. Poehl, X. Lin; and R. Press, Logic design for on-chip test clock generation - implementation details and impact on delay test quality, Proc. Design, Automation, and Test in Europe, 2005, pp. 56-6. [6] F. Frederick, T. McLaurin, Design for Test Features of the ARM Clock Control Macro, Proc. Int. Test Conf., paper 9.3, 2007 [7] N.A. Touba, Survey of Test Vector Compression Techniques, in IEEE D&T of Computer, Vol. 23, Issue 4, pp. 294-303, 2006. [8] J. Rajski, J. Tyszer, M. Kassab, and N. Mukherjee, Embedded deterministic test, IEEE Trans. CAD, vol. 23, pp. 776-792, May 2004. [9] X. Lin and R. Thompson, Test Generation for Designs with Multiple Clocks, Int. Conf. on Design Auto. Conf., pp. 662-667, 2003. [0] 500-2005 IEEE Standard Testability Method for Embedded Corebased Integrated Circuits, Standard IEEE Product No: SH95335, ISBN: 0-738-4693-5 [] T. Waayers, R. Morren, R. Grandi, Definition of a robust Modular SOC Test Architecture; Resurrection of the single TAM daisy-chain, in Int. Test Conference, 2005. [2] X. Lin, M. Kassab, Test Generation for Designs with On-Chip Clock Generators, in Asian Test Symposium, 2009. [3] X. Lin, S.M. Reddy, and I. Pomeranz, Test Pattern Reduction by Simultaneously Pulsing Interacting Clocks, in VLSI Design and Test Symposium, 2008. [4] L.-T. Wang, X. Wen, S. Wu, H. furukawa, H.J. Chao, and B. Sheu, Using launch-on-capture for Testing BIST Designs Containing Synchronous and Asynchronous Clock Domains, in IEEE Tran. On CAD, Vol. 29, No. 2, pp. 299-32, Feb. 200. [5] Nadeau-Dostie, B.; Takeshita, K.; Cote, J.-F.;, "Power-Aware At- Speed Scan Test Methodology for Circuits with Synchronous Clocks," in Int. Test Conference, Paper 9.3, 2008. [6] D. Goswami, K.-H Tsai, et. al., At-Speed Testing with Timing Exception and Constraints Case Studies, in Asian Test Symposium, pp. 53-59, 2006. Paper 4. INTERNATIONAL TEST CONFERENCE 0