CacheCompress A Novel Approach for Test Data Compression with cache for IP cores

CacheCompress A Novel Approach for Test Data Compression with cache for IP cores Hao Fang ( 方昊 ) fanghao@mprc.pku.edu.cn Rizhao, ICDFN 07 20/08/2007 To be appeared in ICCAD 07

Sections Introduction Our proposed methods Experimentation Conclusion

Test Challenges Large test data size CUT (Circuit-Under-Test) Present circuit 100~1000M bits Next decade 10G~100G bits ATE (Automatic Test Equipment) Limited memory Cannot store all the data in ATE Long test time Directly proportional to $$ 1~2 $ for 1 minute Relative small number of pins stimulus ATE CUT response

Test Data Compression Automatic Test Equipment (ATE) Output channel Clock generator Input channel memory Pre-computed stimulus Pre-computed Test data Compacted response chip On-chip decoder Original stimulus Circuit-under-test (CUT) Original response Research Hot spot On-chip compactor Well researched Stimulus ( hot research spot ) Software compress Hardware decompress (on-chip decoder) Response Hardware on-chip compactor (well researched)

Test Data Compression Architecture Based on Scan slices circuit-under-test Codewords (from ATE) 1 1 1.. 0 1 0 0.. 0 1 1 1.. 1 Codewords for Slice i+1 0 0 1.. 1 1 0 1.. 0 Codeword for Slice i On-chip decoder chain1 chain2 chainc On-chip compactor slicei-1 slicei-2 slicei-3 slicek slicek-1 Narrow input (codeword) Data after compressed Broad output (slice) Original test data to be compressed

Hardware On-chip Decoder Features Vs. conventional software compression (gzip, rar) Small Area Overhead Cannot use too complex methods Cannot save all the former data Codeword Continuous-flow Input data (codeword) Receives one codeword every cycle Difficult to stop ATE clock Usually every slice needs at least one codeword Original stimulus content slice Few 0 or 1 Many don t-cares bits (X) Information Lossless (narrow input) Some image algorithms with very high compression ratio Pre-computed stimulus On-chip decoder Slice (broad output) Original stimulus

Compression methods LFSR based linear expansion Dictionary based on-chip memory Selective Encoding encode specified bits Others Hybrid...

Dictionary Methods Basic method Contain all the slices in the static memory Load test data in the on-chip memory at the beginning ROM Shorter memory address instead of larger slice Need a very large memory Advanced method Dictionary with Correction (ITC 04) Need a smaller memory, but still very large Correction index Memory address RAM correction Chain 1 Chain 2 Chain n

Selective Encoding (ITC 05) For every slice Count # of 0s & # of 1s in the slice Determine default bit & value (majority) slice # 0 # 1 type specified bit & value (minority) Fill Xs with the default value Only encode the specified bits 0000xxxx 00xx001x 4 4 0 1 N-type S-type Classified by # of specified bits 0: N-type (Non-specified) 01101110 3 5 M-type 1: S-type (Single-specified) >1: M-type (Multiple-specified)

Encode Slices with Two Code Types 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 Group 0 Group 1 Group 2 Group 3 control-code Data-code Single-mode code Slice: C=15, G=4 Codeword W=6 Only one codeword for N-type, S-type Composition The default value Index of the specified bit (index=n means no specified bit) Group-mode code M-type First codeword: group index Subsequent codewords: group content

Selective Encoding Decoder Structure decoder codeword w control Controller control Single Sub-decoder Group Sub-decoder Default value G index Select Group Content index G G Scan Slice Selector C Scan slice shift

Selective Coding -- Example Slice xx00 01xx xxxx 1xx1 1100 0001 Control code 00 01 00 11 Data code 0101 1000 0111 0000 Interpretation Start new slice, default=0, index=5, set bit 5 to 1 Start new slice, default=1, index=8 (no bit set) Start new slice, default=0, index=7, set bit 7 to 1 Enter group-mode, Group index=0 Decompressed Slice 0000 0100 1111 1111 0000 0001 0000 0001 11 1100 Group data is 1100 1100 0001 Group pointer Overwrite

Sections Introduction Our proposed methods Experimentation Conclusion

Disadvantages of Dictionary Code Store entire slice Memory width Store slices as many as possible Great many words Memory height Contents are constant (static) Need additional initialization step Store large initialization data on ATE

Improvement in CacheCompress Store only part of slice Memory width A default compress method Single-mode code of selective encoding Many simple slices can be compressed Use memory to compress remain slices Memory height Cache-like Updating during testing Dynamic, MRU (Most frequently used) Memory height Eliminates memory initialization

Disadvantages of Selective Encoding Many specified bits in one slice Eg: 1010 1000 1001 100 Codeword: 00 0010 11 0000 11 1010 10 0010 11 1001 11 100x Cause many group codes No specified bits in the slice (N-type) Eg: 0000 0000 0000 000 Codeword: 00 1111 The whole codeword for one slice In fact only a bit is enough, waste bandwidth

Improving Selective Coding Key idea Use the wasted bandwidth to write the dictionary in advance for future reading Many specified bits in the slice Much wider segment instead of group Read segment from dictionary Only one codeword for several groups No specified bits in the slice (N-type) One bit is for default value Wasted bits for writing dictionary

Compression Parameters & Slice Division Given parameters C: # of bits per slice W: word width Segment (read and write word) [C/W] groups 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Segment 0 Segment 1

Cachecompress Decoder Structure decoder Default value control Single Sub-Decoder index G Scan slice codeword W Controller D Select Read Sub-decoder control we S*G address dictionary Segment content S*G G index Scan Slice Selector C S*G Default value control Write Sub-decoder shift

An Example (C=15, W=8) Original scan slice Code Decompressed slice Shift Type Details 00xxx0xx xxxxxxxx 110 0010 000000000000000 Write default=0, writeaddress=2 xxxxxx00 0xxxxxxx 110 0101 000000000000000 Write default=0, write content=0101 11xxx011 xxxxxxxx 011 0101 111110111111111 Single default=1, specified bit index=5 11111xxx xxxxxxxx 111 0011 111111111111111 Write default=1, write content=0011 Do write: Mem[2]=01010011 01010011 1111111 011 1111 111111111111111 Single default=1, no specified index 100 0010 010101011111111 Read segindex=0, readaddress=2, content=01010011 Write: Mem[2]=01010011

Encoding Algorithm S-type One single-mode codeword N-type One single-mode codeword Record for possible future writes M-type One single-mode codeword for default value Encoding specified segment (specified bits > 1) Search suitable words in the dictionary If not found Use LRU replacement policy to overwrite one entry Convert single-mode for N-type to write-mode An read-after-write race should be avoided Encode as read-mode codeword For other uncovered bits Encode as single-mode codeword for correction

Sections Introduction Our proposed methods Experimentation Conclusion

Hardware Implementation Gate Count or Reg Count C=255 W=64 C=1023 W=64 C=1023 W=128 reuse memory gates regs 1050 330 3472 1097 3752 1163 not reuse memory gates regs 3974 2378 6396 3145 16284 9355 Hardware Implement Verilog HDL (Parameterized C, W) Pass Simulation by VCS Hardware overhead Design compiler under TSMC13 process About 3-4 gates per scan chain (memory reuse) 1% area overhead

Benchmark Circuits Circuits Cells Scan cells Vectors TD X ratio CCT1 94757 11675 634 7,401,950 96.11 CCT2 183743 20652 2396 49,482,192 96.45 CCT3 248518 29057 2355 68,429,235 98.17 CCT4 293265 31239 2439 76,191,921 98.41 CCT5 286205 33465 3658 122,414,970 99.09 CCT6 520342 67569 6385 431,428,065 99.42 Industrial SoCs from MPRC All silicon proved Test patterns Generated by Synopsys Tetramax After dynamic compaction

Compression Result Circuits C SE [ITC 05] TE Cycle DC [ITC 04] TE Cycle Size W CacheCompress TE Cycle Size Hit rate (%) CCT1 255 1,053,680 106,002 893,776 29,798 339,660 64 778,393 71,937 2048 84.47 CCT2 1023 3,035,352 255,342 3,963,168 52,712 2,856,216 64 2,345,096 182,788 2048 83.02 CCT3 1023 4,024,536 337,733 3,935,184 70,650 2,432,694 64 3,305,809 256,648 2048 81.90 CCT4 1023 5,338,008 447,273 4,613,730 78,048 2,950,332 64 3,917,108 303,755 2048 87.68 CCT5 1023 5,123,880 430,648 5,074,080 124,372 2,418,372 64 4,165,993 324,119 2048 85.56 CCT6 1023 12,096,432 1,014,421 14,786,513 434,180 4,947,228 128 9,931,935 770,380 8192 83.61 Test data size: 30% reduced in all cases Testing time: 27% faster than SE Memory size: 100-1000+ times smaller than DC Average hit rate: 84%

Ratio Improvement Analysis Codeword for N-type -> Write-mode Sufficiently use codeword bandwidth Words reuse 84% hit rate means every word will be read 7 times in average Only provide data the first time Subsequent 7 complex segments are benefited by reading

Sections Introduction Our proposed methods Experimentation Conclusion

Conclusion Contribution Combine other technique with dictionary coding Dynamic, cache-like Eliminates memory initialization Result Much smaller dictionary Higher compression ratio