Lossless Compression Algorithms for Direct- Write Lithography Systems

Similar documents
Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

Architecture and Hardware Design of Lossless Compression Algorithms for Direct-Write Maskless Lithography Systems. Hsin-I Liu

Architecture and Hardware Design of Lossless Compression Algorithms for Direct-Write Maskless Lithography Systems

Layout Decompression Chip for Maskless Lithography

L12: Reconfigurable Logic Architectures

L11/12: Reconfigurable Logic Architectures

Implementation of an MPEG Codec on the Tilera TM 64 Processor

FPGA Design with VHDL

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Design Project: Designing a Viterbi Decoder (PART I)

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

Chapter 7 Memory and Programmable Logic

A Low-Power 0.7-V H p Video Decoder

Frame Processing Time Deviations in Video Processors

EECS150 - Digital Design Lecture 12 - Video Interfacing. Recap and Outline

Part 1: Introduction to Computer Graphics

Field Programmable Gate Arrays (FPGAs)

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

Video coding standards

Multi-Shaped E-Beam Technology for Mask Writing

DIGITAL CIRCUIT LOGIC UNIT 9: MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

VGA Controller. Leif Andersen, Daniel Blakemore, Jon Parker University of Utah December 19, VGA Controller Components

Introduction to image compression

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

11. Sequential Elements

Data Storage and Manipulation

Logic Devices for Interfacing, The 8085 MPU Lecture 4

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

Advanced Data Structures and Algorithms

A Fast Constant Coefficient Multiplier for the XC6200

ESE534: Computer Organization. Previously. Today. Previously. Today. Preclass 1. Instruction Space Modeling

AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

ASNT8140. ASNT8140-KMC DC-23Gbps PRBS Generator with the (x 7 + x + 1) Polynomial. vee. vcc qp. vcc. vcc qn. qxorp. qxorn. vee. vcc rstn_p.

BITSTREAM COMPRESSION TECHNIQUES FOR VIRTEX 4 FPGAS

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Why FPGAs? FPGA Overview. Why FPGAs?

IMS B007 A transputer based graphics board

THE new video coding standard H.264/AVC [1] significantly

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Contents Circuits... 1

SoC IC Basics. COE838: Systems on Chip Design

An Overview of the Performance Envelope of Digital Micromirror Device (DMD) Based Projection Display Systems

8/30/2010. Chapter 1: Data Storage. Bits and Bit Patterns. Boolean Operations. Gates. The Boolean operations AND, OR, and XOR (exclusive or)

R Fig. 5 photograph of the image reorganization circuitry. Circuit diagram of output sampling stage.

Motion Video Compression

HEBS: Histogram Equalization for Backlight Scaling

High Performance Carry Chains for FPGAs

Chapter 2 Introduction to

Compressed-Sensing-Enabled Video Streaming for Wireless Multimedia Sensor Networks Abstract:

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Example: compressing black and white images 2 Say we are trying to compress an image of black and white pixels: CSC310 Information Theory.

ASNT8142-KMC Generator of DC-to-23Gbps PRBS with Selectable Polynomials

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

VLSI Chip Design Project TSEK06

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

Technical Note PowerPC Embedded Processors Video Security with PowerPC

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

Introduction to CMOS VLSI Design (E158) Lecture 11: Decoders and Delay Estimation

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Design and Implementation of an AHB VGA Peripheral

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

PARALLEL PROCESSOR ARRAY FOR HIGH SPEED PATH PLANNING

Overview: Logic BIST

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

Design of Fault Coverage Test Pattern Generator Using LFSR

Part 1: Introduction to computer graphics 1. Describe Each of the following: a. Computer Graphics. b. Computer Graphics API. c. CG s can be used in

MPEG-2. ISO/IEC (or ITU-T H.262)

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

3/5/2017. A Register Stores a Set of Bits. ECE 120: Introduction to Computing. Add an Input to Control Changing a Register s Bits

Digital Media. Daniel Fuller ITEC 2110

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

A Low Power Delay Buffer Using Gated Driver Tree

CacheCompress A Novel Approach for Test Data Compression with cache for IP cores

Self-Test and Adaptation for Random Variations in Reliability

From Theory to Practice: Private Circuit and Its Ambush

Advanced Training Course on FPGA Design and VHDL for Hardware Simulation and Synthesis. 26 October - 20 November, 2009

PHYSICS 5620 LAB 9 Basic Digital Circuits and Flip-Flops

A low-power portable H.264/AVC decoder using elastic pipeline

UNIT IV CMOS TESTING. EC2354_Unit IV 1

VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Evaluation of SGI Vizserver

Auto classification and simulation of mask defects using SEM and CAD images

Synchronization Overhead in SOC Compressed Test

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

IE1204 Digital Design F11: Programmable Logic, VHDL for Sequential Circuits

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Multimedia Communications. Image and Video compression

Recent results of Multi-beam mask writer MBM-1000

8 DIGITAL SIGNAL PROCESSOR IN OPTICAL TOMOGRAPHY SYSTEM

Lecture 23 Design for Testability (DFT): Full-Scan

SECONDARY STORAGE DEVICES: MAGNETIC TAPES AND CD-ROM

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Transcription:

Lossless Compression Algorithms for Direct- Write Lithography Systems Hsin-I Liu Video and Image Processing Lab Department of Electrical Engineering and Computer Science University of California at Berkeley 1

Optical Lithography Lithography is applied to create patterns on the wafer in semiconductor manufacturing Current approach: Mask is applied in optical lithography systems cost of mask is increasing 2

From Mask to Maskless Lithography High Volumn Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018 Technology Node (nm) 90 65 45 32 22 16 11 8 Source: ITRS 2004 3

Cost of Masks in Optical Lithography ITRS 2009 4

Maskless Lithography OPTICAL SOURCE OPTICS DATA Mirror array Writer chip WRITER SYSTEM WAFER STAGE [Y. Shroff et al. 00] A micromirror array is used to replace the optical mask Reduce the cost of mask by x times Increase patterning flexibility Focus of research: Fabricate micromirror array Modify the layout pattern for proximity effect correction OPC or EPC However 5

Maskless Lithography Practical Issues OPTICAL SOURCE OPTICS DATA Mirror array Writer chip WRITER SYSTEM WAFER STAGE Each micromirror is controlled individually and dynamically Layout image is rasterized into pixel based Data delivery problem for real-time manufacturing Update the pixel value for Different portion of layout images Overcome the voltage attenuation problem [Y. Shroff et al. 00] 6

Data Delivery Issue Data rate for 45nm minimum feature to achieve 1 wafer layer/minute throughput wafer layer 60 s Estimated needed compression: 12 Tb/s 1.2 Tb/s = 10 Board to chip communication: 1.2 Tb/s e.g. 128 pins @ 6.4 GHz Storage Disks 20 Tb π 4 wafer 10 Gb/s ( 300 mm ) layer 2 pixel ( 22 nm ) 5 bits pixel = 12 Tb Throughput requirement can be reduced to 3-5 wafer layers per hour still need compression Lossless compression is applied to Reduce storage space Lower I/O throughput overhead Processor Board 500 Gb Memory 1.2 Tb/s 2 Decode 12 Tb/s s Writer Chip Writers 7

Data Compression Requirements Lossless compression Achieve ~10 compression efficiency Asymmetric compression algorithms Offline encoding Real-time decoding decoder is implemented in hardware and integrated into the writer system 8

Block GC3 - Compression Algorithm for Rasterized, Flattened Layout Block Golomb context copy code (Block GC3) Prediction from Context - JBIG 1. Predict a pixel value from neighboring pixels (P) 2. Good for non-repetitive layouts [H. Liu 06] 9

Block GC3 - Context Predict a c prediction b z prediction error x = b a + c if (x < 0) then z = 0 if (x > max) then z = max otherwise z = x empirical error prob. 0.6% 7.1% 3.9% 0.0% 0.0% 2.2% 3.7% 0.3% 10

Block GC3 - Copy Copying ZIP, 2D-LZ 1. Copy from left or above 2. Good for repetitive layouts 11

Block GC3 - Segmentation 8 8 P L,8 L,8 L,8 P CL P L,8 L,8 L,8 P L,8 L,8 L,8 CA A,8 L,8 L,8 L,8 Block GC3 Segmentation map Layout images are divided into prediction and copy regions Determined within 8 x 8 block Errors from prediction and copy are transmitted from Encoder to decoder All the information is further compressed 12

Block GC3 Encoder/Decoder Architecture Layout Find Best Copy Distance segmentation values Predict/Copy Region Encoder Compare seg. error values image error map image error values seg. error map Encoder Golomb RLE Golomb RLE Huffman Encoder Decoder Layout /Buffer Predict/Copy Merge Region Decoder seg. error map Golomb RLD Huffman Decoder image error values image error map Golomb RLD Outperform the existing techniques Simple decoder design [V. Dai 05] 13

Golomb Run-Length Code A simple code for binary stream 000100000000001100101 Bucket Size (B): maximum # of zeroes in a row B = 4 Two kind of codes: (0) B zeros in a row (1, n) n zeros in a row followed by a one (1,3) (0) (0) (1,2)(1,0)(1,2)(1,1) Compression achieved Additional information introduced 14

University of California at Berkeley, Video and Image Processing Lab Golomb Run-Length Code A simple code for binary stream 000100000000001100101 Bucket Size (B): maximum # of zeroes in a row B = 4 Two kind of codes: (0) B zeros in a row Golomb code achieves its best compression efficiency in i.i.d. random variables achieves inefficient compression with highly skewed bitstream such as error location simple decoder design (1, n) n zeros in a row followed by a one (1,3) (0) (0) (1,2)(1,0)(1,2)(1,1) 15

Complexity vs. Compression Ratio of Compression Schemes Min Compression Ratio on Poly Layer 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 RLE Huffman LZ77 ZIP BZIP2 C4 Block C4 Block GC3 Desirable operating point 1 10 100 1000 10000 100000 1000000 Decoder Buffer (bytes) [H. Liu 06] 16

Full-Chip Test 24% of the images have CR < 5 AMD CPU 65 nm Metal-1 [A. Zakhor 09] 17

18

Full-Chip Test ST ASIC 65 nm [A. Zakhor 09] 19

Block Diagram of Block GC3 Decoder segmentation History Buffer Region Decoder l/a, d Address Generator Linear Prediction predict/copy address copy value predict value Control/ Merge Compressed Error values Huffman Decoder error value Compressed Error Location Golomb error location High parallelism for hardware implementation Data flow architecture 20

Data Flow of Decoder segmentation error location Address Generator l/a, d Region Decoder Golomb address Linear Prediction History Buffer predict/copy pixel value mux pixel value mux error location output error value Huffman 21

Data Flow of Decoder - Predict segmentation error location Address Generator l/a, d Region Decoder Golomb address Linear Prediction History Buffer predict/copy pixel value mux pixel value 0 mux output error value Huffman After the decoding, the pixel value is stored back to history buffer 22

Data Flow of Decoder - Copy segmentation error location Address Generator l/a, d Region Decoder Golomb address Linear Prediction History Buffer predict/copy pixel value mux pixel value 0 mux output error value Huffman After the decoding, the pixel value is stored back to history buffer 23

Data Flow of Decoder - Error segmentation error location Address Generator l/a, d Region Decoder Golomb address Linear Prediction History Buffer predict/copy pixel value mux pixel value 1 mux output error value Huffman After the decoding, the pixel value is stored back to history buffer 24

Decoder Performance - FPGA Device Xilinx Virtex II Pro 70 Number of slice flip-flops 3,233 (4%) Number of 4 input LUTs 3,086 (4%) Number of block RAMs 36 (10%) System clock rate System throughput rate System output data rate 100 MHz 0.99 (pixels/clock cycle) 495 Mb/s The hardware performance can be improved Update FPGA devices Apply ASIC implementation 25

Block University of California at Berkeley, Video and Image Processing Lab Decoder Performance - ASIC Area (um 2 ) Throughput (output/cycle) Power (mw) Golomb 1,136 1 0.2 Huffman 848 1/codeword+2 0.21 Linear Prediction 455 1 0.16 Address Generator 362 0.99 0.03 Region Decoder 18,370 1 7.26 Control/Merge 749 1 0.22 Memory 46,960 1 13.27 Block GC3 Single decoder 69,288 0.99 21.48 85% of area results from 1.7 KB of memory System clock rate: up to 500 MHz System throughput: 0.99 System output rate: up to 2.47 Gb/s 200 decoders to achieve 500 Gb/s 3 wafer layers per hour 26

Apply Block GC3 to reduce I/O overhead I/O Type Data rate # of link for 500 Gb/s # of link with Block GC3 Cell I/O 6.4 Gb/s 80 12 Hyper Transport 3.1 6.4 Gb/s 80 12 Optical link 3 Gb/s 167 26 Intel 65 nm interface Intel 45 nm interface 10 Gb/s 50 8 25 Gb/s 20 3 200 Block GC3 decoders is 14 mm 2 Reduced I/O interface is more practical for direct-write applications 27

Writer Chip Architecture Address Demux I/O Decoders DACs DRAM Array DACs Decoders I/O Demux Address DRAM array directly controls the micromirror array above Throughput of the chip: 3 waferlayer/hour (500Gb/s) 28

Encoding complexity of Block GC3 Layout Find Best Copy Distance segmentation values Predict/Copy Region Encoder Compare seg. error values image error map image error values seg. error map Encoder Golomb RLE Golomb RLE Huffman Encoder Find best copy distance the most computational challenging part of encoding 29

Find the Best Copy Distance d x d y Allowed copying range Current block For an m x n image with block size M, the complexity is mn ( ) O d 2 x + d y M Memory size= d x x d y Block segmentation reduces the complexity by M 2 For linear writing system, horizontal/vertical copy is sufficient 30

Find the Best Copy Distance Multiple Candidates segmentation map Every block may have more than one candidates with fewest mismatches enforce spatial coherency for better compression Region growing use the fewest number of regions to represent the segmentation map 31

Region Growing 2-D region growing is an NP-complete problem Use left/above segmentation info as preferences a c b? If (a = c) then? = b else? = c 1-D region growing can be solve in polynomial time A better solution for complex segmentation maps 32

Improve Compression Efficiency For linear writing system and ASIC layout images average CR > 10 For different writing system or compact layout modify encoding scheme to improve compression efficiency REBL system 33

REBL Direct-Write Lithography System 45 [P. Petric et. al., KLA-Tencor, 08] Rotary writer spiral writing 45 between the radius of the stage and the die 34

REBL Layout Image Layout pattern created by digital pattern generator (DPG) 256 rows per DPG, 16 DPGs in total Column by column writing mechanism Layout angle orientation: 15 to 75 ±30 + 45 E-beam proximity corrected One DPG 4096 rows 256 rows Wafer direction of scan One column 35

Lossless Compression Algorithm for REBL- Block RGC3 Allow diagonal copying Reduce block size and dimension Apply 1-D region growing to reduce numbers of regions Increase memory size Encoding complexity mn O ( dx d y ) HW Allowed copy range Memory size= d x x d y Diagonal copying Current block 36

Compression Results Block GC3 Block RGC3 ZIP BZip2 JPEG-LS Buffer size 1.6KB 20KB 40KB 1.6KB 20KB 40KB 32KB 900KB 2.2KB Block size 4x4 4x4 4x4 5x3 5x3 5x3 Layout size 2048x64 3.13 3.37 3.44 4.92 6.54 6.60 3.23 3.95 0.95 1024x256 3.19 3.30 3.36 5.09 6.91 7.12 3.37 4.48 0.96 2048x256 3.19 3.30 3.37 5.10 7.01 7.29 3.43 4.68 0.97 Block RGC3 outperforms Block GC3 and others Larger buffer size, larger image size better compression ratio 50 69% of improvement due to diagonal copying - more effective as buffer size increases Block RGC3, 4x4 block, 40 KB Buffer Image size H / V Copying Diagonal Copying 64 2048 3.44 5.22 256 2048 3.37 5.71 25º Metal 1 layout 37

Results for Various Wafer Layers Buffer Metal 1 Memory Metal 1 Logic Poly Via Image size size 25 35 38 25 35 25 35 64 2048 1.7KB 4.92 5.37 5.14 -- 8.49 13.14 12.67 256 1024 1.7KB 5.09 5.43 5.33 8.55 8.47 13.58 13.17 256 2048 1.7KB 5.10 5.45 5.35 -- 8.51 13.62 13.22 64 2048 20KB 6.54 6.68 6.63 -- 11.17 15.31 15.40 256 1024 20KB 6.91 7.08 7.11 14.06 12.50 16.14 16.00 256 2048 20KB 7.01 7.20 7.22 -- 12.77 16.35 16.22 64 2048 40KB 6.60 6.79 6.71 -- 11.91 15.86 16.11 256 1024 40KB 7.12 7.23 7.34 14.87 12.80 17.05 17.27 256 2048 40KB 7.29 7.41 7.50 -- 13.17 17.45 17.79 Higher compression ratio for via than metal 1 Larger buffer size, larger image size better compression ratio 38

University of California at Berkeley, Video and Image Processing Lab (1) Diagonal copying Must compare each image block with each copy distance 1 1 Allowed O buffer _ size +, β 10 copy ( _ ) β block size range (2) Growing regions Proportional to avg. # optimal copy distances per block d matches, O block block _ size (3) Combining regions Encoding Time Proportional to avg. # optimal copy distances per region Inversely proportional to # of blocks per region d matches, region dmatches, region O = O _ _ N block size region size N Current block 39

Encoding Times Image size Buffer size Diagonal copying Metal1 25 Via 25 Metal1 25 Regiongrowing Via 25 Combining regions Metal1 25 Via 25 Total encoding time (seconds) Metal1 25 64 2048 20KB 95.4% 85.5% 4.3% 13.0% 0.5% 1.4% 37.0 41.4 256 1024 20KB 95.2% 85.1% 4.2% 13.8% 0.4% 1.1% 92.1 109.2 64 2048 40KB 96.1% 84.9% 3.6% 14.0% 0.03% 1.1% 66.2 78.7 256 1024 40KB 95.6% 81.1% 4.0% 18.0% 0.02% 0.9% 173.9 226.9 Via 25 Dominant factor Diagonal copying for best copy distance Encoding time proportional to buffer size, image size 40

Encoding Time vs. Compression Ratio 6 5 tio a R4 n 3 s io re 2 p m o1 C Metal 1 0 0.1 1 10 100 1000 Encoding Time 14 12 tio a R n s io re p m o C 10 8 6 4 2 0 Poly 0.1 1 10 100 1000 Encoding Time 18 16 14 tio a R n s io re p m o C 12 10 8 6 4 2 0 Via 0.1 1 10 100 1000 Encoding Time Encoding time normalized to microsecond per pixel Smaller buffer size lower CR and lower encode time Block RGC3 additional encode complexity justifiable if higher compression ratios are needed: - Metal 1: Higher than 3.5 - Poly: Higher than 6 - Via: Higher than 12 41

Integrating Block GC3 with Writer Systems Need to modify the algorithm to achieve best compression efficiency May increase encoding complexity Remain same decoding structure Remain asymmetric compression algorithm 42

Summary Block GC3 solves data delivery problem for direct-write lithography systems Implement Block GC3 Block GC3 reduces: I/O data rate System power Block RGC3 improves compression ratio for REBL system Increase encoder complexity Decoder complexity remains low the goal 43