FPGA IMPLEMENTATION OF THE JPEG2000 MQ DECODER

Similar documents
INF5080 Multimedia Coding and Transmission Vårsemester 2005, Ifi, UiO. Wavelet Coding & JPEG Wolfgang Leister.

Motion Video Compression

JPEG2000: An Introduction Part II

Video coding standards

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Chapter 2 Introduction to

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

A Novel Approach towards Video Compression for Mobile Internet using Transform Domain Technique

The H.26L Video Coding Project

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

Lossless Compression Algorithms for Direct- Write Lithography Systems

OBJECT-BASED IMAGE COMPRESSION WITH SIMULTANEOUS SPATIAL AND SNR SCALABILITY SUPPORT FOR MULTICASTING OVER HETEROGENEOUS NETWORKS

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Intra-frame JPEG-2000 vs. Inter-frame Compression Comparison: The benefits and trade-offs for very high quality, high resolution sequences

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

EMBEDDED ZEROTREE WAVELET CODING WITH JOINT HUFFMAN AND ARITHMETIC CODING

Chapter 10 Basic Video Compression Techniques

CERIAS Tech Report Preprocessing and Postprocessing Techniques for Encoding Predictive Error Frames in Rate Scalable Video Codecs by E

ELEC 691X/498X Broadcast Signal Transmission Fall 2015

Multimedia Communications. Image and Video compression

Digital Video Telemetry System

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

AUDIOVISUAL COMMUNICATION

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

INTRA-FRAME WAVELET VIDEO CODING

A Fast Constant Coefficient Multiplier for the XC6200

Multicore Design Considerations

MPEG has been established as an international standard

Example: compressing black and white images 2 Say we are trying to compress an image of black and white pixels: CSC310 Information Theory.

Video compression principles. Color Space Conversion. Sub-sampling of Chrominance Information. Video: moving pictures and the terms frame and

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Understanding Compression Technologies for HD and Megapixel Surveillance

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

Data Representation. signals can vary continuously across an infinite range of values e.g., frequencies on an old-fashioned radio with a dial

MPEG + Compression of Moving Pictures for Digital Cinema Using the MPEG-2 Toolkit. A Digital Cinema Accelerator

Speeding up Dirac s Entropy Coder

International Journal of Engineering Research-Online A Peer Reviewed International Journal

for File Format for Digital Moving- Picture Exchange (DPX)

BITSTREAM COMPRESSION TECHNIQUES FOR VIRTEX 4 FPGAS

SUMMIT LAW GROUP PLLC 315 FIFTH AVENUE SOUTH, SUITE 1000 SEATTLE, WASHINGTON Telephone: (206) Fax: (206)

INTERNATIONAL TELECOMMUNICATION UNION. SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS Coding of moving video

A video signal consists of a time sequence of images. Typical frame rates are 24, 25, 30, 50 and 60 images per seconds.

ALONG with the progressive device scaling, semiconductor

Video Compression. Representations. Multimedia Systems and Applications. Analog Video Representations. Digitizing. Digital Video Block Structure

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

H.264/AVC Baseline Profile Decoder Complexity Analysis

LUT Optimization for Memory Based Computation using Modified OMS Technique

Enhanced Frame Buffer Management for HEVC Encoders and Decoders

High Performance Carry Chains for FPGAs

Audio and Video II. Video signal +Color systems Motion estimation Video compression standards +H.261 +MPEG-1, MPEG-2, MPEG-4, MPEG- 7, and MPEG-21

Hardware study on the H.264/AVC video stream parser

Snapshot. Sanjay Jhaveri Mike Huhs Final Project

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

MEMORY ERROR COMPENSATION TECHNIQUES FOR JPEG2000. Yunus Emre and Chaitali Chakrabarti

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

Into the Depths: The Technical Details Behind AV1. Nathan Egge Mile High Video Workshop 2018 July 31, 2018

Implementation of Memory Based Multiplication Using Micro wind Software

DICOM medical image watermarking of ECG signals using EZW algorithm. A. Kannammal* and S. Subha Rani

Adding Analog and Mixed Signal Concerns to a Digital VLSI Course

New forms of video compression

Multimedia Communications. Video compression

Keywords- Discrete Wavelet Transform, Lifting Scheme, 5/3 Filter

FPGA Development for Radar, Radio-Astronomy and Communications

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Design of Memory Based Implementation Using LUT Multiplier

MPEGTool: An X Window Based MPEG Encoder and Statistics Tool 1

Introduction to Video Compression Techniques. Slides courtesy of Tay Vaughan Making Multimedia Work

SERIES T: TERMINALS FOR TELEMATIC SERVICES Still-image compression JPEG 2000

2. Logic Elements and Logic Array Blocks in the Cyclone III Device Family

Design of Fault Coverage Test Pattern Generator Using LFSR

21.1. Unit 21. Hardware Acceleration

Part 1: Introduction to Computer Graphics

Transform Coding of Still Images

complex than coding of interlaced data. This is a significant component of the reduced complexity of AVS coding.

Unequal Error Protection Codes for Wavelet Image Transmission over W-CDMA, AWGN and Rayleigh Fading Channels

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

FPGA Implementation of DA Algritm for Fir Filter

Fully Pipelined High Speed SB and MC of AES Based on FPGA

Upgrading a FIR Compiler v3.1.x Design to v3.2.x

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

COMPRESSION OF DICOM IMAGES BASED ON WAVELETS AND SPIHT FOR TELEMEDICINE APPLICATIONS

Contents. xv xxi xxiii xxiv. 1 Introduction 1 References 4

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

Chapt er 3 Data Representation

8/30/2010. Chapter 1: Data Storage. Bits and Bit Patterns. Boolean Operations. Gates. The Boolean operations AND, OR, and XOR (exclusive or)

Optimization of memory based multiplication for LUT

Altera's 28-nm FPGAs Optimized for Broadcast Video Applications

Content storage architectures

A New Compression Scheme for Color-Quantized Images

Digital Image Processing

Distributed Video Coding Using LDPC Codes for Wireless Video

PCM ENCODING PREPARATION... 2 PCM the PCM ENCODER module... 4

Implementation of MPEG-2 Trick Modes

Transcription:

FPGA IMPLEMENTATION OF THE JPEG2000 MQ DECODER Thesis Submitted to The School of Engineering of the UNIVERSITY OF DAYTON In Partial Fulfillment of the Requirements for The Degree of Master of Science in Electrical Engineering By David Joseph Lucking, B.S. Dayton, Ohio May, 2010

FPGA IMPLEMENTATION OF THE JPEG2000 MQ DECODER APPROVED BY: Eric Balster, Ph.D. Adviser Committee Chairman Electrical & Computer Engineering Frank Scarpino, Ph.D. Committee Member Electrical & Computer Engineering Tarek Taha, Ph.D. Committee Member Electrical & Computer Engineering Malcolm Daniels, Ph.D. Associate Dean, School of Engineering Tony Saliba, Ph.D. Dean, School of Engineering ii

ABSTRACT FPGA IMPLEMENTATION OF THE JPEG2000 MQ DECODER Name: Lucking, David Joseph University of Dayton, 2010 Adviser: Eric Balster As digital imaging techniques continue to advance, new image compression standards are needed to keep the transmission time and storage space low for increasing image sizes. The Joint Photographic Expert Group (JPEG) fulfilled this need with the ratification of the JPEG2000 standard in December of 2000. JPEG2000 adds many features to image compression technology but also increases the computational complexity of traditional decoders. To mitigate the added computational complexity, the committee developed the JPEG2000 algorithm to process parts in parallel, increasing the benefits of implementing the algorithm in application specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs). A flexible FPGA implementation of the MQ Decoder, the core component of the JPEG2000 decoding algorithm, is presented in this paper that successfully increases the throughput beyond previous designs. iii

To my parents for all your support and love. iv

ACKNOWLEDGMENTS Thank you to everyone that helped in making this opportunity possible. Thank you to Eric Balster for being my adviser and continuously pestering me to stay on focus to complete my thesis on time. At many points throughout this process, you were more motivated than I in finishing my thesis. Thank you to Dr. Scarpino for all of your advice and information in both engineering subjects and life matters. Thank you for the opportunity to work in the Reconfigurable Computing Laboratory and I will always remember the great experiences I have had in this graduate program. Thank you to Kerry Hill, Al Scarpelli, and the Air Force Research Laboratory for all of your lab space, resources, and financial support. Thank you for the opportunity to work with AFRL as a graduate student and for helping me acquire a job in the multi-chip integration branch of the AFRL family. Thank you to Ben Fortener for helping me learn JPEG2000 in the beginning and always being there to answer the questions I felt too stupid to ask anyone else. Thank you to David Walker and Luke Hogrebe for your patience in answering all my questions about HDL, the GiDEL interface, and the algorithm for the coding passes and the MQ encoder. Thank you to Thaddeus Marrara, Nick Vicen, Ken Simone, Bill Turri, and UDRI for the faith that you placed in me by assigning me the MQ decoder project while I was a graduate student with barely enough experience to decide to work on such a project. Thank you to David Mundy for your software implementation of the MQ encoder because it helped considerably when designing the algorithm in this thesis. Thank you to Dr. Taha for serving on my thesis committee. Thank you to my friends and roommates for putting up with me throughout this v

process. Finally, thank you to my family (especially my parents) for supporting and helping me both emotionally and financially throughout my life. You have helped me learn and grow in every aspect of my life. This thesis is a direct result of everything you have taught me. vi

TABLE OF CONTENTS Page APPROVAL.................................... ii ABSTRACT.................................... iii DEDICATION................................... iv ACKNOWLEDGMENTS............................. v LIST OF TABLES................................. ix LIST OF FIGURES................................ x CHAPTERS: 1. Introduction.................................. 1 2. JPEG2000 Decoder Overview........................ 4 2.1 Bit Stream Parser........................... 4 2.2 Entropy Block Decoder........................ 8 2.3 De-Quantization............................ 8 2.4 Inverse Discrete Wavelet Transform.................. 9 2.5 Color Transform............................ 14 2.6 Tile Combiner............................. 15 3. Entropy Block Decoder............................ 16 3.1 Shannon s Limit............................ 16 3.2 Huffman Coding............................ 17 3.3 Arithmetic Coding........................... 19 3.4 Decoding Passes............................ 20 3.5 MQ Decoder.............................. 26 vii

4. MQ Decoder Design............................. 37 4.1 Design Algorithm............................ 39 4.2 Lookup Tables............................. 43 4.3 Arithmetic and Comparator Modules................. 45 4.4 State Machine............................. 46 4.5 Controller................................ 49 5. Results..................................... 53 5.1 Verification............................... 53 5.2 Theoretical Results........................... 54 5.3 Empirical Results............................ 58 6. Future Work and Conclusions........................ 70 6.1 Future Work.............................. 70 6.2 Conclusion............................... 71 BIBLIOGRAPHY................................. 73 viii

LIST OF TABLES Table Page 2.1 JPEG2000 Main Headers......................... 6 2.2 JPEG2000 Tile Headers......................... 7 2.3 CDF 9/7 Synthesis Filter Coefficients.................. 11 2.4 Spline 5/3 Synthesis Filter Coefficients................. 11 2.5 Lifting Equation Variables........................ 13 3.1 LL and LH Significance Decoding Look Up Table Context Values.. 22 3.2 HL Significance Decoding Look Up Table Context Values....... 23 3.3 HH Significance Decoding Look Up Table Context Values....... 23 3.4 Sign Decoding Context LUT....................... 24 3.5 Sign Decoding Sign LUT......................... 25 3.6 Magnitude Refinement Pass LUT.................... 25 3.7 Initial Context State Table........................ 28 3.8 Probability State Table.......................... 29 3.9 Probability State Table (Continued)................... 30 4.1 Operations Performed by the MQ Decoder............... 45 4.2 Operations Performed by the MQ Controller.............. 50 5.1 MQ Decoder Logic Utilization on a XC2V6000-6............ 56 5.2 Decoding Passes Logic Utilization from [9] on a XC2V6000-6..... 56 5.3 Tier 1 Logic Utilization on a XC2V6000-6............... 57 5.4 MQ Decoder Theoretical Throughput.................. 58 5.5 Logic Utilization on a Altera Stratix III SL150............. 59 5.6 Kakadu Pentagon Timing Analysis, in seconds............ 61 5.7 Intel Pentagon Timing Analysis, in seconds............. 62 5.8 JasPer Pentagon Timing Analysis, in seconds............ 63 5.9 Kakadu Airport Timing Analysis, in seconds............. 64 5.10 Intel Airport Timing Analysis, in seconds.............. 65 5.11 JasPer Airport Timing Analysis, in seconds............. 66 5.12 Average Entropy Decoder Timing Analysis, in seconds........ 68 5.13 Entropy Decoder Throughput (MB/s).................. 69 ix

LIST OF FIGURES Figure Page 2.1 JPEG2000 Decoder Block Diagram................... 5 2.2 JPEG2000 Bit Stream Parser Hierarchy................. 8 2.3 Inverse Discrete Wavelet Transform Block Diagram.......... 10 2.4 Two Dimensional Inverse Discrete Wavelet Transform......... 10 2.5 Discrete Wavelet Transform Acronyms................. 10 2.6 Inverse Discrete Wavelet Transform Lifting Block Diagram...... 13 2.7 Actual Image Transformed by Two DWTs............... 14 3.1 Direction of Processing Codeblock Bitplanes.............. 20 3.2 Direction of Processing Bitplane Stripes................ 21 3.3 Representation of the MQ Decoder Internal Registers......... 27 3.4 The Standard MQ Decoder Algorithm................. 32 3.5 The Standard MQ Decode Function................... 33 3.6 The Standard MQ MPS Exchange Function.............. 34 3.7 The Standard MQ LPS Exchange Function............... 35 3.8 The Standard MQ RenormD Function................. 36 3.9 The Standard MQ Byte In Function.................. 36 4.1 MQ Decoder Block Diagram from [9].................. 38 4.2 MQ Decoder Block Diagram....................... 38 4.3 The MQ Decoder Algorithm of the Proposed Design.......... 40 4.4 The MQ Decode Function of the Proposed Design........... 41 4.5 The MQ Set LUTs Function of the Proposed Design.......... 42 4.6 The MQ Renorm Function of the Proposed Design.......... 43 4.7 The MQ Load Byte Function of the Proposed Design......... 44 4.8 MQ Controller Assignment Optimization................ 46 4.9 MQ Decoder State Machine from [9].................. 47 4.10 Number of Shifts Performed During Renormalization......... 47 4.11 Proposed MQ Decoder State Machine.................. 48 5.1 Verification Images............................ 54 5.2 2048x2048 Pixel Pentagon Variation................... 55 5.3 4096x4096 Pixel Airport Variation.................... 55 5.4 Total Decompression Time for Pentagon Image Encoded by Kakadu 61 x

5.5 Total Decompression Time for Pentagon Image Encoded by Intel. 62 5.6 Total Decompression Time for Pentagon Image Encoded by JasPer 63 5.7 Total Decompression Time for Airport Image Encoded by Kakadu. 65 5.8 Total Decompression Time for Airport Image Encoded by Intel.. 66 5.9 Total Decompression Time for Airport Image Encoded by JasPer. 67 5.10 Average Entropy Decoder Processing Time............... 68 xi

CHAPTER 1 Introduction JPEG2000 is the latest image compression standard ratified by the Joint Photographic Expert Group (JPEG) in December of 2000, [15]. According to [12], the Geospatial Intelligence Standards Working Group (GWG) mandated the new standard from the JPEG committee as the preferred still imagery compression standard for use with the National Imagery Transmission Format Standard (NITF) in September of 2005. According to [10], the United States Department of Defense (DoD), the International Standards Organization (ISO), and the American National Standards Institute (ANSI) have all adopted NITF as the common imagery format for transmission between systems. The requirement of JPEG2000 standard use in both government and commercial agencies demand the development of a near real time (NRT) implementation to replace the previous image compression systems. The adoption of the JPEG2000 standard is due to the many advantages introduced by the compression algorithm including a higher compression performance by 30% over the previous JPEG standard, [18 20]. An embedded bit stream and region of interest coding are two other advantages that are inherent in the algorithm. The bit stream organization increases the options for imagery systems to efficiently parse and decode JPEG2000 images based on the platform targeted by the decoder. The standard committee determined that multitude of advantages far outweigh the major 1

disadvantage: an increase in computational complexity. Compared to the original JPEG standard, [4] states that JPEG2000 is 30 times more computationally complex to encode an image and 10 times more complex to decode an image. The computational difference between the original JPEG standard and JPEG2000 is further magnified by the continuous increase in pixel densities of modern focal plant arrays (FPAs), which only further amplifies the problem. To accomplish the NRT systems required by the government and commercial systems, innovative designs must be devised to offset the increased complexity of the JPEG2000 standard over the original JPEG standard. An application specific integrated circuit (ASIC) or field programmable gate array (FPGA) implementation of a JPEG2000 encoder/decoder is a perfect solution for this ongoing problem due to the parallel nature of the JPEG2000 standard and the ability of both implementations to easily perform parallel processing. Many papers have proposed to implement the JPEG2000 decoder using hardware description language (HDL) implementations to be used in FPGAs or in developing ASICs, [6, 7, 9]. In [7], a column and sample skipping technique is used to speed up decoding while decreasing the quality of the decoded image. The skipping technique lowers the flexibility of the design by removing it as an option for decoding lossless images. In [6], Handel-C is used to speed hardware development time with the tradeoff of an inefficient implementation. Handel-C synthesizes C code into HDL, which requires more logic and more clock cycles than hand designed implementations. In [9], a hardware decoder for digital cinema is developed requiring that the images be previously encoded using parallel mode, an inefficient coding technique described 2

in the standard to decrease complexity, [15]. The design in [9] trades a high logic utilization for a low number of clock cycles to complete the MQ decoding process. The JPEG2000 decoding algorithm is partitioned into 6 different parts: the bit stream parser, entropy block decoder, de-quantization, inverse discrete wavelet transform, tile combiner, and inverse color transform. The algorithm is profiled in [18, 20], with both concluding that the entropy decoding operation (i.e. the MQ decoder) uses the largest percentage of the processing time and the most memory. Thus the entropy decoder is the best candidate for hardware acceleration. This paper describes a new MQ decoder design in the Very High Speed Integrated Circuit Hardware Description Language (VHDL). It is shown that this new design increases the speed of the JPEG2000 process while decreasing the size when compared specifically to the design in [9]. It is also shown through timing that the HDL implementation takes less time to decode the same images as the Intel Primitives software. This thesis is organized into six chapters starting with the introduction. After the introduction, Chapter 2 provides a brief overview of the JPEG2000 algorithm to set the terminology for the rest of the paper. Chapter 3 gives an in-depth analysis of the entropy block decoder. Chapter 4 gives a description of a quick and efficient MQ decoder in VHDL. Chapter 5 gives results of the MQ Decoder design implemented on an FPGA. Chapter 6 concludes the paper. 3

CHAPTER 2 JPEG2000 Decoder Overview The JPEG2000 decoder algorithm is briefly described in this section for completeness and to introduce terminology for the rest of the paper. Figure 2.1 gives the processing architecture of the JPEG2000 decoding process. As shown in Figure 2.1, a JPEG2000 decoder is composed of a bit stream parser, an entropy block decoder, a de-quantization module, an inverse discrete wavelet transform module, a color transform module, and a tile combiner. The bit stream parser, also known as the Tier 2, extracts the required data from the headers and locates the compressed codestream to pass to the entropy block decoder. The entropy block decoder decodes the codestream into the wavelet domain. The de-quantization module increases the number of bits representing each coefficient. The inverse discrete wavelet transform converts the wavelet coefficients into pixels. The color transform moves the pixels from the luminance and chrominance color space, YCbCr, into the red, green, and blue color space, RGB. The tile combiner rearranges the rows of each tile to align the image in raster scan order. 2.1 Bit Stream Parser The bit stream parser is the part of the JPEG2000 algorithm commonly referred to as Tier 2. The parser extracts the information from the image, tile, and packet headers 4

Figure 2.1: JPEG2000 Decoder Block Diagram required to correctly decode the image. The JPEG2000 main header descriptions are shown in Table 2.1. These values are taken from [22] and presented here for completeness. The start of codestream (SOC), image and tile size (SIZ), coding style default (COD), and quantization default (QCD) headers are the only required main headers because those headers contain the minimal amount of information to be able to completely decode the image. The optional headers signify either a change in the values already set by the required headers or non essential information to speed up the processing of the image. The SIZ header includes the height and width of the image and tiles in pixels, the number of components, the sub-sampling for each component, and the bit depth for each component. The COD header contains the coding style (states if the packets are of maximum size, and the use of start of packet and end of packet markers), the progression order, the number of quality layers, the number of wavelet transform levels, which wavelet transform is performed, and the packet and codeblock sizes. A packet is a group of codeblocks and a codeblock is an array of 5

Table 2.1: JPEG2000 Main Headers Mnemonic Value Description Required SOC 0xFF4F Start of code-stream Y SIZ 0xFF51 Image and tile size Y COD 0xFF52 Coding style default Y QCD 0xFF5C Quantization default Y COC 0xFF53 Coding style component N QCC 0xFF5D Quantization component N RGN 0xFF5E Region of interest N POC 0xFF5F Progression order change N PPM 0xFF60 Packed packet headers: main header N PLM 0xFF57 Packet lengths: main header N TLM 0xFF55 Tile-part lengths: main header N CRG 0xFF63 Component registration N COM 0xFF64 Comment N coded wavelet coefficients that are processed by the arithmetic decoder. The QCD header includes all the information pertaining to the quantization of the image. The quantization style specifies whether only one quantization value is given and the rest are derived or if all the quantization values are defined in the header. The second part of the QCD header then defines the required quantization values. Along with the main headers, each tile is self contained with its own header information and the packets containing data. The tile headers, given in [22], are shown in Table 2.2. The start of tile (SOT) header specifies the required information about the tile including the tile index and the size of the tile in bytes. The SOT header and the start of data (SOD) header, which signifies the end of tile headers and the beginning of the packets, are the only required tile headers. The rest of the tile headers are not required but can be used to override the parameters set in the main header for that tile. 6

Table 2.2: JPEG2000 Tile Headers Mnemonic Value Description Required SOT 0xFF90 Start of tile Y SOD 0xFF93 Start of data Y COD 0xFF52 Coding style default N QCD 0xFF5C Quantization default N COC 0xFF53 Coding style component N QCC 0xFF5D Quantization component N RGN 0xFF5E Region of interest N POC 0xFF5F Progression order change N PPT 0xFF60 Packed packet headers: tile-part header N PLT 0xFF57 Packet lengths: tile-part header N COM 0xFF64 Comment N Each tile contains packets, also called precincts, that group the codeblocks together. The packets also have a header and a body associated with it. The packet header includes which codeblocks actually have data, which quality layer each codeblock belongs to, the bit depth of each codeblock, the number of coding passes (explained in Entropy Decoder section) performed on each codeblock, and the length in bytes of each codeblock. After all the needed information is obtained from the headers, the parser locates the bit stream for each codeblock and passes each one to the entropy decoder. Figure 2.2 displays the hierarchy of the image, tile, precincts, and codeblock partitions. In the figure, the image contains X number of tiles and each tile contains Y number of precincts. Each precinct inside of the tile contains Z number of codeblocks. 7

Figure 2.2: JPEG2000 Bit Stream Parser Hierarchy 2.2 Entropy Block Decoder The entropy block decoder is subdivided into the decoding passes and the binary arithmetic decoder (MQ decoder). The decoding passes determine a context for each bit in the image based on the significance of the bits surrounding the current bit in the image. The binary arithmetic coder uses the context generated by the decoding passes along with probability look up tables to determine the output bit value. The entropy block decoder is described in more detail in Section 3. 2.3 De-Quantization De-quantization uses the epsilon and mantissa extracted from the QCD header to increase the amount of bits used to represent the coefficients output by the arithmetic decoder. The epsilon and mantissa can be explicitly specified for each wavelet subband or specified only for the lowest subband. A subband is part of the image that is processed by the same type and number of filters during the wavelet transform. If 8

only the lowest subband is specified, then the other subband quantization mantissa and epsilon values are derived. The step size for each subband is then found by using the mantissa and epsilon in Equation 2.2. b = 2 R b ε b (1 + µ b ) (2.1) 211 Along with the mantissa, µ b, and epsilon, ε b, Equation 2.1 also requires the bit depth, R b, to calculate the step size. De-quantization is performed by multiplying each coefficient by the step size, as shown by Equation 2.2. If the step size is equal to 1 or if the image is encoded losslessly, then the de-quantization step is skipped. x d [w] = b x L [w] (2.2) 2.4 Inverse Discrete Wavelet Transform After the data has been decoded and de-quantized, the output information is in the form of decoded wavelet coefficients. The inverse discrete wavelet transform (IDWT) converts these wavelet coefficients into raw image data. The IDWT applies a low pass (h [n]) and a high pass filter (g [n]) equal to the number of levels the discrete wavelet transform performed in encoding the image. Figure 2.3 demonstrates a three level IDWT because there are three pairs of filters to process the coefficients. Although Figure 2.3 demonstrates the one dimensional implementation of the IDWT, the JPEG2000 algorithm requires a two dimensional implementation by filtering the coefficients in each column and then the coefficients in each row. The process of filtering each column and then each row is shown in Figure 2.4. Each subband, the part of the image processed by the same filters, is referred to by the type of 9

Figure 2.3: Inverse Discrete Wavelet Transform Block Diagram filters that processed the coefficients. To distinguish between the coefficients filtered by each wavelet transform, the subbands also have a subscript specifying the number of times it has been transformed. For example, in Figure 2.4, the part designated as Low Pass + Low Pass would be described as LL 1. After one wavelet transform is performed, the image consists of each subband shown in Figure 2.5: LL, LH, HL, and HH. For each wavelet transform performed after the first, only the LH, HL, and HH subbands are replicated while there is always only one LL subband. Figure 2.4: Two Dimensional Inverse Discrete Wavelet Transform Figure 2.5: Discrete Wavelet Transform Acronyms 10

The filters used for the JPEG2000 algorithm are the CDF 9/7 for the irreversible transform and a derivation of the spline 5/3 for the reversible transform. The numbers 9/7 and 5/3 refer to the length of the corresponding filters; 9(5 for the reversible) filter coefficients for the low pass filter and 7(3 for the reversible) filter coefficients for the high pass filter. The filters are distinguished using the reversibility to signify that the 9/7 coefficients are infinite decimals while the 5/3 are finite. From [8], the 9/7 filter coefficients and 5/3 filter coefficients are shown in Table 2.3 and 2.4 respectively. Table 2.3: CDF 9/7 Synthesis Filter Coefficients Position n Low Pass High Pass 0 1.115087052456994 0.6029490182363579-1,1 0.5912717631142470 0.2668641184428723-2,2 0.05754352622849957 0.07822326652898785-3,3 0.09127176311424948 0.01686411844287495-4,4 0.02674875741080976 Table 2.4: Spline 5/3 Synthesis Filter Coefficients Position n Low Pass High Pass 6 0 1 8 1-1,1 2 2 8-2,2 1 8 Due to the inefficiency of upsampling the signal and then filtering the increased data set, a lifting scheme can be used to switch the filtering and upsampling steps. By switching the two steps, a lifting implementation lowers the amount of data being 11

processed and the computations performed by half. To implement a lifting implementation, the predict and update filter coefficients must be calculated. Once the new filter coefficients are found, the predict and update filters are then used according to Figure 2.6. A tutorial to change the implementation shown in Figure 2.3 to Figure 2.6 can be found in [13]. According to [13], the equations for the predict and update filters are given in Equations 2.3, 2.4, 2.5, 2.6, 2.7, and 2.8. The variables required in the equations are shown in Table 2.5. p 1 (z) = α(z + 1) (2.3) u 1 (z) = β(1 + z 1 ) (2.4) p 2 (z) = γ(z + 1) (2.5) u 2 (z) = δ(1 + z 1 ) (2.6) K s = ζ (2.7) K d = 1 ζ (2.8) The DWT adds the advantage of inherent resolution levels to the JPEG2000 algorithm without the blocking artifacts of the discrete cosine transform in the original JPEG standard. Figure 2.7 has been processed by two wavelet transforms. The highlighted part of Figure 2.7 is the LL subband after the first wavelet transform. The 12

Table 2.5: Lifting Equation Variables Variable Value α 1.568134342060 β 0.052980118573 γ 0.882911075531 δ 0.443506852044 ζ 1.149604398860 Figure 2.6: Inverse Discrete Wavelet Transform Lifting Block Diagram inherent resolutions are created because the LL subband of any wavelet transform can be extracted and processed by the wavelet transform to generate a lower resolution version of the image. The resolution of the LL portion of the image is 1 2 2m of the original image resolution (where m is the number of wavelet transforms performed on the image). For example, to get the first resolution, the LL 2 subband would be extracted from Figure 2.7 and inverse wavelet transformed to generate an image that is 16 times, 1, less than the original resolution. To get a higher resolution, the inverse wavelet 2 4 transform is performed on all the subbands with the same subscript. Performing the inverse wavelet transform would produce the LL subband with a subscript that is one less than the processed subbands. This iterative process can be continued to generate the needed resolution or until the original image is reconstructed. 13

Figure 2.7: Actual Image Transformed by Two DWTs 2.5 Color Transform The color transform is only performed when the image has more than one color plane. The color transform changes the data from the YC b C r (luminance, blue chrominance, and red chrominance) color space to the RGB (red, green, and blue) color space. The YC b C r color space transfers the data with the most perceptible changes by the human eye to one component so the other components can be heavily quantized. The matrix used to transform the image from the YC b C r color space to the RGB color space is shown in Equation 2.9. If the image is grayscale, then the data will not be inverse color transformed. R G B = 1.0 0.0 1.40210 1.0 0.34414 0.71414 1.0 1.77180 0.0 Y C b C r 0 128 128 (2.9) 14

2.6 Tile Combiner The JPEG2000 standard allows the encoder to break the original image up into multiple tiles. If multiple tiles are used in the encoder, then the entropy decoding, dequantization, inverse DWT, and color transform is performed on each tile separately. After the image data is decoded, the tile combiner arranges the image data from each tile in raster order. If only one tile is used, then the tile combiner is not evoked. 15

CHAPTER 3 Entropy Block Decoder The entropy decoder processes the compressed bitstream and outputs quantized wavelet coefficients, which make up the codeblock. The bit plane decoder can be separated into the decoding passes, which performs three passes on each bit plane to select the context for each bit and the MQ decoder that produces the decoded bit. A bit plane is a two dimensional array of bits, which are all located in the same coefficient position. 3.1 Shannon s Limit In information theory, Claude Shannon introduces the compression limit of a sequence of independent identically distributed tokens with the possibility of fully recovering the original data. Shannon s source coding theorem, [21], proves that the least amount of symbols to represent a sequence of tokens in a given alphabet is on average equal to the entropy of the original sequence divided by the logarithm of the number of symbols in the coding alphabet, shown in Equation 3.1. H(X) log 2 L A2 L m < H(X) log 2 L A2 + 1, (3.1) Where X is a random variable from the sequence of tokens and H(X) is the entropy of that sequence of random variables. The resulting message after encoding is given 16

by m with A 2 as the alphabet of its possible symbols. L A is the length of the given alphabet. L m is the length of the resultant message after the encoding process. For binary coders, such as the MQ coder, the coding alphabet consists of either one or zero. The binary alphabet has a length of two causing Equation 3.1 to be simplified to Equation 3.2. The JPEG2000 compression engine employs the MQ coder to strive for an encoded bit stream length close to Shannon s limit. According to [16], the arithmetic coder encodes the message within 3 bits of its entropy limit. H(X) L ES < H(X) + 1 (3.2) 3.2 Huffman Coding In 1952, David Huffman described new restrictions to generate a code aimed at reducing the average number of code words required to represent a sequence [14] to the entropy of that sequence. Huffman s method of encoding is based on the probability estimate of a word s occurrence in the original sequence. Huffman imposed the following restrictions on his coding alphabet: No two code words will consist of identical symbols in the same order. There is no indication of when a code word begins or ends other than the beginning of the coded message. The more probable code word must have a shorter length than code words that are less probable. At least two of the code words must have identical symbols except for the last symbol. 17

Every combination of symbols less than the length of the least probable code word must be used without conflicting with the previous restrictions. The second restriction requires that all messages be coded in a way that the message does not start the same as any other message with a longer length. Therefore, if the coding alphabet includes aa and aaa as symbols and a message starts with aa, it is not immediately clear whether the aa has been received or the start of aaa has been. So the aa symbol must have another character at the end to distinguish it from aaa. To achieve the entropy limit as the message length, he proposed that the most probable sequence of tokens in the original data must have the lowest code word length with the least probable and second least probable sequences having the same code word length. If more than one sequence has the same probability, then it is possible that their coded length might be different. Finally the coding alphabet must use all the combinations of digits that have lengths less than the least probable code word without duplicating the exact combination of another longer code word. The limitations of the Huffman coding appear when the exact probability distribution of the original sequence is not known. In real life applications, such as JPEG2000, where the source is ever changing, Huffman coding is usually generated by capturing a specific set of sources to use as a test bench. A predefined Huffman code causes the coding to be inefficient when it is applied to a source that has significant differences than the original test bench. This limitation causes the real life implementation to lose the minimum redundancy when the sequence varies considerably from the original set of sources. Another limitation of Huffman coding appears when the probability of the occurrence of a sequence is much greater than the other probabilities. In [11], Gallagher 18

shows that the redundancy of Huffman coding is bounded by the probability of the most likely sequence occurrence + 0.086. Therefore if the probability of the most probable sequence is high, then the redundancy of Huffman coding is also very high. 3.3 Arithmetic Coding The MQ coder belongs to a category known as context adaptive arithmetic coders. Arithmetic coding builds on the previous ideas from Huffman coding of replacing the highest probability token sequences with the least amount of symbols from the given alphabet, [4]. But while Huffman coding performs replacements with a static integer length code word, arithmetic coding performs variable length source encoding to produce a message with the least amount of redundancy. The input sequences for arithmetic coding are represented by a real number interval to produce a resulting message whose length equals the probability of the original sequence. The interval starts at [0.0,1.0] and as the message becomes longer, the interval becomes smaller based on the tokens in the original sequence. The most probable sequences have a larger interval and therefore reduce the interval less than the least probable sequences causing the resulting message to be smaller. While arithmetic coding offers much better coding efficiency and flexibility compared to Huffman coding, it is also much more complex, [4]. The variable length encoding is useful when the alphabet has a large difference in the probabilities between code words. Arithmetic coding is based on the histogram as a probability model, which allows this approach to theoretically achieve the entropy of the original sequence, [24]. The higher complexity is caused by the real number interval and the 19

smaller interval produced by the larger original sequence of tokens. In image processing implementations, the image is subdivided into smaller pieces to increase the interval and lower the complexity. According to [23], the requirement for an end of message marker and the finite precision of the real number representation lower the theoretical efficiency of arithmetic coding. But in real life implementations, rounding and scaling techniques are used to defer the problems caused by the finite representation. 3.4 Decoding Passes The decoding passes determine the significance of each bit in the image by using the decoded values and the significance of the surrounding bits. The significance of a coefficient is an array of binary labels that signify whether the coefficient has a 1 decoded in a previous bit plane. The passes traverse through the codeblocks starting with the most significant bitplane (MSB) and processing the next lowest bitplane until the least significant bitplane (LSB) is reached. The order of the bitplanes being processed is shown in Figure 3.1. Figure 3.1: Direction of Processing Codeblock Bitplanes 20

In each bitplane, a stripe of four vertical bits are processed at a time starting with the upper left most bit stripe. The passes visit each bit in the stripe starting with the top bit and then the second, third, and fourth bit. The stripes are processed horizontally from left to right until the end of the bitplane row is reached. The scanning procedure is described in Figure 3.2. Figure 3.2: Direction of Processing Bitplane Stripes Once a decoding pass finds a bit to be significant, the pass provides the arithmetic decoder with a context for that bit based on the status of the surrounding bits. The arithmetic decoder uses the context to correctly calculate the value of that bit. The decoding pass must wait for the bit value returned by the arithmetic decoder before the surrounding bits can be processed. Since each bit must be decoded and a context is required to decode a bit, at least one decoding pass is required to operate on each bit in an image. However after a bit is decoded by one of the passes, the bit is not processed by another decoding pass. The decoder consists of three different passes: the significance propagation pass, the magnitude refinement pass, and the cleanup pass. 21

The significance propagation pass (SPP) finds the coefficients that have not been marked as significant and have at least one significant coefficient in the bit s neighborhood. A bit s neighborhood is defined as the horizontal, vertical, and diagonal bits in the corresponding bitplane. If an insignificant coefficient is found with significant neighbors, significance decoding decodes the bit of the current bit plane and the coefficient is marked as significant. Significance decoding first calculates the horizontal, vertical, and diagonal significance of the bit s neighborhood. Then significance decoding inputs those significance values to a LUT based on the current subband to get the context. The significance decoding LUT is shown in Table 3.1 for the LL and LH subbands, 3.2 for the HL subband, and 3.3 for the HH subband to get the context value. The LUTs are taken from page 355 of [22]. The context value received from the correct LUT is then used in the MQ Decoder to determine the value of the current bit. Table 3.1: LL and LH Significance Decoding Look Up Table Context Values Horizontal Vertical Diagonal Significance Significance Significance Context 0 0 0 0 0 0 1 1 0 0 2-4 2 0 1 0-4 3 0 2 0-4 4 1 0 0 5 1 0 1-4 6 1 1-2 0-4 7 2 0-2 0-4 8 22

Table 3.2: HL Significance Decoding Look Up Table Context Values Vertical Horizontal Diagonal Significance Significance Significance Context 0 0 0 0 0 0 1 1 0 0 2-4 2 0 1 0-4 3 0 2 0-4 4 1 0 0 5 1 0 1-4 6 1 1-2 0-4 7 2 0-2 0-4 8 Table 3.3: HH Significance Decoding Look Up Table Context Values Diagonal Horizontal + Vertical Significance Significance Context 0 0 0 0 1 1 0 2-4 2 1 0 3 1 1 4 1 2-4 5 2 0 6 2 1-4 7 3-4 0-4 8 23

If the significance decoding returns a one for the decoded bit, then sign decoding is invoked to determine the sign of that coefficient. Sign decoding inputs the addition of the horizontal and vertical signs into Table 3.4, given on page 359 of [22], to determine the context. The result of the LUT output is then sent to the MQ decoder. The output of the MQ decoder along with the summation of the surrounding signs from above into Table 3.5, produced using the table on page 359 and the Decode-Sign algorithm on pages 493 and 494 of [22]. Table 3.4: Sign Decoding Context LUT Verical Horizontal Signs Signs Context <0 <0 14 <0 0 13 <0 >0 12 0 <0 11 0 0 10 0 >0 11 >0 <0 12 >0 0 13 >0 >0 14 The magnitude refinement pass (MRP) locates the coefficients that have already been deemed significant in a previous bit plane and decodes the remaining bits of that coefficient. The MRP then initiates significance decoding like the SPP. But instead of feeding only the neighborhood significance to the LUT, the MRP plugs the delayed significance and the neighborhood significance into Table 3.6, given on page 360 of [22]. Then the output of the MRP LUT is sent to the MQ Decoder as the context to 24

Table 3.5: Sign Decoding Sign LUT Decoded Verical Horizontal Value Sign Sign Sign 0 <0 x 1 0 0 <0 1 0 0 0 1 0 1 x 1 1 <0 x 1 1 0 <0 1 1 0 0 1 1 1 x 1 decode the bit. The delayed significance of a coefficient is set when any bit of that coefficient is decoded by the MRP. Table 3.6: Magnitude Refinement Pass LUT Delayed Significance Decoding Significance Output Context 0 0 15 0 1-8 16 1 x 17 The cleanup pass (CUP) decodes the bits that were not decoded by the other two passes. If the bit plane parsing is at the beginning of a full four bit stripe and none of the four bits have a significant neighbor, then run length decoding is performed in the cleanup pass. Run length decoding has two different modes; either the first bit is equal to zero indicating all the bits in the stripe are equal to zero or the first bit is equal to one indicating the next two bits specify the location of the first one bit in the stripe. If one bit in the stripe has a significant neighbor, then significance 25

decoding is performed on each bit in the stripe instead of run length decoding. The significance decoding for the CUP executes the same operations as the significance decoding performed in the SPP, including sign decoding if the decoded bit is equal to one. 3.5 MQ Decoder The MQ decoder is a binary arithmetic entropy decoder. It accepts contexts from the three decoding passes and uses the context to determine the probability estimate between the most probable symbol (MPS) and least probable symbol (LPS). Based on the probability estimate and the current state of the decoder, the output bit is either the MPS or LPS. This output bit is returned to the decoding passes to be placed in the correct location for the final decoded image and to be used in determining the significance of surrounding bits. The MQ decoder can be separated into two separate components: the internal state (given as A, C, t, T and L) and two lookup tables, the context state table and probability state table. The A register represents a normalized length of the decoding interval. The coefficients from the encoded code stream are loaded into the least significant bits of the C register, which is then shifted to keep the lower bound of the decoding interval in the most significant bits. The t register keeps track of the number of times C has been shifted. When t reaches zero, a new byte from the bit stream is loaded into the C register. T holds the last byte loaded. This register is used to reverse the bit stuffing mode of the MQ encoder. The bit stuffing mode inserts an extra byte to make sure no consecutive bytes in the encoded stream are greater than 0xFF8F. Finally L is the number of bytes that have been loaded from the bit stream. If the decoder has 26

reached the end of the codeblock but still requires more bytes, then the value 0xFF is loaded into the C register. Figure 3.3 shows the representation of the MQ decoder internal registers. When a new byte is loaded, the MQ decoder must keep the A register close to 0x8000. To make sure this requirement is met, the MQ decoder shifts both the A and C registers until the A register is greater than 0x8000. If the A register is already greater than 0x8000, then no shifting is performed. In Figure 3.3, t is equal to eight because a new byte has been loaded and the C register has not yet been shifted. The active region of C is the only part of that register used in decoding the value and the reason that the first two bytes must be initially shifted into the most significant bits of the C register. The length register holds the next byte of the compressed bit stream to be loaded into the eight least significant bits of the C register. Figure 3.3: Representation of the MQ Decoder Internal Registers The context state table (CST) uses the context from the decoding passes to determine the probability symbol and the index for the probability state table (PST). The CST accepts the probability symbol and the index as an input to change the table 27

Table 3.7: Initial Context State Table Context Probability Probability State Symbol Table Index 0 0 4 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 3 10 0 0 11 0 0 12 0 0 13 0 0 14 0 0 15 0 0 16 0 0 17 0 0 18 0 46 28

Table 3.8: Probability State Table Index Next Most Next Least Toggle Probabilty Probability Probable Symbol Probable Symbol Symbol Estimate 0 1 1 1 0x5601 1 2 6 0 0x3401 2 3 9 0 0x1801 3 4 12 0 0x0AC1 4 5 29 0 0x0521 5 38 33 0 0x0221 6 7 6 1 0x5601 7 8 14 0 0x5401 8 9 14 0 0x4801 9 10 14 0 0x3801 10 11 17 0 0x3001 11 12 18 0 0x2401 12 13 20 0 0x1C01 13 29 21 0 0x1601 14 15 14 1 0x5601 15 16 14 0 0x5401 16 17 15 0 0x5101 17 18 16 0 0x4801 18 19 17 0 0x3801 19 20 18 0 0x3401 20 21 19 0 0x3001 21 22 19 0 0x2801 22 23 20 0 0x2401 23 24 21 0 0x2201 24 25 22 0 0x1C01 25 26 23 0 0x1801 26 27 24 0 0x1601 27 28 25 0 0x1401 28 29 26 0 0x1201 29 30 27 0 0x1101 30 31 28 0 0x0AC1 31 32 29 0 0x09C1 32 33 30 0 0x08A1 33 34 31 0 0x0521 29

Table 3.9: Probability State Table (Continued) Index Next Most Next Least Toggle Probabilty Probability Probable Symbol Probable Symbol Symbol Estimate 34 35 32 0 0x0441 35 36 33 0 0x02A1 36 37 34 0 0x0221 37 38 35 0 0x0141 38 39 36 0 0x0111 39 40 37 0 0x0085 40 41 38 0 0x0049 41 42 39 0 0x0025 42 43 40 0 0x0015 43 44 41 0 0x0009 44 45 42 0 0x0005 45 45 43 0 0x0001 46 46 46 0 0x5601 values. This feature allows the probability of a symbol appearing to change based on the surrounding byte values. The PST requires the index from the CST to obtain the four probability mapping rules for the current decoding interval. The first two rules are replacement indices for the CST based on whether the MPS or LPS is decoded. The next rule specifies whether the probability symbol from the context state table should be inverted when the symbol is decoded. The final rule is the relationship between the context and the LPS probability estimate. The initial CST and PST (the PST is modified during the decoding process) are shown in Table 3.7 and Table 3.8 (and 3.9) respectively. The CST is given on page 488 and the PST is given on page 75 of [22]. The general algorithm of the MQ decoder can be broken down into seven steps: initialization, loading context, calculating internal variables, decoding bit, look up 30

table modification, renormalization, and byte loading. The first step is the initialization process, which loads the first two bytes of the code stream and only occurs once per codeblock. After the MQ decoder is initialized, the decoding needs to receive the context and load required values from the look up tables. Once all the needed values have been loaded, the MQ decoder s main steps (calculating internal variables, decoding bit, and look up table modification) can be performed in any order but must produce the same results as the standard algorithm. The internal variables modified are the A and C variables. The decoding bit step determines if the output bit is the MPS or the LPS. The look up table modification is performed using the values calculated in the decoding step. The renormalization shifts A and C until A is normalized to 0x8000. Finally, the byte loading step is only performed if the last eight bits of the C register are empty. The JPEG2000 standard algorithm, given in [15], detailing these steps is shown in Figures 3.4, 3.5, 3.6, 3.7, 3.8, and 3.9. The algorithm variables are modified to easily compare to the algorithm of the proposed implementation. 31

Figure 3.4: The Standard MQ Decoder Algorithm 32

Figure 3.5: The Standard MQ Decode Function 33

Figure 3.6: The Standard MQ MPS Exchange Function 34

Figure 3.7: The Standard MQ LPS Exchange Function 35

Figure 3.8: The Standard MQ RenormD Function Figure 3.9: The Standard MQ Byte In Function 36

CHAPTER 4 MQ Decoder Design Due to the requirement that codeblocks be encoded and decoded completely independent without the introduction of blocking artifacts, the MQ decoder is an ideal candidate for acceleration through a FPGA implementation. A FPGA implementation exploits the ability to process codeblocks in parallel to increase the throughput of the MQ decoder, the most time consuming part of the JPEG2000 algorithm [18, 20]. The design of the JPEG2000 decoder algorithm requires the decoding passes to wait for the decoded bit value to be returned by the MQ decoder. This requirement demands an efficient MQ decoder to return the value quickly. This section details an efficient FPGA MQ decoder implementation in both area consumed and clock speed than [9]. The MQ decoder design from [9] is shown in Figure 4.1. The notable parts of the block diagram are the number of shifts being stored in the RAM, the variable shifting component, the three sections used to decode a bit, and the lack of a feedback from the renormalization block. The Load, Compute, and Decide sections are required to decode each bit while the Renormalize section is only utilized when the A register is normalized. The block diagram of the proposed design, shown in 4.2, separates the mathematical and logical operations from the logic to determine the values of the internal 37

Figure 4.1: MQ Decoder Block Diagram from [9] Figure 4.2: MQ Decoder Block Diagram 38

registers and the output value. The proposed design also differs from the block diagram from [9] because the amount to shift the A register is not stored in the LUT logic and a variable shifting component is not included in the computations module. The proposed MQ decoder design diagram consists of two lookup tables (the context state table and the probability state table), a state machine module, an arithmetic module, a comparator module, and a controller module. 4.1 Design Algorithm The algorithm used in this thesis is shown in Figures 4.3, 4.4, 4.5, 4.6, and 4.7. Compared to the standard algorithm, the overall algorithm is the same but the steps are rearranged and optimized for the HDL implementation. The algorithm in the top level, in Figure 4.3, is the exact same as the standard top level algorithm shown in Figure 3.4. In the Decode function, the algorithm is modified to remove dependencies between previous assignments and comparisons. For example, the initial operation in the standard Decode function, A = A - p, is removed and the subsequent logical and arithmetic operations that use the A register take this into account. Also nested IF statements are also removed from the original algorithm to be more efficient for a HDL implementation. The calculation of the output register is an example where nested IF statements are replaced with a XOR gate. The calculation of the A and C registers in the Decode function do not rely on the previous calculation of the output register so those operations are able to be performed in parallel. Shown in the Decode function and the Set LUTs function, the calculation of the LUT values do rely on the previous assignment of the A register. But since the A 39

Figure 4.3: The MQ Decoder Algorithm of the Proposed Design 40

Figure 4.4: The MQ Decode Function of the Proposed Design 41

register is the only assignment that must be performed before the LUT values can be determined, the HDL implementation exploits the ability to foresee the next state s values to set the values of the LUTs in parallel with the assignments of the A, C, and output registers. These calculations are performed in the decode state of the controller as shown in Figure 4.2. Figure 4.5: The MQ Set LUTs Function of the Proposed Design The Renorm function in the proposed algorithm is very similar to the standard RenormD function with a slight difference in the operations performed after the byte 42

loading. In the standard s RenormD function, the byte is loaded and then the A and C registers are shifted but in the current design s Renorm function, the A and C registers are shifted while the byte is loaded. Figure 4.6: The MQ Renorm Function of the Proposed Design The LoadByte function modifies the standard s RenormD function by shifting the byte while loading it into the C register. 4.2 Lookup Tables The lookup tables include both the context state table (CST) and the probability state table (PST). The CST uses the context received from the decoding passes for the table index. The PST uses the CST s values to determine which table value to 43

Figure 4.7: The MQ Load Byte Function of the Proposed Design access. Both table implementations use registers to store the values but the CST requires the ability to read and write values, similar to a RAM module, while the PST only requires access to the table information, similar to a ROM module. The LUTs are implemented asynchronously when information is read from the tables to allow the context to be loaded and the data extracted from the tables in one clock cycle. This helps lower the amount of clock cycles required to decode a bit but also requires that the context be synchronized to lower the latency caused by asynchronous logic. When writing to the CST, the only table that requires the ability to modifies the data, the internal registers are synchronized to the clock when assigning the input register. Synchronously assigning the CST s values causes the table to be more deterministic and removes the possibility of erroneous values being assigned to the table entries. 44

4.3 Arithmetic and Comparator Modules In the proposed design, the arithmetic and comparison operations, shown in Table 4.1, are performed during each clock cycle. This design technique permits the state machine and controller to lookahead and use the next value of the internal registers to remove unnecessary states. Performing every operation during each clock cycle increases the amount of routing logic required to implement the design but has the capability to lower the overall logic depending on the reuse of each operation in the design. The technique is also beneficial because it causes the design to be more deterministic. Since the same calculations are performed for each clock, the system produces the same internal values and the controller acts as a multiplexor selecting the correct value to be used during the current state. Table 4.1: Operations Performed by the MQ Decoder Operation Arithmetic Logical 1 A p T == 0xFF 2 A << 1 Byte > 0x8F 3 (C + 255) << 1 t == 0 4 (C + 255) << 7 A < p + p 5 C[8:23] p C[8:23] < p 6 (C + Byte) << 7 A[15] = 0 7 (C << 7) + (Byte << 8) Output == MPS 8 C + (Byte << 1) Toggle MP S == 1 9 C + (Byte << 2) 10 C << 1 11 t 1 12 MPS 1 45

An example to demonstrate the change in the both components is shown in Figure 4.8. In this example, the LUT index assignment is dependent on the value of the A register, which is assigned in the decode state. So to assign the LUT index register, the design had to wait until after the decode state to be able to use the A register in a logical operation. But in the proposed design, the logical operation C a ctive < p along with the next value of the A register is used to determine the value for the LUT index register in the same state. The ability to foresee the next value has been used to remove states in the overall system and ultimately lowers the number of clock cycles required to decode a codeblock. Figure 4.8: MQ Controller Assignment Optimization 4.4 State Machine The state machine module makes use of the current state of the module and the output of the comparator module to determine the next state. The state machine from [9], shown in Figure 4.9, consists of five states where the InitBuf and Renorme states initialize the decoder each time a codeblock is loaded and the decoding passes are ready. The initialization process populates the C register with the first two 46

Figure 4.9: MQ Decoder State Machine from [9] compressed bytes of the codeblock. After the current codeblock is initialized, the WaitCX state receives the context and loads the values from the lookup tables. Once the lookup table values are loaded, the Compute state calculates the value for the internal registers, the new table values, and the output bit. Then the Decide state uses applies multiplexors to assign the calculated values to the correct memory locations. The state machine reaches the Renorme state when the next value of the A register is less than 0x8000. Figure 4.10: Number of Shifts Performed During Renormalization 47

The design from [9] utilizes the same technique proposed in [17] to lower the number of clock cycles for the renormalization process. While focusing on lowering the clock cycles, this implementation consumes a large amount of resources on the FPGA by storing the number of shifts required for each renormalization in a ROM. The histogram given in Figure 4.10 displays the number of shifts required to normalize the A register after decoding each bit for the Pentagon, Fishing Boat, Peppers, Lena, House, Chemical Plant, and Mandrill images from [3]. Figure 4.10 demonstrates that the JPEG2000 probability table produces the largest coding interval for most bits. So for the majority of bits decoded on non-random images, the A register is required to shift only 1 or 0 times. Based on this data, the proposed design implements the MQ decoder using the advantages of JPEG2000 to minimize the total number of clock cycles and logic required to maximize the overall throughput. Figure 4.11: Proposed MQ Decoder State Machine The state machine for the proposed design is shown in Figure 4.11. Comparing the proposed state machine to the state machine from [9] demonstrates major differences 48

that cause the proposed design to produce a higher throughput. These differences include the addition of the Initialize state, the removal of the Compute state, and the addition of a feedback loop to remain in the Renorme state until the A register is normalized. The addition of the Initialize state does not increase the number of clock cycles required to initialize a codeblock because both designs require two cycles to initialize a codeblock. The state machine in [9] uses the Renorme state to perform the second initialization cycle. The extra intialization state only minimally increases the amount of logic required by the proposed decoder. The removal of the Compute state decreases the number of clock cycles required by one for each decoded bit in the codeblock. The addition of the feedback path from the Renorme state to the previous state minimally increases the number of required clocks per codeblock while removing much of the logic required to store the number of shifts in a ROM and performing the variable left shifts in one clock cycle. 4.5 Controller The controller module also makes use of the current state and the output of the comparator module to determine which arithmetic operation is assigned to the internal registers (A, C, t, T and L). The operations performed with the condition required to perform the operation in each state are in Table 4.2 and are described below. The InitBuf state loads the first byte into the middle eight bits of the C register and loads the initial values for the other internal registers. If the codestream is empty, then 0xFF is loaded instead of a byte from the codestream. The Initialize state loads the second byte into the C register. The second byte is placed into bits six through 49

Table 4.2: Operations Performed by the MQ Controller State Operation Condition A = 0x8000 InitBuf T = 0 and C = 0x00FF00 ReadyByte = 0 T = Byte and C = Byte << 8 ReadyByte = 1 t = 0 ReadyByte = 1 and T = 0xFF t = 1 ReadyByte = 0 or T 0xFF T = Byte ReadyByte = 1 Initialize C = (C + 0xFF) << 7 ReadyByte = 0 or ( T = 0xFF and Byte > 0x8F) C = (C + (Byte << 1)) << 7 ReadyByte = 1 and T = 0xFF and Byte 0x8F C = (C + Byte) << 7 ReadyByte = 1 and T 0xFF WaitCX None None Output = MPS 1 A < p + p and C[8:23] < p Output = MPS A p + p or C[8:23] p A = p C[8:23] p A = A p C[8:23] p C[8:23] = C[8:23] p C[8:23] p Decide A Index context = Index i+1 < 0x8000 and NMP S Output i+1 = MPS Index context = Index NLP S A i+1 < 0x8000 and Output i+1 = MPS 1 A i+1 < 0x8000 and MPS = MPS 1 Output i+1 = MPS 1 and Toggle MP S = 1 A = A << 1 t = t 1 t 0 t = 6 t = 0 and ReadyByte = 1 and T = 0xFF Renorme t = 0 and (ReadyByte = 0 or t = 7 T 0xFF) T = Byte t = 0 and ReadyByte = 1 C = C << 1 t 0 C = (C + 0xFF) << 1 t = 0 and (ReadyByte = 0 or ( T = 0xFF and Byte > 0x8F)) C = (C + (Byte << 1)) << 1 t = 0 and ReadyByte = 1 and T = 0xFF and Byte 0x8F C = (C + Byte) << 1 t = 0 and ReadyByte = 1 and T 0xFF 50

fourteen of the C register while shifting the first byte accordingly. If codestream only has one byte, then 0xFF is loaded into the C register instead of a second byte. The t register is set to zero if the first byte is equal to 0xFF and the byte first in, first out (FIFO) still has data. If either of the comparisons are not true, then the t register is set to one. After the Initialize state, the WaitCX state sets the registers equal to their current values waiting for the LUT values to be loaded. The Decide state determines the decoded bit. The bit is equal to the most probable symbol (MPS) if either the A register is less than 2 times the probability estimate or the active region of the C register (bits 8 through 23) is less than the probability estimate but both cannot be true. If either both the conditions are true or both are false, then the output bit is equal to the least probable symbol (LPS), or one minus the most probable symbol. The Decide state also updates the A and C registers and the context state table values. If the active region of the C register is less than the probability estimate, then the A register is set to equal the probability estimate and the C register stays the same. If the active region of the C register is greater than or equal to the probability estimate, then the probability estimate is subtracted from the A register and the active region of the C register. The most probable symbol for this context is only updated to the least probable symbol when all three of the following is true: a renormalization is needed, the decoded bit is the MPS, and the Toggle register from the probability state table is equal to one. The probability state table index gets updated only if a renormalization is not needed. A renormalization is not needed when the next value of the A register is equal to one. The index is updated to the next most probable symbol index if the decoded bit is set to the most 51

probable symbol. If the decoded bit is to the least probable output then the index is set to the next least probable symbol index. The Renorme state continuously shifts the A register and C register by one and subtracts one from the t until the C register needs a new byte to be loaded (the t value equals zero) or the most significant bit of the A register equals one. If the nonactive part of the C register is empty, then the Renorme state loads the next byte, or 0xFF if the byte FIFO is empty, into the C register and shifts by two if the previous byte equals 0xFF or one if the previous byte is any other value. The A register is still shifted by one and the t value is either set to seven or six depending on if the C register is shifted by one or two. 52

CHAPTER 5 Results The results of the proposed design are given in three areas: verification, logic utilization, and achieved throughput. The design is verified and compared to the software implementation, [1], on an Altera Stratix III SL150 FPGA but the theoretical comparison to the design in [9] is done on a Xilinx Virtex 2 XC2V6000-6. The design utilizes no proprietary parts and is easily portable between the Altera and Xilinx architectures. 5.1 Verification The verification process starts by encoding four variations of two different images, shown in Figure 5.1, using three encoder programs, Kakadu [2], JasPer [5], and a C++ program using the Intel Performance Primitives [1]. The images are Airport on the left and Pentagon on the right. Airport is 1024x1024 pixels and grayscale (8 bits/pixel) [3]. Pentagon is also 1024x1024 pixels and grayscale (8 bits/pixel) [3]. The images are replicated in both the horizontal and vertical axes to create larger images with pixel sizes equal to 2048x2048 and 4096x4096. These new images are used along with the original images to make up the test images. An example of the 2048x2048 pixel variation is shown in Figure 5.2 where the Pentagon image is replicated. The image shown in Figure 5.3 displays the Airport 53

Figure 5.1: Verification Images image replicated to create the 4096x4096 pixel pictures. The parameters used when encoding include no quantization, the 9/7 wavelet transform, five wavelet levels, 32x32 codeblocks and only one tile for the entire image. Then the images are decoded using a software program using the Intel Performance Primitives, [1], to perform the JPEG2000 components excluding the MQ decoder, which is completed on a PCI Express board populated with four Stratix III devices. After the images are decoded, the output files are verified to be the same as the original image. 5.2 Theoretical Results The theoretical results include the logic utilization synthesized on the Xilinx Virtex2 XC2V6000-6 for comparison purposes and a throughput calculation using the speed reported by the tools and the number of clock cycles counted by implementing both designs. The logic utilization of the proposed design and the design in [9] is shown in Table 5.1. 54

Figure 5.2: 2048x2048 Pixel Pentagon Variation Figure 5.3: 4096x4096 Pixel Airport Variation 55