Energy-Efficient FPGA-Based Parallel Quasi-Stochastic Computing

Similar documents
Chapter 7 Registers and Register Transfers

Logistics We are here. If you cannot login to MarkUs, me your UTORID and name.

Line numbering and synchronization in digital HDTV systems

Read Only Memory (ROM)

Quality improvement in measurement channel including of ADC under operation conditions

L-CBF: A Low-Power, Fast Counting Bloom Filter Architecture

CODE GENERATION FOR WIDEBAND CDMA

Randomness Analysis of Pseudorandom Bit Sequences

THE Internet of Things (IoT) is likely to be incorporated

PROBABILITY AND STATISTICS Vol. I - Ergodic Properties of Stationary, Markov, and Regenerative Processes - Karl Grill

EE260: Digital Design, Spring /3/18. n Combinational Logic: n Output depends only on current input. n Require cascading of many structures

Reliable Transmission Control Scheme Based on FEC Sensing and Adaptive MIMO for Mobile Internet of Things

MODELLING PERCEPTION OF SPEED IN MUSIC AUDIO

NewBlot PVDF 5X Stripping Buffer

Australian Journal of Basic and Applied Sciences

RELIABILITY EVALUATION OF REPAIRABLE COMPLEX SYSTEMS AN ANALYZING FAILURE DATA

Image Intensifier Reference Manual

DIGITAL SYSTEM DESIGN

Implementation of Expressive Performance Rules on the WF-4RIII by modeling a professional flutist performance using NN

Motivation. Analysis-and-manipulation approach to pitch and duration of musical instrument sounds without distorting timbral characteristics

References and quotations

RHYTHM TRANSCRIPTION OF POLYPHONIC MIDI PERFORMANCES BASED ON A MERGED-OUTPUT HMM FOR MULTIPLE VOICES

Polychrome Devices Reference Manual

Voice Security Selection Guide

A Simulation Experiment on a Built-In Self Test Equipped with Pseudorandom Test Pattern Generator and Multi-Input Shift Register (MISR)

T-25e, T-39 & T-66. G657 fibres and how to splice them. TA036DO th June 2011

Achieving 550 MHz in an ASIC Methodology

Research on the Classification Algorithms for the Classical Poetry Artistic Conception based on Feature Clustering Methodology. Jin-feng LIANG 1, a

STx. Compact HD/SD COFDM Transmitter. Features. Options. Accessories. Applications

Higher-order modulation is indispensable in mobile, satellite,

Forces: Calculating Them, and Using Them Shobhana Narasimhan JNCASR, Bangalore, India

Analyzing the influence of pitch quantization and note segmentation on singing voice alignment in the context of audio-based Query-by-Humming

Math of Projections:Overview. Perspective Viewing. Perspective Projections. Perspective Projections. Math of perspective projection

Working with PlasmaWipe Effects

PowerStrip Automatic Cut & Strip Machine

The new, parametrised VS Model for Determining the Quality of Video Streams in the Video-telephony Service

Mullard INDUCTOR POT CORE EQUIVALENTS LIST. Mullard Limited, Mullard House, Torrington Place, London Wel 7HD. Telephone:

The Blizzard Challenge 2014

A Backlight Optimization Scheme for Video Playback on Mobile Devices

Internet supported Analysis of MPEG Compressed Newsfeeds

Image Enhancement in the JPEG Domain for People with Vision Impairment

Comparative Study of Different Techniques for License Plate Recognition

MPEG4 Traffic Modeling Using The Transform Expand Sample Methodology

2 Specialty Application Photoelectric Sensors

ROUNDNESS EVALUATION BY GENETIC ALGORITHMS

Recognition of Human Speech using q-bernstein Polynomials

NIIT Logotype YOU MUST NEVER CREATE A NIIT LOGOTYPE THROUGH ANY SOFTWARE OR COMPUTER. THIS LOGO HAS BEEN DRAWN SPECIALLY.

Background Manuscript Music Data Results... sort of Acknowledgments. Suite, Suite Phylogenetics. Michael Charleston and Zoltán Szabó

The Communication Method of Distance Education System and Sound Control Characteristics

Volume 20, Number 2, June 2014 Copyright 2014 Society for Music Theory

Quantifying Domestic Movie Revenues Using Online Resources in China

PIANO SYLLABUS SPECIFICATION. Also suitable for Keyboards Edition

Apollo 360 Map Display User s Guide

2 Specialty Application Photoelectric Sensors

CCTV that s light years ahead

PROJECTOR SFX SUFA-X. Properties. Specifications. Application. Tel

What Does it Take to Build a Complete Test Flow for 3-D IC?

Manual Industrial air curtain

AN IMPROVED VARIABLE STEP-SIZE AFFINE PROJECTION SIGN ALGORITHM FOR ECHO CANCELLATION * Jianming Liu and Steven L Grant 1

A Novel Method for Music Retrieval using Chord Progression

VOCALS SYLLABUS SPECIFICATION Edition

Our competitive advantages : Solutions for X ray Tubes. X ray emitters. Long lifetime dispensers cathodes n. Electron gun manufacturing capability n

MultiTest Modules. EXFO FTB-3923 Specs Provided by FTB-3920 and FTB-1400

Manual RCA-1. Item no fold RailCom display. tams elektronik. n n n

NexLine AD Power Line Adaptor INSTALLATION AND OPERATION MANUAL. Westinghouse Security Electronics an ISO 9001 certified company

Using a Computer Screen as a Whiteboard while Recording the Lecture as a Sound Movie

Sigma 3-30KS Sigma 3-30KHS

,..,,.,. - z : i,; ;I.,i,,?-.. _.m,vi LJ

8825E/8825R/8830E/8831E SERIES

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Research Article Measurements and Analysis of Secondary User Device Effects on Digital Television Receivers

Size Doesn t Really Matter

A Proposal for the LDPC Decoder Architecture for DVB-S2

MOBILVIDEO: A Framework for Self-Manipulating Video Streams

9311 EN. DIGIFORCE X/Y monitoring. For monitoring press-fit, joining, rivet and caulking operations Series 9311 ±10V DMS.

Sensor Data Processing and Neuro-inspired Computing

TRAINING & QUALIFICATION PROSPECTUS

Tobacco Range. Biaxially Oriented Polypropylene Films and Labels. use our imagination...

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

COLLEGE READINESS STANDARDS

Mathematics and Beauty

Digital Migration Process in Kenya

Study Guide. Advanced Composition

Because your pack is worth protecting. Tobacco Biaxially Oriented Polypropylene Films. use our imagination...

Organic Macromolecules and the Genetic Code A cell is mostly water.

Manual Comfort Air Curtain

Part II: Derivation of the rules of voice-leading. The Goal. Some Abbreviations

Analysis and Detection of Historical Period in Symbolic Music Data

ABSTRACT. woodwind multiphonics. Each section is based on a single multiphonic or a combination thereof distributed across the wind

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Application Example. HD Hanna. Firewire. Display. Display. Display. Display. Display. Computer DVD. Game Console. RS-232 Control.

Obsolete Product(s) - Obsolete Product(s)

Design Techniques of FPGA Based Random Number Generator

Detection of Historical Period in Symbolic Music Text

Incidence and Progression of Astigmatism in Singaporean Children METHODS

Innovation in the Multi-Screen World. Sirius 800 Series. Multi-format, expandable routing that stands out from the crowd

An Enhancement of Decimation Process using Fast Cascaded Integrator Comb (CIC) Filter

Perspectives AUTOMATION. As the valve turns By Jim Garrison. The Opportunity to make Misteaks By Doug Aldrich, Ph.D., CFM

SMARTEYE ColorWise TM. Specialty Application Photoelectric Sensors. True Color Sensor 2-65

Minimum Span. Maximum Span Setting

Transcription:

Article Eergy-Efficiet FPGA-Based Parallel Quasi-Stochastic Computig Ramu Seva, Prashathi Metku * ad Misu Choi Departmet of Computer Egieerig, Missouri Uiversity of Sciece & Techology, 4 Emerso Electric Co Hall, 30 W 6th St, Rolla, MO, 65409-0040, USA; rs2k6@mstedu (RS); choim@mstedu (MC) * Correspodece: pmcc@mstedu; Tel: +-573-202-553 Received: 6 September 207 ; Accepted: 3 November 207; Published: 7 November 207 Abstract: The high performace of FPGA (Field Programmable Gate Array) i image processig applicatios is justified by its flexible recofigurability, its iheret parallel ature ad the availability of a large amout of iteral memories Lately, the Stochastic Computig (SC) paradigm has bee foud to be sigificatly advatageous i certai applicatio domais icludig image processig because of its lower hardware complexity ad power cosumptio However, its viability is deemed to be limited due to its serial bitstream processig ad excessive ru-time requiremet for covergece To address these issues, a ovel approach is proposed i this work where a eergy-efficiet implemetatio of SC is accomplished by itroducig fast-covergig Quasi-Stochastic Number Geerators (QSNGs) ad parallel stochastic bitstream processig, which are well suited to leverage FPGA s recofigurability ad abudat iteral memory resources the proposed approach has bee tested o the Virtex-4 FPGA, ad results have bee compared with the serial ad parallel implemetatios of covetioal stochastic computatio usig the well-kow SC edge detectio ad multiplicatio circuits Results prove that by usig this approach, executio time, as well as the power cosumptio are decreased by a factor of 35 ad 45 for the edge detectio circuit ad multiplicatio circuit, respectively Keywords: stochastic computig; FPGA; edge detectio; quasi-stochastic umber geerator; recofigurability Itroductio Stochastic Computig (SC) is a alterative computig style, which has recetly prove to be advatageous i image processig applicatios, because of its potetial area ad power beefits compared to biary implemetatios the performace beefits of the parallel implemetatio of a stochastic circuit usig FPGAs for a image processig applicatio has ot bee aalyzed i the prior literature Takig advatage of the parallel implemetatio of stochastic circuits is possible by usig the distributed memory elemets of FPGAs New Stochastic Number Geerators (SNGs) are desiged to utilize quasi-radom umbers, makig use of the distributed memory elemets i this paper While it is possible to use Liear Feedback Shift Registers (LFSRs) as radom umber geerators i SNGs, makig use of Low-Discrepacy Sequeces (LDS) or quasi-radom umbers is advatageous, because they do ot suffer from radom fluctuatios ad coverge faster [] Though the desig alterative selected here is certaily ot ew [2], this paper maily focuses o the possibility of parallel stochastic computatio for image processig applicatios usig FPGAs the mai cotributios of this paper are as follows: Reductio of the radom fluctuatio errors preset i the traditioal pseudo-radom umbers by adoptig a ew way of costructig SNGs by usig the Look-Up Table (LUT)-based approach ad Low-Discrepacy (LD) sub-radom sequeces J Low Power Electro Appl 207, 7, 29; doi:03390/jlpea7040029 wwwmdpicom/joural/jlpea

J Low Power Electro Appl 207, 7, 29 2 of 5 2 Reductio i the power cosumed as compared to a covetioal SC ad a successful parallel implemetatio to decrease the executio time 2 Backgroud SC has its roots i the 960s, ad it is used for probability represetatio usig digital bit streams [3,4] SC has bee successfully applied to may applicatios like image processig, eural etworks, LDPCcodes, factor graphs, fault-tree aalysis ad i filters [5 0] However, the extesive use of stochastic computatio is still limited, because of its log ru-time ad iaccuracy Recet improvemets have maily focused o improvig the accuracy ad performace of the stochastic circuits by sharig cosecutive bit streams, sharig the stochastic umber geerators, usig true radom geerators, exploitig the correlatio ad usig the spectral trasform approach for stochastic circuit sythesis [ 7] This paper also explores ew methods to improve the accuracy ad performace of stochastic circuits Figure shows the basic SC circuits the fuctio implemeted by these circuits varies with the umber iterpretatios, ie, uipolar, bipolar or iverted bipolar (UP, BP, IBP), where uipolar format is used to represet real umber x i the rage of [0, ], bipolar is used to represet real umber x betwee [, ] ad IBP is the iverted bipolar format, which is a iversio of BP ragig from [, ], where the Boolea values zero ad oe i the Stochastic Number (SN) represet oe ad, respectively, rather tha ad oe i the case of the BP format Oe ca refer to [3] for more details o differet SN formats Figure Basic circuits used i stochastic computatio [8] UP, uipolar; BP, bipolar I SC, a probability value is represeted by a biary bit stream of zeroes ad oes with specific legth L To represet a probability value of 05, half of the bits i the bit stream of legth L are represeted by oes For example, if 05 is to be represeted by a bit stream of 0 bits, the the 00000 bit stream is oe way of represetig it the represetatio of a probability value i SC is ot uique, ad ot all real umber s i the iterval [0, ] ca be exactly represeted for a

J Low Power Electro Appl 207, 7, 29 3 of 5 fixed value of L Aother cosiderably importat factor whe represetig a stochastic umber is the depedecy or correlatio betwee the iputs [9] this is a importat iheret ature of stochastic circuits that limits their performace over certai applicatios whe compared to covetioal biary implemetatios [20] Figure 2 shows two examples where iaccurate results are caused by correlated iputs i the multiplicatio circuit This correlatio comes from the SNGs, where the SNs geerated by SNGs happeed to have the same set of sequeces of oes ad zeros or have some relatio amog them as show i Figure 2 This causes iaccuracy i the output geerated, so SNGs are always chose i such a way that they produce ucorrelated SNs LFSRs are kow to be best-suited for SC ad have bee used for umber geeratio i may SC desigs [2] However, the mai disadvatages are the umber of SNGs must be higher (ie, for every idepedet iput, the umber of SNGs used icreases by oe) for ucorrelated iputs ad eed a loger time to operate for accurate ad efficiet SC [9] x y 00000 (/2) 00000 (/2) 00000 (/2) Z* x = y (a) x y 00000 (/2) 00000 (/2) 0000000000 (0) Z* x = y (b) Figure 2 (a) Correlatio effect i a ad gate (multiplier circuit i Stochastic Computig (SC)) whe both bit streams are the same; (b) correlatio effect whe both bit streams are iverse of each other Whe the circuit size, power ad computatio time of SC are cosidered, the mai cotributios for these factors to vary sigificatly are the SNGs the umber of SNGs is proportioal to the area of a stochastic circuit, cotributig to about 80% of the circuit area the power cosumed by SC maily depeds o the umber of clock cycles the circuit uses for computatio, which i tur, depeds o the SNG properties the computatio time ca be limited by SNGs due to their iheret properties such as radom umber fluctuatios the computatio time icreases expoetially with the liear icrease i accuracy, hece the eed to address basic questios such as: What is the miimum umber of clock cycles eeded to ru, so the probability value is represeted correctly? What is the effect of radom oise fluctuatios i a sequece of stochastic operatios? Aswers to these questios may help i decreasig the computatio time drastically SC has aother disadvatage over the biary implemetatio as all the operatios i the SC are sigle staged; therefore, covetioal techiques such as pipeliig to improve the throughput caot be applied [8] This paper is orgaized as follows Sectio 3 gives the backgroud of the LD sequeces ad LUT-based method implemeted Sectio 4 discusses the parallel implemetatio of SNGs ad the differet stages used i the parallel implemetatio of SNGs for parallel stochastic computig Sectio 5 discusses the simulatio results comparig the proposed SNG with the pseudo-radom umber geerators (LFSRs) a aalysis of the covergece rate of the proposed SNG with that of the LFSR is give A discussio of the applicatio of the proposed serial ad parallel SNG i edge detectio ad the multiplicatio circuit ad the specificatio of the advatages over the LFSR implemetatio is give Fially, Sectio 6 provides the coclusio

J Low Power Electro Appl 207, 7, 29 4 of 5 3 Proposed LUT-Based Method for LD Sequece Geeratio I the proposed approach, SNGs are desiged to leverage LUTs, which are the distributed memory elemets of the FPGAs the primary focuses of this paper are o decreasig the power cosumed, improvig the accuracy, reducig the umber of radom fluctuatios ad reducig the executio time by parallel implemetatio usig the proposed LUT-based Quasi-SNG (QSNG) FPGAs are the target hardware implemeted i this desig for their abudat availability of LUTs ad their iheret parallel ature Amog various applicatios, LUTs are used i digital sigal processig algorithms, where multiplicatio is doe with a fixed set of coefficiets that are already pre-computed ad stored i the LUT, so they ca be used without computig them each time [22] The same cocept is used i this paper, where pre-computed fixed directio vectors are multiplied with a biary umber to get the desired sequece the LUT-based method is used to develop stochastic bit umbers by usig Quasi-Mote Carlo (QMC) methods [23] the LD sequeces i the literature are used to develop these stochastic umbers the mai advatage gaied over the use of LFSRs is that LD sequeces do ot suffer from radom fluctuatios as the zeros ad oes are uiformly spaced [2] this is ulike LFSRs, where the zero ad oes are o-uiformly spaced the idea behid the LD sequeces is to let the fractio of the poits withi ay subset of [0, ) be as close as possible, such that the low-discrepacy poits will spread over [0, ) as uiformly as possible, reducig gaps ad clusterig poits Figure 3 shows the compariso of pseudo-radom poits (LFSR implemetatio) ad LD poits (Sobol sequece) i the uit square LD poits shows eve ad uiform coverage of the area of iterest as show ad are to coverge faster whe applied to SC Figure 3 Distributio of pseudo-radom poits (top) ad Low-Discrepacy (LD) poits (bottom) i the uit square [24] The widely-used sequeces that fall uder the LD sequece category are the Halto sequece, Sobol sequece, Faure sequece ad Niederreiter sequece [23,25,26] Geeratig these sequeces is usually software based because the hardware implemetatio of all these sequeces is ot suited for SC due to their complexity i costructio [2] this disadvatage of LD sequeces is mitigated fully by the proposed LUT-based approach the mai differece i geeratig the LD sequeces lies i the costructio of their directio vectors [23] Each sequece has a specific type of algorithm to compute these, ad the uiformity of the sequece depeds o the way these directio vectors are computed I this paper, LUT-based SNGs were desiged usig three LD sequeces icludig Halto, Sobol ad Niederreiter the digital method was chose to desig these sequeces, restrictig the base value to biary base two For a detailed explaatio about the sequeces metioed above, refer to [23] The geeral structure used for geeratio of the LD sequece usig biary base two is as show i Figure 4 It cotais RAM to store the directio vectors, a multiplicatio circuit ad bit-wise XOR gates I the multiplicatio circuit, every bit from the couter output is multiplied by each -bit directio

J Low Power Electro Appl 207, 7, 29 5 of 5 vector, stored i the RAM, to geerate -bit itermediate directio vectors These -bit itermediate directio vectors are the bit-wise XORed (ie, modulo-two additio) to geerate a -bit LD sequece LUT s cotaiig Pre- Computed Directio Vector s V X V 2 V X X L Bit-wise XOR -bit LD sequece X - X X 2 X -bit Biary Couter Figure 4 Basic block diagram of the proposed Quasi-Stochastic Number Geerator (QSNG) This ca be expressed by usig a mathematical expressio as show i the equatio below [23]: N = x ( ) V x 2 ( ) V 2, () where deotes biary additio or XOR operatio, x ( )x 2 ( ) is the biary represetatio of (N ), V, V 2, V represets the directio vectors ad N represets the N-th umber i the respective LD sequece; for example N = 8 represets the eighth umber i a Sobol sequece, which is computed by usig directio vectors ad a -bit couter, whe Sobol sequece directio vectors are used [23] Sobol ad Niederreiter sequeces belog to the geeral class of digital sequeces, ad their LD sequece geeratio ca be expressed by the above digital method [25,27] the Halto sequeces belog to the simplest form of LD sequeces, ad their costructio does ot have a geeral form as metioed above i Equatio () I the above Equatio (), V, V 2, V are called the directio vectors ad are defied as the costat values that have to be multiplied with the couter output to geerate the desired LD sequece as show i Figure 4 These values do ot chage throughout the operatio of the circuit; hece, for the geeratio of Halto LD sequeces, defiig the directio vectors to fit ito the above equatio of the geeral digital method of LD sequece is ecessary to geeralize the hardware structure for the LD sequeces metioed i the paper Halto sequeces are defied as the geeralized form of va der Corput sequeces, which use a distict prime base for every dimesio The k th Halto poit H(k) is defied as H(k) = a i (k )b i [26] Upo closer ispectio of the summatio, we defie a i (k ) to i=0 be othig, but the base b represetatio of k, ad b i is the base b term, which has to be multiplied with a i (k ) for the geeratio of each sequece depedig o the value of k the term b i is a costat term, ad the value does ot chage with the chage i the value of k; hece, these terms are defied as directio vectors ad fit ito the geeral form represeted above to geerate a LD sequece by choosig the base b = 2 Sobol ad Niederreiter sequeces have specific algorithms to calculate the directio vectors that fit ito Equatio () to geerate the LD sequece I this paper, algorithms reported i papers [25] ad [23] are used to pre-calculate directio vectors a importat poit to ote i this implemetatio is that the umber of sequeces geerated is limited by usig oly R base b directio vectors of R bits, which are capable of represetig a value of b R i base b [23] For example, to geerate a stochastic bit legth of 256, the geeratio of oly iitial 256 LD sequeces is required For this process, eight-bit legth directio vectors, which are capable of geeratig a eight-bit legth LD sequece every clock cycle, are eeded the maximum value they

J Low Power Electro Appl 207, 7, 29 6 of 5 ca represet is b 8 = 255, limitig the size of the couter For the above 256 iitial sequeces, a eight-bit couter is eeded to cout from zero to 255 After geeratio, the LD sequece umbers are set to the comparator where they are compared with the iput value to geerate a equivalet stochastic umber the size of the proposed SNG depeds o the stochastic bit legth L of the circuit, as well as the umber of iputs to the stochastic circuit For a stochastic bit legth of 256, it is ecessary to use a eight-bit biary couter ad a memory space of 64 bits to store eight directio vectors each of a eight-bit legth Idepedet stochastic iputs require differet directio vectors; as the umber of idepedet stochastic iputs icreases, the memory space required to store these directio vectors icreases LUT-based SNGs were implemeted for 256-, 52-, 024- ad 2048-bit legths o the Xilix Virtex 4-SFFPGA (XC4VLX5) device ad sythesized usig the Xilix ISE tool I this paper, a geeral form of implemetatio was preseted, ad further optimizatio of the circuit has bee left for a future study Table 2 clearly shows that the LD sequece geerators make use of more hardware whe compared to the LFSRs, but the covergece ad the accuracy obtaied from LD sequeces are superior eough to justify this extra hardware utilizatio (explaied i the followig sectios) 4 Parallel Implemetatio of Proposed SNGs for SC The proposed parallel implemetatio of the SNGs was desiged to geerate LD sequece umbers i parallel These LD sequece umbers geerated i parallel were used to geerate stochastic bits i parallel These stochastic umbers, geerated i parallel, are termed as Stochastic Bit Vectors (SBVs), ad the parallel processig used to geerate the sequece is termed as Stochastic Bit Matrix (SBM) processig Cosider a 256-bit legth stochastic bit matrix, this desig geerates p iitial bits every clock cycle of the SBM, istead of geeratig oe bit of the SBM this is show i a vector form i Figure 5, which shows that for oe stochastic bit geeratio usig a sigle SBM Processig Uit (SBMPU), 256 clock cycles are eeded to geerate a 256-bit legth SBM By duplicatig p SBMPUs i parallel, it is also possible to geerate p stochastic bits of the SBM i just oe clock cycle Hece, 256/p clock cycles are eeded for geeratig a 256-bit legth, thus savig the executio time of the operatio as p icreases 256 bit-legth stream [ 0 0 0 0 0 ] x256 SBMPU Parallel stochastic bit matrix processig SBMPU SBMPU SBMPU [0 0 0 0 0 ]px256/p Figure 5 Parallel stochastic bit matrix processig SBMPU, Stochastic Bit Matrix Processig Uit

J Low Power Electro Appl 207, 7, 29 7 of 5 The structure of the parallel implemetatio of the circuit is show i Figure 6 the parallel implemetatio of the proposed SNGs is doe i three stages the first stage is where the LD sequece umbers are geerated i parallel the secod stage is where the stochastic bit streams are geerated i parallel usig comparators ad set to the stochastic circuit for SC Fially, the third stage is where the stochastic output is coverted back to a biary output by coutig the umber of oes A combiatioal circuit is implemeted for the coversio of stochastic to biary umber by coutig the umber of oes i the SN by makig use of the Hammig weight couter priciple [28] Stage Stage 2 Stage 3 LD sequece geeratio p LD sequece to Stochastic coversio Stochastic Operatio p Stochastic to Biary Coversio Biary umber b 0 b b 2 b Clk Biary umber b 0 b b 2 b Clk Figure 6 Three stages of parallel implemetatio 4 First Stage The first iitial p LD sequece umbers are geerated i parallel depedig o the degree of parallelism the geeral structure of the implemetatio is as show i Figure 7 Here, the etire structure is ot duplicated, but the part of the SNG that geerates the LD sequece umber is duplicated to reduce the area overhead the degree of parallelism determies the amout of hardware utilized Couters, which follow a specific sequece of coutig, are used to implemet the SNGs i parallel For example, to geerate the first iitial eight sub-sequeces i parallel of a 256-bit stream legth, use eight eight-bit couters, which cout by eight the first couter follows the sequece 0, 8, 6, 32, ad the secod couter follows the sequece, 9, 7, 33 i the same way as the eighth couter follows the sequece 7, 5, 3 Therefore, i the first clock cycle, the eight couters hold the value from zero to seve, which meas that the first eight LD sequece umbers are geerated i parallel I the secod clock cycle, the couters are icremeted by oe to hold the value eight to 5, ad the ext eight LD sequece umbers are geerated These geerated sequeces are the set to the parallel comparator uits where they are compared with the iput probability value to geerate the stochastic bits i parallel this implemetatio geerates a sequece for a sigle iput i parallel For multiple iputs, differet directio vectors ca be used, while the circuit for the geeratio of the LD sequece is the same

J Low Power Electro Appl 207, 7, 29 8 of 5 LUT s cotaiig Pre-Computed Directio Vector s V V 2 V V V 2 V X X X X p+ X (-p)+ -bit Biary Couter X L Bit-wise XOR L L 2 L p Figure 7 First stage of LD sequece geeratio 42 Secod Stage The secod stage cosists of the geeratio of the stochastic bit stream ad the stochastic operatio For the geeratio of the stochastic bit stream, the LD sequeces geerated i parallel are set to the comparators, which are also i parallel, such that multiple sequeces are compared simultaeously to geerate a stochastic bit stream i parallel For example, to geerate eight-bits of the stochastic bit stream, use eight comparators where each sequece is compared with the biary probability value to geerate the first eight bits of the SN at the same time this is termed as eight-bit SBV geeratio usig a eight SBM processig i oe clock cycle by replicatig the SBM circuit eight times Similarly, to geerate 6 SBV s, 6 SBM processig is doe i oe clock cycle replicatig the SBM circuit 6 times The SBM processig circuit ivolves oly that part of the LD sequece geerator capable of geeratig the sequece (ie, the multiplicatio ad the bit-wise XOR structure) ad a LD sequece to the stochastic coversio uit (ie, comparator); the LUTs used are shared amog the parallel SBM processig uits as they are costat values that do ot chage durig the executio cycle the geerated stochastic bits by SBM processig are the set for computatio ad the to the stochastic to biary coversio stage for fial output See Figure 8 for the parallel structure of the stochastic bit stream geerators The umber of comparators used depeds o the degree of parallelism implemeted Hece, the degree of parallelism determies the hardware utilized

J Low Power Electro Appl 207, 7, 29 9 of 5 LD sequece umber L Biary Number b 0 b b 2 b 3 b A B < X LD sequece umber L 2 Biary Number b 0 b b 2 b 3 b A B < X 2 LD sequece umber L p Biary Number b 0 b b 2 b 3 b A B < X p Figure 8 Secod stag: LD to stochastic bit coversio 43 Third Stage The fial stage i a stochastic computatio is to create the biary output, which is geerated by usig STB coversio uits, which are comprised of a couter that couts the umber of oes i the stochastic bit-stream If the output stochastic bits geerated are eight bits per clock cycle, it is ecessary to cout the umber of oes i the iitial eight-bits withi oe clock cycle; this is ot possible by usig a sigle couter circuit with the same clock period I this paper, a STB coversio uit is used that coverts the parallel stochastic output ito a biary umber by usig simple adder circuits this circuit ca cout the umber of oes i a parallel bit stream by usig the Hammig weight couter priciple [28] the structure of the STB coversio uit for coutig the umber of oes i eight parallel stochastic bits of a 256-bit stream legth cosists of four half adders, two two-bit adders, a three-bit adder, a eight-bit register ad a four-bit adder a eight-bit register is used to store the previous cout value, ad it is updated every clock cycle with the ew value (ie, the umber of oes i the stochastic bit stream) To cout the umber of oes i 6 stochastic bits of a 256-bit stream legth, eight half adders, four two-bit adders, two three-bit adders, a eight-bit register ad a eight-bit adder are required Therefore, the size of the STB coversio uit icreases with the umber of parallel bits geerated the scalability issue of the STB coversio uit may ot be a major cocer as the proposed approach maily targets image processig applicatios where the word legth for may operatios is less tha 6 bits the ext sectio presets the simulatio results of both the parallel ad the serial implemetatio of the LUT-based LD sequece SNGs 5 Simulatio Results Simulatio results are comprised of both serial ad parallel implemetatio of the LUT-based LD sequece SNGs All the circuits have bee implemeted o a Xilix Virtex 4 SF FPGA (XC4VLX5) device ad sythesized usig the Xilix ISE 2 desig suite, with the miimum area ad power reductio as optimizatio goals the accuracy of the sequece geerator proposed i this work is compared with that of the pseudo-radom umber geerator (LFSR implemetatio) the results demostrate that the proposed SNG geerators have better covergece whe compared with LFSRs

J Low Power Electro Appl 207, 7, 29 0 of 5 the LUT-based SNGs are used i image processig ad arithmetic applicatio of SC (ie, edge detectio ad multiplicatio) to compare the results with the LFSR-based SNGs 5 Edge Detectio The proposed SNGs were tested with stochastic edge detectio circuit for eight-bit grayscale images ie, each pixel was represeted usig a stochastic bit-legth of 256 bits I this work, a edge detectio circuit has bee implemeted usig the stochastic circuit described i [5] as show i Figure g This circuit is implemeted based o Robert s cross algorithm the test images selected for implemetig the edge detectio algorithm are show i Figure 9 the pixel values of the images were extracted usig MATLAB ad were give as the eight-bit biary iput to the proposed SNGs the outputs from the proposed SNGs were give to the stochastic edge detectio circuit, ad the outputs extracted from the post sythesis simulatio results were processed i MATLAB To evaluate the covergece rate ad quality of the image geerated by the proposed SNGs, MAE (Mea Absolute Error) ad PSNR (Peak Sigal-to-Noise Ratio) were calculated for the output edge detected image every clock cycle, ad the results were compared to the output geerated by usig pseudo-radom umber geerators (LFSRs) (a) Camera-ma (b) Pepper (c) Baboo Figure 9 Test images for edge detectio PSNR is commoly adopted i the image processig field to quatify the acceptability of erroeous or oisy images [29] the PSNR value (usually i the uit of db) ca idicate the similarity of two differet images Here, the edge detectio image geerated by ruig for 256 clock cycles ad the edge detectio image geerated by ruig for q clock cycles are used to compute the correspodig PSNR value the q value is defied as the umber of clock cycles eeded to output a satisfactory result withi a absolute error of less tha 00 betwee the actual ad the predicted value PSNR value ca be calculated by the equatio below [29] where MSE = m m i=0 PSNR = 0 log 0 MAX 2 I MSE, (2) I(i, j) K(i, j) 2 is the mea square error of the error-free ad the erroeous j=0 image, MAX I is the maximum image pixel value (eg, 255 i the eight-bit grayscale image), m ad represet the width ad height of the target image i terms of the umber of pixels ad I(i, j) ad K(i, j) represet the pixel values of the error-free image ad the erroeous/oisy image, respectively From Equatio (2), whe the erroeous image is more similar to the origial oe, a smaller MSE ad a higher PSNR value will be obtaied

J Low Power Electro Appl 207, 7, 29 of 5 The MAE value is calculated every clock cycle to determie the q clock cycles eeded to output a satisfactory result withi a absolute error of 00 It is defied as the average of the absolute differece betwee the actual value ad the predicted value It ca be calculated by usig the equatio below [29] MAE = m m i=0 I(i, j) K(i, j), (3) j=0 where m ad represet the width ad height of the target image i terms of the umber of pixels, I(i, j) represet the pixel values of the edge detectio image geerated by ruig for 256 clock cycles (actual value) ad K(i, j) represet the pixel values of the edge detectio image geerated by ruig for r clock cycles (predicted value) where r varies from oe to 255, respectively Whe MAE equals 00, r equals q Iitial aalysis was doe o the ope-source bechmark image Camera-ma show Figure 9a It was foud that the edge detectio circuit usig the pseudo-radom geerator as a SNG took 28 clock cycles to output a absolute error of less tha 00 as compared to 22 clock cycles required for a LUT-based QSNG A similar kid of aalysis o test images Pepper ad Baboo resulted i the same results Figures 0 to 2 show the edge detectio results for the test images For Figure 2, the results were ot show for differet clock cycles to reduce the space Table shows the PSNR ad MAE values for the test images at differet clock cycles From the table, it ca be cocluded that the proposed SNGs show a better covergece at a faster rate as compared to pseudo-radom geerators with a acceptable image quality [29] (a) (b) (c) (d) (e) Figure 0 Edge detectio usig Liear Feedback Shift Registers (LFSRs): (a) eight clock cycles; (b) 22 clock cycles; (c) 64 clock cycles; (d) 28 clock cycles; (e) 256 clock cycles (a) (b) (c) (d) (e) Figure Edge detectio usig LD sequece geerators: (a) eight clock cycles; (b) 22 clock cycles; (c) 64 clock cycles; (d) 28 clock cycles; (e) 256 clock cycles

J Low Power Electro Appl 207, 7, 29 2 of 5 (a) 28 - LFSR (b) 22 - QSNG (c) 28 - LFSR (d) 22 - QSNG Figure 2 Edge detectio results from LFSR ad QSNG Note: the 22 clock-cycle QSNG results are comparable to the 28 clock-cycle LFSR results Table Table showig the PSNR ad MAE values for test images Test-Image Sequece Clock Cycles PSNR (db) MAE 8 2299 0375 Camera-ma LD Sequece 22 3086 000 64 343 0005 28 420 0002 8 968 0466 Pseudo-radom sequece 22 223 064 64 263 00394 28 362 000 8 2484 004 Baboo LD Sequece 22 296 00084 64 3484 00046 28 400 0002 8 2068 0383 Pseudo-radom sequece 22 236 0264 64 283 002 28 332 000 8 267 0032 Pepper LD Sequece 22 306 0002 64 3500 00046 28 4222 00025 8 868 0766 Pseudo-radom sequece 22 2046 0464 64 2435 00466 28 3222 00079 52 Multiplicatio The stochastic multiplicatio circuit as show i Figure a is beig implemeted for a eight-bit biary iput umber usig both pseudo-radom ad LD sequece geerators the average umber of clock cycles eeded to geerate a satisfactory result per iput is calculated by givig a radom set of 256 iput values (xad y) ad calculatig the average umber of clock cycles eeded to geerate a output with a absolute error of less tha 00 For the multiplicatio circuit, the MAE is mathematically represeted as P z -P z where, P z is the actual output value ad P z is the predicted output value after stochastic computatio It turs out that for LFSR (pseudo-radom umber geerator) as a SNG, the average umber of clock cycles eeded is 52, ad for LUT-based QSNG it is 54 clock cycles oly

J Low Power Electro Appl 207, 7, 29 3 of 5 53 Hardware Utilizatio Table 2 shows the compariso of LUT-based QSNG (LD sequece) ad LFSR-based (pseudo-radom sequece) SNG i terms of resource utilizatio, average ru-time ad the average power cosumed for edge detectio ad multiplicatio circuits I this paper, average ru-time is calculated by multiplyig the clock period of the circuit by the average umber of clock cycles eeded to geerate a satisfactory output, which is defied as a output with a MAE of less tha 00 the average power is calculated for the average ru-time i both approaches the umbers are estimated values based o the implemetatio results o the Xilix Virtex 4 SF FPGA (XC4VLX5) device From Table 2, the average ru time ad the power cosumed are reduced by 45-times for the multiplicatio circuit ad 35-times for the edge detectio circuit Though the area occupied by the LUT-based QSNG is more as compared to the LFSR-based SNG whe the covergece power of the LD sequece is cosidered acceptable, the results ca be achieved at a much faster rate ad with a cosiderable power reductio as show i the Table 2 For applicatios that have a trade-off of area for speed ad power cosumed, the proposed LUT-based QSNG would be very beeficial the proposed approach is a better low-power desig as compared to the covetioal stochastic approach Table 2 Table showig the resource utilizatio for the serial implemetatio Sequece LD Sequece Pseudo-Radom Applicatio # of Occupied Max Freq Average Average Power Slices (MHz) Ru-Time (s) (uw) Multiplicatio 70 22430 2407 03 Edge detectio 40 22230 00 02 Multiplicatio 7 45832 7 34 Edge detectio 48 37452 34 04 The hardware utilizatio of the parallel implemetatio of edge detectio ad multiplicatio circuits is show i Table 3 Sice it is clear from the previous results that the covergece power of LD sequece geerators is better tha the LFSRs for the same circuit implemetatio, LD sequeces geerators hardware utilizatio ca be reduced by restrictig it to the geeratio of iitial sub-sequeces rather tha complete sets of sequeces the throughput of the system ca be icreased drastically by usig the proposed parallel implemetatio For istace, the proposed serial implemetatio of the edge detectio circuit reduces the computatio time by a factor of 35-times; if the same circuit is realized i parallel say with a degree of parallelism of four, we eed to ru it for just six clock cycles to geerate the iitial sub-sequece of 24 LD sequeces, which would be eough to achieve a acceptable output O the other had, for the LFSR-based SNG, we eed to ru it for 32 clock cycles to geerate the iitial 28 sequeces, which are capable eough to output a acceptable image the computatio time is decreased by a factor of four here suggestig that the throughput of the system has icreased Therefore, by usig the proposed SNGs, both the executio time ad power cosumed ca be reduced while achievig a higher throughput by implemetig them i parallel

J Low Power Electro Appl 207, 7, 29 4 of 5 Table 3 Resource utilizatio compariso for the parallel implemetatio Sequece Applicatio Degree of Parallelism Slices 4 537 LD Sequece Edge-Detectio 8 069 6 225 4 04 LD Sequece Multiplicatio 8 99 6 394 4 90 Pseudo-Radom Edge-Detectio 8 382 6 767 4 67 Pseudo-Radom Multiplicatio 8 34 6 262 6 Coclusios This paper has itroduced a ovel costructio method to realize QSNGs o FPGA usig LUTs the FPGA s superior recofigurability was leveraged advatageously for parallel implemetatio of a stochastic circuit that outperforms the covetioal LFSR-based stochastic circuit approach i terms of covergece, power cosumed ad accuracy Simulatio results suggest that both computatio time ad power cosumed ca be saved up to 35-times i the edge detectio circuit ad up to 45-times i the multiplicatio circuit as compared to the covetioal stochastic approach Further, extesive simulatio results justify that for faster (higher throughput) ad more accurate computatio with a low-power cosumptio, makig use of FPGA-based parallel quasi-stochastic computig is a better optio The future scope of this work is to optimize the LD sequece geerator circuit with a combiatioal logic to geerate the LD sequece to reduce the area occupied Author Cotributios: All authors cotributed equally to this work Coflicts of Iterest: The authors declare o coflict of iterest Refereces Sobol, IM Uiformly distributed sequeces with a additioal uiform property USSR Comput Math Math Phys 976, 6, 236 242 2 Alaghi, A; Hayes, JP Fast ad accurate computatio usig stochastic circuits I Proceedigs of the Coferece o Desig, Automatio & Test i Europe Coferece ad Exhibitio (DATE), Europea Desig ad Automatio Associatio, Dresde, Germay, 24 28 March 204; p 76 3 Gaies, BR Stochastic computig I Proceedigs of the 8 20 April 967, Sprig Joit Computer Coferece, Atlatic City, NJ, USA, 8 20 April 967; pp 49 56 4 Gaies, B Stochastic computig systems I Advaces i Iformatio Systems Sciece; Spriger: Berli, Germay, 969; pp 37 72 5 Alaghi, A; Li, C; Hayes, JP Stochastic circuits for real-time image-processig applicatios I Proceedigs of the 50th Aual Desig Automatio Coferece, Austi, TX, USA, 29 May 7 Jue 203; p 36 6 Naderi, A; Maor, S; Sawa, M; Gross, WJ Delayed stochastic decodig of LDPC codes IEEE Tras Sigal Process 20, 59, 567 5626 7 Aliee, H; Zaradi, HR Fault tree aalysis usig stochastic logic: A reliable ad high speed computig I Proceedigs of the 20 Proceedigs-Aual Reliability ad Maitaiability Symposium (RAMS), Lake Buea Vista, FL, USA, 24 27 Jauary 20; pp 6 8 Li, P; Lilja, DJ Usig stochastic computig to implemet digital image processig algorithms I Proceedigs of the 20 IEEE 29th Iteratioal Coferece o Computer Desig (ICCD), Amherst, MA, USA, 9 2 October 20; pp 54 6

J Low Power Electro Appl 207, 7, 29 5 of 5 9 Chag, YN; Parhi, K Architectures for digital filters usig stochastic computig I Proceedigs of the 203 IEEE Iteratioal Coferece o Acoustics, Speech ad Sigal Processig (ICASSP), Vacouver, BC, Caada, 26 3 May 203; pp 2697 270 0 Saraf, N; Bazarga, K; Lilja, DJ; Riedel, MD IIR filters usig stochastic arithmetic I Proceedigs of the 204 Desig, Automatio ad Test i Europe Coferece ad Exhibitio (DATE), Dresde, Germay, 24 28 March 204; pp 6 Li, P; Lilja, DJ Acceleratig the performace of stochastic ecodig-based computatios by sharig bits i cosecutive bit streams I Proceedigs of the 203 IEEE 24th Iteratioal Coferece o Applicatio-Specific Systems, Architectures ad Processors (ASAP), Washigto, DC, USA, 5 7 Jue 203; pp 257 260 2 Ichihara, H; Ishii, S; Suamori, D; Iwagaki, T; Ioue, T Compact ad accurate stochastic circuits with shared radom umber sources I Proceedigs of the 204 32d IEEE Iteratioal Coferece o Computer Desig (ICCD), Seoul, Korea, 9 22 October 204; pp 36 366 3 Alaghi, A; Hayes, JP A spectral trasform approach to stochastic circuits I Proceedigs of the 202 IEEE 30th Iteratioal Coferece o Computer Desig (ICCD), Motreal, QC, Caada, 30 September 3 October 202; pp 35 32 4 Alaghi, A; Hayes, J STRAUSS: Spectral Trasform Use i Stochastic Circuit Sythesis IEEE Tras Comput Aided Des Itegr Circuits Syst 202, 34, 770 783 5 Che, TH; Hayes, JP Equivalece amog Stochastic Logic Circuits ad its Applicatio to Sythesis IEEE Tras Emerg Top Comput 206, doi:009/tetc2062623796 6 Kwok, SH; Lam, EY FPGA-based high-speed true radom umber geerator for cryptographic applicatios I Proceedigs of the 2006 IEEE Regio 0 Coferece o TENCON 2006, Hog Kog, Chia, 4 7 November 2006; pp 4 7 Majzoobi, M; Koushafar, F; Devadas, S FPGA-Based True Radom Number Geeratio Usig Circuit Metastability with Adaptive Feedback Cotrol Cryptogr Hardw Embed Syst 20, 697, 7 32 8 Moos, B; Verhelst, M Eergy-Efficiecy ad Accuracy of Stochastic Computig Circuits i Emergig Techologies IEEE J Emerg Sel Top Circuits Syst 204, 4, 475 486 9 Alaghi, A; Hayes, JP Survey of stochastic computig ACM Tras Embed Comput Syst 203, 2, doi:045/24657872465794 20 Maohar, R Comparig Stochastic ad Determiistic Computig IEEE Comput Archit Lett 205, 4, 9 22 2 Gupta, P; Kumaresa, R Biary multiplicatio with PN sequeces IEEE Tras Acoust Speech Sigal Process 988, 36, 603 606 22 Thomas, DB; Luk, W the LUT-SR family of uiform radom umber geerators for FPGA architectures IEEE Tras Very Large Scale Itegratio (VLSI) Syst 203, 2, 76 770 23 Bratley, P; Fox, BL; Niederreiter, H Implemetatio ad tests of low-discrepacy sequeces ACM Tras Model Comput Simul 992, 2, 95 23 24 Wikipedia Low-Discrepacy Sequece Available olie: wwwewikipediaorg/wiki/low-discrepacy_sequece (accessed o 9 April 207) 25 Sobol, I O the distributio of poits i a cube ad the approximate evaluatio of itegrals USSR Comput Math Math Phys 967, 7, 784 802, doi:006/004-5553(67)9044-9 26 Halto, JH O the efficiecy of certai quasi-radom sequeces of poits i evaluatig multi-dimesioal itegrals Numer Math 960, 2, 84 90 27 Niederreiter, H Poit sets ad sequeces with small discrepacy Mo Math 987, 04, 273 337 28 Parhami, B Efficiet hammig weight comparators for biary vectors based o accumulative ad up/dow parallel couters IEEE Tras Circuits Syst II Express Briefs 2009, 56, 67 7 29 Hsieh, TY; Peg, YH; Ku, CC a Efficiet Test Methodology for Image Processig Applicatios Based o Error-Tolerace I Proceedigs of the 203 22d Asia Test Symposium, Jiaosi Towship, Taiwa, 8 2 November 203; pp 289 294 c 207 by the authors Licesee MDPI, Basel, Switzerlad This article is a ope access article distributed uder the terms ad coditios of the Creative Commos Attributio (CC BY) licese (http://creativecommosorg/liceses/by/40/)