An investigation of memory latency reduction using an address prediction buffer

Similar documents
DTIC AELECTE. NAVAL POSTGRADUATE SCHOOL Monterey, California THESIS AD-A

NEO-FLASH 300. USER MANUAL V1.1 Page.1 TECHNICAL SPECIFICATIONS

1722A Global System Clock Streams (aka Media Clock Streams) Principles and Suggestions

How do I use SmartMusic in my everyday classroom instruction?

Caspersen School of Graduate Studies Drew University GUIDELINES FOR THE PREPARATION OF DOCTORAL DISSERTATIONS

Student Recital Checklist

!!!!!!!!!! Seventh!Grade,General!Music:!!! Creating!an!Original!Composition!in!ABA!Form!using! Garageband!! Mindy!Rubinlicht! Updated!January!2015!!!

Instructions for Contributors to the International Journal of Microwave and Wireless Technologies

Sequential Logic. Sequential circuits. Reuse circuit elements by storing bits in "memory." Introduction to Computer Yung-Yu Chuang

User Guide. Table Of Contents. o o o o o o o o

SCHEDULE FOR THE EVENING:

JROTCDL.com CADET 104 How to Write Effectively 1

NYS Common Core ELA & Literacy Curriculum Grade 9 Module 1 Unit 2 Lesson 3

Release Type: Firmware Software Hardware New Product. WP-577VH Any Yes N/A

Hybrid Transcoding for QoS Adaptive Video-on-Demand Services

Section 28 Rehabilitative and Community Support Services KEPRO Mapping Document

Dearborn STEM Middle School Music Handbook

APPLICATIONS: TELEVISIONS

KEYS TO SUCCESS. September 25, PERCEPTIVE DEVICES LLC 8359 Oakdale Ct, Mason, OH 45040, USA

Contexts: Literary Research Essay/Independent Novel Project

RF-TTC FAQs. September 24. Typical questions about timing signals generated by the RF system and transmitted over fibres to TTC system

MS Arts Audition Boot Camp Online Application Instructions

MFA Thesis Assessment Rubric

7 th Grade Advanced English Language Arts An investment in knowledge pays the best interest. ~ Ben Franklin

Stephen Graham Bird Award

FILM PORTFOLIO REVIEW

Commercial and Entertainment Arts. o Work Experience, General. o Open Entry/Exit. Distance (Hybrid Online) for online supported courses

Thursday, April 21st

Introduction This application note describes the VSB-ENC-150E 8-VSB Modulator and its applications.

Basics How to cite in-text and at end-of-paper

Following a musical performance from a partially specified score.

The UCD community has made this article openly available. Please share how this access benefits you. Your story matters!

Ryan Raider Band - New Members. Answers to Frequently Asked Questions (updated 4/25/17)

Solon Center for the Arts presents THE LORAX

PR indicates a pre-requisite. CO indicates a co-requisite.

Music has different functions in different cultures. For example, music can be used in various cultures for:

Makeup Crew Responsibilities

Media Technology & Instructional Services (MTIS) - Lake Worth Campus

RIAM Local Centre Woodwind, Brass & Percussion Syllabus

CB South Advanced Ensembles Symphonic Band and Jazz Ensemble Auditions

Guidelines for Music 48 (Lessons for Credit)

Synchronous Capture of Image Sequences from Multiple Cameras. P. J. Narayanan, Peter Rander, Takeo Kanade CMU-RI-TR-95-25

Holding a School-wide Mock Caldecott

ENG2410 Digital Design Registers & Counters

Copyright 1975, by the author(s). All rights reserved.

River Ridge Taiko Student Contract

Happily Ever After? A Fairy Tale Unit [1st grade]

Anthem. Subject to change based on time and needs of the class

A STUDY OF TRUMPET ENVELOPES

BFI/Doc Society Fund Application Form questions. These are a preview only. Please apply online here

MORE SCREENS, MORE OPTIONS TO VIEW: Q AUSTRALIAN MULTI-SCREEN REPORT

Rock Music History and Appreciation. o Work Experience, General. o Open Entry/Exit. Distance (Hybrid Online) for online supported courses

Duke Ellington School of the Arts English Department. Senior (class of 2019) Summer Reading Task

EDUCATION PROGRAM. Educate, Enlighten & IMAX EDUCATION 2007

Wichita State University School of Music Voice Department Handbook

English 3201 Poetry Analysis - Notes 2017

Quartet op.22 Webern

Caritas Chorale Member Information

Festival Registration Guidelines

Satire Project. Formatting Requirements:

Recycled Rhythms! Use rhythm, music composition and movement to learn about recycling!

The following example configurations are intended to show how the

Statistics AGAIN? Descriptives

Northeast Independent School District

CMV COMPETITION RULES

EDUCATION PROGRAM. Educate, Enlighten & IMAX EDUCATION 2009

Week One: Focus: Emotions. Aims: o o o o. Objectives: o. Introduction: o o. Development: o. Conclusion: o

PALMETTO HIGH SCHOOL SHOW CHOIR Syllabus

SMART Podium interactive pen display

MORE SCREENS, MORE CHOICE, MORE DIVERSE VIEWING ACTIVITY: Q AUSTRALIAN VIDEO VIEWING REPORT

Simon Sheu Computer Science National Tsing Hua Universtity Taiwan, ROC

Technical Information

FIFTH GRADE UNIT 1: FIRST GRADING PERIOD Month / pacing Big Ideas/ Learning Intentions/ Learning Outcomes Suggested Projects/Strategies September

PaperStream IP (ISIS) change history

TABLE OF CONTENTS CONTENTS

Relationships Among Musical Home Environment, Parental Involvement, Demographic Characteristics, and Early Childhood Music Participation

The Efficient Band Rehearsal Charles T. Menghini, D.M.A. President and Director of Bands VanderCook Collage of Music Chicago, Illinois

Revised: January Dear Parent,

Welcome to Palm Beach State College Boca Raton Campus. Use the buttons on the left to assist you in using the Multimedia Classroom Equipment.

Name Period Literary Term Glossary: English I Academic

Aberdeen Community Theatre 417 S. Main Street Aberdeen, SD ACT OFFICE ACT FAX ACT TICKET LINE

Evaluating Musical Software Using Conceptual Metaphors

PALMETTO HIGH SCHOOL MIXED CHOIR Syllabus

AUSTRALIANS EMBRACE NEW CONTENT AND PLATFORM OPTIONS, BROADCAST TV VIEWING REMAINS STRONG: Q AUSTRALIAN VIDEO VIEWING REPORT

PaperStream IP (ISIS) change history

Madison City 6 th Grade Honor Chorus

2008 Event Sponsorship Opportunities

ThinManager Certification Test Lab 3

National Double Dutch League Double Dutch Holiday Classic

ML= Musical Literacy, MR= Musical Response, CR= Contextual Relevancy 1

Reflect on the Theme: Let Your Imagination Fly

SCHOOLS AND STUDENTS PERFORMING & RECEIVING AWARDS

The Energy Corridor of Houston Orchestra ignites the spirit of music through community performances in the heart of Houston s Energy Corridor

How to Write a Literary Analysis

Little Red Riding Hood and the Wolf

o Work Experience, General o Open Entry/Exit Distance (Hybrid Online) for online supported courses

HYMNS, OUR CHRISTIAN HERITAGE

Sculpture Walk Jax Exhibition and Competition Entry Form Temporary Outdoor Sculpture Exhibition Main Street Park, Jacksonville, FL

Optical Alignment Technique To Improve LCD Quality and Price

Error Concealment Aware Rate Shaping for Wireless Video Transport 1

Transcription:

Calhun: The NPS Insttutnal Archve Theses and Dssertatns Thess Cllectn 1992-12 An nvestgatn f memry latency reductn usng an address predctn buffer Bllngsley, Arthur Brks, Jr. Mnterey, Calfrna. Naval Pstgraduate Schl http://hdl.handle.net/10945/23712

'

UNCI ASSIFIFR CURITY CLASSIFICATION OF THIS PAGE REPORT DOCUMENTATION PAGE a. REPORT SECURITY CLASSIFICATION UNCLASSIFIED 1b. RESTRICTIVE MARKINGS a SECURITY CLASSIFICATION AUTHORITY b. DECLASSIFICATION/DOWNGRADING SCHEDULE 3. DISTRIBUTION/AVAILABILITY OF REPORT Apprved fr publc release; dstrbutn s unlmted. PERFORMING ORGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMBER(S) a. NAME OF PERFORMING ORGANIZATION Naval Pstgraduate Schl c. ADDRESS (Cty, State, and ZIP Cde) Mnterey, CA 93943-5000 6b. OFFICE SYMBOL (f applcable) ECE 7a. NAME OF MONITORING ORGANIZATION Naval Pstgraduate Schl 7b. ADDRESS (Cty, State, and ZIP Cde) Mnterey, CA 93943-5000 a. NAME OF FUNDING/SPONSORING ORGANIZATION 8b. OFFICE SYMBOL (f applcable) 9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER c. ADDRESS (Cty, State, and ZIP Cde) 10. SOURCE OF FUNDING NUMBERS PROGRAM PROJECT ELEMENT NO. NO. TASK NO. WORK UNIT ACCESSION NO 1. TITLE (Include Securty Classfcatn) Kn Investgatn f Memry Latency Reductn Usng an Address Predctn Buffer (U) 2. PERSONAL AUTHQR(S) ^thur Brks Bllmgsley, Jr. 3a. TYPE QF REPORT vaster s Thess 6. SUPPLEMENTARY NOTATION 13b. TIME COVERED FROM 05/92 TO 12/92 14. DATE OF REPORT (Year, Mnth, Day) December 1992 15. PAGE COUNT 39 The vews expressed n ths thess are thse f the authr and d nt reflect the ffcal plcy r pstn f the Department f Defense r the Unted States Gvernment. 7. COSATI CODES 18. SUBJECT TERMS (Cntnue n reverse f necessary and dentfy by blck number) Memry latency, Cmputer Archtecture, Cache Memry, Cmpute FIELD GROUP SUB-GROUP Perfrmance, Latency Reductn, Cache Herarchy 9. ABSTRACT (Cntnue n reverse f necessary and dentfy by blck number) Develpng memry systems t supprt hgh-speed prcessrs s a majr challenge t cmputer archtects. Cache memres can mprve system perfrmance but the latency f man memry remans a majr penalty fr a cache-mss. A nvel apprach t mprve system perfrmance s the use f a memry predctn buffer. The memry predctn buffer (MPB) s nserted between the cache and man memry. The MPB predcts the next cache-mss address and pre-fetches the data. The use f an MPB n a cmputer system s shwn t decrease man-memry latency and ncrease system perfrmance. >0. DISTRIBUTION/AVAILABILITY OF ABSTRACT [ UNCLASSIFIED/UNLIMfTED [J SAME 21. ABSTRACT SECURITY CLASSIFICATION AS RPT. [J DTIC USERS UNCLASSIFIED >2a. NAME OF RESPONSIBLE INDIVIDUAL 22b. TELEPHONE/7nc/ude Area Cde) Duglas ruts (408) 646-2852 22c&m YMBOL ID FORM 1473, 84 MAR 83 APR edtn may be used untl exhausted All ther edtns are bslete SECURITY CLASSIFICATION OF THIS PAGE UNCLASSIFIED T26021A

Apprved fr publc release; dstrbutn s unlmted AN INVESTIGATION OF MEMORY LATENCY REDUCTION USING AN ADDRESS PREDICTION BUFFER by Arthur Brks Bllngsley Jr. Leutenant, Unted States Navy B.S.E E, Auburn Unversty, 1985 Submtted n partal fulfllment f the requrements fr the degree f MASTER OF ELECTRICAL ENGINEERING frm the NAVAL POSTGRADUATE SCHOOL December 1992

ABSTRACT Develpng memry systems t supprt hgh-speed prcessrs s a majr challenge t cmputer archtects. Cache memres can mprve system perfrmance but the latency f man memry remans a majr penalty fr a cache-mss. A nvel apprach t mprve system perfrmance s the use f a memry predctn buffer. The memry predctn buffer(mpb) s nserted between the cache and man memry. The MPB predcts the next cache-mss address and pre-fetches the data. The use f an MPB n a cmputer system s shwn t decrease man-memry latency and ncrease system perfrmance. 1X1

TABLE OF CONTENTS I. INTRODUCTION 1 II. MEMORY HIERARCHY AND LATENCY REDUCTION 3 IH. PERFORMANCE METRICS 6 IV. MEMORY PREDICTION BUFFER 8 V. MEMORY PREDICTION BUFFER PERFORMANCE 13 A. MPB THEORECTICAL PERFORMANCE 13 B. BASELINE SYSTEM PERFORMANCE 14 C. MPB SIMULATION PERFORMANCE 15 VI. CONCLUSIONS 18 VII. RECOMMENDATIONS FOR FUTURE RESEARCH 19 APPENDIX 20 LIST OF REFERENCES 30 INITIAL DISTRIBUTION LIST 32 IV

TABLE OF SYMBOLS E effcency I nstructns CPIeff S CPIepp/mpb) T EA T cs C HR T CF TMR 7^ effectve cycles per nstructn speedup effectve cycles per nstructn wth memry predctn buffer effectve access tme cache search tme cache ht rat cache fetch tme memry read tme memry wrte tme / clck frequency n Hertz HR L HR C HR$ys HR MPB MPB lcal cache ht rate frst level cache ht rate verall system herarchy ht rate MPB lcal ht rate memry predctn buffer

ACKNOWLEDGEMENTS The undertakng f any research prject f ths nature s nt perfrmed n slatn. It s therefre desrable t recgnze the cntrbutns f thers wh aded the cmpletn f ths research. Anta Brg f Dgtal (DEC) was extremely helpful n btanng address traces fr the smulatn f desgn cncepts. Mark Hll f the Unversty f Wscnsn prvded hs cache smulatrs, DINEROIII and TYCHO, fr the smulatn f the desgn cncept. Thanks t Rchard Hammng f the Naval Pstgraduate Schl fr hs gudance n the effcent prgressn f research and fr nvaluable statstcal nsght. A specal thanks t Jhn Pwers f the Naval Pstgraduate Schl fr hs prfessnal and persnal assstance t the authr. The smulatn fr ths research was accmplshed usng a netwrk f Sun SPARCs and the effrts f three fne system admnstratrs, Rbert Lmes, Elane Kdres and Brad Plk. In addtn, Duglas Futs prvded the ntal nspratn fr the develpment f the cncept and prvded supprt, fnancal and prfessnal, t the authr. Ths research was funded by a Naval Pstgraduate Schl Research Intatn Grant. VI

I. INTRODUCTION The technlgcal advances n hgh-speed, general purpse prcessrs have utpaced the supprt prvded by man memry systems. In addtn, sftware applcatns cntnue t grw n prcessr and memry requrements. The majr factrs n the desgn f memry systems are sze f address space, bandwdth requred, man-memry latency, and memry subsystem cst. Large memry subsystems use dynamc randm-access memres because f ther lw cst per bt. Cachng schemes, whch emply hgh-cst, hgh-speed memres, are used t vercme manmemry latency and ncrease bandwdth. Hwever, man memry latency, whch s the tme (n prcessr cycles) between the start f a memry fetch and the start f the transfer f requested data, s sgnfcant and ncreasng [PRZYBY90]. Further gans n memry system perfrmance are pssble thrugh the use f dfferent manufacturng prcesses (CMOS, BCMOS, ECL and GaAs) [VAGTS92] and strngent desgn f the memry herarchy. One such memry perfrmance enhancement s the predctn f a cache-mss read address request t man memry. If the read address s predcted and the data made avalable, then the verall system perfrmance s mprved. Snce current RISC prcessrs far exceed the capablty f man memry systems, the fcus fr the cmputer systems archtect s hw t mprve the perfrmance f the memry herarchy. Large, fully-asscatve caches are cst prhbtve, and drect-mapped caches ffer an excellent alternatve [HILL88]. Drect-mapped caches have a hgher mss rate than fully-asscatve r setasscatve caches. A dsadvantage f cache memres, n general, s the mss penalty [PATHEN90],[PRZYBYZ90]. The reductn f the mss rate and subsequent mss penalty s the mtvatn fr the memry predctn buffer (MPB). Cnceptually, the MPB s an enhancement fr the data cache. The behavr f prcessrs utlzng separate data and nstructn caches s nted n ths research and thers [JOUPPI90],[PRZYBY90]. Examnatn f ths behavr shws that nstructn caches and data caches behave dfferently. Instructn caches can mprve effectveness by smply prefetchng the next nstructn. Ths apprach s shwn t be less effectve fr data caches [PATHEN90],[JOUPPI90]. If ths apprach s used fr data cache management, t cntrbutes t pllutn f the cache and ncreases the number f capacty msses. Snce mst mdern RISC

prcessrs have separate nstructn and data caches, and emply sme prefetch mechansm fr the nstructn cache, ths research wll fcus n mprvng the effectveness f the data cache by nsertng an MPB between the cache and ts refll lne (man memry, n mst cases). Althugh ths rganzatn s the fcus fr ths research, t s nt the nly mplementatn pssble fr the MPB[NOWICKI92].

n. MEMORY HIERARCHY AND LATENCY REDUCTION The vn Neumann archtecture, used by mst sngle-nstructn-sngle-data 1 (SISD) and sngle-nstructn-multple-data (SIMD) machnes, has sme baselne behavral characterstcs t cnsder [HWANG84]. The characterstcs f the memry subsystem prvde the parameters fr ptmzatn f the peratnal behavr f the memry subsystem n cnjunctn wth the prcessr and secndary strage. Frst, stred prgrams bey the prncple f lcalty [PATHEN90]. Ths prncple has tw cmpnents whch state that prgrams, whle executng, favr nly a prtn f ther address space at a gven nstant. The tw cmpnents are: Spatal Lcalty - Prgrams tend t request data and nstructns that have memry addresses near the nstructns and data currenty beng used. The vn Neumann archtecture prvdes fr the executn f sequental prgram nstructns and prgrams use related data tems whch are lkely t be adjacently stred. Tempral Lcalty - Prgrams tend t use current nfrmatn and data. That s, f an tem s referenced, t wll prbably be referenced agan sn. The lder the nfrmatn, the less lkely t s that the prgram wll agan reference t. Tempral lcalty s especally evdent n the executn f prgram lps where nstructn and data are used several tmes wthn a shrt perd f tme. Wth reference t these prncples, hgh-speed buffers are nserted between the man memry and the prcessr. These buffers are knwn as caches. The caches stre prtns f man memry whch are currently n use by the executng prgram. Ths allws rapd access by the prcessr f the nstructns and data needed t cntnue prcessng. Althugh the cache des a great jb f hdng man memry latency, a dsadvantage f ts use s the penalty fr a cache mss. The cnstructn f the cache gves the fllwng behavral characterstcs fr a cache mss. Cmpulsry - cache msses that ccur when a blck s frst accessed and the prgram s just startng. These are smetmes called cld start msses snce the cache has never held the nfrmatn requested. Capacty - cache msses that ccur when dscarded blcks are agan referenced by the executng prgram. These msses are nevtable snce the cache sze s less than man memry sze. Cnflct - the blck placement strategy dctates cnflct msses. Cnflct msses ccur when a blck s dscarded because t many ncmng blcks map t the same set and the 1. Flynn's classfcatn (1966) s based n the multplcty f nstructn streams and data streams n a cmputer system [HWANG84].

dscarded blck s sn needed. Ths characterstc s evdent n bth set-asscatve mapped and drect-mapped caches. The structure f the memry subsystem s gven n Fgure 1. Traversng dwn the herarchy, access tme ncreases and the strage sze ncreases. Hwever, bandwdth decreases sgnfcantly whle traversng the herarchy, tp t bttm. Sme nmnal fgures fr sze and bandwdth are als gven n Fgure 1. It s wrthy t nte that each level s a subset f the next lwer level. That s, each level cntans nly a subset f the nfrmatn cntaned n the next lwer level. Ths presents a cnstrant f mantanng cherency (crrect nfrmatn) thrughut the herarchy. The MPB receves ts nfrmatn frm the next lwer level f the herarchy. In ths research, the next level f the herarchy s the man memry. Fr the develpment f the cncept f the MPB and fr mst f the smulatn descrbed here, the MPB s nt nvlved n the wrte plcy f the cache. The MPB always gets ts data frm the man memry whch s kept up t date. Further research f the MPB wll study the mplementatn f a wrte-thrugh plcy fr cherency. Wrte-back perfrmance wll als be examned n fllw-n research

CENTRAL PROCESSING UNIT PROCESSOR REGISTERS <2k bytes J 200Mb/s CACHE 1KB-512KB t 100Mb/s MAIN MEMORY 512KB-512MB 4MB/s MASS STORAGE > 100Mb MEMORY SUBSYSTEM Fgure 1: Memry Herarchy

. IH. PERFORMANCE METRICS In rder t nvestgate the perfrmance f the memry subsystem, characterstcs f the memry subsystem must be develped. Frm the system perspectve, wrk cmpleted n tme defnes system perfrmance. Hence, system perfrmance can be descrbed analytcally as Equatn 1. _, - Instructns Cmpleted,, System Perfrmance =,_. (1) Elapsed Tme Ths defntn f system perfrmance des derve the ubqutus MIPS unts. Ths unt f measurement shuld nt be used n cmparsn f dfferent systems perfrmng the same task [PATHEN90]. Hwever, fr characterzatn f a specfc system perfrmng the same task, ths unt f measure s useful. Ths measure f perfrmance can be fcused n terms f prcessr cycles. Effcency s a prduct f the number f nstructns executed, the number f clck cycles per nstructn and the clck speed (Equatn 2). E = I CPI f (2) Expandng ths mdel, the number f cycles per nstructn executed s the metrc that s drected nfluenced by the memry subsystem. Statstcally, a mre stable metrc s the effectve CPI. The effectve CPI s the statstcal average f several measurements. The effectve CPI s CPI CPI " = L-r (3) The number f cycles per nstructns s largely determned by prcessr archtecture and regster/cache structure(effectveness). Wth a fcus tward the memry structure, the effectve access tme f the memry subsystem s the best metrc t ndcate memry subsystem perfrmance. Ths parameter depends n the cache access tme and the man memry access tme. By decreasng the number f cycles per nstructn, the system perfrmance s mprved. The speedup n system perfrmance s mdelled by Equatn 4. _ CPl EFF -CPl EFF{MPB) CPIeff(mpb),. CPIeff CP'eff

The nmnal fgures fr the number f cycles per nstructn n hgh perfrmance prcessrs s 1.2-2.0 CPI. If we assume that the prcessr can execute nstructns at the bandwdth f the memry subsystem, the speedup becmes a functn f the effectve access tme f the memry subsystem. Equatn 5 determnes the speedup f a gven system by reference t the effectve access tme wth the MPB and wthut the MPB. 5=1- Tea 1 MPB) (5) 1 EA The effectve access tme measures the memry herarchy perfrmance. The effectve access tme s therefre, a functn f the cache perfrmance and man memry perfrmance as nted n Equatn 6. T EA = T CS + C HR T CF + ( l ~ C HR) ( T CS + T MR + T Cf"> (6) Ths relatnshp can be smplfed by ntng the tme fr a cache tag search T cs s very small. In addtn, the cache tag search and cache fetch are much smaller than the tme t read/fetch data frm man memry,. Tm The effectve access tme can then be apprxmated as n Equatn 7. t ea " c hr t cf + ( 1 - c hr) (7W (7) Ths apprxmatn can be used nly fr cmparsn between smulatn mdels. The descrptn gven by Equatn 6 must be used fr evaluatn f the smulatn mdel wth respect t mplementatn perfrmance.

IV. MEMORY PREDICTION BUFFER The memry predctn buffer(mpb) was cnceved t predct the next cache-mss address and prefetch the data befre the request s made by the prcessr. The MPB can be nserted between the cache and ts refll lne as depcted n Fgure 2. Anther pssble cnfguratn culd be the use CENTRAL PROCESSING UNIT MAIN MEMORY MEMORY SUBSYSTEM Fgure 2: MPB Wth Cache Implementatn

f smaller MPBs attached t ndvdual memry chps (DRAMs). Ths mplementatn s realzed n recent wrk by Nwck[NOWlCK92]. A blck dagram f ths apprach s gven n Fgure 3. In CENTRAL PROCESSING UNIT PROCESSOR REGISTER FILE \/ MPB MPB MPB MPB Q I Q a a Q S MAIN MBVIORY MEMORY SUBSYSTEM Fgure 3: MPB Wth Man Memry Implementatn the early research f ths dea, effrts turned nstnctvely tward statstcal methds fr predctn. The area f dgtal sgnal prcessng was explred fr pssble slutns t the predctn requrement[hammin83],[therri92]. Kalman flters, Wener flters and ther adaptve technques fr predctn were prpsed and nvestgated. Hwever, further characterzatn f the prblem prvded mre specfcatns fr pssble slutns.

1 Cache smulatn was acheved usng Mark Hll's DINEROm cache smulatr. The mdel cache s a drect-mapped, 8K data, 8K nstructn wth a 32 byte lne sze. Usng varus ATUM traces [GRIMSR92] and DEC traces [BORG90], cache mss addresses were nvestgated[agarwl86]. Revew f the traces shw that spatal lcalty and tempral lcalty are vald fr all prcesses. Snce n curves are nted n the traces, predctn shuld emply lnear methds. The physcal cnstructn f the memry predctn buffer s gven n Fgure 4. The ADDRESS FROM CACHE TO CACHE LINE1 LINE 2 LINE 3 LINEm / / / / / / / COMPARATOR ADDRESS TAGS BYTE 1 BYTE 2 BYTEn/ / COMPARATOR ADDRESS TAGS BYTE 1 BYTE 2 BYTE/? COMPARATOR ADDRESS TAGS BYTE1 BYTE 2 BYTEn / COMPARATOR ADDRESS TAGS BYTE1 BYTE 2 BYTEn/ FROM MAIN MEMORY Fgure 4: Memry Predctn Buffer smulatn was cnfgured t gve the number f cache hts befre a mss s encuntered. The average f these mss events gve the cnstrant f tme avalable t predct and prefetch a mss address. Snce the average f cache-hts befre a cache-mss s 4-6, t s pssble that sme 6-10 cycles are avalable fr predctn and prefetch. In addtn, the system bus bandwdth must be cnsdered fr prefetch slutn. These cnstrants were respnsble fr the develpment f a 10

smpler predctn algrthm. The predctn algrthm yelds a bas fr the ensung prefetch. The algrthm s mplemented n C fr smulatn. If the current address s larger than the past address, then the bas s pstve (negatve therwse). The algrthm fr the MPB s gven n Fgure 5. The determnatn and applcatn f receve address request frm prcessr determne blck address (bundary) fetch address requested frm man memry send requested data t prcessr cmpare address requested wth prevus address request and calculate bas apply bas t last address t btan predcted address fetch data at predcted address Fgure 5: Memry Predctn Buffer Algrthm the bas s central t the algrthm. The bas s smply the dfference n address bundares (f wrd algned) f the prevus address and the current address. If the address requested s greater than 32K away, anther address stream bas s establshed. The crrespndng address stream bas s used t predct the next requested address. The bas may be pstve r negatve, that s, ascendng r 11

descendng n memry. The crrect address stream bas s determned usng a smple but fast bnary search. The search tme can be reduced further usng a fully asscatve algrthm. The structure f the memry predctn buffer s smlar t a cnventnal fully-asscatve cache. The MPB s cmpsed f m lnes f n byte blcks. Fr the cache used n ths research, the MPB has 16-256 lnes f 32 byte blcks. The blcks are algned n the same address(wrd) bundares as the frst level cache. The blck sze s dependent n the blck sze f the frst level cache. The ptmal sze f the MPB s 64-256 lnes. Ths sze s due t the fan-ut requrements (and csts) fr the cnstructn f a fully asscatve cache and the number f lnes (sets) needed t allw effectve use f the replacement plcy used (randm replacement vce LRU, FIFO, etc.). If a LRU replacement plcy s used nstead f randm replacement, a smaller MPB can be used t gve the same perfrmance mprvement. 12

V. MEMORY PREDICTION BUFFER PERFORMANCE A. MPB THEORECTICAL PERFORMANCE The memry predctn buffer determnes the future cache mss address usng prevus cache mss addresses. Fr ths analyss, nly the data cache s gven a MPB. The nstructn cache s set t prefetch nstructns. Gven a mdel cache wth a ht rat f 93.2%, f the MBP s fund t be crrect n 33% f ts predctns, an ncrease f 2. 1% s realzed fr the cache ht rate. The effectve cache ht rat s mprved t 93.2% frm 95.3%. The graph f Fgure6 gves the effectve cache MPS Ccm 1 rr\ r v- rm m r\ x. O.SS 0.96-0.9«* -, - - a M O j / a / - 0.92 - «- 0.9 < M M O.SS - a yf - O O.S6.a- a. 92 - / - 20 *0 60 SO 1 OO tmmry Pr«aetn Buflar ff«ctv«n«*s Fgure 6: MPB Perfrmance Graph ht rate as a functn f MFP effectveness. There are fur cache mdels that are cmpared. One mdel has an 80% ntal ht rate, anther mdel has an 85% ht rate and s n. A sample readng s shwn fr a base cache ht rat f 80% wth an MPB effectveness ratng f 20%. The resultng effectve cache ht rat fr ths sample s 84%. Ths s an ncrease f 4% n the effectve cache ht rat. The resultng system perfrmance acheves a speedup f 9%. The mdel system fr ths nvestgatn has 10ns cache memry and 80ns man memry. Ths mdel memry herarchy s used by the smulatn study als. The cycle tme f the man memry s nt cnsdered but wuld add t the effectveness f the MPB. 13

B. BASELINE SYSTEM PERFORMANCE In rder t cmpare the perfrmance f the MPB t exstng latency reductn strateges, several measurements f the baselne system had t be cllected and examned. Ths baselne system was cnstructed usng the cache smulatr, DINEROHI. The system smulates separate 8K drect-mapped data and 8K drect-mapped nstructn caches. Prcess Table 1: BASELINE SYSTEM PERFORMANCE Cache Sze HR L HR C HRsyS Speedup 8K FIRST LEVEL CACHE BASE-SYSTEM PERFORMANCE SPICE 8192 96.51 96.51 96.51-0- Pascal 8192 91.57 91.57 91.57-0- LISP 8192 92.44 92.44 92.44-0- FORTRAN 8192 93.88 93.88 93.88-0- Tree 8192 98.66 98.66 98.66-0- SOR 8192 90.50 90.50 90.50-0- 12K FIRST LEVEL CACHE PERFORMANCE SPICE 12288 97.16 97.16 97.16 3.66 Pascal 12288 94.40 94.40 94.40 12.46 LISP 12288 96.32 96.32 96.32 17.76 FORTRAN 12288 95.11 95.11 95.11 6.03 Tree 12288 97.43 97.43 97.43 (-7.87) SOR 12288 91.16 91.16 91.16 2.77 8K FIRST LEVEL CACHE (DM) WITH 4K SECOND LEVEL CACHE (FA) SPICE 4096 24.46 96.51 97.37 4.84 Pascal 4096 36.91 91.57 94.68 13.69 LISP 4096 75.59 92.44 98.16 26.18 FORTRAN 4096 32.58 93.88 95.81 9.46 Tree 4096 68.32 98.56 99.44 4.99 14

' Table 1: BASELINE SYSTEM PERFORMANCE Prcess Cache Sze HR L HR C HRsyS Speedup SOR 4096 23.84 90.50 92.77 9.54 C. MPB SIMULATION PERFORMANCE The theretcal study f the MPB was realzed when mplemented usng trace-drven smulatn (TDS)[GRIMSR92] wth the DINEROIII cache smulatr (prvded by Mark Hll). As wth any TDS research, address traces and ther accuracy are crtcal t prper smulatn. Fr ths research, ATUM traces [ AGARWL86] and DEC Ttan[BORG90] traces were used. Sme behavral characterstcs f the smulatn are graphcally llustrated n the appendx. Table 2 gves Table 2: MEMORY PREDICTION BUFFER PERFORMANCE(DEC) Prcess MPB Lnes Blcks per lne HR MPB HR C HR SY s Speedup TREE1 128 32 69.89 97.87 99.37 9.14 TREE 2 128 32 59.57 98.01 99.20 7.31 SOR1 128 32 12.77 90.51 91.79 5.38 SOR 2 128 32 10.20 90.29 91.35 4.42 a summary f MPB perfrmance fr tw prcesses and tw runs f each. SOR s Renat Delenes successve ver-relaxatn algrthm that uses sparse matrces. TREE s Jel Bartletts' prgram whch bulds a tree data structure and searches fr the largest element n the tree. Hs prgram s a varant f LISP. Bth f these prcess traces were prvded by DEC WRL. The mdel system s a RISC prcessr wth separate 8K nstructn and 8K data caches. There are 32-byte blcks n the cache and n the MPB. The cache s drect-mapped fr reasns gven by [HTLL88]. The ntal cache ht rate CHR was befre the nsertn f the MPB. The lcal ht rate fr the MPB s gven under MHR. The verall ht rate fr the cache and MPB cmbned s lsted under NHR. The speedup s lsted fr the verall system. Fr these examples, each lne f the MPB cnssts f 32-byte lnes(blcks) and 128 lnes. Each lne s bundary algned n the same way as the cache. That s, just as the cache may use wrd algned blcks, s des the MPB. Ths MPB smulatn used a randm 15

replacement plcy fr the remval f lnes. Tward the end f ths research effrt, a MPB was smulated usng a least-recendy used (LRU) replacement plcy. Several smulatns usng ths replacement plcy shwed that the number f lnes n the MPB culd be reduced whle mantanng the effectveness f the MPB. In partcular, 64 lnes were shwn t perfrm nearly as well as 128 lnes. Fr the smulatn results f Table 2, the speedup numbers are mdest but, the cst f ths mplementatn s mnmal when cmpared t a 256K next level cache[pathen90]. In addtn t the smulatns usng the DEC traces, smulatns were als dne usng ATUM traces. Table 3 lst results f smulatn usng ATUM traces. The mdel system s the same as used Table 3: MEMORY PREDICTION BUFFER PERFORMANCE (ATUM) Prcess MPB Lnes Blcks per lne HR MPB HR C HR SYs Speedup Spce 128 32 33.50 93.22 95.27 6.75 Pascal 128 32 47.35 95.62 97.45 9.80 LISP 128 32 69.75 92.68 97.72 23.33 FORTRAN 128 32 40.11 94.22 96.90 13.36 n the DEC trace smulatn. These smulatn results can be used t mtvate further research. ATUM traces are relatvely shrt fr cache mdellng and behavr analyss. Each trace s apprxmately 400,000 addresses. Ths number f addresses s margnally adequate fr a 32K cache smulatn and larger cache-sze smulatn wuld requre a larger number f addresses fr prper and accurate smulatn. Fr the precedng research, a randm-replacement plcy was used by the MPB. An early mplementatn f the MPB usng a least-recently-used (LRU) plcy shws mprved perfrmance ver the randm-replacement algrthm.. Table 4 lsts the results f ths research usng the prcess Table 4: MEMORY PREDICTION BUFFER PERFORMANCE (LRU) Prcess MPB Lnes Blcks per lne HR MPB HR C HR SYs Speedup TREE 128 32 79.11 97.91 99.98 12.64 16

"tree". Results f ths mplementatn usng ther prcesses were nt yet accmplshed at the tme f the reprt. As evdenced by all these smulatn studes, the MPB s shwn t be a favrable archtectural cncept fr cnsderatn n systems where the hghest pssble perfrmance s desred and systems csts are cnstraned. 17

VI. CONCLUSIONS The memry predctn buffer s prpsed as a cmpnent fr hgh perfrmance cmputer systems. The wdenng gap between prcessr speed and memry subsystems requre the nvestgatn f alternatve archtectures fr reducng man memry latency whle restranng csts. The MPB utperfrms prefetch always strateges by allwng addressng n the up and dwn drectn. In addtn, the MPB des nt cntrbute t pllutn f the cache. Effectve memry latency reductn must be addressed at the tme f system desgn. In addtn, as the requrements fr a larger address space grws, memry herarchy desgn and mplementatn wll cntnue t ncrease n cmplexty. The mplementatn f a MPB s less expensve than a next-level cache and delvers a cmparable perfrmance enhancement. In addtn, the algrthm used can be talred t the prpsed system envrnment t prvde a mre effectve latency reductn structure. The MPB s shwn t mprve verall system perfrmance and prvde reasnable gans n speedup. 18

VH. RECOMMENDATIONS FOR FUTURE RESEARCH The memry predctn buffer s studed and smulated fr enhancement f the data cache f a unprcessr. Its use r enhancement n a multprcessr envrnment s nt yet knwn. In addtn, the questn f whether the MPB can be used t sgnfcantly enhance the perfrmance f the nstructn cache has nt fully been explred. The algrthm fr the MPB f ths research fcused n a randm replacement plcy fr dscardng lnes. The LRU replacement plcy shwed an mprvement ver randm hwever, the effect f ther replacement plces s avalable fr dscussn. Smulatn and study f the memry bandwdth requred t supprt an archtecture wth a MPB and wthut a MPB s needed. A cmparsn f the amunt f bandwdth requred by the base archtecture (cache and prcessr) wth the bandwdth requred by the archtecture wth a MPB nstalled, s useful. The cache wrte-back plcy and ts effect n systems perfrmance wth and wthut an MPB s an area pen fr study. 19

APPENDIX ON 00 VO ^ <L> x> E 3 2 n cu 3 O" <u Tfr CO en O sjth 9qB3 20

G < 3 CT <U CO CO <u < CO (pjupap) 3np?A ssajppv /Ouaj^ 21

\ 1 1 1 1 III!1 > 1 ; ; 1 I < 0\ 1 ; 1 O as ] > 1 1 cn c '5b I 1 X> s3z > -a CO cd w CO <t> CO CO <U a < CO CO t l > 1 1 ; 1 ; 1 '! 1 * 3 3 1 1 00 CD c D 3 O" <u CO s 2 ca u! t 1 ' a! s 1 y CO 1 CN > 1 3 O en ON ; CO ON > CN 5\ ':! 1 ' 3 l L. CN CN ON L CN CN ON ON t CN Os (jbupsp) anpa ssajppv ^ua^ 22

: :» 1 1 1 1 1 1 ' ' 00 ON 1. ;! '. '. " r- '. I. r : : ; ',,' O On '.. ' * *,!, C 'Sb : - :! ". -! ' 1 '_ ' - *, * ' - * ' E c 2 as CO U V) CO -a -a < CO c : - :.1: * " 1 " ; I j * y j * I \ I > 1 - I.. : - J j * ', * t _ r. :, '! : ; ; «"«; ', '. 1 : ' : ' : f en z D c <u 3 r <u e 2 u cd U 13 CO a. 1 f. ^ Z * : * 1 ' ' CN : : \ z j. : * * \ t O l : 5 1 :... :. t. > : I * 1 ' 111 en CO CN CN ON ON On Os : CN ON CN On CN On CN ON (pjupap) anjba ssajppy /Otusp^ 23

3 z C <u 3 - D E 2 (j^unap) 3n[E^\ sssjppy /Ousja 24

1. ON O as ; * - 00 I >* S 3 z -a 08 ea w V) <L> U V} J c <u 3 * U en - < C/J V 1 - e -C u Z<! 1 l (N - : ' en ON ON ON On On ON ON (^unap) 3nfBy\ ssaxppy u3jaj 25

O a CO ON a '5b DC JO g 3 z < c 4J 3 0" <u CO O e 2 ([Bunap) 3n^y\ ssajppv u3ja{ 26

D X> =3 z <L> O C <D 3 O" <D n ed s <u (l^unap) an^y\ ssajppv Xjusjm 27

(I^upap) anjba. sssppv ^ua^v 28

S-4 S 3 z <D CJ C <u 3 cr 2 (jbunap) 3npy\ ssajppv /Ouq^ 29

LIST OF REFERENCES [AGARWL86]Agarwal, A., et al., "ATOM: A New Technque fr Capturng Address Traces Usng Mcrcde", The 13th Annual Internatnal Sympsum n Cmputer Archtecture, IEEE Cmputer Scety Press, Ls Alamts, Calfrna (vl 14, n3), 1986. [AZIMI92] Azm, M. et al, "Tw Level Cache Archtectures", COMPCON '92, IEEE Cmputer Scety Press, Ls Alamts, Calfrna, 1992 pg 344-349. [BORG90] Brg, A., Kessler, R.E., Wall, D.W., "Generatn and Analyss f Very Lng Address Traces", The 17th Annual Internatnal Sympsum n Cmputer Archtecture, IEEE Cmputer Scety Press, Ls Alamts, Calfrna (vl 18, n2), 1990. [BUGGE90] Bugge, H.O. et al, "Trace Drven Smulatns fr a Tw-Level Cache Desgn n Open Bus Systems", IEEE Cmputer Scety Press, Ls Alamts, Calfrna, (vl 18 n2), 1990. [BURSKY92] Bursky, D., "Cmbnatn DRAM-SRAM", Electrnc Desgn, Pentn Publshng, Cleveland, Oh, January 1992, (vl 40, n. 2), pg 39. [CLEMEN91] Clements, A., Mcrprcessr Supprt Chps Surcebk, McGraw-Hll Inc., Lndn, England, 1991. [GAJSKI87] Gajsk, D.D. et al, Cmputer Archtecture, IEEE Cmputer Scety Press, Washngtn, D.C., 1987. [GRIMSR92] Grmsrud, K. et al., Brgham Yung Unversty, Nvember 1992, unpublshed. "Estmatn f Smulatn Errr Due t Trace Inaccuraces", [HAMMIN83]Hammng, R.W., Dgtal Flters, Prentce-Hall, Englewd Clffs, New Jersey, 1983. [HILL88] Hll, M.D., "A Case fr Drect-Mapped Caches", IEEE Cmputer, IEEE Cmputer Scety, Ls Alamts, Calfrna, December 1988. [HWANG84] Hwang, K., Brggs, E, Cmputer Archtecture and Parallel Prcessng, McGraw- Hll, New Yrk, New Yrk, 1984. [JAIN91] Jan, Raj., The Art f Cmputer Systems Perfrmance Analyss, Jhn Wley and Sns, New Yrk, New Yrk, 1991. [JOUPPI90] Jupp, N.P., "Imprvng Drect-Mapped Cache Perfrmance by the Addtn f a Small Fully-Asscatve Cache and Prefetch Buffers", The 17th Annual Internatnal Sympsum n Cmputer Archtecture, IEEE Cmputer Scety Press, Ls Alamts, Calfrna (vl 18, n2), 1990. [KURIAN91] Kuran, L. et al, "Classfcatn and Perfrmance Evaluatn f Instructn Bufferng Technques", IEEE Cmputer Scety Press, Ls Alamts, Calfrna, (vl 19 n 3), 1991. [NOWICK92] Nwck, G., "Desgn and Implementatn f a Read Predctn Buffer", Master's Thess, Naval Pstgraduate Schl, Mnterey, Calfrna, December 1992. [PATHEN90] Pattersn, D.A. & Hennessy J.L., Cmputer Archtecture-A Quanttatve Apprach, Mrgan Kauffman Publshers, San Mate, Calfrna, 1990. 30

[POHM83] Phm, A.V., Hgh-Speed Memry Systems, Restn Publshng Cmpany, Restn, Vrgna, 1983. [POLLAR90] Pllard, L.H., Cmputer Archtecture and Desgn, Prentce Hall, Englewd Clffs, New Jersey, 1990. [PRZYBY90] Przybylsk, S. A., Cache and Memry Herarchy Desgn: A Perfrmance-Drected Apprach, Mrgan Kaufmann Publshers, San Mate, Calfrna, 1990. [PRZYBY88] Przybylsk, S. et al, "Perfrmance Trade-ffs n Cache Desgn", IEEE Cmputer Scety Press, Ls Alamts, Calfrna, (vl 16 n 2), 1988. [PRZYBY90] Przybylsk, S. A., "The Perfrmance Impact f Blck Szes and Fetch Stateges", IEEE Cmputer Scety Press, Ls Alamts, Calfrna, (vl 18 n 2), 1990. [SHORT88] Shrt, R.T. and Levy, H.M., "A Smulatn Study f Tw-Level Caches", The 17th Annual Internatnal Sympsum n Cmputer Archtecture, IEEE Cmputer Scety Press, Ls Alamts, Calfrna (vl 16, n2), 1988. [SMITH85] Smth, A. J., "Cache Evaluatn and the Impact f Wrklad Chce", IEEE Cmputer Scety Press, Ls Alamts, Calfrna, (vl 18 ssue 3), 1985. [SMITH82] [THIEBT92] Smth, A.J., "Cache Memres", ACM Cmputng Surveys, New Yrk, New Yrk, 1982, (vl 14, n 3 September). Thebaut, D., Wlf, J.L., Stne S.S., "Synthetc Traces fr Trace-Drven Smulatn f Cache Memres", IEEE Transactns n Cmputers, VOL 41 NO. 4, Aprl 1992. [THERRI92] Therren, C.W, Dscrete Randm Sgnals and Statstcal Sgnal Prcessng, Prentce-Hall, Englewd Clffs, New Jersey, 1992. [VAGTS92] Vagts, C, "A Sngle Transstr Cell Fr GaAs Dynamc RAM", Master's Thess, Naval Pstgraduate Schl, Mnterey, Calfrna, 1992. 31

INITIAL DISTRIBUTION LIST Defense Techncal Infrmatn Center Camern Statn Alexandra, Vrgna 22304-6145 Lbrary, Cde 52 Naval Pstgraduate Schl Mnterey, Calfrna 93943-5000 Charman, Cde EC Department f Electrcal and Cnputer Engneerng Naval Pstgraduate Schl Mnterey, Calfrna 93943-5000 Prf. Duglas J. Futs, Cde EC/FS Department f Electrcal and Cmputer Engneerng Naval Pstgraduate Schl Mnterey, Calfrna 93943-5000 Prf. Rchard W. Hammng, Cde CS/HG Department f Cmputer Scence Naval Pstgraduate Schl Mnterey, Calfrna 93943-5000 Arthur Bllngsley, LT, USN Space and Naval Warfare Systems Cmmand Department f the Navy SPAWAR (PMW- 156-1) UHF SATCOMM Washngtn, D.C. 20363-5100 /Vy-</03 32

MMCd \