PRACE Autumn School GPU Programming
|
|
- Kelly Gregory
- 5 years ago
- Views:
Transcription
1 PRACE Autumn School 2010 GPU Programming October 25-29, 2010 PRACE Autumn School, Oct
2 Outline GPU Programming Track Tuesday 26th GPGPU: General-purpose GPU Programming CUDA Architecture, Threading and Memory model CUDA Programming, Runtimes and Environments Hands-on Lab 1: CUDA Environment Setup, Compilation and Execution Examples Wednesday 27th CUDA Optimizations. Debugging and Profiling GPU Multiprocessing. Deploying Multi-GPU Applications The GPU on Heterogeneous and High-Performance Computing Hands-on Lab 2: Advanced Tools and Exercises. HPC Codes and Performance Evaluation PRACE Autumn School, Oct
3 Instructors Manuel Ujaldón Associate Professor, Computer Architecture Department, University of Malaga Nacho Navarro Associate Professor, Computer Architecture Department, Universitat Politecnica de Catalunya (UPC), Researcher at Barcelona Supercomputing Center (BSC), Visiting Research Professor at University of Illinois (UIUC) Javier Cabezas Ph.D. student at the Computer Architecture Department, UPC. Researcher at the Barcelona Supercomputing Center. Visiting PhD. Student at UIUC. PRACE Autumn School, Oct
4 CUDA is Popular PRACE Autumn School, Oct
5 PUMPS Summer School Programming and Tuning Massively Parallel Systems Summer School (PUMPS) Teachers: Wen-mei W. Hwu, University of Illinois David B. Kirk, NVIDIA PRACE Autumn School, Oct
6 BSC named first CUDA Research Center in Spain The Barcelona Supercomputing Center (BSC) has been named by NVIDIA as a 2010 CUDA Research Center, the first in Spain. The CUDA Research Center Program recognizes and fosters collaboration with research groups at universities and research institutes that are expanding the frontier of massively parallel computing. Institutions identified as CUDA Research Centers are doing worldchanging research by leveraging CUDA and NVIDIA GPUs. PRACE Autumn School, Oct
7 Hands-on Labs Labs will be done at the AC GPU Cluster at NCSA AC.NCSA.UIUC.EDU Experimental system available for exploring GPU computing PRACE Autumn School, Oct
8 HP xw9400 workstation 2216 AMD Opteron 2.4 GHz dual socket dual core 8 GB DDR2 Infiniband QDR Tesla S1070 1U GPU Computing Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 32 (128 CPUS) Accelerator Units: 32 (128 GPUS, 128 TF SP, 10 TF DP) Compute Node PRACE Autumn School, Oct
9 Course Wiki Course Material Hands-on Lab Info on Textbooks Links to interesting educational material Register and log in to get access to the content PRACE Autumn School, Oct
10 PRACE Autumn School 2010 GPU Programming Nacho Navarro Associate Professor Universitat Politecnica Catalunya / Barcelona Supercomputing Center Visiting Research Professor, UIUC, CSL PRACE Autumn School, Oct
11 Outline Multicore: Dual/Quad, Cell, GPU, FPGA,? Current and future systems Graphics beyond games Programmability experiences and trends Supercomputing anywhere Acknowledgements: Prof. Wen-mei Hwu, UIUC, David Kirk, NVIDIA, NCSA Summer School PRACE Autumn School, Oct
12 Current Trend: Multi-core Processors Cache Cache Core Core Core C1 C3 C2 Cache C4 C1 C2 C1 C2 C1 C2 C3 C4 C3 C4 Cache C1 C2 C1 C2 C3 C4 C3 C4 C3 C4 Past trend: increasing number of transistors on a chip and increasing clock speed Heat is an unmanageable problem, Intel Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future. However, # transistors on a chip will continue to increase. Intel Core 2 Duo Do we have some free space? put more cores What s left over? Put cache memory PRACE Autumn School, Oct
13 Multicores: Just Cores? How many cores? Intel/AMD cores IBM Cell 8-16 SPU NVIDIA 480 cores Multicore is Hardware and Software together (challenge and inspire each other) More transistors, worse reliability Error / fault (detection / correction / recovery) Dynamic reconfiguration Memory Memory wall due to bandwidth (scalability?) Memory wall due to power (interconnect needs power) Memory size grows but data always grows more and more On-chip locality, communication PRACE Autumn School, Oct
14 IBM, SONY, TOSHIBA Cell BE Heterogeneous Mickey mouse Power PC 8 SPU Local memory, local address space Lot of memory copies: DMA s Always short of memory space Cannot host all data Software cache Two unrelated thread schedulers Reliability: if all cores are fine, IBM supercomputer; if SPE error, sell it as PS3 PRACE Autumn School, Oct
15 NVIDIA GPU PRACE Autumn School, Oct
16 GPU: How Many cores? (240 in chunks of 16 way MP) PRACE Autumn School, Oct
17 Is GPU driving the parallelism revolution? 1 Based on slide 7 of S. Green, GPU Physics, SIGGRAPH 2007 GPGPU Course. PRACE Autumn School, Oct
18 GPU performance in recent history Performance of NVIDIA GPUs over time Fermi Peak GFLOPS CUDA Memory Bandwidth (GB/s) PRACE Autumn School, Oct
19 CPU vs. GPU, approaching each other PRACE Autumn School, Oct
20 ILP vs. Massive Data Parallelism PRACE Autumn School, Oct
21 PRACE Autumn School, Oct
22 PRACE Autumn School, Oct
23 Graphics and Games: Nvidia purchased AGEIA PhysX middleware. PRACE Autumn School, Oct
24 Massive Parallelism PRACE Autumn School, Oct
25 GPU: Supercomputing at Home PRACE Autumn School, Oct
26 PRACE Autumn School, Oct
27 PRACE Autumn School, Oct
28 CUDA: Widely Adopted Parallel Programming Model PRACE Autumn School, Oct
29 PRACE Autumn School, Oct
30 PRACE Autumn School, Oct
31 Performance of Advanced MRI Reconstruction Wen-mei Hwu, IMPACT, UIUC PRACE Autumn School, Oct
32 GPU Speedup GPU gives us 100x (after one month of understanding the architecture) to massive parallel algorithms Faster is not just Faster 2-3X faster is just faster Do a little more, wait a little less Doesn t change how you work 5-10x faster is significant Worth upgrading Worth re-writing (parts of) the application 100x+ faster is fundamentally different Worth considering a new platform Worth re-architecting the application Makes new applications possible Drives time to discovery and creates fundamental changes in Science PRACE Autumn School, Oct
33 PRACE Autumn School, Oct
34 CUDA Features (Threading) Physical partitioning in SM Virtual partitioning Problem is divided into a grid of Thread Blocks (TBs) Each Thread Block is composed by <= 512 threads Threads are very lightweight Scheduling of threads on physical cores is performed by the HW (in groups called warps ) New warps are scheduled on memory stalls (hides latency) Many TBs can be executed on the same SM (1024 threads max), depending on the used (memory) resources SIMD: Divergent branches significantly reduce the performance PRACE Autumn School, Oct
35 CUDA Features (Memory) Global memory (up to 4GB per card) Very slow ( cycles) Texture memory (64KB per card) <cache> Read-only Useful for some kinds of access patterns Constant memory (64KB per card) <cache> Read-only 2 cycles (when all threads in a warp read the Shared memory (16KB per SM) 8 banks (4 bytes stride) 2 cycles if no bank conflict (consecutive accesses) Register memory registers/sm (16 per thread if 1024 threads, 32 if 512 threads) PRACE Autumn School, Oct
36 Data Movements and Kernel Launch PRACE Autumn School, Oct
37 Oil and Gas Prospection PRACE Autumn School, Oct
38 RTM on GPU : Experience on Mapping Forward Stencil + Hessian (GPU) Boundary Conditions (GPU) Shot insertion (GPU) Receivers (GPU) For synthetic traces Write to disk (CPU) Backward Stencil (GPU) Boundary Conditions (GPU) Receivers shots insertions (GPU) Read from disk Correlation PRACE Autumn School, Oct
39 RTM Port to GPU Timeline Three months progress for a new CUDA developer PRACE Autumn School, Oct
40 RTM kernel on GPUs Current Results Three months progress for a new CUDA developer PRACE Autumn School, Oct
41 RTM on GPU: Kernel bottlenecks Naïve: uses global memory only Store all the matrices in the global memory Unroll the loops and create as many TB as necessary Bottleneck: global accesses are very slow Shared memory: Use shared memory to store the values of the previous time step Drawback: divergent branches to load the ghost area Bottleneck: Shared memory usage Bad useful/total reads ratio due to the big stencil 2D sliding window: Proposed by Paulius Micikevicius (NVIDIA Total) Store the Y (geophysical) stencil dimension in registers Only store the ZX plane in shared memory Better useful/total reads ratio Slide the plane to the end of the cube Bottleneck: Registers usage PRACE Autumn School, Oct
42 Benchmarks and Lessons Learned App. Archit. Bottleneck Simult. T Kernel X App X H.264 Registers, global memory latency 3, LBM Shared memory capacity 3, RC5-72 Registers 3, FEM Global memory bandwidth 4, RPES Instruction issue rate 4, PNS Global memory capacity 2, LINPACK Global memory bandwidth, CPU-GPU data transfer 12, TRACF Shared memory capacity 4, FDTD Global memory bandwidth 1, MRI-Q Instruction issue rate 8, [HKR HotChips-2007] PRACE Autumn School, Oct
Scalability of MB-level Parallelism for H.264 Decoding
Scalability of Macroblock-level Parallelism for H.264 Decoding Mauricio Alvarez Mesa 1, Alex Ramírez 1,2, Mateo Valero 1,2, Arnaldo Azevedo 3, Cor Meenderinck 3, Ben Juurlink 3 1 Universitat Politècnica
More informationYong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan
Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Virginia Polytechnic Institute and State University Reverse-engineer the brain National
More informationTransparent low-overhead checkpoint for GPU-accelerated clusters
Transparent low-overhead checkpoint for GPU-accelerated clusters Leonardo BAUTISTA GOMEZ 1,3, Akira NUKADA 1, Naoya MARUYAMA 1, Franck CAPPELLO 3,4, Satoshi MATSUOKA 1,2 1 Tokyo Institute of Technology,
More informationGPU Acceleration of a Production Molecular Docking Code
GPU Acceleration of a Production Molecular Docking Code Bharat Sukhwani Martin Herbordt Computer Architecture and Automated Design Laboratory Department of Electrical and Computer Engineering Boston University
More informationAmdahl s Law in the Multicore Era
Amdahl s Law in the Multicore Era Mark D. Hill and Michael R. Marty University of Wisconsin Madison August 2008 @ Semiahmoo Workshop IBM s Dr. Thomas Puzak: Everyone knows Amdahl s Law 2008 Multifacet
More informationOutline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.
Outline 1 Reiteration Lecture 5: EIT090 Computer Architecture 2 Dynamic scheduling - Tomasulo Anders Ardö 3 Superscalar, VLIW EIT Electrical and Information Technology, Lund University Sept. 30, 2009 4
More informationHigh Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation
High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities Introduction About Myself What to expect out of this lecture Understand the current trend in the IC Design
More informationInstruction Level Parallelism Part III
Course on: Advanced Computer Architectures Instruction Level Parallelism Part III Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Outline of Part III Dynamic Scheduling
More informationFooling the Masses with Performance Results: Old Classics & Some New Ideas
Fooling the Masses with Performance Results: Old Classics & Some New Ideas Gerhard Wellein (1,2), Georg Hager (2) (1) Department for Computer Science (2) Erlangen Regional Computing Center Friedrich-Alexander-Universität
More informationHigh-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures
46 H. Y. SU, M. WEN, J. REN, N. WU, J. CHAI, C.Y. ZHANG, HIGH-EFFICIENT PARALLEL CAVLC ENCODER High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures Huayou SU, Mei WEN, Ju REN,
More informationInstruction Level Parallelism Part III
Course on: Advanced Computer Architectures Instruction Level Parallelism Part III Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Outline of Part III Tomasulo Dynamic Scheduling
More informationProfiling techniques for parallel applications
Profiling techniques for parallel applications Analyzing program performance with HPCToolkit 03/10/2016 PRACE Autumn School 2016 2 Introduction Focus of this session Profiling of parallel applications
More informationNorth America, Inc. AFFICHER. a true cloud digital signage system. Copyright PDC Co.,Ltd. All Rights Reserved.
AFFICHER a true cloud digital signage system AFFICHER INTRODUCTION AFFICHER (Sign in French) is a HIGH-END full function turnkey cloud based digital signage system for you to manage your screens. The AFFICHER
More informationProfiling techniques for parallel applications
Profiling techniques for parallel applications Analyzing program performance with HPCToolkit 17/04/2014 PRACE Spring School 2014 2 Introduction Thomas Ponweiser Johannes Kepler University Linz (JKU) Involved
More informationUSING FUSION SYSTEM ARCHITECTURE FOR BROADCAST VIDEO. Edward Callway AMD
USING FUSION SYSTEM ARCHITECTURE FOR BROADCAST VIDEO Edward Callway AMD USING PC COMPONENTS FOR BROADCAST VIDEO Video processing from pure analog to digital compute PC Design for video Parallel GPU computing
More informationGPU s for High Performance Signal Processing in Infrared Camera System
GPU s for High Performance Signal Processing in Infrared Camera System Stefan Olsson, PhD Senior Company Specialist-Video Processing Project Manager at FLIR 2015-05-28 Instruments Automation/Process Monitoring
More informationModels NVIDIA NVS 315 1GB Graphics
Overview Models NVIDIA NVS 315 1GB Graphics E1U66AA Introduction The NVIDIA NVS 315 graphics board is a PCI Express low profile form factor graphics add-in card targeted as an active low cost graphics
More informationHighly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU
2017. This manuscript version (accecpted manuscript) is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/. Highly Parallel HEVC Decoding for Heterogeneous
More informationUniversal Parallel Computing Research Center The Center for New Music and Audio Technologies University of California, Berkeley
Eric Battenberg and David Wessel Universal Parallel Computing Research Center The Center for New Music and Audio Technologies University of California, Berkeley Microsoft Parallel Applications Workshop
More informationInstruction Level Parallelism and Its. (Part II) ECE 154B
Instruction Level Parallelism and Its Exploitation (Part II) ECE 154B Dmitri Strukov ILP techniques not covered last week this week next week Scoreboard Technique Review Allow for out of order execution
More informationResearch Article Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation
e Scientific World Journal, Article ID 716020, 19 pages http://dx.doi.org/10.1155/2014/716020 Research Article Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation Huayou
More informationM598. Radeon E8860 (Adelaar) Video & Graphics PMC. Aitech
Single Width PMC PCI-X 64-bit @ 133 MHz Host Interface AMD Radeon E8860 (Adelaar) GPU 6 Independent Graphics Heads 2 GB GDDR5 Analog Inputs Analog and Digital Outputs Full Switching Capabilities Capture
More informationA Highly Scalable Parallel Implementation of H.264
A Highly Scalable Parallel Implementation of H.264 Arnaldo Azevedo 1, Ben Juurlink 1, Cor Meenderinck 1, Andrei Terechko 2, Jan Hoogerbrugge 3, Mauricio Alvarez 4, Alex Ramirez 4,5, Mateo Valero 4,5 1
More informationOddCI: On-Demand Distributed Computing Infrastructure
OddCI: On-Demand Distributed Computing Infrastructure Rostand Costa Francisco Brasileiro Guido Lemos Filho Dênio Mariz Sousa MTAGS 2nd Workshop on Many-Task Computing on Grids and Supercomputers Co-located
More informationMilestone Solution Partner IT Infrastructure Components Certification Report
Milestone Solution Partner IT Infrastructure Components Certification Report Infortrend Technologies 5000 Series NVR 12-15-2015 Table of Contents Executive Summary:... 4 Introduction... 4 Certified Products...
More informationImplementation of an MPEG Codec on the Tilera TM 64 Processor
1 Implementation of an MPEG Codec on the Tilera TM 64 Processor Whitney Flohr Supervisor: Mark Franklin, Ed Richter Department of Electrical and Systems Engineering Washington University in St. Louis Fall
More informationImpact of Intermittent Faults on Nanocomputing Devices
Impact of Intermittent Faults on Nanocomputing Devices Cristian Constantinescu June 28th, 2007 Dependable Systems and Networks Outline Fault classes Permanent faults Transient faults Intermittent faults
More informationDay 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size
ESE534: Computer Organization Day 22: November 16, 2016 Retiming 1 Day 21: Retiming Requirements Retiming requirement depends on parallelism and performance Even with a given amount of parallelism Will
More informationOut of order execution allows
Out of order execution allows Letter A B C D E Answer Requires extra stages in the pipeline The processor to exploit parallelism between instructions. Is used mostly in handheld computers A, B, and C A
More informationDC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview
DATASHEET DC Ultra Concurrent Timing, Area, Power and Test Optimization DC Ultra RTL synthesis solution enables users to meet today s design challenges with concurrent optimization of timing, area, power
More informationBenchmark Mar_26_2018
Benchmark Mar_26_2018 Referens Tensorflow Official Benchmarks (May 2017, GitHub sour): https://www.tensorflow.org/performan/benchmarks IBM Power9 benchmark results (Nov 2017, 1.4.0): https://developer.ibm.com/linuxonpower/perfcol/perfcol-mldl/
More informationTools to Debug Dead Boards
Tools to Debug Dead Boards Hardware Prototype Bring-up Ryan Jones Senior Application Engineer Corelis 1 Boundary-Scan Without Boundaries click to start the show Webinar Outline What is a Dead Board? Prototype
More informationMilestone Leverages Intel Processors with Intel Quick Sync Video to Create Breakthrough Capabilities for Video Surveillance and Monitoring
white paper Milestone Leverages Intel Processors with Intel Quick Sync Video to Create Breakthrough Capabilities for Video Surveillance and Monitoring Executive Summary Milestone Systems, the world s leading
More information8088 Corruption. Motion Video on a 1981 IBM PC with CGA
8088 Corruption Motion Video on a 1981 IBM PC with CGA Introduction 8088 Corruption plays video that: Is Full-motion (30fps) Is Full-screen In Color With synchronized audio on a 1981 IBM PC with CGA (and
More informationESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming
ESE534: Computer Organization Today Retiming Demand Folded Computation Day 21: April 14, 2014 Retiming Logical Pipelining Physical Pipelining Retiming Supply Technology Structures Hierarchy 1 2 Image Processing
More informationVideo Output and Graphics Acceleration
Video Output and Graphics Acceleration Overview Frame Buffer and Line Drawing Engine Prof. Kris Pister TAs: Vincent Lee, Ian Juch, Albert Magyar Version 1.5 In this project, you will use SDRAM to implement
More informationMulticore Design Considerations
Multicore Design Considerations Multicore: The Forefront of Computing Technology We re not going to have faster processors. Instead, making software run faster in the future will mean using parallel programming
More informationWiBench: An Open Source Kernel Suite for Benchmarking Wireless Systems
1 WiBench: An Open Source Kernel Suite for Benchmarking Wireless Systems Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti +, Achilleas Anastasopoulos*, Scott Mahlke*, Trevor Mudge* *,
More informationCommunication Avoiding Successive Band Reduction
Communication Avoiding Successive Band Reduction Grey Ballard, James Demmel, Nicholas Knight UC Berkeley PPoPP 12 Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by
More informationCREATE. CONTROL. CONNECT.
CREATE. CONTROL. CONNECT. CREATE. CONTROL. CONNECT. DYVI offers unprecedented creativity, simple and secure operations along with technical reliability all in a costeffective, tailored and highly reliable
More informationCreate. Control. Connect.
Create. Control. Connect. Create. Control. Connect. Control live broadcasting wherever you are The DYVI production suite is a whole new approach to live content creation. Taking advantage of the latest
More informationDistributed Cluster Processing to Evaluate Interlaced Run-Length Compression Schemes
Distributed Cluster Processing to Evaluate Interlaced Run-Length Compression Schemes Ankit Arora Sachin Bagga Rajbir Singh Cheema M.Tech (IT) M.Tech (CSE) M.Tech (CSE) Guru Nanak Dev University Asr. Thapar
More informationLIVE PRODUCTION SWITCHER. Think differently about what you can do with a production switcher
LIVE PRODUCTION SWITCHER Think differently about what you can do with a production switcher BRILLIANTLY SIMPLE, CREATIVE CONTROL The DYVI live production switcher goes far beyond the traditional limits
More informationBuild Applications Tailored for Remote Signal Monitoring with the Signal Hound BB60C
Application Note Build Applications Tailored for Remote Signal Monitoring with the Signal Hound BB60C By Justin Crooks and Bruce Devine, Signal Hound July 21, 2015 Introduction The Signal Hound BB60C Spectrum
More informationEEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)
1 EEC 581 Computer Architecture Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview of Chap. 3 (again) Pipelined
More informationESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large
ESE680-002 (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance Last Time Saw how to formulate and automate retiming: start with network calculate minimum achievable
More informationSharif University of Technology. SoC: Introduction
SoC Design Lecture 1: Introduction Shaahin Hessabi Department of Computer Engineering System-on-Chip System: a set of related parts that act as a whole to achieve a given goal. A system is a set of interacting
More informationSlide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng
Slide Set 8 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017 ENCM 501 W17 Lectures: Slide
More informationLossless Compression Algorithms for Direct- Write Lithography Systems
Lossless Compression Algorithms for Direct- Write Lithography Systems Hsin-I Liu Video and Image Processing Lab Department of Electrical Engineering and Computer Science University of California at Berkeley
More informationNutaq. PicoDigitizer-125. Up to 64 Channels, 125 MSPS ADCs, FPGA-based DAQ Solution With Up to 32 Channels, 1000 MSPS DACs PRODUCT SHEET. nutaq.
Nutaq Up to 64 Channels, 125 MSPS ADCs, FPGA-based DAQ Solution With Up to 32 Channels, 1000 MSPS DACs PRODUCT SHEET QUEBEC I MONTREAL I N E W YO R K I nutaq.com Nutaq The PicoDigitizer 125-Series is a
More informationEN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014
EN2911X: Reconfigurable Computing Topic 01: Programmable Logic Prof. Sherief Reda School of Engineering, Brown University Fall 2014 1 Contents 1. Architecture of modern FPGAs Programmable interconnect
More informationMauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard
Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard Conference object, Postprint version This version is available
More informationTHE Collider Detector at Fermilab (CDF) [1] is a general
The Level-3 Trigger at the CDF Experiment at Tevatron Run II Y.S. Chung 1, G. De Lentdecker 1, S. Demers 1,B.Y.Han 1, B. Kilminster 1,J.Lee 1, K. McFarland 1, A. Vaiciulis 1, F. Azfar 2,T.Huffman 2,T.Akimoto
More informationImage Acquisition Technology
Image Choosing the Right Image Acquisition Technology A Machine Vision White Paper 1 Today, machine vision is used to ensure the quality of everything from tiny computer chips to massive space vehicles.
More informationAlain Legault Hardent. Create Higher Resolution Displays With VESA Display Stream Compression
Alain Legault Hardent Create Higher Resolution Displays With VESA Display Stream Compression What Is VESA? 2 Why Is VESA Needed? Video In Processor TX Port RX Port Display Module To Display Mobile application
More informationParallelization of Multimedia Applications by Compiler on Multicores for Consumer Electronics
Vol. 0 No. 0 1959 TV MPEG2 MP3 JPEG 2000 OSCAR API VLIW 4 FR1000 SH-4A 4 RP1 FR1000 4 1 4 3.27 RP1 4 1 4 3.31 Parallelization of Multimedia Applications by Compiler on Multicores for Consumer Electronics
More informationHardware Design I Chap. 5 Memory elements
Hardware Design I Chap. 5 Memory elements E-mail: shimada@is.naist.jp Why memory is required? To hold data which will be processed with designed hardware (for storage) Main memory, cache, register, and
More informationIEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 3, MARCH GHEVC: An Efficient HEVC Decoder for Graphics Processing Units
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 3, MARCH 2017 459 GHEVC: An Efficient HEVC Decoder for Graphics Processing Units Diego F. de Souza, Student Member, IEEE, Aleksandar Ilic, Member, IEEE, Nuno
More informationHybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era
Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era Keynote at the Bi annual HiPEAC Compu6ng Systems Week Mee6ng Barcelona, Spain October 19 th 2010 Prof. Simha Sethumadhavan Columbia
More informationHigh Performance Raster Scan Displays
High Performance Raster Scan Displays Item Type text; Proceedings Authors Fowler, Jon F. Publisher International Foundation for Telemetering Journal International Telemetering Conference Proceedings Rights
More informationThe DM7 and the Future of High
The DM7 and the Future of High Performance Computing in Space 15 th Annual CubeSat Developers Workshop April 30, 2018 Presented By: Aaron Zucherman Graduate Research Assistant DM Student Team Lead MSU
More information1ms Column Parallel Vision System and It's Application of High Speed Target Tracking
Proceedings of the 2(X)0 IEEE International Conference on Robotics & Automation San Francisco, CA April 2000 1ms Column Parallel Vision System and It's Application of High Speed Target Tracking Y. Nakabo,
More informationPractical De-embedding for Gigabit fixture. Ben Chia Senior Signal Integrity Consultant 5/17/2011
Practical De-embedding for Gigabit fixture Ben Chia Senior Signal Integrity Consultant 5/17/2011 Topics Why De-Embedding/Embedding? De-embedding in Time Domain De-embedding in Frequency Domain De-embedding
More informationAchieving Timing Closure in ALTERA FPGAs
Achieving Timing Closure in ALTERA FPGAs Course Description This course provides all necessary theoretical and practical know-how to write system timing constraints for variety designs in ALTERA FPGAs.
More informationBenchtop Portability with ATE Performance
Benchtop Portability with ATE Performance Features: Configurable for simultaneous test of multiple connectivity standard Air cooled, 100 W power consumption 4 RF source and receive ports supporting up
More informationTHE BaBar High Energy Physics (HEP) detector [1] is
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 53, NO. 3, JUNE 2006 1299 BaBar Simulation Production A Millennium of Work in Under a Year D. A. Smith, F. Blanc, C. Bozzi, and A. Khan, Member, IEEE Abstract
More informationINFORMATION SYSTEMS. Written examination. Wednesday 12 November 2003
Victorian Certificate of Education 2003 SUPERVISOR TO ATTACH PROCESSING LABEL HERE INFORMATION SYSTEMS Written examination Wednesday 12 November 2003 Reading time: 11.45 am to 12.00 noon (15 minutes) Writing
More informationOptimizing the Startup Time of Embedded Systems: A Case Study of Digital TV
2242 IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009 Optimizing the Startup Time of Embedded Systems: A Case Study of Digital TV Heeseung Jo, Hwanju Kim, Jinkyu Jeong, Joonwon
More informationGo BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C
CS6C L5 Intro to SDS, State Elements I () inst.eecs.berkeley.edu/~cs6c CS6C : Machine Structures Lecture #5 Intro to Synchronous Digital Systems, State Elements I 28-7-6 Go BEARS~ Albert Chae, Instructor
More informationVector IRAM Memory Performance for Image Access Patterns Richard M. Fromm Report No. UCB/CSD-99-1067 October 1999 Computer Science Division (EECS) University of California Berkeley, California 94720 Vector
More informationDigital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor
14 12 10 8 6 IBM ES9000 Bipolar Fujitsu VP2000 IBM 3090S Pulsar 4 IBM 3090 IBM RY6 CDC Cyber 205 IBM 4381 IBM RY4 2 IBM 3081 Apache Fujitsu M380 IBM 370 Merced IBM 360 IBM 3033 Vacuum Pentium II(DSIP)
More informationExplorer Edition FUZZY LOGIC DEVELOPMENT TOOL FOR ST6
fuzzytech ST6 Explorer Edition FUZZY LOGIC DEVELOPMENT TOOL FOR ST6 DESIGN: System: up to 4 inputs and one output Variables: up to 7 labels per input/output Rules: up to 125 rules ON-LINE OPTIMISATION:
More information100Gb/s Single-lane SERDES Discussion. Phil Sun, Credo Semiconductor IEEE New Ethernet Applications Ad Hoc May 24, 2017
100Gb/s Single-lane SERDES Discussion Phil Sun, Credo Semiconductor IEEE 802.3 New Ethernet Applications Ad Hoc May 24, 2017 Introduction This contribution tries to share thoughts on 100Gb/s single-lane
More informationECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011
ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011 Lecture 9: TX Multiplexer Circuits Sam Palermo Analog & Mixed-Signal Center Texas A&M University Announcements & Agenda Next
More informationAmon: Advanced Mesh-Like Optical NoC
Amon: Advanced Mesh-Like Optical NoC Sebastian Werner, Javier Navaridas and Mikel Luján Advanced Processor Technologies Group School of Computer Science The University of Manchester Bottleneck: On-chip
More informationOn the Rules of Low-Power Design
On the Rules of Low-Power Design (and How to Break Them) Prof. Todd Austin Advanced Computer Architecture Lab University of Michigan austin@umich.edu Once upon a time 1 Rules of Low-Power Design P = acv
More informationDigital Integrated Circuits EECS 312
14 12 10 8 6 Fujitsu VP2000 IBM 3090S Pulsar 4 IBM 3090 IBM RY6 CDC Cyber 205 IBM 4381 IBM RY4 2 IBM 3081 Apache Fujitsu M380 IBM 370 Merced IBM 360 IBM 3033 Vacuum Pentium II(DSIP) 0 1950 1960 1970 1980
More informationEpiphan Frame Grabber User Guide
Epiphan Frame Grabber User Guide VGA2USB VGA2USB LR DVI2USB VGA2USB HR DVI2USB Solo VGA2USB Pro DVI2USB Duo KVM2USB www.epiphan.com 1 February 2009 Version 3.20.2 (Windows) 3.16.14 (Mac OS X) Thank you
More informationni.com Digital Signal Processing for Every Application
Digital Signal Processing for Every Application Digital Signal Processing is Everywhere High-Volume Image Processing Production Test Structural Sound Health and Vibration Monitoring RF WiMAX, and Microwave
More informationFrame Processing Time Deviations in Video Processors
Tensilica White Paper Frame Processing Time Deviations in Video Processors May, 2008 1 Executive Summary Chips are increasingly made with processor designs licensed as semiconductor IP (intellectual property).
More informationTelephony Training Systems
Telephony Training Systems LabVolt Series Datasheet Festo Didactic en 120 V - 60 Hz 07/2018 Table of Contents General Description 2 Topic Coverage 6 Features & Benefits 6 List of Available Training Systems
More informationCS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm
CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm 2003-10-23 Dave Patterson (www.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs152/ CS 152 L17 Adv.
More informationSCOREBOARDS ADDENDUM NO. 2 PROJECT NO PAGE 1 OF 5 MATTOON ILLINOIS April 12, 2018 ADDENDUM NO. 2
PROJECT NO. 2018-005 PAGE 1 OF 5 ADDENDUM NO. 2 The following shall be added to and become part of the Specifications for the above referenced project. ITEM NO. 1 Section 00113. Advertisement for Bids,
More informationSolutions to Embedded System Design Challenges Part II
Solutions to Embedded System Design Challenges Part II Time-Saving Tips to Improve Productivity In Embedded System Design, Validation and Debug Hi, my name is Mike Juliana. Welcome to today s elearning.
More informationComputer and Machine Vision
Computer and Machine Vision Lecture Week 3 Part-1 January 27, 2014 Sam Siewert Outline of Week 3 Processing Images and Moving Pictures High Level View and Computer Architecture for it Linux Platforms for
More informationOn the Characterization of Distributed Virtual Environment Systems
On the Characterization of Distributed Virtual Environment Systems P. Morillo, J. M. Orduña, M. Fernández and J. Duato Departamento de Informática. Universidad de Valencia. SPAIN DISCA. Universidad Politécnica
More informationSCode V3.5.1 (SP-601 and MP-6010) Digital Video Network Surveillance System
V3.5.1 (SP-601 and MP-6010) Digital Video Network Surveillance System Core Technologies Image Compression MPEG4. It supports high compression rate with good image quality and reduces the requirement of
More information2 MHz Lock-In Amplifier
2 MHz Lock-In Amplifier SR865 2 MHz dual phase lock-in amplifier SR865 2 MHz Lock-In Amplifier 1 mhz to 2 MHz frequency range Dual reference mode Low-noise current and voltage inputs Touchscreen data display
More informationUsing SignalTap II in the Quartus II Software
White Paper Using SignalTap II in the Quartus II Software Introduction The SignalTap II embedded logic analyzer, available exclusively in the Altera Quartus II software version 2.1, helps reduce verification
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationQuickSpecs. NVIDIA Graphics SUPPORTED SOLUTIONS. NVIDIA Graphics. Overview QUADRO NVIDIA QUADRO K2200 L2K02AA NVIDIA QUADRO M2000M (12GB)
Overview SUPPORTED SOLUTIONS Category Part number QUADRO NVIDIA QUADRO K420 NVIDIA QUADRO K620 NVIDIA QUADRO K1200 NVIDIA QUADRO K2200 NVIDIA QUADRO M2000 NVIDIA QUADRO M4000 NVIDIA QUADRO M5000 NVIDIA
More informationMasters of Science in COMPUTER ENGINEERING
PICSEL: Measuring User-Perceived Performance to Control Dynamic Frequency Scaling IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Masters of Science in COMPUTER ENGINEERING By Jack Cosgrove
More informationSlide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng
Slide Set 9 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 9 slide
More informationTelephony Training Systems
Telephony Training Systems LabVolt Series Datasheet Festo Didactic en 240 V - 50 Hz 04/2018 Table of Contents General Description 2 Topic Coverage 6 Features & Benefits 6 List of Available Training Systems
More informationCritical C-RAN Technologies Speaker: Lin Wang
Critical C-RAN Technologies Speaker: Lin Wang Research Advisor: Biswanath Mukherjee Three key technologies to realize C-RAN Function split solutions for fronthaul design Goal: reduce the fronthaul bandwidth
More information3/5/2017. A Register Stores a Set of Bits. ECE 120: Introduction to Computing. Add an Input to Control Changing a Register s Bits
University of Illinois at Urbana-Champaign Dept. of Electrical and Computer Engineering ECE 120: Introduction to Computing Registers A Register Stores a Set of Bits Most of our representations use sets
More information5620 SAM SERVICE AWARE MANAGER 14.0 R7. Planning Guide
5620 SAM SERVICE AWARE MANAGER 14.0 R7 Planning Guide 3HE-10698-AAAE-TQZZA December 2016 5620 SAM Legal notice Nokia is a registered trademark of Nokia Corporation. Other products and company names mentioned
More informationLab2: Cache Memories. Dimitar Nikolov
Lab2: Cache Memories Dimitar Nikolov Goal Understand how cache memories work Learn how different cache-mappings impact CPU time Leran how different cache-sizes impact CPU time Lund University / Electrical
More informationThe AuroraScience Project
The AuroraScience Project F. S. Schifano 1 1 University of Ferrara and INFN-Ferrara November 25-26, 2009 F. S. Schifano (Univ. and INFN of Ferrara) The AuroraScience Project November 25-26, 2009 1 / 24
More informationSIGGRAPH 2013 Shaping the Future of Visual Computing
SIGGRAPH 2013 Shaping the Future of Visual Computing High Performance Graphics for 4K & Ultra High Resolution Displays Doug Traill, Senior Solutions Architect QuadroSVS@nvidia.com Things I want you to
More information