PRACE Autumn School GPU Programming

Similar documents
Scalability of MB-level Parallelism for H.264 Decoding

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

Transparent low-overhead checkpoint for GPU-accelerated clusters

GPU Acceleration of a Production Molecular Docking Code

Amdahl s Law in the Multicore Era

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Instruction Level Parallelism Part III

Fooling the Masses with Performance Results: Old Classics & Some New Ideas

High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures

Instruction Level Parallelism Part III

Profiling techniques for parallel applications

North America, Inc. AFFICHER. a true cloud digital signage system. Copyright PDC Co.,Ltd. All Rights Reserved.

Profiling techniques for parallel applications

USING FUSION SYSTEM ARCHITECTURE FOR BROADCAST VIDEO. Edward Callway AMD

GPU s for High Performance Signal Processing in Infrared Camera System

Models NVIDIA NVS 315 1GB Graphics

Highly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU

Universal Parallel Computing Research Center The Center for New Music and Audio Technologies University of California, Berkeley

Instruction Level Parallelism and Its. (Part II) ECE 154B

Research Article Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

M598. Radeon E8860 (Adelaar) Video & Graphics PMC. Aitech

A Highly Scalable Parallel Implementation of H.264

OddCI: On-Demand Distributed Computing Infrastructure

Milestone Solution Partner IT Infrastructure Components Certification Report

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Impact of Intermittent Faults on Nanocomputing Devices

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

Out of order execution allows

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

Benchmark Mar_26_2018

Tools to Debug Dead Boards

Milestone Leverages Intel Processors with Intel Quick Sync Video to Create Breakthrough Capabilities for Video Surveillance and Monitoring

8088 Corruption. Motion Video on a 1981 IBM PC with CGA

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

Video Output and Graphics Acceleration

Multicore Design Considerations

WiBench: An Open Source Kernel Suite for Benchmarking Wireless Systems

Communication Avoiding Successive Band Reduction

CREATE. CONTROL. CONNECT.

Create. Control. Connect.

Distributed Cluster Processing to Evaluate Interlaced Run-Length Compression Schemes

LIVE PRODUCTION SWITCHER. Think differently about what you can do with a production switcher

Build Applications Tailored for Remote Signal Monitoring with the Signal Hound BB60C

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Sharif University of Technology. SoC: Introduction

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Lossless Compression Algorithms for Direct- Write Lithography Systems

Nutaq. PicoDigitizer-125. Up to 64 Channels, 125 MSPS ADCs, FPGA-based DAQ Solution With Up to 32 Channels, 1000 MSPS DACs PRODUCT SHEET. nutaq.

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

THE Collider Detector at Fermilab (CDF) [1] is a general

Image Acquisition Technology

Alain Legault Hardent. Create Higher Resolution Displays With VESA Display Stream Compression

Parallelization of Multimedia Applications by Compiler on Multicores for Consumer Electronics

Hardware Design I Chap. 5 Memory elements

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 3, MARCH GHEVC: An Efficient HEVC Decoder for Graphics Processing Units

Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era

High Performance Raster Scan Displays

The DM7 and the Future of High

1ms Column Parallel Vision System and It's Application of High Speed Target Tracking

Practical De-embedding for Gigabit fixture. Ben Chia Senior Signal Integrity Consultant 5/17/2011

Achieving Timing Closure in ALTERA FPGAs

Benchtop Portability with ATE Performance

THE BaBar High Energy Physics (HEP) detector [1] is

INFORMATION SYSTEMS. Written examination. Wednesday 12 November 2003

Optimizing the Startup Time of Embedded Systems: A Case Study of Digital TV

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C


Digital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor

Explorer Edition FUZZY LOGIC DEVELOPMENT TOOL FOR ST6

100Gb/s Single-lane SERDES Discussion. Phil Sun, Credo Semiconductor IEEE New Ethernet Applications Ad Hoc May 24, 2017

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

Amon: Advanced Mesh-Like Optical NoC

On the Rules of Low-Power Design

Digital Integrated Circuits EECS 312

Epiphan Frame Grabber User Guide

ni.com Digital Signal Processing for Every Application

Frame Processing Time Deviations in Video Processors

Telephony Training Systems

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

SCOREBOARDS ADDENDUM NO. 2 PROJECT NO PAGE 1 OF 5 MATTOON ILLINOIS April 12, 2018 ADDENDUM NO. 2

Solutions to Embedded System Design Challenges Part II

Computer and Machine Vision

On the Characterization of Distributed Virtual Environment Systems

SCode V3.5.1 (SP-601 and MP-6010) Digital Video Network Surveillance System

2 MHz Lock-In Amplifier

Using SignalTap II in the Quartus II Software

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

QuickSpecs. NVIDIA Graphics SUPPORTED SOLUTIONS. NVIDIA Graphics. Overview QUADRO NVIDIA QUADRO K2200 L2K02AA NVIDIA QUADRO M2000M (12GB)

Masters of Science in COMPUTER ENGINEERING

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Telephony Training Systems

Critical C-RAN Technologies Speaker: Lin Wang

3/5/2017. A Register Stores a Set of Bits. ECE 120: Introduction to Computing. Add an Input to Control Changing a Register s Bits

5620 SAM SERVICE AWARE MANAGER 14.0 R7. Planning Guide

Lab2: Cache Memories. Dimitar Nikolov

The AuroraScience Project

SIGGRAPH 2013 Shaping the Future of Visual Computing

Transcription:

PRACE Autumn School 2010 GPU Programming October 25-29, 2010 PRACE Autumn School, Oct 2010 1

Outline GPU Programming Track Tuesday 26th GPGPU: General-purpose GPU Programming CUDA Architecture, Threading and Memory model CUDA Programming, Runtimes and Environments Hands-on Lab 1: CUDA Environment Setup, Compilation and Execution Examples Wednesday 27th CUDA Optimizations. Debugging and Profiling GPU Multiprocessing. Deploying Multi-GPU Applications The GPU on Heterogeneous and High-Performance Computing Hands-on Lab 2: Advanced Tools and Exercises. HPC Codes and Performance Evaluation PRACE Autumn School, Oct 2010 2

Instructors Manuel Ujaldón Associate Professor, Computer Architecture Department, University of Malaga Nacho Navarro Associate Professor, Computer Architecture Department, Universitat Politecnica de Catalunya (UPC), Researcher at Barcelona Supercomputing Center (BSC), Visiting Research Professor at University of Illinois (UIUC) Javier Cabezas Ph.D. student at the Computer Architecture Department, UPC. Researcher at the Barcelona Supercomputing Center. Visiting PhD. Student at UIUC. PRACE Autumn School, Oct 2010 3

CUDA is Popular PRACE Autumn School, Oct 2010 4

PUMPS Summer School Programming and Tuning Massively Parallel Systems Summer School (PUMPS) Teachers: Wen-mei W. Hwu, University of Illinois David B. Kirk, NVIDIA PRACE Autumn School, Oct 2010 5

BSC named first CUDA Research Center in Spain The Barcelona Supercomputing Center (BSC) has been named by NVIDIA as a 2010 CUDA Research Center, the first in Spain. The CUDA Research Center Program recognizes and fosters collaboration with research groups at universities and research institutes that are expanding the frontier of massively parallel computing. Institutions identified as CUDA Research Centers are doing worldchanging research by leveraging CUDA and NVIDIA GPUs. PRACE Autumn School, Oct 2010 6

Hands-on Labs Labs will be done at the AC GPU Cluster at NCSA AC.NCSA.UIUC.EDU Experimental system available for exploring GPU computing PRACE Autumn School, Oct 2010 7

HP xw9400 workstation 2216 AMD Opteron 2.4 GHz dual socket dual core 8 GB DDR2 Infiniband QDR Tesla S1070 1U GPU Computing Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 32 (128 CPUS) Accelerator Units: 32 (128 GPUS, 128 TF SP, 10 TF DP) Compute Node PRACE Autumn School, Oct 2010 8

Course Wiki Course Material Hands-on Lab Info on Textbooks Links to interesting educational material http://marsa.ac.upc.edu/prace-gpu Register and log in to get access to the content PRACE Autumn School, Oct 2010 9

PRACE Autumn School 2010 GPU Programming Nacho Navarro nacho@bsc.es Associate Professor Universitat Politecnica Catalunya / Barcelona Supercomputing Center Visiting Research Professor, UIUC, CSL PRACE Autumn School, Oct 2010 1

Outline Multicore: Dual/Quad, Cell, GPU, FPGA,? Current and future systems Graphics beyond games Programmability experiences and trends Supercomputing anywhere Acknowledgements: Prof. Wen-mei Hwu, UIUC, David Kirk, NVIDIA, NCSA Summer School PRACE Autumn School, Oct 2010 2

Current Trend: Multi-core Processors Cache Cache Core Core Core C1 C3 C2 Cache C4 C1 C2 C1 C2 C1 C2 C3 C4 C3 C4 Cache C1 C2 C1 C2 C3 C4 C3 C4 C3 C4 Past trend: increasing number of transistors on a chip and increasing clock speed Heat is an unmanageable problem, Intel Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future. However, # transistors on a chip will continue to increase. Intel Core 2 Duo Do we have some free space? put more cores What s left over? Put cache memory PRACE Autumn School, Oct 2010 3

Multicores: Just Cores? How many cores? Intel/AMD 2-4-8-16 cores IBM Cell 8-16 SPU NVIDIA 480 cores Multicore is Hardware and Software together (challenge and inspire each other) More transistors, worse reliability Error / fault (detection / correction / recovery) Dynamic reconfiguration Memory Memory wall due to bandwidth (scalability?) Memory wall due to power (interconnect needs power) Memory size grows but data always grows more and more On-chip locality, communication PRACE Autumn School, Oct 2010 4

IBM, SONY, TOSHIBA Cell BE Heterogeneous Mickey mouse Power PC 8 SPU Local memory, local address space Lot of memory copies: DMA s Always short of memory space Cannot host all data Software cache Two unrelated thread schedulers Reliability: if all cores are fine, IBM supercomputer; if SPE error, sell it as PS3 PRACE Autumn School, Oct 2010 5

NVIDIA GPU PRACE Autumn School, Oct 2010 6

GPU: How Many cores? (240 in chunks of 16 way MP) PRACE Autumn School, Oct 2010 7

Is GPU driving the parallelism revolution? 1 Based on slide 7 of S. Green, GPU Physics, SIGGRAPH 2007 GPGPU Course. http://www.gpgpu.org/s2007/slides/15-gpgpu-physics.pdf PRACE Autumn School, Oct 2010 8

GPU performance in recent history Performance of NVIDIA GPUs over time Fermi Peak GFLOPS CUDA Memory Bandwidth (GB/s) PRACE Autumn School, Oct 2010 9

CPU vs. GPU, approaching each other PRACE Autumn School, Oct 2010 10

ILP vs. Massive Data Parallelism PRACE Autumn School, Oct 2010 11

PRACE Autumn School, Oct 2010 12

PRACE Autumn School, Oct 2010 13

Graphics and Games: Nvidia purchased AGEIA PhysX middleware. PRACE Autumn School, Oct 2010 14

Massive Parallelism PRACE Autumn School, Oct 2010 15

GPU: Supercomputing at Home PRACE Autumn School, Oct 2010 16

PRACE Autumn School, Oct 2010 17

PRACE Autumn School, Oct 2010 18

CUDA: Widely Adopted Parallel Programming Model PRACE Autumn School, Oct 2010 19

2007-2009 PRACE Autumn School, Oct 2010 20

PRACE Autumn School, Oct 2010 21

Performance of Advanced MRI Reconstruction Wen-mei Hwu, IMPACT, UIUC PRACE Autumn School, Oct 2010 22

GPU Speedup GPU gives us 100x (after one month of understanding the architecture) to massive parallel algorithms Faster is not just Faster 2-3X faster is just faster Do a little more, wait a little less Doesn t change how you work 5-10x faster is significant Worth upgrading Worth re-writing (parts of) the application 100x+ faster is fundamentally different Worth considering a new platform Worth re-architecting the application Makes new applications possible Drives time to discovery and creates fundamental changes in Science PRACE Autumn School, Oct 2010 23

PRACE Autumn School, Oct 2010 24

CUDA Features (Threading) Physical partitioning in SM Virtual partitioning Problem is divided into a grid of Thread Blocks (TBs) Each Thread Block is composed by <= 512 threads Threads are very lightweight Scheduling of threads on physical cores is performed by the HW (in groups called warps ) New warps are scheduled on memory stalls (hides latency) Many TBs can be executed on the same SM (1024 threads max), depending on the used (memory) resources SIMD: Divergent branches significantly reduce the performance PRACE Autumn School, Oct 2010 25 25

CUDA Features (Memory) Global memory (up to 4GB per card) Very slow (400-600 cycles) Texture memory (64KB per card) <cache> Read-only Useful for some kinds of access patterns Constant memory (64KB per card) <cache> Read-only 2 cycles (when all threads in a warp read the same @) Shared memory (16KB per SM) 8 banks (4 bytes stride) 2 cycles if no bank conflict (consecutive accesses) Register memory 16384 registers/sm (16 per thread if 1024 threads, 32 if 512 threads) PRACE Autumn School, Oct 2010 26 26

Data Movements and Kernel Launch PRACE Autumn School, Oct 2010 27

Oil and Gas Prospection PRACE Autumn School, Oct 2010 28

RTM on GPU : Experience on Mapping Forward Stencil + Hessian (GPU) Boundary Conditions (GPU) Shot insertion (GPU) Receivers (GPU) For synthetic traces Write to disk (CPU) Backward Stencil (GPU) Boundary Conditions (GPU) Receivers shots insertions (GPU) Read from disk Correlation PRACE Autumn School, Oct 2010 29 29

RTM Port to GPU Timeline Three months progress for a new CUDA developer PRACE Autumn School, Oct 2010 30 30

RTM kernel on GPUs Current Results Three months progress for a new CUDA developer PRACE Autumn School, Oct 2010 31 31

RTM on GPU: Kernel bottlenecks Naïve: uses global memory only Store all the matrices in the global memory Unroll the loops and create as many TB as necessary Bottleneck: global accesses are very slow Shared memory: Use shared memory to store the values of the previous time step Drawback: divergent branches to load the ghost area Bottleneck: Shared memory usage Bad useful/total reads ratio due to the big stencil 2D sliding window: Proposed by Paulius Micikevicius (NVIDIA Total) Store the Y (geophysical) stencil dimension in registers Only store the ZX plane in shared memory Better useful/total reads ratio Slide the plane to the end of the cube Bottleneck: Registers usage PRACE Autumn School, Oct 2010 32 32

Benchmarks and Lessons Learned App. Archit. Bottleneck Simult. T Kernel X App X H.264 Registers, global memory latency 3,936 20.2 1.5 LBM Shared memory capacity 3,200 12.5 12.3 RC5-72 Registers 3,072 17.1 11.0 FEM Global memory bandwidth 4,096 11.0 10.1 RPES Instruction issue rate 4,096 210.0 79.4 PNS Global memory capacity 2,048 24.0 23.7 LINPACK Global memory bandwidth, CPU-GPU data transfer 12,288 19.4 11.8 TRACF Shared memory capacity 4,096 60.2 21.6 FDTD Global memory bandwidth 1,365 10.5 1.2 MRI-Q Instruction issue rate 8,192 457.0 431.0 [HKR HotChips-2007] PRACE Autumn School, Oct 2010 33