PRACE Autumn School GPU Programming

PRACE Autumn School 2010 GPU Programming October 25-29, 2010 PRACE Autumn School, Oct 2010 1

Outline GPU Programming Track Tuesday 26th GPGPU: General-purpose GPU Programming CUDA Architecture, Threading and Memory model CUDA Programming, Runtimes and Environments Hands-on Lab 1: CUDA Environment Setup, Compilation and Execution Examples Wednesday 27th CUDA Optimizations. Debugging and Profiling GPU Multiprocessing. Deploying Multi-GPU Applications The GPU on Heterogeneous and High-Performance Computing Hands-on Lab 2: Advanced Tools and Exercises. HPC Codes and Performance Evaluation PRACE Autumn School, Oct 2010 2

Instructors Manuel Ujaldón Associate Professor, Computer Architecture Department, University of Malaga Nacho Navarro Associate Professor, Computer Architecture Department, Universitat Politecnica de Catalunya (UPC), Researcher at Barcelona Supercomputing Center (BSC), Visiting Research Professor at University of Illinois (UIUC) Javier Cabezas Ph.D. student at the Computer Architecture Department, UPC. Researcher at the Barcelona Supercomputing Center. Visiting PhD. Student at UIUC. PRACE Autumn School, Oct 2010 3

CUDA is Popular PRACE Autumn School, Oct 2010 4

PUMPS Summer School Programming and Tuning Massively Parallel Systems Summer School (PUMPS) Teachers: Wen-mei W. Hwu, University of Illinois David B. Kirk, NVIDIA PRACE Autumn School, Oct 2010 5

BSC named first CUDA Research Center in Spain The Barcelona Supercomputing Center (BSC) has been named by NVIDIA as a 2010 CUDA Research Center, the first in Spain. The CUDA Research Center Program recognizes and fosters collaboration with research groups at universities and research institutes that are expanding the frontier of massively parallel computing. Institutions identified as CUDA Research Centers are doing worldchanging research by leveraging CUDA and NVIDIA GPUs. PRACE Autumn School, Oct 2010 6

Hands-on Labs Labs will be done at the AC GPU Cluster at NCSA AC.NCSA.UIUC.EDU Experimental system available for exploring GPU computing PRACE Autumn School, Oct 2010 7

HP xw9400 workstation 2216 AMD Opteron 2.4 GHz dual socket dual core 8 GB DDR2 Infiniband QDR Tesla S1070 1U GPU Computing Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 32 (128 CPUS) Accelerator Units: 32 (128 GPUS, 128 TF SP, 10 TF DP) Compute Node PRACE Autumn School, Oct 2010 8

Course Wiki Course Material Hands-on Lab Info on Textbooks Links to interesting educational material http://marsa.ac.upc.edu/prace-gpu Register and log in to get access to the content PRACE Autumn School, Oct 2010 9

PRACE Autumn School 2010 GPU Programming Nacho Navarro nacho@bsc.es Associate Professor Universitat Politecnica Catalunya / Barcelona Supercomputing Center Visiting Research Professor, UIUC, CSL PRACE Autumn School, Oct 2010 1

Outline Multicore: Dual/Quad, Cell, GPU, FPGA,? Current and future systems Graphics beyond games Programmability experiences and trends Supercomputing anywhere Acknowledgements: Prof. Wen-mei Hwu, UIUC, David Kirk, NVIDIA, NCSA Summer School PRACE Autumn School, Oct 2010 2

Current Trend: Multi-core Processors Cache Cache Core Core Core C1 C3 C2 Cache C4 C1 C2 C1 C2 C1 C2 C3 C4 C3 C4 Cache C1 C2 C1 C2 C3 C4 C3 C4 C3 C4 Past trend: increasing number of transistors on a chip and increasing clock speed Heat is an unmanageable problem, Intel Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future. However, # transistors on a chip will continue to increase. Intel Core 2 Duo Do we have some free space? put more cores What s left over? Put cache memory PRACE Autumn School, Oct 2010 3

Multicores: Just Cores? How many cores? Intel/AMD 2-4-8-16 cores IBM Cell 8-16 SPU NVIDIA 480 cores Multicore is Hardware and Software together (challenge and inspire each other) More transistors, worse reliability Error / fault (detection / correction / recovery) Dynamic reconfiguration Memory Memory wall due to bandwidth (scalability?) Memory wall due to power (interconnect needs power) Memory size grows but data always grows more and more On-chip locality, communication PRACE Autumn School, Oct 2010 4

IBM, SONY, TOSHIBA Cell BE Heterogeneous Mickey mouse Power PC 8 SPU Local memory, local address space Lot of memory copies: DMA s Always short of memory space Cannot host all data Software cache Two unrelated thread schedulers Reliability: if all cores are fine, IBM supercomputer; if SPE error, sell it as PS3 PRACE Autumn School, Oct 2010 5

NVIDIA GPU PRACE Autumn School, Oct 2010 6

GPU: How Many cores? (240 in chunks of 16 way MP) PRACE Autumn School, Oct 2010 7

Is GPU driving the parallelism revolution? 1 Based on slide 7 of S. Green, GPU Physics, SIGGRAPH 2007 GPGPU Course. http://www.gpgpu.org/s2007/slides/15-gpgpu-physics.pdf PRACE Autumn School, Oct 2010 8

GPU performance in recent history Performance of NVIDIA GPUs over time Fermi Peak GFLOPS CUDA Memory Bandwidth (GB/s) PRACE Autumn School, Oct 2010 9

CPU vs. GPU, approaching each other PRACE Autumn School, Oct 2010 10

ILP vs. Massive Data Parallelism PRACE Autumn School, Oct 2010 11

PRACE Autumn School, Oct 2010 12

PRACE Autumn School, Oct 2010 13

Graphics and Games: Nvidia purchased AGEIA PhysX middleware. PRACE Autumn School, Oct 2010 14

Massive Parallelism PRACE Autumn School, Oct 2010 15

GPU: Supercomputing at Home PRACE Autumn School, Oct 2010 16

PRACE Autumn School, Oct 2010 17

PRACE Autumn School, Oct 2010 18

CUDA: Widely Adopted Parallel Programming Model PRACE Autumn School, Oct 2010 19

2007-2009 PRACE Autumn School, Oct 2010 20

PRACE Autumn School, Oct 2010 21

Performance of Advanced MRI Reconstruction Wen-mei Hwu, IMPACT, UIUC PRACE Autumn School, Oct 2010 22

GPU Speedup GPU gives us 100x (after one month of understanding the architecture) to massive parallel algorithms Faster is not just Faster 2-3X faster is just faster Do a little more, wait a little less Doesn t change how you work 5-10x faster is significant Worth upgrading Worth re-writing (parts of) the application 100x+ faster is fundamentally different Worth considering a new platform Worth re-architecting the application Makes new applications possible Drives time to discovery and creates fundamental changes in Science PRACE Autumn School, Oct 2010 23

PRACE Autumn School, Oct 2010 24

CUDA Features (Threading) Physical partitioning in SM Virtual partitioning Problem is divided into a grid of Thread Blocks (TBs) Each Thread Block is composed by <= 512 threads Threads are very lightweight Scheduling of threads on physical cores is performed by the HW (in groups called warps ) New warps are scheduled on memory stalls (hides latency) Many TBs can be executed on the same SM (1024 threads max), depending on the used (memory) resources SIMD: Divergent branches significantly reduce the performance PRACE Autumn School, Oct 2010 25 25

CUDA Features (Memory) Global memory (up to 4GB per card) Very slow (400-600 cycles) Texture memory (64KB per card) <cache> Read-only Useful for some kinds of access patterns Constant memory (64KB per card) <cache> Read-only 2 cycles (when all threads in a warp read the same @) Shared memory (16KB per SM) 8 banks (4 bytes stride) 2 cycles if no bank conflict (consecutive accesses) Register memory 16384 registers/sm (16 per thread if 1024 threads, 32 if 512 threads) PRACE Autumn School, Oct 2010 26 26

Data Movements and Kernel Launch PRACE Autumn School, Oct 2010 27

Oil and Gas Prospection PRACE Autumn School, Oct 2010 28

RTM on GPU : Experience on Mapping Forward Stencil + Hessian (GPU) Boundary Conditions (GPU) Shot insertion (GPU) Receivers (GPU) For synthetic traces Write to disk (CPU) Backward Stencil (GPU) Boundary Conditions (GPU) Receivers shots insertions (GPU) Read from disk Correlation PRACE Autumn School, Oct 2010 29 29

RTM Port to GPU Timeline Three months progress for a new CUDA developer PRACE Autumn School, Oct 2010 30 30

RTM kernel on GPUs Current Results Three months progress for a new CUDA developer PRACE Autumn School, Oct 2010 31 31

RTM on GPU: Kernel bottlenecks Naïve: uses global memory only Store all the matrices in the global memory Unroll the loops and create as many TB as necessary Bottleneck: global accesses are very slow Shared memory: Use shared memory to store the values of the previous time step Drawback: divergent branches to load the ghost area Bottleneck: Shared memory usage Bad useful/total reads ratio due to the big stencil 2D sliding window: Proposed by Paulius Micikevicius (NVIDIA Total) Store the Y (geophysical) stencil dimension in registers Only store the ZX plane in shared memory Better useful/total reads ratio Slide the plane to the end of the cube Bottleneck: Registers usage PRACE Autumn School, Oct 2010 32 32

Benchmarks and Lessons Learned App. Archit. Bottleneck Simult. T Kernel X App X H.264 Registers, global memory latency 3,936 20.2 1.5 LBM Shared memory capacity 3,200 12.5 12.3 RC5-72 Registers 3,072 17.1 11.0 FEM Global memory bandwidth 4,096 11.0 10.1 RPES Instruction issue rate 4,096 210.0 79.4 PNS Global memory capacity 2,048 24.0 23.7 LINPACK Global memory bandwidth, CPU-GPU data transfer 12,288 19.4 11.8 TRACF Shared memory capacity 4,096 60.2 21.6 FDTD Global memory bandwidth 1,365 10.5 1.2 MRI-Q Instruction issue rate 8,192 457.0 431.0 [HKR HotChips-2007] PRACE Autumn School, Oct 2010 33