PRACE Autumn School GPU Programming

Size: px

Start display at page:

Download "PRACE Autumn School GPU Programming"

Kelly Gregory
5 years ago
Views:

1 PRACE Autumn School 2010 GPU Programming October 25-29, 2010 PRACE Autumn School, Oct

2 Outline GPU Programming Track Tuesday 26th GPGPU: General-purpose GPU Programming CUDA Architecture, Threading and Memory model CUDA Programming, Runtimes and Environments Hands-on Lab 1: CUDA Environment Setup, Compilation and Execution Examples Wednesday 27th CUDA Optimizations. Debugging and Profiling GPU Multiprocessing. Deploying Multi-GPU Applications The GPU on Heterogeneous and High-Performance Computing Hands-on Lab 2: Advanced Tools and Exercises. HPC Codes and Performance Evaluation PRACE Autumn School, Oct

3 Instructors Manuel Ujaldón Associate Professor, Computer Architecture Department, University of Malaga Nacho Navarro Associate Professor, Computer Architecture Department, Universitat Politecnica de Catalunya (UPC), Researcher at Barcelona Supercomputing Center (BSC), Visiting Research Professor at University of Illinois (UIUC) Javier Cabezas Ph.D. student at the Computer Architecture Department, UPC. Researcher at the Barcelona Supercomputing Center. Visiting PhD. Student at UIUC. PRACE Autumn School, Oct

4 CUDA is Popular PRACE Autumn School, Oct

5 PUMPS Summer School Programming and Tuning Massively Parallel Systems Summer School (PUMPS) Teachers: Wen-mei W. Hwu, University of Illinois David B. Kirk, NVIDIA PRACE Autumn School, Oct

BSC named first CUDA Research Center in Spain The Barcelona Supercomputing Center (BSC) has been named by NVIDIA as a 2010 CUDA Research Center, the first in Spain.

6 BSC named first CUDA Research Center in Spain The Barcelona Supercomputing Center (BSC) has been named by NVIDIA as a 2010 CUDA Research Center, the first in Spain. The CUDA Research Center Program recognizes and fosters collaboration with research groups at universities and research institutes that are expanding the frontier of massively parallel computing. Institutions identified as CUDA Research Centers are doing worldchanging research by leveraging CUDA and NVIDIA GPUs. PRACE Autumn School, Oct

7 Hands-on Labs Labs will be done at the AC GPU Cluster at NCSA AC.NCSA.UIUC.EDU Experimental system available for exploring GPU computing PRACE Autumn School, Oct

8 HP xw9400 workstation 2216 AMD Opteron 2.4 GHz dual socket dual core 8 GB DDR2 Infiniband QDR Tesla S1070 1U GPU Computing Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 32 (128 CPUS) Accelerator Units: 32 (128 GPUS, 128 TF SP, 10 TF DP) Compute Node PRACE Autumn School, Oct

9 Course Wiki Course Material Hands-on Lab Info on Textbooks Links to interesting educational material Register and log in to get access to the content PRACE Autumn School, Oct

10 PRACE Autumn School 2010 GPU Programming Nacho Navarro Associate Professor Universitat Politecnica Catalunya / Barcelona Supercomputing Center Visiting Research Professor, UIUC, CSL PRACE Autumn School, Oct

11 Outline Multicore: Dual/Quad, Cell, GPU, FPGA,? Current and future systems Graphics beyond games Programmability experiences and trends Supercomputing anywhere Acknowledgements: Prof. Wen-mei Hwu, UIUC, David Kirk, NVIDIA, NCSA Summer School PRACE Autumn School, Oct

Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future.

12 Current Trend: Multi-core Processors Cache Cache Core Core Core C1 C3 C2 Cache C4 C1 C2 C1 C2 C1 C2 C3 C4 C3 C4 Cache C1 C2 C1 C2 C3 C4 C3 C4 C3 C4 Past trend: increasing number of transistors on a chip and increasing clock speed Heat is an unmanageable problem, Intel Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future. However, # transistors on a chip will continue to increase. Intel Core 2 Duo Do we have some free space? put more cores What s left over? Put cache memory PRACE Autumn School, Oct

13 Multicores: Just Cores? How many cores? Intel/AMD cores IBM Cell 8-16 SPU NVIDIA 480 cores Multicore is Hardware and Software together (challenge and inspire each other) More transistors, worse reliability Error / fault (detection / correction / recovery) Dynamic reconfiguration Memory Memory wall due to bandwidth (scalability?) Memory wall due to power (interconnect needs power) Memory size grows but data always grows more and more On-chip locality, communication PRACE Autumn School, Oct

IBM, SONY, TOSHIBA Cell BE Heterogeneous Mickey mouse

memory copies: DMA s Always short of memory space Cannot

schedulers Reliability: if all cores are fine, IBM

14 IBM, SONY, TOSHIBA Cell BE Heterogeneous Mickey mouse Power PC 8 SPU Local memory, local address space Lot of memory copies: DMA s Always short of memory space Cannot host all data Software cache Two unrelated thread schedulers Reliability: if all cores are fine, IBM supercomputer; if SPE error, sell it as PS3 PRACE Autumn School, Oct

15 NVIDIA GPU PRACE Autumn School, Oct

16 GPU: How Many cores? (240 in chunks of 16 way MP) PRACE Autumn School, Oct

17 Is GPU driving the parallelism revolution? 1 Based on slide 7 of S. Green, GPU Physics, SIGGRAPH 2007 GPGPU Course. PRACE Autumn School, Oct

18 GPU performance in recent history Performance of NVIDIA GPUs over time Fermi Peak GFLOPS CUDA Memory Bandwidth (GB/s) PRACE Autumn School, Oct

19 CPU vs. GPU, approaching each other PRACE Autumn School, Oct

20 ILP vs. Massive Data Parallelism PRACE Autumn School, Oct

21 PRACE Autumn School, Oct

22 PRACE Autumn School, Oct

23 Graphics and Games: Nvidia purchased AGEIA PhysX middleware. PRACE Autumn School, Oct

24 Massive Parallelism PRACE Autumn School, Oct

25 GPU: Supercomputing at Home PRACE Autumn School, Oct

26 PRACE Autumn School, Oct

27 PRACE Autumn School, Oct

28 CUDA: Widely Adopted Parallel Programming Model PRACE Autumn School, Oct

29 PRACE Autumn School, Oct

30 PRACE Autumn School, Oct

31 Performance of Advanced MRI Reconstruction Wen-mei Hwu, IMPACT, UIUC PRACE Autumn School, Oct

32 GPU Speedup GPU gives us 100x (after one month of understanding the architecture) to massive parallel algorithms Faster is not just Faster 2-3X faster is just faster Do a little more, wait a little less Doesn t change how you work 5-10x faster is significant Worth upgrading Worth re-writing (parts of) the application 100x+ faster is fundamentally different Worth considering a new platform Worth re-architecting the application Makes new applications possible Drives time to discovery and creates fundamental changes in Science PRACE Autumn School, Oct

33 PRACE Autumn School, Oct

(in groups called warps ) New warps are scheduled on memory stalls (hides latency) Many TBs can be executed on the same SM (1024 threads

34 CUDA Features (Threading) Physical partitioning in SM Virtual partitioning Problem is divided into a grid of Thread Blocks (TBs) Each Thread Block is composed by <= 512 threads Threads are very lightweight Scheduling of threads on physical cores is performed by the HW (in groups called warps ) New warps are scheduled on memory stalls (hides latency) Many TBs can be executed on the same SM (1024 threads max), depending on the used (memory) resources SIMD: Divergent branches significantly reduce the performance PRACE Autumn School, Oct

CUDA Features (Memory) Global memory (up to 4GB per card) Very slow (400-600 cycles) Texture memory (64KB per card) <cache> Read-only Useful for some kinds of access patterns Constant memory (64KB

35 CUDA Features (Memory) Global memory (up to 4GB per card) Very slow ( cycles) Texture memory (64KB per card) <cache> Read-only Useful for some kinds of access patterns Constant memory (64KB per card) <cache> Read-only 2 cycles (when all threads in a warp read the Shared memory (16KB per SM) 8 banks (4 bytes stride) 2 cycles if no bank conflict (consecutive accesses) Register memory registers/sm (16 per thread if 1024 threads, 32 if 512 threads) PRACE Autumn School, Oct

36 Data Movements and Kernel Launch PRACE Autumn School, Oct

37 Oil and Gas Prospection PRACE Autumn School, Oct

RTM on GPU : Experience on Mapping Forward Stencil + Hessian (GPU) Boundary Conditions (GPU) Shot insertion (GPU) Receivers (GPU) For synthetic traces Write

38 RTM on GPU : Experience on Mapping Forward Stencil + Hessian (GPU) Boundary Conditions (GPU) Shot insertion (GPU) Receivers (GPU) For synthetic traces Write to disk (CPU) Backward Stencil (GPU) Boundary Conditions (GPU) Receivers shots insertions (GPU) Read from disk Correlation PRACE Autumn School, Oct

39 RTM Port to GPU Timeline Three months progress for a new CUDA developer PRACE Autumn School, Oct

40 RTM kernel on GPUs Current Results Three months progress for a new CUDA developer PRACE Autumn School, Oct

41 RTM on GPU: Kernel bottlenecks Naïve: uses global memory only Store all the matrices in the global memory Unroll the loops and create as many TB as necessary Bottleneck: global accesses are very slow Shared memory: Use shared memory to store the values of the previous time step Drawback: divergent branches to load the ghost area Bottleneck: Shared memory usage Bad useful/total reads ratio due to the big stencil 2D sliding window: Proposed by Paulius Micikevicius (NVIDIA Total) Store the Y (geophysical) stencil dimension in registers Only store the ZX plane in shared memory Better useful/total reads ratio Slide the plane to the end of the cube Bottleneck: Registers usage PRACE Autumn School, Oct

42 Benchmarks and Lessons Learned App. Archit. Bottleneck Simult. T Kernel X App X H.264 Registers, global memory latency 3, LBM Shared memory capacity 3, RC5-72 Registers 3, FEM Global memory bandwidth 4, RPES Instruction issue rate 4, PNS Global memory capacity 2, LINPACK Global memory bandwidth, CPU-GPU data transfer 12, TRACF Shared memory capacity 4, FDTD Global memory bandwidth 1, MRI-Q Instruction issue rate 8, [HKR HotChips-2007] PRACE Autumn School, Oct

Scalability of MB-level Parallelism for H.264 Decoding

Scalability of MB-level Parallelism for H.264 Decoding Scalability of Macroblock-level Parallelism for H.264 Decoding Mauricio Alvarez Mesa 1, Alex Ramírez 1,2, Mateo Valero 1,2, Arnaldo Azevedo 3, Cor Meenderinck 3, Ben Juurlink 3 1 Universitat Politècnica