Profiling techniques for parallel applications

Similar documents
Profiling techniques for parallel applications

Logic Analysis Basics

Logic Analysis Basics

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL

Detail at scale in performance analysis

Achieving Timing Closure in ALTERA FPGAs

Performance Analysis with Vampir VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING

Scalability of MB-level Parallelism for H.264 Decoding

PulseCounter Neutron & Gamma Spectrometry Software Manual

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Training Document for Comprehensive Automation Solutions Totally Integrated Automation (T I A)

EAN-Performance and Latency

Digital Logic Design ENEE x. Lecture 24

TV Synchronism Generation with PIC Microcontroller

Static Timing Analysis for Nanometer Designs

Advanced Pipelining and Instruction-Level Paralelism (2)

CHAPTER1: Digital Logic Circuits

Logic Analyzer Triggering Techniques to Capture Elusive Problems

Laboratory Exercise 4

Lab2: Cache Memories. Dimitar Nikolov

Sequential Circuit Design: Principle

White paper Max number of unique video stream configurations

Instruction Level Parallelism Part III

HIGH SPEED ASYNCHRONOUS DATA MULTIPLEXER/ DEMULTIPLEXER FOR HIGH DENSITY DIGITAL RECORDERS

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Design and Implementation of Timer, GPIO, and 7-segment Peripherals

Logic and Computer Design Fundamentals. Chapter 7. Registers and Counters

Instruction Level Parallelism Part III

QCN Transience and Equilibrium: Response and Stability. Abdul Kabbani, Rong Pan, Balaji Prabhakar and Mick Seaman

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Instruction Level Parallelism and Its. (Part II) ECE 154B

Understanding FICON Channel Path Metrics

On the Rules of Low-Power Design

The University of Texas at Dallas Department of Computer Science CS 4141: Digital Systems Lab

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

PRACE Autumn School GPU Programming

TABLE 3. MIB COUNTER INPUT Register (Write Only) TABLE 4. MIB STATUS Register (Read Only)

Chapter 7 Memory and Programmable Logic

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Amdahl s Law in the Multicore Era

Training Note TR-06RD. Schedules. Schedule types

Final Exam review: chapter 4 and 5. Supplement 3 and 4

SAP Edge Services, cloud edition Edge Services Overview Guide Version 1802

Digital Systems Laboratory 3 Counters & Registers Time 4 hours

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Scans and encodes up to a 64-key keyboard. DB 1 DB 2 DB 3 DB 4 DB 5 DB 6 DB 7 V SS. display information.

WAVES Greg Wells MixCentric. User Guide

Long and Fast Up/Down Counters Pushpinder Kaur CHOUHAN 6 th Jan, 2003

DSP in Communications and Signal Processing

ELE2120 Digital Circuits and Systems. Tutorial Note 8

Intelligent Monitoring Software IMZ-RS300. Series IMZ-RS301 IMZ-RS304 IMZ-RS309 IMZ-RS316 IMZ-RS332 IMZ-RS300C

Controlling adaptive resampling

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

Low Power Illinois Scan Architecture for Simultaneous Power and Test Data Volume Reduction

SVT DAQ. Per Hansson Adrian HPS Collaboration Meeting 10/27/2015

MANAGING POWER SYSTEM FAULTS. Xianyong Feng, PhD Center for Electromechanics The University of Texas at Austin November 14, 2017

Analyzing Modulated Signals with the V93000 Signal Analyzer Tool. Joe Kelly, Verigy, Inc.

Stimulus presentation using Matlab and Visage

More Digital Circuits

6.3 Sequential Circuits (plus a few Combinational)

Network Disk Recorder WJ-ND200

A MISSILE INSTRUMENTATION ENCODER

Quick Reference Manual

CacheCompress A Novel Approach for Test Data Compression with cache for IP cores

Video Output and Graphics Acceleration

Full Disclosure Monitoring

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

Customized electronic part transport in the press shop siemens.com/metalforming

The word digital implies information in computers is represented by variables that take a limited number of discrete values.

Computer Architecture Basic Computer Organization and Design

MTL Software. Overview

EyeFace SDK v Technical Sheet

Lecture 2: Digi Logic & Bus

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

10GBASE-R Test Patterns

ELCT201: DIGITAL LOGIC DESIGN

Milestone Solution Partner IT Infrastructure Components Certification Report

ECE 4220 Real Time Embedded Systems Final Project Spectrum Analyzer

Level and edge-sensitive behaviour

DIGITAL SYSTEM FUNDAMENTALS (ECE421) DIGITAL ELECTRONICS FUNDAMENTAL (ECE422) COUNTERS

Figure 30.1a Timing diagram of the divide by 60 minutes/seconds counter

EE292: Fundamentals of ECE

A Low-Power 0.7-V H p Video Decoder

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

IT T35 Digital system desigm y - ii /s - iii

Introductions o Instructor introduction o Attendee introductions Why are you here? What do you hope to learn? Do you have any special needs?

Testing Digital Systems II

Introduction. Edge Enhancement (SEE( Advantages of Scalable SEE) Lijun Yin. Scalable Enhancement and Optimization. Case Study:

Microprocessor Design

HDL & High Level Synthesize (EEET 2035) Laboratory II Sequential Circuits with VHDL: DFF, Counter, TFF and Timer

DEPARTMENT OF ELECTRICAL &ELECTRONICS ENGINEERING DIGITAL DESIGN

ANT-20, ANT-20E Advanced Network Tester. STM-1 Mappings

SigPlay User s Guide

Timing Pulses. Important element of laboratory electronics. Pulses can control logical sequences with precise timing.

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Datasheet SHF A Multi-Channel Error Analyzer

Jin-Fu Li Advanced Reliable Systems (ARES) Laboratory. National Central University

Transcription:

Profiling techniques for parallel applications Analyzing program performance with HPCToolkit 03/10/2016 PRACE Autumn School 2016 2

Introduction Focus of this session Profiling of parallel applications Statistical sampling Introduction to HPCToolkit Strategies for finding optimization potential (not limited to HPCToolkit) High penalty and Waste metrics Profiling using expectations 03/10/2016 PRACE Autumn School 2016 3

Outline Overview: Basic profiling techniques Statistical sampling vs. Code instrumentation HPCToolkit: A quick introduction Effective analysis strategies Pinpointing inefficiencies Pinpointing scalability bottlenecks Practical part Analysis of program profiles (hpcviewer) Analysis of program traces (hpctraceviewer) 03/10/2016 PRACE Autumn School 2016 4

Prerequisites for Practical Part Download HPCToolkit profile and trace viewers http://hpctoolkit.org/software.html hpcviewer-5.4.2 hpctraceviewer-5.4.2 Try to launch them (Java required) Download prepared profiles http://tinyurl.com/pas2016-hpctk 03/10/2016 PRACE Autumn School 2016 5

Outline Overview: Basic profiling techniques Statistical sampling vs. Code instrumentation HPCToolkit: A quick introduction Effective analysis strategies Pinpointing inefficiencies Pinpointing scalability bottlenecks Practical part Analysis of program profiles (hpcviewer) Analysis of program traces (hpctraceviewer) 03/10/2016 PRACE Autumn School 2016 6

Overview Statistical sampling Sampling: Program flow is periodically interrupted, current program state is examined. Asynchronous sampling: Timers Hardware counters (CPU cycles, L3 cache misses, etc.) Synchronous sampling: Calls to certain library functions are intercepted (malloc, fread, ) Code Instrumentation Instrumentation: Code for collecting profiling information is inserted into the original program. Approaches: Manual (measurement APIs) Automatic source level Compiler assisted (e.g. gprof) Binary translation Runtime instrumentation 03/10/2016 PRACE Autumn School 2016 7

Overview Statistical sampling Sampling: Program flow is periodically interrupted, current program state is examined. Asynchronous sampling: Timers Hardware counters (CPU cycles, L3 cache misses, etc.) Synchronous sampling: Calls to certain library functions are intercepted (malloc, fread, ) Code Instrumentation Instrumentation: Code for collecting profiling information is inserted into the original program. Approaches: Manual (measurement APIs) Automatic source level Compiler assisted (e.g. gprof) Binary translation Runtime instrumentation 03/10/2016 PRACE Autumn School 2016 8

Statistical sampling: Advantages No changes to program or build process Recommended: Debugging symbols No blind spots: Measurements cover Library functions Functions with unavailable source code Low overhead typically 3 to 5% 03/10/2016 PRACE Autumn School 2016 9

Statistical sampling: Limitations Statistical sampling involves some degree of uncertainty Information attributed to source lines may not be accurate Certain types of information not available: Number of calls of a certain function Average runtime per call of a certain function 03/10/2016 PRACE Autumn School 2016 10

Outline Overview: Basic profiling techniques Statistical sampling vs. Code instrumentation HPCToolkit: A quick introduction Effective analysis strategies Pinpointing inefficiencies Pinpointing scalability bottlenecks Practical part Analysis of program profiles (hpcviewer) Analysis of program traces (hpctraceviewer) 03/10/2016 PRACE Autumn School 2016 11

HPCToolkit: A quick introduction Suite of tools for program performance analysis Developed at Rice University, Houston, Texas Features Statistical sampling Full call-path unwinding Attribution of metrics at the level of functions, loops and source lines Computation of user-defined metrics 03/10/2016 PRACE Autumn School 2016 12

Supports HPCToolkit: A quick introduction Asynchronous sampling System timers, Hardware counters (PAPI library) Synchronous sampling (via LD_PRELOAD) Suited for Threaded applications MPI applications Hybrid applications (Threading + MPI) 03/10/2016 PRACE Autumn School 2016 13

HPCToolkit: Basic workflow Step Command Description (1) hpcrun (OR hpclink) Measures program performance (2) hpcstruct Recovers program structure from the binary (3) hpcprof / hpcprof-mpi Creates an experiment database (4) hpcviewer / hpctraceviewer Displays experiment database (profile or trace view) 03/10/2016 PRACE Autumn School 2016 14

Step (1) Performance measurement # A) Sequential or threaded applications: hpcrun [options] command [args] # B) MPI or hybrid applications: mpirun [mpi-opts] hpcrun [options] command [args] # Important options: # -e event@period... Specify sampling sources # -t... Enable trace data collection # -f frac... Enable measurement only with probability frac. # Supported number formats: 0.1 or 1/10 # -o outpath... Specify measurement output directory # Example - sample every ~4 million cpu cycles: mpirun -n 4 hpcrun -e PAPI_TOT_CYC@4100100./myprog --some-arg 03/10/2016 PRACE Autumn School 2016 15

Step (1) When using static linking # 1a) Link your application with hpclink linker wrapper hpclink linker-command linker-args # e.g. when using mpicc hpclink mpicc -o myprog myprog.o module1.o module2.o... # 1b) Launch your MPI application as usual # Use environment variables for HPCToolkit configuration # Example: export HPCRUN_EVENT_LIST="PAPI_TOT_CYC@4100100" mpirun -n 4./myprog --some-arg 03/10/2016 PRACE Autumn School 2016 16

Step (1) When using static linking # Supported environment variables: hpclink --help # Output: HPCRUN_EVENT_LIST=<event1>[@<period1>];...;<eventN>[@<periodN>] : Sampling event list; hpcrun -e/--event HPCRUN_TRACE=1 : Enable tracing; hpcrun -t/--trace HPCRUN_PROCESS_FRACTION=<f>: Measure only a fraction <f> of the execution's processes; hpcrun -f/-fp/--process-fraction HPCRUN_OUT_PATH=<outpath> : Set output directory; hpcrun -o/--output 03/10/2016 PRACE Autumn School 2016 17

Step (2): Program structure recovery # Analyze program structure (recovers loops from optimized binaries): hpcstruct [options] binary # Example: hpcstruct./myprog 03/10/2016 PRACE Autumn School 2016 18

Step (3): Experiment database creation # Join (i) measurements, (ii) program structure and (iii) source code # together in a so-called "experiment database" # Three alternatives: # (a) threaded or small MPI executions hpcprof [options] measurement-directory... # (b) medium size MPI executions hpcprof-mpi [options] measurement-directory... # (c) large MPI executions mpirun [mpi-opts] hpcprof-mpi [options] measurment-directory... 03/10/2016 PRACE Autumn School 2016 19

Step (3): Experiment database creation # Important options for hpcprof and hpcprof-mpi: # -I path-to-source... Location of source code # -S structure-file... Specify the file generated by hpcstruct # -o outpath... Name of the experiment database directory # -M metric... Aggregation level for metric output: # sum... Only metric sums # stats... Sum, mean, stddev, min, max for each metric # thread... Per-thread/process info (no aggregation) # Example: hpcprof -I./src/+ -S myprog.hpcstruct -M stats measurments 03/10/2016 PRACE Autumn School 2016 20

Step (3): Experiment database creation hpcprof vs. hpcprof-mpi Option M thread Not supported by hpcprof-mpi Per-Process/Thread metric creation Only supported by hpcprof-mpi Enables metric plots and histograms in profile viewer Profiles generated with hpcprof-mpi are larger 03/10/2016 PRACE Autumn School 2016 21

Step (4): Profile analysis # Profile analysis hpcviewer experiment-database # Trace analysis hpctraceviewer experiment-database 03/10/2016 PRACE Autumn School 2016 22

HPCToolkit: An example # (1) Measure performance of./myprog running with 4 and 8 MPI processes mpirun -n 4 hpcrun -o m4 -e PAPI_TOT_CYC@4100100./myprog --some-arg mpirun -n 8 hpcrun -o m8 -e PAPI_TOT_CYC@4100100./myprog --some-arg # (2) Program structure recovery; generates./myprog.hpcstruct hpcstruct./myprog # (3) Metric attribution hpcprof -S myprog.hpcstruct I./src/'*' -o db-4-8 m4 m8 # (4) View profile hpcviewer db-4-8 03/10/2016 PRACE Autumn School 2016 23

Outline Overview: Basic profiling techniques Statistical sampling vs. Code instrumentation HPCToolkit: A quick introduction Effective analysis strategies Pinpointing inefficiencies Pinpointing scalability bottlenecks Practical part Analysis of program profiles (hpcviewer) Analysis of program traces (hpctraceviewer) 03/10/2016 PRACE Autumn School 2016 24

Questions: Selecting sampling sources 1. Which sampling sources are available? 2. Which sampling source(s) should I select? 3. What is an appropriate sampling frequency? 03/10/2016 PRACE Autumn School 2016 25

Questions: Selecting sampling sources 1. Which sampling sources are available? 2. Which sampling source(s) should I select? 3. What is an appropriate sampling frequency? 03/10/2016 PRACE Autumn School 2016 26

(1) Available sampling sources # List available sampling sources: hpcrun -L # Output (shortened): =========================================================================== Available Timer events =========================================================================== Name Description --------------------------------------------------------------------------- WALLCLOCK Wall clock time used by the process in microseconds. REALTIME Real clock time used by the thread in microseconds. CPUTIME CPU clock time used by the thread in microseconds. Note: do not use multiple timer events in the same run. 03/10/2016 PRACE Autumn School 2016 27

(1) Available sampling sources =========================================================================== Available PAPI preset events =========================================================================== Name Profilable Description --------------------------------------------------------------------------- PAPI_TOT_CYC Yes Total cycles PAPI_STL_ICY Yes Cycles with no instruction issue... PAPI_L3_TCM Yes Level 3 cache misses... PAPI_BR_CN Yes Conditional branch instructions PAPI_BR_MSP Yes Conditional branch instructions mispredicted... PAPI_FP_INS No Floating point instructions PAPI_FDV_INS Yes Floating point divide instructions... 03/10/2016 PRACE Autumn School 2016 28

(1) Available sampling sources =========================================================================== Other available events =========================================================================== Name Description --------------------------------------------------------------------------- RETCNT Each time a procedure returns, the return count for that procedure is incremented (experimental feature, x86 only) MEMLEAK IO The number of bytes allocated and freed per dynamic context The number of bytes read and written per dynamic context 03/10/2016 PRACE Autumn School 2016 29

Questions: Selecting sampling sources 1. Which sampling sources are available? 2. Which sampling source(s) should I select? 3. What is an appropriate sampling frequency? 03/10/2016 PRACE Autumn School 2016 30

(2) Selecting sampling sources Most important sampling source: PAPI_TOT_CYC CPU cycles (Measures execution time) Alternatives: WALLCLOCK REALTIME CPUTIME My experience: Most problems are traceable just by looking at execution time (PAPI_TOT_CYC). 03/10/2016 PRACE Autumn School 2016 31

(2) Selecting sampling sources PAPI_STL_ICY PAPI_L3_TCM PAPI_FP_INS, PAPI_FDV_INS, IO PAPI_BR_CN, PAPI_BR_MSP Sampling sources for detecting inefficiencies: CPU cycles without activity (waiting times) L3 Cache misses (inefficient data access patterns) Solutions: Data restructuring, Loop tiling, Floating point instructions Bytes read/written Branch misprediction 03/10/2016 PRACE Autumn School 2016 32

(2) Selecting sampling sources Other potentially interesting sampling sources: MEMLEAK RETCNT Allocated/freed bytes, may be used for debugging Number of times a function is being called My experience: MEMLEAK can be helpful for debugging, but does not always work. Had problems when running with OpenMPI. 03/10/2016 PRACE Autumn School 2016 33

Questions: Selecting sampling sources 1. Which sampling sources are available? 2. Which sampling source(s) should I select? 3. What is an appropriate sampling frequency? 03/10/2016 PRACE Autumn School 2016 34

(3) Selecting the sampling frequency Rules of thumb: Between 10 and 1000 samples per second and process (or thread). More than 1000 samples/s can distort the profiling results make profiles/traces unnecessary big Profiling overhead should remain below 5%. For profiling: Longer runs with lower frequency For tracing: Shorter runs with higher frequency 03/10/2016 PRACE Autumn School 2016 35

(3) Selecting the sampling frequency Formula for PAPI_TOT_CYC: [CPU GHz] 10 4 10 samples / s [CPU GHz] 10 6 1000 samples / s Choose something in between Good frequencies for other metrics are always application and problem dependent (typically lower than the frequency used for PAPI_TOT_CYC) For synchronous events (IO, MEMLEAK) no sampling frequency needs to be specified instead, e.g. for MEMLEAK, a probability for sampling can be specified hpcrun -mp 0.1 or hpcrun -mp 1/10 03/10/2016 PRACE Autumn School 2016 36

Performance analysis strategies Detecting inefficiencies: Monitor high-penalty events, e.g. PAPI_L3_TCM PAPI_STL_ICY Define your own waste metrics E.g. Missed floating point opportunities : 2 PAPI_TOT_CYC PAPI_FP_INS 03/10/2016 PRACE Autumn School 2016 37

Performance analysis strategies Detecting scalability bottlenecks: Profiling using expectations Define your own metrics, reflecting your expectations Example: Strong scaling Experiment database with measurements for N and 2N processes (fixed problem size) Define your own metric for parallel overhead, e.g. OVERHEAD = PAPI_TOT_CYC(2N) - PAPI_TOT_CYC(N) 03/10/2016 PRACE Autumn School 2016 38

Performance analysis strategies Further reading: HPCToolkit User s Manual http://hpctoolkit.org/documentation.html References given in User s Manual In particular [3], [5], [8], [9]. 03/10/2016 PRACE Autumn School 2016 39

Outline Overview: Basic profiling techniques Statistical sampling vs. Code instrumentation HPCToolkit: A quick introduction Effective analysis strategies Pinpointing inefficiencies Pinpointing scalability bottlenecks Practical part Analysis of program profiles (hpcviewer) Analysis of program traces (hpctraceviewer) 03/10/2016 PRACE Autumn School 2016 40

Detecting inefficiencies (1/4) Go to directory 1-inefficiency Open 1a-before-simple with hpcviewer. What is the hot path w.r.t. execution time? Within the routine mover_pc, which lines of code are long-running? Do you spot optimization potential? Close experiment database. 03/10/2016 PRACE Autumn School 2016 41

Detecting inefficiencies (2/4) Stay in directory 1-inefficiency Open 1a-before-allmetrics with hpcviewer. Deselect exclusive metric columns for display What is the hot path with respect to Stalled CPU Cycles? L3 Cache misses? Leave database open. 03/10/2016 PRACE Autumn School 2016 42

Detecting inefficiencies (3/4) In opened database, 1a-before-allmetrics Deselect all columns except PAPI_TOT_CYC:Sum (I) PAPI_FP_INS:Sum (I) Define a metric for missed floating point opportunities FPWASTE = 2 PAPI_TOT_CYC PAPI_FP_INS What is the hot path w.r.t. FPWASTE? Leave database open. 03/10/2016 PRACE Autumn School 2016 43

Detecting inefficiencies (4/4) In addition to 1a-before-allmetrics, open database 2bafter-allmetrics. Do the same for 1b-after-allmetrics as for 1a-before-allmetrics: Display only PAPI_TOT_CYC:Sum (I) and PAPI_FP_INS:Sum (I) Define metric FPWASTE Compare databases: Execution time and FPWASTE Of whole run (main) Of function mover_pc What has changed in the source code of mover_pc? Close both databases. 03/10/2016 PRACE Autumn School 2016 44

Detecting load imbalance (1/1) Go to directory 2-imbalance. Open trace-totcyc-stats with hpcviewer. Display only PAPI_TOT_CYC:Mean (I) and PAPI_TOT_CYC:Max (I). Define metric IMBALANCE: PAPI_TOT_CYC:Max (I) / PAPI_TOT_CYC:Mean (I) Within the longest-running loop of main: Do you spot a routine with high runtime and high IMBALANCE? Close database, and re-open with hpctraceviewer. Do you find the routine in the trace? What is happening? 03/10/2016 PRACE Autumn School 2016 45

Pinpointing scalability bottlenecks (1/2) Go to directory 3-scalbility Open 1-before-128-256 with hpcviewer Define a metric OVERHEAD as the difference of: 2.PAPI_TOT_CYC:Sum (I) (256 procs) 1.PAPI_TOT_CYC:Sum (I) (128 procs) What are the hot paths w.r.t. execution time and OVERHEAD? Leave database open. 03/10/2016 PRACE Autumn School 2016 46

Pinpointing scalability bottlenecks (1/2) In addition to 1-before-128-256, open 2-after-128-256 How has the overall runtime changed? Has the hot path w.r.t. execution time changed? How has the source code changed in exchange.c? Close both databases. 03/10/2016 PRACE Autumn School 2016 47

Debugging Go to directory 4-debugging Open profile-mem-io with hpcviewer. Which routines read/write most of the data? Plot different metrics for main. Close database. 03/10/2016 PRACE Autumn School 2016 48

References HPCToolkit documentation: http://hpctoolkit.org/documentation.html 03/10/2016 PRACE Autumn School 2016 49