Profiling techniques for parallel applications

Similar documents
Profiling techniques for parallel applications

Logic Analysis Basics

Logic Analysis Basics

Achieving Timing Closure in ALTERA FPGAs

Performance Analysis with Vampir VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

PulseCounter Neutron & Gamma Spectrometry Software Manual

Detail at scale in performance analysis

Training Document for Comprehensive Automation Solutions Totally Integrated Automation (T I A)

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

Design and Implementation of Timer, GPIO, and 7-segment Peripherals

UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL

Laboratory Exercise 4

CHAPTER1: Digital Logic Circuits

Digital Logic Design ENEE x. Lecture 24

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

EAN-Performance and Latency

HIGH SPEED ASYNCHRONOUS DATA MULTIPLEXER/ DEMULTIPLEXER FOR HIGH DENSITY DIGITAL RECORDERS

Logic Analyzer Triggering Techniques to Capture Elusive Problems

Sequential Circuit Design: Principle

Scalability of MB-level Parallelism for H.264 Decoding

Advanced Pipelining and Instruction-Level Paralelism (2)

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Logic and Computer Design Fundamentals. Chapter 7. Registers and Counters

On the Rules of Low-Power Design

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Scans and encodes up to a 64-key keyboard. DB 1 DB 2 DB 3 DB 4 DB 5 DB 6 DB 7 V SS. display information.

ELE2120 Digital Circuits and Systems. Tutorial Note 8

The University of Texas at Dallas Department of Computer Science CS 4141: Digital Systems Lab

Instruction Level Parallelism and Its. (Part II) ECE 154B

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

Instruction Level Parallelism Part III

Static Timing Analysis for Nanometer Designs

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Network Disk Recorder WJ-ND200

6.3 Sequential Circuits (plus a few Combinational)

Instruction Level Parallelism Part III

Sequential Logic. Introduction to Computer Yung-Yu Chuang

TV Synchronism Generation with PIC Microcontroller

CacheCompress A Novel Approach for Test Data Compression with cache for IP cores

Amdahl s Law in the Multicore Era

QCN Transience and Equilibrium: Response and Stability. Abdul Kabbani, Rong Pan, Balaji Prabhakar and Mick Seaman

Understanding FICON Channel Path Metrics

Training Note TR-06RD. Schedules. Schedule types

PRACE Autumn School GPU Programming

Long and Fast Up/Down Counters Pushpinder Kaur CHOUHAN 6 th Jan, 2003

10GBASE-R Test Patterns

TABLE 3. MIB COUNTER INPUT Register (Write Only) TABLE 4. MIB STATUS Register (Read Only)

ELCT201: DIGITAL LOGIC DESIGN

Design for Testability

ECE 4220 Real Time Embedded Systems Final Project Spectrum Analyzer

MANAGING POWER SYSTEM FAULTS. Xianyong Feng, PhD Center for Electromechanics The University of Texas at Austin November 14, 2017

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Digital Systems Laboratory 3 Counters & Registers Time 4 hours

More Digital Circuits

Computer Systems Architecture

A MISSILE INSTRUMENTATION ENCODER

WAVES Greg Wells MixCentric. User Guide

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Lab2: Cache Memories. Dimitar Nikolov

Video Output and Graphics Acceleration

Full Disclosure Monitoring

Lecture 2: Digi Logic & Bus

Logic Design Viva Question Bank Compiled By Channveer Patil

Video Surveillance *

CSCB58 - Lab 4. Prelab /3 Part I (in-lab) /1 Part II (in-lab) /1 Part III (in-lab) /2 TOTAL /8

Logic Design. Flip Flops, Registers and Counters

Low Power Illinois Scan Architecture for Simultaneous Power and Test Data Volume Reduction

Logic Design II (17.342) Spring Lecture Outline

SAP Edge Services, cloud edition Edge Services Overview Guide Version 1802

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

DIGITAL SYSTEM FUNDAMENTALS (ECE421) DIGITAL ELECTRONICS FUNDAMENTAL (ECE422) COUNTERS

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

Controlling adaptive resampling

The word digital implies information in computers is represented by variables that take a limited number of discrete values.

Chapter 7 Memory and Programmable Logic

Intelligent Monitoring Software IMZ-RS300. Series IMZ-RS301 IMZ-RS304 IMZ-RS309 IMZ-RS316 IMZ-RS332 IMZ-RS300C

Experiment 8 Introduction to Latches and Flip-Flops and registers

Milestone Solution Partner IT Infrastructure Components Certification Report

CPSC 121: Models of Computation Lab #5: Flip-Flops and Frequency Division

A Low-Power 0.7-V H p Video Decoder

Level and edge-sensitive behaviour

EE292: Fundamentals of ECE

Quick Reference Manual

Testability: Lecture 23 Design for Testability (DFT) Slide 1 of 43

Testing Digital Systems II

Customized electronic part transport in the press shop siemens.com/metalforming

VLSI System Testing. BIST Motivation

Using on-chip Test Pattern Compression for Full Scan SoC Designs

Stimulus presentation using Matlab and Visage

HDL & High Level Synthesize (EEET 2035) Laboratory II Sequential Circuits with VHDL: DFF, Counter, TFF and Timer

Lossless Compression Algorithms for Direct- Write Lithography Systems

Chapter 3 Unit Combinational

THE LXI IVI PROGRAMMING MODEL FOR SYNCHRONIZATION AND TRIGGERING

Final Exam review: chapter 4 and 5. Supplement 3 and 4

Digital Electronics Course Outline

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

LIO-8 Quick Start Guide

Overview: Logic BIST

Transcription:

Profiling techniques for parallel applications Analyzing program performance with HPCToolkit 17/04/2014 PRACE Spring School 2014 2

Introduction Thomas Ponweiser Johannes Kepler University Linz (JKU) Involved in PRACE 3IP, WP 7 Subtask Debugging and Profiling techniques Support expert for Preparatory Access Type C Personal background: Software developer since 2008 Currently finishing my study of Technical Mathematics 17/04/2014 PRACE Spring School 2014 3

Introduction Focus of this session Profiling of parallel applications Statistical sampling Introduction to HPCToolkit Strategies for finding optimization potential (not limited to HPCToolkit) High penalty and Waste metrics Profiling using expectations 17/04/2014 PRACE Spring School 2014 4

Outline Overview: Basic profiling techniques Statistical sampling vs. Code instrumentation HPCToolkit: A quick introduction Effective analysis strategies Pinpointing inefficiencies Pinpointing scalability bottlenecks Practical part Analysis of program profiles (hpcviewer) Analysis of program traces (hpctraceviewer) 17/04/2014 PRACE Spring School 2014 5

Prerequisites for Practical Part Download HPCToolkit profile and trace viewers http://hpctoolkit.org/software.html hpcviewer-5.3.2 hpctraceviewer-5.3.2 Try to launch them (Java required) Download prepared profiles 17/04/2014 PRACE Spring School 2014 6

Outline Overview: Basic profiling techniques Statistical sampling vs. Code instrumentation HPCToolkit: A quick introduction Effective analysis strategies Pinpointing inefficiencies Pinpointing scalability bottlenecks Practical part Analysis of program profiles (hpcviewer) Analysis of program traces (hpctraceviewer) 17/04/2014 PRACE Spring School 2014 7

Overview Statistical sampling Sampling: Program flow is periodically interrupted, current program state is examined. Asynchronous sampling: Timers Hardware counters (CPU cycles, L3 cache misses, etc.) Synchronous sampling: Calls to certain library functions are intercepted (malloc, fread, ) Code Instrumentation Instrumentation: Code for collecting profiling information is inserted into the original program. Approaches: Manual (measurement APIs) Automatic source level Compiler assisted (e.g. gprof) Binary translation Runtime instrumentation 17/04/2014 PRACE Spring School 2014 8

Overview Statistical sampling Sampling: Program flow is periodically interrupted, current program state is examined. Asynchronous sampling: Timers Hardware counters (CPU cycles, L3 cache misses, etc.) Synchronous sampling: Calls to certain library functions are intercepted (malloc, fread, ) Code Instrumentation Instrumentation: Code for collecting profiling information is inserted into the original program. Approaches: Manual (measurement APIs) Automatic source level Compiler assisted (e.g. gprof) Binary translation Runtime instrumentation 17/04/2014 PRACE Spring School 2014 9

Statistical sampling: Advantages No changes to program or build process Recommended: Debugging symbols No blind spots: Measurements cover Library functions Functions with unavailable source code Low overhead typically 3 to 5% 17/04/2014 PRACE Spring School 2014 10

Statistical sampling: Limitations Statistical sampling involves some degree of uncertainty Information attributed to source lines may not be accurate Certain types of information not available: Number of calls of a certain function Average runtime per call of a certain function 17/04/2014 PRACE Spring School 2014 11

Outline Overview: Basic profiling techniques Statistical sampling vs. Code instrumentation HPCToolkit: A quick introduction Effective analysis strategies Pinpointing inefficiencies Pinpointing scalability bottlenecks Practical part Analysis of program profiles (hpcviewer) Analysis of program traces (hpctraceviewer) 17/04/2014 PRACE Spring School 2014 12

HPCToolkit: A quick introduction Suite of tools for program performance analysis Developed at Rice University, Houston, Texas Features Statistical sampling Full call-path unwinding Attribution of metrics at the level of functions, loops and source lines Computation of user-defined metrics 17/04/2014 PRACE Spring School 2014 13

Supports HPCToolkit: A quick introduction Asynchronous sampling System timers, Hardware counters (PAPI library) Synchronous sampling (via LD_PRELOAD) Suited for Threaded applications MPI applications Hybrid applications (Threading + MPI) 17/04/2014 PRACE Spring School 2014 14

HPCToolkit: Basic workflow Step Command Description (1) hpcrun Measures program performance (2) hpcstruct Recovers program structure from the binary (3) hpcprof / hpcprof-mpi Creates an experiment database (4) hpcviewer / hpctraceviewer Displays experiment database (profile or trace view) 17/04/2014 PRACE Spring School 2014 15

Step (1) Performance measurement # A) Sequential or threaded applications: hpcrun [options] command [args] # B) MPI or hybrid applications: mpirun [mpi-opts] hpcrun [options] command [args] # Important options: # -e event@period... Specify sampling sources # -t... Enable trace data collection # -f frac... Enable measurement only with probability frac. # Supported number formats: 0.1 or 1/10 # -o outpath... Specify measurement output directory # Example - sample every ~4 million cpu cycles: mpirun -n 4 hpcrun -e PAPI_TOT_CYC@4100100./myprog --some-arg 17/04/2014 PRACE Spring School 2014 16

Step (2): Program structure recovery # Analyze program structure (recovers loops from optimized binaries): hpcstruct [options] binary # Example: hpcstruct./myprog 17/04/2014 PRACE Spring School 2014 17

Step (3): Experiment database creation # Join (i) measurements, (ii) program structure and (iii) source code # together in a so-called "experiment database" # Three alternatives: # (a) threaded or small MPI executions hpcprof [options] measurement-directory... # (b) medium size MPI executions hpcprof-mpi [options] measurement-directory... # (c) large MPI executions mpirun [mpi-opts] hpcprof-mpi [options] measurment-directory... 17/04/2014 PRACE Spring School 2014 18

Step (3): Experiment database creation # Important options for hpcprof and hpcprof-mpi: # -I path-to-source... Location of source code # -S structure-file... Specify the file generated by hpcstruct # -o outpath... Name of the experiment database directory # -M metric... Aggregation level for metric output: # sum... Only metric sums # stats... Sum, mean, stddev, min, max for each metric # thread... Per-thread/process info (no aggregation) # Example: hpcprof -I./src/'*' -S myprog.hpcstruct -M stats measurments 17/04/2014 PRACE Spring School 2014 19

Step (3): Experiment database creation hpcprof vs. hpcprof-mpi Option M thread Not supported by hpcprof-mpi Per-Process/Thread metric creation Only supported by hpcprof-mpi Enables metric plots and histograms in profile viewer Profiles generated with hpcprof-mpi are larger 17/04/2014 PRACE Spring School 2014 20

Step (4): Profile analysis # Profile analysis hpcviewer experiment-database # Trace analysis hpctraceviewer experiment-database 17/04/2014 PRACE Spring School 2014 21

HPCToolkit: An example # (1) Measure performance of./myprog running with 4 and 8 MPI processes mpirun -n 4 hpcrun -o m4 -e PAPI_TOT_CYC@4100100./myprog --some-arg mpirun -n 8 hpcrun -o m8 -e PAPI_TOT_CYC@4100100./myprog --some-arg # (2) Program structure recovery; generates./myprog.hpcstruct hpcstruct./myprog # (3) Metric attribution hpcprof -S myprog.hpcstruct I./src/'*' -o db-4-8 m4 m8 # (4) View profile hpcviewer db-4-8 17/04/2014 PRACE Spring School 2014 22

Outline Overview: Basic profiling techniques Statistical sampling vs. Code instrumentation HPCToolkit: A quick introduction Effective analysis strategies Pinpointing inefficiencies Pinpointing scalability bottlenecks Practical part Analysis of program profiles (hpcviewer) Analysis of program traces (hpctraceviewer) 17/04/2014 PRACE Spring School 2014 23

Questions: Selecting sampling sources 1. Which sampling sources are available? 2. Which sampling source(s) should I select? 3. What is an appropriate sampling frequency? 17/04/2014 PRACE Spring School 2014 24

Questions: Selecting sampling sources 1. Which sampling sources are available? 2. Which sampling source(s) should I select? 3. What is an appropriate sampling frequency? 17/04/2014 PRACE Spring School 2014 25

(1) Available sampling sources # List available sampling sources: hpcrun -l # Output (shortened): =========================================================================== Available Timer events =========================================================================== Name Description --------------------------------------------------------------------------- WALLCLOCK Wall clock time used by the process in microseconds. REALTIME Real clock time used by the thread in microseconds. CPUTIME CPU clock time used by the thread in microseconds. Note: do not use multiple timer events in the same run. 17/04/2014 PRACE Spring School 2014 26

(1) Available sampling sources =========================================================================== Available PAPI preset events =========================================================================== Name Profilable Description --------------------------------------------------------------------------- PAPI_TOT_CYC Yes Total cycles PAPI_STL_ICY Yes Cycles with no instruction issue... PAPI_L3_TCM Yes Level 3 cache misses... PAPI_BR_CN Yes Conditional branch instructions PAPI_BR_MSP Yes Conditional branch instructions mispredicted... PAPI_FP_INS No Floating point instructions PAPI_FDV_INS Yes Floating point divide instructions... 17/04/2014 PRACE Spring School 2014 27

(1) Available sampling sources =========================================================================== Other available events =========================================================================== Name Description --------------------------------------------------------------------------- RETCNT Each time a procedure returns, the return count for that procedure is incremented (experimental feature, x86 only) MEMLEAK IO The number of bytes allocated and freed per dynamic context The number of bytes read and written per dynamic context 17/04/2014 PRACE Spring School 2014 28

Questions: Selecting sampling sources 1. Which sampling sources are available? 2. Which sampling source(s) should I select? 3. What is an appropriate sampling frequency? 17/04/2014 PRACE Spring School 2014 29

(2) Selecting sampling sources Most important sampling source: PAPI_TOT_CYC CPU cycles (Measures execution time) Alternatives: WALLCLOCK REALTIME CPUTIME My experience: Most problems are traceable just by looking at execution time (PAPI_TOT_CYC). 17/04/2014 PRACE Spring School 2014 30

(2) Selecting sampling sources PAPI_STL_ICY PAPI_L3_TCM PAPI_FP_INS, PAPI_FDV_INS, IO PAPI_BR_CN, PAPI_BR_MSP Sampling sources for detecting inefficiencies: CPU cycles without activity (waiting times) L3 Cache misses (inefficient data access patterns) Solutions: Data restructuring, Loop tiling, Floating point instructions Bytes read/written Branch misprediction 17/04/2014 PRACE Spring School 2014 31

(2) Selecting sampling sources Other potentially interesting sampling sources: MEMLEAK RETCNT Allocated/freed bytes, may be used for debugging Number of times a function is being called My experience: MEMLEAK can be helpful for debugging, but does not always work. Had problems when running with OpenMPI. 17/04/2014 PRACE Spring School 2014 32

Questions: Selecting sampling sources 1. Which sampling sources are available? 2. Which sampling source(s) should I select? 3. What is an appropriate sampling frequency? 17/04/2014 PRACE Spring School 2014 33

(3) Selecting the sampling frequency Rules of thumb: Between 10 and 1000 samples per second and process (or thread). More than 1000 samples/s can distort the profiling results make profiles/traces unnecessary big Profiling overhead should remain below 5%. For profiling: Longer runs with lower frequency For tracing: Shorter runs with higher frequency 17/04/2014 PRACE Spring School 2014 34

(3) Selecting the sampling frequency Formula for PAPI_TOT_CYC: [CPU GHz] 10 4 10 samples / s [CPU GHz] 10 6 1000 samples / s Choose something in between Good frequencies for other metrics are always application and problem dependent For synchronous events (IO, MEMLEAK) no sampling frequency needs to be specified 17/04/2014 PRACE Spring School 2014 35

Performance analysis strategies Detecting inefficiencies: Monitor high-penalty events, e.g. PAPI_L3_TCM PAPI_STL_ICY Define your own waste metrics E.g. Missed floating point opportunities : 2 PAPI_TOT_CYC PAPI_FP_INS 17/04/2014 PRACE Spring School 2014 36

Performance analysis strategies Detecting scalability bottlenecks: Profiling using expectations Define your own metrics, reflecting your expectations Example: Strong scaling Experiment database with measurements for N and 2N processes (fixed problem size) Define your own metric for parallel overhead, e.g. OVERHEAD = PAPI_TOT_CYC(2N) - PAPI_TOT_CYC(N) 17/04/2014 PRACE Spring School 2014 37

Performance analysis strategies Further reading: HPCToolkit User s Manual http://hpctoolkit.org/documentation.html References given in User s Manual In particular [3], [5], [8], [9]. 17/04/2014 PRACE Spring School 2014 38

Outline Overview: Basic profiling techniques Statistical sampling vs. Code instrumentation HPCToolkit: A quick introduction Effective analysis strategies Pinpointing inefficiencies Pinpointing scalability bottlenecks Practical part Analysis of program profiles (hpcviewer) Analysis of program traces (hpctraceviewer) 17/04/2014 PRACE Spring School 2014 39

Detecting inefficiencies (1/4) Go to directory 1-inefficiency Open 1a-before-simple with hpcviewer. What is the hot path w.r.t. execution time? Within the routine mover_pc, which lines of code are long-running? Do you spot optimization potential? Close experiment database. 17/04/2014 PRACE Spring School 2014 40

Detecting inefficiencies (2/4) Stay in directory 1-inefficiency Open 1a-before-allmetrics with hpcviewer. Deselect exclusive metric columns for display What is the hot path with respect to Stalled CPU Cycles? L3 Cache misses? Leave database open. 17/04/2014 PRACE Spring School 2014 41

Detecting inefficiencies (3/4) In opened database, 1a-before-allmetrics Deselect all columns except PAPI_TOT_CYC:Sum (I) PAPI_FP_INS:Sum (I) Define a metric for missed floating point opportunities FPWASTE = 2 PAPI_TOT_CYC PAPI_FP_INS What is the hot path w.r.t. FPWASTE? Leave database open. 17/04/2014 PRACE Spring School 2014 42

Detecting inefficiencies (4/4) In addition to 1a-before-allmetrics, open database 1bafter-allmetrics. Do the same for 1b-after-allmetrics as for 1a-before-allmetrics: Display only PAPI_TOT_CYC:Sum (I) and PAPI_FP_INS:Sum (I) Define metric FPWASTE Compare databases: Execution time and FPWASTE Of whole run (main) Of function mover_pc What has changed in the source code of mover_pc? Close both databases. 17/04/2014 PRACE Spring School 2014 43

Detecting load imbalance (1/1) Go to directory 2-imbalance. Open trace-totcyc-stats with hpcviewer. Display only PAPI_TOT_CYC:Mean (I) and PAPI_TOT_CYC:Max (I). Define metric IMBALANCE: PAPI_TOT_CYC:Max (I) / PAPI_TOT_CYC:Mean (I) Within the longest-running loop of main: Do you spot a routine with high runtime and high IMBALANCE? Close database, and re-open with hpctraceviewer. Do you find the routine in the trace? What is happening? 17/04/2014 PRACE Spring School 2014 44

Pinpointing scalability bottlenecks (1/2) Go to directory 3-scalbility Open 1-before-128-256 with hpcviewer Define a metric OVERHEAD as the difference of: 2.PAPI_TOT_CYC:Sum (I) (256 procs) 1.PAPI_TOT_CYC:Sum (I) (128 procs) What are the hot paths w.r.t. execution time and OVERHEAD? Leave database open. 17/04/2014 PRACE Spring School 2014 45

Pinpointing scalability bottlenecks (1/2) In addition to 1-before-128-256, open 2-after-128-256 How has the overall runtime changed? Has the hot path w.r.t. execution time changed? How has the source code changed in exchange.c? Close both databases. 17/04/2014 PRACE Spring School 2014 46

Debugging Go to directory 4-debugging Open profile-mem-io with hpcviewer. Which routines read/write most of the data? Plot different metrics for main. Close database. 17/04/2014 PRACE Spring School 2014 47

References HPCToolkit documentation: http://hpctoolkit.org/documentation.html 17/04/2014 PRACE Spring School 2014 48