Communication Avoiding Successive Band Reduction

Similar documents
Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

Fooling the Masses with Performance Results: Old Classics & Some New Ideas

Adaptive decoding of convolutional codes

Solution of Linear Systems

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Universal Parallel Computing Research Center The Center for New Music and Audio Technologies University of California, Berkeley

Scalability of MB-level Parallelism for H.264 Decoding

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Optimized Color Based Compression

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era

Amdahl s Law in the Multicore Era

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Comparative Analysis of Stein s. and Euclid s Algorithm with BIST for GCD Computations. 1. Introduction

An Optimized Diffusion Depth Of Field Solver (DDOF)

Proceedings of the Third International DERIVE/TI-92 Conference

1) New Paths to New Machine Learning Science. 2) How an Unruly Mob Almost Stole. Jeff Howbert University of Washington

An Experimental Comparison of Fast Algorithms for Drawing General Large Graphs

Performance of a Low-Complexity Turbo Decoder and its Implementation on a Low-Cost, 16-Bit Fixed-Point DSP

PRACE Autumn School GPU Programming

Vector-Valued Image Interpolation by an Anisotropic Diffusion-Projection PDE

Retiming Sequential Circuits for Low Power

Chi Ching Chi, Ben Juurlink A QHD-capable parallel H.264 decoder

Post-Routing Layer Assignment for Double Patterning

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Conference object, Postprint version This version is available at

A Fast Constant Coefficient Multiplier for the XC6200

A Low-Power 0.7-V H p Video Decoder

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Area Efficient Pulsed Clock Generator Using Pulsed Latch Shift Register

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

CS229 Project Report Polyphonic Piano Transcription

An Efficient High Speed Wallace Tree Multiplier

MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER. Wassim Hamidouche, Mickael Raulet and Olivier Déforges

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

On the Rules of Low-Power Design

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 3. A Network-Centric View on HPC

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

An Efficient Reduction of Area in Multistandard Transform Core

THE CAPABILITY to display a large number of gray

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

Implementation of a turbo codes test bed in the Simulink environment

FPGA Implementation of Convolutional Encoder And Hard Decision Viterbi Decoder

EE5780 Advanced VLSI CAD

Piya Pal. California Institute of Technology, Pasadena, CA GPA: 4.2/4.0 Advisor: Prof. P. P. Vaidyanathan

Feasibility Study of Stochastic Streaming with 4K UHD Video Traces

Detecting Musical Key with Supervised Learning

MPEG has been established as an international standard

1ms Column Parallel Vision System and It's Application of High Speed Target Tracking

Interconnect Planning with Local Area Constrained Retiming

Lossless Compression Algorithms for Direct- Write Lithography Systems

A Highly Scalable Parallel Implementation of H.264

Cryptanalysis of LILI-128

DICOM medical image watermarking of ECG signals using EZW algorithm. A. Kannammal* and S. Subha Rani

Slack Redistribution for Graceful Degradation Under Voltage Overscaling

Transparent low-overhead checkpoint for GPU-accelerated clusters

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

GPU Acceleration of a Production Molecular Docking Code

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Data Science + Content. Todd Holloway, Director of Content Science & Algorithms for Smart Content Summit, 3/9/2017

Concurrent Programming through the JTAG Interface for MAX Devices

Ocean bottom seismic acquisition via jittered sampling

A Framework for Segmentation of Interview Videos

On the Characterization of Distributed Virtual Environment Systems

Processing the Output of TOSOM

Low Power Illinois Scan Architecture for Simultaneous Power and Test Data Volume Reduction

Design of Fault Coverage Test Pattern Generator Using LFSR

Research on sampling of vibration signals based on compressed sensing

Performance Analysis of Broadcasting Algorithms on the Intel Single-Chip Cloud Computer

REDUCED-COMPLEXITY DECODING FOR CONCATENATED CODES BASED ON RECTANGULAR PARITY-CHECK CODES AND TURBO CODES

Color Image Compression Using Colorization Based On Coding Technique

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

Lecture 2: Digi Logic & Bus

Timing with Virtual Signal Synchronization for Circuit Performance and Netlist Security

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS

Performance and Energy Consumption Analysis of the X265 Video Encoder

High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures

More Digital Circuits

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

CS 61C: Great Ideas in Computer Architecture

J. Maillard, J. Silva. Laboratoire de Physique Corpusculaire, College de France. Paris, France

Adaptive Key Frame Selection for Efficient Video Coding

Low Power Estimation on Test Compression Technique for SoC based Design

Lecture 0: Organization

Orthogonal rotation in PCAMIX

Parallelization of Multimedia Applications by Compiler on Multicores for Consumer Electronics

VLSI System Testing. BIST Motivation

A low-power portable H.264/AVC decoder using elastic pipeline

VirtualSync: Timing Optimization by Synchronizing Logic Waves with Sequential and Combinational Components as Delay Units

Lecture 2: Basic FPGA Fabric. James C. Hoe Department of ECE Carnegie Mellon University

REAL-TIME H.264 ENCODING BY THREAD-LEVEL PARALLELISM: GAINS AND PITFALLS

Transcription:

Communication Avoiding Successive Band Reduction Grey Ballard, James Demmel, Nicholas Knight UC Berkeley PPoPP 12 Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung.

Talk Summary For high performance in linear algebra, we must reformulate existing algorithms in order to reduce data movement i.e., avoid communication We want to tridiagonalize a symmetric band matrix Motivation: dense symmetric eigenproblem (eigenvalues only) Our improved band reduction algorithm Moves asymptotically less data than previous algorithms Attains speed-ups against tuned libraries on a multicore platform, up to 2 serial, 6 parallel With our band-reduction approach, two-step tridiagonalization of a dense matrix is communication-optimal for all problem sizes Grey Ballard Communication Avoiding Successive Band Reduction 1

Motivation By communication we mean moving data within memory hierarchy on a sequential computer moving data between processors on a parallel computer Local Local Local SLOW Local Local Local FAST Local Local Local Sequential Parallel Communication is expensive, so our goal is to minimize it in many cases we need new algorithms in many cases we can prove lower bounds and optimality Grey Ballard Communication Avoiding Successive Band Reduction 2

Direct vs Two-Step Tridiagonalization Application: solving the dense symmetric eigenproblem via reduction to tridiagonal form (tridiagonalization) Direct: Conventional approach (e.g. LAPACK) is direct tridiagonalization Two-step approach reduces first to band, then band to tridiagonal Two-step: A T A B T Grey Ballard Communication Avoiding Successive Band Reduction 3

Direct vs Two-Step Tridiagonalization Application: solving the dense symmetric eigenproblem via reduction to tridiagonal form (tridiagonalization) Direct: Conventional approach (e.g. LAPACK) is direct tridiagonalization Two-step approach reduces first to band, then band to tridiagonal 9000 8000 MatMul Direct Two step A T Two-step: A B T MFLOPS 7000 6000 5000 4000 3000 2000 1000 0 0 1000 2000 3000 4000 5000 6000 7000 8000 n Grey Ballard Communication Avoiding Successive Band Reduction 3

Why is direct tridiagonalization slow? Communication costs! MFLOPS MatMul Direct Two step 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0 1000 2000 3000 4000 5000 6000 7000 8000 n Approach Flops Words Moved 4 Direct 3 n3 O ( ( n 3) ) 4 (1) Two-step 3 n3 O n 3 ( M (2) O n 2 ) ( M O n 2 ) M M = fast memory size Direct approach achieves O(1) data re-use Two-step approach moves fewer words than direct approach using intermediate bandwidth b = Θ( M) Full-to-banded step (1) achieves O( M) data re-use this is optimal Band reduction step (2) achieves O(1) data re-use Can we do better? Grey Ballard Communication Avoiding Successive Band Reduction 4

Band Reduction - previous work 1963 Rutishauser: Givens-based down diagonals and Householder-based 1968 Schwarz: Givens-based up columns 1975 Muraka-Horikoshi: improved R s Householder-based algorithm 1984 Kaufman: vectorized S s algorithm 1993 Lang: parallelized M-H s algorithm (distributed-mem) 2000 Bischof-Lang-Sun: generalized everything but S s algorithm 2009 Davis-Rajamanickam: Givens-based in blocks 2011 Luszczek-Ltaief-Dongarra: parallelized M-H s algorithm (shared-mem) 2011 Haidar-Ltaief-Dongarra: combined L-L-D and D-R Grey Ballard Communication Avoiding Successive Band Reduction 5

Band Reduction - previous work 1963 Rutishauser: Givens-based down diagonals and Householder-based 1968 Schwarz: Givens-based up columns 1975 Muraka-Horikoshi: improved R s Householder-based algorithm 1984 Kaufman: vectorized S s algorithm 1993 Lang: parallelized M-H s algorithm (distributed-mem) 2000 Bischof-Lang-Sun: generalized everything but S s algorithm 2009 Davis-Rajamanickam: Givens-based in blocks 2011 Luszczek-Ltaief-Dongarra: parallelized M-H s algorithm (shared-mem) 2011 Haidar-Ltaief-Dongarra: combined L-L-D and D-R Grey Ballard Communication Avoiding Successive Band Reduction 5

Successive Band Reduction (bulge-chasing) Q 1 T b+1 Q 2 T constraint: c + d b d+1 1 Q 1 6 c Q 3 T c+d 2 Q 2 c d Q 4 T 3 Q 3 b = bandwidth c = columns d = diagonals Q 4 4 Q 5 T Q 5 5 Grey Ballard Communication Avoiding Successive Band Reduction 6

How do we get data re-use? Increase number of columns in parallelogram (c) 1 permits blocking Householder updates: O(c) re-use constraint c + d b = trade-off between re-use and progress Chase multiple bulges at a time (ω) 2 apply several updates to band while it s in cache: O(ω) re-use bulges cannot overlap, need working set to fit in cache d+1 b+1 QR PRE SYM c POST Grey Ballard Communication Avoiding Successive Band Reduction 7

How do we get data re-use? Increase number of columns in parallelogram (c) 1 permits blocking Householder updates: O(c) re-use constraint c + d b = trade-off between re-use and progress Chase multiple bulges at a time (ω) 2 apply several updates to band while it s in cache: O(ω) re-use bulges cannot overlap, need working set to fit in cache d+1 b+1 QR PRE SYM c POST Grey Ballard Communication Avoiding Successive Band Reduction 7

How do we get data re-use? Increase number of columns in parallelogram (c) 1 permits blocking Householder updates: O(c) re-use constraint c + d b = trade-off between re-use and progress Chase multiple bulges at a time (ω) 2 apply several updates to band while it s in cache: O(ω) re-use bulges cannot overlap, need working set to fit in cache d+1 b+1 QR PRE SYM c POST Grey Ballard Communication Avoiding Successive Band Reduction 7

How do we get data re-use? Increase number of columns in parallelogram (c) 1 permits blocking Householder updates: O(c) re-use constraint c + d b = trade-off between re-use and progress Chase multiple bulges at a time (ω) 2 apply several updates to band while it s in cache: O(ω) re-use bulges cannot overlap, need working set to fit in cache d+1 b+1 QR PRE SYM c POST Grey Ballard Communication Avoiding Successive Band Reduction 7

How do we get data re-use? Increase number of columns in parallelogram (c) 1 permits blocking Householder updates: O(c) re-use constraint c + d b = trade-off between re-use and progress Chase multiple bulges at a time (ω) 2 apply several updates to band while it s in cache: O(ω) re-use bulges cannot overlap, need working set to fit in cache d+1 b+1 QR PRE SYM c POST Grey Ballard Communication Avoiding Successive Band Reduction 7

How do we get data re-use? Increase number of columns in parallelogram (c) 1 permits blocking Householder updates: O(c) re-use constraint c + d b = trade-off between re-use and progress Chase multiple bulges at a time (ω) 2 apply several updates to band while it s in cache: O(ω) re-use bulges cannot overlap, need working set to fit in cache d+1 b+1 QR PRE SYM c POST Grey Ballard Communication Avoiding Successive Band Reduction 7

How do we get data re-use? Increase number of columns in parallelogram (c) 1 permits blocking Householder updates: O(c) re-use constraint c + d b = trade-off between re-use and progress Chase multiple bulges at a time (ω) 2 apply several updates to band while it s in cache: O(ω) re-use bulges cannot overlap, need working set to fit in cache d+1 b+1 QR PRE SYM c POST Grey Ballard Communication Avoiding Successive Band Reduction 7

How do we get data re-use? Increase number of columns in parallelogram (c) 1 permits blocking Householder updates: O(c) re-use constraint c + d b = trade-off between re-use and progress Chase multiple bulges at a time (ω) 2 apply several updates to band while it s in cache: O(ω) re-use bulges cannot overlap, need working set to fit in cache d+1 b+1 QR PRE SYM c POST Grey Ballard Communication Avoiding Successive Band Reduction 7

How do we get data re-use? Increase number of columns in parallelogram (c) 1 permits blocking Householder updates: O(c) re-use constraint c + d b = trade-off between re-use and progress Chase multiple bulges at a time (ω) 2 apply several updates to band while it s in cache: O(ω) re-use bulges cannot overlap, need working set to fit in cache d+1 b+1 QR PRE SYM c POST Grey Ballard Communication Avoiding Successive Band Reduction 7

Data access patterns One bulge at a time Four bulges at a time Grey Ballard Communication Avoiding Successive Band Reduction 8

Communication-Avoiding SBR - theory Tradeoff: c and ω c - number of columns in each parallelogram ω - number of bulges chased at a time CA-SBR cuts remaining bandwidth in half at each sweep starts with big c and decreases by half at each sweep starts with small ω and doubles at each sweep Grey Ballard Communication Avoiding Successive Band Reduction 9

Communication-Avoiding SBR - theory Tradeoff: c and ω c - number of columns in each parallelogram ω - number of bulges chased at a time CA-SBR cuts remaining bandwidth in half at each sweep starts with big c and decreases by half at each sweep starts with small ω and doubles at each sweep Flops Words Moved Data Re-use Schwarz 4n 2 b O(n 2 b) O(1) M-H 6n 2 b O(n 2 b) O(1) ( ) B-L-S* 5n 2 b O(n 2 log b) O b CA-SBR 5n 2 b O ( n 2 b 2 M *with optimal parameter choices assuming 1 b M/3 ) log b O ( M b ) Grey Ballard Communication Avoiding Successive Band Reduction 9

Communication-Avoiding SBR - theory Tradeoff: c and ω c - number of columns in each parallelogram ω - number of bulges chased at a time CA-SBR cuts remaining bandwidth in half at each sweep starts with big c and decreases by half at each sweep starts with small ω and doubles at each sweep Flops Words Moved Data Re-use Schwarz 4n 2 b O(n 2 b) O(1) M-H 6n 2 b O(n 2 b) O(1) ( ) B-L-S* 5n 2 b O(n 2 log b) O b CA-SBR 5n 2 b O ( n 2 b 2 M *with optimal parameter choices assuming 1 b M/3 ) log b O ( M b We have similar theoretical improvements in dist-mem parallel case Grey Ballard Communication Avoiding Successive Band Reduction 9 )

Shared-Memory Parallel Implementation lots of dependencies: use pipelining threads maintain working sets which never overlap Grey Ballard Communication Avoiding Successive Band Reduction 10

Search Space for Autotuning Main tuning parameters: 1 Number of sweeps and diagonals per sweep: {d i } satisfying d i = b 2 Parameters for i th sweep a number of columns in each parallelogram: c i satisfying c i + d i b i b number of bulges chased at a time: ω i c number of times bulge is chased in a row: l i 3 Parameters for individual bulge chase a algorithm choice (BLAS-1, BLAS-2, BLAS-3 varieties) b inner blocking size for BLAS-3 Grey Ballard Communication Avoiding Successive Band Reduction 11

Experimental Platform Intel Westmere-EX (Boxboro) 4 sockets, 10 cores per socket, hyperthreading 24MB L3 (shared) per socket, 256KB L2 (private) per core MKL v.10.3, PLASMA v.2.4.1, ICC v.11.1 Experiments run on single socket (up to 10 threads) Grey Ballard Communication Avoiding Successive Band Reduction 12

CA-SBR vs MKL (dsbtrd), sequential Speedup 24000 1.0 1.2 1.6 1.8 2.0 2.0 20000 1.0 1.1 1.5 1.8 1.9 2.0 Matrix dimension n 16000 12000 0.9 1.0 0.9 0.9 1.4 1.2 1.7 1.5 1.8 1.7 1.9 1.8 8000 1.0 0.9 1.1 1.3 1.4 1.6 4000 0.9 0.9 1.1 1.2 1.2 1.2 50 100 150 200 250 300 Bandwidth b Grey Ballard Communication Avoiding Successive Band Reduction 13

CA-SBR (10 threads) vs CA-SBR (1 thread) Speedup 24000 8.8 8.1 9.4 9.2 8.5 8.4 20000 9.2 8.8 9.2 8.9 8.2 8.3 Matrix dimension n 16000 12000 8.9 9.0 9.3 9.8 9.2 8.9 8.6 7.9 8.0 7.4 7.8 7.4 8000 8.7 9.2 8.1 6.8 5.9 6.0 4000 8.2 6.7 5.6 4.4 3.6 3.6 50 100 150 200 250 300 Bandwidth b Grey Ballard Communication Avoiding Successive Band Reduction 14

CA-SBR vs PLASMA (pdsbrdt), 10 threads Speedup 24000 4.0 3.2 4.6 4.7 5.2 5.9 20000 4.2 3.6 4.5 4.3 5.0 5.9 Matrix dimension n 16000 12000 4.4 4.7 4.6 5.1 4.5 4.7 4.1 3.6 4.8 4.4 5.5 5.5 8000 6.7 5.7 5.5 4.0 3.8 5.0 4000 6.2 5.7 3.7 3.4 3.0 3.8 50 100 150 200 250 300 Bandwidth b Grey Ballard Communication Avoiding Successive Band Reduction 15

Best serial speedups on Boxboro On the largest experimental problem n = 24000, b = 300, our serial CA-SBR implementation attained 2 speedup vs. MKL dsbtrd (p = 1 thread) 36% of dgemm peak (50% counting actual flops) dsbtrd is a vectorized version of the Schwarz algorithm (O(1) reuse) dsbtrd performance did not improve with p so we compared only serial implementations MKL also provides an implementation of SBR (dsyrdb) but does not expose the band-to-tridiagonal routine, so we could not compare Grey Ballard Communication Avoiding Successive Band Reduction 16

Best parallel speedups on Boxboro On the largest experimental problem n = 24000, b = 300, our multithreaded CA-SBR implementation attained 6 speedup vs. PLASMA pdsbrdt (p = 10 threads) 30% of dgemm peak (40% counting actual flops) In PLASMA v.2.4.1, pdsbrdt is a tiled, multithreaded, dynamically scheduled implementation of M-H algorithm (O(1) reuse) We are collaborating with the PLASMA developers - they have improved their pdsbrdt scheduler since (current version is 2.4.5) Our CA-SBR implementation is not NUMA-aware so we restricted our tests to a single socket (10 cores) Grey Ballard Communication Avoiding Successive Band Reduction 17

Conclusions and Future Work Theoretical Results Analysis of communication costs of existing algorithms CA-SBR reduces communication below lower bound for matmul Is it optimal? Practical Results Heuristic tuning leads to speedups, for both the band reduction problem and the dense eigenproblem Implementation exposes important tuning parameters Automate tuning process Extensions Handle eigenvector updates (results here are for eigenvalues only) Extend to bidiagonal reduction (SVD) case Distributed-memory parallel algorithm Grey Ballard Communication Avoiding Successive Band Reduction 18

Thank you! Grey Ballard, Jim Demmel, Nick Knight {ballard,demmel,knight}@cs.berkeley.edu Grey Ballard Communication Avoiding Successive Band Reduction 19

References I Aggarwal, A., and Vitter, J. S. The input/output complexity of sorting and related problems. Comm. ACM 31, 9 (1988), 1116 1127. Agullo, E., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Langou, J., Ltaief, H., Luszczek, P., and YarKhan, A. PLASMA users guide, 2009. http://icl.cs.utk.edu/plasma/. Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. Minimizing communication in linear algebra. SIAM Journal on Matrix Analysis and Applications 32, 3 (2011), 866-901. Bischof, C., Lang, B., and Sun, X. A framework for symmetric band reduction. ACM Trans. Math. Soft. 26, 4 (2000), 581 601. Bischof, C. H., Lang, B., and Sun, X. Algorithm 807: The SBR Toolbox software for successive band reduction. ACM Trans. Math. Soft. 26, 4 (2000), 602 616. Demmel, J., Grigori, L., Hoemmen, M., and Langou, J. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. (2011). To appear. Grey Ballard Communication Avoiding Successive Band Reduction 20

References II Dongarra, J., Hammarling, S., and Sorensen, D. Block reduction of matrices to condensed forms for eigenvalue computations. Journal of Computational and Applied Mathematics 27 (1989). Fuller, S. H., and Millett, L. I., Eds. The Future of Computing Performance: Game Over or Next Level? The National Academies Press, Washington, D.C., 2011. Haidar, A., Ltaief, H., and Dongarra, J. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. Proceedings of the ACM/IEEE Conference on Supercomputing (2011). Howell, G., Demmel, J., Fulton, C., Hammarling, S., and Marmol, K. Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Trans. Math. Softw. 34, 3 (2008), 14:1-14:33. Kaufman, L. Banded eigenvalue solvers on vector machines. ACM Trans. Math. Softw. 10 (1984), 73 86. Kaufman, L. Band reduction algorithms revisited. ACM Trans. Math. Softw. 26 (December 2000), 551 567. Grey Ballard Communication Avoiding Successive Band Reduction 21

References III Lang, B. A parallel algorithm for reducing symmetric banded matrices to tridiagonal form. SIAM J. Sci. Comput. 14, 6 (1993), 1320 1338. Lang, B. Efficient eigenvalue and singular value computations on shared memory machines. Par. Comp. 25, 7 (1999), 845 860. Ltaief, H., Luszczek, P., and Dongarra, J. High performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures. Tech. Rep. 247, LAPACK Working Note, May 2011. Submitted to ACM TOMS. Luszczek, P., Ltaief, H., and Dongarra, J. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2011). Murata, K., and Horikoshi, K. A new method for the tridiagonalization of the symmetric band matrix. Information Processing in Japan 15 (1975), 108 112. Grey Ballard Communication Avoiding Successive Band Reduction 22

References IV Rajamanickam, S. Efficient Algorithms for Sparse Singular Value Decomposition. PhD thesis, University of Florida, 2009. Rutishauser, H. On Jacobi rotation patterns. In Proceedings of Symposia in Applied Mathematics (1963), vol. 15, pp. 219 239. Schwarz, H. Algorithm 183: Reduction of a symmetric bandmatrix to triple diagonal form. Comm. ACM 6, 6 (June 1963), 315 316. Schwarz, H. Tridiagonalization of a symmetric band matrix. Numerische Mathematik 12 (1968), 231 241. Grey Ballard Communication Avoiding Successive Band Reduction 23

Anatomy of a bulge-chase b+1 d+1 QR PRE SYM c QR: create zeros PRE: A Q T A SYM: A Q T AQ POST: A AQ POST Grey Ballard Communication Avoiding Successive Band Reduction 24

CA-SBR sequential performance (p = 1) 24000 1.78 1.85 2.25 2.55 2.78 2.93 20000 1.77 1.86 2.27 2.56 2.80 2.94 16000 1.77 1.87 2.27 2.57 2.80 2.95 12000 1.78 1.87 2.27 2.58 2.81 2.95 8000 1.80 1.85 2.27 2.59 2.80 2.96 4000 1.63 1.87 2.28 2.58 2.82 2.88 n / b 50 100 150 200 250 300 Table: Performance of sequential CA-SBR in GFLOPS. Each row corresponds to a matrix dimension, and each column corresponds to a matrix bandwidth. Effective flop rates are shown actual performance may be up to 50% higher. Grey Ballard Communication Avoiding Successive Band Reduction 25

CA-SBR parallel performance (p = 10) 24000 15.59 14.92 21.17 23.43 23.48 24.79 20000 16.29 16.47 20.81 22.78 22.89 24.56 16000 15.80 17.32 20.81 22.02 22.34 23.08 12000 16.06 18.29 20.19 20.28 20.76 21.74 8000 15.64 17.14 18.39 17.62 16.56 17.80 4000 13.36 12.56 12.82 11.48 10.26 10.44 n / b 50 100 150 200 250 300 Table: Performance of parallel CA-SBR in GFLOPS. Each row corresponds to a matrix dimension, and each column corresponds to a matrix bandwidth. Effective flop rates are shown actual performance may be up to 50% higher. Grey Ballard Communication Avoiding Successive Band Reduction 26