Solution of Linear Systems

Similar documents
Modified Generalized Integrated Interleaved Codes for Local Erasure Recovery

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

VLSI System Testing. BIST Motivation

MPEG has been established as an international standard

AN UNEQUAL ERROR PROTECTION SCHEME FOR MULTIPLE INPUT MULTIPLE OUTPUT SYSTEMS. M. Farooq Sabir, Robert W. Heath and Alan C. Bovik

MITOCW ocw f08-lec19_300k

UPDATE TO DOWNSTREAM FREQUENCY INTERLEAVING AND DE-INTERLEAVING FOR OFDM. Presenter: Rich Prodan

HYBRID CONCATENATED CONVOLUTIONAL CODES FOR DEEP SPACE MISSION

Post-Routing Layer Assignment for Double Patterning

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Note: Please use the actual date you accessed this material in your citation.

Communication Avoiding Successive Band Reduction

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

21.1. Unit 21. Hardware Acceleration

REDUCED-COMPLEXITY DECODING FOR CONCATENATED CODES BASED ON RECTANGULAR PARITY-CHECK CODES AND TURBO CODES

Sequential Logic Notes

CPS311 Lecture: Sequential Circuits

Permutations of the Octagon: An Aesthetic-Mathematical Dialectic

MPEG-2. ISO/IEC (or ITU-T H.262)

2. AN INTROSPECTION OF THE MORPHING PROCESS

A Framework for Segmentation of Interview Videos

CS 61C: Great Ideas in Computer Architecture

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

Chapter 5: Synchronous Sequential Logic

Sequential Logic. Analysis and Synthesis. Joseph Cavahagh Santa Clara University. r & Francis. TaylonSi Francis Group. , Boca.Raton London New York \

Part 1: Introduction to Computer Graphics

1. a) For the circuit shown in figure 1.1, draw a truth table showing the output Q for all combinations of inputs A, B and C. [4] Figure 1.

Lecture 3: Nondeterministic Computation

Proceedings of the Third International DERIVE/TI-92 Conference

An Efficient High Speed Wallace Tree Multiplier

MVP: Capture-Power Reduction with Minimum-Violations Partitioning for Delay Testing

Pitch correction on the human voice

Multimedia Communications. Image and Video compression

Part 2.4 Turbo codes. p. 1. ELEC 7073 Digital Communications III, Dept. of E.E.E., HKU

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

cs281: Introduction to Computer Systems Lab07 - Sequential Circuits II: Ant Brain

A Low-Power 0.7-V H p Video Decoder

Improving Performance in Neural Networks Using a Boosting Algorithm

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

Department of Electrical and Computer Engineering University of Wisconsin Madison. Fall Final Examination CLOSED BOOK

Music Alignment and Applications. Introduction

Digital Logic Design I

FPGA Implementation of DA Algritm for Fir Filter

An Overview of Video Coding Algorithms

Color Quantization of Compressed Video Sequences. Wan-Fung Cheung, and Yuk-Hee Chan, Member, IEEE 1 CSVT

1/ 19 2/17 3/23 4/23 5/18 Total/100. Please do not write in the spaces above.

IMS B007 A transputer based graphics board

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

CURIE Day 3: Frequency Domain Images

PAM4 signals for 400 Gbps: acquisition for measurement and signal processing

Design of Fault Coverage Test Pattern Generator Using LFSR

Built-In Self-Test (BIST) Abdil Rashid Mohamed, Embedded Systems Laboratory (ESLAB) Linköping University, Sweden

Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era

On the design of turbo codes with convolutional interleavers

Synchronization Overhead in SOC Compressed Test

Computer Architecture and Organization

Logic Design II (17.342) Spring Lecture Outline

Normalization Methods for Two-Color Microarray Data

Digital Principles and Design

Area and Speed Efficient Implementation of Symmetric FIR Digital Filter through Reduced Parallel LUT Decomposed DA Approach

Application of Symbol Avoidance in Reed-Solomon Codes to Improve their Synchronization

Computer Graphics Prof. Sukhendu Das Dept. of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 5 CRT Display Devices

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

CPSC 221 Basic Algorithms and Data Structures

L11/12: Reconfigurable Logic Architectures

Video Signals and Circuits Part 2

Finite State Machine Design

Digital Logic Design ENEE x. Lecture 24

TERRESTRIAL broadcasting of digital television (DTV)

Hidden Markov Model based dance recognition

CS2401-COMPUTER GRAPHICS QUESTION BANK

Fooling the Masses with Performance Results: Old Classics & Some New Ideas

Project Design. Eric Chang Mike Ilardi Jess Kaneshiro Jonathan Steiner

The XYZ Colour Space. 26 January 2011 WHITE PAPER. IMAGE PROCESSING TECHNIQUES

ALONG with the progressive device scaling, semiconductor

On the Characterization of Distributed Virtual Environment Systems

Flip Flop. S-R Flip Flop. Sequential Circuits. Block diagram. Prepared by:- Anwar Bari

Multimedia Communications. Video compression

The trigger for the New Electromagnetic Calorimeter NewCal

IG Discovery for FDX DOCSIS

Implementation of Memory Based Multiplication Using Micro wind Software

Computer Vision for HCI. Image Pyramids. Image Pyramids. Multi-resolution image representations Useful for image coding/compression

The reduction in the number of flip-flops in a sequential circuit is referred to as the state-reduction problem.

A repetition-based framework for lyric alignment in popular songs

MATHEMATICAL APPROACH FOR RECOVERING ENCRYPTION KEY OF STREAM CIPHER SYSTEM

UNIT 1: DIGITAL LOGICAL CIRCUITS What is Digital Computer? OR Explain the block diagram of digital computers.

UNIVERSAL SPATIAL UP-SCALER WITH NONLINEAR EDGE ENHANCEMENT

Augmentation Matrix: A Music System Derived from the Proportions of the Harmonic Series

A Comprehensive Approach to the Partial Scan Problem using Implicit State Enumeration

Jazz Melody Generation and Recognition

Figure 1.LFSR Architecture ( ) Table 1. Shows the operation for x 3 +x+1 polynomial.

Sequential Logic Circuits

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District UNIT-III SEQUENTIAL CIRCUITS

CHAPTER 2 SUBCHANNEL POWER CONTROL THROUGH WEIGHTING COEFFICIENT METHOD

Python Quick-Look Utilities for Ground WFC3 Images

Investigation of Digital Signal Processing of High-speed DACs Signals for Settling Time Testing

Cryptanalysis of LILI-128

Chapter 12. Synchronous Circuits. Contents

Swept-tuned spectrum analyzer. Gianfranco Miele, Ph.D

Operating Bio-Implantable Devices in Ultra-Low Power Error Correction Circuits: using optimized ACS Viterbi decoder

Transcription:

Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 30, 2011 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 1 / 28

Outline Solving Linear Systems Direct Methods: solution is sought directly, at once Gaussian Elimination LU Factorization Pivoting CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 2 / 28

Linear Systems Probably the single most used procedure in the world. Linear Systems are the model for many modern day problems in mathematics in physics in economics and pretty much in any field What about nonlinear systems? Is that not a more general model? Yes, but how do we solve nonlinear systems? We linearize and iterate until we have a solution At each iteration we solve a linear system Also, how do we solve differential equations? We discretize in time and solve for each timepoint At each timepoint it may be a nonlinear system, so we linearize it In the end we still solve a linear systems, actually many of them CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 3 / 28

Direct Methods Gaussian Elimination Reduce Ax = b to upper triangular system, Tx = c: Forward Elimination Use Back Substitution to solve Tx = c. CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 4 / 28

Direct Methods Gaussian Elimination Reduce Ax = b to upper triangular system, Tx = c: Forward Elimination Use Back Substitution to solve Tx = c. a 00 a 01 a 02 a 0n a 10 a 11 a 12 a 1n a 20 a 21 a 22 a 2n....... a m0 a m1 a m2 a mn x 0 x 1 x 2. x n = b 0 b 1 b 2. b n CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 4 / 28

Direct Methods Gaussian Elimination Reduce Ax = b to upper triangular system, Tx = c: Forward Elimination Use Back Substitution to solve Tx = c. t 00 t 01 t 02 t 0n 0 t 11 t 12 t 1n 0 0 t 22 t 2n....... 0 0 0 t mn x 0 x 1 x 2. x n = c 0 c 1 c 2. c n CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 4 / 28

Direct Methods Gaussian Elimination Reduce Ax = b to upper triangular system, Tx = c: Forward Elimination Use Back Substitution to solve Tx = c. t 00 t 01 t 02 t 0n 0 t 11 t 12 t 1n 0 0 t 22 t 2n....... 0 0 0 t mn x 0 x 1 x 2. x n = c 0 c 1 c 2. c n Back Substitution 1 one element of x can be immediately computed 2 use this value to simplify system, revealing another element that can be immediately computed 3 repeat CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 4 / 28

Forward Elimination, recall steps 1x 0 +1x 1 1x 2 +4x 3 = 8 1x 0 1x 1 4x 2 +5x 3 = 13 1x 0 1x 2 6x 2 8x 3 = 13 1x 0 1x 3 2x 2 = 9 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 5 / 28

Forward Elimination, recall steps 1x 0 +1x 1 1x 2 +4x 3 = 8 1x 0 1x 1 4x 2 +5x 3 = 13 1x 0 1x 2 6x 2 8x 3 = 13 1x 0 1x 3 2x 2 = 9 Pivot: p 21 = a 21 /a 11, multiply by 1 st row, add to 2 nd row CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 5 / 28

Forward Elimination, recall steps 1x 0 +1x 1 1x 2 +4x 3 = 8 1x 0 1x 1 4x 2 +5x 3 = 13 1x 0 1x 2 6x 2 8x 3 = 13 1x 0 1x 3 2x 2 = 9 Pivot: p 21 = a 21 /a 11, multiply by 1 st row, add to 2 nd row 1x 0 +1x 1 1x 2 4x 3 = 8 2x 1 3x 2 1x 3 = 5 1x 0 1x 2 6x 2 8x 3 = 13 1x 0 1x 3 2x 2 = 9 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 5 / 28

Forward Elimination, recall steps 1x 0 +1x 1 1x 2 +4x 3 = 8 1x 0 1x 1 4x 2 +5x 3 = 13 1x 0 1x 2 6x 2 8x 3 = 13 1x 0 1x 3 2x 2 = 9 Also for p 31 = a 31 /a 11, p 41 = a 41 /a 11 1x 0 +1x 1 1x 2 4x 3 = 8 2x 1 3x 2 1x 3 = 5 2x 2 5x 2 4x 3 = 5 2x 3 1x 2 4x 3 = 1 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 5 / 28

Forward Elimination, recall steps 1x 0 +1x 1 1x 2 4x 3 = 8 2x 1 3x 2 1x 3 = 5 2x 2 5x 2 4x 3 = 5 2x 3 1x 2 4x 3 = 1 Pivot: p 32 = a 32 /a 22 1x 0 +1x 1 1x 2 4x 3 = 8 2x 1 3x 2 1x 3 = 5 2x 2 3x 3 = 0 2x 3 1x 2 4x 3 = 1 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 6 / 28

Back Substitution 1x 0 +1x 1 1x 2 +4x 3 = 8 2x 1 3x 2 +1x 3 = 5 2x 2 3x 3 = 0 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 7 / 28

Back Substitution 1x 0 +1x 1 1x 2 +4x 3 = 8 2x 1 3x 2 +1x 3 = 5 2x 2 3x 3 = 0 2x 3 = 4 x 3 = 2 1x 0 +1x 1 1x 2 = 0 2x 1 3x 2 = 3 2x 2 = 6 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 7 / 28

Back Substitution 1x 0 +1x 1 1x 2 = 0 2x 1 3x 2 = 3 2x 2 = 6 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 8 / 28

Back Substitution 1x 0 +1x 1 1x 2 = 0 2x 1 3x 2 = 3 2x 2 = 6 2x 3 = 4 x 3 = 2, x 2 = 3 1x 0 +1x 1 = 3 2x 1 = 12 2x 2 = 6 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 8 / 28

Back Substitution 1x 0 +1x 1 = 3 2x 1 = 12 2x 2 = 6 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 9 / 28

Back Substitution x 3 = 2, x 2 = 3, 1x 0 +1x 1 = 3 2x 1 = 12 2x 2 = 6 2x 3 = 4 x 1 = 6 1x 0 = 9 2x 1 = 12 2x 2 = 6 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 9 / 28

Back Substitution x 3 = 2, x 2 = 3, 1x 0 +1x 1 = 3 2x 1 = 12 2x 2 = 6 2x 3 = 4 x 1 = 6 1x 0 = 9 2x 1 = 12 2x 2 = 6 2x 3 = 4 x 3 = 2, x 2 = 3, x 1 = 6, x 0 = 9 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 9 / 28

Pseudo-code for Back Substitution for i = n - 1 down to 1 do x[i] = b[i] / a[i,i] for j = 0 to i - 1 do b[j] = b[j] - x[i] * a[j,i] endfor endfor Complexity: Θ(n 2 ) CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 10 / 28

Pseudo-code for Back Substitution for i = n - 1 down to 1 do x[i] = b[i] / a[i,i] for j = 0 to i - 1 do b[j] = b[j] - x[i] * a[j,i] endfor endfor Complexity: Θ(n 2 ) Parallelization? CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 10 / 28

Pseudo-code for Back Substitution for i = n - 1 down to 1 do x[i] = b[i] / a[i,i] for j = 0 to i - 1 do b[j] = b[j] - x[i] * a[j,i] endfor endfor Complexity: Θ(n 2 ) Parallelization: cannot execute the outer loop in parallel can execute the inner loop in parallel CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 10 / 28

Row-oriented Algorithm for i = n - 1 down to 1 do x[i] = b[i] / a[i,i] for j = 0 to i - 1 do b[j] = b[j] - x[i] * a[j,i] endfor endfor associate primitive task with each row of A and corresponding elements of x and b during iteration i task associated with row j computes new value of b j task i must compute x i and broadcast its value agglomerate using rowwise interleaved striped decomposition CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 11 / 28

Complexity Analysis for i = n - 1 down to 1 do x[i] = b[i] / a[i,i] for j = 0 to i - 1 do b[j] = b[j] - x[i] * a[j,i] endfor endfor Complexity Analysis Computation Complexity: each process performs about n/(2p) iterations of loop j in all a total of n 1 iterations in all Overall computational complexity: Θ(n 2 /p) Communication Complexity: one broadcast per iteration, log p n 1 iterations Overall communication complexity: Θ(n log p) CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 12 / 28

Isoefficiency Analysis Isoefficiency analysis: T (n, 1) CT 0 (n, p) (T (n, 1) sequential time; T 0 (n, p) parallel overhead) Sequential time complexity: T (n, 1) = O(n 2 ) Parallel overhead dominated by broadcasts: O(n log p) T 0 (n, p) = p O(n log p) n 2 Cpn log p n Cp log p Scalability function: M(f (p))/p M(n) = n 2 M(Cp log p) p = C 2 p log 2 p Poor scalability... CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 13 / 28

LU Factorization Useful if solving for multiple right-hand-sides (same matrix) Ax = b 1, Ax = b 2, Compute LU factorization A = LU where L is unit lower triangular and U is upper triangular. Solution obtained in two steps Ly = b lower triangular system by forward-substitution to obtain vector y Ux = y upper triangular system by back-substitution to obtain solution x to original system CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 14 / 28

Factorization by Gaussian Elimination LU factorization can be computed by Gaussian elimination as follows, where U overwrites A for k = 1 to n 1 for i = k + 1 to n l ik = a ik /a kk end end for j = k + 1 to n for i = k + 1 to n a ij = a ij l ik a kj end end {loop over columns} {compute multipliers} {for current column} {apply transformation to} {remaining submatrix} CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 15 / 28

Factorization by Gaussian Elimination In general, row interchanges (pivoting) may be required to ensure existence of LU factorization and numerical stability of Gaussian elimination algorithm, but for simplicity we temporarily ignore this issue Gaussian elimination requires about n 3 /3 paired additions and multiplications, so model serial time as T 1 = t c n 3 /3 where t c is time required for multiply-add operation About n 2 /2 divisions also required, but we ignore this lower-order term CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 16 / 28

Loop Orderings for Gaussian Elimination Gaussian elimination has general form of triple-nested loop in which entries of L and U overwrite those of A for end for for end end a ij = a ij (a ik /a kk )a kj Perhaps most promising for parallel implementation are kij and kji forms, which differ only in accessing matrix by rows or columns, respectively CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 17 / 28

Gaussian Elimination Algorithm kij for of Gaussian elimination for k = 1 to n 1 for i = k + 1 to n l ik = a ik /a kk end end for j = k + 1 to n for i = k + 1 to n a ij = a ij l ik a kj end end Multipliers l ik computed outside inner loop for greater efficiency CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 18 / 28

Parallel Algorithm Partition For i, j = 1,, n, fine-grain task (i, j) stores a ij and computes and stores { uij, if i j l ij, if i > j yielding 2-D array of n 2 fine-grain tasks Communication Broadcast entries of A vertically to tasks below Broadcast entries of L horizontally to tasks to right CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 19 / 28

Fine-Grain Tasks and Communication CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 20 / 28

Fine-Grain Parallel Algorithm for k = 1 to min(i, j) 1 recv broadcast of a kj task (k, j) recv broadcast of l ik from task (i, k) a ij = a ij l ik a kj end if i j then else broadcast a ij to tasks (k, j), k = i + 1,, n recv broadcast of a jj from task (j, j) {vert bcast} {horiz bcast} {update entry} {vert bcast} {vert bcast} end l ij = a ij /a jj broadcast l ij to tasks (i, k), k = j + 1,, n {multiplier} {horiz bcast} CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 21 / 28

Agglomeration Agglomerate With n n array of fine-grain tasks, natural strategies are: 2-D: combine k k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k) 2 coarse-grain tasks 1-D column: combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 22 / 28

Mapping Map 2-D: assign (n/k) 2 /p coarse-grain tasks to each of p processes using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: assign n/p coarse-grain tasks to each of p processes using any desired mapping, treating target network as 1-D mesh CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 23 / 28

Scalability for 2-D Agglomeration Updating by each process at step k requires about (n k) 2 /p operations Summing over n 1 steps T comp n 1 t c (n k) 2 /p k=1 t c n 3 /(3p) CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 24 / 28

Scalability for 2-D Agglomeration Similarly, amount of data broadcast at step k along each process row and column is about (n k)/ p, so on 2-D mesh T comm n 1 2(t s + t w (n k)/ p) k=1 2t s n + t w n 2 / p where we have allowed for overlap of broadcasts for successive steps Total execution time is Tp t c n 3 /(3p) + 2t s n + t w n 2 / p CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 25 / 28

Isoefficiency Analysis Isoefficiency analysis: T (n, 1) CT 0 (n, p) (T (n, 1) sequential time; T 0 (n, p) parallel overhead) Sequential time complexity: T (n, 1) = O(n 3 ) Parallel overhead dominated by broadcasts: O(2t s n + t w n 2 / p) = O(n 2 / p) T 0 (n, p) = p O(n 2 / p) n 3 C pn 2 n C p Scalability function: M(f (p))/p M(n) = n 2 M(C p) = C 2 p Perfect scalability! CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 26 / 28

Pivoting Pivoting is the action of exchanging matrix elements to use a different pivot Main reason is to choose pivot that creates fewer fillins during elimination: creates previous non-existent element Other reasons are numerical Partial pivoting complicates parallel implementation of Gaussian elimination and significantly affects potential performance With 2-D algorithm, pivot search is parallel but requires communication within process column and inhibits overlapping of successive steps With 1-D column algorithm, pivot search requires no communication but is purely serial Once pivot is found, index of pivot row must be communicated to other processes, and rows must be explicitly or implicitly interchanged in each process CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 27 / 28

Next Class Efficient parallelization of numerical algorithms Relaxation Methods Finite Difference discretization CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 28 / 28