Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 30, 2011 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 1 / 28
Outline Solving Linear Systems Direct Methods: solution is sought directly, at once Gaussian Elimination LU Factorization Pivoting CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 2 / 28
Linear Systems Probably the single most used procedure in the world. Linear Systems are the model for many modern day problems in mathematics in physics in economics and pretty much in any field What about nonlinear systems? Is that not a more general model? Yes, but how do we solve nonlinear systems? We linearize and iterate until we have a solution At each iteration we solve a linear system Also, how do we solve differential equations? We discretize in time and solve for each timepoint At each timepoint it may be a nonlinear system, so we linearize it In the end we still solve a linear systems, actually many of them CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 3 / 28
Direct Methods Gaussian Elimination Reduce Ax = b to upper triangular system, Tx = c: Forward Elimination Use Back Substitution to solve Tx = c. CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 4 / 28
Direct Methods Gaussian Elimination Reduce Ax = b to upper triangular system, Tx = c: Forward Elimination Use Back Substitution to solve Tx = c. a 00 a 01 a 02 a 0n a 10 a 11 a 12 a 1n a 20 a 21 a 22 a 2n....... a m0 a m1 a m2 a mn x 0 x 1 x 2. x n = b 0 b 1 b 2. b n CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 4 / 28
Direct Methods Gaussian Elimination Reduce Ax = b to upper triangular system, Tx = c: Forward Elimination Use Back Substitution to solve Tx = c. t 00 t 01 t 02 t 0n 0 t 11 t 12 t 1n 0 0 t 22 t 2n....... 0 0 0 t mn x 0 x 1 x 2. x n = c 0 c 1 c 2. c n CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 4 / 28
Direct Methods Gaussian Elimination Reduce Ax = b to upper triangular system, Tx = c: Forward Elimination Use Back Substitution to solve Tx = c. t 00 t 01 t 02 t 0n 0 t 11 t 12 t 1n 0 0 t 22 t 2n....... 0 0 0 t mn x 0 x 1 x 2. x n = c 0 c 1 c 2. c n Back Substitution 1 one element of x can be immediately computed 2 use this value to simplify system, revealing another element that can be immediately computed 3 repeat CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 4 / 28
Forward Elimination, recall steps 1x 0 +1x 1 1x 2 +4x 3 = 8 1x 0 1x 1 4x 2 +5x 3 = 13 1x 0 1x 2 6x 2 8x 3 = 13 1x 0 1x 3 2x 2 = 9 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 5 / 28
Forward Elimination, recall steps 1x 0 +1x 1 1x 2 +4x 3 = 8 1x 0 1x 1 4x 2 +5x 3 = 13 1x 0 1x 2 6x 2 8x 3 = 13 1x 0 1x 3 2x 2 = 9 Pivot: p 21 = a 21 /a 11, multiply by 1 st row, add to 2 nd row CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 5 / 28
Forward Elimination, recall steps 1x 0 +1x 1 1x 2 +4x 3 = 8 1x 0 1x 1 4x 2 +5x 3 = 13 1x 0 1x 2 6x 2 8x 3 = 13 1x 0 1x 3 2x 2 = 9 Pivot: p 21 = a 21 /a 11, multiply by 1 st row, add to 2 nd row 1x 0 +1x 1 1x 2 4x 3 = 8 2x 1 3x 2 1x 3 = 5 1x 0 1x 2 6x 2 8x 3 = 13 1x 0 1x 3 2x 2 = 9 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 5 / 28
Forward Elimination, recall steps 1x 0 +1x 1 1x 2 +4x 3 = 8 1x 0 1x 1 4x 2 +5x 3 = 13 1x 0 1x 2 6x 2 8x 3 = 13 1x 0 1x 3 2x 2 = 9 Also for p 31 = a 31 /a 11, p 41 = a 41 /a 11 1x 0 +1x 1 1x 2 4x 3 = 8 2x 1 3x 2 1x 3 = 5 2x 2 5x 2 4x 3 = 5 2x 3 1x 2 4x 3 = 1 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 5 / 28
Forward Elimination, recall steps 1x 0 +1x 1 1x 2 4x 3 = 8 2x 1 3x 2 1x 3 = 5 2x 2 5x 2 4x 3 = 5 2x 3 1x 2 4x 3 = 1 Pivot: p 32 = a 32 /a 22 1x 0 +1x 1 1x 2 4x 3 = 8 2x 1 3x 2 1x 3 = 5 2x 2 3x 3 = 0 2x 3 1x 2 4x 3 = 1 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 6 / 28
Back Substitution 1x 0 +1x 1 1x 2 +4x 3 = 8 2x 1 3x 2 +1x 3 = 5 2x 2 3x 3 = 0 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 7 / 28
Back Substitution 1x 0 +1x 1 1x 2 +4x 3 = 8 2x 1 3x 2 +1x 3 = 5 2x 2 3x 3 = 0 2x 3 = 4 x 3 = 2 1x 0 +1x 1 1x 2 = 0 2x 1 3x 2 = 3 2x 2 = 6 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 7 / 28
Back Substitution 1x 0 +1x 1 1x 2 = 0 2x 1 3x 2 = 3 2x 2 = 6 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 8 / 28
Back Substitution 1x 0 +1x 1 1x 2 = 0 2x 1 3x 2 = 3 2x 2 = 6 2x 3 = 4 x 3 = 2, x 2 = 3 1x 0 +1x 1 = 3 2x 1 = 12 2x 2 = 6 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 8 / 28
Back Substitution 1x 0 +1x 1 = 3 2x 1 = 12 2x 2 = 6 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 9 / 28
Back Substitution x 3 = 2, x 2 = 3, 1x 0 +1x 1 = 3 2x 1 = 12 2x 2 = 6 2x 3 = 4 x 1 = 6 1x 0 = 9 2x 1 = 12 2x 2 = 6 2x 3 = 4 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 9 / 28
Back Substitution x 3 = 2, x 2 = 3, 1x 0 +1x 1 = 3 2x 1 = 12 2x 2 = 6 2x 3 = 4 x 1 = 6 1x 0 = 9 2x 1 = 12 2x 2 = 6 2x 3 = 4 x 3 = 2, x 2 = 3, x 1 = 6, x 0 = 9 CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 9 / 28
Pseudo-code for Back Substitution for i = n - 1 down to 1 do x[i] = b[i] / a[i,i] for j = 0 to i - 1 do b[j] = b[j] - x[i] * a[j,i] endfor endfor Complexity: Θ(n 2 ) CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 10 / 28
Pseudo-code for Back Substitution for i = n - 1 down to 1 do x[i] = b[i] / a[i,i] for j = 0 to i - 1 do b[j] = b[j] - x[i] * a[j,i] endfor endfor Complexity: Θ(n 2 ) Parallelization? CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 10 / 28
Pseudo-code for Back Substitution for i = n - 1 down to 1 do x[i] = b[i] / a[i,i] for j = 0 to i - 1 do b[j] = b[j] - x[i] * a[j,i] endfor endfor Complexity: Θ(n 2 ) Parallelization: cannot execute the outer loop in parallel can execute the inner loop in parallel CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 10 / 28
Row-oriented Algorithm for i = n - 1 down to 1 do x[i] = b[i] / a[i,i] for j = 0 to i - 1 do b[j] = b[j] - x[i] * a[j,i] endfor endfor associate primitive task with each row of A and corresponding elements of x and b during iteration i task associated with row j computes new value of b j task i must compute x i and broadcast its value agglomerate using rowwise interleaved striped decomposition CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 11 / 28
Complexity Analysis for i = n - 1 down to 1 do x[i] = b[i] / a[i,i] for j = 0 to i - 1 do b[j] = b[j] - x[i] * a[j,i] endfor endfor Complexity Analysis Computation Complexity: each process performs about n/(2p) iterations of loop j in all a total of n 1 iterations in all Overall computational complexity: Θ(n 2 /p) Communication Complexity: one broadcast per iteration, log p n 1 iterations Overall communication complexity: Θ(n log p) CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 12 / 28
Isoefficiency Analysis Isoefficiency analysis: T (n, 1) CT 0 (n, p) (T (n, 1) sequential time; T 0 (n, p) parallel overhead) Sequential time complexity: T (n, 1) = O(n 2 ) Parallel overhead dominated by broadcasts: O(n log p) T 0 (n, p) = p O(n log p) n 2 Cpn log p n Cp log p Scalability function: M(f (p))/p M(n) = n 2 M(Cp log p) p = C 2 p log 2 p Poor scalability... CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 13 / 28
LU Factorization Useful if solving for multiple right-hand-sides (same matrix) Ax = b 1, Ax = b 2, Compute LU factorization A = LU where L is unit lower triangular and U is upper triangular. Solution obtained in two steps Ly = b lower triangular system by forward-substitution to obtain vector y Ux = y upper triangular system by back-substitution to obtain solution x to original system CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 14 / 28
Factorization by Gaussian Elimination LU factorization can be computed by Gaussian elimination as follows, where U overwrites A for k = 1 to n 1 for i = k + 1 to n l ik = a ik /a kk end end for j = k + 1 to n for i = k + 1 to n a ij = a ij l ik a kj end end {loop over columns} {compute multipliers} {for current column} {apply transformation to} {remaining submatrix} CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 15 / 28
Factorization by Gaussian Elimination In general, row interchanges (pivoting) may be required to ensure existence of LU factorization and numerical stability of Gaussian elimination algorithm, but for simplicity we temporarily ignore this issue Gaussian elimination requires about n 3 /3 paired additions and multiplications, so model serial time as T 1 = t c n 3 /3 where t c is time required for multiply-add operation About n 2 /2 divisions also required, but we ignore this lower-order term CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 16 / 28
Loop Orderings for Gaussian Elimination Gaussian elimination has general form of triple-nested loop in which entries of L and U overwrite those of A for end for for end end a ij = a ij (a ik /a kk )a kj Perhaps most promising for parallel implementation are kij and kji forms, which differ only in accessing matrix by rows or columns, respectively CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 17 / 28
Gaussian Elimination Algorithm kij for of Gaussian elimination for k = 1 to n 1 for i = k + 1 to n l ik = a ik /a kk end end for j = k + 1 to n for i = k + 1 to n a ij = a ij l ik a kj end end Multipliers l ik computed outside inner loop for greater efficiency CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 18 / 28
Parallel Algorithm Partition For i, j = 1,, n, fine-grain task (i, j) stores a ij and computes and stores { uij, if i j l ij, if i > j yielding 2-D array of n 2 fine-grain tasks Communication Broadcast entries of A vertically to tasks below Broadcast entries of L horizontally to tasks to right CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 19 / 28
Fine-Grain Tasks and Communication CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 20 / 28
Fine-Grain Parallel Algorithm for k = 1 to min(i, j) 1 recv broadcast of a kj task (k, j) recv broadcast of l ik from task (i, k) a ij = a ij l ik a kj end if i j then else broadcast a ij to tasks (k, j), k = i + 1,, n recv broadcast of a jj from task (j, j) {vert bcast} {horiz bcast} {update entry} {vert bcast} {vert bcast} end l ij = a ij /a jj broadcast l ij to tasks (i, k), k = j + 1,, n {multiplier} {horiz bcast} CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 21 / 28
Agglomeration Agglomerate With n n array of fine-grain tasks, natural strategies are: 2-D: combine k k subarray of fine-grain tasks to form each coarse-grain task, yielding (n/k) 2 coarse-grain tasks 1-D column: combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 22 / 28
Mapping Map 2-D: assign (n/k) 2 /p coarse-grain tasks to each of p processes using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: assign n/p coarse-grain tasks to each of p processes using any desired mapping, treating target network as 1-D mesh CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 23 / 28
Scalability for 2-D Agglomeration Updating by each process at step k requires about (n k) 2 /p operations Summing over n 1 steps T comp n 1 t c (n k) 2 /p k=1 t c n 3 /(3p) CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 24 / 28
Scalability for 2-D Agglomeration Similarly, amount of data broadcast at step k along each process row and column is about (n k)/ p, so on 2-D mesh T comm n 1 2(t s + t w (n k)/ p) k=1 2t s n + t w n 2 / p where we have allowed for overlap of broadcasts for successive steps Total execution time is Tp t c n 3 /(3p) + 2t s n + t w n 2 / p CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 25 / 28
Isoefficiency Analysis Isoefficiency analysis: T (n, 1) CT 0 (n, p) (T (n, 1) sequential time; T 0 (n, p) parallel overhead) Sequential time complexity: T (n, 1) = O(n 3 ) Parallel overhead dominated by broadcasts: O(2t s n + t w n 2 / p) = O(n 2 / p) T 0 (n, p) = p O(n 2 / p) n 3 C pn 2 n C p Scalability function: M(f (p))/p M(n) = n 2 M(C p) = C 2 p Perfect scalability! CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 26 / 28
Pivoting Pivoting is the action of exchanging matrix elements to use a different pivot Main reason is to choose pivot that creates fewer fillins during elimination: creates previous non-existent element Other reasons are numerical Partial pivoting complicates parallel implementation of Gaussian elimination and significantly affects potential performance With 2-D algorithm, pivot search is parallel but requires communication within process column and inhibits overlapping of successive steps With 1-D column algorithm, pivot search requires no communication but is purely serial Once pivot is found, index of pivot row must be communicated to other processes, and rows must be explicitly or implicitly interchanged in each process CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 27 / 28
Next Class Efficient parallelization of numerical algorithms Relaxation Methods Finite Difference discretization CPD (DEI / IST) Parallel and Distributed Computing 22 2011-11-30 28 / 28