parallel systems

31
Parallel Systems Dr. Guy Tel-Zur

Upload: donar

Post on 14-Feb-2016

63 views

Category:

Documents


0 download

DESCRIPTION

Parallel Systems. Dr. Guy Tel- Zur. Agenda. Barnes-Hut (final remarks) Continue slides5 from previous lecture MPI Virtual Topologies Scalapack Mixing programming languages Impressions from the SC12 conference Home Assignment #2. נא לחשוב במרץ ולשלוח אלי הצעות למצגת גמר!!!. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Systems

Parallel Systems

Dr. Guy Tel-Zur

Page 2: Parallel Systems

Agenda

• Barnes-Hut (final remarks)• Continue slides5 from previous lecture• MPI Virtual Topologies• Scalapack• Mixing programming languages• Impressions from the SC12 conference• Home Assignment #2

למצגת הצעות אלי ולשלוח במרץ לחשוב נא!!!גמר

Page 4: Parallel Systems

In this example, you will put together some of the previous examples to implement a simple Jacobi iteration for approximating the solution to a linear system of equations.In this example, we solve the Laplace equation in two dimensions with finite differences. This may sound involved, but really amount only to a simple computation, combined with the previous example of a parallel mesh data structure.

Any numerical analysis text will show that iterating

while (not converged) { for (i,j) xnew[i][j] = (x[i+1][j] + x[i-1][j] + x[i][j+1] + x[i][j-1])/4; for (i,j) x[i][j] = xnew[i][j]; }

Page 5: Parallel Systems

will compute an approximation for the solution of Laplace's equation. There is one last detail; this replacement of xnew with the average of the values around it is applied only in the interior; the boundary values are left fixed. In practice, this means that if the mesh is n by n, then the valuesx[0][j]x[n-1][j]x[i][0]x[i][n-1]are left unchanged. Of course, these refer to the complete mesh; you'll have to figure out what to do with for the decomposed data structures (xlocal).Because the values are replaced by averaging around them, these techniques are called relaxation methods.

We wish to compute this approximation in parallel. Write a program to apply this approximation. For convergence testing, compute

diffnorm = 0;for (i,j) diffnorm += (xnew[i][j] - x[i][j]) * (xnew[i][j] - x[i][j]);diffnorm = sqrt(diffnorm);

Page 6: Parallel Systems

You'll need to use MPI_Allreduce for this. (Why not use MPI_Reduce?) Have process zero write out the value of diffnorm and the iteration count at each iteration. When diffnorm is less that 1.0e-2, consider the iteration converged. Also, if you reach 100 iterations, exit the loop.

Page 7: Parallel Systems

For simplicity, consider a 12 x 12 mesh on 4 processors

The example solution uses the boundary values from the previous exercise; they are -1 on the top and bottom, and the rank of the process on the side. The initial data (the values of x that are being relaxed) are also the same; the interior points have the same value as the rank of the process. This is shown below:

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Page 8: Parallel Systems

#include "mpi.h"

/* This example handles a 12 x 12 mesh, on 4 processors only. */#define maxn 12

int main( argc, argv )int argc;char **argv;{ int rank, value, size, errcnt, toterr, i, j, itcnt; int i_first, i_last; MPI_Status status; double diffnorm, gdiffnorm; double xlocal[(12/4)+2][12]; double xnew[(12/3)+2][12];

MPI_Init( &argc, &argv );

MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size );

if (size != 4) MPI_Abort( MPI_COMM_WORLD, 1 );

Page 9: Parallel Systems

/* xlocal[][0] is lower ghostpoints, xlocal[][maxn+2] is upper */

/* Note that top and bottom processes have one less row of interior points */ i_first = 1; i_last = maxn/size; if (rank == 0) i_first++; if (rank == size - 1) i_last--;

/* Fill the data as specified */ for (i=1; i<=maxn/size; i++)

for (j=0; j<maxn; j++) xlocal[i][j] = rank;

for (j=0; j<maxn; j++) {xlocal[i_first-1][j] = -1;xlocal[i_last+1][j] = -1;

}

Page 10: Parallel Systems

itcnt = 0; do {

/* Send up unless I'm at the top, then receive from below */

/* Note the use of xlocal[i] for &xlocal[i][0] */if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank +

1, 0, MPI_COMM_WORLD );

if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0,

MPI_COMM_WORLD, &status );/* Send down unless I'm at the bottom */if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1,

MPI_COMM_WORLD );if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank +

1, 1, MPI_COMM_WORLD, &status );

Page 11: Parallel Systems

/* Compute new values (but not on boundary) */itcnt ++;diffnorm = 0.0;for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) {

xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) /

4.0;diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]);

}/* Only transfer the interior points */for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++)

xlocal[i][j] = xnew[i][j];MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE,

MPI_SUM, MPI_COMM_WORLD );

gdiffnorm = sqrt( gdiffnorm );if (rank == 0) printf( "At iteration %d, diff is %e\n",

itcnt, gdiffnorm );

} while (gdiffnorm > 1.0e-2 && itcnt < 100); MPI_Finalize( ); return 0;}

Page 12: Parallel Systems

The Makefile# Generated automatically from Makefile.in by configure.ALL: jacobiSHELL = /bin/shDIRS = jacobi: jacobi.c

mpicc -o jacobi jacobi.c -lm profile.alog: jacobi.c

mpicc -o jacobi.log -mpilog jacobi.c -lmmpirun -np 4 jacobi.log/bin/mv jacobi.log_profile.log profile.alog

clean:/bin/rm -f jacobi jacobi.o jacobi.log#for dir in $(DIRS) ; do \# ( cd $$dir ; make clean ) ; done

Page 14: Parallel Systems

Cartesian Constructor

Parameter Meaning of Parametercomm_old input communicator (handle)

ndims number of dimensions of cartesian grid (integer)

dims integer array of size ndims specifying the number of processes in each dimension

periods logical array of size ndims specifying whether the grid is periodic (true) or not (false) in each dimension

reorder ranking may be reordered (true) or not (false) (logical)

comm_cart communicator with new cartesian topology (handle)

int MPI_Cart_create ( MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart )

Page 15: Parallel Systems

Cartesian Convenience - MPI_Dims_create

int MPI_Dims_create( int nnodes, int ndims, int *dims)

Parameter Meaning of Parameter

nnodes number of nodes in a grid (integer)

ndims number of cartesian dimensions (integer)

dimsinteger array of size ndims specifying the number of nodes in each dimension

MPI_Dims_create creates a division of processors in a cartesian grid.

Page 16: Parallel Systems

See more:

http://www.rc.usf.edu/tutorials/classes/tutorial/mpi/chapter10.html

Page 17: Parallel Systems

#include "mpi.h"#include <stdio.h>#define SIZE 16#define UP 0#define DOWN 1#define LEFT 2#define RIGHT 3

int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source, dest, outbuf, i, tag=1, inbuf[4]={MPI_PROC_NULL,MPI_PROC_NULL,MPI_PROC_NULL,MPI_PROC_NULL,}, nbrs[4], dims[2]={4,4}, periods[2]={0,0}, reorder=0, coords[2];

MPI_Request reqs[8];MPI_Status stats[8];MPI_Comm cartcomm;

MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

An Example

Page 18: Parallel Systems

if (numtasks == SIZE) { MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder, &cartcomm); MPI_Comm_rank(cartcomm, &rank); MPI_Cart_coords(cartcomm, rank, 2, coords); MPI_Cart_shift(cartcomm, 0, 1, &nbrs[UP], &nbrs[DOWN]); MPI_Cart_shift(cartcomm, 1, 1, &nbrs[LEFT], &nbrs[RIGHT]);

printf("rank= %d coords= %d %d neighbors(u,d,l,r)= %d %d %d %d\n", rank,coords[0],coords[1],nbrs[UP],nbrs[DOWN],nbrs[LEFT], nbrs[RIGHT]); outbuf = rank;

for (i=0; i<4; i++) { dest = nbrs[i]; source = nbrs[i]; MPI_Isend(&outbuf, 1, MPI_INT, dest, tag, MPI_COMM_WORLD, &reqs[i]); MPI_Irecv(&inbuf[i], 1, MPI_INT, source, tag, MPI_COMM_WORLD, &reqs[i+4]); }

MPI_Waitall(8, reqs, stats); printf("rank= %d inbuf(u,d,l,r)= %d %d %d %d\n", rank,inbuf[UP],inbuf[DOWN],inbuf[LEFT],inbuf[RIGHT]); }else printf("Must specify %d processors. Terminating.\n",SIZE); MPI_Finalize();}

Page 20: Parallel Systems

Scalapack

Page 21: Parallel Systems
Page 22: Parallel Systems

Hands-On Exercises for ScaLAPACK

http://acts.nersc.gov/scalapack/hands-on/

http://www.netlib.org/scalapack/pblas_qref.html

PBLAS Quick Reference Card

http://www.netlib.org/scalapack/

Page 23: Parallel Systems

C interface to BLACS

http://www.netlib.org/blacs/cblacsqref.ps

See next slide

Page 24: Parallel Systems

BLACS

Page 25: Parallel Systems
Page 27: Parallel Systems
Page 28: Parallel Systems

Mixing C and FORTRAN

• Demo on my other laptop– ~/tests/cprog.c and ffunction.f

– http://www.cae.tntech.edu/help/programming/mixed_languages

– Important! Column-major order vs. Row-major order: C => Row, Fortran=>Column

– Ref: http://en.wikipedia.org/wiki/Row-major_order

Page 29: Parallel Systems

Example: PDGEMV()The sample is using PDGEMV(), which computes a distributed matrix-vector product y =alphaAx+beta*y. A= [1 4 7 10 13 ; 3 6 9 12 15 ; 5 8 11 14 17 ; 7 10 13 16 19; 9 12 15 18 21] and x=[1 ; 1; 0; 0; 1]T, x is a column vector, Call PDGEMV() routine to compute y=Ax. The right result is y=[18; 24; 30; 36; 42]T

T=transpose

Code sample in C: pdgemv.c The function call in C is like, double alpha = 1.0; double beta = 0.0; pdgemv_("N",&M,&M,&alpha,A,&ONE,&ONE,descA,x,&ONE,&ONE,descx,&ONE,&beta,y,&ONE,& amp;ONE,descy,&ONE);

Page 31: Parallel Systems

Purpose=======PDGEMV performs one of the matrix-vector operationssub( Y ) := alpha*sub( A ) *sub( X ) + beta*sub( Y ), orsub( Y ) := alpha*sub( A )'*sub( X ) + beta*sub( Y ),wheresub( A ) denotes A(IA:IA+M-1,JA:JA+N-1).When TRANS = 'N',sub( X ) denotes X(IX:IX,JX:JX+N-1), if INCX = M_X,X(IX:IX+N-1,JX:JX), if INCX = 1 and INCX <> M_X,and,sub( Y ) denotes Y(IY:IY,JY:JY+M-1), if INCY = M_Y,Y(IY:IY+M-1,JY:JY), if INCY = 1 and INCY <> M_Y,and, otherwisesub( X ) denotes X(IX:IX,JX:JX+M-1), if INCX = M_X,X(IX:IX+M-1,JX:JX), if INCX = 1 and INCX <> M_X,and,sub( Y ) denotes Y(IY:IY,JY:JY+N-1), if INCY = M_Y,Y(IY:IY+N-1,JY:JY), if INCY = 1 and INCY <> M_Y.Alpha and beta are scalars, and sub( X ) and sub( Y ) are subvectorsand sub( A ) is an m by n submatrix.