parallel systems

Parallel Systems

Dr. Guy Tel-Zur

Agenda

• Barnes-Hut (final remarks)• Continue slides5 from previous lecture• MPI Virtual Topologies• Scalapack• Mixing programming languages• Impressions from the SC12 conference• Home Assignment #2

למצגת הצעות אלי ולשלוח במרץ לחשוב נא!!!גמר

A simple Jacobi iteration

• Analyze program from:http://www.mcs.anl.gov/research/projects/mpi/tutorial/mpiexmpl/src/jacobi/C/main.html

http://www.mcs.anl.gov/research/projects/mpi/tutorial/mpiexmpl/src/jacobi/C/main.html



In this example, you will put together some of the previous examples to implement a simple Jacobi iteration for approximating the solution to a linear system of equations.In this example, we solve the Laplace equation in two dimensions with finite differences. This may sound involved, but really amount only to a simple computation, combined with the previous example of a parallel mesh data structure.

Any numerical analysis text will show that iterating

while (not converged) { for (i,j) xnew[i][j] = (x[i+1][j] + x[i-1][j] + x[i][j+1] + x[i][j-1])/4; for (i,j) x[i][j] = xnew[i][j]; }

will compute an approximation for the solution of Laplace's equation. There is one last detail; this replacement of xnew with the average of the values around it is applied only in the interior; the boundary values are left fixed. In practice, this means that if the mesh is n by n, then the valuesx[0][j]x[n-1][j]x[i][0]x[i][n-1]are left unchanged. Of course, these refer to the complete mesh; you'll have to figure out what to do with for the decomposed data structures (xlocal).Because the values are replaced by averaging around them, these techniques are called relaxation methods.

We wish to compute this approximation in parallel. Write a program to apply this approximation. For convergence testing, compute

diffnorm = 0;for (i,j) diffnorm += (xnew[i][j] - x[i][j]) * (xnew[i][j] - x[i][j]);diffnorm = sqrt(diffnorm);

You'll need to use MPI_Allreduce for this. (Why not use MPI_Reduce?) Have process zero write out the value of diffnorm and the iteration count at each iteration. When diffnorm is less that 1.0e-2, consider the iteration converged. Also, if you reach 100 iterations, exit the loop.

For simplicity, consider a 12 x 12 mesh on 4 processors

The example solution uses the boundary values from the previous exercise; they are -1 on the top and bottom, and the rank of the process on the side. The initial data (the values of x that are being relaxed) are also the same; the interior points have the same value as the rank of the process. This is shown below:

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

#include "mpi.h"

/* This example handles a 12 x 12 mesh, on 4 processors only. */#define maxn 12

int main( argc, argv )int argc;char **argv;{ int rank, value, size, errcnt, toterr, i, j, itcnt; int i_first, i_last; MPI_Status status; double diffnorm, gdiffnorm; double xlocal[(12/4)+2][12]; double xnew[(12/3)+2][12];

MPI_Init( &argc, &argv );

MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size );

if (size != 4) MPI_Abort( MPI_COMM_WORLD, 1 );

/* xlocal[][0] is lower ghostpoints, xlocal[][maxn+2] is upper */

/* Note that top and bottom processes have one less row of interior points */ i_first = 1; i_last = maxn/size; if (rank == 0) i_first++; if (rank == size - 1) i_last--;

/* Fill the data as specified */ for (i=1; i<=maxn/size; i++)

for (j=0; j<maxn; j++) xlocal[i][j] = rank;

for (j=0; j<maxn; j++) {xlocal[i_first-1][j] = -1;xlocal[i_last+1][j] = -1;

}

itcnt = 0; do {

/* Send up unless I'm at the top, then receive from below */

/* Note the use of xlocal[i] for &xlocal[i][0] */if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank +

1, 0, MPI_COMM_WORLD );

if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0,

MPI_COMM_WORLD, &status );/* Send down unless I'm at the bottom */if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1,

MPI_COMM_WORLD );if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank +

1, 1, MPI_COMM_WORLD, &status );

/* Compute new values (but not on boundary) */itcnt ++;diffnorm = 0.0;for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) {

xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) /

4.0;diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]);

}/* Only transfer the interior points */for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++)

xlocal[i][j] = xnew[i][j];MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE,

MPI_SUM, MPI_COMM_WORLD );

gdiffnorm = sqrt( gdiffnorm );if (rank == 0) printf( "At iteration %d, diff is %e\n",

itcnt, gdiffnorm );

} while (gdiffnorm > 1.0e-2 && itcnt < 100); MPI_Finalize( ); return 0;}

The Makefile# Generated automatically from Makefile.in by configure.ALL: jacobiSHELL = /bin/shDIRS = jacobi: jacobi.c

mpicc -o jacobi jacobi.c -lm profile.alog: jacobi.c

mpicc -o jacobi.log -mpilog jacobi.c -lmmpirun -np 4 jacobi.log/bin/mv jacobi.log_profile.log profile.alog

clean:/bin/rm -f jacobi jacobi.o jacobi.log#for dir in $(DIRS) ; do \# ( cd $$dir ; make clean ) ; done

MPI Virtual Topologies

• http://www.pdc.kth.se/education/historical/previous-years-summer-schools/2009/handouts/lect5.pdf

http://www.pdc.kth.se/education/historical/previous-years-summer-schools/2009/handouts/lect5.pdf



Cartesian Constructor

Parameter Meaning of Parametercomm_old input communicator (handle)

ndims number of dimensions of cartesian grid (integer)

dims integer array of size ndims specifying the number of processes in each dimension

periods logical array of size ndims specifying whether the grid is periodic (true) or not (false) in each dimension

reorder ranking may be reordered (true) or not (false) (logical)

comm_cart communicator with new cartesian topology (handle)

int MPI_Cart_create ( MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart )

Cartesian Convenience - MPI_Dims_create

int MPI_Dims_create( int nnodes, int ndims, int *dims)

Parameter Meaning of Parameter

nnodes number of nodes in a grid (integer)

ndims number of cartesian dimensions (integer)

dimsinteger array of size ndims specifying the number of nodes in each dimension

MPI_Dims_create creates a division of processors in a cartesian grid.

See more:

http://www.rc.usf.edu/tutorials/classes/tutorial/mpi/chapter10.html

http://www.rc.usf.edu/tutorials/classes/tutorial/mpi/chapter10.html

#include "mpi.h"#include <stdio.h>#define SIZE 16#define UP 0#define DOWN 1#define LEFT 2#define RIGHT 3

int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source, dest, outbuf, i, tag=1, inbuf[4]={MPI_PROC_NULL,MPI_PROC_NULL,MPI_PROC_NULL,MPI_PROC_NULL,}, nbrs[4], dims[2]={4,4}, periods[2]={0,0}, reorder=0, coords[2];

MPI_Request reqs[8];MPI_Status stats[8];MPI_Comm cartcomm;

MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

An Example

if (numtasks == SIZE) { MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder, &cartcomm); MPI_Comm_rank(cartcomm, &rank); MPI_Cart_coords(cartcomm, rank, 2, coords); MPI_Cart_shift(cartcomm, 0, 1, &nbrs[UP], &nbrs[DOWN]); MPI_Cart_shift(cartcomm, 1, 1, &nbrs[LEFT], &nbrs[RIGHT]);

printf("rank= %d coords= %d %d neighbors(u,d,l,r)= %d %d %d %d\n", rank,coords[0],coords[1],nbrs[UP],nbrs[DOWN],nbrs[LEFT], nbrs[RIGHT]); outbuf = rank;

for (i=0; i<4; i++) { dest = nbrs[i]; source = nbrs[i]; MPI_Isend(&outbuf, 1, MPI_INT, dest, tag, MPI_COMM_WORLD, &reqs[i]); MPI_Irecv(&inbuf[i], 1, MPI_INT, source, tag, MPI_COMM_WORLD, &reqs[i+4]); }

MPI_Waitall(8, reqs, stats); printf("rank= %d inbuf(u,d,l,r)= %d %d %d %d\n", rank,inbuf[UP],inbuf[DOWN],inbuf[LEFT],inbuf[RIGHT]); }else printf("Must specify %d processors. Terminating.\n",SIZE); MPI_Finalize();}

Jacobi Iteration using MPI virtual topologies

• Reference: Intermediate MPI from “Using MPI” book

• http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/intermediate/main.htm

• Similar: an MPI parallelization of the solution of the 2D Poisson's equation

• http://www.ee.bgu.ac.il/~tel-zur/2003B/lecture10.pdf

http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/intermediate/main.htm



http://www.pdc.kth.se/education/tutorials/mpi/hybrid-lab/oned.f/view



http://www.ee.bgu.ac.il/~tel-zur/2003B/lecture10.pdf

http://www.ee.bgu.ac.il/~tel-zur/2003B/lecture10.pdf

Scalapack

Hands-On Exercises for ScaLAPACK

http://acts.nersc.gov/scalapack/hands-on/

http://www.netlib.org/scalapack/pblas_qref.html

PBLAS Quick Reference Card

http://www.netlib.org/scalapack/

http://acts.nersc.gov/scalapack/hands-on/

http://www.netlib.org/scalapack/pblasqref.ps

C interface to BLACS

http://www.netlib.org/blacs/cblacsqref.ps

See next slide

http://www.netlib.org/blacs/cblacsqref.ps

http://netlib.org/scalapack/slug/node184.html



Mixing C and FORTRAN

• Demo on my other laptop– ~/tests/cprog.c and ffunction.f

– http://www.cae.tntech.edu/help/programming/mixed_languages

– Important! Column-major order vs. Row-major order: C => Row, Fortran=>Column

– Ref: http://en.wikipedia.org/wiki/Row-major_order

http://www.cae.tntech.edu/help/programming/mixed_languages



http://en.wikipedia.org/wiki/Row-major_order

http://en.wikipedia.org/wiki/Row-major_order

Example: PDGEMV()The sample is using PDGEMV(), which computes a distributed matrix-vector product y =alphaAx+beta*y. A= [1 4 7 10 13 ; 3 6 9 12 15 ; 5 8 11 14 17 ; 7 10 13 16 19; 9 12 15 18 21] and x=[1 ; 1; 0; 0; 1]T, x is a column vector, Call PDGEMV() routine to compute y=Ax. The right result is y=[18; 24; 30; 36; 42]T

T=transpose

Code sample in C: pdgemv.c The function call in C is like, double alpha = 1.0; double beta = 0.0; pdgemv_("N",&M,&M,&alpha,A,&ONE,&ONE,descA,x,&ONE,&ONE,descx,&ONE,&beta,y,&ONE,& amp;ONE,descy,&ONE);

http://software.intel.com/sites/default/files/article/165948/pdgemv.c

PDGEMV()

Check pdgemv_c. here:

http://www.netlib.org/scalapack/explore-html/da/dca/pdgemv___8c_source.html

http://www.cs.utexas.edu/~plapack/Code/ScaLAPACK/PBLAS/pdgemv_.c





Purpose=======PDGEMV performs one of the matrix-vector operationssub( Y ) := alpha*sub( A ) *sub( X ) + beta*sub( Y ), orsub( Y ) := alpha*sub( A )'*sub( X ) + beta*sub( Y ),wheresub( A ) denotes A(IA:IA+M-1,JA:JA+N-1).When TRANS = 'N',sub( X ) denotes X(IX:IX,JX:JX+N-1), if INCX = M_X,X(IX:IX+N-1,JX:JX), if INCX = 1 and INCX <> M_X,and,sub( Y ) denotes Y(IY:IY,JY:JY+M-1), if INCY = M_Y,Y(IY:IY+M-1,JY:JY), if INCY = 1 and INCY <> M_Y,and, otherwisesub( X ) denotes X(IX:IX,JX:JX+M-1), if INCX = M_X,X(IX:IX+M-1,JX:JX), if INCX = 1 and INCX <> M_X,and,sub( Y ) denotes Y(IY:IY,JY:JY+N-1), if INCY = M_Y,Y(IY:IY+N-1,JY:JY), if INCY = 1 and INCY <> M_Y.Alpha and beta are scalars, and sub( X ) and sub( Y ) are subvectorsand sub( A ) is an m by n submatrix.

parallel systems

Documents