cee 618 scientiﬁc parallel computing (lecture 4): message … · 2013. 2. 1. · cee 618...

CEE 618 Scientific Parallel Computing (Lecture 4):Message-Passing Interface (MPI)

Albert S. Kim

Department of Civil and Environmental EngineeringUniversity of Hawai‘i at Manoa

2540 Dole Street, Holmes 383, Honolulu, Hawaii 96822

1 / 48

Table of Contents

1 Cluster progress

2 Introduction to MPI (Message-Passing Interface)

3 Calculation of π using MPIBasicsWall TimeBroadcastBarrierData TypesReduceOperation TypesResources

2 / 48

Cluster progress

Outline

1 Cluster progress

3 / 48

Cluster progress

My first home-made cluster, UCLA 1997

1 CPU: Pentium II 450MHz2 Memory: 128MB/PC3 Network card: Netgear FX310, 100/10MBPS Ethernet card4 Switch: Netgear 8 port, 100/10MBPS Ethernet switch5 KVM sharing device: Belkin Omni Cube 4 port

4 / 48

Cluster progress

UH 2001, the second home-made cluster

1 Composed of 16 PCs sharing ONE keyboard, monitor, and mouse2 Red Hat Linux 7.2 installed (free)3 Connected to a private network by a data switch (192.168.0.0 –

192.168.255.255)4 More than 30 times faster than Pentium 1.0 GHz system

5 / 48

Cluster progress

UH 2007, the third from Dell

1 Linux Cluster from Dell Inc. under support from NSF.2 Initially 16 nodes, 2 Intel(R) Xeon(TM) CPU 2.80GHz, and 2 GB

memory per core3 Queuing system: Platform Lava Ñ LSF, Platform Computing Inc.4 Programming Language: Intel FORTRAN 77/90 and Intel C/C++5 Libraries: BLAS, ATLAS, GotoBLAS, BLACS, LAPACK,

ScaLAPACK, OPENMPI-1.2.8 (http://www.open-mpi.org/).6 / 48

Cluster progress

1 GNU-4.1.2 & Intel-11.1 compilers, OpenMPI-1.4.1, PBSPro-10.22 Host name: jaws.mhpcc.hawaii.edu3 IP addresses: 132.160.97.245 & 132.160.97.246 7 / 48

Cluster progress

UH 2013, the system updated

1 The second rack was added.2 Additional 3 nodes, 8 Intel(R) Xeon(R) CPU E5345 2.33GHz per

node, and 2 GB memory per core3 Currently total 56 cores with 2GB memory each.4 Queuing system: PBS (Portrable Batch System), torque5 Programming Language: Intel FORTRAN and Intel C/C++

(version 13.1.0)6 Libraries: OPENMPI-1.6.1

8 / 48

Introduction to MPI (Message-Passing Interface)

Outline

1 Cluster progress

9 / 48

What is MPI?

MESSAGE-PASSING INTERFACE

1 A program library, NOT a language.

2 Called from FORTRAN 77/90, C/C++, and Python (and Java).3 Most widely used parallel library, but NOT a revolutionary way for

parallel computation.4 A collection of the best features of (many) existing

message-passing systems.

10 / 48

What is MPI?

MESSAGE-PASSING INTERFACE

1 A program library, NOT a language.2 Called from FORTRAN 77/90, C/C++, and Python (and Java).3 Most widely used parallel library, but NOT a revolutionary way for

parallel computation.4 A collection of the best features of (many) existing

message-passing systems.

10 / 48

How MPI works?

Figure: Distributed memory system using 4 nodes.

Suppose we have a cluster composed of four computers:alpha, beta, gamma, and delta. (Each computer has one core.)Usually, the first computer (alpha) is a master machine (and fileserver). (E.g., fractal)You have a MPI code, mympi.f, in your working directory of alpha.

1 Compile the code1: mpif90\mympi.f90\-o\mympi.x ê2 Run it using 4 nodes2: mpirun\-np \4\mympi.x ê

1“mpif90” are generated when MPI is installed with specific compilers.2We don’t execute mpirun directly. In practice, we will use Makefile (and

later ‘qsub’ command).11 / 48

How MPI works?

Suppose we have a cluster composed of four computers:alpha, beta, gamma, and delta. (Each computer has one core.)Usually, the first computer (alpha) is a master machine (and fileserver). (E.g., fractal)

You have a MPI code, mympi.f, in your working directory of alpha.1 Compile the code1: mpif90\mympi.f90\-o\mympi.x ê2 Run it using 4 nodes2: mpirun\-np \4\mympi.x ê

How MPI works?

Suppose we have a cluster composed of four computers:alpha, beta, gamma, and delta. (Each computer has one core.)Usually, the first computer (alpha) is a master machine (and fileserver). (E.g., fractal)You have a MPI code, mympi.f, in your working directory of alpha.

1 Compile the code1: mpif90\mympi.f90\-o\mympi.x ê2 Run it using 4 nodes2: mpirun\-np \4\mympi.x ê

TCP/IP communication through ssh

MPI uses a default machine file that contains (for example)172.20.0.100 ÝÑ compute-01-00’s IP (alpha)172.20.0.101 ÝÑ compute-01-01’s IP (beta)172.20.0.102 ÝÑ compute-01-02’s IP (gamma)172.20.0.103 ÝÑ compute-01-03’s IP (delta)

Basically, when a MPI job is submitted using mpirun, thismachine file is read, and node numbers (ranks) are automaticallyassigned in a sequence (not always ordered).

172.20.0.100 ÝÑ 0172.20.0.101 ÝÑ 1172.20.0.102 ÝÑ 2172.20.0.103 ÝÑ 3

In most cases including ours, a queueing system (PBS, LSF orothers) takes care of job allocation to optimize computationalresources.Inter-node communication was through rsh (remote shell) in thepast, but is now ssh (secure shell) (rsh+data encription).

12 / 48

172.20.0.100 ÝÑ 0172.20.0.101 ÝÑ 1172.20.0.102 ÝÑ 2172.20.0.103 ÝÑ 3

12 / 48

172.20.0.100 ÝÑ 0172.20.0.101 ÝÑ 1172.20.0.102 ÝÑ 2172.20.0.103 ÝÑ 3

12 / 48

Basic Structure of MPI programs

1 program mympi2 i m p l i c i t none3 include ’ mpi f . h ’4 integer : : numprocs , rank , i e r r , rc , RESULTLEN5 character ( len =20) : : PNAME67 c a l l MPI_INIT ( i e r r )8 i f ( i e r r /= MPI_SUCCESS) then9 pr in t ∗ , ’ E r ro r s t a r t i n g MPI program . Terminat ing . ’

10 c a l l MPI_ABORT(MPI_COMM_WORLD, rc , i e r r )11 end i f1213 c a l l MPI_COMM_RANK(MPI_COMM_WORLD, rank , i e r r )14 c a l l MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs , i e r r )15 c a l l MPI_GET_PROCESSOR_NAME(PNAME, RESULTLEN, i e r r )1617 write (∗ , " ( ’No . o f procs = ’ ,1X, i4 ,1 x , ’ , My rank = ’ ,1X, i4 , ’ , pname = ’ , A20 ) " )&18 numprocs , rank , pname1920 c a l l MPI_FINALIZE ( i e r r )21 end

How many “MPI” calls?

13 / 48

Basic Structure of MPI programs

1 program mympi2 i m p l i c i t none3 include ’ mpi f . h ’4 integer : : numprocs , rank , i e r r , rc , RESULTLEN5 character ( len =20) : : PNAME67 c a l l MPI_INIT ( i e r r )8 i f ( i e r r /= MPI_SUCCESS) then9 pr in t ∗ , ’ E r ro r s t a r t i n g MPI program . Terminat ing . ’

10 c a l l MPI_ABORT(MPI_COMM_WORLD, rc , i e r r )11 end i f1213 c a l l MPI_COMM_RANK(MPI_COMM_WORLD, rank , i e r r )14 c a l l MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs , i e r r )15 c a l l MPI_GET_PROCESSOR_NAME(PNAME, RESULTLEN, i e r r )1617 write (∗ , " ( ’No . o f procs = ’ ,1X, i4 ,1 x , ’ , My rank = ’ ,1X, i4 , ’ , pname = ’ , A20 ) " )&18 numprocs , rank , pname1920 c a l l MPI_FINALIZE ( i e r r )21 end

How many “MPI” calls? 5

13 / 48

Makefile

1 mpif90 =/ opt / openmpi�1.6.1� i n t e l / b in / mpif90 �l i m f2 mpirun =/ opt / openmpi�1.6.1� i n t e l / b in / mpirun3 mpiopt=�mca b t l tcp , s e l f ��mca b t l _ t c p _ i f _ i n c l u d e eth045 s r c r o o t =mympi6 s r c f i l e =$ ( s r c r o o t ) . f907 e x e f i l e =$ ( s r c r o o t ) . x8 numprocs=69

10 a l l :11 $ ( mpif90 ) $ ( s r c f i l e ) �o $ ( e x e f i l e )1213 srun :14 $ ( mpirun ) �mca b t l tcp , s e l f �np 1 . / $ ( e x e f i l e )1516 prun :17 $ ( mpirun ) $ ( mpiopt ) ��h o s t f i l e . / mpihosts �np $ ( numprocs ) . / $ ( e x e f i l e )1819 e d i t :20 vim $ ( s r c f i l e )

14 / 48

We will discuss how to use open-mpi with pbs later.

15 / 48

Job output file

1 No . o f procs = 6 , My rank = 0 , pname = compute�1�012 No . o f procs = 6 , My rank = 2 , pname = compute�1�023 No . o f procs = 6 , My rank = 4 , pname = compute�1�034 No . o f procs = 6 , My rank = 5 , pname = compute�1�035 No . o f procs = 6 , My rank = 3 , pname = compute�1�026 No . o f procs = 6 , My rank = 1 , pname = compute�1�01

Note that ranks are not well ordered.

16 / 48

1. call MPI_INIT (ierr)

1 c a l l MPI_INIT ( i e r r )2 i f ( i e r r /= MPI_SUCCESS) then3 pr in t ∗ , ’ E r ro r s t a r t i n g MPI program . Terminat ing . ’4 c a l l MPI_ABORT(MPI_COMM_WORLD, rc , i e r r )5 end i f67 c a l l MPI_COMM_RANK(MPI_COMM_WORLD, rank , i e r r )8 c a l l MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs , i e r r )9 c a l l MPI_GET_PROCESSOR_NAME(PNAME, RESULTLEN, i e r r )

Initializes the MPI execution environment.This function must be called in every MPI program, must be calledbefore any other MPI functions, and must be called only once inan MPI program.ierr = error code, which must be either MPI_SUCCESS (=0) or animplementation-defined error code.Where is MPI_SUCCESS (=0) set? [Ans.] mpi.h

17 / 48

Initializes the MPI execution environment.This function must be called in every MPI program, must be calledbefore any other MPI functions, and must be called only once inan MPI program.ierr = error code, which must be either MPI_SUCCESS (=0) or animplementation-defined error code.Where is MPI_SUCCESS (=0) set? [Ans.]

17 / 48

Initializes the MPI execution environment.This function must be called in every MPI program, must be calledbefore any other MPI functions, and must be called only once inan MPI program.ierr = error code, which must be either MPI_SUCCESS (=0) or animplementation-defined error code.Where is MPI_SUCCESS (=0) set? [Ans.] mpi.h

17 / 48

2. call MPI_ABORT (MPI_COMM_WORLD, rc ,ierr)

Terminates (aborts) all MPI processes associated with thecommunicator.In most MPI implementations it terminates ALL processesregardless of the communicator specified.MPI_COMM_WORLD = default communicatordefines one context and the set of all processes. It is one of itemsdefined in ‘mpi.h’.rc = error code

18 / 48

2. call MPI_ABORT (MPI_COMM_WORLD, rc ,ierr)

Terminates (aborts) all MPI processes associated with thecommunicator.In most MPI implementations it terminates ALL processesregardless of the communicator specified.MPI_COMM_WORLD = default communicatordefines one context and the set of all processes. It is one of itemsdefined in ‘mpi.h’.rc = error code

18 / 48

3. call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)

Determines the rank of the calling process within thecommunicator. whoami?Initially, each process will be assigned a unique integer rankbetween 0 and number of processors - 1 (i.e., numprocs-1),within the communicator MPI_COMM_WORLD.This rank is often referred to as a task ID.If 8 processors are used, then ranks are (?)

0, 1, 2, 3, 4, 5, 6, and 7.

19 / 48

3. call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)

Determines the rank of the calling process within thecommunicator. whoami?Initially, each process will be assigned a unique integer rankbetween 0 and number of processors - 1 (i.e., numprocs-1),within the communicator MPI_COMM_WORLD.This rank is often referred to as a task ID.If 8 processors are used, then ranks are (?)0, 1, 2, 3, 4, 5, 6, and 7.

19 / 48

4. call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

Determines (or obtains) the number of processes in the groupassociated with a communicator.Generally used within the communicator MPI_COMM_WORLD todetermine the number of processes (numprocs) being used byyour own application.It matches with the number after “-np” in» mpirun\-np \4\mympi.x êwhere “4” is read as “numprocs” by MPI_COMM_SIZE.

20 / 48

Determines (or obtains) the number of processes in the groupassociated with a communicator.Generally used within the communicator MPI_COMM_WORLD todetermine the number of processes (numprocs) being used byyour own application.

It matches with the number after “-np” in» mpirun\-np \4\mympi.x êwhere “4” is read as “numprocs” by MPI_COMM_SIZE.

20 / 48

Determines (or obtains) the number of processes in the groupassociated with a communicator.Generally used within the communicator MPI_COMM_WORLD todetermine the number of processes (numprocs) being used byyour own application.It matches with the number after “-np” in» mpirun\-np \4\mympi.x êwhere “4” is read as “numprocs” by MPI_COMM_SIZE.

20 / 48

5. call MPI_GET_PROCESSOR_NAME(PANME, RESULTLEN, ierr)

Returns the name of the local processor at the time of the call,i.e, PNAME.‘’RESULTLEN’ is the character length of PNAME

21 / 48

Returns the name of the local processor at the time of the call,i.e, PNAME.

‘’RESULTLEN’ is the character length of PNAME

21 / 48

Returns the name of the local processor at the time of the call,i.e, PNAME.‘’RESULTLEN’ is the character length of PNAME

21 / 48

6. call MPI_FINALIZE (ierr)

12 write (∗ , " ( ’No . o f procs = ’ ,1X, i4 ,1 x , ’ , My rank = ’ ,1X, i4 , ’ , pname = ’ , A20 ) " )&3 numprocs , rank , pname45 c a l l MPI_FINALIZE ( i e r r )6 end

Terminates the MPI execution environment.This function should be the last MPI routine called in every MPIprogram - NO other MPI routines may be called after it.Note that all MPI programs

start with MPI_INIT (ierr)do something in-between, andend with MPI_FINALIZE (ierr).

Question: Who is/are doing “print *” above?Ans. all 4 processes

22 / 48

Question: Who is/are doing “print *” above?

Ans. all 4 processes

22 / 48

Question: Who is/are doing “print *” above?Ans. all 4 processes

22 / 48

Summary: rank and the number of processors

If the number of processors is Nprocs, then Nprocs processors willhave ranks of 0, 1, 2, � � � , Nprocs � 2, and Nprocs � 1.

For example, if 6 processors are used for a parallel MPIcalculation, then ranks of the processors will be0, 1, 2, 3, 4, and 5.

23 / 48

For example, if 6 processors are used for a parallel MPIcalculation, then ranks of the processors will be

0, 1, 2, 3, 4, and 5.

23 / 48

For example, if 6 processors are used for a parallel MPIcalculation, then ranks of the processors will be0, 1, 2, 3, 4, and 5.

23 / 48

Random Quiz

What are required MPI routines to generate a minimum parallelcode? And, how many?ANS:

MPI_INITMPI_FINALIZE2

24 / 48

Random Quiz

What are required MPI routines to generate a minimum parallelcode? And, how many?ANS:

MPI_INITMPI_FINALIZE2

24 / 48

Calculation of π using MPI

Outline

1 Cluster progress

25 / 48

Calculation of π using MPI Basics

Mathematical Identity

Numerically integrate

g pxq �4

1� x2(1)

from x � 0 to 1. Here, g pxq is f pxq in the reference book3.

» x�1

1� x2dx � 4

�tan�1 pxq

�x�1x�0

� 4�tan�1 p1q � tan�1 p0q

�� 4

�π4� 0

� π (2)

Make your own derivation, substituting x � tan y!

3Using MPI, Portable Parallel Programming with the Message-PassingInterface, by William Gropp, Ewing Lusk, Anthony Skjellum, The MIT press,page 21-26.

26 / 48

Mathematical Identity

Numerically integrate

g pxq �4

1� x2(1)

from x � 0 to 1. Here, g pxq is f pxq in the reference book3.» x�1

1� x2dx � 4

�tan�1 pxq

�x�1x�0

� 4�tan�1 p1q � tan�1 p0q

�� 4

�π4� 0

� π (2)

Make your own derivation, substituting x � tan y!

3Using MPI, Portable Parallel Programming with the Message-PassingInterface, by William Gropp, Ewing Lusk, Anthony Skjellum, The MIT press,page 21-26.

26 / 48

Integration Scheme

Figure: Integrating to find the value of π with n � 5 where n is the number ofpoints (or rectangles) for the integration.

27 / 48

The number of integration and the number ofprocesses (=cores)

1 If n � 100 and Nprocs � 4, the each process takes care of 25points.

Processor 0: n � 1� 25Processor 1: n � 26� 50Processor 2: n � 51� 75Processor 3: n � 76� 100

2 However, what if n{Nprocs is NOT an integer, for example, n � 100and Nprocs � 6? You can try n{Nprocs or n{ pNprocs � 1q. Then howdo you handle remainders?.

28 / 48

The number of integration and the number ofprocesses (=cores)

1 If n � 100 and Nprocs � 4, the each process takes care of 25points.

Processor 0: n � 1� 25Processor 1: n � 26� 50Processor 2: n � 51� 75Processor 3: n � 76� 100

2 However, what if n{Nprocs is NOT an integer, for example, n � 100and Nprocs � 6? You can try n{Nprocs or n{ pNprocs � 1q. Then howdo you handle remainders?.

28 / 48

Optimum load balance

* If n{Nprocs is NOT an integer, one possible optimal way is jumpingas many steps as the number of processes (i.e., 6)

Processor 0: n � 1, 7, 13, . . . , 91, 97

Processor 1: n � 2, 8, 14, . . . , 92, 98

Processor 2: n � 3, 9, 15, . . . , 93, 99

Processor 3: n � 4, 10, 16, . . . , 94, 100

Processor 4: n � 5, 11, 17, . . . , 95

Processor 5: n � 6, 12, 18, . . . , 96

This method is generally applicable without any restriction. Assignedtasks are almost identical to each process.

29 / 48

Optimum load balance

* If n{Nprocs is NOT an integer, one possible optimal way is jumpingas many steps as the number of processes (i.e., 6)

Processor 0: n � 1, 7, 13, . . . , 91, 97

Processor 1: n � 2, 8, 14, . . . , 92, 98

Processor 2: n � 3, 9, 15, . . . , 93, 99

Processor 3: n � 4, 10, 16, . . . , 94, 100

Processor 4: n � 5, 11, 17, . . . , 95

Processor 5: n � 6, 12, 18, . . . , 96

This method is generally applicable without any restriction. Assignedtasks are almost identical to each process.

29 / 48

fpi.f90 – the first half

1 program main2 include ’ mpi f . h ’3 double precision : : PI25DT4 parameter ( PI25DT = 3.141592653589793238462643d0 )5 double precision : : mypi , p i , h , sum, x , f , a6 double precision : : T1 , T27 integer : : n , myid , numprocs , i , r c8 ! f u n c t i o n to i n t e g r a t e9 f ( a ) = 4 . d0 / ( 1 . d0 + a∗a )

1011 c a l l MPI_INIT ( i e r r )12 c a l l MPI_COMM_RANK( MPI_COMM_WORLD, myid , i e r r )13 c a l l MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs , i e r r )1415 n = 10000000016 h = 1.0d0 / dble ( n )17 T1 = MPI_WTIME ( )1819 c a l l MPI_BCAST( n ,1 ,MPI_INTEGER,0 ,MPI_COMM_WORLD, i e r r )

30 / 48

INTEGER Types

1 INTEGER*2 (non-standard) (2 Bytes = 16 bits):From -32768 to 32767 = 216 � 104.8

2 INTEGER*4 (4 bytes = 32 bits), MPI default:From -2147483648 to 2147483647 = 232 � 109.63 1010

3 INTEGER*8 (8 bytes = 64 bits):From �

�264 � 2

�264 � 2� 1

�= 264 � 1019.26 1020

31 / 48

Calculation of π using MPI Wall Time

MPI_WTIME()

T1 = MPI_WTIME()...T2 = MPI_WTIME()T2 - T1

1 Returns an elapsed wall clock time in seconds (double precision)on the calling processor.

2 Time in seconds since an arbitrary time in the past, e.g.,1970.0.0.0.0.0

3 Only the difference, T2 � T1, makes sense.

32 / 48

Calculation of π using MPI Broadcast

MPI_BCAST

call MPI_BCAST( n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr )MPI_BCAST (buffer, count, datatype, root, comm, ierr )broadcasts a message from the process with rank "root" toall other processes of the group.

1 buffer - name (or starting address) of buffer, i.e., variable (choice)2 count - the number of entries in buffer, i.e., how many (integer)3 datatype - data type of buffer such as integer, double precision, ...

(handle)4 root - rank of broadcast root, i.e., from whom?, broadcaster (integer)5 comm - communicator (handle), i.e., a communication language,

default is MPI_COMM_WORLD

Note: This MPI_BCAST call is only to explain how it works in theexample code, which means π will be properly calculated withoutthis call. This is because each process already knows the value ofn.

33 / 48

MPI_BCAST

1 buffer - name (or starting address) of buffer, i.e., variable (choice)

2 count - the number of entries in buffer, i.e., how many (integer)3 datatype - data type of buffer such as integer, double precision, ...

33 / 48

MPI_BCAST

1 buffer - name (or starting address) of buffer, i.e., variable (choice)2 count - the number of entries in buffer, i.e., how many (integer)

3 datatype - data type of buffer such as integer, double precision, ...(handle)

4 root - rank of broadcast root, i.e., from whom?, broadcaster (integer)5 comm - communicator (handle), i.e., a communication language,

33 / 48

MPI_BCAST

(handle)

4 root - rank of broadcast root, i.e., from whom?, broadcaster (integer)5 comm - communicator (handle), i.e., a communication language,

33 / 48

MPI_BCAST

(handle)4 root - rank of broadcast root, i.e., from whom?, broadcaster (integer)

5 comm - communicator (handle), i.e., a communication language,default is MPI_COMM_WORLD

33 / 48

MPI_BCAST

33 / 48

MPI_BCAST

33 / 48

Example of MPI_BCAST

1 program broadcast2 i m p l i c i t none3 include ’ mpi f . h ’4 integer numprocs , rank , i e r r , rc5 integer Nbcast6 c a l l MPI_INIT ( i e r r )7 i f ( i e r r /= MPI_SUCCESS) then8 pr in t ∗ , ’ E r ro r s t a r t i n g MPI program . Terminat ing . ’9 c a l l MPI_ABORT(MPI_COMM_WORLD, rc , i e r r )

10 end i f11 c a l l MPI_COMM_RANK(MPI_COMM_WORLD, rank , i e r r )12 c a l l MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs , i e r r )1314 Nbcast = �1015 pr in t ∗ , ’ I am ’ , rank , ’ o f ’ , numprocs , ’ and Nbcast = ’ , Nbcast16 c a l l MPI_BARRIER(MPI_COMM_WORLD, i e r r )1718 i f ( rank ==0) Nbcast = 1019 c a l l MPI_BCAST ( Nbcast , 1 ,MPI_INTEGER,0 ,MPI_COMM_WORLD, i e r r )20 pr in t ∗ , ’ I am ’ , rank , ’ o f ’ , numprocs , ’ and Nbcast = ’ , Nbcast2122 c a l l MPI_FINALIZE ( i e r r )23 end

An MPI program is executed by each processor concurrently with given conditionswhich could be different from each processor, i.e., rank-specific.

34 / 48

Outcome of MPI_BCAST example

I am 0 of 6 and Nbcast = -10

I am 0 of 6 and Nbcast = 10

Note: Without calling MPI_BARRIER, the above messages will behighly disordered because processes compete each other to printmessage to stout. For each run, change the executable file name.

35 / 48

Calculation of π using MPI Barrier

MPI_BARRIER

call MPI_BARRIER( MPI_COMM_WORLD, ierr )

MPI_BARRIER (comm, ierr )creates a barrier synchronization in a group.Each task, when reaching the MPI_Barrier call, blocks until alltasks in the group reach the same MPI_Barrier call.

Let’s check everybody is in the bus and then drive to the vacation place.No home alone!

36 / 48

Calculation of π using MPI Data Types

MPI FORTRAN Data Types

1 MPI_CHARACTER : character(1)

2 MPI_INTEGER : integer(4)

3 MPI_REAL : real(4)

4 MPI_DOUBLE_PRECISION : double precision = real(8)

5 MPI_COMPLEX : complex

6 MPI_LOGICAL : logical

7 MPI_BYTE : 8 binary digits

8 MPI_PACKED : data packed or unpacked with MPI_Pack()/MPI_Unpack

Note: C/C++ do not have complex and logical data-types.37 / 48

fpi.f90 – second half

1 sum = 0.0d02 do i = myid+1 , n , numprocs3 x = h ∗ ( dble ( i ) � 0.5d0 )4 sum = sum + f ( x )5 enddo6 mypi = h ∗ sum7

8 c a l l MPI_REDUCE ( mypi , p i , 1 ,MPI_DOUBLE_PRECISION,MPI_SUM,0 ,MPI_COMM_WORLD, i e r r )9 T2 = MPI_WTIME ( )

10 i f ( myid == 0) then11 write (∗ , " ( ’ p i i s approx imate ly : ’ , F18.16 ) " ) p i12 write (∗ , " ( ’ E r ro r i s : ’ , F18 .16 ) " ) abs ( p i � PI25DT )13 write (∗ ,∗ ) " S t a r t t ime = " , T114 write (∗ ,∗ ) "End t ime = " , T215 write (∗ ,∗ ) " Elapsed t ime = " , T2�T116 write (∗ ,∗ ) " The number o f processes = " , numprocs17 endif18 c a l l MPI_FINALIZE ( rc )19 stop20 end

38 / 48

Calculation of π using MPI Reduce

MPI_REDUCE

call MPI_REDUCE(mypi, pi, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 0,MPI_COMM_WORLD, ierr)

MPI_REDUCE (sendbuf, recvbuf, count, datatype, op, root,comm, ierr)reduces values on all processes to a single value.

1 sendbuf - address of send buffer by each processor, i.e., variable(choice)

2 recvbuf - address of receive buffer by reducer, i.e., a differentvariable (choice)

3 count - number of elements in send buffer, i.e., how many (integer)4 datatype - data type of elements of send buffer such as integer, real,

... (handle)5 op - reducing aritematic operation (handle)6 root - rank of root process, i.e., from whom?, broadcaster (integer)7 comm - communicator, i.e., a communication language, default is

MPI_COMM_WORLD (handle)

39 / 48

MPI_REDUCE

39 / 48

MPI_REDUCE

39 / 48

MPI_REDUCE

3 count - number of elements in send buffer, i.e., how many (integer)

4 datatype - data type of elements of send buffer such as integer, real,... (handle)

5 op - reducing aritematic operation (handle)6 root - rank of root process, i.e., from whom?, broadcaster (integer)7 comm - communicator, i.e., a communication language, default is

39 / 48

MPI_REDUCE

... (handle)

5 op - reducing aritematic operation (handle)6 root - rank of root process, i.e., from whom?, broadcaster (integer)7 comm - communicator, i.e., a communication language, default is

39 / 48

MPI_REDUCE

... (handle)5 op - reducing aritematic operation (handle)

6 root - rank of root process, i.e., from whom?, broadcaster (integer)7 comm - communicator, i.e., a communication language, default is

39 / 48

MPI_REDUCE

... (handle)5 op - reducing aritematic operation (handle)6 root - rank of root process, i.e., from whom?, broadcaster (integer)

7 comm - communicator, i.e., a communication language, default isMPI_COMM_WORLD (handle)

39 / 48

MPI_REDUCE

MPI_COMM_WORLD (handle)39 / 48

Example of MPI_REDUCE

1 program broadcast2 i m p l i c i t none3 include ’ mpi f . h ’4 integer : : numprocs , rank , i e r r , rc5 integer : : Ireduced , Nreduced6 c a l l MPI_INIT ( i e r r )7 i f ( i e r r /= MPI_SUCCESS) then8 pr in t ∗ , ’ E r ro r s t a r t i n g MPI program . Terminat ing . ’9 c a l l MPI_ABORT(MPI_COMM_WORLD, rc , i e r r )

10 end i f11

12 c a l l MPI_COMM_RANK(MPI_COMM_WORLD, rank , i e r r )13 c a l l MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs , i e r r )14 I reduced = 1 + rank15 Nreduced = 016 pr in t ∗ , ’ I am ’ , rank , ’ o f ’ , numprocs , ’ and Ireduced = ’ , I reduced17 c a l l MPI_BARRIER(MPI_COMM_WORLD, i e r r )18 c a l l MPI_REDUCE &19 ( Ireduced , Nreduced , 1 , MPI_INTEGER, MPI_SUM, 0 , MPI_COMM_WORLD, i e r r )20 i f ( rank ==0) &21 pr in t ∗ , ’ I am ’ , rank , ’ o f ’ , numprocs , ’ and Nreduced = ’ , Nreduced22 c a l l MPI_FINALIZE ( i e r r )23 end

40 / 48

Outcome of MPI_REDUCE example

I am 0 of 6 and Ireduced = 1

I am 0 of 6 and Nreduced = 21

1� 2� 3� 4� 5� 6 � 21

Note that rank = Ireduced -1. Without calling MPI_BARRIER, the finalanswer of 21 can be positioned anywhere.

41 / 48

Calculation of π using MPI Operation Types

MPI Operations

1 MPI_MAX – maximum (integer, real, complex )

2 MPI_MIN – minimum (integer, real, complex)3 MPI_SUM – sum (integer, real, complex)4 MPI_PROD – product (integer, real, complex)5 MPI_LAND – logical AND (logical)6 MPI_LOR – logical OR (logical)7 MPI_LXOR – logical XOR (logical)8 MPI_BAND – bit-wise AND (integer, MPI_BYTE)9 MPI_BOR – bit-wise OR (integer, MPI_BYTE)

10 MPI_BXOR – bit-wise XOR (integer, MPI_BYTE)11 MPI_MAXLOC – max value and location (real, complex, double

precision )12 MPI_MINLOC – min value and location (real, complex, double

precision )

42 / 48

MPI Operations

1 MPI_MAX – maximum (integer, real, complex )2 MPI_MIN – minimum (integer, real, complex)

3 MPI_SUM – sum (integer, real, complex)4 MPI_PROD – product (integer, real, complex)5 MPI_LAND – logical AND (logical)6 MPI_LOR – logical OR (logical)7 MPI_LXOR – logical XOR (logical)8 MPI_BAND – bit-wise AND (integer, MPI_BYTE)9 MPI_BOR – bit-wise OR (integer, MPI_BYTE)

precision )

42 / 48

MPI Operations

1 MPI_MAX – maximum (integer, real, complex )2 MPI_MIN – minimum (integer, real, complex)3 MPI_SUM – sum (integer, real, complex)

4 MPI_PROD – product (integer, real, complex)5 MPI_LAND – logical AND (logical)6 MPI_LOR – logical OR (logical)7 MPI_LXOR – logical XOR (logical)8 MPI_BAND – bit-wise AND (integer, MPI_BYTE)9 MPI_BOR – bit-wise OR (integer, MPI_BYTE)

precision )

42 / 48

MPI Operations

1 MPI_MAX – maximum (integer, real, complex )2 MPI_MIN – minimum (integer, real, complex)3 MPI_SUM – sum (integer, real, complex)4 MPI_PROD – product (integer, real, complex)

5 MPI_LAND – logical AND (logical)6 MPI_LOR – logical OR (logical)7 MPI_LXOR – logical XOR (logical)8 MPI_BAND – bit-wise AND (integer, MPI_BYTE)9 MPI_BOR – bit-wise OR (integer, MPI_BYTE)

precision )

42 / 48

MPI Operations

1 MPI_MAX – maximum (integer, real, complex )2 MPI_MIN – minimum (integer, real, complex)3 MPI_SUM – sum (integer, real, complex)4 MPI_PROD – product (integer, real, complex)5 MPI_LAND – logical AND (logical)

6 MPI_LOR – logical OR (logical)7 MPI_LXOR – logical XOR (logical)8 MPI_BAND – bit-wise AND (integer, MPI_BYTE)9 MPI_BOR – bit-wise OR (integer, MPI_BYTE)

precision )

42 / 48

MPI Operations

1 MPI_MAX – maximum (integer, real, complex )2 MPI_MIN – minimum (integer, real, complex)3 MPI_SUM – sum (integer, real, complex)4 MPI_PROD – product (integer, real, complex)5 MPI_LAND – logical AND (logical)6 MPI_LOR – logical OR (logical)

7 MPI_LXOR – logical XOR (logical)8 MPI_BAND – bit-wise AND (integer, MPI_BYTE)9 MPI_BOR – bit-wise OR (integer, MPI_BYTE)

precision )

42 / 48

MPI Operations

1 MPI_MAX – maximum (integer, real, complex )2 MPI_MIN – minimum (integer, real, complex)3 MPI_SUM – sum (integer, real, complex)4 MPI_PROD – product (integer, real, complex)5 MPI_LAND – logical AND (logical)6 MPI_LOR – logical OR (logical)7 MPI_LXOR – logical XOR (logical)

8 MPI_BAND – bit-wise AND (integer, MPI_BYTE)9 MPI_BOR – bit-wise OR (integer, MPI_BYTE)

precision )

42 / 48

MPI Operations

1 MPI_MAX – maximum (integer, real, complex )2 MPI_MIN – minimum (integer, real, complex)3 MPI_SUM – sum (integer, real, complex)4 MPI_PROD – product (integer, real, complex)5 MPI_LAND – logical AND (logical)6 MPI_LOR – logical OR (logical)7 MPI_LXOR – logical XOR (logical)8 MPI_BAND – bit-wise AND (integer, MPI_BYTE)9 MPI_BOR – bit-wise OR (integer, MPI_BYTE)

precision )42 / 48

Calculation of π using MPI Resources

Resources and References

1 Gropp et al. Using MPI: Portable parallel programming with theMessage-Passing Interface. MIT Press (1997)

43 / 48

Makefile

1 mpif90 =/ opt / openmpi�1.6.1� i n t e l / b in / mpif90 �l i m f2 f o r o p t=�diag�d isab le 8290 �diag�d isab le 8291 �diag�d isab le 101453 mpirun =/ opt / openmpi�1.6.1� i n t e l / b in / mpirun4 mpiopt=�mca b t l tcp , s e l f ��mca b t l _ t c p _ i f _ i n c l u d e eth056 s r c r o o t = f p i7 s r c f i l e =$ ( s r c r o o t ) . f908 e x e f i l e =$ ( s r c r o o t ) . x9 numprocs=6

1011 a l l :12 $ ( mpif90 ) $ ( f o r o p t ) $ ( s r c f i l e ) �o $ ( e x e f i l e )1314 srun :15 t ime $ ( mpirun ) �mca b t l tcp , s e l f �np 1 . / $ ( e x e f i l e )1617 prun :18 t ime $ ( mpirun ) $ ( mpiopt ) ��h o s t f i l e . / mpihosts �np $ ( numprocs ) . / $ ( e x e f i l e )1920 e d i t :21 vim $ ( s r c f i l e )

44 / 48

script file

We will discuss this later.

45 / 48

Job output file – part

1 p i i s approx imate ly : 3.14159265358973892 Er ro r i s : 0.00000000000005423 S t a r t t ime = 1359673097.680334 End t ime = 1359673097.929175 Elapsed t ime = 0.2488420009613046 The number of processes = 67 0.08 user 0.04 system 0:01.84 elapsed 6%CPU (0 avgtex t +0avgdata 12896maxresident ) k8 0 inpu ts +0outputs (0 major+4125minor ) page fau l t s 0swaps

46 / 48

Lab work

1 MPI codes are stored at /opt/cee618s13/class04/2 Copy examples under subdirectories of MPI-basic, MPI-bcase,

MPI-reduce and MPI-pi and study them.3 Type/enter “make” followed by “make prun”.4 Start your homework.

47 / 48

Speed up test

1 Change the number of processes of MPI-pi application from 1 to 8and see how much speed up is acheieved.

48 / 48

Speed up test

48 / 48

Speed up test

48 / 48

Speed up test

48 / 48

Speed up test

48 / 48

cee 618 scientiﬁc parallel computing (lecture 4): message … · 2013. 2. 1. · cee 618...

Documents

interni magazine 618

edicion impresa 618

javanan 618

la crÓnica 618

industrial simulation on parallel computers · industrial...

parallel multigrid a scalable alternative to parallel...

pembelajaran dengan scientiﬁc debate untuk meningkatkan

skkni 618 - jmkp.or.id

cee 618 scienti c parallel computing (lecture 1):...

กรมป่าไม้€¦ · 618 618 : aouowaa09 puo...

oracle’ın parallel execution yetenekleri ve performans...

hp no. 618

express 618

a granj artículo cientíﬁco / scientiﬁc paper de... ·...

getting started with scientiﬁc workplace scientiﬁc word...

creativity and scientiﬁc discoveries after 65 years

thermo scientiﬁc modelo 5028i

future world beyond a decade2018/02/14 · beth fulton...

campus 618

3d parallel fem (iv) openmp + hybrid parallel programming...