training of reedbush-u of reedbush-u 名古屋情報基盤中心 教授 片桐孝洋 takahiro...

48
Training of Reedbush-U 名古屋情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information Technology Center, Nagoya University Introduction to Parallel Programming for Multicore/Manycore Clusters 1 國家理論中心數學組「高效能計算 」短期課程

Upload: trinhdien

Post on 23-May-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Training of Reedbush-U

名古屋情報基盤中心 教授 片桐孝洋

Takahiro Katagiri, Professor, Information Technology Center, Nagoya University

Introduction to Parallel Programming for Multicore/Manycore Clusters

1

國家理論中心數學組「高效能計算 」短期課程

Agenda1. How to log in the Reedbush-U2. How to use supercomputer Execute sample programs

3. An Example of Parallel Summation4. Homework

Introduction to Parallel Programming for Multicore/Manycore Clusters

2

How to login the Reedbush-U

はじめてのスパコンログイン

Introduction to Parallel Programming for Multicore/Manycore Clusters

3

Obtain data from Reedbush-U to own PC Type the following:$ scp [email protected]:~/a.f90 ./

“tYYxxx” is user name. YYxxx is a number. Copy “a.f90” on home directory of the Reedbush-U to

current directory in PC. If you want to whole files of a directory, please specify

“-r”.$ scp -r [email protected]:~/SAMP ./ Copy all files in “SAMP” folder in home directory in the

Reedbush-U to current directory in PC.

Introduction to Parallel Programming for Multicore/Manycore Clusters

4

Obtain data from own PC to the Reedbush-U Type the following:$ scp ./a.f90 [email protected]:

“tYYxxx” is user name. YYxxx is a number. Copy “a.f90” on home directory in PE to current

directory in the Reedbush-U. If you want to whole files of a directory, please specify

“-r”.$ scp -r ./SAMP [email protected]: Copy all files in “SAMP” folder in home directory in

PC to current directory in the Reedbush-U.

Introduction to Parallel Programming for Multicore/Manycore Clusters

5

How to login to supercomputerand run test programs

Introduction to Parallel Programming for Multicore/Manycore Clusters

6

Agenda

1. How to use supercomputer Execute sample parallel programs

2. An Example of Parallel Summation

Introduction to Parallel Programming for Multicore/Manycore Clusters

7

Execute Test Parallel Programs

Introduction to Parallel Programming for Multicore/Manycore Clusters

8

Notes of UNIX Execute emacs: emacs <file name> ^x ^s (”^” is control key.) : Save current text. ^x ^c : quit ^g : Back to input mode. ^k : Delete words from cursor to end of line. The deleted

lines are temporary memorized. ^y : lines that are deleted with “^k” copy to location of

current cursor. ^s <character string> :Move to part of <character string>. ^M x goto-line : Move to specified line.

Introduction to Parallel Programming for Multicore/Manycore Clusters

9

Notes of UNIX rm <file name>: Remove <file name>.

rm *~ : Delete all files with “~”, such as “test.c~”.

ls : Show file names in current directory. cd <folder name>: Move to <folder name>. cd .. : Move to one level of upper directory. cd ~ : Move to home directory.

cat <file name>: See contents of <file name>. make : Make executable file. (“Makefile” is needed in current directory) make clean : Delete executable file.( Definition of “clean” in Makefile is needed. )

Introduction to Parallel Programming for Multicore/Manycore Clusters

10

Execute test parallel programs

Introduction to Parallel Programming for Multicore/Manycore Clusters

11

Name of Sample Program Common file of C and Fotran90 Languages:

Samples-reedbush.tar After performing “tar” command, a directory

including C and Fortran90 is made. C/ : For C Language F/ : For Fortran90 Language

Directory that has the above file./lustre/gt00/z30082/

Introduction to Parallel Programming for Multicore/Manycore Clusters

12

Let’s compile parallel Hello program(1/3)Change current directory to execution

directory.In the Reedbush-U, the execution direction

is:/lustre/gt31/<your account name>

Type the follow:$ cd /lustre/gt31/<your account name>

Introduction to Parallel Programming for Multicore/Manycore Clusters

13

Let’s compile parallel Hello program(2/3)1. Copy Samples-fx.tar in /lustre/gt00/z30082/ to my

directory$ cp /lustre/gt00/z30082/Samples-reedbush.tar ./

2. Unpack Samples-fx.tar$ tar xvf Samples-reedbush.tar

3. Enter Samples folder$ cd Samples

4. For C Language : $ cd CFor Fortran90 : $ cd F

5. Enter Hello folder$ cd Hello

Introduction to Parallel Programming for Multicore/Manycore Clusters

14

Let’s compile parallel Hello program(3/3)6. Copy “Makefile” for pure MPI:

$ cp Makefile_pure Makefile7. Do make:

$ make8. Confirm executable file, named “hello”

$ ls

Introduction to Parallel Programming for Multicore/Manycore Clusters

15

What is batch job? In supercomputer environment, interactive execution

(command line execution) is usually unsupported. Job is executing with batch job system.

Introduction to Parallel Programming for Multicore/Manycore Clusters

16

A User Supercomputer

Batch system picks up jobs

Execution

Batch queues

Job Request

17

Running Job• Batch Jobs

– Only batch jobs are allowed.– Interactive executions of jobs are not allowed.

• How to run– writing job script– submitting job– checking job status– checking results

• Utilization of computational resources – 1-node (16 cores) is occupied by each job.– Your node is not shared by other jobs.

This material coms from Prof. Nakajima.

18

Job Script• <$O-S1>/hello.sh

• Scheduling + Shell Script

#!/bin/sh#PBS -q u-lecture1 Name of “QUEUE”#PBS -N HELLO Job Name#PBS -l select=1:mpiprocs=4 node#,MPI proc#/node#PBS -Wgroup_list=gt31 Group Name (Wallet)#PBS -l walltime=00:05:00 Computation Time#PBS -e err Standard Error#PBS -o hello.lst Standard Outpt

cd $PBS_O_WORKDIR go to current dir. /etc/profile.d/modules.sh (ESSENTIAL)export I_MPI_PIN_DOMAIN=socket Execution on each

socketexport I_MPI_PERHOST=4 =mpiprocs, stablempirun ./impimap.sh ./a.out Exec’s

This material coms from Prof. Nakajima.

19

impimap.shNUMA: utilizing resource (e.g. memory) of the core

where job is running: Performance is stable#!/bin/shnumactl --localalloc $@

Process Number#PBS -l select=1:mpiprocs=4 1-node, 4-proc’s#PBS –l select=1:mpiprocs=16 1-node, 16-proc’s#PBS -l select=1:mpiprocs=36 1-node, 36-proc’s#PBS –l select=2:mpiprocs=32 2-nodes, 32x2=64-proc’s#PBS –l select=8:mpiprocs=36 8-nodes, 36x8=288-proc’s

This material coms from Prof. Nakajima.

20

Available QUEUE’s• Following 2 queues are available.• 8 nodes can be used

– u-lecture

• 8 nodes (288 cores), 10 min., valid until the end of March, 2018

• Shared by all “educational” users– u-lecture1

• 8 nodes (288 cores), 10 min., active during class time • More jobs (compared to lecture) can be processed up

on availability.

This material coms from Prof. Nakajima.

21

2121

Submitting & Checking Jobs• Submitting Jobs qsub SCRIPT NAME

• Checking status of jobs rbstat

• Deleting/aborting qdel JOB ID

• Checking status of queues rbstat --rsc

• Detailed info. of queues rbstat --rsc –x

• Number of running jobs rbstat --rsc –b

• History of Submission rbstat –H

• Limitation of submission rbstat --limit

Figure of Execution for “rbstat --rsc”

Introduction to Parallel Programming for Multicore/Manycore Clusters

22

$ rbstat --rscQUEUE STATUS NODEu-debug [ENABLE ,START] 54u-short [ENABLE ,START] 16u-regular [ENABLE ,START]

|---- u-small [ENABLE ,START] 294|---- u-medium [ENABLE ,START] 294|---- u-large [ENABLE ,START] 294|---- u-x-large [ENABLE ,START] 294

u-interactive [ENABLE ,START]|---- u-interactive_1 [ENABLE ,START] 54|---- u-interactive_4 [ENABLE ,START] 54

Available queue names (Resource Groups)

Enable or notfor the queues.

Information of nodes for physical construction.

Figure of Execution for “rbstat --rsc -x”

Introduction to Parallel Programming for Multicore/Manycore Clusters

23

$ rbtat --rsc -xQUEUE STATUS MIN_NODE MAX_NODE MAX_ELAPSE REMAIN_ELAPSE MEM(GB)/NODE PROJECTu-debug [ENABLE ,START] 1 24 00:30:00 00:30:00 244GB gt31u-short [ENABLE ,START] 1 8 04:00:00 04:00:00 244GB gt31u-regular [ENABLE ,START]

|---- u-small [ENABLE ,START] 4 16 48:00:00 48:00:00 244GB gt31|---- u-medium [ENABLE ,START] 17 32 48:00:00 48:00:00 244GB gt31|---- u-large [ENABLE ,START] 33 64 48:00:00 48:00:00 244GB gt31|---- u-x-large [ENABLE ,START] 65 128 24:00:00 24:00:00 244GB gt31

Available queue names (Resource Groups)

Enable or notfor the queues.

Informationof run-timefor nodes

Queue Information.(MAX_ELAPSE, REMAIN_ELAPSE,MEM(GB)/NODE,PROJECT)

Figure of Execution for “rbstat --rsc -b”

Introduction to Parallel Programming for Multicore/Manycore Clusters

24

$ pjstat --rsc –bQUEUE STATUS TOTAL RUNNING QUEUED HOLD BEGUN WAIT EXIT TRANSIT NODEu-debug [ENABLE ,START] 9 3 6 0 0 0 0 0 54u-short [ENABLE ,START] 70 13 12 45 0 0 0 0 16u-regular [ENABLE ,START]

|---- u-small [ENABLE ,START] 74 21 53 0 0 0 0 0 294|---- u-medium [ENABLE ,START] 7 2 3 2 0 0 0 0 294|---- u-large [ENABLE ,START] 2 0 2 0 0 0 0 0 294|---- u-x-large [ENABLE ,START] 2 0 1 1 0 0 0 0 294

Available queue names (Resource Groups)

Enable or notfor the queues.

Total Number of jobs.

Number of runningJobs.

Numberof queued jobs.

Information of nodes for physical construction.

A Sample of Job Script (for Pure MPI) (hello-pure.bash, common for C and Fortran)

Introduction to Parallel Programming for Multicore/Manycore Clusters

25

#!/bin/sh#PBS -q u-lecture1#PBS -N 1d#PBS -l select=8:mpiprocs=32#PBS -Wgroup_list=gt31#PBS -l walltime=00:10:00#PBS -e err#PBS -o test.lstcd $PBS_O_WORKDIR. /etc/profile.d/modules.shexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=32mpirun ./impimap.sh ./hello

Specify number of nodes (8), and number of MPI processes per node (32)

Maximum executiontime:10 minutes.

Specify queue name:u-lecture

Execute MPI job with 32*8nodes = 256 processes.

Introduction to Parallel Programming for Multicore/Manycore Clusters

26

Pure MPI execution within a node.

A MPI process

18 cores

Socket #0

18 cores

Socket #1

76.8GB/s 76.8GB/s

Intel Xeon E5-2695 v4

(Broadwell-EP)-

76.8GB/s

76.8GB/s

DDR4Memory128GB

Intel Xeon E5-2695 v4

(Broadwell-EP)

DDR4DDR4DDR4

DDR4DDR4DDR4DDR4

Memory128GB

QPIQPI

QPIQPI

A Point of View of Node on the Reedbush-U

Let’s execute sample code of parallel Hello program (pure MPI) The job script file of the sample code is:

hello-pure.bash

In the sample case of provided code, name of queues is set to ”u-lecture1”.If you want to modify the queue name,type the following to use emacs:

$ emacs hello-pure.bash

Introduction to Parallel Programming for Multicore/Manycore Clusters

27

Let’s execute sample code of parallel Hello program (pure MPI) 1. Execute the follow in folder of Hello.

$ qsub hello-pure.bash2. Confirm submitted job for you:

$ rbstat3. After finishing execution of the job, the following files are

created. errtest.lst

4. Confirm contents of file of standard output:$ cat test.lst

5. If you see several output of “Hello parallel world!” for 32 processes * 8 nodes = 256 times, then it means successfully ended.

Introduction to Parallel Programming for Multicore/Manycore Clusters

28

Output of standard output and standard error by execution of batch job

After finalizing batch job execution, files for standard output and standard error are created in directly that you have submitted the job.

Standard output is stored in the file of standard output, and standard error is stored in the file of standard error.

Introduction to Parallel Programming for Multicore/Manycore Clusters

29

test.lst --- Standard output file.err --- Standard error file.

You can change the above names in batch job script.

Let’s compile sample code of parallel Hello Program : A Case of Hybrid MPI1. Copy “Makefile” for hybrid MPI:

$ cp Makefile_hy16 Makefile2. Do “make”.

$ make clean$ make

3. Confirm creation for file of “hello”.$ ls

4. If you want to modify queue name in job script (hello-hy16.bash) with emacs, like: $ emacs hello-hy16.bash

Introduction to Parallel Programming for Multicore/Manycore Clusters

30

Let’s execute sample code of parallel Hello Program : A Case of Hybrid MPI1. Execute the follow in folder of Hello.

$ qsub hello-hy16.bash2. Confirm submitted job for you:

$ rbstat3. After finishing execution of the job, the following files are

created. test2.lsterr

4. Confirm contents of file of standard output:$ cat test2.lst

5. If you see several output of “Hello parallel world!” for 2 process *8 nodes = 16 times, then it means successfully ended.

Introduction to Parallel Programming for Multicore/Manycore Clusters

31

Introduction to Parallel Programming for Multicore/Manycore Clusters

#!/bin/sh#PBS -q u-lecture1#PBS -N hybrid#PBS -l select=8:mpiprocs=2#PBS -Wgroup_list=gt31#PBS -l walltime=00:05:00#PBS -e err#PBS -o test2.lstcd $PBS_O_WORKDIR. /etc/profile.d/modules.shexport OMP_NUM_THREADS=16export I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=2export KMP_AFFINITY=granularity=fine,compact,1,0mpirun ./impimap.sh ./hello

A Sample of Job Script (for Hybrid MPI) (hello-hy16.bash, common for C and Fortran)

32

Specify queue name:u-lecture1

Execute MPI job with 2*8 = 16 processes.

Specify number of nodes (8) and MPI processes per node (2).

Maximum execution Time: 5 minute.

16 threads are created per 1 MPI process.

Introduction to Parallel Programming for Multicore/Manycore Clusters

33

MPI ProcessesThreads

Hybrid MPI execution within a node.

18 cores

Socket #0

18 cores

Socket #1

76.8GB/s 76.8GB/s

Intel Xeon E5-2695 v4

(Broadwell-EP)-

76.8GB/s

76.8GB/s

DDR4Memory128GB

Intel Xeon E5-2695 v4

(Broadwell-EP)

DDR4DDR4DDR4

DDR4DDR4DDR4DDR4

Memory128GB

QPIQPI

QPIQPI

A Point of View of Node on the Reedbush-U

Parallel Hello Program (C Language)

Introduction to Parallel Programming for Multicore/Manycore Clusters

34

#include <stdio.h>#include <mpi.h>

int main(int argc, char* argv[]) {

int myid, numprocs;int ierr, rc;

ierr = MPI_Init(&argc, &argv);ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid);ierr = MPI_Comm_size(MPI_COMM_WORLD, &numprocs);

printf("Hello parallel world! Myid:%d ¥n", myid);

rc = MPI_Finalize();

exit(0);}

Initialization of MPI.

Obtain own ID number (rank).:Different values in each PE.

Obtain number of total processes. :Same value in each PE. (In the lecture environment, 256 or 16)Finalizing MPI.

This program is copied to all PEs.

Parallel Hello Program (Fortran Language)

Introduction to Parallel Programming for Multicore/Manycore Clusters

35

program main

common /mpienv/myid,numprocs

integer myid, numprocsinteger ierr

call MPI_INIT(ierr)call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

print *, "Hello parallel world! Myid:", myid

call MPI_FINALIZE(ierr)

stopend

Initialization of MPI.

Obtain own ID number (rank).:Different values in each PE.

Obtain number of total processes. :Same value in each PE.(In the lecture environment, 256 or 16)Finalizing MPI.

This program is copied to all PEs.

Elapse Time Measurement (C Language)

Introduction to Parallel Programming for Multicore/Manycore Clusters

36

double t0, t1, t2, t_w; ..ierr = MPI_Barrier(MPI_COMM_WORLD);t1 = MPI_Wtime();

< This part is measured. >

ierr = MPI_Barrier(MPI_COMM_WORLD);t2 = MPI_Wtime();

t0 = t2 - t1;ierr = MPI_Reduce(&t0, &t_w, 1,

MPI_DOUBLE,MPI_MAX, 0, MPI_COMM_WORLD);

After barrier, time is stored.

Values of t0 in each PE are different. In this case, maximal timeis stored in rank 0.

Elapse Time Measurement (Fortran Language)

Introduction to Parallel Programming for Multicore/Manycore Clusters

37

double precision t0, t1, t2, t_wdouble precision MPI_WTIME

..call MPI_BARRIER(MPI_COMM_WORLD, ierr)t1 = MPI_WTIME(ierr)

< This part is measured. >

call MPI_BARRIER(MPI_COMM_WORLD, ierr)t2 = MPI_WTIME(ierr)

t0 = t2 - t1call MPI_REDUCE(t0, t_w, 1,

& MPI_DOUBLE_PRECISION,& MPI_MAX, 0, MPI_COMM_WORLD, ierr)

After barrier, time is stored.

Values of t0 in each PE are different. In this case, maximal time is stored in rank 0.

Kinds of the sample programs Hello/

Parallel Hello program

hello-pure.bash, hello-hy16.bash : Batch Job Scripts

Cpi/ Computation of PI

cpi-pure.bash :A Batch Job Script

Wa1/ Summation with Sequential Sending

wa1-pure.bash : A Batch Job Script

Wa2/ Parallel Summation with Binary Tree Sending

wa2-pure.bash : A Batch Job Script

Cpi_m/ Implementing Elapse Time Measurement to Program of Computing of PI

cpi_m-pure.bash : A Batch Job Script

Introduction to Parallel Programming for Multicore/Manycore Clusters

38

Parallel Summations and Its Communication Costs

Introduction to Parallel Programming for Multicore/Manycore Clusters

39

Summation with Sequential Sending Problem: An operation that distributed data is summed with all processes, and

then the summed result is gathered in a process.

A Simple Implementation (Sequential Summation)1. If I am not rank 0 then receive data from left process.2. If data comes from left process then;

1. Receive it;2. Add <own data> to <received data>;3. If I am not final rank (#191 in this sample program) then send <the result in 2>

to right process;4. Exit ;

A note for implementation; The left process is rank (myid-1) .

The right process is rank (myid+1). If myid=0 then there is no left process. Hence no receiving.

If myid=p-1 the there is no right process. Hence no sending.Introduction to Parallel Programming for

Multicore/Manycore Clusters40

Summation with Sequential Sending

Introduction to Parallel Programming for Multicore/Manycore Clusters

41

CPU0 CPU1 CPU2 CPU3

0 1 2 3

Own Data

0 + 1 = 1

1 + 2 = 3

3 + 3 = 6

Sending Sending Sending

FinalResult

Own Data Own Data Own Data

An example of 1-to-1 Communication(Sequential Sending, C Language)

Introduction to Parallel Programming for Multicore/Manycore Clusters

42

void main(int argc, char* argv[]) {MPI_Status istatus;

….dsendbuf = myid;drecvbuf = 0.0;if (myid != 0) {

ierr = MPI_Recv(&drecvbuf, 1, MPI_DOUBLE, myid-1, 0,MPI_COMM_WORLD, &istatus);

}dsendbuf = dsendbuf + drecvbuf;if (myid != nprocs-1) {ierr = MPI_Send(&dsendbuf, 1, MPI_DOUBLE, myid+1, 0,

MPI_COMM_WORLD);}if (myid == nprocs-1) printf ("Total = %4.2lf ¥n", dsendbuf);

….}

Allocate array of MPI system.

Receive data from rank (myid-1) with type of double, and store it to array of drecvbuf.

Send data in array of dsendbuf with type of double to rank (myid+1).

An example of 1-to-1 Communication(Sequential Sending, Fortran Language)

Introduction to Parallel Programming for Multicore/Manycore Clusters

43

program maininteger istatus(MPI_STATUS_SIZE)….dsendbuf = myiddrecvbuf = 0.0if (myid .ne. 0) then

call MPI_RECV(drecvbuf, 1, MPI_DOUBLE_PRECISION,& myid-1, 0, MPI_COMM_WORLD, istatus, ierr)

endifdsendbuf = dsendbuf + drecvbufif (myid .ne. numprocs-1) then

call MPI_SEND(dsendbuf, 1, MPI_DOUBLE_PRECISION, & myid+1, 0, MPI_COMM_WORLD, ierr)

endifif (myid .eq. numprocs-1) then

print *, "Total = ", dsendbufendif….stopend

Allocate array of MPI system.

Receive data from rank (myid-1) with type of double, and store it to array of drecvbuf.

Send data in array of dsendbuf with type of double to rank (myid+1).

Parallel Summation Program(Binary-tree Type Sending) Binary-tree Type Sending

1. k = 1;2. for (i=0; i < log2(nprocs); i++) 3. if ( (myid & k) == k) Receive from Rank (myid – k); Add own data to the received data; k = k * 2;

4. else Send to Rank (myid + k); Exit;

Introduction to Parallel Programming for Multicore/Manycore Clusters

44

Parallel Summation Program(Binary-tree Type Sending)

Introduction to Parallel Programming for Multicore/Manycore Clusters

45

0 1 2 3 4 5 6 7First Step

1 3 5 7Second Step

3 7Third Step=log2(8)

0 1 2 3 4 5 6 7

1 3 5 7

3 7

7

Parallel Summation Program(Binary-tree Type Sending) Note of Implementations Point: By using information of rank number for binary. In the step i, rank number of receiving process is as follow:

(myid & k) equals k , where k = 2^(i-1) . Hence , bit of i-th for rank number of binary is 1, then the

process should send data. Rank of receiving process can be describe as follow:

myid + k Hence, distance of process number to be sent is 2^(i-1)←Because of binary tree.

For sending process, it is same as the above.

Introduction to Parallel Programming for Multicore/Manycore Clusters

46

Parallel Summation Program(Binary-tree Type Sending) Number of communications for sequential sending is: nprocs-1

Number of communications for binary-tree type sending is: A hypothesis of this estimation:

No collision is happened for each communication in each step, and communication is done with parallel.

If so, the number of communication is number of steps. Hence, log2(nprocs)

Comparison with both communication times. If number of processors is increase, difference of time goes large. This

means that difference of communication time goes large.

In we use 1024 process, then 1023 vs. 10. But the binary-tree type sending is not always fast; In case of heavily

collisions. Introduction to Parallel Programming for

Multicore/Manycore Clusters47

Lessons1. Execute programs for summation with sequential sending.

Using sample program of “Wa1”

2. Execute program for parallel summation with binary-tree type sending.

Using sample program of “Wa2”

3. Execute program with elapse time measurements. Using sample program of “Cpi_m”

4. Vary number of processes for sample programs, then execute them.

5. Modify the Hello program as follows: By using MPI_Send, send a string “Hello World!!”, which is char type,

from rank 0 to other rank, such as rank 8. In the receiving rank, print received data by using MPI_Recv.

Introduction to Parallel Programming for Multicore/Manycore Clusters

48