training of reedbush-u of reedbush-u 名古屋情報基盤中心 教授 片桐孝洋 takahiro...
TRANSCRIPT
Training of Reedbush-U
名古屋情報基盤中心 教授 片桐孝洋
Takahiro Katagiri, Professor, Information Technology Center, Nagoya University
Introduction to Parallel Programming for Multicore/Manycore Clusters
1
國家理論中心數學組「高效能計算 」短期課程
Agenda1. How to log in the Reedbush-U2. How to use supercomputer Execute sample programs
3. An Example of Parallel Summation4. Homework
Introduction to Parallel Programming for Multicore/Manycore Clusters
2
How to login the Reedbush-U
はじめてのスパコンログイン
Introduction to Parallel Programming for Multicore/Manycore Clusters
3
Obtain data from Reedbush-U to own PC Type the following:$ scp [email protected]:~/a.f90 ./
“tYYxxx” is user name. YYxxx is a number. Copy “a.f90” on home directory of the Reedbush-U to
current directory in PC. If you want to whole files of a directory, please specify
“-r”.$ scp -r [email protected]:~/SAMP ./ Copy all files in “SAMP” folder in home directory in the
Reedbush-U to current directory in PC.
Introduction to Parallel Programming for Multicore/Manycore Clusters
4
Obtain data from own PC to the Reedbush-U Type the following:$ scp ./a.f90 [email protected]:
“tYYxxx” is user name. YYxxx is a number. Copy “a.f90” on home directory in PE to current
directory in the Reedbush-U. If you want to whole files of a directory, please specify
“-r”.$ scp -r ./SAMP [email protected]: Copy all files in “SAMP” folder in home directory in
PC to current directory in the Reedbush-U.
Introduction to Parallel Programming for Multicore/Manycore Clusters
5
How to login to supercomputerand run test programs
Introduction to Parallel Programming for Multicore/Manycore Clusters
6
Agenda
1. How to use supercomputer Execute sample parallel programs
2. An Example of Parallel Summation
Introduction to Parallel Programming for Multicore/Manycore Clusters
7
Execute Test Parallel Programs
Introduction to Parallel Programming for Multicore/Manycore Clusters
8
Notes of UNIX Execute emacs: emacs <file name> ^x ^s (”^” is control key.) : Save current text. ^x ^c : quit ^g : Back to input mode. ^k : Delete words from cursor to end of line. The deleted
lines are temporary memorized. ^y : lines that are deleted with “^k” copy to location of
current cursor. ^s <character string> :Move to part of <character string>. ^M x goto-line : Move to specified line.
Introduction to Parallel Programming for Multicore/Manycore Clusters
9
Notes of UNIX rm <file name>: Remove <file name>.
rm *~ : Delete all files with “~”, such as “test.c~”.
ls : Show file names in current directory. cd <folder name>: Move to <folder name>. cd .. : Move to one level of upper directory. cd ~ : Move to home directory.
cat <file name>: See contents of <file name>. make : Make executable file. (“Makefile” is needed in current directory) make clean : Delete executable file.( Definition of “clean” in Makefile is needed. )
Introduction to Parallel Programming for Multicore/Manycore Clusters
10
Execute test parallel programs
Introduction to Parallel Programming for Multicore/Manycore Clusters
11
Name of Sample Program Common file of C and Fotran90 Languages:
Samples-reedbush.tar After performing “tar” command, a directory
including C and Fortran90 is made. C/ : For C Language F/ : For Fortran90 Language
Directory that has the above file./lustre/gt00/z30082/
Introduction to Parallel Programming for Multicore/Manycore Clusters
12
Let’s compile parallel Hello program(1/3)Change current directory to execution
directory.In the Reedbush-U, the execution direction
is:/lustre/gt31/<your account name>
Type the follow:$ cd /lustre/gt31/<your account name>
Introduction to Parallel Programming for Multicore/Manycore Clusters
13
Let’s compile parallel Hello program(2/3)1. Copy Samples-fx.tar in /lustre/gt00/z30082/ to my
directory$ cp /lustre/gt00/z30082/Samples-reedbush.tar ./
2. Unpack Samples-fx.tar$ tar xvf Samples-reedbush.tar
3. Enter Samples folder$ cd Samples
4. For C Language : $ cd CFor Fortran90 : $ cd F
5. Enter Hello folder$ cd Hello
Introduction to Parallel Programming for Multicore/Manycore Clusters
14
Let’s compile parallel Hello program(3/3)6. Copy “Makefile” for pure MPI:
$ cp Makefile_pure Makefile7. Do make:
$ make8. Confirm executable file, named “hello”
$ ls
Introduction to Parallel Programming for Multicore/Manycore Clusters
15
What is batch job? In supercomputer environment, interactive execution
(command line execution) is usually unsupported. Job is executing with batch job system.
Introduction to Parallel Programming for Multicore/Manycore Clusters
16
A User Supercomputer
Batch system picks up jobs
Execution
Batch queues
Job Request
17
Running Job• Batch Jobs
– Only batch jobs are allowed.– Interactive executions of jobs are not allowed.
• How to run– writing job script– submitting job– checking job status– checking results
• Utilization of computational resources – 1-node (16 cores) is occupied by each job.– Your node is not shared by other jobs.
This material coms from Prof. Nakajima.
18
Job Script• <$O-S1>/hello.sh
• Scheduling + Shell Script
#!/bin/sh#PBS -q u-lecture1 Name of “QUEUE”#PBS -N HELLO Job Name#PBS -l select=1:mpiprocs=4 node#,MPI proc#/node#PBS -Wgroup_list=gt31 Group Name (Wallet)#PBS -l walltime=00:05:00 Computation Time#PBS -e err Standard Error#PBS -o hello.lst Standard Outpt
cd $PBS_O_WORKDIR go to current dir. /etc/profile.d/modules.sh (ESSENTIAL)export I_MPI_PIN_DOMAIN=socket Execution on each
socketexport I_MPI_PERHOST=4 =mpiprocs, stablempirun ./impimap.sh ./a.out Exec’s
This material coms from Prof. Nakajima.
19
impimap.shNUMA: utilizing resource (e.g. memory) of the core
where job is running: Performance is stable#!/bin/shnumactl --localalloc $@
Process Number#PBS -l select=1:mpiprocs=4 1-node, 4-proc’s#PBS –l select=1:mpiprocs=16 1-node, 16-proc’s#PBS -l select=1:mpiprocs=36 1-node, 36-proc’s#PBS –l select=2:mpiprocs=32 2-nodes, 32x2=64-proc’s#PBS –l select=8:mpiprocs=36 8-nodes, 36x8=288-proc’s
This material coms from Prof. Nakajima.
20
Available QUEUE’s• Following 2 queues are available.• 8 nodes can be used
– u-lecture
• 8 nodes (288 cores), 10 min., valid until the end of March, 2018
• Shared by all “educational” users– u-lecture1
• 8 nodes (288 cores), 10 min., active during class time • More jobs (compared to lecture) can be processed up
on availability.
This material coms from Prof. Nakajima.
21
2121
Submitting & Checking Jobs• Submitting Jobs qsub SCRIPT NAME
• Checking status of jobs rbstat
• Deleting/aborting qdel JOB ID
• Checking status of queues rbstat --rsc
• Detailed info. of queues rbstat --rsc –x
• Number of running jobs rbstat --rsc –b
• History of Submission rbstat –H
• Limitation of submission rbstat --limit
Figure of Execution for “rbstat --rsc”
Introduction to Parallel Programming for Multicore/Manycore Clusters
22
$ rbstat --rscQUEUE STATUS NODEu-debug [ENABLE ,START] 54u-short [ENABLE ,START] 16u-regular [ENABLE ,START]
|---- u-small [ENABLE ,START] 294|---- u-medium [ENABLE ,START] 294|---- u-large [ENABLE ,START] 294|---- u-x-large [ENABLE ,START] 294
u-interactive [ENABLE ,START]|---- u-interactive_1 [ENABLE ,START] 54|---- u-interactive_4 [ENABLE ,START] 54
Available queue names (Resource Groups)
Enable or notfor the queues.
Information of nodes for physical construction.
Figure of Execution for “rbstat --rsc -x”
Introduction to Parallel Programming for Multicore/Manycore Clusters
23
$ rbtat --rsc -xQUEUE STATUS MIN_NODE MAX_NODE MAX_ELAPSE REMAIN_ELAPSE MEM(GB)/NODE PROJECTu-debug [ENABLE ,START] 1 24 00:30:00 00:30:00 244GB gt31u-short [ENABLE ,START] 1 8 04:00:00 04:00:00 244GB gt31u-regular [ENABLE ,START]
|---- u-small [ENABLE ,START] 4 16 48:00:00 48:00:00 244GB gt31|---- u-medium [ENABLE ,START] 17 32 48:00:00 48:00:00 244GB gt31|---- u-large [ENABLE ,START] 33 64 48:00:00 48:00:00 244GB gt31|---- u-x-large [ENABLE ,START] 65 128 24:00:00 24:00:00 244GB gt31
Available queue names (Resource Groups)
Enable or notfor the queues.
Informationof run-timefor nodes
Queue Information.(MAX_ELAPSE, REMAIN_ELAPSE,MEM(GB)/NODE,PROJECT)
Figure of Execution for “rbstat --rsc -b”
Introduction to Parallel Programming for Multicore/Manycore Clusters
24
$ pjstat --rsc –bQUEUE STATUS TOTAL RUNNING QUEUED HOLD BEGUN WAIT EXIT TRANSIT NODEu-debug [ENABLE ,START] 9 3 6 0 0 0 0 0 54u-short [ENABLE ,START] 70 13 12 45 0 0 0 0 16u-regular [ENABLE ,START]
|---- u-small [ENABLE ,START] 74 21 53 0 0 0 0 0 294|---- u-medium [ENABLE ,START] 7 2 3 2 0 0 0 0 294|---- u-large [ENABLE ,START] 2 0 2 0 0 0 0 0 294|---- u-x-large [ENABLE ,START] 2 0 1 1 0 0 0 0 294
Available queue names (Resource Groups)
Enable or notfor the queues.
Total Number of jobs.
Number of runningJobs.
Numberof queued jobs.
Information of nodes for physical construction.
A Sample of Job Script (for Pure MPI) (hello-pure.bash, common for C and Fortran)
Introduction to Parallel Programming for Multicore/Manycore Clusters
25
#!/bin/sh#PBS -q u-lecture1#PBS -N 1d#PBS -l select=8:mpiprocs=32#PBS -Wgroup_list=gt31#PBS -l walltime=00:10:00#PBS -e err#PBS -o test.lstcd $PBS_O_WORKDIR. /etc/profile.d/modules.shexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=32mpirun ./impimap.sh ./hello
Specify number of nodes (8), and number of MPI processes per node (32)
Maximum executiontime:10 minutes.
Specify queue name:u-lecture
Execute MPI job with 32*8nodes = 256 processes.
Introduction to Parallel Programming for Multicore/Manycore Clusters
26
Pure MPI execution within a node.
A MPI process
18 cores
Socket #0
18 cores
Socket #1
76.8GB/s 76.8GB/s
Intel Xeon E5-2695 v4
(Broadwell-EP)-
76.8GB/s
76.8GB/s
DDR4Memory128GB
Intel Xeon E5-2695 v4
(Broadwell-EP)
DDR4DDR4DDR4
DDR4DDR4DDR4DDR4
Memory128GB
QPIQPI
QPIQPI
A Point of View of Node on the Reedbush-U
Let’s execute sample code of parallel Hello program (pure MPI) The job script file of the sample code is:
hello-pure.bash
In the sample case of provided code, name of queues is set to ”u-lecture1”.If you want to modify the queue name,type the following to use emacs:
$ emacs hello-pure.bash
Introduction to Parallel Programming for Multicore/Manycore Clusters
27
Let’s execute sample code of parallel Hello program (pure MPI) 1. Execute the follow in folder of Hello.
$ qsub hello-pure.bash2. Confirm submitted job for you:
$ rbstat3. After finishing execution of the job, the following files are
created. errtest.lst
4. Confirm contents of file of standard output:$ cat test.lst
5. If you see several output of “Hello parallel world!” for 32 processes * 8 nodes = 256 times, then it means successfully ended.
Introduction to Parallel Programming for Multicore/Manycore Clusters
28
Output of standard output and standard error by execution of batch job
After finalizing batch job execution, files for standard output and standard error are created in directly that you have submitted the job.
Standard output is stored in the file of standard output, and standard error is stored in the file of standard error.
Introduction to Parallel Programming for Multicore/Manycore Clusters
29
test.lst --- Standard output file.err --- Standard error file.
You can change the above names in batch job script.
Let’s compile sample code of parallel Hello Program : A Case of Hybrid MPI1. Copy “Makefile” for hybrid MPI:
$ cp Makefile_hy16 Makefile2. Do “make”.
$ make clean$ make
3. Confirm creation for file of “hello”.$ ls
4. If you want to modify queue name in job script (hello-hy16.bash) with emacs, like: $ emacs hello-hy16.bash
Introduction to Parallel Programming for Multicore/Manycore Clusters
30
Let’s execute sample code of parallel Hello Program : A Case of Hybrid MPI1. Execute the follow in folder of Hello.
$ qsub hello-hy16.bash2. Confirm submitted job for you:
$ rbstat3. After finishing execution of the job, the following files are
created. test2.lsterr
4. Confirm contents of file of standard output:$ cat test2.lst
5. If you see several output of “Hello parallel world!” for 2 process *8 nodes = 16 times, then it means successfully ended.
Introduction to Parallel Programming for Multicore/Manycore Clusters
31
Introduction to Parallel Programming for Multicore/Manycore Clusters
#!/bin/sh#PBS -q u-lecture1#PBS -N hybrid#PBS -l select=8:mpiprocs=2#PBS -Wgroup_list=gt31#PBS -l walltime=00:05:00#PBS -e err#PBS -o test2.lstcd $PBS_O_WORKDIR. /etc/profile.d/modules.shexport OMP_NUM_THREADS=16export I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=2export KMP_AFFINITY=granularity=fine,compact,1,0mpirun ./impimap.sh ./hello
A Sample of Job Script (for Hybrid MPI) (hello-hy16.bash, common for C and Fortran)
32
Specify queue name:u-lecture1
Execute MPI job with 2*8 = 16 processes.
Specify number of nodes (8) and MPI processes per node (2).
Maximum execution Time: 5 minute.
16 threads are created per 1 MPI process.
Introduction to Parallel Programming for Multicore/Manycore Clusters
33
MPI ProcessesThreads
Hybrid MPI execution within a node.
18 cores
Socket #0
18 cores
Socket #1
76.8GB/s 76.8GB/s
Intel Xeon E5-2695 v4
(Broadwell-EP)-
76.8GB/s
76.8GB/s
DDR4Memory128GB
Intel Xeon E5-2695 v4
(Broadwell-EP)
DDR4DDR4DDR4
DDR4DDR4DDR4DDR4
Memory128GB
QPIQPI
QPIQPI
A Point of View of Node on the Reedbush-U
Parallel Hello Program (C Language)
Introduction to Parallel Programming for Multicore/Manycore Clusters
34
#include <stdio.h>#include <mpi.h>
int main(int argc, char* argv[]) {
int myid, numprocs;int ierr, rc;
ierr = MPI_Init(&argc, &argv);ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid);ierr = MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
printf("Hello parallel world! Myid:%d ¥n", myid);
rc = MPI_Finalize();
exit(0);}
Initialization of MPI.
Obtain own ID number (rank).:Different values in each PE.
Obtain number of total processes. :Same value in each PE. (In the lecture environment, 256 or 16)Finalizing MPI.
This program is copied to all PEs.
Parallel Hello Program (Fortran Language)
Introduction to Parallel Programming for Multicore/Manycore Clusters
35
program main
common /mpienv/myid,numprocs
integer myid, numprocsinteger ierr
call MPI_INIT(ierr)call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
print *, "Hello parallel world! Myid:", myid
call MPI_FINALIZE(ierr)
stopend
Initialization of MPI.
Obtain own ID number (rank).:Different values in each PE.
Obtain number of total processes. :Same value in each PE.(In the lecture environment, 256 or 16)Finalizing MPI.
This program is copied to all PEs.
Elapse Time Measurement (C Language)
Introduction to Parallel Programming for Multicore/Manycore Clusters
36
double t0, t1, t2, t_w; ..ierr = MPI_Barrier(MPI_COMM_WORLD);t1 = MPI_Wtime();
< This part is measured. >
ierr = MPI_Barrier(MPI_COMM_WORLD);t2 = MPI_Wtime();
t0 = t2 - t1;ierr = MPI_Reduce(&t0, &t_w, 1,
MPI_DOUBLE,MPI_MAX, 0, MPI_COMM_WORLD);
After barrier, time is stored.
Values of t0 in each PE are different. In this case, maximal timeis stored in rank 0.
Elapse Time Measurement (Fortran Language)
Introduction to Parallel Programming for Multicore/Manycore Clusters
37
double precision t0, t1, t2, t_wdouble precision MPI_WTIME
..call MPI_BARRIER(MPI_COMM_WORLD, ierr)t1 = MPI_WTIME(ierr)
< This part is measured. >
call MPI_BARRIER(MPI_COMM_WORLD, ierr)t2 = MPI_WTIME(ierr)
t0 = t2 - t1call MPI_REDUCE(t0, t_w, 1,
& MPI_DOUBLE_PRECISION,& MPI_MAX, 0, MPI_COMM_WORLD, ierr)
After barrier, time is stored.
Values of t0 in each PE are different. In this case, maximal time is stored in rank 0.
Kinds of the sample programs Hello/
Parallel Hello program
hello-pure.bash, hello-hy16.bash : Batch Job Scripts
Cpi/ Computation of PI
cpi-pure.bash :A Batch Job Script
Wa1/ Summation with Sequential Sending
wa1-pure.bash : A Batch Job Script
Wa2/ Parallel Summation with Binary Tree Sending
wa2-pure.bash : A Batch Job Script
Cpi_m/ Implementing Elapse Time Measurement to Program of Computing of PI
cpi_m-pure.bash : A Batch Job Script
Introduction to Parallel Programming for Multicore/Manycore Clusters
38
Parallel Summations and Its Communication Costs
Introduction to Parallel Programming for Multicore/Manycore Clusters
39
Summation with Sequential Sending Problem: An operation that distributed data is summed with all processes, and
then the summed result is gathered in a process.
A Simple Implementation (Sequential Summation)1. If I am not rank 0 then receive data from left process.2. If data comes from left process then;
1. Receive it;2. Add <own data> to <received data>;3. If I am not final rank (#191 in this sample program) then send <the result in 2>
to right process;4. Exit ;
A note for implementation; The left process is rank (myid-1) .
The right process is rank (myid+1). If myid=0 then there is no left process. Hence no receiving.
If myid=p-1 the there is no right process. Hence no sending.Introduction to Parallel Programming for
Multicore/Manycore Clusters40
Summation with Sequential Sending
Introduction to Parallel Programming for Multicore/Manycore Clusters
41
CPU0 CPU1 CPU2 CPU3
0 1 2 3
0
Own Data
0 + 1 = 1
1
1 + 2 = 3
3
3 + 3 = 6
Sending Sending Sending
FinalResult
Own Data Own Data Own Data
An example of 1-to-1 Communication(Sequential Sending, C Language)
Introduction to Parallel Programming for Multicore/Manycore Clusters
42
void main(int argc, char* argv[]) {MPI_Status istatus;
….dsendbuf = myid;drecvbuf = 0.0;if (myid != 0) {
ierr = MPI_Recv(&drecvbuf, 1, MPI_DOUBLE, myid-1, 0,MPI_COMM_WORLD, &istatus);
}dsendbuf = dsendbuf + drecvbuf;if (myid != nprocs-1) {ierr = MPI_Send(&dsendbuf, 1, MPI_DOUBLE, myid+1, 0,
MPI_COMM_WORLD);}if (myid == nprocs-1) printf ("Total = %4.2lf ¥n", dsendbuf);
….}
Allocate array of MPI system.
Receive data from rank (myid-1) with type of double, and store it to array of drecvbuf.
Send data in array of dsendbuf with type of double to rank (myid+1).
An example of 1-to-1 Communication(Sequential Sending, Fortran Language)
Introduction to Parallel Programming for Multicore/Manycore Clusters
43
program maininteger istatus(MPI_STATUS_SIZE)….dsendbuf = myiddrecvbuf = 0.0if (myid .ne. 0) then
call MPI_RECV(drecvbuf, 1, MPI_DOUBLE_PRECISION,& myid-1, 0, MPI_COMM_WORLD, istatus, ierr)
endifdsendbuf = dsendbuf + drecvbufif (myid .ne. numprocs-1) then
call MPI_SEND(dsendbuf, 1, MPI_DOUBLE_PRECISION, & myid+1, 0, MPI_COMM_WORLD, ierr)
endifif (myid .eq. numprocs-1) then
print *, "Total = ", dsendbufendif….stopend
Allocate array of MPI system.
Receive data from rank (myid-1) with type of double, and store it to array of drecvbuf.
Send data in array of dsendbuf with type of double to rank (myid+1).
Parallel Summation Program(Binary-tree Type Sending) Binary-tree Type Sending
1. k = 1;2. for (i=0; i < log2(nprocs); i++) 3. if ( (myid & k) == k) Receive from Rank (myid – k); Add own data to the received data; k = k * 2;
4. else Send to Rank (myid + k); Exit;
Introduction to Parallel Programming for Multicore/Manycore Clusters
44
Parallel Summation Program(Binary-tree Type Sending)
Introduction to Parallel Programming for Multicore/Manycore Clusters
45
0 1 2 3 4 5 6 7First Step
1 3 5 7Second Step
3 7Third Step=log2(8)
0 1 2 3 4 5 6 7
1 3 5 7
3 7
7
Parallel Summation Program(Binary-tree Type Sending) Note of Implementations Point: By using information of rank number for binary. In the step i, rank number of receiving process is as follow:
(myid & k) equals k , where k = 2^(i-1) . Hence , bit of i-th for rank number of binary is 1, then the
process should send data. Rank of receiving process can be describe as follow:
myid + k Hence, distance of process number to be sent is 2^(i-1)←Because of binary tree.
For sending process, it is same as the above.
Introduction to Parallel Programming for Multicore/Manycore Clusters
46
Parallel Summation Program(Binary-tree Type Sending) Number of communications for sequential sending is: nprocs-1
Number of communications for binary-tree type sending is: A hypothesis of this estimation:
No collision is happened for each communication in each step, and communication is done with parallel.
If so, the number of communication is number of steps. Hence, log2(nprocs)
Comparison with both communication times. If number of processors is increase, difference of time goes large. This
means that difference of communication time goes large.
In we use 1024 process, then 1023 vs. 10. But the binary-tree type sending is not always fast; In case of heavily
collisions. Introduction to Parallel Programming for
Multicore/Manycore Clusters47
Lessons1. Execute programs for summation with sequential sending.
Using sample program of “Wa1”
2. Execute program for parallel summation with binary-tree type sending.
Using sample program of “Wa2”
3. Execute program with elapse time measurements. Using sample program of “Cpi_m”
4. Vary number of processes for sample programs, then execute them.
5. Modify the Hello program as follows: By using MPI_Send, send a string “Hello World!!”, which is char type,
from rank 0 to other rank, such as rank 8. In the receiving rank, print received data by using MPI_Recv.
Introduction to Parallel Programming for Multicore/Manycore Clusters
48