towards auto-tuning facilities into supercomputers in operation - the fiber approach and minimizing...

31
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements - Takahiro Katagiri (片桐 孝洋) Information Technology Center, The University of Tokyo (東京大学 情報基盤センター) 1 2014 ATAT in HPSC, National Taiwan University, March 15, 2014 (Saturday), Performance 10:10-10:30 Joint work with: Satoshi Ohshima(大島 聡史) Masaharu Matsumoto(松本 正晴)

Upload: takahiro-katagiri

Post on 10-Aug-2015

67 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Towards Auto-tuning Facilities into Supercomputers in Operation

- The FIBER approach and minimizing software-stack requirements -

Takahiro Katagiri (片桐 孝洋)Information Technology Center,

The University of Tokyo

(東京大学 情報基盤センター)

1

2014 ATAT in HPSC, National Taiwan University,March 15, 2014 (Saturday), Performance 10:10-10:30

Joint work with: Satoshi Ohshima(大島 聡史)Masaharu Matsumoto(松本 正晴)

Page 2: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Overview

1. Background and ppOpen-HPC Project

2. ppOpen-AT Basics

3. Adaptation to an FDM Application

4. Performance Evaluation

5. Conclusion

2

Page 3: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Overview

1. Background and ppOpen-HPC Project

2. ppOpen-AT Basics

3. Adaptation to an FDM Application

4. Performance Evaluation

5. Conclusion

3

Page 4: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Background High-Thread Parallelism (HTP)

◦ Multi-core and many-core processors are pervasive. Multicore CPUs: 8-16 cores, 16-64 Threads with Hyper

Threading (HT) or Simultaneous Multithreading (SMT)

Many Core CPU: Xeon Phi – 60 cores, 240 Threads with HT.

◦ Utilizing parallelism with full-threads is important.

4

Performance Portability (PP)

◦ Keeping high performance in multiple computer environments.

Not only multiple CPUs, but also multiple compilers.

Run-time information, such as loop length and number of threads, is important.

◦ Auto-tuning (AT) is one of candidates technologies to establish PP in multiple computer environments.

Page 5: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

ppOpen-HPC Project Middleware for HPC and Its AT

◦ Supported by JST, CREST, from 2011FY to 2016FY.

◦ PI: Professor Kengo Nakajima (U. Tokyo)

ppOpen-HPC ◦ An open source infrastructure for reliable simulation

codes on post-peta (pp) scale parallel computers.

◦ consists of various types of libraries, which covers 5 kinds of discretization methods for scientific computations.

ppOpen-AT ◦ An auto-tuning language for ppOpen-HPC codes

◦ Using knowledge of previous project, that is ABCLibScript Project.

◦ Auto-tuning language based on directives of AT. 5

Page 6: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

6

FVM DEMFDMFEM

Many-core CPUs GPULow Power

CPUsVector CPUs

MG

COMM

Auto-Tuning FacilityCode Generation for Optimization CandidatesSearch for the best candidateAutomatic Execution for the optimization

Resource Allocation Facility

ppOpen-APPL

ppOpen-MATH

BEM

ppOpen-AT

User’s Program

GRAPH VIS MP

STATIC DYNAMIC

ppOpen-SYS FT

Specify The Best Execution Allocations

Software Architecture of ppOpen-HPC

Page 7: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Overview

1. Background and ppOpen-HPC Project

2. ppOpen-AT Basics

3. Adaptation to an FDM Application

4. Performance Evaluation

5. Conclusion

7

Page 8: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Overview of FIBER (Framework of Install‐time, Before Execute‐time and Run‐time Auto‐tuning) [T.Katagiri et.al., 03]

#pragma oat …

Legacy codes with AT directives

#pragma oat …

#pragma oat …

Preprocessor of the AT directives

#implementation3

#implementation2

#implementation1

Legacy codes with AT functions and AT candidates specified by the AT directives

Compiling

FIBERAuto‐tuner

Best Parameters

PerformanceDatabase

Install‐time

Before Execute‐time

Run‐time : AT timings defined by FIBER.The timings are specified 

by the AT directives.

API on FIBER

Executable codes with AT functions

User Specifies Parameter

Page 9: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

A Scenario to Software Developers for ppOpen-AT

9

Executable Code with Optimization Candidates

and AT Function

Invocate dedicated Preprocessor

Software Developer

Description of AT by UsingppOpen-AT

Program with AT Functions

Optimizationthat cannot be established by

compilers

#pragma oat install unroll (i,j,k) region start#pragma oat varied (i,j,k) from 1 to 8

for(i = 0 ; i < n ; i++){for(j = 0 ; j < n ; j++){for(k = 0 ; k < n ; k++){A[i][j]=A[i][j]+B[i][k]*C[k][j]; }}}

#pragma oat install unroll (i,j,k) region end

■Automatic Generated FunctionsOptimization CandidatesPerformance MonitorParameter Search Performance Modeling

Description By Software DeveloperOptimizations for Source Codes, Computer Resource, Power Consumption

Page 10: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Compiler Optimization and AT1. Loop length is unclear in compile‐time. Optimal loop split and loop fusion are specified in run‐time. Run‐time compiling is on only research.

2. Loop split with data dependencies. Some loop splits require increase of computations or memory 

space. Some compilers are providing directive, but the directive is not 

standardized.  Code optimization is not also standardized between compilers.

3. Restrictions from Operation in Supercomputers.   Some supercomputer environments cannot supply  required “software‐

stack”, or the software‐stack cannot be utilize due to restriction by operation. Out of target for the system due to hardware restriction.

Ex) CAPS in the K‐computer. Operation costs (budgets), vender strategy, etc…. 10

Page 11: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Overview

1. Background and ppOpen-HPC Project

2. ppOpen-AT Basics

3. Adaptation to an FDM Application

4. Performance Evaluation

5. Conclusion

11

Page 12: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

EARLY EXPERIENCE IN

EXPLICIT METHOD

(FINITE DIFFERENCE

METHOD)

12

Page 13: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Target ApplicationSeism3D

: Simulation software for seismic wave analysis.

Strategic simulation software in Japan.

Developed by Professor Furumura at the University of Tokyo.

◦ The code is re-constructed as ppOpen-APPL/FDM.

Finite Differential Method (FDM)

3D simulation

◦ 3D arrays are allocated.

Data type: Single Precision (real*4)

13

Source: http://www.eri.u-tokyo.ac.jp/furumura/tsunami/tsunami.html

Page 14: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

The Heaviest Loop (20%+ to Total Time)

14

!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)DO K = 1, NZDO J = 1, NYDO I = 1, NX

RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DTDXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT

END DOEND DOEND DO!$omp end parallel do

A Flow Dependency

Page 15: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Optimization Possibilities Loop Splitting

◦ To reduce spill code.

◦ To maximize register usage.

Loop fusion (Loop Collapse)

◦ 3 nested loop -> The following two approaches.

◦ One nest loop

To increase outer loop parallelism for thread parallelism.

◦ Two nested loop

To increase outer loop parallelism for thread parallelism.

To utilize pre-fetching for the inner loop.15

Page 16: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Loop fusion –One dimensional (a loop collapse)

16

!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)

DO KK = 1, NZ * NY * NXK = (KK-1)/(NY*NX) + 1J = mod((KK-1)/NX,NY) + 1I = mod(KK-1,NX) + 1RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DTDXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT

END DO!$omp end parallel do

Merit: Loop length is huge.This is good for OpenMP thread parallelism.

Page 17: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Loop fusion – Two dimensional

17

!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)

DO KK = 1, NZ * NY K = (KK-1)/NY + 1J = mod(KK-1,NY) + 1DO I = 1, NX

RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DTDXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT

ENDDOEND DO

!$omp end parallel do

Example:

Merit: Loop length is huge.This is good for OpenMP thread parallelism.

This I-loop enables us an opportunity of pre-fetching.

Page 18: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

18

!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)

DO K = 1, NZDO J = 1, NYDO I = 1, NX

RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT

ENDDODO I = 1, NX

RM1 = RIG (I,J,K)DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT

END DOEND DO END DO

Re-computation(a copy) is needed.

⇒Compilers do not apply it without directive.

Perfect Splitting: Two 3-nested Loops

Page 19: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

New Directives for ppOpen‐AT • m_stress.f90(ppohFDM_update_stress)!OAT$ install LoopFusionSplit region start!$omp parallel do private (k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)do k = NZ00, NZ01do j = NY00, NY01do i = NX00, NX01RL1   = LAM (I,J,K)

!OAT$ SplitPointCopyDef sub region startRM1   = RIG (I,J,K)

!OAT$ SplitPointCopyDef sub region endRM2   = RM1 + RM1;  RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K);  DYVY1 = DYVY(I,J,K); DZVZ1 = DZVZ(I,J,K)D3V3  = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K)  + (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K)  + (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT

!OAT$ SplitPoint (K,J,I)!OAT$ SplitPointCopyInsert

DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K);    DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT

end doend do

end do!$omp end parallel do!OAT$ install LoopFusionSplit region end

Re-calculation is defined in here.

Using the re-calculation is defined in here.

Loop Split Point

Page 20: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Candidates of Auto-generated Codes #1 [Baseline] : Original three-nested loop.

#2 [Spilt] : Loop split for the k-loop(separated two three-nested loops).

#3 [Split] : Loop split for the j-loop.

#4 [Split] : Loop split for the i-loop.

#5 [Fusion] : Loop fusion for the k-loop and j-loop (a two-nested loop).

#6 [Split and Fusion] : Loop fusions for the k-loop and j-loop for the loops in #2.

#7 [Fusion] : Loop fusions for the k-loop, j-loop, and i-loop (loop collapse).

#8 [Split and Fusion] : Loop fusions for the k-loop, j-loop,and i-loop for the loops in #2 (loop collapses for the separated two-loops).

20

Page 21: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Overview

1. Background and ppOpen-HPC Project

2. ppOpen-AT Basics

3. Adaptation to an FDM Application

4. Performance Evaluation

5. Conclusion

21

Page 22: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

PERFORMANCE EVALUATION

22

Page 23: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

An Example of Seism3D Simulation West part earthquake in Tottori prefecture in Japan

at year 2000. ([1], pp.14) The region of 820km x 410km x 128 km is discretized with 0.4km.

NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.

[1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News, Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.

Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan. (a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)

Page 24: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Test Condition Software version◦ ppOpen-APPL/FDM version 0.2

◦ ppOpen-AT version 0.2

Target Kernels in ppOpen-APPL/FDM◦ TOP 10 Kernels (All three-nested loops) Update_stress

Update_vel

Update_spong

Other 7 kernels in finite differential computations.

AT Timing◦ Before Execute-time After fixing problem size and the number of threads by user.

Then, adapt AT in time for calling of the library routine.

All candidates of AT are evaluated. (Brute-force search) ◦ Only 8+3+6+7*3 = 38 candidates.

#Repeats for each kernel in the AT mode◦ 100 times

24

Page 25: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

The Xeon Phi Cluster System Intel Xeon (Ivy Bridge) : HOST CPU 

OS:Red Hat Enterprise Linux Server release 6.2 #Nodes:32 (Available: 14 nodes) CPU:Intel Xeon E5‐2670 V2 @ 2.50GHz,2 sockets×10 cores Hyper Threading:ON Theoretical Peak Performance for 1 node of CPU:400 GFLOPS Memory size on 1 node:64 GB Interconnect:Infiniband Compiler:Intel Fortran version 14.0.0.080 Build 20130728 Compiler Option:‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel KMP_AFFINITY=granularity=fine, compact (all threads are on socket)

Intel Xeon Phi co‐processor (Xeon Phi) : Accelerator  CPU:Xeon Phi 5110P (B1 stepping) 1.053 GHz,60 core Memory size:8 GB Theoretical Peak Performance :1 TFLOPS ( = 1.053 GHz x 16 FLOPS x 60 core) Connected one board on each node of the Cluster  Native mode Compiler:Intel Fortran version 14.0.0.080 Build 20130728 Compiler Option:‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel 

–mmiccl ‐align array64byte KMP_AFFINITY=granularity=fine, balanced (all threads are equally distributed on 

socket)

Page 26: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

RESULT OF THE XEON PHI

Page 27: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Execution Details• ppOpen‐APPL/FDM ver.0.2• ppOpen‐AT ver.0.2• Target Problem Size

– NX * NY * NZ = 256 x 96 x 100 / node– NX * NY * NZ = 32 * 16 * 20 / core (!= per MPI Process)

• Native mode for MIC• Target MPI Processes and Threads on the Xeon Phi

– 1 node of the Xeon Phi with 4 HT (Hyper Threading)– PXTY : XMPI Processes and Y Threads per process– P240T1 : pure MPI with 4HT per core– P120T2– P60T4– P16T15– P8T30 : Minimum Hybrid MPI‐OpenMP execution for 

ppOpen‐APPL/FDM, since it needs minimum 8 MPI Processes.  • The number of iterations for the kernels: 100

Page 28: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

2.11  2.32  2.33 2.96  3.14 

1.29 1.70  1.74  1.91  1.97 

0

1

2

3

4

P240T1 P120T2 P60T4 P16T15 P8T30

Without AT With AT

AT Effect (update_stress, Xeon Phi)[Seconds]

KMP_AFFINITY=balanced

‐align array64byte New Kernels

1.63 1.36  1.34 

1.55  1.59 

0

0.5

1

1.5

2

P240T1 P120T2 P60T4 P16T15 P8T30

Speedups

Best SW: 6 Best SW: 5 Best SW: 5 Best SW: 5 Best SW: 6

Page 29: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Conclusion Loop fusion to obtain high parallelism

is one of key techniques for current multi- and many-core architectures.◦ Execution with 240 threads/MPI process

in the Xeon Phi.

◦ Strong scaling with more than 10,000+ cores in the FX10.

To do AT in supercomputers in operation, minimizing requirement of “software-stack” is a practical way to establish AT.

Page 30: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

ppOpen-AT is free software!

ppOpen-AT version 0.2 is available!

The licensing is MIT.

Please access the following page:

http://ppopenhpc.cc.u-tokyo.ac.jp/

30

Page 31: Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

31

Thank for your attention!

Questions?