an 80-fold speedup, 15.0 tflops full gpu acceleration of

11
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code Takashi Shimokawabe , Takayuki Aoki ∗‡ , Chiashi Muroi , Junichi Ishida , Kohei Kawano , Toshio Endo ∗‡ , Akira Nukada ∗‡ , Naoya Maruyama ∗‡ , and Satoshi Matsuoka ∗‡§ Tokyo Institute of Technology Japan Meteorological Agency Japan Science and Technology Agency, CREST § National Institute of Informatics [email protected], [email protected] Abstract—Regional weather forecasting demands fast simu- lation over fine-grained grids, resulting in extremely memory- bottlenecked computation, a difficult problem on conventional supercomputers. Early work on accelerating mainstream weather code WRF using GPUs with their high memory performance, however, resulted in only minor speedup due to partial GPU porting of the huge code. Our full CUDA porting of the high- resolution weather prediction model ASUCA is the first such one we know to date; ASUCA is a next-generation, production weather code developed by the Japan Meteorological Agency, similar to WRF in the underlying physics (non-hydrostatic model). Benchmark on the 528 (NVIDIA GT200 Tesla) GPU TSUBAME Supercomputer at the Tokyo Institute of Technology demonstrated over 80-fold speedup and good weak scaling achieving 15.0 TFlops in single precision for 6956 × 6052 × 48 mesh. Further benchmarks on TSUBAME 2.0, which will embody over 4000 NVIDIA Fermi GPUs and deployed in October 2010, will be presented. I. I NTRODUCTION Weather forecasting is one of the most important research challenges that remain to be significantly improved for the human being, and thus improving its accuracy with more sophisticated weather models has been an active research topic. Numerical weather models describe fluid dynamics and physical processes involving natural phenomena such as clouds, rain, and snow, and consist of a large number of partial differential equations that demand not only extremely large raw processing power but also fast, efficient memory systems with high bandwidth. Large-scale weather simulations have been challenged in the field of high performance computing. In 2002, a group of the Japan Marine Science and Technology Center (JAMSTEC) achieved the sustained performance 26.58 TFlops for a spectral atmospheric general circulation model called AFES on the Earth Simulator (ES) [1]. This result was obtained by the use of 640 nodes (5120 cores) of this massively parallel vector supercomputer. This achievement was awarded the Gordon Bell Prize for excellent computing performance at the SC2002 conference. They run the code on an ultra high-resolution mesh; however, the AFES is based on a hydrostatic model and is not available for high-resolution meshes of less than 10 km. Another notable work is done by Skamarock et al., where they achieved the performance of 8.8 TFlops for the Weather Research Forecasting (WRF) model [2], which is a mesoscale non-hydrostatic model. This work was selected as a finalist of the Gordon Bell Prize in SC2007 [3]. The entire Jaguar Cray XT5 system at the Oak Ridge National Laboratory was used to sustain 50 TFlops for the WRF benchmark test using 148480 cores in 2009 [4]. Our current result allows us to project that we could exceed 100 Teraflops in mesoscale non-hydrostatic model with our GPU port of ASUCA once TSUBAME2.0 is commissioned in the fall of 2010, with merely 4000+ GPUs. Exploiting Graphics Processing Units (GPUs) for general- purpose computing, i.e. GPGPU, has emerged as an effective technique to accelerate many important classes of scientific applications, including Computational Fluid Dynamics (CFD) [5], [6], Fast Fourier Transform [7], and astrophysical N- body simulations [8]. Although the GPU had been traditionally designed solely for graphics processing applications, the recent progress toward more general-purpose computing has success- fully enabled significant speedups of those applications com- pared to conventional CPUs because of GPU’ s much higher raw processing power as well as wider memory bandwidth available with relatively low cost. Acceleration of numerical weather prediction with GPU has been reported in the literature [9], [10]; however, the previous research is limited in achieved speedups due to partial GPU porting. Michalakes et al. reported a twenty-fold speedup of the WRF Single Moment 5-tracer (WSM5) microphysics on the NVIDIA GPU, which is a computationally intensive physics module of the WRF model [9]. Similarly, Linford et al. reported an eight-times performance increase using NVIDIA Tesla C1060 for a computationally expensive chemical kinetics kernel from WRF model with Chemistry (WRF-Chem) [10]. These efforts, however, only result in a minor improvement (e.g., 1.3× in [9]) in the overall application time due to c 2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. SC10 November 2010, New Orleans, Louisiana, USA 978-1- 4244-7558-2/10/$26.00

Upload: others

Post on 04-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

An 80-Fold Speedup, 15.0 TFlopsFull GPU Acceleration of Non-Hydrostatic Weather

Model ASUCA Production CodeTakashi Shimokawabe∗, Takayuki Aoki∗‡,

Chiashi Muroi†, Junichi Ishida†, Kohei Kawano†,Toshio Endo∗‡, Akira Nukada∗‡, Naoya Maruyama∗‡, and Satoshi Matsuoka∗‡§

∗ Tokyo Institute of Technology† Japan Meteorological Agency

‡ Japan Science and Technology Agency, CREST§ National Institute of Informatics

[email protected], [email protected]

Abstract—Regional weather forecasting demands fast simu-lation over fine-grained grids, resulting in extremely memory-bottlenecked computation, a difficult problem on conventionalsupercomputers. Early work on accelerating mainstream weathercode WRF using GPUs with their high memory performance,however, resulted in only minor speedup due to partial GPUporting of the huge code. Our full CUDA porting of the high-resolution weather prediction model ASUCA is the first suchone we know to date; ASUCA is a next-generation, productionweather code developed by the Japan Meteorological Agency,similar to WRF in the underlying physics (non-hydrostaticmodel). Benchmark on the 528 (NVIDIA GT200 Tesla) GPUTSUBAME Supercomputer at the Tokyo Institute of Technologydemonstrated over 80-fold speedup and good weak scalingachieving 15.0 TFlops in single precision for 6956× 6052× 48mesh. Further benchmarks on TSUBAME 2.0, which will embodyover 4000 NVIDIA Fermi GPUs and deployed in October 2010,will be presented.

I. INTRODUCTION

Weather forecasting is one of the most important researchchallenges that remain to be significantly improved for thehuman being, and thus improving its accuracy with moresophisticated weather models has been an active researchtopic. Numerical weather models describe fluid dynamicsand physical processes involving natural phenomena such asclouds, rain, and snow, and consist of a large number of partialdifferential equations that demand not only extremely largeraw processing power but also fast, efficient memory systemswith high bandwidth.

Large-scale weather simulations have been challenged inthe field of high performance computing. In 2002, a group ofthe Japan Marine Science and Technology Center (JAMSTEC)achieved the sustained performance 26.58 TFlops for a spectralatmospheric general circulation model called AFES on theEarth Simulator (ES) [1]. This result was obtained by the useof 640 nodes (5120 cores) of this massively parallel vectorsupercomputer. This achievement was awarded the GordonBell Prize for excellent computing performance at the SC2002conference. They run the code on an ultra high-resolution

mesh; however, the AFES is based on a hydrostatic model andis not available for high-resolution meshes of less than 10 km.Another notable work is done by Skamarock et al., wherethey achieved the performance of 8.8 TFlops for the WeatherResearch Forecasting (WRF) model [2], which is a mesoscalenon-hydrostatic model. This work was selected as a finalist ofthe Gordon Bell Prize in SC2007 [3]. The entire Jaguar CrayXT5 system at the Oak Ridge National Laboratory was used tosustain 50 TFlops for the WRF benchmark test using 148480cores in 2009 [4]. Our current result allows us to project thatwe could exceed 100 Teraflops in mesoscale non-hydrostaticmodel with our GPU port of ASUCA once TSUBAME2.0 iscommissioned in the fall of 2010, with merely 4000+ GPUs.

Exploiting Graphics Processing Units (GPUs) for general-purpose computing, i.e. GPGPU, has emerged as an effectivetechnique to accelerate many important classes of scientificapplications, including Computational Fluid Dynamics (CFD)[5], [6], Fast Fourier Transform [7], and astrophysical N-body simulations [8]. Although the GPU had been traditionallydesigned solely for graphics processing applications, the recentprogress toward more general-purpose computing has success-fully enabled significant speedups of those applications com-pared to conventional CPUs because of GPU’ s much higherraw processing power as well as wider memory bandwidthavailable with relatively low cost.

Acceleration of numerical weather prediction with GPUhas been reported in the literature [9], [10]; however, theprevious research is limited in achieved speedups due to partialGPU porting. Michalakes et al. reported a twenty-fold speedupof the WRF Single Moment 5-tracer (WSM5) microphysicson the NVIDIA GPU, which is a computationally intensivephysics module of the WRF model [9]. Similarly, Linford et al.reported an eight-times performance increase using NVIDIATesla C1060 for a computationally expensive chemical kineticskernel from WRF model with Chemistry (WRF-Chem) [10].These efforts, however, only result in a minor improvement(e.g., 1.3× in [9]) in the overall application time due to

c⃝2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes orfor creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must beobtained from the IEEE.SC10 November 2010, New Orleans, Louisiana, USA 978-1- 4244-7558-2/10/$26.00

Amdahl’s law. Partial GPU porting also imposes the overheadof data transfer between the host memory and the GPUmemory on applications. In the current GPGPU model, a GPUis connected to the other system components, including CPUsand system memory, with PCI Express lanes whose bandwidthis 8 GB/s at most, whereas the GPU memory bandwidth canreach as much as 150 GB/s. While significant speedups withGPU are possible in many cases, the partial porting of theprevious research with frequent data copies through relativelynarrow PCI Express bus severely limits the overall speedups.

To fully exploit the benefits of GPUs, the whole appli-cation should be executed on GPUs with minimizing theinteraction with its host CPU and memory. We are currentlyworking on the full GPU implementation of ASUCA [11]– a next-generation high resolution meso-scale atmosphericmodel being developed by the Japan Meteorological Agency(JMA). As a first step, we have successfully implementedits dynamical core and a portion of physics processes as afull GPU application, representing an important step towardestablishing a complete framework for the full GPU-basedASUCA. The GPU code is written from scratch in NVIDIACUDA (Compute Unified Device Architecture) [12] using itsoriginal code in Fortran as a reference. The numerical resultsobtained from the GPU code agree with those from the CPUcode within the margin of machine round-off error.

This paper reports our GPU implementation strategy andits performance results of the dynamical core with the physicsprocesses in ASUCA, which has not yet been accelerated inWRF. We demonstrate the performance of both single- andmulti-GPU computation using the NVIDIA Tesla S1070 GPUsof the TSUBAME supercomputer at the Tokyo Institute ofTechnology. A single GPU achieves 44.3 GFlops in singleprecision and 14.6 GFlops in double precision. Our multi-GPU version that combines distributed GPUs over InfiniBand-connected nodes with MPI achieves 15.0 TFlops in singleprecision using 528 GPUs.

II. NON-HYDROSTATIC MODEL ASUCA

This section describes an outline and formulation ofASUCA, a next-generation high resolution mesoscale atmo-spheric model developed by the JMA. ASUCA succeeds theJapan Meteorological Agency Non-Hydrostatic Model (JMA-NHM) [13] as an operational non-hydrostatic regional modelat the JMA. This model introduces non-hydrostatic equa-tions that conserve mass and some highly efficient numericalmethods in fluid dynamics that are increasingly popular innumerical weather prediction models.

ASUCA uses generalized coordinates (x1, x2, x3). Employ-ing the Einstein summation convention, its flux-form non-hydrostatic balanced equations for the dynamical core arewritten as follows:

∂t

(ρui

J

)+

∂xj

(ρuiuj

J

)+

∂xn

(1J

∂xn

∂xip

)− ρgi

J=

F i

J,

(1)

∂t

( ρ

J

)+

∂xi

(ρui

J

)=

J, (2)

∂t

(ρθm

J

)+

∂xi

(ρθmui

J

)=

Fρθm

J, (3)

∂t

(ρqα

J

)+

∂xi

(ρqαui

α

J

)=

Fρqα

J(α = v, c, r, i, s, g, h),

(4)p = Rdπ(ρθm), (5)

where ui (i = 1, 2, 3) and ui (i = 1, 2, 3) represent thevelocity components in Cartesian coordinates and generalizedcoordinates, respectively. Here, J is the Jacobian of coordi-nate transformation, π is the Exner function, ρ is the totalmass density, and qα represents the ratio of the density ofwater substance α to the total mass density. ASUCA cansimulate water vapor, cloud water, rain, cloud ice, snow,graupel, and hail, each of which is represented by qv , qc,qr, qi, qs, qg, and qh, respectively. The velocity ui

α in theequation of water substances denotes ui + ui

tα, where uitα

is terminal fall velocity of water substance α. Let θm

be θ(ρd/ρ + ϵρv/ρ), where θ is the potential temperature,ρv = qvρ, ρd = ρ(1 − qv − qc − qr − qi − qs − qg − qh)and ϵ is the ratio of the Gas constant for water vapor Rv tothat for dry air Rd. The notation F i contains the Coriolis force,diffusion, diabatic effects and turbulent process, and F i, Fρ

and Fρθm involve terms arising from the density change dueto precipitation. Fρqα represents interactions between watersubstances calculated in cloud microphysics processes.

In ASUCA, the Lorenz coordinate is used on the ArakawaC grid. The equations are discretized using the finite volumemethod (FVM). The flux limiter function proposed by Koren[14] is employed for monotonicity to avoid numerical oscilla-tions. Because the vertical grid spacing is much smaller thanthe horizontal one, the sound speed in the vertical directiondetermines the time step. To avoid this, ASUCA introduces thehorizontally explicit and vertically implicit (HE-VI) schemewith a time-splitting method [15]. In this method, a timeintegration step consists of several short time steps and a longtime step. The short time steps, which employ the second-orderRunge-Kutta scheme, are used for horizontal propagation ofsound waves and gravity waves with implicit treatment forvertical propagation. For the long time steps, the third-orderRunge-Kutta method proposed by Wicker et al. [16] is adoptedin time integration. In a long time step, advection of mo-mentum, density, potential temperature and water substances,Coriolis force, diffusion and other effects by physics processesare calculated. ASUCA employs a Kessler-type warm-rainscheme for cloud-microphysics parameterization at this time,which is also used in the JMA-NHM [17]. This scheme cansimulate water vapor, cloud water, and rain drops.

III. NVIDIA CUDA AND THE TSUBAMESUPERCOMPUTER

In this section, we introduce the CUDA GPU and its pro-gramming environment. The TSUBAME 1.2 supercomputer

[18] in Tokyo Institute of Technology, which is used for ourpreliminary evaluation, is also described.

The NVIDIA Tesla S1070, which is used for our research,is physically composed of four GPUs, each of which has4GB GDDR3 SDRAM device memory and 240 1.44GHzstreaming processors (SP) supporting more than thousandconcurrent threads. The theoretical peak performance of aGPU is 691.2GFlops in single precision and 86.4GFlopsin double precision. Eight SPs are bundled into a streamingmultiprocessor (SM) as a SIMD (single instruction, multipledata stream) unit; thus each GPU contains 30 SMs. The devicememory (also called global memory in CUDA), shared byall the SMs in the GPU, provides 102GB/s peak bandwidthin a Tesla S1070 GPU. In spite of its excellent memorybandwidth, each access to the device memory takes 400 to600 clock cycles. To hide this overhead and harness locality,another memory layer, called shared memory, which works asa software-managed cache, is introduced. A 16KByte sharedmemory is included in each SM and shared among eight SPswith an access time of about two cycles.

Application programs on NVIDIA GPUs are written in theCUDA language, which is a parallel extension of C and C++language that allows to distribute thousands of threads acrossa large number of SPs. CUDA programs consist of CPU codesand GPU codes, the latter of which is called kernel functions.A kernel function runs in parallel by a large number of threads,which are composed hierarchically. The threads are bundledinto a thread block, and all the thread blocks compose a grid.When a kernel is launched from the CPU code, the programerspecifies the number of threads per block and number of blocksper grid. All the threads in a grid are able to access thedevice memory on the GPU. On the other hand, the fast sharedmemory is visible only to threads included by a correspondingthread block. Thus threads in a single block can cooperate witheach other efficiently through the shared memory assigned tothe block as scratchpad memory. Additionally, CUDA supportsfast barrier synchronization in a thread block.

The TSUBAME 1.2 supercomputer in Tokyo Institute ofTechnology, which is one of the largest supercomputers thatharness modern acceleration technology, is equipped with170 NVIDIA Tesla S1070 boxes (680 GPUs). The maincomponent of the system is Sun Fire X4600 computing nodes,each of which has eight dual-core 2.4GHz AMD Opteronand 32GByte main memory. In this system, more than 300nodes have GPUs; a node is connected to two S1070 GPUdevices via PCI-Express Gen1.0 ×8. Nodes are connected tothe others via Dual-Rail SDR InfiniBand links (2GB/s). Inorder to achieve multi-GPU computing on the TSUBAME,programmers use two types of communication libraries: (1) aMPI library for inter-node communication, and (2) the CUDAruntime library for communication between CPUs and GPUs.In our preliminary evaluation, we use the Voltaire MPI libraryand CUDA version 2.3.

As described later, Tokyo Institute of Technology is goingto introduce a new system, named TSUBAME 2.0, which willembody over 4000 GPUs. We will show the performance of

Initial data

CPU GPUInitialize

Advection

C i li fLong time step

Coriolis force

Horizontal pressure gradient force

Short time step

Vertical pressure gradient force+ gravity force(1D Helmholtz eq.)

Equation of continuity

Update potential temperatureUpdate potential temperature

Update pressure(EOS)

Precipitation

Boundary operations

p

Physical processes

Output Data transfer

Fig. 1: Computational components in both the long and theshort time steps on a GPU. Kernels for short time steps arelaunched several times in a single long time step.

the multi-GPU execution of ASUCA with thousands of FermiGPUs in the SC2010 conference.

IV. SINGLE GPU IMPLEMENTATION

A. Implementation

Before describing our final implementation of ASUCA,we describe a single GPU version for simplicity. Figure 1illustrates the execution flow.

When the execution starts, a CPU reads initial data frominput files onto main memory, and then transfers them toglobal memory on a GPU. After that, the GPU performs all thecomputations. Finally, when forecast data are output, minimumnecessary data are transferred from the GPU to main memoryon the CPU and then written to files. As already described,the computational part iterates the long time steps, each ofwhich includes several short steps. The time steps consist ofseveral computational components, which we have coded asseparate GPU kernels. They include, for example, advection,Coriolis force, horizontal pressure gradient force, equation ofcontinuity, 1D Helmholtz-like elliptic equation, equation ofstate, physical processes and Jacobian operations as well asarray copy, etc.

Generally, in order to improve calculation performance onGPUs, we should take the following strategies. First, kernelsshould be invoked with much more threads than the numberof SPs in a GPU so that access cost to the device memory ishidden by multithreading. Secondly, since almost all memoryoperations in kernels are for 3-D arrays of variables, theirorder on the device memory should be determined carefully inorder to enable coalesced memory access. Thirdly, componentsshould make use of the shared memory as a software-managedcache to reduce the access to the global memory.

In this section, we focus on (1) the order of arrays opti-mized for GPU device memory and implementation of twocomponents as examples: (2) implementation of the advectioncomponent and (3) implementation of the 1D Helmholtz-likeelliptic equation component.

1) Order of arrays: For the sake of simplicity, we usehereafter x, y and z for x1, x2 and x3 respectively. In theoriginal code of ASUCA for CPU, 3-D arrays of variables arestored sequentially in the order of the z, x, y (kij-ordering).This is advantageous to increase the cache hit rate when thecalculation goes along the z axis. On the other hand, in ourimplementation, they are allocated in the order of the x, z,y on the GPU device memory in order to perform parallelcomputing more efficiently. In the HE-VI scheme adoptedin ASUCA, 1D Helmholtz-like elliptic equation needs to besolved in the vertical direction. It has to be calculated sequen-tially in the z direction, while it can be performed in parallelin the xy slice. Thus the kij-ordering, which works well onCPUs, should be avoided on GPUs in order to enable coalescedmemory access to the global memory. Another factor thatdetermines the ordering is the domain decomposition methodin the multi-GPU version with MPI described in the nextsection. Considering the 2D domain decomposition that weadopt, we place variables arrays in the order of the x, z, y totransfer halo data effectively in the y direction.

2) Advection: In the advection component, to calculate ad-vection for a given grid size (nx, ny, nz), the kernel functionsare configured for execution with (nx/64, nz/4, 1) blockswith (64, 4, 1) threads in each block. Each thread specifiesan (x, z) point and performs calculations from j = 0 toj = ny − 1 marching in the y direction. Also, in order tofacilitate the implementation of kernel functions for domaindecomposition with MPI, the z direction in physical spaceis mapped to the y direction in the space of threads inCUDA (Figure 2 (a)). Hereafter, our discussion is based onthe physical space.

Computing advection for a single point requires data ofsome neighbor points; in ASUCA, we require a four-pointstencil in each direction. To carry out calculations of the jth slice, each element in the slice is needed for by severalthreads. Thus using shared memory for data sharing is effective[19]. On the other hand, neighbor elements aligned in the ydirection is used only by a single thread; for example, an(i, j − 1, k) point is used for computation of (i, j, k), but notfor (i + 1, j, k). Thus data for preceding or succeeding planesare not shared and using registers is sufficient (Figure 3). Tosum up, when a thread block computes a part of the j th slice,it holds an array of (64+3)×(4+3)×1 elements on its sharedmemory, including both of the 2D sub-domain data (red pointsin the figure) and halos (green points) as shown in the figure.On the other hand, elements along the y axis are stored inregisters of the corresponding thread. When the computationproceeds to the next j +1 th slice, data in registers are reused[19].

3) 1D Helmholtz-like elliptic equation: Unlike the advec-tion component, the 1D Helmholtz-like elliptic equation is

nz

1 Thread 1 Block!(64 x 4 threads)

Marching direction

Mar

chin

g di

rect

ion

y

x

z

ny

nx

nz

(a) Advectionny

nx

644

64

4

64

4

Marching direction(b) 1D Helmholtz-like elliptic eq.

Fig. 2: Layouts of threads and blocks for the advection (a) andthe 1D Helmholtz-like elliptic equation (b).

Marching direction

Shared memoryz

1 Thready64

x

4Register

4

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 2

Fig. 3: (64 + 3) × (4 + 3) elements in shared memory and 3elements stored in registers along the y axis in the advection.

solved in the vertical (z) direction because of characteristics ofthe HE-VI scheme adopted in ASUCA. Through discretizationof the equation, a tridiagonal matrix is obtained. The basicalgorithm of the solver for the matrix is similar to thatfor advection. However, since we need to compute elementssequentially in the z direction here, threads march along thez axis unlike in the advection component. In our implemen-tation, for a given grid size (nx, ny, nz), we configure thekernel invocation to run with (nx/64, ny/4, 1) blocks, eachof which has (64, 4, 1) threads (Figure 2 (b)).

B. Performance

Figure 4 shows the performance of ASUCA on a singleGPU on a TSUBAME 1.2 node for a mountain wave test [20]in both single- and double-precision floating-point calculationfor eight different grid sizes. In this simulation, an idealmountain is placed at the center of the calculation domain.As an initial condition, 10.0m/sec wind blows in the xdirection and normal pressure, temperature, density and theamount of water substances are given. The time integrationstep is 5.0 sec. All kernels, including physics processes, arecarried out for computation of real data except kernels fortreatment of boundary are executed. Although boundaries areupdated by using real forecast data in real weather forecast,periodic boundary condition are adopted in this mountain

Grid number (nx*ny*nz)0 500 1000 1500 2000 2500 3000 3500 4000 4500

310×

Perf

orm

ance

[GFl

ops]

0

5

10

15

20

25

30

35

40

45

50

GPU, Single precisionGPU, Double precisionCPU, Double precision

Performance [TSUBAME]

Fig. 4: Performance of ASUCA on a GPU (NVIDIA TeslaS1070) and a CPU (AMD Opteron). The solid blue and redpoints indicate the performance of the GPU version in singleand double precision floating point calculations respectively.The magenta outline points show the performance of theoriginal Fortran code running on a CPU core. It is compiledwith the PGI Fortran Compiler.

wave test. The figure also shows the CPU performance ofthe original ASUCA written in Fortran; it is measured on a2.4GHz Opteron core in a TSUBAME node and calculationsare done in double precision. In all cases, with nx set as320 and nz set as 48, the value of ny is varied from 32to 256. The amount of memory on Tesla S1070 (4 GByte)limits a grid size to no more than 320 × 256 × 48 in singleprecision on GPU. Because of the same reason, in the case ofdouble-precision computation on GPU, the maximum grid sizeremains 320× 128× 48. In order to measure the performancein GFlops on a GPU, we count the number of floating-pointoperations of ASUCA running on a CPU with performancecounters provided by the Performance API (PAPI) [21]. Themeasured code is implemented in C and C++ languagescorresponding to GPU code. Using the obtained counts andthe GPU computation time, the performance on the GPU isevaluated.

The performance of our ASUCA implementation is 44.3GFlops in single precision for a 320 × 256 × 48 meshon a single GPU. The graph also shows the performancein double precision, 14.6 GFlops, which achieves 26.3-foldspeedup compared to a CPU core. When we compare the GPUperformance in single precision and the CPU performance indouble precision, the former is 83.4 times faster. We note thatthis comparison is not strictly fair because of the differencein floating-point precision. However, since single precision isoften precise enough in weather forecasting, we emphasize thatour performance results have significant practical implicationsthat have not been observed before. Such high performanceis achieved owing to careful implementation strategies takingcharacteristics of GPU architecture into account, includingarray ordering, making use of shared memory and computationordering.

Arithmetic Intensity: FLOP/Byte Ratio-210 -110 1 10 210

Perf

orm

ance

[GFl

ops]

1

10

210

310 691.2 [GFlops]

(1) Coordinate transformation for density

(2) Pressure gradient force in x direction

(3) Advection

(4) Helmholtz-like eq.

(5) Warm rain

TESLA S1070

Fig. 5: Relationship between arithmetic intensity and perfor-mance for key kernels (1) to (5) in ASUCA on Tesla S1070.The curved line represents the estimated performance for agiven arithmetic intensity.

The performance in double precision is about 30% of thesingle precision case; we consider reasons for this differenceas follows. If the number of hardware floating point unitsperfectly determines the performance, the double precisionperformance would be 12.5% of the single precision one,since each SM has one double precision units and eight singleprecision units. On the other hand, if bandwidth of the devicememory is the only factor, the ratio would be 50%, due to thedifference in the element sizes. From the above discussion, weconsider that both of these two factors affect the performance.

To analyze the performance deeper, we go into kernel-wiseanalysis since kernels have largely different characteristics.Figure 5 shows that the relationship between the arithmeticintensity and the performance of several kernels in singleprecision. The curved line in the graph gives the estimatedperformance that Tesla S1070 can support for a given arith-metic intensity, which is expressed as

Performance =FLOP

FLOP/Fpeak + Byte/Bpeak + α(6)

where FLOP is the number of floating-point operations in theapplications, Byte is the amount of memory access in bytesin the applications, Fpeak is the peak performance of floating-point operation (691.2GFlops in Tesla S1070), Bpeak is thepeak memory bandwidth (102.4GByte/sec) and α representsthe time taken by other operations except both floating-pointand memory access operations, which is zero in this graphbecause of simplification.

For five key kernels used in ASUCA, points are put onthe graph. Kernel (1) in Figure 5 is coordinate transformationcalculation for density and kernels (2) to (4) are components ofthe dynamical core in ASUCA. Kernel (5) is a part of physicsprocesses and computationally intensive. This graph indicatesthat the performance of kernels (1) to (4) is limited by memorybandwidth and do not reach computational limitation of TeslaS1070. We observe the performance of kernel (1) is lowestamong these kernels. This kernel performs the conversion from

density ρ in generalized coordinates to density ρ in Cartesiancoordinates, which is written as follows:

ρ = Jρ (7)

where J is the Jacobian of coordinate transformation. Thiskernel performs two memory reads, one memory write and onefloating-point operation per element. Since ASUCA adoptedgeneralized coordinates, this kernel or equivalent kernels areapplied to momentum components, density, potential temper-ature and water substances several times in one time inte-gration step. Kernel (5), which calculates warm rain scheme,elicits the high performance of the GPU, because it containsmathematical functions, such as log, exp, with few memoryaccesses. This kernel, however, is called once per time step andspends only 1.0% GPU time. Therefore it is not consideredthat this kernel contributes for raising the overall performanceof ASUCA. Kernels invoked more frequently are kernels (2)to (4) in the dynamical core, which compute pressure gradientforce in the x direction, advection for the x component ofmomentum and the 1D Helmholtz-like elliptic equation in thevertical direction, respectively. Kernels (2) and (4) are calledin the short time step. Kernels for advection including kernel(3) are applied to momentum components, density, potentialtemperature and water substances. Because of many calls ofthese kernels, the performance of these kernels characterizethe performance of this application.

V. MULTI-GPU COMPUTING

This section describes our domain decomposition and par-allelization strategies in order to exploit multiple GPUs overcluster nodes. Using multiple GPUs is necessary when sim-ulating large grids beyond the size that is possible with asingle GPU. For example, a single GPU of Tesla S1070 has4 GB of memory, which can only hold up to a grid of size320 × 256 × 48.

We decompose the given grid in both the x and y directions(2D decomposition) and allocate each sub domain to a singleGPU. Since the z dimension is relatively small in our sim-ulation subjects (approximately a hundred at the maximum),each GPU is responsible for all the elements in the z direction.Because GPUs cannot directly access data stored on globalmemory of other GPUs, the host CPUs are used as bridgesto exchange boundary data between neighboring GPUs. Asillustrated in Figure 6, after each GPU transfers boundary datato the buffer allocated on the associated CPU, the MPI libraryis used to transfer the boundary data between the buffers onCPUs.

To achieve better scalability, we optimize the boundaryexchange by overlapping its communication with the compu-tation. We discuss its performance effects on the TSUBAMEsupercomputer and show that the GPU ASUCA can achieveup to 15 TFlops using 528 GPUs.

A. Overlapping computation and communicationTo hide the communication costs and achieves good scala-

bility with a large number of GPUs, we employ three optimiza-tion techniques in the multi-GPU ASUCA implementation. All

GPU0

CPU0 CUDA APIs

boundary data

GPU1

CPU1 CUDA APIs

boundary data

GPU2

CPU2 CUDA APIs

boundary data

GPU3

CPU3 CUDA APIs

boundary data

MPI communications

Fig. 6: Multi-GPU communication

variable0 Computation Data exchange

variable1 variable2

time

Fig. 7: Overlapping method exploiting inter-variable indepen-dence.

of them share the same principle that independent data sets canexhibit opportunities for overlapping their communication andcomputation by streamlining them in a pipelined fashion.

The first optimization exploits inter-variable independencein the weather model. Specifically, when one variable in theweather model can be computed independently with another,its communication can be overlapped with the computation ofanother variable. In ASUCA, the advection of 13 variables re-lated to water substances can be computed independently; weapply this overlapping method for the 13 variables (Figure 7).

The second optimization relies on data independence withina single variable. Since each grid element of a variable canbe computed independently with each other for one time step,the boundary region where data exchange is required can beseparated from the rest of the domain. By dividing a singlekernel into three – one for the inner domain, another for thex boundaries, and the other for the y boundaries, we canoverlap the computation of inner domain and communicationof the boundary region. Note that since the performanceadvantage of the GPU is often maximized when the number ofthreads is large, dividing the computation domain and therebyreducing the number of concurrent threads tends to degradethe performance. However, we observe that the benefit ofoverlapping often outweighs the negative impact, resulting inincreased overall performance.

In ASUCA, this method is applied to computing and datatransfer of three components of momentum and potentialtemperature in the short time step. Figure 8 illustrates theflow of this second method. First, (1) the y-boundary valuesare computed in a CUDA stream named stream1. Next, (2)the x-boundary values are computed in another stream namedstream2, while simultaneously (5) the y-boundary values arebeing exchanged in stream1. This sequence consists of asyn-chronous memory copies from GPUs to CPUs executed byCUDA memory operations, data exchanges between CPUswith MPI, and asynchronous memory copies from CPUs toGPUs. In stream2, when (2) the computation for the x-boundary values is completed, (3) the x-boundary values are

copied to a consecutive buffer allocated on the GPU in order toprepare efficient data transfer to its host CPU memory. Thesecopies are executed by kernels instead of CUDA memoryoperations. Note that this preparation is not necessary for they boundaries because they are already located in consecutiveregions of the GPU memory. Once the preparation for the xboundaries is done, (4) a kernel for the inner domain startsits execution in a different stream named stream3. Simultane-ously, in stream2, (6) the x-boundary values in the consecutivememory buffer are exchanged between the neighbor nodesby first transferring the data to the host memory buffersand exchanging using MPI communications between the hostCPUs. The exchanged data on the host memory are thentransferred back to the consecutive buffer on the GPU memory,and (7) the values of the buffer are copied to the appropriatelocations of the x boundaries.

In the kernels for momentum in ASUCA, computing the(i, j, k) position needs values at (i ± 1, j ± 1, k ± 1) po-sitions. Thus each sub domain must also obtain the valueson the nearest corner of the diagonal domains. This can beimplemented in a straightforward way by first exchangingthe y boundaries between GPUs and then exchanging the xboundaries between GPUs as well as the neighbor y-boundaryvalue on the x-boundary face. In our implementation, in orderthat the exchange of the x and y boundaries are partiallyoverlapped to reduce the time of these data exchanges, wecoordinate the exchanges of the corner values on the hostmemory as depicted in Figure 8. Directly appending the cornervalues to the x-boundary buffers allows the two boundaries tobe exchanged in parallel.

For the y boundaries, each sub domain performs dataexchanges with the two neighbor sub domains, both of whichcan be done independently. In order to efficiently exercisethe available bandwidths between host and GPU memory aswell as between neighbor nodes, we first transfer the boundarydata for one sub domain from GPU memory to host memory.Next, the data is exchanged with the neighbor domain withMPI, and at the same time the boundary data for anotherneighbor sub domain is transferred from the GPU memoryto host memory, effectively overlapping the two boundaryexchanges. Transferring back the neighbor data to the GPU isalso performed similarly with as much as overlapping betweenthe two exchanges. Note that for the x boundaries, since wepack both the two boundary regions into a single consecutivebuffer, we simply transfer the buffer to the host memoryand exchange the boundaries with the neighbor sub domainsusing asynchronous MPI communications. Thanks to this, thefrequency of the data transfer is reduced for the x boundariescompared to the y boundaries.

Figure 9 shows breakdown of the computation and com-munication times of the kernels when 528 GPUs of theTSUBAME is used. The communication time consists ofelapsed times of asynchronous transfer between GPU andCPU, and MPI communication between CPUs in both xand y boundaries. Figure 9 compares the execution times ofeach variable by an original single kernel used in the non-

stream2

stream3

(1) compute y boundary

(2) compute x boundary

(3) copy x boundary to buffer memory

(7) copy x boundary from buffer memory

(4) compute inside

x

y stream1

CPU

MPI cudaMemcpyAsync

cudaMemcpyAsync

(5) data exchange in y direction

CPU

MPI cudaMemcpyAsync

cudaMemcpyAsync

(6) data exchange in x direction

copy corner values on CPU

Fig. 8: Scheme of the second overlapping method. Domainsare projected onto the xy plain. Operations (1) - (7) arecarried out. Operations (1), (2), (3), (4) and (7) are executedby kernels. Operations (5) and (6) are performed by CUDAmemory operations and CPU. After operation (1) is executed,operations (5) and (6) are performed in parallel with operation(2), (3) and (4). Operation (7) is executed after this parallelsection finished.

overlapping computing and three kernels where this optimiza-tion is applied, and shows that the optimization increases thecomputational times comparing to the single-kernel versionsin all cases due to reduced parallelism within each kernel.However, the communication time, depicted in the top bar ofthe graph, can be overlapped with the divided kernels, whereasthe single-kernel version needs to perform computation andcommunication serially. Thus, the total performance is im-proved with this second optimization.

Similar to the first optimization, the third technique exploitsindependence of different variables: It attempts to logicallyfuse multiple kernels for different variables into one so thatcomputation times that can be overlapped with communicationcan be shared by the multiple kernels. Ideally, this wouldallow more communication costs to be hidden by computa-tion times, especially when one kernel spends much longertime in computation than communication. If it can be fusedwith another communication-bottlenecked kernel, part of thecommunication time of the second kernel could be hidden inthe computation time of the first kernel.

Figure 9 indicates that time spent by the kernels for density

Time [usec]0 1000 2000 3000 4000 5000

Potential temperature

Density

Helmholtz-like eq.

Momentum (y)

Momentum (x)

Communication (x, y)

Whole (Single)Inner (divided)Boundary-y (divided)Boundary-x (divided)GPU to Host MPIHost to GPUCoordinate Transformation

for momentum

Breakdown of Computational Time [6956x6052x48(22x24 GPUs) / float / r412]

x boundaries y boundaries

Fig. 9: Breakdown of computational time and communicationtime for three kernels used in the overlapping method andsingle kernel used in the non-overlapping method. These weremeasured using 528 GPUs. A part of the exchange of the yboundaries between CPUs with MPI is performed in parallelwith a part of the transfer between a GPU and a CPU.

is not enough to hide its communication time. Since inASUCA density and potential temperature can be computedindependently, and the latter spends less in communicationthan computation, fusing them into one logical kernel allowsus to hide more communication time of the density kernel.In our actual implementation, we use separate kernels fordensity and potential temperature, but we treat them as a singlelogical kernel and apply the second optimization, i.e., dividinga kernel into three for the x boundaries, the y boundariesand inner domain, in order to improve the effectiveness of theoverlapping. Note that since the kernels other than density havesufficient computation time to overlap their communicationsas indicated in Figure 9, this optimization is only used for thedensity kernel in our ASUCA implementation.

In summary, we adaptively employ the three optimizationmethods for more effective overlapping as follows. Whenthe data transfer for a variable can be hidden by anothervariable, the first method is adopted. When the first methodcannot be used for a kernel because of the data dependencewith preceding or succeeding kernels, the second method isadopted. When the second method cannot provide sufficienttime to hide communication for the variable, the third methodis adopted if such an independent kernel exists.

B. Performance

To show the performance of our multi-GPU ASUCA code,we used the TSUBAME Supercomputer at the Tokyo Instituteof Technology. Each node is based on Sun Fire X4600, whichhas AMD Opteron 2.4 GHz 16 cores with 32 GB of mainmemory, interconnected with a dual-rail SDR InfiniBand link,whose peak throughput is 2GB/s. Two GPUs from a NVIDIATesla S1070 (1.44 GHz, VRAM 4 GB) are connected to onenode through PCI-Express Gen1.0 ×8.

In order to maximize the performance, the mesh size thateach GPU computes is set as 320 x 256 x 48. This is

Number of GPUs / CPU Cores0 100 200 300 400 500

Perf

orm

ance

[TFl

ops]

0

2

4

6

8

10

12

14

16GPU, Overlapping method

GPU, Non-overlapping method

CPU (double precision)

Performance [TSUBAME]

Fig. 10: Performance of ASUCA on Multi-GPUs and onCPUs on TSUBAME Supercomputer. The solid blue and redpoints indicate the performance of multi-GPU computation insingle precision with overlapping and non-overlapping methodrespectively. The magenta outline points show the performanceof the C and C++ code running on CPUs.

TABLE I: Numbers of GPUs and mesh sizes for multi-GPUcomputing.

number of GPUs (Px × Py) mesh size (nx × ny × nz)

6 (2 × 3) 636 × 760 × 48

20 (4 × 5) 1268 × 1264 × 48

54 (6 × 9) 1900 × 2272 × 48

80 (8 × 10) 2532 × 2524 × 48

120 (10 × 12) 3164 × 3028 × 48

168 (12 × 14) 3796 × 3532 × 48

192 (12 × 16) 3796 × 4036 × 48

252 (14 × 18) 4428 × 4540 × 48

320 (16 × 20) 5060 × 5044 × 48

360 (18 × 20) 5692 × 5044 × 48

396 (18 × 22) 5692 × 5548 × 48

440 (20 × 22) 6324 × 5548 × 48

480 (20 × 24) 6324 × 6052 × 48

528 (22 × 24) 6956 × 6052 × 48

the maximum dimension that can be stored on the availablememory capacity (i.e., 4GB) and that minimizes uncoalescedmemory accesses. Figure 9 shows the communication time,which indicate that the bandwidth of 438 MB/s is achievedbetween neighbor nodes with InfiniBand/MPI.

Figure 10 shows the performance of ASUCA running onmultiple GPUs. These calculations use two GPUs per eachnode. The numbers of GPUs and the mesh sizes used for multi-GPU computation are shown in Table I. The performanceof ASUCA using both the overlapping method where threeoptimizations are applied and non-overlapping methods wasmeasured. Furthermore, to show the performance improve-ments by using the GPUs, we also measure the performanceof C/C++-based ASUCA on the CPUs. Similar to the perfor-mance study of the single-GPU case, the mountain wave test isused as a benchmark. As shown in the graph, the overlapping

Time [msec]0 200 400 600 800 1000

Overlapping

Non-overlappingTotal

Computation

MPI Communication

GPU-CPU Communication

Total Computational Time [6956-6052-48 (22x24 GPUs) / float / r412]

Fig. 11: Computational time and communication time (MPIand CPU-GPU transfer) taken in one time step using the non-overlapping and the overlapping computing on 528 GPUs.Total time indicates actual elapsed time, not the sum of com-putation, MPI communication and GPU-CPU communicationtime.

version achieves the performance of 15.0 TFlops in singleprecision with 528 GPUs. The effect of the overlapping onperformance improvements is approximately 14%. The weakscaling efficiency is above 93% for 6324× 6052× 48 on 528GPUs with respect to the 6-GPU performance.

Figure 11 shows the breakdown of the computation andcommunication times taken in one time integration step usingthe non-overlapping and overlapping methods on 528 GPUs.The communication and computation time in the overlappingmethod increases compared to the non-overlapping versionbecause of reduced parallelism within kernels and decreasedabsolute performance of asynchronous data transfer. However,the total time of the overlapping version is shorter than theother by approximately 11% since the large part of commu-nication time is hidden in the computation time. More specif-ically, the computation, MPI, and GPU-CPU communicationeach takes 763 ms, 336 ms, 145 ms, respectively, and theoverall time is 988 ms. The difference of the overall andcomputation times is the communication time that was notoverlapped with the computation, whereas approximately 460ms is spent on the communication. We see that approximately53% of the communication time is successfully hidden by thecomputation.

Compared to the fastest performance of 50 TFlops achievedon the Jaguar supercomputer with WRF, we argue that ourachievement of 15 TFlops still should be considered animportant contribution in weather forecasting. It is because ourperformance is achieve only with less than 600 GPUs on 300nodes, whereas Jaguar consists of more than 18000 nodes withnearly 150000 CPU cores. Furthermore, our ASUCA model isstill under active development, and therefore does not have asmany physics processes as WRF. Once many physics processesare incorporated, the actual performance of ASUCA will alsobe increased because typical physics processes are computebound and can easily extract GPU’s performance.

VI. RESULTS OF SIMULATIONS USING REAL DATA

This section demonstrates that the GPU ASUCA can suc-cessfully simulate a basic set of real-world weather phe-nomena, including the full dynamical core and warm rains.Figure 12 shows horizontal wind velocity (arrows), pressure(contour) and precipitation (color contour) after two, four andsix hours of model integration with a 1900 × 2272 × 48mesh using 54 GPUs on the TSUBAME supercomputer. Thissimulation was performed in single precision. Note that weonly used 54 GPUs in this particular case because of the smallsize of the real mesh data at hand. In this simulation, thetime integration step was 0.5 sec and the mesh resolution was500 meters in the horizontal directions. The full dynamicalcore and warm rain scheme are treated. This simulation wasperformed with the initial and boundary data around thesouthern islands of Japan in early October 2009. The JMAmesoscale analysis data (MANAL data) were used for theinitial data. Different boundary data were prepared for everyone hour from the forecasted data calculated by a globalspectral model developed at the JMA. The simulation resultindicates that the GPU ASUCA is able to simulate the basicset of real weather phenomena; supporting a wider variety ofphysics processes such as snow is a subject of future work.

VII. PERFORMANCE ESTIMATES OF THE GPU ASUCA ONTSUBAME 2.0

We have so far reported the preliminary results that havebeen obtained using the current TSUBAME supercomputer; inOctober 2010, the TSUBAME 2.0 supercomputer is scheduledto be delivered to the Tokyo Institute of Technology, whichwill provide much larger scale of GPU-accelerated computingwith more than 4000 NVIDIA Fermi GPUs. Each node willhave three GPUs and will be interconnected with such high-speed networks as full-bisection dual-rail QDR InfiniBand,whose peak point-to-point bandwidth is 8 GB/s.

The performance of the GPU ASUCA on TSUBAME 2.0can be estimated as much as 150 TFlops. More specifically,because of the improvements of the intra- and inter-nodeinterconnects, we expect that each GPU of TSUBAME 2.0 willbe able to use more than four times the bandwidth of each GPUon the current TSUBAME 1.2. Assuming that a Fermi GPUprovides almost the same computational performance anddevice memory bandwidth as Tesla S1070 and the communi-cation time decreases at least by half, and given the fact that onTSUBAME 1.2 approximately half of the communication timewas already hidden by the computation, we estimate that thecommunication will be completely hidden by the computationon TSUBAME 2.0. Therefore, assuming perfect weak scalingusing 4000 GPUs, the performance will reach 15TFlops ×988msec/763msec × 4000GPUs/528GPUs ≈ 150 TFlops,where 988msec and 763msec are the observed total timeand computation time using the overlapping computing inFigure 11. Furthermore, future developments of ASUCA willintroduce more computationally intensive physics processes inorder to offer higher accuracy, which will result in increasedFlops thanks to the abundant raw processing power of the

(a) After 2 hours (b) After 4 hours (c) After 6 hours

Fig. 12: Horizontal wind velocity (arrows), pressure (contour) and precipitation (color contour) after two (a), four (b) and six(c) hours of model integration with 1900 × 2272 × 48 mesh using 54 GPUs.

GPU. In October 2010, it is highly likely that the performanceof ASUCA on TSUBAME 2.0 achieves more than 150 TFlops.

Compared to the current fastest performance of 50 TFlopsusing more than 18000 nodes and 150000 CPU cores ofJaguar, TSUBAME 2.0 will achieve a three-fold speedup withonly 4000 GPUs. This significant performance improvementwill further be emphasized as more physics processes areimplemented in our GPU ASUCA. Moreover, our performanceestimate is conservative with respect to the actual perfor-mance of the NVIDIA Fermi GPUs, and the actual overallperformance on TSUBAME 2.0 will likely be higher than 150TFlops.

VIII. CONCLUSION

This paper presented our ongoing work on the full GPUimplementation of the ASUCA weather forecasting code andreported its 80-fold performance improvements over a singleCPU core. We have successfully ported the dynamical coreand part of the physical processes of the original CPU-onlyASUCA. Similar to WRF, our GPU ASUCA can be furtherextended with microphysis modules for more accurate andcomplex weather modeling. Unlike the previous attempts toexploit GPUs in weather forecasting code such as WRF, ourimplementation virtually eliminates all the host-GPU memorytransfers during simulation runs, resulting in much largerperformance improvements.

The effective utilization of shared memory in the GPU hasresulted in the performance of 44.3 GFlops in single precision,which is 83.4 times faster than the original code written inFortran on a single CPU core. Furthermore, our multi-GPUASUCA can effectively utilize a large number of distributedGPUs with the optimizations to hide communication overheads

by the overlapping of computation and communication. Ourcurrent performance studies using the large number of GPUs inthe TSUBAME supercomputer have successfully demonstratedthe scalability of our implementation, reaching 15 TFlops fora mesh of 6956×6052×48 with 528 GPUs. We also showedthe performance impacts of our multi-GPU ASUCA with thereal observed and forecasted data.

We have also discussed the estimated performance of theGPU ASUCA on the TSUBAME 2.0 supercomputer, whichwill be deployed in October 2010 with more than 4,000NVIDIA Fermi GPUs. With a much larger number of GPUsas well as the significantly improved intra- and inter-nodebandwidths, we estimate that the performance will reach up to150 TFlops in the real production weather forecasting code,which, to our knowledge, has not been achieved even with themassive vector machines such as the Earth Simulator and theJaguar supercomputer. Introducing more physics processes toimprove prediction accuracy to both the original and GPU-based ASUCA will be a subject of our future work.

ACKNOWLEDGMENT

This research was supported in part by the Global Centerof Excellence Program“Computationism as a Foundation forthe Sciences” and KAKENHI, Grant-in-Aid for ScientificResearch (B) 19360043 from the Ministry of Education,Culture, Sports, Science and Technology (MEXT) of Japan,and in part by the Japan Science and Technology Agency(JST) Core Research of Evolutional Science and Technology(CREST) research program “ULP-HPC: Ultra Low-Power,High-Performance Computing via Modeling and Optimizationof Next Generation HPC Technologies”.

REFERENCES

[1] S. Shingu, H. Takahara, H. Fuchigami, M. Yamada, Y. Tsuda, W. Oh-fuchi, Y. Sasaki, K. Kobayashi, T. Hagiwara, S.-i. Habata, M. Yokokawa,H. Itoh, and K. Otsuka, “A 26.58 Tflops global atmospheric simula-tion with the spectral transform method on the Earth Simulator,” inSupercomputing ’02: Proceedings of the 2002 ACM/IEEE conferenceon Supercomputing. Los Alamitos, CA, USA: IEEE Computer SocietyPress, 2002, pp. 1–19.

[2] W. C. Skamarock, J. B. Klemp, J. Dudhia, D. O. Gill, D. M. Barker,M. G. Duda, X.-Y. Huang, W. Wang, and J. G. Powers, “A Descrip-tion of the Advanced Research WRF Version 3,” National Center forAtmospheric Research, 2008.

[3] J. Michalakes, J. Hacker, R. Loft, M. O. McCracken, A. Snavely, N. J.Wright, T. Spelce, B. Gorda, and R. Walkup, “WRF nature run,” in SC’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing.New York, NY, USA: ACM, 2007, pp. 1–6.

[4] A. S. Bland, R. A. Kendall, D. B. Kothe, J. H. Rogers, and G. M.Shipman, “Jaguar: The world’s most powerful computer,” in 2009 CUGMeeting, 2009, pp. 1–7.

[5] J. C. Thibault and I. Senocak, “CUDA implementation of a Navier-Stokes solver on multi-GPU desktop platforms for incompressibleflows,” in Proceedings of the 47th AIAA Aerospace Sciences Meeting,no. AIAA 2009-758, jan 2009.

[6] T. Brandvik and G. Pullan, “Acceleration of a 3D Euler Solver usingCommodity Graphics Hardware,” in 46th AIAA Aerospace SciencesMeeting. American Institute of Aeronautics and Astronautics, January2008.

[7] A. Nukada and S. Matsuoka, “Auto-tuning 3-D FFT library for CUDAGPUs,” in SC ’09: Proceedings of the Conference on High PerformanceComputing Networking, Storage and Analysis. New York, NY, USA:ACM, 2009, pp. 1–10.

[8] T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji,“42 TFlops hierarchical N-body simulations on GPUs with applicationsin both astrophysics and turbulence,” in SC ’09: Proceedings of theConference on High Performance Computing Networking, Storage andAnalysis. New York, NY, USA: ACM, 2009, pp. 1–12.

[9] J. Michalakes and M. Vachharajani, “GPU acceleration of numericalweather prediction.” in IPDPS. IEEE, 2008, pp. 1–7.

[10] J. C. Linford, J. Michalakes, M. Vachharajani, and A. Sandu, “Multi-core acceleration of chemical kinetics for simulation and prediction,” inSC ’09: Proceedings of the Conference on High Performance ComputingNetworking, Storage and Analysis. New York, NY, USA: ACM, 2009,pp. 1–11.

[11] J. Ishida, C. Muroi, K. Kawano, and Y. Kitamura, “Development of anew nonhydrostatic model “ASUCA” at JMA,” CAS/JSC WGNE ReserchActivities in Atomospheric and Oceanic Modelling, 2010.

[12] “CUDA Programming Guide 2.3,” http://developer.download.nvidia.com/compute/cuda/2 3/toolkit/docs/NVIDIA CUDA ProgrammingGuide 2.3.pdf, NVIDIA, 2009.

[13] K. Saito, T. Fujita, Y. Yamada, J.-i. Ishida, Y. Kumagai, K. Aranami,S. Ohmori, R. Nagasawa, S. Kumagai, C. Muroi, T. Kato, H. Eito, andY. Yamazaki, “The Operational JMA Nonhydrostatic Mesoscale Model,”Monthly Weather Review, vol. 134, pp. 1266–1298, 2006.

[14] B. Koren, “A robust upwind discretization method for advection, diffu-sion and source terms,” CWI Report NM-R9308, 1993.

[15] W. C. Skamarock and J. B. Klemp, “Efficiency and Accuracy ofthe Klemp-Wilhelmson Time-Splitting Technique,” Monthly WeatherReview, vol. 122, pp. 2623–+, 1994.

[16] L. J. Wicker and W. C. Skamarock, “Time-splitting methods for elasticmodels using forward time schemes,” Monthly Weather Review, vol. 130,pp. 2088–2097, 1993.

[17] M. Ikawa and K. Saito, “Description of a non-hydrostatic model de-veloped at the Forecast Research Department of the MRI,” TechnicalReports of the Meteorological Research Institute, vol. 28, pp. 238–,1991.

[18] T. Endo, A. Nukada, S. Matsuoka, and N. Maruyama, “LinpackEvaluation on a Supercomputer with Heterogeneous Accelerators,” inProceedings of the 24th IEEE International Parallel and DistributedProcessing Symposium (IPDPS’10). Atlanta, GA, USA: IEEE, Apr2010.

[19] P. Micikevicius, “3D finite difference computation on GPUs usingCUDA,” in GPGPU-2: Proceedings of 2nd Workshop on General

Purpose Processing on Graphics Processing Units. New York, NY,USA: ACM, 2009, pp. 79–84.

[20] T. Satomura, T. Iwasaki, K. Saito, C. Muroi, and K. Tsuboki, “Accuracyof terrain following coordinates over isolated mountain: Steep mountainmodel intercomparison project (st-MIP),” Annuals of the Disaster Pre-vention Research Institute, Kyoto University, vol. 46 B, pp. 337–346,2003.

[21] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, “A portableprogramming interface for performance evaluation on modern proces-sors,” Int. J. High Perform. Comput. Appl., vol. 14, no. 3, pp. 189–204,2000.