Download - High-Performance Quantum Simulation: A challenge to Schr ö dinger equation on 256^4 grids
High-Performance High-Performance Quantum Simulation: A Quantum Simulation: A challenge to Schrchallenge to Schröödinger dinger equation on 256^4 gridsequation on 256^4 grids
**Toshiyuki ImamuraToshiyuki Imamura13 13 今村俊幸今村俊幸 , , Thanks to Susumu YamadaThanks to Susumu Yamada2323,,
Takuma KanoTakuma Kano22, and Masahiko Machida, and Masahiko Machida2323
1.1. UEC (University of Electro-Communications UEC (University of Electro-Communications 電気通信大電気通信大学学 ))
2.2. CCSE JAEA (Japan Atomic Energy Agency) CCSE JAEA (Japan Atomic Energy Agency)3.3. CREST JST (Japan Science Technology) CREST JST (Japan Science Technology)
Jan. 4-8, 2008 2RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ OutlineOutline
I.I. Physics, Review of Quantum Physics, Review of Quantum SimulationSimulation
II.II. Mathematics, Numerical Mathematics, Numerical AlgorithmAlgorithm
III.III. Grand Challenge, Parallel Grand Challenge, Parallel Computing on ESComputing on ES
IV.IV. Numerical ResultsNumerical Results
V.V. ConclusionConclusion
I. Physics,I. Physics,Review of Quantum Review of Quantum Simulation, etc.Simulation, etc.
Jan. 4-8, 2008 4RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□
S
W’
IS
W down-sizingdown-sizing
Crossover from Classical to Quantum ???Crossover from Classical to Quantum ???
1.1, Quantum Simulation (1/2) (1/2)
Classical Equation of MotionClassical Equation of Motion
Schroedinger EquationSchroedinger Equation
Jan. 4-8, 2008 5RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ Numerical Simulation for Coupled Schrodinger Eq.Numerical Simulation for Coupled Schrodinger Eq.
αα: : CouplingCoupling
Requirement of Exact Diagonalization for the HamiltonianRequirement of Exact Diagonalization for the Hamiltonian
1.2, Quantum Simulation (2/2)
ββ : : 1/Mass ∝1/Mass ∝ 1 1 / W/ W
ββ : : 1/Mass ∝1/Mass ∝ 1 1 / W/ W H
: Spectral expansion by {un} eigenvecs.
: possible statenot a valuebut a vector!
Numerical method to solve the above equationNumerical method to solve the above equation
II. Mathematics,II. Mathematics,Numerical Algorithm, etc.Numerical Algorithm, etc.
Jan. 4-8, 2008 7RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ 2.1 Krylov Subspace Iteration2.1 Krylov Subspace Iteration
Lanczos Lanczos (Traditional method)(Traditional method) Krylov+GSKrylov+GS : Simple, but shift+invert version is needed: Simple, but shift+invert version is needed
LOBPCG LOBPCG (Locally Optimal Block PCG)(Locally Optimal Block PCG) {Krylov base, Ritz vector, prior vector} : CG approach{Krylov base, Ritz vector, prior vector} : CG approach
**Restart at every iteration****Restart at every iteration****INVERSE-free** -> Less Communication**INVERSE-free** -> Less Communication
LOBPCG Lanczos
Jan. 4-8, 2008 8RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ 2.2 LOBPCG 2.2 LOBPCG
Costly! Since the block is updated at Costly! Since the block is updated at every iteration, MV operation is also every iteration, MV operation is also required!!required!!
1*MV / every iteration
3*MV / every iteration
Other Difficulties in implementationOther Difficulties in implementation• Breakdown of linear independencyBreakdown of linear independency make our own DSYGV using LDL and deflation (not Cholesky)make our own DSYGV using LDL and deflation (not Cholesky)• Growth of numerical error in {W,X,P}Growth of numerical error in {W,X,P} detect numerical error and recalculate them automaticallydetect numerical error and recalculate them automatically• Choice of the shiftChoice of the shift• Portability Portability
Jan. 4-8, 2008 9RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ 2.3 Preconditioning2.3 Preconditioning
T~HT~H-1-1
H=A+BH=A+B11+B+B22+B+B33+B+B44+C+C1212+C+C2323+C+C3434
1e-6
1e-5
1e-4
1e-3
0.01
0.1
1
10
100
500 400 300 200 100 0
No preconditioner
H1 (Point Jacobi)
H2 (LDL)
H3(LDL)
Iteration count
Res
idua
l err
or
H~(A+BH~(A+B11))
H~ (A+BH~ (A+B11)A)A-1-1(A+B(A+B22))
H~AH~A
Here, A: diagonal A+Bx: block-tridiagonal shift + LDLt is used
III. Grand challenge,III. Grand challenge,Parallel Computing on ES, Parallel Computing on ES, etc.etc.
Jan. 4-8, 2008 11RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ 3.2 Technical Issues on the Earth 3.2 Technical Issues on the Earth SimulatorSimulator
Programming modelProgramming model
hybrid of distributed parallelism and hybrid of distributed parallelism and thread parallelism.thread parallelism.
Processor 0
Processor 1
Processor 7
node node
Intra-Node
Vector processing
node
Inter-Node
• Inter-NodeInter-Node : : MPI MPI (Message Passing Interface)(Message Passing Interface) Low latency (6.63[us])Low latency (6.63[us]) Very fast (11.63[GB/s])Very fast (11.63[GB/s])• Intra-NodeIntra-Node : : Auto-parallelizationAuto-parallelization OpenMPOpenMP (thread-level parallelism) (thread-level parallelism)• Vector Processor (most-inner loops) :Vector Processor (most-inner loops) : Auto-/manual- Vectorization Auto-/manual- Vectorization
3-level parallelism
Jan. 4-8, 2008 12RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ 3.3 Quantum Simulation parallel 3.3 Quantum Simulation parallel codecode
Application flow chartApplication flow chart
Eigenmodecalculation
Time Integrator
Quantum stateanalyzer
Parallel LOBPCG solverdeveloped on ES
Visualization
Parallel code on ES
Parallel code on ES
Visualized by AVS
Jan. 4-8, 2008 13RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ 3.4 Handling of Huge Data3.4 Handling of Huge Data
Data distribution in case of a 4D arrayData distribution in case of a 4D array
k
i, jl
i
j
(k, l
) / N
P
intra-node parallelization
iloop length=256
vector processing
2-dimensionnal loopdecomposition
1-dimension loopdecomposition
(k, l
) / N
P
j /MP
NP : Number of MPI processes
MP : Number of microtasking processes (=8)
(k,l) (j)
Jan. 4-8, 2008 14RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ 3.5 Parallel LOBPCG3.5 Parallel LOBPCG
Core implementation is MATRIX-VECTOR mult.Core implementation is MATRIX-VECTOR mult. 3-level parallelism is carefully done in our implementation.3-level parallelism is carefully done in our implementation. In Inter-node parallelization, communication pipelining is used. In Inter-node parallelization, communication pipelining is used. In the Rayleigh-Ritz part, SCALAPACK is used.In the Rayleigh-Ritz part, SCALAPACK is used.
LOBPCG
do l=1,256 :: inter-node parallelisminter-node parallelism do k=1,256 :: inter-node parallelisminter-node parallelism do j=1,256 :: intra-node (thread) parallelismintra-node (thread) parallelism do i=1,256 :: vectorizationvectorization w(i,j,k,l)=a(i,j,k,l)*v(i,j,k,l)& +b*(v(i+1,j,k,l)+ ・・・ ) +c*(v(i+1,j+1,k,l)+ ・・・ ) enddo enddo enddo enddo
Acg.f Acg.f
IV. Numerical Results,IV. Numerical Results,
Jan. 4-8, 2008 16RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ 4.1, Numerical Result
Preliminary test of our eigensolverPreliminary test of our eigensolver 4-junction system: -> 256^4 dimension4-junction system: -> 256^4 dimension
CPUsCPUs time[s]time[s] TFLOPSTFLOPS
20482048 31183118 3.653.65
30723072 25352535 4.494.49
40964096 16211621 7.027.02
Performance
(5 eigenmodes)
Convergence history
(10 eigenmodes)
1e-12
1e-10
1e-8
1e-6
1e-4
1e-2
1
1e+2
1e+4
0 500 1000 1500 2000 2500 3000
the ground statethe 2nd lowest statethe 3rd lowest statethe 4th lowest statethe 5th lowest statethe 6th lowest statethe 7th lowest statethe 8th lowest statethe 9th lowest statethe 10th lowest state
Iteration count
Res
idua
l err
or
Jan. 4-8, 2008 17RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□
Initial StateInitial StatePotential Change: Potential Change:
Only a Single JunctionOnly a Single Junction
??
Capacitive Capacitive CouplingCoupling
Question: Synchronization or Independence (Localization)Question: Synchronization or Independence (Localization)
The Simplest Case: (two Junctions)The Simplest Case: (two Junctions)
4.2, Numerical Result (Scenario)
Jan. 4-8, 2008 18RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ Two-Stacked Intrinsic Josephson JunctionTwo-Stacked Intrinsic Josephson Junction
1
2
Classical Regime: Classical Regime:
Independent DynamicsIndependent Dynamics
Quantum Regime:Quantum Regime:
??
4.3, Numerical Result
Jan. 4-8, 2008 19RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□
q1q2
q1q2
t=0.0(a.u.) t=2.9(a.u.)
q1q2
q1q2
t=9.2(a.u.) t=10.0(a.u.)
αα== 0.40.4
ββ == 0.20.2
Jan. 4-8, 2008 20RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□
t=0.0(a.u.) t=2.5(a.u.)
t=4.2(a.u.) t=10.0(a.u.)
q1
q2
q1
q2
q1
q2
q1
q2
αα== 0.40.4
ββ == 1.01.0
Jan. 4-8, 2008 21RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□
Weakly Quantum(Classical): IndependenceWeakly Quantum(Classical): Independence
Strongly Quantum: Synchronization
Two JunctionsTwo Junctions
Jan. 4-8, 2008 22RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ Three JunctionsThree Junctions
Jan. 4-8, 2008 23RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ αα== 0.40.4
ββ == 0.20.2
Jan. 4-8, 2008 24RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□
αα== 0.40.4
ββ == 1.01.0
Jan. 4-8, 2008 25RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□
<q1><q2><q3><q4>
<q1><q2><q3><q4>
t(a.u.)
t(a.u.)
q
q
(a)
(b)
4 Junctions4 Junctionsα=0.4α=0.4
β=0.2β=0.2
α=0.4α=0.4
β=1.0β=1.0
Quantum Assisted SynchronizationQuantum Assisted Synchronization
V. ConclusionV. Conclusion
Jan. 4-8, 2008 27RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
□ 5. Conclusion5. Conclusion
Collective MQT in Intrinsic Josephson JunctionCollective MQT in Intrinsic Josephson Junctions via parallel computing on ESs via parallel computing on ES Direct Quantum Simulation (4-Junctions)Direct Quantum Simulation (4-Junctions) Quantum (Sychronus) vs Classical (Localized)Quantum (Sychronus) vs Classical (Localized) Quantum Assisted SynchronizationQuantum Assisted Synchronization
High Performance ComputingHigh Performance Computing Novel eigenvalue algorithm LOBPCGNovel eigenvalue algorithm LOBPCG Communication-free (or less) implementationCommunication-free (or less) implementation Sustained 7TFLOPS (21.4% of Peak)Sustained 7TFLOPS (21.4% of Peak) Toward Peta-scale computing? Toward Peta-scale computing?
Thank you! Thank you! 謝謝謝謝
Further informationFurther information
Physics: Physics: [email protected]@jaea.go.jp
HPC: HPC: [email protected]@im.uec.ac.jp