qcd in gpudeveloper.download.nvidia.com/gtc/pdf/1086_chiu.pdf“topological susceptibility in two...

43
QCD in GPU Ting-Wai Chiu (趙挺偉) Department of Physics Center for Quantum Science and Engineering National Taiwan University GPU Technology Conference Beijing, China December 14-15, 2011

Upload: others

Post on 27-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • QCD in GPU

    Ting-Wai Chiu (趙挺偉)

    Department of Physics Center for Quantum Science and Engineering

    National Taiwan University

    GPU Technology Conference Beijing, China

    December 14-15, 2011

  • 2

    which build up the hadrons

    (e.g., neutron, proton, pion, etc.).

    QCD provides the framework to understand the n

    The quantum field theory for the strong interaction

    between quarks and gluons

    uclear

    force/energy from the first principles.

    QCD plays an important role in the evolution of the

    early universe, from the quark gluon "plasma" phase

    to the hadron phase.

    Quantum Chromodynamics (QCD)

    T.W. Chiu, GTC-Asia-2011 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 3 2011/12/14

  • Outline

    Introduction

    Lattice Dirac Operator : Quark Matrix

    Cuda Kernels for Conjugate Gradient

    Conclusion & Outlook

    References:

    Taiwan Lattice QCD Collaboration (TWQCD):

    arXiv: 0911.5029, 1101.0405, 1101.0402, 1101.0423,

    1105.4414, 1109.3675

  • T.W. Chiu, GTC-Asia-2011 5 2011/12/14

  • 2011/12/14 T.W. Chiu, GTC-Asia-2011 6

  • 2011/12/14 T.W. Chiu, GTC-Asia-2011 7

    11 12 13

    21 22 23

    31 32 33

    † †

    ˆˆ ˆ ˆ( , , , ) , , , ,

    U U U

    U x y z t U U U x y z t

    U U U

    U U U U I

  • Gluon fields on the Lattice

    Then the gluon action on the lattice can be written as

    where

    ˆx a ˆˆx a a

    x ˆx a

    The color gluon field are defined on each link connecting and , through the link variable

    A x 3SUx ˆx a

    ˆexp2

    aU x iagA x

    42plaquette

    6 1 11 Re 0

    3 2g pS U tr U a d x tr F x F x

    g

    † †ˆˆpU U x U x a U x a U x

    8 T.W. Chiu, GPU Workshop, Jan 16, '09

  • 2011/12/14 T.W. Chiu, GTC-Asia-2011 9

    The Quark Matrix (Lattice Dirac Operator)

    ijDD

    ( , , ) ( , , )i a x j b y

    , 1,2,3 (color indices)a b

    , 1,2,3,4 (Dirac spinor indices)

    , 1, , (Lattice site indices)sitesx y N

  • 2011/12/14 T.W. Chiu, GTC-Asia-2011 10

    Salient Features of the Quark Matrix

    D is prohibitively large for exact solvers.

    In general, D is a sparse matrix, since it only

    involves (next-)nearest neighbor interactions in

    4-dim or 5-dim lattice.

    Iterative algorithms (conjugate gradient, Lanczos,

    etc.) are used, which involve the matrix-vector

    multiplication.

    CUDA kernels can be optimized for the

    matrix-vector multiplication in QCD.

  • Lattice QCD

    ,

    The QCD action

    where is the action of the gluon fields

    ( )

    ( )

    , , , , ,

    (

    flavo r i

    )

    ( ) (

    n e

    )

    d

    G

    f a x b

    G

    f f

    a x b yy

    S S U

    S U

    f

    D U

    D U D U

    u d s c b t

    sites

    x

    color index

    Di

    , 1,2,3

    , =1,2,3,4

    rac index

    site index

    For exampl

    , 1, , =

    e

    x y z t

    a b

    x y N N N N N

    3

    1

    16 32

    1,572,864 1,

    , on the lattice, for each flavor,

    is a complex matrix of size

    ( , , ) ( , )( , , )

    572,864

    det( )

    det( )

    G

    G

    SS

    SS

    dUd d U

    D

    De dU U eU

    dUd d e dU e

    D

    D

    11 T.W. Chiu, GTC-Asia-2011

    Kenneth G. Wilson Nobel Prize (1982)

    2011/12/14

  • Domain-Wall Fermions

    dwfˆis a local op. with the nearest neighbor coupling along sD

    1 2sN

    s

    Left-handed Right-handed

    ( , )x s( ,1)x ( , )sx N

    5dwf5

    5 52

    1exp det

    1

    , , 0, Exact Chiral Sym.

    c c

    s c c

    Sd d D D D

    S

    HN S D D

    H

    12 T.W. Chiu, GTC-Asia-2011

    2011/12/14

  • T.W. Chiu, GTC-Asia-2011 13

    DWF with even-odd preconditioning

    2011/12/14

  • T.W. Chiu, GTC-Asia-2011 14 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 15 2011/12/14

  • Mixed-Precision CG (2)

    1

    1

    1

    1 1

    1.

    2. If | | | |, then stop

    3.

    4.

    5. Go to 1.

    Pr

    | | | |

    | | | |

    Solve in single precision to

    Let

    then

    an accuracy 1

    ,

    | | |

    oof :

    k k

    k

    k k

    k k k k

    k k

    k

    k

    k k

    k

    k

    At r

    r b Ax

    r b

    x x

    u r A u r

    r b Ax b Ax A

    t

    t

    t u

    1| | | | |k k kr r

    16 T.W. Chiu, GTC-Asia-2011 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 17 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 18 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 19 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 20 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 21 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 22 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 23 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 24 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 25 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 26 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 27

    Dw(Single) M5(Single) Dw(Double) M5(Double) CG(Mixed)

    GTX285 177 346 33 69 181

    C1060 128 290 29 61 132

    C2070 171 244 22 96 156

    GTX480 293 309 37 116 252

    GTX580 338 445 41 150 317

    Benchmarks

    CG (mixed prec.) attains 317 Gflops on GTX580

    The bottleneck is Dw single-precision multiplication

    All numbers are in unit of Gflops, tested with ODWF on 163 x 32 x 16 lattice

    2011/12/14

  • GPU Cluster at NTU

    16 units of Nvidia Tesla S1070 (total 64 GPUs)

    connected to 16 servers (total 48 Intel QC Xeon)

    280 Nvidia GPUs with peak performance > 300 TFLOPS

    Attaining 80 TFLOPS (sustained) for LQCD with Optimal DWF

    Developed efficient CUDA codes for full QCD.

    320/156/180/132 Gflops for GTX580/C2070/GTX285/C1060

    32 Nvidia C1060 (total 32 GPUs),

    connected to 16 servers (total 16 Intel i7)

    Hard disk storage > 300 TB, Lustre cluster file system

    122 Nvidia GTX285 (total 122 GPUs),

    connected to 62 servers (total 62 Intel i7)

    22/36/7 Nvidia C2070/GTX580/GTX480 (total 47 GPUs),

    connected to 27 servers (total 27 Intel i7)

    28 T.W. Chiu, GTC-Asia-2011 2011/12/14

  • GPU Cluster at NTU (a partial view)

    29 T.W. Chiu, GTC-Asia-2011 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 30

    We have implemented efficient CUDA codes for

    lattice QCD with domain-wall fermion. On GTX580,

    our CG solver attains 320 Gflops (sustained).

    GPU has provided the optimal price/performance

    for large-scale lattice QCD simulations.

    Conclusions and Outlook

    2011/12/14

    Lattice QCD with domain-wall fermion on the

    243 x 48 x 16 can be simulated efficiently with

    one single GPU, and for larger lattices

    (e.g., 323 x 64 x 16) with multiGPUs.

    GPU has revolutionized the advancement of QCD

  • 2011/12/14 T.W. Chiu, GTC-Asia-2011 31

    Conclusions and Outlook (cont.)

    First Physics Results from the NTU GPU cluster:

    1. T.W. Chiu, T.H. Hsieh, Y.Y. Mao (TWQCD),

    “Topological Susceptibility in Two Flavors Lattice QCD with the Optimal Domain-Wall Fermion”,

    Phys. Lett. B 702 (2011) 131.

    2. T.W. Chiu, T.H. Hsieh, Y.Y. Mao (TWQCD), “Pseudoscalar Meson in Two Flavors QCD with the

    Optimal Domain-Wall Fermion”,

    arXiv:1109.3675

  • T.W. Chiu, GTC-Asia-2011 32

    Backup slides

    2011/12/14

  • 33

    which build up the hadrons (e.g., neutron,

    proton, pion, etc.). QCD provides the framework to understand

    the nuclear force

    The quantum field theory for the strong interaction between

    quarks and gluons

    /energy from the first principles, and plays an

    important role in the evolution of the early universe, from

    the quark gluon "plasma" phase to the hadron

    Gauge group gluo(3)

    phase.

    SU

    :Salient features

    ( ) 0 as 0

    ( ) 1 at

    ns have self-i

    1 fm

    nteractions.

    Asymptotic freedom: .

    IR slavory: quark (color) confinement

    Spontaneously chiral symmetry breaking

    g r r

    g r r

    Quantum Chromodynamics (QCD)

    T.W. Chiu, GTC-Asia-2011 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 34

    It took 23 years (1974 ~1997) to realize that Lattice QCD with Exact Chiral Symmetry is the ideal theoretical framework to study the nonperturbative

    physics from the first principles of QCD.

    But, it is challenging to perform the simulation such that the chiral sym. is preserved to very high

    precision & all topological sectors are sampled

    ergodically.

    Since 2009, the TWQCD collaboration has been simulating QCD with optimal domain-wall quarks. The chiral sym. is preserved to a good precision

    with , and all topological sectors are

    sampled ergodically.

    0.0004resm a

    2011/12/14

  • T.W. Chiu, GTC-Asia-2011 35

    Exact Chiral Symmetry on the Lattice

    The proper way to break the chiral symmetry at finite lattice

    spacing is to impose the Ginsparg-Wilson relation (1982)

    5 55D D D D

    equivalently, 1 15 5 ,5( , ) ( , ) x yD x y D x y

    which is realized by the Domain-Wall Fermion (Kaplan,1992),

    and the overlap Dirac operator (Neuberger,1998)

    †5 2

    ,

    0,

    is exponentially local for sufficently smooth gauge field.

    In the continuum li

    ,

    mit

    ( ).a

    H HH

    D IH

    D

    D igA

    2011/12/14

  • T.W. Chiu, GTC-Asia-2011 36

    Central Problems in Lattice QCD

    To compute the quark propagator D−1

    To compute the (low-lying) eigenmodes of D

    The matrix D is prohibitively large for exact solvers.

    Iterative algorithms involve the matrix-vector multiplication.

    For the overlap fermion operator, it involves

    2

    HY

    H

    The inverse square-root cannot be computed exactly.

    What is the best way to proceed ?

    2011/12/14

  • T.W. Chiu, GTC-Asia-2011 37

    Nested Conjugate Gradient

    1, 2

    2( )

    n nHY HR H Y

    H

    1, 2

    21 1

    2 solved by CG with mul

    ( )

    ti-shifts

    nl

    l

    l

    l

    nn n l

    l l

    l

    bR H Y Y

    H d

    H d

    b Z

    Z Y

    5 2

    nested CTo compute quark propagator requires

    G

    Y bD YH

    IH

    Q: Can we avoid the inverse square root ?

    A: Yes, to introduce an extra dim. (degree of freedom).

    2011/12/14

  • Domain-Wall Fermions

    dwfˆis a local op. with the nearest neighbor coupling along sD

    1 2sN

    s

    Left-handed Right-handed

    ( , )x s( ,1)x ( , )sx N

    5dwf5

    5 52

    1exp det

    1

    , , 0, Exact

    For fini

    Chiral Sym.

    te , which DWF gives the best rational approx. to ?

    c c

    s c c

    sN

    Sd d D D D

    S

    HN S D D

    H

    S

    38 T.W. Chiu, GTC-Asia-2011

    2011/12/14

  • Optimal Domain-Wall Fermion

    [ TWC, Phys. Rev. Lett. 90 (2003) 071601 ]

    with boundary conditions

    odwf , , , 1 , 1 ,, ,, 1 ,

    odwf

    sN

    x s w s s w s s s s x sx x x xx

    s

    s s x

    sA I D I D P P

    D

    4

    0 0

    1

    , 0,2wD t W m m

    †, ,1

    ,2

    x x x xt x x U x U x

    4

    , , ,

    1

    1, 2

    2x x x x x xW x x U x U x

    0

    ,0 , , : bare quark mass2

    q

    s q

    mP x P x N m

    m

    50

    1, 1 ,1 , (1 )

    2 2

    q

    s

    mP x N P x P

    m

    39 T.W. Chiu, GTC-Asia-2011 2011/12/14

  • The state-of-the-art in the simulations of

    unquenched QCD with exact chiral symmetry

    40 T.W. Chiu, GTC-Asia-2011

    3 3 3

    Machine:

    Lattice fermion:

    Lat

    RBC

    tice sizes:

    Machine:

    and UKQCD Collaborations

    QCDOC

    Domain-Wall Fermion

    16

    JLQCD Collaboration

    IBM BlueGene/L

    32 16, 24 48 16, 32 64 16

    3 3

    =0Lattice fermion:

    Lattice sizes:

    Overlap Fermion (with fixed topology )

    16

    TWQCD C

    Machin

    ollaboration

    Optimal D

    e:

    Lattice fermion: o ma

    32, 16 48

    tQ

    GPU cluster with 280 GPUs

    3 3 3 Lattice sizes

    in-Wall Fermion

    16: 32 16, 20 40 16, 24 48 16

    2011/12/14

  • T.W. Chiu, GTC-Asia-2011 41 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 42 2011/12/14

  • T.W. Chiu, GTC-Asia-2011 43 2011/12/14