schematic diagrams of gpus' architecture and time evolution of theoretical flops and bandwidth
TRANSCRIPT
GPU の性能の遷移(理論演算性能)GeForce ゲーム用Quadro CG 用Tesla GPGPU 用
http://docs.nvidia.com/cuda/cuda-c-programming-guide/ で公開されている資料を基に作成
Theoretical GFLOP/s
2001 2003 2005 2007 2009 2011 2013 2015
*1 コードネーム*2 製品ファミリ
GeForce FX 5800GeForce 6800 Ultra
Pentium 4
GeForce 7800 GTXGeForce 8800 GTX
GeForce GTX 280GeForce GTX 480
GeForce GTX 580Tesla*1
Fermi
KeplerGeForce GTX 680
Kepler
Maxwell
GeForce GTX TITAN
GeForce 780 Ti
WoodcrestHarpertown
Tesla C1060Tesla C2050
Tesla K40Tesla K20X
Tesla M2090
Sandy BridgeIvy Bridge
Tesla K80
Tesla P100
GeForce GTX Titan X
GeForce GTX Titan X
Pascal
HaswellBroadwell
Excel Sheet
2 2017/4/1
GPU の性能の遷移(理論バンド幅)GeForce ゲーム用Quadro CG 用Tesla GPGPU 用
Tesla*1
Fermi
Maxwell
Kepler
GeForce FX 5900GeForce 6800 GT
GeForce 7800 GTXGeForce 8800 GTX
GeForce GTX 280GeForce GTX 480
GeForce GTX 680
GeForce 780 Ti
Tesla K40Tesla K20X
Tesla M2090Tesla C2050
Tesla C1060
Northwood Woodcrest
Harpertown Sandy Bridge
Ivy BridgeWestmere
BloomfieldPrescott
2003 2005 2007 2009 2011 2013
Theore
tical
GB/s
2015
Tesla P100
GeForce GTX Titan X
Pascal
GeForce GTX Titan X
Tesla K80
Haswell
Broadwell
Excel Sheet
3 2017/4/1http://docs.nvidia.com/cuda/cuda-c-programming-guide/ で公開されている資料を基に作成
Tesla アーキテクチャ Tesla C1060 の仕様
SM 数 30 CUDA Core 数 240(=8 Core/SM×30 SM) キャッシュを搭載せず
http://www.anandtech.com/show/2549/2 で公開されている画像を基に作成
SP SP
SP SP
SP SP
SP SP
SFUSFU16 KBShared Memory
Register File (16384×32-bit)
Streaming Multiprocessor
SMSMSM
4 2017/4/1
Fermi アーキテクチャ Tesla M2050 の仕様
SM 数 14 CUDA Core 数 448(=32Core/SM×14SM) L1/L2 キャッシュを搭載 ECC (誤り訂正機能)を搭載
詳細は http://www.nvidia.co.jp/docs/IO/81860/NVIDIA_Fermi_Architecture_Whitepaper_FINAL_J.pdf を参照のこと
Register File (16384 × 32-bit)
64 KB Shared Memory / L1
Cache
SM
CoreCoreCoreCoreCoreCoreCoreCore
CoreCoreCoreCoreCoreCoreCoreCore
CoreCoreCoreCoreCoreCoreCoreCore
CoreCoreCoreCoreCoreCoreCoreCore
SFU×4
L2 Cache
GigaThread Engine
PCI Express 3.0 Host Interface
Memory Controller
GPCRaster Engine
GPCRaster Engine
SM
Raster EngineGPC
Raster EngineGPC
Memory ControllerMemory Controller
Memory ControllerMemory Controller
Memory Controller
http://www.anandtech.com/show/2849/3 で公開されている画像を基に作成5 2017/4/1
Kepler アーキテクチャ Tesla K20c/m の仕様
SMX 数 13 CUDA Core 数 2,496(=192 Core/SM×13 SMX)
https://library.creativecow.net/kaufman_debra/NVIDIA-VGX/1 で公開されている画像を基に作成
詳細は https://www.nvidia.co.jp/con-tent/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf を参照のこと
Register File (65536 × 32-bit)
64 KB Shared Memory / L1 Cache48 KB Read-Only Data Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
SMX
SMX
L2 Cache
GigaThread EnginePCI Express 3.0 Host Interface
Memory ControllerMemory Controller
Memory Controller
Memory ControllerMemory Controller
Memory Controller
6 2017/4/1
Maxwell アーキテクチャ Geforce GTX TITAN X の仕様
SMM 数 24 CUDA Core 数 3,072(=128 Core/SM×24 SM)
倍精度演算器は搭載していない
http://www.itmedia.co.jp/pcuser/articles/1409/19/news051.html で公開されている画像を基に作成
第 1 世代の詳細は https://www.nvidia.co.jp/content/product-detail-pages/geforce-gtx-750-ti/geforce-gtx-750ti-whitepa-per.pdf を参照のこと
64 KB Shared Memory L1 Cache
SMM
Register File (16,384 × 32-
bit)Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
L1 Cache
Register File (16,384 × 32-
bit)Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
Register File (16,384 × 32-
bit)Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
Register File (16,384 × 32-
bit)Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
PolyMorph Engine 3.0
SMM
Raster EngineGPC
Raster EngineGPC
L2 Cache
GigaThread EnginePCI Express 3.0 Host Interface
Memory Controller
Raster EngineGPC
Raster EngineGPC
Memory Controller
Memory ControllerMemory Controller
7 2017/4/1
L2 Cache
GigaThread EnginePCI Express 3.0 Host Interface
Memory Controller
Memory Controller
Memory Controller
Memory Controller
High Bandwidth Memory 2High Bandwidth Memory 2
Memo
ry
Cont
roller
Memo
ry
Contro
ller
Memo
ry
Contro
ller
Memo
ry
Contro
ller
High
Ban
dwidth
Memor
y 2
High
Ban
dwidth
Memor
y 2
High-Speed HubNVLinkNVLink NVLinkNVLink
Pascal アーキテクチャ Tesla P100 の仕様
SM 数 56 CUDA Core 数 3584 (=64 Core/SM×56 SM)
詳細は http://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architec-ture-whitepaper.pdf を参照のこと
64 KB Shared Memory / L1 Cache
48 KB Read-Only Data Cache
SM
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Register File (32768 × 32-bit)
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
DP Unit
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Register File (32768 × 32-bit)
http://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdfで公開されている画像を基に作成8 2017/4/1
理論演算性能( Embedded Excel Sheet )
プログラミングガイドの図からデータを推定 GPU 倍精度のみ正しい値に修正 残りは近似値
year NVIDIA GPU Single Precisionyear NVIDIA GPU Double Precisionyear Intel CPU Single Precisionyear Intel CPU Double Precision2003.000 0.00E+00 2008.462 7.80E+01 2003.000 7.60E+00 2003.000 3.80E+002004.248 7.72E+01 2009.751 5.15E+02 2005.413 2.66E+01 2005.413 1.33E+012005.413 1.54E+02 2011.369 6.66E+02 2006.825 5.12E+01 2006.825 2.66E+012006.832 5.17E+02 2012.864 1.31E+03 2008.456 9.00E+01 2008.456 4.24E+012008.462 9.28E+02 2013.877 1.43E+03 2009.233 1.10E+02 2009.233 5.26E+012009.751 1.34E+03 2014.872 1.87E+03 2010.204 1.68E+02 2010.204 6.29E+012010.846 1.52E+03 2016.594 5.30E+03 2011.151 4.26E+02 2011.151 2.16E+022012.224 3.07E+03 2013.688 5.32E+02 2013.688 2.66E+022013.137 4.50E+03 2014.871 9.90E+02 2014.871 4.95E+022013.855 5.36E+03 2016.160 1.32E+03 2016.160 6.68E+022015.203 6.14E+032016.594 1.02E+04
0500100015002000250030003500400045005000550060006500700075008000850090009500100001050011000
2001 2003 2005 2007 2009 2011 2013 2015 2017
Theoretical GFLOP/s
year
NVIDIA GPU Double Precision
NVIDIA GPU Single Precision
Intel CPU Double Precision
Intel CPU Single Precision
9 2017/4/1
理論バンド幅 * ( Embedded Excel Sheet )
year Geforce GPUyear Tesla GPU year Intel CPU2003.000 1.26E+01 2008.000 1.02E+02 2003.000 6.29E+002004.000 3.08E+01 2009.000 1.49E+02 2005.000 8.81E+002005.000 5.35E+01 2010.000 1.78E+02 2006.000 1.07E+012006.000 8.56E+01 2012.000 2.50E+02 2007.000 1.32E+012008.000 1.42E+02 2013.000 2.88E+02 2009.000 3.21E+012009.000 1.77E+02 2014.884 4.80E+02 2010.000 3.21E+012012.000 1.92E+02 2016.351 7.32E+02 2012.000 5.10E+012013.000 3.36E+02 2013.000 5.98E+012015.196 3.36E+02 2014.879 6.81E+012016.604 4.80E+02 2016.189 7.77E+01
050100150200250300350400450500550600650700750800
2001 2003 2005 2007 2009 2011 2013 2015 2017
Theoretical GB/s
year
Geforce GPU
Tesla GPU
Intel CPU
プログラミングガイドの図からデータを推定 Tesla GPU のみ正しい値に修正 残りは近似値
10 2017/4/1
* 前スライドの Excel シートにも含まれているが,念のため