jaxa site presentation · 2017. 10. 20. · 2017/10/16~10/20 huf 2017 1 p.1 jaxa site...
TRANSCRIPT
P.1HUF 2017 12017/10/16~10/20
JAXA Site Presentation
~Reliability of Data Management~
Japan Aerospace Exploration Agency
Supercomputer Division
FUJITA, [email protected]
P.2HUF 2017 12017/10/16~10/20
JAXAのサイトプレゼンテーションJAXA Site Presentation
~データ管理の信頼性~~Reliability of Data Management~
宇宙航空研究開発機構Japan Aerospace Exploration Agency
(≒Research and Development)スーパーコンピュータ活用課Supercomputer Division
(≒Utilization Division)藤田 直行
WISTERIA field, Go [email protected]
Short Japanese language Lesson (^_^)
P.3HUF 2017 12017/10/16~10/20
藤田 直行WISTERIA field, Go straight
Family name First name
http://photohito.com/photo/1185633/
P.4HUF 2017 12017/10/16~10/20
■Supercomputer Facility Overview
P.5HUF 2017 12017/10/16~10/20
■J-SPACE H/W Configuration
■Supercomputer Facility Overview
(2x6)
(3)(4)
(2)(2x2)
HPSS Data (IB)
HPSS Data2(10GbE)
Admin(GbE)HPSS Control(1GbE)
(2)(2)(2)(2) (2)(2)(2)(2) (2)(2)(2)(2)
(3)
(2)
Nexus5548P 10GbE(8)
Client Machines
x3750 M432Core,128GBCore Server
x3650 M416Core,32GBVFS, Login
x3650 M416Core,32GBDisk Mover
(2x6)(3x6)
(2x6)(2x6)
x3650 M416Core,32GBTape Mover x5
(2x5)(2x5)
(2x5)(2x5)
(4x5=10x2)
SAN48B-548port8GbFCSW
x2
TS350014Frame withHA1140x40 1150x4
TS35004Frame withHALTO4x2, 1140x3
SAN48B-548port8GbFCSW
x3650 M416Core,32GBTape Mover
(44) (5)
DS3524 Turbo146GB SASx48
x6
x2
DDN SFA12K 3TBx2806TBx20
(2)
(1)
P.6HUF 2017 12017/10/16~10/20
■J-SPACE S/W version & services Operation started at August 2014
HPSS: 7.4.2.1
DB2: 10.5 Fixpack3a
OS: RHEL 6.4 AIX => N/A (Replaced with RHEL) …orz
Tape Drives: IBM TS1140×(40+3) drives, IBM TS1150x4
IBM LTO4x2 drives
Tape Cartridges: 3592: 11317+1382
LTO4: 2884
Cache Disk: DDN SFA12K 0.73PB
User Data: 11.0PB(Logical) / 6.2M Files
Number of used COS: 26
UI pftp
hsi/htar
VFS, NFS via VFS, ftp via VSF, scp via VFS
■Supercomputer Facility Overview
P.7HUF 2017 12017/10/16~10/20
■Space Usage on HPSS
■Monitoring
Year/Month
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
0
2000
4000
6000
8000
10000
12000
2009/
01
2009/
04
2009/
07
2009/
10
2010/
01
2010/
04
2010/
07
2010/
12
2011/
03
2011/
06
2011/
09
2011/
12
2012/
03
2012/
06
2012/
09
2012/
12
2013/
03
2013/
06
2013/
09
2013/
12
2014/
03
2014/
06
2014/
09
2014/
12
2015/
03
2015/
06
2015/
09
2015/
12
2016/
03
2016/
06
2016/
09
2016/
12
2017/
03
2017/
06
Am
ount
of D
ata[
TB
]: L
ine
Num
ber
of File
s: B
ar
P.8HUF 2017 12017/10/16~10/20
■Number of files on HPSS(by size, by year)
■Monitoring
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000<1
K
<4K
<16K
<64K
<256
K
<1M
<4M
<16M
<64M
<256
M
<1G
<4G
<16G
<64G
<256
G
<1T
<4T
<16T
<64T
64T<
=
# of
File
s
File Size
Distribution of File Size #1 (2013-2017)
2017
2016
2015
2014
2013
136626
1137
114725
158
419
5599 6
1
P.9HUF 2017 12017/10/16~10/20
■Amount of DATA on HPSS(by size, by year)
■Monitoring
0
500
1,000
1,500
2,000
2,500
3,000
3,500
<1K
<4K
<16K
<64K
<256
K
<1M
<4M
<16M
<64M
<256
M
<1G
<4G
<16G
<64G
<256
G
<1T
<4T
<16T
<64T
64T<
=
Amou
nt o
f Dat
a (T
B)
File Size
Distribution of File Size(2014-2017)
2017
2016
2015
2014
0.4GB
2.9GB
29.1GB
69.0GB
461.4GB
3.9TB
P.10HUF 2017 12017/10/16~10/20
■HPSS Activity and Client Sessions(Usage Monitoring by Ganglia)
■Monitoring
P.11HUF 2017 12017/10/16~10/20
■Transfer Data & Files(Usage Monitoring by Ganglia)
■Monitoring
Not sure if we can monitor VFS data transfer in terms of data amount and file number but we want..
P.12HUF 2017 12017/10/16~10/20
■Migrate/Purge Data & Files(Usage Monitoring by Ganglia)
■Monitoring
We are waiting for a new HPSS version to be able to give information about staging statistics
P.13HUF 2017 12017/10/16~10/20
System failure history■Supercomputer Facility Overview
Tape:40PB
16 Jul. 2015
8 Sep. 201721 Sep. 2017
16 Oct. 20179 May 2017
2nd copy
9 May 20171st copy
1 May 2017
7 May 2017
Banquet DayHUF2017
P.14HUF 2017 12017/10/16~10/20
System failure history■Supercomputer Facility Overview
Tape:40PB
16 Jul. 2015
8 Sep. 201721 Sep. 2017
16 Oct. 20179 May 2017
2nd copy
9 May 20171st copy
1 May 2017
7 May 2017
Banquet DayHUF2017
Date Issue / Cause Remark
16 Jul. 2015 Silent Data Corruption
1 May 2017 Disk controller failure? S.D.C.?
7 May 2017 Disk controller failure? S.D.C.?
9 May 2017 1st copy unreadable Media damage?
9 May 2017 2nd copy unreadable Debris ?
8 Sep. 2017 Meta data un-match・File System S/W bug?・H/W failure?21 Sep. 2017 Meta data un-match
16 Oct. 2017 Meta data un-match
P.15HUF 2017 12017/10/16~10/20
mini-Survey on HPSS Reliability
Have you lost any bitfile on your HPSS? (JAXA: Yes) ...9 raised hands
Have you lost user file on your HPSS? (JAXA: No) ...9 raised hands
Have you experienced on Silent Data Corruption(including suspicious one)?(JAXA: Maybe Yes) ...4 raised hands
Have you experienced on unknown(un-resolved) data loss?(JAXA: Yes) ...2 raised hands
P.16HUF 2017 12017/10/16~10/20
mini-Survey on File System Reliability
Have you lost (a part of) file on your file system? (JAXA: Yes) ...8 raised hands
Have you lost user file on your file system? (JAXA: Yes) ...8 raised hands
Have you experienced on Silent Data Corruption(including suspicious one) on your file system? (JAXA: Yes) ...6 raised hands
Have you experienced on unknown data loss on your file system?(JAXA: No) ...0 raised hand
P.17HUF 2017 12017/10/16~10/20
What should we pay attention to?
Media product quality?
Tape cartridge and tape drive collision and rubbing?
Machine room environment?
What should we monitor media conditionto avoid data-loss?
Mount count?
Mileage of tape media?
Retention count?
Error logs from media?
What are you monitoring?
P.18HUF 2017 12017/10/16~10/20
What should we do for end-user?
Thank you very much.