jaxa site presentation · 2017. 10. 20. · 2017/10/16~10/20 huf 2017 1 p.1 jaxa site...

18
P.1 HUF 2017 1 2017/10/1610/20 JAXA Site Presentation Reliability of Data ManagementJapan Aerospace Exploration Agency Supercomputer Division FUJITA, Naoyuki [email protected]

Upload: others

Post on 02-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.1HUF 2017 12017/10/16~10/20

JAXA Site Presentation

~Reliability of Data Management~

Japan Aerospace Exploration Agency

Supercomputer Division

FUJITA, [email protected]

Page 2: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.2HUF 2017 12017/10/16~10/20

JAXAのサイトプレゼンテーションJAXA Site Presentation

~データ管理の信頼性~~Reliability of Data Management~

宇宙航空研究開発機構Japan Aerospace Exploration Agency

(≒Research and Development)スーパーコンピュータ活用課Supercomputer Division

(≒Utilization Division)藤田 直行

WISTERIA field, Go [email protected]

Short Japanese language Lesson (^_^)

Page 3: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.3HUF 2017 12017/10/16~10/20

藤田 直行WISTERIA field, Go straight

Family name First name

http://photohito.com/photo/1185633/

Page 4: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.4HUF 2017 12017/10/16~10/20

■Supercomputer Facility Overview

Page 5: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.5HUF 2017 12017/10/16~10/20

■J-SPACE H/W Configuration

■Supercomputer Facility Overview

(2x6)

(3)(4)

(2)(2x2)

HPSS Data (IB)

HPSS Data2(10GbE)

Admin(GbE)HPSS Control(1GbE)

(2)(2)(2)(2) (2)(2)(2)(2) (2)(2)(2)(2)

(3)

(2)

Nexus5548P 10GbE(8)

Client Machines

x3750 M432Core,128GBCore Server

x3650 M416Core,32GBVFS, Login

x3650 M416Core,32GBDisk Mover

(2x6)(3x6)

(2x6)(2x6)

x3650 M416Core,32GBTape Mover x5

(2x5)(2x5)

(2x5)(2x5)

(4x5=10x2)

SAN48B-548port8GbFCSW

x2

TS350014Frame withHA1140x40 1150x4

TS35004Frame withHALTO4x2, 1140x3

SAN48B-548port8GbFCSW

x3650 M416Core,32GBTape Mover

(44) (5)

DS3524 Turbo146GB SASx48

x6

x2

DDN SFA12K 3TBx2806TBx20

(2)

(1)

Page 6: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.6HUF 2017 12017/10/16~10/20

■J-SPACE S/W version & services Operation started at August 2014

HPSS: 7.4.2.1

DB2: 10.5 Fixpack3a

OS: RHEL 6.4 AIX => N/A (Replaced with RHEL) …orz

Tape Drives: IBM TS1140×(40+3) drives, IBM TS1150x4

IBM LTO4x2 drives

Tape Cartridges: 3592: 11317+1382

LTO4: 2884

Cache Disk: DDN SFA12K 0.73PB

User Data: 11.0PB(Logical) / 6.2M Files

Number of used COS: 26

UI pftp

hsi/htar

VFS, NFS via VFS, ftp via VSF, scp via VFS

■Supercomputer Facility Overview

Page 7: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.7HUF 2017 12017/10/16~10/20

■Space Usage on HPSS

■Monitoring

Year/Month

0

10,000,000

20,000,000

30,000,000

40,000,000

50,000,000

60,000,000

70,000,000

0

2000

4000

6000

8000

10000

12000

2009/

01

2009/

04

2009/

07

2009/

10

2010/

01

2010/

04

2010/

07

2010/

12

2011/

03

2011/

06

2011/

09

2011/

12

2012/

03

2012/

06

2012/

09

2012/

12

2013/

03

2013/

06

2013/

09

2013/

12

2014/

03

2014/

06

2014/

09

2014/

12

2015/

03

2015/

06

2015/

09

2015/

12

2016/

03

2016/

06

2016/

09

2016/

12

2017/

03

2017/

06

Am

ount

of D

ata[

TB

]: L

ine

Num

ber

of File

s: B

ar

Page 8: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.8HUF 2017 12017/10/16~10/20

■Number of files on HPSS(by size, by year)

■Monitoring

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000<1

K

<4K

<16K

<64K

<256

K

<1M

<4M

<16M

<64M

<256

M

<1G

<4G

<16G

<64G

<256

G

<1T

<4T

<16T

<64T

64T<

=

# of

File

s

File Size

Distribution of File Size #1 (2013-2017)

2017

2016

2015

2014

2013

136626

1137

114725

158

419

5599 6

1

Page 9: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.9HUF 2017 12017/10/16~10/20

■Amount of DATA on HPSS(by size, by year)

■Monitoring

0

500

1,000

1,500

2,000

2,500

3,000

3,500

<1K

<4K

<16K

<64K

<256

K

<1M

<4M

<16M

<64M

<256

M

<1G

<4G

<16G

<64G

<256

G

<1T

<4T

<16T

<64T

64T<

=

Amou

nt o

f Dat

a (T

B)

File Size

Distribution of File Size(2014-2017)

2017

2016

2015

2014

0.4GB

2.9GB

29.1GB

69.0GB

461.4GB

3.9TB

Page 10: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.10HUF 2017 12017/10/16~10/20

■HPSS Activity and Client Sessions(Usage Monitoring by Ganglia)

■Monitoring

Page 11: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.11HUF 2017 12017/10/16~10/20

■Transfer Data & Files(Usage Monitoring by Ganglia)

■Monitoring

Not sure if we can monitor VFS data transfer in terms of data amount and file number but we want..

Page 12: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.12HUF 2017 12017/10/16~10/20

■Migrate/Purge Data & Files(Usage Monitoring by Ganglia)

■Monitoring

We are waiting for a new HPSS version to be able to give information about staging statistics

Page 13: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.13HUF 2017 12017/10/16~10/20

System failure history■Supercomputer Facility Overview

Tape:40PB

16 Jul. 2015

8 Sep. 201721 Sep. 2017

16 Oct. 20179 May 2017

2nd copy

9 May 20171st copy

1 May 2017

7 May 2017

Banquet DayHUF2017

Page 14: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.14HUF 2017 12017/10/16~10/20

System failure history■Supercomputer Facility Overview

Tape:40PB

16 Jul. 2015

8 Sep. 201721 Sep. 2017

16 Oct. 20179 May 2017

2nd copy

9 May 20171st copy

1 May 2017

7 May 2017

Banquet DayHUF2017

Date Issue / Cause Remark

16 Jul. 2015 Silent Data Corruption

1 May 2017 Disk controller failure? S.D.C.?

7 May 2017 Disk controller failure? S.D.C.?

9 May 2017 1st copy unreadable Media damage?

9 May 2017 2nd copy unreadable Debris ?

8 Sep. 2017 Meta data un-match・File System S/W bug?・H/W failure?21 Sep. 2017 Meta data un-match

16 Oct. 2017 Meta data un-match

Page 15: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.15HUF 2017 12017/10/16~10/20

mini-Survey on HPSS Reliability

Have you lost any bitfile on your HPSS? (JAXA: Yes) ...9 raised hands

Have you lost user file on your HPSS? (JAXA: No) ...9 raised hands

Have you experienced on Silent Data Corruption(including suspicious one)?(JAXA: Maybe Yes) ...4 raised hands

Have you experienced on unknown(un-resolved) data loss?(JAXA: Yes) ...2 raised hands

Page 16: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.16HUF 2017 12017/10/16~10/20

mini-Survey on File System Reliability

Have you lost (a part of) file on your file system? (JAXA: Yes) ...8 raised hands

Have you lost user file on your file system? (JAXA: Yes) ...8 raised hands

Have you experienced on Silent Data Corruption(including suspicious one) on your file system? (JAXA: Yes) ...6 raised hands

Have you experienced on unknown data loss on your file system?(JAXA: No) ...0 raised hand

Page 17: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.17HUF 2017 12017/10/16~10/20

What should we pay attention to?

Media product quality?

Tape cartridge and tape drive collision and rubbing?

Machine room environment?

What should we monitor media conditionto avoid data-loss?

Mount count?

Mileage of tape media?

Retention count?

Error logs from media?

What are you monitoring?

Page 18: JAXA Site Presentation · 2017. 10. 20. · 2017/10/16~10/20 HUF 2017 1 P.1 JAXA Site Presentation ~Reliability of Data Management~ Japan Aerospace Exploration Agency Supercomputer

P.18HUF 2017 12017/10/16~10/20

What should we do for end-user?

Thank you very much.