site report: tokyo tomoaki nakamura icepp, the university of tokyo 2013/12/13tomoaki nakamura icepp,...

Site Report: Tokyo

Tomoaki NakamuraICEPP, The University of Tokyo

2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 1

ICEPP regional analysis center


2013 2014 2015

CPU pledge

16000 [HS06]

20000 [HS06] 24000

CPU deployed

43673.6 [HS06-SL5] (2560core)

46156.8 [HS06-SL6] (2560core)

-

Disk pledge 1600 [TB] 2000 [TB] 2400 [TB]

Disk deployed 2000 [TB] 2000 [TB] -

Resource overviewSupport only ATLAS VO in WLCG as Tier2, and ATLAS-Japan collaboration as Tier3. The first production system for WLCG was deployed in 2005 after the several years R&D.Almost of hardware are prepared by three years lease. System have been upgraded in every three years.Current system is the 3rd generation system.

Human resourceTetsuro Mashimo (associate professor): fabric operation, procurement, Tier3 supportNagataka Matsui (technical staff): fabric operation, Tier3 supportTomoaki Nakamura (project assistant professor): Tier2 operation, Tier3 analysis environmentHiroshi Sakamoto (professor): site representative, coordination, ADCoS (AP leader)Ikuo Ueda (assistant professor): ADC coordinator, site contact with ADCSystem engineer from company (2FTE): fabric maintenance, system setup

WLCG pledge

Computing resource (2013-2015)


2nd system　 (2010-2012) 3rd system (2013-2015)

Computing node Total Node: 720 nodes, 5720 cores(including service nodes)CPU: Intel Xeon X5560(Nehalem 2.8GHz, 4 cores/CPU)

Node: 624 nodes, 9984 cores (including service nodes)CPU: Intel Xeon E5-2680(Sandy Bridge 2.7GHz, 8cores/CPU)

Non-grid (Tier3) Node: 96 (496) nodes, 768 (3968) coresMemory: 16GB/nodeNIC: 1Gbps/nodeNetwork BW: 20Gbps/16 nodesDisk: 300GB SAS x 2

Node: 416 nodes, 6656 coresMemory: 16GB/node (to be upgraded)NIC: 10Gbps/nodeNetwork BW: 40Gbps/16 nodesDisk: 600GB SAS x 2

Tier2 Node: 464 (144) nodes, 3712 (1152) coresMemory: 24GB/node NIC: 10Gbps/nodeNetwork BW: 30Gbps/16 nodesDisk: 300GB SAS x 2

Node: 160 nodes, 2560 coresMemory: 32GB/node (to be upgraded)NIC: 10Gbps/nodeNetwork BW: 80Gbps/16 nodesDisk: 600GB SAS x 2

Disk storage Total Capacity: 5280TB (RAID6)Disk Array: 120 units (HDD: 2TB x 24)File Server: 64 nodes (blade)FC: 4Gbps/Disk, 8Gbps/FS

Capacity: 6732TB (RAID6) + αDisk Array: 102 (3TB x 24)File Server: 102 nodes (1U)FC: 8Gbps/Disk, 8Gbps/FS

Non-grid (Tier3) Mainly NFS Mainly GPFS

Tier2 DPM: 1.36PB DPM: 2.64PB

Network bandwidth

LAN 10GE ports in switch: 192Switch inter link: 80Gbps

10GE ports in switch: 352Switch inter link : 160Gbps

WAN ICEPP-UTnet: 10Gbps (+10Gbps)SINIET-USA: 10Gbps x 2ICEPP-EU: 10Gbps

ICEPP-UTNET: 10Gbps (+10Gbps)SINET-USA: 10Gbps x 3ICEPP-EU: 10Gbps (+10Gbps)LHCONE

Evolution of disk storage capacity for Tier2


16x500GB HDD / array5disk arrays / server

XFS on RAID64G-FC via FC switch

10GE NIC

24x2TB HDD / array1disk array / server

XFS on RAID68G-FC via FC switch

10GE NIC

24x3TB HDD /array2disk arrays / server

XFS on RAID68G-FC without FC switch

10GE NIC

■ WLCG pledge● Deployed　 (assigned to ATLAS)Number of disk arraysNumber of file servers

Pilot systemfor R&D

1st system2007 - 2009

2nd system2010 - 2012

3rd system2013 - 2015

30 65 30 4034 4065 30 406 13 15 4017 4013 15 40

Tota

l cap

acity

in D

PM

2.4PB

4th system2016 - 2019

5TBHDD?

ATLAS disk and LocalGroupDisk in DPM


2014’s pledge deployed since Feb. 20, 2013DATADISK: 1740TB (including old GroupDisk, MCDisk, HotDisk)PRODDISK: 60TBSCRATCHDISK: 200TB-------------------------------Total: 2000TB

Keep less than 500TBat present, if no more request from users.

But, it is unlikely…

← one year →

datasets deleted one-year-old

500TB

datasets deleted half-year-old

datasets deleted heavy user

Filled 1.6PB

Fraction of ATLAS jobs at Tokyo Tier2


Frac

tion

of c

ompl

eted

jobs

[%]

[per 6 month]

2nd system 3rd system

System migration

SL6 migration


Migration to SL6 WN was completed in several week at Oct. 2013.It was done by the rolling transition making TOKYO-SL6 queue to minimize the downtime and risk hedge for the massive miss configuration .

Performance improvement5% increased as HepSpec06 scoreSL5, 32bit compile mode: 17.06±0.02SL6, 32bit compile mode: 18.03±0.02

R. M. Llamas, ATLAS S&C workshop Oct. 2013

WLCG service instance and middle-ware


Service Num. EMI Version OS MachineryCREAM-CE 2 EMI-2 SL6 BM

Worker node 160 EMI-2 SL6 BMDPM-MySQL 1 EMI-2 SL5 BM

DPM-Disk 40 EMI-2 SL5 BMMon (APEL) 1 EMI-2 SL5 BM

WMS 1 EMI-3 SL5 VML&B 1 EMI-2 SL5 VM

MyProxy 1 EMI-2 SL5 VMSite-BDII 1 EMI-2 SL5 VMTop-BDII 1 EMI-2 SL5 VM

Argus 1 EMI-2 SL6 VMperfSONAR-PS 2 3.2.2 Centos5 BM

Squid 4 - SL5 BM

Memory upgrade


lcg-ce01.icepp.jp: 80 nodes (1280 cores)32GB/node: 2GB RAM/core (=job slot), 4GB vmem/core

lcg-ce02.icepp.jp: 80 nodes (1280 cores)64GB/node: 4GB RAM/core (=job slot), 6GB vmem/core

• Useful for memory consuming ATLAS production jobs.• Available sites are rare.• We might be able to upgrade remaining half of WNs in 2014.

New Panda queue: TOKYO_HIMEM / ANALY_TOKYO_HIMEM (since Dec. 5, 2013)

Memory have been built up for half of the WNs (Nov. 6, 2013)

I/O performance study


The number of CPU cores in the new worker node was increased from 8 cores to 16 cores.

Local I/O performance for the data staging area may become a possible bottleneck. We have checked the performance by comparing with a special worker node,which have a SSD for the local storage, in the production situation with real ATLAS jobs.

Normal worker nodeHDD: HGST Ultrastar C10K600, 600GB SAS, 10k rpmRAID1: DELL PERC H710P FS: ext3Sequential I/O: ~150MB/secIOPS: ~650 (fio tool)

Special worker nodeSSD: Intel SSD DC S3500 450GBRAID0: DELL PERC H710P FS: ext3Sequential I/O: ~400MB/secIOPS: ~40000 (fio tool)

Results


Direct mount via XRootD

2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 12J. Elmsheuser, ATLAS S&C workshop Oct. 2013

I/O performance [Staging to local disk vs. Direct I/O from storage]Some improvements have been reported especially for the DPM storage.

ADC/user’s point of viewJobs will be almost freed from the input file size and the number of input files.

But, it should be checked more precisely…

Summary of I/O study and others


Local I/O performance• The local I/O performance in our worker node have been studied by a comparison with HDD and SSD at the

mixture situation of running real ATLAS production jobs and analysis jobs.• We confirmed that HDD in the worker node at Tokyo Tier2 center is not a bottleneck for the long batch type jobs

at least for the situation of 16 jobs running concurrently.• The same thing should be checked also for the next generation worker node, which have more CPU cores greater

than 16 cores, and also for the ext4 or XFS file system.• The improvement of the I/O performance with respect to the direct mount of DPM should be confirmed by

ourselves more precisely.• Local network usage should be checked toward the next system design.

DPM:• The database for the DPM head node become very huge (~40GB).• We will add more RAM to the DPM head node, and it will be 128GB in total.• We are also planning to have Fusion I/O (high performance NAND type flush memory connected by PCI-express

bus) for the DPM head node to increase the maintainability of the DB.• Redundant configuration of the MySQL-DB should be studied for the daily backup.

Wide area network:• We will migrate all production instance to LHCONE ASAP (to be discussed on Monday).

Tier3:• SL6 migration is on going.• CVMFS will be available to use ATLAS_LOCAL_ROOT_BASE and nightly release.

site report: tokyo tomoaki nakamura icepp, the university of tokyo 2013/12/13tomoaki nakamura icepp,...

Documents