site report: tokyo tomoaki nakamura icepp, the university of tokyo 2013/12/13tomoaki nakamura icepp,...
TRANSCRIPT
Site Report: Tokyo
Tomoaki NakamuraICEPP, The University of Tokyo
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 1
ICEPP regional analysis center
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 2
2013 2014 2015
CPU pledge
16000 [HS06]
20000 [HS06] 24000
CPU deployed
43673.6 [HS06-SL5] (2560core)
46156.8 [HS06-SL6] (2560core)
-
Disk pledge 1600 [TB] 2000 [TB] 2400 [TB]
Disk deployed 2000 [TB] 2000 [TB] -
Resource overviewSupport only ATLAS VO in WLCG as Tier2, and ATLAS-Japan collaboration as Tier3. The first production system for WLCG was deployed in 2005 after the several years R&D.Almost of hardware are prepared by three years lease. System have been upgraded in every three years.Current system is the 3rd generation system.
Human resourceTetsuro Mashimo (associate professor): fabric operation, procurement, Tier3 supportNagataka Matsui (technical staff): fabric operation, Tier3 supportTomoaki Nakamura (project assistant professor): Tier2 operation, Tier3 analysis environmentHiroshi Sakamoto (professor): site representative, coordination, ADCoS (AP leader)Ikuo Ueda (assistant professor): ADC coordinator, site contact with ADCSystem engineer from company (2FTE): fabric maintenance, system setup
WLCG pledge
Computing resource (2013-2015)
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 3
2nd system (2010-2012) 3rd system (2013-2015)
Computing node Total Node: 720 nodes, 5720 cores(including service nodes)CPU: Intel Xeon X5560(Nehalem 2.8GHz, 4 cores/CPU)
Node: 624 nodes, 9984 cores (including service nodes)CPU: Intel Xeon E5-2680(Sandy Bridge 2.7GHz, 8cores/CPU)
Non-grid (Tier3) Node: 96 (496) nodes, 768 (3968) coresMemory: 16GB/nodeNIC: 1Gbps/nodeNetwork BW: 20Gbps/16 nodesDisk: 300GB SAS x 2
Node: 416 nodes, 6656 coresMemory: 16GB/node (to be upgraded)NIC: 10Gbps/nodeNetwork BW: 40Gbps/16 nodesDisk: 600GB SAS x 2
Tier2 Node: 464 (144) nodes, 3712 (1152) coresMemory: 24GB/node NIC: 10Gbps/nodeNetwork BW: 30Gbps/16 nodesDisk: 300GB SAS x 2
Node: 160 nodes, 2560 coresMemory: 32GB/node (to be upgraded)NIC: 10Gbps/nodeNetwork BW: 80Gbps/16 nodesDisk: 600GB SAS x 2
Disk storage Total Capacity: 5280TB (RAID6)Disk Array: 120 units (HDD: 2TB x 24)File Server: 64 nodes (blade)FC: 4Gbps/Disk, 8Gbps/FS
Capacity: 6732TB (RAID6) + αDisk Array: 102 (3TB x 24)File Server: 102 nodes (1U)FC: 8Gbps/Disk, 8Gbps/FS
Non-grid (Tier3) Mainly NFS Mainly GPFS
Tier2 DPM: 1.36PB DPM: 2.64PB
Network bandwidth
LAN 10GE ports in switch: 192Switch inter link: 80Gbps
10GE ports in switch: 352Switch inter link : 160Gbps
WAN ICEPP-UTnet: 10Gbps (+10Gbps)SINIET-USA: 10Gbps x 2ICEPP-EU: 10Gbps
ICEPP-UTNET: 10Gbps (+10Gbps)SINET-USA: 10Gbps x 3ICEPP-EU: 10Gbps (+10Gbps)LHCONE
Evolution of disk storage capacity for Tier2
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 4
16x500GB HDD / array5disk arrays / server
XFS on RAID64G-FC via FC switch
10GE NIC
24x2TB HDD / array1disk array / server
XFS on RAID68G-FC via FC switch
10GE NIC
24x3TB HDD /array2disk arrays / server
XFS on RAID68G-FC without FC switch
10GE NIC
■ WLCG pledge● Deployed (assigned to ATLAS)Number of disk arraysNumber of file servers
Pilot systemfor R&D
1st system2007 - 2009
2nd system2010 - 2012
3rd system2013 - 2015
30 65 30 4034 4065 30 406 13 15 4017 4013 15 40
Tota
l cap
acity
in D
PM
2.4PB
4th system2016 - 2019
5TBHDD?
ATLAS disk and LocalGroupDisk in DPM
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 5
2014’s pledge deployed since Feb. 20, 2013DATADISK: 1740TB (including old GroupDisk, MCDisk, HotDisk)PRODDISK: 60TBSCRATCHDISK: 200TB-------------------------------Total: 2000TB
Keep less than 500TBat present, if no more request from users.
But, it is unlikely…
← one year →
datasets deleted one-year-old
500TB
datasets deleted half-year-old
datasets deleted heavy user
Filled 1.6PB
Fraction of ATLAS jobs at Tokyo Tier2
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 6
Frac
tion
of c
ompl
eted
jobs
[%]
[per 6 month]
2nd system 3rd system
System migration
SL6 migration
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 7
Migration to SL6 WN was completed in several week at Oct. 2013.It was done by the rolling transition making TOKYO-SL6 queue to minimize the downtime and risk hedge for the massive miss configuration .
Performance improvement5% increased as HepSpec06 scoreSL5, 32bit compile mode: 17.06±0.02SL6, 32bit compile mode: 18.03±0.02
R. M. Llamas, ATLAS S&C workshop Oct. 2013
WLCG service instance and middle-ware
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 8
Service Num. EMI Version OS MachineryCREAM-CE 2 EMI-2 SL6 BM
Worker node 160 EMI-2 SL6 BMDPM-MySQL 1 EMI-2 SL5 BM
DPM-Disk 40 EMI-2 SL5 BMMon (APEL) 1 EMI-2 SL5 BM
WMS 1 EMI-3 SL5 VML&B 1 EMI-2 SL5 VM
MyProxy 1 EMI-2 SL5 VMSite-BDII 1 EMI-2 SL5 VMTop-BDII 1 EMI-2 SL5 VM
Argus 1 EMI-2 SL6 VMperfSONAR-PS 2 3.2.2 Centos5 BM
Squid 4 - SL5 BM
Memory upgrade
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 9
lcg-ce01.icepp.jp: 80 nodes (1280 cores)32GB/node: 2GB RAM/core (=job slot), 4GB vmem/core
lcg-ce02.icepp.jp: 80 nodes (1280 cores)64GB/node: 4GB RAM/core (=job slot), 6GB vmem/core
• Useful for memory consuming ATLAS production jobs.• Available sites are rare.• We might be able to upgrade remaining half of WNs in 2014.
New Panda queue: TOKYO_HIMEM / ANALY_TOKYO_HIMEM (since Dec. 5, 2013)
Memory have been built up for half of the WNs (Nov. 6, 2013)
I/O performance study
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 10
The number of CPU cores in the new worker node was increased from 8 cores to 16 cores.
Local I/O performance for the data staging area may become a possible bottleneck. We have checked the performance by comparing with a special worker node,which have a SSD for the local storage, in the production situation with real ATLAS jobs.
Normal worker nodeHDD: HGST Ultrastar C10K600, 600GB SAS, 10k rpmRAID1: DELL PERC H710P FS: ext3Sequential I/O: ~150MB/secIOPS: ~650 (fio tool)
Special worker nodeSSD: Intel SSD DC S3500 450GBRAID0: DELL PERC H710P FS: ext3Sequential I/O: ~400MB/secIOPS: ~40000 (fio tool)
Results
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 11
Direct mount via XRootD
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 12J. Elmsheuser, ATLAS S&C workshop Oct. 2013
I/O performance [Staging to local disk vs. Direct I/O from storage]Some improvements have been reported especially for the DPM storage.
ADC/user’s point of viewJobs will be almost freed from the input file size and the number of input files.
But, it should be checked more precisely…
Summary of I/O study and others
2013/12/13 Tomoaki Nakamura ICEPP, UTokyo 13
Local I/O performance• The local I/O performance in our worker node have been studied by a comparison with HDD and SSD at the
mixture situation of running real ATLAS production jobs and analysis jobs.• We confirmed that HDD in the worker node at Tokyo Tier2 center is not a bottleneck for the long batch type jobs
at least for the situation of 16 jobs running concurrently.• The same thing should be checked also for the next generation worker node, which have more CPU cores greater
than 16 cores, and also for the ext4 or XFS file system.• The improvement of the I/O performance with respect to the direct mount of DPM should be confirmed by
ourselves more precisely.• Local network usage should be checked toward the next system design.
DPM:• The database for the DPM head node become very huge (~40GB).• We will add more RAM to the DPM head node, and it will be 128GB in total.• We are also planning to have Fusion I/O (high performance NAND type flush memory connected by PCI-express
bus) for the DPM head node to increase the maintainability of the DB.• Redundant configuration of the MySQL-DB should be studied for the daily backup.
Wide area network:• We will migrate all production instance to LHCONE ASAP (to be discussed on Monday).
Tier3:• SL6 migration is on going.• CVMFS will be available to use ATLAS_LOCAL_ROOT_BASE and nightly release.