developing & managing a large linux farm – the brookhaven experience

28
Developing & Developing & Managing A Large Managing A Large Linux Farm – The Linux Farm – The Brookhaven Brookhaven Experience Experience CHEP2004 – Interlaken CHEP2004 – Interlaken September 27, 2004 September 27, 2004 Tomasz Wlodek - BNL Tomasz Wlodek - BNL

Upload: adina

Post on 07-Jan-2016

23 views

Category:

Documents


1 download

DESCRIPTION

Developing & Managing A Large Linux Farm – The Brookhaven Experience. CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL. Background. Brookhaven National Lab (BNL) is a multi-disciplinary research laboratory funded by US government. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Developing & Managing A Developing & Managing A Large Linux Farm – The Large Linux Farm – The Brookhaven ExperienceBrookhaven Experience

CHEP2004 – InterlakenCHEP2004 – Interlaken

September 27, 2004September 27, 2004

Tomasz Wlodek - BNLTomasz Wlodek - BNL

Page 2: Developing & Managing A Large Linux Farm – The Brookhaven Experience

BackgroundBackground

Brookhaven National Lab (BNL) is a multi-Brookhaven National Lab (BNL) is a multi-disciplinary research laboratory funded by disciplinary research laboratory funded by US government.US government.

BNL is the site of Relativistic Heavy Ion BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments.Collider (RHIC) and four of its experiments.

The Rhic Computing Facility (RCF) was The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address formed in the mid 90’s, in order to address computing needs of RHIC experiments.computing needs of RHIC experiments.

Page 3: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Background (cont.)Background (cont.)

BNL has also been chosen as the site of BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN.the Atlas experiment in CERN.

RCF/ACF supports HENP and HEP scientific RCF/ACF supports HENP and HEP scientific computing efforts and various general computing efforts and various general services (backup, e-mail, web, off-site data services (backup, e-mail, web, off-site data transfer, Grid, etc). transfer, Grid, etc).

Page 4: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Background (cont.)Background (cont.)

The Linux Farm is the main source of CPU (and The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACFincreasingly storage) resources in the RCF/ACF

RCF/ACF is transforming itself from a local RCF/ACF is transforming itself from a local resource into a national and global resourceresource into a national and global resource

Growing design and operational complexityGrowing design and operational complexity

Increasing staffing levels to handle additional Increasing staffing levels to handle additional responsibilitiesresponsibilities

Page 5: Developing & Managing A Large Linux Farm – The Brookhaven Experience

RCF/ACF StructureRCF/ACF Structure

Page 6: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Staff Growth at the RCF/ACFStaff Growth at the RCF/ACF

0

5

10

15

20

25

30

35

Staf

f L

evel

s (F

TE

)

1997 1998 1999 2000 2001 2002 2003 2004 2005(est.)

Year

Page 7: Developing & Managing A Large Linux Farm – The Brookhaven Experience

The Pre-Grid EraThe Pre-Grid Era Rack-mounted commodity hardwareRack-mounted commodity hardware

Self-contained, localized resourcesSelf-contained, localized resources

Resources available only to local usersResources available only to local users

Little interaction with external resources at Little interaction with external resources at remote locations remote locations

Considerable freedom to set own usage policiesConsiderable freedom to set own usage policies

Page 8: Developing & Managing A Large Linux Farm – The Brookhaven Experience

The (Near-Term) FutureThe (Near-Term) Future

Resources available globallyResources available globally

Distributed computing architectureDistributed computing architecture

Extensive interaction with remote resources Extensive interaction with remote resources requires closer software inter-operability and requires closer software inter-operability and higher network bandwidthhigher network bandwidth

Constraints on freedom to set own policiesConstraints on freedom to set own policies

Page 9: Developing & Managing A Large Linux Farm – The Brookhaven Experience

How do we get there?How do we get there?

Change in management philosophyChange in management philosophy

Evolution in hardware requirementsEvolution in hardware requirements

Evolution in software packagesEvolution in software packages

Different security protocol(s)Different security protocol(s)

Change in access policyChange in access policy

Page 10: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Change in Management PhilosophyChange in Management Philosophy

Automated monitoring & management of servers Automated monitoring & management of servers in large clusters a mustin large clusters a must

Remote power management, predictive hardware Remote power management, predictive hardware failure analysis and preventive maintenance are failure analysis and preventive maintenance are important important

High-availability based on large number of High-availability based on large number of identical servers, not on 24-hour supportidentical servers, not on 24-hour support

Increasingly larger clusters only manageable if Increasingly larger clusters only manageable if servers are identical servers are identical avoid specialized servers avoid specialized servers

Page 11: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Evolution in Hardware Evolution in Hardware RequirementsRequirements

Early acquisitions emphasized CPU power over Early acquisitions emphasized CPU power over local storage capacitylocal storage capacity

Increasing affordability of local disk storage has Increasing affordability of local disk storage has changed this philosophychanged this philosophy

Hardware chosen by optimal combination of CPU Hardware chosen by optimal combination of CPU power, storage capacity, server density and pricepower, storage capacity, server density and price

Buy from high-quality vendors to avoid labor-Buy from high-quality vendors to avoid labor-intensive maintenance issuesintensive maintenance issues

Page 12: Developing & Managing A Large Linux Farm – The Brookhaven Experience

The Growth of the Linux FarmThe Growth of the Linux Farm

0

200

400

600

800

1000

1200

1400

KSp

ecIn

t200

0

1999 2000 2001 2002 2003 2004

YearKSpecInt2000

Page 13: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Drop in Server Price as a Function Drop in Server Price as a Function of Performanceof Performance

02

4

6

8

10

12

14

Co

st/

Sp

ecIn

t2000

(in

U.S

. d

oll

ars

)

1999 2000 2001 2002 2003 2004

Year

Cost/SpecInt2000

Page 14: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Drop in Cost of Local Storage Drop in Cost of Local Storage

010

20

30

40

50

60

70

Co

st/

GB

(in

U.S

.

do

llars

)

1999 2000 2001 2002 2003 2004

Year

Cost/GB

Page 15: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Total Distributed Storage Capacity Total Distributed Storage Capacity

0

50

100

150

200

250

Total Storage Capacity

(TB)

1999 2000 2001 2002 2003 2004

Year

TB

Page 16: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Growth of Storage Capacity per Growth of Storage Capacity per ServerServer

050

100150200250300350400450

GB

1999 2000 2001 2002 2003 2004

Year

GB/server

Page 17: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Server ReliabilityServer Reliability

0

0.002

0.004

0.006

0.008

0.01

0.012

Fa

ilu

re/M

ac

hin

e.M

on

th

2000 2001 2002 2003 2004

Year

Failure Rate-about 1/week at current size

Page 18: Developing & Managing A Large Linux Farm – The Brookhaven Experience

The Factors Enforcing Evolution in The Factors Enforcing Evolution in Software PackagesSoftware Packages

CostCost Farm size / scalabilityFarm size / scalability SecuritySecurity External influences / wide External influences / wide

acceptanceacceptance

Page 19: Developing & Managing A Large Linux Farm – The Brookhaven Experience

CostCost

Red Hat Linux Red Hat Linux →→ Scientific Scientific LinuxLinux

LSF LSF →→ CondorCondor

Page 20: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Farm Size / ScalabilityFarm Size / Scalability

Home built batch system for Home built batch system for data reconstructiondata reconstruction→→ Condor Condor based batch system based batch system

Home built monitoring Home built monitoring system system →→ Ganglia Ganglia

Page 21: Developing & Managing A Large Linux Farm – The Brookhaven Experience

SecuritySecurity

Started with NIS/telnet in the 90’sStarted with NIS/telnet in the 90’s

Cyber-security threats prompted the Cyber-security threats prompted the installation of firewalls, gatekeepers and installation of firewalls, gatekeepers and migration to ssh migration to ssh scricter security scricter security standards than in the paststandards than in the past

On-going change to Kerberos 5. Ongoing On-going change to Kerberos 5. Ongoing phase-out of NIS passwords.phase-out of NIS passwords.

Testing GSI Testing GSI limited support for GSI limited support for GSI

Page 22: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Security Changes (cont.)Security Changes (cont.) Authorization & authentication controlled by local Authorization & authentication controlled by local

site (NIS and Kerberos)site (NIS and Kerberos)

Migration to GSI requires a central CA and Migration to GSI requires a central CA and regional VO’s for authentication regional VO’s for authentication local sites local sites performs final authentication before granting performs final authentication before granting accessaccess

Accept certificates from multiple CA’s?Accept certificates from multiple CA’s?

Difficult transition from complete to partial control Difficult transition from complete to partial control over security issuesover security issues

Page 23: Developing & Managing A Large Linux Farm – The Brookhaven Experience

External Influences / Wide External Influences / Wide AcceptanceAcceptance

Ganglia – used by RHIC experiments Ganglia – used by RHIC experiments to monitor the RCF and external to monitor the RCF and external farms in order to manage their job farms in order to manage their job submission.submission.

HRM / dCACHE – used by other labs HRM / dCACHE – used by other labs Condor – widely used by Atlas Condor – widely used by Atlas

communitycommunity

Page 24: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Software Evolution - summarySoftware Evolution - summaryPackagePackage OldOld NewNew DateDate

OSOS RedHat RedHat LinuxLinux

Scientific Scientific LinuxLinux

20042004

BatchBatch Home-Built/Home-Built/LSFLSF

Condor/LSFCondor/LSF 2004/20002004/2000

MonitoringMonitoring Home-BuiltHome-Built GangliaGanglia 20032003

SecuritySecurity NISNIS K5/GSIK5/GSI 2003/20042003/2004

Distributed Distributed StorageStorage

---------------------- HRM/dCacheHRM/dCache 2004/?2004/?

Page 25: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Ganglia at the RCF/ACFGanglia at the RCF/ACF

Page 26: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Condor at the RCF/ACFCondor at the RCF/ACF

Page 27: Developing & Managing A Large Linux Farm – The Brookhaven Experience

SummarySummary

RCF/ACF going through a transition from a local RCF/ACF going through a transition from a local facility to a regional (global) facility facility to a regional (global) facility many many changeschanges

Linux Farm built with commodity hardware is Linux Farm built with commodity hardware is increasingly affordable and reliableincreasingly affordable and reliable

Distributed storage is also increasingly affordable Distributed storage is also increasingly affordable management software issues.management software issues.

Page 28: Developing & Managing A Large Linux Farm – The Brookhaven Experience

Summary (cont.)Summary (cont.)

Inter-operability with remote sites (software and Inter-operability with remote sites (software and services) plays an increasingly important role in services) plays an increasingly important role in our software choicesour software choices

Transition with security and access issuesTransition with security and access issues

Migration will take longer and be more difficult Migration will take longer and be more difficult than generally expected than generally expected change in hardware change in hardware and software needs to be complemented by a and software needs to be complemented by a change in management philosophychange in management philosophy