egee-ii infso-ri-031688

40
EGEE-II INFSO-RI- 031688 EGEE and gLite are registered trademarks The EGEE Production Grid A Birdā€™s-Eye View Ian Bird EGEE Operations Manager EGEEā€™06 Geneva, 27 th September 2006 Enabling Grids for E-sciencE

Upload: neola

Post on 02-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Enabling Grids for E-sciencE. The EGEE Production Grid A Birdā€™s-Eye View. Ian Bird EGEE Operations Manager EGEEā€™06 Geneva, 27 th September 2006. EGEE-II INFSO-RI-031688. EGEE and gLite are registered trademarks. Some history What led up to where we are now? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EGEE-II INFSO-RI-031688

EGEE-II INFSO-RI-031688 EGEE and gLite are registered trademarks

The EGEE Production GridA Birdā€™s-Eye View

Ian Bird

EGEE Operations Manager

EGEEā€™06

Geneva, 27th September 2006

Enabling Grids for E-sciencE

Page 2: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 2

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Outline

ā€¢ Some historyā€“ What led up to where we

are now?

ā€¢ What is the EGEE grid infrastructure today?

ā€“ What has been achieved?ā€“ How does it compare and

relate to other production grids?

ā€¢ What is the outlook in the short term?

ā€“ Timescale of EGEE-II ā€¦ā€“ What are the outstanding

issues?

ā€¢ What should happen next?

Page 3: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 3

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Some history ā€¦ LHC EGEE Grid

ā€¢ 1999 ā€“ Monarc Projectā€“ Early discussions on how to organise distributed computing

for LHC

ā€¢ 2000 ā€“ growing interest in grid technologyā€“ HEP community was the driver in launching the DataGrid

project

ā€¢ 2001-2004 - EU DataGrid projectā€“ middleware & testbed for an operational grid

ā€¢ 2002-2005 ā€“ LHC Computing Grid ā€“ LCGā€“ deploying the results of DataGrid to provide aproduction facility for LHC experiments

ā€¢ 2004-2006 ā€“ EU EGEE project phase 1ā€“ starts from the LCG gridā€“ shared production infrastructureā€“ expanding to other communities and sciences

ā€¢ 2006-2008 ā€“ EU EGEE-II ā€“ Building on phase 1ā€“ Expanding applications and communities ā€¦

ā€¢ ā€¦ and in the future ā€“ Worldwide grid infrastructure??ā€“ Interoperating and co-operating infrastructures?

CERN

Page 4: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 4

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Some history ā€¦

ā€¢ EGEE grew out of the EDG and LCG projectsā€“ LCG built the first production middleware distributions and set up the initial

grid infrastructure became EGEE in 2004ā€“ HEP (and LHC in particular) are very strong drivers for EGEEā€“ Branding is changing slowly (gLite-3.0)

ā€¢ Difficult to get the right balance:ā€“ LCG is pushing the boundaries

Data sizes and rates Workloads 50k jobs/day now; ~500k jobs/day in 1 year MoU for service reliability/availability is first real SLA

ā€“ Crucial that other applications push as hard Biomedical ā€“ application security aspects And others with their own requirements ā€¦ Data challenges are a good way to move forward: LCG, Wisdom

ā€¢ Is the balance right?ā€“ Everyone complains equally ā€¦

Page 5: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 5

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The EGEE Infrastructure

Certification testbeds (SA3)

Pre-production service

Production service

Test-beds & Services

Operations Coordination Centre

Regional Operations Centres

Global Grid User Support

EGEE Network Operations Centre (SA2)

Operational Security Coordination Team

Support Structures

Operations Advisory Group (+NA4)

Joint Security Policy Group EuGridPMA (& IGTF)

Grid Security Vulnerability Group

Security & Policy Groups

Infrastructure:ā€¢ Physical test-beds & servicesā€¢ Support organisations & proceduresā€¢ Policy groups

Page 6: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 6

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Certification & release preparation

ā€¢ The goal is to produce a middleware distribution that can be deployed widelyā€“ Not the same as middleware releases from development projects

ā€“ More like a Linux distribution ā€“ bringing together many pieces from several sources

Page 7: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 7

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Middleware release

ā€¢ Technical Coordination Groupā€“ Agrees the contents and priorities

for what goes into the integration and testing process

ā€¢ Not all desired new components or updates may make the next distribution

ā€“ Depends on priorities and urgency for other pieces

ā€¢ Moving away from big-bang releases to component upgrades

ā€“ Concept of a baseline release and then updates and patches

ā€“ New baseline when significant changes (dependencies, ā€¦)

Page 8: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 8

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Certification

ā€¢ Extensive certification test-bed:ā€“ Close to 100 machines involved

ā€“ Main test-bed at CERN, test-beds for specific tasks at SA3 partner sites

ā€¢ Emulate the deployment environmentsā€“ Or at least the main ones ā€¦

ā€¢ Certification testing:ā€“ Installation and configuration

ā€“ Component (service) functionality

ā€“ System testing (trying to emulate real workloads and stress testing)

ā€“ Beginning to use virtualization to simplify the testing environment

ā€¢ Deployment into the pre-production system ā€“ Final step of certification ā€“ validation by real sites

ā€“ Validation by applications ā€“ also allows to prepare apps for new versions

Page 9: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 9

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Snapshot of (part of) certification tb

Page 10: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 10

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Middleware releases & deployment

ā€¢ Once a distribution is ready for deployment, it takes several months to get this to the majority of sitesā€“ Seems to be a constant

ā€“ Advantage of decoupling the components ā€“ VOs can encourage sites to update the pieces they require

ā€“ Client tools can be simply installed (remotely) even without site upgrade

ā€¢ Deployment onto the EGEE infrastructure is managed and supported by the Regional Operations Centres

Page 11: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 11

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Pre-production service

ā€¢ Pre-production service is now ~ 20 sitesā€¢ Provides access to some 500 CPU

ā€“ Some sites allow access to their full production batch systems for scale tests

ā€¢ Sites install and test different configurations and sets of servicesā€“ Try to get good feeling for the quality of the release or updates before

general release to production

ā€“ Feedback to: certification, integration, developers, etc.

ā€¢ P-PS is now used in the way it was intendedā€“ For some time it was acting as a second certification test-bed for the gLite-

1.x branch

ā€“ Some services may be demonstrated in this environment before going to production (or they may need more work)

Page 12: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 12

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Production service

sites

Size of the infrastructure today:

ā€¢ 192 sites in 40 countries

ā€¢ ~25 000 CPU

ā€¢ ~ 3 PB disk, + tape MSS

CPU

Page 13: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 13

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

EGEE Resources

Region #countries #sites #cpu#cpu DoW

disk (TB)

CERN 0 1 4400 1800 770*

UK/I 2 23 4306 2010 310

Italy 1 27 2800 2280 373

France 1 10 2316 1252 300*

De/CH 2 13 2895 1852 280*

Northern Europe 6 16 2379 1860 64

SW Europe 2 13 956 898 16*

SE Europe 8 26 1101 1189 30

Central Europe 7 21 1584 1163 70

Russia 1 15 515 445 38

Asia-Pacific 8 19 840 751 72

North America 2 8 4069 - 229

Totals 40 192 28161 20265 2552

* Estimates taken from reporting as IS publishes total MSS space

Page 14: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 14

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Usage of the infrastructureEGEE workload

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Jan-

05

Feb-0

5

Mar

-05

Apr-0

5

May

-05

Jun-

05

Jul-0

5

Aug-0

5

Sep-0

5

Oct-

05

Nov-0

5

Dec-0

5

Jan-

06

Feb-0

6

Mar

-06

Apr-0

6

May

-06

Jun-

06

Jul-0

6

Aug-0

6

Jo

bs

/mo

nth

other VOs

planck

ops

magic

lhcb

geant4

fusion

esr

egrid

egeode

dteam

compchem

cms

biomed

atlas

alice

Normalized CPU time

0

1000000

2000000

3000000

4000000

5000000

6000000

Jan-

05

Feb-0

5

Mar

-05

Apr-0

5

May

-05

Jun-

05

Jul-0

5

Aug-0

5

Sep-0

5

Oct-

05

Nov-0

5

Dec-0

5

Jan-

06

Feb-0

6

Mar

-06

Apr-0

6

May

-06

Jun-

06

Jul-0

6

Aug-0

6

k.S

I2k

. h

ou

rs

other VOs

planck

ops

magic

lhcb

geant4

fusion

esr

egrid

egeode

dteam

compchem

cms

biomed

atlas

alice

>50k jobs/day

~7000 CPU-months/month

Page 15: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 15

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Non-LHC VOs

EGEE workload

0

50,000

100,000

150,000

200,000

250,000

Jo

bs

/mo

nth

planck

ops

magic

geant4

fusion

esr

egrid

egeode

compchem

biomed

other VOs

Normalized CPU time

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

k.S

I2k

. h

ou

rs

planck

ops

magic

geant4

fusion

esr

egrid

egeode

dteam

compchem

biomed

other VOs

Workloads of the ā€œother VOsā€ start to be significant ā€“ approaching 8-10K jobs per day; and 1000 cpu-months/month

ā€¢ one year ago this was the overall scale of work for all VOs

Workloads of the ā€œother VOsā€ start to be significant ā€“ approaching 8-10K jobs per day; and 1000 cpu-months/month

ā€¢ one year ago this was the overall scale of work for all VOs

Page 16: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 16

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Use of the infrastructure

20k jobs running simultaneously

Page 17: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 17

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Use for massive data transfer

Large LHC experiments now transferring ~ 1PB/month each

Page 18: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 18

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Grid management: structure

ā€¢ Operations Coordination Centre (OCC)

ā€“ management, oversight of all operational and support activities

ā€¢ Regional Operations Centres (ROC)

ā€“ providing the core of the support infrastructure, each supporting a number of resource centres within its region

ā€“ Grid Operator on Duty

ā€¢ Resource centres ā€“ providing resources

(computing, storage, network, etc.);

ā€¢ Grid User Support (GGUS)

ā€“ At FZK, coordination and management of user support, single point of contact for users

Page 19: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 19

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Grid monitoring

The goal is to proactively monitor the operational state of the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources

Regional Operations

Centre

ā€¦ ā€¦Regional

Operations Centre

Resource Centre

Resource Centre

ā€¦

Regional Operations

Centre

Resource Centre

Resource Centre

ā€¦

OSCTGrid Operator on-duty (COD)

Monitoring shows a problem

Page 20: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 20

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Grid Operator on Duty

ā€¢ Role:ā€“ Watch the problems detected by the grid monitoring tools

ā€“ Problem diagnosis

ā€“ Report these problems (GGUS tickets)

ā€“ Follow and escalate them if needed (well defined procedure)

ā€“ Provide help, propose solutions

ā€“ Build and maintain a central knowledge database (WIKI)

ā€¢ Who?ā€“ 9 ROC teams working in pairs (one lead and one backup) on

a weekly rotation

ā€“ CERN, France, Italy, UK, Russia, Asia-Pacific, Southeastern-Europe, Central-Europe, Germany-Switzerland

Page 21: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 21

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Grid monitoring tools

ā€¢ Tools used by the Grid Operator on Duty team to detect problems

ā€¢ Distributed responsibility

ā€¢ CIC portalā€“ single entry pointā€“ Integrated view of monitoring tools

ā€¢ Site Functional Tests (SFT) -> Service Availability Monitoring (SAM)

ā€¢ Grid Operations Centre Core Database (GOCDB)

ā€¢ GIIS monitor (Gstat)

ā€¢ GOC certificate lifetime

ā€¢ GOC job monitor

ā€¢ Others

Page 22: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 22

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Site Functional Tests

ā€¢ Site Functional Tests (SFT)ā€“ Framework to test (sample)

services at all sitesā€“ Shows results matrixā€“ Detailed test log available for

troubleshooting and debuggingā€“ History of individual tests is

kept ā€“ Can include VO-specific tests

(e.g. sw environment)ā€“ Normally >80% of sites pass

SFTs

ā€¢ Very important in stabilising sites:

ā€¢ Apps use only good sitesā€¢ Bad sites are automatically excludedā€¢ Sites work hard to fix problems

Page 23: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 23

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Service Availability Monitoring

ā€¢ Service Availability Monitoring (SAM)

ā€“ Monitoring of all grid services

ā€“ web service based access to data

ā€“ availability metric calculation

ā€“ Will be used to generate alarms

to generate trouble tickets

to call out support staff

Page 24: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 25

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Site metrics - availability

Page 25: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 26

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Support - GGUS

Page 26: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 27

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

GGUS

Page 27: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 28

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The EGEE Network Operations Centre

ā€¢ Creating a ā€œNetwork Support unitā€ in the EGEE operational model;

ā€¢ Based on the work done during EGEE:

ā€¢ First implementation in EGEE-II:ā€“ First ā€œiterationā€;ā€“ Planned developments in the next

months.

ā€¢ Tasks:ā€“ Receive tickets from NRENs, and forward to GGUS if impact on gridā€“ Receive tickets from GGUS if TPM determines a network issueā€“ Troubleshoot them provided that the ENOC has access to suitable

monitoring tools;ā€“ Contact identified faulty domains or reassign ticket to the associated site if

there is no evidence of a backbone problem (e.g. LAN issue).

Page 28: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 29

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Current ENOC status

ā€¢ Interface with NRENs is running like in EGEE:ā€“ ENOC receives Trouble Tickets (incident, maintenance) from GƉANT and the

NRENs (currently France, Germany, Greece, Hungary, Ireland, Italy, Russia, Spain, Switzerland, and United Kingdom);

More to come: Poland, the Netherlands, Czech Republic;

ā€“ Forward it to GGUS after analysis and if relevant to EGEE.

ā€¢ Identified as the Network Support unit in GGUS:ā€“ 2nd level support for network related issues

ā€¢ Identified as the point of contact for EGEE by the NRENs and GEANT2

GGUS

Users

SupportUnits

ENOC

NRENs

GƉANT2

EGEE Network

Page 29: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 30

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Security & Policy

Collaborative policy developmentā€“ Many policy aspects are collaborative works;

e.g.:

ā€¢ Joint Security Policy Groupā€¢ Certification Authorities

ā€“ EUGridPMA IGTF, etc.

ā€¢ Grid Acceptable Use Policy (AUP)ā€“ common, general and simple AUP

ā€“ for all VO members using many Grid infrastructures

EGEE, OSG, SEE-GRID, DEISA, national Gridsā€¦

ā€¢ Incident Handling and Response ā€“ defines basic communications paths

ā€“ defines requirements (MUSTs) for IR

ā€“ not to replace or interfere with local response plans

Security & Availability Policy

UsageRules

Certification Authorities

AuditRequirements

Incident Response

User Registration & VO Management

Application Development& Network Admin Guide

VOSecurity

Page 30: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 31

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Security groups

ā€¢ Joint Security Policy Group:ā€“ Joint with WLCG, OSG, and othersā€“ Focus on policy issuesā€“ Strong input to e-IRG

ā€¢ EUGridPMAā€“ Pan-European trust federation of CAsā€“ Included in IGTF (and was model for it)ā€“ Success: most grid projects now subscribe to the IGTF

ā€¢ Grid Security Vulnerability Groupā€“ New group in EGEE-IIā€“ Looking at how to manage vulnerabilitiesā€“ Risk analysis is fundamentalā€“ Hard to balance between openness and giving away insider info

ā€¢ Operational Security Coordination Teamā€“ Main day-to-day operational security workā€“ Incident response and follow upā€“ Members in all ROCs and sitesā€“ Recent security incident (not grid-related) was good shakedown

TAGPMA APGridPMA

The Americas Grid PMA

European Grid PMA

EUGridPMA

Asia-Pacific

Grid PMA

Page 31: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 32

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Operational Advisory Group

ā€¢ Role ā€“ Negotiate access to resources for applications and VOsā€“ Manage procedures:

To recognize new VOs & define MoUs

ā€“ Identify and manage major procedural problems between VOs and Operations

ā€¢ Membershipā€“ Co-chaired by SA1 and NA4ā€“ Members: VO Managers, ROC managers

ā€¢ Statusā€“ New simpler VO registration procedure in placeā€“ MoU with DILIGENT in progressā€“ Tools to show high level resource allocation by region and VO are planned

ā€¢ Issuesā€“ Resource negotiation procedures have to be developed

This has to be done by region Resource allocation summary tools are a pre-requisite Escalation procedures in case of unsatisfied requests have to be found

ā€“ The operation of the OAG itself has to be changed No EGAAP any longer User Forum and EGEE Conference now more important for face-to-face meetings

Page 32: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 33

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Interoperation

ā€¢ Interoperability and interoperation (or co-operation)

ā€¢ EGEE has interoperability activities with:(enabling the middlewares to work together)

ā€“ Open Science Grid (U.S.) ā€“ quite far advancedā€“ Nordugrid (ARC) ā€“ task in EGEE-II, 4 workshops and ongoing activityā€“ UNICORE ā€“ task in EGEE-IIā€“ NAREGI (Japan) ā€“ 1 workshop, continued activityā€“ GIN (OGF) ā€“ active in several areas

ā€¢ EGEE has interoperation activities with:(enabling the infrastructures to co-operate)

ā€“ Open Science Grid ā€“ actually in useā€“ Anticipated with NorduGrid (NDGF) for WLCG

Page 33: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 34

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Interoperating information systems

EGEE

OSG

Naregi

Teragrid

Pragma

Nordugrid

Page 34: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 35

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Related infrastructure projects

DEISATeraGrid

Coordination in SA1 for:

ā€¢ EELA, BalticGrid, EUMedGrid, EUChinaGrid, SEE-GRID

Interoperation with

ā€¢ OSG, NAREGI

SA3: ā€¢ DEISA, ARC, NAREGI

Page 35: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 36

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Summary of status

ā€¢ Today we have an operating production infrastructure ā€“ Probably the largest in the world, supporting many science domainsā€“ Relied upon by several as their primary source of computing

ā€¢ We have a managed operations process addressing most areasā€“ Constantly evolving

ā€¢ Inter/Co-operation is a fact and is becoming more important very quicklyā€“ Several applications need to work across grids ā€“ and they need support for

that

ā€¢ A large fraction of the value of the operations activity is in the intangibles ā€“ processes, structures, expertise, etc.

ā€¢ We recognise that there are many outstanding problems with the current state of things

Page 36: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 37

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Some (personal) observations

ā€¢ Production grids turned out to be a lot harder than anticipatedā€“ Often not production quality software, not designed with services and

service management in mind ā€“ really should regard as advanced prototypesā€“ Rediscovered the wheel : takes a long time to go from prototypes to

production qualityā€“ We have done a lot : but it is hard to use, hard to support, and hard to

manage ā€¦

ā€¢ Complexityā€“ We have a lot of complexity ā€“ often a clue that something is not rightā€“ Perhaps it is necessary ā€¦ ? But we should be careful.

ā€¢ Many of the reliability issues are not grid specificā€“ Site management problems are reflected in the overall service

ā€¢ ā€¦ and any major changes we make have to be implemented in such a way that does not break the production service

Page 37: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 38

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Short term issues

ā€¢ Urgently need to support more platformsā€“ Migration to GT4 (pre-WS)ā€“ Simplify effort involved in porting

ā€¢ Deployment of new servicesā€“ Especially secure data managementā€“ Better support for MPIā€“ ā€¦

ā€¢ Site reliability, stability ā€“ Understanding of the system in detailā€“ Not more monitoring toolsā€“ We need more sensors (do we monitor enough or the right things?)ā€“ We need data interpretation (knowledge not information)

And use that to generate actions ā€“ alarms, self-repairing systems, ā€¦

ā€“ We have to manage a dynamically changing distributed system with many levels of reliability, management, stability

And hide this from the applications !

ā€¢ How do we scale up to 10-50 times the workload we have now?

Page 38: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 39

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Long term outlookā€¢ Reliability & management must be built in not added on

ā€“ We have been through 1 round of ā€œre-engineeringā€ ā€“ now we need the ā€œreal thingā€

ā€¢ We need a real architecture (avoid ā€œVO Boxesā€)ā€“ How do we deploy application level services in an acceptable way?

ā€¢ Information systems:ā€“ Crucial to a grid infrastructure, more and more information needs to be publishedā€“ See the limits of current system ā€“ Our experience and knowledge is encapsulated in GLUE, not the implementation

ā€¢ Security modelā€“ Proxy renewals ? Is this what we want?ā€“ Complexity againā€“ Where does Shibboleth fit? ā€“ Is there a better way?

ā€¢ Dynamic VOs (the original idea of what a VO could be)ā€“ How to achieve this? It seems a long way offā€“ cf ITU exercise where we hi-jacked an existing VO

Page 39: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 40

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Standardisation

ā€¢ Interoperation, interoperability is the best way to drive real standardsā€“ But must take care not to constrain ourselves ā€“ things are still changing

rapidly

ā€¢ See already practical work on information systems, data, job submission, etcā€“ Both in EGEE and related projects and in the GGF/OGF GIN work

ā€¢ EGEE Operations have a wealth of experience nowā€“ Procedures, issues, what works, what does not workā€“ The problem is finding the time to publish this knowledgeā€“ We need to start documenting this now ā€“ especially at the ROC level

ā€¢ The value of EGEE is in the infrastructure ā€“ not just the production service, but all the parts that fit around thatā€“ This is more or less independent of any specific set of middleware

(although that makes life more or less easy ā€¦)

Page 40: EGEE-II INFSO-RI-031688

[email protected] EGEEā€™06; Geneva; 25-29th September 2006 41

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Conclusions

ā€¢ We have come a long way in the last 5-6 yearsā€“ From specific solutions for HEP to a vision of a global infrastructure

ā€“ But we need to be careful to walk before we run ā€¦ and clarify expectations

ā€¢ Many complex issues need to be addressed

ā€¢ We should expect to see major changes in implementations ā€“ But the infrastructure should remain and evolve across these changes

ā€¢ Standardisation will come with co-operating grids

ā€¢ Many opportunities for collaborationā€“ With other projects, with industry, with applications