egee-ii infso-ri-031688
DESCRIPTION
Enabling Grids for E-sciencE. The EGEE Production Grid A Birdās-Eye View. Ian Bird EGEE Operations Manager EGEEā06 Geneva, 27 th September 2006. EGEE-II INFSO-RI-031688. EGEE and gLite are registered trademarks. Some history What led up to where we are now? - PowerPoint PPT PresentationTRANSCRIPT
EGEE-II INFSO-RI-031688 EGEE and gLite are registered trademarks
The EGEE Production GridA Birdās-Eye View
Ian Bird
EGEE Operations Manager
EGEEā06
Geneva, 27th September 2006
Enabling Grids for E-sciencE
[email protected] EGEEā06; Geneva; 25-29th September 2006 2
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Outline
ā¢ Some historyā What led up to where we
are now?
ā¢ What is the EGEE grid infrastructure today?
ā What has been achieved?ā How does it compare and
relate to other production grids?
ā¢ What is the outlook in the short term?
ā Timescale of EGEE-II ā¦ā What are the outstanding
issues?
ā¢ What should happen next?
[email protected] EGEEā06; Geneva; 25-29th September 2006 3
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Some history ā¦ LHC EGEE Grid
ā¢ 1999 ā Monarc Projectā Early discussions on how to organise distributed computing
for LHC
ā¢ 2000 ā growing interest in grid technologyā HEP community was the driver in launching the DataGrid
project
ā¢ 2001-2004 - EU DataGrid projectā middleware & testbed for an operational grid
ā¢ 2002-2005 ā LHC Computing Grid ā LCGā deploying the results of DataGrid to provide aproduction facility for LHC experiments
ā¢ 2004-2006 ā EU EGEE project phase 1ā starts from the LCG gridā shared production infrastructureā expanding to other communities and sciences
ā¢ 2006-2008 ā EU EGEE-II ā Building on phase 1ā Expanding applications and communities ā¦
ā¢ ā¦ and in the future ā Worldwide grid infrastructure??ā Interoperating and co-operating infrastructures?
CERN
[email protected] EGEEā06; Geneva; 25-29th September 2006 4
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Some history ā¦
ā¢ EGEE grew out of the EDG and LCG projectsā LCG built the first production middleware distributions and set up the initial
grid infrastructure became EGEE in 2004ā HEP (and LHC in particular) are very strong drivers for EGEEā Branding is changing slowly (gLite-3.0)
ā¢ Difficult to get the right balance:ā LCG is pushing the boundaries
Data sizes and rates Workloads 50k jobs/day now; ~500k jobs/day in 1 year MoU for service reliability/availability is first real SLA
ā Crucial that other applications push as hard Biomedical ā application security aspects And others with their own requirements ā¦ Data challenges are a good way to move forward: LCG, Wisdom
ā¢ Is the balance right?ā Everyone complains equally ā¦
[email protected] EGEEā06; Geneva; 25-29th September 2006 5
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The EGEE Infrastructure
Certification testbeds (SA3)
Pre-production service
Production service
Test-beds & Services
Operations Coordination Centre
Regional Operations Centres
Global Grid User Support
EGEE Network Operations Centre (SA2)
Operational Security Coordination Team
Support Structures
Operations Advisory Group (+NA4)
Joint Security Policy Group EuGridPMA (& IGTF)
Grid Security Vulnerability Group
Security & Policy Groups
Infrastructure:ā¢ Physical test-beds & servicesā¢ Support organisations & proceduresā¢ Policy groups
[email protected] EGEEā06; Geneva; 25-29th September 2006 6
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Certification & release preparation
ā¢ The goal is to produce a middleware distribution that can be deployed widelyā Not the same as middleware releases from development projects
ā More like a Linux distribution ā bringing together many pieces from several sources
[email protected] EGEEā06; Geneva; 25-29th September 2006 7
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Middleware release
ā¢ Technical Coordination Groupā Agrees the contents and priorities
for what goes into the integration and testing process
ā¢ Not all desired new components or updates may make the next distribution
ā Depends on priorities and urgency for other pieces
ā¢ Moving away from big-bang releases to component upgrades
ā Concept of a baseline release and then updates and patches
ā New baseline when significant changes (dependencies, ā¦)
[email protected] EGEEā06; Geneva; 25-29th September 2006 8
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Certification
ā¢ Extensive certification test-bed:ā Close to 100 machines involved
ā Main test-bed at CERN, test-beds for specific tasks at SA3 partner sites
ā¢ Emulate the deployment environmentsā Or at least the main ones ā¦
ā¢ Certification testing:ā Installation and configuration
ā Component (service) functionality
ā System testing (trying to emulate real workloads and stress testing)
ā Beginning to use virtualization to simplify the testing environment
ā¢ Deployment into the pre-production system ā Final step of certification ā validation by real sites
ā Validation by applications ā also allows to prepare apps for new versions
[email protected] EGEEā06; Geneva; 25-29th September 2006 9
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Snapshot of (part of) certification tb
[email protected] EGEEā06; Geneva; 25-29th September 2006 10
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Middleware releases & deployment
ā¢ Once a distribution is ready for deployment, it takes several months to get this to the majority of sitesā Seems to be a constant
ā Advantage of decoupling the components ā VOs can encourage sites to update the pieces they require
ā Client tools can be simply installed (remotely) even without site upgrade
ā¢ Deployment onto the EGEE infrastructure is managed and supported by the Regional Operations Centres
[email protected] EGEEā06; Geneva; 25-29th September 2006 11
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Pre-production service
ā¢ Pre-production service is now ~ 20 sitesā¢ Provides access to some 500 CPU
ā Some sites allow access to their full production batch systems for scale tests
ā¢ Sites install and test different configurations and sets of servicesā Try to get good feeling for the quality of the release or updates before
general release to production
ā Feedback to: certification, integration, developers, etc.
ā¢ P-PS is now used in the way it was intendedā For some time it was acting as a second certification test-bed for the gLite-
1.x branch
ā Some services may be demonstrated in this environment before going to production (or they may need more work)
[email protected] EGEEā06; Geneva; 25-29th September 2006 12
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Production service
sites
Size of the infrastructure today:
ā¢ 192 sites in 40 countries
ā¢ ~25 000 CPU
ā¢ ~ 3 PB disk, + tape MSS
CPU
[email protected] EGEEā06; Geneva; 25-29th September 2006 13
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
EGEE Resources
Region #countries #sites #cpu#cpu DoW
disk (TB)
CERN 0 1 4400 1800 770*
UK/I 2 23 4306 2010 310
Italy 1 27 2800 2280 373
France 1 10 2316 1252 300*
De/CH 2 13 2895 1852 280*
Northern Europe 6 16 2379 1860 64
SW Europe 2 13 956 898 16*
SE Europe 8 26 1101 1189 30
Central Europe 7 21 1584 1163 70
Russia 1 15 515 445 38
Asia-Pacific 8 19 840 751 72
North America 2 8 4069 - 229
Totals 40 192 28161 20265 2552
* Estimates taken from reporting as IS publishes total MSS space
[email protected] EGEEā06; Geneva; 25-29th September 2006 14
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Usage of the infrastructureEGEE workload
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
Jan-
05
Feb-0
5
Mar
-05
Apr-0
5
May
-05
Jun-
05
Jul-0
5
Aug-0
5
Sep-0
5
Oct-
05
Nov-0
5
Dec-0
5
Jan-
06
Feb-0
6
Mar
-06
Apr-0
6
May
-06
Jun-
06
Jul-0
6
Aug-0
6
Jo
bs
/mo
nth
other VOs
planck
ops
magic
lhcb
geant4
fusion
esr
egrid
egeode
dteam
compchem
cms
biomed
atlas
alice
Normalized CPU time
0
1000000
2000000
3000000
4000000
5000000
6000000
Jan-
05
Feb-0
5
Mar
-05
Apr-0
5
May
-05
Jun-
05
Jul-0
5
Aug-0
5
Sep-0
5
Oct-
05
Nov-0
5
Dec-0
5
Jan-
06
Feb-0
6
Mar
-06
Apr-0
6
May
-06
Jun-
06
Jul-0
6
Aug-0
6
k.S
I2k
. h
ou
rs
other VOs
planck
ops
magic
lhcb
geant4
fusion
esr
egrid
egeode
dteam
compchem
cms
biomed
atlas
alice
>50k jobs/day
~7000 CPU-months/month
[email protected] EGEEā06; Geneva; 25-29th September 2006 15
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Non-LHC VOs
EGEE workload
0
50,000
100,000
150,000
200,000
250,000
Jo
bs
/mo
nth
planck
ops
magic
geant4
fusion
esr
egrid
egeode
compchem
biomed
other VOs
Normalized CPU time
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
k.S
I2k
. h
ou
rs
planck
ops
magic
geant4
fusion
esr
egrid
egeode
dteam
compchem
biomed
other VOs
Workloads of the āother VOsā start to be significant ā approaching 8-10K jobs per day; and 1000 cpu-months/month
ā¢ one year ago this was the overall scale of work for all VOs
Workloads of the āother VOsā start to be significant ā approaching 8-10K jobs per day; and 1000 cpu-months/month
ā¢ one year ago this was the overall scale of work for all VOs
[email protected] EGEEā06; Geneva; 25-29th September 2006 16
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Use of the infrastructure
20k jobs running simultaneously
[email protected] EGEEā06; Geneva; 25-29th September 2006 17
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Use for massive data transfer
Large LHC experiments now transferring ~ 1PB/month each
[email protected] EGEEā06; Geneva; 25-29th September 2006 18
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Grid management: structure
ā¢ Operations Coordination Centre (OCC)
ā management, oversight of all operational and support activities
ā¢ Regional Operations Centres (ROC)
ā providing the core of the support infrastructure, each supporting a number of resource centres within its region
ā Grid Operator on Duty
ā¢ Resource centres ā providing resources
(computing, storage, network, etc.);
ā¢ Grid User Support (GGUS)
ā At FZK, coordination and management of user support, single point of contact for users
[email protected] EGEEā06; Geneva; 25-29th September 2006 19
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Grid monitoring
The goal is to proactively monitor the operational state of the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources
Regional Operations
Centre
ā¦ ā¦Regional
Operations Centre
Resource Centre
Resource Centre
ā¦
Regional Operations
Centre
Resource Centre
Resource Centre
ā¦
OSCTGrid Operator on-duty (COD)
Monitoring shows a problem
[email protected] EGEEā06; Geneva; 25-29th September 2006 20
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Grid Operator on Duty
ā¢ Role:ā Watch the problems detected by the grid monitoring tools
ā Problem diagnosis
ā Report these problems (GGUS tickets)
ā Follow and escalate them if needed (well defined procedure)
ā Provide help, propose solutions
ā Build and maintain a central knowledge database (WIKI)
ā¢ Who?ā 9 ROC teams working in pairs (one lead and one backup) on
a weekly rotation
ā CERN, France, Italy, UK, Russia, Asia-Pacific, Southeastern-Europe, Central-Europe, Germany-Switzerland
[email protected] EGEEā06; Geneva; 25-29th September 2006 21
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Grid monitoring tools
ā¢ Tools used by the Grid Operator on Duty team to detect problems
ā¢ Distributed responsibility
ā¢ CIC portalā single entry pointā Integrated view of monitoring tools
ā¢ Site Functional Tests (SFT) -> Service Availability Monitoring (SAM)
ā¢ Grid Operations Centre Core Database (GOCDB)
ā¢ GIIS monitor (Gstat)
ā¢ GOC certificate lifetime
ā¢ GOC job monitor
ā¢ Others
[email protected] EGEEā06; Geneva; 25-29th September 2006 22
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Site Functional Tests
ā¢ Site Functional Tests (SFT)ā Framework to test (sample)
services at all sitesā Shows results matrixā Detailed test log available for
troubleshooting and debuggingā History of individual tests is
kept ā Can include VO-specific tests
(e.g. sw environment)ā Normally >80% of sites pass
SFTs
ā¢ Very important in stabilising sites:
ā¢ Apps use only good sitesā¢ Bad sites are automatically excludedā¢ Sites work hard to fix problems
[email protected] EGEEā06; Geneva; 25-29th September 2006 23
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Service Availability Monitoring
ā¢ Service Availability Monitoring (SAM)
ā Monitoring of all grid services
ā web service based access to data
ā availability metric calculation
ā Will be used to generate alarms
to generate trouble tickets
to call out support staff
[email protected] EGEEā06; Geneva; 25-29th September 2006 25
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Site metrics - availability
[email protected] EGEEā06; Geneva; 25-29th September 2006 26
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Support - GGUS
[email protected] EGEEā06; Geneva; 25-29th September 2006 27
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
GGUS
[email protected] EGEEā06; Geneva; 25-29th September 2006 28
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The EGEE Network Operations Centre
ā¢ Creating a āNetwork Support unitā in the EGEE operational model;
ā¢ Based on the work done during EGEE:
ā¢ First implementation in EGEE-II:ā First āiterationā;ā Planned developments in the next
months.
ā¢ Tasks:ā Receive tickets from NRENs, and forward to GGUS if impact on gridā Receive tickets from GGUS if TPM determines a network issueā Troubleshoot them provided that the ENOC has access to suitable
monitoring tools;ā Contact identified faulty domains or reassign ticket to the associated site if
there is no evidence of a backbone problem (e.g. LAN issue).
[email protected] EGEEā06; Geneva; 25-29th September 2006 29
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Current ENOC status
ā¢ Interface with NRENs is running like in EGEE:ā ENOC receives Trouble Tickets (incident, maintenance) from GĆANT and the
NRENs (currently France, Germany, Greece, Hungary, Ireland, Italy, Russia, Spain, Switzerland, and United Kingdom);
More to come: Poland, the Netherlands, Czech Republic;
ā Forward it to GGUS after analysis and if relevant to EGEE.
ā¢ Identified as the Network Support unit in GGUS:ā 2nd level support for network related issues
ā¢ Identified as the point of contact for EGEE by the NRENs and GEANT2
GGUS
Users
SupportUnits
ENOC
NRENs
GĆANT2
EGEE Network
[email protected] EGEEā06; Geneva; 25-29th September 2006 30
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Security & Policy
Collaborative policy developmentā Many policy aspects are collaborative works;
e.g.:
ā¢ Joint Security Policy Groupā¢ Certification Authorities
ā EUGridPMA IGTF, etc.
ā¢ Grid Acceptable Use Policy (AUP)ā common, general and simple AUP
ā for all VO members using many Grid infrastructures
EGEE, OSG, SEE-GRID, DEISA, national Gridsā¦
ā¢ Incident Handling and Response ā defines basic communications paths
ā defines requirements (MUSTs) for IR
ā not to replace or interfere with local response plans
Security & Availability Policy
UsageRules
Certification Authorities
AuditRequirements
Incident Response
User Registration & VO Management
Application Development& Network Admin Guide
VOSecurity
[email protected] EGEEā06; Geneva; 25-29th September 2006 31
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Security groups
ā¢ Joint Security Policy Group:ā Joint with WLCG, OSG, and othersā Focus on policy issuesā Strong input to e-IRG
ā¢ EUGridPMAā Pan-European trust federation of CAsā Included in IGTF (and was model for it)ā Success: most grid projects now subscribe to the IGTF
ā¢ Grid Security Vulnerability Groupā New group in EGEE-IIā Looking at how to manage vulnerabilitiesā Risk analysis is fundamentalā Hard to balance between openness and giving away insider info
ā¢ Operational Security Coordination Teamā Main day-to-day operational security workā Incident response and follow upā Members in all ROCs and sitesā Recent security incident (not grid-related) was good shakedown
TAGPMA APGridPMA
The Americas Grid PMA
European Grid PMA
EUGridPMA
Asia-Pacific
Grid PMA
[email protected] EGEEā06; Geneva; 25-29th September 2006 32
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Operational Advisory Group
ā¢ Role ā Negotiate access to resources for applications and VOsā Manage procedures:
To recognize new VOs & define MoUs
ā Identify and manage major procedural problems between VOs and Operations
ā¢ Membershipā Co-chaired by SA1 and NA4ā Members: VO Managers, ROC managers
ā¢ Statusā New simpler VO registration procedure in placeā MoU with DILIGENT in progressā Tools to show high level resource allocation by region and VO are planned
ā¢ Issuesā Resource negotiation procedures have to be developed
This has to be done by region Resource allocation summary tools are a pre-requisite Escalation procedures in case of unsatisfied requests have to be found
ā The operation of the OAG itself has to be changed No EGAAP any longer User Forum and EGEE Conference now more important for face-to-face meetings
[email protected] EGEEā06; Geneva; 25-29th September 2006 33
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Interoperation
ā¢ Interoperability and interoperation (or co-operation)
ā¢ EGEE has interoperability activities with:(enabling the middlewares to work together)
ā Open Science Grid (U.S.) ā quite far advancedā Nordugrid (ARC) ā task in EGEE-II, 4 workshops and ongoing activityā UNICORE ā task in EGEE-IIā NAREGI (Japan) ā 1 workshop, continued activityā GIN (OGF) ā active in several areas
ā¢ EGEE has interoperation activities with:(enabling the infrastructures to co-operate)
ā Open Science Grid ā actually in useā Anticipated with NorduGrid (NDGF) for WLCG
[email protected] EGEEā06; Geneva; 25-29th September 2006 34
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Interoperating information systems
EGEE
OSG
Naregi
Teragrid
Pragma
Nordugrid
[email protected] EGEEā06; Geneva; 25-29th September 2006 35
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Related infrastructure projects
DEISATeraGrid
Coordination in SA1 for:
ā¢ EELA, BalticGrid, EUMedGrid, EUChinaGrid, SEE-GRID
Interoperation with
ā¢ OSG, NAREGI
SA3: ā¢ DEISA, ARC, NAREGI
[email protected] EGEEā06; Geneva; 25-29th September 2006 36
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Summary of status
ā¢ Today we have an operating production infrastructure ā Probably the largest in the world, supporting many science domainsā Relied upon by several as their primary source of computing
ā¢ We have a managed operations process addressing most areasā Constantly evolving
ā¢ Inter/Co-operation is a fact and is becoming more important very quicklyā Several applications need to work across grids ā and they need support for
that
ā¢ A large fraction of the value of the operations activity is in the intangibles ā processes, structures, expertise, etc.
ā¢ We recognise that there are many outstanding problems with the current state of things
[email protected] EGEEā06; Geneva; 25-29th September 2006 37
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Some (personal) observations
ā¢ Production grids turned out to be a lot harder than anticipatedā Often not production quality software, not designed with services and
service management in mind ā really should regard as advanced prototypesā Rediscovered the wheel : takes a long time to go from prototypes to
production qualityā We have done a lot : but it is hard to use, hard to support, and hard to
manage ā¦
ā¢ Complexityā We have a lot of complexity ā often a clue that something is not rightā Perhaps it is necessary ā¦ ? But we should be careful.
ā¢ Many of the reliability issues are not grid specificā Site management problems are reflected in the overall service
ā¢ ā¦ and any major changes we make have to be implemented in such a way that does not break the production service
[email protected] EGEEā06; Geneva; 25-29th September 2006 38
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Short term issues
ā¢ Urgently need to support more platformsā Migration to GT4 (pre-WS)ā Simplify effort involved in porting
ā¢ Deployment of new servicesā Especially secure data managementā Better support for MPIā ā¦
ā¢ Site reliability, stability ā Understanding of the system in detailā Not more monitoring toolsā We need more sensors (do we monitor enough or the right things?)ā We need data interpretation (knowledge not information)
And use that to generate actions ā alarms, self-repairing systems, ā¦
ā We have to manage a dynamically changing distributed system with many levels of reliability, management, stability
And hide this from the applications !
ā¢ How do we scale up to 10-50 times the workload we have now?
[email protected] EGEEā06; Geneva; 25-29th September 2006 39
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Long term outlookā¢ Reliability & management must be built in not added on
ā We have been through 1 round of āre-engineeringā ā now we need the āreal thingā
ā¢ We need a real architecture (avoid āVO Boxesā)ā How do we deploy application level services in an acceptable way?
ā¢ Information systems:ā Crucial to a grid infrastructure, more and more information needs to be publishedā See the limits of current system ā Our experience and knowledge is encapsulated in GLUE, not the implementation
ā¢ Security modelā Proxy renewals ? Is this what we want?ā Complexity againā Where does Shibboleth fit? ā Is there a better way?
ā¢ Dynamic VOs (the original idea of what a VO could be)ā How to achieve this? It seems a long way offā cf ITU exercise where we hi-jacked an existing VO
[email protected] EGEEā06; Geneva; 25-29th September 2006 40
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Standardisation
ā¢ Interoperation, interoperability is the best way to drive real standardsā But must take care not to constrain ourselves ā things are still changing
rapidly
ā¢ See already practical work on information systems, data, job submission, etcā Both in EGEE and related projects and in the GGF/OGF GIN work
ā¢ EGEE Operations have a wealth of experience nowā Procedures, issues, what works, what does not workā The problem is finding the time to publish this knowledgeā We need to start documenting this now ā especially at the ROC level
ā¢ The value of EGEE is in the infrastructure ā not just the production service, but all the parts that fit around thatā This is more or less independent of any specific set of middleware
(although that makes life more or less easy ā¦)
[email protected] EGEEā06; Geneva; 25-29th September 2006 41
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Conclusions
ā¢ We have come a long way in the last 5-6 yearsā From specific solutions for HEP to a vision of a global infrastructure
ā But we need to be careful to walk before we run ā¦ and clarify expectations
ā¢ Many complex issues need to be addressed
ā¢ We should expect to see major changes in implementations ā But the infrastructure should remain and evolve across these changes
ā¢ Standardisation will come with co-operating grids
ā¢ Many opportunities for collaborationā With other projects, with industry, with applications