review of recent castor database problems at ral gordon d. brown rutherford appleton laboratory...

28
Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008

Upload: jonas-gregory-hudson

Post on 13-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Review of Recent CASTOR Database Problems at RAL

Gordon D. Brown

Rutherford Appleton Laboratory

3D/WLCG WorkshopCERN, Geneva

11th-14th November 2008

Overview

• Current setup

• Issues

• Lessons Learnt

• Monitoring

• Future

RAL CASTOR Architecture• Our setup is for:

– Atlas (Stager, SRM)– CMS (Stager, SRM)– LHCb (Stager, SRM)– General (SRM)– Name Server– DLF– Gen Stager– Repack

RAL CASTOR Architecture• 12 nodes to use

– Need production and test

• Options included:– Single instance (or small cluster) for each schema– One huge RAC– Combination of above

• Constraints– Licenses– Single points of failure (did lose all paths at one point)– Resources

RAL CASTOR Architecture• Outcome

– 2 x 5 node production clusters

– 1 x 2 node test clusters

neptune1 Atlas DLF

LHCB DLF

neptune2 Atlas SRM

neptune3 LHCb Stager

neptune4 LHCb SRM

neptune5 Atlas Stager

pluto1 Name server

CMS Stager

pluto2 CMS SRM

pluto3 Gen Stager

pluto4 Gen SRM

Repack

pluto5 CMS DLF

Gen DLF

RAL CASTOR Architecture• Oracle Enterprise RAC

– Production 10.2.0.4– Test 10.2.0.3– All clusters patched with July CPU

• Backups– RMAN to disk– Tape to Atlas Data Store

• Monitoring– Oracle Enterprise Manager– Nagios and ganglia on machines

Village of CASTOR, Cambridgeshire, UK

Issues – “crosstalk”• Terminology

– SQL executing in wrong schema

• Issue– 14000 files lost on LHCb

• Evidence– Garbage collection on CASTOR– “Deleting local file which is no longer in the stager

catalog”– Also in LHCb stager log:

• “No object found for id : 1517806678”• This is in the Atlas files2delete table

Issues – “crosstalk”• Suspicion

– Not seen by Oracle in 10.2.0.3– Redo logs inconclusive– Lots of areas with possible wrong config

• Disk server tnsnames entries• IP address for VIPs on database servers• Puppet config (on disk servers and central servers)• Connection to wrong schema

• Outcome– Synchronisation is suspended– Haven’t recreated– Difficult for Oracle to analyse

Issues – core dumping• Issue

– ORA-600 sometimes when delete on id2type table– Happens twice a week on average

• Evidence– Only at least two stager schemas (and nodes)– Application and Oracle logs

• Outcome– Application recovers– SR Open and RDA being performed

Issues – cursor invalidation• Issue

– Detected after getting DML partition lock (ORA-14403)

• Strangeness– Oracle say resolved in 10.2.0.4 (which we’re on!)– Action from Oracle “nothing to be done, error should

never be returned to user”– Can not recreate at will

• Outcome– SR Open– Parameter to implement (needs instance restart)

Issues – constraint violations• Issue

– Violation of primary key constraint (ORA-00001)– Seen on Atlas Stager id2type table– Complicated

• Outcome– Implemented Eric’s code to trap error and log it to alert

log (will be effective when existing Stager processes restarted)

Issues – Big IDs• Issue

– Huge numbers appearing in INSERT statements– Not from any sequences on the database– Complicated

Example:

insert into "SRMCMS"."ID2TYPE"("ID","TYPE“) values ('8868517','1002');insert into "SRMCMS"."ID2TYPE"("ID","TYPE“) values ('8868518','1008');insert into "SRMCMS"."ID2TYPE"("ID","TYPE")

values ('58432730170283524000','1005');insert into "SRMCMS"."ID2TYPE"("ID","TYPE")

values ('58432730307722478000','1002');

Issues – performance• Issue 1

– Stale statistics appeared even though gathered– Noticed because of poor performance– Re-gathered, pool flushed and all fine

• Issue 2– Well-used SQL query time degraded on Stager (by 300%)– New SQL Profile improved performance again– Due to stats on fluctuating tables?– Cluster waits on Atlas, high network I/O in Atlas/LHCb

Issues – performance• Issue 3

– CPU load increasing over 3-4 days– Bonny cleared up subrequest table– Shrank table and it was solved

Monitoring• DB Load

– Difficult to know if linked to requests/files– Tools of CASTOR “load” would useful– Is application “good” at being on RAC

• Oracle Services– Currently one “preferred” node and one “available” node

for each schema– Stagers failover to SRM for example– Is two nodes per Stager better?

Lessons Learnt 1• Machine configuration

– Be careful with tnsnames– IP and VIP addresses need care– Hardware should be similar– Schema names are similar

• Database Administration– We can add/remove cluster node without downtime– Tuning, shrinking and profiles experience– Log miner skills

Lessons Learnt 2• Volume

– Very high number of transactions– 200GB of archive redo logs per day (DB on 80GB)– Recovery would be an issue? Image copies?– Need lots of space for log miner

• Space– Space needed for analysis (e.g. log miner)– More space needed for redo logs/backups

CASTOR River, Ontario, Canada

People• DBAs

– Team of four– Good to share skills and experience– Not enough knowledge of application– Pressure

• CASTOR team at RAL– Excellent communication with DBAs– Gained knowledge of databases– Difficult to know if database or application at fault

People• CERN and other Tier-1s

– Invaluable support– Good communication via email lists– Thanks!– More work together for future architecture– Wiki page appreciated

• Oracle– Metalink support has been very good

Next Steps• Set-up

– Moving to single instance for 2-3 weeks– Don’t change too much at once!– Difficult to rule out DB issues– Hardware resilience– Auditing? Overhead.

• Performance– Any more data to clean out?– Tune more SQL– More tests on failover– Backup/recovery– Proactivity

CASTOR star in Gemini (second brightest)

Questions for CERN/Tier-1s

• CASTOR Reporting Tools– Shaun has produced stats on SRM showing transactions– What do others use?– What would be useful?

• Monitoring– What do you monitor (DB and application)?– What’s important in the logs?– Any custom threshold alerts in OEM/lemon?

Questions for CERN/Tier-1s• Database

– Do you gather stats every night? Full?– Any other regular DB jobs? Shrinking?– Amount of transactions/redo logs?– CPU levels?– Plans for 11g?– Backups – full? Level 1? Validate every night?

• People– How many DBAs (working on CASTOR)?– DBAs knowledge of application?– 3D/CASTOR Collaboration

Questions and (hopefully) Answers

[email protected]