review of recent castor database problems at ral gordon d. brown rutherford appleton laboratory...
TRANSCRIPT
Review of Recent CASTOR Database Problems at RAL
Gordon D. Brown
Rutherford Appleton Laboratory
3D/WLCG WorkshopCERN, Geneva
11th-14th November 2008
RAL CASTOR Architecture• Our setup is for:
– Atlas (Stager, SRM)– CMS (Stager, SRM)– LHCb (Stager, SRM)– General (SRM)– Name Server– DLF– Gen Stager– Repack
RAL CASTOR Architecture• 12 nodes to use
– Need production and test
• Options included:– Single instance (or small cluster) for each schema– One huge RAC– Combination of above
• Constraints– Licenses– Single points of failure (did lose all paths at one point)– Resources
RAL CASTOR Architecture• Outcome
– 2 x 5 node production clusters
– 1 x 2 node test clusters
neptune1 Atlas DLF
LHCB DLF
neptune2 Atlas SRM
neptune3 LHCb Stager
neptune4 LHCb SRM
neptune5 Atlas Stager
pluto1 Name server
CMS Stager
pluto2 CMS SRM
pluto3 Gen Stager
pluto4 Gen SRM
Repack
pluto5 CMS DLF
Gen DLF
RAL CASTOR Architecture• Oracle Enterprise RAC
– Production 10.2.0.4– Test 10.2.0.3– All clusters patched with July CPU
• Backups– RMAN to disk– Tape to Atlas Data Store
• Monitoring– Oracle Enterprise Manager– Nagios and ganglia on machines
Issues – “crosstalk”• Terminology
– SQL executing in wrong schema
• Issue– 14000 files lost on LHCb
• Evidence– Garbage collection on CASTOR– “Deleting local file which is no longer in the stager
catalog”– Also in LHCb stager log:
• “No object found for id : 1517806678”• This is in the Atlas files2delete table
Issues – “crosstalk”• Suspicion
– Not seen by Oracle in 10.2.0.3– Redo logs inconclusive– Lots of areas with possible wrong config
• Disk server tnsnames entries• IP address for VIPs on database servers• Puppet config (on disk servers and central servers)• Connection to wrong schema
• Outcome– Synchronisation is suspended– Haven’t recreated– Difficult for Oracle to analyse
Issues – core dumping• Issue
– ORA-600 sometimes when delete on id2type table– Happens twice a week on average
• Evidence– Only at least two stager schemas (and nodes)– Application and Oracle logs
• Outcome– Application recovers– SR Open and RDA being performed
Issues – cursor invalidation• Issue
– Detected after getting DML partition lock (ORA-14403)
• Strangeness– Oracle say resolved in 10.2.0.4 (which we’re on!)– Action from Oracle “nothing to be done, error should
never be returned to user”– Can not recreate at will
• Outcome– SR Open– Parameter to implement (needs instance restart)
Issues – constraint violations• Issue
– Violation of primary key constraint (ORA-00001)– Seen on Atlas Stager id2type table– Complicated
• Outcome– Implemented Eric’s code to trap error and log it to alert
log (will be effective when existing Stager processes restarted)
Issues – Big IDs• Issue
– Huge numbers appearing in INSERT statements– Not from any sequences on the database– Complicated
Example:
insert into "SRMCMS"."ID2TYPE"("ID","TYPE“) values ('8868517','1002');insert into "SRMCMS"."ID2TYPE"("ID","TYPE“) values ('8868518','1008');insert into "SRMCMS"."ID2TYPE"("ID","TYPE")
values ('58432730170283524000','1005');insert into "SRMCMS"."ID2TYPE"("ID","TYPE")
values ('58432730307722478000','1002');
Issues – performance• Issue 1
– Stale statistics appeared even though gathered– Noticed because of poor performance– Re-gathered, pool flushed and all fine
• Issue 2– Well-used SQL query time degraded on Stager (by 300%)– New SQL Profile improved performance again– Due to stats on fluctuating tables?– Cluster waits on Atlas, high network I/O in Atlas/LHCb
Issues – performance• Issue 3
– CPU load increasing over 3-4 days– Bonny cleared up subrequest table– Shrank table and it was solved
CASTOR Oil Plant
Monitoring• DB Load
– Difficult to know if linked to requests/files– Tools of CASTOR “load” would useful– Is application “good” at being on RAC
• Oracle Services– Currently one “preferred” node and one “available” node
for each schema– Stagers failover to SRM for example– Is two nodes per Stager better?
Lessons Learnt 1• Machine configuration
– Be careful with tnsnames– IP and VIP addresses need care– Hardware should be similar– Schema names are similar
• Database Administration– We can add/remove cluster node without downtime– Tuning, shrinking and profiles experience– Log miner skills
Lessons Learnt 2• Volume
– Very high number of transactions– 200GB of archive redo logs per day (DB on 80GB)– Recovery would be an issue? Image copies?– Need lots of space for log miner
• Space– Space needed for analysis (e.g. log miner)– More space needed for redo logs/backups
People• DBAs
– Team of four– Good to share skills and experience– Not enough knowledge of application– Pressure
• CASTOR team at RAL– Excellent communication with DBAs– Gained knowledge of databases– Difficult to know if database or application at fault
People• CERN and other Tier-1s
– Invaluable support– Good communication via email lists– Thanks!– More work together for future architecture– Wiki page appreciated
• Oracle– Metalink support has been very good
Next Steps• Set-up
– Moving to single instance for 2-3 weeks– Don’t change too much at once!– Difficult to rule out DB issues– Hardware resilience– Auditing? Overhead.
• Performance– Any more data to clean out?– Tune more SQL– More tests on failover– Backup/recovery– Proactivity
Questions for CERN/Tier-1s
• CASTOR Reporting Tools– Shaun has produced stats on SRM showing transactions– What do others use?– What would be useful?
• Monitoring– What do you monitor (DB and application)?– What’s important in the logs?– Any custom threshold alerts in OEM/lemon?
Questions for CERN/Tier-1s• Database
– Do you gather stats every night? Full?– Any other regular DB jobs? Shrinking?– Amount of transactions/redo logs?– CPU levels?– Plans for 11g?– Backups – full? Level 1? Validate every night?
• People– How many DBAs (working on CASTOR)?– DBAs knowledge of application?– 3D/CASTOR Collaboration