uri lifshitz - optare consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination...

Post on 20-Apr-2018

289 Views

Category:

Documents

9 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Uri LifshitzOptare Consulting Israel

Uri@OptareConsulting.com

Meaning: The victory is decided before

the battle has begun.

“Saya No Uchi”

Literally: “The victory is in the scabbard”

鞘の内で勝つ

July 10 2

July 10 3

(Source: Giga group)

Average cost of computer downtime:

Understanding

Preparation

Training

July 10 4

DB2 Catalog

DB2 Directory

DB2 Bootstrap datasets

Important note: we will not talk about

simple scenarios. If you want to read chapter

20 in the Administration guide feel free.

We will talk about some nasty things that

could happen to your:

July 10 5

Am I going to talk about Data sharing?

No, because Data Sharing works:

If you lose one DB2 the group still function

We all know about Restart light

CF Structure duplexing

The main DRP advantage in Data sharing is that

a downed DB2 is not a crisis

BUT the problem is still: getting the fallen DB2

back up.

July 10 6

Advice: If you wan to to hear about Data Sharing go to a performance presentation.

DB2 Bootstrap datasets.

Simple questions:

Are you using dual BSDS?

More complicated question:

How often do you backup your BSDS?

Trick question: BSDS are backed up automatically when you archive your active log datasets

Do you remember where your BSDS backup reside?

Do you know how to recover your BSDS?

Recover both or just the corrupted BSDS?

July 10 7

Another trick question: you recycled DB2 and

only one of your BSDS is corrupted, what will

happen?

14.30.51 STC10977 DSNY001I -DBX1 SUBSYSTEM STARTING

14.30.52 STC10977 DSNJ107I -DBX1 READ ERROR ON BSDS 350

350 DSNAME=DSNDBX0.DBX1.BSDS01, ERROR STATUS=0874

14.30.52 STC10977 DSNJ117I -DBX1 INITIALIZATION ERROR READING BSDS 351

351 DSNAME=DSNDBX0.DBX1.BSDS01, ERROR STATUS=0874

14.30.53 STC10977 DSNJ119I -DBX1 BOOTSTRAP ACCESS INITIALIZATION PROCESSING FA

14.31.03 STC10977 *DSNV086E -DBX1 DB2 ABNORMAL TERMINATION REASON=00E80084

14.31.03 STC10977 IEA794I SVC DUMP HAS CAPTURED: 357

357 DUMPID=001 REQUESTED BY JOB (DBX1MSTR)

357 DUMP TITLE=DBX1,ABND=04E-00E80084,U=SYSOPR ,M=(?),C=810.IPC

357 SNYSIRM,M=DSNYECTE,PSW=077C2000A021B300,A=0079

14.31.05 STC10977 IEF450I DBX1MSTR DBX1MSTR - ABEND=S04E U0000 REASON=00E80084

373 TIME=14.31.05

You guessed it: DB2 will not go up.

July 10 8

How do you fix that?

Easy, just copy the correct BSDS over the

corrupted BSDS. What’s the problem?

July 10 9

you copy a backup over the corrupted BSDS.

WWDD?

You guessed it again: DB2 will not start.

15.38.06 STC00002 DSNY001I -DBX1 SUBSYSTEM STARTING

15.38.07 STC00002 DSNJ127I -DBX1 SYSTEM TIMESTAMP FOR BSDS= 09.160 05:46:48.3

15.38.09 STC00002 DSNJ012I -DBX1 DSNJR005 ERROR 00D10348 READING RBA 856

856 0003F4FAA000 IN DATA SET DSNDBX0.DBX1.LOGCOPY1.DS04.

856 CONNECTION-ID=DBX1, CORRELATION-ID=004.JW006 00

15.38.15 STC00002 IEA794I SVC DUMP HAS CAPTURED: 858

858 DUMPID=002 REQUESTED BY JOB (DBX1MSTR)

858 DUMP TITLE=DBX1,ABND=04E-00D10348,U=SYSOPR ,M=(?),C=810.RLM

858 SNJLGR ,M=DSNJRE01,LOC=DSNJL002.DSNJR005+0408

15.38.15 STC00002 DSNJ232I -DBX1 OUTPUT DATA SET CONTROL 859

859 INITIALIZATION PROCESSING FAILED

15.38.15 STC00002 *DSNV086E -DBX1 DB2 ABNORMAL TERMINATION REASON=00E80084

15.38.17 STC00002 IEF450I DBX1MSTR DBX1MSTR - ABEND=S04E U0000 REASON=00E80084

895 TIME=15.38.17

July 10 10

We all know that the DB2 catalog and

directory are identical.

Until they are not…

July 10 11

And then we have to fix it. But some times it’s

not so simple to even know you have a problem

in your Catalog or Directory.

When will you know that your DB2 have a

catalog problem?

Myth number 1 – If I have a serious catalog

problem DB2 will not start.

July 10 12

Lets test that:

We decided that emptying the SYSDBASE dataset

is a serious enough catalog problem.

guess what?

11.54.20 STC06004 DSN9022I -DBX1 DSNYASCP 'START DB2' NORMAL COMPLETION

11.54.26 STC06004 DSNP012I -DBX1 DSNPCNP0 - ERROR IN VSAM CATALOG 531

531 LOCATE FUNCTION FOR DSNDBX0.DSNDBC.DSNDB06.SYSDBASE.I0001.A00

531 CTLGRC=AAAAAA08

531 CTLGRSN=AAAAAA08

531 CONNECTION-ID=DB2CALL, CORRELATION-ID=TMONDB2,

531 LUW-ID=*

11.54.29 STC06004 DSNP012I -DBX1 DSNPCNP0 - ERROR IN VSAM CATALOG 619

619 LOCATE FUNCTION FOR DSNDBX0.DSNDBC.DSNDB06.SYSDBASE.I0001.A00

619 CTLGRC=AAAAAA08a

’START DB2’ normal completion:

July 10 13

Maybe deleting SYSDBASE was not serious

enough for you?

How about deleting SYSCOPY VSAM dataset?

July 10 14

15.38.55 STC06005 DSN9022I -DBX1 DSNYASCP 'START DB2'

NORMAL COMPLETION

That’s right – DB2 start with no problem

You will find out this problem only when you try

to access SYSCOPY.

14.35.10 STC06053 DSNP012I -DBX2 DSNPCNP0 - ERROR IN VSAM CATALOG 031

031 LOCATE FUNCTION FOR DSNDBX0.DSNDBC.DSNDB06.SYSCOPY.I0001.A001

031 CTLGRC=AAAAAA08

031 CTLGRSN=AAAAAA08

031 CONNECTION-ID=DB2CALL, CORRELATION-ID=URI,

031 LUW-ID=*

July 10 15

Till then your DB2 works as if business as usual

Now you need to :

Recover your SYSCOPY

While other utilities get -904 on SYSCOPY

P.S.

Do you know where your SYSCOPY backups

are? (Because you damn sure can’t look them

up in SYSIBM.SYSCOPY)

July 10 16

OK, lets pick on the Directory for a change.

I believe one of the worst risks is having a

garbled DBD on your Directory

So we decided to garble some DBDs *

July 10 17

Do it yourself : how to garble your DBDs in three

easy steps

1. Use DNS1PRNT to print DSNDB01

2. Use the REPAIR utility to locate target Database

3. Use the REPAIR utility with the replace option at the you database DBD offset

Voila! You have caused your very own 00C90101

July 10 18

How to fix this?

Locate corrupted object in DB2 Catalog and

Directory using the REPAIR utility.

Fix corruption using REPAIR or a targeted

recovery rather then recovering all of your

catalog/directory.

July 10 19

Simple questions:

July 10 20

When was the last time you did any of the

operations I just mentioned?

and how long will it take you to do it now?

Understanding

Preparation

Training

July 10 21

Some thing you could do before the crisis:

1. Work Procedures

2. Prepared JCL and Commands

3. Products

4. Team Preparation

July 10 22

Have a clear crisis management procedure:

Who is managing the crisis (Only ONE!)

Who is the technical manager (Only one)

Who should be notified

Timelines for escalation

Periodical checkpoints

July 10 23

As silly as it may sound:

Having the right JCL with the right command will

save time and mistakes.

The last thing you want is to open a book and

start pondering what is the right command to do

what you need.

July 10 24

Some products that you really want to have:

Log Scanners

(backing out several bad changes or updating DRP site)

System Recovery Tools

(returning your whole system to a safe point)

Utility Enhancers

(quicker copy/recover, massive utility generation)Can I suggest BMC

COPY/RECOVERY PLUS?

Can I suggest BMC

RECOVERY MANAGER?

Can I suggest BMC LOG

MASTER?

July 10 25

Why Utility Enhancers?

Because some times you will need to rebuild

hundreds of indexes following a recovery

Because you want all you recovery to run in

parallel

Bottom line: you want to finish your

recovery as soon as possible

July 10 26

Why using System Recovery Tools?

Because you want to be able to recover

EVERYTHING to ANYTIME and you want it

done now.

Why using Log Scanners?

Because you might want to know what

happened during the time you skip.

because you want to know who did what

before the crisis

To keep your recovery site updated

July 10 27

Hey Uri, Why do you want us to buy so many

products? Are these guys paying you?

Money Uri got for this presentation: Company

12 NIS

(Moshe bought me a cup of coffee) BMC

0 NIS IBM

0 NIS CA

0 NIS Others

July 10 28

There are a lot of things you could do so that

your team will be ready to deal with a crisis:

Knowledge management mechanism

(May I suggest MediaWiki? it’s free)

Share information among team members

Make sure everyone are aware of the

procedures

July 10 29

But I believe that the most

important thing you could do to

make your team ready to handle an

emergency is:

July 10 30

Understanding

Preparation

Training

July 10 31

What should you do?

Revise your recovery work procedures

Meet with colleagues and compare procedures

Keeps your team up to date (reading, lectures,

DB2 courses and Certification tests)

Select one DBA to coordinate and test Recovery

scenarios

Remember: nothing is better then hands on

experience!

July 10 32

Perform DRP Drills!

The Methodology we developed at Discount

Bank:

Let the DRP coordinator simulate a DRP scenario

in the system environment

Let another DBA at random handle the problem

July 10 33

This is the only way to be really ready.

July 10 34

Hope you got some good ideas from this

presentation

July 10 35

July 10 36

July 10 37

July 10 38

top related