uri lifshitz - optare consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination...

38
Uri Lifshitz Optare Consulting Israel [email protected]

Upload: lehanh

Post on 20-Apr-2018

289 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Uri LifshitzOptare Consulting Israel

[email protected]

Page 2: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Meaning: The victory is decided before

the battle has begun.

“Saya No Uchi”

Literally: “The victory is in the scabbard”

鞘の内で勝つ

July 10 2

Page 3: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

July 10 3

(Source: Giga group)

Average cost of computer downtime:

Page 4: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Understanding

Preparation

Training

July 10 4

Page 5: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

DB2 Catalog

DB2 Directory

DB2 Bootstrap datasets

Important note: we will not talk about

simple scenarios. If you want to read chapter

20 in the Administration guide feel free.

We will talk about some nasty things that

could happen to your:

July 10 5

Page 6: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Am I going to talk about Data sharing?

No, because Data Sharing works:

If you lose one DB2 the group still function

We all know about Restart light

CF Structure duplexing

The main DRP advantage in Data sharing is that

a downed DB2 is not a crisis

BUT the problem is still: getting the fallen DB2

back up.

July 10 6

Advice: If you wan to to hear about Data Sharing go to a performance presentation.

Page 7: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

DB2 Bootstrap datasets.

Simple questions:

Are you using dual BSDS?

More complicated question:

How often do you backup your BSDS?

Trick question: BSDS are backed up automatically when you archive your active log datasets

Do you remember where your BSDS backup reside?

Do you know how to recover your BSDS?

Recover both or just the corrupted BSDS?

July 10 7

Page 8: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Another trick question: you recycled DB2 and

only one of your BSDS is corrupted, what will

happen?

14.30.51 STC10977 DSNY001I -DBX1 SUBSYSTEM STARTING

14.30.52 STC10977 DSNJ107I -DBX1 READ ERROR ON BSDS 350

350 DSNAME=DSNDBX0.DBX1.BSDS01, ERROR STATUS=0874

14.30.52 STC10977 DSNJ117I -DBX1 INITIALIZATION ERROR READING BSDS 351

351 DSNAME=DSNDBX0.DBX1.BSDS01, ERROR STATUS=0874

14.30.53 STC10977 DSNJ119I -DBX1 BOOTSTRAP ACCESS INITIALIZATION PROCESSING FA

14.31.03 STC10977 *DSNV086E -DBX1 DB2 ABNORMAL TERMINATION REASON=00E80084

14.31.03 STC10977 IEA794I SVC DUMP HAS CAPTURED: 357

357 DUMPID=001 REQUESTED BY JOB (DBX1MSTR)

357 DUMP TITLE=DBX1,ABND=04E-00E80084,U=SYSOPR ,M=(?),C=810.IPC

357 SNYSIRM,M=DSNYECTE,PSW=077C2000A021B300,A=0079

14.31.05 STC10977 IEF450I DBX1MSTR DBX1MSTR - ABEND=S04E U0000 REASON=00E80084

373 TIME=14.31.05

You guessed it: DB2 will not go up.

July 10 8

Page 9: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

How do you fix that?

Easy, just copy the correct BSDS over the

corrupted BSDS. What’s the problem?

July 10 9

Page 10: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

you copy a backup over the corrupted BSDS.

WWDD?

You guessed it again: DB2 will not start.

15.38.06 STC00002 DSNY001I -DBX1 SUBSYSTEM STARTING

15.38.07 STC00002 DSNJ127I -DBX1 SYSTEM TIMESTAMP FOR BSDS= 09.160 05:46:48.3

15.38.09 STC00002 DSNJ012I -DBX1 DSNJR005 ERROR 00D10348 READING RBA 856

856 0003F4FAA000 IN DATA SET DSNDBX0.DBX1.LOGCOPY1.DS04.

856 CONNECTION-ID=DBX1, CORRELATION-ID=004.JW006 00

15.38.15 STC00002 IEA794I SVC DUMP HAS CAPTURED: 858

858 DUMPID=002 REQUESTED BY JOB (DBX1MSTR)

858 DUMP TITLE=DBX1,ABND=04E-00D10348,U=SYSOPR ,M=(?),C=810.RLM

858 SNJLGR ,M=DSNJRE01,LOC=DSNJL002.DSNJR005+0408

15.38.15 STC00002 DSNJ232I -DBX1 OUTPUT DATA SET CONTROL 859

859 INITIALIZATION PROCESSING FAILED

15.38.15 STC00002 *DSNV086E -DBX1 DB2 ABNORMAL TERMINATION REASON=00E80084

15.38.17 STC00002 IEF450I DBX1MSTR DBX1MSTR - ABEND=S04E U0000 REASON=00E80084

895 TIME=15.38.17

July 10 10

Page 11: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

We all know that the DB2 catalog and

directory are identical.

Until they are not…

July 10 11

And then we have to fix it. But some times it’s

not so simple to even know you have a problem

in your Catalog or Directory.

Page 12: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

When will you know that your DB2 have a

catalog problem?

Myth number 1 – If I have a serious catalog

problem DB2 will not start.

July 10 12

Lets test that:

Page 13: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

We decided that emptying the SYSDBASE dataset

is a serious enough catalog problem.

guess what?

11.54.20 STC06004 DSN9022I -DBX1 DSNYASCP 'START DB2' NORMAL COMPLETION

11.54.26 STC06004 DSNP012I -DBX1 DSNPCNP0 - ERROR IN VSAM CATALOG 531

531 LOCATE FUNCTION FOR DSNDBX0.DSNDBC.DSNDB06.SYSDBASE.I0001.A00

531 CTLGRC=AAAAAA08

531 CTLGRSN=AAAAAA08

531 CONNECTION-ID=DB2CALL, CORRELATION-ID=TMONDB2,

531 LUW-ID=*

11.54.29 STC06004 DSNP012I -DBX1 DSNPCNP0 - ERROR IN VSAM CATALOG 619

619 LOCATE FUNCTION FOR DSNDBX0.DSNDBC.DSNDB06.SYSDBASE.I0001.A00

619 CTLGRC=AAAAAA08a

’START DB2’ normal completion:

July 10 13

Page 14: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Maybe deleting SYSDBASE was not serious

enough for you?

How about deleting SYSCOPY VSAM dataset?

July 10 14

15.38.55 STC06005 DSN9022I -DBX1 DSNYASCP 'START DB2'

NORMAL COMPLETION

That’s right – DB2 start with no problem

Page 15: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

You will find out this problem only when you try

to access SYSCOPY.

14.35.10 STC06053 DSNP012I -DBX2 DSNPCNP0 - ERROR IN VSAM CATALOG 031

031 LOCATE FUNCTION FOR DSNDBX0.DSNDBC.DSNDB06.SYSCOPY.I0001.A001

031 CTLGRC=AAAAAA08

031 CTLGRSN=AAAAAA08

031 CONNECTION-ID=DB2CALL, CORRELATION-ID=URI,

031 LUW-ID=*

July 10 15

Till then your DB2 works as if business as usual

Page 16: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Now you need to :

Recover your SYSCOPY

While other utilities get -904 on SYSCOPY

P.S.

Do you know where your SYSCOPY backups

are? (Because you damn sure can’t look them

up in SYSIBM.SYSCOPY)

July 10 16

Page 17: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

OK, lets pick on the Directory for a change.

I believe one of the worst risks is having a

garbled DBD on your Directory

So we decided to garble some DBDs *

July 10 17

Page 18: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Do it yourself : how to garble your DBDs in three

easy steps

1. Use DNS1PRNT to print DSNDB01

2. Use the REPAIR utility to locate target Database

3. Use the REPAIR utility with the replace option at the you database DBD offset

Voila! You have caused your very own 00C90101

July 10 18

Page 19: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

How to fix this?

Locate corrupted object in DB2 Catalog and

Directory using the REPAIR utility.

Fix corruption using REPAIR or a targeted

recovery rather then recovering all of your

catalog/directory.

July 10 19

Page 20: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Simple questions:

July 10 20

When was the last time you did any of the

operations I just mentioned?

and how long will it take you to do it now?

Page 21: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Understanding

Preparation

Training

July 10 21

Page 22: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Some thing you could do before the crisis:

1. Work Procedures

2. Prepared JCL and Commands

3. Products

4. Team Preparation

July 10 22

Page 23: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Have a clear crisis management procedure:

Who is managing the crisis (Only ONE!)

Who is the technical manager (Only one)

Who should be notified

Timelines for escalation

Periodical checkpoints

July 10 23

Page 24: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

As silly as it may sound:

Having the right JCL with the right command will

save time and mistakes.

The last thing you want is to open a book and

start pondering what is the right command to do

what you need.

July 10 24

Page 25: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Some products that you really want to have:

Log Scanners

(backing out several bad changes or updating DRP site)

System Recovery Tools

(returning your whole system to a safe point)

Utility Enhancers

(quicker copy/recover, massive utility generation)Can I suggest BMC

COPY/RECOVERY PLUS?

Can I suggest BMC

RECOVERY MANAGER?

Can I suggest BMC LOG

MASTER?

July 10 25

Page 26: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Why Utility Enhancers?

Because some times you will need to rebuild

hundreds of indexes following a recovery

Because you want all you recovery to run in

parallel

Bottom line: you want to finish your

recovery as soon as possible

July 10 26

Page 27: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Why using System Recovery Tools?

Because you want to be able to recover

EVERYTHING to ANYTIME and you want it

done now.

Why using Log Scanners?

Because you might want to know what

happened during the time you skip.

because you want to know who did what

before the crisis

To keep your recovery site updated

July 10 27

Page 28: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Hey Uri, Why do you want us to buy so many

products? Are these guys paying you?

Money Uri got for this presentation: Company

12 NIS

(Moshe bought me a cup of coffee) BMC

0 NIS IBM

0 NIS CA

0 NIS Others

July 10 28

Page 29: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

There are a lot of things you could do so that

your team will be ready to deal with a crisis:

Knowledge management mechanism

(May I suggest MediaWiki? it’s free)

Share information among team members

Make sure everyone are aware of the

procedures

July 10 29

Page 30: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

But I believe that the most

important thing you could do to

make your team ready to handle an

emergency is:

July 10 30

Page 31: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Understanding

Preparation

Training

July 10 31

Page 32: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

What should you do?

Revise your recovery work procedures

Meet with colleagues and compare procedures

Keeps your team up to date (reading, lectures,

DB2 courses and Certification tests)

Select one DBA to coordinate and test Recovery

scenarios

Remember: nothing is better then hands on

experience!

July 10 32

Page 33: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

Perform DRP Drills!

The Methodology we developed at Discount

Bank:

Let the DRP coordinator simulate a DRP scenario

in the system environment

Let another DBA at random handle the problem

July 10 33

This is the only way to be really ready.

Page 34: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

July 10 34

Hope you got some good ideas from this

presentation

Page 35: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

July 10 35

Page 36: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

July 10 36

Page 37: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

July 10 37

Page 38: Uri Lifshitz - Optare Consulting 03 stc10977 *dsnv086e -dbx1 db2 abnormal termination reason=00e80084 14.31.03 stc10977 iea794i svc dump has captured: 357 ... up in sysibm.syscopy)

July 10 38