troubleshooting xenserver deployments

49
Troubleshooting XenServer deployments Tomasz Czajka, Sr. Support Engineer 8th of October 2010

Upload: elijah-mills

Post on 31-Dec-2015

60 views

Category:

Documents


2 download

DESCRIPTION

Troubleshooting XenServer deployments. Tomasz Czajka, Sr. Support Engineer 8th of October 2010. Agenda. Case Study: “ Production down” Learn: “ XenServer crash ” Case study: “ Singlepathing ” Q & A. “ Production down”. VM don’t start - why?. Basic troubleshooting in XenCenter. - PowerPoint PPT Presentation

TRANSCRIPT

Troubleshooting XenServer deploymentsTomasz Czajka, Sr. Support Engineer8th of October 2010

• Case Study: “Production down”

• Learn: “XenServer crash”

• Case study: “Singlepathing”

• Q & A

Agenda

“Production down”

Basic troubleshooting in XenCenterVM don’t start - why?

• Cannot start a VM “The SR is not available” error

• Storage Repositry (SR) in “broken” state

“Repair” does not work.

Use CLI to troubleshoot

# xe pbd-list currently-atached=false

PBDPBD

What is “broken”?

XenServer_1 XenServer_1

SRSRXenServer_2XenServer_2PBDPBDhas UUID (unique ID)

SCSI ID

PBD = Physical Block DevicePBDPBDPBDPBD

SRSR SRSR

Volume GroupVolume Group

Name: <Prefix>+SR UUID”

Broken storage

Goal: Reproduce and analyse the logsStorage troubleshooting

/var/log/xensource.log* ; SMlog* ; messages* ;

# tail –f /var/log/messages > /tmp/ShortLog

# date

# echo “Unplugging cable” >> messages

messages (UTC) <> xensource.log (local)

Plugging PBD manuallyPBD unplugged

# xe pbd-list host-uuid=... sr-uuid=...

# xe pbd-plug uuid=...

SR_BACKEND_FAILURE_47: The SR is not available no such volume group: VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4

# xe sr-list name-label=“My SR” params=uuid

19856cba-830c-e298-79fa-84a79eb658f4

# grep “PBD.plug” xensource.log# grep “PBD.plug” xensource.log

Logical Volume(LV)

Logical Volume(LV)

What is VG?Volume Group

Virtual Disk

Storage Repository

HDD / LUN

Logical Volume Manager (LVM)

Volume Group(VG)

Volume Group(VG)

Physical Volume(PV)

Physical Volume(PV)

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Physical Volume(PV)

Physical Volume(PV)

Physical Volume(PV)

Physical Volume(PV)

Volume Group(VG)

Volume Group(VG)

HDD / LUN

HDD / LUN

3 VMs1 virtual disk each

3 VMs1 virtual disk each

SRSR

VDIVDI

VDIVDI

VDIVDI

Matching the UUIDVolume Group

# vgs

# vgs 'VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4'

Volume group "VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4" not found

VG #PV #LV #SN Attr VSize VFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G

VG #PV #LV #SN Attr VSize VFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G

Checking SCSI IDExamining HDD/LUN

• check SCSI ID (unique for each SCSI device)

# xe pbd-list params=device-config sr-uuid=...

device-config SCSIid: 360a9800050334f49633459

PBDPBD

SCSI ID

Can Linux kernel see this block device? (SCSI device)Examining HDD/LUN

# hdparm -t /dev/disk/by-id/scsi-360a98045234t654...

Timing buffered disk reads: 138 MB in 3.02 seconds = 45.68 MB/sec

(LUN readable! )

Addressing SCSI disks # ls -lR /dev/disk | grep 360a9800050334f4963345767656c546

• /dev/disk/by-id

•scsi-360a9800050334f4963345767656c546a -> /dev/sde

•/dev/disk/by-scsibus

•360a9800050334f4963345767656c546a-1:0:0:5 -> /dev/sdc

•360a9800050334f4963345767656c546a-2:0:0:5 -> /dev/sde

/dev/mapper/360a9800050334f4963345767656c546

Also check /dev/disk/by-path

Is the LUN empty?Examining HDD/LUN

# udevinfo -q all -n /dev/disk/by-id/scsi-360a9800050334f496334576765...

...

ID_FS_TYPE=LVM2 member

...

“If this is LVM member, why there is no VG on it?”

Is there a VG created on PV?Examining HDD/LUN

# pvs

# pvs |grep 360a9800050334f496334595a32306431PV VG Fmt Attr Psize Free

/dev/mapper/360a9800050334f496334595a32306431 VG_XenStorage-332432-430d-3423-4332434-5485974 lvm2 a- 14.99G 14.99G

# xe sr-list name-label="My SR" params=uuid

19856cba-830c-e298-79fa-84a79eb658f4

VG_Xenstorage<UUID> differs from SR UUID !

PV VG Fmt Attr Psize Free/dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G/dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G/dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G/dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G/dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G

PV VG Fmt Attr Psize Free/dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G/dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G/dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G/dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G/dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G

Potential reasons:No original VG on the LUN

• (Re)installation of host in the same pool• Unplug FC / Zoning

• (Re)installation of host in other pool• Zoning

• Adding SR with “xe sr-create” in CLI

...BE VERY CAREFUL!

...has been recreated!Volume Group

• Lost LVM metadata

• Lost 100 MB of the VDI data

Action steps:

• don’t shutdown running VMs

• Online backup for running Vms (now)

• Block-level clone of the whole LUN (now)

• Assess professional data recovery

Looking for LVM metadata backupVolume Group

/etc/lmv/backup/VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4

• Check backup timestamp (within the file)

LVs in backup file

# cat /etc/lvm/backup/VG... | grep VHD

LVs in backup file

# cat /etc/lvm/backup/VG... | grep VHD

VDI in xapi database

# xe vdi-list sr=<uuid> params=uuid

VDI in xapi database

# xe vdi-list sr=<uuid> params=uuid=

Make a copy first# cp /etc/lvm/backup/* /root/backup/Make a copy first# cp /etc/lvm/backup/* /root/backup/

LVLV

LVLV

VDIVDI

VDIVDI

VDIVDI

LVLV

Removing new VG and PVVolume Group

# vgremove "VG_XenStorage-<new SR uuid>”

# pvremove/dev/mapper/<SCSI ID>

Recreating PV and VG from backupVolume Group

# pvcreate--uuid <PV uuid from backup file>--restorefile /etc/lvm/backup/VG_XenStorage-<SR_UUID>

/dev/mapper/<SCSI ID>

# vgcfgrestore VG_XenStorage-<SR UUID>-f /etc/lvm/backup/VG_XenStorage-<SR UUID>

Confirm that VG name contains SR uuid...Examining HDD/LUN

# pvs |grep 360a9800050334f496334595a32306431PV VG Fmt Attr Psize Free

/dev/mapper/360a9800050334f496334595a32306431 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 lvm2 a- 14.99G 14.99G

# xe sr-list name-label="My SR" params=uuid

19856cba-830c-e298-79fa-84a79eb658f4

VG_Xenstorage<UUID> matches SR UUID

Checking Logical VolumesVolume Group

# lvs

•MGT VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.00M

•VHD-352d31ec-aeb6-4601-8ea9-990575dab395 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M

•VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G

•VHD-fbce18dd-397e-444e-9470-b6fa240243d9 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G

•VHD-ff744448-1b7f-4cc8-80b1-cd38b6c90c98 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Plugging PBD again...Storage Repository

# xe pbd-plug uuid=…

# xe sr-scan uuid=…

Error code: SR_BACKEND_FAILURE_46

Error parameters: , The VDI is not available [opterr=Error scanning VDI 7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32]

# xe vdi-list uuid=<above number>

# lvremove /dev/VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4/VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32

# xe sr-scan uuid=…

Success! But no VDIs shown...

Success! All VDIs shown... Well done!

...by troubleshooting “Production Down” issueWhat we’ve learned

•PBD to get plugged needs...

•LUN/HDD PV VG (SR) LV (VDI)

•VG name generated from SR uuid (+ prefix)LV name generated from VDI uuid (+ prefix)

•Displaying VG (vgs), PV (pvs), LV (lvs)

•Addressing block devices (/dev/disk)

•Examining HDD/LUN with "hdparm –t"

•Restoring PV & VG from backup

“The XenServer Crash”

Unresponsive or rebooting hostThe XenServer Crash?

• Kernel panic or crash dump• Error on Console, host locked• Memory addressing, Bug in OS, Hardware failure

• No Kernel Panic and no crash dump• Host rebooting / frozen / no errors on the console• Hardware failure, OS busy (I/O), user action

/var/crash/<date> exists

Symptom: Host rebooted itselfSymptom: Host is unresponsive

Serial consoleSerial console

Review crashdump

HA enabledHA enabled

Host fenced? Check /var/log/xha.log

Disable HA

HA disabledHA disabled

Add „noreboot” option in extlinux.conf

Add „noreboot” option in extlinux.conf

Analyse /var/log/messages,

xensource.log

Still rebooting? examine hardwareStill rebooting? examine hardware

No serial consoleNo serial console

Connect local consoleConnect local console

Any errors on the console?Any errors on the console?

Analyse /var/log/messages, xensource.log for HA reasons

Boot the host to the consoleCTX120540 & reboot

Boot the host to the consoleCTX120540 & reboot

Generate crashdump CTX120540 & reboot

Generate crashdump CTX120540 & reboot

Review crashdump

Analyse /var/log/messages,

xensource.log

Take photos and rebootTake photos and reboot

Contact Citrix Tech SupportContact Citrix Tech Support

No crashdump

Startup strings:

# cd /var/log

# grep “klogd” messages -B100

# grep “SERVER START” xensource.log -B100

As easy as grepGetting into details… Analyse /var/log/

messages, xensource.log

/var/crash/<stamp>Inside crash log directory

Citrix Confidential - Do Not Distribute

Domain0.logDomain0.log

Hypervisor console ringHypervisor console ring

Domain0 console ring

Domain0 console ring

crash.logcrash.log

CPU stack - to be analysed by Citrix Tech SupportCPU stack - to be analysed by Citrix Tech Support

HA activity, page fault, driver, storage issues

HA activity, page fault, driver, storage issues

Review crashdump

Domain1,2,3...logDebug.log

xen-memory-dump

Domain1,2,3...logDebug.log

xen-memory-dump

XenConsole ringInvestigating crash.log

• located at the bottom of the file

•(XEN) Watchdog timer fired for domain 0(XEN) Domain 0 shutdown: watchdog rebooting machine.

• Why watchdog triggered? /var/log/xha.log (Network or Storage heartbeat failed)

• Why heartbeat failed? /var/log/messages (DMP, kernel, drivers, I/O errors)

Review crashdump (cont)

Page faultInvestigating crash.log

Other examples:

• (XEN) ****************************************

• (XEN) Panic on CPU 6:

• (XEN) FATAL TRAP: vector = 14 (page fault)

• (XEN) [error_code=0000] , IN INTERRUPT CONTEXT

• (XEN)

• ****************************************

• (XEN)

• (XEN) Reboot in five seconds...

Learn: XenServer crashWhat we’ve learned

•Host really crashed?

•Kernel Panic

•Crashdump

•Triggering Crashdump manually

•Locating host reboot in the logs

•Reviewing crashdump logs

“Single-Pathing”

Storage Performance issue

• DMP has been enabled to improve performance

• Virtual Machines are running on different iSCSI SRs

LinuxGuestVM:~# hdparm -t /dev/xvdb

/dev/xvdb:

Timing buffered disk reads: 96 MB in 3.07 seconds = 30.41 MB/sec

Checking multipath statusStorage Performance

# mpathutil status

360a9800050334f496334596c71665246 dm-13 NETAPP,LUN

[size=2.0G][features=0][hwhandler=0][rw]

\_ round-robin 0 [prio=4][enabled]

\_ 3:0:0:2 sdk 8:160 [active][ready]

\_ 4:0:0:2 sdj 8:144 [active][ready]

/dev/mapper/....

/dev/

Determining current performance on domain0Storage Performance

• Testing multi-path device

# hdparm /dev/mapper/<scsi id>

• Testing single-path devices

# hdparm /dev/sdj

# hdparm /dev/sdm

In all cases: 30 MB/sec

Determining usage of pathsStorage Performance

# iostat –x <device>

# iostat –x /dev/sdk /dev/sdj 5

Device Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdk 803.50 33.0 4122 160 sdj 784.00 32.8 3922 155

Both paths are used equally

Checking if there are really 2 iSCSI sessionsStorage Performance

# ls -alR /dev/disk/by-path/ | egrep "(sdk|sdj)"

ip-10.1.200.40:3260-iscsi-iqn.1992-08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdk

ip-10.1.201.40:3260-iscsi-iqn.1992-08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdj

Checking if different paths are really usedStorage Performance

# tcpdump -i any port 3260

# watch "ifconfig | egrep '(eth0 |eth1 )' -A5 | egrep '(eth|bytes)' "

eth0 Link encap:Ethernet HWaddr 00:1D:09:70:88:2C

RX bytes:1490076463 (1.3 GiB) TX bytes:170615419 (162.7 MiB)

eth1 Link encap:Ethernet HWaddr 00:1D:09:70:88:2E

RX bytes:1801238 (166 MiB) TX bytes:46695876 (44.5 MiB)

Checking source IP addresses for iSCSI sessionsStorage Performance

# netstat -at | grep iscsi

10.1.200.138:53049 10.1.200.40:iscsi-target ESTABLISHED

10.1.200.178:46684 10.1.201.40:iscsi-target ESTABLISHED

Checking kernel routing tableStorage Performance

# route

Destination Gateway Genmask Iface

10.1.200.0 * 255.255.255.0 xenbr0

10.1.200.0 * 255.255.255.0 xenbr1

default 10.1.200.1 0.0.0.0 xenbr0

Configuration of management interfaces in XenCenterStorage Performance

Modify ISCSI_2 into 10.1.201.78

Determining current performance on domain0Storage Performance

# route

Destination Gateway Genmask Iface

10.1.200.0 * 255.255.255.0 xenbr0

10.1.201.0 * 255.255.255.0 xenbr1

default 10.1.200.1 0.0.0.0 xenbr0

Configuring kernel routing tableStorage Performance

...or (not recommended)

• Add to /etc/rc.local

# route add -host 10.1.200.40 xenbr0

# route add -host 10.1.201.40 xenbr1

• What about Pool Upgrade and Pool Join?

LinuxVM:~# hdparm -t /dev/xvdb

/dev/xvdb:

Timing buffered disk reads: 45 MB/sec

Well Done!

Determining current performance on VMStorage Performance

Case study: Single-pathingWhat we’ve learned

•/dev/ locations for single and multi-path devices

•# mpathutil status

•# hdparm –t

•# iostat

•# ifconfig, # tcpdump, # netstat, # route

•# watch

•Best practices for iSCSI storages

Questions

First aid kitResources

• http://docs.xensource.com –XenServer documentation

• http://support.citrix.com/product/xens/ - Knowledge Center

• http://forums.citrix.com/support - Support forums

• http://community.citrix.com/citrixready/xenserver - XenServer Central (one-stop information center)

Before you leave…

• Session surveys are available online at www.citrixsynergy.com starting Thursday, 7 October• Provide your feedback and pick up a complimentary gift card at the registration desk

• Download presentations starting Friday, 15 October, from your My Organiser Tool located in your My Synergy Microsite event account