troubleshooting xenserver deployments

49
Troubleshooting XenServer deployments Tomasz Czajka, Sr. Support Engineer 8th of October 2010

Upload: damian-puckett

Post on 31-Dec-2015

70 views

Category:

Documents


1 download

DESCRIPTION

Troubleshooting XenServer deployments. Tomasz Czajka, Sr. Support Engineer 8th of October 2010. Agenda. Case Study: “ Production down” Learn: “ XenServer crash ” Case study: “ Singlepathing ” Q & A. “ Production down”. VM don’t start - why?. Basic troubleshooting in XenCenter. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Troubleshooting XenServer deployments

Troubleshooting XenServer deploymentsTomasz Czajka, Sr. Support Engineer8th of October 2010

Page 2: Troubleshooting XenServer deployments

• Case Study: “Production down”

• Learn: “XenServer crash”

• Case study: “Singlepathing”

• Q & A

Agenda

Page 3: Troubleshooting XenServer deployments

“Production down”

Page 4: Troubleshooting XenServer deployments

Basic troubleshooting in XenCenterVM don’t start - why?

• Cannot start a VM “The SR is not available” error

• Storage Repositry (SR) in “broken” state

“Repair” does not work.

Use CLI to troubleshoot

Page 5: Troubleshooting XenServer deployments

# xe pbd-list currently-atached=false

PBDPBD

What is “broken”?

XenServer_1 XenServer_1

SRSRXenServer_2XenServer_2PBDPBDhas UUID (unique ID)

SCSI ID

PBD = Physical Block DevicePBDPBDPBDPBD

SRSR SRSR

Volume GroupVolume Group

Name: <Prefix>+SR UUID”

Broken storage

Page 6: Troubleshooting XenServer deployments

Goal: Reproduce and analyse the logsStorage troubleshooting

/var/log/xensource.log* ; SMlog* ; messages* ;

# tail –f /var/log/messages > /tmp/ShortLog

# date

# echo “Unplugging cable” >> messages

messages (UTC) <> xensource.log (local)

Page 7: Troubleshooting XenServer deployments

Plugging PBD manuallyPBD unplugged

# xe pbd-list host-uuid=... sr-uuid=...

# xe pbd-plug uuid=...

SR_BACKEND_FAILURE_47: The SR is not available no such volume group: VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4

# xe sr-list name-label=“My SR” params=uuid

19856cba-830c-e298-79fa-84a79eb658f4

# grep “PBD.plug” xensource.log# grep “PBD.plug” xensource.log

Page 8: Troubleshooting XenServer deployments

Logical Volume(LV)

Logical Volume(LV)

What is VG?Volume Group

Virtual Disk

Storage Repository

HDD / LUN

Logical Volume Manager (LVM)

Volume Group(VG)

Volume Group(VG)

Physical Volume(PV)

Physical Volume(PV)

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Physical Volume(PV)

Physical Volume(PV)

Physical Volume(PV)

Physical Volume(PV)

Volume Group(VG)

Volume Group(VG)

HDD / LUN

HDD / LUN

3 VMs1 virtual disk each

3 VMs1 virtual disk each

SRSR

VDIVDI

VDIVDI

VDIVDI

Page 9: Troubleshooting XenServer deployments

Matching the UUIDVolume Group

# vgs

# vgs 'VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4'

Volume group "VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4" not found

VG #PV #LV #SN Attr VSize VFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G

VG #PV #LV #SN Attr VSize VFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G

Page 10: Troubleshooting XenServer deployments

Checking SCSI IDExamining HDD/LUN

• check SCSI ID (unique for each SCSI device)

# xe pbd-list params=device-config sr-uuid=...

device-config SCSIid: 360a9800050334f49633459

PBDPBD

SCSI ID

Page 11: Troubleshooting XenServer deployments

Can Linux kernel see this block device? (SCSI device)Examining HDD/LUN

# hdparm -t /dev/disk/by-id/scsi-360a98045234t654...

Timing buffered disk reads: 138 MB in 3.02 seconds = 45.68 MB/sec

(LUN readable! )

Page 12: Troubleshooting XenServer deployments

Addressing SCSI disks # ls -lR /dev/disk | grep 360a9800050334f4963345767656c546

• /dev/disk/by-id

•scsi-360a9800050334f4963345767656c546a -> /dev/sde

•/dev/disk/by-scsibus

•360a9800050334f4963345767656c546a-1:0:0:5 -> /dev/sdc

•360a9800050334f4963345767656c546a-2:0:0:5 -> /dev/sde

/dev/mapper/360a9800050334f4963345767656c546

Also check /dev/disk/by-path

Page 13: Troubleshooting XenServer deployments

Is the LUN empty?Examining HDD/LUN

# udevinfo -q all -n /dev/disk/by-id/scsi-360a9800050334f496334576765...

...

ID_FS_TYPE=LVM2 member

...

“If this is LVM member, why there is no VG on it?”

Page 14: Troubleshooting XenServer deployments

Is there a VG created on PV?Examining HDD/LUN

# pvs

# pvs |grep 360a9800050334f496334595a32306431PV VG Fmt Attr Psize Free

/dev/mapper/360a9800050334f496334595a32306431 VG_XenStorage-332432-430d-3423-4332434-5485974 lvm2 a- 14.99G 14.99G

# xe sr-list name-label="My SR" params=uuid

19856cba-830c-e298-79fa-84a79eb658f4

VG_Xenstorage<UUID> differs from SR UUID !

PV VG Fmt Attr Psize Free/dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G/dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G/dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G/dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G/dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G

PV VG Fmt Attr Psize Free/dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G/dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G/dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G/dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G/dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G

Page 15: Troubleshooting XenServer deployments

Potential reasons:No original VG on the LUN

• (Re)installation of host in the same pool• Unplug FC / Zoning

• (Re)installation of host in other pool• Zoning

• Adding SR with “xe sr-create” in CLI

...BE VERY CAREFUL!

Page 16: Troubleshooting XenServer deployments

...has been recreated!Volume Group

• Lost LVM metadata

• Lost 100 MB of the VDI data

Action steps:

• don’t shutdown running VMs

• Online backup for running Vms (now)

• Block-level clone of the whole LUN (now)

• Assess professional data recovery

Page 17: Troubleshooting XenServer deployments

Looking for LVM metadata backupVolume Group

/etc/lmv/backup/VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4

• Check backup timestamp (within the file)

LVs in backup file

# cat /etc/lvm/backup/VG... | grep VHD

LVs in backup file

# cat /etc/lvm/backup/VG... | grep VHD

VDI in xapi database

# xe vdi-list sr=<uuid> params=uuid

VDI in xapi database

# xe vdi-list sr=<uuid> params=uuid=

Make a copy first# cp /etc/lvm/backup/* /root/backup/Make a copy first# cp /etc/lvm/backup/* /root/backup/

LVLV

LVLV

VDIVDI

VDIVDI

VDIVDI

LVLV

Page 18: Troubleshooting XenServer deployments

Removing new VG and PVVolume Group

# vgremove "VG_XenStorage-<new SR uuid>”

# pvremove/dev/mapper/<SCSI ID>

Page 19: Troubleshooting XenServer deployments

Recreating PV and VG from backupVolume Group

# pvcreate--uuid <PV uuid from backup file>--restorefile /etc/lvm/backup/VG_XenStorage-<SR_UUID>

/dev/mapper/<SCSI ID>

# vgcfgrestore VG_XenStorage-<SR UUID>-f /etc/lvm/backup/VG_XenStorage-<SR UUID>

Page 20: Troubleshooting XenServer deployments

Confirm that VG name contains SR uuid...Examining HDD/LUN

# pvs |grep 360a9800050334f496334595a32306431PV VG Fmt Attr Psize Free

/dev/mapper/360a9800050334f496334595a32306431 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 lvm2 a- 14.99G 14.99G

# xe sr-list name-label="My SR" params=uuid

19856cba-830c-e298-79fa-84a79eb658f4

VG_Xenstorage<UUID> matches SR UUID

Page 21: Troubleshooting XenServer deployments

Checking Logical VolumesVolume Group

# lvs

•MGT VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.00M

•VHD-352d31ec-aeb6-4601-8ea9-990575dab395 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M

•VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G

•VHD-fbce18dd-397e-444e-9470-b6fa240243d9 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G

•VHD-ff744448-1b7f-4cc8-80b1-cd38b6c90c98 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Logical Volume(LV)

Page 22: Troubleshooting XenServer deployments

Plugging PBD again...Storage Repository

# xe pbd-plug uuid=…

# xe sr-scan uuid=…

Error code: SR_BACKEND_FAILURE_46

Error parameters: , The VDI is not available [opterr=Error scanning VDI 7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32]

# xe vdi-list uuid=<above number>

# lvremove /dev/VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4/VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32

# xe sr-scan uuid=…

Success! But no VDIs shown...

Success! All VDIs shown... Well done!

Page 23: Troubleshooting XenServer deployments

...by troubleshooting “Production Down” issueWhat we’ve learned

•PBD to get plugged needs...

•LUN/HDD PV VG (SR) LV (VDI)

•VG name generated from SR uuid (+ prefix)LV name generated from VDI uuid (+ prefix)

•Displaying VG (vgs), PV (pvs), LV (lvs)

•Addressing block devices (/dev/disk)

•Examining HDD/LUN with "hdparm –t"

•Restoring PV & VG from backup

Page 24: Troubleshooting XenServer deployments

“The XenServer Crash”

Page 25: Troubleshooting XenServer deployments

Unresponsive or rebooting hostThe XenServer Crash?

• Kernel panic or crash dump• Error on Console, host locked• Memory addressing, Bug in OS, Hardware failure

• No Kernel Panic and no crash dump• Host rebooting / frozen / no errors on the console• Hardware failure, OS busy (I/O), user action

Page 26: Troubleshooting XenServer deployments

/var/crash/<date> exists

Symptom: Host rebooted itselfSymptom: Host is unresponsive

Serial consoleSerial console

Review crashdump

HA enabledHA enabled

Host fenced? Check /var/log/xha.log

Disable HA

HA disabledHA disabled

Add „noreboot” option in extlinux.conf

Add „noreboot” option in extlinux.conf

Analyse /var/log/messages,

xensource.log

Still rebooting? examine hardwareStill rebooting? examine hardware

No serial consoleNo serial console

Connect local consoleConnect local console

Any errors on the console?Any errors on the console?

Analyse /var/log/messages, xensource.log for HA reasons

Boot the host to the consoleCTX120540 & reboot

Boot the host to the consoleCTX120540 & reboot

Generate crashdump CTX120540 & reboot

Generate crashdump CTX120540 & reboot

Review crashdump

Analyse /var/log/messages,

xensource.log

Take photos and rebootTake photos and reboot

Contact Citrix Tech SupportContact Citrix Tech Support

No crashdump

Page 27: Troubleshooting XenServer deployments

Startup strings:

# cd /var/log

# grep “klogd” messages -B100

# grep “SERVER START” xensource.log -B100

As easy as grepGetting into details… Analyse /var/log/

messages, xensource.log

Page 28: Troubleshooting XenServer deployments

/var/crash/<stamp>Inside crash log directory

Citrix Confidential - Do Not Distribute

Domain0.logDomain0.log

Hypervisor console ringHypervisor console ring

Domain0 console ring

Domain0 console ring

crash.logcrash.log

CPU stack - to be analysed by Citrix Tech SupportCPU stack - to be analysed by Citrix Tech Support

HA activity, page fault, driver, storage issues

HA activity, page fault, driver, storage issues

Review crashdump

Domain1,2,3...logDebug.log

xen-memory-dump

Domain1,2,3...logDebug.log

xen-memory-dump

Page 29: Troubleshooting XenServer deployments

XenConsole ringInvestigating crash.log

• located at the bottom of the file

•(XEN) Watchdog timer fired for domain 0(XEN) Domain 0 shutdown: watchdog rebooting machine.

• Why watchdog triggered? /var/log/xha.log (Network or Storage heartbeat failed)

• Why heartbeat failed? /var/log/messages (DMP, kernel, drivers, I/O errors)

Review crashdump (cont)

Page 30: Troubleshooting XenServer deployments

Page faultInvestigating crash.log

Other examples:

• (XEN) ****************************************

• (XEN) Panic on CPU 6:

• (XEN) FATAL TRAP: vector = 14 (page fault)

• (XEN) [error_code=0000] , IN INTERRUPT CONTEXT

• (XEN)

• ****************************************

• (XEN)

• (XEN) Reboot in five seconds...

Page 31: Troubleshooting XenServer deployments

Learn: XenServer crashWhat we’ve learned

•Host really crashed?

•Kernel Panic

•Crashdump

•Triggering Crashdump manually

•Locating host reboot in the logs

•Reviewing crashdump logs

Page 32: Troubleshooting XenServer deployments

“Single-Pathing”

Page 33: Troubleshooting XenServer deployments

Storage Performance issue

• DMP has been enabled to improve performance

• Virtual Machines are running on different iSCSI SRs

LinuxGuestVM:~# hdparm -t /dev/xvdb

/dev/xvdb:

Timing buffered disk reads: 96 MB in 3.07 seconds = 30.41 MB/sec

Page 34: Troubleshooting XenServer deployments

Checking multipath statusStorage Performance

# mpathutil status

360a9800050334f496334596c71665246 dm-13 NETAPP,LUN

[size=2.0G][features=0][hwhandler=0][rw]

\_ round-robin 0 [prio=4][enabled]

\_ 3:0:0:2 sdk 8:160 [active][ready]

\_ 4:0:0:2 sdj 8:144 [active][ready]

/dev/mapper/....

/dev/

Page 35: Troubleshooting XenServer deployments

Determining current performance on domain0Storage Performance

• Testing multi-path device

# hdparm /dev/mapper/<scsi id>

• Testing single-path devices

# hdparm /dev/sdj

# hdparm /dev/sdm

In all cases: 30 MB/sec

Page 36: Troubleshooting XenServer deployments

Determining usage of pathsStorage Performance

# iostat –x <device>

# iostat –x /dev/sdk /dev/sdj 5

Device Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdk 803.50 33.0 4122 160 sdj 784.00 32.8 3922 155

Both paths are used equally

Page 37: Troubleshooting XenServer deployments

Checking if there are really 2 iSCSI sessionsStorage Performance

# ls -alR /dev/disk/by-path/ | egrep "(sdk|sdj)"

ip-10.1.200.40:3260-iscsi-iqn.1992-08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdk

ip-10.1.201.40:3260-iscsi-iqn.1992-08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdj

Page 38: Troubleshooting XenServer deployments

Checking if different paths are really usedStorage Performance

# tcpdump -i any port 3260

# watch "ifconfig | egrep '(eth0 |eth1 )' -A5 | egrep '(eth|bytes)' "

eth0 Link encap:Ethernet HWaddr 00:1D:09:70:88:2C

RX bytes:1490076463 (1.3 GiB) TX bytes:170615419 (162.7 MiB)

eth1 Link encap:Ethernet HWaddr 00:1D:09:70:88:2E

RX bytes:1801238 (166 MiB) TX bytes:46695876 (44.5 MiB)

Page 39: Troubleshooting XenServer deployments

Checking source IP addresses for iSCSI sessionsStorage Performance

# netstat -at | grep iscsi

10.1.200.138:53049 10.1.200.40:iscsi-target ESTABLISHED

10.1.200.178:46684 10.1.201.40:iscsi-target ESTABLISHED

Page 40: Troubleshooting XenServer deployments

Checking kernel routing tableStorage Performance

# route

Destination Gateway Genmask Iface

10.1.200.0 * 255.255.255.0 xenbr0

10.1.200.0 * 255.255.255.0 xenbr1

default 10.1.200.1 0.0.0.0 xenbr0

Page 41: Troubleshooting XenServer deployments

Configuration of management interfaces in XenCenterStorage Performance

Modify ISCSI_2 into 10.1.201.78

Page 42: Troubleshooting XenServer deployments

Determining current performance on domain0Storage Performance

# route

Destination Gateway Genmask Iface

10.1.200.0 * 255.255.255.0 xenbr0

10.1.201.0 * 255.255.255.0 xenbr1

default 10.1.200.1 0.0.0.0 xenbr0

Page 43: Troubleshooting XenServer deployments

Configuring kernel routing tableStorage Performance

...or (not recommended)

• Add to /etc/rc.local

# route add -host 10.1.200.40 xenbr0

# route add -host 10.1.201.40 xenbr1

• What about Pool Upgrade and Pool Join?

Page 44: Troubleshooting XenServer deployments

LinuxVM:~# hdparm -t /dev/xvdb

/dev/xvdb:

Timing buffered disk reads: 45 MB/sec

Well Done!

Determining current performance on VMStorage Performance

Page 45: Troubleshooting XenServer deployments

Case study: Single-pathingWhat we’ve learned

•/dev/ locations for single and multi-path devices

•# mpathutil status

•# hdparm –t

•# iostat

•# ifconfig, # tcpdump, # netstat, # route

•# watch

•Best practices for iSCSI storages

Page 46: Troubleshooting XenServer deployments

Questions

Page 47: Troubleshooting XenServer deployments

First aid kitResources

• http://docs.xensource.com –XenServer documentation

• http://support.citrix.com/product/xens/ - Knowledge Center

• http://forums.citrix.com/support - Support forums

• http://community.citrix.com/citrixready/xenserver - XenServer Central (one-stop information center)

Page 48: Troubleshooting XenServer deployments

Before you leave…

• Session surveys are available online at www.citrixsynergy.com starting Thursday, 7 October• Provide your feedback and pick up a complimentary gift card at the registration desk

• Download presentations starting Friday, 15 October, from your My Organiser Tool located in your My Synergy Microsite event account

Page 49: Troubleshooting XenServer deployments