xen virtualization at huawei -...

14

Upload: lamphuc

Post on 16-May-2018

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +
Page 2: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Xen Virtualization at HuaweiUsages, Value-adds, and Challenges

Liu Jinsong, Chief Architect for Virtualization, [email protected]

Page 3: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Usages• Huawei’s Unified Virtual Platform (UVP) supporting Huawei’s all-cloud

• public cloud

• private cloud

• NFV

• Per all-cloud, UVP enables dozens of features based on Xen/KVM• vm life-cycle management

• vm management

• Storage support -- vims, dsware, b-cache, NVMe SSD

• Network support – ovs + dpdk, smart NIC, vlan, vxlan

• live migration and DRS

• GPU graphics virtualization and virtual desktop

• GPGPU computing and FPGA virtualization

• ARM virtualization

Page 4: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Usages• vm life-cycle management

• Co-work with Huawei’s openstack

• Image and vm configuration (password, static ip, etc)

• vm start/ suspend/ resume/ reboot/ shutdown/ destroy

• vm snapshot

• vm sanity check

• HA/ FT

• vm management• CPU hotplug

• Memory over-commitment

• usb, cd-rom, vnc, scsi support

• pass-through/ SR-IOV

• Micro-vm

Page 5: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Usages• Storage

• NVMe SSD

• To satisfy extremely iops & latency requirements

• vims

• vSAN, OCFS2 enhancement supporing 64 nodes, used in Huawei’s private cloud

• dsware

• GFS/HDFS-like storage, supporting hundreds of nodes in 1 POD, used in public cloud

• b-cache

• read I/O cache, 10x speedup iops

• network• ovs + dpdk

• Smart NIC

• offloading via Huawei’s ARM + FPGA smartNIC

Page 6: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Usages• GPU graphics virtualization and virtual desktop

• 4 GPU graphics virtualization technologies

• Huawei used 2 GPU graphics virtualization, providing AWS G2-like instance

• API forwarding

• compatible issues for windows guest

• vGPU solution

• Nvidia K1/K2/M60 + XenGT

• GPU sr-iov is under evaluation

• Huawei’s desktop protocol

• HDP (Huawei Deskto Protocol) virtual desktop

Page 7: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Usages• GPGPU virtualization

• support HPC/ AI cloud, providing AWS P2-like instance

• GPU pass-through

• GPGPU capability is OK compared w/ native

• GPU – GPU data transfer bottleneck

• Nvidia GPUDirect (P2P) virualization

• FPGA virtualization• Not friendly to virtualization

• Security issues

Page 8: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Value-add: live migration

• Live migration@virtualization• Zero-page scanning

• frequency reduce

• Tsc scaling

• Guest whitelist

• Live migration@cloud• Event based on xenstore

• Parallel migration

• Close-loop control

• Safely roll-back

• ~100% vm alive when fail

Control system

networkstorageSrcvm

Migration pre-check

Send migrate commandRe-connect image at dst

Dstvm

N-1 iter memcpy

Socket connection,vm config

Last iter memcpySave vm & qemu context Restore memory

Creat vm

Vm suspend

Qemu save/load

Restore vm & qemu

Image disconnect at src

destroy PV devices and qemu Image connect at dst with r/w mode

Vm unpausepause

successfrontend – backend

reconnect

Send gratuitous arp at dstFlush Windows ARP cache(migration success)

Fail or timeout

Tapdisk reopenCreat vif

Qemu recreate and load

Vm resume

Frontend-backend reconnect

Image connect at src(migration fail)

Safely roll-back(migration fail)

Vm run at dst(migratin success)

Vm destory

ELB session copy

Last iterationELB session copy

Xend

Flush cache at src

ELB session cancel(migratin fail)

Judge and brain-split prevent

Vm database update

Live migratio

n status

Storaged Networkd

events

① ②③

Libvirtd

Xenstore

Page 9: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Challenges: GPGPU virtualization

• GPGPU virtualization• Trivial GPU computing capability loss

• GPU – GPU data transfer bottleneck

• Caffe: cudaMemcpy

• GPUDirect P2P data path

• GPUDirect P2P virtualization

• Nvidia P2P probe theory is un-known

• PCIe topology exposing

• gpa -> hpa transfer

• Varies per different server topology

• QPI, IOH, PCIe switch layout

• How to assign GPUset to a vm

• Unified GPGPU virtualization framework?

GPU-0 GPU-1 GPU-2 GPU-3

PCIe Switch

PCIe Switch

CPU-1

CPU-2

CPU

IOH

switch

GPU GPU

CPU

IOH

switch

GPU GPU

440 or Q35Chipset?

GPU

switch

GPU GPU

switch

GPU

gpa

topology expose

pass-through

hpa

Hypervisor

X

Page 10: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Challenges: FPGA virtualization

• Not mature/friendly to virtualization• Very useful for AI inference

• FPGA partition for multi-tenants

• Static partition is possible but expensive

• Dynamic partition is not possible currently, depending on Xilinx/Intel tools support

• Pass-through solution works but

• Expensive

• Security issue if FPGA owned by malicious guest

• PCIe bandwidth stuck

• Over-heated

• Security issue if bitstream owned by host and hijacked

• Isolation

• IP and data leaking

Page 11: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Challenges: hot-upgrade and security

• XSA/CVE hot-patch• Only ~75% XSA/CVE security holes can be hot-patched

• data structure, newly added functions, booting stage security holes

• vmexit handler, inline function

• NMI handler

• OS/Hypervisor online upgrade• Live migration is too heavy

• Security• Currently security at cloud environment focus on network anti-attack

• Malicious guest can work around network anti-attack, attcking system from inside

• i.e., Qemu VEMON

• Intelligent hardware

• Smart NIC, FPGA, GPU, etc

Page 12: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Backup

Page 13: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Live migration based on sync-calls

Control system

NetworkStorageSrc vm

Pre-check

Send migrate cmd to src

Image read-only open at dst

Des vm

N-1 iter memcpy

Socket connect,vm config

Send flush cache to src

Last iter memcpySave vm & qemu context Restore memory

Create vm

Vm suspend

Qemu save/load

Tapdisk pause

Restore vm & qemu

Image disconnect at src

Storage & network disconnectDestroy PV devices and qemu Tapdisk open Image r/w open at dst

(migration success)

Vm unpausepause

Waiting shakehand

successFrontend-backend connection

Send gratuitous arp at dstFlush Windows ARP cache(migration success)

Fail or timeout

Tapdisk reopenCreate vif

Recreate qemu and status load

Vm resume

Frontend-backend re-connection

Image r/w open at src(migration fail)

Vm safely roll-back(migration fail)

Vm running at dst(migration success)

Destroy vm

Storage lock

ELB session copy

Last iterELB session copy

Migration start

Flush b-cache at src

①②

SLB session cancel at dst(migration fail)

Migration step 1

Migration step 2

Migration step N

Brain-split prevent

Database update

Migra

tion

statu

s

Done

Start

Page 14: Xen Virtualization at Huawei - Schedschd.ws/hosted_files/xendeveloperanddesignsummit2017/ba/Xen...Xen Virtualization at Huawei Usages, Value-adds, ... • Network support –ovs +

Thank You