2012/09/22 カーネル／vm探検隊＠つくば発表資料「nested vmmとはなんぞや」

Nested VMM とはなんぞや

カーネル/VM探検隊 @つくば @syu_cream

Nested VMMについて

•  Nested VMMとは – 読んで字のごとく、VMMをネストする

•  Nested VMMの例 – KVM on KVM – BHyVe on VMWare – BitVisor on VMWare

Nested VMMの構成例図

ハードウェア

L0 ホストVMM

L1 ゲストVMM

L2 ゲストOS

Nested VMM して何が嬉しい？

•  VMMを動かすのに実マシンを用意せず済む •  VMMのデバッグが楽になる – GDBとか使ってサクッとVMMデバッグできる – 当研究室は主に実マシンで動作確認してて辛い

•  VMMと共に僕らの眼の輝きも死んでゆく

•  @m_bird さん「Nested VMMで飯が美味い」

( ˘⊖˘) 。o(待てよ……どうすればVMMをネストできるんだ……？)

Nested VMM の設計のお話

•  どうすればVMMをネスト出来るか？

•  The Turtles Project[OSDI’10]やXen Summit のNested VMMについての記述をまとめてみた

Intel VT解説(VMXその1)

•  ハードでVMを動かす為のモードを提供 –  x86ではセンシティブ命令の扱いが面倒 –  VM専用モード(VMX non-‐root mode)を用意

•  トラップすべき命令が実行されたらVMX root mode へ遷移(VMExit)

•  適切なエミュレーションをしてVMX non-‐root modeに戻る(VMEntry)

VMX root mode VMX non-‐root mode VMExit

VMEntry

VMX命令で制御

Intel VT解説(VMXその2)

•  VMCS(Virtual Machine Control Structure) – VMX命令の為に設定情報とマシンの状態を保存 – 保持するデータは大まかに三種類

• ホストマシンの状態 • ゲストマシンの状態 •  VMExit の条件など制御情報

Intel VT解説(EPT)

•  仮想化においてページングの扱いが面倒 – 物理アドレスを引くのに二重の変換を要する – 一発で引けるよう自前でページテーブル用意する

•  shadow page table

•  EPT(Extended Page Table) –  VMMがやらなきゃならない煩雑なアドレス変換をハードがやってくれる

– 仮想化のオーバーヘッド低減

Nested VMM設計

•  ネストする上で以下をどうにかする – VMXの仮想化 – メモリアクセスの仮想化 –  I/Oの扱い

VMX命令のトラップとエミュレーション

•  L1ではVMX命令を実行できない –  L1 はVMX non-‐root mode で動いてる – → L0 に仕事を投げる！

•  L1 のVMX命令実行時の処理 1.  L1-‐>L0 へVMExit 2.  L0でエミュレーション 3.  L0-‐>L1 へVMEntry

•  L1-‐>L2へのVMEntry 1.  L1-‐>L0 へVMExit 2.  L0-‐>L2 へVMEntry

ハードウェア

L0 ホストVMM

L1 ゲストVMM

L2 ゲストOS

1. VMExit 2. VMEntry

2. VMEntry

Virtual VMEntry

VMCS shadowing

•  VMXをネストする上でVMCSも複数用意 1.  L0, L1制御用のVMCS0,1 2.  L1, L2制御用のVMCS1,2 3.  L0, L2制御用のVMCS0,2

•  VMCS1,2 をshadowing •  L0<-‐>L2遷移を可能に

ハードウェア

L0 ホストVMM

L1 ゲストVMM

L2 ゲストOS

VMCS0,1 VMCS0,2

VMCS1,2 shadowing

VMEntry/VMExit のエミュレーション

•  L2関連のVMEntry/VMExitをうまくエミュレーションする必要がある – L0絡みのイベント処理 – L1絡みのイベント処理

L0絡みのイベント処理 •  L0で処理すべきイベント

•  VMCS0,2 に記述されていてVMCS1,2 に記述されていない処理 •  外部割り込み，NMIなど

•  L0でイベントをハンドリングして，L2をレジューム

ハードウェア

L0 ホストVMM

L1 ゲストVMM

L2 ゲストOS

VMCS0,2 VMCS0,1

VMCS1,2 shadowing

1. VMExit 3. VMEntry

2. イベント処理

L1絡みのイベント処理 •  VMCS1,2 に記述されているイベントの処理 •  L0からL1にイベントをフォワード

–  VMCS1,2のExit Reason を更新しL2からExit したように見せる

•  L1をレジュームし，イベント処理後L0を経由しL2をレジューム

ハードウェア

L0 ホストVMM

L1 ゲストVMM

L2 ゲストOS

VMCS0,2 VMCS0,1

VMCS1,2 shadowing

1. VMExit

3. VMEntry

2. VMCS 更新

4. イベント処理

5. VMExit

6. VMEntry

Virtual VMEntry/VMExit

Nested VMMにおけるメモリ管理

•  Nested VMMは三段階アドレス変換を要する – L2論理→L2物理→L1物理→L0物理 – 二段階変換ならハードウェアサポートが有るが…

•  EPT, NPT • 増えた1回分のアドレス変換処理をどう扱うか？

Shadow Page Table で頑張る方針

•  Shadow-‐on-‐shadow –  L0,L1 でそれぞれshadow page table を持つ –  L0物理アドレスのアクセスはL0のSPT02で一気に変換

•  Shadow-‐on-‐EPT –  L0 でEPTを使用し，L1でshadow page table を持つ – オーバーヘッド大

•  L2でpage fault する度，L0へVMExit した後L1 でSPT更新する必要有…

ent nested MMU virtualization models.

Figure 4: MMU alternatives for nested virtualization

Shadow-on-shadow is used when the processor doesnot support two-dimensional page tables, and is the leastefficient method. Initially, L0 creates a shadow page ta-ble to run L1 (SPT0!1). L1, in turn, creates a shadowpage table to run L2 (SPT1!2). L0 cannot use SPT1!2

to run L2 because this table translates L2 guest virtualaddresses to L1 host physical addresses. Therefore, L0

compresses SPT0!1 and SPT1!2 into a single shadowpage table, SPT0!2. This new table translates directlyfrom L2 guest virtual addresses to L0 host physical ad-dresses. Specifically, for each guest virtual address inSPT1!2, L0 creates an entry in SPT0!2 with the corre-sponding L0 host physical address.

Shadow-on-EPT is the most straightforward approachto use when the processor supports EPT. L0 uses the EPThardware, but L1 cannot use it, so it resorts to shadowpage tables. L1 uses SPT1!2 to run L2. L0 configuresthe MMU to use SPT1!2 as the first translation table andEPT0!1 as the second translation table. In this way, theprocessor first translates from L2 guest virtual address toL1 host physical address using SPT1!2, and then trans-lates from the L1 host physical address to the L0 hostphysical address using the EPT0!1.

Though the Shadow-on-EPT approach uses the EPThardware, it still has a noticeable overhead due to pagefaults and page table modifications in L2. These mustbe handled in L1, to maintain the shadow page ta-ble. Each of these faults and writes cause VMExitsand must be forwarded from L0 to L1 for handling. Inother words, Shadow-on-EPT is slow for the exactly thesame reasons that Shadow itself was slow for single-levelvirtualization—but it is even slower because nested exitsare slower than non-nested exits.

In multi-dimensional page tables, as in two-dimensional page tables, each level creates its own sepa-rate translation table. For L1 to create an EPT table, L0

exposes EPT capabilities to L1, even though the hard-ware only provides a single EPT table.

Since only one EPT table is available in hardware, thetwo EPT tables should be compressed into one: Let usassume that L0 runs L1 using EPT0!1, and that L1 cre-

ates an additional table, EPT1!2, to run L2, because L0

exposed a virtualized EPT capability to L1. The L0 hy-pervisor could then compress EPT0!1 and EPT1!2 intoa single EPT0!2 table as shown in Figure 4. Then L0

could run L2 using EPT0!2, which translates directlyfrom the L2 guest physical address to the L0 host physi-cal address, reducing the number of page fault exits andimproving nested virtualization performance. In Sec-tion 4.1.2 we demonstrate more than a three-fold speedupof some useful workloads with multi-dimensional pagetables, compared to shadow-on-EPT.

The L0 hypervisor launches L2 with an empty EPT0!2

table, building the table on-the-fly, on L2 EPT-violationexits. These happen when a translation for a guest phys-ical address is missing in the EPT table. If there is notranslation in EPT1!2 for the faulting address, L0 firstlets L1 handle the exit and update EPT1!2. L0 can nowcreate an entry in EPT0!2 that translates the L2 guestphysical address directly to the L0 host physical address:EPT1!2 is used to translate the L2 physical address to aL1 physical address, and EPT0!1 translates that into thedesired L0 physical address.

To maintain correctness of EPT0!2, the L0 hypervisorneeds to know of any changes that L1 makes to EPT1!2.L0 sets the memory area of EPT1!2 as read-only, therebycausing a trap when L1 tries to update it. L0 will then up-date EPT0!2 according to the changed entries in EPT1!2.L0 also needs to trap all L1 INVEPT instructions, and in-validate the EPT cache accordingly.

By using huge pages [34] to back guest memory, L0

can create smaller and faster EPT tables. Finally, tofurther improve performance, L0 also allows L1 to useVPIDs. With this feature, the CPU tags each transla-tion in the TLB with a numeric virtual-processor id,eliminating the need for TLB flushes on every VMEn-try and VMExit. Since each hypervisor is free to choosethese VPIDs arbitrarily, they might collide and thereforeL0 needs to map the VPIDs that L1 uses into valid L0

VPIDs.

3.4 I/O: Multi-level Device AssignmentI/O is the third major challenge in server virtualization.There are three approaches commonly used to provideI/O services to a guest virtual machine. Either the hyper-visor emulates a known device and the guest uses an un-modified driver to interact with it [47], or a para-virtualdriver is installed in the guest [6, 42], or the host assignsa real device to the guest which then controls the devicedirectly [11, 31, 37, 52, 53]. Device assignment generallyprovides the best performance [33, 38, 53], since it mini-mizes the number of I/O-related world switches betweenthe virtual machine and its hypervisor, and although itcomplicates live migration, device assignment and live

6

Mulc-‐dimensional Paging

•  一つのEPTでL2物理→L0物理変換を可能にする – 二つのEPTを合成したEPT02を持つ

•  L0のEPT01, L1のEPT12 – ページフォルトによるVMExit が減少しオーバーヘッド低減 –  L1のEPT12への更新を検出しEPT02に反映する必要有り

•  L1におけるEPT12の更新とINVEPTをトラップする

ent nested MMU virtualization models.

Figure 4: MMU alternatives for nested virtualization

Shadow-on-shadow is used when the processor doesnot support two-dimensional page tables, and is the leastefficient method. Initially, L0 creates a shadow page ta-ble to run L1 (SPT0!1). L1, in turn, creates a shadowpage table to run L2 (SPT1!2). L0 cannot use SPT1!2

to run L2 because this table translates L2 guest virtualaddresses to L1 host physical addresses. Therefore, L0

compresses SPT0!1 and SPT1!2 into a single shadowpage table, SPT0!2. This new table translates directlyfrom L2 guest virtual addresses to L0 host physical ad-dresses. Specifically, for each guest virtual address inSPT1!2, L0 creates an entry in SPT0!2 with the corre-sponding L0 host physical address.

Shadow-on-EPT is the most straightforward approachto use when the processor supports EPT. L0 uses the EPThardware, but L1 cannot use it, so it resorts to shadowpage tables. L1 uses SPT1!2 to run L2. L0 configuresthe MMU to use SPT1!2 as the first translation table andEPT0!1 as the second translation table. In this way, theprocessor first translates from L2 guest virtual address toL1 host physical address using SPT1!2, and then trans-lates from the L1 host physical address to the L0 hostphysical address using the EPT0!1.

Though the Shadow-on-EPT approach uses the EPThardware, it still has a noticeable overhead due to pagefaults and page table modifications in L2. These mustbe handled in L1, to maintain the shadow page ta-ble. Each of these faults and writes cause VMExitsand must be forwarded from L0 to L1 for handling. Inother words, Shadow-on-EPT is slow for the exactly thesame reasons that Shadow itself was slow for single-levelvirtualization—but it is even slower because nested exitsare slower than non-nested exits.

In multi-dimensional page tables, as in two-dimensional page tables, each level creates its own sepa-rate translation table. For L1 to create an EPT table, L0

exposes EPT capabilities to L1, even though the hard-ware only provides a single EPT table.

Since only one EPT table is available in hardware, thetwo EPT tables should be compressed into one: Let usassume that L0 runs L1 using EPT0!1, and that L1 cre-

ates an additional table, EPT1!2, to run L2, because L0

exposed a virtualized EPT capability to L1. The L0 hy-pervisor could then compress EPT0!1 and EPT1!2 intoa single EPT0!2 table as shown in Figure 4. Then L0

could run L2 using EPT0!2, which translates directlyfrom the L2 guest physical address to the L0 host physi-cal address, reducing the number of page fault exits andimproving nested virtualization performance. In Sec-tion 4.1.2 we demonstrate more than a three-fold speedupof some useful workloads with multi-dimensional pagetables, compared to shadow-on-EPT.

The L0 hypervisor launches L2 with an empty EPT0!2

table, building the table on-the-fly, on L2 EPT-violationexits. These happen when a translation for a guest phys-ical address is missing in the EPT table. If there is notranslation in EPT1!2 for the faulting address, L0 firstlets L1 handle the exit and update EPT1!2. L0 can nowcreate an entry in EPT0!2 that translates the L2 guestphysical address directly to the L0 host physical address:EPT1!2 is used to translate the L2 physical address to aL1 physical address, and EPT0!1 translates that into thedesired L0 physical address.

To maintain correctness of EPT0!2, the L0 hypervisorneeds to know of any changes that L1 makes to EPT1!2.L0 sets the memory area of EPT1!2 as read-only, therebycausing a trap when L1 tries to update it. L0 will then up-date EPT0!2 according to the changed entries in EPT1!2.L0 also needs to trap all L1 INVEPT instructions, and in-validate the EPT cache accordingly.

By using huge pages [34] to back guest memory, L0

can create smaller and faster EPT tables. Finally, tofurther improve performance, L0 also allows L1 to useVPIDs. With this feature, the CPU tags each transla-tion in the TLB with a numeric virtual-processor id,eliminating the need for TLB flushes on every VMEn-try and VMExit. Since each hypervisor is free to choosethese VPIDs arbitrarily, they might collide and thereforeL0 needs to map the VPIDs that L1 uses into valid L0

VPIDs.

3.4 I/O: Multi-level Device AssignmentI/O is the third major challenge in server virtualization.There are three approaches commonly used to provideI/O services to a guest virtual machine. Either the hyper-visor emulates a known device and the guest uses an un-modified driver to interact with it [47], or a para-virtualdriver is installed in the guest [6, 42], or the host assignsa real device to the guest which then controls the devicedirectly [11, 31, 37, 52, 53]. Device assignment generallyprovides the best performance [33, 38, 53], since it mini-mizes the number of I/O-related world switches betweenthe virtual machine and its hypervisor, and although itcomplicates live migration, device assignment and live

6

Nested VMMにおけるI/Oの扱い

•  仮想化環境においてI/Oの扱いは3種の方針が有る – エミュレーション – 準仮想化 – パススルー

•  Nested VMMならL0, L1で3*3=9 通りの組合せが存在

参考資料

•  The Turtles Project[OSDI’10]の論文 – hgp://stacc.usenix.org/events/osdi10/tech/full_papers/Ben-‐Yehuda.pdf

•  kvm の実装 – arch/x86/kvm/vmx.c

2012/09/22 カーネル／vm探検隊＠つくば 発表資料「nested vmmとはなんぞや」

Documents

2012/09/22 カーネル／vm探検隊＠つくば発表資料「nested vmmとはなんぞや」