2012/09/22 カーネル/vm探検隊@つくば 発表資料「nested vmmとはなんぞや」

20
Nested VMM とはなんぞや カーネル/VM探検隊 @つくば @syu_cream

Upload: ryo-okubo

Post on 03-Jul-2015

2.231 views

Category:

Documents


5 download

DESCRIPTION

2012/09/22のカーネル/VM探検隊@つくばの発表資料 「Nested VMMとはなんぞや」

TRANSCRIPT

Page 1: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Nested  VMM  とはなんぞや

カーネル/VM探検隊  @つくば  @syu_cream

Page 2: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Nested  VMMについて

•  Nested  VMMとは  – 読んで字のごとく、VMMをネストする  

•  Nested  VMMの例  – KVM  on  KVM  – BHyVe  on  VMWare  – BitVisor  on  VMWare  

Page 3: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Nested  VMMの構成例図

ハードウェア

L0  ホストVMM

L1  ゲストVMM

L2  ゲストOS

Page 4: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Nested  VMM  して何が嬉しい?

•  VMMを動かすのに実マシンを用意せず済む  •  VMMのデバッグが楽になる  – GDBとか使ってサクッとVMMデバッグできる  – 当研究室は主に実マシンで動作確認してて辛い  

•  VMMと共に僕らの眼の輝きも死んでゆく  

•  @m_bird  さん「Nested  VMMで飯が美味い」

Page 5: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

   (  ˘⊖˘)  。o(待てよ……どうすればVMMをネストできるんだ……?)  

Page 6: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Nested  VMM  の設計のお話

•  どうすればVMMをネスト出来るか?  

•  The  Turtles  Project[OSDI’10]やXen  Summit  のNested  VMMについての記述をまとめてみた

Page 7: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Intel  VT解説(VMXその1)

•  ハードでVMを動かす為のモードを提供  –  x86ではセンシティブ命令の扱いが面倒  –  VM専用モード(VMX  non-­‐root  mode)を用意  

•  トラップすべき命令が実行されたらVMX  root  mode  へ遷移(VMExit)  

•  適切なエミュレーションをしてVMX  non-­‐root  modeに戻る(VMEntry)  

VMX  root  mode VMX  non-­‐root  mode VMExit

VMEntry

VMX命令で制御

Page 8: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Intel  VT解説(VMXその2)

•  VMCS(Virtual  Machine  Control  Structure)  – VMX命令の為に設定情報とマシンの状態を保存  – 保持するデータは大まかに三種類  

• ホストマシンの状態  • ゲストマシンの状態  •  VMExit  の条件など制御情報

Page 9: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Intel  VT解説(EPT)

•  仮想化においてページングの扱いが面倒  – 物理アドレスを引くのに二重の変換を要する  – 一発で引けるよう自前でページテーブル用意する  

•  shadow  page  table  

•  EPT(Extended  Page  Table)  –  VMMがやらなきゃならない煩雑なアドレス変換をハードがやってくれる  

– 仮想化のオーバーヘッド低減

Page 10: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Nested  VMM設計

•  ネストする上で以下をどうにかする  – VMXの仮想化  – メモリアクセスの仮想化  –  I/Oの扱い  

Page 11: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

VMX命令のトラップとエミュレーション

•  L1ではVMX命令を実行できない  –  L1  はVMX  non-­‐root  mode  で動いてる  – →  L0  に仕事を投げる!  

•  L1  のVMX命令実行時の処理  1.  L1-­‐>L0  へVMExit    2.  L0でエミュレーション  3.  L0-­‐>L1  へVMEntry  

•  L1-­‐>L2へのVMEntry  1.  L1-­‐>L0  へVMExit  2.  L0-­‐>L2  へVMEntry  

ハードウェア

L0  ホストVMM

L1  ゲストVMM

L2  ゲストOS

1.  VMExit 2.  VMEntry

2.  VMEntry

Virtual  VMEntry

Page 12: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

VMCS  shadowing

•  VMXをネストする上でVMCSも複数用意  1.  L0,  L1制御用のVMCS0,1  2.  L1,  L2制御用のVMCS1,2  3.  L0,  L2制御用のVMCS0,2  

•  VMCS1,2  をshadowing  •  L0<-­‐>L2遷移を可能に  

ハードウェア

L0  ホストVMM

L1  ゲストVMM

L2  ゲストOS

VMCS0,1   VMCS0,2  

VMCS1,2  shadowing

Page 13: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

VMEntry/VMExit  のエミュレーション

•  L2関連のVMEntry/VMExitをうまくエミュレーションする必要がある  – L0絡みのイベント処理  – L1絡みのイベント処理  

Page 14: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

L0絡みのイベント処理 •  L0で処理すべきイベント  

•  VMCS0,2  に記述されていてVMCS1,2  に記述されていない処理  •  外部割り込み,NMIなど  

•  L0でイベントをハンドリングして,L2をレジューム  

ハードウェア

L0  ホストVMM

L1  ゲストVMM

L2  ゲストOS

VMCS0,2  VMCS0,1  

VMCS1,2  shadowing

1.  VMExit 3.  VMEntry

2.  イベント処理

Page 15: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

L1絡みのイベント処理 •  VMCS1,2  に記述されているイベントの処理  •  L0からL1にイベントをフォワード  

–  VMCS1,2のExit  Reason  を更新しL2からExit  したように見せる  

•  L1をレジュームし,イベント処理後L0を経由しL2をレジューム  

ハードウェア

L0  ホストVMM

L1  ゲストVMM

L2  ゲストOS

VMCS0,2  VMCS0,1  

VMCS1,2  shadowing

1.  VMExit

3.  VMEntry

2.  VMCS  更新

4.  イベント処理

5.  VMExit

6.  VMEntry

Virtual    VMEntry/VMExit

Page 16: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Nested  VMMにおけるメモリ管理

•  Nested  VMMは三段階アドレス変換を要する  – L2論理→L2物理→L1物理→L0物理  – 二段階変換ならハードウェアサポートが有るが…  

•  EPT,  NPT  • 増えた1回分のアドレス変換処理をどう扱うか?  

Page 17: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Shadow  Page  Table  で頑張る方針

•  Shadow-­‐on-­‐shadow  –  L0,L1  でそれぞれshadow  page  table  を持つ  –  L0物理アドレスのアクセスはL0のSPT02で一気に変換  

•  Shadow-­‐on-­‐EPT  –  L0  でEPTを使用し,L1でshadow  page  table  を持つ  – オーバーヘッド大  

•  L2でpage  fault  する度,L0へVMExit          した後L1  でSPT更新する必要有…  

ent nested MMU virtualization models.

Figure 4: MMU alternatives for nested virtualization

Shadow-on-shadow is used when the processor doesnot support two-dimensional page tables, and is the leastefficient method. Initially, L0 creates a shadow page ta-ble to run L1 (SPT0!1). L1, in turn, creates a shadowpage table to run L2 (SPT1!2). L0 cannot use SPT1!2

to run L2 because this table translates L2 guest virtualaddresses to L1 host physical addresses. Therefore, L0

compresses SPT0!1 and SPT1!2 into a single shadowpage table, SPT0!2. This new table translates directlyfrom L2 guest virtual addresses to L0 host physical ad-dresses. Specifically, for each guest virtual address inSPT1!2, L0 creates an entry in SPT0!2 with the corre-sponding L0 host physical address.

Shadow-on-EPT is the most straightforward approachto use when the processor supports EPT. L0 uses the EPThardware, but L1 cannot use it, so it resorts to shadowpage tables. L1 uses SPT1!2 to run L2. L0 configuresthe MMU to use SPT1!2 as the first translation table andEPT0!1 as the second translation table. In this way, theprocessor first translates from L2 guest virtual address toL1 host physical address using SPT1!2, and then trans-lates from the L1 host physical address to the L0 hostphysical address using the EPT0!1.

Though the Shadow-on-EPT approach uses the EPThardware, it still has a noticeable overhead due to pagefaults and page table modifications in L2. These mustbe handled in L1, to maintain the shadow page ta-ble. Each of these faults and writes cause VMExitsand must be forwarded from L0 to L1 for handling. Inother words, Shadow-on-EPT is slow for the exactly thesame reasons that Shadow itself was slow for single-levelvirtualization—but it is even slower because nested exitsare slower than non-nested exits.

In multi-dimensional page tables, as in two-dimensional page tables, each level creates its own sepa-rate translation table. For L1 to create an EPT table, L0

exposes EPT capabilities to L1, even though the hard-ware only provides a single EPT table.

Since only one EPT table is available in hardware, thetwo EPT tables should be compressed into one: Let usassume that L0 runs L1 using EPT0!1, and that L1 cre-

ates an additional table, EPT1!2, to run L2, because L0

exposed a virtualized EPT capability to L1. The L0 hy-pervisor could then compress EPT0!1 and EPT1!2 intoa single EPT0!2 table as shown in Figure 4. Then L0

could run L2 using EPT0!2, which translates directlyfrom the L2 guest physical address to the L0 host physi-cal address, reducing the number of page fault exits andimproving nested virtualization performance. In Sec-tion 4.1.2 we demonstrate more than a three-fold speedupof some useful workloads with multi-dimensional pagetables, compared to shadow-on-EPT.

The L0 hypervisor launches L2 with an empty EPT0!2

table, building the table on-the-fly, on L2 EPT-violationexits. These happen when a translation for a guest phys-ical address is missing in the EPT table. If there is notranslation in EPT1!2 for the faulting address, L0 firstlets L1 handle the exit and update EPT1!2. L0 can nowcreate an entry in EPT0!2 that translates the L2 guestphysical address directly to the L0 host physical address:EPT1!2 is used to translate the L2 physical address to aL1 physical address, and EPT0!1 translates that into thedesired L0 physical address.

To maintain correctness of EPT0!2, the L0 hypervisorneeds to know of any changes that L1 makes to EPT1!2.L0 sets the memory area of EPT1!2 as read-only, therebycausing a trap when L1 tries to update it. L0 will then up-date EPT0!2 according to the changed entries in EPT1!2.L0 also needs to trap all L1 INVEPT instructions, and in-validate the EPT cache accordingly.

By using huge pages [34] to back guest memory, L0

can create smaller and faster EPT tables. Finally, tofurther improve performance, L0 also allows L1 to useVPIDs. With this feature, the CPU tags each transla-tion in the TLB with a numeric virtual-processor id,eliminating the need for TLB flushes on every VMEn-try and VMExit. Since each hypervisor is free to choosethese VPIDs arbitrarily, they might collide and thereforeL0 needs to map the VPIDs that L1 uses into valid L0

VPIDs.

3.4 I/O: Multi-level Device AssignmentI/O is the third major challenge in server virtualization.There are three approaches commonly used to provideI/O services to a guest virtual machine. Either the hyper-visor emulates a known device and the guest uses an un-modified driver to interact with it [47], or a para-virtualdriver is installed in the guest [6, 42], or the host assignsa real device to the guest which then controls the devicedirectly [11, 31, 37, 52, 53]. Device assignment generallyprovides the best performance [33, 38, 53], since it mini-mizes the number of I/O-related world switches betweenthe virtual machine and its hypervisor, and although itcomplicates live migration, device assignment and live

6

Page 18: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Mulc-­‐dimensional  Paging

•  一つのEPTでL2物理→L0物理変換を可能にする  – 二つのEPTを合成したEPT02を持つ  

•  L0のEPT01,  L1のEPT12  – ページフォルトによるVMExit  が減少しオーバーヘッド低減  –  L1のEPT12への更新を検出しEPT02に反映する必要有り  

•  L1におけるEPT12の更新とINVEPTをトラップする  

ent nested MMU virtualization models.

Figure 4: MMU alternatives for nested virtualization

Shadow-on-shadow is used when the processor doesnot support two-dimensional page tables, and is the leastefficient method. Initially, L0 creates a shadow page ta-ble to run L1 (SPT0!1). L1, in turn, creates a shadowpage table to run L2 (SPT1!2). L0 cannot use SPT1!2

to run L2 because this table translates L2 guest virtualaddresses to L1 host physical addresses. Therefore, L0

compresses SPT0!1 and SPT1!2 into a single shadowpage table, SPT0!2. This new table translates directlyfrom L2 guest virtual addresses to L0 host physical ad-dresses. Specifically, for each guest virtual address inSPT1!2, L0 creates an entry in SPT0!2 with the corre-sponding L0 host physical address.

Shadow-on-EPT is the most straightforward approachto use when the processor supports EPT. L0 uses the EPThardware, but L1 cannot use it, so it resorts to shadowpage tables. L1 uses SPT1!2 to run L2. L0 configuresthe MMU to use SPT1!2 as the first translation table andEPT0!1 as the second translation table. In this way, theprocessor first translates from L2 guest virtual address toL1 host physical address using SPT1!2, and then trans-lates from the L1 host physical address to the L0 hostphysical address using the EPT0!1.

Though the Shadow-on-EPT approach uses the EPThardware, it still has a noticeable overhead due to pagefaults and page table modifications in L2. These mustbe handled in L1, to maintain the shadow page ta-ble. Each of these faults and writes cause VMExitsand must be forwarded from L0 to L1 for handling. Inother words, Shadow-on-EPT is slow for the exactly thesame reasons that Shadow itself was slow for single-levelvirtualization—but it is even slower because nested exitsare slower than non-nested exits.

In multi-dimensional page tables, as in two-dimensional page tables, each level creates its own sepa-rate translation table. For L1 to create an EPT table, L0

exposes EPT capabilities to L1, even though the hard-ware only provides a single EPT table.

Since only one EPT table is available in hardware, thetwo EPT tables should be compressed into one: Let usassume that L0 runs L1 using EPT0!1, and that L1 cre-

ates an additional table, EPT1!2, to run L2, because L0

exposed a virtualized EPT capability to L1. The L0 hy-pervisor could then compress EPT0!1 and EPT1!2 intoa single EPT0!2 table as shown in Figure 4. Then L0

could run L2 using EPT0!2, which translates directlyfrom the L2 guest physical address to the L0 host physi-cal address, reducing the number of page fault exits andimproving nested virtualization performance. In Sec-tion 4.1.2 we demonstrate more than a three-fold speedupof some useful workloads with multi-dimensional pagetables, compared to shadow-on-EPT.

The L0 hypervisor launches L2 with an empty EPT0!2

table, building the table on-the-fly, on L2 EPT-violationexits. These happen when a translation for a guest phys-ical address is missing in the EPT table. If there is notranslation in EPT1!2 for the faulting address, L0 firstlets L1 handle the exit and update EPT1!2. L0 can nowcreate an entry in EPT0!2 that translates the L2 guestphysical address directly to the L0 host physical address:EPT1!2 is used to translate the L2 physical address to aL1 physical address, and EPT0!1 translates that into thedesired L0 physical address.

To maintain correctness of EPT0!2, the L0 hypervisorneeds to know of any changes that L1 makes to EPT1!2.L0 sets the memory area of EPT1!2 as read-only, therebycausing a trap when L1 tries to update it. L0 will then up-date EPT0!2 according to the changed entries in EPT1!2.L0 also needs to trap all L1 INVEPT instructions, and in-validate the EPT cache accordingly.

By using huge pages [34] to back guest memory, L0

can create smaller and faster EPT tables. Finally, tofurther improve performance, L0 also allows L1 to useVPIDs. With this feature, the CPU tags each transla-tion in the TLB with a numeric virtual-processor id,eliminating the need for TLB flushes on every VMEn-try and VMExit. Since each hypervisor is free to choosethese VPIDs arbitrarily, they might collide and thereforeL0 needs to map the VPIDs that L1 uses into valid L0

VPIDs.

3.4 I/O: Multi-level Device AssignmentI/O is the third major challenge in server virtualization.There are three approaches commonly used to provideI/O services to a guest virtual machine. Either the hyper-visor emulates a known device and the guest uses an un-modified driver to interact with it [47], or a para-virtualdriver is installed in the guest [6, 42], or the host assignsa real device to the guest which then controls the devicedirectly [11, 31, 37, 52, 53]. Device assignment generallyprovides the best performance [33, 38, 53], since it mini-mizes the number of I/O-related world switches betweenthe virtual machine and its hypervisor, and although itcomplicates live migration, device assignment and live

6

Page 19: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

Nested  VMMにおけるI/Oの扱い

•  仮想化環境においてI/Oの扱いは3種の方針が有る  – エミュレーション  – 準仮想化  – パススルー  

•  Nested  VMMならL0,  L1で3*3=9  通りの組合せが存在

Page 20: 2012/09/22 カーネル/VM探検隊@つくば 発表資料「Nested VMMとはなんぞや」

参考資料

•  The  Turtles  Project[OSDI’10]の論文  – hgp://stacc.usenix.org/events/osdi10/tech/full_papers/Ben-­‐Yehuda.pdf  

•  kvm  の実装  – arch/x86/kvm/vmx.c