クラウドコンピューティングと要素技術 technologies behind cloud...
TRANSCRIPT
-
クラウドコンピューティングと要素技術TechnologiesbehindCloudComputing
ソフトウェア・クラウド開発プロジェクト実践III
浅井大史
2016年4月22日
-
COMPUTERARCHITECTUREANDOPERATINGSYSTEM
クラウドコンピューティングの要素技術 2
-
ComputerArchitecture
クラウドコンピューティングの要素技術 3
Processor
Core Core
Core Core
Processor
Core Core
Core CoreInterconnect
DRAM
Memorycontroller
Memorycontroller
DRAM
Bus Bus
IOH (I/OHub) PCIe devicePCIe bus
ICH(I/OControllerHub)
Peripherals(USB,Keyboardetc.)
DirectMediaInterface
※ LatestIntelandAMDprocessorsandchipsets
NUMA:Non-UniformMemoryAccess
-
InstructionSetArchitecture(ISA)
• Microprocessorinstructionsetarchitecture– CISC(Complexinstructionsetcomputing)• x86,x86-64(Intel®64,AMD64)
– RISC(Reducedinstructionsetcomputing)• SPARC,MIPS,POWER/PowerPC,ARM
クラウドコンピューティングの要素技術 4
-
InstructionExecutionCycle(abstracted)
1. Instructionfetch(IF)– Fetchinstructionfrominstructioncachememory
2. Instructiondecode(ID)– Decodethefetchedinstruction– Selectregisterscorrespondingtotheinstruction
3. Execute(EX)– Executetheinstruction
4. Memoryaccess(MA)– Load/Storeaccesstomemory
5. Write-back(WB)– Write-backtheresultofexecutiontoregisters
クラウドコンピューティングの要素技術 5
-
Pipeline
クラウドコンピューティングの要素技術 6
IF ID EX MA WBInstruction1
IF ID EX MA WBInstruction2
IF ID EX MA WBInstruction3
IF ID EX MA WBInstruction4
IF ID EX MA WBInstruction5
Time
IF+ID+EX+MA+WBInstruction1
Instruction2
Instruction3
Time
IF+ID+EX+MA+WB
IF+ID+EX+MA+WB
・・・
・・・
“Pipeline”improvesthroughput.
-
Hazards
• Structuralhazards– duetosharedhardwareresource
• Datahazards– duetodatadependencies
• Controlhazards
クラウドコンピューティングの要素技術 7
IF ID EX MA WB
IF ID EX MA WB
IF ID EX MA WB
IF ID ID
Time
ID EX MA WB
IF ID EX MA WB
IF ID ID ID ID
Time
r3ßr1+r2
r4ßr3*r1 EX MA WB
IF ID EX MA WB
IF IF IF IF IF
Time
jmp
r3ßr1+r2 ID EX MA WB
-
Superscalar
クラウドコンピューティングの要素技術 8
IF ID EX MA WBInstruction1
IF ID EX MA WBInstruction2
IF ID EX MA WBInstruction3
IF ID EX MA WBInstruction4
IF ID EX MA WBInstruction5
Time
・・・
Scalarprocessor:Noinstruction-levelparallelismVector:processor:Paralleldataprocessing à Superscalar:Instruction-levelparallelism
-
EX
Out-of-OrderExecution/Completion
クラウドコンピューティングの要素技術 9
IF ID EX MA WBr3←r1+r2
ID ID EX MA WBr1←r2+r3
EXr6←r4+r5
Time
・・・
ID ID ID
IF ID EX MA WBr3←r1+r2
r1←r2+r3
ID EX MA WBr6←r4+r5
Time
・・・
EX MA WBEXEXID
IF
IF
ID ID EX MA WBID ID IDIF
IF
Out-of-OrderExecution
Out-of-OrderCompletion
In-orderexecution/completion
-
OtherTechnologies
• Cache–Write-Throughvs.Write-Back– FullyAssociativevs.SetAssociative– Coherency(Multicore/Multiprocessor)
• Virtualmemory– Pagetable• TLB(TranslationLookaside Buffer)
• SimultaneousMultithreading (Hyper-ThreadingTechnology)
クラウドコンピューティングの要素技術 10
-
OperatingSystem(OS)
• Resourcesharing– CPU:Processscheduling– Memory:Memorymanagement
• Privilegemanagement– Notallowtoexecuteprivilegedinstructionstousers’process• Systemcall:Dispatchaprivilegedinstructiontokernel
– e.g.,I/Oinstructions,cli/sti
• Inter-processcommunication• etc…
クラウドコンピューティングの要素技術 11
-
x86/x86-64RingProtection
クラウドコンピューティングの要素技術 12
Ring0Ring1
Ring2
Ring3
mov cr0,raxetc…
xor rax,rbx etc…
rdmsr
KernelinBSD/Linux
Userland inBSD/Linux
Systemcall
-
MultitaskOS:ResourceSharing
クラウドコンピューティングの要素技術 13
Time
Space
Core#0Core#1
Core#2Core#3
Time
Space
Processor
Memory
-
SchedulingAlgorithms
• Systems– Batch
• First-ComeFirst-Serve• ShortestJobFirst• ShortestRemainingTimeNext
– Interactive• Round-RobinScheduling• PriorityScheduling• PrioritySchedulingw/MultipleQueue(differentquantum)• ShortestProcessNext• GuaranteedScheduling• LotteryScheduling• Fair-ShareScheduling
– Realtime
クラウドコンピューティングの要素技術 14
-
TechnologiesbehindIaaS
クラウドコンピューティングの要素技術 15
-
TechnologiesbehindIaaS
クラウドコンピューティングの要素技術 16
Virtualmachines
Virtualmachines
Virtualmachines
IaaSVirtualNetwork,
SoftwareDefinedNetwork(detailsinnextweek)
Hypervisor/VMM,Hardware(peripheral)emulations
-
TechnologiesbehindIaaS
クラウドコンピューティングの要素技術 17
Virtualmachines
Virtualmachines
Virtualmachines
IaaSHypervisor/VMM- Resourcesharing
- CPU- Memory
- Privilegemanagement- Virtualchipset/controllers
- PIC/APIC- PIT- RTCCMOS
Virtualhardware(peripheralemulations)- IDE/SATAHDD/CDdrive- EthernetNIC- Virtualdisplay(videoRAM)- BIOS/UEFI- etc…
maybecombinedwithhypervisor
-
Terminology
• Hypervisor/VirtualMachineMonitor(VMM)– Apiecesoftware/firmwarethatrunsvirtualmachines
• VirtualMachine(VM)– Acomputeremulatedbysoftware• (usuallybytheassistanceofhardware)
• GuestOS– AnOSthatisexecutedonavirtualmachine
クラウドコンピューティングの要素技術 18
-
TwoTypesofHypervisor
クラウドコンピューティングの要素技術 19
Hardware
Hypervisor
PrivilegedOS VM
OS
VM
OS
Hardware
OS
VM
OS
VM
OS
Hypervisor Hypervisor
Type1:Native(baremetal)hypervisor Type2:Hostedhypervisor
Implements- UIforhypervisor- virtualhardwaredevices
Implements- UIforhypervisor- virtualhardwaredevices
Note:Someimplementationsmayimplementvirtualhardwaredevicescombinedwithhypervisor.
e.g.,- Xen- VMwareESX/ESXi (vSphere Hypervisor)
e.g.,- VirtualBox- VMwareWorkstation- LinuxKVM(sometimesclassifiedastype1)
-
TwoTypesofVirtualization
• Fullvirtualization– Supportthefullsetofinstructions(ofcorrespondingISA)inVMs• Pros:NomodificationstoguestOSarerequired.• Cons:Hardware-assistanceisrequired.
• Paravirtualization– Supportapartialsetofinstructions(ofcorrespondingISA)inVMs• Pros:Hardware-assistanceisnotrequired.• Cons:ModificationstoguestOSarerequired.
– Privilegedinstructionsmustbereplacedwithhypervisorcall.
クラウドコンピューティングの要素技術 20
-
Paravirtualization:x86/x86-64RingProtection
クラウドコンピューティングの要素技術 21
Ring0Ring1
Ring2
Ring3
mov cr0,raxetc…
xor rax,rbx etc…
rdmsr
KernelinBSD/Linux
Userland inBSD/Linux
Systemcall
-
Paravirtualization:x86/x86-64RingProtection
クラウドコンピューティングの要素技術 22
Ring0Ring1
Ring2
Ring3
mov cr0,raxetc…
xor rax,rbx etc…
rdmsr
Hypervisor (Xen)
Guestuserland (BSD/Linux)
Guestkernel(BSD/Linux)Hypervisorcall
Systemcall
-
DifferencesbetweenOSandHypervisor• Hypervisor/VMM
– Resourcesharing• CPU• Memory
– Privilegemanagement– Virtualchipset/controllers
• PIC/APIC• PIT• RTCCMOS
• Virtualhardware(peripheralemulations)– IDE/SATAHDD/CDdrive– EthernetNIC– Virtualdisplay(videoRAM)– BIOS/UEFI– etc…
クラウドコンピューティングの要素技術 23
-
DifferencesbetweenOSandHypervisor
クラウドコンピューティングの要素技術 24
Hypervisor
OS
Resourcesharing(e.g.,scheduler)
PrivilegemanagementVirtualchipset/hardware
POSIX
Usermanagement
Exclusivecontrol
Inter-processcommunication
-
ResourceSharing
クラウドコンピューティングの要素技術 25
Time
Space
Core#0Core#1
Core#2Core#3
Time
Space
Processor
Memory
-
DifferencesbetweenOSandHypervisor• Scheduler
– OS:Processscheduler• Blocked: Inter-processcommunication
– Hypervisor:VMscheduler• Blocked:Notinter-VM;betweenVMandvirtualdevices
à Simpler
• Memorymanagement– OS:Varietyofallocationsizes(severalbyte– ~Gbytes)
• Usualpagetableentrysize:4KiB/2MiB• Complexallocationalgorithm
– Linux:Buddysystem(forpageallocation),Slaballocator(forsmallersizeallocation)
– Hypervisor:~Gbytesà Easier
クラウドコンピューティングの要素技術 26
Ready
RunningBlocked
-
Hardware-assistedVirtualizationTechnology
• Intel®VT-x,AMD-V
クラウドコンピューティングの要素技術 27
Ring0
Ring3
OS
Userland
HypervisorRing0
Genericdesignofhypervisorwithhardware’svirtualizationassist(Intel®VT-x)
Ring1
Ring3
OS
Userland
OS
Userland
Hypervisor Ring0
GenericdesignofhypervisorwithIARingprotectionarchitecturewithouthardware’svirtualizationassist
VMMNon-root
VMMRoot
Ring3
VMExit
VMEntrysyscallsysret
syscallsysret
-
LiveMigration:PreCopy
クラウドコンピューティングの要素技術 28
Source
Cleardirtybit
Destinationto_copy bitmap
Somebecomedirty
Copypages
page_table_with_dirty_bit
Copythedirtypagesagain
Copyallpages
Startcopy
If#ofdirtypagesbelowthreshold1. pausethe VM2. copytheremainingpages3. starttheVMatdestination
-
LiveMigration:PostCopy
クラウドコンピューティングの要素技術 29
Destination Source(ii)
Destination(i)
Destination(iii)
1
2
• Read/Writeoperations– Read
• Ifthepagehasalreadycopied†todestination– (i)readthepagefromdestination
• Otherwise– (ii)readthepagefromsourceandcopyittodestination
– Write• (iii)writetodestination
(†Copiedpagesaremanagedbyabitmap)
-
RUNNING(?)CODEOFOPERATINGSYSTEMANDHYPERVISOR
“Wereject:kings,presidentsandvoting.Webelievein:roughconsensusandrunningcode.”byDavidD.Clark
クラウドコンピューティングの要素技術 30
-
PageTable(x86-64longmode)
クラウドコンピューティングの要素技術 31
4-20 Vol. 3A
PAGING
from CR3[11:0] to 000H (see also Section 4.10.4.1). In addition, the logical processor subsequently determines the memory type used to access the PML4 table using CR3.PWT and CR3.PCD, which had been bits 4:3 of the PCID.
IA-32e paging may map linear addresses to 4-KByte pages, 2-MByte pages, or 1-GByte pages.1 Figure 4-8 illus-trates the translation process when it produces a 4-KByte page; Figure 4-9 covers the case of a 2-MByte page, and Figure 4-10 the case of a 1-GByte page.
1. Not all processors support 1-GByte pages; see Section 4.1.4.
Figure 4-8. Linear-Address Translation to a 4-KByte Page using IA-32e Paging
Directory Ptr
PTE
Linear Address
Page Table
PDPTE
CR3
39 38
Pointer Table
99
40
129
40
4-KByte Page
Offset
Physical Addr
PDE with PS=0
Table011122021
Directory30 29
Page-Directory-
Page-Directory
PML447
9
PML4E
40
40
40
Figure fromIntel®64andIA-32ArchitecturesSoftwareDeveloper’sManual Vol.3A4.5
PML4E:PageMapLevel4EntriesPDPTE:Page-Directory-PointerTableEntriesPDE:Page-DirectoryEntriesPTE:Page-TableEntries
-
PageTable(x86-64longmode)
クラウドコンピューティングの要素技術 32
4-28 Vol. 3A
PAGING
.
4.6 ACCESS RIGHTSThere is a translation for a linear address if the processes described in Section 4.3, Section 4.4.2, and Section 4.5 (depending upon the paging mode) completes and produces a physical address. Whether an access is permitted by a translation is determined by the access rights specified by the paging-structure entries controlling the transla-tion;1 paging-mode modifiers in CR0, CR4, and the IA32_EFERMSR; and the mode of the access.
63
62
61
60
59
58
57
56
55
54
53
52
51
M1
NOTES:1. M is an abbreviation for MAXPHYADDR.
M-1 32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10 9 8 7 6 5 4 3 2 1 0
Reserved2
2. Reserved fields must be 0.
Address of PML4 table IgnoredPCD
PWT
Ign. CR3
XD3
3. If IA32_EFER.NXE = 0 and the P flag of a paging-structure entry is 1, the XD flag (bit 63) is reserved.
Ignored Rsvd. Address of page-directory-pointer table Ign. Rsvd
Ign
APCD
PWT
U/S
R/W
1 PML4E:present
Ignored 0PML4E:
notpresent
XD
Ignored Rsvd. Address of1GB page frame ReservedPAT
Ign. G 1 D APCD
PWT
U/S
R/W
1PDPTE:
1GBpage
XD
Ignored Rsvd. Address of page directory Ign. 0Ign
APCD
PWT
U/S
R/W
1PDPTE:page
directory
Ignored 0PDTPE:
notpresent
XD
Ignored Rsvd. Address of2MB page frame ReservedPAT
Ign. G 1 D APCD
PWT
U/S
R/W
1PDE:2MBpage
XD
Ignored Rsvd. Address of page table Ign. 0Ign
APCD
PWT
U/S
R/W
1PDE:pagetable
Ignored 0PDE:not
present
XD
Ignored Rsvd. Address of 4KB page frame Ign. GPAT
D APCD
PWT
U/S
R/W
1PTE:4KBpage
Ignored 0PTE:not
present
Figure 4-11. Formats of CR3 and Paging-Structure Entries with IA-32e Paging
1. With PAE paging, the PDPTEs do not determine access rights.
Figure fromIntel®64andIA-32ArchitecturesSoftwareDeveloper’sManual Vol.3A4.5
-
PageTable(x86-64longmode)
クラウドコンピューティングの要素技術 33
/* Create 64bit page table */pg_setup:
movl $KERNEL_PGT,%ebx /* Low 12 bit must be zero */movl %ebx,%edixorl %eax,%eaxmovl $(512*8*6/4),%ecxrep stosl /* Initialize %ecx*4 bytes from %edi */
/* with %eax *//* Level 4 page map */
leal 0x1007(%ebx),%eaxmovl %eax,(%ebx)
/* Page directory pointers (PDPE) */leal 0x1000(%ebx),%edileal 0x2007(%ebx),%eaxmovl $4,%ecx
pg_setup.1:movl %eax,(%edi)addl $8,%ediaddl $0x1000,%eaxloop pg_setup.1
/* Page directories (PDE) */leal 0x2000(%ebx),%edimovl $0x183,%eaxmovl $(512*4),%ecx
pg_setup.2:movl %eax,(%edi)addl $8,%ediaddl $0x00200000,%eaxloop pg_setup.2
/* Setup page table register */movl %ebx,%cr3
-
MultitaskOS:ContextSwitch
クラウドコンピューティングの要素技術 34
RSP
RFLAGS
RIP
0
0 63
Lowestaddress
HighestaddressDirectionofpop fromstack
CS
0SS
IRETQ
RSP
Intel®64andIA-32ArchitecturesSoftwareDeveloper’sManual Vol.3A5.8.5.1
-
MultitaskOS:TSS(TaskStateSegment)
クラウドコンピューティングの要素技術 35
Figure fromIntel®64andIA-32ArchitecturesSoftwareDeveloper’sManual Vol.3A7.7
Pointtokernel stack(Ring0)
-
MultitaskOS:ContextSwitch(SourceCode)
クラウドコンピューティングの要素技術 36
/* Restart a task */_task_restart:
/* If the next task is not assigned, immediately restart */cmpq $0,(_next_task)jz 1f/* Save stack pointer */movq (_cur_task),%raxmovq %rsp,TASK_RP(%rax)/* Task switch (set the stack frame of the new task) */movq (_next_task),%raxmovq %rax,(_cur_task)movq TASK_RP(%rax),%rsp /* Copy sp0 */movq $0,(_next_task)/* ToDo: Load LDT/cr3 *//* Setup sp0 in TSS */movq (_cur_task),%raxmovq TASK_SP0(%rax),%rdxmovq (_tss),%rbpmovq %rdx,TSS_SP0(%rbp)
1:/* Pop registers */intr_irq_doneiretq
-
Hypervisor
クラウドコンピューティングの要素技術 37
Ring0
Ring3
OS
Userland
HypervisorRing0
VMMNon-root
VMMRoot
Ring3
VMExit
VMEntry
syscallsysret
1. VMREAD/VMWRITE/VMCLEAR toconfigurecurrentVMCS
2. Setupandprepareperipherals(virtualdevices)3. VMLAUNCH4. Handletheinstructions corresponding toVMexits
andschedule/switchVMsw/VMRESUME• VMexitscausedby
1. instructions ofVM2. signalbyhypervisor’s VMX-preemption timer3. VMCALL fromVM
DetailsinIntel®64andIA-32ArchitecturesSoftwareDeveloper’sManualVol.3CChapter23
VMXON:EnableVMXVMXOFF:DisableVMX
VirtualMachineControlStructure(VMCS)- Guest-statearea:Registersetc.- Host-statearea:Hoststate/exitpointoneveryVMexit.- VM-executioncontrolfields- VM-exitcontrolfields- VM-entrycontrolfields- VM-exitinformation fields
*CurrentVMCSperlogicalprocessor (ofphysicalmachine)*ActiveVMCSpervirtualprocessor
-
HowtoImplementLiveMigration• Memorymigration
– PrecopyorPostcopy(orhybrid)
• VMstatecopy– VMexit(atsourcehypervisor)– JustcopyVMCS– VMRESUME tostartVMatdestinationhypervisor
• Storage– Networkstorage– Distributedstorage– Storagemigrationlikememorymigration
• Network– Copyconfiguration (i.e.,MACaddress)ofvirtualNICs
クラウドコンピューティングの要素技術 38
-
LiveMigrationSupport
クラウドコンピューティングの要素技術 39
4-28 Vol. 3A
PAGING
.
4.6 ACCESS RIGHTSThere is a translation for a linear address if the processes described in Section 4.3, Section 4.4.2, and Section 4.5 (depending upon the paging mode) completes and produces a physical address. Whether an access is permitted by a translation is determined by the access rights specified by the paging-structure entries controlling the transla-tion;1 paging-mode modifiers in CR0, CR4, and the IA32_EFERMSR; and the mode of the access.
63
62
61
60
59
58
57
56
55
54
53
52
51
M1
NOTES:1. M is an abbreviation for MAXPHYADDR.
M-1 32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10 9 8 7 6 5 4 3 2 1 0
Reserved2
2. Reserved fields must be 0.
Address of PML4 table IgnoredPCD
PWT
Ign. CR3
XD3
3. If IA32_EFER.NXE = 0 and the P flag of a paging-structure entry is 1, the XD flag (bit 63) is reserved.
Ignored Rsvd. Address of page-directory-pointer table Ign. Rsvd
Ign
APCD
PWT
U/S
R/W
1 PML4E:present
Ignored 0PML4E:
notpresent
XD
Ignored Rsvd. Address of1GB page frame ReservedPAT
Ign. G 1 D APCD
PWT
U/S
R/W
1PDPTE:
1GBpage
XD
Ignored Rsvd. Address of page directory Ign. 0Ign
APCD
PWT
U/S
R/W
1PDPTE:page
directory
Ignored 0PDTPE:
notpresent
XD
Ignored Rsvd. Address of2MB page frame ReservedPAT
Ign. G 1 D APCD
PWT
U/S
R/W
1PDE:2MBpage
XD
Ignored Rsvd. Address of page table Ign. 0Ign
APCD
PWT
U/S
R/W
1PDE:pagetable
Ignored 0PDE:not
present
XD
Ignored Rsvd. Address of 4KB page frame Ign. GPAT
D APCD
PWT
U/S
R/W
1PTE:4KBpage
Ignored 0PTE:not
present
Figure 4-11. Formats of CR3 and Paging-Structure Entries with IA-32e Paging
1. With PAE paging, the PDPTEs do not determine access rights.
Figure fromIntel®64andIA-32ArchitecturesSoftwareDeveloper’sManual Vol.3A4.5
Dirtybit
-
TechnologiesbehindPaaS
クラウドコンピューティングの要素技術 40
-
TechnologiesbehindPaaS
• LXC(LinuxContainers)– Lightweightvirtualmachine
• Runmanyinstances• PlatformsandFrameworks– Platforms
• Hadoop (fordistributedcomputing)• Docker (forVMimagemanagement/deployment)
– Frameworks• RubyonRails(forWeb)
– Loadbalancing• (Thesenetworkportionwillbeinnextweek.)
クラウドコンピューティングの要素技術 41
-
TechnologiesbehindSaaS
クラウドコンピューティングの要素技術 42
-
TechnologiesbehindSaaS• RichUIoverbrowser
– HTML5• Supportofnewfeatures
– Canvas– Localstorage/Sessionstorage– Localfileaccessw/opagetransition(e.g.,FileReader)– Drag&Drop– Multimediasupportw/othird-partyplugins– Offlineapplicationw/cache
• WebSockets /WebRTC– CSS3
• Newstylemodulese.g.,gradient– JavaScript
• DevelopmentoffasterJavaScriptengines– V8– asm.js
クラウドコンピューティングの要素技術 43
-
GroupPBL• IaaS
– (A1)Cloudcontroller• Hypervisor:KVM• Cloudcontroller:(usinglibvirt API)
– (A2)Cloudstorage• High-speedstorage:10GbE/RAIDcard+SSD
• PaaS– (B1)Hadoop cluster
• Hadoop (MapReduce)– (B2)PaaS cloud
• Linuxcontainer• SaaS
– (C1)SaaS application• Server:Scale-outcapableapplication• Client:Browser
クラウドコンピューティングの要素技術 44
YourCloud
-
ForGroupPBL
• Sendyourinfo&preference– Student#(学生証番号)– Name(氏名)– E-mailaddress(メールアドレス)– 4 rankedpreferencesfromthelistinnextslide• 希望のPBL課題番号を4つ(希望順に)
to.SubjectMUSTbe“CloudPBLcourse2016”
クラウドコンピューティングの要素技術 45