* distributed system lab 1 snowflock: rapid virtual machine

04/12/23 Distributed System Lab 1

SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing

H. Andrés Lagar-Cavilla, Joseph A. Whitney, Adin Scannell, Philip Patchin, Stephen M. Rumble,Eyal de Lara, Michael Brudno, M. Satyanarayanan

University of Toronto and Carnegie Mellon Universityhttp://sysweb.cs.toronto.edu/snowflock

游清權

http://sysweb.cs.toronto.edu/snowflock


Outline• Introduction

• VM Fork

• Design Rationale

• SnowFlock Implementation

• Application Evaluation

• Conclusion and Future Directions


Introduction

• VM technology is widely adopted as an enabler of cloud computing

• Benefits – Security– Performance isolation– Ease of management– Flexibility (user-customized environment )

– Use a variable number of physical machines and VM instances depending on the needs of the problem

EX. Task may need only a single CPU during some phases of execution


Introduction

• Introduce VM fork – Simplifies development and deployment of cloud

applications .– Allows for the rapid (< 1 second) instantiation of

stateful computing elements in a cloud environment

• VM fork is similar to the process fork– Child VMs receive a copy of all of the state generated

by the parent VM prior to forking

– Different in three fundamental ways .


Introduction1. VM fork primitive allows for the forked copies to be

instantiated on a set of different physical machines1. Enabling the task to take advantage of large compute clusters.2. Previous work [Vrable 2005] is limited to cloning VMs within the same

host

2. Have made our primitive parallel, enabling the creation of multiple child VMs with a single call

3. VM fork replicates all of the processes and threads of the originating VM

(Eables effective replication of multiple cooperating processes )

E.g. A customized LAMP Linux/Apache/MySql/Php) stack


Introduction

• Enables the trivial implementation of several useful and well-known patterns that are based on stateful replication

• Pseudocode for four of these is illustrated in Figure 1

• Sandboxing of untrusted code • Instantiating new worker nodes

to handle increased load (e.g. due to flash crowds)

• Enabling parallel computation

• Opportunistically utilizing

unused cycles with short tasks


Introduction

• SnowFlock – Provides swift parallel stateful VM cloning with little runtime

overhead and frugal consumption of cloud I/O resources

• Takes advantage of several key techniques .First– SnowFlock utilizes lazy state replication to minimize the amount

of state propagated to the child VMs.

– Extremely fast instantiation of clones by initially copying the minimal necessary VM data, and transmitting only the fraction of the parent’s state that clones actually need .


Introduction

• Takes advantage of several key techniques .Second– A set of avoidance heuristics eliminate substantial superfluous

memory transfers for the common case of clones allocating new private state .

Finally– Child VMs execute

• Very similar code paths • Access common data structures

Use a Multicast distribution technique for VM state that provides scalability and prefetching


Introduction

• Evaluated SnowFlock by focusing on a demanding instance of Figure 1 (b)

• (Interactive parallel computation)

• Conducted experiments with applications from – Bioinformatics – Quantitative finance – Rendering – Parallel compilation

• Be deployed as Internet services .

• 128 processors• SnowFlock achieves speedups coming within 7% or better of

optimal execution


VM Fork• Advs

– Execute independently on different physical hosts – Isolation – Ease of software development associated with VMs .– Greatly reducing the performance overhead of creating a

collection of identical VMs on a number of physical machines.

• Each forked VMs proceeds with an identical view of the system– Save for a unique identifier (vmid) .

– (be distinguished , parent or ..)

• Each forked VM has its own independent copy (OS ,and virtual disk )– State updates are not propagated between VMs


VM Fork

• Forked VMs are transient entities – Memory image and virtual disk are discarded once they exit

• Any application-specific state or values generate explicitly communicated to the parent VM . – (message passing or via a distributed file system )

• Conflicts may arise (multiple processes within the same VM simultaneously invoke VM forking)

.

• VM fork will be used in VMs that have been carefully customized to run a single application (like serving a web page).


VM Fork

• The semantics of VM fork

– Integration with a dedicated, isolated virtual network connecting child VMs with their parent.

– Each child is configured with a new IP address based on its vmid, and it is placed on the same virtual subnet .

– Child VMs cannot communicate with hosts outside this virtual network


VM Fork

1. User must be conscious of the IP re configuration semantics: -Network shares must be (re)mounted after cloning .

2. Provide a NAT layer to allow the clones to connect to certain external IP

– NAT performs firewalling and throttling – Only allows external inbound connections to the parent VM– Useful to implement web-based fron tend, – Allow access to a dataset provided by another party


Design Rationale• Plotting the cost of suspending and resuming a 1GB VM

to an increasing number of hosts over NFS (see Section 5 for details on the testbed)

• Direct relationship between I/O involved and fork latency, with latency growing to the order of hundreds of seconds


Design Rationale

• Method 1:• Implement VM fork using existing VM suspend/resume

– The whole sale copying of a VM to multiple hosts is far too taxing– Decreases overall system scalability

(by clogging the network with gigabytes of data.)

• Contention caused by the simultaneous requests by all children turns the source host into a hot spot .– Live migration [Clark 2005, VMotion]– A popular mechanism for consolidating VMs in clouds

[Steinder 2007, Wood 2007], (Same algorithm plus extra rounds of copying, taking longer to replicate VMs )


Design Rationale• Method 2• Solving the problem of VM fork latency uses our multicast library .

– Multicast delivers state simultaneously to all hosts – Overhead is still in the range of minutes (substantially reduce the total amount of VM state pushed over the network. )

• Fast VM fork implementation is based on – Start executing child VM on a remote site by initially replicating only

minimal state– Children will typically access only a fraction of the original memory

image (parent)– It’s common for children to allocate memory after forking– Children often execute similar code and access common data

structures.


Design Rationale• VM Descriptors

– Lightweight mechanism ,Instantiates a new forked VM with only the critical metadata needed to start execution on a remote site.

• Memory-On-Demand – Mechanism whereby clones lazily fetch portions of VM state over the

network as it is accessed .

• Experience – Possible to start a child VM by shipping only 0.1% state of the parent.– Children require a fraction of the original memory image of the parent.– Read portions of a remote dataset or allocate local storage– Optimization can reduce communication . 1GB 40MBs 4%!

(for application footprints )


Design Rationale• Memory on-demand : non-intrusive approach

(Reduces state transfer without altering the behavior of the guest OS).

• Another non-intrusive approach :

Copy-on-write, by Potemkin [Vrable 2005]. (Same Host )• Potemkin does NOT provide runtime stateful cloning, since all new

VMs are copies of a frozen template

• Multicast replies to memory page requests– High correlation across memory accesses of the children (insight iv)

– Prevent the parent from becoming a hot-spot

• Multicast provides Scalability and Prefetching.• Children operate independently and individually• A Child waiting for a page does not prevent others from making

progress.


SnowFlock Implementation• SnowFlock is an open-source project (on the Xen 3.0.3 VMM)

• Xen– Hypervisor running at the highest processor privilege level .Controlling the

execution of domains (VMs) ,The domain kernels are paravirtualized .

• SnowFlock – Modifications to the Xen VMM and daemons (In domain0).

– Daemons form a distributed system that controls the life-cycle of VMs (cloning and deallocation )

– Policy decisions: • Resource accounting • Allocation of VMs to physical hosts

(To suitable cluster management software via a plug-in architecture )

– Lazy state replication ( avoidance heuristics to minimize state transfer )


SnowFlock Implementation

• Four mechanisms to fork a VM .1. Parent VM is temporarily suspended ,produce a VM descriptor

– A small file (VM metadata and guest kernel memory management data )– Distributed to other physical hosts to spawn new VMs – In subsecond time

2. Memory-on-demand mechanism .– Lazily fetches additional VM memory state

3. The avoidance heuristics .– Reduce the amount of memory that needs to be fetched on demand

4. Multicast distribution system mcdist– Deliver VM state simultaneously and efficiently – Providing implicit prefetching


Implementation- 1.API• VM fork in Snow-Flock consists of two stages• sf_request_ticket (Reservation for the desired number of clones)

• 1. To optimize for common use cases in SMP hardware– Cloned VMs span multiple hosts.

– The processes within each VM span the physical underlying cores

– Due to user quotas, current load, and other policies, the cluster management system may allocate fewer VMs than requested

• 2. Fork the VM across the hosts with the sf_clone call.– Child VM finishes its part of the computation, then sf_exit

– A parent VM can wait for its children to terminate with sf_join

– Force their termination with sf_kill...


Implementation- 1.API

• API is simple and flexible (modification of existing code bases)• Widely used Message Passing Interface (MPI) library

– Allows unmodified parallel applications to use SnowFlock’s capabilities


Implementation-2.VM Descriptors

• Condensed VMimage – Swift VM replication to a separate physical host.

• Starts by spawning a thread in the VM kernel that quiesces its I/O devices. – Deactivates all but one of the virtual processors (VCPUs), – Issues a hypercall suspending the VM’s execution

• When Hypercall succeeds.– Maps the suspended VM memory to populate the descriptor.

• Descriptor contains:1. Metadata describing the VM and its virtual devices2. Few memory pages shared between the VM and the Xen hypervisor.3. Registers of the main VCPU,4. Global Descriptor Tables (GDT) used by the x86 segmentation

hardware for memory protection5. Page tables of the VM.



• The page tables make up the bulk of a VM descriptor.– Each process in the VM needs a small number of additional page

tables. – The cumulative size of a VM descriptor is thus loosely dependent on the

number of processes the VM is executing.

• Entries in a page table are “canonicalized” before saving.

• Translated from references to host-specific pages to frame numbers within the VM’s private contiguous physical space (“machine” and “physical” addresses in Xen parlance ,respectively).

• A few other values included in the descriptor, e.g. the cr3 register of the saved VCPU (also canonicalized ).



• Descriptor is multicast to multiple physical hosts (mcdist) Section 4.5• Metadata is used to allocate a VM with the appropriate virtual devices

and memory footprint.

• All state saved in the descriptor is loaded: – Pages shared with Xen– Segment descriptors, – Page tables– VCPU registers.

• Physical addresses in page table entries are translated to use the new mapping between VM specific physical addresses and host machine addresses.

• The VM replica resumes execution, enables the extra VCPUs, and reconnects its virtual I/O devices to the new frontends.


Implementation-2.VM Descriptors• Evaluation• Time spent replicating a single-processor VM with 1 GB

of RAM to n clones in n physical hosts



• VM descriptor for experiments was 1051 ± 7 KB.

• The time to create a descriptor =“Save Time” (our code) +“Xend Save” (Recycled and Unmodified Xen code).

• “Starting Clones” time distributing the order to spawn a clone to each host• Clone creation in each host is composed by “Fetch Descriptor”

(wait for the descriptor to arrive),..• “Restore Time” (our code) • “Xend Restore” (recycled Xen code).

• Overall, VM replication is a fast operation. ( 600 to 800 milliseconds)• Replication time is largely independent of the number of clones created• ??


Implementation- 3.Memory-On-Demand

• SnowFlock’s memory-on-demand subsystem - memtap • After being instantiated from a descriptor

• Find it is missing state needed to proceed. • Memtap, handles by lazily populating the clone VM’s memory with

state fetched from the parent (Immutable copy of the VM’s memory)

• memtap Hypervisor logic + Userspace domain0 process (associated with the clone VM)

1.Missing page

2.Pauses that VCPU

3.Notifies the memtap process

4.Fetches its contents from the parent

5. Notifies the hypervisorVCPU may be unpaused.


• To allow hypervisor trap memory accesses to pages (Not yet been fetched)

• Use Xen shadow page tables..– The x86 register (cr3) – Replace pointer to initially empty page table– Shadow page table is filled on demand from the real page table

(Faults on empty entries occur)

• If first access to a page not yet been fetched.– Hypervisor notifies memtap– Fetches are also triggered– Accesses by domain0 of the VM’s memory for the purpose of

virtual device DMA. …..??04/12/23 Distributed System Lab 29


• On parent VMMemtap - Implements copy-on-write

• Use shadow page tables in “log-dirty” mode.– All parent VM memory write attempts are trapped by disabling

the writable bit on shadow page table.

– Hypervisor duplicates the page and patches the mapping of the memtap server process to point to the duplicate.

– Parent VM is then allowed to continue execution..




• To understand the overhead involved --microbenchmark– Multiple microbenchmark runs ten thousand page fetches ,Figure 4(a).– split a page fetch operation into six components.

• Six components.“Page Fault”.Hardware page fault overheads

(cause by using shadow page tables)..

“Xen”. Xen hypervisor shadow page table logic.“HV Logic” Hypervisor logic: “Dom0 Switch” Context switch to the domain0

( memtap ).“Memtap Logic” memtap internals,

Mapping the faulting VM page.“Network” software (libc and Linux kernel TCP stack)

hardware overheads

Evaluation


Implementation- 4.Avoidance Heuristics

• Fetching pages from the parent still incurs an overhead (May prove excessive for many workloads)

• Augmented the VM kernel with two fetch-avoidance heuristics.– Bypass lots unnecessary memory fetches ,retaining correctness.

• First heuristic – Optimizes the general case in which a clone VM allocates new state.

– Intercepts pages (selected by kernel’s page allocator)

– The kernel page allocator is invoked when more memory is needed

– The recipient of the selected pages does not care about the pages’ previous contents

– (… page 6 , Right)


• The second heuristic.– Addresses the case where a virtual I/O device writes to the guest memory.

Consider Block I/O:

– Target page is typically a kernel buffer that is being recycled and whose previous contents do not need to be preserved.

– Again, there is no need to fetch this page– …

• Fetch-avoidance heuristics – Implemented by mapping the memtap bitmap in the guest kernel’s

address space.




Evaluation

• Result in substantial benefits.– Runtime and data transfer

• With the heuristics– State transmissions to clones

are reduced to 40 MBs,

– Tiny fraction (3.5%) of the VM’s footprint..


Implementation- 5.Multicast Distribution

• Mcdist – Multicast distribution system efficiently provides data

to all cloned VMs simultaneously.

Two goals (Not served by point-to-point ).

• First: Data needed by clones is often prefetched.– Single clone requests a page– Response also reaches all other clones.

• Second: Load (network) is greatly reduced – Sending a piece of data to all VM clones (1 operation).



• Mcdist server design is minimalistic– Only switch programming and flow control logic.

• Ensuring Reliability - Timeout mechanism.

• IP-multicast : Send data to multiple hosts simultaneously.– Supported by most off-the-shelf commercial Ethernet hardware.

• IP-multicast hardware – Capable of scaling to thousands of hosts and multicast groups– Automatically relaying multicast frames across multiple hops.



• Mcdist clients are memtap processes– Receive pages asynchronously and unpredictably in

response to requests by fellow VM clones.– Memtap clients batch received pages until

• A threshold is hit• Page that has been explicitly requested arrives.

• Single hypercall is invoked to map the pages in a batch.• Threshold of 1024 pages has proven to work well in

practice.



• To maximize total goodput– Server uses flow control logic Limitits sending rate – Server and clients estimate their send and receive rate

• Clients – Provide explicit feedback

• The server increases its rate limit linearly – Loss is detected The server scales its rate limit back.

• Other Server flow control mechanism - lockstep detection– Multiple requests for the same page Ignores duplicate requests



• Evaluation

• Results obtained with SHRiMP.

• Shows that multicast distribution’s lockstep avoidance works effectively: – Lockstep-executing VMs issue simultaneous requests that are

satisfied by a single response from the server.

– Hence the difference between the “Requests” and “Served” bars in the multicast experiments.



• Figure 4(c) shows the benefit of mcdist for a case where an important portion of memory state is needed after cloning

(The avoidance heuristics cannot help. )

• Experiment (NCBI BLAST) – Executes queries against a 256 MB portion of the NCBI genome

database that the parent caches into memory before cloning.

• Speedup results for SnowFlock: unicast VS multicast,

• Idealized zero-cost fork configuration – VMs have been previously allocated, – with no cloning or state-fetching overhead.


6.Virtual I/O Devices -- Virtual Disk

• Implemented with a blocktap [Warfield 2005] driver. – Multiple views of the virtual disk are supported by a hierarchy of copy-

on-write(COW) slices located at the site where the parent VM runs.

• Each fork operation adds a new COW slice– Rendering the previous state of the disk immutable.

• Children access a sparse local version of the disk,– Fetched on demand from the disk server.

• Virtual disk exploits same optimizations (memory subsystem) – Unnecessary fetches during writes are avoided using heuristics, – Original disk state is provided to all clients simultaneously via multicast.


6.Virtual I/O Devices -- Virtual Disk• Virtual disk is used as the base root partition for the VMs. • For data-intensive tasks

– Envision serving data volumes to the clones through network file systems such as NFS

– Suitable big-data filesystems such as Hadoop or Lustre [Braam 2002].

• Most work done by clones is processor intensive – Writes do not result in fetches– The little remaining disk activity mostly hits kernel caches.

• Largely exceeds the demands of many realistic tasks – Not cause any noticeable overhead for the experiments (Section

5).


6.Virtual I/O Devices -- Network Isolation

• Employ a mechanism to isolate (prevent interference, eavesdropping).• Performed at the level of Ethernet packets, the primitive exposed by Xen

virtual network devices.

• Before being sent– Source MAC addresses of packets sent by a SnowFlock VM are rewritten as a

special address which is a function of both the parent and child identifiers. – Simple filtering rules are used by all hosts to ensure that no packets delivered to

a VM come from VMs that are not its parent or a sibling.• A packet is delivered

– Destination MAC address is rewritten to be as expected, rendering the entire process transparent.

• Small number of special rewriting rules are required for protocols with payloads containing MAC addresses, such as ARP.

• Filtering and rewriting impose an imperceptible overhead while maintaining full IP compatibility.


Application Evaluation

• Focuses on a particularly demanding scenario – The ability to deliver interactive parallel computation,

– VM forks multiple workers to participate in a short-lived computationally-intensive parallel job.

• Scenario – Users interact with a web frontend and submit queries

– Parallel algorithm run on a compute cluster

• Cluster of 32 Dell PowerEdge 1950 blade servers .


Application Evaluation

• Each host – 4 GB of RAM– 4 Intel Xeon 3.2 GHz cores – Broadcom NetX treme II BCM5708 gigabit NIC

• All machines running SnowFlock prototype (Xen 3.0.3) .

• Para-virtualized Linux 2.6.16.29 (Guest, Host)

• All machines were connected to two daisy-chained Dell PowerConnect 5324 gigabit switches


Applications

• 3 typical applications from bioinformatics• 3 applications

– Graphics rendering – Parallel compilation – Financial services

• Driven by a workflow shell script (clones VM and launches application ).

• NCBI BLAST Computational tool used by biologists.• SHRiMP Tool for aligning large collections of very short DNA

sequences• ClustalW multiple alignment of a collection of protein or DNA

sequences• QuantLib Toolkit widely used in quantitative finance• Aqsis – Renderman In films and television visual effects [Pixar]• Distc parallel compilation.


Results

• 32 4-core SMP VMs on 32 physical hosts

• Aim to answer the following questions– How does SnowFlock compare to other methods for instantiating

VMs?– How close does SnowFlock come to achieving optimal

application speedup?– How scalable is SnowFlock?


Results - Comparison

• SHRIMP , 128 processors under three configurations– SnowFlock with all the mechanisms – Xen’s standard Suspend/Resume that use NFS – Multicast to distribute the suspended VM image


Results - Application Performance



• Compares SnowFlock to an optimal “zero-cost fork” baseline • Baseline

– 128 threads to measure overhead

– one thread to measure speedup

• Zero-cost – VMs previously allocated,

– No cloning or state-fetching overhead

– In an idle state .

– Overly optimistic

– Not representative of cloud computing environments

• zero-cost VMs – Vanilla Xen 3.0.3 domains ,configured identically to SnowFlock VMs



• Extremely well • Reducing execution time

– Hours Tens of seconds (for all the benchmarks).

• Speedups – Very close to the zero-cost optimal,

– Comes within 7% of the optimal runtime .

• Overhead ( VM replication , on-demand state fetching) Small .

• ClustalW The best results – Less than 2 seconds of overhead for a 25 second task .


Scale and Agility

• Address SnowFlock’s capability (support multiple concurrent forking VMs)

– Launch four VMs that each forks 32 uniprocessor VMs.

• After completing a parallel task, each parent VM joins and terminates its children. (Than launches another parallel task , repeating five times)

• Each parent VM runs a different application.– Employed an “adversarial allocation” in which each task uses 32

processors, one per physical host

– 128 SnowFlock VMs are active at most times.

– Each physical host needs to fetch state from four parent VMs


Scale and Agility

• SnowFlock is capable of withstanding the increased demands of multiple concurrent forking VMs

• Believe : Optimizing mcdist , consistent running times ↓ • Perform a 32-host 40-seconds or less parallel computation, with five

seconds or less of overhead


Conclusion and Future Directions

• Introduced VM fork and SnowFlock , Xen-based implementation

• VM fork : – Instantiate dozens of VMs in different hosts in sub-second time, runtime

overhead ↓, cloud IO resources↓

• SnowFlock– Drastically reduce the time (copying only the critical state)

– Fetching the VM’s memory image efficiently on-demand…

• Simple modifications guest kernel (reduce network traffic)– Eliminating the transfer of pages that will be overwritten.

• Multicast (locality of memory accesses across cloned VMs)

– Low cost.


Conclusion and Future Directions

• SnowFlock is an active open-source project– Plans involve adapting SnowFlock to bigdata applications.

• Fertile research ground studying the interactions of VM fork with data parallel APIs.

• SnowFlock’s objective: performance > reliability.– Memory-on-demand provides performance

(Dependency on a single source of VM state)

– How to push VM state in background without sacrificing

• Wish : wide-area VM migration.

* distributed system lab 1 snowflock: rapid virtual machine

Documents