apache’mesos - meetupfiles.meetup.com/15980712/mesos-london-09-25-14.pdf · agenda ①...
TRANSCRIPT
Benjamin Hindman – @benh
Apache Mesos (at Twitter) mesos.apache.org
@ApacheMesos
agenda ① Cluster Management at Twitter
② Mesos
③ Mesos at Twitter
④ VMs, IaaS, and Mesos
⑤ …
Monorail
cluster management
(configuration/package management) (deployment)
circa 2010
Woodstar Monorail Macaw TweetyPie memcached
challenges
challenges ① failures
failures
Woodstar Monorail Macaw TweetyPie memcached
Woodstar Monorail Macaw TweetyPie memcached
challenges ② maintenance
(aka “planned failures”)
Woodstar Monorail Macaw TweetyPie memcached
Woodstar Monorail Macaw TweetyPie memcached
planning for failure/maintenance
challenges ③ utilization
Rails
Hadoop
memcached
utilization
utilization
Rails
Hadoop
memcached
utilization
Rails
Hadoop
memcached buy less machines
or run more applications!
planning for utilization intra-‐machine resource sharing:
share a single machine’s resources between multiple applications (multi-‐tenancy)
intra-‐datacenter resource sharing:
share multiple machine’s resources between multiple applications
agenda ① Cluster Management at Twitter
② Mesos
③ Mesos at Twitter
④ VMs, IaaS, and Mesos
⑤ …
origins Mesos started as a research project at Berkeley in early 2009 by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
our motivation
increase performance and utilization of
clusters
our intuition
① static partitioning considered harmful
our intuition
② build new applications
“Map/Reduce is a big hammer, but not everything is a nail!”
workers
distributed system* anatomy
coordinator
* overlooking peer-to-peer distributed systems
static partitioning
coordinator coordinator
Mesos (slaves)
Mesos: level of indirection
coordinator
Mesos (master)
coordinator
Mesos (slaves)
Mesos: level of indirection
coordinator
Mesos (master)
coordinator
Mesos (slaves)
Mesos: level of indirection
coordinator
Mesos (master)
coordinator
Mesos (slaves)
Mesos: level of indirection
coordinator
Mesos (master)
coordinator
Mesos (slaves)
Mesos: level of indirection
coordinator
Mesos (master)
coordinator
Mesos: a level of indirection ① enable running multiple distributed systems
on the same cluster of machines and dynamically share the resources more efficiently!
static partitioning considered harmful
Mesos: a level of indirection ② provide common functionality every new
distributed system re-‐implements like failure detection, task distribution, task starting, task monitoring, task killing, task cleanup!
build new applications
Mesos ≈ cluster manager
cluster management
• PBS (Portable Batch System) • TORQUE • SGE (Sun Grid Engine)
cluster management
• PBS (Portable Batch System) • TORQUE • SGE (Sun Grid Engine)
batch computation!
Mesos is an evolution of the cluster manager, designed to run general purpose distributed systems (i.e., not just focused on batch)
Mesos
batch service storage …
…
streaming
support many different types of distributed systems
Mesos
batch service storage … streaming
(1) coordinate for resources (aka resource allocation)
Mesos
batch service storage … streaming
(2) launch tasks
Mesos
batch service storage … streaming
(3) launch tasks
Mesos
batch service storage … streaming
Mesos
batch service storage … streaming
Mesos
batch service storage … streaming
(4) task termination
Mesos
batch service storage … streaming
(5) task status update
Mesos
batch service storage … streaming
(1) coordinate for resources (aka resource allocation)
agenda ① Cluster Management at Twitter
② Mesos
③ Mesos at Twitter
④ VMs, IaaS, and Mesos
⑤ …
stateless services!
Mesos
batch service storage …
…
streaming
Apache Aurora (incubating)
Apache Aurora (incubating), a scheduler for running stateless services written in any language (but primarily used at Twitter for JVM services)
developer workflow
(1) describe service using Python based DSL
(2) submit service to Aurora using CLI
(1) bundle services as jar, tar/gzip
(2) upload to HDFS
configuration/package management
deployment
service.aurora
developer workflow
(1) describe service using JSON
(2) submit service to Marathon via REST
(1) bundle services as jar, tar/gzip
(2) upload to HDFS
configuration/package management
deployment
service.json
service discovery
Apache ZooKeeper
using Apache ZooKeeper and server sets (github.com/twitter/commons)
service discovery
Apache ZooKeeper
using Apache ZooKeeper and server sets (github.com/twitter/commons)
(1) service gets launched on machine
service discovery
(2) service gets registered in a server set in ZooKeeper
Apache ZooKeeper
using Apache ZooKeeper and server sets (github.com/twitter/commons)
(1) service gets launched on machine
service discovery
(2) service gets registered in a server set in ZooKeeper
(3) other services use ZooKeeper to find services they need
Apache ZooKeeper
using Apache ZooKeeper and server sets (github.com/twitter/commons)
(1) service gets launched on machine
service discovery
(2) service gets registered in a server set in ZooKeeper
(3) other services use ZooKeeper to find services they need
(4) services connect directly with one another
Apache ZooKeeper
using Apache ZooKeeper and server sets (github.com/twitter/commons)
(1) service gets launched on machine
service discovery alternative
(2) update HAProxy with new service location
(1) service gets launched on machine
(3) other services send traffic through HAProxy
ZooKeeper/server sets requires injecting code into your clients!
where are we today?
ops developers
deploys decoupled from ops (many deploys per day, per service)
maintenance consists of “draining” hosts, getting tasks rescheduled, then pulling the cord
challenges revisited ① failures
② maintenance
③ utilization
challenges revisited ① failures
② maintenance
③ utilization
Mesos
batch service storage … streaming
Mesos
batch service storage … streaming
Mesos
batch service storage … streaming
Mesos
batch service storage … streaming
(5) task status update
challenges revisited ① failures
② maintenance
③ utilization
Mesos
batch service storage … streaming
(1) when resources become idle, can be scheduled and reused by other schedulers
Mesos
batch service storage … streaming
(1) when resources become idle, can be scheduled and reused by other schedulers
Mesos
batch service storage … streaming
(1) when resources become idle, can be scheduled and reused by other schedulers
Mesos
batch service storage … streaming
(1) when resources become idle, can be scheduled and reused by other schedulers
Mesos
batch service storage … streaming
(1) when resources become idle, can be scheduled and reused by other schedulers
Mesos
batch service storage … streaming
(1) when resources become idle, can be scheduled and reused by other schedulers
(2) multi-‐tenancy on individual machines
Mesos
batch service storage … streaming
(1) when resources become idle, can be scheduled and reused by other schedulers
(2) multi-‐tenancy on individual machines
multi-‐tenancy
task!
task!
containers
task!
containerization started leveraging containerization technology
in ~2010
2010
LXC
2012
cgroups
2013
namespaces (preliminary)
2014
agenda ① Cluster Management at Twitter
② Mesos
③ Mesos at Twitter
④ VMs, IaaS, and Mesos
⑤ …
wait … don’t virtual machines solve my cluster management
challenges?
wait … don’t virtual machines solve my cluster management
challenges?
No. VMs are neither sufficient nor necessary!
challenges revisited ① failures
② maintenance
③ utilization
challenges revisited ① failures
② maintenance
③ utilization
public or private IaaS, failures still occur (on EC2, instead of racks, have availability zones, instead of datacenters, have regions)
challenges revisited ① failures
② maintenance
③ utilization provider wins with public IaaS, better resource sharing with private IaaS, but a static partition of VMs is still a static partition!
physical machines virtual machines
aggregation not virtualization
physical machines “datacenter computer”
aggregation not virtualization
coordinator
Mesos (master)
Mesos: level of abstraction
resources
machines
Mesos: level of abstraction
Mesos build and run
distributed systems using resources
Mesos: level of abstraction
IaaS
Mesos
provision and manage machines
build and run distributed systems
using resources
Mesos: level of abstraction
PaaS
IaaS
Mesos
deploy and manage applications/services
provision and manage machines
build and run distributed systems
using resources
PaaS on Mesos
PaaS
Mesos
build and run a PaaS on top of Mesos:
Apache Aurora and Marathon
Mesos on IaaS
IaaS
Mesos
use OpenStack or EC2 to run Mesos
Mesos on IaaS/bare metal
IaaS
Mesos
hardware use OpenStack or EC2 or physical machines
to run Mesos
agenda ① Cluster Management at Twitter
② Mesos
③ Mesos at Twitter
④ VMs, IaaS, and Mesos
⑤ Mesos Deeper Dive
cluster manager status quo before Mesos was batch scheduling
cluster manager status quo
cluster manager
application/human
specification
the specification includes as much information as possible to assist the cluster manager in scheduling and execution
cluster manager status quo
cluster manager
application/human wait for task
to be executed
cluster manager status quo
cluster manager
application/human
result
problems with specifications
① hard to specify certain desires or constraints
② hard to update specifications dynamically as tasks execute and finish/fail
MapReduce specification
① what would it look like? ② who submits it?
an alternative model
masters
scheduler
request 3 CPUs 2 GB RAM
a request is purposely simplified subset of a specification, mainly including the required resources at that point in time
an alternative model
masters
scheduler
request 3 CPUs 2 GB RAM
a request is purposely simplified subset of a specification, mainly including the required resources at that point in time
what should you do if you can’t satisfy a request?
what should you do if you can’t satisfy a request?
① wait until you can …
what should you do if you can’t satisfy a request?
① wait until you can …
② offer best you can immediately
what should you do if you can’t satisfy a request?
① wait until you can …
② offer best you can immediately
Mesos model
masters
scheduler
offer hostname 4 CPUs 4 GB RAM
resources are allocated via resource offers a resource offer represents a snapshot of available resources that a scheduler can use to run tasks
an analogue: non-‐blocking sockets
kernel
application
write(s, buffer, size);!
an analogue: non-‐blocking sockets
kernel
application
42 of 100 bytes written!!
Mesos model
masters
scheduler
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
Mesos model
masters
scheduler
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
Mesos model
masters
scheduler
offer hostname 4 CPUs 4 GB RAM
scheduler uses the offers to decide what tasks to run
Mesos model
masters
scheduler
scheduler uses the offers to decide what tasks to run
task 3 CPUs 2 GB RAM
Mesos model
masters
scheduler
scheduler uses the offers to decide what tasks to run “two-‐level scheduling”
task 3 CPUs 2 GB RAM
“two-‐level scheduling” Mesos: controls resource allocations to schedulers
schedulers: make decisions about what tasks to run given allocated resources
two-‐level scheduling Mesos influenced by operating system supported user-‐space scheduling and ideas behind scheduler activations
Mesos is designed less like a “cluster manager” and more like an operating system (or kernel)
design comparison: Google’s Omega
Omega
database
scheduler
snapshot scheduler receives snapshot of all available resources
Omega
database
scheduler
transaction scheduler submits transaction to “acquire” resources
proposal: Mesos is isomorphic to Omega if makes offers for everything available
Omega and Mesos
database
scheduler
snapshot
masters
scheduler
offer hostname 4 CPUs 4 GB RAM
Omega and Mesos
database
scheduler
transaction
masters
scheduler
task 3 CPUs 2 GB RAM
offers represent the current snapshot of available resources a framework can use
concurrency control
optimistic pessimistic
all offers overlap with one another, thus causing frameworks to “compete” first-‐come-‐first-‐served
concurrency control
optimistic pessimistic
offers made to different frameworks are disjoint
Omega: requests are complimentary, but not necessary
agenda ① Cluster Management at Twitter
② Mesos
③ Mesos at Twitter
④ VMs, IaaS, and Mesos
⑤ Mesos Deeper Dive
⑥ Mesos Ecosystem
built on Mesos:
2009 2010 2013 2014
ported to Mesos:
2011 2012 2013 2014
some of our adopters …
2010 2013 2014 …
releases
0.18.0 (2013-‐04-‐01)
0.18.1 (2014-‐04-‐29)
0.18.2 (2014-‐05-‐13)
0.19.0 (2014-‐06-‐04)
0.19.1 (2014-‐07-‐14)
0.20.0 (2014-‐08-‐21)
contributors
38 contributors in the past 6 months
23 committers (also PMC members)
(Storm: 13; Kasa: 15; ZooKeeper: 16; Cassandra: 25; Thrift: 26; Spark: 32; Hadoop: 82)
3 new committers in past 3 months!
why have they been up to?
① containerization and isolation
containerization started leveraging containerization technology
in ~2010
2010
LXC
2012
cgroups
2013
namespaces (network)
containerization started leveraging containerization technology
in ~2010
2010
LXC
2012
cgroups
2013
namespaces (network)
2014
Docker (0.20.0)
first-‐class Docker support, i.e., use Docker images to run containers with Docker primitives like volumes, entrypoints, etc (0.20.0)
Learn More:
https://github.com/apache/mesos/blob/master/docs/docker-‐containerizer.md
② container statistics
monitor all the things CPU, memory, network (0.20.0)
mesos-‐slave GET /monitor/statistics.json HTTP/1.1
{ "source": "sample_executor", "statistics": { "cpus_system_time_secs": 154.42, "cpus_user_time_secs": 258.74, "mem_file_bytes": 30613504, "mem_rss_bytes": 140341248, "net_rx_bytes": 2402099, "net_rx_dropped": 0, "net_tx_bytes": 1507798, "net_tx_dropped": 0, }}
monitor all the things CPU, memory, network (0.20.0)
Learn More:
https://github.com/apache/mesos/blob/master/docs/network-‐monitoring.md
③ authentication and authorization
authentication
credentials: principals and secrets
protocol built using SASL, designed to swap in/out other mechanisms, e.g., kerberos
authorization
introduced access control lists (ACLs)
“run_tasks”: [
{ “principals”: { “type”: “NONE” },
“users”: { “values”: “root” } }]
action
subjects
objects
“action performed by subjects on objects”
authorization
Learn More:
https://github.com/apache/mesos/blob/master/docs/authorization.md
④ fault tolerance and high availability
mesos-‐slave recovery
slave
mesos-slave!
executor!
task!
task!
containers
mesos-‐slave recovery
slave
mesos-slave!
executor!
task!
task!
containers
mesos-‐slave recovery
slave
executor!
task!
task!
containers
mesos-‐slave recovery
slave
executor!
task!
task!
containers
mesos-slave!
mesos-‐slave recovery
slave
executor!
task!
task!
containers
mesos-slave!
mesos-‐slave recovery
Since 0.14.0! Running in production at Twitter for ~ 1 year (enabled by default since 0.15.0)
Learn More:
https://github.com/apache/mesos/blob/master/docs/slave-‐recovery.md
what’s being built?
① primitives for stateful applications
stateful applications
better support for running frameworks like HDFS, Cassandra, directly on Mesos!
Learn More:
https://issues.apache.org/jira/browse/MESOS-‐1554
② primitives for maintenance
aka “planned failures”
maintenance
mesos-‐slave
task!
mesos-‐slave
aka “planned failures”
maintenance
mesos-‐slave
task!
mesos-‐slave
aka “planned failures”
maintenance
mesos-‐slave
task!
mesos-‐slave
aka “planned failures”
maintenance
mesos-‐slave mesos-‐slave
task!
aka “planned failures”
maintenance
mesos-‐slave mesos-‐slave
task! task! (staging)
aka “planned failures”
maintenance
mesos-‐slave mesos-‐slave
task! (running)
aka “planned failures”
Learn More:
https://issues.apache.org/jira/browse/MESOS-‐1592
maintenance
③ primitives for smarter resource allocation and scheduling
resource allocation original implementation of resource allocation in Mesos was “pessimistic”
Google’s Omega introduced the concept of “optimistic” resource allocation, which Mesos is a natural fit for!
Learn More:
https://issues.apache.org/jira/browse/MESOS-‐1607
④ improvements in resource isolation
resource isolation ① network bandwidth
② disk block I/O
③ ...
Learn More:
https://issues.apache.org/jira/browse/MESOS-‐1585
⑤ revised scheduler/executor API
API 2009 relic: HTTP (instead of Thrift) but not REST
(still evaluating JSON-‐RPC vs REST)
Last major change before 1.0! Existing API and libraries will remain backwards compatible, but be deprecated till 2.0 and then removed
agenda ① Cluster Management at Twitter
② Mesos
③ Mesos at Twitter
④ VMs, IaaS, and Mesos
⑤ Mesos Deeper Dive
⑥ Mesos Ecosystem
⑦ Conclusion
my other computer is a datacenter
my other computer is a datacenter
my other computer is a datacenter*
* collection of physical and/or virtual machines
the ops perspective
the datacenter is just another form factor
the datacenter is just another form factor
why can’t we run apps on our datacenters just like we run applications on our mobile phones?
Hadoop Cassandra Rails Jenkins memcached
the dev perspective
applications don’t fit on a single computer anymore
"BIG"
(1) lots of data … (2) lots of users … growing everyday
today’s applications need lots of resources (CPUs, memory, disk)
we’re all building distributed systems
but everybody keeps reinventing the wheel
how many more buggy failure detectors will we build until we stop?
desktop computer
server datacenter
OS
OS
OS
the datacenter computer needs an operating system
operating system “a collection of software that manages the computer hardware resources and provides common services for computer programs”
- Wikipedia
datacenter operating system “a collection of software that manages the datacenter computer hardware resources and provides common services for computer programs”
- Wikipedia
datacenter operating system “a collection of software that manages the datacenter computer hardware resources and provides common services for computer programs”
- Wikipedia
Apache Mesos: datacenter kernel
Apache Mesos: distributed systems kernel
Thank You!
mesos.apache.org
mesos.apache.org/blog
@ApacheMesos