hadoop - past, present and future - v1.2
Embed Size (px)
DESCRIPTION
A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.TRANSCRIPT

7/12/14
! Prepared for: v Orange County Java Users Group
! Presented by:
v “Big Data Joe” Rossi v @bigdatajoerossi
Hadoop Past, Present and Future

Roadmap
~1 hour
1-‐ What Makes Up Hadoop 1.x?
2-‐ What’s New In Hadoop 2.x?
3-‐ The Future Of Hadoop …

What Makes Up Hadoop 1.x?

Hadoop 1.0: HDFS + MapReduce
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client 1-‐1
1-‐2 1-‐3

Hadoop 1.0: HDFS + MapReduce
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client 1-‐1 1-‐2
1-‐3
Reduce Map
2-‐1 3-‐2 3-‐3 4-‐1
2-‐3 4-‐2 2-‐2 3-‐1 4-‐3
Reduce Map

MapReduce v1 LimitaTons Scalability Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000
Availability JobTracker failure kills all queued and running jobs
Resources ParVVoned into Map and Reduce Hard parTToning of Map and Reduce slots led to low resource uVlizaVon
No Support for Alternate Paradigms / Services Only MapReduce batch jobs, nothing else

HADOOP 1.0
Single Use System Batch Apps
Apache Hadoop 1.0: Single Use System
HDFS (redundant, reliable storage)
MapReduce (cluster resource management and data
processing)
Pig Hive

What’s New In Hadoop 2.x?

YARN Replaces MapReduce
Yet Another Resource NegoVator
YARN
YARN will be the de-‐facto distributed operaVng system for Big Data

Store DATA in one place
YARN: Taking Hadoop Beyond Batch
Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
ApplicaTons Run NaTvely IN Hadoop
HDFS2 (redundant, reliable storage)
YARN (cluster resource management)
BATCH (MapReduce)
INTERACTIVE (Tez)
ONLINE (HBase)
STREAMING (DataTorrent)
GRAPH (Giraph)

Running all on the same Hadoop cluster to give applicaVons access to all the same source data!
YARN: ApplicaTons
MapReduce v2
Stream Processing
Master-‐Worker Online
In-‐Memory
Apache Storm

2010
2011
2012
2013
2014
Today
YARN: Moving Quickly Conceived at Yahoo!
Alpha Releases – 2.0
Beta Releases – 2.1 GA Released – 2.2
100,000+ nodes, 400,000+ jobs daily 10 million+ hours of compute daily
Version 2.3 Version 2.4

YARN: Dr. Evil Approved

YARN: How It Works
ResourceManager
NodeManager
ApplicaVonMaster
NodeManager
NodeManager NodeManager
Scheduler
Container
Container Container
Client

YARN: What Has Changed? YARN MRv1 RM
ResourceManager
AM ApplicaVonMaster
JT JobTracker
Scheduler Scheduler
NM NodeManager
TT TaskTracker
Container Map
Reduce
ResourceManager
Scheduler
JobTracker
Scheduler
NodeManager
ApplicaVonMaster
TaskTracker
Map Reduce
NodeManager
Container Container
TaskTracker
Map Reduce

! Scale ! New programming models and services
! Improved cluster uVlizaVon ! Agility ! Backwards compaVble with MapReduce v1
! Mixed workloads on the same source of data
6 Benefits of YARN

The Future of Hadoop Projects and Roadmap

Speed Deliver interacTve query performance.
SQL on Hadoop
SQL Support array of SQL semanTcs for analyTc applicaTons running against Hadoop.
Scale SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes

Hive on Apache Tez Hortonworks
Next Gen SQL on Hadoop
Hive on Apache Spark Cloudera
Cloudera Impala Cloudera
Apache Drill MapR

Dynamic Scaling On-‐demand cluster size. Increase and decrease the size with load.
HOYA: HBase (NoSQL) on YARN
Easier Deployment APIs to create, start, stop and delete HBase clusters.
Availability Recover from Region Server loss with a new container.

Machine Learning Framework well suited for building machine learning jobs.
Microsog REEF
Scalable / Fault Tolerant Makes it easy to implement scalable, fault-‐tolerant runTme environments for a range of computaTonal models.
Maintain State Users can build jobs that uTlize data from where it’s needed and also maintain state ager jobs are done.
Retainable Evaluator ExecuTon Framework

Heterogeneous Storages in HDFS
NameNode
Storage
NameNode
SATA SSD Fusion IO

! Apache Hadoop 2.5
v NodeManager Restart w/o disrupTon v Dynamic Resource ConfiguraTon
! Apache Hadoop 2.6
v Memory As Storage Tier v Support For Docker Containers
Hadoop Roadmap
Q3 2014
Q4 2014

I Know You Have QuesVons … No such thing as a stupid quesVon.
Hadoop: Past, Present and Future

OC Big Data Meetup
One Last Thing …
meetup.com/ocbigdata 3rd Wednesday Of The Month Next: July 16st @ 5:45P

Thank You!
Hadoop: Past, Present and Future
Big Data Joe Rossi hkp://bigdatajoe.io/ @bigdatajoerossi