hadoop - past, present and future - v1.2

Post on 11-Aug-2014

464 Views

Category:

Data & Analytics

6 Downloads

Preview:

Click to see full reader

DESCRIPTION

A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.

TRANSCRIPT

7/12/14  

!  Prepared  for:  v Orange  County  Java  Users  Group  

 !  Presented  by:  

v “Big  Data  Joe”  Rossi  v @bigdatajoerossi  

Hadoop  Past,  Present  and  Future  

Roadmap  

~1  hour  

1-­‐  What  Makes  Up  Hadoop  1.x?  

2-­‐  What’s  New  In  Hadoop  2.x?  

3-­‐  The  Future  Of  Hadoop  …  

What  Makes  Up  Hadoop  1.x?  

Hadoop  1.0:  HDFS  +  MapReduce  

NameNode  

DataNode  /  TaskTracker   DataNode  /  TaskTracker  

DataNode  /  TaskTracker   DataNode  /  TaskTracker  

JobTracker  

Client  1-­‐1  

1-­‐2  1-­‐3  

Hadoop  1.0:  HDFS  +  MapReduce  

NameNode  

DataNode  /  TaskTracker   DataNode  /  TaskTracker  

DataNode  /  TaskTracker   DataNode  /  TaskTracker  

JobTracker  

Client  1-­‐1   1-­‐2  

1-­‐3  

Reduce  Map  

2-­‐1   3-­‐2   3-­‐3   4-­‐1  

2-­‐3   4-­‐2   2-­‐2   3-­‐1   4-­‐3  

Reduce  Map  

MapReduce  v1  LimitaTons  Scalability  Maximum  cluster  size  is  4,000  nodes  and  maximum  concurrent  tasks  is  40,000  

Availability  JobTracker  failure  kills  all  queued  and  running  jobs  

Resources  ParVVoned  into  Map  and  Reduce  Hard  parTToning  of  Map  and  Reduce  slots  led  to  low  resource  uVlizaVon  

No  Support  for  Alternate  Paradigms  /  Services  Only  MapReduce  batch  jobs,  nothing  else  

HADOOP  1.0  

Single  Use  System  Batch  Apps  

Apache  Hadoop  1.0:  Single  Use  System  

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  and  data  

processing)  

Pig   Hive  

What’s  New  In  Hadoop  2.x?  

YARN  Replaces  MapReduce  

Yet  Another  Resource  NegoVator  

YARN  

YARN  will  be  the  de-­‐facto  distributed  operaVng  system  for  Big  Data  

Store  DATA  in  one  place  

YARN:  Taking  Hadoop  Beyond  Batch  

Interact  with  that  data  in  MULTIPLE  WAYS  with  Predictable  Performance  and  Quality  of  Service  

           ApplicaTons  Run  NaTvely  IN  Hadoop  

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

BATCH  (MapReduce)  

INTERACTIVE  (Tez)  

ONLINE  (HBase)  

STREAMING  (DataTorrent)  

GRAPH  (Giraph)  

Running  all  on  the  same  Hadoop  cluster  to  give  applicaVons  access  to  all  the  same  source  data!  

YARN:  ApplicaTons  

MapReduce  v2  

Stream  Processing  

Master-­‐Worker  Online  

In-­‐Memory  

Apache  Storm  

2010    

2011    

2012    

2013    

2014    

Today  

YARN:  Moving  Quickly  Conceived  at  Yahoo!  

Alpha  Releases  –  2.0  

Beta  Releases  –  2.1  GA  Released  –  2.2  

100,000+  nodes,  400,000+  jobs  daily  10  million+  hours  of  compute  daily  

Version  2.3   Version  2.4  

YARN:  Dr.  Evil  Approved  

YARN:  How  It  Works  

ResourceManager  

NodeManager  

ApplicaVonMaster  

NodeManager  

NodeManager   NodeManager  

Scheduler  

Container  

Container   Container  

Client  

YARN:  What  Has  Changed?  YARN   MRv1  RM  

ResourceManager  

AM  ApplicaVonMaster  

JT  JobTracker  

Scheduler   Scheduler  

NM  NodeManager  

TT  TaskTracker  

Container  Map  

Reduce  

ResourceManager  

Scheduler  

JobTracker  

Scheduler  

NodeManager  

ApplicaVonMaster  

TaskTracker  

Map   Reduce  

NodeManager  

Container   Container  

TaskTracker  

Map   Reduce  

!  Scale  !  New  programming  models  and  services  

!  Improved  cluster  uVlizaVon  !  Agility  !  Backwards  compaVble  with  MapReduce  v1  

!  Mixed  workloads  on  the  same  source  of  data  

6  Benefits  of  YARN  

The  Future  of  Hadoop  Projects  and  Roadmap  

Speed  Deliver  interacTve  query  performance.  

SQL  on  Hadoop  

SQL  Support  array  of  SQL  semanTcs  for  analyTc  applicaTons  running  against  Hadoop.  

Scale  SQL  interface  to  Hadoop  designed  for  queries  that  scale  from  Terabytes  to  Petabytes    

Hive  on  Apache  Tez  Hortonworks  

Next  Gen  SQL  on  Hadoop  

Hive  on  Apache  Spark  Cloudera  

Cloudera  Impala  Cloudera    

Apache  Drill  MapR  

Dynamic  Scaling  On-­‐demand  cluster  size.  Increase  and  decrease  the  size  with  load.  

HOYA:  HBase  (NoSQL)  on  YARN  

Easier  Deployment  APIs  to  create,  start,  stop  and  delete  HBase  clusters.  

Availability  Recover  from  Region  Server  loss  with  a  new  container.  

Machine  Learning  Framework  well  suited  for  building  machine  learning  jobs.  

Microsog  REEF  

Scalable  /  Fault  Tolerant  Makes  it  easy  to  implement  scalable,  fault-­‐tolerant  runTme  environments  for  a  range  of  computaTonal  models.  

Maintain  State  Users  can  build  jobs  that  uTlize  data  from  where  it’s  needed  and  also  maintain  state  ager  jobs  are  done.  

Retainable  Evaluator  ExecuTon  Framework  

Heterogeneous  Storages  in  HDFS  

NameNode  

Storage  

NameNode  

SATA   SSD   Fusion  IO  

   !  Apache  Hadoop  2.5  

v NodeManager  Restart  w/o  disrupTon  v Dynamic  Resource  ConfiguraTon  

 !  Apache  Hadoop  2.6  

v Memory  As  Storage  Tier  v Support  For  Docker  Containers  

Hadoop  Roadmap  

Q3  2014  

Q4  2014  

I  Know  You  Have  QuesVons  …  No  such  thing  as  a  stupid  quesVon.  

Hadoop:  Past,  Present  and  Future  

OC  Big  Data  Meetup    

One  Last  Thing  …  

meetup.com/ocbigdata  3rd  Wednesday  Of  The  Month  Next:  July  16st  @  5:45P  

Thank  You!  

Hadoop:  Past,  Present  and  Future  

Big  Data  Joe  Rossi  hkp://bigdatajoe.io/  @bigdatajoerossi  

top related