hadoop - past, present and future - v1.2

of 26 /26
7/12/14 Prepared for: Orange County Java Users Group Presented by: “Big Data Joe” Rossi @bigdatajoerossi Hadoop Past, Present and Future

Author: big-data-joe-rossi

Post on 11-Aug-2014

415 views

Category:

Data & Analytics


6 download

Embed Size (px)

DESCRIPTION

A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.

TRANSCRIPT

  • 7/12/14 ! Prepared for: vOrange County Java Users Group ! Presented by: vBig Data Joe Rossi [email protected] Hadoop Past, Present and Future
  • Roadmap ~1 hour 1- What Makes Up Hadoop 1.x? 2- Whats New In Hadoop 2.x? 3- The Future Of Hadoop
  • What Makes Up Hadoop 1.x?
  • Hadoop 1.0: HDFS + MapReduce NameNode DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker JobTracker Client 1-1 1-2 1-3
  • Hadoop 1.0: HDFS + MapReduce NameNode DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker JobTracker Client 1-1 1-2 1-3 Reduce Map 2-1 3-2 3-3 4-1 2-3 4-2 2-2 3-1 4-3 Reduce Map
  • MapReduce v1 LimitaTons Scalability Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000 Availability JobTracker failure kills all queued and running jobs Resources ParVVoned into Map and Reduce Hard parTToning of Map and Reduce slots led to low resource uVlizaVon No Support for Alternate Paradigms / Services Only MapReduce batch jobs, nothing else
  • HADOOP 1.0 Single Use System Batch Apps Apache Hadoop 1.0: Single Use System HDFS (redundant, reliable storage) MapReduce (cluster resource management and data processing) Pig Hive
  • Whats New In Hadoop 2.x?
  • YARN Replaces MapReduce Yet Another Resource NegoVator YARN YARN will be the de-facto distributed operaVng system for Big Data
  • Store DATA in one place YARN: Taking Hadoop Beyond Batch Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service ApplicaTons Run NaTvely IN Hadoop HDFS2 (redundant, reliable storage) YARN (cluster resource management) BATCH (MapReduce) INTERACTIVE (Tez) ONLINE (HBase) STREAMING (DataTorrent) GRAPH (Giraph)
  • Running all on the same Hadoop cluster to give applicaVons access to all the same source data! YARN: ApplicaTons MapReduce v2 Stream Processing Master-Worker Online In-Memory Apache Storm
  • 2010 2011 2012 2013 2014 Today YARN: Moving Quickly Conceived at Yahoo! Alpha Releases 2.0 Beta Releases 2.1 GA Released 2.2 100,000+ nodes, 400,000+ jobs daily 10 million+ hours of compute daily Version 2.3 Version 2.4
  • YARN: Dr. Evil Approved
  • YARN: How It Works ResourceManager NodeManager ApplicaVonMaster NodeManager NodeManager NodeManager Scheduler Container Container Container Client
  • YARN: What Has Changed? YARN MRv1 RM ResourceManager AM ApplicaVonMaster JT JobTracker Scheduler Scheduler NM NodeManager TT TaskTracker Container Map Reduce ResourceManager Scheduler JobTracker Scheduler NodeManager ApplicaVonMaster TaskTracker Map Reduce NodeManager Container Container TaskTracker Map Reduce
  • ! Scale ! New programming models and services ! Improved cluster uVlizaVon ! Agility ! Backwards compaVble with MapReduce v1 ! Mixed workloads on the same source of data 6 Benets of YARN
  • The Future of Hadoop Projects and Roadmap
  • Speed Deliver interacTve query performance. SQL on Hadoop SQL Support array of SQL semanTcs for analyTc applicaTons running against Hadoop. Scale SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes
  • Hive on Apache Tez Hortonworks Next Gen SQL on Hadoop Hive on Apache Spark Cloudera Cloudera Impala Cloudera Apache Drill MapR
  • Dynamic Scaling On-demand cluster size. Increase and decrease the size with load. HOYA: HBase (NoSQL) on YARN Easier Deployment APIs to create, start, stop and delete HBase clusters. Availability Recover from Region Server loss with a new container.
  • Machine Learning Framework well suited for building machine learning jobs. Microsog REEF Scalable / Fault Tolerant Makes it easy to implement scalable, fault- tolerant runTme environments for a range of computaTonal models. Maintain State Users can build jobs that uTlize data from where its needed and also maintain state ager jobs are done. Retainable Evaluator ExecuTon Framework
  • Heterogeneous Storages in HDFS NameNode Storage NameNode SATA SSD Fusion IO
  • ! Apache Hadoop 2.5 vNodeManager Restart w/o disrupTon vDynamic Resource ConguraTon ! Apache Hadoop 2.6 vMemory As Storage Tier vSupport For Docker Containers Hadoop Roadmap Q3 2014 Q4 2014
  • I Know You Have QuesVons No such thing as a stupid quesVon. Hadoop: Past, Present and Future
  • OC Big Data Meetup One Last Thing meetup.com/ocbigdata 3rd Wednesday Of The Month Next: July 16st @ 5:45P
  • Thank You! Hadoop: Past, Present and Future Big Data Joe Rossi hkp://bigdatajoe.io/ @bigdatajoerossi