how yarn enables multiple data processing engines in hadoop

41
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved How YARN Enables Multiple Data Processing Engines in Hadoop We Do Hadoop Eric Mizell - Director, Solution Engineering

Upload: posscon

Post on 13-Aug-2015

74 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

How YARN Enables Multiple Data Processing Engines in Hadoop

We Do Hadoop

Eric Mizell - Director, Solution Engineering

Page 2: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Agenda

• HDFS Overview - Storage

• YARN 101 - Compute – Yet Another Resource Negotiator

• Enabling a Modern Data Architecture

• YARN in action – Demo of streaming application

• Hadoop Tools – Demos

• Sample Code - https://github.com/emizell/HBase-Code-Samples

2

Page 3: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDFS Overview

3

Page 4: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDFS Overview

4

•  Typical Hardware for DataNodes –  2@8 Core –  256GB RAM –  2@24TB Disk –  10 GbE

•  Hadoop is rack aware –  Data is replicated across racks to ensure no data loss

•  Scale up or down –  Add or remove DataNodes and HDFS auto rebalances

•  HDFS is a file system –  Store any kind of data –  Inexpensive storage –  Replica of 3 by default (can be changed)

Page 5: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN Concepts

• Application – Application is a job submitted to the framework – Example – MapReduce Job

• Container – Basic unit of allocation – Fine-grained resource allocation across multiple resource types (memory, cpu,

disk, network, gpu etc.) – container_0 = 2GB, 1CPU – container_1 = 1GB, 6 CPU

– Replaces the fixed map/reduce slots

5

Page 6: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN Architecture

• Resource Manager – Global resource scheduler – Hierarchical queues – Application management

• Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring

• Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master

6

Page 7: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

RackN

NodeManager

NodeManager

NodeManager

Rack2

NodeManager

NodeManager

NodeManager

Rack1

NodeManager

NodeManager

NodeManager

C2.1

C1.4

AM2

C2.2 C2.3

AM1

C1.3

C1.2

C1.1

Hadoop Client 1

Hadoop Client 2

create app2

submit app1

submit app2

create app1

ASM Scheduler queues

ASM Containers

NM ASM

Scheduler Resources

.......negotiates.......

.......reports to.......

.......partitions.......

ResourceManager

status report

YARN – Running Apps

Page 8: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 2.x Stack – Enabled by YARN

Hadoop

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization Accounting

Data Protection

Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon

Cluster: Knox Cluster: Ranger

Deployment Choice Linux Windows On-Premises Cloud

YARN is the architectural center of HDP

Enables batch, interactive and real-time workloads

Provides comprehensive enterprise capabilities

The widest range of deployment options

Page 9: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 2.2.x Stack – Versions

Page 10: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enabling a Modern Data Architecture with Apache Hadoop

Hortonworks. We do Hadoop.

Page 11: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Existing Siloed Data Architectures Under Pressure AP

PLICAT

IONS  

DATA

   SYSTEM  

SOURC

ES  

Business    Analy:cs  

Custom  Applica:ons  

Packaged  Applica:ons  

Exis:ng  Sources    (CRM,  ERP,  Clickstream,  Logs)  

SILO  SILO  

RDBMS  

SILO   SILO  SILO   SILO  

EDW   MPP  

Data  growth:  New  Data  Types  

OLTP,  ERP,  CRM  Systems  

Unstructured  docs,  emails  

Clickstream  

Server  logs  

Social/Web  Data  

Sensor.  Machine  Data  

Geoloca:on  

85% Source: IDC

??

"   Can’t manage new data paradigm

"   Constrains data to specific schema

" Siloed data

"   Limited scalability

"   Economically unfeasible

"   Limited analytics

Page 12: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP2 and YARN enable the Modern Data Architecture

Hortonworks architected and led development of YARN

Common data set, multiple applications •  Optionally land all data in a single cluster

•  Batch, interactive & real-time use cases

•  Support multi-tenant access, processing & segmentation of data

YARN: Architectural center of Hadoop •  Consistent security, governance & operations •  Ecosystem applications certified

by Hortonworks to run natively in Hadoop

SOU

RC

ES

EXISTING  Systems  

Clickstream   Web    &Social  

Geoloca:on   Sensor    &  Machine  

Server    Logs  

Unstructured  

APP

LIC

ATIO

NS

DAT

A S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

RDBMS EDW MPP YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

HDFS (Hadoop Distributed File System)

Interactive Real-Time Batch

Page 13: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN in Action Hortonworks. We do Hadoop.

Page 14: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Truck Sensors

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging (Kafka)

Microsoft Excel

Interactive Query (Hive on Tez)

Alerts & Events (ActiveMQ)

Real-Time User Interface

Real-time Serving (HBase)

Page 15: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Components of the Topology

• 9 Node HDP 2.2 Cluster with Storm and HBase on YARN

• 4 Node 0.8 Kafka Cluster

• 1 Node ActiveMQ with Stomp Protocol Enabled • Spring 4.0 WebMVC Web Using SocketJS & ActiveMQ over STOMP

Page 15

Page 16: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Topology Architecture

Page 16

Truck Simulator

T(1)

T(2)

T(N)

Truck Stream Generator via AKKA

KafkaCollector

Kafka Grid - Captures all Driving Events

BR(1) BR(2) BR(3)

BR(4) BR(5)

ZK

truck_eventsTOPIC

Storm on YARN on HDP

Kafka Spout

HBase Bolt

Monitoring Bolt

WebSocket Bolt

HBase on HDP

HBase

driver dangerous

events

driver dangerous

eventscount

Email

Alerts

ActiveMQ

Alert Topic

Spring WebApp with SockJS WebSockets

Real-Time Streaming Driver Monitoring App

ActiveMQ

Violation Events Topic

Page 17: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Demo

Page 18: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop Tools Hortonworks. We do Hadoop.

Page 19: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Agenda

•  The Basics •  MapReduce & Java •  Pig •  Hive •  HBase, Solr & Spark •  Abstractions: .net, cascading and Spring XD

•  Intro to the Sandbox •  Basic Hello World Using Hive and Pig •  HBase and Phoenix demo and code discussion

Page 20: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hortonworks Data Platform 2.2

HDP Delivers Enterprise Hadoop

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

Deployment Choice Linux Windows Cloud

YARN is the architectural center of HDP

•  Common data set across all applications

•  Batch, interactive & real-time workloads

•  Multi-tenant access & processing

Provides comprehensive enterprise capabilities

•  Governance

•  Security

•  Operations

Enables broad ecosystem adoption

•  ISVs can plug directly into Hadoop

The widest range of deployment options •  Linux & Windows

•  On premises & cloud

Others

ISV Engines

On-Premises

Page 21: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hortonworks Data Platform 2.2

HDP Delivers Enterprise Hadoop

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

HDFS (Hadoop Distributed File System)

SECURITY OPERATIONS

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

Deployment Choice Linux Windows Cloud On-Premises

YARN: Data Operating System (Cluster Resource Management)

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Others

ISV Engines

We will cover: •  What it is & where it is used •  Basic elements

Page 22: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

MapReduce

MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner

Developers use it to… •  They don’t have to anymore

•  Many tools have been created to abstract this complexity

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

Page 23: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Pig •  Apache™ Pig allows you to write complex

MapReduce transformations using a simple scripting language.

•  Pig Latin (the language) defines a set of transformations on a data set such as aggregate, join and sort.

•  Pig Latin is sometimes extended using UDFs (User Defined Functions), in Java or a scripting language and then call directly from the Pig Latin.

Developers use Pig for •  ETL

•  Basic “spreadsheet” functions

•  Prepare data for data science

Page 24: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Example

RAW_LOGS  =  LOAD  '/user/paul/data/apache/access'  USING  TextLoader  as  (line:chararray);  

 

CLICKS_RAW  =  LOAD  '$input'  USING  PigStorage('|')  as  (sls_key:chararray,  sls_item_ln_id:int,  chn_id:int,  loc_id:int,  all_chnl_rpt_chn_id:int,  all_chnl_rpt_loc_id:int,  sls_bsns_dt:chararray,  sku_id:int);  

 

RECORDS  =  load  'config'  using  org.apache.hcatalog.pig.HCatLoader();  

 

Page 25: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Pig Operators

Page 26: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive •  Apache Hive is the defacto standard for SQL

queries over petabytes of data in Hadoop

•  Created by a team at Facebook.

•  Provides a standard SQL interface to data stored in Hadoop.

•  Quickly find value in raw data files.

•  Proven at petabyte scale.

•  Compatible with every popular BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc.

Developers use it to: •  Perform SQL queries

•  Interface with existing tools via JDBC/ODBC

Page 27: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Sample SQL with Hive

SELECT [ALL | DISTINCT] select_expr, select_expr, ...!FROM table_reference![WHERE where_condition]![GROUP BY col_list]![HAVING having_condition]![CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY !col_list]]!

[LIMIT number] ; !

Page 28: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive - Select Syntax

Page 29: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive Demonstration

HDP Sandbox •  Up and running with a Hadoop

environment in minutes •  Basic and advanced tutorials with

discreet learning paths •  Ecosystem partner tutorials

hortonworks.com/sandbox

Page 30: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HBase •  Apache™ HBase is a non-relational (NoSQL)

database that runs on top of the Hadoop® Distributed File System (HDFS).

•  It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data.

•  It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes.

•  HBase was created for hosting very large tables with billions of rows and millions of columns.

Developers use it to: •  Provide low latency access to

massive amounts of data (eg. Recommendation engine results)

•  Document store

Page 31: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Phoenix •  Apache™ Phoenix is a high performance

relational database layer over HBase for low latency applications.

•  SQL queries are compiled into a series of HBase scans producing regular JDBC result sets.

•  Table metadata is stored in an HBase table and versioned and can be queried by version.

•  Query performance in the millisecond to low seconds range.

•  Largest know table size is a Trillion+ rows with query response times in the 30 second range.

Developers use it for: •  Low latency queries

•  SQL skin on HBase

Page 32: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Phoenix Functions

Page 33: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HBase/Phoenix Demonstration

HDP Sandbox •  Up and running with a Hadoop

environment in minutes •  Basic and advanced tutorials with

discreet learning paths •  Ecosystem partner tutorials

hortonworks.com/sandbox

Page 34: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Storm •  Apache™ Storm is a distributed real-time

computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Hadoop.

•  Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size.

•  Apache Kafka is a publish-subscribe messaging system that works well with Storm.

Developers use it to: •  Analyze stream data in real-

time

Page 35: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Solr •  Apache Solr provides full-text search and

near real-time indexing for data stored in Hadoop.

•  Whether users search for tabular, text, geo-location or sensor data in Hadoop, they find it quickly with Apache Solr.

•  Apache Solr indexes via XML, JSON, CSV or binary over HTTP. Users can query petabytes of data via HTTP GET and receive XML, JSON, CSV or binary results.

Developers use it to: •  Provide search capability for a

cluster

•  Data Scientist often use to explore data found in HDFS

Page 36: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark •  Spark is a general-purpose engine for ad-hoc

interactive analytics, iterative machine-learning, and other use cases well-suited to interactive, in-memory data processing of GB to TB sized datasets.

•  Spark loads data into memory so it can be queried repeatedly. It can create a “shadow” of data that can be used in the next iteration of a query

•  Spark provides simple APIs for data scientists and engineers familiar with Scala (programming language) to build applications

•  Spark is YARN-ready – another engine on YARN!

Developers use it to: •  Data Science: machine learning

and iterative analytics

Page 37: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Cascading •  Cascading is an application development

framework for building data applications. Converts applications into MapReduce jobs.

•  The Cascading SDK provides a collection of tools, documentation, libraries, tutorials and example projects.

•  Lingual. Simplifies systems integration through ANSI SQL compatibility and a JDBC driver

•  Pattern. Enables various machine learning scoring algorithms through PMML compatibility

•  Scalding. Enables development with Scala, a powerful language for solving functional problems

•  Cascalog. Enables development with Clojure, a Lisp dialect

Developers use it to: •  Build complex native Hadoop

applications without getting into MapReduce.

Page 38: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

.net •  The Microsoft .NET SDK for Hadoop provides

API access to HDP and Microsoft HDInsight including HDFS, HCatalog, Oozie and Ambari, and also some Powershell scripts for cluster management.

•  There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.

Developers use it to: •  Build complex MSFT .net

Hadoop applications.

Page 39: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Java & Spring XD •  Spring for Apache Hadoop (SHDP) provides a

developer API for Pig, Hive, Cascading and provides extensions to Spring Batch for orchestrating Hadoop based workflows.

•  It integrates with other Spring ecosystem project such as Spring Integration and Spring Batch

•  These foundational parts of Spring IO platform make Hadoop development more accessible to a wider range of Java developers.

Developers use it to: •  Build complex Hadoop

applications using Java and the Spring framework

Page 40: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop Summit 2015

Page 40

Page 41: How YARN Enables Multiple Data Processing Engines in Hadoop

Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Thank You! Eric Mizell – Director, Solutions Engineering [email protected] @ericmizell