run your first hadoop 2.x program

26
© 2015 BlueCamphor Technologies (P) Ltd. Run Your First Hadoop Program

Upload: skillspeed

Post on 15-Apr-2017

216 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd.

Run Your First Hadoop Program

Page 2: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 2

Know your Instructor

Page 3: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 3

Session Objectives

This Session will help you to:

ᗍ Understand • Introduction to BIG Data• Introduction to Hadoop 2.x• HDFS Fundamentals• MapReduce & YARN• Hive Introduction

Page 4: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 4

What is BIG Data

ᗍ Big Data refers to data-sets so large & complex/unstructured data that it becomes difficult to manage & process via traditional RDBMS tools

ᗍ Every day we roughly create 2.5 Quintillion bytes of data; 90% of the worlds collected data has been generated only in the last 2 years

ᗍ Data sizes are now in Peta-bytes, Tera-bytes, Exa-bytes & Zeta-bytes

Page 5: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 5

Structured vs Unstructured Data - II

Page 6: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 6

Traditional Systems vs. New Systems

Traditional Systems New Systems

It is not scalable to meet new business demands It is scalable to meet new business demands

Can process massive data at high speedCannot process massive data at high speedIt can only be Scaled-Up and cannot be Scaled-Out It can be Scaled-Up and Scaled-Out

Cost of system, processing and data management is economical

Cost of system, processing and data management is not economical

Page 7: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 7

Introduction to Hadoop

ᗍ Hadoop is a framework for storing, processing and analysing Big Data

ᗍ Allows distributed storage and distributed processing of large data sets across clusters of commodity computers using a simple programming model

ᗍ It is an Apache Open Source framework

Page 8: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 8

Key Features of Hadoop

It’s based on the Master - Slave architecture

Designed for massive scale

Highly available System

Low software and hardware costs

Distributed storage and processing achieves high performance

No license costs; supported by a very large developer community

Page 9: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 9

Hadoop Ecosystem

Pig(Data Flow)

MR(Batch)

Hive(SQL)

Others(Cascading)

RT S

tream

G

raph

(Stro

m, G

iraph

)

Serv

ices

(HBa

se)

TEZ(Execution Engine)

YARN(Cluster Resource Management)

HDFS(Redundant Reliable Storage)

Hadoop

Page 10: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 10

Hadoop Core Components

Distributed Data Storage frame work

Distributed Data Processing Framework

Page 11: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 11

Hadoop Architecture

ᗍ HDFS - Storage• NameNode• Data Node• Secondary NameNode

ᗍ MapReduce - Processing• Resource Manager • Node Manager

Page 12: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 12

HDFS Architecture

NameNode

Client

Rack 1 Client Rack 2

Metadata (Name, replicas,...): /home/foo/data, 3,…

Read DataNodes

Write

Replication

Blocks

Block opsDataNodes

Metadata ops

Page 13: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 13

HDFS File Read Operation

2. Get Block locations

4. Read 5. Read

Client Node

HDFSClient

Distributed File System

FS DataInput Stream

Client JVM6. Close

3. Read

1. Open

DataNode

Slave Node

DataNode

Slave Node

DataNode

Slave Node

NameNode

Admin Node

Page 14: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 14

HDFS File Write Operation

NameNode2. Create

7. Complete

5. Ack Packet4. Write Packet

Pipeline of Data nodes

6. Close

HDFSClient Distributed

File SystemNameNode

DataNode

Slave Node

4

5

4

5DataNode

Slave Node

DataNode

Slave Node

Blocks

Admin Node1. Create

3. Write

Page 15: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 15

Hadoop Core Components

Distributed Data Storage frame work

Distributed Data Processing Framework

Page 16: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 16

Hadoop Architecture

ᗍ HDFS - Storage• NameNode• Data Node• Secondary NameNode

ᗍ MapReduce - Processing• Resource Manager • Node Manager

Page 17: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 17

Traditional Solution

matchesSplit Data

Allmatches

grep

grep

grep

cat

grep

:

matches

matches

matches

Split Data

Split Data

Split Data

VeryBig

Data

Page 18: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 18

MapReduce Solution

Split Data

Allmatches

:

Split Data

Split Data

Split Data

MAP

REDUCE

MapReduce Framework

VeryBig

Input

Page 19: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 19

Understanding MapReduce Paradigm

Input Splitting Mapping Shuffling Reducing Final Result

List(K3,V3) Jack Bill Joe

Bill, 2Don, 3Jack, 2Joe, 2

K2,List(V2)List(K2,V2)K1,V1

Don Don Joe

Jack Don Bill

Bill, (1,1)

Don, (1,1,1)

Jack, (1,1)

Joe, (1,1)

MapReduce Word Count Process Flow

Jack Bill Joe Don Don Joe Jack Don Bill

Jack, 1 Bill, 1 Joe, 1

Don, 1 Don, 1 Joe, 1

Jack, 1 Don, 1 Bill, 1

Bill, 2

Don, 3

Jack, 2

Joe, 2

Page 20: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 20

What is Hive?

Hive is data warehouse query tool built on top of HDFS and YARN

Provides HiveQL, which is very similar to SQL

Used for querying and analyzing large structured data sets

It is extensible by User Defined Functions (UDFs)

Hive

Page 21: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 21

RDBMS Vs. Hive

RDBMS HiveSchema on WRITE – table schema is enforced at data load time i.e if the data being loaded doesn't conformed on schema in that case it will rejected

Schema on READ – it’s does not verify the schema while it’s loaded the data

Not much Scalable, costly scale up It’s very easily scalable at low cost

In traditional database we can read and write many time

It’s based on hadoop notation that is Write once and read many times

Record level updates, insertions and deletes, transactions and indexes are possible Record level updates is not possible in Hive

Both OLTP (On-line Transaction Processing) and OLAP (On-line Analytical Processing) are supported in RDBMS

OLTP (On-line Transaction Processing) is not yet supported in  Hive but it’s supported OLAP (On-line Analytical Processing)

RDBMS

Page 22: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 22

Doubt’s Time

Page 23: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 23

One more thing…A VERY SPECIAL SURPRISE FOR YOU

Page 24: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 24

Why SkillSpeed?

Course Curriculum

from Industry Experts

Instructor Led Live Virtual

Sessions

Lifetime access to Course

Content via LMS

100% Placement Assistance

24x7 Support

24x7

Page 25: Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 25

Corporate Partners

Page 26: Run Your First Hadoop 2.x Program