run your first hadoop 2.x program

© 2015 BlueCamphor Technologies (P) Ltd.

Run Your First Hadoop Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 2

Know your Instructor


Session Objectives

This Session will help you to:

ᗍ Understand • Introduction to BIG Data• Introduction to Hadoop 2.x• HDFS Fundamentals• MapReduce & YARN• Hive Introduction


What is BIG Data

ᗍ Big Data refers to data-sets so large & complex/unstructured data that it becomes difficult to manage & process via traditional RDBMS tools

ᗍ Every day we roughly create 2.5 Quintillion bytes of data; 90% of the worlds collected data has been generated only in the last 2 years

ᗍ Data sizes are now in Peta-bytes, Tera-bytes, Exa-bytes & Zeta-bytes


Structured vs Unstructured Data - II


Traditional Systems vs. New Systems

Traditional Systems New Systems

It is not scalable to meet new business demands It is scalable to meet new business demands

Can process massive data at high speedCannot process massive data at high speedIt can only be Scaled-Up and cannot be Scaled-Out It can be Scaled-Up and Scaled-Out

Cost of system, processing and data management is economical

Cost of system, processing and data management is not economical


Introduction to Hadoop

ᗍ Hadoop is a framework for storing, processing and analysing Big Data

ᗍ Allows distributed storage and distributed processing of large data sets across clusters of commodity computers using a simple programming model

ᗍ It is an Apache Open Source framework


Key Features of Hadoop

It’s based on the Master - Slave architecture

Designed for massive scale

Highly available System

Low software and hardware costs

Distributed storage and processing achieves high performance

No license costs; supported by a very large developer community


Hadoop Ecosystem

Pig(Data Flow)

MR(Batch)

Hive(SQL)

Others(Cascading)

RT S

tream

G

raph

(Stro

m, G

iraph

)

Serv

ices

(HBa

se)

TEZ(Execution Engine)

YARN(Cluster Resource Management)

HDFS(Redundant Reliable Storage)

Hadoop


Hadoop Core Components

Distributed Data Storage frame work

Distributed Data Processing Framework


Hadoop Architecture

ᗍ HDFS - Storage• NameNode• Data Node• Secondary NameNode

ᗍ MapReduce - Processing• Resource Manager • Node Manager


HDFS Architecture

NameNode

Client

Rack 1 Client Rack 2

Metadata (Name, replicas,...): /home/foo/data, 3,…

Read DataNodes

Write

Replication

Blocks

Block opsDataNodes

Metadata ops


HDFS File Read Operation

2. Get Block locations

4. Read 5. Read

Client Node

HDFSClient

Distributed File System

FS DataInput Stream

Client JVM6. Close

3. Read

1. Open

DataNode

Slave Node

DataNode

Slave Node

DataNode

Slave Node

NameNode

Admin Node


HDFS File Write Operation

NameNode2. Create

7. Complete

5. Ack Packet4. Write Packet

Pipeline of Data nodes

6. Close

HDFSClient Distributed

File SystemNameNode

DataNode

Slave Node

4

5

4

5DataNode

Slave Node

DataNode

Slave Node

Blocks

Admin Node1. Create

3. Write


Hadoop Core Components

Distributed Data Storage frame work

Distributed Data Processing Framework


Hadoop Architecture

ᗍ HDFS - Storage• NameNode• Data Node• Secondary NameNode

ᗍ MapReduce - Processing• Resource Manager • Node Manager


Traditional Solution

matchesSplit Data

Allmatches

grep

grep

grep

cat

grep

:

matches

matches

matches

Split Data

Split Data

Split Data

VeryBig

Data


MapReduce Solution

Split Data

Allmatches

:

Split Data

Split Data

Split Data

MAP

REDUCE

MapReduce Framework

VeryBig

Input


Understanding MapReduce Paradigm

Input Splitting Mapping Shuffling Reducing Final Result

List(K3,V3) Jack Bill Joe

Bill, 2Don, 3Jack, 2Joe, 2

K2,List(V2)List(K2,V2)K1,V1

Don Don Joe

Jack Don Bill

Bill, (1,1)

Don, (1,1,1)

Jack, (1,1)

Joe, (1,1)

MapReduce Word Count Process Flow

Jack Bill Joe Don Don Joe Jack Don Bill

Jack, 1 Bill, 1 Joe, 1

Don, 1 Don, 1 Joe, 1

Jack, 1 Don, 1 Bill, 1

Bill, 2

Don, 3

Jack, 2

Joe, 2


What is Hive?

Hive is data warehouse query tool built on top of HDFS and YARN

Provides HiveQL, which is very similar to SQL

Used for querying and analyzing large structured data sets

It is extensible by User Defined Functions (UDFs)

Hive


RDBMS Vs. Hive

RDBMS HiveSchema on WRITE – table schema is enforced at data load time i.e if the data being loaded doesn't conformed on schema in that case it will rejected

Schema on READ – it’s does not verify the schema while it’s loaded the data

Not much Scalable, costly scale up It’s very easily scalable at low cost

In traditional database we can read and write many time

It’s based on hadoop notation that is Write once and read many times

Record level updates, insertions and deletes, transactions and indexes are possible Record level updates is not possible in Hive

Both OLTP (On-line Transaction Processing) and OLAP (On-line Analytical Processing) are supported in RDBMS

OLTP (On-line Transaction Processing) is not yet supported in Hive but it’s supported OLAP (On-line Analytical Processing)

RDBMS


Doubt’s Time


One more thing…A VERY SPECIAL SURPRISE FOR YOU


Why SkillSpeed?

Course Curriculum

from Industry Experts

Instructor Led Live Virtual

Sessions

Lifetime access to Course

Content via LMS

100% Placement Assistance

24x7 Support

24x7


Corporate Partners

run your first hadoop 2.x program

Data & Analytics