run your first hadoop 2.x program
TRANSCRIPT
© 2015 BlueCamphor Technologies (P) Ltd.
Run Your First Hadoop Program
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 2
Know your Instructor
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 3
Session Objectives
This Session will help you to:
ᗍ Understand • Introduction to BIG Data• Introduction to Hadoop 2.x• HDFS Fundamentals• MapReduce & YARN• Hive Introduction
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 4
What is BIG Data
ᗍ Big Data refers to data-sets so large & complex/unstructured data that it becomes difficult to manage & process via traditional RDBMS tools
ᗍ Every day we roughly create 2.5 Quintillion bytes of data; 90% of the worlds collected data has been generated only in the last 2 years
ᗍ Data sizes are now in Peta-bytes, Tera-bytes, Exa-bytes & Zeta-bytes
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 5
Structured vs Unstructured Data - II
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 6
Traditional Systems vs. New Systems
Traditional Systems New Systems
It is not scalable to meet new business demands It is scalable to meet new business demands
Can process massive data at high speedCannot process massive data at high speedIt can only be Scaled-Up and cannot be Scaled-Out It can be Scaled-Up and Scaled-Out
Cost of system, processing and data management is economical
Cost of system, processing and data management is not economical
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 7
Introduction to Hadoop
ᗍ Hadoop is a framework for storing, processing and analysing Big Data
ᗍ Allows distributed storage and distributed processing of large data sets across clusters of commodity computers using a simple programming model
ᗍ It is an Apache Open Source framework
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 8
Key Features of Hadoop
It’s based on the Master - Slave architecture
Designed for massive scale
Highly available System
Low software and hardware costs
Distributed storage and processing achieves high performance
No license costs; supported by a very large developer community
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 9
Hadoop Ecosystem
Pig(Data Flow)
MR(Batch)
Hive(SQL)
Others(Cascading)
RT S
tream
G
raph
(Stro
m, G
iraph
)
Serv
ices
(HBa
se)
TEZ(Execution Engine)
YARN(Cluster Resource Management)
HDFS(Redundant Reliable Storage)
Hadoop
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 10
Hadoop Core Components
Distributed Data Storage frame work
Distributed Data Processing Framework
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 11
Hadoop Architecture
ᗍ HDFS - Storage• NameNode• Data Node• Secondary NameNode
ᗍ MapReduce - Processing• Resource Manager • Node Manager
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 12
HDFS Architecture
NameNode
Client
Rack 1 Client Rack 2
Metadata (Name, replicas,...): /home/foo/data, 3,…
Read DataNodes
Write
Replication
Blocks
Block opsDataNodes
Metadata ops
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 13
HDFS File Read Operation
2. Get Block locations
4. Read 5. Read
Client Node
HDFSClient
Distributed File System
FS DataInput Stream
Client JVM6. Close
3. Read
1. Open
DataNode
Slave Node
DataNode
Slave Node
DataNode
Slave Node
NameNode
Admin Node
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 14
HDFS File Write Operation
NameNode2. Create
7. Complete
5. Ack Packet4. Write Packet
Pipeline of Data nodes
6. Close
HDFSClient Distributed
File SystemNameNode
DataNode
Slave Node
4
5
4
5DataNode
Slave Node
DataNode
Slave Node
Blocks
Admin Node1. Create
3. Write
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 15
Hadoop Core Components
Distributed Data Storage frame work
Distributed Data Processing Framework
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 16
Hadoop Architecture
ᗍ HDFS - Storage• NameNode• Data Node• Secondary NameNode
ᗍ MapReduce - Processing• Resource Manager • Node Manager
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 17
Traditional Solution
matchesSplit Data
Allmatches
grep
grep
grep
cat
grep
:
matches
matches
matches
Split Data
Split Data
Split Data
VeryBig
Data
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 18
MapReduce Solution
Split Data
Allmatches
:
Split Data
Split Data
Split Data
MAP
REDUCE
MapReduce Framework
VeryBig
Input
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 19
Understanding MapReduce Paradigm
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3) Jack Bill Joe
Bill, 2Don, 3Jack, 2Joe, 2
K2,List(V2)List(K2,V2)K1,V1
Don Don Joe
Jack Don Bill
Bill, (1,1)
Don, (1,1,1)
Jack, (1,1)
Joe, (1,1)
MapReduce Word Count Process Flow
Jack Bill Joe Don Don Joe Jack Don Bill
Jack, 1 Bill, 1 Joe, 1
Don, 1 Don, 1 Joe, 1
Jack, 1 Don, 1 Bill, 1
Bill, 2
Don, 3
Jack, 2
Joe, 2
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 20
What is Hive?
Hive is data warehouse query tool built on top of HDFS and YARN
Provides HiveQL, which is very similar to SQL
Used for querying and analyzing large structured data sets
It is extensible by User Defined Functions (UDFs)
Hive
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 21
RDBMS Vs. Hive
RDBMS HiveSchema on WRITE – table schema is enforced at data load time i.e if the data being loaded doesn't conformed on schema in that case it will rejected
Schema on READ – it’s does not verify the schema while it’s loaded the data
Not much Scalable, costly scale up It’s very easily scalable at low cost
In traditional database we can read and write many time
It’s based on hadoop notation that is Write once and read many times
Record level updates, insertions and deletes, transactions and indexes are possible Record level updates is not possible in Hive
Both OLTP (On-line Transaction Processing) and OLAP (On-line Analytical Processing) are supported in RDBMS
OLTP (On-line Transaction Processing) is not yet supported in Hive but it’s supported OLAP (On-line Analytical Processing)
RDBMS
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 22
Doubt’s Time
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 23
One more thing…A VERY SPECIAL SURPRISE FOR YOU
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 24
Why SkillSpeed?
Course Curriculum
from Industry Experts
Instructor Led Live Virtual
Sessions
Lifetime access to Course
Content via LMS
100% Placement Assistance
24x7 Support
24x7
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 25
Corporate Partners