hadoop week2 ppt
DESCRIPTION
Hadoop Week2 PPTTRANSCRIPT
-
Course Topics
Week 1 Introduction to HDFS
Week 3 Map-Reduce Basics, types and
formats
Week 5 HIVE
Week 7 ZOOKEEPER
Week 2 Setting Up Hadoop Cluster
Week 4 PIG
Week 6 HBASE
Week 8 SQOOP
-
Topics for Today
Revision
Hadoop Modes
Terminal Commands
Web UI Urls
Usecase in Healthcare
Sample example list in Hadoop
Running Teragen Example
Hadoop Configuration Files
Slaves & Masters
Name Node Recovery
Dump of MR jobs
Data Loading Techniques
-
HDFS Hadoop Distributed File System (storage)
MapReduce (processing)
Class 1 - Revision
-
Lets Revise
1. What is HDFS?
3. What is Namenode?
2. What is the difference between a Hadoop database and Relational Database?
4. What is Secondary Namenode?
5. Gen 1 and Gen 2 Hadoop.
-
Hadoop Modes
no daemons, everything runs in a single JVM
suitable for running MapReduce programs during development
has no dfs
Standalone (or local) mode
Hadoop daemons run on the local machine
Pseudo-distributed mode
Hadoop daemons run on a cluster of machines
Fully distributed mode
Hadoop can be run in one of three modes:
-
Terminal Commands
-
Terminal Commands
-
Web UI URLs
NameNode status: http://localhost:50070/dfshealth.jsp
JobTracker status: http://localhost:50030/jobtracker.jsp
TaskTracker status: http://localhost:50060/tasktracker.jsp
DataBlock Scanner Report: http://localhost:50075/blockScannerReport
http://localhost:50070/dfshealth.jsphttp://localhost:50070/dfshealth.jsphttp://localhost:50030/jobtracker.jsphttp://localhost:50060/tasktracker.jsphttp://localhost:50075/blockScannerReport
-
Sample Examples List
-
Running the Teragen Example
-
Checking the Output
-
Checking the Output
-
Hadoop Configuration Files
-
Sample Cluster Configuration
Slave01
Slave02
Slave03
Slave04
Slave05 DataNode
TaskTracker
Master
NameNode
JobTracker
-
Hadoop Configuration Files
Configuration Filenames Description of log files
hadoop-env.sh Enviroment variables that are used in the scripts to run Hadoop
core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce
hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data nodes.
mapred-site.xml Configuration settings for MapReduce daemons : the jobtracker and the task trackers
masters A list of machines(one per line) that each run a secondary namenode
slaves A list of machines(one per line) that each run a datanode and a task tracker
hadoop-metrics.properties Properties for controlling how metrics are published in Hadoop
log4j.properties Properties for system log files, the namenode audit log and the task log for the tasktracker child process
-
DD for each component
Core core-site.xml
HDFS hdfs-site.xml
MapReduce mapred-site.xml
-
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
dfs.replication fs.default.name
1 hdfs://localhost:8020/
-
Defining HDFS details in hdfs-site.xml Property Value Description
dfs.data.dir /disk1/hdfs/data,/di
sk2/hdfs/data
A list of directories where the datanode stores
blocks. Each block is stored in only one of these
directories. ${hadoop.tmp.dir}/dfs/data
fs.checkpoint.dir /disk1/hdfs/names
econdary,/disk2/hdfs/name
secondary
A list of directories where the secondary
namenode stores checkpoints. It stores a copy of
the checkpoint in each directory in the list
${hadoop.tmp.dir}/dfs/namesecondary
-
Mapred-site.xml
mapred.job.tracker
localhost:8021
-
All Properties
1. http://hadoop.apache.org/docs/r1.1.2/core-default.html
2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html
3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
http://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
-
Defining mapred-sites.xml
Property Value Description
Mapred.job.tracker localhost:
8021
The hostname and the port that the jobtrackers RPC server
runs on. If set to the default value of local, then the jobtracker
is run in-process on demand when you run a MapReduce job
(you dont need to start the jobtracker in this case, and in fact
you will get an error if you try to start it in this mode)
Mapred.local.dir ${hadoop.tmp.dir}
/mapred/local
A list of directories where MapReduce stores intermediate
data for jobs. The data is cleared out when the job ends.
Mapred.system.dir ${hadoop.tmp.dir}
/mapred/system
The directory relative to fs.default.name where shared files
are stored, during a job run.
Mapred.tasktracker.map.
tasks.maximum
2 The number of map tasks that may be run on a tasktracker at
any one time
Mapred.tasktracker.redu
ce.tasks.maximum
2 The number of reduce tasks tat may be run on a tasktracker
at any one time.
-
Slaves and masters
contains a list of hosts, one per line, that are to host DataNode and TaskTracker servers
slaves
contains a list of hosts, one per line, that are to host secondary NameNode servers
masters
Two files are used by the startup and shutdown commands:
-
Per-process runtime environment
JVM Hadoop-env.sh
hadoop-env.sh file:
This file also offers a way to provide custom parameters for each of the servers. Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/
directory of the installation.
Set parameter JAVA_HOME
-
hadoop.env-sh
Examples of environment variables that you can specify: export HADOOP_DATANODE_HEAPSIZE="128
export HADOOP_TASKTRACKER_HEAPSIZE="512
-
hadoop.env-sh Sample
# Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.6-sun # Extra Java runtime options. Empty by default. export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}" .. .. .. # A string representing this instance of hadoop. $USER by default. export HADOOP_IDENT_STRING=$USER
-
Namenode Recovery
1 Shut down the secondary NameNode
2 secondary:fs.checkpoint.dir Namenode:dfs.name.dir
3 secondary:fs.checkpoint.edits Namenode:dfs.name.edits.dir
4
When the copy completes, start the NameNode and restart the secondary NameNode
-
Reporting
This file controls the reporting
The default is not to report
hadoop-metrics.properties
-
Dump of a MR Job
-
Data Loading Techniques
Using Hadoop Copy Commands
Using Flume
Using Sqoop
HDFS
-
FLUME
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.
-
SQOOP
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
http://hadoop.apache.org/
-
Assignment for this Week
Attempt the following assignment using the document present in the LMS under the tab Week 2: Flume Set-up on Cloudera Attempt Assignment Week 2
Refresh your Java Skills using Java for Hadoop Tutorial on LMS
-
Ask your doubts
Q & A..?
-
Thank You See You in Class Next Week