hadoop week2 ppt

Course Topics

Week 1 Introduction to HDFS

Week 3 Map-Reduce Basics, types and

formats

Week 5 HIVE

Week 7 ZOOKEEPER

Week 2 Setting Up Hadoop Cluster

Week 4 PIG

Week 6 HBASE

Week 8 SQOOP

Topics for Today

Revision

Hadoop Modes

Terminal Commands

Web UI Urls

Usecase in Healthcare

Sample example list in Hadoop

Running Teragen Example

Hadoop Configuration Files

Slaves & Masters

Name Node Recovery

Dump of MR jobs

Data Loading Techniques

HDFS Hadoop Distributed File System (storage)

MapReduce (processing)

Class 1 - Revision

Lets Revise

1. What is HDFS?

3. What is Namenode?

2. What is the difference between a Hadoop database and Relational Database?

4. What is Secondary Namenode?

5. Gen 1 and Gen 2 Hadoop.

Hadoop Modes

no daemons, everything runs in a single JVM

suitable for running MapReduce programs during development

has no dfs

Standalone (or local) mode

Hadoop daemons run on the local machine

Pseudo-distributed mode

Hadoop daemons run on a cluster of machines

Fully distributed mode

Hadoop can be run in one of three modes:

Terminal Commands

Web UI URLs

NameNode status: http://localhost:50070/dfshealth.jsp

JobTracker status: http://localhost:50030/jobtracker.jsp

TaskTracker status: http://localhost:50060/tasktracker.jsp

DataBlock Scanner Report: http://localhost:50075/blockScannerReport

http://localhost:50070/dfshealth.jsphttp://localhost:50070/dfshealth.jsphttp://localhost:50030/jobtracker.jsphttp://localhost:50060/tasktracker.jsphttp://localhost:50075/blockScannerReport

Sample Examples List

Running the Teragen Example

Checking the Output

Sample Cluster Configuration

Slave01

Slave02

Slave03

Slave04

Slave05 DataNode

TaskTracker

Master

NameNode

JobTracker


Configuration Filenames Description of log files

hadoop-env.sh Enviroment variables that are used in the scripts to run Hadoop

core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce

hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data nodes.

mapred-site.xml Configuration settings for MapReduce daemons : the jobtracker and the task trackers

masters A list of machines(one per line) that each run a secondary namenode

slaves A list of machines(one per line) that each run a datanode and a task tracker

hadoop-metrics.properties Properties for controlling how metrics are published in Hadoop

log4j.properties Properties for system log files, the namenode audit log and the task log for the tasktracker child process

DD for each component

Core core-site.xml

HDFS hdfs-site.xml

MapReduce mapred-site.xml

core-site.xml and hdfs-site.xml

hdfs-site.xml core-site.xml

dfs.replication fs.default.name

1 hdfs://localhost:8020/

Defining HDFS details in hdfs-site.xml Property Value Description

dfs.data.dir /disk1/hdfs/data,/di

sk2/hdfs/data

A list of directories where the datanode stores

blocks. Each block is stored in only one of these

directories. ${hadoop.tmp.dir}/dfs/data

fs.checkpoint.dir /disk1/hdfs/names

econdary,/disk2/hdfs/name

secondary

A list of directories where the secondary

namenode stores checkpoints. It stores a copy of

the checkpoint in each directory in the list

${hadoop.tmp.dir}/dfs/namesecondary

Mapred-site.xml

mapred.job.tracker

localhost:8021

All Properties

1. http://hadoop.apache.org/docs/r1.1.2/core-default.html

2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html

3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html

http://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.html

Defining mapred-sites.xml

Property Value Description

Mapred.job.tracker localhost:

8021

The hostname and the port that the jobtrackers RPC server

runs on. If set to the default value of local, then the jobtracker

is run in-process on demand when you run a MapReduce job

(you dont need to start the jobtracker in this case, and in fact

you will get an error if you try to start it in this mode)

Mapred.local.dir ${hadoop.tmp.dir}

/mapred/local

A list of directories where MapReduce stores intermediate

data for jobs. The data is cleared out when the job ends.

Mapred.system.dir ${hadoop.tmp.dir}

/mapred/system

The directory relative to fs.default.name where shared files

are stored, during a job run.

Mapred.tasktracker.map.

tasks.maximum

2 The number of map tasks that may be run on a tasktracker at

any one time

Mapred.tasktracker.redu

ce.tasks.maximum

2 The number of reduce tasks tat may be run on a tasktracker

at any one time.

Slaves and masters

contains a list of hosts, one per line, that are to host DataNode and TaskTracker servers

slaves

contains a list of hosts, one per line, that are to host secondary NameNode servers

masters

Two files are used by the startup and shutdown commands:

Per-process runtime environment

JVM Hadoop-env.sh

hadoop-env.sh file:

This file also offers a way to provide custom parameters for each of the servers. Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/

directory of the installation.

Set parameter JAVA_HOME

hadoop.env-sh

Examples of environment variables that you can specify: export HADOOP_DATANODE_HEAPSIZE="128

export HADOOP_TASKTRACKER_HEAPSIZE="512

hadoop.env-sh Sample

# Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.6-sun # Extra Java runtime options. Empty by default. export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}" .. .. .. # A string representing this instance of hadoop. $USER by default. export HADOOP_IDENT_STRING=$USER

Namenode Recovery

1 Shut down the secondary NameNode

2 secondary:fs.checkpoint.dir Namenode:dfs.name.dir

3 secondary:fs.checkpoint.edits Namenode:dfs.name.edits.dir

4

When the copy completes, start the NameNode and restart the secondary NameNode

Reporting

This file controls the reporting

The default is not to report

hadoop-metrics.properties

Dump of a MR Job

Data Loading Techniques

Using Hadoop Copy Commands

Using Flume

Using Sqoop

HDFS

FLUME

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.

SQOOP

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

http://hadoop.apache.org/

Assignment for this Week

Attempt the following assignment using the document present in the LMS under the tab Week 2: Flume Set-up on Cloudera Attempt Assignment Week 2

Refresh your Java Skills using Java for Hadoop Tutorial on LMS

Ask your doubts

Q & A..?

Thank You See You in Class Next Week

hadoop week2 ppt

Documents