mapreduce programming yue-shan chang. split 0 split 1 split 2 split 3 split 4 worker master user...
TRANSCRIPT
MapReduce Programming
Yue-Shan Chang
split 0split 1split 2split 3split 4
worker
worker
worker
worker
worker
Master
UserProgram
outputfile 0
outputfile 1
(1) fork (1) fork (1) fork
(2) assign map(2) assign reduce
(3) read(4) local write
(5) remote read(6) write
Inputfiles
Mapphase
Intermediate files(on local disk)
Reducephase
Outputfiles
MapReduce Program Structure
Class MapReduceClass Mapper hellip Map程式碼Class Reduer hellip Reduce程式碼Main() 主程式設定區JobConf Conf=new JobConf(ldquoMRClassrdquo)其他設定參數程式碼
package orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
MapReduce Job
Handled parts
Configuration of a Jobbull JobConf objectndash JobConf is the primary interface for a user to describe
a map-reduce job to the Hadoop framework for execution
ndash JobConf typically specifies the Mapper combiner (if any) Partitioner Reducer InputFormat and OutputFormat implementations to be used
ndash Indicates the set of input files (setInputPaths(JobConf Path) addInputPath(JobConf Path)) and (setInputPaths(JobConf String) addInputPaths(JobConf String)) and where the output files should be written (setOutputPath(Path))
Configuration of a Job
Input Splitting
bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger
than number of files ndash the individual files are larger than the suggested
fragment size there may be multiple input splits constructed of each input file
bull The user has considerable control over the number of input splits
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
split 0split 1split 2split 3split 4
worker
worker
worker
worker
worker
Master
UserProgram
outputfile 0
outputfile 1
(1) fork (1) fork (1) fork
(2) assign map(2) assign reduce
(3) read(4) local write
(5) remote read(6) write
Inputfiles
Mapphase
Intermediate files(on local disk)
Reducephase
Outputfiles
MapReduce Program Structure
Class MapReduceClass Mapper hellip Map程式碼Class Reduer hellip Reduce程式碼Main() 主程式設定區JobConf Conf=new JobConf(ldquoMRClassrdquo)其他設定參數程式碼
package orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
MapReduce Job
Handled parts
Configuration of a Jobbull JobConf objectndash JobConf is the primary interface for a user to describe
a map-reduce job to the Hadoop framework for execution
ndash JobConf typically specifies the Mapper combiner (if any) Partitioner Reducer InputFormat and OutputFormat implementations to be used
ndash Indicates the set of input files (setInputPaths(JobConf Path) addInputPath(JobConf Path)) and (setInputPaths(JobConf String) addInputPaths(JobConf String)) and where the output files should be written (setOutputPath(Path))
Configuration of a Job
Input Splitting
bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger
than number of files ndash the individual files are larger than the suggested
fragment size there may be multiple input splits constructed of each input file
bull The user has considerable control over the number of input splits
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
MapReduce Program Structure
Class MapReduceClass Mapper hellip Map程式碼Class Reduer hellip Reduce程式碼Main() 主程式設定區JobConf Conf=new JobConf(ldquoMRClassrdquo)其他設定參數程式碼
package orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
MapReduce Job
Handled parts
Configuration of a Jobbull JobConf objectndash JobConf is the primary interface for a user to describe
a map-reduce job to the Hadoop framework for execution
ndash JobConf typically specifies the Mapper combiner (if any) Partitioner Reducer InputFormat and OutputFormat implementations to be used
ndash Indicates the set of input files (setInputPaths(JobConf Path) addInputPath(JobConf Path)) and (setInputPaths(JobConf String) addInputPaths(JobConf String)) and where the output files should be written (setOutputPath(Path))
Configuration of a Job
Input Splitting
bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger
than number of files ndash the individual files are larger than the suggested
fragment size there may be multiple input splits constructed of each input file
bull The user has considerable control over the number of input splits
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
package orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one)
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
MapReduce Job
Handled parts
Configuration of a Jobbull JobConf objectndash JobConf is the primary interface for a user to describe
a map-reduce job to the Hadoop framework for execution
ndash JobConf typically specifies the Mapper combiner (if any) Partitioner Reducer InputFormat and OutputFormat implementations to be used
ndash Indicates the set of input files (setInputPaths(JobConf Path) addInputPath(JobConf Path)) and (setInputPaths(JobConf String) addInputPaths(JobConf String)) and where the output files should be written (setOutputPath(Path))
Configuration of a Job
Input Splitting
bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger
than number of files ndash the individual files are larger than the suggested
fragment size there may be multiple input splits constructed of each input file
bull The user has considerable control over the number of input splits
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)
MapReduce Job
Handled parts
Configuration of a Jobbull JobConf objectndash JobConf is the primary interface for a user to describe
a map-reduce job to the Hadoop framework for execution
ndash JobConf typically specifies the Mapper combiner (if any) Partitioner Reducer InputFormat and OutputFormat implementations to be used
ndash Indicates the set of input files (setInputPaths(JobConf Path) addInputPath(JobConf Path)) and (setInputPaths(JobConf String) addInputPaths(JobConf String)) and where the output files should be written (setOutputPath(Path))
Configuration of a Job
Input Splitting
bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger
than number of files ndash the individual files are larger than the suggested
fragment size there may be multiple input splits constructed of each input file
bull The user has considerable control over the number of input splits
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
MapReduce Job
Handled parts
Configuration of a Jobbull JobConf objectndash JobConf is the primary interface for a user to describe
a map-reduce job to the Hadoop framework for execution
ndash JobConf typically specifies the Mapper combiner (if any) Partitioner Reducer InputFormat and OutputFormat implementations to be used
ndash Indicates the set of input files (setInputPaths(JobConf Path) addInputPath(JobConf Path)) and (setInputPaths(JobConf String) addInputPaths(JobConf String)) and where the output files should be written (setOutputPath(Path))
Configuration of a Job
Input Splitting
bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger
than number of files ndash the individual files are larger than the suggested
fragment size there may be multiple input splits constructed of each input file
bull The user has considerable control over the number of input splits
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Handled parts
Configuration of a Jobbull JobConf objectndash JobConf is the primary interface for a user to describe
a map-reduce job to the Hadoop framework for execution
ndash JobConf typically specifies the Mapper combiner (if any) Partitioner Reducer InputFormat and OutputFormat implementations to be used
ndash Indicates the set of input files (setInputPaths(JobConf Path) addInputPath(JobConf Path)) and (setInputPaths(JobConf String) addInputPaths(JobConf String)) and where the output files should be written (setOutputPath(Path))
Configuration of a Job
Input Splitting
bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger
than number of files ndash the individual files are larger than the suggested
fragment size there may be multiple input splits constructed of each input file
bull The user has considerable control over the number of input splits
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Configuration of a Jobbull JobConf objectndash JobConf is the primary interface for a user to describe
a map-reduce job to the Hadoop framework for execution
ndash JobConf typically specifies the Mapper combiner (if any) Partitioner Reducer InputFormat and OutputFormat implementations to be used
ndash Indicates the set of input files (setInputPaths(JobConf Path) addInputPath(JobConf Path)) and (setInputPaths(JobConf String) addInputPaths(JobConf String)) and where the output files should be written (setOutputPath(Path))
Configuration of a Job
Input Splitting
bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger
than number of files ndash the individual files are larger than the suggested
fragment size there may be multiple input splits constructed of each input file
bull The user has considerable control over the number of input splits
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Configuration of a Job
Input Splitting
bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger
than number of files ndash the individual files are larger than the suggested
fragment size there may be multiple input splits constructed of each input file
bull The user has considerable control over the number of input splits
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Input Splitting
bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger
than number of files ndash the individual files are larger than the suggested
fragment size there may be multiple input splits constructed of each input file
bull The user has considerable control over the number of input splits
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Specifying Input Formatsbull The Hadoop framework provides a large variety of
input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the
value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat
but the splits are based on N lines of input rather than Y bytes of input
ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split
ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Specifying Input Formats
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Setting the Output Parameters
bull The framework requires that the output parameters be configured even if the job will not produce any output
bull The framework will collect the output from the specified tasks and place them into the configured output directory
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Setting the Output Parameters
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
A Simple Map Function IdentityMapper
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
A Simple Reduce Function IdentityReducer
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
A Simple Reduce Function IdentityReducer
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Configuring the Reduce Phase
bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce
phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task
by default the same as the reduce outputndash The output key and value types for the reduce
taskndash The output file type for the reduce task output
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
How Many Maps
bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files
bull The right level of parallelism for maps seems to be around 10-100 maps per-node
bull it is best if the maps take at least a minute to execute
bull setNumMapTasks(int)
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Reducerbull Reducer reduces a set of intermediate values which
share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and
reducebull Shufflendash Input to the Reducer is the sorted output of the mappers
In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP
bull Sortndash The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
How Many Reduces
bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)
bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish
bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
How Many Reducesbull Increasing the number of reduces increases the
framework overhead but increases load balancing and lowers the cost of failures
bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if
no reduction is desiredndash In this case the outputs of the map-tasks go directly
to the FileSystem into the output path set by setOutputPath(Path)
ndash The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Reporter
bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters
bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
JobTracker
bull JobTracker is the central location for submitting and tracking MR jobs in a network environment
bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their
progress access component-tasks reports and logs get the MapReduce clusters status information and so on
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Job Submission and Monitoring
bull The job submission process involvesndash Checking the input and output specifications of
the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information
for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the
MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and
optionally monitoring its status
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
MapReduce Details forMultimachine Clusters
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Introduction
bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to
satisfy with a small number of machines ndash need to rapidly scale the computing power
applied to a problem due to varying input set sizes
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Requirements for Successful MapReduce Jobs
bull Mapperndash ingest the input and process the input record sending
forward the records that can be passed to the reduce task or to the final output directly
bull Reducerndash Accept the key and value groups that passed through the
mapper and generate the final output
bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Requirements for Successful MapReduce Jobs
bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers
bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Requirements for Successful MapReduce Jobs
bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Launching MapReduce Jobs
bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
MapReduce-Specific Configuration for Each Machine in a Cluster
bull install any standard JARs that your application usesbull It is probable that your applications will have a
runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine
bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks
bull The confslaves file should have the set of machines to serve as TaskTracker nodes
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
DistributedCache
bull distributes application-specific large read-only files efficiently
bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications
bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Adding Resources to the Task Classpath
bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the
MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the
JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR
ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Configuring the Hadoop Core Cluster Information
bull Setting the Default File System URI
bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname
hdfsNamenodeHostnamePORT)
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-
Configuring the Hadoop Core Cluster Information
bull Setting the JobTracker Location
bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker
JobtrackerHostnamePORT)
- MapReduce Programming
- Slide 2
- MapReduce Program Structure
- Slide 4
- Slide 5
- MapReduce Job
- Handled parts
- Configuration of a Job
- Slide 9
- Input Splitting
- Specifying Input Formats
- Slide 12
- Setting the Output Parameters
- Slide 14
- A Simple Map Function IdentityMapper
- A Simple Reduce Function IdentityReducer
- Slide 17
- Configuring the Reduce Phase
- How Many Maps
- Reducer
- How Many Reduces
- Slide 22
- Reporter
- JobTracker
- Job Submission and Monitoring
- MapReduce Details for Multimachine Clusters
- Introduction
- Requirements for Successful MapReduce Jobs
- Slide 29
- Slide 30
- Launching MapReduce Jobs
- Slide 32
- MapReduce-Specific Configuration for Each Machine in a Cluster
- DistributedCache
- Adding Resources to the Task Classpath
- Configuring the Hadoop Core Cluster Information
- Slide 37
-