hadoop& framework& - jordi · pdf file6 hadoop: very high-level overview ! when data...
TRANSCRIPT
Hadoop Framework
Spring - 2014
Jordi Torres, UPC - BSC www.JordiTorres.eu @JordiTorresBCN
technology basics for data scientists
Warning!
Slides are only for presenta8on guide
We will discuss+debate addi8onal concepts/ideas
appeared during your par8cipa8on! (and we could skip part of the content)
3
Hadoop MapReduce
§ Hadoop is the dominant open source MapReduce implementation
§ Funded by Yahoo, it emerged in 2006
§ The Hadoop project is now hosted by Apache
§ Implemented in Java,
§ (The data to be processed must be loaded into e.g. the Hadoop Distributed Filesystem)
Source: Wikipedia
4
§ Hadoop is an open source MapReduce runtime provided by the Apache Software Foundation
§ De-facto standard, free, open-source MapReduce implementation.
§ Endorsed by: http://wiki.apache.org/hadoop/PoweredBy
Hadoop MapReduce
5
Hadoop - Architecture
Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf
6
Hadoop: Very high-level overview
§ When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb
§ Map tasks works on typically a single block § A master program allocates work to nodes (that
work in parallel) such that a Map task will work on a block of data stored locally on that node
§ If a node fails, the master will detect that failure and re-assign the work to a different node on the system
7
Hadoop esentials
§ Computation: – Move the computation to the data
§ Storage: – Keeping track of the data and metadata – Data is sharded across the cluster
§ Cluster management tools § ...
8
(default) Hadoop’s Stack
Storage Services
Compute Services
8
Applications
Hadoop Distributed File System (HDFS)
Hadoop’s MapReduce
Data Services Hbase: NoSQL Databases
Resource Fabrics
more detail in next part!!!
9
Basic Cluster Components
• One of each: – Namenode (NN) – Jobtracker (JT)
• Set of each per slave machine: – Tasktracker (TT) – Datanode (DN)
10
Put2ng Everything Together
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
namenode
namenode daemon
job submission node
jobtracker
11
Anatomy of a Job
§ MapReduce program in Hadoop = Hadoop job
§ Jobs are divided into map and reduce tasks
§ An instance of running a task is called a task § attempt
§ Multiple jobs can be composed into a workflow
12
§ Job submission process
• Client (i.e., driver program) creates a job, configures it, and submits it to job tracker
• JobClient computes input splits (on client end)
• Job data (jar, configuration XML) are sent to JobTracker
• JobTracker puts job data in shared location, enqueues tasks
• TaskTrackers poll for tasks
• Off to the races…
Anatomy of a Job
13
Running MapReduce job with Hadoop
§ Steps: – Defining the MapReduce stages in a Java program – Loading the data into the Hadoop Distributed Filesystem – Submitting the job for execution – Retrieving the results from the filesystem
MapReduce has been implemented in a variety of other programming languages and systems,
Several NoSQL database systems have integrated MapReduce (later in this course)
14
Hadoop and enterprise?
§ Hadoop is a complement to a relational data warehouse – Enterprises are generally not replacing their relational
DataWarehouse with Hadoop
§ Hadoop’s strengths – Inexpensive – High reliability – Extreme scalability – Flexibility: Data can be added without defining a schema
§ Hadoop’s weaknesses – Hadoop is not an interactive query environment – Processing data in Hadoop requires writing code
15
Who is using Hadoop?
Source: Wikipedia, April 2013
16
What is MapReduce model used for?
§ At Google: – Index construction for Google Search – Article clustering for Google News – Statistical machine translation
§ At Yahoo!: – “Web map” powering Yahoo! Search – Spam detection for Yahoo! Mail
§ At Facebook: – Data mining – Ad optimization – Spam detection
17
Hadoop 1.0
04-01-2012: § The Apache Software Foundation delivers Hadoop 1.0, the much-anticipated 1.0 version of the popular open-source platform for storing and processing large amounts of data.
– six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists and systems engineers, culminating in a highly stable, enterprise-ready release of the fastest-growing big data platform.
18
Getting Started with Hadoop
§ Different ways to write jobs:
– Java API – Hadoop Streaming (for Python, Perl, etc) – Pipes API (C++) – R – …
19
Hadoop API
• Different APIs to write Hadoop programs: – A rich Java API (main way to write Hadoop programs)
– A Streaming API that can be used to write map and reduce func2ons in any programming language (using standard inputs and outputs)
– A C++ API (Hadoop Pipes) – With a higher language level (e.g., Pig, Hive)
20
Hadoop API
• Mapper
• Reducer/Combiner – void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3,V3> output, Reporter reporter) – void configure(JobConf job) – void close() throws IOException
• Par22oner – void getPartition(K2 key, V2 value, int
numPartitions)
– void map(K1 key, V1 value, OutputCollector<K2, V2> – output, Reporter reporter) void configure(JobConf job)
– void close() throws IOException
21
WordCount.java
package org.myorg;
import java.io.IOException; import java.util.*;
import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { …… }
22
WordCount.java
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
23
WordCount.java
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
24
WordCount.java
conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");
25
E.g. Common wordcount
Hello World Hello MapReduce
Fig1: Sample input
Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf
26
E.g. Common wordcount
12 March 2012
void map(string i, string line): for word in line: print word, 1
Fig 2: wordcount – map function
Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf
27
E.g. Common wordcount
void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total += c print word, total
Fig 3: wordcount – reduce function
Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf
28
E.g. Common wordcount
Hello , 2 MapReduce , 1
World , 1 Hello , 1 MapReduce , 1
Second intermediate output
Hello , 1 World , 1
First intermediate output
REDUCE
Final output
MAP
Hello World Hello MapReduce
Input
Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf
Word Count Python Mapper def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1)
Source: Robert Grossman – Tutorial Supercompu2ng 2011
Word Count R Mapper trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)
Source: Robert Grossman – Tutorial Supercompu2ng 2011
Word Count Java Mapper public static class Map
extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
Source: Robert Grossman – Tutorial Supercompu2ng 2011
Code Comparison – Word Count Mapper Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)
Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
Source: Robert Grossman – Tutorial Supercompu2ng 2011
Word Count Python Reducer def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)
Source: Robert Grossman – Tutorial Supercompu2ng 2011
Word Count R Reducer trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count
Source: Robert Grossman – Tutorial Supercompu2ng 2011
Word Count R Reducer (cont’d) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”)
Word Count Java Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }
Code Comparison – Word Count Reducer Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count
if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
Source: Robert Grossman – Tutorial Supercompu2ng 2011