hadoop& framework& - jordi · pdf file6 hadoop: very high-level overview ! when data...

37
Hadoop Framework Spring - 2014 Jordi Torres, UPC - BSC www.JordiTorres.eu @JordiTorresBCN technology basics for data scientists

Upload: vuongnhu

Post on 26-Mar-2018

219 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

Hadoop  Framework  

Spring - 2014

Jordi Torres, UPC - BSC www.JordiTorres.eu @JordiTorresBCN

technology basics for data scientists

Page 2: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

   Warning!  

Slides  are  only  for  presenta8on  guide  

       

We  will  discuss+debate  addi8onal  concepts/ideas  

appeared  during  your  par8cipa8on!  (and  we  could  skip  part  of  the  content)    

 

Page 3: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

3

Hadoop MapReduce

§  Hadoop is the dominant open source MapReduce implementation

§  Funded by Yahoo, it emerged in 2006

§  The Hadoop project is now hosted by Apache

§  Implemented in Java,

§  (The data to be processed must be loaded into e.g. the Hadoop Distributed Filesystem)

Source: Wikipedia

Page 4: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

4

§  Hadoop is an open source MapReduce runtime provided by the Apache Software Foundation

§  De-facto standard, free, open-source MapReduce implementation.

§  Endorsed by: http://wiki.apache.org/hadoop/PoweredBy

Hadoop MapReduce

Page 5: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

5

Hadoop - Architecture

Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf

Page 6: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

6

Hadoop: Very high-level overview

§ When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb

§ Map tasks works on typically a single block §  A master program allocates work to nodes (that

work in parallel) such that a Map task will work on a block of data stored locally on that node

§  If a node fails, the master will detect that failure and re-assign the work to a different node on the system

Page 7: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

7

Hadoop esentials

§  Computation: –  Move the computation to the data

§  Storage: –  Keeping track of the data and metadata –  Data is sharded across the cluster

§  Cluster management tools §  ...

Page 8: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

8

(default) Hadoop’s Stack

Storage Services

Compute Services

8

Applications

Hadoop Distributed File System (HDFS)

Hadoop’s MapReduce

Data Services Hbase: NoSQL Databases

Resource Fabrics

more detail in next part!!!

Page 9: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

9

Basic  Cluster  Components  

•  One of each: –  Namenode (NN) –  Jobtracker (JT)

•  Set of each per slave machine: –  Tasktracker (TT) –  Datanode (DN)

Page 10: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

10

Put2ng  Everything  Together  

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

Page 11: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

11

Anatomy  of  a  Job  

§  MapReduce program in Hadoop = Hadoop job

§  Jobs are divided into map and reduce tasks

§  An instance of running a task is called a task §  attempt

§  Multiple jobs can be composed into a workflow

Page 12: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

12

§  Job submission process

•  Client (i.e., driver program) creates a job, configures it, and submits it to job tracker

•  JobClient computes input splits (on client end)

•  Job data (jar, configuration XML) are sent to JobTracker

•  JobTracker puts job data in shared location, enqueues tasks

•  TaskTrackers poll for tasks

•  Off to the races…

Anatomy  of  a  Job  

Page 13: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

13

Running MapReduce job with Hadoop

§  Steps: –  Defining the MapReduce stages in a Java program –  Loading the data into the Hadoop Distributed Filesystem –  Submitting the job for execution –  Retrieving the results from the filesystem

MapReduce has been implemented in a variety of other programming languages and systems,

Several NoSQL database systems have integrated MapReduce (later in this course)

Page 14: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

14

Hadoop and enterprise?

§  Hadoop is a complement to a relational data warehouse –  Enterprises are generally not replacing their relational

DataWarehouse with Hadoop

§  Hadoop’s strengths –  Inexpensive –  High reliability –  Extreme scalability –  Flexibility: Data can be added without defining a schema

§  Hadoop’s weaknesses –  Hadoop is not an interactive query environment –  Processing data in Hadoop requires writing code

Page 15: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

15

Who is using Hadoop?

Source: Wikipedia, April 2013

Page 16: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

16

What is MapReduce model used for?

§  At Google: –  Index construction for Google Search –  Article clustering for Google News –  Statistical machine translation

§  At Yahoo!: –  “Web map” powering Yahoo! Search –  Spam detection for Yahoo! Mail

§  At Facebook: –  Data mining –  Ad optimization –  Spam detection

Page 17: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

17

Hadoop 1.0

04-01-2012: § The Apache Software Foundation delivers Hadoop 1.0, the much-anticipated 1.0 version of the popular open-source platform for storing and processing large amounts of data.

–  six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists and systems engineers, culminating in a highly stable, enterprise-ready release of the fastest-growing big data platform.

Page 18: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

18

Getting Started with Hadoop

§ Different ways to write jobs:

–  Java API –  Hadoop Streaming (for Python, Perl, etc) –  Pipes API (C++) –  R –  …

Page 19: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

19

Hadoop  API  

•  Different  APIs  to  write  Hadoop  programs:  – A  rich  Java  API  (main  way  to  write  Hadoop  programs)  

– A  Streaming  API  that  can  be  used  to  write  map  and  reduce  func2ons  in  any  programming  language  (using  standard  inputs  and  outputs)  

– A  C++  API  (Hadoop  Pipes)  – With  a  higher  language  level  (e.g.,  Pig,  Hive)  

Page 20: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

20

Hadoop  API  

•  Mapper  

•  Reducer/Combiner  –  void  reduce(K2  key,  Iterator<V2>  values,  

OutputCollector<K3,V3>  output,  Reporter  reporter)  –  void  configure(JobConf  job)  –  void  close()  throws  IOException  

•  Par22oner  –  void  getPartition(K2  key,  V2  value,  int  

numPartitions)  

– void  map(K1  key,  V1  value,  OutputCollector<K2,  V2>  – output,  Reporter  reporter)  void  configure(JobConf  job)  

– void  close()  throws  IOException  

Page 21: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

21

WordCount.java  

package  org.myorg;  

import  java.io.IOException;  import  java.util.*;  

import  org.apache.hadoop.fs.Path;  import  org.apache.hadoop.conf.*;  import  org.apache.hadoop.io.*;  import  org.apache.hadoop.mapred.*;  import  org.apache.hadoop.util.*;   public  class  WordCount  {   ……   }  

Page 22: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

22

WordCount.java  

public  static  class  Map  extends  MapReduceBase  implements  Mapper<LongWritable,  Text,  Text,  IntWritable>  {  private  final  static  IntWritable  one  =  new  IntWritable(1);  private  Text  word  =  new  Text();  public  void  map(  LongWritable  key,  Text  value,  OutputCollector<Text,  IntWritable>  output,  Reporter  reporter)  throws  IOException  {  String  line  =  value.toString();  StringTokenizer  tokenizer  =  new  StringTokenizer(line);  while  (tokenizer.hasMoreTokens())  {  word.set(tokenizer.nextToken());  output.collect(word,  one);  }  }  }  

Page 23: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

23

WordCount.java  

public  static  class  Reduce  extends  MapReduceBase  implements  Reducer<Text,  IntWritable,  Text,  IntWritable>  {  public  void  reduce(Text  key,  Iterator<IntWritable>  values,  OutputCollector<Text,  IntWritable>  output,  Reporter  reporter)  throws  IOException  {  int  sum  =  0;  while  (values.hasNext())  {  sum  +=  values.next().get();  }  output.collect(key,  new  IntWritable(sum));  }  }  

Page 24: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

24

WordCount.java    

conf.setOutputKeyClass(Text.class);  conf.setOutputValueClass(IntWritable.class);  

conf.setMapperClass(Map.class);  conf.setCombinerClass(Reduce.class);  conf.setReducerClass(Reduce.class);  

conf.setInputFormat(TextInputFormat.class);  conf.setOutputFormat(TextOutputFormat.class);  

FileInputFormat.setInputPaths(conf,  new  Path(args[0]));  FileOutputFormat.setOutputPath(conf,  new  Path(args[1]));   JobClient.runJob(conf);  }  

   public  static  void  main(String[]  args)  throws  Exception  {  JobConf  conf  =  new  JobConf(WordCount.class);  conf.setJobName("wordcount");  

Page 25: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

25

E.g. Common wordcount

Hello  World  Hello  MapReduce  

Fig1: Sample input

Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf

Page 26: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

26

E.g. Common wordcount

12 March 2012

void  map(string  i,  string  line):          for  word  in  line:                  print  word,  1  

Fig 2: wordcount – map function

Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf

Page 27: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

27

E.g. Common wordcount

void  reduce(string  word,  list  partial_counts):          total  =  0          for  c  in  partial_counts:                  total  +=  c          print  word,  total  

Fig 3: wordcount – reduce function

Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf

Page 28: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

28

E.g. Common wordcount

Hello  ,  2  MapReduce  ,  1  

World  ,  1  Hello  ,  1  MapReduce  ,  1  

Second intermediate output

Hello  ,  1  World  ,  1  

First intermediate output

REDUCE

Final output

MAP

Hello  World  Hello  MapReduce  

Input

Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf

Page 29: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

Word  Count  Python  Mapper    def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1)

Source:  Robert  Grossman  –  Tutorial  Supercompu2ng  2011    

Page 30: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

Word  Count  R  Mapper  trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)  

Source:  Robert  Grossman  –  Tutorial  Supercompu2ng  2011    

Page 31: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

Word  Count  Java  Mapper  public static class Map

extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

 

Source:  Robert  Grossman  –  Tutorial  Supercompu2ng  2011    

Page 32: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

Code  Comparison  –  Word  Count  Mapper  Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)

Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

Source:  Robert  Grossman  –  Tutorial  Supercompu2ng  2011    

Page 33: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

Word  Count  Python  Reducer  def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)

Source:  Robert  Grossman  –  Tutorial  Supercompu2ng  2011    

Page 34: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

Word  Count  R  Reducer  trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count  

Source:  Robert  Grossman  –  Tutorial  Supercompu2ng  2011    

Page 35: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

Word  Count  R  Reducer  (cont’d)  if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”)  

Page 36: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

Word  Count  Java  Reducer  public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }  

Page 37: Hadoop& Framework& - Jordi · PDF file6 Hadoop: Very high-level overview ! When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb ! Map tasks works on typically

Code  Comparison  –  Word  Count  Reducer  Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count

if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

Source:  Robert  Grossman  –  Tutorial  Supercompu2ng  2011