hadoop& framework& - jordi · pdf file6 hadoop: very high-level overview ! when data...

Hadoop Framework

Spring - 2014

Jordi Torres, UPC - BSC www.JordiTorres.eu @JordiTorresBCN

technology basics for data scientists

Warning!

Slides are only for presenta8on guide

We will discuss+debate addi8onal concepts/ideas

appeared during your par8cipa8on! (and we could skip part of the content)

3

Hadoop MapReduce

§  Hadoop is the dominant open source MapReduce implementation

§  Funded by Yahoo, it emerged in 2006

§  The Hadoop project is now hosted by Apache

§  Implemented in Java,

§  (The data to be processed must be loaded into e.g. the Hadoop Distributed Filesystem)

Source: Wikipedia

4

§  Hadoop is an open source MapReduce runtime provided by the Apache Software Foundation

§  De-facto standard, free, open-source MapReduce implementation.

§  Endorsed by: http://wiki.apache.org/hadoop/PoweredBy

Hadoop MapReduce

5

Hadoop - Architecture

Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/Part2.EEDC_.BigData.HADOOP.pdf

6

Hadoop: Very high-level overview

§ When data is loaded into the systems, it is split into “blocks” of 64Mb/128Mb

§ Map tasks works on typically a single block §  A master program allocates work to nodes (that

work in parallel) such that a Map task will work on a block of data stored locally on that node

§  If a node fails, the master will detect that failure and re-assign the work to a different node on the system

7

Hadoop esentials

§  Computation: –  Move the computation to the data

§  Storage: –  Keeping track of the data and metadata –  Data is sharded across the cluster

§  Cluster management tools §  ...

8

(default) Hadoop’s Stack

Storage Services

Compute Services

8

Applications

Hadoop Distributed File System (HDFS)

Hadoop’s MapReduce

Data Services Hbase: NoSQL Databases

Resource Fabrics

more detail in next part!!!

9

Basic Cluster Components

•  One of each: –  Namenode (NN) –  Jobtracker (JT)

•  Set of each per slave machine: –  Tasktracker (TT) –  Datanode (DN)

10

Put2ng Everything Together

datanode daemon

Linux file system

…

tasktracker

slave node

datanode daemon

Linux file system

…

tasktracker

slave node

datanode daemon

Linux file system

…

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

11

Anatomy of a Job

§  MapReduce program in Hadoop = Hadoop job

§  Jobs are divided into map and reduce tasks

§  An instance of running a task is called a task §  attempt

§  Multiple jobs can be composed into a workflow

12

§  Job submission process

•  Client (i.e., driver program) creates a job, configures it, and submits it to job tracker

•  JobClient computes input splits (on client end)

•  Job data (jar, configuration XML) are sent to JobTracker

•  JobTracker puts job data in shared location, enqueues tasks

•  TaskTrackers poll for tasks

•  Off to the races…

Anatomy of a Job

13

Running MapReduce job with Hadoop

§  Steps: –  Defining the MapReduce stages in a Java program –  Loading the data into the Hadoop Distributed Filesystem –  Submitting the job for execution –  Retrieving the results from the filesystem

MapReduce has been implemented in a variety of other programming languages and systems,

Several NoSQL database systems have integrated MapReduce (later in this course)

14

Hadoop and enterprise?

§  Hadoop is a complement to a relational data warehouse –  Enterprises are generally not replacing their relational

DataWarehouse with Hadoop

§  Hadoop’s strengths –  Inexpensive –  High reliability –  Extreme scalability –  Flexibility: Data can be added without defining a schema

§  Hadoop’s weaknesses –  Hadoop is not an interactive query environment –  Processing data in Hadoop requires writing code

15

Who is using Hadoop?

Source: Wikipedia, April 2013

16

What is MapReduce model used for?

§  At Google: –  Index construction for Google Search –  Article clustering for Google News –  Statistical machine translation

§  At Yahoo!: –  “Web map” powering Yahoo! Search –  Spam detection for Yahoo! Mail

§  At Facebook: –  Data mining –  Ad optimization –  Spam detection

17

Hadoop 1.0

04-01-2012: § The Apache Software Foundation delivers Hadoop 1.0, the much-anticipated 1.0 version of the popular open-source platform for storing and processing large amounts of data.

–  six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists and systems engineers, culminating in a highly stable, enterprise-ready release of the fastest-growing big data platform.

18

Getting Started with Hadoop

§ Different ways to write jobs:

–  Java API –  Hadoop Streaming (for Python, Perl, etc) –  Pipes API (C++) –  R –  …

19

Hadoop API

•  Different APIs to write Hadoop programs: – A rich Java API (main way to write Hadoop programs)

– A Streaming API that can be used to write map and reduce func2ons in any programming language (using standard inputs and outputs)

– A C++ API (Hadoop Pipes) – With a higher language level (e.g., Pig, Hive)

20

Hadoop API

•  Mapper

•  Reducer/Combiner –  void reduce(K2 key, Iterator<V2> values,

OutputCollector<K3,V3> output, Reporter reporter) –  void configure(JobConf job) –  void close() throws IOException

•  Par22oner –  void getPartition(K2 key, V2 value, int

numPartitions)

– void map(K1 key, V1 value, OutputCollector<K2, V2> – output, Reporter reporter) void configure(JobConf job)

– void close() throws IOException

21

WordCount.java

package org.myorg;

import java.io.IOException; import java.util.*;

import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { …… }

22

WordCount.java

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

23

WordCount.java

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

24

WordCount.java

conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }

public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");

25

E.g. Common wordcount

Hello World Hello MapReduce

Fig1: Sample input


26


12 March 2012

void map(string i, string line): for word in line: print word, 1

Fig 2: wordcount – map function


27


void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total += c print word, total

Fig 3: wordcount – reduce function


28


Hello , 2 MapReduce , 1

World , 1 Hello , 1 MapReduce , 1

Second intermediate output

Hello , 1 World , 1

First intermediate output

REDUCE

Final output

MAP

Hello World Hello MapReduce

Input


Word Count Python Mapper def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1)

Source: Robert Grossman – Tutorial Supercompu2ng 2011

Word Count R Mapper trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)


Word Count Java Mapper public static class Map

extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }


Code Comparison – Word Count Mapper Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)

Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }


Word Count Python Reducer def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)


Word Count R Reducer trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count


Word Count R Reducer (cont’d) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”)

Word Count Java Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }

Code Comparison – Word Count Reducer Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count

if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }


hadoop& framework& - jordi · pdf file6 hadoop: very high-level overview ! when data...

Documents