big data analytics with r and hadoop chapter 4 using hadoopstreaming with r 컴퓨터과학과 se...

Big data analytics with R and HadoopChapter 4 Using HadoopStreaming with R

컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding the basics of Hadoop streamingUnderstanding how to run HadoopStreaming with RExploring the HadoopStreaming R package

2

Content

Understanding the basics of Hadoop streaming

Hadoop streaming is a Hadoop utility for running the Hadoop MapReduce job with executable scripts such as Mapper and Reducer.

This is similar to the pipe operation in Linux.

1. Text input file is printed on stream (stdin)2. Provided as an input to Mapper3. Output (stdout) of Mapper is provided as an input to Re-

ducer4. Reducer writes the output to the HDFS directory.

3


The main advantage of the Hadoop streaming utility is that it allows Java as well as non-Java programmed MapReduce jobs to be executed over Hadoop clusters.

The Hadoop streaming supports the Perl, Python, PHP, R, and C++Translate the application logic into the Mapper and Reducer

sections with the key and value output elements.Three main components: Mapper, Reducer, and Driver

4


HadoopStreaming Components

5


Now, assume we have implemented our Mapper and Reducer as code_mapper.R and code_reducer.R.

Format of the HadoopStreaming command: bin/hadoop command [generic Options] [streaming Options]

6

Understanding how to run Hadoop streaming with R

Understanding a MapReduce application

Gujarat Technological University - http://www.gtuniversity.-comMedical, Hotel Management, Architecture, Pharmacy, MBA,

and MCA.

Purpose: Identify the fields that visitors are interested in ge-ographically.

Input dataset: The extracted Google Analytics dataset con-tains four data columns.

7


Understanding a MapReduce application

• date: This is the date of visit and in the form of YYYY/MM/DD.• country: This is the country of the visitor.• city: This is the city of the visitor.• pagePath: This is the URL of a page of the website.

8


Understanding how to code a MapReduce applicationMapReduce application:

• Mapper code• Reducer code

Mapper code: This R script, named, will take care of the Map phase of a MapReduce job.

The Mapper extract a pair (key, value) and pass it to the Re-ducer to be grouped/aggregated.

City is a key and PagePath is a value.

9


Understanding how to code a MapReduce applicationga-mapper.R(R script)

while(length(currentLine <- readLines(input, n=1, warn=FALSE)) > 0) {

fields <- unlist(strsplit(currentLine, ","))

city <- as.character(fields[3])

pagepath <- as.character(fields[4])

print(paste(city, pagepath,sep="\t"),stdout())

}

close(input)

10


Understanding how to code a MapReduce applicationga-reducer.R(R script)

city.key <- NA

page.value <- 0.0

while (length(currentLine <- readLines(input, n=1)) > 0) {

fields <- strsplit(currentLine, "\t")

key <- fields[[1]][1]

value <- as.character(fields[[1]][2])

if (is.na(city.key)) {

city.key <- key

page.value <- value

}

else {if (city.key == key) {

page.value <- c(page.value, value)

}

11


Executing a Hadoop streaming job from the command prompt

12


Executing the Hadoop streaming job from R or an RStudio console

13


Exploring an output from the command prompt

14


Exploring an output from R or an RStudio console

15


Monitoring the Hadoop MapReduce job Small syntax error -> Failure of MapReduce job Administration Page $ bin/hadoop job –history /output/location

16

Exploring the HadoopStreaming R package

The hsTableReader function is designed for reading data in the table format.

hsTableReader(con, cols, chunkSize=3, skip, sep, carryMemLimit, carryMaxRows)

str <- “ key1\t1.91\nkey1\t2.1\nkey1\t20.2\nkey1\t3.2\n key2\t1.2\nkey2\t10\nkey3\t2.5\nkey3\t2.1\nkey4\t1.2\n"

cols = list(key='',val=0)

con <- textConnection(str, open = "r")

hsTableReader(con, cols, chunkSize=3)

17


The hsKeyValReader function is designed for reading the data available in the keyvalue pair format.

hsKeyValReader(con, chunkSize, skip, sep)

printkeyval <- function(k,v) {

cat('A chunk:\n')

cat(paste(k,v,sep=': '),sep='\n')

}

str <- "key1\tval1\nkey2\tval2\nkey3\tval3\n"


hsKeyValReader(con, chunkSize=1, FUN=printFn)

18


The hsLineReader function is designed for reading the entire line as a string without performing the data-parsing operation.

hsLineReader(file="",chunkSize=2,skip="")

str <- " This is HadoopStreaming!!\n here are,\n examples for chunk dataset!!\n in R\n ?"


hsLineReader(con,chunkSize=2)

19

감사합니다

20

big data analytics with r and hadoop chapter 4 using hadoopstreaming with r 컴퓨터과학과 se...

Documents