big data analytics with r and hadoop chapter 4 using hadoopstreaming with r 컴퓨터과학과 se...

20
Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴컴컴컴컴컴 SE 컴컴컴 아아아아아 2015-4-9

Upload: margery-golden

Post on 05-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Big data analytics with R and HadoopChapter 4 Using HadoopStreaming with R

컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Page 2: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding the basics of Hadoop streamingUnderstanding how to run HadoopStreaming with RExploring the HadoopStreaming R package

2

Content

Page 3: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding the basics of Hadoop streaming

Hadoop streaming is a Hadoop utility for running the Hadoop MapReduce job with executable scripts such as Mapper and Reducer.

This is similar to the pipe operation in Linux.

1. Text input file is printed on stream (stdin)2. Provided as an input to Mapper3. Output (stdout) of Mapper is provided as an input to Re-

ducer4. Reducer writes the output to the HDFS directory.

3

Page 4: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding the basics of Hadoop streaming

The main advantage of the Hadoop streaming utility is that it allows Java as well as non-Java programmed MapReduce jobs to be executed over Hadoop clusters.

The Hadoop streaming supports the Perl, Python, PHP, R, and C++Translate the application logic into the Mapper and Reducer

sections with the key and value output elements.Three main components: Mapper, Reducer, and Driver

4

Page 5: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding the basics of Hadoop streaming

HadoopStreaming Components

5

Page 6: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding the basics of Hadoop streaming

Now, assume we have implemented our Mapper and Reducer as code_mapper.R and code_reducer.R.

Format of the HadoopStreaming command: bin/hadoop command [generic Options] [streaming Options]

6

Page 7: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding how to run Hadoop streaming with R

Understanding a MapReduce application

Gujarat Technological University - http://www.gtuniversity.-comMedical, Hotel Management, Architecture, Pharmacy, MBA,

and MCA.

Purpose: Identify the fields that visitors are interested in ge-ographically.

Input dataset: The extracted Google Analytics dataset con-tains four data columns.

7

Page 8: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding how to run Hadoop streaming with R

Understanding a MapReduce application

• date: This is the date of visit and in the form of YYYY/MM/DD.• country: This is the country of the visitor.• city: This is the city of the visitor.• pagePath: This is the URL of a page of the website.

8

Page 9: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding how to run Hadoop streaming with R

Understanding how to code a MapReduce applicationMapReduce application:

• Mapper code• Reducer code

Mapper code: This R script, named, will take care of the Map phase of a MapReduce job.

The Mapper extract a pair (key, value) and pass it to the Re-ducer to be grouped/aggregated.

City is a key and PagePath is a value.

9

Page 10: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding how to run Hadoop streaming with R

Understanding how to code a MapReduce applicationga-mapper.R(R script)

while(length(currentLine <- readLines(input, n=1, warn=FALSE)) > 0) {

fields <- unlist(strsplit(currentLine, ","))

city <- as.character(fields[3])

pagepath <- as.character(fields[4])

print(paste(city, pagepath,sep="\t"),stdout())

}

close(input)

10

Page 11: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding how to run Hadoop streaming with R

Understanding how to code a MapReduce applicationga-reducer.R(R script)

city.key <- NA

page.value <- 0.0

while (length(currentLine <- readLines(input, n=1)) > 0) {

fields <- strsplit(currentLine, "\t")

key <- fields[[1]][1]

value <- as.character(fields[[1]][2])

if (is.na(city.key)) {

city.key <- key

page.value <- value

}

else {if (city.key == key) {

page.value <- c(page.value, value)

}

11

Page 12: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding how to run Hadoop streaming with R

Executing a Hadoop streaming job from the command prompt

12

Page 13: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding how to run Hadoop streaming with R

Executing the Hadoop streaming job from R or an RStudio console

13

Page 14: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding how to run Hadoop streaming with R

Exploring an output from the command prompt

14

Page 15: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding how to run Hadoop streaming with R

Exploring an output from R or an RStudio console

15

Page 16: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Understanding how to run Hadoop streaming with R

Monitoring the Hadoop MapReduce job Small syntax error -> Failure of MapReduce job Administration Page $ bin/hadoop job –history /output/location

16

Page 17: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Exploring the HadoopStreaming R package

The hsTableReader function is designed for reading data in the table format.

hsTableReader(con, cols, chunkSize=3, skip, sep, carryMemLimit, carryMaxRows)

str <- “ key1\t1.91\nkey1\t2.1\nkey1\t20.2\nkey1\t3.2\n key2\t1.2\nkey2\t10\nkey3\t2.5\nkey3\t2.1\nkey4\t1.2\n"

cols = list(key='',val=0)

con <- textConnection(str, open = "r")

hsTableReader(con, cols, chunkSize=3)

17

Page 18: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Exploring the HadoopStreaming R package

The hsKeyValReader function is designed for reading the data available in the keyvalue pair format.

hsKeyValReader(con, chunkSize, skip, sep)

printkeyval <- function(k,v) {

cat('A chunk:\n')

cat(paste(k,v,sep=': '),sep='\n')

}

str <- "key1\tval1\nkey2\tval2\nkey3\tval3\n"

con <- textConnection(str, open = "r")

hsKeyValReader(con, chunkSize=1, FUN=printFn)

18

Page 19: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

Exploring the HadoopStreaming R package

The hsLineReader function is designed for reading the entire line as a string without performing the data-parsing operation.

hsLineReader(file="",chunkSize=2,skip="")

str <- " This is HadoopStreaming!!\n here are,\n examples for chunk dataset!!\n in R\n ?"

con <- textConnection(str, open = "r")

hsLineReader(con,chunkSize=2)

19

Page 20: Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

감사합니다

20