components of hadoop 한양대학교, 컴퓨터공학과 전공, 이정호 (2011453307)...

Components of Hadoop

한양대학교 , 컴퓨터공학과 전공 , 이정호 (2011453307)한양대학교 , 컴퓨터공학과 전공 , 서동우 (2012451605)

1. Introduction2. Working with files in HDFS3. Anatomy of a MapReduce pro-

gram4. Reading and writing

1. Introduction

• Hadoop 프레임워크에는 3 가지 컴포넌트를 가지고 있으며 이는 HDFS, IO Component, MapReduce로 이루어 져있음 . • 이장에서는 각 컴포넌트를 프로그래밍 수준에서 다룰 것임 .

Hadoop

HDFS MapRe-duce

IO Compo-nent

2. WORKING WITH FILES IN HDFS

2.1. Basic file commends2.2 Reading and writing to HDFS programmatically

2. Working with files in HDFS

2.1Basic file com-mends

hadoop fs -cmd <args>

File Commend Form

hadoop fs -ls <args>

hadoop fs –lsr <args>

$ hadoop fs -ls /

drwxr-xr-x - chuck supergroup 0 2009-01-14 10:23 /user

$ hadoop fs -lsr /

drwxr-xr-x - chuck supergroup 0 2009-01-14 10:23 /user

drwxr-xr-x - chuck supergroup 0 2009-01-14 11:02 /user/chuck

-rw-r--r-- 1 chuck supergroup 264 2009-01-14 11:02 /user/chuck/example.txt

• 파일의 복사본의 개수를 나타냄• 가상분산모드 (pseudo-distributed) 일 때는 항상 ‘ 1’ 임 .• 일반적으로는 ‘ 3’ 이나 경우에 따라 다른 양의 정수일 수 있음 .• 디렉터리는 복사본에 대한 명시가 적용되지 않음으로 ‘ -’ 로 표현함 .

가상 분산 모드 : 클러스터가 한대로 구성되어 있고 모든 데몬 역시 이 한대의 컴퓨터에서 실행됨 . 이 모드는 코드 디버깅 시 standalone 모드에서의 기능을 보완 할 수 있는데 , 메모리 사용정도 , HDFS 입출력 관련 문제 , 다른 데몬과의 상호작용에서 발생하는 일을 검사할수 있음 .

Listing Files

2. Working with files in

HDFS2.1.Basic file com-

mends (cont’)

hadoop fs -mkdir <paths>

hadoop fs -put <localsrc> ... <dst>

$ hadoop fs –mkdir /user/chuck

$ hadoop fs –put example.txt .

$ hadoop fs –put example.txt /user/chuck

$ hadoop fs –put - hdfs://nn.example.com/hadoop/hadoopfile

• 마지막 Commend 에서 사용된 위치를 나타냄 .

Make Directory

Add File

• 표준입력 (stdin) 으로 부터 입력을 받음 ( 키보드로 부터의 입력 ).



mends (cont’)

hadoop fs –get <file_path> <local_dir_path OR local_file_path>

$ hadoop fs -get example.txt .

Get File

hadoop fs –cat <file_path>

$ hadoop fs -cat example.txt

Display File



mends (cont’)

hadoop fs -rm URI [URI …]

$ hadoop fs –rm example.txt .

Delete File

hadoop fs –help <cmd>

$ hadoop fs –help ls

-ls <path>: List the contents that match the specified file pattern. Ifpath is not specified, the contents of /user/<currentUser>will be listed. Directory entries are of the form

dirName (full path) <dir>and file entries are of the form

fileName(full path) <r n> sizewhere n is the number of replicas specified for the fileand size is the size of the file, in bytes.

Look Up Help


2.1.Basic file com-mends (cont’)

Hadoop FS Shell Guide

http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html




2.1.Reading and writing to HDFS pro-grammatically

• FileSystem 클래스에서 사용하는 주요 함수에 대한 설명 – 기본 shell 에서 사용하는 명령어와 유사한 기능을 제공함

FileSystem Class Method• FSDataOutputStream create(Path f) : 파일을 생성• FSDataInputStream open(Path f) : 파일을 읽어옮• boolean delete(Path f) : 파일을 삭제• boolean exists(Path f) : 파일이 존재하는 지 여부를 반환함• boolean mkdirs(Path f) : Directory 를 생성함• boolean rename(Path src, Path dst) : 파일의 path 를 src 에서 dst 로 변경함• FileStatus[] listStatus(Path dir) : dir 의 경로의 디렉터리와 파일 Status 를 반환함

• FileSystem get(Configuration conf) : 환경 설정된 파일 시스템을 불러옮• LocalFileSystem getLocal(Configuration conf) : Local 파일 시스템을 불러옮

FileSystem Class Static Method

Configuration conf = new Configuration();FileSystem hdfs = FileSystem.get(conf);FileSystem local = FileSystem.getLocal(conf);



• FileSystem 클래스를 활용한 예제를 구성– 기능 : Local 에 있는 경로의 파일들을 읽어와서 HDFS 에 파일을 합쳐서 저장– 모티브

• 아파치 서버에서 여러개의 로그 파일을 생성하게 되는데 로그파일들을 HDFS 에 저장하려고 할 때 하나의 파일로 저장• HDFS 에서는 Key 값에 해당하는 정보를 주로 하나의 파일로 저장하기 때문임

Function of PutMerge Class

Apache Server HDFS

Log 1 Log 2 Log N…

Log

PutMerge



• Local 에 있는 경로의 파일들을 읽어와서 HDFS 에 파일을 합쳐서 놓을 수 있는 예제 프로그램의 실제 소스 – get() 함수와 getLocal() 함수를 활용하여 FileSystem 객체를 불러옮

PutMerge.javapublic class PutMerge {

public static void main(String[] args) throws IOException {

Configuration conf = new Configuration();FileSystem hdfs = FileSystem.get(conf);FileSystem local = FileSystem.getLocal(conf);

// 디렉터리를 세팅함Path inputDir = new Path(args[0]); Path hdfsFile = new Path(args[1]);

try {// Local 파일을 읽어옮FileStatus[] inputFiles = local.listStatus(inputDir); // HDFS 의 Output Stream을 만듦FSDataOutputStream out = hdfs.create(hdfsFile);

for (int i=0; i<inputFiles.length; i++) {System.out.println(inputFiles[i].getPath().getName());// Local Input Stream 을 만듦FSDataInputStream in = local.open(inputFiles[i].getPath()); byte buffer[] = new byte[256];int bytesRead = 0;while( (bytesRead = in.read(buffer)) > 0) {

out.write(buffer, 0, bytesRead);}in.close();

}out.close();

} catch (IOException e) {e.printStackTrace();

}}

}

3. ANATOMY OF A MAPREDUCE PROGRAM

3.1.Hadoop data types3.2.Mapper3.3.Reducer3.4.Partitioner3.5.Combiner3.6.Word counting example

3. Anatomy of a MapReduce program

• MapReduce 프로세스는 데이터를 key/value 쌍을 다뤄서 처리함– map: (K1,V1) ➞ list(K2,V2)– reduce: (K2,list(V2)) ➞ list(K3,V3)

• Input Data 를 각각의 노드로 분배하게되고 각 노드에서 Map 프로세스를 진행하고 나온 key/value 쌍을 바탕으로 해당 key 값에 해당하는 노드의 Reduce 프로세스로 전달하고 Reducer 의 처리결과를 저장The general MapReduce data flow


3.1.Hadoop data types

• MapReduce 프레임워크는 key/value 를 직렬화하는 방법을 정의하였음 . – WritableComparable 인터페이스를 사용

• WritableComparable 인터페이스– Comarator 를 통하여 비교가 가능함– MapReduce 프레임워크에서 key 값으로 사용하기 위해서는 implement 하여야함 .

Class DescriptionBoolean-Writable

Wrapper for a standard Boolean variable

ByteWritable Wrapper for a single byteDoubleWritable Wrapper for a DoubleFloatWritable Wrapper for a FloatIntWritable Wrapper for a IntegerLongWritable Wrapper for a Long

Text Wrapper to store text using the UTF8 format

NullWritable Placeholder when the key or value is not needed

List of frequently used types for the key/value pairsHierarchy of WritableComparable

Writable Comparable<T>

WritableComparable<T>

Boolean-Writable ByteWritable NullWritable…

[ 참조 :http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/WritableComparable.html]

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/WritableComparable.html








3.1.Hadoop data types (cont’)

• Writable 인터페이스에서 readFields() 함수와 write() 함수를 상속받음 .– readFields() 함수와 write() 함수를 통하여 데이터를 송신 및 수신 가능함 .

• Comparable 인터페이스에서 compareTo() 함수를 상속받음 .– compareTo() 함수를 통하여 객체의 순서를 맞출 때 사용됨 .

public class Edge implements WritableComparable<Edge>{

private String departureNode;private String arrivalNode;

public String getDepartureNode() { return departureNode;}

@Overridepublic void readFields(DataInput in) throws IOException {

departureNode = in.readUTF();arrivalNode = in.readUTF();

}@Overridepublic void write(DataOutput out) throws IOException {

out.writeUTF(departureNode);out.writeUTF(arrivalNode);

}@Overridepublic int compareTo(Edge o) {

return (departureNode.compareTo(o.departureNode) != 0)? departureNode.compareTo(o.departureNode)

: arrivalNode.compareTo(o.arrivalNode);}

}


3.1.Hadoop data types (cont’)

• hashCode() 함수는 하둡시스템에서 각 파티션에서 키 값으로 사용됨 . • 다른 JVM 환경에서도 같은 hashCode() 값이 나올 수 있게 정의되어야함 .

public class MyWritableComparable implements WritableComparable {// Some dataprivate int counter;private long timestamp;

public void write(DataOutput out) throws IOException {out.writeInt(counter);out.writeLong(timestamp);

}

public void readFields(DataInput in) throws IOException {counter = in.readInt();timestamp = in.readLong();

}

public int compareTo(MyWritableComparable o) {int thisValue = this.value;int thatValue = o.value;return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));

}

public int hashCode() {final int prime = 31;int result = 1;result = prime * result + counter;result = prime * result + (int) (timestamp ^ (timestamp >>> 32));return result;

}}

[ 참조 :http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/WritableComparable.html]









3.2.Mapper

• Mapper 를 구현하기 위해서는 Mapper 인터페이스를 implement 하고 , MapReduceBase 를 상속받아야함 . – Mapper 인터페이스에서 상속받은 map 함수를 구현– MapReduceBase 클래스에서 생성자와 소멸자를 담당하는 configure() 함수와 close() 함수를 제공함

• Hadoop 에서 미리 정의된 Mapper 를 제공함 .

MapReduceBase

Mapper<K1,V1,K2,V2

>

CustomMapper

• Constructor and Deconstructor• void configure(JobConf job)• void close ()

Implementation of Custom Mapper

Class DescriptionIdentityMapper<K,V>

Implements Mapper<K,V,K,V> and maps inputs directly to outputs

InverseMapper<K,V>

Implements Mapper<K,V,V,K> and reverses the key/value pair

RegexMapper<K>

Implements Mapper<K,Text,Text, LongWritable> and generates a(match, 1) pair for every regular ex-pression match

Token-CountMapper<K>

Implements Mapper<K,Text,Text, LongWritable> and generates a(token, 1) pair when the input value is tokenized

Mapper implementations predefined by Hadoop

• void map(K1 key, V1 value, OutputCollector<K2,V2> out-put, Reporter reporter) throws IOException• 입력받은 key, value 를 output에 K2, V2 쌍을 만들어 추가 . • Repoter 는 현재 프로세스 진행에 대한 추가 정보를 저장하는 옵션을 제공 .


3.3.Reducer

MapReduceBase

Reducer<K2,V2,K3,V3

>

CustomReducer

Implementation of Custom Reducer

• void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) throws IOException

• key 에 해당하는 values 를 받아서 output 에 K3, V3 형태로 출력함 .

• Reporter 의 역할은 Mapper 의 역할과 동일함 .

Class DescriptionIdentityRe-ducer<K,V>

Implements Reducer<K,V,K,V> and maps inputs directly to outputs

LongSumRe-ducer<K>

Implements Reducer<K,LongWritable,K, Long-Writable> and determines the sum of all values corresponding to the given key

Reducer implementations predefined by Hadoop


3.4.Partitioner

• Partitioner 에서는 Mapper 에서 나온 결과를 바탕으로 Reducer 로 값을 할당하는 역할 수행 . • Partiioner 구현을 위해서는 Partitioner 객체를 implement 하여 구현함 .

– getPartition() 함수를 통하여 Reducer Node 로 할당함 .

Role of Partitioner Implementation of Partitioner

public class EdgePartitioner implements Partitioner<Edge, Writable>{

@Overridepublic int getPartition(Edge key, Writable value,

int numPartitions){

return key.getDepartureNode().hashCode() % numPartitions;

}

@Overridepublic void configure(JobConf conf) { }

}

Partitioner<K2, V2> EdgePartitioner

• getPartition() 함수에서 파티션의 번호를 반환함 . • 파티션의 개수가 N 이면 이 함수의 결과의 범위는 [0,

N-1].• configure() 함수에서는 생성된 JobConf 객체를 초기화함 .


3.5.Combiner

• Mapper 의 결과를 각 Node 에서 Reduce 를 하는 과정을 의미함

Role of Combiner

Local Reduce

• Mapper 의 결과를 분배하기 전에 Local Reduce 를 수행

Local Reduce


3.6.Word counting example

• Word Counting 기능을하는 예제– JobClient 객체와 JobConf 객체를 생성– 기본적인 정보를 JobConf 객체에 세팅– JobClient 객체에 JobConf 객체를 저장하고 실행

public class WordCount2 {public static void main(String[] args) {

// JobClient 객체와 JobConf 객체를 생성JobClient client = new JobClient();JobConf conf = new JobConf(WordCount2.class);

// input 경로와 output 경로를 지정FileInputFormat.addInputPath(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

// Output 타입을 세팅conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(LongWritable.class);

// JobConf 객체에 Mapper, Combinder, Reducer 타입을 세팅 conf.setMapperClass(TokenCountMapper.class); conf.setCombinerClass(LongSumReducer.class);conf.setReducerClass(LongSumReducer.class);

client.setConf(conf);try {

JobClient.runJob(conf);} catch (Exception e) {

e.printStackTrace();}

}}

WordCount2.java

4. READING AND WRITING

4.1. Input Format4.2. Output Format

4. Reading and writing

• MapReduce 에서 분산 프로세싱을 가능하게 하기위해 한가지 가정을 정하여 프로세스가 구성되었음– 데이터는 처리중이다 .

• Input 데이터를 여러개의 input split 으로 분리함

Input DataProcess 1

Process 2

Data Is ProcessingSplit 1

Split 2

Split 3

Split N

…

Process 1

Process 2

Process Data in Parallel

4. Reading and writing4.1.Input Format

(cont’)

• InputFormat 에는 Split 배열을 받아올 수 있는 함수와 RecordeReader 객체를 받아올 수 있는 객체가 있음 . • Split 을 각각 분산처리를 위해 분산시키고 RecordReader 객체를 전달하여 분산된 데이터를 Key /

Value 형태로 처리할 수 있게함 .

InputFormat

Split 1

Split 2

Split 3

Split N

…

Process 1

Process 2

getSplits()

getRecordReader() Process N

…

split 의 분산

RecordReader 객체 전달• Split 의 내용을 파싱하여 Key/

Value 형태로 반환하는 객체

Recor-dReader

Role of InputFormat


• InputFormat 인터페이스를 상속받아 InputFormat 클래스를 구성 .• 주로 사용하는 InputFormat 클래스로는 TextInputFormat, KeyValueTextInputFormat, Se-

quenceFileInputFormat, NLineInputFormat 이 있음 .

InputFormat<K,V>

InputFormat Class

Hierarchy of InputFormat Interface

Class Description

TextInputFormatEach line in the text files is a record. Key is the byteoffset of the line, and value is the content of the line.• key: LongWritable• value: Text

KeyValueTextInputFor-mat

Each line in the text files is a record. The first separatorcharacter divides each line. Everything before theseparator is the key, and everything after is the value.The separator is set by the key.value.separator.in.input.line property, and the default is the tab (\t) character.• key: Text• value: Text

SequenceFileInputFor-mat<K,V>

An InputFormat for reading in sequence files. Key andvalue are user defined. Sequence file is a Hadoopspe-cificcompressed binary file format. It’s optimized forpassing data between the output of one MapReduce jobto the input of some other MapReduce job.• key: K (user defined)• value: V (user defined)

NLineInputFormat

Same as TextInputFormat, but each split is guaranteedto have exactly N lines. The mapred.line.input.format.linespermap property, which defaults to one, sets N.• key: LongWritable• value: Text

Main InputFormat Classes


(cont’)

• Custom InputFormat 을 구성하기 위해서는 InputFormat 인터페이스와 RecordReader 인터페이스의 구현이 필요함 .

public class TimeUrlTextInputFormat extends FileInputFormat<Text, URLWritable> {public RecordReader<Text, URLWritable> getRecordReader(

InputSplit input, JobConf job, Reporter reporter) throws IOException {

return new TimeUrlLineRecordReader(job, (FileSplit)input);}

}

public class URLWritable implements Writable {protected URL url;

public URLWritable() { }

public URLWritable(URL url) {this.url = url;

}

public void write(DataOutput out) throws IOException {out.writeUTF(url.toString());

}

public void readFields(DataInput in) throws IOException {url = new URL(in.readUTF());

}

public void set(String s) throws MalformedURLException {url = new URL(s);

}}

TimeUrlTextInputFormat.java

URLWritable.java


(cont’)

class TimeUrlLineRecordReader implements RecordReader<Text, URLWritable> {private KeyValueLineRecordReader lineReader;private Text lineKey, lineValue;

public TimeUrlLineRecordReader(JobConf job, FileSplit split) throws IOException {lineReader = new KeyValueLineRecordReader(job, split);lineKey = lineReader.createKey();lineValue = lineReader.createValue();

}

public boolean next(Text key, URLWritable value) throws IOException {if (!lineReader.next(lineKey, lineValue)) {

return false;}

key.set(lineKey);value.set(lineValue.toString());return true;

}

public Text createKey() {return new Text("");

}

public URLWritable createValue() {return new URLWritable();

}

public long getPos() throws IOException {return lineReader.getPos();

}

public float getProgress() throws IOException {return lineReader.getProgress();

}

public void close() throws IOException {lineReader.close();

}}

TimeUrlLineRecordReader.java

4. Reading and writing4.2.Output Format

• OutputFormat 인터페이스는 InputFormat 인터페이스와 유사하나 split 을 하지 않는 점이 다름 .• RecordeReader 를 통하여 파일을 작성하며 작성할 때 파일의 위치는 각 Reducer 가 있는 노드가 됨 .

– 저장되는 디렉토리는 공통으로 사용되는 디렉토리임 .

Role of OutputFormat

OutputFormat Recor-dReader

getRecordWriter() part-02write()Reducer 02






getRecordWriter() part-NNwrite()Reducer NN

…

4. Reading and writing4.2.Output

Format(cont’)

• 미리 구현된 OutputFormat 클래스 중에서 주요 클래스– TextOutputFormat 기본적으로 자주사용되는 OutputFormat 클래스– SequenceFileOutputFormat 은 순차적으로 파일을 작성할 때 사용됨 . – NullOutputFormat 은 아무 것도 작성하지 않음 .

Class Description

TextOutputFormat<K,V>

Writes each record as a line of text. Keys and valuesare written as strings and separated by a tab (\t)character, which can be changed in the mapred.textoutputformat.separator property.

SequenceFileOutputFormat<K,V>

Writes the key/value pairs in Hadoop’s propri-etarysequence file format. Works in conjunction withSequenceFileInputFormat.

NullOutputFormat<K,V> Outputs nothing.

Main OutputFormat Classes

components of hadoop 한양대학교, 컴퓨터공학과 전공, 이정호 (2011453307)...

Documents