初探map reduce

初探MapReduce

HELLO!Uncle@Mars

緣由

✘ CCDH✗ 選擇題✗ 重著於Hadoop、MapReduce和少部份的Hive✗ 快要絕版

http://www.cloudera.com/content/cloudera/en/training/certification/ccdh.html

http://www.cloudera.com/content/cloudera/en/training/certification/ccdh.html

緣由

✘ CCP: Data Engineer✗ 考試的內容較全面✗ 對個人的助益

■ 重視實作■ 注重技能，而非產品的使用■ 推薦與認證

■ 現在一直在改，但重點在確認資料處理的

技能✗ 對公司的助益

■ 找到真正會實作的人■ 找到的人是有權威的認證

http://www.cloudera.com/content/cloudera/en/training/certification/ccp-data-engineer.html

http://www.cloudera.com/content/cloudera/en/training/certification/ccp-data-engineer.html

MapReduce

主要元素

✘ 用戶端的Java程式✘ 客製的Mapper✘ 客製的Reducer✘ 用戶端的函式庫✘ 遠端的函式庫✘ 包含用戶端Java程式的JAR檔

範例程式

Mapper & Reducer

✘ 可以使用setNumMapTasks來控制Mapper的數量✗ 但真正的Mapper數量還是會由資料分佈決定

✘ 可以使用setNumReduceTasks來控制Reducer的數量，預設為115/07/30 20:43:43 INFO mapred.MapTask: numReduceTasks: 2

-rw-r--r-- 3 javakid supergroup 0 2015-07-30 20:43 output/_SUCCESS-rw-r--r-- 3 javakid supergroup 15 2015-07-30 20:43 output/part-00000-rw-r--r-- 3 javakid supergroup 62 2015-07-30 20:43 output/part-00001

Mapper & Reducer

✘ Reducer的數目✗ 多一點，速度會較快，缺點：輸出的檔案過多

✗ 比較理想的數目，調成Reduce Task可以在5分鐘左右完成

Hadoop Classpath

✘ 要讓Hadoop命令知道要執行的Jarexport HADOOP_CLASSPATH=/home/javakid/git_repository/HelloHadoop/target/hellohadoop-1.0-SNAPSHOT.jar

✘ 執行

hadoop idv.jk.hellohadoop.WordCountOldAPI input/test1.txt output

概要

✘ JobConf✗ MapReduce Job用來設定Hadoop參數的主要介

面

✘ TextInputFormat✗ 宣告輸入資料的型式，上列是設定為文字型式✗ 是InputFormat的子類別

✘ TextOutputFormat✗ 再跑一次，會丟出下列錯誤hdfs://javakid01:

9000/user/javakid/output already exists✗ 確認Job所設定的輸出目錄是不存在✗ hdfs dfs -rm -R output

概要

概要

✘ FileInputFormat.setInputPaths()✗ 設定輸入檔案的路徑

✘ FileOutputFormat.setOutputPath()✗ 設定要置於輸出結果檔案的路徑

✘ conf.setOutputKeyClass()和conf.

setOutputValueClass()✗ 設定和輸出資料的key和value的類別，需和Reducer的

輸出一樣，不然會丟出RuntimeException

概要

✘ 輸出檔案中的資料預設是用TAB分隔的，可以用conf.set("mapreduce.textoutputformat.separator", ",")來調整

✘ 新版API為：configuration.set("mapreduce.output.textoutputformat.separator", ",");

MapReduce 1的限制

✘ Batch processing：每次就是一個Batch Job一個接一個

✘ 是一切應用的基礎

✘ 所有Job都得轉換成MapReduce，如：Pig、Hive

YARN(MapReduce 2)

MapReduce 2的進化

✘ 把Resource Manager拉出來，變成YARN

✘ 讓MapReduce變成一個單純在YARN上面跑的Batch Job

MapReduce 2的進化

✘ 原來的應用(如：Pig、Hive)，搬到更適合的Computing Framework上

YARN

✘ New API程式✘ 要跑YARN，在conf/mapred-site.xml中

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

YARN

✘ 命令

yarn jar hellohadoop-1.0-SNAPSHOT.jar idv.jk.hellohadoop.WordCountNewAPI input/lyrics.txt output/lyrics

YARN

✘ YARN Client✗ 建立YARN應用程式

✘ ResourceManager✗ YARN最主要的程式，負責資源的排程與管理

✘ NodeManager✗ 在每一個node上執行的程序✗ 負責container的啟動與管理

YARN

✘ ApplicationMaster✗ 由ResourceManager建立出來✗ 負責送給container執行工作的請求

✘ Container✗ YARN的負責執行應用程式的程序

Map和Reduce的基本形式

✘ Map✗ (K1, V1) -> list(K2, V2)

✘ Reduce✗ (K2, list(V2)) -> list(K3, V3)

Combiner

✘ map✗ (K1, V1) -> list(K2, V2)

✘ combiner✗ (K2, list(V2)) -> list(K2, V2)

✘ reduce✗ (K2, list(V2)) -> list(K3, V3)

Combiner

✘ 第一個Mapper產出

(1950, 0) (1950, 20) (1950, 10)

第二個Mapper產出

(1950, 25) (1950, 15)

✘ Reducer輸入：(1950, [0, 20, 10, 25, 15])

輸出：(1950, 25)

Combiner

✘ 將(1950, 0) (1950, 20) (1950, 10)處理成

(1950, 20)

✘ 將(1950, 25) (1950, 15)處理成

(1950, 25)

✘ Reducer輸入：(1950, [20, 25])

輸出：(1950, 25)

Combiner

✘ 原理

max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15) = max(25, 20) = 25

✘ 限制?

Combiner

✘ 限制

mean(0, 20, 10, 25, 15) = 14

Butmean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15

Shuffle & Sort Phase

✘ 負責決定要處理Map輸出的Reducer✘ 保證進入某一個Reducer的輸入，其資料

的Key是排序過了

InputFormat

InputFormat

✘ TextInputFormat✗ 預設的InputFormat✗ 每一行就是一個value✗ Key是LongWirtable，是每一行資料的byte

offsetOn the top of the Crumpetty TreeThe Quangle Wangle sat,But his face you could not see,On account of his Beaver Hat.

( 0, On the top of the Crumpetty Tree)( 33, The Quangle Wangle sat, )( 57, But his face you could not see, )( 89, On account of his Beaver Hat. )

InputFormat

✘ KeyValueTextInputFormat

✘ 第一個TAB(預設)前的為Key，其後為值line1→On the top of the Crumpetty Treeline2→The Quangle Wangle sat,line3→But his face you could not see,line4→On account of his Beaver Hat.

( line1, On the top of the Crumpetty Tree)( line2, The Quangle Wangle sat, )( line3, But his face you could not see, )( line4, On account of his Beaver Hat. )

InputFormat

✘ KeyValueTextInputFormat

✘ 若要改預設的切詞符號，mapreduce.input. keyvaluelinerecordreader.key.value. separator

InputFormat

✘ NLineInputFormat

✘ N是指幾行，預設是1✘ N可以經由設定mapreduce. input.

lineinputformat. linespermap來調整✘ 例如：設成2，每個map的輸入就是2行

( 0, On the top of the Crumpetty Tree)( 33, The Quangle Wangle sat, )

自學資源

Distributions

✘ Coludera QuickStart VM✗ Tutorial in VM✗ Tutorial on Web✗ Cloudera Essentials for Apache Hadoop

✘ Hortonworks Data Platform✗ Tutorial

http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

http://cloudera.com/content/cloudera/en/training.html

http://cloudera.com/content/cloudera/en/training.html

http://cloudera.com/content/cloudera/en/training/library/hadoop-essentials.html

http://cloudera.com/content/cloudera/en/training/library/hadoop-essentials.html

http://hortonworks.com/hdp/whats-new/

http://hortonworks.com/hdp/whats-new/

http://hortonworks.com/tutorials/

http://hortonworks.com/tutorials/

參考書目

THANKS!Any questions?

初探map reduce

Technology