初探map reduce
TRANSCRIPT
緣由
✘ CCDH✗ 選擇題✗ 重著於Hadoop、MapReduce和少部份的Hive✗ 快要絕版
緣由
✘ CCP: Data Engineer✗ 考試的內容較全面✗ 對個人的助益
■ 重視實作■ 注重技能,而非產品的使用■ 推薦與認證
■ 現在一直在改,但重點在確認資料處理的
技能✗ 對公司的助益
■ 找到真正會實作的人■ 找到的人是有權威的認證
Mapper & Reducer
✘ 可以使用setNumMapTasks來控制Mapper的數量✗ 但真正的Mapper數量還是會由資料分佈決定
✘ 可以使用setNumReduceTasks來控制Reducer的數量,預設為115/07/30 20:43:43 INFO mapred.MapTask: numReduceTasks: 2
-rw-r--r-- 3 javakid supergroup 0 2015-07-30 20:43 output/_SUCCESS-rw-r--r-- 3 javakid supergroup 15 2015-07-30 20:43 output/part-00000-rw-r--r-- 3 javakid supergroup 62 2015-07-30 20:43 output/part-00001
Hadoop Classpath
✘ 要讓Hadoop命令知道要執行的Jarexport HADOOP_CLASSPATH=/home/javakid/git_repository/HelloHadoop/target/hellohadoop-1.0-SNAPSHOT.jar
✘ 執行
hadoop idv.jk.hellohadoop.WordCountOldAPI input/test1.txt output
概要
✘ JobConf✗ MapReduce Job用來設定Hadoop參數的主要介
面
✘ TextInputFormat✗ 宣告輸入資料的型式,上列是設定為文字型式✗ 是InputFormat的子類別
✘ TextOutputFormat✗ 再跑一次,會丟出下列錯誤hdfs://javakid01:
9000/user/javakid/output already exists✗ 確認Job所設定的輸出目錄是不存在✗ hdfs dfs -rm -R output
概要
概要
✘ FileInputFormat.setInputPaths()✗ 設定輸入檔案的路徑
✘ FileOutputFormat.setOutputPath()✗ 設定要置於輸出結果檔案的路徑
✘ conf.setOutputKeyClass()和conf.
setOutputValueClass()✗ 設定和輸出資料的key和value的類別,需和Reducer的
輸出一樣,不然會丟出RuntimeException
概要
✘ 輸出檔案中的資料預設是用TAB分隔的,可以用conf.set("mapreduce.textoutputformat.separator", ",")來調整
✘ 新版API為:configuration.set("mapreduce.output.textoutputformat.separator", ",");
YARN
✘ New API程式✘ 要跑YARN,在conf/mapred-site.xml中
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
YARN
✘ 命令
yarn jar hellohadoop-1.0-SNAPSHOT.jar idv.jk.hellohadoop.WordCountNewAPI input/lyrics.txt output/lyrics
YARN
✘ YARN Client✗ 建立YARN應用程式
✘ ResourceManager✗ YARN最主要的程式,負責資源的排程與管理
✘ NodeManager✗ 在每一個node上執行的程序✗ 負責container的啟動與管理
Combiner
✘ map✗ (K1, V1) -> list(K2, V2)
✘ combiner✗ (K2, list(V2)) -> list(K2, V2)
✘ reduce✗ (K2, list(V2)) -> list(K3, V3)
Combiner
✘ 第一個Mapper產出
(1950, 0) (1950, 20) (1950, 10)
第二個Mapper產出
(1950, 25) (1950, 15)
✘ Reducer輸入:(1950, [0, 20, 10, 25, 15])
輸出:(1950, 25)
Combiner
✘ 將(1950, 0) (1950, 20) (1950, 10)處理成
(1950, 20)
✘ 將(1950, 25) (1950, 15)處理成
(1950, 25)
✘ Reducer輸入:(1950, [20, 25])
輸出:(1950, 25)
Combiner
✘ 限制
mean(0, 20, 10, 25, 15) = 14
Butmean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
InputFormat
✘ TextInputFormat✗ 預設的InputFormat✗ 每一行就是一個value✗ Key是LongWirtable,是每一行資料的byte
offsetOn the top of the Crumpetty TreeThe Quangle Wangle sat,But his face you could not see,On account of his Beaver Hat.
( 0, On the top of the Crumpetty Tree)( 33, The Quangle Wangle sat, )( 57, But his face you could not see, )( 89, On account of his Beaver Hat. )
InputFormat
✘ KeyValueTextInputFormat
✘ 第一個TAB(預設)前的為Key,其後為值line1→On the top of the Crumpetty Treeline2→The Quangle Wangle sat,line3→But his face you could not see,line4→On account of his Beaver Hat.
( line1, On the top of the Crumpetty Tree)( line2, The Quangle Wangle sat, )( line3, But his face you could not see, )( line4, On account of his Beaver Hat. )
InputFormat
✘ KeyValueTextInputFormat
✘ 若要改預設的切詞符號,mapreduce.input. keyvaluelinerecordreader.key.value. separator
InputFormat
✘ KeyValueTextInputFormat
✘ 若要改預設的切詞符號,mapreduce.input. keyvaluelinerecordreader.key.value. separator
InputFormat
✘ NLineInputFormat
✘ N是指幾行,預設是1✘ N可以經由設定mapreduce. input.
lineinputformat. linespermap來調整✘ 例如:設成2,每個map的輸入就是2行
( 0, On the top of the Crumpetty Tree)( 33, The Quangle Wangle sat, )
Distributions
✘ Coludera QuickStart VM✗ Tutorial in VM✗ Tutorial on Web✗ Cloudera Essentials for Apache Hadoop
✘ Hortonworks Data Platform✗ Tutorial