[c12]元気hadoop! oracleをhadoopで分析しちゃうぜ by daisuke hirama

26
元気Hadoop! Copyright © 2013 Insight Technology, Inc. All Rights Reserved. 平間 大輔 Insight Technology, Inc.

Upload: insight-technology-inc

Post on 04-Jul-2015

510 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

元気Hadoop!

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

平間 大輔

Insight Technology, Inc.

Page 2: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

「Big Data」「Big Data」「ビッグデータ」!

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

Page 3: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

ビッグデータといえばHadoop…なぜ?

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

PB

Page 4: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

HadoopのコアはHDFSとMapReduce

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

Page 5: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

DBエンジニアだってHadoopを使いたい!

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

• 4ノードHadoopクラスタ

• Cloudera CDH4 4.4.0

• Cloudera Manager Standard 4.7.2

• Master (NameNode, JobTracker) 1台

• Slave (DataNode, TaskTracker) 4台

(1台はMasterに同居)

• DISKはSSD(アキバモデル)12枚

• クラスタ間通信はInfiniBand

Super Hadoop 2013 !

Page 6: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

DBサーバのログを分析してみよう!

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

• Oracle Database 12c

• 4ノードRAC構成

• 3台のストレージノード(自作PC)

• DISKはSSD(アキバモデル)18枚

• ノード間通信はInfiniBand

Super RAC 2013 !

SuperRACで実行させる処理

夜間バッチ処理: 午前1:00からTPC-Hを実行(10分程度)

日中のOLTP処理: 午前8:00からTPC-Cを実行(1時間)

Page 7: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

その1:パフォーマンスログを分析してみよう

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

"Dstat 0.7.0 CSV output"

"Author:","Dag Wieers <[email protected]>",,,,"URL:","http://dag.wieers.com/home-made/dstat/"

"Host:","iq-4node-db3",,,,"User:","root"

"Cmdline:","dstat -C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 --output /data01/logs/iq-4node-db3/dstat_cpu_201310300000.log 1 86400",,,,"Date:","30 Oct 2013 00:00:01 JST"

"cpu0 usage",,,,,,"cpu1 usage",,,,,,"cpu2 usage",,,,,,"cpu3 usage",,,,,,"cpu4 usage",,,,,,"cpu5 usage",,,,,,"cpu6 usage",,,,,,"cpu7 usage",,,,,,"cpu8 usage",,,,,,"cpu9 usage",,,,,,"cpu10 usage",,,,,,"cpu11 usage",,,,,,"cpu12 usage",,,,,,"cpu13 usage",,,,,,"cpu14 usage",,,,,,"cpu15 usage",,,,,,"cpu16 usage",,,,,,"cpu17 usage",,,,,,"cpu18 usage",,,,,,"cpu19 usage",,,,,,"cpu20 usage",,,,,,"cpu21 usage",,,,,,"cpu22 usage",,,,,,"cpu23 usage",,,,,,"dsk/total",,"net/total",,"paging",,"system",

"usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","read","writ","recv","send","in","out","int","csw"

1.086,0.600,98.202,0.106,0.0,0.005,1.825,0.491,97.593,0.086,0.0,0.005,0.677,0.225,99.070,0.017,0.0,0.011,0.417,0.140,99.427,0.011,0.0,0.005,0.372,0.095,99.525,0.006,0.0,0.002,0.269,0.034,99.694,0.003,0.0,0.001,1.811,0.750,97.091,0.326,0.0,0.021,0.866,0.291,98.793,0.043,0.0,0.006,0.727,0.293,98.899,0.074,0.0,0.007,0.795,0.185,99.000,0.017,0.0,0.003,0.224,0.072,99.697,0.006,0.0,0.001,0.163,0.064,99.735,0.036,0.0,0.001,0.534,0.454,98.991,0.018,0.0,0.002,2.596,0.577,96.145,0.524,0.000,0.157,0.386,0.335,99.264,0.008,0.0,0.008,0.229,0.146,99.618,0.005,0.0,0.003,0.122,0.074,99.801,0.003,0.0,0.001,0.106,0.026,99.868,0.000,0.0,0.000,0.637,0.443,98.853,0.060,0.000,0.006,0.395,0.324,99.265,0.013,0.0,0.003,1.052,0.472,97.868,0.105,0.001,0.502,1.491,0.369,98.040,0.017,0.000,0.082,0.136,0.076,99.774,0.011,0.0,0.003,0.146,0.072,99.771,0.010,0.0,0.001,5091436.512,1060554.722,0.0,0.0,0.0,0.0,8545.433,11683.463

0.0,0.990,99.010,0.0,0.0,0.0,0.0,1.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.990,0.0,99.010,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,19456.0,1536.0,8754.0,10138.0,0.0,0.0,5587.0,8866.0

1.0,0.0,99.0,0.0,0.0,0.0,2.020,0.0,97.980,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.010,4.040,94.949,0.0,0.0,0.0,1.010,1.010,97.980,0.0,0.0,0.0,1.010,0.0,98.990,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.990,0.990,98.020,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.980,0.990,95.050,0.990,0.0,0.990,0.0,1.0,99.0,0.0,0.0,0.0,1.0,1.0,98.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,1.0,1.0,98.0,0.0,0.0,0.0,1.0,1.0,98.0,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,171008.0,50688.0,341116.0,209218.0,0.0,0.0,7972.0,10555.0

1.0,1.0,98.0,0.0,0.0,0.0,3.0,0.0,97.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.010,5.051,93.939,0.0,0.0,0.0,0.0,1.020,98.980,0.0,0.0,0.0,0.990,0.990,98.020,0.0,0.0,0.0,0.990,0.0,99.010,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,1.0,99.0,0.0,0.0,0.0,3.061,2.041,94.898,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,2.0,3.0,95.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,2.0,98.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,23552.0,1536.0,502456.0,436070.0,0.0,0.0,10631.0,13814.0

1.010,0.0,98.990,0.0,0.0,0.0,2.0,1.0,97.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.990,0.0,99.010,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.980,0.990,97.030,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.010,0.0,98.990,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,7.071,1.010,90.909,0.0,0.0,1.010,1.0,1.0,98.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,1.0,99.0,0.0,0.0,0.0,0.0,1.0,99.0,0.0,0.0,0.0,1.980,0.990,97.030,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,32256.0,44544.0,1323810.0,1625256.0,0.0,0.0,9064.0,12098.0

10.891,0.990,87.129,0.990,0.0,0.0,16.667,2.941,79.412,0.980,0.0,0.0,3.0,0.0,97.0,0.0,0.0,0.0,6.931,0.990,92.079,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,2.020,0.0,95.960,2.020,0.0,0.0,1.020,0.0,98.980,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,4.040,2.020,93.939,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,4.0,2.0,94.0,0.0,0.0,0.0,65.306,6.122,19.388,5.102,0.0,4.082,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,4602880.0,4922880.0,6463056.0,6093078.0,0.0,0.0,16241.0,20141.0

先頭にサーバ名と日付・時刻を追加

iq-4node-db3,20131030,1,0.0,0.990,99.010,0.0,0.0,0.0,0.0,1.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.990,0.0,99.010,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,19456.0,1536.0,8754.0,10138.0,0.0,0.0,5587.0,8866.0

iq-4node-db3,20131030,2,1.0,0.0,99.0,0.0,0.0,0.0,2.020,0.0,97.980,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.010,4.040,94.949,0.0,0.0,0.0,1.010,1.010,97.980,0.0,0.0,0.0,1.010,0.0,98.990,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.990,0.990,98.020,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.980,0.990,95.050,0.990,0.0,0.990,0.0,1.0,99.0,0.0,0.0,0.0,1.0,1.0,98.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,1.0,1.0,98.0,0.0,0.0,0.0,1.0,1.0,98.0,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,171008.0,50688.0,341116.0,209218.0,0.0,0.0,7972.0,10555.0

iq-4node-db3,20131030,3,1.0,1.0,98.0,0.0,0.0,0.0,3.0,0.0,97.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.010,5.051,93.939,0.0,0.0,0.0,0.0,1.020,98.980,0.0,0.0,0.0,0.990,0.990,98.020,0.0,0.0,0.0,0.990,0.0,99.010,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,1.0,99.0,0.0,0.0,0.0,3.061,2.041,94.898,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,2.0,3.0,95.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,2.0,98.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,23552.0,1536.0,502456.0,436070.0,0.0,0.0,10631.0,13814.0

iq-4node-db3,20131030,4,1.010,0.0,98.990,0.0,0.0,0.0,2.0,1.0,97.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.990,0.0,99.010,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.980,0.990,97.030,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.010,0.0,98.990,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,7.071,1.010,90.909,0.0,0.0,1.010,1.0,1.0,98.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,1.0,99.0,0.0,0.0,0.0,0.0,1.0,99.0,0.0,0.0,0.0,1.980,0.990,97.030,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,32256.0,44544.0,1323810.0,1625256.0,0.0,0.0,9064.0,12098.0

iq-4node-db3,20131030,5,10.891,0.990,87.129,0.990,0.0,0.0,16.667,2.941,79.412,0.980,0.0,0.0,3.0,0.0,97.0,0.0,0.0,0.0,6.931,0.990,92.079,0.0,0.0,0.0,1.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,2.020,0.0,95.960,2.020,0.0,0.0,1.020,0.0,98.980,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,4.040,2.020,93.939,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,4.0,2.0,94.0,0.0,0.0,0.0,65.306,6.122,19.388,5.102,0.0,4.082,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,4602880.0,4922880.0,6463056.0,6093078.0,0.0,0.0,16241.0,20141.0

hadoop fs -put dstat_cpu_iq-4node-db3_20131030.csv dstat_cpu

tail -86400 $fn | cat -n | sed 's/¥s¥+/,/g' | sed "s/^/${SVRNAME},${YESTERDAY}/"

dstatでパフォーマンスログを取得

ヘッダと1行目を削除

加工して

Hadoopへ

Page 8: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

CSVならHiveでGO!

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

CSV

メタストアデータベース

(PostgreSQLなど)

Page 9: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

CSVをテーブルとして定義

create external table dstat_cpu (

servername string,

create_ymd string,

create_second int,

cpu0_user DOUBLE,

cpu0_sys DOUBLE,

cpu0_idle DOUBLE,

page_in DOUBLE,

page_out DOUBLE,

system_int DOUBLE,

system_csw DOUBLE

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

STORED AS TEXTFILE LOCATION '/user/root/dstat_cpu';

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

Page 10: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

では実行してみよう!

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

Page 11: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

Hiveに速度を期待してはダメ

• クエリをMapReduceに変換するオーバーヘッド

• MapReduce自体のオーバーヘッド

• 売りは「開発生産性の高さ」、実行時の速さではない

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

0

2000

4000

6000

8000

SF=10(GB) SF=100(GB)

(秒)

ただし、大量データのバッチ処理には強い!

データ量が10倍になっても処理時間は2倍に収まった例: ※ TPC-H用クエリ(22個)を一部Hive用に修正して実行

Page 12: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

Impalaって速いやつが出てるらしいよ

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

Cloudera社Webサイトより

Page 13: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

Impalaだとどれくらい?

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

Page 14: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

Impalaなら爆速! だけど…

0

20

40

60

80

100

120

Hive Impala

(秒)

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

TPC-H Q3 SF=10(GB)

1/5以下!

Page 15: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

データ量が多いとちょっと厳しい…

0

50

100

150

200

250

300

350

400

450

Hive Impala

(秒)

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

TPC-H Q3 SF=100(GB)

Page 16: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

データの量と質次第では本職に任せるのもあり

0

50

100

150

200

250

300

350

400

450

Hive Impala 某世界最速RDB

(秒)

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

TPC-H Q3 SF=100(GB)

2.7秒

Page 17: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

その2:怪しいSQLを見つけられないかな?

select * from CUSTOMER

where C_LAST = ‘Hirama’;

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

これは通常の業務で

発行されるSQLなの?

--CDBで実行

alter system set audit_trail=xml, extended sid='*'

scope=spfile;

--PDBで実行

AUDIT SELECT TABLE BY ACCESS;

AUDIT INSERT TABLE BY ACCESS;

AUDIT UPDATE TABLE BY ACCESS;

AUDIT DELETE TABLE BY ACCESS;

Oracleの監査証跡 DB監査ツール

ログ量:1日64GB…

Page 18: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

監査ログからSQLを抜き出そう

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

<AuditRecord><Audit_Type>1</Audit_Type><Session_Id>140037</Session_Id>

<DBID>409456161</DBID>

<Sql_Text>select

l_returnflag,

l_linestatus,

sum(l_quantity) as sum_qty,

order by

l_returnflag,

l_linestatus</Sql_Text>

</AuditRecord>

select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1

- l_discount)) as sum_disc_price, sum(l_extendedprice * (1 -

l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty,

avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc,

count(*) as count_order from lineitem where l_shipdate &lt;= date

&apos;1998-12-01&apos; - interval &apos;91&apos; day (3) group

by l_returnflag, l_linestatus order by l_returnflag, l_linestatus

SQLのみ抜きだし、改行を削除して1行に

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-

streaming-2.0.0-mr1-cdh4.4.0.jar ¥

-D mapred.reduce.tasks=0 ¥

-inputreader "StreamXmlRecordReader,begin=<Sql_Text>,end=</Sql_Text>" ¥

-input XmlSql ¥

-output Sql ¥

-mapper cutlftag.sh ¥

-file cutlftag.sh

• Hadoop StreamingでXMLタグの抜き出し

• Hadoop Streamingなら、スクリプトでMapReduceが可能 #!/bin/sh

tr -d "¥n" | sed -e "s/¥t/¥ /g" |sed -e "s/<Sql_Text>//g" | sed -e "s/<¥/Sql_Text>/¥n/g"

Page 19: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

機械学習で怪しいSQLを見つけられる?

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

Page 20: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

Mahoutでクラシフィケーション

• クラシフィケーション(分類)の実行手順

1. 訓練用データを人手で分類

2. シーケンスファイルへ変換

3. ベクトルデータに変換

4. 訓練用データとテスト用データに分割

5. 訓練してモデルを構築(train)

6. モデルのテスト

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

今回は分類器にNaive Bayes(単純ベイズ)を使用

→ スパムフィルタでよく使われるよ!

Page 21: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

データの分類、変換

• 訓練用データを人手で分類

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

tpch

tpcc

suspicious

trainSql

• シーケンスファイルに変換 $ mahout seqdirectory -i trainSql -o trainSeq

Key: /tpcc/part-00000: Value: SELECT /* N-07 */ s_quantity, s_dist_01, s_dist_02, s_dist_03, s_dist_04, s_dist_05,

s_dist_06, s_dist_07, s_dist_08, s_dist_09, s_dist_10, s_data FROM stock WHERE s_i_id = :1 AND s_w_id = :2 FOR

UPDATE

UPDATE /* N-08 */ stock SET s_quantity = :1 , s_ytd = s_ytd + :2 , s_order_cnt = s_order_cnt + 1, s_remote_cnt =

s_remote_cnt + :3 WHERE s_i_id = :4 AND s_w_id = :5

INSERT /* N-09 */ INTO order_line (ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_delivery_d,

ol_quantity, ol_amount, ol_dist_info) VALUES (:1 , :2 , :3 , :4 , :5 , :6 , NULL, :7 , :8 , :9 )

中身はこんな感じ

• ベクトルデータに変換 $ mahout seq2sparse -i trainSeq -o trainSparse ¥

-a org.apache.lucene.analysis.WhitespaceAnalyzer

中身はこんな感じ Key: /tpcc/part-00001: Value:

{543:26.124736785888672,542:36.76076126098633,541:51.987571716308594,539:116.10087585449219,538:82.09571075439453

,529:82.09571075439453,528:13.946792602539062,527:13.946792602539062,524:25.92806053161621,523:37.858341217041016

,522:25.92806053161621,521:11.3875093460083,520:11.3875093460083,519:11.3875093460083,518:25.92806053161621,517:2

6.76988983154297,516:65.2334213256836,514:37.858341217041016,513:37.858341217041016,512:53.53977966308594,501:53.

889198303222656,500:36.94595718383789,499:26.124736785888672,498:52.05316925048828,497:19.723743438720703,496:19.

Page 22: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

モデルの構築・テスト

• 訓練用データとテスト用データに分割

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

$ mahout split -i trainSparse/tfidf-vectors --trainingOutput trainData ¥

--testOutput trainTestData --randomSelectionPct 50 ¥

--overwrite --sequenceFiles --method sequential

• 訓練してモデルを構築 $ mahout trainnb -i trainData -o trainModel –li trainIndex -ow -c -el

• モデルのテスト $ mahout testnb -i trainTestData -o trainTestResult ¥

-m trainModel -l trainIndex -ow -c

select s_suppkey, s_name,

s_address, s_phone,

total_revenuefrom supplier,

revenue0where s_suppkey =

supplier_no and total_revenue = (

select max(total_revenue) from

revenue0 )order by s_suppkey

SELECT /* N-07 */ s_quantity,

s_dist_01, s_dist_02, s_dist_03,

s_dist_04, s_dist_05, s_dist_06,

s_dist_07, s_dist_08, s_dist_09,

s_dist_10, s_data FROM stock

WHERE s_i_id = :1 AND s_w_id =

:2 FOR UPDATE

SELECT C_ID FROM

TPCC.CUSTOMER WHERE

C_ID=:B3 AND

C_D_ID=:B2 AND

C_W_ID=:B1

TPC-H: 3件 TPC-C: 4件 怪しいSQL: 1件

テストデータ

Page 23: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

さて、テスト結果は?

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

Summary

-------------------------------------------------------

Correctly Classified Instances : 8 100%

Incorrectly Classified Instances : 0 0%

Total Classified Instances : 8

=======================================================

Confusion Matrix

-------------------------------------------------------

a b c <--Classified as

1 0 0 | 1 a = suspicious

0 4 0 | 4 b = tpcc

0 0 3 | 3 c = tpch

正答率100%!

これで怪しいSQLは見つけられる?

Page 24: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

実環境に適用するにはまだまだ…

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

訓練データの

量・質

機械学習の

理解

Mahoutの

運用スキル

Page 25: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

DBエンジニアが元気にHadoopを使うには

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.

1. 「埋もれたダイヤの原石を発掘」の発想で

2. 適材適所。RDBMSの置き換えではなく補完

3. 機械学習はデータが命

DBエンジニアの腕の見せ所

Page 26: [C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

•無断転載を禁ず

•この文書はあくまでも参考資料であり、掲載されている情報は予告なしに変更されることがあります。

•株式会社インサイトテクノロジーは本書の内容に関していかなる保証もしません。また、本書の内容に関連したいかなる損害についても責任を負いかねます。

•本書で使用している製品やサービス名の名称は、各社の商標または登録商標です。