Download - 20121215 DevLOVE2012 Mahout on AWS
黄色いゾウ使いのパレード~Mahout on AWS~
都元ダイスケ2012-12-15 @DevLOVE2012
自己紹介• 都元ダイスケ (@daisuke_m)
• Java屋です
• java-jaから来ま(ry
Javaオブジェクト指向Eclipse
恭ライセンス
薬
Mahout
Spring
XML JiemamyDDD
HadoopOSGi
Haskell
Scala
MavenWicket
AWS
酒
works
• 日経ソフトウエア
• Java入門記事
• Eclipse記事
Mahoutインアクション
Mahoutとは
• Javaで実装された
• スケーラブルな
• オープンソースの
• 機械学習ライブラリ
代表的な機械学習
• レコメンド(推薦)
• クラスタリング
• クラシファイイング(分類)
• その他色々ある
アプリと機械学習• CRUD (create, read, update, delete)
• FILTER (where)
• AGGREGATE (count, sum, ave, max, min...)
• SORT (order by)
• INTELLIGENCE (machine learning)
スケーラブル• 機械学習の精度は、データ量依存
• データ量に応じ、計算量が指数的に増加
• 大規模な計算リソースが必要
• Hadoop (MapReduce)
• AWS Elastic MapReduce
レコメンド1,101,5.01,102,3.01,103,2.52,101,2.02,102,2.52,103,5.02,104,2.0...
1128 [ 1179:5.0, 3160:4.6582785, ..., 797:4.0637455]1136[ 33493:4.8670673, 6934:4.86497, ..., 230:4.335819]...
recommendation
【input】 【output】
非分散レコメンド
入力データ (intro.csv)1,101,5.01,102,3.01,103,2.5
2,101,2.02,102,2.52,103,5.02,104,2.0
3,101,2.53,104,4.03,105,4.53,107,5.0
4,101,5.04,103,3.04,104,4.54,106,4.0
5,101,4.05,102,3.05,103,2.05,104,4.05,105,3.55,106,4.0
簡単なレコメンドimport java.io.File;import java.util.List;import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;import org.apache.mahout.cf.taste.model.DataModel;import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;import org.apache.mahout.cf.taste.recommender.*;import org.apache.mahout.cf.taste.similarity.UserSimilarity;
DataModel model = new FileDataModel(new File("intro.csv"));UserSimilarity similarity = new PearsonCorrelationSimilarity(model);UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(1, 2);
for (RecommendedItem recommendation : recommendations) {! System.out.println(recommendation);}
結果!
RecommendedItem[item:104, value:4.257081]RecommendedItem[item:106, value:4.0]
レコメンドの理屈• 1~5の「ユーザ」
• 101~107の「アイテム」
• そしてスコア
• 1さんと5さん似てる
• 1さんと4さんも何と無く似てる
• 2さんとは逆の好み?
• 3さんとの関連は見えない
• 1 vs 5 = 0.94
• 1 vs 4 = 0.99
• 1 vs 2 = -0.76
• 1 vs 3 = NaN
• 1 vs 1 = 1.0
相関係数
• 1 vs 1 = 1.0
• 1 vs 2 = -0.7642652566278799
• 1 vs 3 = NaN
• 1 vs 4 = 0.9999999999999998
• 1 vs 5 = 0.944911182523068
それぞれの人が1さんの予想評点に与える影響度
http://ja.wikipedia.org/wiki/相関係数
加重平均
0.94 ×0.99 ×0.94 ×0.94 ×0.99 ×
)/ 1.93)/ 0.94)/ 1.93
4.25 =(3.50 =(4.00 =(
この情報は相関係数が低いまたはNaNなのでもうアテにしない
結果!
RecommendedItem[item:104, value:4.257081]RecommendedItem[item:106, value:4.0]
(再掲)
分散レコメンド
分散レコメンド1,101,5.01,102,3.01,103,2.52,101,2.02,102,2.52,103,5.02,104,2.0...
1128 [ 1179:5.0, 3160:4.6582785, ..., 797:4.0637455]1136[ 33493:4.8670673, 6934:4.86497, ..., 230:4.335819]...
recommendation
【input】 【output】
S3 S3EMR
• 1万アイテム
• 7万2千ユーザ
• 1千万評価
MovieLens 10M
実はこれでもまだ小規模だと思う
データの加工
S3入力の準備•バケットを作る mahoutinaction-jp•ファイルを2つアップロード
• mahout/mahout-core-0.7-job.jar
• input10m/mahout-10m-ratings.dat
upload by codeimport java.io.File;import com.amazonaws.auth.*;import com.amazonaws.services.s3.*;import com.amazonaws.services.s3.model.Region;
AWSCredentials cred = new BasicAWSCredentials("AccessKeyID","SecretAccessKey");
AmazonS3 s3 = new AmazonS3Client(cred);
s3.createBucket("mahoutinaction-jp", Region.AP_Tokyo);s3.putObject(
"mahoutinaction-jp","mahout/mahout-core-0.7-job.jar",new File("mahout-core-0.7-job.jar"));
s3.putObject("mahoutinaction-jp","input10m/mahout-10m-ratings.dat",new File("mahout-10m-ratings.dat"));
EMRの起動• JAR Location
mahoutinaction-jp/mahout/mahout-core-0.7-job.jar
• JAR Argumentsorg.apache.mahout.cf.taste.hadoop.item.RecommenderJob-Dmapred.map.tasks=40-Dmapred.reduce.tasks=19-Dmapred.input.dir=s3n://mahoutinaction-jp/input10m-Dmapred.output.dir=s3n://mahoutinaction-jp/output10m--numRecommendations 100--similarityClassname SIMILARITY_PEARSON_CORRELATION
compute by code
import com.amazonaws.auth.*;import com.amazonaws.services.elasticmapreduce.*;import com.amazonaws.services.elasticmapreduce.model.*;import com.amazonaws.services.elasticmapreduce.util.*;
AWSCredentials cred = new BasicAWSCredentials("AccessKeyID", "SecretAccessKey");
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(cred);emr.setEndpoint("elasticmapreduce.ap-northeast-1.amazonaws.com");
RunJobFlowRequest runRequest = new RunJobFlowRequest().withName("mahout-10m").withSteps( ... ) // detailed on next page.withInstances( ... ) // detailed on next page.withAmiVersion("2.1.4").withLogUri("s3n://mahoutinaction-jp/log");
RunJobFlowResult runResult = emr.runJobFlow(runRequest);
RunJobFlowRequest runRequest = new RunJobFlowRequest()! .withName("mahout-10m")! .withSteps(! ! ! new StepConfig()! ! ! ! .withName("Setup Hadoop Debugging")! ! ! ! .withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)! ! ! ! .withHadoopJarStep(! ! ! ! ! ! new StepFactory("ap-northeast-1.elasticmapreduce")! ! ! ! ! ! ! .newEnableDebuggingStep()),! ! ! new StepConfig()! ! ! ! .withName("Custom Jar")! ! ! ! .withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)! ! ! ! .withHadoopJarStep(new HadoopJarStepConfig()! ! ! ! ! .withJar("s3n://mahoutinaction-jp/mahout/mahout-core-0.7-job.jar")! ! ! ! ! .withMainClass("org.apache.mahout.cf.taste.hadoop.item.RecommenderJob")! ! ! ! ! .withArgs(Arrays.asList(! ! ! ! ! ! ! "-Dmapred.map.tasks=40",! ! ! ! ! ! ! "-Dmapred.reduce.tasks=19",! ! ! ! ! ! ! "-Dmapred.input.dir=s3n://mahoutinaction-jp/input10m",! ! ! ! ! ! ! "-Dmapred.output.dir=s3n://mahoutinaction-jp/output10m",! ! ! ! ! ! ! "--numRecommendations", "100",! ! ! ! ! ! ! "--similarityClassname", "SIMILARITY_PEARSON_CORRELATION"))))! .withInstances(new JobFlowInstancesConfig()! ! .withPlacement(new PlacementType("ap-northeast-1a"))! ! .withInstanceCount(20)! ! .withMasterInstanceType("m1.small")! ! .withSlaveInstanceType("m1.small")! ! .withKeepJobFlowAliveWhenNoSteps(false)! ! .withHadoopVersion("0.20.205"))! .withAmiVersion("2.1.4")! .withLogUri("s3n://mahoutinaction-jp/logs");
後でごゆっくりどうぞ
watch by code
AmazonElasticMapReduce emr = ...;RunJobFlowResult runResult = ...;
String jobFlowId = runResult.getJobFlowId();DescribeJobFlowsRequest describeRequest =
new DescribeJobFlowsRequest().withJobFlowIds(jobFlowId);DescribeJobFlowsResult describeResult =
emr.describeJobFlows(describeRequest);JobFlowDetail detail = describeResult.getJobFlows().get(0);JobFlowExecutionStatusDetail statusDetail =
detail.getExecutionStatusDetail();JobFlowExecutionState state =
JobFlowExecutionState.fromValue(statusDetail.getState());
// COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN,// STARTING, WAITING, BOOTSTRAPPING
結果を取り出す
指定したロケーションにファイルがいくつか生成されている。
download by codeimport java.io.InputStream;import java.util.List;import com.amazonaws.auth.*;import com.amazonaws.services.s3.*;import com.amazonaws.services.s3.model.*;
AWSCredentials cred = new BasicAWSCredentials("AccessKeyID","SecretAccessKey");
AmazonS3 s3 = new AmazonS3Client(cred);
ObjectListing listing = s3.listObjects("mahoutinaction-jp", "output10m");
List<S3ObjectSummary> summaries = listing.getObjectSummaries();for (S3ObjectSummary summary : summaries) {! System.out.println(summary.getKey());! if (summary.getKey().endsWith("/_SUCCESS")) {! ! continue;! }! S3Object obj = s3.getObject("mahoutinaction-jp", summary.getKey());! InputStream in = obj.getObjectContent();! // ...}
Summary
• 機械学習 は、ちょっとインテリな機能
• 分散・非分散アルゴリズム
• 非分散ならオンラインで
• 分散ならAWSのEMRで