20121215 devlove2012 mahout on aws

36
黄色いゾウ使いの パレード Mahout on AWS都元ダイスケ 2012-12-15 @DevLOVE2012

Upload: -

Post on 11-May-2015

1.557 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: 20121215 DevLOVE2012 Mahout on AWS

黄色いゾウ使いのパレード~Mahout on AWS~

都元ダイスケ2012-12-15 @DevLOVE2012

Page 2: 20121215 DevLOVE2012 Mahout on AWS

自己紹介• 都元ダイスケ (@daisuke_m)

• Java屋です

• java-jaから来ま(ry

Javaオブジェクト指向Eclipse

恭ライセンス

Mahout

Spring

XML JiemamyDDD

HadoopOSGi

Haskell

Scala

MavenWicket

AWS

Page 3: 20121215 DevLOVE2012 Mahout on AWS

works

• 日経ソフトウエア

• Java入門記事

• Eclipse記事

Page 4: 20121215 DevLOVE2012 Mahout on AWS

Mahoutインアクション

Page 5: 20121215 DevLOVE2012 Mahout on AWS

Mahoutとは

• Javaで実装された

• スケーラブルな

• オープンソースの

• 機械学習ライブラリ

Page 6: 20121215 DevLOVE2012 Mahout on AWS

代表的な機械学習

• レコメンド(推薦)

• クラスタリング

• クラシファイイング(分類)

• その他色々ある

Page 7: 20121215 DevLOVE2012 Mahout on AWS

アプリと機械学習• CRUD (create, read, update, delete)

• FILTER (where)

• AGGREGATE (count, sum, ave, max, min...)

• SORT (order by)

• INTELLIGENCE (machine learning)

Page 8: 20121215 DevLOVE2012 Mahout on AWS

スケーラブル• 機械学習の精度は、データ量依存

• データ量に応じ、計算量が指数的に増加

• 大規模な計算リソースが必要

• Hadoop (MapReduce)

• AWS Elastic MapReduce

Page 9: 20121215 DevLOVE2012 Mahout on AWS

レコメンド1,101,5.01,102,3.01,103,2.52,101,2.02,102,2.52,103,5.02,104,2.0...

1128 [ 1179:5.0, 3160:4.6582785, ..., 797:4.0637455]1136[ 33493:4.8670673, 6934:4.86497, ..., 230:4.335819]...

recommendation

【input】 【output】

Page 10: 20121215 DevLOVE2012 Mahout on AWS

非分散レコメンド

Page 11: 20121215 DevLOVE2012 Mahout on AWS

入力データ (intro.csv)1,101,5.01,102,3.01,103,2.5

2,101,2.02,102,2.52,103,5.02,104,2.0

3,101,2.53,104,4.03,105,4.53,107,5.0

4,101,5.04,103,3.04,104,4.54,106,4.0

5,101,4.05,102,3.05,103,2.05,104,4.05,105,3.55,106,4.0

Page 12: 20121215 DevLOVE2012 Mahout on AWS

簡単なレコメンドimport java.io.File;import java.util.List;import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;import org.apache.mahout.cf.taste.model.DataModel;import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;import org.apache.mahout.cf.taste.recommender.*;import org.apache.mahout.cf.taste.similarity.UserSimilarity;

DataModel model = new FileDataModel(new File("intro.csv"));UserSimilarity similarity = new PearsonCorrelationSimilarity(model);UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);

Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);

List<RecommendedItem> recommendations = recommender.recommend(1, 2);

for (RecommendedItem recommendation : recommendations) {! System.out.println(recommendation);}

Page 13: 20121215 DevLOVE2012 Mahout on AWS

結果!

RecommendedItem[item:104, value:4.257081]RecommendedItem[item:106, value:4.0]

Page 14: 20121215 DevLOVE2012 Mahout on AWS

レコメンドの理屈• 1~5の「ユーザ」

• 101~107の「アイテム」

• そしてスコア

Page 15: 20121215 DevLOVE2012 Mahout on AWS
Page 16: 20121215 DevLOVE2012 Mahout on AWS

• 1さんと5さん似てる

• 1さんと4さんも何と無く似てる

• 2さんとは逆の好み?

• 3さんとの関連は見えない

Page 17: 20121215 DevLOVE2012 Mahout on AWS

• 1 vs 5 = 0.94

• 1 vs 4 = 0.99

• 1 vs 2 = -0.76

• 1 vs 3 = NaN

• 1 vs 1 = 1.0

Page 18: 20121215 DevLOVE2012 Mahout on AWS

相関係数

• 1 vs 1 = 1.0

• 1 vs 2 = -0.7642652566278799

• 1 vs 3 = NaN

• 1 vs 4 = 0.9999999999999998

• 1 vs 5 = 0.944911182523068

それぞれの人が1さんの予想評点に与える影響度

Page 20: 20121215 DevLOVE2012 Mahout on AWS

加重平均

0.94 ×0.99 ×0.94 ×0.94 ×0.99 ×

)/ 1.93)/ 0.94)/ 1.93

4.25 =(3.50 =(4.00 =(

この情報は相関係数が低いまたはNaNなのでもうアテにしない

Page 21: 20121215 DevLOVE2012 Mahout on AWS

結果!

RecommendedItem[item:104, value:4.257081]RecommendedItem[item:106, value:4.0]

(再掲)

Page 22: 20121215 DevLOVE2012 Mahout on AWS

分散レコメンド

Page 23: 20121215 DevLOVE2012 Mahout on AWS

分散レコメンド1,101,5.01,102,3.01,103,2.52,101,2.02,102,2.52,103,5.02,104,2.0...

1128 [ 1179:5.0, 3160:4.6582785, ..., 797:4.0637455]1136[ 33493:4.8670673, 6934:4.86497, ..., 230:4.335819]...

recommendation

【input】 【output】

S3 S3EMR

Page 25: 20121215 DevLOVE2012 Mahout on AWS

• 1万アイテム

• 7万2千ユーザ

• 1千万評価

MovieLens 10M

実はこれでもまだ小規模だと思う

Page 26: 20121215 DevLOVE2012 Mahout on AWS

データの加工

Page 27: 20121215 DevLOVE2012 Mahout on AWS

S3入力の準備•バケットを作る mahoutinaction-jp•ファイルを2つアップロード

• mahout/mahout-core-0.7-job.jar

• input10m/mahout-10m-ratings.dat

Page 28: 20121215 DevLOVE2012 Mahout on AWS

upload by codeimport java.io.File;import com.amazonaws.auth.*;import com.amazonaws.services.s3.*;import com.amazonaws.services.s3.model.Region;

AWSCredentials cred = new BasicAWSCredentials("AccessKeyID","SecretAccessKey");

AmazonS3 s3 = new AmazonS3Client(cred);

s3.createBucket("mahoutinaction-jp", Region.AP_Tokyo);s3.putObject(

"mahoutinaction-jp","mahout/mahout-core-0.7-job.jar",new File("mahout-core-0.7-job.jar"));

s3.putObject("mahoutinaction-jp","input10m/mahout-10m-ratings.dat",new File("mahout-10m-ratings.dat"));

Page 29: 20121215 DevLOVE2012 Mahout on AWS

EMRの起動• JAR Location

mahoutinaction-jp/mahout/mahout-core-0.7-job.jar

• JAR Argumentsorg.apache.mahout.cf.taste.hadoop.item.RecommenderJob-Dmapred.map.tasks=40-Dmapred.reduce.tasks=19-Dmapred.input.dir=s3n://mahoutinaction-jp/input10m-Dmapred.output.dir=s3n://mahoutinaction-jp/output10m--numRecommendations 100--similarityClassname SIMILARITY_PEARSON_CORRELATION

Page 30: 20121215 DevLOVE2012 Mahout on AWS

compute by code

import com.amazonaws.auth.*;import com.amazonaws.services.elasticmapreduce.*;import com.amazonaws.services.elasticmapreduce.model.*;import com.amazonaws.services.elasticmapreduce.util.*;

AWSCredentials cred = new BasicAWSCredentials("AccessKeyID", "SecretAccessKey");

AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(cred);emr.setEndpoint("elasticmapreduce.ap-northeast-1.amazonaws.com");

RunJobFlowRequest runRequest = new RunJobFlowRequest().withName("mahout-10m").withSteps( ... ) // detailed on next page.withInstances( ... ) // detailed on next page.withAmiVersion("2.1.4").withLogUri("s3n://mahoutinaction-jp/log");

RunJobFlowResult runResult = emr.runJobFlow(runRequest);

Page 31: 20121215 DevLOVE2012 Mahout on AWS

RunJobFlowRequest runRequest = new RunJobFlowRequest()! .withName("mahout-10m")! .withSteps(! ! ! new StepConfig()! ! ! ! .withName("Setup Hadoop Debugging")! ! ! ! .withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)! ! ! ! .withHadoopJarStep(! ! ! ! ! ! new StepFactory("ap-northeast-1.elasticmapreduce")! ! ! ! ! ! ! .newEnableDebuggingStep()),! ! ! new StepConfig()! ! ! ! .withName("Custom Jar")! ! ! ! .withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)! ! ! ! .withHadoopJarStep(new HadoopJarStepConfig()! ! ! ! ! .withJar("s3n://mahoutinaction-jp/mahout/mahout-core-0.7-job.jar")! ! ! ! ! .withMainClass("org.apache.mahout.cf.taste.hadoop.item.RecommenderJob")! ! ! ! ! .withArgs(Arrays.asList(! ! ! ! ! ! ! "-Dmapred.map.tasks=40",! ! ! ! ! ! ! "-Dmapred.reduce.tasks=19",! ! ! ! ! ! ! "-Dmapred.input.dir=s3n://mahoutinaction-jp/input10m",! ! ! ! ! ! ! "-Dmapred.output.dir=s3n://mahoutinaction-jp/output10m",! ! ! ! ! ! ! "--numRecommendations", "100",! ! ! ! ! ! ! "--similarityClassname", "SIMILARITY_PEARSON_CORRELATION"))))! .withInstances(new JobFlowInstancesConfig()! ! .withPlacement(new PlacementType("ap-northeast-1a"))! ! .withInstanceCount(20)! ! .withMasterInstanceType("m1.small")! ! .withSlaveInstanceType("m1.small")! ! .withKeepJobFlowAliveWhenNoSteps(false)! ! .withHadoopVersion("0.20.205"))! .withAmiVersion("2.1.4")! .withLogUri("s3n://mahoutinaction-jp/logs");

後でごゆっくりどうぞ

Page 32: 20121215 DevLOVE2012 Mahout on AWS

watch by code

AmazonElasticMapReduce emr = ...;RunJobFlowResult runResult = ...;

String jobFlowId = runResult.getJobFlowId();DescribeJobFlowsRequest describeRequest =

new DescribeJobFlowsRequest().withJobFlowIds(jobFlowId);DescribeJobFlowsResult describeResult =

emr.describeJobFlows(describeRequest);JobFlowDetail detail = describeResult.getJobFlows().get(0);JobFlowExecutionStatusDetail statusDetail =

detail.getExecutionStatusDetail();JobFlowExecutionState state =

JobFlowExecutionState.fromValue(statusDetail.getState());

// COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN,// STARTING, WAITING, BOOTSTRAPPING

Page 33: 20121215 DevLOVE2012 Mahout on AWS

結果を取り出す

指定したロケーションにファイルがいくつか生成されている。

Page 34: 20121215 DevLOVE2012 Mahout on AWS
Page 35: 20121215 DevLOVE2012 Mahout on AWS

download by codeimport java.io.InputStream;import java.util.List;import com.amazonaws.auth.*;import com.amazonaws.services.s3.*;import com.amazonaws.services.s3.model.*;

AWSCredentials cred = new BasicAWSCredentials("AccessKeyID","SecretAccessKey");

AmazonS3 s3 = new AmazonS3Client(cred);

ObjectListing listing = s3.listObjects("mahoutinaction-jp", "output10m");

List<S3ObjectSummary> summaries = listing.getObjectSummaries();for (S3ObjectSummary summary : summaries) {! System.out.println(summary.getKey());! if (summary.getKey().endsWith("/_SUCCESS")) {! ! continue;! }! S3Object obj = s3.getObject("mahoutinaction-jp", summary.getKey());! InputStream in = obj.getObjectContent();! // ...}

Page 36: 20121215 DevLOVE2012 Mahout on AWS

Summary

• 機械学習 は、ちょっとインテリな機能

• 分散・非分散アルゴリズム

• 非分散ならオンラインで

• 分散ならAWSのEMRで