apache mahout 於電子商務的應用

55
Apache Mahout 於於於於於於於於 James Chen, Etu Solution Hadoop in TW 2013 Sep 28, 2013

Upload: james-chen

Post on 08-Sep-2014

30 views

Category:

Technology


4 download

DESCRIPTION

Apache Mahout是一個架構在MapReduce之上的演算法函式庫,內建許多經典的演算法,借助MapReduce的平行處理架構讓巨量資料的分析更容易。電子商務的發展越來越趨向個人化,帶來對於使用者行為分析的需求、精準推薦、精準廣告投放、個人化商品推薦,個人化內容推薦等個性化服務不斷推出,使得Apache Mahout在Hadoop Ecosystem中的角色日益重要。MapReduce的平行運算能力與Machine Learning的結合,協助電商業者從巨量的網站日誌中,提取出有價值的使用者行為數據,Etu將在這個session中,介紹Mahout內建的商品推薦演算法原理,以及如何step by step打造一個end-to-end的商品推薦系統。

TRANSCRIPT

Page 1: Apache Mahout 於電子商務的應用

Apache Mahout 於電子商務的應用

James Chen, Etu Solution

Hadoop in TW 2013Sep 28, 2013

Page 2: Apache Mahout 於電子商務的應用

2

台灣 Hadoop 2013 現狀問卷調查

填寫問卷就有機會抽電影票兩張2013/10/7 截止

Page 3: Apache Mahout 於電子商務的應用

3

• Apache Mahout Introduction• Machine Learning Use Cases• Building a recommendation system• Collaboration Filtering• System Architecture• Performance• Future Roadmap

Agenda

Page 4: Apache Mahout 於電子商務的應用

4

Apache Mahout

• ASF project to create scalable machine learning libraries– http://mahout.apache.org

• Why Mahout?– Many open source machine learning libraries either:

• Lack Community• Lack Documentation and Examples• Lack Scalability• Lack the Apache License

Page 5: Apache Mahout 於電子商務的應用

5

Algorithms in Mahout

Regression

Recommenders

ClusteringClassificationFreq. PatternMining

Vector Similarity

Non-MRAlgorithms

Examples

See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Dimension Reduction

Evolution

Page 6: Apache Mahout 於電子商務的應用

6

Algorithms in Mahout

• Classification– Logistic Regression– Bayesian– Support Vector Machines– Perceptron and Winnow– Neural Network– Random Forests– Restricted Boltzmann

Machines– Online Passive Aggressive– Boosting– Hidden Markov Models

• Clustering– Canopy Clustering– K-Means– Fuzzy K-Means– Expectation

Maximization– Mean Shift– Hierarchical Clustering– Dirichlet Process

Clustering– Latent Dirichlet

Allocation– Spectral Clustering– Minhash Clustering– Top Down Clustering

Page 7: Apache Mahout 於電子商務的應用

7

Algorithms in Mahout – Cont.

• Pattern Mining– Parallel FP Growth

• Regression– Locally Weighted Linear

Regression

• Dimension Reduction– SVD– Stochastic SVD with PCA– PCA– Independent Component

Analysis– Gaussian Discriminative

Analysis

• Evolution Algorithms– Genetic Algorithms

• Recommenders– Non-distributed

recommenders (“Taste”)

– Distributed Item-Based Collaboration Filtering

– Collaboration Filtering using a parallel matrix factorization

– Slope One

Page 8: Apache Mahout 於電子商務的應用

8

Algorithms in Mahout – Cont.

• Vector Similarity– RowSimiliarityJob (MR)– VectorDistanceJob (MR)

• Other– Collocations

• Non-MapReduce algorithms

See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Page 9: Apache Mahout 於電子商務的應用

9

Mahout Focus on Scalability

• Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm– Some algorithms won’t scale to massive machine clusters– Others fit logically on a Map Reduce framework like Apache

Hadoop– Still others will need alternative distributed programming

models– Be pragmatic

• Most Mahout implementations are Map Reduce enabled

• (Always a) Work in Progress

Page 10: Apache Mahout 於電子商務的應用

10

Prepare Data from Raw content

• Lucene integration– bin/mahout lucenevector …

• Document Vectorizer– bin/mahout seqdirectory …– bin/mahout seq2sparse …

• Programmatically– See the Utils module in Mahout

• Database (JDBC)• File System (HDFS)

Page 11: Apache Mahout 於電子商務的應用

11

Machine Learning

• “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E. Alpaydin

• Subset of Artificial Intelligence• Lots of related fields:

– Information Retrieval– Stats– Biology– Linear algebra– Many more

Page 12: Apache Mahout 於電子商務的應用

12

Use Case : Recommendation

Page 13: Apache Mahout 於電子商務的應用

13

Use Cases : Classification

Page 14: Apache Mahout 於電子商務的應用

14

User Cases : Clustering

Page 15: Apache Mahout 於電子商務的應用

15

More use cases

• Recommend products/books/friends …• Classify content into predefined groups• Find similar content based on object properties• Find associations/patterns in actions/behaviors• Identify key topics in large collections of text• Detect anomalies in machine output• Ranking search results (PageRank)• Others

Page 16: Apache Mahout 於電子商務的應用

16

Building recommendation system

• Help users find items they might like based on historical preferences

Page 17: Apache Mahout 於電子商務的應用

17

Approach

• Collect User Preferences -> User vs Item Matrix• Find Similar Users or Items (Neighborhood-based

approach)• Works by finding similarly rated items in the user-

item-matrix (e.g. cosine, Pearson-Correlation, Tanimoto Coefficient)

• Estimates a user's preference towards an item by looking at his/her preferences towards similar items

Page 18: Apache Mahout 於電子商務的應用

18

Collaborative Filtering – User Based

Find User Similarity

1. 如何預測用戶 1 對於商品 4

的喜好程度 ?

2. 找尋 n 個和用戶 1 相似的用戶且購買過商品 4 (基於購買記錄的評價)為用戶 n

3. 根據用戶 n 對商品 4 的評價,以相似度為權重回填結果

4. 針對所有用戶組合,重覆1~3 ,直到所有空格都被填滿

Items

User 1 ?

User n

回填結果

Page 19: Apache Mahout 於電子商務的應用

19

Find Item Similarity

(Amazon)

1. 如何預測用戶 1 對於商品 4

的喜好程度 ?

2. 從用戶 1 歷史記錄中,計算商品 n 和商品 4 的相似度(以其他用戶的歷史記錄)

3. 將用戶 1 對於商品 n 的評價,以產品相似度為權重回填

4. 針對所有商品組合,重覆1~3 直到所有空格都被填滿

Items

Users

?回填結果

Collaborative Filtering – Item Based

Page 20: Apache Mahout 於電子商務的應用

20

Test Drive of Mahout Recommender

• Group Len Dataset: http://www.grouplens.org/node/12• 1,000,209 anonymous ratings of 3,900 movies made

by 6,040 MovieLens users• movies.dat (movie ids with title and category)• ratings.dat (ratings of movies)• users.dat (user information)

Page 21: Apache Mahout 於電子商務的應用

21

Ratings File

• Each line of ratings file has the formatUserID::MovieID::Rating::Timestamp

• Mahout requires following csv formatUserID,ItemID,Value

• tr –s ‘:’ ‘,’ < ratings.dat | cut –f1-3 –d, > rating.csv

Page 22: Apache Mahout 於電子商務的應用

22

Run Recommendation Job

• $ mahout recommenditembased \–i [input-hdfs-path] \-o [output-hdfs-path] \--usersFile [File listing users] \--tempDir

Page 23: Apache Mahout 於電子商務的應用

23

Recommendation Result

• Recommendation Result will look likeUserID [ItemID:Weight, ItemID:Weight,…]

• Each line represents a UserID with associated recommended ItemID

Page 24: Apache Mahout 於電子商務的應用

24

Collect User Behavior Events

Implicit (Easy to collect) Explicit (Hard to collect)

View Rating (0~5)

Shopping Cart (0 or 1) Voting (0 or 1)

Order or Buy (0 or 1) Forward or Share (0 or 1)

Duration Time (Noisy) Add favorite (0 or 1)

Tag (text analysis)

Comments (text analysis)

Page 25: Apache Mahout 於電子商務的應用

25

Process Event into Preference

• Group by different event type, and calculate similarity based on event types. Ex. Also View, Also Buy..

• Weighting:– Explicit Event > Implicit Event– Order, Cart > View

• Noise Reduction• Normalization

Page 26: Apache Mahout 於電子商務的應用

26

Similarity (Vector Similarity)

• Euclidean Distance

• Pearson Correlation Coefficient (-1 ~ +1)

• Cosine Similarity

• Tanimoto Coefficient

Page 27: Apache Mahout 於電子商務的應用

27

Complementary

• Sometimes CF cannot generate enough recommendation to all users

• Cold start problem• New user and new item• Some statistical approaches can be complementary• Ranking is very easy to implement by MR. Word Count ?

Page 28: Apache Mahout 於電子商務的應用

28

Etu Recommender Application

The Whole System

協同過濾分析(Collaborative Filtering)

客戶相似度

分析

轉化率分析

資料擷取

產品關聯性

分析

推薦清單

推薦引擎

用戶個性化推薦交易資料

Transaction Info

• 歷史訂單資料• 產品被購買紀錄

Web 互動資料• 瀏覽 • 點擊 Click• 搜尋 Search• 購物車 Cart• 結帳 check-out• 評論 Rating

Mobile 互動資料• 下載 Download• 點擊 Click• 打卡 Check-in• 付費 Payment• 位置 Location

Social Media (3rd party feed)

Etu Appliance

瀏覽過本商品的顧客還瀏覽了

購買過此商品的顧客還買了

購物車商品的推薦

組合購買的商品

基於瀏覽,你可能會喜歡

瀏覽過本商品的顧客最終買了

Etu Recommender

Page 29: Apache Mahout 於電子商務的應用

29

Data Process Flow

Front EndJava Script

Event Colloector(Nginx)

HDFS

Log Parser

HBase

Core Engine

Mahout JobUser BasedItem Based

MR JobRanking &

Stats.

Rec APIItem Mgmt.

API

Dashboard&

Mgmt Console

requestaccesslog

Preprocess & Dispatch

Schedule & Flow Control

Front End

Backend Admin

Page 30: Apache Mahout 於電子商務的應用

30

System Components

• Nginx– Event Collector & Request Forwarder

• Log Parser– Preprocess collected log and dispatch log to HDFS

• HDFS– Fundamental storage of the system

• Core Engine– Scheduling & Workflow Control– Job Driver

• Management Console– Dashboard (PV,UV,Conv. Rate)– Scheduling, Log Viewer, System Configuration

Page 31: Apache Mahout 於電子商務的應用

31

System Components – Cont.

• Recommendation Jobs– Mahout jobs for CF– MR jobs for Ranking

• HBase– Recommendation Result for query

• Recommendation API– API wrapper for frontend to query result from HBase table– Handle business logic and policy here

• Item Management API– API interface for frontend item management– Allow List, Exception List

Page 32: Apache Mahout 於電子商務的應用

32

HBase Table

Table Rowkey Column

CATEGORY CategoryID column=f:id Category IDcolumn=f:rank ranking by viewcolumn=f:rank_cart ranking by cartcolumn=f:rank_order ranking by ordercolumn=f:rank_view ranking by view

ERUID_USER ErUid column=f:uid ERUID/UID mapping

USER_ERUID uid column=f:eruid UID/ERUID mapping

SEARCH Keyword column=f:id search IDcolumn=f:rec item list

Page 33: Apache Mahout 於電子商務的應用

33

HBase Table – Cont.

Table Rowkey Column

STATS date(Ex:2013-06-25)

column=f:amount ( 全站成交金額 ) column=f:item ( 全站成交商品數 )column=f:order ( 全站成交訂單數 )column=f:pv ( 全站 PV 數 )column=f:uv ( 全站 UV 數 )column=f:erAmount ( 推薦成交金額 )column=f:erItem ( 推薦成交商品數 )column=f:erOrder ( 推薦成交訂單數 )column=f:erPv ( 推薦版位 PV)column=f:erUv ( 推薦版位 UV)

Page 34: Apache Mahout 於電子商務的應用

34

HBase Table – Item Table

Table Rowkey Column

ITEM PID column=f:avl_cat 此 category 是否處於可以推薦的狀態column=f:avl_item 此 item 是否處於可以推薦的狀態column=f:cat 此 item 所屬的類別column=f:id item IDcolumn=f:pry Priority 推薦優先權值column=f:rec_view 用 mahout 算出來推薦 view 的商品column=f:rec_cart 用 mahout 算出來推薦放進 cart 的商品column=f:rec_order 用 mahout 算出來推薦購買的商品column=f:rec_order_co_occurrence 經常與此 item 一起購買的商品

Page 35: Apache Mahout 於電子商務的應用

35

HBase Table – User Table

Table Rowkey Column

USER UID column=f:id 用戶 IDcolumn=f:rec_cart 用 mahout 算出來推薦放進 cart 的商品column=f:rec_view 用 mahout 算出來推薦 view 的商品column=f:rec_order 用 mahout 算出來推薦購買的商品column=f:rec_view_last_item_views 推薦用戶最常被看的商品

Page 36: Apache Mahout 於電子商務的應用

36

Tracking Code Snippet

<script id="etu-recommender" type="text/javascript">var erHostname='${erHostname}'var _qevents = _qevents || [];_qevents.push({

${paramName} : '${paramValue}',...

}); var erUrlPrefix=('https:' == document.location.protocol ? 'https://':'http://')+erHostname+'/';

(function() {var er = document.createElement('script');er.type = 'text/javascript';er.async = true;er.src = erUrlPrefix+'/er.js?'+(new Date().getTime());var currentJs=document.getElementById('etu-recommender');currentJs.parentNode.insertBefore(er,currentJs);

})();

</script>

Page 37: Apache Mahout 於電子商務的應用

37

Sample parameters for tracking a "view" action

#Parameter

NameParameter

TypeSample Value Required

1 cid String "www.etusolution.com"

Yes

2 uid String "johnny_nien" Yes

3 act String "view" Yes

4 pid String "P00001" Yes

5 cat String Array [ "C", "C00001" ] No, but please take it as a yes.

6 avl * Boolean(0 or 1) 1 No

Note: Explanation about "avl" will be available later

Page 38: Apache Mahout 於電子商務的應用

38

Query Recommendations<script id="etu-recommender" type="text/javascript">

var erUrlPrefix='${erUrlPrefix}';var _qquery = _qquery || []; _qquery.push({

${paramName} : '${paramValue}',……

}); function etuRecQueryCallBack(queryParams,queryResult) {

// Implement Your Logic Here!!! }var erUrlPrefix=('https:' == document.location.protocol ? 'https://':'http://')+erHostname+'/';

(function() {var er = document.createElement('script');er.type = 'text/javascript';er.async = true;er.src = erUrlPrefix+'/er.js?'+(new Date().getTime());var currentJs=document.getElementById('etu-recommender');currentJs.parentNode.insertBefore(er,currentJs);

})();</script>

Page 39: Apache Mahout 於電子商務的應用

39

Sample parameter for Also Buy … (Item based)

#Parameter

NameParameter

TypeSample Value Required

1 cid String "www.etusolution.com"

Yes

2 type String “item” Yes

3 act String ”order" Yes

4 pid String "P001" Yes

5 cat String "C001" No, but highly recommended

Page 40: Apache Mahout 於電子商務的應用

40

Network Topology

Router

NginxNNHM

DNRS

DNRS

L2

L2User LAN

Private LANisolated

Master IP (22,443,8888)

Web Server

Web Server

internet

Public IP (22,443,8888)

Recommender ClusterWeb Server Farm

Page 41: Apache Mahout 於電子商務的應用

41

User Event Collection

嵌入 JavaScript擷取客戶線上行為

相關網頁• 己登入用戶的首頁• 搜尋頁• 商品詳情頁• 添加商品至購物車頁• 下單付款頁• 付款完成頁

Online Behavior

Online / Offline Records

客戶行為 (Event)• 瀏覽、點擊 Click• 搜尋 Search• 放入購物車 Cart• 下單付款 Check-

out• 評論 Rating

交易資料• 歷史訂單• 產品資料

Recommender

Batch Import

Page 42: Apache Mahout 於電子商務的應用

42

Generate Recommendation Result

用戶個性化推薦

瀏覽過本商品的顧客還瀏覽了

購買過此商品的顧客還買了

瀏覽過本商品的顧客最終買了

購物車商品的推薦

組合購買的商品

基於瀏覽,你可能會喜歡

資料來源:瀏覽記錄 + 購物車記錄 + 購買記錄作用:強化推薦個性,提高使用者體驗度,提高訂單轉化率

資料來源:瀏覽記錄作用:降低使用者的跳出率,提高訂單轉化率

資料來源:購買記錄作用:強化交叉銷售效果,激發顧客再次下單的欲望

資料來源:瀏覽記錄 + 購買記錄作用:降低用戶跳出率,幫助用戶提高決策率,提高訂單轉化率

資料來源:購物車記錄 + 購買記錄作用:通過向上銷售原理,在幫助顧客滿足基本需求之後,引導其購買更多感興趣的商品,有效提升銷售量,增加毛利率

資料來源:購物車記錄 + 購買記錄作用:商品組合,有效提高商品交叉行銷,提高銷售量

資料來源:瀏覽記錄作用:最大程度減少跳出率,提升顧客忠誠度,增加商品的複購率

Page 43: Apache Mahout 於電子商務的應用

43

Recommender 如何應用在電子商務

Etu Recommender Application

協同過濾分析

Collaborative Filtering

轉化率分析

資料擷取

推薦清單

推薦引擎

Etu Appliance

Etu Recommender

Product Pages

Category Pages

Search Results

Cart Pages

Email Confirmation

EDM

歷史訂單

產品資料

即時訂單

瀏覽、點擊

放入購物車

結帳

線上評論

搜尋

Page 44: Apache Mahout 於電子商務的應用

44

Recommender 的轉化率分析

Online Performance Tracking

Item A

Item A

透過點擊推薦清單

透過主頁或其他所有頁面

PV1, UV1

PV2, UV2

推薦商品點擊率 =PV2 or UV2

PV1 or UV1

推薦商品轉化率 =

透過推薦清單

U-Cart 2

UV2 or PV2

** PV : page view UV : unique

visitor U-Cart : added to

cart by UV

U-Cart 1

U-Cart 2

Algorithm Benchmark

• Train vs Test (80-20)• A/B test

Page 45: Apache Mahout 於電子商務的應用

45

Summary

• Mahout is very useful if you would like to build a machine learning application on top of Hadoop

• BUT, a recommendation system is not algorithm only• DON’T re-invent the wheels. Leverage mahout and

hadoop• Put most of your efforts on integration, performance

tuning, and business logic

Page 46: Apache Mahout 於電子商務的應用

46

Future Roadmaps

• Offline to online integration -> Offline User Event Collection

• 360 Degree CRM -> CRM Connector• Social Recommendation -> Social Connector• Retargeting -> Customer Behavior Data Warehouse• Go real-time!

Page 47: Apache Mahout 於電子商務的應用

47

WE’ RE

HIRING!

Page 48: Apache Mahout 於電子商務的應用

[email protected]

Taipei, Taiwan318, Rueiguang Rd., Taipei 114, TaiwanT: +886 2 7720 1888 F: +886 2 8798 6069

Beijing, ChinaRoom B-26, Landgent Center, No. 24, East Third Ring Middle Rd.,Beijing, China 100022 T: +86 10 8441 7988 F: +86 10 8441 7227

Contact

Page 49: Apache Mahout 於電子商務的應用

49

Recommendation

Alice

Bob

Peter

5 1 4

? 2 5

4 3 2

Page 50: Apache Mahout 於電子商務的應用

50

Algorithms Examples – Recommendation

• Prediction: Estimate Bob's preference towards “The Matrix”1. Look at all items that

– a) are similar to “The Matrix“ – b) have been rated by Bob

=> “Alien“, “Inception“2. Estimate the unknown preference with a weighted sum

Page 51: Apache Mahout 於電子商務的應用

51

Algorithms Examples – Recommendation

• MapReduce phase 1– Map – Make user the key

(Alice, Matrix, 5)

(Alice, Alien, 1)

(Alice, Inception, 4)

(Bob, Alien, 2)

(Bob, Inception, 5)

(Peter, Matrix, 4)

(Peter, Alien, 3)

(Peter, Inception, 2)

Alice (Matrix, 5)

Alice (Alien, 1)

Alice (Inception, 4)

Bob (Alien, 2)

Bob (Inception, 5)

Peter (Matrix, 4)

Peter (Alien, 3)

Peter (Inception, 2)

Page 52: Apache Mahout 於電子商務的應用

52

Algorithms Examples – Recommendation

• MapReduce phase 1– Reduce – Create inverted index

Alice (Matrix, 5)

Alice (Alien, 1)

Alice (Inception, 4)

Bob (Alien, 2)

Bob (Inception, 5)

Peter (Matrix, 4)

Peter (Alien, 3)

Peter (Inception, 2)

Alice (Matrix, 5) (Alien, 1) (Inception, 4)

Bob (Alien, 2) (Inception, 5)

Peter(Matrix, 4) (Alien, 3) (Inception, 2)

Page 53: Apache Mahout 於電子商務的應用

53

Algorithms Examples – Recommendation

• MapReduce phase 2– Map – Isolate all co-occurred ratings (all cases where a user

rated both items)

Matrix, Alien (5,1)

Matrix, Alien (4,3)

Alien, Inception (1,4)

Alien, Inception (2,5)

Alien, Inception (3,2)

Matrix, Inception (4,2)

Matrix, Inception (5,4)

Alice (Matrix, 5) (Alien, 1) (Inception, 4)

Bob (Alien, 2) (Inception, 5)

Peter(Matrix, 4) (Alien, 3) (Inception, 2)

Page 54: Apache Mahout 於電子商務的應用

54

Algorithms Examples – Recommendation

• MapReduce phase 2– Reduce – Compute similarities

Matrix, Alien (5,1)

Matrix, Alien (4,3)

Alien, Inception (1,4)

Alien, Inception (2,5)

Alien, Inception (3,2)

Matrix, Inception (4,2)

Matrix, Inception (5,4)

Matrix, Alien (-0.47)

Matrix, Inception (0.47)

Alien, Inception(-0.63)

Page 55: Apache Mahout 於電子商務的應用

55

Recommendation

Alice

Bob

Peter

5 1 4

2 5

4 3 2

1.5