apache mahout 於電子商務的應用
Post on 08-Sep-2014
30 Views
Preview:
DESCRIPTION
TRANSCRIPT
Apache Mahout 於電子商務的應用
James Chen, Etu Solution
Hadoop in TW 2013Sep 28, 2013
2
台灣 Hadoop 2013 現狀問卷調查
填寫問卷就有機會抽電影票兩張2013/10/7 截止
3
• Apache Mahout Introduction• Machine Learning Use Cases• Building a recommendation system• Collaboration Filtering• System Architecture• Performance• Future Roadmap
Agenda
4
Apache Mahout
• ASF project to create scalable machine learning libraries– http://mahout.apache.org
• Why Mahout?– Many open source machine learning libraries either:
• Lack Community• Lack Documentation and Examples• Lack Scalability• Lack the Apache License
5
Algorithms in Mahout
Regression
Recommenders
ClusteringClassificationFreq. PatternMining
Vector Similarity
Non-MRAlgorithms
Examples
See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Dimension Reduction
Evolution
6
Algorithms in Mahout
• Classification– Logistic Regression– Bayesian– Support Vector Machines– Perceptron and Winnow– Neural Network– Random Forests– Restricted Boltzmann
Machines– Online Passive Aggressive– Boosting– Hidden Markov Models
• Clustering– Canopy Clustering– K-Means– Fuzzy K-Means– Expectation
Maximization– Mean Shift– Hierarchical Clustering– Dirichlet Process
Clustering– Latent Dirichlet
Allocation– Spectral Clustering– Minhash Clustering– Top Down Clustering
7
Algorithms in Mahout – Cont.
• Pattern Mining– Parallel FP Growth
• Regression– Locally Weighted Linear
Regression
• Dimension Reduction– SVD– Stochastic SVD with PCA– PCA– Independent Component
Analysis– Gaussian Discriminative
Analysis
• Evolution Algorithms– Genetic Algorithms
• Recommenders– Non-distributed
recommenders (“Taste”)
– Distributed Item-Based Collaboration Filtering
– Collaboration Filtering using a parallel matrix factorization
– Slope One
8
Algorithms in Mahout – Cont.
• Vector Similarity– RowSimiliarityJob (MR)– VectorDistanceJob (MR)
• Other– Collocations
• Non-MapReduce algorithms
See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
9
Mahout Focus on Scalability
• Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm– Some algorithms won’t scale to massive machine clusters– Others fit logically on a Map Reduce framework like Apache
Hadoop– Still others will need alternative distributed programming
models– Be pragmatic
• Most Mahout implementations are Map Reduce enabled
• (Always a) Work in Progress
10
Prepare Data from Raw content
• Lucene integration– bin/mahout lucenevector …
• Document Vectorizer– bin/mahout seqdirectory …– bin/mahout seq2sparse …
• Programmatically– See the Utils module in Mahout
• Database (JDBC)• File System (HDFS)
11
Machine Learning
• “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E. Alpaydin
• Subset of Artificial Intelligence• Lots of related fields:
– Information Retrieval– Stats– Biology– Linear algebra– Many more
12
Use Case : Recommendation
13
Use Cases : Classification
14
User Cases : Clustering
15
More use cases
• Recommend products/books/friends …• Classify content into predefined groups• Find similar content based on object properties• Find associations/patterns in actions/behaviors• Identify key topics in large collections of text• Detect anomalies in machine output• Ranking search results (PageRank)• Others
16
Building recommendation system
• Help users find items they might like based on historical preferences
17
Approach
• Collect User Preferences -> User vs Item Matrix• Find Similar Users or Items (Neighborhood-based
approach)• Works by finding similarly rated items in the user-
item-matrix (e.g. cosine, Pearson-Correlation, Tanimoto Coefficient)
• Estimates a user's preference towards an item by looking at his/her preferences towards similar items
18
Collaborative Filtering – User Based
Find User Similarity
1. 如何預測用戶 1 對於商品 4
的喜好程度 ?
2. 找尋 n 個和用戶 1 相似的用戶且購買過商品 4 (基於購買記錄的評價)為用戶 n
3. 根據用戶 n 對商品 4 的評價,以相似度為權重回填結果
4. 針對所有用戶組合,重覆1~3 ,直到所有空格都被填滿
Items
User 1 ?
User n
回填結果
19
Find Item Similarity
(Amazon)
1. 如何預測用戶 1 對於商品 4
的喜好程度 ?
2. 從用戶 1 歷史記錄中,計算商品 n 和商品 4 的相似度(以其他用戶的歷史記錄)
3. 將用戶 1 對於商品 n 的評價,以產品相似度為權重回填
4. 針對所有商品組合,重覆1~3 直到所有空格都被填滿
Items
Users
?回填結果
Collaborative Filtering – Item Based
20
Test Drive of Mahout Recommender
• Group Len Dataset: http://www.grouplens.org/node/12• 1,000,209 anonymous ratings of 3,900 movies made
by 6,040 MovieLens users• movies.dat (movie ids with title and category)• ratings.dat (ratings of movies)• users.dat (user information)
21
Ratings File
• Each line of ratings file has the formatUserID::MovieID::Rating::Timestamp
• Mahout requires following csv formatUserID,ItemID,Value
• tr –s ‘:’ ‘,’ < ratings.dat | cut –f1-3 –d, > rating.csv
22
Run Recommendation Job
• $ mahout recommenditembased \–i [input-hdfs-path] \-o [output-hdfs-path] \--usersFile [File listing users] \--tempDir
23
Recommendation Result
• Recommendation Result will look likeUserID [ItemID:Weight, ItemID:Weight,…]
• Each line represents a UserID with associated recommended ItemID
24
Collect User Behavior Events
Implicit (Easy to collect) Explicit (Hard to collect)
View Rating (0~5)
Shopping Cart (0 or 1) Voting (0 or 1)
Order or Buy (0 or 1) Forward or Share (0 or 1)
Duration Time (Noisy) Add favorite (0 or 1)
Tag (text analysis)
Comments (text analysis)
25
Process Event into Preference
• Group by different event type, and calculate similarity based on event types. Ex. Also View, Also Buy..
• Weighting:– Explicit Event > Implicit Event– Order, Cart > View
• Noise Reduction• Normalization
26
Similarity (Vector Similarity)
• Euclidean Distance
• Pearson Correlation Coefficient (-1 ~ +1)
• Cosine Similarity
• Tanimoto Coefficient
27
Complementary
• Sometimes CF cannot generate enough recommendation to all users
• Cold start problem• New user and new item• Some statistical approaches can be complementary• Ranking is very easy to implement by MR. Word Count ?
28
Etu Recommender Application
The Whole System
協同過濾分析(Collaborative Filtering)
客戶相似度
分析
轉化率分析
資料擷取
產品關聯性
分析
推薦清單
推薦引擎
用戶個性化推薦交易資料
Transaction Info
• 歷史訂單資料• 產品被購買紀錄
Web 互動資料• 瀏覽 • 點擊 Click• 搜尋 Search• 購物車 Cart• 結帳 check-out• 評論 Rating
Mobile 互動資料• 下載 Download• 點擊 Click• 打卡 Check-in• 付費 Payment• 位置 Location
Social Media (3rd party feed)
Etu Appliance
瀏覽過本商品的顧客還瀏覽了
購買過此商品的顧客還買了
購物車商品的推薦
組合購買的商品
基於瀏覽,你可能會喜歡
瀏覽過本商品的顧客最終買了
Etu Recommender
29
Data Process Flow
Front EndJava Script
Event Colloector(Nginx)
HDFS
Log Parser
HBase
Core Engine
Mahout JobUser BasedItem Based
MR JobRanking &
Stats.
Rec APIItem Mgmt.
API
Dashboard&
Mgmt Console
requestaccesslog
Preprocess & Dispatch
Schedule & Flow Control
Front End
Backend Admin
30
System Components
• Nginx– Event Collector & Request Forwarder
• Log Parser– Preprocess collected log and dispatch log to HDFS
• HDFS– Fundamental storage of the system
• Core Engine– Scheduling & Workflow Control– Job Driver
• Management Console– Dashboard (PV,UV,Conv. Rate)– Scheduling, Log Viewer, System Configuration
31
System Components – Cont.
• Recommendation Jobs– Mahout jobs for CF– MR jobs for Ranking
• HBase– Recommendation Result for query
• Recommendation API– API wrapper for frontend to query result from HBase table– Handle business logic and policy here
• Item Management API– API interface for frontend item management– Allow List, Exception List
32
HBase Table
Table Rowkey Column
CATEGORY CategoryID column=f:id Category IDcolumn=f:rank ranking by viewcolumn=f:rank_cart ranking by cartcolumn=f:rank_order ranking by ordercolumn=f:rank_view ranking by view
ERUID_USER ErUid column=f:uid ERUID/UID mapping
USER_ERUID uid column=f:eruid UID/ERUID mapping
SEARCH Keyword column=f:id search IDcolumn=f:rec item list
33
HBase Table – Cont.
Table Rowkey Column
STATS date(Ex:2013-06-25)
column=f:amount ( 全站成交金額 ) column=f:item ( 全站成交商品數 )column=f:order ( 全站成交訂單數 )column=f:pv ( 全站 PV 數 )column=f:uv ( 全站 UV 數 )column=f:erAmount ( 推薦成交金額 )column=f:erItem ( 推薦成交商品數 )column=f:erOrder ( 推薦成交訂單數 )column=f:erPv ( 推薦版位 PV)column=f:erUv ( 推薦版位 UV)
34
HBase Table – Item Table
Table Rowkey Column
ITEM PID column=f:avl_cat 此 category 是否處於可以推薦的狀態column=f:avl_item 此 item 是否處於可以推薦的狀態column=f:cat 此 item 所屬的類別column=f:id item IDcolumn=f:pry Priority 推薦優先權值column=f:rec_view 用 mahout 算出來推薦 view 的商品column=f:rec_cart 用 mahout 算出來推薦放進 cart 的商品column=f:rec_order 用 mahout 算出來推薦購買的商品column=f:rec_order_co_occurrence 經常與此 item 一起購買的商品
35
HBase Table – User Table
Table Rowkey Column
USER UID column=f:id 用戶 IDcolumn=f:rec_cart 用 mahout 算出來推薦放進 cart 的商品column=f:rec_view 用 mahout 算出來推薦 view 的商品column=f:rec_order 用 mahout 算出來推薦購買的商品column=f:rec_view_last_item_views 推薦用戶最常被看的商品
36
Tracking Code Snippet
<script id="etu-recommender" type="text/javascript">var erHostname='${erHostname}'var _qevents = _qevents || [];_qevents.push({
${paramName} : '${paramValue}',...
}); var erUrlPrefix=('https:' == document.location.protocol ? 'https://':'http://')+erHostname+'/';
(function() {var er = document.createElement('script');er.type = 'text/javascript';er.async = true;er.src = erUrlPrefix+'/er.js?'+(new Date().getTime());var currentJs=document.getElementById('etu-recommender');currentJs.parentNode.insertBefore(er,currentJs);
})();
</script>
37
Sample parameters for tracking a "view" action
#Parameter
NameParameter
TypeSample Value Required
1 cid String "www.etusolution.com"
Yes
2 uid String "johnny_nien" Yes
3 act String "view" Yes
4 pid String "P00001" Yes
5 cat String Array [ "C", "C00001" ] No, but please take it as a yes.
6 avl * Boolean(0 or 1) 1 No
Note: Explanation about "avl" will be available later
38
Query Recommendations<script id="etu-recommender" type="text/javascript">
var erUrlPrefix='${erUrlPrefix}';var _qquery = _qquery || []; _qquery.push({
${paramName} : '${paramValue}',……
}); function etuRecQueryCallBack(queryParams,queryResult) {
// Implement Your Logic Here!!! }var erUrlPrefix=('https:' == document.location.protocol ? 'https://':'http://')+erHostname+'/';
(function() {var er = document.createElement('script');er.type = 'text/javascript';er.async = true;er.src = erUrlPrefix+'/er.js?'+(new Date().getTime());var currentJs=document.getElementById('etu-recommender');currentJs.parentNode.insertBefore(er,currentJs);
})();</script>
39
Sample parameter for Also Buy … (Item based)
#Parameter
NameParameter
TypeSample Value Required
1 cid String "www.etusolution.com"
Yes
2 type String “item” Yes
3 act String ”order" Yes
4 pid String "P001" Yes
5 cat String "C001" No, but highly recommended
40
Network Topology
Router
NginxNNHM
DNRS
DNRS
L2
L2User LAN
Private LANisolated
Master IP (22,443,8888)
Web Server
Web Server
internet
Public IP (22,443,8888)
Recommender ClusterWeb Server Farm
41
User Event Collection
嵌入 JavaScript擷取客戶線上行為
相關網頁• 己登入用戶的首頁• 搜尋頁• 商品詳情頁• 添加商品至購物車頁• 下單付款頁• 付款完成頁
Online Behavior
Online / Offline Records
客戶行為 (Event)• 瀏覽、點擊 Click• 搜尋 Search• 放入購物車 Cart• 下單付款 Check-
out• 評論 Rating
交易資料• 歷史訂單• 產品資料
Recommender
Batch Import
42
Generate Recommendation Result
用戶個性化推薦
瀏覽過本商品的顧客還瀏覽了
購買過此商品的顧客還買了
瀏覽過本商品的顧客最終買了
購物車商品的推薦
組合購買的商品
基於瀏覽,你可能會喜歡
資料來源:瀏覽記錄 + 購物車記錄 + 購買記錄作用:強化推薦個性,提高使用者體驗度,提高訂單轉化率
資料來源:瀏覽記錄作用:降低使用者的跳出率,提高訂單轉化率
資料來源:購買記錄作用:強化交叉銷售效果,激發顧客再次下單的欲望
資料來源:瀏覽記錄 + 購買記錄作用:降低用戶跳出率,幫助用戶提高決策率,提高訂單轉化率
資料來源:購物車記錄 + 購買記錄作用:通過向上銷售原理,在幫助顧客滿足基本需求之後,引導其購買更多感興趣的商品,有效提升銷售量,增加毛利率
資料來源:購物車記錄 + 購買記錄作用:商品組合,有效提高商品交叉行銷,提高銷售量
資料來源:瀏覽記錄作用:最大程度減少跳出率,提升顧客忠誠度,增加商品的複購率
43
Recommender 如何應用在電子商務
Etu Recommender Application
協同過濾分析
Collaborative Filtering
轉化率分析
資料擷取
推薦清單
推薦引擎
Etu Appliance
Etu Recommender
Product Pages
Category Pages
Search Results
Cart Pages
Email Confirmation
EDM
歷史訂單
產品資料
即時訂單
瀏覽、點擊
放入購物車
結帳
線上評論
搜尋
44
Recommender 的轉化率分析
Online Performance Tracking
Item A
Item A
透過點擊推薦清單
透過主頁或其他所有頁面
PV1, UV1
PV2, UV2
推薦商品點擊率 =PV2 or UV2
PV1 or UV1
推薦商品轉化率 =
透過推薦清單
U-Cart 2
UV2 or PV2
** PV : page view UV : unique
visitor U-Cart : added to
cart by UV
U-Cart 1
U-Cart 2
Algorithm Benchmark
• Train vs Test (80-20)• A/B test
45
Summary
• Mahout is very useful if you would like to build a machine learning application on top of Hadoop
• BUT, a recommendation system is not algorithm only• DON’T re-invent the wheels. Leverage mahout and
hadoop• Put most of your efforts on integration, performance
tuning, and business logic
46
Future Roadmaps
• Offline to online integration -> Offline User Event Collection
• 360 Degree CRM -> CRM Connector• Social Recommendation -> Social Connector• Retargeting -> Customer Behavior Data Warehouse• Go real-time!
47
WE’ RE
HIRING!
www.etusolution.cominfo@etusolution.com
Taipei, Taiwan318, Rueiguang Rd., Taipei 114, TaiwanT: +886 2 7720 1888 F: +886 2 8798 6069
Beijing, ChinaRoom B-26, Landgent Center, No. 24, East Third Ring Middle Rd.,Beijing, China 100022 T: +86 10 8441 7988 F: +86 10 8441 7227
Contact
49
Recommendation
Alice
Bob
Peter
5 1 4
? 2 5
4 3 2
50
Algorithms Examples – Recommendation
• Prediction: Estimate Bob's preference towards “The Matrix”1. Look at all items that
– a) are similar to “The Matrix“ – b) have been rated by Bob
=> “Alien“, “Inception“2. Estimate the unknown preference with a weighted sum
51
Algorithms Examples – Recommendation
• MapReduce phase 1– Map – Make user the key
(Alice, Matrix, 5)
(Alice, Alien, 1)
(Alice, Inception, 4)
(Bob, Alien, 2)
(Bob, Inception, 5)
(Peter, Matrix, 4)
(Peter, Alien, 3)
(Peter, Inception, 2)
Alice (Matrix, 5)
Alice (Alien, 1)
Alice (Inception, 4)
Bob (Alien, 2)
Bob (Inception, 5)
Peter (Matrix, 4)
Peter (Alien, 3)
Peter (Inception, 2)
52
Algorithms Examples – Recommendation
• MapReduce phase 1– Reduce – Create inverted index
Alice (Matrix, 5)
Alice (Alien, 1)
Alice (Inception, 4)
Bob (Alien, 2)
Bob (Inception, 5)
Peter (Matrix, 4)
Peter (Alien, 3)
Peter (Inception, 2)
Alice (Matrix, 5) (Alien, 1) (Inception, 4)
Bob (Alien, 2) (Inception, 5)
Peter(Matrix, 4) (Alien, 3) (Inception, 2)
53
Algorithms Examples – Recommendation
• MapReduce phase 2– Map – Isolate all co-occurred ratings (all cases where a user
rated both items)
Matrix, Alien (5,1)
Matrix, Alien (4,3)
Alien, Inception (1,4)
Alien, Inception (2,5)
Alien, Inception (3,2)
Matrix, Inception (4,2)
Matrix, Inception (5,4)
Alice (Matrix, 5) (Alien, 1) (Inception, 4)
Bob (Alien, 2) (Inception, 5)
Peter(Matrix, 4) (Alien, 3) (Inception, 2)
54
Algorithms Examples – Recommendation
• MapReduce phase 2– Reduce – Compute similarities
Matrix, Alien (5,1)
Matrix, Alien (4,3)
Alien, Inception (1,4)
Alien, Inception (2,5)
Alien, Inception (3,2)
Matrix, Inception (4,2)
Matrix, Inception (5,4)
Matrix, Alien (-0.47)
Matrix, Inception (0.47)
Alien, Inception(-0.63)
55
Recommendation
Alice
Bob
Peter
5 1 4
2 5
4 3 2
1.5
top related