2010 09-17-張志威老師-confucius & “its” intelligent disciples

84
2010-9-17 Ed Chang@NCCU Invited Talk 1 Confucius & “Its” Intelligent Disciples Search + Social Edward Chang Director, Google Research, Beijing

Upload: nccuscience

Post on 05-Jul-2015

564 views

Category:

Education


3 download

TRANSCRIPT

Page 1: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 1

Confucius & “Its” Intelligent DisciplesSearch + Social

Edward ChangDirector, Google Research, Beijing

Page 2: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Google中国:我们在哪里?

2010-9-17 2Ed Chang@NCCU Invited Talk

Page 3: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk

Web 1.0

.htm

.htm

.htm

.jpg

.jpg

.doc

.htm

.msg

.htm

.htm

3

Page 4: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk

Web 2.0 --- Web with People

.htm

.jpg

.doc

.xls

.msg.htm

.htm

.jpg

.msg

.htm

4

Page 5: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 5

Page 6: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 6

Page 7: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Outline

• Search + Social Synergy

• Search Social

• Social Search• Scalability and Elasticity

2010-9-17 Ed Chang@NCCU Invited Talk 7

Page 8: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Related Papers• AdHeat (Social Ads):

– AdHeat: An Influence-based Diffusion Model for Propagating Hints to Match Ads,H.J. Bao and E. Y. Chang, WWW 2010 (best paper candidate), April 2010.

– Parallel Spectral Clustering in Distributed Systems,Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and E. Y. Chang,IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2010.

• UserRank:– Confucius and its Intelligent Disciples, X. Si, E. Y. Chang, Z. Gyongyi, VLDB,

September 2010 .– Topic-dependent User Rank, Xiance Si, Z. Gyongyi, E. Y. Chang, and M.S. Sun,

Google Technical Report.

• Large-scale Collaborative Filtering:– PLDA*: Parallel Latent Dirichlet Allocation for Large-Scale Applications, ACM

Transactions on Internet Technology, 2010.– Collaborative Filtering for Orkut Communities: Discovery of User Latent Behavior,

W.-Y. Chen, J. Chu, E. Y. Chang, WWW 2009: 681-690.– Combinational Collaborative Filtering for Personalized Community Recommendation,

W.-Y. Chen, E. Y. Chang, KDD 2008: 115-123.– Parallel SVMs, E. Y. Chang, et al., NIPS 2007.

2010-9-17 Ed Chang@NCCU Invited Talk 8

Page 9: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 9

Query: What are must-see attractions at Yellowstone

Page 10: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 10

Query: What are must-see attractions at Yellowstone

Page 11: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 11

Query: What are must-see attractions at Yosemite

Page 12: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 12

Query: What are must-see attractions at Beijing

Hotel ads

Page 13: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 13

Page 14: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 14

Page 15: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 15

Page 16: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 16

Page 17: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 17

Confucius: Google Q&AProviding High-Quality Answers in a Timely Fashion

Trigger a discussion/question session during search Provide labels to a post (semi-automatically) Given a post, find similar posts (automatically) Evaluate quality of a post, relevance and originality Evaluate user credentials in a topic sensitive way Route questions to experts Provide most relevant, high-quality content for Search to index Generate answers using NLP

U

Search

SU

Community

C

CCS/Q&A

Discussion/Question

DC

Differentiated Content

Page 18: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 18

Confucius: Google Q&AProviding High-Quality Answers in a Timely Fashion

Trigger a discussion/question session during search Provide labels to a post (semi-automatically) Given a post, find similar posts (automatically) Evaluate quality of a post, relevance and originality Evaluate user credentials in a topic sensitive way Route questions to experts Provide most relevant, high-quality content for Search to index Generate answers using NLP

U

Search

SU

Community

C

CCS/Q&A

Discussion/Question

DC

Differentiated Content

Page 19: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 19

Label suggestion using LDA algorithm.

Q&A Uses Machine Learning

• Real Time topic-to-topic (T2T) recommendation using LDA algorithm.

• Gives out related high quality links to previous questions before human answer appear.

Page 20: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 20

1111111

1111

1111

11111

111111111

111

Que

stio

ns

Labels/Qs

Based on membership so far, and memberships of others

Predict further membership

Collaborative Filtering

Page 21: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 21

FIM-based Recommendation

Page 22: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 22

FIM Preliminaries• Observation 1: If an item A is not frequent, any pattern contains

A won’t be frequent [R. Agrawal] use a threshold to eliminate infrequent items {A} {A,B}

• Observation 2: Patterns containing A are subsets of (or found from) transactions containing A [J. Han] divide-and-conquer: select transactions containing A to form a conditional database (CDB), and find patterns containing A from that conditional database

{A, B}, {A, C}, {A} CDB A{A, B}, {B, C} CDB B

Page 23: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 23

Preprocessing• According to

Observation 1, we count the support of each item by scanning the database, and eliminate those infrequent items from the transactions.

• According to Observation 3, we sort items in each transaction by the order of descending support value.

Page 24: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 25

Example of Projection

Example of Projection of a database into CDBs.Left: sorted transactions in order of f, c, a, b, m, pRight: conditional databases of frequent items

Page 25: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 26

Example of Projection

Example of Projection of a database into CDBs.Left: sorted transactions; Right: conditional databases of frequent items

Page 26: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 27

Example of Projection

Example of Projection of a database into CDBs.Left: sorted transactions; Right: conditional databases of frequent items

Page 27: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 28

Recursive Projections [H. Li, et al. ACM RS 08]

• Recursive projection form a search tree

• Each node is a CDB• Using the order of items to

prevent duplicated CDBs.• Each level of breath-first

search of the tree can be done by a MapReduce iteration.

• Once a CDB is small enough to fit in memory, we can invoke FP-growth to mine this CDB, and no more growth of the sub-tree.

Page 28: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 29

Projection using MapReduce

p:{fcam/fcam/cb} p:3, pc:3

Page 29: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 30

1111111

1111

1111

11111

111111111

111

Que

stio

ns

Labels/Related Qs

Based on membership so far, and memberships of others

Predict further membership

Collaborative Filtering

Page 30: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 31

Distributed Latent Dirichlet Allocation (LDA)

• Other Collaborative Filtering Apps– Recommend Users Users– Recommend Music Users– Recommend Ads Users– Recommend Answers Q

• Predict the ? In the light-blue cells

Users/Music/Ads/Question

Use

rs/M

usic

/Ads

/Ans

wer

s

1 recipe pastry for a 9 inch double crust 9 apples, 2/1 cup, brown sugar

How to install apps on Apple mobile phones?

Documents

Topic Distribution

Topic Distribution

User quries iPhone crack Apple pie

• Search– Construct a latent layer for better

for semantic matching

• Example:– iPhone crack– Apple pie

Page 31: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 32

Latent Dirichlet Allocation [D. Blei, M. Jordan 04]

• α: uniform Dirichlet φ prior for per document d topic distribution (corpus level parameter)

• β: uniform Dirichlet φ prior for per topic z word distribution (corpus level parameter)

• θd is the topic distribution of document d (document level)

• zdj the topic if the jth word in d, wdj the specific word (word level)

θ

z

w

Nm

M

α

βφ

K

Page 32: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Combinational Collaborative Filtering Model (CCF)【KDD2008】

Communities

users descriptionsusers descriptions

Communities Communities

332010-9-17 Ed Chang@NCCU Invited Talk

Page 33: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 34

LDA Gibbs Sampling: Inputs & Outputs

docs

words topics

wordstopics

docs

Inputs: 1. training data: documents as

bags of words2. parameter: the number of

topics

Outputs: 1. model parameters: a co-

occurrence matrix of topics and words.

2. by-product: a co-occurrence matrix of topics and documents.

Page 34: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 35

Parallel Gibbs SamplingInputs: 1. training data: documents as

bags of words2. parameter: the number of

topics

Outputs: 1. model parameters: a co-

occurrence matrix of topics and words.

2. by-product: a co-occurrence matrix of topics and documents.

docs

words topics

wordstopics

docs

Page 35: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

PLDA -- enhanced parallel LDA[ACM Transactions on IT]

• PLDA is restricted by memory: Topic-word matrix has to fit into memory

• Restricted by Amdahl’s Law: communication costs too high

362010-9-17 Ed Chang@NCCU Invited Talk

Page 36: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

PLDA -- enhanced parallel LDA

• Take advantage of bag of words modeling: each Pw machine processes vocabulary in a word order

• Pipelining: fetching the updated topic distribution matrix while doing Gibbs sampling

372010-9-17 Ed Chang@NCCU Invited Talk

Page 37: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 38

Speedup 1,500x using 2,000 machines

Page 38: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 39Ed Chang@NCCU Invited Talk

Page 39: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 40

Confucius: Google Q&A

Trigger a discussion/question session during search Provide labels to a post (semi-automatically) Given a post, find similar posts (automatically) Evaluate quality of a post, relevance and originality Evaluate user credentials in a topic sensitive way Route questions to experts Provide most relevant, high-quality content for Search to index NLQA

U

Search

SU

Community

C

CCS/Q&A

Discussion/Question

DC

Differentiated Content

Page 40: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

UserRank

41

• Rank users by quantity (number of links)and quality (weights on links) of contributions

• Quality include:– Relevance. Is an answer relevant to the

Q? Measured by KL divergence between latent-topic vectors of A and Q

– Coverage. Compared among different answers

– Originality. Detect potential plagiarism and spam

– Promptness. Time between Q and A posting time

2010-9-17 Ed Chang@NCCU Invited Talk

Page 41: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Outline

• Search + Social Synergy

• Search Social

• Social Search• Scalability and Elasticity

2010-9-17 Ed Chang@NCCU Invited Talk 42

Page 42: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Outline

• Search + Social Synergy

• Search Social

• Social Search• Scalability and Elasticity

2010-9-17 Ed Chang@NCCU Invited Talk 43

Page 43: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Social + Data Networks

2010-9-17 44Ed Chang@NCCU Invited Talk

Page 44: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

可擴展性和彈性Scalability & Elasticity

• 計算的規模與極端

• 雲計算平台設計挑戰

• 谷歌云計算基礎技術

• 谷歌云計算新技術

2010-9-17 45Ed Chang@NCCU Invited Talk

Page 45: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

百万

十億

万億

千万億

百兆

十万兆

億兆

京,亥

計算的規模

3.8*10^5km

2010-9-17 Ed Chang@NCCU Invited Talk 46

Page 46: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

計算的規模

3.8*10^5km

1.2*10^10km

2010-9-17 Ed Chang@NCCU Invited Talk

百万

十億

万億

千万億

百兆

十万兆

億兆

京,亥 47

Page 47: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

計算的規模

3.8*10^5km

1.2*10^10km

1.5*10^16km

2010-9-17 Ed Chang@NCCU Invited Talk

百万

十億

万億

千万億

百兆

十万兆

億兆

京,亥 48

Page 48: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

高性能計算的極端

应用 用户数 精确度 可靠度 数据量

科學計算 少 極高 低 -- 中等 Tera

股市交易 大量 高 極高 Gega

基因排序 少 高 高 Tera --Peta

搜索引擎 大量 中等 -- 高 中等 Peta

2010-9-17 Ed Chang@NCCU Invited Talk 49

Page 49: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

可擴展性和彈性Scalability & Elasticity

• 計算的規模與極端

• 雲計算平台設計挑戰

• 雲計算基礎技術

• 雲計算新技術

2010-9-17 50Ed Chang@NCCU Invited Talk

Page 50: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Google文件系统 (GFS)

• Master manages metadata• Data transfers happen directly between

clients/chunkservers• Files broken into chunks (typically 64 MB)• Chunks triplicated across three machines for safety• See SOSP^03 paper at

http://labs.google.com/papers/gfs.html

Rep

licas

MasterGFS Master

GFS Master Client

Client

C1C0 C0

C3 C3C4

C1

C5

C3

C4

2010-9-17 Ed Chang@NCCU Invited Talk 51

Page 51: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

大表格(Big Table)

• 内存管理

• <Row, Column, Timestamp> triple for key -lookup, insert, and delete API

• Example– Row: a Web Page– Columns: various aspects of the page– Time stamps: time when a column was fetched

2010-9-17 Ed Chang@NCCU Invited Talk 52

Page 52: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

MapReduce

map

map

map

map

map

Reduce

Data Block 1

Data Block 2

Data Block 3

Data Block 4

Data Block 5

Datadatadatadatadatadatadatadatadatadatadata

Resultsdatadatadatadatadatadata

2010-9-17 Ed Chang@NCCU Invited Talk 53

Page 53: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

計算機平台(Platform)

2010-9-17 Ed Chang@NCCU Invited Talk 54

Page 54: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

樣品平台

• 計算機

– 16GB DRAM; 160 GB SSD; 5 x 1TB disks

• 計算機機架 (Rack)– 40 计算机

– 48 port Gigabit Ethernet switch

• 計算機中心 (Center)– 10,000計算機(250 racks)– 2K port Gigabit Ethernet switch

2010-9-17 Ed Chang@NCCU Invited Talk 55

Page 55: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

儲存 ---計算機單機

2010-9-17 Ed Chang@NCCU Invited Talk 56

Page 56: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

儲存 ---計算機機架

2010-9-17 Ed Chang@NCCU Invited Talk 57

Page 57: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

儲存 ---計算機中心

2010-9-17 Ed Chang@NCCU Invited Talk 58

Page 58: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

設計挑戰

• 新程模型, 技術– File system– Flash (SSD); GPUs?

• 能源節約– Hardware, software, warehouse– Encode/compress/transmit data well

• 故障恢復– Hardware/software faults– Deal with stragglers

2010-9-17 Ed Chang@NCCU Invited Talk 59

Page 59: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

新文件系統技術: Colossus

• Latency Reduction– Metadata distribution– Encoding and Replication– Data migration

• Throughput Improvement– Adaptive cluster-level file system

• Fault Tolerant– Reed-Solomon – Avoid correlated faults

2010-9-17 Ed Chang@NCCU Invited Talk 60

Page 60: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Colossus [Sigmod 09]

2010-9-17 Ed Chang@NCCU Invited Talk 61

Page 61: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk

新parellel算法

MapReduce Pregel MPI

GFS/IO and task rescheduling overhead between iterations

Yes No+1

No+1

Flexibility of computation model AllReduce only+0.5

?+1

Flexible+1

Efficient AllReduce Yes+1

Yes+1

Yes+1

Recover from faults between iterations

Yes+1

Yes+1

Apps

Recover from faults within each iteration

Yes+1

Yes+1

Apps

Final Score for scalable machine learning

3.5 5 4

62

Page 62: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

設計挑戰

• 新程式設計模型– Parallel– Flash (SSD); GPUs?

• 能源節約– Hardware, software, warehouse– Encode/compress/transmit data well

• 故障恢復– Hardware/software faults– Deal with stragglers

2010-9-17 Ed Chang@NCCU Invited Talk 63

Page 63: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

能源消耗全球增長趨勢

2010-9-17 Ed Chang@NCCU Invited Talk 64

Page 64: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

能源消耗地區增長趨勢

2010-9-17 Ed Chang@NCCU Invited Talk 65

Page 65: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

電源使用效率

2010-9-17 Ed Chang@NCCU Invited Talk 66

Page 66: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

電源使用分布

2010-9-17 Ed Chang@NCCU Invited Talk 67

Page 67: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

節能措施

2010-9-17 Ed Chang@NCCU Invited Talk 68

Page 68: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

能源消耗 --- 油當量

2010-9-17 Ed Chang@NCCU Invited Talk 69

Page 69: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

設計挑戰

• 新程式設計模型– Parallel– Flash (SSD); GPUs?

• 能源節約– Hardware, software, warehouse– Encode/compress/transmit data well

• 故障恢復– Hardware/software faults– Deal with stragglers

2010-9-17 Ed Chang@NCCU Invited Talk 70

Page 70: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

故障率• 99.9% 正常運行時間, 一年故障?小时

– 9 小时故障/年• 99.9%正常運行/天, 10,000 計算機, 不down機機率?

– 4.5173346 × 10-5

• 10,000 計算機 (中心级)– 0.25 次斷電

– 3 次路由器故障

– 1,000 計算機故障

– 1,000s 硬盤故障

– etc., etc., etc.

2010-9-17 Ed Chang@NCCU Invited Talk 71

Page 71: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

故障後快速修復

• 複製

• 如果可能的話

– 鬆散的一致性 (loosely consist)

– 近似的答案

– 不完整的答案

2010-9-17 Ed Chang@NCCU Invited Talk 72

Page 72: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

數據規模算法精度

2010-9-17 Ed Chang@NCCU Invited Talk 73

Page 73: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

數據規模算法精度

2010-9-17 Ed Chang@NCCU Invited Talk 74

Page 74: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

數據規模算法精度

2010-9-17 Ed Chang@NCCU Invited Talk 75

Page 75: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

數據規模算法精度

2010-9-17 Ed Chang@NCCU Invited Talk 76

Page 76: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

赢在规模的例证

• 谷歌翻譯

• 谷歌語音識別

• 趨勢預測

2010-9-17 Ed Chang@NCCU Invited Talk 77

Page 77: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

流感趨勢圖

2010-9-17 Ed Chang@NCCU Invited Talk 78

Page 78: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Outline

• Search + Social Synergy

• Search Social

• Social Search• Scalability and Elasticity

2010-9-17 Ed Chang@NCCU Invited Talk 79

Page 79: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 80

Confucius and Disciples

Page 80: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 81

Concluding Remarks• Search + Social• Increasing quantity and complexity of data demands scalable

solutions• Have parallelized key subroutines for mining massive data sets

– Spectral Clustering [ECML 08]

– Frequent Itemset Mining [ACM RS 08]

– PLSA [KDD 08]

– LDA [WWW 09, AAIM 09]

– UserRank [Google TR 09]– Support Vector Machines [NIPS 07]

• Relevant papers– http://infolab.stanford.edu/~echang/

• Open Source PSVM, PLDA– http://code.google.com/p/psvm/– http://code.google.com/p/plda/

Page 81: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 82

Collaborators• Prof. Chih-Jen Lin (NTU)• Hongjie Bai (Google)• Hongji Bao (Google)• Wen-Yen Chen (UCSB)• Jon Chu (MIT)• Haoyuan Li (PKU)• Zhiyuan Liu (Tsinghua)• Xiance Si (Tsinghua)• Yangqiu Song (Tsinghua)• Matt Stanton (CMU)• Yi Wang (Google)• Dong Zhang (Google)• Kaihua Zhu (Google)

Page 82: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 83

References[1] Alexa internet. http://www.alexa.com/.[2] D. M. Blei and M. I. Jordan. Variational methods for the dirichlet process. In Proc. of the 21st international

conference on Machine learning, pages 373-380, 2004.[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993-

1022, 2003.[4] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. of the Seventeenth

International Conference on Machine Learning, pages 167-174, 2000.[5] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In

Advances in Neural Information Processing Systems 13, pages 430-436, 2001.[6] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic

analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm.

Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1-38, 1977.[8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE

Transactions on Pattern recognition and Machine Intelligence, 6:721-741, 1984.[9] T. Hofmann. Probabilistic latent semantic indexing. In Proc. of Uncertainty in Articial Intelligence, pages 289-296,

1999.[10] T. Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information System, 22(1):89-

115, 2004.[11] A. McCallum, A. Corrada-Emmanuel, and X. Wang. The author-recipient-topic model for topic and role discovery in

social networks: Experiments with enron and academic email. Technical report, Computer Science, University of Massachusetts Amherst, 2004.

[12] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In Advances in Neural Information Processing Systems 20, 2007.

[13] M. Ramoni, P. Sebastiani, and P. Cohen. Bayesian clustering by dynamics. Machine Learning, 47(1):91-121, 2002.

Page 83: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 84

References (cont.)[14] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative ltering. In Proc. Of the

24th international conference on Machine learning, pages 791-798, 2007.[15] E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social

network. In Proc. of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining, pages 678-684, 2005.

[16] M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griths. Probabilistic author-topic models for information discovery. In Proc. of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306-315, 2004.

[17] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal on Machine Learning Research (JMLR), 3:583-617, 2002.

[18] T. Zhang and V. S. Iyengar. Recommender systems using linear classiers. Journal of Machine Learning Research, 2:313-334, 2002.

[19] S. Zhong and J. Ghosh. Generative model-based clustering of documents: a comparative study. Knowledge and Information Systems (KAIS), 8:374-384, 2005.

[20] L. Admic and E. Adar. How to search a social network. 2004[21] T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, pages

5228-5235, 2004.[22] H. Kautz, B. Selman, and M. Shah. Referral Web: Combining social networks and collaborative filtering.

Communitcations of the ACM, 3:63-65, 1997.[23] R. Agrawal, T. Imielnski, A. Swami. Mining association rules between sets of items in large databses. SIGMOD

Rec., 22:207-116, 1993. [24] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In

Proceedings of the Fourteenth Conference on Uncertainty in Artifical Intelligence, 1998.[25] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-

177, 2004.

Page 84: 2010 09-17-張志威老師-confucius & “its” intelligent disciples

2010-9-17 Ed Chang@NCCU Invited Talk 85

References (cont.)[26] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms.

In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001.[27] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-

177, 2004.[28] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms.

In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001.[29] M. Brand. Fast online svd revisions for lightweight recommender systems. In Proceedings of the 3rd SIAM

International Conference on Data Mining, 2003.[30] D. Goldbberg, D. Nichols, B. Oki and D. Terry. Using collaborative filtering to weave an information tapestry.

Communication of ACM 35, 12:61-70, 1992.[31] P. Resnik, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for aollaborative

filtering of netnews. In Proceedings of the ACM, Conference on Computer Supported Cooperative Work. Pages 175-186, 1994.

[32] J. Konstan, et al. Grouplens: Applying collaborative filtering to usenet news. Communication ofACM 40, 3:77-87, 1997.

[33] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating “word of mouth”. In Proceedings of ACM CHI, 1:210-217, 1995.

[34] G. Kinden, B. Smith and J. York. Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Computing, 7:76-80, 2003.

[35] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal 42, 1:177-196, 2001.

[36] T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Proceedings of International Joint Conference in Artificial Intelligence, 1999.

[37] http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/collaborativefiltering.html[38] E. Y. Chang, et. al., Parallelizing Support Vector Machines on Distributed Machines, NIPS, 2007.[39] Wen-Yen Chen, Dong Zhang, and E. Y. Chang, Combinational Collaborative Filtering for personalized community

recommendation, ACM KDD 2008.[40] Y. Sun, W.-Y. Chen, H. Bai, C.-j. Lin, and E. Y. Chang, Parallel Spectral Clustering, ECML 2008.[41] Wen-Yen Chen, Jon Chu, Junyi Luan, Hongjie Bai, Yi Wang, and Edward Y. Chang , Collaborative Filtering for

Orkut Communities: Discovery of User Latent Behavior, International World Wide Web Conference (WWW)Madrid, Spain, April 2009 .

[42] Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA: Parallel Latent Dirichlet Allocation, International Conference on Algorithmic Aspects in Information and Management, June 2009.