a topic detection and tracking method combining nlp with suffix tree clustering

23
A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering Author : Yaohong JIN Source : International Conference on Computer Science and Electronics Engineering (ICCSEE), Date : 2013/10/7 Presenter : 曹曹曹 1

Upload: kami

Post on 14-Jan-2016

57 views

Category:

Documents


7 download

DESCRIPTION

A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering. Author : Yaohong JIN Source : International Conference on Computer Science and Electronics Engineering (ICCSEE), Date : 2013/10/7 Presenter : 曹昌林. Outline. Introduction CLUSTERING ALGORITHM - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Author : Yaohong JINSource : International Conference on Computer Science and Electronics Engineering (ICCSEE), Date : 2013/10/7Presenter : 曹昌林

1

Page 2: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Outline

Introduction CLUSTERING ALGORITHM TOPIC DETECTION AND TRACKING

ALGORITHM Conclusion

2

Page 3: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

TDT(Topic Detection and Tracking, 話題檢測與跟蹤 )

一種訊息處理的技術 可用於識別主要議題,並追蹤延伸話題 運用在 news mining ,會隨著時間產生位移

3

Page 4: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

suffix tree(後綴樹 )

一棵包含 m 個字的字符串 S 的後綴樹 T 僅有 m 個葉子節點的樹,且每條邊都被標上非空的 S 的子串,並且從一個節點發出的兩條邊不能包含相同詞開始的字串。 ex:bananas

4

Page 5: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

suffix tree clustering( 後綴樹組 )(1)

將 n 個字串集合到一棵後綴樹,叫後綴樹組。 每個葉子節點被標示為 ( j , i ) ,從根到該葉子節點的整個路徑的邊串起來的內容就是 j(0 < j

n)≦ 從位置 i 起的後綴子串

5

Page 6: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

suffix tree clustering( 後綴樹組 )(2)

ex: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" }

6

Page 7: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Outline

Introduction CLUSTERING ALGORITHM TOPIC DETECTION AND TRACKING

ALGORITHM Conclusion

7

Page 8: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

CLUSTERING ALGORITHM

8

Page 9: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Feature Selection(1)

為了 clustering 使用 NLP algorithm 來選擇較有意義的字

使用 stop word table 來過濾高頻率單字 (such as "the", "I", "a“)

使用 TF-IDF 來計算單字的權重,並且過濾常使用的單字

9

Page 10: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Feature Selection(2)

初始化 STC ,來追蹤任何長度的單字 對所有單字標註詞性和意思 選擇 noun 、 verb 和意思作為文件的 key word

10

Page 11: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Suffix Tree Clustering

將 feature selection 過濾後的結果,輸入到 STC 保留在文本的標點符號和他們的位置關係 優點在於一個文檔可以出現在多個 clusters ,而且任何句子輸入到 tree 僅需 linear time

11

Page 12: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Scoring Clusters(1)

每日的新聞標題被分散到一連串的 clusters 一個 cluster 的重要性,關於有多少文章包含此

topic 跟有多少媒體將此 topic 放入文章中,而兩者皆高的,就會具有最高的關注度

經過下一頁式子計算,選出最高的 50 個 cluster來當作 TDT 的 source

12

Page 13: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Scoring Clusters(2)

is the importance of the topic is the number of articles in the topic is the total number of articles in the day is the number of the medias in which the topic

is involved is the total number of medias in corpus.

iT

( )iD T

| |iT

1

| |n

ii

T| |im

| |M

iT

13

Page 14: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Outline

Introduction CLUSTERING ALGORITHM TOPIC DETECTION AND TRACKING

ALGORITHM Conclusion

14

Page 15: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

TOPIC DETECTION AND TRACKING ALGORITHM(1)

Suppose A={a1,a2,……an} is the set of topics in one period time. Initially A is an empty set.

B ={ }is the set of clusters in one day, where i is the ith day, and m is 50

Step 1, to initialize the topic set A; Step 2, if set A is empty set, add all the elements

of B into A;

1 2, ,.....i i imb b b

15

Page 16: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

TOPIC DETECTION AND TRACKING ALGORITHM(2)

Step 3, to compute the similarity of each pair of (ak, bij);

Step 4, If a cluster bij is similar with ak, bij is linked with ak (This procedure is tracking), and bij is called as sub-topic of ak;

Step 5, If bij is not similar with anyone of set A, bij is a new topic, and was added into the set A (This procedure is detection);

Step 6, to generate a description for each topic.

16

Page 17: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

TOPIC DETECTION AND TRACKING ALGORITHM(3)

The difficulty of TDT algorithm above is the similarity computing of clusters because the focus of topic is gradually shifting over time

similarity computing has to take the shifting phenomenon into account

a new description has to be generated from a list of topics if a topic is linked by other topics

17

Page 18: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Similarity of two Clusters(1)

use Vector Space Model (VSM) to represent the content of the cluster

In addition to the label of the cluster, we added the top K words into the vector

K words were extracted from the nodes of suffix tree by the Mutual Information algorithm

K is set to 50

18

Page 19: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Similarity of two Clusters(2)

use Jaccard distance to measure the correlation of two vectors of clusters

is the number of words appears in two clusters

is the total number of words in two clusters.

19

Page 20: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Similarity of two Clusters(3)

means these two clusters are similar, and can be linked

means they are not similar, and a new topic have to be added

20

Page 21: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Description Generation

use semantic analysis based on the Hierarchical Network of Concepts theory (HNC theory) to extract the description from the labels.

The words with same meaning or hyponymy have to be filtered, and the noun is prior to be retained in the list

The common phrase has to be extracted from the

remaining word list

21

Page 22: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Outline

Introduction CLUSTERING ALGORITHM TOPIC DETECTION AND TRACKING

ALGORITHM Conclusion

22

Page 23: A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Conclusion

Advantage can track the topics effectively

Drawback The different aspects of the topic were revealed correctly, but not

linked with each other the ambiguity of topic detection and tracking was not processed

very well

combine the semantic analysis technology with TDT to deal with the ambiguity of topic detection and tracking

23