text clustering: a case study a multilingual text mining approach based on self-organizing maps

14
Text Clustering: A Case Study A Multilingual Text Mining Approach Based On Self-Organizing Maps

Upload: lila-martin

Post on 03-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Text Clustering: A Case Study A Multilingual Text Mining Approach Based On Self-Organizing Maps. Background. 一、多國語文文件探勘技術之相關研究 主要重點在研究 『 文件探勘 』 (Text Mining) 技術在中文 / 英文混合語料庫上的進階應用;本研究的目的在於提出一種 Self-Organizing Maps 類神經網路 機器學習的方法,來偵測收集中文 / 英文混合的文件集合中內容相關的文件。本研究工作主要的原創性與貢獻包括 : - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

Text Clustering: A Case Study

A Multilingual Text Mining Approach Based On

Self-Organizing Maps

Text Clustering: A Case Study

A Multilingual Text Mining Approach Based On

Self-Organizing Maps

Page 2: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

BackgroundBackground

   一、多國語文文件探勘技術之相關研究一、多國語文文件探勘技術之相關研究主要重點在研究『文件探勘 』 (Text Mining)技術在中文 /英文混合語料庫上的進階應用;本研究的目的在於提出一種 Self-Organizing Self-Organizing

MapsMaps類神經網路類神經網路機器學習的方法,來偵測收集中文 /英文混合的文件集合中內容相關的文件。本研究工作主要的原創性與貢獻包括 :

首創中文中文 //英文等多國語文文件探勘技術英文等多國語文文件探勘技術研究的理論模型 開發以 Self-Organizing Maps類神經網路為主的文件探勘模型,使成為一個語言上中性語言上中性 (neutral)(neutral)的演算法的演算法

突破了資料探勘理論應用於跨語文資訊處理跨語文資訊處理上的困難度 可提供作進一步的文件語意相關性分析的計算文件語意相關性分析的計算以及更多語言學上的理論研究

Page 3: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

Advanced IssuesAdvanced Issues

   二、二、文件探勘技術應用於下一代網際網路建構之研究文件探勘技術應用於下一代網際網路建構之研究本研究應用以本研究應用以 Self-Organizing MapsSelf-Organizing Maps類神經網路為主的類神經網路為主的『『文件探文件探

勘勘』』技術於支援技術於支援 Semantic Web(Semantic Web(語意網語意網 ))部分工程之建構,以部分工程之建構,以處理處理 Semantic WebSemantic Web 上上 Knowledge RepresentationKnowledge Representation的問題包的問題包括括 ::

網頁資訊網頁資訊目錄與階層結構目錄與階層結構 (web directories and hierarchies)(web directories and hierarchies)的自動建構的自動建構

自動文件分類自動文件分類 OntologyOntology的建構工程的建構工程在此應用領域上,本研究也使用了不同的文件探勘的演算法與運算在此應用領域上,本研究也使用了不同的文件探勘的演算法與運算平台,包括平台,包括 Self-Organizing Maps (SOM)Self-Organizing Maps (SOM) 與與 Support Vector Support Vector Machines (SVM)Machines (SVM)為主的文件探勘技術。為主的文件探勘技術。

Page 4: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

Related Concepts of Text Mining( 文件探文件探勘勘 ))

--Data mining , Information Retrieval ( IR ) --Machine learning , Automatically organize --Text Categorization --unstructured / semi-structured data

Why Multilingual Text Mining? --monolingual vs. multilingual --parallel corpora --language-independent algorithm

IntroductionIntroduction

Page 5: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

Corpora Selection

Feature Selection

Translation

SOM DiscoveryAlgorithm

WordsCluster Map

DocumentsCluster Map

SemanticAnalysis

preprocessing training Analysis

System ArchitectureSystem Architecture

Page 6: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

翻開民國六十五年元月的光華創刊號,發現最早期的「光華畫報雜誌」,的確只是重大建設、觀光勝地、風土民情的「圖片集錦」簿冊,文宜味十足,並且只對海外發行。然而很快地,它開始有了改變,有時是漸進式地日見豐實,有時則是大幅度的改頭換面,終於成為第一本能反映社會現況,探詢先人智慧寶藏、介紹東西文化交流的獨特刊物。

民國光華雜誌

[ x , x , x , x , x ,…,x ]

[ x , x , x , x , x ,…,x ]

[ x , x , x , x , x ,…,x ]

[ x , x , x , x , x ,…,x ]

...

•Vector-Space-Model

documents index files document vectors

Preprocessing stagePreprocessing stage

Page 7: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

Unsupervised learningAutomatic cluster generationHigh-dimensionality

two-dimensionality Intuitive neighborhood relations

Self-Organizing Maps (SOM)Self-Organizing Maps (SOM)

Page 8: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

N samples

M neurons C clusters

SOM Abstraction IllustrationSOM Abstraction Illustration

Page 9: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

p

q

D(p,q) = ( 1 + 2 )|| G(Np)-G(Nq) || -1

Similarity between two words / documents :

Measure of similarity for clustered itemsMeasure of similarity for clustered items

Page 10: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

Corpora Selection : Sinorama Magazine

If you could flip through the first issue, from January of 1976, you would discover that the early Sinorama Pictorial was a slim collection of photos of national development, scenic spots, and traditional customs. It had a heavily propagandistic feel, and was only for overseas distribution. Nevertheless, it rapidly began to change. Sometimes the changes were gradual, as the contents became richer and more realistic. Sometimes there were major change of format. Ultimately, it has become a unique publication which reflects current society, explores the wisdom of our ancestors, and introduces East-West cultural interchange.

翻開民國六十五年元月的光華創刊號,發現最早期的「光華畫報雜誌」,的確只是重大建設、觀光勝地、風土民情的「圖片集錦」簿冊,文宜味十足,並且只對海外發行。然而很快地,它開始有了改變,有時是漸進式地日見豐實,有時則是大幅度的改頭換面,終於成為第一本能反映社會現況,探詢先人智慧寶藏、介紹東西文化交流的獨特刊物。

003c.txt003e.txt

Experimental DiscussionExperimental Discussion

Page 11: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

sinorama 工作 光華 bad bridg caught childlik chingju chrissi commun comprehens contribut countryw cultiv curios drove easili endlessli event eventu fulfil goal greatest highest impart inexhaust inferior jiafong joi magazin mission model modesti plant potenti problem profession pursu record repres respons scholar sens serv spark specialist transmit wai wang wonder 人中 人生 不只 不亞於 不斷 之間 充當 本國 生態 目的 丟人 她們 成就 自然 似乎 我們 赤忱 使命感 其他 委員 孩子 後來 後進 既然 根基 留下 追求 做好 執著 培養 專家 帶動 啟發 深厚 現在 責任 這些 提到 傳遞 敬業樂群 楷模 態度 榮譽 撰述 潛力 稿子 學者 擔任 樹立 橋樑 環保 總是 總編輯 謙遜 職守 灌輸

An example of resulting word clusters from the trained word cluster map.

Word cluster mapWord cluster map

Page 12: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

E005_002.txt E002_001.txtE006_001.txt

E003_002.txtE004_002.txt

E001_001.txtE001_002.txtE009_001.txt

E005_001.txt E006_002.txt E003_001.txtE004_001.txt

E007_001.txtE007_002.txt

E002_002.txt E008_001.txtE008_002.txt

E009_002.txt

C008_001.txtC009_001.txt

C009_002.txt C007_001.txtC007_002.txt

C008_002.txt C004_001.txtC004_002.txt

C001_001.txtC001_002.txt

C006_001.txt C003_001.txtC005_002.txt

C002-001.txtC002_002.txt

C006_002.txt C005-001.txt C003_002.txt

• The document cluster map for the tested English articles

• The document cluster map for the tested Chinese articles

Multilingual Text Mining from Parallel Chinese-English Corpora

Multilingual Text Mining from Parallel Chinese-English Corpora

Page 13: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

E15 E42 E45 E29

C49 E49 C02 E02 E20 C20

E33 E47 E34 E51 E55 C55 E27

C54 E01 E43 E48 E54 C08 E08 E26 E40 C19 E19 C39 E39

C48 E32 C07 E07 C37 E37 E05 C56 E04 E16 E56

C00 C01 C03 C04 C05 C06 C09 C10 C11 C12 C13 C14 C15 C16 C17 C18 C21 C22 C23 C24 C25 C26 C27 C28 C29 C30 C31 C32 C33 C34 C35 C36 C38 C40 C41 C42 C43 C45 C46 C47 C50 C51 C52 C53 C57 E00 E03 E06 E09 E10 E13 E17 E18 E25 E28 E30 E31 E35 E36 E38 E41 E46 E50 E52 E57

C44 E21 E44 E22 E11 E12 E14 E53

• The document cluster map for the hybrid corpus that contains tested English and Chinese articles.

Multilingual Text Mining from Hybrid Chinese-English Corpora

Multilingual Text Mining from Hybrid Chinese-English Corpora

Page 14: Text Clustering:  A Case Study A Multilingual Text Mining Approach Based On  Self-Organizing Maps

本研究工作主要的原創性與貢獻包括 :

首創中文中文 //英文等多國語文文件探勘技術英文等多國語文文件探勘技術研究的理論模型

開發以 Self-Organizing Maps類神經網路為主的文件探勘模型,使成為一個語言上中性語言上中性(neutral)(neutral) 的演算法的演算法

突破了資料探勘理論應用於跨語文資訊處理跨語文資訊處理上的困難度

可提供作進一步的文件語意相關性分析的計算文件語意相關性分析的計算以及更多語言學上的理論研究

ConclusionsConclusions