text analysis method using latent topics for field notes in area studies

Post on 08-Feb-2016

26 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Text Analysis Method Using Latent Topics for Field Notes in Area Studies. Taizo Yamada Historiographical Institute, The University of Tokyo. Contribution. Text analysis for Area Studies applying topic model to a field note for Area studies - PowerPoint PPT Presentation

TRANSCRIPT

Text Analysis Method Using Latent Topics for Field Notes in Area Studies

Taizo Yamada Historiographical Institute,

The University of Tokyo

2013/12/13 PNC2013 1

ContributionText analysis for Area Studies – applying topic model to a field note for Area studies

• We use LDA (Latent Dirichlet Allocation) as a topic model.• Similar fragments or scenes in field note can be obtained.

– Visualization of the relationship between place names• The place information does not have Latitude and

longitude.• We don’t have any dictionaries of place name.

2013/12/13 PNC2013 2

OutlineBackground, purposeMethodology of text analysis– Text structuring,– Term extraction– Characterization of term– Method of obtaining similar text fragments– Visualization and System

Conclusion

2013/12/13 PNC2013 3

Background Recently, Area Studies has made remarkable progress.

– Researchers in Area Studies can search and analyze large volumes of data easily and quickly.

– using information technology such as web technology, data analysis, data engineering,…

– In order to promote the analysis, the researchers have published databases.• catalogues, images, statistical data, spatial data and temporal data.

For more the progress of the study, – we believe that text analysis is one of the essential elements. – a text such as a field note has a description of sights, scenes and

customs, – but latent topics or subjects can be key elements characterizing the

area.2013/12/13 PNC2013 4

Purpose Text analysis method for a field note in Area Studies. – We prepare a field note database in which the data unit

is a description of a sight or a scene. – In order to detect latent topics, we use latent Dirichlet

allocation (LDA). • LDA is one of a topic model.• in LDA each text can be viewed as a mixture of various latent

topics and each topic can be viewed as a mixture of various words.

– In order to detect the gait of investigator in a field note• Visualization of the gait shows presentation of relations

between place names.

2013/12/13 PNC2013 5

Text(1)Target: Koichi Takaya, “The

Field note collection2 Sumatra” (in Japanese)– 1984. 10. 19 ― 1985. 1. 18– Overall Sumatra Island

2013/12/13 PNC2013 6

Text structuring (1)

2013/12/13 PNC2013 7

Text structuring (1)

2013/12/13 PNC2013 8

Text structuring (2)

2013/12/13 PNC2013 9

Term extraction(1)

morphological analysis– mecab+ipadic (morphological analyzer; dictionary)

2013/12/13 PNC2013 10

マングローブ。前面の海にはバガン ( 魚取り用の櫓 ) いくつもある。

Text (a scene)

マングローブ名詞 , 一般 ,*,*,*,*, マングローブ , マングローブ , マングローブ。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。前面 名詞 , 一般 ,*,*,*,*, 前面 , ゼンメン , ゼンメンの 助詞 , 連体化 ,*,*,*,*, の , ノ , ノ海 名詞 , 一般 ,*,*,*,*, 海 , ウミ , ウミに 助詞 , 格助詞 , 一般 ,*,*,*, に , ニ , ニは 助詞 , 係助詞 ,*,*,*,*, は , ハ , ワバガン 名詞 , 一般 ,*,*,*,*,*。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。魚 名詞 , 一般 ,*,*,*,*, 魚 , サカナ , サカナ取り 名詞 , 接尾 , 一般 ,*,*,*, 取り , トリ , トリ用 名詞 , 接尾 , 一般 ,*,*,*, 用 , ヨウ , ヨーの 助詞 , 連体化 ,*,*,*,*, の , ノ , ノ櫓 名詞 , 一般 ,*,*,*,*, 櫓 , ロ , ロ。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。いくつ 名詞 , 代名詞 , 一般 ,*,*,*, いくつ , イクツ , イクツも 助詞 , 係助詞 ,*,*,*,*, も , モ , モある 動詞 , 自立 ,*,*, 五段・ラ行 , 基本形 , ある , アル , アル。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。EOS

Result of morphological analysis

“ 名詞” : Noun, “ 助詞” : postpositional particle, “ 記号” : Symbol, “ 動詞” : Verb

Term extraction(2)

Extraction target: only noun But following types are not extracted:

– pronoun, number,

2013/12/13 PNC2013 11

Bakauhumi:1マングローブ :1前面 :1海 :1バガン :1魚取り用 :1櫓 :1ココヤシ :1下 :1家 :1チョウジ :1斜面 :1

Bag-of-Wordsマングローブ名詞 , 一般 ,*,*,*,*, マングローブ , マングローブ , マングローブ。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。前面 名詞 , 一般 ,*,*,*,*, 前面 , ゼンメン , ゼンメンの 助詞 , 連体化 ,*,*,*,*, の , ノ , ノ海 名詞 , 一般 ,*,*,*,*, 海 , ウミ , ウミに 助詞 , 格助詞 , 一般 ,*,*,*, に , ニ , ニは 助詞 , 係助詞 ,*,*,*,*, は , ハ , ワバガン 名詞 , 一般 ,*,*,*,*,*。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。魚 名詞 , 一般 ,*,*,*,*, 魚 , サカナ , サカナ取り 名詞 , 接尾 , 一般 ,*,*,*, 取り , トリ , トリ用 名詞 , 接尾 , 一般 ,*,*,*, 用 , ヨウ , ヨーの 助詞 , 連体化 ,*,*,*,*, の , ノ , ノ櫓 名詞 , 一般 ,*,*,*,*, 櫓 , ロ , ロ。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。いくつ 名詞 , 代名詞 , 一般 ,*,*,*, いくつ , イクツ , イクツも 助詞 , 係助詞 ,*,*,*,*, も , モ , モある 動詞 , 自立 ,*,*, 五段・ラ行 , 基本形 , ある , アル , アル。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。EOS

Result of morphological analysis

The number of the kinds of term is 5,666.

Term extraction(3) Markup the extracted terms

– The terms may characterize the scene in the text.

– Extracted terms for each scene are different.

By the way, What features do the terms have? – We should prepare a method of a

detection of the features.– But we don’t have any thesaurus or

dictionaries.

Then, in order to detect, we introduce topic model.– Using topic model, we can detect

latent topics as the features.

2013/12/13 PNC2013 12

720km: Jakarta 出発830km: Bakauhumi   (*1) ①  マングローブ。前面の海にはバガン ( 魚取り用の櫓 ) いくつもある。 ② ココヤシ多い。この下に少し家ある。 ③ チョウジの多い斜面。 853km: 稲。今若実り。54km: このあたりよりチョウジ多くなる。その下を時に耕している。トウモロコシを植えるらしい。70km: 水田をよく見る。東に海見える。77-79km: ココヤシが多い。時に水田あり、それ実っている。85km: ココヤシ園広い。時にチョウジがある。90km: 西海岸に来る。マングローブあるが、その背後にはココヤシ多い。97km: チョウジが多い。この辺りは殆どがジャワ人だという。01km: Sidomulyo 。周り、シラス台地。11km: 5 ~ 10 年生のココヤシ多い。他に、チョウジ、バナナ、ランブータン、ドリアン。18km; 左の海にはバガンが 100 基ほど見える。22km: 海岸は広くココヤシ。これ 60 年生。高みはチョウジ多い。

Using topic model(1) We use LDA ( Latent Dirichlet Allocation) as

topic model.– Topic model

• Modeling of co-occurrence of terms.• The results show term classification.

– The kind of topic model• LSI(Latent Semantic Indexing): the model of introducing

latent topic to VSM(Vector Space Model).• PLSI(Probabilistic Latent Semantic Indexing): The re-

definition as a probabilistic model of LSI.• LDA: improved PLSI based on Bayesian learning

132013/12/13 PNC2013

Using topic model(2) LDA :D.M. Blei, et al. “Latent Dirichlet Allocation”, 2003.

– document generation model where generating probability of latent topic follows Dirichlet distribution.

– Latent topics can be determined if parameters of LDA can be tuned.

– parameter of LDA– : latent topic– : generating probability – : document . : term . : the total number of term in d– Dir: Dirichlet distribution

142013/12/13 PNC2013

Using topic model(2) LDA :D.M. Blei, et al. “Latent Dirichlet Allocation”, 2003.

– document generation model where generating probability of latent topic follows Dirichlet distribution.

– Latent topics can be determined if parameters of LDA can be tuned.

– parameter of LDA– : latent topic– : generating probability – : document . : term . : the total number of term in d– Dir: Dirichlet distribution

15

Topic can be generated according to θ.

The term can be generated according to topic z_k and β.

Document can be generated according to terms

θ can be generated by α

2013/12/13 PNC2013

Detection of latent topic

Feature of LDA– text

• A set of terms• Having multiple topics

– term• Belong to multiple topics• Not only specific topic

Spatial changing(scene changing)– Because of the visualization of detection

results, we can understand the changing .– Latent topics are changed according to

the spatial changing.

By the way, which is similar?2013/12/13 PNC2013 16

Similarity between texts (1) We introduce VSM (Vector Space Model).

– Feature vectors are needed by VSM.– The vector has an element which is total number of terms

per topic.

– Similarity between vectors is calculated by cosine similarity.

– x,y: text(scene)– : The weight of topic in text x.– : tf.idf weighting – : the frequency of in text x.– : the number of text which has topic .– N: the number of text

2013/12/13 PNC2013 17

Similarity between texts (2)

2013/12/13 PNC2013 18

Track of investigation (1) Beginning of text

– Date: Oct. 19. ‘84– “Jakarta より Kotabumi へ行

く。”– The text means the movement

from ”Jakarta” to ”Kotabumi”.

Tracking the movement– Extracting place name.– Rule:

• from: ○○[ から | より | 出発 |…]

• to: ○○[ へ | まで | に | 泊 |…]

– Unfortunately, we don’t have any dictionaries or gazetteers.

– I connect extracted place names for the time being.

2013/12/13 PNC2013 19

Track of investigation (2)

2013/12/13 PNC2013 20

Using D3.js

Force-Directed Graph

Oct. ‘84

Nov. ‘84

Jan. ‘85

Dec. ‘84

Jakarta

SolokTembilahan

Pekanbaru

Singapore

http://d3js.org/

Conclusion, Future works We introduce text analysis for field note in Area

Studies. – Using topic model LDA– Tracking of the investigator.

Future work– Improvement of text analysis for Area Studies.

• What is the system that the researcher for Area Studies wants?

• We consider about the answer, and develop system according to the answer.

2013/12/13 PNC2013 21

PNC2013 22

Thank you for listening to my presentation.

– E-mail: t_yamada@hi.u-tokyo.ac.jp

2013/12/13

top related