context discovery with shy (song huiyao – 宋會要 )

34
1 Context discovery with SHY (Song Huiyao – 會) Jieh Hsiang ( 項項 ) National Taiwan University and Academia Sinica 2012/12/07 PNC 2012, Berkeley

Upload: ashby

Post on 08-Jan-2016

69 views

Category:

Documents


3 download

DESCRIPTION

Context discovery with SHY (Song Huiyao – 宋會要 ). Jieh Hsiang ( 項潔 ) National Taiwan University and Academia Sinica. Joint work with. Hsieh-Chang Tu ( 杜協昌 ), NTU Shih-Pei Chen ( 陳詩沛 ), Harvard With special thanks to Cheng-yun Liu ( 劉錚雲 ) of IHP, Academia Sinica - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Context discovery with SHY                    (Song Huiyao –  宋會要 )

1

Context discovery with SHY

(Song Huiyao – 宋會要 )

Jieh Hsiang ( 項潔 )

National Taiwan Universityand

Academia Sinica

2012/12/07 PNC 2012, Berkeley

Page 2: Context discovery with SHY                    (Song Huiyao –  宋會要 )

Joint work with

• Hsieh-Chang Tu ( 杜協昌 ), NTU

• Shih-Pei Chen ( 陳詩沛 ), Harvard

With special thanks to

• Cheng-yun Liu ( 劉錚雲 ) of IHP, Academia Sinica

• Peter Bol of Harvard University

2012/12/07 PNC 2012, Berkeley 2

Page 3: Context discovery with SHY                    (Song Huiyao –  宋會要 )

3

Songhuiyao《宋會要》• Huiyao ( 會要 ):

– Decrees and laws, usually collected throughout a dynasty

• Songhuiyao ( 宋會要 ): The huiyao of the Song Dynasty, 960 – 1279 AD, most important government record of the Song Dynasty

• Current version is only a remnant, extrated by Xu Song ( 清,徐松 ) around 1800 from Yong-le Dadian ( 永樂大典 )

2012/12/07 PNC 2012, Berkeley

Page 4: Context discovery with SHY                    (Song Huiyao –  宋會要 )

4

Songhuiyao《宋會要》• 35,000,000 words in full text, 17 categories• Full text done by the Institute of History and

Philology (IHP) of the Academia Sinica and the Chinese Bibliographical Dababase (CBDB) project of Harvard University

• Included in the Scripta Sinica of IHP • Why another system?

– Songhuiyao is fragmented and very difficult to use– Need a better way to re-contextualize the material

2012/12/07 PNC 2012, Berkeley

Page 5: Context discovery with SHY                    (Song Huiyao –  宋會要 )

5

Introducing THDL

• Originally designed as a system for a Chinese corpus of full text historical documents related to Taiwan (thus the name THDL: Taiwan History Digital Library)

• Tailored for scholarly use with many special features

2012/12/07 PNC 2012, Berkeley

Page 6: Context discovery with SHY                    (Song Huiyao –  宋會要 )

6

Key design philosophy of THDL

6

• Assume that documents are related

• Treats a query return as a sub-collection of inter-related documents

• provides ways to discover the collective meanings of a sub-collection

• Contexts, contexts, contexts(Preserving old) (creating new)

(observing different)

2012/12/07 PNC 2012, Berkeley

Page 7: Context discovery with SHY                    (Song Huiyao –  宋會要 )

7

Features in THDL

• Main goal: provide ways to show collective meanings (contexts) of documents– Multi-level classification of query result– Term co-occurrence analysis– GIS/time distributions

• Term extraction tools

• Text mining tools

• Annotation/correction tools2012/12/07 PNC 2012, Berkeley

Page 8: Context discovery with SHY                    (Song Huiyao –  宋會要 )

8

THDL as a shell• THDL

– Taiwanese Land deeds– Ming Qing court documents– Dan-Xin archives

• KMT (Nationalist Party) archives• Taiwanese democratic magazines• Songhuiyao ( 宋會要 ) (this talk)• Qingshilu – Veritable Records of Qing ( 清實

錄 ) (IP)• Gujin tushu jicheng ( 古今圖書集成 ) and other

leisu ( 類書 ) (IP), other smaller books• Over 400,000,000 Chinese words, 1,000,000

metadata records, 2,000,000 images2012/12/07 PNC 2012, Berkeley

Page 9: Context discovery with SHY                    (Song Huiyao –  宋會要 )

XMLize the data

CBDB processed the data into 80,396 entries into excel form, each with 7 fields : category, emperor, dates (4 fields), and full-text

102012/11/29

Page 10: Context discovery with SHY                    (Song Huiyao –  宋會要 )

XMLize the data

• Dates: use DDBC from Dharma Drum to convert the dates in western calendar (61,002 documents)

• Extract names for SHY– 9,470 person names from CBDB (CBDB

has 35,632 Song names)– 3,366 official titles from CBDB– 4,010 locations from CBDB– Text-mined 11,901 additional potential

names (estimate correctness: 33%)112012/11/29

Page 11: Context discovery with SHY                    (Song Huiyao –  宋會要 )

12

Features of SHY (1)• Finding documents

– Full text search, plus logical operations

• Multiple contextual presentations of query results

• Term frequency and co-occurrence (contextual) analysis of people, locations, and offices

• Biography of people (from CBDB)

2012/12/07 PNC 2012, Berkeley

Page 12: Context discovery with SHY                    (Song Huiyao –  宋會要 )

13

Features of SHY (2)• Chronological distribution of query

results• Geographic distribution of query results• Self-defined document sets (with all the

features above)• Chronological comparison of two query

result sets• User-feedback mechanism (especially

useful for Song research community)

• Appositional term analysis2012/12/07 PNC 2012, Berkeley

Page 13: Context discovery with SHY                    (Song Huiyao –  宋會要 )

14

Full text search in SHYQuery term “locust”

2012/12/07 PNC 2012, Berkeley

Page 14: Context discovery with SHY                    (Song Huiyao –  宋會要 )

15

Multi-contextual classification• Years• Era (of emperors)• Categories• Subcategories • Error detection

2012/12/07 PNC 2012, Berkeley

Page 15: Context discovery with SHY                    (Song Huiyao –  宋會要 )

16

Error detection using facets

• Years that are not supposed to exist (e.g., 2nd month of first year of Xinguo)

2012/12/07 PNC 2012, Berkeley

Page 16: Context discovery with SHY                    (Song Huiyao –  宋會要 )

17

Facets within a facet

• Distribution of result of the query “locust” within the category Ruiyi ( 瑞異 strange phenomenon)

2012/12/07 PNC 2012, Berkeley

Page 17: Context discovery with SHY                    (Song Huiyao –  宋會要 )

18

Biography of people from CBDB

• Click biography ( 生平 ) by any name and get the information from CBDB

2012/12/07 PNC 2012, Berkeley

Page 18: Context discovery with SHY                    (Song Huiyao –  宋會要 )

19

Term frequency analysis

• Common names and locations in the query result

• df: document frequency tf: term frequency

• df(A)=4, tf(A)=6 df(B)=3, tf(B)=4 df(C)=2, tf(C)=3 df(D)=2, tf(D)=2

A…B… A

A…C A…A

B…A…B

D…B…C…C

D

2012/12/07 PNC 2012, Berkeley

Page 19: Context discovery with SHY                    (Song Huiyao –  宋會要 )

20

Term frequency analysis

df: given query q, the number of documents of the query result in which term t appears. df(t)

tq: percentage of documents in df(t) over the total number of documents in which t appears

(the higher it is, the more relevant t is to q)

query「史彌遠」

2012/12/07 PNC 2012, Berkeley

Page 20: Context discovery with SHY                    (Song Huiyao –  宋會要 )

21

Chronological distribution of documents

• Chronological distribution of documents is often useful

• Among the 80,396 documents in Songhuiyao, 61,002 have dates that were extracted automatically

2012/12/07 PNC 2012, Berkeley

Page 21: Context discovery with SHY                    (Song Huiyao –  宋會要 )

22

Comparing timelines of two queries

• q1 ?vs q2

• Ex : Wenzhou ?vs Raozhou

Grey: with Raozhou

Red: with Wenzhou

2012/12/07 PNC 2012, Berkeley

Page 22: Context discovery with SHY                    (Song Huiyao –  宋會要 )

23

Geographic distribution

• Locations (with df) plotted on map.

• Location names obtained from CBDB Query “locust”

2012/12/07 PNC 2012, Berkeley

Page 23: Context discovery with SHY                    (Song Huiyao –  宋會要 )

24

Self-defined folders

• User can define her own folders of documents so that they can be used later

• All the features described above apply to all self-defined folders (i.e., any sets of documents, not only query results)

2012/12/07 PNC 2012, Berkeley

Page 24: Context discovery with SHY                    (Song Huiyao –  宋會要 )

25

Self-defined folders

• Light green color means that document has been kept in some folder

2012/12/07 PNC 2012, Berkeley

Page 25: Context discovery with SHY                    (Song Huiyao –  宋會要 )

26

User feedback mechanism

• Simple way for users to report errors in metadata or full-text

• Also used effectively for SHY users to determine the correctness of new names found through term extraction

2012/12/07 PNC 2012, Berkeley

Page 26: Context discovery with SHY                    (Song Huiyao –  宋會要 )

27

User feedback mechanism

• 目前在詞頻分析的每個「其他」詞彙右方,都有一個「錯誤回報」連結• 全文的右下方,有「更正全文錯誤」的連結

人地名詞彙的更正與回報

2012/12/07 PNC 2012, Berkeley

Page 27: Context discovery with SHY                    (Song Huiyao –  宋會要 )

28

User feedback mechanism

Feedback on terms

Feedback on full text

2012/12/07 PNC 2012, Berkeley

Page 28: Context discovery with SHY                    (Song Huiyao –  宋會要 )

29

Ask the user community to check for correctness

2012/12/07 PNC 2012, Berkeley

Page 29: Context discovery with SHY                    (Song Huiyao –  宋會要 )

30

So far: 966 names confirmed from 2,390 candidates

2012/12/07 PNC 2012, Berkeley

Page 30: Context discovery with SHY                    (Song Huiyao –  宋會要 )

31

Appositional term analysis

• Given a set of documents, what terms (and their frequency) appeared precede or after a certain word– Example: what words appeared before tax

(which also gives an indication of what type of taxes there were)

• Simple interface: simply type the keyword and a number that indicates the number of words precede or after the keyword

2012/12/07 PNC 2012, Berkeley

Page 31: Context discovery with SHY                    (Song Huiyao –  宋會要 )

32

Statistics of x tax

2012/12/07 PNC 2012, Berkeley

Page 32: Context discovery with SHY                    (Song Huiyao –  宋會要 )

33

Can directly read the text

2012/12/07 PNC 2012, Berkeley

Page 33: Context discovery with SHY                    (Song Huiyao –  宋會要 )

Discussion • SHY is an example of a new methodology of

search systems• It can analyze contexts of documents

resulting from a query• The first prototype of SHY was completed

within a week (fine-tuning took longer)– Critical input from CBDB, especially on terms,

locations, calendar, biography

• THDL as a shell is very effective quickly prototyping such systems

342012/12/07 PNC 2012, Berkeley

Page 34: Context discovery with SHY                    (Song Huiyao –  宋會要 )

Thank you

352012/12/07 PNC 2012, Berkeley