kyoungryol kim

21
Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

Upload: leo-hurst

Post on 04-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Meeting Information Extraction from Meeting Announcement in Korean. Kyoungryol Kim. Table of Contents. Introduction Motivation Goal Problem Definition The Proposed Method Problem Modeling / Checklist Overall Architecture Normalization Process. Introduction. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Kyoungryol Kim

Kyoungryol Kim

Meeting Information Extraction from Meeting Announcement in Korean

Page 2: Kyoungryol Kim

2

Table of Contents

1. Introduction Motivation Goal Problem Definition

2. The Proposed Method Problem Modeling / Checklist Overall Architecture

Normalization Process

Page 3: Kyoungryol Kim

3

Introduction

Page 4: Kyoungryol Kim

4

Motivation

Everyday we receive a lot of Meeting Announcement Conference, Seminar, Workshop, Meeting, Appointment… Meeting announcement accounts for 17%

(30,201 out of 183,022) of emails in Enron Email Dataset.

Smartphone era Many people manage schedule using online-calendar via

smartphonee.g. Google Calendar

But, typing by touch screen keyboard make many errors and even it’s difficult.

* Enron Email Dataset, August 21, 2009 version, http://www.cs.cmu.edu/~enron/

Page 5: Kyoungryol Kim

5

Goal

Extracting schedule information from meeting announcement,and update them to the calendar, automatically.

무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다 .날짜 : 7 월 19 일 ( 토 ) 오후 2 시장소 : 민들레영토민들레영토 오는길지도와 같이 명동역 8 번 출구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA 빌딩 1 층에 있습니다 .

startTime 2011-07-19T14:00

isHeldAt

Administrative Address 대한민국 서울특별시 중구 명동 1 가 1-1 민들레영토 명동점

Geocode (37.5647312, 126.9861426)

Semantic Type Café

Meeting Announcement

Extract Update

Page 6: Kyoungryol Kim

6

Problem DefinitionTo find Meeting Location, the problem divided into 2 parts :

1. Finding locations from the text for each type of predefined complexity.

2. Named entity disambiguation on found locations.

무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다 .

날짜 : 7 월 19 일 ( 토 ) 오후 2

시장소 : 민들레영토기본 안건- 제작지원비 지급 지연에 대한 설명- 기금 조정 운영안- 가을 워크샵 준비위 구성- 기타 ( 기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다 )

민들레영토 오는길지도와 같이 명동역 8 번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA 빌딩 1

층에 있습니다 .

참고하세요

1. Finding Target

Locations

무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다 .

날짜 : 7 월 19 일 ( 토 ) 오후 2 시장소 : 민들레영토기본 안건- 제작지원비 지급 지연에 대한 설명- 기금 조정 운영안- 가을 워크샵 준비위 구성- 기타 ( 기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다 )

민들레영토 오는길지도와 같이 명동역 8 번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA 빌딩 1 층에 있습니다 .

참고하세요

2. Disambiguation

Page 7: Kyoungryol Kim

7

The Proposed Method

Page 8: Kyoungryol Kim

8

Problem Modeling

Meeting Announcement Text Location on the Map

Extract location strings

Extract address informationand limit the boundary

2. How to extract address information?

1. How to extract location string?7. How to represent Location?

Search the locationfrom the DB

3. What kind of DB can we use?4. How to manipulate the query?

Search the locationfrom external resources

5. What kind of external resources can we use?

Disambiguationamong found locations

6. What are the measures to find desired location?

Page 9: Kyoungryol Kim

9

Problem List (1/2)

1. How to extract location strings from the given text?

2. How to extract address information from location strings?

3. To search the location, what kind of database can we use?

4. To search the location, how to manipulate the query?

5. To search the location, what kind of external resources can we use?

6. What are the measures to find desired locations among candidates?

7. How to represent the location ?

Page 10: Kyoungryol Kim

10

Problem List (2/2) - Reorganized

1. How to extract location strings from the given text?

2. How to extract address information from location strings?1) How to check whether address information is included or not?

2) How to construct database which can limits boundary of address

3. To search the location, what resources can we use?1) Internal database : How to construct internal database?

2) External resources : What external resources available?

4. To search the location, how to manipulate the query?

5. What are the measures to find desired locations among candidates?

6. How to represent the location ?1) To store the location to the DB

2) To represent the location on the map

Page 11: Kyoungryol Kim

11

Problem Checklist : (6/6)

How to represent the location ?1) To store the location to the DB

Uses OpenStreetMap representation Node / Way / Relation

2) To represent the location on the map WGS84 (standard) : ( latitude, longitude [, altitude] )

Page 12: Kyoungryol Kim

Representation of Meeting Location

Follows basic representations of the Node in OpenStreetMap to represent location.

Regard the meeting location as Point-of-Interest Variable attributes (key-value pair)

http://wiki.openstreetmap.org/wiki/Map_Features used_as_meeting_location=true search_query=user’s query (comma separated)

Meeting location can be imported to OSM server (interoperabil-ity)

<node id="850918486" lat="37.4936384" lon="127.0137745" user="cyana" uid="74529" visible="true" version="3" changeset="5478335" timestamp="2010-08-13T02:26:19Z"> <tag k="name" v=" 교대 (Gyodae)"/> <tag k="name:en" v="Gyodae"/> <tag k="name:ko_rm" v="Gyodae"/> <tag k="railway" v="station"/> </node>

<node id="368738707" lat="37.4990100" lon="127.0275800" user="cyana" uid="74529" visible="true" version="2" changeset="4370541" timestamp="2010-04-09T08:09:50Z"> <tag k="amenity" v="dentist"/> <tag k="name" v=" 미소드림치과 (Misodeurim Dental Clinic)"/> <tag k="name:en" v="Misodeurim Dental Clinic"/> <tag k="name:ko" v=" 미소드림치과 "/> <tag k="name:ko_rm" v="Misodeurimchigwa"/> <tag k="ncat" v=" 치과 "/> </node>

Page 13: Kyoungryol Kim

nodeid int

lat double

lng double

user varchar(100) : email

version int

change-set

int

time-stamp

varchar(20)

changesetnode_id int

id int

created_at varchar(20)

num_changes

int

closed_at varchar(20)

open boolean

user varchar(100) : email

changeset_tagnode_id int

changeset_id

int

id int

key varchar(100)

value varchar(100)

node_tagnode_id int

id int

key varchar(100)

value varchar(100)

boundsid int

country_code char(2) : ISO-3166

admin_div1 varchar(100)

admin_div2 varchar(100)

admin_div3 varchar(100)

admin_div4 varchar(100)

southwest_lat double

southwest_lng double

northeast_lat double

northeast_lat double

Page 14: Kyoungryol Kim

Example : bounds

Bounds information constructed by using Google Maps API Closed-world is South Korea area (possibly can be expanded)

Page 15: Kyoungryol Kim

15

Corpus Expansion

Overall Architecture

InputDocument

OUTPUT

Finding Target Locations

TrainingCorpus

Adding Document

to Corpus

TrainedModels

(CRFs,SVMs)Train

Models

GazetteerExpand

Gazetteer

DocumentAnnotation

Location NER

Relation-type Classification

Ope-nAPIMap Ser-vices

Disambiguation

Normalization

PersonalInformation

Testing SystemTraining System

Page 16: Kyoungryol Kim

Pre-Processing :

Input Query:프란치스코교육회관 2 층

Split the Query into 2 parts :

Main Part / Extra-Part

Main : Chunks include Main location information.Extra : Chunks include Floor/room information.

{ “query” : { “full” : “ 프란치스코교육회관 2 층” , “main” : “ 프란치스코교육회관” , “extra” : “2 층” }}

{ “query” : { “full” : “ 프란치스코교육회관 2 층” , }}

Remove HTML-tag/URL/ ㈜Replace (),[],{} with space

InputDocument

OUTPUT

Finding Target Locations

Location NER

Relation-type Classification OpenAPIMap Services

Disambiguation

Normalization

TrainedModels

(CRFs,SVMs)

Gazetteer

PersonalInforma-

tion

Normalization

Normalization Process

Page 17: Kyoungryol Kim

Extract Address Information

includeHouse

no?

BoundsDB

Yes

NoGet Bounds infofrom Address

(SW, NE)

Geocodingby Query

1. if query doesn’t have Address information:Without boundary limitation, just do searchfrom the databases and APIs

has Address info?

1) main query 를 space 단위로 chunking 하고

2) 각 chunk 를 iteration 하면서- chunk 가 “ - 시” , “- 시 /- 구 /- 군” , “- 동 /- 가 /- 면 /- 읍” , “- 리” 로 끝나는지 , - DB 의 시 / 구 / 동 / 리 칼럼의 값으로 시작되는지확인하여 , 찾아진 칼럼과 값을 저장한다 .

3) 주소정보가 포함되어 있다면 ,뒤에 번지수까지 포함하고 있는지 확인한다 .[0-9]+, [0-9]+\-[0-9]+, [0-9]+ 번지 , [0-9]+\-[0-9]+ 번지- 번지수까지 포함되어 있으면 , 바로 geocoding.- 번지수는 없으면 , 해당지역까지의 bounds 를 db 에서 가져옴 .

hasAd-

dress Info?

Yes

No

{ “query” : { “full” : “ 프란치스코교육회관 2 층” , “main” : “ 프란치스코교육회관” , “extra” : “2 층” }}

{ “query” : { “full” : “ 서울시 강남구 삼성동 159-1 무역회관 2001 호” , “main” : “ 서울시 강남구 삼성동 159-1 무역회관” , “extra” : “2001 호” }, found_locations : [ { “title” : “ 대한민국 서울특별시 강남구 삼성동 159-1”, “administrative_address” : “ 대한민국 서울특별시 강남구 삼성동 159-1”, “geometry_location” : { “lat” : 37.5103598, “lng” : 127.0611803 } ]}

{ “query” : { “full” : “ 소공동 코리아나 호텔” , “main” : “ 소공동 코리아나 호텔” , “extra” : “” }, “limited_bound” : { “name” : “ 대한민국 서울특별시 중구 소공동” , “southwest” : { lat : 37.4346000, lng : 126.7968000}, “northeast” : { lat : 37.6956000, lng : 127.1823000} }}

InputDocument

OUTPUT

Finding Target Locations

Location NER

Relation-type Classification OpenAPIMap Services

Disambiguation

Normalization

TrainedModels

(CRFs,SVMs)

Gazetteer

PersonalInforma-

tion

Normalization

Page 18: Kyoungryol Kim

Extract Address Information

includeHouse

no?

BoundsDB

Yes

NoGet Bounds infofrom Address

(SW, NE)

Geocodingby Query

2. if the query have address information (with house number) :Geocode the address information and return.(Disambiguation finished)

hasAd-

dress Info?

Yes

No

{ “query” : { “full” : “ 프란치스코교육회관 2 층” , “main” : “ 프란치스코교육회관” , “extra” : “2 층” }}

{ “query” : { “full” : “ 서울시 강남구 삼성동 159-1 무역회관 2001 호” , “main” : “ 서울시 강남구 삼성동 159-1 무역회관” , “extra” : “2001 호” }, found_locations : [ { “title” : “ 대한민국 서울특별시 강남구 삼성동 159-1”, “administrative_address” : “ 대한민국 서울특별시 강남구 삼성동 159-1”, “geometry_location” : { “lat” : 37.5103598, “lng” : 127.0611803 } ]}

{ “query” : { “full” : “ 소공동 코리아나 호텔” , “main” : “ 소공동 코리아나 호텔” , “extra” : “” }, “limited_bound” : { “name” : “ 대한민국 서울특별시 중구 소공동” , “southwest” : { lat : 37.4346000, lng : 126.7968000}, “northeast” : { lat : 37.6956000, lng : 127.1823000} }}

InputDocument

OUTPUT

Finding Target Locations

Location NER

Relation-type Classification OpenAPIMap Services

Disambiguation

Normalization

TrainedModels

(CRFs,SVMs)

Gazetteer

PersonalInforma-

tion

Normalization

Page 19: Kyoungryol Kim

Extract Address Information

includeHouse

no?

BoundsDB

Yes

NoGet Bounds infofrom Address

(SW, NE)

Geocodingby Query

3. if the query have address information (no house number) :Get bound information and search the location in the bound.has

Ad-dress Info?

Yes

No

{ “query” : { “full” : “ 프란치스코교육회관 2 층” , “main” : “ 프란치스코교육회관” , “extra” : “2 층” }}

{ “query” : { “full” : “ 서울시 강남구 삼성동 159-1 무역회관 2001 호” , “main” : “ 서울시 강남구 삼성동 159-1 무역회관” , “extra” : “2001 호” }, found_locations : [ { “title” : “ 대한민국 서울특별시 강남구 삼성동 159-1”, “administrative_address” : “ 대한민국 서울특별시 강남구 삼성동 159-1”, “geometry_location” : { “lat” : 37.5103598, “lng” : 127.0611803 } ]}

{ “query” : { “full” : “ 소공동 코리아나 호텔” , “main” : “ 소공동 코리아나 호텔” , “extra” : “” }, “limited_bound” : { “name” : “ 대한민국 서울특별시 중구 소공동” , “southwest” : { lat : 37.4346000, lng : 126.7968000}, “northeast” : { lat : 37.6956000, lng : 127.1823000} }}

InputDocument

OUTPUT

Finding Target Locations

Location NER

Relation-type Classification OpenAPIMap Services

Disambiguation

Normalization

TrainedModels

(CRFs,SVMs)

Gazetteer

PersonalInforma-

tion

Normalization

Page 20: Kyoungryol Kim

{ “query” : { “full” : “ 소공동 코리아나 호텔” , “main” : “ 소공동 코리아나 호텔” , “extra” : “” }, “limited_bound” : { “name” : “ 대한민국 서울특별시 중구 소공동” , “southwest” : { lat : 37.4346000, lng : 126.7968000}, “northeast” : { lat : 37.6956000, lng : 127.1823000} }}

Find Candidate Locations

UserMeeting Location

DB(Priority

1)

SWRC Meeting Location

DB(Priority 2)

Open API(OpenStreetMap,

Naver)(Priority 3)

Remove Duplicated Addresses

{ “query” : { “full” : “ 소공동 코리아나 호텔” , “main” : “ 소공동 코리아나 호텔” , “extra” : “” }, “limited_bound” : { “name” : “ 대한민국 서울특별시 중구 소공동” , “southwest” : { lat : 37.4346000, lng : 126.7968000}, “northeast” : { lat : 37.6956000, lng : 127.1823000} }, found_locations : [ { “query” : “ 밀레니엄 힐튼 서울” , “title” : “ 밀레니엄 힐튼 서울” , “administrative_address” : “ 대한민국 서울특별시 중구 태평로 1 가 61-1”, “geometry_location” : { “lat” : 37.5103598, “lng” : 127.0611803 }, { ..... } ]}

Geocoding

Coordinate Conversion

KTM -> WGS84

Local Search

SWRCDB

UserDB

Open API

WMS

InputDocument

OUTPUT

Finding Target Locations

Location NER

Relation-type Classification OpenAPIMap Services

Disambiguation

Normalization

TrainedModels

(CRFs,SVMs)

Gazetteer

PersonalInforma-

tion

Normalization

Page 21: Kyoungryol Kim

21

Disambiguation

InputDocument

OUTPUT

Finding Target Locations

Location NER

Relation-type ClassificationOpenAPI

Map Services

Disambiguation

Normalization

TrainedModels

(CRFs,SVMs)

Gazetteer

PersonalInformation

동강밀레니엄래프팅 밀레니엄 대한민국 강원도 영월군 영월읍 거운리 547-1 밀레니엄피시방 서현점 밀레니엄 대한민국 경기도 성남시 분당구 서현동 307밀레니엄모텔 밀레니엄 대한민국 광주광역시 북구 오룡동 1114-1서울힐튼호텔 밀레니엄 힐튼 서울 대한민국 서울특별시 중구 남대문로 5 가 395

Disambiguation- Number of Matched characters

query-title, query-original query, query-address- (Can be used )

Semantic Type / Personal Annotation DB / Dis-tance between locationLandmark

- Personal Address book/Search history/GPS log

서울힐튼호텔 : 대한민국 서울특별시 중구 남대문로 5 가 395 (36.3414225, 127.3914705) (Hotel)

Title | Query | Address

밀레니엄 힐튼 서울

Original Query