korea terminology research center for language and knowledge engineering infrastructures in korea...

26
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Upload: ethan-warner

Post on 24-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Infrastructures in Korea and for the Korean Language

Key-Sun Choi

Page 2: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Academic Society

SIG-Korean Language Computing under Korea Information Science Society 300 members

Korea Information Society linguistics oriented

Page 3: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

KIBS Korea Information Base and Systems

Purpose: To improve Korean Language Processing Technology To promote Korean Software Industry

• in the planning phase (1993), targetted to Hangul Wordprocessor, Machine Translation and Korean Linguistic Research

1995 - 1997 (Phase 1): “word” Two ministry joint project + Industry

• Ministry of Science&Technology, Ministry of Culture 1998 - 2000 (Phase 2): “sentence”

Only by Ministry of Science&Technology + Industry will be evaluated in October, 2000

2001 - 2003 (Phase 3): “discourse” - not decided http://kibs.kaist.ac.kr/

Page 4: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

King Sejong Project

Purpose To promote the Korean Language Research in the linguistics sid

e To prepare for the language planning

for Unification of South-/North-Korea for International use of Korean

Sponsor: Ministry of Culture Period: 1998 - 2007 (10 years) Items

corpus, dictionary, internationalization, terminology, education, font, old Korean

http://www.sejong.or.kr/

Page 5: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

KIBS: Architecture

MA1

MA2

TA1

TA2

PA1

PA2

WSD1

WSD2

DA1

DA2

RM1

RM2

Ontology

Common Knowledge

Domain Knowledge

Electronic Dictionary

Engine Module Level

Engine Level

Basic DB

corpus

MRD

Knowledge extractor

Knowledge Source Level

MT engine IR engineSpell checker Style checker UI engine

Application LevelWord processor MT system Information

RetrievalSystem

Automatic Speech

Translation

End User

User(P

rogramm

er)U

ser(lexicograph

yist)

User(Dictionary)

QualityManagementSystem

-- System

Terminology

Distributed ResourceManagement System

Master DB

TaggingSupport Tool

Knowledge Level

TerminologyDB

Page 6: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

KIBS: Introduction

Title of Project KIBS I : Integrated Korean Information Base KIBS II : On Development of Deep-Level Processing and Q

uality Management Technology for Very Large Korean Information Base

OutlineTerm : 1994.12.4 ~ 2004.9.30 (10 years)Sponsor : Ministry of Science and TechnologyStaff : 50 person/year

Page 7: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

The Goal of First step

•Standard Module Interface•Corpus and Electronic Dictionary Development and Management System •Korean Part-of-Speech Tagging System•Korean Syntactic Tagging System•Korean/English Alignment System

•Standard Module Interface•Corpus and Electronic Dictionary Development and Management System •Korean Part-of-Speech Tagging System•Korean Syntactic Tagging System•Korean/English Alignment System

•Terminological Data Base Development and Management System•Standard Korean Input/Output Environment•Standardized Methodology for the Construction of a Balanced Corpus•Part-Of-Speech Transfer Dictionary Rules and an Example Package

•Terminological Data Base Development and Management System•Standard Korean Input/Output Environment•Standardized Methodology for the Construction of a Balanced Corpus•Part-Of-Speech Transfer Dictionary Rules and an Example Package

•Tree-Tagged Corpus•Word-Level Narrative Speech Data Base•Hand-written Hangul scripts of high frequency

•Tree-Tagged Corpus•Word-Level Narrative Speech Data Base•Hand-written Hangul scripts of high frequency

The Standardization & the Specification for Korean Information BaseThe Standardization & the Specification for Korean Information Base

The Development of an Integrated, Environment and Support Management SystemThe Development of an Integrated, Environment and Support Management System

The Construction of Korean Information BaseThe Construction of Korean Information Base

Page 8: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

The Goal of Second step

•Terminology Entries•Domain-specific Corpus for Terminology Building•Sublanguage Analysis and Extraction of Terminology

•Terminology Entries•Domain-specific Corpus for Terminology Building•Sublanguage Analysis and Extraction of Terminology

•Development/Management System for Information Base •Development of Integrated Management System for Distributed Resources

•Development/Management System for Information Base •Development of Integrated Management System for Distributed Resources

•Syntactic Information Base for Syntactic Analysis/Generation•Semantic Information Base for Semantic Analysis/Generation•Additional Information on Language and GUI for Developing Applications

•Syntactic Information Base for Syntactic Analysis/Generation•Semantic Information Base for Semantic Analysis/Generation•Additional Information on Language and GUI for Developing Applications

Quality Management System for Language Information Processing Quality Management System for Language Information Processing

Terminology Dictionary and Development/Management SystemTerminology Dictionary and Development/Management System

Development/Management System of Electronic Dictionary for Sentence Analysis/Generation (100,000 entries)Development/Management System of Electronic Dictionary for Sentence Analysis/Generation (100,000 entries)

Page 9: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Development Tools

Korean Concordance Program (KCP) Compound Noun Browser Corpus Browser Corpus Browser by Category Automatic English-to-Korean Transliteration System (TLEK) KAIST Ontology Browser Korean Morphological Analyser Korean Tagger Korean Syntactic Analyser Editing Support Tools to Electronic Dictionary

Page 10: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Results & Distribution

Major Results The first (KIBS I) : 1997.6. ~ present (80 site)

Text corpus 10 million word phrases POS tagged corpus 1 million word phrases Syntactic structure tagged corpus 10 thousands sentences TDMS, Speech DB samples, Hand-written character DB samples

The second (KIBS II) : 1998.12. ~ present (140 site) Raw corpus 10 million word phrases, POS tagged corpus – 200 thousands

word phrases The third (KIBS III) : 2000 (pending)

Proper noun 10 thousands entries, Compound noun 20 thousands entries, Verb sentence pattern dictionary 3 thousands entries, ...

Plan to maintain and distribute ...

Page 11: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

KORTERM

Korea Terminology Center Korea Terminology Center for Language and Knowledge Engineeringfor Language and Knowledge Engineering

http://korterm.or.kr/http://korterm.or.kr/http://korterm.org/http://korterm.org/

Page 12: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Goals of KORTERM

Through World-Wide Terminology Collection and Their Standardization and Harmonization in Local Society

Distribution, Publication and Application in Language and Knowledge Engineering are promoted.

Through Education and Consultation of Terminology R&D Methodology for Each Subject Field,

High-Quality, High-Reliable Terminology and Its Infrastructure and System are achieved.

Center of Terminology and Knowledge Engineering

Page 13: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Phases and Subjects of KORTERM

Integration of Working Terminology•Terminology Collection (Basic S&T, Industry Standard, Economics)•Electronic Terminology (Publication)•R&D Environment (System Standardization)•Terminology Theory and Education Infrastructure

Value-Added Terminology Integration•Terminology Collection (Extended S&T) •Extension & Maintenance (Industry Standards)•High-Quality Terminology•Application in Language Industry•Verification for High-Reliability and Distribution

Multi-lingual Terminology Integration •Terminology Collection (Humanity and Social Science)•Maintenance and Extension •Large-Scale Knowledge Base for Terminology•Terminology Education Curriculum Development•Application Product Development

Continuous Extension and Management•Terminology Study Promotion•Distribution of Terminology Information Base•Continuous Terminology Extension and Management

Phase 2(2001-2003)

Value-Added Working System

Phase 3(2004-2007)Operation

Phase 4(2008 - )

Maintenance and Extension

Phase 1(1998-2000)

R&D Environment and Basic Data Collection

Page 14: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Basic Data (Corpus) Corpus for Each Subject Domain

Electronic Dictionary for Basic Vocabulary Everyday Vocabulary consists of General Vocabulary and Everyd

ay Terminology

Internationalization of Korean Language South-North Korean Terminology Standardization, Korean langua

ge Input Methods

Korean Language Engineering Standardized Term Use for Information Retrieval, Machine Trans

lation and Document Classification

R & D (1)

Page 15: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Language Engineering Information Retrieval:

Effective Internet Information Creation and Information/Knowledge Acquisition

Multi-lingualism

Machine Translation: Efficient Information Generation through Terminology and Vo

cabulary Collection and Standardization

Wordprocessor: High Productivity by Spelling Correction, Summarization and

Efficient Use.

R & D (2)

Page 16: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Language, Information and Terminology Language Education:

Technical Thinking and Technical Communication Terminology-based Education

Language Study: Domain-specific Language Study

R & D (3)

Page 17: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Terminology Sponsors

Support from Government, Organization and Industry according to each specialty Ministry of Culture and Tourism (KORTERM Center Operat

ion) Ministry of Science and Technology (R&D Fund) Ministry of Information and Telecommunication (R&D Fun

d) Ministry of Diplomacy and Trade Ministry of Industry and Resource Ministry of Education Korea Science and Technology Foundation (Event Support)

Page 18: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Task Configuration

Terminology Base (Collection)Non-standards

International Term StandardTerminology Standard

Language& Knowledge Product

LanguageEducationEnvironment

Terminology Information Environment

R&

D E

nvironme

ntA

pplicatio

nU

se

Term

inology

Sym

bolization

Terminology Access Standard Channel

Grid Size Controller

Application-Specific Dictionary

Language Education Adaptable to Student

R&D Industry Living Communication

Standardization & Harmonization

TerminologicalConceptual

Space

Page 19: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Large-Scale Speech/Language/Image DB Construction a

nd Evaluation

Supported by Ministry of Science and Technology

Two Year Project (1999.10-2001.10)

Page 20: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Goals Speech/Language/Image Evaluation Standardization Speech/Language/Image Evaluation StandardizationFinal GoalFinal Goal

OrganizationOrganization

Test SuiteTest Suite

•Working Group Organization•Survey and Planning•Working Group Organization•Survey and Planning

Specification Standardization

Specification Standardization

•IR Test Suite and Evaluation Model Recommend•MT Test Suite and Evaluation Model Recommend•IR Test Suite and Evaluation Model Recommend•MT Test Suite and Evaluation Model Recommend

•Image Attribute Format•Color-Lexical Entry •MPEG7 Specification

•Image Attribute Format•Color-Lexical Entry •MPEG7 Specification

LanguageLanguage

•Sentence-unit Speech DB•Prosody for Speech Synthesis•Sentence-unit Speech DB•Prosody for Speech SynthesisSpeech

Speech

ImageImage

LanguageLanguage

SpeechSpeech

ImageImage

•IR/QA 90 query/200K doc, MT 5,000 sentences•IR/QA 90 query/200K doc, MT 5,000 sentences

•word-unit telephone speech DB: 100 token * 500•word-unit telephone speech DB: 100 token * 500

•Image 300 kinds - Meta Data•Image 300 kinds - Meta Data

Page 21: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Question-Answering IR Test Suites

Test Suites for IR/QA Documents

207,067 records (370MB) Newspapers

Query Generation 90 queries (through 300 quiz query analysis) Queries for WH-question and other various types of answers for NLP problem solving relevent document set to include the answer by using four kinds of commercialized IR systems by 16 kind

s of methods

Page 22: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

English-Korean MT Test Suites

Type Classification: About 300 KindsTest Sentences and Test Query: 5,000 Records

Extracted from Textbook and Grammar books (1999-2000)

will be extracted from the Real usage like web, newspapers (2000-2001)

Evaluation by Yes/No Question Tested for 4 Commercialized English-Korean MT Syst

ems

Page 23: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

MT Evaluation Workbench

Page 24: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Image Meta Data Editor

Meta data Input Workbenchby XML

Page 25: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

Image Retrieval by Meta data

Page 26: Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi

Korea Terminology Research Center for Language and Knowledge Engineering

http://korterm.kaist.ac.kr/ksurimal/