hsin-hsi chenintroduction-1 natural language processing hsin-hsi chen ( 陳信希 ) department of...

89
Hsin-Hsi Chen Introduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳陳陳 ) Department of Computer Science and Information Engineering National Taiwan University [email protected]

Upload: joan-hudson

Post on 20-Jan-2016

278 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-1

Natural Language Processing

Hsin-Hsi Chen ( 陳信希 )

Department of Computer Science and Information Engineering

National Taiwan University

[email protected]

Page 2: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-2

What is Natural Language Processing (NLP)?

• Computational Linguistics (CL)– The Study of computer systems for understanding and generating natural

languages– To make the computer a fluent user of ordinary language in all kinds of

conversation tasks• Human Language Technology (HLT)• Natural Language Processing FAQ

– ftp://rtfm.mit.edu/pub/usenet-by-hierarchy/comp/ai/nat-lang/Natural_Language_Processing_FAQ[faq]

– http://www1.cs.columbia.edu/~radev/nlpfaq.txt[faq]• Wiki

– http://en.wikipedia.org/wiki/Natural_language_processing[wiki]

Page 3: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-3

重要性

• 自然語言處理是改變世界十大新興科技趨勢之一– (MIT 2001 元月 / 二月科技評論 )

• 自然語言處理是 2000 至 2010 最重要的十二項資訊技術之一– (Gartner Group , 2000 年 11 月 )

• 自然語言處理是關鍵技術– 微軟亞洲研究院 (MSRA) 的研發方向之一 [msra]

– Web search as a computational challenge (Peter Norvig, Director of Research, Google)

– …

Page 4: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-4

Universities

• Brown University• Buffalo, SUNY at• California at Berkeley, University of• California at Los Angeles, University of• Carnegie-Mellon University• Columbia University• Delaware, University of• Duke University• Georgetown University• Georgia, University of• Georgia Institute of Technology• Harvard University

http://en.wikipedia.org/wiki/List_of_NLP_Courses[c]

Page 5: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-5

Universities (Continued)

• Indiana University • Johns Hopkins University• Massachusetts at Amherst, University of• Massachusetts Institute of Technology• New Mexico State University• New York University• Pennsylvania, University of• Rochester, University of• Southern California, University of• Stanford University• SUNY, Buffalo• Wisconsin - Milwaukee, University of• Yale University

Page 6: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-6

Universities

• Carnegie Mellon University– Language Technologies Institute of School of Computer

Science

– All aspects of language technology and information management

• Stanford University– Center of the Study of Language and Information (CSLI)

– Integrated theories of language, information and computation

Page 7: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-7

Universities

• University of Pennsylvania– LINC Laboratory of Department of Computer and

Information Science

• Massachusetts Institute of Technology– Spoken Language System Group

• ...

Page 8: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-8

Applications of NLP

• Machine Translation• Natural Language Interface (to Databases)• Text Processing (Understanding/Generation)• Written Aids (Spelling Checker, Grammar

Checker, Style Checker)• Speech Recognition/Synthesis• OCR and OLCR

Page 9: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-9

Applications of NLP (Continued)

• Intelligent Information Retrieval• Digital Libraries• NLP for the World Wide Web• Text Data Mining• Summarization• Question and Answering• Language Modeling of Biological Data• ...

Page 10: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-10

Machine Translation

• Translation Model

Tagger

ChunkerProbabilisticPOS Grammar

PredicateArgumentStructure

ProbabilisticChunkGrammar

Subcat-Subcat- TractorTractor

GeneratorGeneratorLexicalSelection

SimpleTransfer

MappingRules

BilingualDictionary

MarkovMarkovModelModel

SourceSourceSentenceSentence TargetTarget

SentenceSentence

ANALYSISANALYSIS TRANSFERTRANSFER SYNTHESISSYNTHESIS

Page 11: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-11

Google Translator: The Universal Language

• MT test by National Institute of Science and Technology – Arabic-to-English

• Google: 0.5137• USC ISI:0.4657• IBM: 0.4646

– Chinese-to-English• Google: 0.3537• USC ISI: 0.3073• IBM: 0.2571

• Google used the United Nations Documents to train their machine, and all in all fed 200 billion words.

• 自由軟體引爆機器翻譯 2.0( 科學人, 2006 , 04) http://140.127.234.182/sa/pdf.file/ch/c050/c050p066.pdf[mt]

Page 12: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-12

Natural Language Interface (to Databases)

• Keyword extraction or pattern matching• Parsing

– LADDER system (semantic grammar, U.S. Navy DB)

– INTELLECT system (grammar rules)

– Rendezvous system (phrase grammar)

• Query Mapping– KDA system

Page 13: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-13

Text Understanding

• A news report dated March 3, 1989– A cargo train running from Lima to Lorohia was

derailed before dawn today after hitting a dynamic charge

– Inspector Eulogio Flores died in the explosion

– The police reported that the accident took place past midnight in the Carahuaichi-Jaurin area.

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html[muc]

Page 14: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-14

Text Understanding (Continued)

• Database entries (MUC-3, 1991)Incident: Date 30 Mar 89Incident: Type Peru: Carahuaichi-Jaurin (area)Incident: Type BombingPhysical Target: Description “cargo train”Physical Target: Effect Some Damage: “cargo train”Human Target: Name “Eulogio Flores”Human Target: Description “inspector”: “Eulogio Flores”Human Target: Effect Death: “Eulogio Flores”

• 15 sites, Oct 1990 (1300 texts), Feb 1991 (100 new messages), May 1991 (another 100 messages)

Page 15: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-15

Speech Recognition/Synthesis

Training Data

SpeechSignal

RepresentationModeling/

ClassificationSearch

RecognizedWords

AcousticModels

LexicalModels

LanguageModels

Page 16: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-16

Optical Character Recognition

Analysis of ErrorCount Distributions

Image Processing Module(Preprocessing)

Language Processing Module(Postprocessing)

Image Document

Image Segmentation

Feature Extraction

Feature Matching

Image Database

Dictionary

Markov CharacterBigram Model

Character Bigram Table

Text Document

Feature Extraction

Feature Database

Character Unigram Table

Page 17: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-17

Intelligent Information Retrieval

• Conceptual Text Retrieval

SimilarityComputationQueries Documents

Retrieval ofsimilar items

Page 18: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-18

Intelligent Information Retrieval(Continued)

• Morphological information• linguistically sensitive collection information• dictionary and thesaurus methods• syntax-based template and pattern matching• semantic nets• statistical and syntactic based phrasal indexing

Page 19: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-19

Natural Language Processing for World Wide Web

• WWW – a powerful medium for human communication and

dissemination of information

• information on WWW – natural language texts

• issues– apply NLP techniques for searching, retrieving,

presenting, or generating texts

Page 20: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-20

NLP for WWW

• applications– automatic and interactive summarization

– machine translation of WWW documents

– information brokering

– document filtering and personalized newspaper

– automatic generation of WWW documents

– …

Page 21: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-21

political news

local news

international news

news headlines

society news event clusters

(Table of Today News Stories)

(October 15, 2000)

browsingsummary

focusingsummary

date

Summarization

Page 22: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-22

Browsing (1) title

news stories in the same event clusters

First story of an event cluster

data sourcenext

Page 23: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-23

Browsing (2)

2nd story in the event cluster

novelty fragments

data sourceprevious next

title

Page 24: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-24

Browsing (3)

3rd story in the same cluster

noveltyfragment

data sourceprevious next

title

Page 25: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-25

Newspaper 1 Newspaper 2 Newspaper 3 Newspaper n

A News Clusterer

Event 1 Event 2 Event 3 Event m

A News Summarizer

Summary

for event 1

Summary

for event 2

Summary

for event 3

Summary

for event m

Overview of a Multi-Document Summarization System

Employing a segmentation system Extracting named entities Applying a tagger Clustering the news stream

Partitioning a Chinese text Linking the meaningful units Displaying the summarization results

Page 26: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-26

Question and Answering

Page 27: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-27

Question Types

• Who– 誰發現 DNA 的架構? (Who discovers DNA structure?)

• Where– 亞洲首座恐龍博物館在哪裡? (Where is the first dinosaur museum in Asia?)

• When– 第一位試管嬰兒出生在哪一年? (When was the first test-tube baby born?)– 盤尼西林什麼時候發現? (When was pencillin discovered?)

• How many– 有多少美國人罹患氣喘? (How many Americans suffer from asthma?)

• What– 阿茲海默症的症狀為何? (What are the symptoms of Alzheimer's Disease?)

• Why– 為什麼會有阿茲海默症? (Why results in Alzheimer's Disease?)

Page 28: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-28

IR System

Answer Candidates Anchoring

Question Type Classification

NE Identifier

Question Foci

Relevant Documents

Knowledge Base

Question Type Rules

Thesaurus

POS Tagger and Parser

Ranking Scores Evaluation

Overview of a Question Answering System

Page 29: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-29

Extracting information from biological texts

• 用自動摘錄的方式由生物文獻資料中找出蛋白質和蛋白質的相互關係 (bioinformatics 2001, vol 17, no 2, pp. 155-161)

selection of target text↓

identification of protein names↓

process of compound or complex sentences ↓

recognition of protein-protein interaction ↓

extraction of protein interactions

Page 30: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-30

Natural Language Processing for Biology

• Send BioNLP mailing list submissions [email protected]

• To subscribe or unsubscribe via the World Wide Web, visithttps://lists.ccs.neu.edu/bin/listinfo/bionlp

• or, via email, send a message with subject or body 'help' to [email protected]

• You can reach the person managing the list [email protected]

Page 31: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-31

Why is NLP Difficult?

• NLP is difficult because Natural Language is highly ambiguous.

• Example: “Our company is training workers” has 3 parses (i.e., syntactic analyses). next slide

• “List the sales of the products produced in 1973 with the products produced in 1972” has 455 parses.

• Therefore, a practical NLP system must be good at making disambiguation decisions of word sense, word category, syntactic structure, and semantic scope.

Page 32: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-32

S

NP VP

workers

Aux VP

NPV

training

is

(1.11)

a.

Our company

Page 33: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-33

S

NP VP

workers

V

VP

NPV

training

is

b.

Our company NP

Page 34: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-34

S

NP VP

workers

V

NAdjP

training

is

c.

Our company NP

Page 35: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-35

Critical Problems in NLP

• Ambiguity Resolution– Lexical

• current: noun vs. adjective• bank (noun): money vs. river• order: hundreds of candidates per sentence

– Syntactic• [saw [the boy] [in the park]]• [saw [the boy in the park]]• order: hundreds to thousands

– Semantic• [the policy] were ordered [to stop drinking] by midnight• agent vs. patient

Page 36: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-36

Human Language Capability

(from Dr Eric Chang)

Page 37: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-37

Pragmatics

• W84MEEE

• Wait for ME( 英文 )

• 王八是我 ( 中文 )

Page 38: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-38

Critical Problems in NLP (Continued)

• Ill-Formedness– typographic errors

– grammatical errors, e.g., subject-verb agreement

• Robustness Problem– change in domain

– 網路語言:取材於方言俗語、各門外語、縮略語、諧音、甚至以符號合併以達至象形效果等等 。

Page 39: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-39

Main Topics in Large-Scale NLPS Design

• Knowledge representation– How to organize and describe linguistic knowledge for

the critical problems

• Knowledge strategies– How to use knowledge for efficient parsing, ambiguity

resolution, ill-formed recovery

Page 40: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-40

Main Topics in Large-Scale NLPS Design (Continued)

• Knowledge acquisition– How to setup knowledge base systematically and cost-

effectively

– How to maintain knowledge base consistency

• Knowledge integration– How to jointly consider various knowledge source

effectively

Page 41: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-41

Today’s Approach to NLP

• From ~1970-1989, people were concerned with the science of the mind and built small (toy) systems that attempted to behave intelligently.

• Recently, there has been more interest on engineering practical solutions using automatic learning (knowledge induction).

Page 42: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-42

Approaches

• Rule-based approach• Corpus-based approach• Hybrid approach

Page 43: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-43

Rule-Based Approach

• Sample Grammar

S --> NP, VPNP --> DET, NNP --> PRONVP --> IVVP --> TV, NP

Page 44: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-44

Rule-Based Approach (Continued)

• Advantages– No need to prepare database

– Easy to incorporate existing linguistic knowledge

– Have better generalization to a unseen domain

– Reasoning processes are explainable and traceable

– Operation mechanism is easy to understand

Page 45: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-45

Rule-Based Approach (Continued)

• Disadvantages– Hard to maintain consistency (between different

people, at different occasions)– Hard to handle uncertain knowledge (not easy to

objectively quantify uncertainty factor)– Hard to deal with complex, irregular information– Knowledge acquisition is very time consuming– Not easy to obtain high coverage (completeness) for a

given domain– Not easy to avoid redundancy

Page 46: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-46

Corpus-Based Approach

• What is a corpus?– Webster’s Dictionary

A collection of recorded utterances used as a basis for the descriptive analysis of a language

– Oxfordbody, collection, especially of writing on a specified subject of materials for study

– TEI (Text Encoding Initiative)A corpus is a body of texts put together in a principled, typically in order to construct a sample of a given language or sublanguage.

Page 47: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-47

Corpus-Based Approach (Continued)

• Advantages– Knowledge acquisition can be automatically achieved

by the computer

– Uncertain knowledge can be objectively quantified

– Consistency and completeness are easy to obtain

– Very suitable to handle huge and minute information (with a lot of parameters)

– Well established statistical theories and technique are available

Page 48: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-48

Corpus-Based Approach (Continued)

• Disadvantages– Preparing database is a time consuming and boring task

– Generalization is poor for small-size database

– Reasoning processes are implicit and inaccessible to human

– Parameters are interactive, hard to identify the effect of a particular one

Page 49: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-49

Corpus-Based Approach (Continued)

• Types of Corpus– Alternative 1

• pure text

• text annotated with parts of speech, semantic tags, syntactic structures, etc.

• bilingual corpora– parallel corpus (document-aligned, sentence-aligned, word-

aligned)

– comparable corpus

• speech corpora

• spoken corpora

Page 50: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-50

LOB Corpus (untagged)

A01 1 **[001 TEXT A01**]A01 2 *<*'*7STOP ELECTING LIFE PEERS**'*>A01 3 *<*4By TREVOR WILLIAMS*>A01 4 |^A *0MOVE to stop \0Mr. Gaitskell from nominating any more LabourA01 5 life Peers is to be made at a meeting of Labour {0M P}s tomorrow.A01 6 |^\0Mr. Michael Foot has put down a resolution on the subject andA01 7 he is to be backed by \0Mr. Will Griffiths, {0M P} for ManchesterA01 8 Exchange.

Page 51: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-51

A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._.A01 3 ^ by_IN Trevor_NP Williams_NP ._.A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_INA01 4 nominating_VBG any_DTI more_AP labour_NNA01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NNA01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.A01 6 ^ \0Mr_NPT Michael_NP Foot_NP has_HVZ put_VBN down_RP a_ATA01 6 resolution_NN on_IN the_ATI subject_NN and_CCA01 7 he_PP3A is_BEZ to_TO be_BE backed_VBN by_IN \0Mr_NPT Will_NPA01 7 Griffiths_NP ,_, \0MP_NPT for_IN Manchester_NPA01 8 Exchange_NP ._.

LOB Corpus (parts of speech)

Page 52: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-52

ASBC corpus (segmented and tagged)

1.  。 (PERIODCATEGORY)  依據 (P)  行政院 (Nc)  主計處 (Nc)          的 (DE)  統計 (Na)   , (COMMACATEGORY)***********************************************2.   , (COMMACATEGORY)  十月份 (Nd)  一 (Neu)  到 (Caa)          二十日 (Nd)   , (COMMACATEGORY)***********************************************3.   , (COMMACATEGORY)  我國 (Nc)  出口 (VC)  及 (Caa)          進口 (VC)  金額 (Na)  比起 (P)  去年 (Nd)  同 (Nes)  期 (Nf)          均 (D)  有 (D)  增加 (VHC)   , (COMMACATEGORY)***********************************************4.   , (COMMACATEGORY)  但 (Cbb)  總計 (Da)  一月 (Nd)  到 (Caa)         十月 (Nd)  二十日 (Nd)  的 (DE)  出超 (VH)  統計 (Na)  則 (D)  比 (P)         去年 (Nd)  同 (Nes)  期 (Nf)  減少 (VHC)  了 (Di)  百分之八點六 (Neqa)          , (COMMACATEGORY)***********************************************5.   , (COMMACATEGORY)  僅有 (VJ)  一百零四億七千二百萬 (Neu)         美元 (Nf)   。 (PERIODCATEGORY)

Page 53: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-53

A01:0010a - YB <minbrk> - [Oh.Oh]A01:0010b - AT The the [O[S[Nns:s.A01:0010c - NP1s Fulton Fulton [Nns.A01:0010d - NNL1cbCounty county .Nns]A01:0010e - JJ Grand grand .A01:0010f - NN1c Jury jury .Nns:s]A01:0010g - VVDv said say [Vd.Vd]A01:0010h - NPD1 Friday Friday [Nns:t.Nns:t]A01:0010i - AT1 an an [Fn:o[Ns:s.A01:0010j - NN1n investigation investigation .A01:0020a - IO of of [Po.A01:0020b - NP1t Atlanta Atlanta [Ns[G[Nns.Nns]A01:0020c - GG +<apos>s - .G]A01:0020d - JJ recent recent .A01:0020e - JJ primary primary .A01:0020f - NN1n election election .Ns]Po]Ns:s]A01:0020g - VVDv produced produce [Vd.Vd]A01:0020h - YIL <ldquo> - .A01:0020i - ATn +no no [Ns:o.A01:0020j - NN1u evidenceevidence.

Susanne Corpus (syntactic structures)

Page 54: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-54

A01:0020k - YIR +<rdquo> - .A01:0020m - CST that that [Fn.A01:0030a - DDy any any [Np:s.A01:0030b - NN2 irregularities irregularity .Np:s]A01:0030c - VVDv took take [Vd.Vd]A01:0030d - NNL1c place place [Ns:o.Ns:o]Fn]Ns:o]Fn:o]S]A01:0030e - YF +. - .O]

Page 55: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-55

A Sample from NTU Treebank

NTU01:0001:0000 ---- e e [S[S[NP[N''[N'.N']N'']NP]NTU01:0001:0010 ---- 如何 qadv [VP[RP.RP]NTU01:0001:0020 ---- 修憲 vi [V'.V']VP]S]NTU01:0001:0030 ---- 還 vadv [VP[RP.RP]NTU01:0001:0040 ---- 有待 vs [V'.NTU01:0001:0050 ---- 上級 nc [S[NP[N''[N'.N']N'']NP]NTU01:0001:0060 ---- 決定 vn [VP[V'.NTU01:0001:0070 ---- e e [NP[N''[N'.N']N'']NP]V']VP]S]V']VP]S]

Page 56: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-56

Bilingual Corpus (sentence-aligned)

%%This edition applies to Version 3, Release 1, of the Query ManagementFacility, Licensed Programs <;sa&mvspn.<;ee (MVS environment only) and<;sa&vmpn<;ee (VM/SP environment only).%本版適用於「查詢管理機能」( Query Managemnet Facility )版本 3 ,版次 1 ,特許程式 &mvspn. (僅供 MVS環境使用)及 &mvspn. (僅供 VM/SP 環境使用)。%%This edition also applies to any subsequent releases until otherwiseindicated in new editions or technical newsletters.%本版亦適用於任何後續版次,除非新版或技術通報中有提出任何指示。“本書所發行的國家,可能並未提供本書中所述及之系統配備或系統特性。如您欲知貴國內所提供之配備和特性,請與當地IBM 授權之經銷商或 IBM 業務部門聯繫 " 。

Page 57: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-57

Comparable Corpus

Taiwan's Parliament Approves Emergency Order

TAIPEI, Sept 28 (AFP) - Taiwan's parliament Tuesday approved a state of emergency order issued by President Lee Teng-hui last week, a parliament official said.

Lawmakers voted 201-2 to endorse the six-month state of emergency for the country declared by Lee Saturday, the official said.

Lee's decree allows the government to use troops to force evacuations, appropriate private buildings and vehicles, re-prioritise budgets and disregard all planning laws during reconstruction.

The central bank will fund low or no interest reconstruction loans and the government

can raise up to 2.5 billion US dollars from new bonds.

Page 58: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-58

立院通過追認緊急命令

【記者徐孝慈台北報導】立法院院會昨日以 兩百零一票同意、兩票不同意、一票廢票,通 過追認李總統日前為因應九二一大地震所發布 之緊急命令。不過,為免行政機關藉此擴權, 新黨主張立法院應設立「緊急命令執行監督委 員會」,民進黨亦主張設立「九二一救災防災 監督及調查委員會」,這兩件提案排入立法院 本週五的院會議程。

行政院長蕭萬長獲悉追認案完成法定程序, 對立法院表示感謝之意,並指將在緊急命令時 限內,確立所有安置與重建計畫的實施方式、 步驟及時程。憲法增修條文規定,經行政院決 議、呈請總統發布的緊急命令案,須於十日內 交由立法院追認通過,否則緊急命令即屬失效 。由於朝野黨團對通過該案具高度共識,立法 院昨日由全院委員會對此案進行審查後,隨即 改開院會,進行投票,採無記名方式。

多數國民黨籍立委在政策會執行長洪玉欽帶 頭及黨團要求下,採策略性亮票表達立場﹔少 數民進黨籍立委亦以亮票方式,表示對緊急命 令案的支持。開票過程中,部分立委拉起紅布 條,支持九二一災後孤兒認養運動。

Page 59: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-59

Speech Corpus MAT

Page 60: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-60

SPK01:0001:0010 ---- L: -SPK01:0001:0020 ---- 只 vadvSPK01:0001:0030 ---- 喝酒 viSPK01:0001:0040 ---- 啊 -SPK01:0001:0050 ---- =,- -

SPK01:0002:0010 ---- ... -SPK01:0002:0020 ---- 就 vadvSPK01:0002:0030 ----      〔 -SPK01:0002:0040 ---- 完 a , suf , vnSPK01:0002:0050 ----      <@ -SPK01:0002:0060 ---- 蛋 ncSPK01:0002:0070 ---- ^ -SPK01:0002:0080 ---- 啦 -SPK01:0002:0090 ---- @>〕 .\ -

Spoken Corpus

1 L:只喝酒啊 =,-2 ...就 [完 <@ 蛋 ^啦@>].\3 H:[對啊 ].\4 都 -- {R1,1-1,1}5 ..都沒錄到這樣子啊 .\6 ...(.8)[那你%,-7 ...給我們一人 ],-8 L:[唉可是 ]--9 H:... 一百塊好啦 .\10 ...(1.6)談話費 .\11 ...xxx.\

Page 61: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-61

Corpus-Based Approach (Continued)

– Alternative 2Balanced Corpora, e.g., Brown Corpus (1M words),Birmingham Corpus (7.5M words), LOB Corpus (1M words), etc.

– Alternative 3Corpora of special domains or style, e.g., Newspaper, Bible, etc.

Page 62: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-62

Balanced Corpus

A Press: reportage ( 報導文學 ) 44B Press: editorial (社論 )

27C Press: reviews ( 書評 )

17D Religion (宗教性 ) 17E Skills and hobbies ( 技藝,商業性,娛樂性 ) 36F Popular lore (民間傳說 ) 48G Belles lettres, biography, memoirs, etc.(純文學 ) 75H Miscellaneous (mainly government documents) (雜類 ) 30J Learned (including science and technology) ( 學術論文 ) 80K General fiction ( 一般小說 ) 29L Mystery and detective fiction (神秘及偵探小說 ) 24M Science fiction ( 科幻小說 ) 6N Adventure and western fiction (探險及西部小說 ) 29P Romance and love story (浪漫愛情小說 ) 29R Humour (幽默體 ) 9

500

Brown

Page 63: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-63

Corpus-Based Approach (Continued)

• Information in Corpora– Information within a pure text corpus

language usage in the real world, word distribution, co-occurrence, etc. next slide

– Information within a tagged corpuscorrelation among parts of speech, structures, and features

Page 64: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-64

Things that can be done with Text Corpora I: Word Counts

• Word Counts to find out:– What are the most common words in the text.– How many words are in the text (word tokens

and word types).– What the average frequency of each word in

the text is.• Limitation of word counts: Most words appear

very infrequently and it is hard to predict much about the behavior of words that do not occur often in a corpus. ==> Zipf’s Law.

Page 65: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-65

Page 66: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-66

Page 67: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-67

Things that can be done with Text Corpora II: Zipf’s Law

• If we count up how often each word type of a language occurs in a large corpus and then list the words in order of their frequency of occurrence, we can explore the relationship between the frequency of a word, f, and its position in the list, known as its rank, r.

• Zipf’s Law says that: • Significance of Zipf’s Law: For most words, our

data about their use will be exceedingly sparse. Only for a few words will we have a lot of examples.

r

1 f f ・ r=k

Page 68: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-68

Page 69: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-69

Things that can be done with Text Corpora III: Collocations

• A collocation is any turn of phrase or accepted usage where somehow the whole is perceived as having an existence beyond the sum of its parts (e.g., disk drive, make up, bacon and eggs).

• Collocations are important for machine translation.• Collocation can be extracted from a text (example,

the most common bigrams can be extracted). However, since these bigrams are often insignificant (e.g., “at the”, “of a”), they can be filtered.

Page 70: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-70

Things that can be done with Text Corpora IV: Concordances

• Finding concordances corresponds to finding the different contexts in which a given word occurs.

• One can use a Key Word In Context (KWIC) concordancing program.

• Concordances are useful both for building dictionaries for learners of foreign languages and for guiding statistical parsers.

Page 71: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-71

Corpus-Based Approach (Continued)

• Source of Corpora– Association of Computational Linguistics’ Data Collection

Initiative (ACL/DCI)– European Corpus Initiative (ECI)– International Computer Archive of Modern English (ICAME)– Linguistic Data Consortium (LDC)– Consortium for Lexical Research (CLR)– Electronic Dictionary Research (EDR)– Text Encoding Initiative (TEI)– European Language Resources Distribution Agency (ELDA)– Association for Computational Linguistics and Chinese Language

Processing (ROCLING)

Page 72: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-72

1998 RELEASES Price Set-of Description Catalog IDMO 19 1996 Broadcast News Speech (CSR-V) LDC97S44MO ftp 1996 Broadcast News Transcripts(CSR-V) LDC97T22MO 2 1996 Broadcast News Dev and Eval (CSR-V) LDC97S66MO 2 1996 CSR Hub-4 Language Model LDC98T31MO 18 1997 Broadcast News Speech Corpus (CSR-VI) LDC98S71MO ftp 1997 Broadcast News Transcripts (CSR-VI) LDC98T28MO 8 1997 Mandarin Broadcast News (HUB-4NE)LDC98S73MO 9 1997 Spanish Broadcast News (HUB-4NE) LDC98S741500 ftp COMLEX: English Syntax Lexicon 3.0 LDC98L21MO 3 CSR-III Speech: Dev. & Eval. Data LDC95S23*MO 4 CSR-III Text: Language Model LDC95T6*300 3 HTIMIT (Handset TIMIT) LDC98S671000 2 Hub-5 Mandarin Telephone Speech Corpus LDC98S691000 ftp Hub-5 Mandarin Telephone Transcripts LDC98T261500 5 Hub-5 Spanish Telephone Speech Corpus LDC98S701500 ftp Hub-5 Spanish Telephone Transcripts LDC98T27

Page 73: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-73

TBA 3 JURIS:Legal Text (500M words) 4th Quarter 500 1 KING Speaker Verification LDC95S22*200 LLHDB (Lincoln Lab Handset DataBase) LDC98S68MO 2 North American News Text Supplement LDC98T30 600 2 1998 Speaker Recognition Evaluation Set LDC98S765000 26 Switchboard-2 Phase I LDC98S75 200 ftp TDT Pilot Study Corpus LDC98T25750 2 Taiwanese Putonghua Corpus LDC98S72MO 1 Voicemail Corpus Part-I LDC98S77500 1 YOHO Speaker Verification LDC94S16*

Page 74: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-74

2001

CALLHOME Spanish Dialogue Act AnnotationTDT3 Multilanguage Text Version 2.0TDT2 Multilanguage Text Version 4.0Arabic Newswire Part 1Message Understanding Conference (MUC) 7Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio2000 NIST Speaker Recognition EvaluationTDT3 Broadcast News Mandarin Corpus (Audio)TDT3 English AudioTDT2 Mandarin Audio Corpus1997 HUB-4 Broadcast News Evaluation Non English Test Material

Page 75: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-75

2000

Speech in Noisy Environments (SPINE) Evaluation TranscriptsVoice of America (VOA) Broadcast News Czech Transcript CorpusTREC MandarinTREC SpanishHong Kong Hansards Parallel TextSpeech in Noisy Environments (SPINE) Training TranscriptsChinese Treebank Final ReleaseHong Kong Laws Parallel TextHong Kong News Parallel TextKorean NewswireTDT2 Careful Transcription TextBLLIP 1987-89 WSJ Corpus Release 1Speech in Noisy Environments (SPINE) Evaluation AudioTDT2 Careful Transcription AudioVoice of America (VOA) Czech Broadcast News Audio1999 HUB-4 Broadcast News Evaluation English Test MaterialSpeech in Noisy Environments (SPINE) Training Audio1998 HUB-4 Broadcast News Evaluation English Test MaterialSanta Barbara Corpus of Spoken American English Part-I

Page 76: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-76

2000 Communicator Dialogue Act Tagged NIST Meeting Pilot Corpus Speech

TIDES Extraction (ACE) 2003 Multilingual Training Data Proposition Bank 2001 Communicator Dialogue Act Tagged2002 NIST Speaker Recognition EvaluationArabic Treebank: Part 2 v 2.0Arabic Treebank: Part 3 v 1.0Chinese Treebank Version 4.0Czech Broadcast News SpeechCzech Broadcast News TranscriptsFORM1 Kinematic GestureHong Kong Parallel TextICSI Meeting SpeechICSI Meeting TranscriptsISL Meeting Speech Part 1ISL Meeting Transcripts Part 1Klex: Finite-State Lexical Transducer for KoreanMDE RT-03 Training Data SpeechMDE RT-03 Training Data Text and AnnotationsMorphologically Annotated Korean TextMultiple-Translation Chinese (MTC) Part 3

2004

Page 77: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-77

ROCLING

• 中文辭典• 斷詞與詞性標記語料庫• 中文剖析樹語料庫

Page 78: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-78

Hybrid Approach

• When we should adopt rule-based approach– Not easy to establish a large-size database

– The size of rule-base needed is not large(phenomena can be governed by a small number of rules, or they have well behavior)

– Rules that will good coverage have existed

– Extensional knowledge is important to the system

Page 79: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-79

Hybrid Approach (Continued)

• When we should adopt corpus-based approach– Establishing a large-size database is affordable

– Knowledge needed to solve the problem is huge and intricate, not easy to acquire by human

– Intensional knowledge is enough for the system

– A good model or formulation can be found

Page 80: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-80

World-Wide Web resources

• The Association for Computational Linguistics site (the major international organization in the field)

http://www.aclweb.org[acl]• The ACL NLP/CL Universe (The largest index of

Computational Linguistics and Natural Language Processing resources on the Web)http://tangra.si.umich.edu/clair/universe-rk/html/u/db/acl/[nlp]

Page 81: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-81

World-Wide Web resources (continued)

• The Survey of the State of the Art of Human Language Technologyhttp://www.cse.ogi.edu/CSLU/HLTsurvey/

• The Linguistic Data Consortium (create, collect and distribute speech and text databases, lexicons, and other resources for research and development purposes)http://www.ldc.upenn.edu/

• ACL Anthology (A Digital Archive of Research Papers in Computational Linguistics; September 2004: the anthology contains 8350 papers)http://www.aclweb.org/anthology

Page 82: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-82

Major Publications

• COMPUTATIONAL LINGUISTICS• COMPUTER SPEECH & LANGUAGE

MACHINE TRANSLATION• SPEECH TECHNOLOGY• JOURNAL OF NATURAL LANGUAGE ENGINEERING• JOURNAL OF LOGIC, LANGUAGE AND INFORMATION• ACM Transactions on Asian Language Information

Processing (TALIP) • Artificial Intelligence

Page 83: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-83

Professional Organizations, Associations

• ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL)

• ASSOCIATION FOR MACHINE TRANSLATION IN THE AMERICAS (AMTA)

• COGNITIVE SCIENCE SOCIETY• AMERICAN ASSOCIATION FOR ARTIFICIAL

INTELLIGENCE (AAAI)• The Association for Computational Linguistics

and Chinese Language Processing (ROCLING)

Page 84: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-84

Major Conferences

• Annual Meeting of Association for Computational Linguistics (ACL)

• International Conference on Computational Linguistics (COLING)

• EACL, NAACL, IJCNLP, CoNLL, IWPT, TMI, ICSLP, Eurospeech, AAAI, IJCAI, SIGIR, AIRS, ROCLING, ...

Page 85: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-85

Evaluation Competitions

• Message Understanding Conference (MUC)– named entity categorization– word sense disambiguation– mini-MUC (contents scanning, template filling)– co-reference identification– predicate-argument identification

• Document Understanding Conference (DUC)– Automatic Summarizing Evaluation

• Text Retrieval Conference (TREC)– Information retrieval using NLP/statistical techniques

• SENSEVA– Evaluating Word Sense Disambiguation Systems

Page 86: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-86

Course

• Rule-Based Approach– G. Gazdar and C. Mellish, Natural Language

Processing: An Introduction to Computational Linguistics, Addison-Wesley, 1989.

• Corpus-Based Approach– Christopher D. Manning and Hinrich Schutze,

Foundations of Statistical Natural Language Processing, MIT Press, 1999.

– Eugene Charniak, Statistical Language Learning, MIT, 1993.

Page 87: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-87

Material

• Rule-Based Approach– Finite-state techniques

– Recursive and augmented transition networks

– Grammars

– Parsing, search and ambiguity

– Well-formed substring table and chart

– Features and the lexicon

– Semantics

Page 88: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-88

Material

• Corpus-Based Approach– Mathematical Foundations– Corpus-based work– Collocations– Statistical Inference: n-gram models over sparse data– Word Sense Disambiguation– Lexical Acquisition– Markov Models– Part-of-Speech Tagging– Probabilistic Context Free Grammars– Probabilistic Parsing– Applications

Page 89: Hsin-Hsi ChenIntroduction-1 Natural Language Processing Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan

Hsin-Hsi Chen Introduction-89

Grading

• Midterm Examination• Term Examination• Term Project