foundations of statistical nlp chapter 4. corpus-based work 박 태 원박 태 원

Foundations of Statistical NLP

Chapter 4. Corpus-Based Work

박 태 원

2

Abstract Getting Set Up

– Computers, Corpora, Software Looking at Text

– Low-level formatting issues– Tokenization : What is a word?– Morphology– Sentences

Mark-up Data– Markup schemes– Grammatical tagging

3

Getting Set up(1/2)

Text corpora are usually big.– major limitation on the use of corpora– Computer 의 발전으로 극복

Corpora– use text corpora distributed by main organization– corpus : special collection of textual material– general issue is representative sample of the population

of interest.

4

Getting Set up(2/2)

Software– Text editors : shows fairly literally– Regular expressions : find certain pattern– Programming languages : C, C++, Perl– Programming techniques

6

Looking at Text

Text come a row format or marked up. Markup

– a term is used for putting code of some sort into a computer file.

– commercial word processing : WYSIWYG Features of text in human languages

– difficulty to process automatically

7

Low-level formatting issues

Junk formatting/content– junk : document header, separator, table, diagram, etc.– OCR : deal with only English text -> remove junk

(other text) Uppercase and lowercase

– The original Brown corpus : * was used to capital letter – Should we treat brown in Richard Brown and brown

paint as the same?– proper name detection : difficult problem

8

Tokenization : What is a word?(1)

Tokenization– To divide the input text into unit called token– what is a word?

• graphic word (Kucera and Francis. 1967) “a string of contiguous alphanumeric characters with

space on either side;may include hyphens and apo-strophes, but no other punctuation marks”

-> workable definition : $22.50, Micro$oft, C|net

9


Period– distinction end of sentence punctuation marks, abbreviat

ion makrs as in etc. or Wash. Single apostrophes

– English contractions : I’ll or isn’t– dog’s : dog is or dog has or genitive case

Hyphenation– line-breaking hyphen is present in typographical source– e-mail, 26-year-old, co-operate

10


The same form representing multiple “words”– homographs : ‘saw’ has two lexemes (chap 7)

Word segmentation in other languages– Many languages do not put spaces in between words

Whitespace not indicating a word break– the New York-New Haven railroad

Variant coding of information of a certain seman-tic type

11

Morphology

Stemming processing– a process that strips off affixes and leaves you w

ith a stem. lemmatization

– one is attempting to find the lemma or lexeme of which one is looking at an inflected form

IR community has shown that doing stemm-ing does not help the performance

12

Sentences

What is a sentence?– something ending with a ‘.’, ‘?’ or ‘!.’– colon, semicolon, dash is regarded as a sentence

recent research sentence boundary detection– Riley(1989) : statistical classification tree– Palmer and Hearst (1994; 1997) : a neural network to p

redict sentence boundaries– Mikheev(1998) : Maximum Entropy approaches to the

problem

13

Mark-up Schemes

early days, markup schemes– including header information in texts

(giving author, date, title, etc.) SGML

– general language that lets one define a grammar for texts,

XML– subset of SGML particularly designed for web

14

Grammatical tagging first step of analysis

– automatic grammatical tagging for categories– distinguishing comparative and superlative

Tag sets (Table 4.5)– incorporate morphological distinction of a particular

language The design of a tag set

– target feature of classification• useful information about the grammatical class of a word

– predictive feature• prediction the behavior of other words in the context

foundations of statistical nlp chapter 4. corpus-based work 박 태 원박 태 원

Documents