foundations of statistical nlp chapter 4. corpus-based work 박 태 원박 태 원
Post on 19-Jan-2018
220 Views
Preview:
DESCRIPTION
TRANSCRIPT
Foundations of Statistical NLP
Chapter 4. Corpus-Based Work
박 태 원
2
Abstract Getting Set Up
– Computers, Corpora, Software Looking at Text
– Low-level formatting issues– Tokenization : What is a word?– Morphology– Sentences
Mark-up Data– Markup schemes– Grammatical tagging
3
Getting Set up(1/2)
Text corpora are usually big.– major limitation on the use of corpora– Computer 의 발전으로 극복
Corpora– use text corpora distributed by main organization– corpus : special collection of textual material– general issue is representative sample of the population
of interest.
4
Getting Set up(2/2)
Software– Text editors : shows fairly literally– Regular expressions : find certain pattern– Programming languages : C, C++, Perl– Programming techniques
5
6
Looking at Text
Text come a row format or marked up. Markup
– a term is used for putting code of some sort into a computer file.
– commercial word processing : WYSIWYG Features of text in human languages
– difficulty to process automatically
7
Low-level formatting issues
Junk formatting/content– junk : document header, separator, table, diagram, etc.– OCR : deal with only English text -> remove junk
(other text) Uppercase and lowercase
– The original Brown corpus : * was used to capital letter – Should we treat brown in Richard Brown and brown
paint as the same?– proper name detection : difficult problem
8
Tokenization : What is a word?(1)
Tokenization– To divide the input text into unit called token– what is a word?
• graphic word (Kucera and Francis. 1967) “a string of contiguous alphanumeric characters with
space on either side;may include hyphens and apo-strophes, but no other punctuation marks”
-> workable definition : $22.50, Micro$oft, C|net
9
Tokenization : What is a word?(2)
Period– distinction end of sentence punctuation marks, abbreviat
ion makrs as in etc. or Wash. Single apostrophes
– English contractions : I’ll or isn’t– dog’s : dog is or dog has or genitive case
Hyphenation– line-breaking hyphen is present in typographical source– e-mail, 26-year-old, co-operate
10
Tokenization : What is a word?(3)
The same form representing multiple “words”– homographs : ‘saw’ has two lexemes (chap 7)
Word segmentation in other languages– Many languages do not put spaces in between words
Whitespace not indicating a word break– the New York-New Haven railroad
Variant coding of information of a certain seman-tic type
11
Morphology
Stemming processing– a process that strips off affixes and leaves you w
ith a stem. lemmatization
– one is attempting to find the lemma or lexeme of which one is looking at an inflected form
IR community has shown that doing stemm-ing does not help the performance
12
Sentences
What is a sentence?– something ending with a ‘.’, ‘?’ or ‘!.’– colon, semicolon, dash is regarded as a sentence
recent research sentence boundary detection– Riley(1989) : statistical classification tree– Palmer and Hearst (1994; 1997) : a neural network to p
redict sentence boundaries– Mikheev(1998) : Maximum Entropy approaches to the
problem
13
Mark-up Schemes
early days, markup schemes– including header information in texts
(giving author, date, title, etc.) SGML
– general language that lets one define a grammar for texts,
XML– subset of SGML particularly designed for web
14
Grammatical tagging first step of analysis
– automatic grammatical tagging for categories– distinguishing comparative and superlative
Tag sets (Table 4.5)– incorporate morphological distinction of a particular
language The design of a tag set
– target feature of classification• useful information about the grammatical class of a word
– predictive feature• prediction the behavior of other words in the context
15
top related