research methods in corpus linguistics xiaofei lu

Post on 23-Dec-2015

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Research methods in corpus linguistics

Xiaofei Lu

2

Overview

What is a corpus? Types of corpora Corpus design Where to obtain corpora Corpus annotation Corpus analysis Note on research project design Exercises and demos in between Future courses on corpus linguistics

3

What is a corpus?

Leech (1992): an unexciting phenomenon, a helluva lot of text,

stored on a computer

Francis (1982): a collection of texts assumed to be representative of a

given language, dialect, or other subset of a language to be used for linguistic analysis

Sinclair (1991): a collection of naturally-occurring language

text, chosen to characterise a state or a variety of language

4

Types of corpora

General-purpose monolingual corpora The British National Corpus

Specialized corpora Lancaster Corpus of Academic Written English

Learner corpora International Corpus of Learner English

Parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer

Corpora and varieties International Corpus of English

Synchronic and diachronic corpora

5

Corpus design

Purpose Comparability Type Content: mode, interaction, domain,

medium Structure: proportions Size Sampling? Design of the BNC

6

Where to obtain corpora

Linguistic data consortium Bookmarks for corpus-based linguists Ask on the corpora list Compile your own corpora

Design your corpus Getting permission File format, metadata, and data markup Text capture

Scanning, typing, electronic files, web crawlers, e.g., WebSPHINX

Transcription tools, e.g., Transcriber A Guide to Good Practice

7

Corpus annotation

Why annotate Levels of corpus annotation Difficulties for corpus annotation Tools for corpus annotation

8

Why annotate

For linguistic research Allow more effective corpus searches

For natural language processing Spelling and grammar checking Text summarization Machine translation Question answering

9

Levels of corpus annotation

Sentence segmentation Word segmentation/tokenization Part-of-speech (POS) tagging Chunking/shallow parsing Syntactic parsing Semantic annotation Pragmatic annotation Parallel corpora: sentence alignment Learner corpora: error annotation

10

Difficulties for corpus annotation

Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD

Unknown words Identification POS tagging Semantic annotation

11

Tools for corpus annotation

Bookmarks for corpus-based linguists Corpora and Corpus Annotation Tools on t

he WWW POS tagger demonstration

Sentence segmentation POS tagging Extracting NPs of the form DT NN NN

Dexter: Tools for analyzing language data

12

Corpus analysis

Levels of corpus analysis Tools for corpus analysis Interpreting corpus data

13

Levels of corpus analysis

Word frequency lists Concordances

Collocation (lexical patterning) Colligation (syntactic patterning)

Keyword lists

14

Tools for corpus analysis

Bookmarks for corpus-based linguists

Recommendations: WordSmith Tools (not free) AntConc (free) TextStat (free)

Unix tools Write your own scripts

15

Exercise (part 1)

Download and install AntConc Download some text for processing

Project Gutenberg Generate a word frequency list for

your mini-corpus

16

Interpreting corpus data

Are frequency differences statistically significant? w appears x times in an n-word corpus,

and y times in an m-word corpus Chi-square test (doesn’t work well for

small numbers) Fisher’s Exact Test (doesn’t work for a

cross table larger than 2×2)

17

Exercise (part 2)

Compare your word frequency list with that of BNC

Anything interesting? Run the chi-square test and Fisher’s

Exact test on some interesting words

18

Interpreting corpus data (cont.)

Collocational analysis: How strongly are x and y associated Mutual information

Measures difference between observed and expected frequencies of (X,Y)

Higher MI, stronger association Doesn’t work well for low frequencies

T-test Measures confidence with which to claim

strong association between X and Y Higher t-score, higher association

Online calculations

19

Exercise (part 3)

Generate a concordance for a target word

Find a word that co-occurs frequently with the target word

Test if the word is strongly associated with the target word

20

Note on research project design

Purpose of project Corpus compilation and annotation Corpus analysis

Bottom-up: from observations of recurring patterns to hypothesis and generalizations

Top-down: start with given categories and search for evidence of use and variance

Caution on generalizability

21

Future courses on corpus linguistics

Spring 2007 APLING 597E: Introduction to Corpus

Linguistics Hands-on course on principles and tools for

corpus compilation, annotation, processing, and analysis

Spring 2008 APLING 597: Seminar on Corpus Linguistics Advanced seminar on using corpora for serious

research projects

top related