research methods in corpus linguistics xiaofei lu
TRANSCRIPT
![Page 1: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/1.jpg)
Research methods in corpus linguistics
Xiaofei Lu
![Page 2: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/2.jpg)
2
Overview
What is a corpus? Types of corpora Corpus design Where to obtain corpora Corpus annotation Corpus analysis Note on research project design Exercises and demos in between Future courses on corpus linguistics
![Page 3: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/3.jpg)
3
What is a corpus?
Leech (1992): an unexciting phenomenon, a helluva lot of text,
stored on a computer
Francis (1982): a collection of texts assumed to be representative of a
given language, dialect, or other subset of a language to be used for linguistic analysis
Sinclair (1991): a collection of naturally-occurring language
text, chosen to characterise a state or a variety of language
![Page 4: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/4.jpg)
4
Types of corpora
General-purpose monolingual corpora The British National Corpus
Specialized corpora Lancaster Corpus of Academic Written English
Learner corpora International Corpus of Learner English
Parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer
Corpora and varieties International Corpus of English
Synchronic and diachronic corpora
![Page 5: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/5.jpg)
5
Corpus design
Purpose Comparability Type Content: mode, interaction, domain,
medium Structure: proportions Size Sampling? Design of the BNC
![Page 6: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/6.jpg)
6
Where to obtain corpora
Linguistic data consortium Bookmarks for corpus-based linguists Ask on the corpora list Compile your own corpora
Design your corpus Getting permission File format, metadata, and data markup Text capture
Scanning, typing, electronic files, web crawlers, e.g., WebSPHINX
Transcription tools, e.g., Transcriber A Guide to Good Practice
![Page 7: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/7.jpg)
7
Corpus annotation
Why annotate Levels of corpus annotation Difficulties for corpus annotation Tools for corpus annotation
![Page 8: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/8.jpg)
8
Why annotate
For linguistic research Allow more effective corpus searches
For natural language processing Spelling and grammar checking Text summarization Machine translation Question answering
![Page 9: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/9.jpg)
9
Levels of corpus annotation
Sentence segmentation Word segmentation/tokenization Part-of-speech (POS) tagging Chunking/shallow parsing Syntactic parsing Semantic annotation Pragmatic annotation Parallel corpora: sentence alignment Learner corpora: error annotation
![Page 10: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/10.jpg)
10
Difficulties for corpus annotation
Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD
Unknown words Identification POS tagging Semantic annotation
![Page 11: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/11.jpg)
11
Tools for corpus annotation
Bookmarks for corpus-based linguists Corpora and Corpus Annotation Tools on t
he WWW POS tagger demonstration
Sentence segmentation POS tagging Extracting NPs of the form DT NN NN
Dexter: Tools for analyzing language data
![Page 12: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/12.jpg)
12
Corpus analysis
Levels of corpus analysis Tools for corpus analysis Interpreting corpus data
![Page 13: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/13.jpg)
13
Levels of corpus analysis
Word frequency lists Concordances
Collocation (lexical patterning) Colligation (syntactic patterning)
Keyword lists
![Page 14: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/14.jpg)
14
Tools for corpus analysis
Bookmarks for corpus-based linguists
Recommendations: WordSmith Tools (not free) AntConc (free) TextStat (free)
Unix tools Write your own scripts
![Page 15: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/15.jpg)
15
Exercise (part 1)
Download and install AntConc Download some text for processing
Project Gutenberg Generate a word frequency list for
your mini-corpus
![Page 16: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/16.jpg)
16
Interpreting corpus data
Are frequency differences statistically significant? w appears x times in an n-word corpus,
and y times in an m-word corpus Chi-square test (doesn’t work well for
small numbers) Fisher’s Exact Test (doesn’t work for a
cross table larger than 2×2)
![Page 17: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/17.jpg)
17
Exercise (part 2)
Compare your word frequency list with that of BNC
Anything interesting? Run the chi-square test and Fisher’s
Exact test on some interesting words
![Page 18: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/18.jpg)
18
Interpreting corpus data (cont.)
Collocational analysis: How strongly are x and y associated Mutual information
Measures difference between observed and expected frequencies of (X,Y)
Higher MI, stronger association Doesn’t work well for low frequencies
T-test Measures confidence with which to claim
strong association between X and Y Higher t-score, higher association
Online calculations
![Page 19: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/19.jpg)
19
Exercise (part 3)
Generate a concordance for a target word
Find a word that co-occurs frequently with the target word
Test if the word is strongly associated with the target word
![Page 20: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/20.jpg)
20
Note on research project design
Purpose of project Corpus compilation and annotation Corpus analysis
Bottom-up: from observations of recurring patterns to hypothesis and generalizations
Top-down: start with given categories and search for evidence of use and variance
Caution on generalizability
![Page 21: Research methods in corpus linguistics Xiaofei Lu](https://reader036.vdocuments.pub/reader036/viewer/2022062421/56649da05503460f94a8b0ed/html5/thumbnails/21.jpg)
21
Future courses on corpus linguistics
Spring 2007 APLING 597E: Introduction to Corpus
Linguistics Hands-on course on principles and tools for
corpus compilation, annotation, processing, and analysis
Spring 2008 APLING 597: Seminar on Corpus Linguistics Advanced seminar on using corpora for serious
research projects