corpus linguistics

13
Introduction to Corpus Linguistics Dr. Mubarak Alkhatnai King Saud University

Upload: king-saud-university

Post on 08-Feb-2017

559 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Corpus linguistics

Introduction to Corpus Linguistics

Dr. Mubarak AlkhatnaiKing Saud University

Page 2: Corpus linguistics

What is Corpus?

• Definition• Why are they used? • What are they considered to be? –Method vs Theory

• Types of corpora? –Monolingual Vs. Multilingual – Parallel Vs. Translated

Page 3: Corpus linguistics

Corpus Linguistics • LC history

– 1960 1st generation e.g. Brown– 1975 2nd generation e.g. Cobuild – 1990 3rd generation e.g. BOE

• Roots – CL and Linguistics

• Comparative linguistics • Syntactics and semantics

– Chomskyan revolution • Technology and the progress of CL• Benefits of CL• Problems of CL

Page 4: Corpus linguistics

Building the Corpora • General corpora

– E.g. BNC, The Brown Corpus• Specialized corpora

– How corpora is used (Written – Spoken)– Materials for creating the corpora (newspapers – books –

documents etc.)– General (Social – science – art ..etc)

• Multilingual corpora – Parallel corpora • Learners corpora (International Corpus of Learner English) • Monitor Corpus (The Bank of English)• Historical Corpus

Page 5: Corpus linguistics

Advantages and Disadvantages • More reliable than intuition • Language patterns are easily identified• Deconstruct texts to discover patterns • Track the development of specific features in the history of

English• Test hypothesis on specific language features empirically• Follow language acquisition properly • Draw conclusions on large amount of linguistic data • Not always a complete picture • Frequency rather than the possibility

Page 6: Corpus linguistics

CL terminology • Concordance – Where and in what context? – Frequency

• Annotation – Mark-up

• Tagging – POS tagging – Syntactic Treebank – Semantic tagging

• Coding • Metadata

Page 7: Corpus linguistics

Famous Corpora

Credits: Nadja Nesselhauf

Page 8: Corpus linguistics

Corpora and Translation

• Corpus translation studies (CTS)• Descriptive translation• Equivalence• Corpus-based translation• The process Vs the product • The third code • Simplification Vs normalization

Page 9: Corpus linguistics

Methods of Research in CL

• Quantitative • Qualitative – Context

• Quantitative and Qualitative

Page 10: Corpus linguistics

Corpus Software

• AntConc:• MICASE: Michigan Corpus of Academic Spoken

English• TACT:  Text Analysis Computing Tools• TACTWeb: a concordance program based on

TACT but for the Web• SARA: the concordance program which is

specifically written for the British National Corpus

Page 11: Corpus linguistics

Corpus Software Continued • BNCweb• BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British

National Corpus (BNC). It relies on the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench to provide a convenient interface between the user and the rich variety of annotated text in the 100-million word BNC in its most recent incarnation, the XML-version.BNC Web Index

• This is the web front end to David Lee's BNC Index spreadsheet. For an introduction to BNC Index, please see  David's web site.CLAWS

• Part of speech tagging software for English.Clustertool• Clustertool allows you to perform Hierarchical Agglomerative Cluster Analysis on your own data. CQPweb• An extension of BNCweb but designed for use with any corpus.LL Calculator• This calculates Log-Likelihood values from a 2x2 contingency table. LL is a more reliable alternative to the standard Pearson's

chi-squared test, see Dunning (1993).LWAC• LWAC is a tool for constructing corpora from web data.Sentrick• Stream-oriented Java library and a set of command line tools for high quality sentence boundary detection. (Sentence

segmentation / splitting / disambiguation). Currently has one model for German (trained on general text and Wikipedia lynx dumps).SigTest

• Flexible Significance Test System: Chi-squared test, log-likelihood test and Fisher exact test for any kind of contingency table, using RUSAS

• Semantic tagger developed for English and extended to Finnish and Russian.VARD• Variant Detector software that facilitates the pre-processing of corpora for normalisation of spelling variation (e.g. Early

Modern English)Wmatrix• A corpus comparison and annotation tool incorporating CLAWS and USAS in a web front end.

Page 12: Corpus linguistics

Additional Resources • University of Lancaster Centre for Computer Corpus Research on Language (Summer School)

http://ucrel.lancs.ac.uk/• McEnery, Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press,

2001.• ESRC Centre for Corpus Approaches to Social Science (CASS) University of Lancaster • Aston, Guy and Burnard, Lou. The BNC handbook: exploring the British National Corpus

with SARA. Edinburgh University Press, 1998.• McEnery, Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press,

2001.• Biber, Douglas, Conrad, Susan, and Reppen, Randi. Corpus Linguistics: Investigating

Language Structure and Use.CUP, 1998. 

Page 13: Corpus linguistics

Questions/Comments