corpus linguistics
TRANSCRIPT
Introduction to Corpus Linguistics
Dr. Mubarak AlkhatnaiKing Saud University
What is Corpus?
• Definition• Why are they used? • What are they considered to be? –Method vs Theory
• Types of corpora? –Monolingual Vs. Multilingual – Parallel Vs. Translated
Corpus Linguistics • LC history
– 1960 1st generation e.g. Brown– 1975 2nd generation e.g. Cobuild – 1990 3rd generation e.g. BOE
• Roots – CL and Linguistics
• Comparative linguistics • Syntactics and semantics
– Chomskyan revolution • Technology and the progress of CL• Benefits of CL• Problems of CL
Building the Corpora • General corpora
– E.g. BNC, The Brown Corpus• Specialized corpora
– How corpora is used (Written – Spoken)– Materials for creating the corpora (newspapers – books –
documents etc.)– General (Social – science – art ..etc)
• Multilingual corpora – Parallel corpora • Learners corpora (International Corpus of Learner English) • Monitor Corpus (The Bank of English)• Historical Corpus
Advantages and Disadvantages • More reliable than intuition • Language patterns are easily identified• Deconstruct texts to discover patterns • Track the development of specific features in the history of
English• Test hypothesis on specific language features empirically• Follow language acquisition properly • Draw conclusions on large amount of linguistic data • Not always a complete picture • Frequency rather than the possibility
CL terminology • Concordance – Where and in what context? – Frequency
• Annotation – Mark-up
• Tagging – POS tagging – Syntactic Treebank – Semantic tagging
• Coding • Metadata
Famous Corpora
Credits: Nadja Nesselhauf
Corpora and Translation
• Corpus translation studies (CTS)• Descriptive translation• Equivalence• Corpus-based translation• The process Vs the product • The third code • Simplification Vs normalization
Methods of Research in CL
• Quantitative • Qualitative – Context
• Quantitative and Qualitative
Corpus Software
• AntConc:• MICASE: Michigan Corpus of Academic Spoken
English• TACT: Text Analysis Computing Tools• TACTWeb: a concordance program based on
TACT but for the Web• SARA: the concordance program which is
specifically written for the British National Corpus
Corpus Software Continued • BNCweb• BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British
National Corpus (BNC). It relies on the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench to provide a convenient interface between the user and the rich variety of annotated text in the 100-million word BNC in its most recent incarnation, the XML-version.BNC Web Index
• This is the web front end to David Lee's BNC Index spreadsheet. For an introduction to BNC Index, please see David's web site.CLAWS
• Part of speech tagging software for English.Clustertool• Clustertool allows you to perform Hierarchical Agglomerative Cluster Analysis on your own data. CQPweb• An extension of BNCweb but designed for use with any corpus.LL Calculator• This calculates Log-Likelihood values from a 2x2 contingency table. LL is a more reliable alternative to the standard Pearson's
chi-squared test, see Dunning (1993).LWAC• LWAC is a tool for constructing corpora from web data.Sentrick• Stream-oriented Java library and a set of command line tools for high quality sentence boundary detection. (Sentence
segmentation / splitting / disambiguation). Currently has one model for German (trained on general text and Wikipedia lynx dumps).SigTest
• Flexible Significance Test System: Chi-squared test, log-likelihood test and Fisher exact test for any kind of contingency table, using RUSAS
• Semantic tagger developed for English and extended to Finnish and Russian.VARD• Variant Detector software that facilitates the pre-processing of corpora for normalisation of spelling variation (e.g. Early
Modern English)Wmatrix• A corpus comparison and annotation tool incorporating CLAWS and USAS in a web front end.
Additional Resources • University of Lancaster Centre for Computer Corpus Research on Language (Summer School)
http://ucrel.lancs.ac.uk/• McEnery, Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press,
2001.• ESRC Centre for Corpus Approaches to Social Science (CASS) University of Lancaster • Aston, Guy and Burnard, Lou. The BNC handbook: exploring the British National Corpus
with SARA. Edinburgh University Press, 1998.• McEnery, Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press,
2001.• Biber, Douglas, Conrad, Susan, and Reppen, Randi. Corpus Linguistics: Investigating
Language Structure and Use.CUP, 1998.
Questions/Comments