corpora and statistical methods

Introduction

Lecture 1, Part I

Albert GattCorpora and Statistical Methods1TutorialCSA5011 -- Corpora and Statistical MethodsNext Monday at 11:00.

This will take the form of a discussion of the following paper:Jurafsky, D. (2003). Probabilistic knowledge in psycholinguistics. (Available from course web page)Course goalsCSA5011 -- Corpora and Statistical MethodsIntroduce the field ofstatistical natural language processing (statistical NLP).

Describe the main directions, problems, and algorithms in the field.

Discuss the theoretical foundations.

Involve students in hands-on experiments with real problems.3A general introductionCSA5011 -- Corpora and Statistical MethodsLanguageCSA5011 -- Corpora and Statistical MethodsWe can define a language formally as:a set of symbols (alphabet)a set of rules to combine those symbols

This mathematical definition covers many classes of languages, not just human language.Java: An artificial (formal) languageCSA5011 -- Corpora and Statistical Methodsfixed set of basic symbols:public, static, for, while, {, }

fixed syntax for symbol combination

public static void main (String[] args) {for(int i = 0; i < args.length; i++) {}}Natural languageCSA5011 -- Corpora and Statistical MethodsOften much more complicated than an artificial language.NB: Some theorists view NL as a special kind of formal language as well (Montague).

It does conform to the formal definition:there are symbolsthere are modes of combination

However, there are many levels at which these symbols and rules are defined.Levels of analysis in Natural language (I)CSA5011 -- Corpora and Statistical MethodsAcoustic properties (phonetics)defines a basic set of sounds in terms of their featuresstudies the combination of these phonemes

Higher-order acoustic features (phonology)how combinations of phonemes combine into larger units, with suprasegmental features such as intonation.

8Levels of analysis in Natural language (II)CSA5011 -- Corpora and Statistical MethodsWord formation (morphology)combines morphemes into words

Combination into longer units in a structure-dependent way (syntax)legal word combinations in a languagerecursive phrasal combination

Interpretation (semantics):of words (lexical semantics) of longer units (sentential/propositional semantics)

Interpretation in context (pragmatics)

Natural Language ProcessingCSA5011 -- Corpora and Statistical MethodsStudies language at all its levels.phonology, morphology, syntax, semanticsfocusses on process (Sparck-Jones `07)computational methods to understand and generate human language

Often, the distinction between NLP and computational linguistics is fuzzy

Kindred disciplines: LinguisticsCSA5011 -- Corpora and Statistical MethodsTheoretical linguistics tends to be less process-oriented than NLPQ: how can we characterise knowledge that native speakers have of their language?this leads to declarative models of speakers knowledge of languagetends to say less about how speakers process language in real timeNB: This depends on the theoretical orientation!

NLP has strong ties to theoretical linguisticsit has also been an important contributor: process models can serve as tests for declarative modelsKindred disciplines: PsycholinguisticsCSA5011 -- Corpora and Statistical MethodsLike NLP, psycholinguistics tends to be strongly process-orientedstudies the online processes of language understanding and language production

NLP has benefited from such models.

NLP has also been a contributor:it is increasingly common to test psycholinguistic theories by building computational models.

Paradigms in NLP (I)CSA5011 -- Corpora and Statistical MethodsKnowledge-based: system is based on a priori rules and constraintse.g. a syntactic parser might have hand-crafted rules such as:NP Det AdjP NAdjP A+Problem: it is extremely difficult to hand-code all the relevant knowledge.

13Paradigms in NLP (II)CSA5011 -- Corpora and Statistical MethodsStatistical:starting point is a large repository of text or speech (a corpus)corpus is often annotated with relevant information, e.g.:parsed corpora (syntax)tagged corpora (part-of-speech)word-sense annotated corpora (semantics)tries to learn a model from the datatries to generalise this model to new dataThe paradigms: a birds-eye viewCSA5011 -- Corpora and Statistical MethodsWe find similar divisions within mainstream linguistics:generative linguistics tends to formulate generalisations about internalised speaker knowledge of language (competence, I-Language)corpus linguistics tends to formulate generalisations based on patterns observed in corpora

The two paradigms are viewed as having roots in different traditions:rationalist tradition (Plato, Descartes)empiricist tradition (Locke)The idea of linguistic knowledgeCSA5011 -- Corpora and Statistical MethodsTraditional linguistic theory (since the 1950s) introduced a dichotomy:competence: a persons knowledge of language, formalised as a set of rulesperformance: actual production and perception of language in concrete situations

Much of linguistic theory has focused on characterising competence.The idea of linguistic knowledgeCSA5011 -- Corpora and Statistical MethodsThe use of data (corpora) involves an increased focus on performance.

The idea is that exposure to such regularities is a crucial part of human language learning.(Evidence for this is our topic for Mondays tutorial!)

An initial exampleCSA5011 -- Corpora and Statistical MethodsSuppose youre a linguist interested in the syntax of verb phrases. Some verbs are transitive, some intransitiveI ate the meat pie (transitive)I swam (intransitive)

What about:quiverquake

Corpus data suggests they have transitive uses:the insect quivered its wingsit quaked his bowels (with fear)

Most traditional grammars characterisethese as intransitiveExample II: lexical semanticsCSA5011 -- Corpora and Statistical MethodsQuasi-synonymous lexical items exhibit subtle differences in context.strongpowerful

A fine-grained theory of lexical semantics would benefit from data about these contextual cues to meaning.

Example II continuedCSA5011 -- Corpora and Statistical MethodsSome differences between strong and powerful (source: British National Corpus):

strong

powerful

The differences are subtle, but examining their collocates helps.

wind, feeling, accent, flavourtool, weapon, punch, engineStatistical approaches to languageCSA5011 -- Corpora and Statistical MethodsDo not rely on categorical judgements of grammaticality etc. Examples:

Degrees of grammaticality: people often do not have categorical judgements of acceptability.

Category blending: We live nearer town than you thought.Is near an adjective or a preposition?

Syntactic ambiguity: She killed the man with the gun.What is the most likely parse?

Statistical NLP vs. Corpus Linguistics (I)CSA5011 -- Corpora and Statistical MethodsCorpus linguistics became popular with the arrival of large, machine-readable corpora.generally viewed as a methodologytests hypotheses empirically on dataaim is to refine a theory of language, or discover novel generalisations

Statistical NLP shares these aims; however:it is often corpus-driven rather than corpus-basedthe theory or model learned is often not a priori given

Statistical NLP vs. Corpus Linguistics (II)CSA5011 -- Corpora and Statistical MethodsThe term corpus may mean different things to different people:To a corpus linguist, a corpus is a balanced, representative sample of a particular language variety (e.g. The British National Corpus)Representativeness allows generalisations to be made more rigorously.

In statistical NLP, there has traditionally been less emphasis on these properties.emphasis on algorithms for learning language modelswe frequently find the tacit assumption that the algorithm can be applied to any set of data, given the right annotations

Some applications of Statistical NLPCSA5011 -- Corpora and Statistical Methods25TextLanguage TechnologyNatural Language UnderstandingNatural Language GenerationSpeech RecognitionSpeech SynthesisTextMeaningSpeechSpeechMachine translationA (very) rough division of NLP tasksCSA5011 -- Corpora and Statistical Methodsunderstanding: typically take as input free text or speech, and conduct some structural or semantic analysisPOS Tagging, parsing, semantic role labelling, sentiment/opinion mining, named entity recognition

generation: typically take textual or non-linguistic input, outputting some text/speechautomatic weather reporting, summarisation, machine translation

How effective are statistical NLP tools to carry out these and other tasks?Are statistical techniques actually useful to learn things about language?Example 1: Semanticssheep 0.359cow 0.345pig 0.331rabbit 0.305cattle 0.304deer 0.289lamb 0.286donkey 0.276poultry 0.262boar 0.261camel 0.259elephant 0.258calf 0.258pony 0.255Example of an automatically acquired thesaurus of similar words.Data: 1.5 bn words obtained from the web. (www.sketchengine.co.uk)How does this work?CSA5011 -- Corpora and Statistical MethodsgoatExample 1: Semantics (cont/d)CSA5011 -- Corpora and Statistical MethodsCorpus-based lexical semantic acquisition typically uses vector-space models.represent a word as a vectors containing information about the context in which it is likely to occursome models also include grammatical relations (subject-of, object-of etc)Example 2: POS TaggingCSA5011 -- Corpora and Statistical MethodsThe tall woman and the strange boy thought statistical NLP was pointless.

The tall woman and the strange boy thought statistical NLP was pointless.

Output from a statistical POS Tagger, trained on the Brown Corpus(LingPipe demo library)

Uses of POS Tagging:pre-parsingcorpus analysis for linguisticsExample 3: parsingParsed using the Stanford Parser.Based on probabilistic context-free grammar of Englishtrained on a treebankCFG rules with probabilitiesCSA5011 -- Corpora and Statistical Methods

Example 4: Machine translationCSA5011 -- Corpora and Statistical MethodsInput:(Maltese translation of example sentence)

Output:The wife and son long strange nonetheless feels that the statistical NLP is without purpose.Translated using Maltese-English Google Translate.

Obvious shortcomings, but robust, i.e. some output returned, even if garbled.

Based on automatic alignment between parallel text corpora.Example 5: Generation/SummarisationCSA5011 -- Corpora and Statistical Methods[] No laboratories offering molecular genetic testing for prenatal diagnosis of 3-M syndrome are listed in the GeneTests Laboratory Directory. However, prenatal testing may be available for families in which the disease-causing mutations have been identified []Automatically generated article about 3-M syndrome (Sauper and Barzilay 2009)Now on Wikipedia!!!(http://en.wikipedia.org/wiki/3-M_syndrome)Summarised from multiple documents drawn from the web.Uses automatically acquired templates from human-authored texts to ensure coherence.Features of Statistical NLP systemsCSA5011 -- Corpora and Statistical MethodsRobustness: typically, dont break down with new or unknown input

Portability: statistical learning algorithms can in principle be ported to new domains (given data)

Sensitivity to training data: if (say) a POS tagger is trained on medical text, its performance will decline on a new genre (e.g. news).Some important conceptsCSA5011 -- Corpora and Statistical MethodsAll the systems surveyed rely on regularities in large repositories of training data, expressed as probabilities.

In practice, we distinguish between:training/development data: for learning a model and finetuningtest data: for evaluation on unseen but compatible dataReferencesCSA5011 -- Corpora and Statistical MethodsSparck-Jones, K. (2007). Computational Linguistics: What about the linguistics? Computational Linguistics 33 (3): 437 441

McEnery, T., Xiao, R. & Tono, Y. 2006:Corpus-based language studies: An advanced resource book. London: Routledge(Contains an interesting discussion of corpus-based vs. corpus-driven approaches)

corpora and statistical methods

Documents