subtle patterns of learner language: 13 topics for further research

24
Subtle patterns of learner language Steve Pepper 2013-09-26 ASKeladden 13 topics for further research og er det å i jeg som en at for de til ikke har m ed vi kan av m an m en om et m ange den var eller seg også m ye veldig når væ re fra norge andre alle skal m eg du vil noen hvis m er m ennesker ha dette barn bare blir viktig fordi folk da han min barna hva noe dem bli synes hvor selv etter hadde oss land år kom m er ting gjøre alt enn dag der livet tror venner flere stor får trenger

Upload: steve-pepper

Post on 25-May-2015

172 views

Category:

Technology


0 download

DESCRIPTION

A presentation of (some of) the subtle and hitherto undetected patterns in the lexicon of Norwegian language learners revealed by a Discriminant Analysis of texts in the ASK corpus.

TRANSCRIPT

Page 1: Subtle patterns of learner language: 13 topics for further research

Subtle patterns of learner language

Steve Pepper      2013-09-26     ASKeladden

13 topics for further research

oger

detåi

jeg

somen

atpåfor

de

tilikke

harmed

vi

kan

av

man

men

om

et

mange den

varmå

eller

seg

også

mye

veldig

når

være

fra

norge

andre

alle

skal

megdu

vil

noen

hvis

mer

mennesker

ha

dett

e

barn

bare

blirviktig

fordi

folk

da

hanmin

barna

hva

noefådem

blisynes

hvor

selv

ette

r

hadde

oss

land

år

kommer

ting

gjøre

alt

enn

dag

de

r

livet

tro

r

vennergå

flere

stor

får

trenger

Page 2: Subtle patterns of learner language: 13 topics for further research

Introduction

• An application of the detection-based argument (Jarvis 2010)– Modelled on Jarvis & Crossley (2012)

• Use of data mining methods to1) automatically detect (predict) the L1

2) identify (lexical) features that serve to discriminate between L1 groups, i.e. L1 predictors

• Major advantages:– Ability to recognize positive as well as

negative transfer

– Ability to detect very subtle patterns that might otherwise escape notice

Jarvis & Crossley (2012)

Page 3: Subtle patterns of learner language: 13 topics for further research

Evidence of the third kind...

• The method supplies the first two kinds of evidence “out of the box”– The focus here is therefore on supplying the

third kind

• Sources of type 3 evidence– the learner’s L1 performance– comparable users’ L1 performance– contrastive grammars– traditional grammars

• Involves Contrastive Interlanguage Analysis (Granger 1996)

– ILL2 < > NLL1

Evidence for transfer

(Jarvis 2010)

1. Intergroupheterogeneity

2. Intragrouphomogeneity

3. Cross-languagecongruity

4. Intralingualcontrasts

Page 4: Subtle patterns of learner language: 13 topics for further research

L1 predictors

• 55 features (i.e. words) selected using Discriminant Analysis (see box)– DA explained on Saturday at LCR 2013

• Subjected to post-hoc analysis using Tukey’s HSD– single-step multiple comparison procedure

and statistical test that is used in conjunction with an ANOVA to find means that differ statistically from each other

• The output is not very easy to interpret…

andre, at, av, bare, barn, barna, bo, da, de, den, det, du, eller, en, enn, er, et, for, fordi, fra, han, har, hun, i, ikke, jeg, kan, liker, man, mange, med, meg, men, mennesker, mer, min, mye, norge, norsk, når, og, også, om, på, skal, som, sted, så, til, veldig, venner, vi, viktig, være, å

Page 5: Subtle patterns of learner language: 13 topics for further research

SH EN PL DE NO RUX

Y Y Y Y

X X X

Df Sum Sq Mean Sq F value Pr(>F)myData$L1 5 1790 358.1 10.11 2.65e-09 ***Residuals 594 21044 35.4---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = myData[, X] ~ myData$L1)

$`myData$L1` diff lwr upr p adjen-de -1.373 -3.7796269 1.03362692 0.5781845no-de 0.032 -2.3746269 2.43862692 1.0000000pl-de -0.239 -2.6456269 2.16762692 0.9997514ru-de 3.186 0.7793731 5.59262692 0.0023298sh-de -2.434 -4.8406269 -0.02737308 0.0456381no-en 1.405 -1.0016269 3.81162692 0.5528485pl-en 1.134 -1.2726269 3.54062692 0.7583997ru-en 4.559 2.1523731 6.96562692 0.0000013sh-en -1.061 -3.4676269 1.34562692 0.8063672pl-no -0.271 -2.6776269 2.13562692 0.9995400ru-no 3.154 0.7473731 5.56062692 0.0026907sh-no -2.466 -4.8726269 -0.05937308 0.0409536ru-pl 3.425 1.0183731 5.83162692 0.0007589sh-pl -2.195 -4.6016269 0.21162692 0.0969624sh-ru -5.620 -8.0266269 -3.21337308 0.0000000

sh en pl de no ru2.806 3.867 5.001 5.240 5.272 8.426

feature: den

NOTE:

Tukey’s HSD was performed for groups of six L1s at a time. There were six such “groups of six”:

– DE, EN, PL and RU were always included (along with the control group NO)

– NL, SH, SP, SO, SQ and VI were each added in turn

– The example above shows the homogeneity table for the group of L1s that includes SH

– Examples to follow (including the next one) contain up to six homogeneity tables at once

Essence represented visually as a “homogeneity table”

Page 6: Subtle patterns of learner language: 13 topics for further research

#1 NL speakers overuse skal

• Finite form of modal auxiliary skulle; used to form the future tense

han skal lage middag i kveldhe will make dinner tonight

– Other methods:• non-past: han lager middag i kveld• construction komme til + infinitive

• Recognized tendency for beginners to overuse this form– Partly due to overly simplistic explanations

in teaching materials• “Futurum lager vi av skal + infinitiv”

(Greftegreff 1985)• Analysis shows that skal is overused by NL, SH,

SO, SQ and VI learners

RU DE EN NO PL NLY

X X X X X

RU DE EN NO PL SHY

X X X X X

RU DE EN NO PL SOY

X X X X X

RU DE EN NO PL SPX X X X X X

RU DE EN NO PL SQY

X X X X X

RU DE EN NO PL VIY

X X X X X

? proficiency? thematic bias? transfer

Page 7: Subtle patterns of learner language: 13 topics for further research

Proficiency?• We have CEFR ratings for 7 of the 10

L1 groups (not NL, SH, SQ)– VI and SO score lowest

– DE and EN score highest

• For these 7 L1 groups, overuse of skal thus correlates with linguistic and/or cultural distance– VI and SO communities in Norway

originated as refugees

– If lower proficiency explains overuse of skal for VI and SO, chances are that it also does so for SH and SQ

– But this does not explain the NL case

• So could the reason for NL users’ overuse be thematic bias?

0 20 40 60 80 100

SO

VI

SP

RU

PL

EN

DE

A2 A2/B1 B1 B1/B2 B2 B2/C1 C1

Page 8: Subtle patterns of learner language: 13 topics for further research

Thematic bias?

• Some topics are more concerned with future events than others– Over half the occurrences of skal are in 6 of the 46 topics

• Cf. occurrences pr. text (“freq”) with the topic held constant– 4.9 (NL) >> 2.9 (SP)– 1.3 (NL) >> 0.5 (EN) and 0.6 (SP)– 1.1 (NL) >> 0.7 (DE) and 0.4 (EN)

• Even with the topic held constant, the tendency is clear• Thematic bias can thus be ruled out

DE EN NL SP

wc tc freq wc tc freq wc tc freq wc tc freq

Framtida - - -   - - -   39 8 4.9 29 10 2.9

Bomiljø - - -   20 38 0.5 21 16 1.3 14 23 0.6

Bolig og bosted - - -   - - -   13 9 1.4 - - -  

Frivillig hjelp i organisasjoner 2 5 0.4 - - -   9 2 4.5 - - -  

Nyheter 7 10 0.7 4 9 0.4 8 7 1.1 2 -  

Reise - - -   - - -   8 14 0.6 - - -  

Page 9: Subtle patterns of learner language: 13 topics for further research

Cross-linguistic explanation

• In NL the future tenses are formed with the auxiliary zullenhij zal het diner vanavond maken

• NL zullen cognate with skulle – finite form zal similar in form to skal– EN shall also cognate with skal and similar in form, but much less frequent

in EN than ’ll, will and going to

– DE werden is neither cognate nor similar in form

• Conclusion: Strong tendency for NL speakers to overuse skal appears to be a case of formal lexical transfer– Caveat: NL has other means to express future action, including the non-past

tense (hij maakt het diner vanavond) and the auxiliary gaan

– Further investigation of relative frequencies necessary in order to confirm or disconfirm possible transfer effects

➔ Is there anything else that should be considered???

Page 10: Subtle patterns of learner language: 13 topics for further research

#2 DE speakers overuse en

• Speakers of Slavic languages use the indefinite articles en (m.) and et (n.) much less frequently than learners from other L1 backgrounds– Also applies to SO, SQ and VI. As expected

• But why do DE speakers use the masculine form en more than everyone else?– DE forms ein (m., n.), eine (f.) bear strong formal

resemblance to en– Tendency to use en instead of et because of this?– Detailed error analysis required.

• Hypothesis– That DE speakers commit errors of type

<sic type="W" corr="et"><word>en</word></sic>

more frequently than other L1 groups➔ Comments???

PL RU EN NO NL DEY Y Y

X X X

Y Y

X X

PL SH RU EN NO DEY Y

X X

Y Y

X X X

PL RU SO EN NO DEY Y

X X X

Y Y Y

X X X

PL RU SP EN NO DEY Y

X X X

Y Y Y

X X

PL RU SQ EN NO DEY Y

X X X

Y Y Y

X X

PL VI RU EN NO DEY Y

X X

Y Y Y

X X X

Page 11: Subtle patterns of learner language: 13 topics for further research

#3 EN speakers overuse et

• Cross-linguistic explanation?– Avoidance of en (as indefinite article) due to

identification with the numeral ‘one’?– Greater similarity between EN ‘a’ [ə] and NO et

(short vowel, unvoiced dental plosive) than between ‘a’ and NO en (formal lexical transfer)?

• Greater similarity between en and EN ‘an’, but ‘an’ much less frequent than ‘a’– Wiktionary rankings #102 and #5 respectively– ‘a’ occurs 11 times more often that ‘an’– Evidence that frequency constrains transfer?

• Conclusion: L1 transfer appears to be at work when EN speakers overuse et➔ But how can this be proved beyond doubt???

RU PL DE NL NO ENX X

Y Y Y

X X X

RU PL SH DE NO ENY Y

X X X X

SO RU PL DE NO ENX X

Y Y

X X X X

RU PL DE SP NO ENX X

Y Y Y

X X X X

RU PL SQ DE NO ENX X

Y Y

X X X X

RU PL DE VI NO ENX X

Y Y Y

X X X X

Page 12: Subtle patterns of learner language: 13 topics for further research

#4 PL and RU speakers: den and det

• These are 3SG pronouns, demonstratives, and (preposed) definite articles

• RU speakers use den (m.) significantly more often than all other L1 groups, including PL speakers

• PL speakers use det (n.) significantly more often than RU speakers– Absolute usage figures:

• den PL 122, RU 166 (~40:60)

• det PL 668, RU 496 (~60:40)

➔ Why???

➔ How can we find out???

NOTE:• 3SG personal pronouns

are identical inPL (on, ona, ono) andRU (он, она, оно)

• Demonstrative pronouns– PL ten, ta, to– RU етот, ето, ета

$den

SH EN PL DE NO RUX

Y Y Y Y

X X X

$det

NO RU SH EN DE PLX X X X

Y Y Y Y

X X X X

Page 13: Subtle patterns of learner language: 13 topics for further research

#5 EN speakers overuse er

• EN speakers use er ‘is, are’ statistically more than all other L1 groups (except PL and SH)

• Most likely explanation: formal transfer– formal resemblance er [æɾ] ~ are [ɑ(ɹ)]

EN NOsg pl sg pl

1. am are er er2. are are er er3. is are er er

• High salience of ‘to be’ in English (not least because of present continuous)– And yet, ENPC shows finite forms of NO være to

be more frequent than finite forms of EN be

• 8,182 vs. 6,566 occurrences

➔ So how to explain EN overuse???

RU NO NL DE PL ENX X

Y Y Y

X X X X

RU NO DE PL SH ENX X X

Y Y Y

X X X

SO RU NO DE PL ENX X

Y Y

X X X X

RU NO DE SP PL ENY Y

X X X

Y Y Y

X X X

RU NO SQ DE PL ENX X

Y Y Y

X X X X

RU VI NO DE PL ENX X

Y Y

X X X X

Page 14: Subtle patterns of learner language: 13 topics for further research

#6 While RU speakers underuse er

• PL and SH speakers use er more than RU speakers– Despite the fact that they are all Slavic languages

• PL and SH have a copula in the present tense (być and бити ~ biti)

PL dom jest tamSH куђа је тамо ~ kuća je tamo

‘the house is there’

• RU no longer has such a copula

RU дом _ там‘the house is there’

➔ Case proved???

RU NO NL DE PL ENX X

Y Y Y

X X X X

RU NO DE PL SH ENX X X

Y Y Y

X X X

SO RU NO DE PL ENX X

Y Y

X X X X

RU NO DE SP PL ENY Y

X X X

Y Y Y

X X X

RU NO SQ DE PL ENX X

Y Y Y

X X X X

RU VI NO DE PL ENX X

Y Y

X X X X

Page 15: Subtle patterns of learner language: 13 topics for further research

#7 Many L1 groups underuse være

Underuse by RU, SH, SO, SQ and VI

Possible cross-linguistic explanations:

RU no copula in present tense

VI  copula là not used with adjectives(because adjectives are verbal),

thus:Mai là sinh viên‘Mai is (a) student’butMai cao‘Mai is tall’

SH copula exists but little used due tocontact with other Balkan languages

SO yahay ‘to be’ contracts with adjectives,

losing its root (-ah-) in the process

SQ no infinitives (është is finite form)

➔ Case proved???

RU NL DE PL NO ENY Y Y Y Y

X X X X X

SH RU DE PL NO ENY Y Y Y

X X X X X

SO RU DE PL NO ENX X X X

Y Y Y Y

X X X X

RU DE PL NO SP ENY Y Y Y Y

X X X X

SQ RU DE PL NO ENY Y Y Y

X X X X X

VI RU DE PL NO ENY Y Y Y

X X X X X

Page 16: Subtle patterns of learner language: 13 topics for further research

#8 But EN speakers overuse være

• Overuse by EN speakers– Difference is statistical w.r.t. RU, SH, SO, SQ

and VI

• Difference w.r.t. NO not statistical, but still noticeable– In the English-Norwegian Parallel Corpus, be

occurs much more frequently in English texts (both fiction and non-fiction) than være does in Norwegian texts

• be: 3,126 occurrences• være: 1,193 occurrences

– Worthy of a more detailed investigation using ENPC

➔ Alternative explanations?

RU NL DE PL NO ENY Y Y Y Y

X X X X X

SH RU DE PL NO ENY Y Y Y

X X X X X

SO RU DE PL NO ENX X X X

Y Y Y Y

X X X X

RU DE PL NO SP ENY Y Y Y Y

X X X X

SQ RU DE PL NO ENY Y Y Y

X X X X X

VI RU DE PL NO ENY Y Y Y

X X X X X

Page 17: Subtle patterns of learner language: 13 topics for further research

#9 Prepositions i and på

• Preposition på ‘on’– EN (overuse) vs. DE (underuse)– Investigate using error analysis– Check type and token frequencies of

constructions in which corresponding L1 forms (on and auf) are congruent in one L1 but not the other, e.g.:

– NO på søndag ≡EN on Sundaybut≠DE am Sonntag

whereas– NO på engelsk ≡DE auf Englisch

but≠EN in English

• Preposition i ‘in’– RU (overuse) vs. PL (underuse)– Investigate using error analysis➔ Any suggestions???

$iPL EN DE NO NL RU

X X X

Y Y Y Y

X X X X

PL EN DE SH NO RU

Y Y

X X X X X

PL EN DE SO NO RU

Y Y Y

X X X X X

PL EN SP DE NO RU

Y Y

X X X X X

PL EN DE NO SQ RU

X X X

Y Y Y

X X X X

PL EN DE NO VI RU

X X X

Y Y Y Y

X X X X

$påDE RU NO NL PL EN

Y Y Y Y Y

X X X X X

DE RU NO SH PL EN

Y Y Y Y Y

X X X X X

SO DE RU NO PL EN

X X X X

Y Y Y Y

X X X X

DE RU NO SP PL EN

Y Y Y Y Y

X X X X X

DE SQ RU NO PL EN

Y Y Y Y Y

X X X X X

DE RU NO VI PL EN

Y Y Y Y Y

X X X X X

Prepositions, especially spatial prepositions, are renowned for being “among the hardest expressions to acquire when learning a second language” (Coventry & Garrod 2004: 4) and they have already been the subject of some interesting work based on ASK (Szymanska 2010; Malcher 2011).

Page 18: Subtle patterns of learner language: 13 topics for further research

#10 Prepositions til and fra

• Preposition til ‘to’

– underused by all L1 groups, especially DE, SH and SQ

– …

• Preposition fra ‘from’

– used statistically more often by EN speakers than by PLor native speakers

– …

➔ Any suggestions here???

$tilDE RU PL NL EN NO

Y Y Y Y Y

X X X X X

SH DE RU PL EN NO

Y Y Y Y

X X X X X

DE RU SO PL EN NO

Y Y Y Y Y

X X X X X

DE RU SP PL EN NO

Y Y Y Y Y

X X X X X

SQ DE RU PL EN NO

Y Y Y Y

X X X X X

DE RU PL VI EN NO

Y Y Y Y Y

X X X X X

$fraNO PL DE NL RU EN

X X X X

Y Y Y Y

X X X

NO PL SH DE RU EN

X X

Y Y Y

X X X X

NO PL DE SO RU EN

X X X X

Y Y Y Y

X X X

NO PL DE SP RU EN

X X X X

Y Y Y Y

X X X

NO PL DE SQ RU EN

X X X X

Y Y Y Y

X X X

NO PL DE VI RU EN

X X X X

Y Y Y Y

X X X X

Page 19: Subtle patterns of learner language: 13 topics for further research

#11 Underuse and overuse of og

• Striking contrast between PL speakers (underuse) and RU speakers (overuse)– Cannot be formal transfer, since PL i and RU и

are phonologically identical

• Different token frequencies in L1s?– Wiktionary frequency lists (WFREQ)*

• RU и ranked as #1

• PL i ranked as #2 (after w ‘in’)

– Raw frequencies not comparable in WFREQ

• Zipfian distribution?

• Requires further investigation

➔ Your suggestions???

PL DE NL EN NO RUY Y

X X X X

PL DE SH EN NO RUX X

Y Y

X X X X

PL DE EN SO NO RUX X

Y Y Y

X X X X

PL SP DE EN NO RUX X

Y Y

X X X X

PL SQ DE EN NO RUX X

Y Y

X X X X

VI PL DE EN NO RUX X

Y Y

X X X X

* http://en.wiktionary.org/wiki/Wiktionary:FREQ

Page 20: Subtle patterns of learner language: 13 topics for further research

#12 Overuse and underuse of eller

• DE and EN speakers overuse eller ‘or’– Difference w.r.t. to NL is highly statistical

• This seems odd. (Are the Dutch more decisive than the English and Germans?)

– Difference between DE and NO also statistical– Frequency related?

• Mutual correspondence between NO eller and EN ‘or’ is 84%

• RU speakers underuse eller– Strong formal resemblance with или (ili)

• Possible cross-linguistic explanation– или has a more restricted distribution– Not used in negative contexts

он не любит ни футбол, ни теннис‘he doesn’t like football or tennis’

RU NO NL PL EN DEX X

Y Y Y Y

X X X X

RU SH NO PL EN DEX X

Y Y Y

X X X X

RU SO NO PL EN DEX X

Y Y Y

X X X X

RU NO PL SP EN DEX X

Y Y Y Y

X X X

RU SQ NO PL EN DEX X

Y Y Y

X X X X

RU VI NO PL EN DEX X

Y Y Y

X X X X

Page 21: Subtle patterns of learner language: 13 topics for further research

#13 More general questions

• Misclassification can also be revealing– Texts written by EN learners are more often misclassified as SP, rather than NL

or DE, despite EN being more closely related to the latter

➔ Why???

– Texts by SO and SQ learners are most often misclassified as RU, whilst texts by VI learners are most often misclassified as PL

➔ Again, why???

• All the 12 patterns discussed above pertain to Indo-European languages most closely related to NO (DE, EN, NL; PL, RU)– There no really clear-cut predictors for the most distantly related L1s,

i.e. SO, SQ and VI

➔ Why???

Page 22: Subtle patterns of learner language: 13 topics for further research

Conclusion

• Discriminant analysis reveals subtle patterns of L2 usage that would otherwise go undetected

• Homogeneity tables based on Tukey’s HSD can help us understand those patterns

• Contrastive analysis is required in order to confirm that the patterns are due to cross-linguistic influence

• All 13 issues discussed in this chapter are suitable topics for further research using ASK

• This study has merely scratched the surface…

Page 23: Subtle patterns of learner language: 13 topics for further research

13 research questions

1. Why do NL speakers overuse skal?

2. Why do DE speakers overuse en?

3. Why do EN speakers overuse et?

4. Why do PL and RU speakers differ so much in their use of den and det?

5. Why do EN speakers overuse er?

6. Why do RU speakers underuse er?

7. Why do many L1 groups underuse være?

8. Why do EN speakers, on the other hand, overuse være?

9. Why do EN speakers overuse på, while DE speakers underuse it?And why do RU speakers overuse i, while PL speakers underuse it?

10. Why do all L1 groups underuse til –and why do EN speakers overuse fra?

11. Why do PL and RU speakers differ so markedly in their use of og?

12. Why do EN and DE speakers overuse eller and why do RU speakers underuse it?

13. What lies behind the misclassification patterns, and why are there no good predictors for SO, SQ and VI?

Page 24: Subtle patterns of learner language: 13 topics for further research

ReferencesDonaldson, Bruce. 1997. Dutch: A Comprehensive Grammar. London: Routledge.Granger, Sylviane. 1996. From CA to CIA and back: An integrated approach to computerized bilingual and

learner corpora. In Karin Aijmer, Bengt Altenberg and Mats Johansson (eds.) Languages in Contrast. Papers from a symposium on text-based cross-linguistic studies. Lund 4–5 March 1994. Lund: Lund University Press [Lund Studies in English 88], 37–51.

Greftegreff, Liv Astrid. 1985. Enkel norsk grammatikk. Oslo: NKS-Forlaget.Husby, Olaf. 1999. En kort innføring i albansk. Trondheim: Tapir.Husby, Olaf. 2001. En kort innføring i somali. Trondheim: Tapir.Jarvis, Scott. 2010. Comparison-based and detection-based approaches to transfer research. EUROSLA

Yearbook 10, 169‑192.Jarvis, Scott & Scott A. Crossley (eds.) 2012. Approaching Language Transfer through Text Classification.

Explorations in the detection-based approach. Bristol: Multilingual Matters.Koolhoven, H. 1961. Teach yourself Dutch. London: The English Universities Press.Lie, Svein. 2005. Kontrastiv grammatikk – med norsk i sentrum, 3rd Edition. Oslo: Novus.Malcher, Jenny. 2011. Jeg liker å treffe folk i café. Man må nyter de fine tingene på verden! Preposisjoner og

morsmålstransfer – en korpusbasert studie med i og på i fokus. Masters thesis, Department of Linguistics and Scandinavian Studies, University of Oslo.

Mønnesland, Svein. 1990. Serbokroatisk-norsk kontrastiv grammatikk. In Hvenekilde, Anne (ed.) Med to språk: Fem kontrastive språkstudier for lærere. Oslo: Cappelen.

Saaed, John Ibrahim. 1993. Somali Reference Grammar, 2nd Edition. Kensington, MD: Dunwoody Press.Szymanska, Oliwia. 2010b. A conceptual approach towards the use of prepositional phrases in Norwegian – the

case of i and på. Folia Scandinavica 11, 173-183.Wade, Terence. 2011. A Comprehensive Russian Grammar. Wiley: Malden MA.Wiull, Hans Olaf. 2007. Bli bedre i norsk – se forskjellene mellom norsk og vietnamesisk. Oslo: VOX.