grammar%proﬁle%for% spokenlearnerdatagrammar%proﬁles% extracngcharacteriscs: a2vsb1 rank...

Grammar Profile for Spoken Learner Data

By Brendan Flanagan1, Emiko Kaneko2, Emi Izumi3, Sachio Hirokawa4

1 Kyushu University, JSPS Research Fellow 2 Aizu University

3 Doshisha University 4 Kyushu University

Overview

•  IntroducGon •  Equivalent Proficiency Levels • Grammar PaLern Item Dataset •  SVM & OpGmal Feature SelecGon • CharacterisGc Grammar Profiles •  A1 vs A2 •  A2 vs B1 •  B1 vs B2

• Conclusion

Introduc

Equivalent Proficiency Levels The NICT-‐JLE Corpus and CEFR-‐J

The NICT-‐JLE Corpus is made up of 1280 transcripts of the ACTFL-‐ALC SST (Standard Speaking Test) English oral proficiency interview test.

There are 9 proficiency levels based on the SST scoring

method.

Equivalent Proficiency Levels The NICT-‐JLE Corpus and CEFR-‐J

SST Level 4 is categorized at CEFR-‐J Level A2

(in this presentaGon)

Target Proficiency Levels:

CEFR-‐J: A1, A2, B1, B2

CEFR-‐J Level

# Samples SST 4 as CEFR-‐J A1

# Samples SST 4 as CEFR-‐J A2

A1 236 257

A2 738 717

B1 263 263

B2 40 40

Grammar PaIern Item Dataset•  The NICT JLE corpus exam and data structure:

•  Each secGon was preprocessed to count the occurrence of 493 grammar paLerns, eg:

Stage Task Follow-‐up

1

2 ● ●

3 ● ●

4 ● ●

5

Grammar paGern # 00015 # 00253 # 00287

1:人称代名詞主格(I)+be: I am 2 2 4

1-1: 人称代名詞主格(I)+be: I am not 0 0 0

1-2:人称代名詞主格(I)+be: Am I ...? 0 0 0

Grammar PaIern Item Dataset•  The NICT JLE corpus exam and data structure:

•  Each secGon was preprocessed to count the occurrence of 493 grammar paLerns, eg:

Stage Task Follow-‐up

1

2 ● ●

3 ● ●

4 ● ●

5

Excluded ”Follow-‐up” secGon from analysis as it contains free dialog.

Target secGons for analysis.

Grammar paGern # 00015 # 00253 # 00287

1:人称代名詞主格(I)+be: I am 2 2 4

1-1: 人称代名詞主格(I)+be: I am not 0 0 0

1-2:人称代名詞主格(I)+be: Am I ...? 0 0 0

SVM & Grammar Item Dataset•  The preprocessed dataset was then vectorized to create a

special purpose search engine using GETA[1]. •  The dataset was divided into randomly

selected parts to evaluate the classificaGon performance of SVM models by 10-‐fold cross validaGon.

•  SVMlight[2] linear kernel was used to train/test models. •  To rank the importance of grammar items for feature

selecGon, iniGally an SVM model was trained using all features.

•  The SVM model score for each individual grammar item wi was analyzed to determine the weight(wi) ranking.

[1] hLp://geta.cs.nii.ac.jp [2] hLp://svmlight.joachims.org

SVM & Op

Grammar Profiles Extrac

Analysis By SVM

Grammar Profiles Classifica

Grammar Profiles Extrac

Visualiza

Grammar Profiles Visualizing Characteris

Conclusion

• Classified the English proficiency levels of data in a spoken learner corpus by SVM. • CharacterisGc grammar items for each CEFR-‐J Level were extracted. •  To aid interpretaGon of the results, we visualized grammar item features by Decision tree. •  In future work, we will extract the error features of spoken learner data.

grammar%proﬁle%for% spokenlearnerdatagrammar%proﬁles% extracngcharacteriscs: a2vsb1 rank...

Documents