osakar_4: r言語によるコーパス分析入門

R言語によるコーパス分析入門

小林雄一郎

（大阪大学／日本学術振興会）

2010年12月2日（土）、大阪大学

1

自己紹介

• 小林雄一郎（こばやしゆういちろう）

– 大阪大学大学院言語文化研究科

– 関心領域（「専門領域」ではない）：

コーパス言語学・統計的テキストマイニング

2

Rとわたし

• Rとの出会い

– 2004年: （SPSSを買うお金なくて）その存在を知る

– 2005年: 舟尾・高浪 (2005) や中澤 (2003) で独習

– それから5年、文系なので大した上達はなし。。。

• Rコミュニティーとの関わり

– 2010/04/09 「R言語による検定入門」 Osaka.R#2

– 2010/06/19 「R言語によるプロット入門」 Nagoya.R#3

– 2010/06/19 「R言語による判別分析入門」 Nagoya.R#3

– 2010/06/26 「R言語によるテキストマイニング入門」

Osaka.R#3

（その他、Tokyo.RやShiga.Rにも参加）

3

本日のお題

• R言語によるコーパス分析入門

• 前回の「R言語によるテキストマイニング入門」の続

き、あるいは姉妹編

• Rの基本的な関数だけで（外部のソフトウェアやパッ

ケージに頼らずに）テキスト処理やコーパス分析をし

たい！

4

コンピュータを用いた言語研究

音声・音韻

文字・表記

語彙

データベース

情報検索

人工知能

言語学情報学

コーパス言語学

5

語彙

文法・構文

意味

談話・文体

変種

翻訳・対照

習得・教育

人工知能

データマイニング

統計・確率

機械学習

webマイニング

可視化

計算言語学

自然言語処理

テキストマイニング

コーパスとは何か？

• コーパス

– 機械可読性 (machine-readable)

– 真正性 (authentic)

– 代表性 (representative, balanced)

(McEnery, et al., 2006)

* （狭義での）コーパスは、単なるアーカイブやデータベースと異なる

• コーパス言語学

– 「コンピュータで処理可能な電子コーパスを検索して言語

分析・記述を行う言語学一般」(齊藤ほか, 2004)

– (1) 言語能力よりは言語運用、(2) 言語の普遍的特性の解

明よりも個別言語の記述、(3) 質的分析のみならず量的

分析、(4) 合理主義よりも経験主義的 (Leech, 1992)6

コーパス分析の一般的な流れ

データ構築データ構築データ構築データ構築テキスト処理テキスト処理テキスト処理テキスト処理統計処理統計処理統計処理統計処理質的分析質的分析質的分析質的分析

7

テキスト収集

電子化

etc.

語彙表の作成

用例の抽出

etc.

検定

多変量解析

etc.

結果の解釈

実質科学的な考察

etc.

統計処理だけでなく、その前段階に

あたるテキスト処理もRでやりたい！！

だって、Rが好きだから☆

8

じゃあ、Rでやってみようよ！

9

Rでワードリストを作る

#ファイルの読み込み（Macの場合はfile.choose）

textfile<-scan(choose.files(), what="char", sep="¥n", quote="", comment.char="")

#全ての文字を小文字に置換

textfile2<-tolower(textfile)

#文を単語で分割

word.list<-strsplit(textfile2, "¥¥W")word.list<-strsplit(textfile2, "¥¥W")

word.vector<-unlist(word.list)

#頻度表の作成

freq.list<-table(word.vector)

sorted.freq.list<-sort(freq.list, decreasing=T)

sorted.table<-paste(names(sorted.freq.list), sorted.freq.list, sep="¥t")

#ファイルへ出力

write.table(sorted.freq.list[-1], file="wordlist.txt", sep="¥t")

10

こんな感じになります

THE HOUND OF THE BASKERVILLES

By A. Conan Doyle

Chapter 1. Mr. Sherlock Holmes

Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not

"the" 3331

"and" 1628

"of" 1594

"i" 1501

"to" 1408

"a" 1308

入力入力入力入力: 元のテキスト元のテキスト元のテキスト元のテキスト出力出力出力出力: ワードリストワードリストワードリストワードリスト

late in the mornings, save upon those not infrequent occasions when he was up all night, was seated at the breakfast table. I stood upon the hearth-rug and picked up the stick which our visitor had left behind him the night before. It was a fine, thick piece of wood, bulbous-headed, of the sort which is known as a "Penang lawyer." Just under the head was a broad silver band nearly an inch across. "To James Mortimer, M.R.C.S., from his friends of the C.C.H.," was engraved upon it, with the date "1884." It was just such a stick as the old-fashioned family practitioner used to carry--dignified, solid, and reassuring.

（以下、省略）

"that" 1144

"it" 1010

"he" 919

"in" 911

"you" 834

"was" 803

"his" 692

"is" 624

"have" 542

（以下、省略）

11

Rでコンコーダンスを作る（用例抽出）


#暫定版（出力がいまひとつ）

textfile<-scan(choose.files(), what="char", sep="¥n", quote="", comment.char="")

#例："the"を含む行を検索

conc<-grep("¥¥bthe¥¥b", textfile, ignore.case=T, value=T, perl=T)

#コンコーダンスラインの整形

conc2<-gsub("¥¥b(the)¥¥b", "¥t¥¥1¥t", conc, ignore.case=T)conc2<-gsub("¥¥b(the)¥¥b", "¥t¥¥1¥t", conc, ignore.case=T)

#ファイルへ出力

cat("PRECEDING_CONTEXT¥tMATCH¥tSUBSEQUENT_CONTEXT", conc2, file="concordance.txt",

sep=“¥n”)

12


PRECEDING_CONTEXT MATCH SUBSEQUENT_CONTEXT

THE HOUND OF

Mr. Sherlock Holmes, who was usually very late in the mornings, save

at the breakfast table. I stood upon

stick which our visitor had left behind him the night before. It was a

fine, thick piece of wood, bulbous-headed, of the sort which is known as

a "Penang lawyer." Just under the head was a broad silver band nearly

an inch across. "To James Mortimer, M.R.C.S., from his friends of the

C.C.H.," was engraved upon it, with the date "1884." It was just such a

stick as the old-fashioned family practitioner used to carry--dignified,

I think, said I, following as far as I could the methods of my

that to be the Something Hunt,

13

that to be the Something Hunt,

chair and lighting a cigarette. "I am bound to say that in all the

admiration and to the attempts which I had made to give publicity to

the stick from my hands and examined it for a few minutes with his naked

and carrying the cane to

favourite corner of the settee. "There are certainly one or two

indications upon the stick. It gives us

in noting your fallacies I was occasionally guided towards the truth.

Not that you are entirely wrong in this instance. The man is certainly a

hospital than from a hunt, and that when the initials 'C.C.' are placed

before that hospital the words 'Charing Cross' very naturally suggest

* “the”を含む行の抽出には成功しているが、改善の余地がある

1) 改行を取ってからコンコーダンスラインを作るべきか？

2) 1行に複数の”the”がある場合はどうするべきか？

Rでコロケーション・テーブルを作る（共起分析）


corpus.file<-tolower(scan(file=choose.files(), what="char", sep="¥n", quiet=T))

#データの整形

cleaned.corpus.file<-gsub("([^-a-z0-9¥¥s])", " ¥¥1", corpus.file, perl=T)

cleaned.corpus.file2<-gsub("(^¥¥s+|¥¥s+$)", "", cleaned.corpus.file, perl=T)

#文を単語に分割#文を単語に分割

corpus.words.vector<-unlist(strsplit(cleaned.corpus.file2, "([^-a-z0-9]+|--)"))

#ノードワードの指定（例："see"）

node.word<-"¥¥bsee¥¥b"

#スパンの指定（例:L3～R3）

span<-(-3:3)

#出力ファイルの指定（Macの場合はfile.choose）

#あらかじめファイルを用意

output.file<-choose.files() (続く)14

#ノードワードの生起位置の検索

positions.of.matches<-grep(node.word, corpus.words.vector, perl=T)

#コロケーションの抽出

results<-list()

for (i in 1:length(span)) {

collocate.positions<-positions.of.matches+span[i]

collocates<-corpus.words.vector[collocate.positions]

sorted.collocates<-sort(table(collocates), decreasing=T)sorted.collocates<-sort(table(collocates), decreasing=T)

results[[i]]<-sorted.collocates

}

#ファイルへの出力

lengths<-sapply(results, length)

cat(paste(rep(c("W_", "F_"), length(span)), rep(span, each=2), sep=""), "¥n", sep="¥t", file=output.file)

for (k in 1:max(lengths)) {

output.string<-paste(names(sapply(results, "[", k)), sapply(results, "[", k), sep="¥t")

output.string2<-gsub("NA¥tNA", "¥t", output.string, perl=T)

cat(output.string2, "¥n", sep="¥t", file=output.file, append=T)

} 15


W_-3 F_-3 W_-2 F_-2 W_-1 F_-1 W_0 F_0 W_1 F_1 W_2 F_2 W_3 F_3

8 i 9 to 35 see 113 the 15 it 6 7

i 6 you 8 you 17 that 14 he 5 of 6

you 5 did 5 i 16 i t 7 the 5 i 5

and 4 4 wil l 8 how 5 you 4 the 4

was 3 do 4 could 7 what 5 3 and 3

a 2 he 4 and 6 him 4 now 3 is 3

as 2 can 3 can 3 i f 4 there 3 my 3

be 2 could 3 l l 2 you 4 through 3 this 3

cried 2 let 3 us 2 a 3 we 3 a 2

for 2 we 3 1 his 3 yes 3 are 2

have 2 as 2 any 1 me 3 a 2 but 2

he 2 expected 2 baronet 1 no 3 both 2 can 2

16

he 2 expected 2 baronet 1 no 3 both 2 can 2

in 2 glad 2 did 1 2 henry 2 for 2

i t 2 ha 2 dis tinctly 1 anyone 2 holmes 2 has 2

l ight 2 hal l 2 even 1 anything 2 i 2 i t 2

or 2 him 2 he 1 her 2 if 2 there 2

that 2 s eemed 2 me 1 of 2 that 2 was 2

the 2 that 2 merely 1 our 2 they 2 were 2

very 2 time 2 must 1 sa id 2 after 1 aga in 1

wel l 2 was 2 never 1 s i r 2 and 1 ah 1

ah 1 when 2 not 1 something 2 barrymore 1 am 1

are 1 able 1 s hould 1 this 2 beast 1 at 1

at 1 and 1 t 1 and 1 beauties 1 barrymore 1

bas kervi l le 1 as tounded 1 vi l la in 1 as 1 beneath 1 beauties 1

* ハイフンでつながれた語の処理をどうするか？

（「単語」とは何か、という哲学の問題？）

* 空白も取ってきている？？？

参考文献

17

Gries, S. (2009). Quantitative

Corpus Linguistics with R: A

Practical Introduction. New

York: Routledge.

間瀬茂 (2007). 『Rプログラミング

マニュアル』東京: 数理工学社.

McEnery, T., Xiao, R., & Tono, Y.

(2006). Corpus-Based

Language Studies: An

Advanced Book. New York:

Routledge.

本発表の元ネタです。。。特に、第

4章が。。。Rの関数で分からないことがあったら、まずコレを引きましょう！

コーパス言語学の入門書を1冊選ぶとしたらコレ！

ご清聴ありがとうございました。

いろいろと改良方法を教えてくださいませ

m(_ _)mm(_ _)m

@langstat

[email protected]

18

osakar_4: r言語によるコーパス分析入門

Documents