improving vietnamese word segmentation and pos tagging using mem with various kinds of resources

20
Improving Vietnamese Word Segmentation and POS Tagging using MEM with Various Kinds of Resources 長岡技術科学大学 自然言語処理研究室 高橋寛治 Oanh Thi Tran, Cuong Anh Le1, Thuy Quang Ha 自然言語処理, Vol. 17, No. 3, pp.41-60, 言語処理学会, 2010 文献紹介 2015年11月19日

Upload: takahashi-kanji

Post on 09-Jan-2017

245 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

Improving Vietnamese Word Segmentation and POS Tagging using MEM with Various Kinds of Resources

長岡技術科学大学 自然言語処理研究室高橋寛治

Oanh Thi Tran, Cuong Anh Le1, Thuy Quang Ha自然言語処理, Vol. 17, No. 3, pp.41-60, 言語処理学会, 2010

文献紹介 2015年11月19日

Page 2: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

概要•ベトナム語単語分割と品詞付与に様々な資源を用いて精度を向上•辞書ベースのモデルやN-gramモデル、固有表現モデルを最大エントロピーモデルに組み込む•品詞付与では中国語とベトナム語の特徴を利用して新しい素性を提案•F値で単語分割は95.30%、品詞付与は89.64%

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 2

Page 3: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

単語分割の位置付け•応用のための基本的な処理•ベトナム語の単語分割は性質上難しい

Ø孤立語Ø活用や語形変化がないØ文字列はアルファベットØ単語境界がない

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 3

Page 4: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

Ông già đi nhanh quá• [Ông già] [đi] [nhanh quá] -> The old man walks too fast

-> My father walks too fast• [Ông già] [đi] [nhanh quá] -> The old man died too fast

-> My father died too fast• [Ông] [già đi] [nhanh quá] -> You get old too fast

-> Grandfather gets old too fast

様々な解釈が可能

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 4

Page 5: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

関連研究•単語分割

ØN-gramを用いる(Ha 2003) ØCRFとSVM(Cam-Tu Nguyen 2007) ØMEM(Dien and Thuy 2006)

u未知語処理と単語分割を分けたアプローチ •品詞付与(単語の情報を利用)

Ø統計に基づくvnQTAG(Huyen 2003)Ø種々の素性をSVMを用いる(Minh and Dien 2008)

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 5

Page 6: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

関連するモデルの概要

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources

6

Page 7: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

辞書ベースのモデル•典型的な手法は、最長一致法と単語数最小法

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 7

Originalsent. Đó làcách để truyền thông tin

×:1stword-segmentedsent Đó làcáchđể truyền_thông tin

○:2ndword-segmentedsent Đó làcáchđể truyền thông_tin

Originalsent. Học sinh học sinh học

?:1stword-segmentedsent Học_sinh học_sinh học

?:2ndword-segmentedsent Học_sinh học sinh_học

•最長一致法

•単語数最小法

Page 8: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

N-Gramベースのモデル•ベトナム語単語分割でいい結果(Le Ha 2003)

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 8

Page 9: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

固有表現認識モデル•固有表現を分類•機械学習手法が用いられる(Tri et al.2007)•単語分割と非常に関連がある

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 9

[PERSONÔng Nguyễn Hừ Minh] được đề cử chức tổng giảm đốc của [ORG Công ty Đại Á] nhiệm kỳ [DTIME 2002-2006]

[PERSONMr NguyenHuu Minh]isrecommendedastheCEOof[ORGDaiAcorporation]for[DTIME2002-2006]

Page 10: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

単語分割モデル•固有表現の利用•未知語への対応(事前調査で約10%)

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 10

Page 11: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

単語分割に用いる素性

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 11

Page 12: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

辞書ベースのモデル•対象の音節の左側の音節を用いる

Ø最長一致法の経験から •単語は最大4音節で構成されるので、幅4

ØSC(-3,0), SC(-2,0), SC(-1,0), SC(0,0) Ø辞書に存在するなら「1」、しないなら「0」

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 12

Page 13: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

固有表現モデル•正規表現での獲得できるものを対象

Øベトナム人の名前(約2万) u名前=姓+ミドルネーム+名 uNguyễn Văn Hải

Øベトナムの地名(約800)

•マッチしていれば「1」、してないなら「0」

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 13

Page 14: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

N-gramモデル•ベトナム語Wikipediaから14M音節 •2-gramと3-gramを利用 •0から1の値となるように正規化

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 14

Page 15: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

実験•最大エントロピーモデルを分類に利用

ØラベルはB_WとI_W

•新聞から8,000文抽出し利用•5分割交差検定により評価

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 15

Page 16: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

結果

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 16

全部使用

未知語は20%ほど改善

Page 17: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

品詞付与•コーパスの作成

Ø14種類の品詞タグと記号タグ Ø種々の題材8000文にタグ済み

uhttp://vnlp.net/blog/?p=164•ベトナム語品詞付与のモデル

Ø単語ベース及び形態素ベースの素性 u位置、窓幅2での連接、前の品詞 u1つ前と2つ前の品詞、句読点?、大文字?

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 17

Page 18: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

実験•コーパス

ØViettrebank(VNLP 2009)u17品詞 u種々のトピックから1万文

ØvnPOS u今回作成したもの u14品詞

•それぞれ5分割交差検定 •BLMVMアルゴリズムを利用

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 18

Page 19: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

結果

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 19

形態素素性はいつも精度を向上させる

Page 20: Improving vietnamese word segmentation and pos tagging using MEM with various kinds of resources

まとめ•様々な知識を用いたベトナム語単語分割を提案

•形態素ベースの品詞付与も提案。89.64%

•タグ付きコーパス(8000文)を作成し、公開

ImprovingVietnameseWordSegmentationandPOSTaggingusingMEMwithVariousKindsofResources 20