integrating surface and abstract features for robust cross-domain chinese word segmentation

16
Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation 作作 : Xiaoqing Li 作作作 作作作

Upload: donkor

Post on 21-Mar-2016

97 views

Category:

Documents


7 download

DESCRIPTION

Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation. 作者 : Xiaoqing Li 主讲人:赵安邦. 问题. 跨领域 OOV 刑法,民法,宪法。。。 吸星 大法 不同的领域,同一个词的 tag 分布不同 酸 (s) ,酸的 (b) ,酸性 (b) 硫酸 (e) ,盐酸 (e) ,硝酸 (e). 解决办法. 引入领域词典 不引入新领域知识,很难解决这个问题。 词典相对比较容易获得,如化工词词典,医学名词词典。 引入词典的方法 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word

Segmentation

作者 : Xiaoqing Li主讲人:赵安邦

Page 2: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

问题• 跨领域OOV刑法,民法,宪法。。。吸星大法不同的领域,同一个词的 tag分布不同酸 (s),酸的 (b),酸性 (b)硫酸 (e),盐酸 (e),硝酸 (e)

Page 3: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

解决办法• 引入领域词典不引入新领域知识,很难解决这个问题。词典相对比较容易获得,如化工词词典,医学名词词典。

引入词典的方法( 1)机械匹配( 2)利用词典最长匹配词信息(在判别式分词方法中被广泛应用)

Page 4: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

例子句子:新华社报道。词典:华社,新华社最大匹配词长度: 3抽出特征C0=华 L=3 m

Page 5: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

相关概念Surface features

N-gram概率

Abstract features一个字是否选择它在字典中最长匹配词中 tag的分布,在不同领域是几乎不变的。(映射)

Page 6: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

Dictionary Coverage Status

• 一个包含五个元素的集合{No-Dictionary-Word, No-Ambiguity, Crossed-Ambiguity, Included-Ambiguity, Mixed-Ambiguity}

作用:给字在词典中匹配到的词的歧义情况分类。

Page 7: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

Dictionary Coverage Status

• 例子• Included-Ambiguity

Page 8: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

Dictionary Coverage Status

• 例子• Crossed-Ambiguity

Page 9: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

Tag Matching Status

• 一个包含四个元素的集合{Following-Longest-Word, Only-Following-Shorter-Word, Not-Following-Any-Word, Inapplicable}

作用:字的 tag和匹配到的词的 tag之间的关系分类。

Page 10: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

生成模型推导• 传统生成模型

• 加上词典特征的生成模型

Page 11: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

生成模型推导

近似成:

Page 12: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

生成模型推导

对 Abstract feature 和 Surface feature可以加上不同的权重

Page 13: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

生成模型推导这个模型还可以进一步融入判别式模型,得到以下公式:

Page 14: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

实验• 实验配置• 训练语料: PKU-News7 from CIPS-SIGHAN-2010• 同领域测试语料: PKU-News testing corpus of SIGHAN-2005• 跨领域测试语料: corpora of CIPS-SIGHAN-2010 (文学,计算机,医学,金融 )

Page 15: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

实验• 生成模型实验结果

B是基线系统G1的 Abstract Feature公式:G2的 Abstract Feature公式:

Page 16: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation

实验• 生成 +判别式模型实验结果

SBest是基线系统( best results of SIGHAN 2005 (News) and CIPS-SIGHAN 2010 (other domains))ED是利用词典改进了的判别式系统 (Enhanced Discriminative)EG(Generative) EI(Integrated)