摘要
上下文是统计语言学中获取语言知识和解决自然语言处理中多种实际应用问题必须依靠的资源和基础。近年来基于字的词位标注的方法极大地提高了汉语分词的性能,该方法将汉语分词转化为字的词位标注问题,当前字的词位标注需要借助于该字的上下文来确定。为克服仅凭主观经验给出猜测结果的不足,采用四词位标注集,使用条件随机场模型研究了词位标注汉语分词中上文和下文对分词性能的贡献情况,在国际汉语分词评测Bakeoff2005的PKU和MSRA两种语料上进行了封闭测试,采用分别表征上文和下文的特征模板集进行了对比实验,结果表明,下文对分词性能的贡献比上文的贡献高出13个百分点以上。
Context is the necessary resource not only for obtaining linguistic knowledge in statistical linguistics but also for solving the problem in natural language processing.The performance of Chinese word segmentation has been greatly improved by word-position-based approaches in recent years.This approach treats Chinese word segmentation as a word-position tagging problem.To tag the word-position of current character needs the help of correlative context.To overcome the lack of giving the result by the subjective experience,this paper studies the contribution of above and below for Chinese word segmentation via using four word-positions and conditional random fields.Closed evaluations are performed on PKU and MSRA corpus from the second international Chinese word segmentation Bakeoff-2005,and comparative experiments are performed on different feature templates.Experimental results show that the performance by the below-context increases 13 percentage points than by the above-context.
出处
《计算机工程与应用》
CSCD
北大核心
2011年第4期117-120,共4页
Computer Engineering and Applications
基金
高等学校博士学科点专项科研基金项目(No.20050007023)
河南省高等学校青年骨干教师项目(No.2009GGJS-108)
关键词
汉语分词
上下文
条件随机场
词位标注
特征模板
Chinese word segmentation
context
conditional random fields
word-position tagging
feature template