摘要
对于生物证据句子抽取问题,传统特征和贝叶斯分类模型构建的抽取系统效率不高,导致抽取结果的召回率较低。为此,针对单句抽取问题和多句混合抽取问题,分别构建2套系统。利用最小二乘支持向量机模型结合新的特征组合和句子过滤模块构建系统1,解决传统特征涵盖不全面的问题,并在系统1中融入条件随机场模型,融合候选句判别规则建立系统2,解决连续多句合并的问题。实验结果表明,在单句抽取问题上,相比贝叶斯模型的基准系统,系统1召回率和F值分别提高39.7%和12.9%,在多句混合抽取问题上,相比基于正例和无标记样本学习系统,系统2的召回率提高了37.1%。
For the Gene Ontology Evidence Sentences( GOES) extraction problem,the recall rate and efficiency of the traditional system built on traditional features and Bayesian classification model are relatively low. In order to solve this problem,two systems are built for the single sentence and joined sentences retrieval. System 1 is built on Support Vector Machine( SVM) model and new combination of features,which solves the problem of incomplete coverage. Conditional Random Field ( CRF ) model and the rules of identification of candidate sentence are added into System 1 to build System 2 which solve the problem of sentences combination. Experimental results show that, in the single sentence extraction problem,compared with the Bayesian model based system,the recall and F-value of System 1 are increased by 39. 7% and 12. 9% . In the joined sentences extraction problem,compared with the Learning from Positive and Unlabeled Documents for Retrieval(LPU) system,the recall of System 2 is increased by 37. 1% .
出处
《计算机工程》
CAS
CSCD
北大核心
2015年第5期207-212,共6页
Computer Engineering
关键词
生物证据句子
特征结合
支持向量机
最小二乘支持向量机
条件随机场
biological evidence sentence
feature combination
Support Vector Machine (SVM)
Least SquaresSupport Vector Machine (LS-SVM)
Conditional Random Field (CRF)