摘要
【目的】降低中文物种描述文本语义标注的学习成本。【方法】设计基于Bootstrapping的弱监督学习方法,以少量数据为基础,迭代执行学习和标注过程。在迭代过程中,利用置信度最高的标注数据扩充知识库,提升标注能力。【结果】运用15 041条数据测试算法效率,F-value的平均值达到0.911 2。【局限】对过于稀疏的数据,标注效率相对较低。【结论】本研究设计的方法不仅有效降低系统学习对训练数据规模的要求,而且可提高标注效率。
[Objective] To reduce cost of machine learning by declining the size of learning dataset in species description text annotation in Chinese. [Methods] Based on Bootstrapping method, design a weakly supervised learning method which performs learning and tagging processes iteratively with a small amount of data at the beginning. The iteration process promotes annotation ability continuously by expanding the knowledge base. [Results] The average score of F-value runs up to 0.911 2 on a dataset with 15 041 sentences. [Limitations] The annotation efficiency might be relatively low on sparse data. [Conclusions] The experimental data shows that the algorithm in this study not only declines the dataset size requirement of machine learning dramatically, but also increases annotation efficiency.
出处
《现代图书情报技术》
CSSCI
北大核心
2014年第5期83-89,共7页
New Technology of Library and Information Service
基金
国家社会科学基金一般项目"基于无监督语义标注的网络中文学术信息抽取研究"(项目编号:11BTQ024)的研究成果之一