期刊文献+

基于弱监督学习的海量网络数据关系抽取 被引量:34

Extracting Relations from the Web via Weakly Supervised Learning
下载PDF
导出
摘要 在大数据时代,对于海量网络数据的信息抽取与应用已成为自然语言处理和信息检索技术发展的重要主题.其中,基于弱监督的关系抽取方法,因为具有不需要过多人工参与、适应性强的特点,受到了广泛的关注.目前针对它的研究主要集中在英语资源上,主要使用传统的词法和句法特征.然而,词法特征有严重的稀疏性问题,句法特征则对一些语言分析工具的性能有较强的依赖性.提出利用n-gram特征来缓解传统词法特征稀疏性的问题.特别地,这种特征还可以弥补传统句法特征在其他语言上不可靠的情况,对于关系抽取的跨语言应用有重要作用.在此基础上,针对弱监督学习中标注数据不完全可靠的情况,提出基于bootstrapping思想的协同训练方法来对弱监督关系抽取模型进行强化,并且对预测关系时的协同策略进行了详细分析.在大规模的中文和英文数据上进行实验的结果显示,把传统特征与n-gram特征相结合并进行协同训练,在中文和英文数据集上均可以提升弱监督关系抽取的效果,可以适应多语言的关系抽取需求. In the time of big data, information extraction at a large scale has been an important topic discussed in natural language processing and information retrieval. Specifically, weak supervision, as a novel framework that need not any human involvement and can be easily adapted to new domains, is receiving increasing attentions. The current study of weak supervision is intended primarily for English, with conventional features such as segments of words based lexical features and dependency based syntactic features. However, this type of lexical features often suffer from the data sparsity problem, while syntactic features strongly rely on the availability of syntactic analysis tools. This paper proposes to make use of n-gram features which can relieve to some extent the data sparsity prob mult em brought by lexical features. It is also observed that the n-gram features are important for lingual relation extraction, especially, they can make up for the syntactic features in those languages where syntactic analysis tools are not reliable. In order to deal with the quality issue of training data used in weakly supervised learning models, a bootstrapping approach, co-training, is introduced into the framework to improve this extraction paradigm. We study the strategies used to combine the outputs from different training views. The experimental results on both English and Chinese datasets show that the proposed approach can effectively improve the performance of weak supervision in both languages, and has the potential to work well in a multilingual scenario with more languages.
出处 《计算机研究与发展》 EI CSCD 北大核心 2013年第9期1825-1835,共11页 Journal of Computer Research and Development
基金 国家"八六三"高技术研究发展计划基金项目(2012AA011101) 国家自然科学基金项目(61272344 61202233)
关键词 关系抽取 弱监督学习 最大熵模型 协同训练 知识库构建 relation extraction weakly supervised learning maximum entropy co-training knowledgebase construction
  • 相关文献

参考文献39

  • 1Sundheim B, Chinchor N. Survey of the message understanding conferences[CJ Jjproc of HLT'93. Stroudsburg, PA: ACL, 1993: 56-60. 被引量:1
  • 2Banko M, Cafarella M, Soderland S, et a1. Open information ext ract ion from the Web[CJ JjProc of I]CAI 2007. New York: ACM, 2007: 2670-2676. 被引量:1
  • 3Fader A, Soderland S, Etzioni D. Identifying relations for open information extraction[CJ JjProc of EMNLP 2011. Stroudshurg, PA: ACL, 2011: 1535-1545. 被引量:1
  • 4Carlson A, BetteridgeJ, Kisiel B, et a1. Toward an architecture for never-ending language learning[CJ JjProc of AAAI2010. Palo Alto, CA: AAAL 2010: 1306-1313. 被引量:1
  • 5Craven M, KumlienJ. Constructing biological knowledge bases by extracting information from text sources[CJ /lProc of the 7th Int Conf on Intelligent Systems for Molecular Biology. Palo Alto, CA: AAAL 1999: 77-86. 被引量:1
  • 6Blum A, Mitchell T. Combining labeled and unlabeled data with co-training[CJ Jjproc of ICML 1998. New York: ACM, 1998: 92-100. 被引量:1
  • 7车万翔,刘挺,李生.实体关系自动抽取[J].中文信息学报,2005,19(2):1-6. 被引量:116
  • 8刘克彬,李芳,刘磊,韩颖.基于核函数中文关系自动抽取系统的实现[J].计算机研究与发展,2007,44(8):1406-1411. 被引量:58
  • 9董静,孙乐,冯元勇,黄瑞红.中文实体关系抽取中的特征选择研究[J].中文信息学报,2007,21(4):80-85. 被引量:55
  • 10WU Fei, Hoffmann R, Weld D. Information extraction from Wikipedia . Moving down the long tail[CJ JjProc of ACM SIGKDD 2008. New York: ACM, 2008: 731-739. 被引量:1

二级参考文献71

  • 1车万翔,刘挺,李生.实体关系自动抽取[J].中文信息学报,2005,19(2):1-6. 被引量:116
  • 2梁晗,陈群秀,吴平博.基于事件框架的信息抽取系统[J].中文信息学报,2006,20(2):40-46. 被引量:38
  • 3Chapelle O, Scholkopf B, Zien A. Semi-supervised Learning [ M]. Cambridge: MIT Press,2006. 被引量:1
  • 4Zhu Xiao-Jin. Semi-supervised Learning with Graphs[D]. Carnegie Mellon University, doctoral thesis, 2005. 被引量:1
  • 5Blum A, Chawla S. Learning from labeled and unlabeled clam using graph mincuts[A]. Proceedings of the 18th International Conference on Machine Learning [ C]. Williamston, MA, 2001. 19 - 26. 被引量:1
  • 6Szummer M, Jaakkola T. Partially labeled classification with markov random walks [ A ]. Advances in Neural Information Processing Systems 14[ C]. Cambridge, MA: MIT Press, 2002. 945 - 952. 被引量:1
  • 7Joachims T. Transductive inference for text classification using support vector machines [ A]. Proceedings of the 16th International Conference on Machine Learning[ C]. New York, USA, 1999. 200 - 209. 被引量:1
  • 8Tong S, Koller D. Support vector machine active learning with applications to text classification[ A]. Proceedings of the 17th International Conference on Machine Learning [ C ]. Stanford, US,2000.999- 1006. 被引量:1
  • 9Nigam K, McCallum A K, Thrtm S, Mitchell T. Text classification from labeled and unlabeled documents using EM[J]. Machine Learning,2000,39(2 - 3) : 103 - 134. 被引量:1
  • 10Cozman F G, Cohen I, Cirelo M C. Semi-supervised learning of mixture model[ A]. Proceedings of the 20th International Conference on Machine Learning[ C ]. citeseer, 2003.99 - 106. 被引量:1

共引文献182

同被引文献281

引证文献34

二级引证文献1839

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部