期刊文献+

基于同义实体扩展的冗余信息去重 被引量:3

Synonymous Entity Expansion Based Information De-duplication
下载PDF
导出
摘要 冗余信息去重是信息抽取中的重要任务,对于多元素表示的信息,该文针对以往对各个元素统一处理所存在的问题,将信息元素进行分类,由各类元素的冗余判断难易出发,归纳相似度计算方法,并将各相似度作为特征,通过分类器判断信息间的冗余性。同时对最难判断的命名实体信息元素,该文从其他易判断相似性的信息元素出发,通过同义命名实体的自动扩展,提高信息去重的效果。 Information De-duplication is an important task of Information Extraction.This paper focuses on the multi-field information de-duplication.Previous works usually treat each information field equally.We separate information fields into several categories,generalize the computing method of similarity for each single filed,and use those similarities as the features in a machine learning method to distinguish duplicate information pairs.For the most difficult named entity field,we expand co-reference pairs by using the other easy predicted fields,and use the expanded knowledge to improve the de-duplication performance.
出处 《中文信息学报》 CSCD 北大核心 2012年第1期42-50,共9页 Journal of Chinese Information Processing
关键词 信息抽取 信息去重 命名实体 information extraction information de-duplication named entity
  • 相关文献

参考文献17

  • 1Mikhail Bilenko, Raymond J. Mooney. Adaptive Du- plicate Detection Using Learnable String Similarity Measures [C]//Proceedings of KDD, Washington, DC, USA, 2003: 39-48. 被引量:1
  • 2Rohan Baxter, Peter Christen, Tim Churches. A Comparison of Fast Blocking Methods {or Record [C]//Proceedings of KDD. Washington, DC, USA, 2003: 25-27. 被引量:1
  • 3Lifang Gu, Rohan Baxter. Adaptive Filtering for Effi- cient Record Linkage[C]//Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, 2004: 477-481. 被引量:1
  • 4李峰,李芳.中文词语语义相似度计算——基于《知网》2000[J].中文信息学报,2007,21(3):99-105. 被引量:106
  • 5王荣波,池哲儒.基于词类串的汉语句子结构相似度计算方法[J].中文信息学报,2005,19(1):21-29. 被引量:28
  • 6张奇,黄萱菁,吴立德.一种新的句子相似度度量及其在文本自动摘要中的应用[J].中文信息学报,2005,19(2):93-99. 被引量:34
  • 7William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg. A comparison of string distance metrics for name-matching tasks [C]//Proceedings of IJCAI, 2003: 73-78. 被引量:1
  • 8http://www, cs. umass, edu/- mccallum/code-data. html [OL]. 被引量:1
  • 9M Vilain, J Burger, J Aberdeen, et al. A model-theo- retic coreference scoring scheme[C]//Proceedings of the 6th Conference on Message Understanding. Co- lumbia, Maryland, USA, 1995: 45-52. 被引量:1
  • 10Amit Bagga, Breck Baldwin. Algorithms for Scoring Coreference Chains[C]//Proceedings of The First In- ternational Conference on Language Resources and E- valuation Workshop on Linguistics Coreference. 1998: 563-566. 被引量:1

二级参考文献26

  • 1吴健,吴朝晖,李莹,邓水光.基于本体论和词汇语义相似度的Web服务发现[J].计算机学报,2005,28(4):595-602. 被引量:218
  • 2M. Carl.Recent Research in the Field of Example-Based Machine Translation[A]. CICLing 2001 ,LNCS 2004. 被引量:1
  • 3W. John Hutchins. Machine Translation: a brief history. Concise history of the language sciences: from the Sumerians tothe cognitivists[M]. Oxford:Pergamon Press, 1995. 被引量:1
  • 4Sumita,E.and H.Iida. Experiments and Prospects of.Example-Based Machine Translation[A]. Proceedings of 29th ACL Meeting[C]. Berkeley, 1991,185 - 192. 被引量:1
  • 5K. Chidananda Gowda and E. Diday. Symbolic Clustering Using a New Similarity Measure[J]. IEEE. Transactions on Systems, Man, and Cybernetic, 1992,22(2). 被引量:1
  • 6Federica Mandreoli, Riccardo Martoglia, and Paolo Tiberio. Searching Similar(Sub) Sentences for Example-Based Machine Translation[ A ]. In: Atfi del Decimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati(SEBD 2002 ), Isola d'Elba, Italy, 2002. 被引量:1
  • 7J. Carbonell, J. Goldstein, 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries [ A],In: Proceedings of the 21st ACM-SIGIR International Conference on Research and Development in Information Retrieval [C], Melbourne, Australia. 被引量:1
  • 8Lin, Chin-Yew and E. H. Hovy 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics [ A ]. In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003) [C],Edmonton,Canada,May 27- June 1,2003. 被引量:1
  • 9Lin, Chin-Yew and E. H. Hovy. 2002. Automated Multi-document Summarization in NeATS [ A ]. In Proceedings of the Human Language Technology Conference (HLT2002) [C] ,San Diego,CA,U.S.A. ,March 23-27,2002. 被引量:1
  • 10Radev,D.R. ,Jing,H. ,and Budzikowska,M.2000. Centroid-based summarization of multiple documents [A] .In ANLP-NAACL workshop on summarization [ C]. 被引量:1

共引文献153

同被引文献63

引证文献3

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部