摘要
冗余信息去重是信息抽取中的重要任务,对于多元素表示的信息,该文针对以往对各个元素统一处理所存在的问题,将信息元素进行分类,由各类元素的冗余判断难易出发,归纳相似度计算方法,并将各相似度作为特征,通过分类器判断信息间的冗余性。同时对最难判断的命名实体信息元素,该文从其他易判断相似性的信息元素出发,通过同义命名实体的自动扩展,提高信息去重的效果。
Information De-duplication is an important task of Information Extraction.This paper focuses on the multi-field information de-duplication.Previous works usually treat each information field equally.We separate information fields into several categories,generalize the computing method of similarity for each single filed,and use those similarities as the features in a machine learning method to distinguish duplicate information pairs.For the most difficult named entity field,we expand co-reference pairs by using the other easy predicted fields,and use the expanded knowledge to improve the de-duplication performance.
出处
《中文信息学报》
CSCD
北大核心
2012年第1期42-50,共9页
Journal of Chinese Information Processing
关键词
信息抽取
信息去重
命名实体
information extraction
information de-duplication
named entity