摘要
话题检测技术是互联网新闻热点挖掘的基础,为解决基于传统的话题检测较少利用报道中的类别信息以及命名实体信息来提高检测效果,提出一种基于多向量相似度计算和二次聚类的话题检测方法。将报道按照其所在的站点层次关系进行层次分类,利用新闻文本中的地点、人物等命名实体信息来区分新闻报道;利用报道的时间聚集特性,将同一天的报道先进行局部聚类,再与旧话题归并聚类。实验结果表明,该方法的归一化识别代价达到0.197,比传统的话题检测算法提升约8%的性能。
Topic detection technology is based on news hotspot mining on Internet. To solve the traditional topic detections do not make full use of categories information and named entity in reports. So, a new topic detection method based on multi-vector similarity calculation and secondary clustering is proposed, which classifies the reports according to its site hierarchy, and uses information of characters and locations to distinguish the topics. Furthermore, it utilizes the time aggregation behavior of reports to do partial clustering on the set of reports in the same day, and then merged the results with the old topics. The experimental results show that (CDet)Norm of the new method achieves 0. 197, and its performance is about 8% better than traditional methods.
出处
《计算机工程与设计》
CSCD
北大核心
2012年第8期3214-3218,共5页
Computer Engineering and Design
基金
广东省科技计划基金项目(2010B010600017)
关键词
话题检测
新闻热点
命名实体
相似度计算
聚类
topic detection
news hotspot
named entity
similarity calculation
cluster