摘要
考虑网络事件的时间距离,基于半结构化网页中不同位置特征项重要程度的不同,提出改进的single-pass文本聚类算法single-pass*,优势在于对Web文本不同位置特征项的加权处理,仅需计算新文档与同类别种子文档间的相似度。实验结果表明,相比single-pass,改进算法极大减少了漏检率和错检率,降低了由于新文本流内文档进行相似度计算导致系统性能的下降,平均提高Web文本聚类效率40%。将聚类后的Web文本应用于网络舆情分析,进行主题关注度分析和话题热度特性分析。
By considering the time interval of Internet events as well as the importance of different feature items from semi-structured Web documents in different locations, an improved single-pass text clustering algorithm called single-pass* is proposed. The advantage is that it assigns the weight value to different feature items from different locations on the Web pages, and only needs to calculate the similarity between the new document and its seed document. Experimental results show that, compared to the single-pass algorithm, the improved algorithm can reduce the missing rate, the error detection rate, and the degradation of system performance caused by computing the topic similarity of documents in new Web data stream, and improve the clustering efficiency at an average rate of 40%. The clustered Web texts can be used to analyze the Internet opinion including the topic relevant degree and the hot degree.
出处
《电子科技大学学报》
EI
CAS
CSCD
北大核心
2015年第4期599-604,共6页
Journal of University of Electronic Science and Technology of China
基金
国家自然科学基金(61100045
61165013)
高等学校博士学科点专项科研基金(20110184120008)
中国博士后科学基金特别资助项目(201104697)
教育部人文社会科学研究青年基金(14YJCZH046)
中央高校基本科研业务费专项资金(2682013BR023)
科学计算与智能信息处理广西高校重点实验室开放课题资助(GXSCIIP201407)
四川省教育厅资助科研项目(14ZB0458)