摘要
【目的】探索社会标签与文本内容的结合对文本聚类的影响。【方法】采用Engadget中英文博客数据,使用TF×IDF、Text Rank、Text Rank×IDF三种特征抽取方法,线性函数和Sigmod函数进行相似度加权,AP算法进行聚类。【结果】结果表明,TF×IDF的聚类效果最好,两种加权对英文博文聚类有不同程度的改善,但在中文博文聚类中,Sigmod加权结果稍有下降,线性加权比Sigmoid加权方法效果更好。【局限】没有找出标签相似度与内容相似度最佳的权重系数。AP聚类算法不能应用于大数据,聚簇过多影响聚类结果的展示。【结论】社会标签与文本内容相似度的线性加权能改善Web文本聚类结果。
[Objective] This paper explores the infulence of the combination of social tagging and text content. [Methods] In this paper, taking the English and Chinese blogs for example, using TF × IDF, TextRank and TextRank × IDF as text feature extraction method, basing on tags combining with text content where two types weighted methods is used, and AP clustering algorithm is used to cluster samples. [Results] The results show that TF×IDF acts the best in the clustering of three feature extraction. And content weighted with tags improve different degree of the clustering of English blogs, but not for Chinese blogs in the method of Sigmoid. In two kinds of similarity weighted, linear method performs better than the Sigmoid method. [Limitations] The authors cannot find the best weight coefficient of tag similarity and content similarity. AP clustering algorithm can't apply to big data and a lot of clustering results interfered the visualization of show. [Conclusions] The weighted similarity of social tags and text content can improve the effect of the clutering of Web text.
出处
《现代图书情报技术》
CSSCI
北大核心
2014年第11期45-52,共8页
New Technology of Library and Information Service
基金
国家社会科学基金项目"在线社交网络中基于用户的知识组织模式研究"(项目编号:14BTQ033)
教育部人文社会科学基金规划项目"多语言高质量社会化标签生成及聚类研究"(项目编号:13YJA870020)的研究成果之一
关键词
社会标签
特征选择
文本聚类
Social tag Feature selection Text clustering