摘要
[目的/意义]网络论坛是当前网络舆情汇聚、扩散的重要平台,而当前热点话题检测方法对大量主题论坛的应用效果较差,文章专门针对网络论坛的文本特点进行分析,旨在构建更科学、精准的特征词权重计算方法。[方法/过程]通过引入特征词类别权重、词性权重和位置权重,对TF-IDF进行改进,形成TF-IDF-PPC优化方法。[结果/结论]将TF-IDF-PPC计算方法与传统算法和改进的TF-CRF,TW-TF-IDF及结合CHI的TF-IDF在相同的数据集中进行F1值的测试与比较,并将其应用到热点话题检测的实例测试中,实验表明TF-IDF-PPC方法具有明显优势,另外该算法依旧可以有效地应用在论坛文本的特征表达、主题抽取等场景。
[Purpose/significance]Hot topic detection is the focus of online public opinion monitoring.This article specifically analyzes the text characteristics of BBS,and aims to construct a more scientific and accurate method for calculating the weight of terms.[Method/process]The TF-IDF is improved to form the TF-IDF-PPC method by introducing the term category weight,part of speech weight and position weight.[Result/conclusion]The method is tested and compared with the traditional TF-IDF algorithm,TF-CRF,TW-TF-IDF and TF-IDF combined with improved CHI in the same data set.The experiment shows that TF-IDFPPC has obvious advantages and can be effectively applied to the term and topic extraction of texts.
出处
《情报理论与实践》
CSSCI
北大核心
2021年第5期187-192,共6页
Information Studies:Theory & Application