摘要
针对藏文舆情分析需求,该文以藏文新闻文本数据为研究对象,提出一种融合多特征的藏文新闻热点事件检测方法。首先研究藏文新闻热点事件产生的特点,分析热词的词频、词频增长率、网站影响力特征,提出热度度量方法,通过热度过滤获取热词集。其次分析事件词对分布特点,建立词对生成模型和词对语义引力模型,通过热度筛选获取词对集。最后采用凝聚式层次聚类方法,聚类混合表示的热词和词对,实现藏文新闻热点事件检测。测试结果表明,该方法最优F值达到0.6000,优于对比方法,可以较有效地检测热点事件,具有一定的应用价值。
Aiming at the demand of public opinion analysis in Tibetan,this paper proposes a hot event detection method based on multi-feature fusion.Firstly,the hot news event characteristics are studied by analyzing the term frequency,term frequency growth rate and website influence.The heat measurement method is then put forward,and the hot words set is obtained by heat filtering.Secondly,the event word pair distribution is analyzed,the word pair generation model and semantic gravity model are designed,and the hot word pair set is obtained by heat filtering.Finally,a hierarchical clustering algorithm is introduced to detect hot events by clustering the mixed hot words and word pairs.The experimental results show that the optimal F value is 0.6000,which is better than the benchmark methods.
作者
孔春伟
吕学强
张乐
赵海兴
KONG Chunwei;LYU Xueqiang;ZHANG Le;ZHAO Haixing(School of Computer Science,Qinghai Normal University,Xi'ning,Qinghai 810008,China;Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University,Beijing 100101,China)
出处
《中文信息学报》
CSCD
北大核心
2023年第2期53-61,共9页
Journal of Chinese Information Processing
基金
青海省藏文信息处理与机器翻译重点实验室/藏文信息处理教育部重点实验室开放课题基金(2019Z002)
北京市自然科学基金(4212020)
国家自然科学基金(61671070)。
关键词
事件检测
热词
词对
语义引力
层次聚类
event detection
hot words
word pair
semantic gravity
hierarchical clustering