摘要
近年来,各领域内频频发生各类突发事件,对社会稳定发展产生了一定程度的影响.本文提出了一种基于多种词特征的微博突发事件检测模型,可以在海量微博数据中对突发事件进行检测,便于相关决策者进行微博监控和舆论引导,尽可能减少突发事件给社会带来的危害.首先根据时间信息对微博数据进行时间切片,对每一个时间窗口内的数据分别计算各个词语的词频特征、话题标签特征和词频增长率特征;然后基于D-S证据理论和层次分析法,确定词的各个特征权重,并进行加权融合得到词的突发特征值,将突发特征值大的词挑选出来构成突发特征词集,构建基于共现度和结合紧密度的突发事件特征词集的耦合度矩阵;最后将该耦合度矩阵作为凝聚式层次聚类算法的输入,生成一棵由突发词为叶子节点的二叉树,并采用内部相似度的二叉树剪枝算法对聚类结果进行划分,即可实现对相应时间窗口突发事件的检测.实验结果表明,基于突发词的事件检测模型在簇内部相似度阈值等于1.1时效果最好,正确率达到0.8462、召回率达到0.8684、F值为0.8571,表明了本文所提方法的有效性.
In recent years,a wide variety of bursty events have been occurring frequently in many fields,impacting both the stability and the development of our society.This paper proposes an event detection model based on multiple word features,which is intended to detect bursty events in the massive microblog data.The model will assist decision-makers to monitor microblogs and guide public opinions and will minimize the negative effect of bursty events to society.Firstly,the model slices the microblog data according to the time information.In each time window,the word frequency feature,the topic tag feature and the word frequency growth rate feature of each word are calculated separately.Then,the D-S evidence theory and the analytic hierarchy process are utilized to determine each word’s feature weights,which are then merged to obtain the bursty feature value of the word.Words with large bursty feature value are selected to form the bursty feature word set and to construct a coupling degree matrix of bursty feature word set based on co-occurrence degree and tightness.Finally,the coupling degree matrix is used as the input of the hierarchical agglomerative clustering algorithm to generate a binary tree with bursty words being leaf nodes,and the internal similarity binary tree pruning algorithm is used to divide the clustering results.In this way,the detection of the corresponding time window’s bursty events can be realized.The experimental results show that the event detection model based on bursty words has the best effect when the intra-cluster similarity threshold is 1.1,the correct rate is as high as 0.8462,the recall rate reaches 0.8684,and the F value is 0.8571,indicating the effectiveness of the proposed method.
作者
张仰森
段宇翔
王建
吴云芳
ZHANG Yang-sen;DUAN Yu-xiang;WANG Jian;WU Yun-fang(Institute of Intelligent Information Processing,Beijing Information Science and Technology University,Beijing 100101,China;Institute of Computational Linguistics,Peking University,Beijing,100871,China;Beijing Laboratory of National Economic Security Early-warning Engineering,Beijing 100044,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2019年第9期1919-1928,共10页
Acta Electronica Sinica
基金
国家自然科学基金(No.61772081)
科技创新服务能力建设-科研基地建设-北京实验室-国家经济安全预警工程北京实验室项目(No.PXM2018-014224-000010)
关键词
微博
突发事件
突发特征词
D-S证据理论
凝聚式层次聚类
microblog
bursty events
bursty feature words
D-S evidence theory
hierarchical agglomerative clustering