摘要
【目的/意义】舆情主题识别一直是舆情领域的研究热点,如今已有丰富的研究成果。现有研究对舆情信息进行表征时多采用了传统的词袋模型、主题模型或词向量模型,只能对词语进行唯一的向量表征且传统模型需对文本分词,可能会因分词错误、数据稀疏、出现集外词等情况影响识别效果。【方法/过程】本文构建了一种基于多采样双向编码表示的网络舆情主题识别模型,在训练前无需对文本进行分词,针对文本过长的情况采用头尾结合的方式进行截断,从字、段、位置三个维度提取特征嵌入,通过自注意力机制进行舆情表征,在训练过程中使用区分性微调和多采样dropout的方法增强泛化能力,提升识别效果。【结果/结论】实验结果表明构建模型在舆情主题分类任务中表现良好,可以在不对文本分词的情况下实现对舆情主题的准确识别。【创新/局限】创新之处在于构建了一种新型的网络主题识别模型,局限之处在于算法复杂,如何进一步调参优化是接下来的研究重点。
【Purpose/significance】Topic identification has always been a research hotspot in the field of public opinion, nowadays there are abundant research findings. Existing work mostly uses the traditional bag-of-words model, LDA or word-vector model to represent public opinion information, which can only symbolize tokens uniquely and traditional model needs to segment words, which may affect the identification effect due to segmentation error, data sparsity and out-of-vocabulary.【Method/process】This paper proposes a topic identification model of network public opinion based on multi-sample Bidirectional Encoder Representations from Transformers and text does not need to be segmented before training. The method of combining head text with tail is used for truncation for the longtext case. Embedding features are extracted from word, segment and position dimensions. The public opinion is represented by self-attention mechanism. Fine-tuning learning rate and multi-sample dropout are used to enhance the generalization and improve identification effect.【Result/conclusion】Results show that the proposed model performs well in the task of public opinion topic classification,and can identify topic accurately without text segmentation.【Innovation/limitation】The innovation of this article is the construction of a new network topic identification model, however, the limitation lies in the complexity of the algorithm, ways to optimize parameters is the focus of next research.
作者
孙靖超
刘为军
SUN Jing-chao;LIU Wei-jun(School of Criminal Investigation,People’s Public Security University of China,Beijing 100076,China)
出处
《情报科学》
CSSCI
北大核心
2021年第7期147-152,共6页
Information Science
基金
国家社会科学基金重大专项“社会主义核心价值观融入公安执法领域研究”(2018VHJ012)。
关键词
网络舆情
主题识别
双向编码表示
主题分类
自注意力机制
network public opinion
topic identification
Bidirectional Encoder Representations from Transformers
topic classification
self-attention mechanism