期刊文献+

基于中文文本类别信息的主题生成模型构建研究

Constructing a Topic Generation Model Based on Chinese Text Category Information
原文传递
导出
摘要 【目的/意义】为了解决传统LDA模型文本主题识别时语义描述不充分以及主题语义连贯性不强等问题,本文尝试将文本类别信息融入LDA模型,形成一种基于中文文本类别信息的主题生成新模型,即CLCI-LDA模型,为数据挖掘领域的文本分析和知识发现提供新的工具。【方法/过程】利用CLCI-LDA模型提取主题时,首先,采用深度学习的句向量模型Sentence-BERT将文本转换为句嵌入向量,并与LDA模型生成的文档主题向量进行串联,以提升文本向量的语义丰富性和关联性;然后,运用K-means聚类算法进行文本聚类,获得文本的类别信息;最后,根据主题词频次,获取每个类族中的高频关键词,对主题进行凝练。【结果/结论】以我国“智慧图书馆”研究领域为研究对象进行文献主题提取实验,对CLCI-LDA模型及传统LDA模型的应用效果进行对比。结果表明CLCILDA模型能够更好地获得具有语义信息的主题词,该模型获得的主题一致性指标优于传统的LDA模型。【创新/局限】相比于传统LDA模型,CLCI-LDA模型在文本语义表示的深入性以及主题凝练的合理性方面均具有优势。但新模型同时存在参数调优的不足、语义理解深度有待进一步提高的问题;此外CLCI-LDA模型的普适性还有待检验。 【Purpose/significance】In order to solve the problems of insufficient semantic description and weak topic semantic coherence in traditional LDA models for text topic recognition,this paper attempts to integrate text category information into the LDA model,forming a new topic generation model based on Chinese text category information,namely the CLCI-LDA model,which provides new tools for text analysis and knowledge discovery in the field of data mining.【Method/process】When using the CLCI-LDA model to extract topics,first,the Sentence BERT model of deep learning is used to transform the text into a sentence embedding vector,and concatenated with the document topic vector generated by the LDA model to improve the semantic richness and relevance of the text vector;Then,use the K-means clustering algorithm to cluster the text and obtain the category information of the text;Finally,based on the frequency of topic words,obtain high-frequency keywords in each category family and condense the topic.【Result/conclusion】A literature topic extraction experiment was conducted in the research field of"smart libraries"in China to compare the application effects of the CLCI-LDA model and traditional LDA model.The results indicate that the CLCI-LDA model can better obtain topic words with semantic information,and the topic consistency index obtained by this model is superior to traditional LDA models.【Innovation/limitation】Compared to traditional LDA models,the CLCI-LDA model has advantages in the depth of text semantic representation and the rationality of topic condensation.However,the new model also has shortcomings in parameter tuning and the need for further improvement in semantic understanding depth;In addition,the universality of the CLCI-LDA model still needs to be tested.
作者 董同强 朱彦君 马秀峰 DONG Tongqiang;ZHU Yanjun;MA Xiufeng(Department of Communication,Qufu Normal University,Rizhao 276826,China;Shandong Jianzhu University Library,Jinan 250102,China)
出处 《情报科学》 北大核心 2024年第4期36-42,共7页 Information Science
基金 国家社会科学基金项目“面向知识流分析的中文文本主题生成模型构建及应用研究”(18BTQ069)。
关键词 主题模型 模型构建 主题识别 深度学习 文本聚类 theme model model construction topic recognition deep learning text clustering
  • 相关文献

参考文献12

二级参考文献218

共引文献84

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部