摘要
针对LDA主题模型文本特征提取时主题识别不明确的问题,提出一种基于Labeled-LDA模型的文本特征提取方法。使用LDA主题模型对文本隐含主题中的主题词进行提取,根据TF-IDF算法实现对文本类别的关键词进行提取。通过文本simhash算法对提取出的主题词与关键词进行相似度计算,找到文本隐含主题的类别并提取特征词。实验表明结合后的特征提取方法比TF-IDF、传统LDA主题模型的文本特征提取方法,获得更高的分类精度,其中准确度提高了3.40%,召回率提高了4.40%,F值提高了3.92%。
Due to the unclear topic recognition problem in text feature extraction of LDA topic model, this paper proposed a text feature extraction method based on Labeled-LDA model. In the proposed method, firstly, we utilized the LDA topic model to extract topic words in the text of the implied topics, and then implemented the TF-IDF algorithm to extract keywords from categories in text. Secondly, the Simhash algorithm was adopted to calculate the degree of similarity between the topic words and keywords, and then to find the category of the implied topics in the text and to extract the feature words as well. Experiments show that the combined feature extraction method performs well and can achieve higher classification accuracy than the text feature extraction method of TF-IDF and traditional LDA topic models. Among them, the accuracy increased by 3.40%, the recall rate increased by 4.40%, and the F value increased by 3.92%.
作者
王瑞
龙华
邵玉斌
杜庆治
Wang Rui;Long Hua;Shao Yubin;Du Qingzhi(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,Yunnan 650000,China)
出处
《电子测量技术》
2020年第1期141-146,共6页
Electronic Measurement Technology