期刊文献+

自然语言处理文本查重优化算法设计 被引量:9

Algorithm Design of Text Duplicated-checking Based on Natural Language Processing
下载PDF
导出
摘要 为了探索高校学生实习时提交的实践报告文本存在着重复的问题,从高校教学管理部门收集到相关文本的分类数据,结合Jieba分词工具处理文本信息,利用Word2vec词向量转换技术,表现了自然语言精准的语义分析能力。考虑到主题词抽取、概率分布情况及时间复杂度三个方面,使用Python的OS库完成批处理去重、去停用词和去非中文词,运用重要采样思想优化LDA(latent dirichlet allocation),模型,提出了新的训练模型ISLDA(importance sampling latent dirichlet allocation)抽取主题词汇,并采用余弦相似度计算重复率。更好地实现了文本查重算法模型的优化,对比两个模型的主题词类别、各词汇分布概率,结果表明新训练模型优化了主题模型,提高了计算模型训练准确率及测试文本的查重能力,较理想地实现了文本查重分析设计方法。 With the aim of exploring the problem of duplication in the practice report texts submitted by college students during their internship,the classification data of relevant texts was collected from college teaching management department.The Jieba word segmentation tool was applied to analyze the text information,while the Word2vec word vector conversion technology was adopted to illustrate the natural language accurate semantic analysis capabilities.Taking such three aspects into account as topic word extraction,probability distribution,and time complexity,the Python OS library was used to complete batch processing in order to remove duplication,stop words and non-Chinese word.An important sampling method was presented to optimize the LDA model,a new training model ISLDA was proposed to extract subject vocabulary,and cosine similarity was adopted to calculate the repetition rate.Thus,the optimization of the text duplicate checking algorithm model was better realized than previous works.Comparing the two models in terms of the topic word category and the distribution probability of each vocabulary of,the results show that the topic model is optimized by the new training model,the training accuracy of the calculation model is improved,and eventually the design method of text checking and analysis is ideally realized.
作者 董星彤 陈士宏 陈淑鑫 DONG Xing-tong;CHEN Shi-hong;CHEN Shu-xin(School of Chemical and Materials Engineering, Beijing Technology and Business University, Beijing 100048, China;Department of Communication and Electronic Engineering, Qiqihar University, Qiqihar 161006, China;Department of Computer Science and Technology, Tianjin Ren'ai Collage, Tianjin 301636, China)
出处 《科学技术与工程》 北大核心 2022年第3期1091-1097,共7页 Science Technology and Engineering
基金 国家自然科学基金(U2031142) 国家自然科学基金青年科学基金(11803013)。
关键词 语义分析 查重模型 重要性采样 文本向量化 相似度计算 semantic analysis duplicated-checking model importance sampling text vectorization similarity calculation
  • 相关文献

参考文献15

二级参考文献165

共引文献523

同被引文献116

引证文献9

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部