基于卷积神经网络的文档特征提取方法

Text feature extraction method based on convolution neural network

下载PDF

导出

摘要随着上网用户的增多,人们在网络上贡献了各式各样的文献,这些文献形成了海量的文本数据,潜藏着巨大的价值。文献的文类和整理是一项非常具有挑战性的工作,抽取文档特征信息成了目前重要研究方向之一。针对传统方法对文本数据的特征提取时,文本特征维数大、处理效率低等问题,文章设计了基于卷积神经网络的文本特征提取方法,搭建了卷积神经网络模型,选取了卷积神经网络的各项参数,实验的输入数据集为中文语料库中的文本,使用Word2vec工具集进行文本向量转换,对文本特征提取采用卷积神经网络算法,通过K-means聚类算法对文本特征进行验证,验证了本文设计的基于卷积神经网络的文本特征提取方法的有效性。 With the increase of Internet users, people have contributed a variety of documents on the network. Thesedocuments have formed a huge amount of text data, which has a great value. Document classification and collation is achallenging task. Extracting feature information from documents has become one of the important research directions.In order to extract the feature of text data from traditional methods, the dimension of text feature is large and theprocessing efficiency is low. In this paper, a text feature extraction method based on convolution neural network isdesigned, and a convolution neural network model is built, and the parameters of the convolution neural network areselected. The input data set in the experiment is Chinese corpus. In the text, the text vector conversion is carried outusing the Word2vec tool set. The text feature extraction adopts the convolution neural network algorithm. The textfeatures are verified by the K-means clustering algorithm, and the effectiveness of the text feature extraction methodbased on the convolution neural network is verified.

作者刘钢李宗晨郭建伟 Liu Gang;Li Zongchen;Guo Jianwei(College of Computer Science and Engineering, Changchun University of Technology, Changchun 130012, China;Changchun Finance College, Modern Education Center, Changchun 130012, China)

机构地区长春工业大学计算机科学与工程学院长春金融高等专科学校现代教育中心

出处《江苏科技信息》 2018年第14期21-23,28,共4页 Jiangsu Science and Technology Information

基金吉林省科技厅重大科技招标专项项目编号:20160203010GX 吉林省发改委项目产业创新专项资金项目项目编号:20170505MA2

关键词 Word2vec 文本分析 K-MEANS 卷积神经网络 Word2vec text analysis K-means convolution neural network

分类号 G27 [文化科学—档案学]

引文网络
相关文献

参考文献6

1许厚金,刘永炎,邓成玉,刘永山.基于相似中心的k-cmeans文本聚类算法[J].计算机工程与设计,2010,31(8):1802-1805. 被引量：12
2朱磊..基于word2vec词向量的文本分类研究[D].西南大学,2017:
3张谦,高章敏,刘嘉勇.基于Word2vec的微博短文本分类研究[J].信息网络安全,2017(1):57-62. 被引量：51
4唐明,朱磊,邹显春.基于Word2Vec的一种文档向量表示[J].计算机科学,2016,43(6):214-217. 被引量：142
5李跃鹏,金翠,及俊川.基于word2vec的关键词提取算法[J].科研信息化技术与应用,2015,6(4):54-59. 被引量：45
6熊富林,邓怡豪,唐晓晟.Word2vec的核心架构及其应用[J].南京师范大学学报（工程技术版）,2015,15(1):43-48. 被引量：68

二级参考文献78

1李孝明,曹万华.文本信息检索的精确匹配模型[J].计算机科学,2004,31(9):100-102. 被引量：7
2张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量：120
3K.haled M Hammouda,Mohamed S Kamel.Efficient phrase-based document indexing for web document clustering[J].IEEE Transactions on Knowledge and Data Engineering,2004,16(10):1279- 1296. 被引量：1
4Joshua Zhexue Huang, Michael K Ng, Hongqiang Rong, et al. Automated variable weighting in k-means type clustering [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(5):657-668. 被引量：1
5Shehroz S Khan,Amir Ahmad.A cluster center initialization algorithm for k-means clustering[J].Pattem Recognition Letters, 2004,25(11):1293-1302. 被引量：1
6Ramiz M Aliguliyev.Clustering of document collection- a weighting approach [J]. Expert Systems with Applications, 2009,36(4) :7904-7916. 被引量：1
7Tapas Kanungo,David M Mount,Nathan S Net-anyahu,et al.An efficient k-means clustering algorithm [J]. Analysis and Implementation,IEEE Transactions on Pattern Analysis and Machine InteUigence,2002,24(7):881-892. 被引量：1
8Ajith Abraham, Swagatam Das, Amit Konar. Document clustering using differential evolution[C].Vancouver, BC:IEEE Congress on Evolutionary Computation,2006:1784-1791. 被引量：1
9Richard Nock, Frank Nielsen.On weighting clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006,28(8): 1223-1235. 被引量：1
10Slonim N,Tishby N.Document clustering using word clusters via the information bottleneck method[C].Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM Press,2000:208-215. 被引量：1

共引文献291

1叶佳鑫,熊回香,蒋武轩.一种融合患者咨询文本与决策机理的医生推荐算法[J].数据分析与知识发现,2020,4(2):153-164. 被引量：9
2孟旭,谢靖,李春旺.基于核心主题特征的作者身份识别研究[J].知识管理论坛,2023(5):351-364.
3齐浩翔,马莉媛,朱翌民.基于Word2Vec的疫情虚假信息检测方法[J].智能计算机与应用,2021,11(10):134-138. 被引量：3
4朱剑华,李莉,张秋实,李赫,李伟凡,徐健.长江航道信息智能推送服务方法研究[J].测绘地理信息,2022,47(5):110-113.
5李秀茹,王晓,李朋朋,李绪红,罗安.Word2vec和支持向量机的POI自动分类方法[J].测绘科学,2022,47(6):195-203. 被引量：4
6金春霞,周海岩.位置加权文本聚类算法[J].计算机工程与科学,2011,33(6):154-158. 被引量：6
7罗锦光,元昌安,郭乙江,邹鹏.基于GEP和CPN网络的文本聚类算法[J].计算机工程与设计,2011,32(11):3873-3876.
8白秋产,金春霞.概念属性扩展的短文本聚类算法[J].长春师范学院学报（自然科学版）,2011,30(5):29-33. 被引量：4
9白秋产,金春霞,周海岩.概念向量文本聚类算法[J].计算机工程与应用,2011,47(35):155-157. 被引量：11
10刘勘,周丽红,陈譞.基于关键词的科技文献聚类研究[J].图书情报工作,2012,56(4):6-11. 被引量：18

1蒋丹丹.秦汉时期年龄表述方式琐议[J].许昌学院学报,2017,36(3):75-78.
2邬登峰,白琳,王涛,李慧,许舒人.基于多粒度特征和混合算法的文档推荐系统[J].计算机系统应用,2018,27(3):9-17. 被引量：1
3郭显娥.K-Means优化算法的R语言实现[J].山西大同大学学报（自然科学版）,2018,34(2):27-29. 被引量：1
4听风就是雨.网络中发生了什么?——隐身大盗并不远[J].电子计算机与外部设备,2000(8):140-141.
5李奇颖,管维.Mozilla1.0 挑战IE的浏览器[J].新电脑,2002,26(10):104-106.
6李帅彬,李亚星,冯旭鹏,刘利军,黄青松.基于词向量的微博话题发现方法[J].计算机应用与软件,2017,34(12):47-52. 被引量：2
7朱浩,连德富,左志宏,颜凯.余弦相似度在高校综合信息系统中的应用[J].东南大学学报（自然科学版）,2017,47(A01):123-128. 被引量：5
8涂欣欣,李颖美,邢春阳,刘扬.心房颤动知识问卷的编制及信效度检验[J].护理研究,2018,32(6):966-968. 被引量：5
9木妮娜.玉素甫,古丽娜.玉素甫.重复模式识别算法及在Web信息抽取和聚类分析中的应用[J].计算机科学,2017,44(B11):39-45. 被引量：1
10史敏锐.上Internet用ISDN好,还是用56k Modem好[J].电子技术（上海）,1998,25(8):44-45.

江苏科技信息

2018年第14期

浏览历史

内容加载中请稍等...

基于卷积神经网络的文档特征提取方法

参考文献6

二级参考文献78

共引文献291

相关作者

相关机构

相关主题

浏览历史