摘要
由于中文短文本存在特征词少、规范性差、数据规模量大等难点,ERNIE预训练模型占用内存大,进行短文本分类时会造成向量空间稀疏、文本预训练不准确、时间复杂度高等问题。针对以上短文本分类存在的问题,提出基于ERNIE-RCNN模型的中文短文本分类。模型运用ERNIE模型作为词向量,对实体和词语义单元掩码,后连接Transformer的编码层,对ERNIE层输出的词嵌入向量进行编码,优化模型过拟合问题,增强泛化能力,RCNN模型对ERNIE输入的词向量进行特征提取,卷积层利用大小不同的卷积核提取大小不同的特征值,池化层进行映射处理,最后通过softmax进行分类。将该模型与七种深度学习文本分类模型在中文新闻数据集上进行训练实验,得到了模型在准确率、精准率、召回率、F1值、迭代次数、运行时间上的对比结果,表明ERNIE-RCNN模型能够很好地提取文本中的特征信息,减少了训练时间,有效解决了中文短文本分类的难点,具有很好的分类效果。
Due to the difficulties in short Chinese texts such as fewer feature words,poor standardization and large data size,the ERNIE pre-training model occupies a large amount of memory,which causes problems such as sparse vector space,inaccurate text pre-training and high time complexity when classifying short texts.In response to the above short text classification problems,we propose a Chinese short text classification based on the ERNIE-RCNN model.The model uses the ERNIE model as a word vector,masks entities and word sense units,and then connects to the encoding layer of Transformer and outputs to the ERNIE layer.The word embedding vector is encoded to optimize the model over-fitting problem and enhance the generalization ability.The RCNN model performs feature extraction on the word vector input by ERNIE.The convolution layer uses convolution kernels of different sizes to extract feature values of different sizes.The pooling layer is mapped and finally classified by softmax.The proposed model is trained on the Chinese news data set with seven deep learning text classification models,and the comparison results of accuracy,precision,recall,F1 value,number of iterations and running time are obtained.It is showed that ERNIE-RCNN can extract the feature information in the text well,reduce the training time,effectively solve the difficulties in the classification of Chinese short texts with excellent classification effect.
作者
王浩畅
孙铭泽
WANG Hao-chang;SUN Ming-ze(School of Computer and Information Technology,Northeast Petroleum University,Daqing 163318,China)
出处
《计算机技术与发展》
2022年第6期28-33,共6页
Computer Technology and Development
基金
国家自然科学基金(61402099,61702093)。