摘要
对垃圾短信进行过滤识别研究具有重要的社会价值和时代背景意义。针对传统的人工设计短信特征选择方法中存在数据稀疏、特征信息共现不足和特征提取困难的问题,提出一种基于词向量和卷积神经网络(CNN)的垃圾短信识别方法。首先,使用word2vec的skip-gram模型根据维基中文语料库训练出短信数据集中每个词的词向量,并将每条短信中各个词组所对应的词向量组成表示短信的二维特征矩阵;然后,把特征矩阵作为卷积神经网络的输入,通过卷积层的不同尺度卷积核提取多尺度短信特征,以及利用1-max pooling池化策略得到局部最优特征;最后,将局部最优特征组成融合特征向量放入softmax分类器中得出分类结果。在10万条短信数据上进行的实验结果表明,在特征提取方式相同的情况下,基于卷积神经网络模型的识别准确率能够达到99.5%,比传统的机器学习模型提高了2.4%~5.1%,且各模型的识别准确率均保持在94%以上。
It is of great social value and times background significance to filter and recognize spam messages.Traditional artificially designed feature selection methods may lead to data sparseness,insufficient co-occurrence of feature information and difficulty in feature extraction.To solve above problems,a spam messages recognizing method based on word embedding and convolutional neural network was proposed.Firstly,word2vec s skip-gram model was used to train the word embedding of each word in the short message dataset according to the Wiki Chinese corpus,and the two-dimensional feature matrix representing short message was composed of word embedding of each word in a short message.Then,the feature matrix was used as the input to the convolutional neural network.The multi-scale short message features were extracted by using different scale convolution kernels of the convolution layer,and the 1-max pooling strategy was used to obtain the local optimal features.Finally,the fusion feature vector,composed of the local optimal features,was put into the softmax classifier to get the classification results.Experiments were performed on 100 000 short messages.The experimental results show that the recognition accuracy based on the convolutional neural network model can reach 99.5%,which is 2.4%to 5.1%higher than that of the traditional machine learning models with the same feature extraction method,and the recognition accuracy of each model maintains above 94%.It is demonstrated that the proposed method has good recognition performance for spam messages,and can improve recognition accuracy effectively.
作者
赖文辉
乔宇鹏
LAI Wenhui;QIAO Yupeng(School of Automation Science and Engineering,South China University of Technology,Guangzhou Guangdong 510640,China)
出处
《计算机应用》
CSCD
北大核心
2018年第9期2469-2476,共8页
journal of Computer Applications