摘要
在作者识别领域中最重要的是选取能够唯一识别作者的写作特征,这些特征也被称为"写作指纹"。传统的写作特征使用词袋,功能词,以及结构特征,词袋和功能词特征虽然也能达到比较不错的分类效果,但是却忽略了词语之间的关联性,完全丢失了文本的语义信息。通过分析中文语法特点,了解句子构成特点,使用了关联挖掘算法挖掘得到文章中具有关联的词性序列作为特征,该方法挖掘得到的特征称为词性关联特征。同时使用了汉语语法中的虚词词性,情感偏向,文本结构特征等四个类别的特征,构成作者特征的向量空间,并使用机器学习中的随机森林,逻辑回归和K近邻等分类算法来进行分类,比较选择最好的分类器,以此构成作者识别模型。研究对象为同时代的武侠小说作家作品集,验证了多层次特征向量的准确性和稳定性。
In the authorship attribution domain,the most important thing is to select the writing characteristics that can uniquely identify the author.These characteristics are also called"writing finger-prints".Traditional writing features use the bag-of-words,function words and the structural features.Although bag-of words and function words features can also achieve relatively good classification effects,they ignore the relevance between words and completely ignore all the semantic information of the text.This paper proposes a method of using part-of-speech annotation and association rule mining to extract related part-of-speech sequences as features,which combinies characteristics of part-of-speech,emotional bias,and text structure in Chinese grammar constitutes author’s writing features vector space(VS),and then Random Forest(RF),Logical Regression(LG),and K-Nearest Neighbor(KNN)algorithms in machine learning are used to classify the text,by the result to choose the best classifier,and the authorship attribution model is formed.This applied research uses the collection of works of contemporary arts fiction writers as experimental data to verify the accuracy and stability of the multilevel characteristics.
作者
钟敏
汪洋
ZHONG Min;WANG Yang(Wuhan Research Institute of Posts and Telecommunications,Wuhan 430074;FiberHome Communications Science&Technology Development Co.,Ltd.,Nanjing 210079)
出处
《计算机与数字工程》
2020年第5期1159-1163,1171,共6页
Computer & Digital Engineering
关键词
作者识别
词性关联
数据挖掘
authorship attribution
part-of-speech connections
data mining