期刊文献+

基于计量风格学的多层次特征在作者识别应用研究

Applied Research of Authorship Attribution Based on Computation Stylistics and Multilevel Characteristics
下载PDF
导出
摘要 在作者识别领域中最重要的是选取能够唯一识别作者的写作特征,这些特征也被称为"写作指纹"。传统的写作特征使用词袋,功能词,以及结构特征,词袋和功能词特征虽然也能达到比较不错的分类效果,但是却忽略了词语之间的关联性,完全丢失了文本的语义信息。通过分析中文语法特点,了解句子构成特点,使用了关联挖掘算法挖掘得到文章中具有关联的词性序列作为特征,该方法挖掘得到的特征称为词性关联特征。同时使用了汉语语法中的虚词词性,情感偏向,文本结构特征等四个类别的特征,构成作者特征的向量空间,并使用机器学习中的随机森林,逻辑回归和K近邻等分类算法来进行分类,比较选择最好的分类器,以此构成作者识别模型。研究对象为同时代的武侠小说作家作品集,验证了多层次特征向量的准确性和稳定性。 In the authorship attribution domain,the most important thing is to select the writing characteristics that can uniquely identify the author.These characteristics are also called"writing finger-prints".Traditional writing features use the bag-of-words,function words and the structural features.Although bag-of words and function words features can also achieve relatively good classification effects,they ignore the relevance between words and completely ignore all the semantic information of the text.This paper proposes a method of using part-of-speech annotation and association rule mining to extract related part-of-speech sequences as features,which combinies characteristics of part-of-speech,emotional bias,and text structure in Chinese grammar constitutes author’s writing features vector space(VS),and then Random Forest(RF),Logical Regression(LG),and K-Nearest Neighbor(KNN)algorithms in machine learning are used to classify the text,by the result to choose the best classifier,and the authorship attribution model is formed.This applied research uses the collection of works of contemporary arts fiction writers as experimental data to verify the accuracy and stability of the multilevel characteristics.
作者 钟敏 汪洋 ZHONG Min;WANG Yang(Wuhan Research Institute of Posts and Telecommunications,Wuhan 430074;FiberHome Communications Science&Technology Development Co.,Ltd.,Nanjing 210079)
出处 《计算机与数字工程》 2020年第5期1159-1163,1171,共6页 Computer & Digital Engineering
关键词 作者识别 词性关联 数据挖掘 authorship attribution part-of-speech connections data mining
  • 相关文献

参考文献8

二级参考文献23

  • 1陈浩.金庸古龙武侠小说比较论[J].浙江大学学报(人文社会科学版),1999,29(5):131-138. 被引量:4
  • 2宋枫溪,高林.文本分类器性能评估指标[J].计算机工程,2004,30(13):107-109. 被引量:33
  • 3武晓春,黄萱菁,吴立德.基于语义分析的作者身份识别方法研究[J].中文信息学报,2006,20(6):61-68. 被引量:25
  • 4[1]Thisted R,Efron B.Did shakespeare write a newly-discovered poem?[J].Biometrika,1987,74(3):445-455. 被引量:1
  • 5[2]David I Holmes.Stylometry:its origins,development and aspirations[A].Joint international conference of the association for computers and the humanities and the association for literary and linguistic computing[C].Norway:University of Bergen,1997.98-103. 被引量:1
  • 6[3]Mendenhall T C.The characteristic curves of composition[J].Science,1887,IX,237-249. 被引量:1
  • 7[4]Yule G U.On sentence length as a statistical characteristic of style in prose with application to two cases of disputed authorship[J].Biometrika,1938,30:363-390. 被引量:1
  • 8[5]Zipf G K.Selected studies of the principle of relative frequency in language[M].Cambridge,Massachusetts:Harvard University Press,1932. 被引量:1
  • 9[6]Zipf G K.Human behavior and the principle of least effort.An Introduction to Human Ecology[M].Cambridge,Massachusetts:Addison-Wesley Press,1949. 被引量:1
  • 10[7]Mosteller F,Wallace D L.Inference and disputed authorship:the federalist[M].Reading,Massachusetts:Addison-Wesley,1964. 被引量:1

共引文献199

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部