摘要
该文提出一种"基于高频词等级相关度的方法"来探析存疑文献的作者信息,把各份语料中的词型均按照出现频次递减排列并确定等级,然后通过计算出语料之间高频词等级的相关度,来推断语料之间语言风格的相似度,并且把这种方法与"基于词型共现率的方法"和"基于词例共现率的方法"相比较。把《红楼梦》的120回均分为12份语料,使用"基于高频词等级相关度的方法"计算这12份语料两两之间的相关度。研究发现《红楼梦》的前8份语料两两之间相关度高,后4份语料两两之间相关度也高,而前8份语料与后4份语料这两部分语料之间相关度低。推断《红楼梦》前80回应是同一人所写,后40回应是另一人所写。
This paper puts forward an author identification method based on rank correlation of high frequency word types.Words in each corpus are arranged according to the frequency of occurrence and the rank is determined,then the correlation degree between the high frequency word types among the corpus is calculated,which is applied as the similarity of the language style between corpus.This method is compared the word intersection based method and token intersection based method on 12 sub-divisions of total 120 chapters fromThe dream of Red mansions.It is revealed that the correlation is rather high either between the former 8 corpus or between the latter 4 corpus,while the correlation significantly decreases between the former and the latter chapters.It is inferred that the former 80 chapters of The dream of Red mansions were written by one author,and the latter 40 chapters by another one.
作者
马创新
陈小荷
MA Chuangxin;CHEN Xiaohe(Linguistic Sciences and Arts School of Jiangsu Normal University,Xuzhou,Jiangsu 221009,China;College of Liberal Arts,Nanjing Normal University,Nanjing,Jiangsu 210097,China)
出处
《中文信息学报》
CSCD
北大核心
2018年第11期97-102,共6页
Journal of Chinese Information Processing
基金
江苏省社会科学基金(15YYC001)
关键词
高频词
等级
相关度
作者信息
high frequency word types
rank
correlation
author identification