摘要
谐音双关语的识别是幽默研究领域的一个重要分支,并逐渐发展为一个新兴的研究领域.本文提出一种基于4个维度特征集的谐音双关语识别模型,其中4个维度包括语义透明度、语义相关度、语音扩展性和语法特征集.语义透明度包括词项统计和语句字符长度两个特征,语法特征集包括人名、大写、时态、词性和位置5个特征.将这4个维度的9个特征加入到二叉判定树中,使用K-Means聚类获取阈值,完成双关语的识别.本文的实验数据来自于SemEval2017任务7的语料,取得了较好的效果, F1值高于参赛队中的第一名,实验证明基于4个维度特征的二叉判定树分类方法在谐音双关语识别中是有效的,并且在多个特征中,语音扩展性和语法特征集的效果比较明显,这也符合谐音双关语识别中语音作用较大的预测.
Identifying heterographic puns is an important branch of humor research, which has gradually developed into a new research area. This paper presents a heterographic pun identification mechanism based on feature sets in four dimensions, namely, semantic transparency, semantic relevance, phonetic expansibility, and syntax feature sets. The semantic transparency feature sets consist of the lexical item statistics and the character length; the syntax feature sets include names, capitalization, tense, part of speech, and location. Nine features of the above four dimensions are added to a binary decision tree to generate a threshold and complete a pun identification with the help of K-means clustering. Using the corpus of the SemEval2017 Task 7, the proposed method achieves satisfactory results, and its F1 value outscores the top one out of all participating teams. The experiment outlined in this paper proves that the taxonomic approach of the binary decision tree algorithm based on four dimensions is effective in identifying heterographic puns. The phonetic expansibility and the syntax feature sets are particularly effective among all other dimensions, which is consistent with our presumption that the phonetic feature plays a bigger role in identifying heterographic puns.
作者
徐琳宏
林鸿飞
祁瑞华
杨亮
Linhong XU;Hongfei LIN;Ruihua QI;Liang YANG(Software School,Dalian University of Foreign Languages,Dalian 116044,China;Computer Department,Dalian University of Technology,Dalian 116024,China)
出处
《中国科学:信息科学》
CSCD
北大核心
2018年第11期1510-1520,共11页
Scientia Sinica(Informationis)
基金
国家自然科学基金重点项目(批准号:61632011)
国家自然科学基金(批准号:61772103
61702080)
国家社会科学基金一般项目(批准号:15BYY028)
辽宁省自然基金(批准号:20170540230
2015020017
20170540232)
辽宁省优秀人才项目(批准号:LJQ2014127)资助
关键词
谐音双关语
情感分析
二叉判定树
语义特征集
聚类
heterographic pun
sentiment analysis
binary decision tree
semantic feature set
cluster