摘要
针对传统的短文本分类方法大量使用语法标签和词库导致产生语言依赖的问题,提出一种基于语言无关性语义核学习的短文本分类方法。首先,利用短文本的语义信息从文档中提取模式;然后,以三个标注层(词、文档和类别)标注提取出的每个模式;最后,根据三个标注层次计算文档之间的相似度,并根据相似度完成分类。在英语和汉语数据集上的实验验证了该方法的有效性。实验结果表明,相比其他几种核方法,该方法取得了更好的分类性能。
The language dependence problem is generated due to that the short-text classification methods use a lot of grammar tags and thesaurus, for this issue we proposed a short-text classification method which is based on language-independent semantic kernel learning (SKL). First, it extracts patterns from document by making use of semantic information of short-text. Then, it labels every extracted pattern with three annotation layers (words, document and categories), Finally, it calculates the similarity between documents by three annotated layers and completes classification according to the similarity. The effectiveness of the proposed method has been verified by the experiments on English and Chinese datasets. Experimental results showed that the proposed method has better classification performance than several other kernel methods.
出处
《计算机应用与软件》
CSCD
2015年第7期314-318,共5页
Computer Applications and Software
关键词
短文本分类
语义核学习
相似性度量
语言无关性
标注层
模式语义标注
Short-text classification Semantic kernel Similarity measure Language independence Annotated layers Semantic annotation of patterns