摘要
针对科技论文具有半结构化的特点,提出利用科技论文的元数据的多层次分类模型.这里元数据包含论文的标题、关键词集合和摘要等信息.实验证明,若只利用元数据,可以取得与传统的基于全文信息分类方法近似的分类精度;若基于领域知识所产生的分类法,先利用元数据进行粗分类,然后再进行全文分类,所得到的分类精度要高于已知最好算法.因元数据的尺寸远远小于论文全文的尺寸,而粗分类后每类的论文数要远远小于全体论文数,故在分类类别数目较多且分类文本分布较为平均的情况下,可极大地缩短分类的时间.
Since scientific papers are usually semi-structural documents, a hierarchy classification model based on the metadata of scientific papers is proposed, where the metadata include the rifles, keyword sets, abstracts and so on. Experiments show the precision of the classification based on the metadata of papers is close to that of the classification based on the full text of papers. Furthermore, the classification precisions are better than the best known classification algorithm if the papers are classified based on taxonomy of application domains as follows: first, the metadata are used to classify paper roughly based on the higher levels of taxonomy, then full texts are utilized to classify these papers on the lower levels Of taxonomy. Since the size of metadata is less than that of full text and the number of papers classified in a subclass is less than that of total number of papers, the new model enhances the efficiency of paper classification when the number of classes is bigger andthe documents are distributed averagely in the given taxonomy.
出处
《山东大学学报(理学版)》
CAS
CSCD
北大核心
2006年第3期14-16,123,共4页
Journal of Shandong University(Natural Science)
基金
教育部骨干教师基金资助项目(教技司[2000]65)