期刊文献+

对数字化科技论文的自动分类研究 被引量:5

The study on automitic classification of digital documents of scientific papers
下载PDF
导出
摘要 针对科技论文具有半结构化的特点,提出利用科技论文的元数据的多层次分类模型.这里元数据包含论文的标题、关键词集合和摘要等信息.实验证明,若只利用元数据,可以取得与传统的基于全文信息分类方法近似的分类精度;若基于领域知识所产生的分类法,先利用元数据进行粗分类,然后再进行全文分类,所得到的分类精度要高于已知最好算法.因元数据的尺寸远远小于论文全文的尺寸,而粗分类后每类的论文数要远远小于全体论文数,故在分类类别数目较多且分类文本分布较为平均的情况下,可极大地缩短分类的时间. Since scientific papers are usually semi-structural documents, a hierarchy classification model based on the metadata of scientific papers is proposed, where the metadata include the rifles, keyword sets, abstracts and so on. Experiments show the precision of the classification based on the metadata of papers is close to that of the classification based on the full text of papers. Furthermore, the classification precisions are better than the best known classification algorithm if the papers are classified based on taxonomy of application domains as follows: first, the metadata are used to classify paper roughly based on the higher levels of taxonomy, then full texts are utilized to classify these papers on the lower levels Of taxonomy. Since the size of metadata is less than that of full text and the number of papers classified in a subclass is less than that of total number of papers, the new model enhances the efficiency of paper classification when the number of classes is bigger andthe documents are distributed averagely in the given taxonomy.
出处 《山东大学学报(理学版)》 CAS CSCD 北大核心 2006年第3期14-16,123,共4页 Journal of Shandong University(Natural Science)
基金 教育部骨干教师基金资助项目(教技司[2000]65)
关键词 科技论文 文本分类 层次结构 分类精度 分类效率 technical literature text categorization hierarchy accuracy efficiency
  • 相关文献

参考文献6

  • 1Masao Fuketa, Sangkon Lee, Takako Tsujietal. A document classification method by using field association words[J]. Information Sciences, 2000, 126( 1 - 4) :57 - 70. 被引量:1
  • 2Y Yang, Xin Liu. A re-examination of text categorization methods[A] . Proc-ngs of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) [C]. New York: ACM Press, 1999.42 - 49. 被引量:1
  • 3中国图书馆分类法编辑委员会编..中国图书馆分类法[M].北京:科学技术文献出版社,1999.
  • 4J Gary Auguston J, Jack Minker. An analysis of some graphtheoretical cluster techniques[J]. JACM, 1970, 17(4) :571 -588. 被引量:1
  • 5Marie-Francine Moens, Jos Dumortier. Text categorization:The assignment of subject descriptors to magazine articles[J].Information Processing & Management, 2000, 36(6) : 841 -861. 被引量:1
  • 6史忠植.知识发现[M].北京:清华大学出版社,2000. 被引量:6

共引文献5

同被引文献70

引证文献5

二级引证文献56

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部