摘要
当前学术文本挖掘研究大多数是采用基于词汇、窗口、全文的方法,往往忽略了学术文本的内在结构,导致了很多歧义性问题。本文针对当前研究不足,提出一种研究性论文的结构功能框架,对学术文本的章节功能和逻辑结构进行了定义。在此基础上本文从三个不同层次(基于章节标题、基于章节内容和标题、基于段落)论述了结构功能的自动分类问题,并从第一个层次(基于章节标题)采用词表与序列标注相结合的方法进行了结构功能的自动分类实验,取得了令人满意的效果。
The current academic text mining research is mostly based on the word, the window and the full text. It ignors the internal structure, leading to a lot of ambiguity problems. In view of the current lack of research, this paper puts forward a kind of framework that gives definition about the structure function of the research papers ' chapter. On this basis, from three different levels (based on the section headers, based on the section content and header, based on the paragraph) the automatic classification problem of structure function is discussed, and from the first level (based on the section header) by adopting the combination of vocabulary and sequence tagging method the automatic classification experiment of structure function is conducted, the satisfactory results have been achieved.
出处
《情报学报》
CSSCI
北大核心
2014年第9期979-985,共7页
Journal of the China Society for Scientific and Technical Information
基金
国家自然科学基金面上项目“基于语言模型的通用实体检索建模及框架实现研究”(项目编号:71173164)
教育部人文社会科学基地重大项目“面向细粒度的网络信息检索模型及框架构建研究”(项目编号:10JJD630014)的研究成果之一
关键词
文本挖掘
结构功能
自动分类
text mining, structure function, automatic classification