摘要
提出了一种基于段落角色识别的流式文档逻辑结构重建方法。借助格式信息判断段落角色,基于XML来描述文档的章节逻辑结构和内容;面向OOXML实现了一个原型系统,能够自动地将标准文档转换为带有结构逻辑信息的XML文档。通过实验验证了方案的可行性,从而为后期文档数据挖掘提供了有效基础。
A method of reconstructing the logical structure of flow document based on the passage character recognition is proposed. Based on XML,the logical structure and content of the document are described. An OOXML oriented prototype system is implemented. The standard document can be automatically converted to XML documents with the structure of the logic of information,which verifies the feasibility of the scheme,thus providing an effective basis for data mining in the later stage.
作者
赵雪
侯霞
ZHAO Xue HOU Xia(Computer School, Beijing Information Science & Technology University, Beijing 100101, China)
出处
《北京信息科技大学学报(自然科学版)》
2017年第5期56-61,66,共7页
Journal of Beijing Information Science and Technology University
基金
北京市属高等学校高层次人才引进与培养计划项目(CIT&TCD201504056)