期刊文献+

基于Transformer的多模态级联文档布局分析网络

Multimodal cascaded document layout analysis network based on Transformer
下载PDF
导出
摘要 针对现有方法在文本和图像模态的预训练目标上存在嵌入不对齐,文档图像采用基于卷积神经网络(CNN)的结构进行预处理,流程复杂,模型参数量大的问题,提出基于Transformer的多模态级联文档布局分析网络(MCOD-Net).设计词块对齐嵌入模块(WAEM),实现文本和图像模态预训练目标的对齐嵌入,使用掩码语言建模(MLM)、掩码图像建模(MIM)和词块对齐(WPA)进行预训练,以促进模型在文本和图像模态上的表征学习能力.直接使用文档原始图像,用图像块的线性投影特征来表示文档图像,简化模型结构,减小了模型参数量.实验结果表明,所提模型在PubLayNet公开数据集上的平均精度均值(mAP)达到95.1%.相较于其他模型,整体性能提升了2.5%,泛化能力突出,综合效果最优. The multimodal cascaded document layout analysis network(MCOD-Net)based on Transformer was proposed in order to solve the issue of misalignment in the existing methods for pretraining objectives in both text and image modalities,which involve complex preprocessing of document images using convolutional neural network(CNN)structures leading to many model parameters.The word block alignment embedding module(WAEM)was introduced to achieve alignment embedding of the pretraining objectives for text and image modalities.Masked language modeling(MLM),masked image modeling(MIM)and word-patch alignment(WPA)were utilized for pretraining in order to enhance the model’s representation learning capabilities across text and image modalities.The model structure was simplified and the number of model parameters was reduced by directly using the original document images and representing them using linear projected features of image blocks.The experimental results demonstrate that the proposed model achieves an mean average precision(mAP)of 95.1%on the publicly available PubLayNet dataset.A 2.5%overall performance improvement was achieved with outstanding generalization ability and exhibiting the best comprehensive performance compared with other models.
作者 温绍杰 吴瑞刚 冯超文 刘英莉 WEN Shaojie;WU Ruigang;FENG Chaowen;LIU Yingli(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Computer Technologies Application,Kunming University of Science and Technology,Kunming 650500,China)
出处 《浙江大学学报(工学版)》 EI CAS CSCD 北大核心 2024年第2期317-324,369,共9页 Journal of Zhejiang University:Engineering Science
基金 国家自然科学基金资助项目(52061020,61971208) 云南计算机技术应用重点实验室开放基金资助项目(2020103) 云南省重大科技专项资助项目(202302AG050009)。
关键词 文档布局分析 词块对齐嵌入 TRANSFORMER MCOD-Net模型 document layout analysis word-block alignment embedding Transformer MCOD-Net model
  • 相关文献

参考文献1

二级参考文献6

共引文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部