摘要
目的场景文本检测是场景理解和文字识别领域的重要任务之一,尽管基于深度学习的算法显著提升了检测精度,但现有的方法由于对文字局部语义和文字实例间的全局语义的提取能力不足,导致缺乏文字多层语义的建模,从而检测精度不理想。针对此问题,提出了一种层级语义融合的场景文本检测算法。方法该方法包括基于文本片段的局部语义理解模块和基于文本实例的全局语义理解模块,以分别引导网络关注文字局部和文字实例间的多层级语义信息。首先,基于文本片段的局部语义理解模块根据相对位置将文本划分为多个片段,在细粒度优化目标的监督下增强网络对局部语义的感知能力。然后,基于文本实例的全局语义理解模块利用文本片段粗分割结果过滤背景区域并提取可靠的文字区域特征,进而通过注意力机制自适应地捕获任意形状文本的全局语义信息并得到最终分割结果。此外,为了降低边界区域的预测噪声对层级语义信息聚合的干扰,提出边界感知损失函数以降低边界区域特征的歧义性。结果算法在3个常用的场景文字检测数据集上实验并与其他算法进行了比较,所提方法在性能上获得了显著提升,在Totoal-Text数据集上,F值为87.0%,相比其他模型提升了1.0%;在MSRA-TD500(MSRA text detection 500 database)数据集上,F值为88.2%,相比其他模型提升了1.0%;在ICDAR 2015(International Conference on Document Analysis and Recognition)数据集上,F值为87.0%。结论提出的模型通过分别构建不同层级下的语义上下文和对歧义特征额外的惩罚解决了层级语义提取不充分的问题,获得了更高的检测精度。
Objective Scene-related text detection is essential for computer vision,which aims to localize text instances for targeted image.It is beneficial for such domain of text recognition applications like scene understanding,translation and text visual question answering.The emerging deep learning based convolution neural network(CNN)has been widely developing in relevance to text detection nowadays.Current researches are focused on texts location in terms of the regression of the quadrangular bounding box.However,since regression based methods unfit texts with arbitrary shapes(e.g.,curved texts),many approaches focus on segmentation based methods.Fully convolutional networks(FCN)are commonly used to obtain high-resolution feature maps,and the pixel-level mask is predicted to locate the text instances as well.Due to the extreme aspect ratios and the various sizes of text instances,existing models are challenged for one feature map-related integration of local-level and global-level semantics.More feature maps are introduced from multiple levels of the network,and hierarchical semantics can be generated from the corresponding feature map.But,these modules are required to yield the network to optimize the hierarchical features simultaneously,which may distract the network to a certain extent.Hence,existing networks are required to capture accurate hierarchical semantics further.Method To resolve this problem,the segmentation based text detection method is developed and a hierarchical semantic fusion network is demonstrated as well.We decouple the local and global feature extraction process and learn corresponding semantics.Specially,two mutual-benefited components are introduced for enhancing effective local and global feature,sub-region based local semantic understanding module(SLM)and instance based global semantic understanding module(IGM).First,SLM is used to segment the text instance into a kernel and multiple sub-regions in terms of their text-inner position.And,SLM can be used to learn their segmentation,which is
作者
王紫霄
谢洪涛
王裕鑫
张勇东
Wang Zixiao;Xie Hongtao;Wang Yuxin;Zhang Yongdong(School of Information Science and Technology,University of Science and Technology of China,Hefei 230026,China)
出处
《中国图象图形学报》
CSCD
北大核心
2023年第8期2343-2355,共13页
Journal of Image and Graphics
基金
国家自然科学基金创新研究群体项目(62121002,62022076,U1936210)。