摘要
提出了一种光学公式识别与分析的新方法,在公式符号提取与识别中采用RL(Run_length)特征以提高识别率。采用二层连通区域搜索算法提取公式符号的图像,其中第一层为基于RL特征的符号提取,得到复合符号的整体连通区域;第二层为传统搜索方法,进一步确定这些复合符号中包含的单一符号。设计了专门的公式符号识别器,对公式符号进行识别;根据符号间的语义信息和几何关系得到公式的逻辑结构;最终表达为公式结构树。在对印刷文献中所含公式的识别实验中取得了较好的效果,表明该方法具有良好的应用前景。
A new method for optical formula recognition and analysis was put forward. The RL features were used in formula extraction and recognition to improve the recognition accuracy. The symbol images were obtained with a two-layers searching algorithm of connected components. In the first layer, the connection areas of composed symbols were extracted with RL features. And the single symbols contained in these composed symbols were identified with a traditional way in the second layer. A special recognizer was designed to identify these symbol images. The logical structure was obtained according to their geometrical features and lingual information. The analysis result was presented as a formula structure tree. The experiments were done on some mathematical expressions within printed document. The results show that the method is of immense practical and theoretical value.
出处
《光学技术》
EI
CAS
CSCD
北大核心
2007年第1期79-82,共4页
Optical Technique
基金
河北省自然科学基金资助项目(F2004000132)
关键词
OCR
光学公式识别
符号识别
结构分析
RL特征
OCR
optical formula recognition
symbol recognition
structural analysis
RL feature