摘要
传统的光学字符识别(OCR)系统中,由于训练的样本中并没有包括斜体字符,导致系统无法正确识别出斜体字符,这对农业文献的知识获取造成了一定的影响。针对这个问题,提出了一种斜体字符检测和纠正的方法。首先将文本行分割成单词,并进一步细分为单个字符,然后分别检测各个字符的形态特征,并依此判断出单词的形态,最后收集检测为斜体结果的所有单词,并利用这些单词计算出斜体字符的准确角度并加以纠正。经农业文献知识获取的实践结果证明,该方法能取得很好的检测和纠正效果。
In the optical character recognition (OCR) system, due to the training sample does not include italic characters, the system cannot correctly identify the italic characters, which impacts on knowledge acquisition of agricultural literature. If the italic character were con- tained in the training sample, the complexity of the sample will be increased and also will have some impact in the recognition of positive body. For this phenomenon, this paper presents a method to detect and correct the English italics. The first step is to split lines of text into words, and further to subdivide the words into individual characters, and then detect the mor- phological characteristics of each character and so determine the word shape. Furthermore, collect the test results of all the words in italics, and use these words to calculate the italic characters' accurate angle and correct. The results of knowledge acquisition of agricultural lit- erature show that this method can achieve good detection and correction results.
出处
《河北农业大学学报》
CAS
CSCD
北大核心
2015年第6期124-128,共5页
Journal of Hebei Agricultural University
基金
河北省高等学校科学技术研究青年基金(Z2012142)
保定市科学技术研究与发展指导计划项目(13ZN025
13ZF098)
保定市科学技术协会自然科学课题(KX2013A20)
河北农业大学理工基金项目(LG20120604)资助
关键词
OCR
斜体检测
斜体校正
农业知识获取
OCR
italic detection
italic correction
agricultural knowledge acquisition