摘要
HCL2000是目前最具影响力的手写汉字数据库之一,基于研究手写汉字规律的设计初衷,该数据库采用了以书写者为单位按文件形式组织和存放的方式。本文则从研究样本选择的应用角度出发,对HCL2000中的样本进行了重新组织,同时对该数据库中的错误进行了纠正,生成了一个新的手写汉字数据库HCL2004。文章最后基于HCL2004数据库和方向线素特征进行了有关训练样本数对识别性能影响的研究,给出了3755类大字符集情况下的最佳训练样本数为300的结论,同时还对识别过程中的样本选择问题进行了探讨。
HCL2000 is one of the most influential handwritten Chinese characters databases. In order to research the nature features of handwritten Chinese characters, the files of database are organized in the mode of the writers. But this form of the files organization is not always the most effective in other researches such as the research on pattern selection. By this reason, a new model of characters database is developed. Based on the new model and HCL2000, a newly edited version of HCL2000- HCL2004 is developed by reorganizing and revising the samples. Then two experiments are arranged. One is focused on the effect of the number of the training samples. From this experiment, we can see the relation of the number of the training samples and the system performance. And for 3755 characters classes, to achieve the optimal system performance need 300 training samples of each character. The other experiment in the paper is about the seleetion of the training and testing samples.
出处
《中文信息学报》
CSCD
北大核心
2005年第5期97-104,共8页
Journal of Chinese Information Processing
基金
教育部跨世纪人才基金和教育部重点科研项目资助(02029)