针对多字体藏文字丁数据集匮乏的现状和藏文印刷体多字体字丁的识别问题,构建了一个含有数据规模为48960张字丁图像的藏文印刷体字丁数据集(Tibetan Printed Character Dataset, TPCD),并对TPCD数据集进行了标记,归一化和二值化的预处...针对多字体藏文字丁数据集匮乏的现状和藏文印刷体多字体字丁的识别问题,构建了一个含有数据规模为48960张字丁图像的藏文印刷体字丁数据集(Tibetan Printed Character Dataset, TPCD),并对TPCD数据集进行了标记,归一化和二值化的预处理。运用各类包括支持向量机、前馈神经网络和卷积网络等线性统计和深度学习方法对数据集中的藏文字丁进行了识别实验。对实验结果进行评测后,提出的基于神经网络的模型可以使多字体藏文印刷体识别任务在测试集上的识别率、召回率和F1值分别达到了97%、96.6%和96.6%,证实了上述方法的有效性,为后续藏文文字识别提供了一定的理论和研究的基础。展开更多
To improve the recognition accuracy of off-line handwritten Tibetan characters the local gradient direction histograms based on the wavelet transform are proposed as the recognition features.First for a Tibetan charac...To improve the recognition accuracy of off-line handwritten Tibetan characters the local gradient direction histograms based on the wavelet transform are proposed as the recognition features.First for a Tibetan character sample image the first level approximation component of the Haar wavelet transform is calculated.Secondly the approximation component is partitioned into several equal-sized zones. Finally the gradient direction histograms of each zone are calculated and the local direction histograms of the approximation component are considered as the features of the character sample image.The proposed method is tested on the recently developed off-line Tibetan handwritten character sample database.The experimental results demonstrate the effectiveness and efficiency of the proposed feature extraction method.Furthermore compared with the detail components the approximation component contributes more to the recognition accuracy.展开更多
文摘针对多字体藏文字丁数据集匮乏的现状和藏文印刷体多字体字丁的识别问题,构建了一个含有数据规模为48960张字丁图像的藏文印刷体字丁数据集(Tibetan Printed Character Dataset, TPCD),并对TPCD数据集进行了标记,归一化和二值化的预处理。运用各类包括支持向量机、前馈神经网络和卷积网络等线性统计和深度学习方法对数据集中的藏文字丁进行了识别实验。对实验结果进行评测后,提出的基于神经网络的模型可以使多字体藏文印刷体识别任务在测试集上的识别率、召回率和F1值分别达到了97%、96.6%和96.6%,证实了上述方法的有效性,为后续藏文文字识别提供了一定的理论和研究的基础。
基金The National Natural Science Foundation of China(No.60963016)the National Social Science Foundation of China(No.17BXW037)
文摘To improve the recognition accuracy of off-line handwritten Tibetan characters the local gradient direction histograms based on the wavelet transform are proposed as the recognition features.First for a Tibetan character sample image the first level approximation component of the Haar wavelet transform is calculated.Secondly the approximation component is partitioned into several equal-sized zones. Finally the gradient direction histograms of each zone are calculated and the local direction histograms of the approximation component are considered as the features of the character sample image.The proposed method is tested on the recently developed off-line Tibetan handwritten character sample database.The experimental results demonstrate the effectiveness and efficiency of the proposed feature extraction method.Furthermore compared with the detail components the approximation component contributes more to the recognition accuracy.