摘要
藏文机器阅读理解领域尚处于起步阶段,构建一份高质量的语料库成为推动该领域发展的当务之急。本研究采用众包方式,对藏医经典著作《四部医典》中的藏医植物药材与名词解释部分进行精细标注。结合藏文掩码数据扩充策略,有效扩充了数据集的规模,最终整理出13k条有效问答对。基于该数据集,通过优化传统的注意力机制,提出了一个高效的藏文机器阅读理解模型。文章的研究不仅对于推动藏文信息处理技术的深入发展具有重要意义,更有助于提升机器对藏文文本的理解能力,从而为藏文化的传承和保护提供有力支持。
The field of Tibetan machine reading comprehension is still in its infancy,and the construction of a highquality corpus has become an urgent task to promote the development of this field.This study adopted a crowdsourcing approach to finely annotate the Tibetan medical compilation and terminology explanations in the Tibetan medical classics,the"The Four Medical Tantras."Combined with the Tibetan masked data enrichment strategy,the scale of the dataset was effectively expanded,and finally 13,000 effective question-answer pairs were sorted out.Based on the dataset,an efficient model of Tibetan machine reading comprehension is proposed by optimizing the traditional attention mechanism.The research in this paper is not only of great significance for promoting the in-depth development of Tibetan information processing technology,but also helps to improvethe ability of machines to understand Tibetan texts,so as to provide strong support for the inheritance and protection of Tibetan culture.
作者
旦增罗布
拉巴次仁
王浩畅
小次仁
Danzeng Luobu;Laba Ciren;Wang Haochang;Xiao Ciren(Shannan Power Supply Co-mpany of State Grid Tibet Electric Power Company Limited,Lhoka 856000,China;University of Tibetan Medicine,Lhasa 850000,China;School of Computer and Information Technology,Northeast Petroleum University,Daqing 163318,China)
出处
《西藏科技》
2024年第9期73-80,共8页
Xizang Science And Technology
基金
2023年藏医博士点建设及中藏药博士点培育科研资助计划项目(BSDJS-23-15)
国家自然科学基金(61402099)。
关键词
藏文机器阅读理解
四部医典
藏文语料库
注意力机制
Tibetan machine reading comprehension
The Four Medical Tantras
Tibetan corpus
Attention mechanism