摘要
受益于面向大规模语言学资源的深度学习,预训练语言模型有着较强的语义表示学习能力.其能够借助特定任务场景下的迁移学习,在优化模型性能方面提供重要的支持.目前,预训练语言模型已被引入机器阅读理解研究领域,并展现了较好的优化能力.然而,针对特定领域的数据,微调后的预训练模型仍存在领域适应性问题,即无法解决未知领域中新颖的语言现象.为此,本文提出了一种融合迁移自训练和多任务学习机制的无监督领域自适应模型.具体而言,本文结合生成式阅读理解网络和掩码预测机制形成了多任务学习框架,并利用该框架实现跨领域(源领域至目标领域)的无监督模型迁移技术.此外,本文设计了文本规范化和迁移自训练模式,以此促进目标领域的数据分布适应源领域的数据分布,从而提高模型迁移学习的质量.本文将TweetQA作为目标领域数据集,将SQuAD、CoQA和NarrativeQA作为源领域数据集进行实验.实验证明,本文所提方法相较于基线模型有显著提升,在BLEU-1、METEOR和ROUGE-L指标上分别提升了至少2.5、2.7和2.0个百分点,验证了其优化领域适应性的能力.
Benefiting from deep learning for large-scale linguistic resources,pre-trained language models have obtained strong semantic representation learning capabilities.It can use transfer learning in specific task scenarios to provide important support in optimizing model performance.Pre-trained language models such as BERT and UniLM have been widely used in natural language processing fields such as text summarization,machine translation,sentiment analysis and so on.Nowadays,pre-trained language models have been introduced into the field of machine reading comprehension,and have shown considerable optimization capabilities.However,for domain-specific data,the fine-tuned pre-trained models still suffer from weak domain adaptability.In other word,they cannot tackle novel language phenomena in unknown domains.In the social media field,it is difficult to form a standardized and normalized language representation due to the characteristics of“colloquial”and“symbolic”text.In addition,in the practical application scenarios of“innumerable”domain classes,the timeliness of manual annotation is often difficult to guarantee.The current research is mainly oriented to the field of text normalization,and the existing MRC models based on supervised learning necessarily require large-scale training data,while the data for the social media field is relatively scarce.Therefore,although we can fine-tune it based on large-scale pre-trained language models,existing social media data is not large enough to support a complete language system because it is different from pre-trained corpora in terms of specific linguistic phenomena.In addition,previous researches are mainly based on the methods of cloze,multiple choice and extraction models to design MRC models.Such models lack generalization ability for real MRC data,and generative models are closer to practical applications than the above models.To this end,from model level,this paper proposes an unsupervised domain adaptive model that combines transfer self-training and multi-t
作者
刘皓
洪宇
朱巧明
LIU Hao;HONG Yu;ZHU Qiao-Ming(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006;BigData Intelligence Engineering Lab of Jiangsu Province,Soochow University,Suzhou,Jiangsu 215006)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2022年第10期2133-2150,共18页
Chinese Journal of Computers
基金
国家重点研发计划项目(2020YFB1313601)
国家自然科学基金(62076174,61836007)资助
关键词
无监督领域自适应
迁移自训练
多任务学习
生成式阅读理解
掩码预测
unsupervised domain adaptation
transfer self-training
multi-task learning
generative reading comprehension
mask prediction