摘要
自然语言和图像处理领域引发的人工智能革命给蛋白质计算领域带来了新的思路和研究范式.其中一个重大的进展是从海量蛋白质序列通过自监督学习得到预训练的蛋白质语言模型.这类预训练模型编码了蛋白质的序列、进化、结构乃至功能等多种信息,可方便地迁移至多种下游任务,并展现了强大的泛化能力.在此基础上,人们正进一步发展融合更多种类数据的多模态预训练模型.考虑到蛋白质结构是决定其功能的主要因素,融合了结构信息的蛋白质预训练模型可更好地支持下游多种任务,本文对这一方向的研究工作进行了介绍和总结.此外,还简介了融合先验知识的蛋白质预训练模型、RNA语言模型、蛋白质设计等方面的工作,讨论了这些领域目前的现状、困难及可能的解决方案.
The AI revolution,sparked by natural language and image processing,has brought new ideas and research paradigms to the field of protein computing.One significant advancement is the development of pre-training protein language models through self-supervised learning from massive protein sequences.These pre-trained models encode various information about protein sequences,evolution,structures,and even functions,which can be easily transferred to various downstream tasks and demonstrate robust generalization capabilities.Recently,researchers have further developed multimodal pre-trained models that integrate more diverse types of data.The recent studies in this direction are summarized and reviewed from the following aspects in this paper.Firstly,the protein pre-training models that integrate protein structures into language models are reviewed:this is particularly important,for protein structure is the primary determinant of its function.Secondly,the pretrained models that integrate protein dynamic information are introduced.These models may benefit downstream tasks such as protein-protein interactions,soft docking of ligands,and interactions involving allosteric proteins and intrinsic disordered proteins.Thirdly,the pre-trained models that integrate knowledge such as gene ontology are described.Fourthly,we briefly introduce pre-trained models in RNA fields.Finally,we introduce the most recent developments in protein designs and discuss the relationship of these models with the aforementioned pre-trained models that integrate protein structure information.
作者
汤天一
熊翊名
张睿格
张建
李文飞
王骏
王炜
Tang Tian-Yi;Xiong Yi-Ming;Zhang Rui-Ge;Zhang Jian;†Li Wen-Fei;Wang Jun;Wang Wei(School of Physics,Nanjing University,Nanjing 210093,China;Institute of Brain Science,Nanjing University,Nanjing 210093,China)
出处
《物理学报》
SCIE
EI
CAS
CSCD
北大核心
2024年第18期1-15,共15页
Acta Physica Sinica
基金
科技部科技创新项目(批准号:2030-2021ZD0201300)
国家自然科学基金(批准号:11934008)资助的课题。