摘要
蛋白质交互(protein-protein interaction)是生物医学领域一项重要的研究内容,目前由生物医学进行的PPI实验结果主要以文献的形式存储,随着生物医学文献的大量增加,以手工收集信息的方式已经难以满足实际需求。对此,提出一种基于分布式假设的弱监督蛋白质交互识别方法。首先,从描述蛋白质交互关系的上下文中提取表达语义关系的词汇模式,以少量有交互关系的蛋白质对构成初始种子集,基于分布式假设理论,根据词汇模式在种子集中的分布构建向量空间模型。然后依据相似性对词汇模式进行聚类,形成具有语义相似性的模式簇,利用这些簇在语料中找到新的具有相似分布的模式加入候选集。最后对候选集里的蛋白质对及其模式进行评估,挑选出满足条件的蛋白质对加入种子集进行迭代,最终得到有交互关系的蛋白质对。相比于现有方法,该方法考虑了上下文的语义相关性,实验结果表明,该方法以很小的种子集规模取得了较高的精确度与召回率。
Protein-protein interaction (PPI) is an important content of biological research. The results of PPI experiments carried out bybiomedical research are mainly stored in the form of literature. With the increasing of biomedical literatures,the way of manually collec-ting information has been difficult to meet the actual needs. For this,we propose a weakly supervised protein-protein interaction identifi-cation approach based on distribution hypothesis. First,a few interactive protein pairs are collected as seeds,and lexical patterns of allprotein pair which express semantic relation is extracted. Based on distribution hypothesis,vector space model is constructed according todistribution of patterns over seeds. Then,lexical patterns are clustered using the similarity. Using these clusters,some new semanticallyrelated patterns are recognized and then added to candidates. Lastly,based on the score of lexical patterns,protein pairs in candidates areevaluated and selected to the seed set. The seed set is expanded iteratively,and finally interactive protein pairs are identified. This ap-proach considers the semantically relation in context and achieves high precision and recall by small seeds set compared to results of previ-ous studies.
作者
毛宇薇
牛耘
MAO Yu-wei;NIU Yun(School of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China)
出处
《计算机技术与发展》
2018年第9期34-37,共4页
Computer Technology and Development
基金
国家自然科学基金(61202132)
关键词
蛋白质交互
分布式假设
弱监督算法
关系相似性
protein-protein interaction
distribution hypothesis
weakly-supervised method
relational similarity