摘要
目的:根据NCBI Refseq基因集的蛋白质序列数据,构建包含11种真核模式生物在内的基因、蛋白质信息数据库(EPD),探讨各种模式生物的蛋白质序列长度频率分布,并以此来探讨在生物进化过程中,蛋白质序列长度的变化趋势。方法:通过计算机编程软件提取不同种类生物基因总数、有实验数据支撑的蛋白质序列(标记为NP_的序列)数、理论预测的蛋白质序列(标记为XP_的序列)数等相关信息进行统计。结果:对该数据库数据统计规律的研究表明:数据库共有347042个基因,469887个蛋白质,平均每个基因可翻译有1.35个蛋白质;蛋白质序列长度为300-350个氨基酸的最多;长度小于100个氨基酸和大于2500个氨基酸的蛋白质序列很少。结论:模式生物蛋白质序列长度频率分布具有一定规律性;人和爪蛙的蛋白质长度频率分布曲线形状最相似;从蛋白质长度频率分布曲线的靠近程度上看,人与小鼠更接近;生物偏好使用中等长度的蛋白质序列。
Objective: According to protein sequence data from NCBI Refseq, a database called eukaryotic protein database (EPD) was developed. EPD contained information for gene and protein of 11 kinds of eukaryotic model organisms. Protein length frequency distributions of the various eukaryotic model organisms and the change trend of protein sequence length in the process of biological evolution were discussed Methods: Using computer programming software, a lot of information of different kinds of biological protein sequences was extracted and was counted. The information include the total number of gene, the total number of the protein sequence marked as NP_ that was supported by the experimental data, and the total number of the protein sequences marked as XP that was predicted by the theoretical data. Results: Studies on the statistical characteristics of the data in EPD show that there are 347042 genes, 469887 proteins. On average, a gene can be translated 1.35 proteins. Proteins with 300-350 amino acid in length are abundant and proteins with length is less than 100 amino acids and more than 2500 amino acid are rare in the database. Conclusions: Protein sequence length frequency distributions of model organisms have certain regularity. Protein length frequency distribution curves of human and claw frog is most similar. Looking from the closeness of protein length frequency distribution curve, human and mouse is closer. Creature likes to use protein sequences with medium length.
出处
《中国医学物理学杂志》
CSCD
2014年第6期5318-5321,5332,共5页
Chinese Journal of Medical Physics
基金
国家自然科学基金项目(11362015)
内蒙古民族大学科科研项目(NMD1220)
内蒙古民族大学研创新团队建设计划资助课题
关键词
真核生物
数据库
频率分布
eukaryote
database
frequency distribution