摘要
生物序列数据的表达和存储是生物序列数据处理的关键。当前的数据库管理系统不能有效地支持生物序列数据类型和操作,人们不得不用文本数据类型或直接使用文本文件存储生物序列数据。这种状况造成了生物序列比对、模式发现等数据处理的低效率。研究了生物序列数据的特征,分析并归纳了用户对生物序列数据的查询需求,提出了一个新的生物序列数据模型BioSeg。BioSeg模型由描述部分和多维数组组成,描述部分表示生物序列注释和其他相关信息,多维数组表示具体序列(如DNA序列"ATCCCGTA")。BioSeg模型提供了实现生物序列数据查询的代数操作。相对于生物序列数据的文本存储方式,BioSeg模型提供的数据查询具有良好的效率和灵活性。
The appropriate storage manner of biological sequence data is critical for accessing and dealing with them efficiently. Existing database management system cannot efficiently support biological sequence data type and its operations, people have to use text data type in database management system or text file directly. This state makes the low efficiency when biological sequence data are processed. The features of biological sequence data are investigated, the query demands are analyzed and induced, and then a novel biological sequence data model named BioSeg is presented. The model is composed of descripition and multidimensional array. The part of description represents annotations and other related information about biological sequence data and multi-dimensional array stores concrete sequence (for example, a DNA sequence "ATCCCGA"). Algebra operations on BioSeg which can implement query on biological sequence data. Query capability on BioSeg is more efficient and feasible than previous storage manner using text type.
出处
《计算机科学与探索》
CSCD
2008年第1期77-96,共20页
Journal of Frontiers of Computer Science and Technology
基金
the National Natural Science Foundation of China under Grant No.60573093 ( 国家自然科学基金)
the National High-Tech Research and Development Plan of China under Grant No.2006AA02Z329( 国家高技术研究发展计划( 863)) .
关键词
生物序列
数据库管理系统
数据模型
生物信息学
Biological Sequence
Database Management System (DBMS)
data model
Bioinformatics