摘要
从语言学的角度来看基因序列,一个DNA序列可以看成是由字母A,G,C,T组成的有限的字符串,以一定的语法和词法结构为转录机器所识别.那么,词在序列中是如何分布的?文中研究了在不同状况下的单词的理论分布,证实了单词(即CODE)的自重叠性对单词在序列中的概率分布的有极大的影响,并就实例验证了这一点.结合经验分布,提出了两种在DNA序列中鉴别异常单词的方法.得出结论:字母A,G,C,T等概率出现和不等概率出现是判别单词是否异常的重要条件.
A linguistic approaches to understanding the meaning of DNA sequence have been adopted.A DNA sequence is composed of nucleotides A,C,G,T,and can be transferred under a special ‘morphology’ and ‘grammar’.Which factors will influence the number of the occurrences of words in the DNA text?How to find the ‘anomalous’ words?The theoretic probability distribution of the words is got.And it reveals the highly influence of overlapping capability on the probability distribution.The effect is illustrated with a DNA fragment example.Going with empirical distribution,two ways are presented to find ‘anomalous’ words.
出处
《云南大学学报(自然科学版)》
CAS
CSCD
1998年第6期432-436,共5页
Journal of Yunnan University(Natural Sciences Edition)
基金
云南省应用基础研究基金
关键词
单词
随机序列
自重叠性
经验分布
DNA序列
words,random sequences,non random sequences,overlap,empirical distribution