Measures relating word frequencies and expectations have been constantly ofinterest in Bioinformatics studies. With sequence data becoming massively available, exhaustiveenumeration of such measures have become concei...Measures relating word frequencies and expectations have been constantly ofinterest in Bioinformatics studies. With sequence data becoming massively available, exhaustiveenumeration of such measures have become conceivable, and yet pose significant computational burdeneven when limited to words of bounded maximum length. In addition, the display of the huge tablespossibly resulting from these counts poses practical problems of visualization and inference.VERBUMCULUS is a suite of software tools for the efficient and fast detection of over- orunder-represented words in nucleotide sequences. The inner core of VERBUMCULUS rests on subtlyinterwoven properties of statistics, pattern matching and combinatorics on words, that enable one tolimit drastically and a priori the set of over-or under-represented candidate words of all lengthsin a given sequence, thereby rendering it more feasible both to detect and visualize such words in afast and practically useful way. This paper is devoted to the description of the facility at theoutset and to report experimental results, ranging from simulations on synthetic data to thediscovery of regulatory elements on the upstream regions of a set of genes of the yeast.展开更多
In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction p...In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page. Key words clustering - suffix tree - Web mining CLC number TP 311 Foundation item: Supported by the National Information Industry Development Foundation of ChinaBiography: YANG Jian-wu (1973-), male, Ph. D, research direction: information retrieval and text mining.展开更多
One method for identifying noncoding regulatory regions of a genome is to quantify rates of divergence between related species, as functional sequence will generally diverge more slowly. Most approaches to identifying...One method for identifying noncoding regulatory regions of a genome is to quantify rates of divergence between related species, as functional sequence will generally diverge more slowly. Most approaches to identifying these conserved noncoding sequences (CNSs) based on alignment have had relatively large minimum sequence lengths (≥15 bp) compared with the average length of known transcription factor binding sites. To circumvent this constraint, STAG-CNS that can simultaneously integrate the data from the promoters of conserved orthologous genes in three or more species was developed. Using the data from up to six grass species made it possible to identify conserved sequences as short as 9 bp with false discovery rate ≤0.05. These CNSs exhibit greater overlap with open chromatin regions identified using DNase I hypersensitivity assays, and are enriched in the promoters of genes involved in transcriptional regulation. STAG-CNS was further employed to characterize loss of conserved noncoding sequences associated with retained duplicate genes from the ancient maize polyploidy. Genes with fewer retained CNSs show lower overall expression, although this bias is more apparent in samples of complex organ systems containing many cell types, suggesting that CNS loss may correspond to a reduced number of expression contexts rather than lower expression levels across the entire ancestral expression domain.展开更多
基金美国自然科学基金,Purdue Research Foundation,Italian Ministry of University and Research, and the Research Program of the University of Padova 及Purdue Research Foundation,the Italian Ministry of University and Re-search, the Research Program of the University of Padova and Bourns College of Engineering, University of California,Riverside
文摘Measures relating word frequencies and expectations have been constantly ofinterest in Bioinformatics studies. With sequence data becoming massively available, exhaustiveenumeration of such measures have become conceivable, and yet pose significant computational burdeneven when limited to words of bounded maximum length. In addition, the display of the huge tablespossibly resulting from these counts poses practical problems of visualization and inference.VERBUMCULUS is a suite of software tools for the efficient and fast detection of over- orunder-represented words in nucleotide sequences. The inner core of VERBUMCULUS rests on subtlyinterwoven properties of statistics, pattern matching and combinatorics on words, that enable one tolimit drastically and a priori the set of over-or under-represented candidate words of all lengthsin a given sequence, thereby rendering it more feasible both to detect and visualize such words in afast and practically useful way. This paper is devoted to the description of the facility at theoutset and to report experimental results, ranging from simulations on synthetic data to thediscovery of regulatory elements on the upstream regions of a set of genes of the yeast.
文摘In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page. Key words clustering - suffix tree - Web mining CLC number TP 311 Foundation item: Supported by the National Information Industry Development Foundation of ChinaBiography: YANG Jian-wu (1973-), male, Ph. D, research direction: information retrieval and text mining.
文摘One method for identifying noncoding regulatory regions of a genome is to quantify rates of divergence between related species, as functional sequence will generally diverge more slowly. Most approaches to identifying these conserved noncoding sequences (CNSs) based on alignment have had relatively large minimum sequence lengths (≥15 bp) compared with the average length of known transcription factor binding sites. To circumvent this constraint, STAG-CNS that can simultaneously integrate the data from the promoters of conserved orthologous genes in three or more species was developed. Using the data from up to six grass species made it possible to identify conserved sequences as short as 9 bp with false discovery rate ≤0.05. These CNSs exhibit greater overlap with open chromatin regions identified using DNase I hypersensitivity assays, and are enriched in the promoters of genes involved in transcriptional regulation. STAG-CNS was further employed to characterize loss of conserved noncoding sequences associated with retained duplicate genes from the ancient maize polyploidy. Genes with fewer retained CNSs show lower overall expression, although this bias is more apparent in samples of complex organ systems containing many cell types, suggesting that CNS loss may correspond to a reduced number of expression contexts rather than lower expression levels across the entire ancestral expression domain.