Diatoms comprise a diverse and ecologically important group of eukaryotic phytoplankton that signifi- cantly contributes to marine primary production and global carbon cycling. Phaeodactylum tricornutum is commonly us...Diatoms comprise a diverse and ecologically important group of eukaryotic phytoplankton that signifi- cantly contributes to marine primary production and global carbon cycling. Phaeodactylum tricornutum is commonly used as a model organism for studying diatom biology. Although its genome was sequenced in 2008, a high-quality genome annotation is still not available for this diatom. Here we report the develop- ment of an integrated proteogenomic pipeline and its application for improved annotation of P. tricornutum genome using mass spectrometry (MS)-based proteomics data. Our proteogenomic analysis unambigu- ously identified approximately 8300 genes and revealed 606 novel proteins, 506 revised genes, 94 splice variants, 58 single amino acid variants, and a holistic view of post-translational modifications in P. tricor- nutum. We experimentally confirmed a subset of novel events and obtained MS evidence for more than 200 micropeptides in P. tricornutum. These findings expand the genomic landscape of P. tricornutum and provide a rich resource for the study of diatom biology. The proteogenomic pipeline we developed in this study is applicable to any sequenced eukaryote and thus represents a significant contribution to the toolset for eukaryotic proteogenomic analysis. The pipeline and its source code are freely available at https://sourceforge.net/projects/gapeproteogeno mic.展开更多
Published genomes frequently contain erroneous gene models that represent issues associated with identification of open reading frames,start sites,splice sites,and related structural features.The source of these incon...Published genomes frequently contain erroneous gene models that represent issues associated with identification of open reading frames,start sites,splice sites,and related structural features.The source of these inconsistencies is often traced back to integration across text file formats designed to describe long read alignments and predicted gene structures.In addition,the majority of gene prediction frameworks do not provide robust downstream filtering to remove problematic gene annotations,nor do they represent these annotations in a format consistent with current file standards.These frameworks also lack consideration for functional attributes,such as the presence or absence of protein domains that can be used for gene model validation.To provide oversight to the increasing number of published genome annotations,we present a software package,the Gene Filtering,Analysis,and Conversion(gFACs),to filter,analyze,and convert predicted gene models and alignments.The software operates across a wide range of alignment,analysis,and gene prediction files with a flexible framework for defining gene models with reliable structural and functional attributes.gFACs supports common downstream applications,including genome browsers,and generates extensive details on the filtering process,including distributions that can be visualized to further assess the proposed gene space.gFACs is freely available and implemented in Perl with support from BioPerl libraries at https://gitlab.com/PlantGenomicsLab/gFACs.展开更多
The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences c...The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences constantly released every week. Among such projects, the plethora of updated genome assemblies induces the requirement of versiondependent annotation files and other compatible public dataset for downstream analysis. To handlethese tasks in an efficient manner, we developed the reference-based genome assembly and annotation tool(RGAAT), a flexible toolkit for resequencing-based consensus building and annotation update. RGAAT can detect sequence variants with comparable precision, specificity, and sensitivity to GATK and with higher precision and specificity than Freebayes and SAMtools on four DNAseq datasets tested in this study. RGAAT can also identify sequence variants based on cross-cultivar or cross-version genomic alignments. Unlike GATK and SAMtools/BCFtools, RGAAT builds the consensus sequence by taking into account the true allele frequency. Finally, RGAAT generates a coordinate conversion file between the reference and query genomes using sequence variants and supports annotation file transfer. Compared to the rapid annotation transfer tool(RATT),RGAAT displays better performance characteristics for annotation transfer between different genome assemblies, strains, and species. In addition, RGAAT can be used for genome modification,genome comparison, and coordinate conversion. RGAAT is available at https://sourceforge.net/projects/rgaat/and https://github.com/wushyer/RGAAT;2 at no cost.展开更多
Novel genomes are today often annotated by small consortia or individuals whose background is not from bioinformatics.This audience requires tools that are easy to use.Such need has been addressed by several genome an...Novel genomes are today often annotated by small consortia or individuals whose background is not from bioinformatics.This audience requires tools that are easy to use.Such need has been addressed by several genome annotation tools and pipelines.Visualizing resulting annotation is a crucial step of quality control.The UCSC Genome Browser is a powerful and popular genome visualization tool.Assembly Hubs,which can be hosted on any publicly available web server,allow browsing genomes via UCSC Genome Browser servers.The steps for creating custom Assembly Hubs are well documented and the required tools are publicly available.However,the number of steps for creating a novel Assembly Hub is large.In some cases,the format of input files needs to be adapted,which is a difficult task for scientists without programming background.Here,we describe Make Hub,a novel command line tool that generates Assembly Hubs for the UCSC Genome Browser in a fully automated fashion.The pipeline also allows extending previously created Hubs by additional tracks.Make Hub is freely available for downloading at https://github.com/Gaius-Augustus/Make Hub.展开更多
The KEGG pathway maps are widely used as a reference data set for inferring high-level functions of the organism or the ecosystem from its genome or metagenome sequence data. The KEGG modules, which are tighter functi...The KEGG pathway maps are widely used as a reference data set for inferring high-level functions of the organism or the ecosystem from its genome or metagenome sequence data. The KEGG modules, which are tighter functional units often corresponding to subpathways in the KEGG pathway maps, are designed for better automation of genome interpretation. Each KEGG module is represented by a simple Boolean expression of KEGG Orthology (KO) identifiers (K numbers), enabling automatic evaluation of the completeness of genes in the genome. Here we focus on metabolic functions and introduce reaction modules for improving annotation and signature modules for inferring metabolic capacity. We also describe how genome annotation is performed in KEGG using the manually created KO database and the computationaUy generated SSDB database. The resulting KEGG GENES database with KO (K number) annotation is a reference sequence database to be compared for automated annotation and interpretation of newly determined genomes.展开更多
Annotation of the genome sequence of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus) is indispensable to understand its evolution and pathogenesis. We have performed a full annotation of the SA...Annotation of the genome sequence of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus) is indispensable to understand its evolution and pathogenesis. We have performed a full annotation of the SARS-CoV genome sequences by using annotation programs publicly available or developed by ourselves. Totally, 21 open reading frames (ORFs) of genes or putative uncharacterized proteins (PUPs) were predicted. Seven PUPs had not been reported previously, and two of them were predicted to contain transmembrane regions. Eight ORFs partially overlapped with or embedded into those of known genes, revealing that the SARS-CoV genome is a small and compact one with overlapped coding regions. The most striking discovery is that an ORF locates on the minus strand. We have also annotated non-coding regions and identified the transcription regulating sequences (TRS) in the intergenic regions. The analysis of TRS supports the minus strand extending transcription mechanism of coronavirus. The SNP analysis of different isolates reveals that mutations of the sequences do not affect the prediction results of ORFs.展开更多
基金This work was supported by the National Key Research and Development Program (2016YFA0501304), the National Natural Science Foundation of China (grant no. 31570829), and the Strategic Priority Research Program of the Chinese Academy of Sciences (grant no. XDB14030202).
文摘Diatoms comprise a diverse and ecologically important group of eukaryotic phytoplankton that signifi- cantly contributes to marine primary production and global carbon cycling. Phaeodactylum tricornutum is commonly used as a model organism for studying diatom biology. Although its genome was sequenced in 2008, a high-quality genome annotation is still not available for this diatom. Here we report the develop- ment of an integrated proteogenomic pipeline and its application for improved annotation of P. tricornutum genome using mass spectrometry (MS)-based proteomics data. Our proteogenomic analysis unambigu- ously identified approximately 8300 genes and revealed 606 novel proteins, 506 revised genes, 94 splice variants, 58 single amino acid variants, and a holistic view of post-translational modifications in P. tricor- nutum. We experimentally confirmed a subset of novel events and obtained MS evidence for more than 200 micropeptides in P. tricornutum. These findings expand the genomic landscape of P. tricornutum and provide a rich resource for the study of diatom biology. The proteogenomic pipeline we developed in this study is applicable to any sequenced eukaryote and thus represents a significant contribution to the toolset for eukaryotic proteogenomic analysis. The pipeline and its source code are freely available at https://sourceforge.net/projects/gapeproteogeno mic.
基金supported by the National Science Foundation Plant Genome Research Program of the United States(Grant No.1444573)
文摘Published genomes frequently contain erroneous gene models that represent issues associated with identification of open reading frames,start sites,splice sites,and related structural features.The source of these inconsistencies is often traced back to integration across text file formats designed to describe long read alignments and predicted gene structures.In addition,the majority of gene prediction frameworks do not provide robust downstream filtering to remove problematic gene annotations,nor do they represent these annotations in a format consistent with current file standards.These frameworks also lack consideration for functional attributes,such as the presence or absence of protein domains that can be used for gene model validation.To provide oversight to the increasing number of published genome annotations,we present a software package,the Gene Filtering,Analysis,and Conversion(gFACs),to filter,analyze,and convert predicted gene models and alignments.The software operates across a wide range of alignment,analysis,and gene prediction files with a flexible framework for defining gene models with reliable structural and functional attributes.gFACs supports common downstream applications,including genome browsers,and generates extensive details on the filtering process,including distributions that can be visualized to further assess the proposed gene space.gFACs is freely available and implemented in Perl with support from BioPerl libraries at https://gitlab.com/PlantGenomicsLab/gFACs.
基金supported by grants from the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDA08020102)National Natural Science Foundation of China (Grant Nos. 81701071, 31501042, 31271385, and 31200957)+2 种基金Shenzhen Science and Technology Program (Grant No. JCYJ20170306171013613),ChinaKing Abdulaziz City for Science and Technology (KACSTGrant No. 1035-35),Kingdom of Saudi Arabia
文摘The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences constantly released every week. Among such projects, the plethora of updated genome assemblies induces the requirement of versiondependent annotation files and other compatible public dataset for downstream analysis. To handlethese tasks in an efficient manner, we developed the reference-based genome assembly and annotation tool(RGAAT), a flexible toolkit for resequencing-based consensus building and annotation update. RGAAT can detect sequence variants with comparable precision, specificity, and sensitivity to GATK and with higher precision and specificity than Freebayes and SAMtools on four DNAseq datasets tested in this study. RGAAT can also identify sequence variants based on cross-cultivar or cross-version genomic alignments. Unlike GATK and SAMtools/BCFtools, RGAAT builds the consensus sequence by taking into account the true allele frequency. Finally, RGAAT generates a coordinate conversion file between the reference and query genomes using sequence variants and supports annotation file transfer. Compared to the rapid annotation transfer tool(RATT),RGAAT displays better performance characteristics for annotation transfer between different genome assemblies, strains, and species. In addition, RGAAT can be used for genome modification,genome comparison, and coordinate conversion. RGAAT is available at https://sourceforge.net/projects/rgaat/and https://github.com/wushyer/RGAAT;2 at no cost.
基金US National Institutes of Health(Grant No.HG000783),gave rise to the development of MakeHubUniversitat Greifswald,Germany.
文摘Novel genomes are today often annotated by small consortia or individuals whose background is not from bioinformatics.This audience requires tools that are easy to use.Such need has been addressed by several genome annotation tools and pipelines.Visualizing resulting annotation is a crucial step of quality control.The UCSC Genome Browser is a powerful and popular genome visualization tool.Assembly Hubs,which can be hosted on any publicly available web server,allow browsing genomes via UCSC Genome Browser servers.The steps for creating custom Assembly Hubs are well documented and the required tools are publicly available.However,the number of steps for creating a novel Assembly Hub is large.In some cases,the format of input files needs to be adapted,which is a difficult task for scientists without programming background.Here,we describe Make Hub,a novel command line tool that generates Assembly Hubs for the UCSC Genome Browser in a fully automated fashion.The pipeline also allows extending previously created Hubs by additional tracks.Make Hub is freely available for downloading at https://github.com/Gaius-Augustus/Make Hub.
文摘The KEGG pathway maps are widely used as a reference data set for inferring high-level functions of the organism or the ecosystem from its genome or metagenome sequence data. The KEGG modules, which are tighter functional units often corresponding to subpathways in the KEGG pathway maps, are designed for better automation of genome interpretation. Each KEGG module is represented by a simple Boolean expression of KEGG Orthology (KO) identifiers (K numbers), enabling automatic evaluation of the completeness of genes in the genome. Here we focus on metabolic functions and introduce reaction modules for improving annotation and signature modules for inferring metabolic capacity. We also describe how genome annotation is performed in KEGG using the manually created KO database and the computationaUy generated SSDB database. The resulting KEGG GENES database with KO (K number) annotation is a reference sequence database to be compared for automated annotation and interpretation of newly determined genomes.
文摘Annotation of the genome sequence of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus) is indispensable to understand its evolution and pathogenesis. We have performed a full annotation of the SARS-CoV genome sequences by using annotation programs publicly available or developed by ourselves. Totally, 21 open reading frames (ORFs) of genes or putative uncharacterized proteins (PUPs) were predicted. Seven PUPs had not been reported previously, and two of them were predicted to contain transmembrane regions. Eight ORFs partially overlapped with or embedded into those of known genes, revealing that the SARS-CoV genome is a small and compact one with overlapped coding regions. The most striking discovery is that an ORF locates on the minus strand. We have also annotated non-coding regions and identified the transcription regulating sequences (TRS) in the intergenic regions. The analysis of TRS supports the minus strand extending transcription mechanism of coronavirus. The SNP analysis of different isolates reveals that mutations of the sequences do not affect the prediction results of ORFs.