Limits...
The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).

Huntemann M, Ivanova NN, Mavromatis K, Tripp HJ, Paez-Espino D, Palaniappan K, Szeto E, Pillay M, Chen IM, Pati A, Nielsen T, Markowitz VM, Kyrpides NC - Stand Genomic Sci (2015)

Bottom Line: Dataset submission for annotation first requires project and associated metadata description in GOLD.The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements.Structural annotation is followed by assignment of protein product names and functions.

View Article: PubMed Central - PubMed

Affiliation: Genome Biology Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, USA.

ABSTRACT
The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.

No MeSH data available.


Genome sequence data preprocessing (i), structural (ii) and functional annotation (iii) steps of the MGAP v.4
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4623924&req=5

Fig1: Genome sequence data preprocessing (i), structural (ii) and functional annotation (iii) steps of the MGAP v.4

Mentions: All genome datasets undergo preprocessing in order to ensure that only good quality sequences are processed in the gene prediction stage, as illustrated in Fig. 1(i). First, all ambiguous nucleotides in the sequence datasets are replaced by N’s and sequences with characters that do not belong to the {A,C,G,T,N} set are not considered further. Additionally, the headers in multi-FASTA files are changed to ensure that all contig and scaffold names are unique and compatible with the tools employed in subsequent stages. The pipeline creates a mapping file, which records the correspondence between old and new sequence headers. Furthermore, sequences shorter than 150 nt are removed. Second, the sequences are trimmed in order to remove trailing ‘N’s. The trimmed sequences then have to pass a low complexity filtering where low complexity noisy sequences are identified using the DUST [3] application and eliminated. Finally, for finished circular genomes the pipeline attempts to detect the origin of replication by running BLASTx against an in-house curated database of genes located near the origin of replication. If an origin of replication is successfully detected, the sequence is permuted so that it starts at that position.Fig. 1


The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).

Huntemann M, Ivanova NN, Mavromatis K, Tripp HJ, Paez-Espino D, Palaniappan K, Szeto E, Pillay M, Chen IM, Pati A, Nielsen T, Markowitz VM, Kyrpides NC - Stand Genomic Sci (2015)

Genome sequence data preprocessing (i), structural (ii) and functional annotation (iii) steps of the MGAP v.4
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4623924&req=5

Fig1: Genome sequence data preprocessing (i), structural (ii) and functional annotation (iii) steps of the MGAP v.4
Mentions: All genome datasets undergo preprocessing in order to ensure that only good quality sequences are processed in the gene prediction stage, as illustrated in Fig. 1(i). First, all ambiguous nucleotides in the sequence datasets are replaced by N’s and sequences with characters that do not belong to the {A,C,G,T,N} set are not considered further. Additionally, the headers in multi-FASTA files are changed to ensure that all contig and scaffold names are unique and compatible with the tools employed in subsequent stages. The pipeline creates a mapping file, which records the correspondence between old and new sequence headers. Furthermore, sequences shorter than 150 nt are removed. Second, the sequences are trimmed in order to remove trailing ‘N’s. The trimmed sequences then have to pass a low complexity filtering where low complexity noisy sequences are identified using the DUST [3] application and eliminated. Finally, for finished circular genomes the pipeline attempts to detect the origin of replication by running BLASTx against an in-house curated database of genes located near the origin of replication. If an origin of replication is successfully detected, the sequence is permuted so that it starts at that position.Fig. 1

Bottom Line: Dataset submission for annotation first requires project and associated metadata description in GOLD.The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements.Structural annotation is followed by assignment of protein product names and functions.

View Article: PubMed Central - PubMed

Affiliation: Genome Biology Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, USA.

ABSTRACT
The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.

No MeSH data available.