Limits...
Quantitative frame analysis and the annotation of GC-rich (and other) prokaryotic genomes. An application to Anaeromyxobacter dehalogenans.

Oden S, Brocchieri L - Bioinformatics (2015)

Bottom Line: Graphical representations of contrasts in GC usage among codon frame positions (frame analysis) provide evidence of genes missing from the annotations of prokaryotic genomes of high GC content but the qualitative approach of visual frame analysis prevents its applicability on a genomic scale.We developed two quantitative methods for the identification and statistical characterization in sequence regions of three-base periodicity (hits) associated with open reading frame structures.We applied the NPACT procedures to two recently annotated strains of the deltaproteobacterium Anaeromyxobacter dehalogenans, identifying in both genomes numerous conserved ORFs not included in the published annotation of coding regions.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL 32610, USA and Genetics Institute, University of Florida, Gainesville, FL 32610, USA.

No MeSH data available.


Related in: MedlinePlus

Associations of nucleotide-type usages with codon position. (A) Base usages at the three codon positions of genes in relation to the overall usage of the same base. Each point represents the average usage in the corresponding codon position from the collection of all coding sequences annotated in one genome. (B) Log-odds-ratio scores associated to each base at the three codon base positions as a function of GC content. (C) Corresponding log-odds-ratio scores for the 61 codons with the highest- and lowest-scoring codon types indicated for low, intermediate and high GC content (W = AT, S = CG, R = AG, Y = CT)
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4595893&req=5

btv339-F1: Associations of nucleotide-type usages with codon position. (A) Base usages at the three codon positions of genes in relation to the overall usage of the same base. Each point represents the average usage in the corresponding codon position from the collection of all coding sequences annotated in one genome. (B) Log-odds-ratio scores associated to each base at the three codon base positions as a function of GC content. (C) Corresponding log-odds-ratio scores for the 61 codons with the highest- and lowest-scoring codon types indicated for low, intermediate and high GC content (W = AT, S = CG, R = AG, Y = CT)

Mentions: To identify three-base periodicities associated with coding regions, we considered all ‘reading frames’ of the input sequence, i.e. continuous sequences of trinucleotides not interrupted by in-frame stop codons (TAA, TGA or TAG) that may contain coding regions. Each reading-frame was also used to identify the local composition of the sequence, based on which the nucleotide composition of the three codon positions can be predicted. Specifically, the expected frequency of nucleotide type X (= A, C, G, T) at codon position i (= 0, 1, 2) can be estimated by linear regression over its overall nucleotide frequency from collections of known genes (Fig. 1A). To identify within a reading frame sequence segments of any length with the three-base periodicities expected in coding regions, we derived an implementation of the ‘high-scoring segment’ approach (Karlin and Altschul, 1990) based on cumulating, along the sequence, scores appropriate for identifying segments with expected phase-specific nucleotide frequencies, such as log-odds-ratio scores , assigned to each nucleotide and codon position (Fig. 1B).Fig. 1.


Quantitative frame analysis and the annotation of GC-rich (and other) prokaryotic genomes. An application to Anaeromyxobacter dehalogenans.

Oden S, Brocchieri L - Bioinformatics (2015)

Associations of nucleotide-type usages with codon position. (A) Base usages at the three codon positions of genes in relation to the overall usage of the same base. Each point represents the average usage in the corresponding codon position from the collection of all coding sequences annotated in one genome. (B) Log-odds-ratio scores associated to each base at the three codon base positions as a function of GC content. (C) Corresponding log-odds-ratio scores for the 61 codons with the highest- and lowest-scoring codon types indicated for low, intermediate and high GC content (W = AT, S = CG, R = AG, Y = CT)
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4595893&req=5

btv339-F1: Associations of nucleotide-type usages with codon position. (A) Base usages at the three codon positions of genes in relation to the overall usage of the same base. Each point represents the average usage in the corresponding codon position from the collection of all coding sequences annotated in one genome. (B) Log-odds-ratio scores associated to each base at the three codon base positions as a function of GC content. (C) Corresponding log-odds-ratio scores for the 61 codons with the highest- and lowest-scoring codon types indicated for low, intermediate and high GC content (W = AT, S = CG, R = AG, Y = CT)
Mentions: To identify three-base periodicities associated with coding regions, we considered all ‘reading frames’ of the input sequence, i.e. continuous sequences of trinucleotides not interrupted by in-frame stop codons (TAA, TGA or TAG) that may contain coding regions. Each reading-frame was also used to identify the local composition of the sequence, based on which the nucleotide composition of the three codon positions can be predicted. Specifically, the expected frequency of nucleotide type X (= A, C, G, T) at codon position i (= 0, 1, 2) can be estimated by linear regression over its overall nucleotide frequency from collections of known genes (Fig. 1A). To identify within a reading frame sequence segments of any length with the three-base periodicities expected in coding regions, we derived an implementation of the ‘high-scoring segment’ approach (Karlin and Altschul, 1990) based on cumulating, along the sequence, scores appropriate for identifying segments with expected phase-specific nucleotide frequencies, such as log-odds-ratio scores , assigned to each nucleotide and codon position (Fig. 1B).Fig. 1.

Bottom Line: Graphical representations of contrasts in GC usage among codon frame positions (frame analysis) provide evidence of genes missing from the annotations of prokaryotic genomes of high GC content but the qualitative approach of visual frame analysis prevents its applicability on a genomic scale.We developed two quantitative methods for the identification and statistical characterization in sequence regions of three-base periodicity (hits) associated with open reading frame structures.We applied the NPACT procedures to two recently annotated strains of the deltaproteobacterium Anaeromyxobacter dehalogenans, identifying in both genomes numerous conserved ORFs not included in the published annotation of coding regions.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL 32610, USA and Genetics Institute, University of Florida, Gainesville, FL 32610, USA.

No MeSH data available.


Related in: MedlinePlus