Limits...
Quantitative frame analysis and the annotation of GC-rich (and other) prokaryotic genomes. An application to Anaeromyxobacter dehalogenans.

Oden S, Brocchieri L - Bioinformatics (2015)

Bottom Line: Graphical representations of contrasts in GC usage among codon frame positions (frame analysis) provide evidence of genes missing from the annotations of prokaryotic genomes of high GC content but the qualitative approach of visual frame analysis prevents its applicability on a genomic scale.We developed two quantitative methods for the identification and statistical characterization in sequence regions of three-base periodicity (hits) associated with open reading frame structures.We applied the NPACT procedures to two recently annotated strains of the deltaproteobacterium Anaeromyxobacter dehalogenans, identifying in both genomes numerous conserved ORFs not included in the published annotation of coding regions.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL 32610, USA and Genetics Institute, University of Florida, Gainesville, FL 32610, USA.

No MeSH data available.


Related in: MedlinePlus

An example of the procedure used to identify ‘G-type’ significant hits of any length within a reading frame of 240 trinucleotides containing a coding region starting at position m. Ia and Ib are G-scores assigned to codon positions i, with 1 ≤ i ≤ 240, corresponding to G-values calculated over the intervals [i,240], with maximum G-score Gm at position m. II are G-scores assigned to codon positions j calculated over the intervals [m,j], with m ≤ j ≤ 240 and maximum G-score GM at position M. III is the G-type hit identified by the two positions of maximum G-score. Threshold values of GM corresponding to two significance levels  are indicated by dashed lines
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4595893&req=5

btv339-F2: An example of the procedure used to identify ‘G-type’ significant hits of any length within a reading frame of 240 trinucleotides containing a coding region starting at position m. Ia and Ib are G-scores assigned to codon positions i, with 1 ≤ i ≤ 240, corresponding to G-values calculated over the intervals [i,240], with maximum G-score Gm at position m. II are G-scores assigned to codon positions j calculated over the intervals [m,j], with m ≤ j ≤ 240 and maximum G-score GM at position M. III is the G-type hit identified by the two positions of maximum G-score. Threshold values of GM corresponding to two significance levels are indicated by dashed lines

Mentions: To identify within a reading-frame three-base periodicities of any type, we propose a procedure based on the G statistics (Sokal and Rohlf, 1994), which is often used as an alternative to the χ2 for testing associations. In our implementation, we used values of the G statistics (G-values) to score association of nucleotides with codon positions, as illustrated in Figure 2 and detailed in Supplementary Methods. Briefly, in a reading frame of L codons, a G-value is calculated over the interval [i,L] (1 ≤ i ≤ L) and is assigned to codon position i as its ‘G-score’. The codon position i = m of maximum G-score identifies a sequence interval [m,L]. Within this interval, G-values are calculated over intervals [m,j] and assigned to codon positions j (m ≤ j ≤ L). Position j = M (M ≥ m) of maximum G-score identifies with position m the sequence segment [m,M], which is characterized as ‘high-scoring’ if GM exceeds a threshold score of significant probability. We identified thresholds of significance for the GM statistics for a large number of compositions and different sequence lengths based on large samples of randomly generated sequences, as described for the H-test. As for the H statistics, we found that threshold values of GM depend on sequence composition and are linearly related to the log-log-length of the sequence (Supplementary Fig. S4) and can similarly be inferred for any sequence length and composition. We refer to this test of association as the ‘G-test’, and to the high-scoring segments identified by the G-test as ‘G-type hits’ or ‘G-hits’.Fig. 2.


Quantitative frame analysis and the annotation of GC-rich (and other) prokaryotic genomes. An application to Anaeromyxobacter dehalogenans.

Oden S, Brocchieri L - Bioinformatics (2015)

An example of the procedure used to identify ‘G-type’ significant hits of any length within a reading frame of 240 trinucleotides containing a coding region starting at position m. Ia and Ib are G-scores assigned to codon positions i, with 1 ≤ i ≤ 240, corresponding to G-values calculated over the intervals [i,240], with maximum G-score Gm at position m. II are G-scores assigned to codon positions j calculated over the intervals [m,j], with m ≤ j ≤ 240 and maximum G-score GM at position M. III is the G-type hit identified by the two positions of maximum G-score. Threshold values of GM corresponding to two significance levels  are indicated by dashed lines
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4595893&req=5

btv339-F2: An example of the procedure used to identify ‘G-type’ significant hits of any length within a reading frame of 240 trinucleotides containing a coding region starting at position m. Ia and Ib are G-scores assigned to codon positions i, with 1 ≤ i ≤ 240, corresponding to G-values calculated over the intervals [i,240], with maximum G-score Gm at position m. II are G-scores assigned to codon positions j calculated over the intervals [m,j], with m ≤ j ≤ 240 and maximum G-score GM at position M. III is the G-type hit identified by the two positions of maximum G-score. Threshold values of GM corresponding to two significance levels are indicated by dashed lines
Mentions: To identify within a reading-frame three-base periodicities of any type, we propose a procedure based on the G statistics (Sokal and Rohlf, 1994), which is often used as an alternative to the χ2 for testing associations. In our implementation, we used values of the G statistics (G-values) to score association of nucleotides with codon positions, as illustrated in Figure 2 and detailed in Supplementary Methods. Briefly, in a reading frame of L codons, a G-value is calculated over the interval [i,L] (1 ≤ i ≤ L) and is assigned to codon position i as its ‘G-score’. The codon position i = m of maximum G-score identifies a sequence interval [m,L]. Within this interval, G-values are calculated over intervals [m,j] and assigned to codon positions j (m ≤ j ≤ L). Position j = M (M ≥ m) of maximum G-score identifies with position m the sequence segment [m,M], which is characterized as ‘high-scoring’ if GM exceeds a threshold score of significant probability. We identified thresholds of significance for the GM statistics for a large number of compositions and different sequence lengths based on large samples of randomly generated sequences, as described for the H-test. As for the H statistics, we found that threshold values of GM depend on sequence composition and are linearly related to the log-log-length of the sequence (Supplementary Fig. S4) and can similarly be inferred for any sequence length and composition. We refer to this test of association as the ‘G-test’, and to the high-scoring segments identified by the G-test as ‘G-type hits’ or ‘G-hits’.Fig. 2.

Bottom Line: Graphical representations of contrasts in GC usage among codon frame positions (frame analysis) provide evidence of genes missing from the annotations of prokaryotic genomes of high GC content but the qualitative approach of visual frame analysis prevents its applicability on a genomic scale.We developed two quantitative methods for the identification and statistical characterization in sequence regions of three-base periodicity (hits) associated with open reading frame structures.We applied the NPACT procedures to two recently annotated strains of the deltaproteobacterium Anaeromyxobacter dehalogenans, identifying in both genomes numerous conserved ORFs not included in the published annotation of coding regions.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL 32610, USA and Genetics Institute, University of Florida, Gainesville, FL 32610, USA.

No MeSH data available.


Related in: MedlinePlus