Limits...
Global Shifts in Genome and Proteome Composition Are Very Tightly Coupled.

Brbić M, Warnecke T, Kriško A, Supek F - Genome Biol Evol (2015)

Bottom Line: Qualitatively similar results were obtained for 49 fungal genomes, where 80% of the variability in AAC could be explained by the composition of introns and intergenic regions.Moreover, highly expressed genes do not exhibit more prominent environment-related AAC signatures than lowly expressed genes, despite contributing more to the effective proteome.Thus, evolutionary shifts in overall AAC appear to occur almost exclusively through factors shaping the global oligonucleotide content of the genome.

View Article: PubMed Central - PubMed

Affiliation: Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia Molecular Basis of Ageing, Mediterranean Institute for Life Sciences (MedILS), Split, Croatia.

Show MeSH

Related in: MedlinePlus

Nonlinear SVM regression models that predict amino acid usage in proteomes from G + C and dinucleotide frequencies in noncoding DNA. Dependency of relative frequencies of Ala (A) and Met (B) in proteomes on the G + C content of DNA, as examples of a linear and nonlinear relationship, respectively. Each dot is a prokaryotic chromosome (>200 kb in size). Red curves show SVM predictions. Several examples which deviate strongly from the dominant trend are highlighted by the vertical lines that show residuals of the regression. SVM regression models that regress the relative frequency of Thr (C) and Val (D) in proteomes against a combination of the G + C content and the frequency of the ApC + GpT dinucleotide.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4494046&req=5

evv088-F2: Nonlinear SVM regression models that predict amino acid usage in proteomes from G + C and dinucleotide frequencies in noncoding DNA. Dependency of relative frequencies of Ala (A) and Met (B) in proteomes on the G + C content of DNA, as examples of a linear and nonlinear relationship, respectively. Each dot is a prokaryotic chromosome (>200 kb in size). Red curves show SVM predictions. Several examples which deviate strongly from the dominant trend are highlighted by the vertical lines that show residuals of the regression. SVM regression models that regress the relative frequency of Thr (C) and Val (D) in proteomes against a combination of the G + C content and the frequency of the ApC + GpT dinucleotide.

Mentions: Introducing the relative frequencies of dinucleotides in intergenic DNA (Materials and Methods) in addition to G + C content considerably improves AAC prediction, for both the G + C balanced amino acids (median R2 = 0.647) and on overall (fig. 1; R2 = 0.736, 0.632–0.879 [median, Q1–Q3 over amino acids]). Observed dinucleotide frequencies were normalized by the frequency expected from G + C content in order to capture orthogonal information (Materials and Methods). We gain further predictive accuracy by adding trinucleotide composition as a predictor (fig. 1; R2 = 0.840, 0.761–0.905). Testing this regression model on an additional set of 600 genomes yielded similar results (median R2 = 0.820; supplementary fig. S9C, Supplementary Material online). Of note, in some cases the regression models involve complex interactions between features. For instance, in valine and threonine, a combination of G + C content and ApC/GpT dinucleotide frequencies exhibits nonadditive effects in determining the frequency of the amino acids (fig. 2C and D). A simple linear regression model where the number of free parameters was further penalized (by Akaike information criterion [AIC], see Materials and Methods) still yields a median cross-validation R2 = 0.728 (supplementary fig. S11, Supplementary Material online), suggesting that our SVM estimates of fit are not inflated as a result of overfitting.Fig. 2.—


Global Shifts in Genome and Proteome Composition Are Very Tightly Coupled.

Brbić M, Warnecke T, Kriško A, Supek F - Genome Biol Evol (2015)

Nonlinear SVM regression models that predict amino acid usage in proteomes from G + C and dinucleotide frequencies in noncoding DNA. Dependency of relative frequencies of Ala (A) and Met (B) in proteomes on the G + C content of DNA, as examples of a linear and nonlinear relationship, respectively. Each dot is a prokaryotic chromosome (>200 kb in size). Red curves show SVM predictions. Several examples which deviate strongly from the dominant trend are highlighted by the vertical lines that show residuals of the regression. SVM regression models that regress the relative frequency of Thr (C) and Val (D) in proteomes against a combination of the G + C content and the frequency of the ApC + GpT dinucleotide.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4494046&req=5

evv088-F2: Nonlinear SVM regression models that predict amino acid usage in proteomes from G + C and dinucleotide frequencies in noncoding DNA. Dependency of relative frequencies of Ala (A) and Met (B) in proteomes on the G + C content of DNA, as examples of a linear and nonlinear relationship, respectively. Each dot is a prokaryotic chromosome (>200 kb in size). Red curves show SVM predictions. Several examples which deviate strongly from the dominant trend are highlighted by the vertical lines that show residuals of the regression. SVM regression models that regress the relative frequency of Thr (C) and Val (D) in proteomes against a combination of the G + C content and the frequency of the ApC + GpT dinucleotide.
Mentions: Introducing the relative frequencies of dinucleotides in intergenic DNA (Materials and Methods) in addition to G + C content considerably improves AAC prediction, for both the G + C balanced amino acids (median R2 = 0.647) and on overall (fig. 1; R2 = 0.736, 0.632–0.879 [median, Q1–Q3 over amino acids]). Observed dinucleotide frequencies were normalized by the frequency expected from G + C content in order to capture orthogonal information (Materials and Methods). We gain further predictive accuracy by adding trinucleotide composition as a predictor (fig. 1; R2 = 0.840, 0.761–0.905). Testing this regression model on an additional set of 600 genomes yielded similar results (median R2 = 0.820; supplementary fig. S9C, Supplementary Material online). Of note, in some cases the regression models involve complex interactions between features. For instance, in valine and threonine, a combination of G + C content and ApC/GpT dinucleotide frequencies exhibits nonadditive effects in determining the frequency of the amino acids (fig. 2C and D). A simple linear regression model where the number of free parameters was further penalized (by Akaike information criterion [AIC], see Materials and Methods) still yields a median cross-validation R2 = 0.728 (supplementary fig. S11, Supplementary Material online), suggesting that our SVM estimates of fit are not inflated as a result of overfitting.Fig. 2.—

Bottom Line: Qualitatively similar results were obtained for 49 fungal genomes, where 80% of the variability in AAC could be explained by the composition of introns and intergenic regions.Moreover, highly expressed genes do not exhibit more prominent environment-related AAC signatures than lowly expressed genes, despite contributing more to the effective proteome.Thus, evolutionary shifts in overall AAC appear to occur almost exclusively through factors shaping the global oligonucleotide content of the genome.

View Article: PubMed Central - PubMed

Affiliation: Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia Molecular Basis of Ageing, Mediterranean Institute for Life Sciences (MedILS), Split, Croatia.

Show MeSH
Related in: MedlinePlus