Global Shifts in Genome and Proteome Composition Are Very Tightly Coupled.
Bottom Line: Qualitatively similar results were obtained for 49 fungal genomes, where 80% of the variability in AAC could be explained by the composition of introns and intergenic regions.Moreover, highly expressed genes do not exhibit more prominent environment-related AAC signatures than lowly expressed genes, despite contributing more to the effective proteome.Thus, evolutionary shifts in overall AAC appear to occur almost exclusively through factors shaping the global oligonucleotide content of the genome.
Affiliation: Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia Molecular Basis of Ageing, Mediterranean Institute for Life Sciences (MedILS), Split, Croatia.Show MeSH
Related in: MedlinePlus
Mentions: Introducing the relative frequencies of dinucleotides in intergenic DNA (Materials and Methods) in addition to G + C content considerably improves AAC prediction, for both the G + C balanced amino acids (median R2 = 0.647) and on overall (fig. 1; R2 = 0.736, 0.632–0.879 [median, Q1–Q3 over amino acids]). Observed dinucleotide frequencies were normalized by the frequency expected from G + C content in order to capture orthogonal information (Materials and Methods). We gain further predictive accuracy by adding trinucleotide composition as a predictor (fig. 1; R2 = 0.840, 0.761–0.905). Testing this regression model on an additional set of 600 genomes yielded similar results (median R2 = 0.820; supplementary fig. S9C, Supplementary Material online). Of note, in some cases the regression models involve complex interactions between features. For instance, in valine and threonine, a combination of G + C content and ApC/GpT dinucleotide frequencies exhibits nonadditive effects in determining the frequency of the amino acids (fig. 2C and D). A simple linear regression model where the number of free parameters was further penalized (by Akaike information criterion [AIC], see Materials and Methods) still yields a median cross-validation R2 = 0.728 (supplementary fig. S11, Supplementary Material online), suggesting that our SVM estimates of fit are not inflated as a result of overfitting.Fig. 2.—
Affiliation: Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia Molecular Basis of Ageing, Mediterranean Institute for Life Sciences (MedILS), Split, Croatia.