Limits...
Projection of gene-protein networks to the functional space of the proteome and its application to analysis of organism complexity.

Kanapin AA, Mulder N, Kuznetsov VA - BMC Genomics (2010)

Bottom Line: We identify and provide characteristics of functional switches in the polyform group of TUs in different organisms.Based on comparison of mouse and human SFNs, a role of alternative splicing as a necessary source of evolution towards more complex organisms is demonstrated.The entire set of FL across many organisms could be used as a draft of the catalogue of the functional space of the proteome world.

View Article: PubMed Central - HTML - PubMed

Affiliation: Ontario Institute for Cancer Research, Toronto, Canada. alexander.kanapin@oicr.on.ca

ABSTRACT

Unlabelled: We consider the problem of biological complexity via a projection of protein-coding genes of complex organisms onto the functional space of the proteome. The latter can be defined as a set of all functions committed by proteins of an organism. Alternative splicing (AS) allows an organism to generate diverse mature RNA transcripts from a single mRNA strand and thus it could be one of the key mechanisms of increasing of functional complexity of the organism's proteome and a driving force of biological evolution. Thus, the projection of transcription units (TU) and alternative splice-variant (SV) forms onto proteome functional space could generate new types of relational networks (e.g. SV-protein function networks, SFN) and lead to discoveries of novel evolutionarily conservative functional modules. Such types of networks might provide new reliable characteristics of organism complexity and a better understanding of the evolutionary integration and plasticity of interconnection of genome-transcriptome-proteome functions.

Results: We use the InterPro and UniProt databases to attribute descriptive features (keywords) to protein sequences. UniProt database includes a controlled and curated vocabulary of specific descriptors or keywords. The keywords have been assigned to a protein sequence via conserved domains or via similarity with annotated sequences. Then we consider the unique combinations of keywords as the protein functional labels (FL), which characterize the biological functions of the given protein and construct the contingency tables and graphs providing the projections of transcription units (TU) and alternative splice-variants (SV) onto all FL of the proteome of a given organism. We constructed SFNs for organisms with different evolutionary history and levels of complexity, and performed detailed statistical parameterization of the networks.

Conclusions: The application of the algorithm to organisms with different evolutionary history and level of biological complexity (nematode, fruit fly, vertebrata) reveals that the parameters describing SFN correlate with the complexity of a given organism. Using statistical analysis of the links of the functional networks, we propose new features of evolution of protein function acquisition. We reveal a group of genes and corresponding functions, which could be attributed to an early conservative part of the cellular machinery essential for cell viability and survival. We identify and provide characteristics of functional switches in the polyform group of TUs in different organisms. Based on comparison of mouse and human SFNs, a role of alternative splicing as a necessary source of evolution towards more complex organisms is demonstrated. The entire set of FL across many organisms could be used as a draft of the catalogue of the functional space of the proteome world.

Show MeSH
Best-fit statistics of three transcript-protein relation functions in the mouse (left) and human (right) data sets. A and B: best-fit frequency distributions of the number of FLs in a given TU; C and D: best-fit frequency distribution of the number of distinct TUs attributed with a given FL in a proteome subset related to selected TUs. E and F: the frequency distributions of the number of splice variant events per TU. The mixture probabilistic model (1) was used for identification of the empirical frequency distributions. Blue symbols: data used for parameterisation of the first model (P1); blue lines best-fit function P1. Read symbols: data used for parameterisation of the first model (P2); blue lines best-fit function P2. SigmaPlot analytical and graphical tools were used.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2822532&req=5

Figure 3: Best-fit statistics of three transcript-protein relation functions in the mouse (left) and human (right) data sets. A and B: best-fit frequency distributions of the number of FLs in a given TU; C and D: best-fit frequency distribution of the number of distinct TUs attributed with a given FL in a proteome subset related to selected TUs. E and F: the frequency distributions of the number of splice variant events per TU. The mixture probabilistic model (1) was used for identification of the empirical frequency distributions. Blue symbols: data used for parameterisation of the first model (P1); blue lines best-fit function P1. Read symbols: data used for parameterisation of the first model (P2); blue lines best-fit function P2. SigmaPlot analytical and graphical tools were used.

Mentions: First, we found 23640 FLs in the mouse proteome and 20929 FLs in the human proteome. For each of these two sets of FLs, we selected 20928 mouse TUs and 18260 human TUs related to at least one FL in the given TU for mouse and for human, respectively. Note, in our analysis, these TUs can represent 3mRNAs translated to known proteins. After that, we identified a model of frequency distribution of the number of occurrences of FLs in a given TU. Figures 3A and 3B display the empirical frequency distributions of the number of FLs in a given TU in the mouse and in the human data sets, respectively. Both the distribution functions are skewed, for which the most frequent is a single event, however rare events could have occurred on the large dynamical ranges [23,24]. Interestingly, the major fraction (89% (20928/236400)) of mouse TUs translated to proteins exhibits one-to-one relations with corresponding FLs. A very similar fraction of human TUs (87% (18260/20920)) exhibits one-to-one relations with corresponding FLs. The maximum number of distinct FLs in a TU is 9 for both the mouse and the human. The exponential function is fitted well to both the distributions at similar values of exponent constant (Figure 3A &3B; Table 4). However, deviation from simple exponential distribution could be seen on the right tail of the empirical distributions. Note this skewed frequency pattern was observed among all the statistics analysed for other species too (data not presented).


Projection of gene-protein networks to the functional space of the proteome and its application to analysis of organism complexity.

Kanapin AA, Mulder N, Kuznetsov VA - BMC Genomics (2010)

Best-fit statistics of three transcript-protein relation functions in the mouse (left) and human (right) data sets. A and B: best-fit frequency distributions of the number of FLs in a given TU; C and D: best-fit frequency distribution of the number of distinct TUs attributed with a given FL in a proteome subset related to selected TUs. E and F: the frequency distributions of the number of splice variant events per TU. The mixture probabilistic model (1) was used for identification of the empirical frequency distributions. Blue symbols: data used for parameterisation of the first model (P1); blue lines best-fit function P1. Read symbols: data used for parameterisation of the first model (P2); blue lines best-fit function P2. SigmaPlot analytical and graphical tools were used.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2822532&req=5

Figure 3: Best-fit statistics of three transcript-protein relation functions in the mouse (left) and human (right) data sets. A and B: best-fit frequency distributions of the number of FLs in a given TU; C and D: best-fit frequency distribution of the number of distinct TUs attributed with a given FL in a proteome subset related to selected TUs. E and F: the frequency distributions of the number of splice variant events per TU. The mixture probabilistic model (1) was used for identification of the empirical frequency distributions. Blue symbols: data used for parameterisation of the first model (P1); blue lines best-fit function P1. Read symbols: data used for parameterisation of the first model (P2); blue lines best-fit function P2. SigmaPlot analytical and graphical tools were used.
Mentions: First, we found 23640 FLs in the mouse proteome and 20929 FLs in the human proteome. For each of these two sets of FLs, we selected 20928 mouse TUs and 18260 human TUs related to at least one FL in the given TU for mouse and for human, respectively. Note, in our analysis, these TUs can represent 3mRNAs translated to known proteins. After that, we identified a model of frequency distribution of the number of occurrences of FLs in a given TU. Figures 3A and 3B display the empirical frequency distributions of the number of FLs in a given TU in the mouse and in the human data sets, respectively. Both the distribution functions are skewed, for which the most frequent is a single event, however rare events could have occurred on the large dynamical ranges [23,24]. Interestingly, the major fraction (89% (20928/236400)) of mouse TUs translated to proteins exhibits one-to-one relations with corresponding FLs. A very similar fraction of human TUs (87% (18260/20920)) exhibits one-to-one relations with corresponding FLs. The maximum number of distinct FLs in a TU is 9 for both the mouse and the human. The exponential function is fitted well to both the distributions at similar values of exponent constant (Figure 3A &3B; Table 4). However, deviation from simple exponential distribution could be seen on the right tail of the empirical distributions. Note this skewed frequency pattern was observed among all the statistics analysed for other species too (data not presented).

Bottom Line: We identify and provide characteristics of functional switches in the polyform group of TUs in different organisms.Based on comparison of mouse and human SFNs, a role of alternative splicing as a necessary source of evolution towards more complex organisms is demonstrated.The entire set of FL across many organisms could be used as a draft of the catalogue of the functional space of the proteome world.

View Article: PubMed Central - HTML - PubMed

Affiliation: Ontario Institute for Cancer Research, Toronto, Canada. alexander.kanapin@oicr.on.ca

ABSTRACT

Unlabelled: We consider the problem of biological complexity via a projection of protein-coding genes of complex organisms onto the functional space of the proteome. The latter can be defined as a set of all functions committed by proteins of an organism. Alternative splicing (AS) allows an organism to generate diverse mature RNA transcripts from a single mRNA strand and thus it could be one of the key mechanisms of increasing of functional complexity of the organism's proteome and a driving force of biological evolution. Thus, the projection of transcription units (TU) and alternative splice-variant (SV) forms onto proteome functional space could generate new types of relational networks (e.g. SV-protein function networks, SFN) and lead to discoveries of novel evolutionarily conservative functional modules. Such types of networks might provide new reliable characteristics of organism complexity and a better understanding of the evolutionary integration and plasticity of interconnection of genome-transcriptome-proteome functions.

Results: We use the InterPro and UniProt databases to attribute descriptive features (keywords) to protein sequences. UniProt database includes a controlled and curated vocabulary of specific descriptors or keywords. The keywords have been assigned to a protein sequence via conserved domains or via similarity with annotated sequences. Then we consider the unique combinations of keywords as the protein functional labels (FL), which characterize the biological functions of the given protein and construct the contingency tables and graphs providing the projections of transcription units (TU) and alternative splice-variants (SV) onto all FL of the proteome of a given organism. We constructed SFNs for organisms with different evolutionary history and levels of complexity, and performed detailed statistical parameterization of the networks.

Conclusions: The application of the algorithm to organisms with different evolutionary history and level of biological complexity (nematode, fruit fly, vertebrata) reveals that the parameters describing SFN correlate with the complexity of a given organism. Using statistical analysis of the links of the functional networks, we propose new features of evolution of protein function acquisition. We reveal a group of genes and corresponding functions, which could be attributed to an early conservative part of the cellular machinery essential for cell viability and survival. We identify and provide characteristics of functional switches in the polyform group of TUs in different organisms. Based on comparison of mouse and human SFNs, a role of alternative splicing as a necessary source of evolution towards more complex organisms is demonstrated. The entire set of FL across many organisms could be used as a draft of the catalogue of the functional space of the proteome world.

Show MeSH