Modularity and dynamics of cellular networks.
View Article:
PubMed Central - PubMed
Affiliation: MIT's Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, United States of America.
ABSTRACT
Show MeSH
(a) Clustering Clustering methods are widely used to find modules in transcriptional regulation. An expression profile dataset can be represented as a two-dimensional matrix where rows index genes and columns index experimental conditions. Clustering methods partition genes into groups such that genes in each group show similar expression across conditions or through a time series [47] (Figure 6). Since regulation by common TFs may only occur under certain conditions, bi-clustering methods [48] have been developed to identify genes that express similarly under a subset of conditions. It should be noted that genes with similar expression may not all be co-regulated, and that clustering does not necessarily identify the corresponding regulators. Therefore, genes clustered together may not fully represent modules in transcriptional regulatory networks. Traditional clustering methods, such as K-means, require a predefined and fixed number of gene clusters, which may be hard to assign in practice and greatly influence the results. They also do not model temporal dependence between expression profiles. To address these issues, Schliep and coworkers [49] and Beal and Krishnamurthy [50] applied Hidden Markov Models to cluster gene expression time course data. Specifically, both of them used Hidden Markov Models to model temporal dependence of gene expression, instead of treating different time points independently. While Schliep and coworkers proposed a heuristic approach to determine the number of clusters, Beal and Krishnamurthy used a nonparametric prior distribution on mixture weights, such that the genes can be clustered without a predefined number of clusters. (b) Topology-based analysis Interaction networks are often visualized as graphs where nodes represent genes/proteins and edges represent interactions. Modular structures can be inferred based on topological features of the networks. For example, densely connected subgraphs can be exhaustively identified in PPI networks (Figure 7). These suggest the existence of multi-protein complexes [16]. Also, modules can be identified using topological distances in the networks. More specifically, the distance between two nodes is defined as the length of the shortest path(s) between them. A matrix of distances between all pair-wise combinations of nodes can be used for clustering [17]. The underlying assumption is that proteins in a module have similar distances to proteins outside of the given module. (c) Probabilistic graphical models Nodes of probabilistic graphical models represent variables, and edges represent independency relations among the variables (Figure 8). According to the directionality of edges, graphical models can be classified into two major categories: Bayesian networks and Markov random fields. A Bayesian network is a directed acyclic graphical model: if there is an edge from node X pointing to another node Y, then values of variable Y depend directly on values of X and X is called a parent of Y. Coupled with intervention data, Bayesian networks can be used to learn causal relationships, and are thus suitable to model transcriptional regulatory networks [14,51,52] or signaling pathways [39]. In contrast, the edges in Markov random fields are undirected, which makes them suitable to model PPI networks or other networks of symmetric interactions [53]. To use graphical models, we need to systematically learn the structures of networks based on biological data and to estimate the parameters of these networks [54]. The learned graphical models reveal how proteins and genes interact, which can be applied to answer different biological queries as an inference problem. For example, when the activities of a protein are suppressed, cells may respond by changing the expression levels of other genes. Such responses can be predicted based on a learned regulatory network. While the task of learning Bayesian networks has been well-addressed [51,55], learning Markov random fields is still in its early stage [56,57]. If we use graphical models to model large-scale biological networks containing structural loops such as PPI networks, the inference problem is not trivial. Monte Carlo methods or approximate inference methods can be used to solve such problems [55,58–62]. (d) Integration of various data sources Individual high-throughput biological datasets are usually both incomprehensive and error-prone. Therefore, data integration becomes indispensable in order to model cellular networks accurately and to make functional inferences [45]. For example, both yeast two-hybrid [63,64] and affinity-purification/mass-spectrometry experiments [8,9] have been applied to the mapping of PPI networks. Overlapping the two data sources enables the identification of high-confidence interactions [65]. In addition, yeast two-hybrid detects binary relationships while affinity-purification/mass-spectrometry detects proteins as members of a complex. Integrating these two types of data helps to model the actual topology of protein complexes [66]. Furthermore, if temporal, spatial, or conditional expression data are available, it may be possible to provide a dynamic view of protein complexes under physiological conditions (Figure 9). |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC1761653&req=5
Mentions: Unlike random networks, cellular networks contain characteristic topological patterns that enable their functionality. To find the basic building blocks of cellular networks, simple units consisting of a few components were enumerated and some of them were found to be significantly overrepresented [11]. These recurring units were defined as network motifs. For instance, transcriptional network motifs include feed-forward loops, single-input motifs, and multi-input motifs (Figure 2) [3,5,12]. A feed-forward loop describes a situation in which a transcription factor (TF) regulates a second TF, and these two TFs jointly regulate a common target gene. A single-input motif contains one TF which regulates a set of target genes, such as subunits of a protein complex. A multi-input motif consists of multiple TFs that regulate a set of target genes, providing the possibility of combinatorial controls. These motifs are found in multiple organisms such as bacteria, yeast, and human. This structural conservation suggests functional importance of network motifs for transcriptional regulation. |
View Article: PubMed Central - PubMed
Affiliation: MIT's Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, United States of America.
(a) Clustering
Clustering methods are widely used to find modules in transcriptional regulation. An expression profile dataset can be represented as a two-dimensional matrix where rows index genes and columns index experimental conditions. Clustering methods partition genes into groups such that genes in each group show similar expression across conditions or through a time series [47] (Figure 6). Since regulation by common TFs may only occur under certain conditions, bi-clustering methods [48] have been developed to identify genes that express similarly under a subset of conditions. It should be noted that genes with similar expression may not all be co-regulated, and that clustering does not necessarily identify the corresponding regulators. Therefore, genes clustered together may not fully represent modules in transcriptional regulatory networks.
Traditional clustering methods, such as K-means, require a predefined and fixed number of gene clusters, which may be hard to assign in practice and greatly influence the results. They also do not model temporal dependence between expression profiles. To address these issues, Schliep and coworkers [49] and Beal and Krishnamurthy [50] applied Hidden Markov Models to cluster gene expression time course data. Specifically, both of them used Hidden Markov Models to model temporal dependence of gene expression, instead of treating different time points independently. While Schliep and coworkers proposed a heuristic approach to determine the number of clusters, Beal and Krishnamurthy used a nonparametric prior distribution on mixture weights, such that the genes can be clustered without a predefined number of clusters.
(b) Topology-based analysis
Interaction networks are often visualized as graphs where nodes represent genes/proteins and edges represent interactions. Modular structures can be inferred based on topological features of the networks. For example, densely connected subgraphs can be exhaustively identified in PPI networks (Figure 7). These suggest the existence of multi-protein complexes [16]. Also, modules can be identified using topological distances in the networks. More specifically, the distance between two nodes is defined as the length of the shortest path(s) between them. A matrix of distances between all pair-wise combinations of nodes can be used for clustering [17]. The underlying assumption is that proteins in a module have similar distances to proteins outside of the given module.
(c) Probabilistic graphical models
Nodes of probabilistic graphical models represent variables, and edges represent independency relations among the variables (Figure 8). According to the directionality of edges, graphical models can be classified into two major categories: Bayesian networks and Markov random fields. A Bayesian network is a directed acyclic graphical model: if there is an edge from node X pointing to another node Y, then values of variable Y depend directly on values of X and X is called a parent of Y. Coupled with intervention data, Bayesian networks can be used to learn causal relationships, and are thus suitable to model transcriptional regulatory networks [14,51,52] or signaling pathways [39]. In contrast, the edges in Markov random fields are undirected, which makes them suitable to model PPI networks or other networks of symmetric interactions [53].
To use graphical models, we need to systematically learn the structures of networks based on biological data and to estimate the parameters of these networks [54]. The learned graphical models reveal how proteins and genes interact, which can be applied to answer different biological queries as an inference problem. For example, when the activities of a protein are suppressed, cells may respond by changing the expression levels of other genes. Such responses can be predicted based on a learned regulatory network.
While the task of learning Bayesian networks has been well-addressed [51,55], learning Markov random fields is still in its early stage [56,57]. If we use graphical models to model large-scale biological networks containing structural loops such as PPI networks, the inference problem is not trivial. Monte Carlo methods or approximate inference methods can be used to solve such problems [55,58–62].
(d) Integration of various data sources
Individual high-throughput biological datasets are usually both incomprehensive and error-prone. Therefore, data integration becomes indispensable in order to model cellular networks accurately and to make functional inferences [45]. For example, both yeast two-hybrid [63,64] and affinity-purification/mass-spectrometry experiments [8,9] have been applied to the mapping of PPI networks. Overlapping the two data sources enables the identification of high-confidence interactions [65]. In addition, yeast two-hybrid detects binary relationships while affinity-purification/mass-spectrometry detects proteins as members of a complex. Integrating these two types of data helps to model the actual topology of protein complexes [66]. Furthermore, if temporal, spatial, or conditional expression data are available, it may be possible to provide a dynamic view of protein complexes under physiological conditions (Figure 9).