Limits...
Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes.

Zhang S, Xu M, Li S, Su Z - Nucleic Acids Res. (2009)

Bottom Line: Although cis-regulatory binding sites (CRBSs) are at least as important as the coding sequences in a genome, our general understanding of them in most sequenced genomes is very limited due to the lack of efficient and accurate experimental and computational methods for their characterization, which has largely hindered our understanding of many important biological processes.We designed our algorithm to circumvent three identified difficulties for CRBS prediction using comparative genomics principles based on a new method for the selection of reference genomes, a new metric for measuring the similarity of CRBSs, and a new graph clustering procedure.When compared with the prior state-of-the-art algorithms, our algorithm outperforms them in both prediction sensitivity and specificity.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Genomics, Bioinformatics Research Center, the University of North Carolina at Charlotte, Charlotte, NC 28223, USA.

ABSTRACT
Although cis-regulatory binding sites (CRBSs) are at least as important as the coding sequences in a genome, our general understanding of them in most sequenced genomes is very limited due to the lack of efficient and accurate experimental and computational methods for their characterization, which has largely hindered our understanding of many important biological processes. In this article, we describe a novel algorithm for genome-wide de novo prediction of CRBSs with high accuracy. We designed our algorithm to circumvent three identified difficulties for CRBS prediction using comparative genomics principles based on a new method for the selection of reference genomes, a new metric for measuring the similarity of CRBSs, and a new graph clustering procedure. When operon structures are correctly predicted, our algorithm can predict 81% of known individual binding sites belonging to 94% of known cis-regulatory motifs in the Escherichia coli K12 genome, while achieving high prediction specificity. Our algorithm has also achieved similar prediction accuracy in the Bacillus subtilis genome, suggesting that it is very robust, and thus can be applied to any other sequenced prokaryotic genome. When compared with the prior state-of-the-art algorithms, our algorithm outperforms them in both prediction sensitivity and specificity.

Show MeSH

Related in: MedlinePlus

Construction of the motif similarity graph. Multiple (k) motif-finding tools are used and each predicts multiple motifs in each inter-operonic sequence set Io. All the predicted motifs form the set of input motifs. A node in the motif similarity graph represents a motif. Two nodes are connected if their corresponding motifs have a similarity score ≥β with the score being the weight on the edge (not shown for clarity). A solid line represents an edge between two nodes associated with two different Io, and a dashed line represents an edge between two nodes associated with the same Io.
© Copyright Policy - openaccess
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2691844&req=5

Figure 2: Construction of the motif similarity graph. Multiple (k) motif-finding tools are used and each predicts multiple motifs in each inter-operonic sequence set Io. All the predicted motifs form the set of input motifs. A node in the motif similarity graph represents a motif. Two nodes are connected if their corresponding motifs have a similarity score ≥β with the score being the weight on the edge (not shown for clarity). A solid line represents an edge between two nodes associated with two different Io, and a dashed line represents an edge between two nodes associated with the same Io.

Mentions: The GLECLUBS algorithm is based on a comparative genomics approach and its flowchart is shown in Figure 1. Given a target genome, e.g. E. coli K12, for which we want to predict all possible CRBSs, we first select a group of closely related reference genomes using a new method (see below). For each operon o in the target genome, we identify its orthologous operons in all the reference genomes, and extract their corresponding upstream inter-operonic sequences to form a sequence set Io. We assume that some sequences in Io share some similar CRBSs for a set of orthologous TFs encoded in the target genome and some of the reference genomes. Based on this assumption, and for the convenience of discussion, we loosely define a cis-regulatory motif as the set or a subset of the binding sites of a TF in the target genome and of its orthologs in the reference genomes. Since the existing motif-finding algorithms can only predict a small fraction of the true CRBSs present in the input intergenic sequences, and different algorithms produce complementary predictions (29,30), we applied multiple (k) complementary motif-finding algorithms to each sequence set Io to harvest as many true binding sites as possible. In addition, since a motif with the highest score returned by a motif-finding algorithm may not necessarily be the true binding motif, a relatively low-ranked prediction can be a true one instead (29,30), we kept the top mi motifs found by the i-th motif-finding algorithm to include even more true binding sites. All these putative motifs are designated the input motifs. In order to separate the correctly predicted motifs from spurious ones in the set of input motifs, we computed the similarity scores between pairs of input motifs using a new metric (see below), and constructed a weighted graph called the motif similarity graph, in which a node represents an input motif, and two nodes are connected by an edge if and only if the similarity score between their corresponding input motifs is greater than a preset cutoff β with the similarity score being the weight of the edge (Figure 2). We reason that a true motif is more likely to be predicted by multiple tools than are spurious ones, when based on the same inter-operonic sequence set associated with an operon in the target genome if its binding sites are conserved beyond a certain level. Furthermore, a true motif is also more likely than a spurious one to have multiple similar motifs predicted in different sets of inter-operonic sequences associated with different operons, simply because all of these operons are regulated by the same TF, and therefore are expected to contain similar CRBSs. In other words, true binding site motifs are more likely to form highly connected sub-graphs with high weights on the edges in the motif similarity graph than are spurious ones, because the probability for multiple similar spurious motifs to occur by chance should be low. Our algorithm was then designed to identify ‘condensed sub-graphs’ as possible true binding site motifs. However, due to the degenerate nature, the similarity between two subsets of a motif may not be significantly high (see below), making our task of ‘separating true motifs from spurious ones’ in a single step not easy. To overcome this difficulty and find the condensed sub-graphs, we designed a graph clustering algorithm that iteratively constructs and clusters graphs to gradually filter out the spurious motifs in the motif similarity graph (Figure 1). We found that only three iterations were needed to asymptotically converge on the optimal prediction accuracy. We then refined the resulting sub-graphs/clusters, and ranked them according to the quality of the best motif that each cluster contains. Each top-ranked cluster is predicted to be a putative cis-regulatory binding motif, and the genes associated with it are predicted to form a regulon.Figure 1.


Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes.

Zhang S, Xu M, Li S, Su Z - Nucleic Acids Res. (2009)

Construction of the motif similarity graph. Multiple (k) motif-finding tools are used and each predicts multiple motifs in each inter-operonic sequence set Io. All the predicted motifs form the set of input motifs. A node in the motif similarity graph represents a motif. Two nodes are connected if their corresponding motifs have a similarity score ≥β with the score being the weight on the edge (not shown for clarity). A solid line represents an edge between two nodes associated with two different Io, and a dashed line represents an edge between two nodes associated with the same Io.
© Copyright Policy - openaccess
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2691844&req=5

Figure 2: Construction of the motif similarity graph. Multiple (k) motif-finding tools are used and each predicts multiple motifs in each inter-operonic sequence set Io. All the predicted motifs form the set of input motifs. A node in the motif similarity graph represents a motif. Two nodes are connected if their corresponding motifs have a similarity score ≥β with the score being the weight on the edge (not shown for clarity). A solid line represents an edge between two nodes associated with two different Io, and a dashed line represents an edge between two nodes associated with the same Io.
Mentions: The GLECLUBS algorithm is based on a comparative genomics approach and its flowchart is shown in Figure 1. Given a target genome, e.g. E. coli K12, for which we want to predict all possible CRBSs, we first select a group of closely related reference genomes using a new method (see below). For each operon o in the target genome, we identify its orthologous operons in all the reference genomes, and extract their corresponding upstream inter-operonic sequences to form a sequence set Io. We assume that some sequences in Io share some similar CRBSs for a set of orthologous TFs encoded in the target genome and some of the reference genomes. Based on this assumption, and for the convenience of discussion, we loosely define a cis-regulatory motif as the set or a subset of the binding sites of a TF in the target genome and of its orthologs in the reference genomes. Since the existing motif-finding algorithms can only predict a small fraction of the true CRBSs present in the input intergenic sequences, and different algorithms produce complementary predictions (29,30), we applied multiple (k) complementary motif-finding algorithms to each sequence set Io to harvest as many true binding sites as possible. In addition, since a motif with the highest score returned by a motif-finding algorithm may not necessarily be the true binding motif, a relatively low-ranked prediction can be a true one instead (29,30), we kept the top mi motifs found by the i-th motif-finding algorithm to include even more true binding sites. All these putative motifs are designated the input motifs. In order to separate the correctly predicted motifs from spurious ones in the set of input motifs, we computed the similarity scores between pairs of input motifs using a new metric (see below), and constructed a weighted graph called the motif similarity graph, in which a node represents an input motif, and two nodes are connected by an edge if and only if the similarity score between their corresponding input motifs is greater than a preset cutoff β with the similarity score being the weight of the edge (Figure 2). We reason that a true motif is more likely to be predicted by multiple tools than are spurious ones, when based on the same inter-operonic sequence set associated with an operon in the target genome if its binding sites are conserved beyond a certain level. Furthermore, a true motif is also more likely than a spurious one to have multiple similar motifs predicted in different sets of inter-operonic sequences associated with different operons, simply because all of these operons are regulated by the same TF, and therefore are expected to contain similar CRBSs. In other words, true binding site motifs are more likely to form highly connected sub-graphs with high weights on the edges in the motif similarity graph than are spurious ones, because the probability for multiple similar spurious motifs to occur by chance should be low. Our algorithm was then designed to identify ‘condensed sub-graphs’ as possible true binding site motifs. However, due to the degenerate nature, the similarity between two subsets of a motif may not be significantly high (see below), making our task of ‘separating true motifs from spurious ones’ in a single step not easy. To overcome this difficulty and find the condensed sub-graphs, we designed a graph clustering algorithm that iteratively constructs and clusters graphs to gradually filter out the spurious motifs in the motif similarity graph (Figure 1). We found that only three iterations were needed to asymptotically converge on the optimal prediction accuracy. We then refined the resulting sub-graphs/clusters, and ranked them according to the quality of the best motif that each cluster contains. Each top-ranked cluster is predicted to be a putative cis-regulatory binding motif, and the genes associated with it are predicted to form a regulon.Figure 1.

Bottom Line: Although cis-regulatory binding sites (CRBSs) are at least as important as the coding sequences in a genome, our general understanding of them in most sequenced genomes is very limited due to the lack of efficient and accurate experimental and computational methods for their characterization, which has largely hindered our understanding of many important biological processes.We designed our algorithm to circumvent three identified difficulties for CRBS prediction using comparative genomics principles based on a new method for the selection of reference genomes, a new metric for measuring the similarity of CRBSs, and a new graph clustering procedure.When compared with the prior state-of-the-art algorithms, our algorithm outperforms them in both prediction sensitivity and specificity.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Genomics, Bioinformatics Research Center, the University of North Carolina at Charlotte, Charlotte, NC 28223, USA.

ABSTRACT
Although cis-regulatory binding sites (CRBSs) are at least as important as the coding sequences in a genome, our general understanding of them in most sequenced genomes is very limited due to the lack of efficient and accurate experimental and computational methods for their characterization, which has largely hindered our understanding of many important biological processes. In this article, we describe a novel algorithm for genome-wide de novo prediction of CRBSs with high accuracy. We designed our algorithm to circumvent three identified difficulties for CRBS prediction using comparative genomics principles based on a new method for the selection of reference genomes, a new metric for measuring the similarity of CRBSs, and a new graph clustering procedure. When operon structures are correctly predicted, our algorithm can predict 81% of known individual binding sites belonging to 94% of known cis-regulatory motifs in the Escherichia coli K12 genome, while achieving high prediction specificity. Our algorithm has also achieved similar prediction accuracy in the Bacillus subtilis genome, suggesting that it is very robust, and thus can be applied to any other sequenced prokaryotic genome. When compared with the prior state-of-the-art algorithms, our algorithm outperforms them in both prediction sensitivity and specificity.

Show MeSH
Related in: MedlinePlus