Limits...
Measuring global credibility with application to local sequence alignment.

Webb-Robertson BJ, McCue LA, Lawrence CE - PLoS Comput. Biol. (2008)

Bottom Line: Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete.The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators.Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments.

View Article: PubMed Central - PubMed

Affiliation: Computational Biology and Bioinformatics, Pacific Northwest National Laboratory, Richland, Washington, United States of America. bj@pnl.gov

ABSTRACT
Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1-alpha)%, 0< or =alpha< or =1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1-alpha)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments.

Show MeSH

Related in: MedlinePlus

Plot of ND95 values for the EC versus the MS of 20 pairwise sequence alignments.The ND95 values associated with the 20 highly conserved gene sequences are represented as green circles. The sequence alignments that represent alignment of random, un-related, sequences are represented as black triangles. In blue squares are the ND95 values for the intergenic sequences upstream of the coding genes. The four example alignment ND distributions displayed in Figure 2 are indicated by a letter next to the corresponding square.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2367447&req=5

pcbi-1000077-g001: Plot of ND95 values for the EC versus the MS of 20 pairwise sequence alignments.The ND95 values associated with the 20 highly conserved gene sequences are represented as green circles. The sequence alignments that represent alignment of random, un-related, sequences are represented as black triangles. In blue squares are the ND95 values for the intergenic sequences upstream of the coding genes. The four example alignment ND distributions displayed in Figure 2 are indicated by a letter next to the corresponding square.

Mentions: Hamming distances will, in general, be dependent on the lengths of the ensemble members. For example, in alignment, longer sequences will tend to return larger distances simply because the alignment matrix is larger. Thus, normalization is in order. For this normalization, we employ a normalization factor that uses maximum realized alignment lengths. Specifically, when calculating a credibility limit, the length of the estimating alignment (LE) is known, and the maximum length of an alignment in the ensemble is the length of the shorter of the two sequences (I). Thus, the maximum Hamming distances between an estimating alignment and the longest member of the ensemble is (LE+I). However, in our studies, we found that using this sum as a normalizing factor was misleading for cases in which the posterior space of alignments tended to be dominated by shorter local alignments. For example, the local alignments of the randomly shuffled sequences described in the Results section (see Figure 1) were dominated by short alignments. As a result, using (LE+I) as the normalizing constant in this case produced normalized distances that were not close to one, even when there were no base pairs in common between a sampled alignment and the estimating alignment. To adjust these differences, we used the length of the longest sampled alignment, LS, as the second term in our normalizing sum, and the normalizing distance between the estimating alignment Ax and the ith alignment in the sample is where S indicates the set of sampled alignments. Using this normalization factor yields normalizing distances with values between zero and one. A perfect match would yield an ND score of zero, and in the case where the longest sampled alignment has no base pairings in common with the estimating alignment, the ND score would be one. We define the credibility of the alignment at (1−α) to be ND(1−α).


Measuring global credibility with application to local sequence alignment.

Webb-Robertson BJ, McCue LA, Lawrence CE - PLoS Comput. Biol. (2008)

Plot of ND95 values for the EC versus the MS of 20 pairwise sequence alignments.The ND95 values associated with the 20 highly conserved gene sequences are represented as green circles. The sequence alignments that represent alignment of random, un-related, sequences are represented as black triangles. In blue squares are the ND95 values for the intergenic sequences upstream of the coding genes. The four example alignment ND distributions displayed in Figure 2 are indicated by a letter next to the corresponding square.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2367447&req=5

pcbi-1000077-g001: Plot of ND95 values for the EC versus the MS of 20 pairwise sequence alignments.The ND95 values associated with the 20 highly conserved gene sequences are represented as green circles. The sequence alignments that represent alignment of random, un-related, sequences are represented as black triangles. In blue squares are the ND95 values for the intergenic sequences upstream of the coding genes. The four example alignment ND distributions displayed in Figure 2 are indicated by a letter next to the corresponding square.
Mentions: Hamming distances will, in general, be dependent on the lengths of the ensemble members. For example, in alignment, longer sequences will tend to return larger distances simply because the alignment matrix is larger. Thus, normalization is in order. For this normalization, we employ a normalization factor that uses maximum realized alignment lengths. Specifically, when calculating a credibility limit, the length of the estimating alignment (LE) is known, and the maximum length of an alignment in the ensemble is the length of the shorter of the two sequences (I). Thus, the maximum Hamming distances between an estimating alignment and the longest member of the ensemble is (LE+I). However, in our studies, we found that using this sum as a normalizing factor was misleading for cases in which the posterior space of alignments tended to be dominated by shorter local alignments. For example, the local alignments of the randomly shuffled sequences described in the Results section (see Figure 1) were dominated by short alignments. As a result, using (LE+I) as the normalizing constant in this case produced normalized distances that were not close to one, even when there were no base pairs in common between a sampled alignment and the estimating alignment. To adjust these differences, we used the length of the longest sampled alignment, LS, as the second term in our normalizing sum, and the normalizing distance between the estimating alignment Ax and the ith alignment in the sample is where S indicates the set of sampled alignments. Using this normalization factor yields normalizing distances with values between zero and one. A perfect match would yield an ND score of zero, and in the case where the longest sampled alignment has no base pairings in common with the estimating alignment, the ND score would be one. We define the credibility of the alignment at (1−α) to be ND(1−α).

Bottom Line: Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete.The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators.Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments.

View Article: PubMed Central - PubMed

Affiliation: Computational Biology and Bioinformatics, Pacific Northwest National Laboratory, Richland, Washington, United States of America. bj@pnl.gov

ABSTRACT
Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1-alpha)%, 0< or =alpha< or =1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1-alpha)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments.

Show MeSH
Related in: MedlinePlus