Limits...
Statistical power of phylo-HMM for evolutionarily conserved element detection.

Fan X, Zhu J, Schadt EE, Liu JS - BMC Bioinformatics (2007)

Bottom Line: In addition, the conservation ratio of conserved elements and the expected length of the conserved elements are also major factors.In contrast, the influence of the topology and the nucleotide substitution model are relatively minor factors.Our results provide for general guidelines on how to select the number of genomes and their evolutionary distance in comparative genomics studies, as well as the level of power we can expect under different parameter settings.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Statistics, Harvard University, Boston, MA, USA. xfan@fas.harvard.edu

ABSTRACT

Background: An important goal of comparative genomics is the identification of functional elements through conservation analysis. Phylo-HMM was recently introduced to detect conserved elements based on multiple genome alignments, but the method has not been rigorously evaluated.

Results: We report here a simulation study to investigate the power of phylo-HMM. We show that the power of the phylo-HMM approach depends on many factors, the most important being the number of species-specific genomes used and evolutionary distances between pairs of species. This finding is consistent with results reported by other groups for simpler comparative genomics models. In addition, the conservation ratio of conserved elements and the expected length of the conserved elements are also major factors. In contrast, the influence of the topology and the nucleotide substitution model are relatively minor factors.

Conclusion: Our results provide for general guidelines on how to select the number of genomes and their evolutionary distance in comparative genomics studies, as well as the level of power we can expect under different parameter settings.

Show MeSH

Related in: MedlinePlus

Topology of the baseline phylogenetic tree. For the un-rooted version of this tree, the branch between the mouse-rat and the human-dog pairs is called the "middle branch".
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194792&req=5

Figure 2: Topology of the baseline phylogenetic tree. For the un-rooted version of this tree, the branch between the mouse-rat and the human-dog pairs is called the "middle branch".

Mentions: We estimated the baseline parameters of the phylo-HMM from the promoter sequences of dog, human, mouse, and rat, whose evolutionary relationship is represented by the phylogenetic tree in Figure 2. The promoter sequence of a gene is defined as the region 2000 bp upstream to 500 bp downstream of the annotated transcription start site of that gene. MLAGAN [32] was used to align the promoter regions of genes in each orthologous gene cluster, which is defined as a set of 4 genes, each from a different species, of which any pair are reciprocal-best BLAST hits of each other (filtered by blastn threshold e-value = 0.001). All gaps in the alignment were treated as missing data [7]. If the average branch length of the phylogenetic tree for the conserved sites of a gene's promoter region was estimated to be greater than one substitution per site, the corresponding promoter alignment was considered unreliable and, thus, removed from our simulation studies. After filtering out such alignments, 8533 orthologous gene clusters remained. Using the REV model for the substitution rate matrix, we fitted the phylo-HMM to each of these 8533 alignments separately. The baseline is defined by setting the parameters (π, μ, ν, Q, β, ρ) to be the median values of fitted parameters from these 8533 alignments.


Statistical power of phylo-HMM for evolutionarily conserved element detection.

Fan X, Zhu J, Schadt EE, Liu JS - BMC Bioinformatics (2007)

Topology of the baseline phylogenetic tree. For the un-rooted version of this tree, the branch between the mouse-rat and the human-dog pairs is called the "middle branch".
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194792&req=5

Figure 2: Topology of the baseline phylogenetic tree. For the un-rooted version of this tree, the branch between the mouse-rat and the human-dog pairs is called the "middle branch".
Mentions: We estimated the baseline parameters of the phylo-HMM from the promoter sequences of dog, human, mouse, and rat, whose evolutionary relationship is represented by the phylogenetic tree in Figure 2. The promoter sequence of a gene is defined as the region 2000 bp upstream to 500 bp downstream of the annotated transcription start site of that gene. MLAGAN [32] was used to align the promoter regions of genes in each orthologous gene cluster, which is defined as a set of 4 genes, each from a different species, of which any pair are reciprocal-best BLAST hits of each other (filtered by blastn threshold e-value = 0.001). All gaps in the alignment were treated as missing data [7]. If the average branch length of the phylogenetic tree for the conserved sites of a gene's promoter region was estimated to be greater than one substitution per site, the corresponding promoter alignment was considered unreliable and, thus, removed from our simulation studies. After filtering out such alignments, 8533 orthologous gene clusters remained. Using the REV model for the substitution rate matrix, we fitted the phylo-HMM to each of these 8533 alignments separately. The baseline is defined by setting the parameters (π, μ, ν, Q, β, ρ) to be the median values of fitted parameters from these 8533 alignments.

Bottom Line: In addition, the conservation ratio of conserved elements and the expected length of the conserved elements are also major factors.In contrast, the influence of the topology and the nucleotide substitution model are relatively minor factors.Our results provide for general guidelines on how to select the number of genomes and their evolutionary distance in comparative genomics studies, as well as the level of power we can expect under different parameter settings.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Statistics, Harvard University, Boston, MA, USA. xfan@fas.harvard.edu

ABSTRACT

Background: An important goal of comparative genomics is the identification of functional elements through conservation analysis. Phylo-HMM was recently introduced to detect conserved elements based on multiple genome alignments, but the method has not been rigorously evaluated.

Results: We report here a simulation study to investigate the power of phylo-HMM. We show that the power of the phylo-HMM approach depends on many factors, the most important being the number of species-specific genomes used and evolutionary distances between pairs of species. This finding is consistent with results reported by other groups for simpler comparative genomics models. In addition, the conservation ratio of conserved elements and the expected length of the conserved elements are also major factors. In contrast, the influence of the topology and the nucleotide substitution model are relatively minor factors.

Conclusion: Our results provide for general guidelines on how to select the number of genomes and their evolutionary distance in comparative genomics studies, as well as the level of power we can expect under different parameter settings.

Show MeSH
Related in: MedlinePlus