Limits...
PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions.

Lin MF, Jungreis I, Kellis M - Bioinformatics (2011)

Bottom Line: We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study.We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.

View Article: PubMed Central - PubMed

Affiliation: Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street 32-D510, Cambridge, MA 02139, USA. mlin@mit.edu

ABSTRACT

Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.

Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.

Availability and implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF CONTACT: mlin@mit.edu; manoli@mit.edu.

Show MeSH
PhyloCSF method overview. (A) PhyloCSF uses phylogenetic codon models estimated from genome-wide training data based on known coding and non-coding regions. These models include a phylogenetic tree and codon substitution rate matrices QC and QN for coding and non-coding regions, respectively, shown here for 12 Drosophila species. QC captures the characteristic evolutionary signatures of codon substitutions in conserved coding regions, while QN captures the typical evolutionary rates of triplet sites in non-coding regions. (B) PhyloCSF applied to a short region from the first exon of the D.melanogaster homeobox gene Dfd. The alignment of this region shows only synonymous substitutions compared with the inferred ancestral sequence (green). Using the maximum likelihood estimate of a scale factor ρ applied to the assumed branch lengths, the alignment has higher probability under the coding model than the non-coding model, resulting in a positive log-likelihood ratio Λ. (C) PhyloCSF applied to a conserved region within a Dfd intron. In contrast to the exonic alignment, this region shows many non-synonymous substitutions (red), nonsense substitutions (blue, purple) and frameshifts (orange). The alignment has lower probability under the coding model, resulting in a negative score.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117341&req=5

Figure 1: PhyloCSF method overview. (A) PhyloCSF uses phylogenetic codon models estimated from genome-wide training data based on known coding and non-coding regions. These models include a phylogenetic tree and codon substitution rate matrices QC and QN for coding and non-coding regions, respectively, shown here for 12 Drosophila species. QC captures the characteristic evolutionary signatures of codon substitutions in conserved coding regions, while QN captures the typical evolutionary rates of triplet sites in non-coding regions. (B) PhyloCSF applied to a short region from the first exon of the D.melanogaster homeobox gene Dfd. The alignment of this region shows only synonymous substitutions compared with the inferred ancestral sequence (green). Using the maximum likelihood estimate of a scale factor ρ applied to the assumed branch lengths, the alignment has higher probability under the coding model than the non-coding model, resulting in a positive log-likelihood ratio Λ. (C) PhyloCSF applied to a conserved region within a Dfd intron. In contrast to the exonic alignment, this region shows many non-synonymous substitutions (red), nonsense substitutions (blue, purple) and frameshifts (orange). The alignment has lower probability under the coding model, resulting in a negative score.

Mentions: In summary (Fig. 1), PhyloCSF relies on two ECMs fit to genome-wide training data, which include estimates for the branch lengths, codon frequencies and codon substitution rates for known coding and non-coding regions. To evaluate a given nucleotide sequence alignment, PhyloCSF (i) determines the MLE of the scale factor ρ on the branch lengths for each of these models, (ii) computes the likelihood of each model (the probability of the alignment under the model) using the MLE of the scale factor, and (iii) reports the log-likelihood ratio between the coding and non-coding models.Fig. 1.


PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions.

Lin MF, Jungreis I, Kellis M - Bioinformatics (2011)

PhyloCSF method overview. (A) PhyloCSF uses phylogenetic codon models estimated from genome-wide training data based on known coding and non-coding regions. These models include a phylogenetic tree and codon substitution rate matrices QC and QN for coding and non-coding regions, respectively, shown here for 12 Drosophila species. QC captures the characteristic evolutionary signatures of codon substitutions in conserved coding regions, while QN captures the typical evolutionary rates of triplet sites in non-coding regions. (B) PhyloCSF applied to a short region from the first exon of the D.melanogaster homeobox gene Dfd. The alignment of this region shows only synonymous substitutions compared with the inferred ancestral sequence (green). Using the maximum likelihood estimate of a scale factor ρ applied to the assumed branch lengths, the alignment has higher probability under the coding model than the non-coding model, resulting in a positive log-likelihood ratio Λ. (C) PhyloCSF applied to a conserved region within a Dfd intron. In contrast to the exonic alignment, this region shows many non-synonymous substitutions (red), nonsense substitutions (blue, purple) and frameshifts (orange). The alignment has lower probability under the coding model, resulting in a negative score.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117341&req=5

Figure 1: PhyloCSF method overview. (A) PhyloCSF uses phylogenetic codon models estimated from genome-wide training data based on known coding and non-coding regions. These models include a phylogenetic tree and codon substitution rate matrices QC and QN for coding and non-coding regions, respectively, shown here for 12 Drosophila species. QC captures the characteristic evolutionary signatures of codon substitutions in conserved coding regions, while QN captures the typical evolutionary rates of triplet sites in non-coding regions. (B) PhyloCSF applied to a short region from the first exon of the D.melanogaster homeobox gene Dfd. The alignment of this region shows only synonymous substitutions compared with the inferred ancestral sequence (green). Using the maximum likelihood estimate of a scale factor ρ applied to the assumed branch lengths, the alignment has higher probability under the coding model than the non-coding model, resulting in a positive log-likelihood ratio Λ. (C) PhyloCSF applied to a conserved region within a Dfd intron. In contrast to the exonic alignment, this region shows many non-synonymous substitutions (red), nonsense substitutions (blue, purple) and frameshifts (orange). The alignment has lower probability under the coding model, resulting in a negative score.
Mentions: In summary (Fig. 1), PhyloCSF relies on two ECMs fit to genome-wide training data, which include estimates for the branch lengths, codon frequencies and codon substitution rates for known coding and non-coding regions. To evaluate a given nucleotide sequence alignment, PhyloCSF (i) determines the MLE of the scale factor ρ on the branch lengths for each of these models, (ii) computes the likelihood of each model (the probability of the alignment under the model) using the MLE of the scale factor, and (iii) reports the log-likelihood ratio between the coding and non-coding models.Fig. 1.

Bottom Line: We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study.We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.

View Article: PubMed Central - PubMed

Affiliation: Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street 32-D510, Cambridge, MA 02139, USA. mlin@mit.edu

ABSTRACT

Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.

Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.

Availability and implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF CONTACT: mlin@mit.edu; manoli@mit.edu.

Show MeSH