Limits...
A computational pipeline to discover highly phylogenetically informative genes in sequenced genomes: application to Saccharomyces cerevisiae natural strains.

Ramazzotti M, BernĂ¡ L, Stefanini I, Cavalieri D - Nucleic Acids Res. (2012)

Bottom Line: Nevertheless, the disadvantageous cost-benefit ratio (the amount of details disclosed by NGS against the time-expensive and expertise-demanding data assembly process) still precludes the application of these techniques to the routinely assignment of yeast strains, making the selection of the most reliable molecular markers greatly desirable.We found 13 genes whose variability can be used to recapitulate the phylogeny obtained from genome-wide sequences.The same approach that we prove to be successful in yeasts can be generalized to any other population of individuals given the availability of high-quality genomic sequences and of a clear population structure to be targeted.

View Article: PubMed Central - PubMed

Affiliation: Department of Preclinical and Clinical Pharmacology, University of Florence, Viale G. Pieraccini 6, 50139 Firenze, Italy.

ABSTRACT
The quest for genes representing genetic relationships of strains or individuals within populations and their evolutionary history is acquiring a novel dimension of complexity with the advancement of next-generation sequencing (NGS) technologies. In fact, sequencing an entire genome uncovers genetic variation in coding and non-coding regions and offers the possibility of studying Saccharomyces cerevisiae populations at the strain level. Nevertheless, the disadvantageous cost-benefit ratio (the amount of details disclosed by NGS against the time-expensive and expertise-demanding data assembly process) still precludes the application of these techniques to the routinely assignment of yeast strains, making the selection of the most reliable molecular markers greatly desirable. In this work we propose an original computational approach to discover genes that can be used as a descriptor of the population structure. We found 13 genes whose variability can be used to recapitulate the phylogeny obtained from genome-wide sequences. The same approach that we prove to be successful in yeasts can be generalized to any other population of individuals given the availability of high-quality genomic sequences and of a clear population structure to be targeted.

Show MeSH

Related in: MedlinePlus

Scheme of the analysis pipeline developed in this work. After the collection of coding sequences from genomic files (with uniform annotation for all the ortholog genes), one per organism, the pipeline converts per-organism files into per-gene files. Then multiple sequence alignments are performed using ClustalW. The alignments can be clipped at the extremities to cope with alternative starts or premature stops and then, optionally, can be subjected to the removal of non-informative loci. The genes are then analyzed, as single entities or combined in doublets, triplets, other fashions or even all together, using Phylip (in this work with Kimura two-parameters distance metric followed by neighbor-joining) with optional bootstrap support. Finally, the trees are screened for the presence of predefined clusters, allowing selection of genes or combinations that perfectly satisfy an arbitrary scheme that, in this work, was based on the phylogenetic tree obtained by Liti et al. (12) with a genome-wide survey.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3351171&req=5

gks005-F1: Scheme of the analysis pipeline developed in this work. After the collection of coding sequences from genomic files (with uniform annotation for all the ortholog genes), one per organism, the pipeline converts per-organism files into per-gene files. Then multiple sequence alignments are performed using ClustalW. The alignments can be clipped at the extremities to cope with alternative starts or premature stops and then, optionally, can be subjected to the removal of non-informative loci. The genes are then analyzed, as single entities or combined in doublets, triplets, other fashions or even all together, using Phylip (in this work with Kimura two-parameters distance metric followed by neighbor-joining) with optional bootstrap support. Finally, the trees are screened for the presence of predefined clusters, allowing selection of genes or combinations that perfectly satisfy an arbitrary scheme that, in this work, was based on the phylogenetic tree obtained by Liti et al. (12) with a genome-wide survey.

Mentions: We developed an automated pipeline to construct and test complete phylogenetic analyses of genes or gene combinations using both full-length sequences or distilling only the polymorphic/indel sites (SNPs/indels). Specifically, we tested the phylogenetic relationships of 39 available strains of S. cerevisiae. Given a set of complete known genomic sequences, the script was designed to (i) generate multi-fasta files for each of all different genes, (ii) calculate the different combinations of genes (one, two or three), (iii) perform the nucleotide alignments, (iv) identify the SNPs/indels, (v) use distance and neighbor-joining methods to compute the phylogenetic tree using the SNPs/indels or the full-length gene or gene combination, (vi) perform the bootstrap analysis, and eventually (vii) identify the phylogenetic trees that satisfy the evolutionary relationships proposed by the sequence of the entire genomes (see a schematic representation of the pipeline in Figure 1).Figure 1.


A computational pipeline to discover highly phylogenetically informative genes in sequenced genomes: application to Saccharomyces cerevisiae natural strains.

Ramazzotti M, BernĂ¡ L, Stefanini I, Cavalieri D - Nucleic Acids Res. (2012)

Scheme of the analysis pipeline developed in this work. After the collection of coding sequences from genomic files (with uniform annotation for all the ortholog genes), one per organism, the pipeline converts per-organism files into per-gene files. Then multiple sequence alignments are performed using ClustalW. The alignments can be clipped at the extremities to cope with alternative starts or premature stops and then, optionally, can be subjected to the removal of non-informative loci. The genes are then analyzed, as single entities or combined in doublets, triplets, other fashions or even all together, using Phylip (in this work with Kimura two-parameters distance metric followed by neighbor-joining) with optional bootstrap support. Finally, the trees are screened for the presence of predefined clusters, allowing selection of genes or combinations that perfectly satisfy an arbitrary scheme that, in this work, was based on the phylogenetic tree obtained by Liti et al. (12) with a genome-wide survey.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3351171&req=5

gks005-F1: Scheme of the analysis pipeline developed in this work. After the collection of coding sequences from genomic files (with uniform annotation for all the ortholog genes), one per organism, the pipeline converts per-organism files into per-gene files. Then multiple sequence alignments are performed using ClustalW. The alignments can be clipped at the extremities to cope with alternative starts or premature stops and then, optionally, can be subjected to the removal of non-informative loci. The genes are then analyzed, as single entities or combined in doublets, triplets, other fashions or even all together, using Phylip (in this work with Kimura two-parameters distance metric followed by neighbor-joining) with optional bootstrap support. Finally, the trees are screened for the presence of predefined clusters, allowing selection of genes or combinations that perfectly satisfy an arbitrary scheme that, in this work, was based on the phylogenetic tree obtained by Liti et al. (12) with a genome-wide survey.
Mentions: We developed an automated pipeline to construct and test complete phylogenetic analyses of genes or gene combinations using both full-length sequences or distilling only the polymorphic/indel sites (SNPs/indels). Specifically, we tested the phylogenetic relationships of 39 available strains of S. cerevisiae. Given a set of complete known genomic sequences, the script was designed to (i) generate multi-fasta files for each of all different genes, (ii) calculate the different combinations of genes (one, two or three), (iii) perform the nucleotide alignments, (iv) identify the SNPs/indels, (v) use distance and neighbor-joining methods to compute the phylogenetic tree using the SNPs/indels or the full-length gene or gene combination, (vi) perform the bootstrap analysis, and eventually (vii) identify the phylogenetic trees that satisfy the evolutionary relationships proposed by the sequence of the entire genomes (see a schematic representation of the pipeline in Figure 1).Figure 1.

Bottom Line: Nevertheless, the disadvantageous cost-benefit ratio (the amount of details disclosed by NGS against the time-expensive and expertise-demanding data assembly process) still precludes the application of these techniques to the routinely assignment of yeast strains, making the selection of the most reliable molecular markers greatly desirable.We found 13 genes whose variability can be used to recapitulate the phylogeny obtained from genome-wide sequences.The same approach that we prove to be successful in yeasts can be generalized to any other population of individuals given the availability of high-quality genomic sequences and of a clear population structure to be targeted.

View Article: PubMed Central - PubMed

Affiliation: Department of Preclinical and Clinical Pharmacology, University of Florence, Viale G. Pieraccini 6, 50139 Firenze, Italy.

ABSTRACT
The quest for genes representing genetic relationships of strains or individuals within populations and their evolutionary history is acquiring a novel dimension of complexity with the advancement of next-generation sequencing (NGS) technologies. In fact, sequencing an entire genome uncovers genetic variation in coding and non-coding regions and offers the possibility of studying Saccharomyces cerevisiae populations at the strain level. Nevertheless, the disadvantageous cost-benefit ratio (the amount of details disclosed by NGS against the time-expensive and expertise-demanding data assembly process) still precludes the application of these techniques to the routinely assignment of yeast strains, making the selection of the most reliable molecular markers greatly desirable. In this work we propose an original computational approach to discover genes that can be used as a descriptor of the population structure. We found 13 genes whose variability can be used to recapitulate the phylogeny obtained from genome-wide sequences. The same approach that we prove to be successful in yeasts can be generalized to any other population of individuals given the availability of high-quality genomic sequences and of a clear population structure to be targeted.

Show MeSH
Related in: MedlinePlus