Limits...
REFGEN and TREENAMER: automated sequence data handling for phylogenetic analysis in the genomic era.

Leonard G, Stevens JR, Richards TA - Evol. Bioinform. Online (2009)

Bottom Line: Unfortunately, most phylogenetic analysis programs do not tolerate the sequence naming conventions of genome databases.This facility enables the user to keep track of the branching order of the sequences/taxa during multiple tree calculations and re-optimisations.Together these programs drastically reduce the time required for managing sequence alignments and labelling phylogenetic figures.

View Article: PubMed Central - PubMed

Affiliation: Centre for Eukaryotic Evolutionary Microbiology, School of Biosciences, University of Exeter, Geoffrey Pope Building, Exeter, EX4 4QD, U.K.

ABSTRACT
The phylogenetic analysis of nucleotide sequences and increasingly that of amino acid sequences is used to address a number of biological questions. Access to extensive datasets, including numerous genome projects, means that standard phylogenetic analyses can include many hundreds of sequences. Unfortunately, most phylogenetic analysis programs do not tolerate the sequence naming conventions of genome databases. Managing large numbers of sequences and standardizing sequence labels for use in phylogenetic analysis programs can be a time consuming and laborious task. Here we report the availability of an online resource for the management of gene sequences recovered from public access genome databases such as GenBank. These web utilities include the facility for renaming every sequence in a FASTA alignment file, with each sequence label derived from a user-defined combination of the species name and/or database accession number. This facility enables the user to keep track of the branching order of the sequences/taxa during multiple tree calculations and re-optimisations. Post phylogenetic analysis, these webpages can then be used to rename every label in the subsequent tree files (with a user-defined combination of species name and/or database accession number). Together these programs drastically reduce the time required for managing sequence alignments and labelling phylogenetic figures. Additional features of our platform include the automatic removal of identical accession numbers (recorded in the report file) and generation of species and accession number lists for use in supplementary materials or figure legends.

No MeSH data available.


Related in: MedlinePlus

REFGEN conversion of FASTA files for use in phylogenetic programs. A) Snapshot of CPN60 alignment. Sequences are derived from GenBank and the DOE JGI Phytophthora ramorum databases (please note although the DOE JGI sequence does not confirm to the long identification line format it is accommodated by REFGEN). All CPN60 sequences are curtailed after the first 70 amino acid positions for the purpose of this figure. Note the long database identifier lines given to each sequence. B) Screenshot of REFGEN formatting options. C) Output from REFGEN, with sequence labels now compatible with all phylogenetic programs and ready for analysis.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC2747128&req=5

f1-ebo-2009-001: REFGEN conversion of FASTA files for use in phylogenetic programs. A) Snapshot of CPN60 alignment. Sequences are derived from GenBank and the DOE JGI Phytophthora ramorum databases (please note although the DOE JGI sequence does not confirm to the long identification line format it is accommodated by REFGEN). All CPN60 sequences are curtailed after the first 70 amino acid positions for the purpose of this figure. Note the long database identifier lines given to each sequence. B) Screenshot of REFGEN formatting options. C) Output from REFGEN, with sequence labels now compatible with all phylogenetic programs and ready for analysis.

Mentions: Prior to any phylogenetic analyses, it is often necessary to conduct a series of BLAST searches to sample gene sequences from GenBank and other online databases (e.g. Department of Energy Joint Genome Initiative—http://genome.jgi-psf.org). It is then necessary to collect the returned FASTA formatted sequences in a single alignment file. As standard in the GenBank database and other data repositories, each database sequence has been given a long identification line (Fig. 1A) also known as the header or definition line. Within each sequence identification line the accession number and species name is recorded (species names are generally contained within square brackets). Our first program REFGEN (http://www.exeter.ac.uk/ceem/refgen.html) will take an uploaded FASTA sequence alignment file and convert each long sequence identification line, according to a set of renaming rules in to a REFGEN ID (Fig. 1C). The REFGEN ID is comprised of a combination of either the taxonomic name or the accession number (Fig. 1B). In some cases this may generate two or more identical REFGEN IDs, in these cases the last character of the identical REFGEN IDs will be removed and the REFGEN ID will be appended with A, B, C, etc. Where no species name exists in the long identification line, text from within the parentheses is used, and where this is not found the first two words in the locus are recorded. REFGEN will then output a FASTA sequence file with the new shortened names (REFGEN IDs) that are compatible with all phylogenetic analysis programs. The great majority of GenBank protein sequences have the standard header/definition line format; this is also the case, but to a lesser extent, for many GenBank nucleotide sequences. However, there are exceptions and we have therefore specifically designed the program so that any sequence data that does not follow this format (e.g. many JGI sequences) are accommodated by being appended untouched to the end of the results file for direct user evaluation.


REFGEN and TREENAMER: automated sequence data handling for phylogenetic analysis in the genomic era.

Leonard G, Stevens JR, Richards TA - Evol. Bioinform. Online (2009)

REFGEN conversion of FASTA files for use in phylogenetic programs. A) Snapshot of CPN60 alignment. Sequences are derived from GenBank and the DOE JGI Phytophthora ramorum databases (please note although the DOE JGI sequence does not confirm to the long identification line format it is accommodated by REFGEN). All CPN60 sequences are curtailed after the first 70 amino acid positions for the purpose of this figure. Note the long database identifier lines given to each sequence. B) Screenshot of REFGEN formatting options. C) Output from REFGEN, with sequence labels now compatible with all phylogenetic programs and ready for analysis.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC2747128&req=5

f1-ebo-2009-001: REFGEN conversion of FASTA files for use in phylogenetic programs. A) Snapshot of CPN60 alignment. Sequences are derived from GenBank and the DOE JGI Phytophthora ramorum databases (please note although the DOE JGI sequence does not confirm to the long identification line format it is accommodated by REFGEN). All CPN60 sequences are curtailed after the first 70 amino acid positions for the purpose of this figure. Note the long database identifier lines given to each sequence. B) Screenshot of REFGEN formatting options. C) Output from REFGEN, with sequence labels now compatible with all phylogenetic programs and ready for analysis.
Mentions: Prior to any phylogenetic analyses, it is often necessary to conduct a series of BLAST searches to sample gene sequences from GenBank and other online databases (e.g. Department of Energy Joint Genome Initiative—http://genome.jgi-psf.org). It is then necessary to collect the returned FASTA formatted sequences in a single alignment file. As standard in the GenBank database and other data repositories, each database sequence has been given a long identification line (Fig. 1A) also known as the header or definition line. Within each sequence identification line the accession number and species name is recorded (species names are generally contained within square brackets). Our first program REFGEN (http://www.exeter.ac.uk/ceem/refgen.html) will take an uploaded FASTA sequence alignment file and convert each long sequence identification line, according to a set of renaming rules in to a REFGEN ID (Fig. 1C). The REFGEN ID is comprised of a combination of either the taxonomic name or the accession number (Fig. 1B). In some cases this may generate two or more identical REFGEN IDs, in these cases the last character of the identical REFGEN IDs will be removed and the REFGEN ID will be appended with A, B, C, etc. Where no species name exists in the long identification line, text from within the parentheses is used, and where this is not found the first two words in the locus are recorded. REFGEN will then output a FASTA sequence file with the new shortened names (REFGEN IDs) that are compatible with all phylogenetic analysis programs. The great majority of GenBank protein sequences have the standard header/definition line format; this is also the case, but to a lesser extent, for many GenBank nucleotide sequences. However, there are exceptions and we have therefore specifically designed the program so that any sequence data that does not follow this format (e.g. many JGI sequences) are accommodated by being appended untouched to the end of the results file for direct user evaluation.

Bottom Line: Unfortunately, most phylogenetic analysis programs do not tolerate the sequence naming conventions of genome databases.This facility enables the user to keep track of the branching order of the sequences/taxa during multiple tree calculations and re-optimisations.Together these programs drastically reduce the time required for managing sequence alignments and labelling phylogenetic figures.

View Article: PubMed Central - PubMed

Affiliation: Centre for Eukaryotic Evolutionary Microbiology, School of Biosciences, University of Exeter, Geoffrey Pope Building, Exeter, EX4 4QD, U.K.

ABSTRACT
The phylogenetic analysis of nucleotide sequences and increasingly that of amino acid sequences is used to address a number of biological questions. Access to extensive datasets, including numerous genome projects, means that standard phylogenetic analyses can include many hundreds of sequences. Unfortunately, most phylogenetic analysis programs do not tolerate the sequence naming conventions of genome databases. Managing large numbers of sequences and standardizing sequence labels for use in phylogenetic analysis programs can be a time consuming and laborious task. Here we report the availability of an online resource for the management of gene sequences recovered from public access genome databases such as GenBank. These web utilities include the facility for renaming every sequence in a FASTA alignment file, with each sequence label derived from a user-defined combination of the species name and/or database accession number. This facility enables the user to keep track of the branching order of the sequences/taxa during multiple tree calculations and re-optimisations. Post phylogenetic analysis, these webpages can then be used to rename every label in the subsequent tree files (with a user-defined combination of species name and/or database accession number). Together these programs drastically reduce the time required for managing sequence alignments and labelling phylogenetic figures. Additional features of our platform include the automatic removal of identical accession numbers (recorded in the report file) and generation of species and accession number lists for use in supplementary materials or figure legends.

No MeSH data available.


Related in: MedlinePlus