Limits...
EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data.

Picardi E, Mignone F, Pesole G - BMC Bioinformatics (2009)

Bottom Line: Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand.Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human HOXA gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS.EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the Ricinus communis oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.

View Article: PubMed Central - HTML - PubMed

Affiliation: Dipartimento di Biochimica e Biologia Molecolare E, Quagliariello, Università degli Studi di Bari, 70126 Bari, Italy. e.picardi@biologia.uniba.it

ABSTRACT

Background: ESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed loci. Common strategies recently employed to this end are based on sequence similarity between transcripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an ad hoc genomic mapping.

Methods: EasyCluster uses the well-known GMAP program in order to perform a very quick EST-to-genome mapping in addition to the detection of reliable splice sites. Given a genomic sequence and a pool of ESTs/FL-cDNAs, EasyCluster starts building genomic and EST local databases and runs GMAP. Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand. In the final step, EasyCluster refines the clustering by again running GMAP on each pseudo-cluster and groups together ESTs sharing at least one splice site.

Results: The higher accuracy of EasyCluster with respect to other clustering tools has been verified by means of a manually cured benchmark of human EST clusters. Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human HOXA gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS. EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the Ricinus communis oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.

Show MeSH
Graphical overview of EasyCluster algorithm and work-flow. In EasyCluster, genomic and EST sequences are initially used to build local databases. Next, GMAP is used to produce EST to genome alignments and results are parsed to build a first round of pseudo-clusters according to overlapping coordinates. For each cluster a refinement procedure is used to generate final clusters taking into account exon/intron boundaries. Before results, the prediction of alternative splicing events per cluster can be optionally required.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2697633&req=5

Figure 1: Graphical overview of EasyCluster algorithm and work-flow. In EasyCluster, genomic and EST sequences are initially used to build local databases. Next, GMAP is used to produce EST to genome alignments and results are parsed to build a first round of pseudo-clusters according to overlapping coordinates. For each cluster a refinement procedure is used to generate final clusters taking into account exon/intron boundaries. Before results, the prediction of alternative splicing events per cluster can be optionally required.

Mentions: The algorithm implemented in EasyCluster harbours new and unique features in order to improve the clustering process and, at the same time, facilitate the generation of gene-oriented clusters to researchers without advanced skills in bioinformatics. EasyCluster, in fact, can be used interactively providing only two Fasta files containing genomic and EST sequences, respectively. The main steps of the algorithm are shown in the flow chart in Figure 1 and can be summarized in the following six points:


EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data.

Picardi E, Mignone F, Pesole G - BMC Bioinformatics (2009)

Graphical overview of EasyCluster algorithm and work-flow. In EasyCluster, genomic and EST sequences are initially used to build local databases. Next, GMAP is used to produce EST to genome alignments and results are parsed to build a first round of pseudo-clusters according to overlapping coordinates. For each cluster a refinement procedure is used to generate final clusters taking into account exon/intron boundaries. Before results, the prediction of alternative splicing events per cluster can be optionally required.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2697633&req=5

Figure 1: Graphical overview of EasyCluster algorithm and work-flow. In EasyCluster, genomic and EST sequences are initially used to build local databases. Next, GMAP is used to produce EST to genome alignments and results are parsed to build a first round of pseudo-clusters according to overlapping coordinates. For each cluster a refinement procedure is used to generate final clusters taking into account exon/intron boundaries. Before results, the prediction of alternative splicing events per cluster can be optionally required.
Mentions: The algorithm implemented in EasyCluster harbours new and unique features in order to improve the clustering process and, at the same time, facilitate the generation of gene-oriented clusters to researchers without advanced skills in bioinformatics. EasyCluster, in fact, can be used interactively providing only two Fasta files containing genomic and EST sequences, respectively. The main steps of the algorithm are shown in the flow chart in Figure 1 and can be summarized in the following six points:

Bottom Line: Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand.Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human HOXA gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS.EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the Ricinus communis oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.

View Article: PubMed Central - HTML - PubMed

Affiliation: Dipartimento di Biochimica e Biologia Molecolare E, Quagliariello, Università degli Studi di Bari, 70126 Bari, Italy. e.picardi@biologia.uniba.it

ABSTRACT

Background: ESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed loci. Common strategies recently employed to this end are based on sequence similarity between transcripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an ad hoc genomic mapping.

Methods: EasyCluster uses the well-known GMAP program in order to perform a very quick EST-to-genome mapping in addition to the detection of reliable splice sites. Given a genomic sequence and a pool of ESTs/FL-cDNAs, EasyCluster starts building genomic and EST local databases and runs GMAP. Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand. In the final step, EasyCluster refines the clustering by again running GMAP on each pseudo-cluster and groups together ESTs sharing at least one splice site.

Results: The higher accuracy of EasyCluster with respect to other clustering tools has been verified by means of a manually cured benchmark of human EST clusters. Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human HOXA gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS. EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the Ricinus communis oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.

Show MeSH