Limits...
A pipeline for the de novo assembly of the Themira biloba (Sepsidae: Diptera) transcriptome using a multiple k-mer length approach.

Melicher D, Torson AS, Dworkin I, Bowsher JH - BMC Genomics (2014)

Bottom Line: We assembled transcriptomes from an additional three non-model organisms to demonstrate that our pipeline assembled a higher-quality transcriptome than single k-mer approaches across multiple species.The pipeline we have developed for assembly and analysis increases contig length, recovers unique transcripts, and assembles more base pairs than other methods through the use of a meta-assembly.The T. biloba transcriptome is a critical resource for performing large-scale RNA-Seq investigations of gene expression patterns, and is the first transcriptome sequenced in this Dipteran family.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Sciences, North Dakota State University, 1340 Bolley Drive, 218 Stevens Hall, Fargo, ND 58102, USA. Dacotah.Melicher@ndsu.edu.

ABSTRACT

Background: The Sepsidae family of flies is a model for investigating how sexual selection shapes courtship and sexual dimorphism in a comparative framework. However, like many non-model systems, there are few molecular resources available. Large-scale sequencing and assembly have not been performed in any sepsid, and the lack of a closely related genome makes investigation of gene expression challenging. Our goal was to develop an automated pipeline for de novo transcriptome assembly, and to use that pipeline to assemble and analyze the transcriptome of the sepsid Themira biloba.

Results: Our bioinformatics pipeline uses cloud computing services to assemble and analyze the transcriptome with off-site data management, processing, and backup. It uses a multiple k-mer length approach combined with a second meta-assembly to extend transcripts and recover more bases of transcript sequences than standard single k-mer assembly. We used 454 sequencing to generate 1.48 million reads from cDNA generated from embryo, larva, and pupae of T. biloba and assembled a transcriptome consisting of 24,495 contigs. Annotation identified 16,705 transcripts, including those involved in embryogenesis and limb patterning. We assembled transcriptomes from an additional three non-model organisms to demonstrate that our pipeline assembled a higher-quality transcriptome than single k-mer approaches across multiple species.

Conclusions: The pipeline we have developed for assembly and analysis increases contig length, recovers unique transcripts, and assembles more base pairs than other methods through the use of a meta-assembly. The T. biloba transcriptome is a critical resource for performing large-scale RNA-Seq investigations of gene expression patterns, and is the first transcriptome sequenced in this Dipteran family.

Show MeSH
BLAST strategy to identify unique transcripts. Identification of unique transcripts in each individual assembly was performed by reserving contigs from one assembly and pooling all contigs from the remaining assemblies. The contigs from the single assembly were aligned to the pooled contigs. Contigs that fail to align were considered unique to that single assembly. The unique contigs were annotated by aligning to the D. melanogaster transcriptome.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4008362&req=5

Fig2: BLAST strategy to identify unique transcripts. Identification of unique transcripts in each individual assembly was performed by reserving contigs from one assembly and pooling all contigs from the remaining assemblies. The contigs from the single assembly were aligned to the pooled contigs. Contigs that fail to align were considered unique to that single assembly. The unique contigs were annotated by aligning to the D. melanogaster transcriptome.

Mentions: The T. biloba sequence data was used to generate assemblies with k-mer lengths of 17, 19, 21, 23, 25, 27, 29, and 31 base pairs. To demonstrate that assemblies with different k-mer lengths recover unique transcripts, the stand-alone BLAST algorithm was used to align contigs from each assembly to a pool of contigs from all assemblies, with the resulting unaligned contigs representing those unique to one assembly (Figure 2). For example, to determine the number of contigs unique to the K17 assembly, the K17 contigs were blasted against the pooled contigs from all other assemblies. If a contig did not align, then it was unique to the k17 assembly. Contigs were discarded that were less than 200 base pairs. Next, BLAST was performed against D. melanogaster to annotate the unique contigs, and only those contigs with orthology to D. melanogaster were reported (Table 2). After the initial analysis, the pooled assemblies were also annotated using the D. melanogaster transcriptome to generate a total number of transcripts for the pool, to which the number of unique transcripts could be compared (Table 2). A significant number of transcripts were represented in only one of the single k-mer length assemblies (Table 2). In total, 2,296 transcripts were identified as unique to a specific assembly using BLAST analysis. For k-mer lengths 17–27, unique transcripts were approximately 2% of each assembly, and this percentage did not decrease with increasing k-mer length. However, at K29, unique transcripts decreased to only 0.8% of the total. The number of unique transcripts generated from this analysis is a low estimate because it contains only conserved Drosophila orthologs, and excludes transcripts unique to T. biloba and those too divergent to be identified by BLAST. Therefore, the number of unique transcripts recovered from different k-mer assemblies is likely higher. Our analysis confirms that restricting assemblies to only a single k-mer length limits the number of transcripts recovered, regardless of which k-mer length is chosen.Figure 2


A pipeline for the de novo assembly of the Themira biloba (Sepsidae: Diptera) transcriptome using a multiple k-mer length approach.

Melicher D, Torson AS, Dworkin I, Bowsher JH - BMC Genomics (2014)

BLAST strategy to identify unique transcripts. Identification of unique transcripts in each individual assembly was performed by reserving contigs from one assembly and pooling all contigs from the remaining assemblies. The contigs from the single assembly were aligned to the pooled contigs. Contigs that fail to align were considered unique to that single assembly. The unique contigs were annotated by aligning to the D. melanogaster transcriptome.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4008362&req=5

Fig2: BLAST strategy to identify unique transcripts. Identification of unique transcripts in each individual assembly was performed by reserving contigs from one assembly and pooling all contigs from the remaining assemblies. The contigs from the single assembly were aligned to the pooled contigs. Contigs that fail to align were considered unique to that single assembly. The unique contigs were annotated by aligning to the D. melanogaster transcriptome.
Mentions: The T. biloba sequence data was used to generate assemblies with k-mer lengths of 17, 19, 21, 23, 25, 27, 29, and 31 base pairs. To demonstrate that assemblies with different k-mer lengths recover unique transcripts, the stand-alone BLAST algorithm was used to align contigs from each assembly to a pool of contigs from all assemblies, with the resulting unaligned contigs representing those unique to one assembly (Figure 2). For example, to determine the number of contigs unique to the K17 assembly, the K17 contigs were blasted against the pooled contigs from all other assemblies. If a contig did not align, then it was unique to the k17 assembly. Contigs were discarded that were less than 200 base pairs. Next, BLAST was performed against D. melanogaster to annotate the unique contigs, and only those contigs with orthology to D. melanogaster were reported (Table 2). After the initial analysis, the pooled assemblies were also annotated using the D. melanogaster transcriptome to generate a total number of transcripts for the pool, to which the number of unique transcripts could be compared (Table 2). A significant number of transcripts were represented in only one of the single k-mer length assemblies (Table 2). In total, 2,296 transcripts were identified as unique to a specific assembly using BLAST analysis. For k-mer lengths 17–27, unique transcripts were approximately 2% of each assembly, and this percentage did not decrease with increasing k-mer length. However, at K29, unique transcripts decreased to only 0.8% of the total. The number of unique transcripts generated from this analysis is a low estimate because it contains only conserved Drosophila orthologs, and excludes transcripts unique to T. biloba and those too divergent to be identified by BLAST. Therefore, the number of unique transcripts recovered from different k-mer assemblies is likely higher. Our analysis confirms that restricting assemblies to only a single k-mer length limits the number of transcripts recovered, regardless of which k-mer length is chosen.Figure 2

Bottom Line: We assembled transcriptomes from an additional three non-model organisms to demonstrate that our pipeline assembled a higher-quality transcriptome than single k-mer approaches across multiple species.The pipeline we have developed for assembly and analysis increases contig length, recovers unique transcripts, and assembles more base pairs than other methods through the use of a meta-assembly.The T. biloba transcriptome is a critical resource for performing large-scale RNA-Seq investigations of gene expression patterns, and is the first transcriptome sequenced in this Dipteran family.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Sciences, North Dakota State University, 1340 Bolley Drive, 218 Stevens Hall, Fargo, ND 58102, USA. Dacotah.Melicher@ndsu.edu.

ABSTRACT

Background: The Sepsidae family of flies is a model for investigating how sexual selection shapes courtship and sexual dimorphism in a comparative framework. However, like many non-model systems, there are few molecular resources available. Large-scale sequencing and assembly have not been performed in any sepsid, and the lack of a closely related genome makes investigation of gene expression challenging. Our goal was to develop an automated pipeline for de novo transcriptome assembly, and to use that pipeline to assemble and analyze the transcriptome of the sepsid Themira biloba.

Results: Our bioinformatics pipeline uses cloud computing services to assemble and analyze the transcriptome with off-site data management, processing, and backup. It uses a multiple k-mer length approach combined with a second meta-assembly to extend transcripts and recover more bases of transcript sequences than standard single k-mer assembly. We used 454 sequencing to generate 1.48 million reads from cDNA generated from embryo, larva, and pupae of T. biloba and assembled a transcriptome consisting of 24,495 contigs. Annotation identified 16,705 transcripts, including those involved in embryogenesis and limb patterning. We assembled transcriptomes from an additional three non-model organisms to demonstrate that our pipeline assembled a higher-quality transcriptome than single k-mer approaches across multiple species.

Conclusions: The pipeline we have developed for assembly and analysis increases contig length, recovers unique transcripts, and assembles more base pairs than other methods through the use of a meta-assembly. The T. biloba transcriptome is a critical resource for performing large-scale RNA-Seq investigations of gene expression patterns, and is the first transcriptome sequenced in this Dipteran family.

Show MeSH