Limits...
VTBuilder: a tool for the assembly of multi isoform transcriptomes.

Archer J, Whiteley G, Casewell NR, Harrison RA, Wagstaff SC - BMC Bioinformatics (2014)

Bottom Line: From the simulated reads, VTBuilder constructed 55 transcripts, 50 of which had a greater than 99% sequence similarity to 48 of the SSTs.Unlike other approaches, VTBuilder strives to maintain the relationships between co-evolving sites within the constructed transcripts, and thus increases transcript utility for a wide range of research areas ranging from transcriptomics to phylogenetics and including the monitoring of drug resistant parasite populations.Additionally, improving the quality of transcripts assembled from read data will have an impact on future studies that query these data.

View Article: PubMed Central - PubMed

Affiliation: Department of Parasitology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA2, UK. john.archer.jpa@gmail.com.

ABSTRACT

Background: Within many research areas, such as transcriptomics, the millions of short DNA fragments (reads) produced by current sequencing platforms need to be assembled into transcript sequences before they can be utilized. Despite recent advances in assembly software, creating such transcripts from read data harboring isoform variation remains challenging. This is because current approaches fail to identify all variants present or they create chimeric transcripts within which relationships between co-evolving sites and other evolutionary factors are disrupted. We present VTBuilder, a tool for constructing non-chimeric transcripts from read data that has been sequenced from sources containing isoform complexity.

Results: We validated VTBuilder using reads simulated from 54 Sanger sequenced transcripts (SSTs) expressed in the venom gland of the saw scaled viper, Echis ocellatus. The SSTs were selected to represent genes from major co-expressed toxin groups known to harbor isoform variants. From the simulated reads, VTBuilder constructed 55 transcripts, 50 of which had a greater than 99% sequence similarity to 48 of the SSTs. In contrast, using the popular assembler tool Trinity (r2013-02-25), only 14 transcripts were constructed with a similar level of sequence identity to just 11 SSTs. Furthermore VTBuilder produced transcripts with a similar length distribution to the SSTs while those produced by Trinity were considerably shorter. To demonstrate that our approach can be scaled to real world data we assembled the venom gland transcriptome of the African puff adder Bitis arietans using paired-end reads sequenced on Illumina's MiSeq platform. VTBuilder constructed 1481 transcripts from 5 million reads and, following annotation, all major toxin genes were recovered demonstrating reconstruction of complex underlying sequence and isoform diversity.

Conclusion: Unlike other approaches, VTBuilder strives to maintain the relationships between co-evolving sites within the constructed transcripts, and thus increases transcript utility for a wide range of research areas ranging from transcriptomics to phylogenetics and including the monitoring of drug resistant parasite populations. Additionally, improving the quality of transcripts assembled from read data will have an impact on future studies that query these data. VTBuilder has been implemented in java and is available, under the GPL GPU V0.3 license, from http:// http://www.lstmed.ac.uk/vtbuilder .

Show MeSH
Scaling up to real data. Reads from the venom gland of Bitis arietans were assembled using VTBuilder and annotated using BLAST2GO [59]. (A) Box and whisker plot depicting the length distribution of the constructed transcripts (see Figure 3 for details of whiskers). (B) Transcripts were categorized into four groups; (i) Toxins, (ii) Non-Toxins, (iii) No significant match, and (iv) Bacterial or Viral DNA. (C) The Toxin group in (A) was split into sub categories representing the different protein families present.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4260244&req=5

Fig4: Scaling up to real data. Reads from the venom gland of Bitis arietans were assembled using VTBuilder and annotated using BLAST2GO [59]. (A) Box and whisker plot depicting the length distribution of the constructed transcripts (see Figure 3 for details of whiskers). (B) Transcripts were categorized into four groups; (i) Toxins, (ii) Non-Toxins, (iii) No significant match, and (iv) Bacterial or Viral DNA. (C) The Toxin group in (A) was split into sub categories representing the different protein families present.

Mentions: To demonstrate the application of our software to real world data, we sequenced the venom gland transcriptome of the Nigerian puff adder Bitis arietans. Venom glands were dissected and homogenised, total RNA extracted (TRIzol Plus RNA purification kit; Invitrogen), DNase treated (PureLink DNase Set; Invitrogen), and poly(A) selected (Dynabeads mRNA DIRECT purification kit; Life Technologies). Sequencing was performed on the Illumina MiSeq platform with 250 bp paired-end reads producing 7,114,760 reads in total (Centre for Genomics Research, University of Liverpool). These were processed to remove low quality and unpaired reads leaving a total of 3,511,257 pairs. Post quality filtering resulted in a mean read length of 150 nucleotides. Reads were loaded into both VTBuilder and Trinity for assembly. VTBuilder constructed 1481 transcripts ranging in length from 300 to 5,598 nucleotides (mean length: 751) while Trinity constructed 61,709 transcripts ranging in length from 201 to 8815 nucleotides (mean length: 440) (Additional file 3: Figure S3 and Figure 3A), 31,477 of which were less than 300 nucleotides in length. Transcripts produced by VTBuilder were annotated using BLAST2GO [59] (BlastX; RefSeq Database Release 62, E-value <10×10−5) and subsequently sorted into four categories (Figure 4B): (i) toxins: i.e. transcripts homologous to transcripts found in the NCBI database coding for proteins previously identified as toxins. These made up 33.71% of the transcriptome and were comprised of 101 unique transcripts. Note: SVMP and SP inhibitors have been included within this group. (ii) non-toxins: i.e. transcripts homologous to proteins with no known pathology e.g. housekeeping genes. These made up 38.02% of the transcriptome and were comprised of 913 unique transcripts. (iii) no significant match found: i.e. transcripts with no match in the database or where the E-value of the match is >10×10−05. These made up 28.17% of the transcriptome and were comprised of 463 unique transcripts and (iv) bacterial or viral DNA: these made up 0.11% of the transcriptome and were comprised of 4 unique transcripts. Transcripts defined as toxins were subdivided into protein families (Figure 4C). All major viperid toxin families were accounted for, demonstrating that VTBuilder had accurately reconstructed the underlying transcriptome. Of note is the 101 unique toxin transcripts that contribute to just 6.81% of the total diversity present within the transcriptome (i.e. 101 out of 1481 unique transcripts), but make up 33.71% of the expressed transcriptome. These unique toxin transcripts fall largely into four main toxin families (Table 2), and highlight the importance of distinguishing between isoforms within the underlying data. For example 31 closely related but unique CTL isoforms were identified making up 44.87% of the toxins category. Our software demonstrates how NGS data can be exploited to provide a more accurate, high-resolution picture of complex transcriptomes, such as snake venom gland transcriptomes.Figure 4


VTBuilder: a tool for the assembly of multi isoform transcriptomes.

Archer J, Whiteley G, Casewell NR, Harrison RA, Wagstaff SC - BMC Bioinformatics (2014)

Scaling up to real data. Reads from the venom gland of Bitis arietans were assembled using VTBuilder and annotated using BLAST2GO [59]. (A) Box and whisker plot depicting the length distribution of the constructed transcripts (see Figure 3 for details of whiskers). (B) Transcripts were categorized into four groups; (i) Toxins, (ii) Non-Toxins, (iii) No significant match, and (iv) Bacterial or Viral DNA. (C) The Toxin group in (A) was split into sub categories representing the different protein families present.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4260244&req=5

Fig4: Scaling up to real data. Reads from the venom gland of Bitis arietans were assembled using VTBuilder and annotated using BLAST2GO [59]. (A) Box and whisker plot depicting the length distribution of the constructed transcripts (see Figure 3 for details of whiskers). (B) Transcripts were categorized into four groups; (i) Toxins, (ii) Non-Toxins, (iii) No significant match, and (iv) Bacterial or Viral DNA. (C) The Toxin group in (A) was split into sub categories representing the different protein families present.
Mentions: To demonstrate the application of our software to real world data, we sequenced the venom gland transcriptome of the Nigerian puff adder Bitis arietans. Venom glands were dissected and homogenised, total RNA extracted (TRIzol Plus RNA purification kit; Invitrogen), DNase treated (PureLink DNase Set; Invitrogen), and poly(A) selected (Dynabeads mRNA DIRECT purification kit; Life Technologies). Sequencing was performed on the Illumina MiSeq platform with 250 bp paired-end reads producing 7,114,760 reads in total (Centre for Genomics Research, University of Liverpool). These were processed to remove low quality and unpaired reads leaving a total of 3,511,257 pairs. Post quality filtering resulted in a mean read length of 150 nucleotides. Reads were loaded into both VTBuilder and Trinity for assembly. VTBuilder constructed 1481 transcripts ranging in length from 300 to 5,598 nucleotides (mean length: 751) while Trinity constructed 61,709 transcripts ranging in length from 201 to 8815 nucleotides (mean length: 440) (Additional file 3: Figure S3 and Figure 3A), 31,477 of which were less than 300 nucleotides in length. Transcripts produced by VTBuilder were annotated using BLAST2GO [59] (BlastX; RefSeq Database Release 62, E-value <10×10−5) and subsequently sorted into four categories (Figure 4B): (i) toxins: i.e. transcripts homologous to transcripts found in the NCBI database coding for proteins previously identified as toxins. These made up 33.71% of the transcriptome and were comprised of 101 unique transcripts. Note: SVMP and SP inhibitors have been included within this group. (ii) non-toxins: i.e. transcripts homologous to proteins with no known pathology e.g. housekeeping genes. These made up 38.02% of the transcriptome and were comprised of 913 unique transcripts. (iii) no significant match found: i.e. transcripts with no match in the database or where the E-value of the match is >10×10−05. These made up 28.17% of the transcriptome and were comprised of 463 unique transcripts and (iv) bacterial or viral DNA: these made up 0.11% of the transcriptome and were comprised of 4 unique transcripts. Transcripts defined as toxins were subdivided into protein families (Figure 4C). All major viperid toxin families were accounted for, demonstrating that VTBuilder had accurately reconstructed the underlying transcriptome. Of note is the 101 unique toxin transcripts that contribute to just 6.81% of the total diversity present within the transcriptome (i.e. 101 out of 1481 unique transcripts), but make up 33.71% of the expressed transcriptome. These unique toxin transcripts fall largely into four main toxin families (Table 2), and highlight the importance of distinguishing between isoforms within the underlying data. For example 31 closely related but unique CTL isoforms were identified making up 44.87% of the toxins category. Our software demonstrates how NGS data can be exploited to provide a more accurate, high-resolution picture of complex transcriptomes, such as snake venom gland transcriptomes.Figure 4

Bottom Line: From the simulated reads, VTBuilder constructed 55 transcripts, 50 of which had a greater than 99% sequence similarity to 48 of the SSTs.Unlike other approaches, VTBuilder strives to maintain the relationships between co-evolving sites within the constructed transcripts, and thus increases transcript utility for a wide range of research areas ranging from transcriptomics to phylogenetics and including the monitoring of drug resistant parasite populations.Additionally, improving the quality of transcripts assembled from read data will have an impact on future studies that query these data.

View Article: PubMed Central - PubMed

Affiliation: Department of Parasitology, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool, L3 5QA2, UK. john.archer.jpa@gmail.com.

ABSTRACT

Background: Within many research areas, such as transcriptomics, the millions of short DNA fragments (reads) produced by current sequencing platforms need to be assembled into transcript sequences before they can be utilized. Despite recent advances in assembly software, creating such transcripts from read data harboring isoform variation remains challenging. This is because current approaches fail to identify all variants present or they create chimeric transcripts within which relationships between co-evolving sites and other evolutionary factors are disrupted. We present VTBuilder, a tool for constructing non-chimeric transcripts from read data that has been sequenced from sources containing isoform complexity.

Results: We validated VTBuilder using reads simulated from 54 Sanger sequenced transcripts (SSTs) expressed in the venom gland of the saw scaled viper, Echis ocellatus. The SSTs were selected to represent genes from major co-expressed toxin groups known to harbor isoform variants. From the simulated reads, VTBuilder constructed 55 transcripts, 50 of which had a greater than 99% sequence similarity to 48 of the SSTs. In contrast, using the popular assembler tool Trinity (r2013-02-25), only 14 transcripts were constructed with a similar level of sequence identity to just 11 SSTs. Furthermore VTBuilder produced transcripts with a similar length distribution to the SSTs while those produced by Trinity were considerably shorter. To demonstrate that our approach can be scaled to real world data we assembled the venom gland transcriptome of the African puff adder Bitis arietans using paired-end reads sequenced on Illumina's MiSeq platform. VTBuilder constructed 1481 transcripts from 5 million reads and, following annotation, all major toxin genes were recovered demonstrating reconstruction of complex underlying sequence and isoform diversity.

Conclusion: Unlike other approaches, VTBuilder strives to maintain the relationships between co-evolving sites within the constructed transcripts, and thus increases transcript utility for a wide range of research areas ranging from transcriptomics to phylogenetics and including the monitoring of drug resistant parasite populations. Additionally, improving the quality of transcripts assembled from read data will have an impact on future studies that query these data. VTBuilder has been implemented in java and is available, under the GPL GPU V0.3 license, from http:// http://www.lstmed.ac.uk/vtbuilder .

Show MeSH