Limits...
What's in your next-generation sequence data? An exploration of unmapped DNA and RNA sequence reads from the bovine reference individual.

Whitacre LK, Tizioto PC, Kim J, Sonstegard TS, Schroeder SG, Alexander LJ, Medrano JF, Schnabel RD, Taylor JF, Decker JE - BMC Genomics (2015)

Bottom Line: While improvements in alignment algorithms and computational hardware have greatly enhanced the efficiency and accuracy of alignments, a significant percentage of reads often remain unmapped.These species are either not present in the US or are not known to infect taurine cattle and the reference animal appears to have been host to unsequenced sister species.We demonstrate the importance of exploring unmapped reads to ascertain sequences that are either absent or misassembled in the reference assembly and for detecting sequences indicative of parasitic or commensal organisms.

View Article: PubMed Central - PubMed

Affiliation: Informatics Institute, University of Missouri, Columbia, MO, 65211, USA. lwhitacre@mail.missouri.edu.

ABSTRACT

Background: Next-generation sequencing projects commonly commence by aligning reads to a reference genome assembly. While improvements in alignment algorithms and computational hardware have greatly enhanced the efficiency and accuracy of alignments, a significant percentage of reads often remain unmapped.

Results: We generated de novo assemblies of unmapped reads from the DNA and RNA sequencing of the Bos taurus reference individual and identified the closest matching sequence to each contig by alignment to the NCBI non-redundant nucleotide database using BLAST. As expected, many of these contigs represent vertebrate sequence that is absent, incomplete, or misassembled in the UMD3.1 reference assembly. However, numerous additional contigs represent invertebrate species. Most prominent were several species of Spirurid nematodes and a blood-borne parasite, Babesia bigemina. These species are either not present in the US or are not known to infect taurine cattle and the reference animal appears to have been host to unsequenced sister species.

Conclusions: We demonstrate the importance of exploring unmapped reads to ascertain sequences that are either absent or misassembled in the reference assembly and for detecting sequences indicative of parasitic or commensal organisms.

No MeSH data available.


Most common alignments from RNA. Bar chart of the ten most common species with significant alignments from the pairwise alignment of de novo assembled contigs from unmapped RNA-seq reads by tissue. Trend line represents the overall median percent identity for each species across tissues
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4696311&req=5

Fig2: Most common alignments from RNA. Bar chart of the ten most common species with significant alignments from the pairwise alignment of de novo assembled contigs from unmapped RNA-seq reads by tissue. Trend line represents the overall median percent identity for each species across tissues

Mentions: The pairwise alignment of the de novo assembled contigs generated from the unmapped RNA-seq reads to the nt database produced similar results to the alignment of the DNA contigs. Overall, 81 % of the RNA-seq contigs had significant alignments to sequences in the nt database. Across all tissues, Bos taurus produced the largest number of alignments. Also prominent were alignments to Bison bison bison, Bubalus bubalis, and Bos mutus, all species that are closely related to cattle (Fig. 2). Significant BLAST alignments of the RNA-seq unmapped read contigs to cattle or these other closely related species indicates the existence of coding regions that are missing or misassembled in the reference assembly. By mapping the GI number of the most significant BLAST alignment to a gene symbol, we detected alignments to 4412 B. taurus and 4029 B. bison bison, B. bubalis, or B. mutus genes. As the total number of Bos taurus genes reported by Ensembl is 19,994 [8], this suggests that as many as 42.2 % of the bovine protein coding genes are misassembled (although these misassemblies likely represent a small fraction of total transcriptome base pairs). Additionally, approximately 5 % of RNA alignments failed to map to a gene with an assigned symbol, likely corresponding to unannotated structural or regulatory RNAs. Further results and discussion of these analyses are included in Additional file 2: Note 2.Fig. 2


What's in your next-generation sequence data? An exploration of unmapped DNA and RNA sequence reads from the bovine reference individual.

Whitacre LK, Tizioto PC, Kim J, Sonstegard TS, Schroeder SG, Alexander LJ, Medrano JF, Schnabel RD, Taylor JF, Decker JE - BMC Genomics (2015)

Most common alignments from RNA. Bar chart of the ten most common species with significant alignments from the pairwise alignment of de novo assembled contigs from unmapped RNA-seq reads by tissue. Trend line represents the overall median percent identity for each species across tissues
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4696311&req=5

Fig2: Most common alignments from RNA. Bar chart of the ten most common species with significant alignments from the pairwise alignment of de novo assembled contigs from unmapped RNA-seq reads by tissue. Trend line represents the overall median percent identity for each species across tissues
Mentions: The pairwise alignment of the de novo assembled contigs generated from the unmapped RNA-seq reads to the nt database produced similar results to the alignment of the DNA contigs. Overall, 81 % of the RNA-seq contigs had significant alignments to sequences in the nt database. Across all tissues, Bos taurus produced the largest number of alignments. Also prominent were alignments to Bison bison bison, Bubalus bubalis, and Bos mutus, all species that are closely related to cattle (Fig. 2). Significant BLAST alignments of the RNA-seq unmapped read contigs to cattle or these other closely related species indicates the existence of coding regions that are missing or misassembled in the reference assembly. By mapping the GI number of the most significant BLAST alignment to a gene symbol, we detected alignments to 4412 B. taurus and 4029 B. bison bison, B. bubalis, or B. mutus genes. As the total number of Bos taurus genes reported by Ensembl is 19,994 [8], this suggests that as many as 42.2 % of the bovine protein coding genes are misassembled (although these misassemblies likely represent a small fraction of total transcriptome base pairs). Additionally, approximately 5 % of RNA alignments failed to map to a gene with an assigned symbol, likely corresponding to unannotated structural or regulatory RNAs. Further results and discussion of these analyses are included in Additional file 2: Note 2.Fig. 2

Bottom Line: While improvements in alignment algorithms and computational hardware have greatly enhanced the efficiency and accuracy of alignments, a significant percentage of reads often remain unmapped.These species are either not present in the US or are not known to infect taurine cattle and the reference animal appears to have been host to unsequenced sister species.We demonstrate the importance of exploring unmapped reads to ascertain sequences that are either absent or misassembled in the reference assembly and for detecting sequences indicative of parasitic or commensal organisms.

View Article: PubMed Central - PubMed

Affiliation: Informatics Institute, University of Missouri, Columbia, MO, 65211, USA. lwhitacre@mail.missouri.edu.

ABSTRACT

Background: Next-generation sequencing projects commonly commence by aligning reads to a reference genome assembly. While improvements in alignment algorithms and computational hardware have greatly enhanced the efficiency and accuracy of alignments, a significant percentage of reads often remain unmapped.

Results: We generated de novo assemblies of unmapped reads from the DNA and RNA sequencing of the Bos taurus reference individual and identified the closest matching sequence to each contig by alignment to the NCBI non-redundant nucleotide database using BLAST. As expected, many of these contigs represent vertebrate sequence that is absent, incomplete, or misassembled in the UMD3.1 reference assembly. However, numerous additional contigs represent invertebrate species. Most prominent were several species of Spirurid nematodes and a blood-borne parasite, Babesia bigemina. These species are either not present in the US or are not known to infect taurine cattle and the reference animal appears to have been host to unsequenced sister species.

Conclusions: We demonstrate the importance of exploring unmapped reads to ascertain sequences that are either absent or misassembled in the reference assembly and for detecting sequences indicative of parasitic or commensal organisms.

No MeSH data available.