Limits...
Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing.

Kanda K, Pflug JM, Sproul JS, Dasenko MA, Maddison DR - PLoS ONE (2015)

Bottom Line: Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens.Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments.A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples.

View Article: PubMed Central - PubMed

Affiliation: Department of Integrative Biology, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles being more successfully sequenced.

Show MeSH

Related in: MedlinePlus

Recovery success of seven focal genes, with comparison of de novo and reference-based assemblies.For protein-coding genes, values in cells are the fractional recovery of the query sequence (for de novo assemblies) or reference sequence (for reference-based assemblies). Cells are shaded in a gray-scale ramp with black recovery of 100% of the fragment length and white 0%. For ribosomal genes, values in cells are the fractional recovery of the query sequence (for de novo assemblies), and for reference-based assemblies, values in cells represent the percentage recovery of the assembly relative to the de novo assembly (as opposed to the query or reference sequence). Values less than 1.0 indicate that some bases were missing from the reference-based assembly. A comparison of the de novo assembly sequence to the reference sequence shows that those missing regions are very different between the museum sample and the reference, and thus that region of the reference-based assembly failed. If there are no base differences between the reference-based and the de novo assemblies, the cell is colored using a blue ramp, with pure blue indicating 100% recovery. If there are base differences, the cell is colored red, with the number of base differences shown in parentheses. An asterisk (*) indicates that the sequence so marked is not in the predicted place in the maximum likelihood tree including the DeNovo, FarRef, and NearRef sequences; two asterisks (**) indicates that this placement failure is supported by a bootstrap value of over 50%. “-”indicates that no attempt was made to find this fragment in the assemblies. Four specimens with less than 34 million reads have specimen abbreviation and age shown in gray.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4696846&req=5

pone.0143929.g007: Recovery success of seven focal genes, with comparison of de novo and reference-based assemblies.For protein-coding genes, values in cells are the fractional recovery of the query sequence (for de novo assemblies) or reference sequence (for reference-based assemblies). Cells are shaded in a gray-scale ramp with black recovery of 100% of the fragment length and white 0%. For ribosomal genes, values in cells are the fractional recovery of the query sequence (for de novo assemblies), and for reference-based assemblies, values in cells represent the percentage recovery of the assembly relative to the de novo assembly (as opposed to the query or reference sequence). Values less than 1.0 indicate that some bases were missing from the reference-based assembly. A comparison of the de novo assembly sequence to the reference sequence shows that those missing regions are very different between the museum sample and the reference, and thus that region of the reference-based assembly failed. If there are no base differences between the reference-based and the de novo assemblies, the cell is colored using a blue ramp, with pure blue indicating 100% recovery. If there are base differences, the cell is colored red, with the number of base differences shown in parentheses. An asterisk (*) indicates that the sequence so marked is not in the predicted place in the maximum likelihood tree including the DeNovo, FarRef, and NearRef sequences; two asterisks (**) indicates that this placement failure is supported by a bootstrap value of over 50%. “-”indicates that no attempt was made to find this fragment in the assemblies. Four specimens with less than 34 million reads have specimen abbreviation and age shown in gray.

Mentions: We recovered the entire target region of 18S and 28S sequences from all de novo assemblies except for Bembidion cf. “Desert Spotted” 3978, which was missing 12 bases at the start of the sequence (Fig 7). Many of the de novo assemblies contained multiple contigs that were returned in BLAST searches as matches for 18S or 28S (S4 Table), but the single contig chosen by our selection procedure in all instances passed our phylogenetic test for accuracy (see below). Reference-based assemblies performed slightly worse than de novo assemblies at recovering ribosomal genes in a few specimens, especially in regions of the genes with extensive insertions and deletions. However, with very few exceptions more than 90% of the length of the ribosomal gene fragments was recovered.


Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing.

Kanda K, Pflug JM, Sproul JS, Dasenko MA, Maddison DR - PLoS ONE (2015)

Recovery success of seven focal genes, with comparison of de novo and reference-based assemblies.For protein-coding genes, values in cells are the fractional recovery of the query sequence (for de novo assemblies) or reference sequence (for reference-based assemblies). Cells are shaded in a gray-scale ramp with black recovery of 100% of the fragment length and white 0%. For ribosomal genes, values in cells are the fractional recovery of the query sequence (for de novo assemblies), and for reference-based assemblies, values in cells represent the percentage recovery of the assembly relative to the de novo assembly (as opposed to the query or reference sequence). Values less than 1.0 indicate that some bases were missing from the reference-based assembly. A comparison of the de novo assembly sequence to the reference sequence shows that those missing regions are very different between the museum sample and the reference, and thus that region of the reference-based assembly failed. If there are no base differences between the reference-based and the de novo assemblies, the cell is colored using a blue ramp, with pure blue indicating 100% recovery. If there are base differences, the cell is colored red, with the number of base differences shown in parentheses. An asterisk (*) indicates that the sequence so marked is not in the predicted place in the maximum likelihood tree including the DeNovo, FarRef, and NearRef sequences; two asterisks (**) indicates that this placement failure is supported by a bootstrap value of over 50%. “-”indicates that no attempt was made to find this fragment in the assemblies. Four specimens with less than 34 million reads have specimen abbreviation and age shown in gray.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4696846&req=5

pone.0143929.g007: Recovery success of seven focal genes, with comparison of de novo and reference-based assemblies.For protein-coding genes, values in cells are the fractional recovery of the query sequence (for de novo assemblies) or reference sequence (for reference-based assemblies). Cells are shaded in a gray-scale ramp with black recovery of 100% of the fragment length and white 0%. For ribosomal genes, values in cells are the fractional recovery of the query sequence (for de novo assemblies), and for reference-based assemblies, values in cells represent the percentage recovery of the assembly relative to the de novo assembly (as opposed to the query or reference sequence). Values less than 1.0 indicate that some bases were missing from the reference-based assembly. A comparison of the de novo assembly sequence to the reference sequence shows that those missing regions are very different between the museum sample and the reference, and thus that region of the reference-based assembly failed. If there are no base differences between the reference-based and the de novo assemblies, the cell is colored using a blue ramp, with pure blue indicating 100% recovery. If there are base differences, the cell is colored red, with the number of base differences shown in parentheses. An asterisk (*) indicates that the sequence so marked is not in the predicted place in the maximum likelihood tree including the DeNovo, FarRef, and NearRef sequences; two asterisks (**) indicates that this placement failure is supported by a bootstrap value of over 50%. “-”indicates that no attempt was made to find this fragment in the assemblies. Four specimens with less than 34 million reads have specimen abbreviation and age shown in gray.
Mentions: We recovered the entire target region of 18S and 28S sequences from all de novo assemblies except for Bembidion cf. “Desert Spotted” 3978, which was missing 12 bases at the start of the sequence (Fig 7). Many of the de novo assemblies contained multiple contigs that were returned in BLAST searches as matches for 18S or 28S (S4 Table), but the single contig chosen by our selection procedure in all instances passed our phylogenetic test for accuracy (see below). Reference-based assemblies performed slightly worse than de novo assemblies at recovering ribosomal genes in a few specimens, especially in regions of the genes with extensive insertions and deletions. However, with very few exceptions more than 90% of the length of the ribosomal gene fragments was recovered.

Bottom Line: Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens.Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments.A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples.

View Article: PubMed Central - PubMed

Affiliation: Department of Integrative Biology, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles being more successfully sequenced.

Show MeSH
Related in: MedlinePlus