Limits...
High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome.

Novaes E, Drost DR, Farmerie WG, Pappas GJ, Grattapaglia D, Sederoff RR, Kirst M - BMC Genomics (2008)

Bottom Line: However, it is questionable how effective the sequencing of large numbers of short reads for species with essentially no prior gene sequence information will support contig assemblies and sequence annotation.In providing an abundance of foundational transcript sequences where limited prior genomic information existed, this work created part of the foundation for the annotation of the E. grandis genome that is being sequenced by the US Department of Energy.In addition we demonstrated that SNPs sampled in large-scale with 454 pyrosequencing can be used to detect evolutionary signatures among genes, providing one of the first genome-wide assessments of nucleotide diversity and Ka/Ks for a non-model plant species.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Forest Resources and Conservation, University of Florida, PO Box 110410, Gainesville, USA. evandro@ufl.edu

ABSTRACT

Background: Benefits from high-throughput sequencing using 454 pyrosequencing technology may be most apparent for species with high societal or economic value but few genomic resources. Rapid means of gene sequence and SNP discovery using this novel sequencing technology provide a set of baseline tools for genome-level research. However, it is questionable how effective the sequencing of large numbers of short reads for species with essentially no prior gene sequence information will support contig assemblies and sequence annotation.

Results: With the purpose of generating the first broad survey of gene sequences in Eucalyptus grandis, the most widely planted hardwood tree species, we used 454 technology to sequence and assemble 148 Mbp of expressed sequences (EST). EST sequences were generated from a normalized cDNA pool comprised of multiple tissues and genotypes, promoting discovery of homologues to almost half of Arabidopsis genes, and a comprehensive survey of allelic variation in the transcriptome. By aligning the sequencing reads from multiple genotypes we detected 23,742 SNPs, 83% of which were validated in a sample. Genome-wide nucleotide diversity was estimated for 2,392 contigs using a modified theta (theta) parameter, adapted for measuring genetic diversity from polymorphisms detected by randomly sequencing a multi-genotype cDNA pool. Diversity estimates in non-synonymous nucleotides were on average 4x smaller than in synonymous, suggesting purifying selection. Non-synonymous to synonymous substitutions (Ka/Ks) among 2,001 contigs averaged 0.30 and was skewed to the right, further supporting that most genes are under purifying selection. Comparison of these estimates among contigs identified major functional classes of genes under purifying and diversifying selection in agreement with previous researches.

Conclusion: In providing an abundance of foundational transcript sequences where limited prior genomic information existed, this work created part of the foundation for the annotation of the E. grandis genome that is being sequenced by the US Department of Energy. In addition we demonstrated that SNPs sampled in large-scale with 454 pyrosequencing can be used to detect evolutionary signatures among genes, providing one of the first genome-wide assessments of nucleotide diversity and Ka/Ks for a non-model plant species.

Show MeSH
Proportion of E. grandis unigenes with homology to gene models. Proportion of E. grandis unigenes (contigs + singlets) without (-) and with homology to the Arabidopsis (A), Populus (P), Oryza (O) gene models. (a) Effect of the sequence length on the proportion of homology to gene models (E value 10-5). (b) Proportion of E. grandis unigenes longer than 100 bp with and without homology to gene models at three different E values (10-5, 10-10 and 10-20).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2483731&req=5

Figure 1: Proportion of E. grandis unigenes with homology to gene models. Proportion of E. grandis unigenes (contigs + singlets) without (-) and with homology to the Arabidopsis (A), Populus (P), Oryza (O) gene models. (a) Effect of the sequence length on the proportion of homology to gene models (E value 10-5). (b) Proportion of E. grandis unigenes longer than 100 bp with and without homology to gene models at three different E values (10-5, 10-10 and 10-20).

Mentions: An E. grandis unigene set was generated by combining all 71,384 assembled contigs and 118,722 non-assembled reads (singlets) generated by the three 454 runs. The unigene set was annotated by searching for sequence similarities using BlastX against Arabidopsis (TAIR v. 7.0), Populus (JGI v. 1.1) and Oryza (TIGR v. 5.0) gene models. As expected, the likelihood of finding similarity to previously described gene models is highly dependent on the length of the query sequence (Figure 1a). Logistic regression testing the effect of sequence length on whether or not the query sequence have at least one BlastX hit (E value 10-5) was highly significant (p-value < 0.00001). For instance, sequences longer than 1000 bp have significant similarity (E value 10-5) with gene models from all three species in 96% of cases, whereas 88% of sequences shorter than 100 bp have no similarities to any annotated gene model. Among 118,013 unigenes longer than 100 bp 38% have similarity to at least one gene model at an E value of 10-5, 28% at an E value of 10-10, and 15% at an E value of 10-20 (Figure 1b). The low proportion of BlastX hits is mainly due to the high frequency of shorter sequences (75th percentile = 252 bp).


High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome.

Novaes E, Drost DR, Farmerie WG, Pappas GJ, Grattapaglia D, Sederoff RR, Kirst M - BMC Genomics (2008)

Proportion of E. grandis unigenes with homology to gene models. Proportion of E. grandis unigenes (contigs + singlets) without (-) and with homology to the Arabidopsis (A), Populus (P), Oryza (O) gene models. (a) Effect of the sequence length on the proportion of homology to gene models (E value 10-5). (b) Proportion of E. grandis unigenes longer than 100 bp with and without homology to gene models at three different E values (10-5, 10-10 and 10-20).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2483731&req=5

Figure 1: Proportion of E. grandis unigenes with homology to gene models. Proportion of E. grandis unigenes (contigs + singlets) without (-) and with homology to the Arabidopsis (A), Populus (P), Oryza (O) gene models. (a) Effect of the sequence length on the proportion of homology to gene models (E value 10-5). (b) Proportion of E. grandis unigenes longer than 100 bp with and without homology to gene models at three different E values (10-5, 10-10 and 10-20).
Mentions: An E. grandis unigene set was generated by combining all 71,384 assembled contigs and 118,722 non-assembled reads (singlets) generated by the three 454 runs. The unigene set was annotated by searching for sequence similarities using BlastX against Arabidopsis (TAIR v. 7.0), Populus (JGI v. 1.1) and Oryza (TIGR v. 5.0) gene models. As expected, the likelihood of finding similarity to previously described gene models is highly dependent on the length of the query sequence (Figure 1a). Logistic regression testing the effect of sequence length on whether or not the query sequence have at least one BlastX hit (E value 10-5) was highly significant (p-value < 0.00001). For instance, sequences longer than 1000 bp have significant similarity (E value 10-5) with gene models from all three species in 96% of cases, whereas 88% of sequences shorter than 100 bp have no similarities to any annotated gene model. Among 118,013 unigenes longer than 100 bp 38% have similarity to at least one gene model at an E value of 10-5, 28% at an E value of 10-10, and 15% at an E value of 10-20 (Figure 1b). The low proportion of BlastX hits is mainly due to the high frequency of shorter sequences (75th percentile = 252 bp).

Bottom Line: However, it is questionable how effective the sequencing of large numbers of short reads for species with essentially no prior gene sequence information will support contig assemblies and sequence annotation.In providing an abundance of foundational transcript sequences where limited prior genomic information existed, this work created part of the foundation for the annotation of the E. grandis genome that is being sequenced by the US Department of Energy.In addition we demonstrated that SNPs sampled in large-scale with 454 pyrosequencing can be used to detect evolutionary signatures among genes, providing one of the first genome-wide assessments of nucleotide diversity and Ka/Ks for a non-model plant species.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Forest Resources and Conservation, University of Florida, PO Box 110410, Gainesville, USA. evandro@ufl.edu

ABSTRACT

Background: Benefits from high-throughput sequencing using 454 pyrosequencing technology may be most apparent for species with high societal or economic value but few genomic resources. Rapid means of gene sequence and SNP discovery using this novel sequencing technology provide a set of baseline tools for genome-level research. However, it is questionable how effective the sequencing of large numbers of short reads for species with essentially no prior gene sequence information will support contig assemblies and sequence annotation.

Results: With the purpose of generating the first broad survey of gene sequences in Eucalyptus grandis, the most widely planted hardwood tree species, we used 454 technology to sequence and assemble 148 Mbp of expressed sequences (EST). EST sequences were generated from a normalized cDNA pool comprised of multiple tissues and genotypes, promoting discovery of homologues to almost half of Arabidopsis genes, and a comprehensive survey of allelic variation in the transcriptome. By aligning the sequencing reads from multiple genotypes we detected 23,742 SNPs, 83% of which were validated in a sample. Genome-wide nucleotide diversity was estimated for 2,392 contigs using a modified theta (theta) parameter, adapted for measuring genetic diversity from polymorphisms detected by randomly sequencing a multi-genotype cDNA pool. Diversity estimates in non-synonymous nucleotides were on average 4x smaller than in synonymous, suggesting purifying selection. Non-synonymous to synonymous substitutions (Ka/Ks) among 2,001 contigs averaged 0.30 and was skewed to the right, further supporting that most genes are under purifying selection. Comparison of these estimates among contigs identified major functional classes of genes under purifying and diversifying selection in agreement with previous researches.

Conclusion: In providing an abundance of foundational transcript sequences where limited prior genomic information existed, this work created part of the foundation for the annotation of the E. grandis genome that is being sequenced by the US Department of Energy. In addition we demonstrated that SNPs sampled in large-scale with 454 pyrosequencing can be used to detect evolutionary signatures among genes, providing one of the first genome-wide assessments of nucleotide diversity and Ka/Ks for a non-model plant species.

Show MeSH