Limits...
Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.

Wu Z, Tembrock LR, Ge S - PLoS ONE (2015)

Bottom Line: Based on the whole genome alignment, GU592209 was more similar to the reference genome (O. sativa: AY522330) with 99.2% sequence identity (SI value) compared with the 98.8% SI values in the KJ830774 genome; whereas the opposite result was obtained when the SI values in coding and noncoding regions of GU592209 and KJ830774 were compared.Thus, we concluded that the genomic composition of GU592209 was heterogeneous in coding and non-coding regions.These findings should impel biologists to carefully consider the quality of sequencing and assembly when working with next-generation data.

View Article: PubMed Central - PubMed

Affiliation: State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China; Department of Biology, Colorado State University, Fort Collins, Colorado, United States of America.

ABSTRACT
DNA sequencing has been revolutionized by the development of high-throughput sequencing technologies. Plummeting costs and the massive throughput capacities of second and third generation sequencing platforms have transformed many fields of biological research. Concurrently, new data processing pipelines made rapid de novo genome assemblies possible. However, high quality data are critically important for all investigations in the genomic era. We used chloroplast genomes of one Oryza species (O. australiensis) to compare differences in sequence quality: one genome (GU592209) was obtained through Illumina sequencing and reference-guided assembly and the other genome (KJ830774) was obtained via target enrichment libraries and shotgun sequencing. Based on the whole genome alignment, GU592209 was more similar to the reference genome (O. sativa: AY522330) with 99.2% sequence identity (SI value) compared with the 98.8% SI values in the KJ830774 genome; whereas the opposite result was obtained when the SI values in coding and noncoding regions of GU592209 and KJ830774 were compared. Additionally, the junctions of two single copies and repeat copies in the chloroplast genome exhibited differences. Phylogenetic analyses were conducted using these sequences, and the different data sets yielded dissimilar topologies: phylogenetic replacements of the two individuals were remarkably different based on whole genome sequencing or SNP data and insertions and deletions (indels) data. Thus, we concluded that the genomic composition of GU592209 was heterogeneous in coding and non-coding regions. These findings should impel biologists to carefully consider the quality of sequencing and assembly when working with next-generation data.

Show MeSH

Related in: MedlinePlus

Identity plot that compares the chloroplast genomes of the two O. australiensis data sets used in this study with O. sativa ssp. Japonica (AY522330) as the reference sequence.The vertical scale indicates the percentage of identity, ranging from 50% to 100%. The horizontal axis indicates the coordinated base position within the chloroplast genome. Genome regions are color coded as protein-coding, rRNA, tRNA, intron, and conserved noncoding sequences (CNS).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4320078&req=5

pone.0118019.g003: Identity plot that compares the chloroplast genomes of the two O. australiensis data sets used in this study with O. sativa ssp. Japonica (AY522330) as the reference sequence.The vertical scale indicates the percentage of identity, ranging from 50% to 100%. The horizontal axis indicates the coordinated base position within the chloroplast genome. Genome regions are color coded as protein-coding, rRNA, tRNA, intron, and conserved noncoding sequences (CNS).

Mentions: First, the mVISTA program [52] was used to demonstrate the whole genome variation with O. sativa ssp. Japonica (AY522330) as the reference for comparison with the two plastomes (Fig. 3). As the whole, the organization of the plastome was rather conserved between two individuals, and no translocations or inversions were detected in the architecture of the two genomes. The two IR regions were more conserved than the LSC and SSC regions. However, we found more local variations in O. australiensis (KJ830774) than in O. australiensis (GU592209). For example, two variations in the rpoC2 gene were found in KJ830774 but not in GU592209. Many of the intergenic region (ndhC-trnV, rbcL-psaI and others) variations were found in KJ830774, but no such variation was found in GU592209. The results indicated that the full sequence of GU592209 was more similar to AY522330 and that KJ830774 was more divergent compared with GU592209.


Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.

Wu Z, Tembrock LR, Ge S - PLoS ONE (2015)

Identity plot that compares the chloroplast genomes of the two O. australiensis data sets used in this study with O. sativa ssp. Japonica (AY522330) as the reference sequence.The vertical scale indicates the percentage of identity, ranging from 50% to 100%. The horizontal axis indicates the coordinated base position within the chloroplast genome. Genome regions are color coded as protein-coding, rRNA, tRNA, intron, and conserved noncoding sequences (CNS).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4320078&req=5

pone.0118019.g003: Identity plot that compares the chloroplast genomes of the two O. australiensis data sets used in this study with O. sativa ssp. Japonica (AY522330) as the reference sequence.The vertical scale indicates the percentage of identity, ranging from 50% to 100%. The horizontal axis indicates the coordinated base position within the chloroplast genome. Genome regions are color coded as protein-coding, rRNA, tRNA, intron, and conserved noncoding sequences (CNS).
Mentions: First, the mVISTA program [52] was used to demonstrate the whole genome variation with O. sativa ssp. Japonica (AY522330) as the reference for comparison with the two plastomes (Fig. 3). As the whole, the organization of the plastome was rather conserved between two individuals, and no translocations or inversions were detected in the architecture of the two genomes. The two IR regions were more conserved than the LSC and SSC regions. However, we found more local variations in O. australiensis (KJ830774) than in O. australiensis (GU592209). For example, two variations in the rpoC2 gene were found in KJ830774 but not in GU592209. Many of the intergenic region (ndhC-trnV, rbcL-psaI and others) variations were found in KJ830774, but no such variation was found in GU592209. The results indicated that the full sequence of GU592209 was more similar to AY522330 and that KJ830774 was more divergent compared with GU592209.

Bottom Line: Based on the whole genome alignment, GU592209 was more similar to the reference genome (O. sativa: AY522330) with 99.2% sequence identity (SI value) compared with the 98.8% SI values in the KJ830774 genome; whereas the opposite result was obtained when the SI values in coding and noncoding regions of GU592209 and KJ830774 were compared.Thus, we concluded that the genomic composition of GU592209 was heterogeneous in coding and non-coding regions.These findings should impel biologists to carefully consider the quality of sequencing and assembly when working with next-generation data.

View Article: PubMed Central - PubMed

Affiliation: State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China; Department of Biology, Colorado State University, Fort Collins, Colorado, United States of America.

ABSTRACT
DNA sequencing has been revolutionized by the development of high-throughput sequencing technologies. Plummeting costs and the massive throughput capacities of second and third generation sequencing platforms have transformed many fields of biological research. Concurrently, new data processing pipelines made rapid de novo genome assemblies possible. However, high quality data are critically important for all investigations in the genomic era. We used chloroplast genomes of one Oryza species (O. australiensis) to compare differences in sequence quality: one genome (GU592209) was obtained through Illumina sequencing and reference-guided assembly and the other genome (KJ830774) was obtained via target enrichment libraries and shotgun sequencing. Based on the whole genome alignment, GU592209 was more similar to the reference genome (O. sativa: AY522330) with 99.2% sequence identity (SI value) compared with the 98.8% SI values in the KJ830774 genome; whereas the opposite result was obtained when the SI values in coding and noncoding regions of GU592209 and KJ830774 were compared. Additionally, the junctions of two single copies and repeat copies in the chloroplast genome exhibited differences. Phylogenetic analyses were conducted using these sequences, and the different data sets yielded dissimilar topologies: phylogenetic replacements of the two individuals were remarkably different based on whole genome sequencing or SNP data and insertions and deletions (indels) data. Thus, we concluded that the genomic composition of GU592209 was heterogeneous in coding and non-coding regions. These findings should impel biologists to carefully consider the quality of sequencing and assembly when working with next-generation data.

Show MeSH
Related in: MedlinePlus