Limits...
The Use of Non-Variant Sites to Improve the Clinical Assessment of Whole-Genome Sequence Data.

Ferrarini A, Xumerle L, Griggio F, Garonzi M, Cantaloni C, Centomo C, Vargas SM, Descombes P, Marquis J, Collino S, Franceschi C, Garagnani P, Salisbury BA, Harvey JM, Delledonne M - PLoS ONE (2015)

Bottom Line: However, the reference genome sequence contains a large number of sites represented by rare alleles, by known pathogenic alleles and by alleles strongly associated to disease by GWAS.Here we show that an alternative analytical approach based on the analysis of both variant and non-variant sites from WGS data allows to genotype more than 92% of sites corresponding to known SNPs compared to 6% genotyped by standard variant analysis.Altogether, our findings indicate that characterization of both variant and non-variant clinically informative sites in the genome is necessary to allow an accurate clinical assessment of a personal genome.

View Article: PubMed Central - PubMed

Affiliation: Functional Genomics Center, Department of Biotechnology, University of Verona, 37134, Verona, Italy.

ABSTRACT
Genetic testing, which is now a routine part of clinical practice and disease management protocols, is often based on the assessment of small panels of variants or genes. On the other hand, continuous improvements in the speed and per-base costs of sequencing have now made whole exome sequencing (WES) and whole genome sequencing (WGS) viable strategies for targeted or complete genetic analysis, respectively. Standard WGS/WES data analytical workflows generally rely on calling of sequence variants respect to the reference genome sequence. However, the reference genome sequence contains a large number of sites represented by rare alleles, by known pathogenic alleles and by alleles strongly associated to disease by GWAS. It's thus critical, for clinical applications of WGS and WES, to interpret whether non-variant sites are homozygous for the reference allele or if the corresponding genotype cannot be reliably called. Here we show that an alternative analytical approach based on the analysis of both variant and non-variant sites from WGS data allows to genotype more than 92% of sites corresponding to known SNPs compared to 6% genotyped by standard variant analysis. These include homozygous reference sites of clinical interest, thus leading to a broad and comprehensive characterization of variation necessary to an accurate evaluation of disease risk. Altogether, our findings indicate that characterization of both variant and non-variant clinically informative sites in the genome is necessary to allow an accurate clinical assessment of a personal genome. Finally, we propose a highly efficient extended VCF (eVCF) file format which allows to store genotype calls for sites of clinical interest while remaining compatible with current variant interpretation software.

No MeSH data available.


Related in: MedlinePlus

Exonic regions coverage.Percentage of exonic regions covered at a read depth ≥ 5, an alignment score ≥ 10, a basecall quality ≥ 10 from WGS subsets of the original full set with different average X-fold coverage values.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4492948&req=5

pone.0132180.g001: Exonic regions coverage.Percentage of exonic regions covered at a read depth ≥ 5, an alignment score ≥ 10, a basecall quality ≥ 10 from WGS subsets of the original full set with different average X-fold coverage values.

Mentions: We compared WGS sequence data from the NA12892 CEU individual of European ancestry produced using a standard library preparation protocol (STD) and the PCR-free library preparation protocol (PCRFree). The two filtered datasets were normalized to 106 estimated X-fold coverage to allow unbiased comparison. In both cases, more than 90% of the reads were mapped to the hg19 reference sequence (STD 93.0%, PCRFree 98.5%) and the two datasets showed overall good uniformity of coverage (S1 Fig). The PCRFree protocol achieved a reduction of the number of gaps from 134,001 to 78,898, defined as regions longer than 10 bp with a low read depth (< 5), low alignment score (Q score < 10) and low basecall quality (Q < 10). This accounted for a reduction in the number of bases included in gaps from 102.7 Mbp to 54.7 Mbp. The PCRFree dataset also achieved a significant increase in the average read depth and coverage uniformity in regions with high (≥75%) and extreme (≥85%) GC content, and in the presence of repeated AT dinucleotides, ranging from 151% to 664 (S2 and S3 Figs). The typical WGS coverage generated by the Illumina HiSeq X-Ten run is 30–40 X-fold, so we evaluated the percentage of exonic regions covered (read depth > 5, mapping score > 10 and basecall quality score > 10) by datasets with different average coverage values ranging from 20 to 100 X-fold generated by sub-sampling the full available dataset. A previous study, based on sequencing of illumina standard libraries, reported that an average mapped depth of 50 X-fold is required to produce confident genotype calls for >80% of the exome and showed that a 40–45 X-fold coverage to detect most of SNPs [20]. We found that the 40 X-fold PCRFree dataset covered more than 96.7% of the exonic regions with the given thresholds and that increasing the average genome coverage up to 100 X-fold only increased the percentage of exonic regions covered by 0.81% (Fig 1). Based on these results, we decided to use the PCRFree protocol for library preparation and perform WGS at a target mean coverage of 35–40 X-fold. It is worth noting that an higher mean X-fold coverage might be required to genotype INDELs reliably. Indeed, a recent work by Fang et al. showed that 60 X-fold mean coverage is needed to recover 95% of INDELs [21].


The Use of Non-Variant Sites to Improve the Clinical Assessment of Whole-Genome Sequence Data.

Ferrarini A, Xumerle L, Griggio F, Garonzi M, Cantaloni C, Centomo C, Vargas SM, Descombes P, Marquis J, Collino S, Franceschi C, Garagnani P, Salisbury BA, Harvey JM, Delledonne M - PLoS ONE (2015)

Exonic regions coverage.Percentage of exonic regions covered at a read depth ≥ 5, an alignment score ≥ 10, a basecall quality ≥ 10 from WGS subsets of the original full set with different average X-fold coverage values.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4492948&req=5

pone.0132180.g001: Exonic regions coverage.Percentage of exonic regions covered at a read depth ≥ 5, an alignment score ≥ 10, a basecall quality ≥ 10 from WGS subsets of the original full set with different average X-fold coverage values.
Mentions: We compared WGS sequence data from the NA12892 CEU individual of European ancestry produced using a standard library preparation protocol (STD) and the PCR-free library preparation protocol (PCRFree). The two filtered datasets were normalized to 106 estimated X-fold coverage to allow unbiased comparison. In both cases, more than 90% of the reads were mapped to the hg19 reference sequence (STD 93.0%, PCRFree 98.5%) and the two datasets showed overall good uniformity of coverage (S1 Fig). The PCRFree protocol achieved a reduction of the number of gaps from 134,001 to 78,898, defined as regions longer than 10 bp with a low read depth (< 5), low alignment score (Q score < 10) and low basecall quality (Q < 10). This accounted for a reduction in the number of bases included in gaps from 102.7 Mbp to 54.7 Mbp. The PCRFree dataset also achieved a significant increase in the average read depth and coverage uniformity in regions with high (≥75%) and extreme (≥85%) GC content, and in the presence of repeated AT dinucleotides, ranging from 151% to 664 (S2 and S3 Figs). The typical WGS coverage generated by the Illumina HiSeq X-Ten run is 30–40 X-fold, so we evaluated the percentage of exonic regions covered (read depth > 5, mapping score > 10 and basecall quality score > 10) by datasets with different average coverage values ranging from 20 to 100 X-fold generated by sub-sampling the full available dataset. A previous study, based on sequencing of illumina standard libraries, reported that an average mapped depth of 50 X-fold is required to produce confident genotype calls for >80% of the exome and showed that a 40–45 X-fold coverage to detect most of SNPs [20]. We found that the 40 X-fold PCRFree dataset covered more than 96.7% of the exonic regions with the given thresholds and that increasing the average genome coverage up to 100 X-fold only increased the percentage of exonic regions covered by 0.81% (Fig 1). Based on these results, we decided to use the PCRFree protocol for library preparation and perform WGS at a target mean coverage of 35–40 X-fold. It is worth noting that an higher mean X-fold coverage might be required to genotype INDELs reliably. Indeed, a recent work by Fang et al. showed that 60 X-fold mean coverage is needed to recover 95% of INDELs [21].

Bottom Line: However, the reference genome sequence contains a large number of sites represented by rare alleles, by known pathogenic alleles and by alleles strongly associated to disease by GWAS.Here we show that an alternative analytical approach based on the analysis of both variant and non-variant sites from WGS data allows to genotype more than 92% of sites corresponding to known SNPs compared to 6% genotyped by standard variant analysis.Altogether, our findings indicate that characterization of both variant and non-variant clinically informative sites in the genome is necessary to allow an accurate clinical assessment of a personal genome.

View Article: PubMed Central - PubMed

Affiliation: Functional Genomics Center, Department of Biotechnology, University of Verona, 37134, Verona, Italy.

ABSTRACT
Genetic testing, which is now a routine part of clinical practice and disease management protocols, is often based on the assessment of small panels of variants or genes. On the other hand, continuous improvements in the speed and per-base costs of sequencing have now made whole exome sequencing (WES) and whole genome sequencing (WGS) viable strategies for targeted or complete genetic analysis, respectively. Standard WGS/WES data analytical workflows generally rely on calling of sequence variants respect to the reference genome sequence. However, the reference genome sequence contains a large number of sites represented by rare alleles, by known pathogenic alleles and by alleles strongly associated to disease by GWAS. It's thus critical, for clinical applications of WGS and WES, to interpret whether non-variant sites are homozygous for the reference allele or if the corresponding genotype cannot be reliably called. Here we show that an alternative analytical approach based on the analysis of both variant and non-variant sites from WGS data allows to genotype more than 92% of sites corresponding to known SNPs compared to 6% genotyped by standard variant analysis. These include homozygous reference sites of clinical interest, thus leading to a broad and comprehensive characterization of variation necessary to an accurate evaluation of disease risk. Altogether, our findings indicate that characterization of both variant and non-variant clinically informative sites in the genome is necessary to allow an accurate clinical assessment of a personal genome. Finally, we propose a highly efficient extended VCF (eVCF) file format which allows to store genotype calls for sites of clinical interest while remaining compatible with current variant interpretation software.

No MeSH data available.


Related in: MedlinePlus