Limits...
A bioinformatic filter for improved base-call accuracy and polymorphism detection using the Affymetrix GeneChip whole-genome resequencing platform.

Pandya GA, Holmes MH, Sunkara S, Sparks A, Bai Y, Verratti K, Saeed K, Venepally P, Jarrahi B, Fleischmann RD, Peterson SN - Nucleic Acids Res. (2007)

Bottom Line: While this specificity in the target DNA population reduces the potential for artifacts caused by cross-hybridization, the subsampling of the query genome limits the sequence coverage that can be obtained and therefore reduces the technique's resolution as a genotyping method.A set of bioinformatic filters that targeted systematic base-calling errors caused by cross-hybridization between the whole-genome sample and the array probes and by deletions in the sample DNA relative to the chip reference sequence were developed.Our approach eliminated 91% of the false-positive single-nucleotide polymorphism calls identified in the SCHU S4 query sample, at the cost of 10.7% of the true positives, yielding a total base-calling accuracy of 99.992%.

View Article: PubMed Central - PubMed

Affiliation: Pathogen Functional Genomics Resource Center, The Institute for Genomic Research at the J. Craig Venter Institute, Rockville, MD 20850, USA.

ABSTRACT
DNA resequencing arrays enable rapid acquisition of high-quality sequence data. This technology represents a promising platform for rapid high-resolution genotyping of microorganisms. Traditional array-based resequencing methods have relied on the use of specific PCR-amplified fragments from the query samples as hybridization targets. While this specificity in the target DNA population reduces the potential for artifacts caused by cross-hybridization, the subsampling of the query genome limits the sequence coverage that can be obtained and therefore reduces the technique's resolution as a genotyping method. We have developed and validated an Affymetrix Inc. GeneChip(R) array-based, whole-genome resequencing platform for Francisella tularensis, the causative agent of tularemia. A set of bioinformatic filters that targeted systematic base-calling errors caused by cross-hybridization between the whole-genome sample and the array probes and by deletions in the sample DNA relative to the chip reference sequence were developed. Our approach eliminated 91% of the false-positive single-nucleotide polymorphism calls identified in the SCHU S4 query sample, at the cost of 10.7% of the true positives, yielding a total base-calling accuracy of 99.992%.

Show MeSH

Related in: MedlinePlus

ROC curve showing the effect of different delta binding energy threshold values on the true positive and false positive rates. The values on the line graph are the delta energy values.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2175352&req=5

Figure 2: ROC curve showing the effect of different delta binding energy threshold values on the true positive and false positive rates. The values on the line graph are the delta energy values.

Mentions: The alternate homology filter identifies SNP calls that may have arisen as a result of this effect. For each SNP call in the analyzed results, the SNP 25-mer probe sequence is used to search for all perfect, ungapped alignments of at least 13 bases with the LVS genome sequence. The requirement of a minimum alignment length of 13 bases guarantees that the SNP base will be included in all alignments found. The program ExamineSNPs.pl examines the SNP alignments and calculates the binding energies, using the MUMmer package to obtain the sequence alignments (11) and the binding energy calculator from the ArrayOligoSelector package (12) for the binding energy calculations. The alignment representing the highest binding energy is selected and compared with the free energy of binding of the reference 25-mer to its reverse complement. If the difference between these two binding energies is ≤11.5 kcal/mol, the SNP call is assumed to be an artifact of the alternate sequence homology and it is removed from the list of high-confidence SNP calls. The set of SNP calls from the hybridization of a SCHU S4 sample was used to determine the threshold binding energy difference that identifies probable alternate homology artifacts. A delta binding energy threshold of 11.5 kcal/mol was chosen based on the effect of different threshold values on the false-negative and false-positive calls (Figure 2).Figure 2.


A bioinformatic filter for improved base-call accuracy and polymorphism detection using the Affymetrix GeneChip whole-genome resequencing platform.

Pandya GA, Holmes MH, Sunkara S, Sparks A, Bai Y, Verratti K, Saeed K, Venepally P, Jarrahi B, Fleischmann RD, Peterson SN - Nucleic Acids Res. (2007)

ROC curve showing the effect of different delta binding energy threshold values on the true positive and false positive rates. The values on the line graph are the delta energy values.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2175352&req=5

Figure 2: ROC curve showing the effect of different delta binding energy threshold values on the true positive and false positive rates. The values on the line graph are the delta energy values.
Mentions: The alternate homology filter identifies SNP calls that may have arisen as a result of this effect. For each SNP call in the analyzed results, the SNP 25-mer probe sequence is used to search for all perfect, ungapped alignments of at least 13 bases with the LVS genome sequence. The requirement of a minimum alignment length of 13 bases guarantees that the SNP base will be included in all alignments found. The program ExamineSNPs.pl examines the SNP alignments and calculates the binding energies, using the MUMmer package to obtain the sequence alignments (11) and the binding energy calculator from the ArrayOligoSelector package (12) for the binding energy calculations. The alignment representing the highest binding energy is selected and compared with the free energy of binding of the reference 25-mer to its reverse complement. If the difference between these two binding energies is ≤11.5 kcal/mol, the SNP call is assumed to be an artifact of the alternate sequence homology and it is removed from the list of high-confidence SNP calls. The set of SNP calls from the hybridization of a SCHU S4 sample was used to determine the threshold binding energy difference that identifies probable alternate homology artifacts. A delta binding energy threshold of 11.5 kcal/mol was chosen based on the effect of different threshold values on the false-negative and false-positive calls (Figure 2).Figure 2.

Bottom Line: While this specificity in the target DNA population reduces the potential for artifacts caused by cross-hybridization, the subsampling of the query genome limits the sequence coverage that can be obtained and therefore reduces the technique's resolution as a genotyping method.A set of bioinformatic filters that targeted systematic base-calling errors caused by cross-hybridization between the whole-genome sample and the array probes and by deletions in the sample DNA relative to the chip reference sequence were developed.Our approach eliminated 91% of the false-positive single-nucleotide polymorphism calls identified in the SCHU S4 query sample, at the cost of 10.7% of the true positives, yielding a total base-calling accuracy of 99.992%.

View Article: PubMed Central - PubMed

Affiliation: Pathogen Functional Genomics Resource Center, The Institute for Genomic Research at the J. Craig Venter Institute, Rockville, MD 20850, USA.

ABSTRACT
DNA resequencing arrays enable rapid acquisition of high-quality sequence data. This technology represents a promising platform for rapid high-resolution genotyping of microorganisms. Traditional array-based resequencing methods have relied on the use of specific PCR-amplified fragments from the query samples as hybridization targets. While this specificity in the target DNA population reduces the potential for artifacts caused by cross-hybridization, the subsampling of the query genome limits the sequence coverage that can be obtained and therefore reduces the technique's resolution as a genotyping method. We have developed and validated an Affymetrix Inc. GeneChip(R) array-based, whole-genome resequencing platform for Francisella tularensis, the causative agent of tularemia. A set of bioinformatic filters that targeted systematic base-calling errors caused by cross-hybridization between the whole-genome sample and the array probes and by deletions in the sample DNA relative to the chip reference sequence were developed. Our approach eliminated 91% of the false-positive single-nucleotide polymorphism calls identified in the SCHU S4 query sample, at the cost of 10.7% of the true positives, yielding a total base-calling accuracy of 99.992%.

Show MeSH
Related in: MedlinePlus