Limits...
A bioinformatic filter for improved base-call accuracy and polymorphism detection using the Affymetrix GeneChip whole-genome resequencing platform.

Pandya GA, Holmes MH, Sunkara S, Sparks A, Bai Y, Verratti K, Saeed K, Venepally P, Jarrahi B, Fleischmann RD, Peterson SN - Nucleic Acids Res. (2007)

Bottom Line: While this specificity in the target DNA population reduces the potential for artifacts caused by cross-hybridization, the subsampling of the query genome limits the sequence coverage that can be obtained and therefore reduces the technique's resolution as a genotyping method.A set of bioinformatic filters that targeted systematic base-calling errors caused by cross-hybridization between the whole-genome sample and the array probes and by deletions in the sample DNA relative to the chip reference sequence were developed.Our approach eliminated 91% of the false-positive single-nucleotide polymorphism calls identified in the SCHU S4 query sample, at the cost of 10.7% of the true positives, yielding a total base-calling accuracy of 99.992%.

View Article: PubMed Central - PubMed

Affiliation: Pathogen Functional Genomics Resource Center, The Institute for Genomic Research at the J. Craig Venter Institute, Rockville, MD 20850, USA.

ABSTRACT
DNA resequencing arrays enable rapid acquisition of high-quality sequence data. This technology represents a promising platform for rapid high-resolution genotyping of microorganisms. Traditional array-based resequencing methods have relied on the use of specific PCR-amplified fragments from the query samples as hybridization targets. While this specificity in the target DNA population reduces the potential for artifacts caused by cross-hybridization, the subsampling of the query genome limits the sequence coverage that can be obtained and therefore reduces the technique's resolution as a genotyping method. We have developed and validated an Affymetrix Inc. GeneChip(R) array-based, whole-genome resequencing platform for Francisella tularensis, the causative agent of tularemia. A set of bioinformatic filters that targeted systematic base-calling errors caused by cross-hybridization between the whole-genome sample and the array probes and by deletions in the sample DNA relative to the chip reference sequence were developed. Our approach eliminated 91% of the false-positive single-nucleotide polymorphism calls identified in the SCHU S4 query sample, at the cost of 10.7% of the true positives, yielding a total base-calling accuracy of 99.992%.

Show MeSH

Related in: MedlinePlus

Representation of the ‘footprint effect’. Query locations are in bold and mismatches are shown in red. Chip oligonucleotides and sample DNA alignments at SNP location (central 13th position) and SNP location plus two bases are shown.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2175352&req=5

Figure 4: Representation of the ‘footprint effect’. Query locations are in bold and mismatches are shown in red. Chip oligonucleotides and sample DNA alignments at SNP location (central 13th position) and SNP location plus two bases are shown.

Mentions: The remaining SNP calls are next put through the footprint effect filter. The occurrence of a real SNP in a query sample results in a destabilizing effect on 25-mers in the immediate vicinity of the SNP. This artifact, called the ‘footprint effect’, is illustrated in Figure 4. The locus on the resequencing chip at which the SNP occurs contains two probes that hybridize perfectly with the sample over the entire 25-base length of the forward- and reverse-complement probes (only the forward strand is shown in the figure). However, at adjacent loci on the chip, which represent base positions near the SNP base, there are no probes that hybridize perfectly with the sample DNA. This is because, in general, the chip design tiles probes based on a single reference sequence, which does not contain the SNP base. As a result, the probes on the chip that represent reference sequence positions within 12 bases of the SNP location will all contain at least a single-base mismatch with the sample DNA. This mismatch decreases the reference probe hybridization intensities and increases the likelihood that an alternate sequence from a second location in the sample DNA will hybridize more strongly to a non-reference probe pair. This results in a mixture of reference calls, SNP calls and no-calls at the loci within 12 base positions adjacent to a genuine SNP, with reference calls predominant. This effect is exacerbated when two genuine SNPs occur within the same 25-base window. The footprint effect, like the alternate homology effect, is expected to be more pronounced in the context of a whole-genome hybridization, because of the larger number of hybridization targets in the sample.Figure 4.


A bioinformatic filter for improved base-call accuracy and polymorphism detection using the Affymetrix GeneChip whole-genome resequencing platform.

Pandya GA, Holmes MH, Sunkara S, Sparks A, Bai Y, Verratti K, Saeed K, Venepally P, Jarrahi B, Fleischmann RD, Peterson SN - Nucleic Acids Res. (2007)

Representation of the ‘footprint effect’. Query locations are in bold and mismatches are shown in red. Chip oligonucleotides and sample DNA alignments at SNP location (central 13th position) and SNP location plus two bases are shown.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2175352&req=5

Figure 4: Representation of the ‘footprint effect’. Query locations are in bold and mismatches are shown in red. Chip oligonucleotides and sample DNA alignments at SNP location (central 13th position) and SNP location plus two bases are shown.
Mentions: The remaining SNP calls are next put through the footprint effect filter. The occurrence of a real SNP in a query sample results in a destabilizing effect on 25-mers in the immediate vicinity of the SNP. This artifact, called the ‘footprint effect’, is illustrated in Figure 4. The locus on the resequencing chip at which the SNP occurs contains two probes that hybridize perfectly with the sample over the entire 25-base length of the forward- and reverse-complement probes (only the forward strand is shown in the figure). However, at adjacent loci on the chip, which represent base positions near the SNP base, there are no probes that hybridize perfectly with the sample DNA. This is because, in general, the chip design tiles probes based on a single reference sequence, which does not contain the SNP base. As a result, the probes on the chip that represent reference sequence positions within 12 bases of the SNP location will all contain at least a single-base mismatch with the sample DNA. This mismatch decreases the reference probe hybridization intensities and increases the likelihood that an alternate sequence from a second location in the sample DNA will hybridize more strongly to a non-reference probe pair. This results in a mixture of reference calls, SNP calls and no-calls at the loci within 12 base positions adjacent to a genuine SNP, with reference calls predominant. This effect is exacerbated when two genuine SNPs occur within the same 25-base window. The footprint effect, like the alternate homology effect, is expected to be more pronounced in the context of a whole-genome hybridization, because of the larger number of hybridization targets in the sample.Figure 4.

Bottom Line: While this specificity in the target DNA population reduces the potential for artifacts caused by cross-hybridization, the subsampling of the query genome limits the sequence coverage that can be obtained and therefore reduces the technique's resolution as a genotyping method.A set of bioinformatic filters that targeted systematic base-calling errors caused by cross-hybridization between the whole-genome sample and the array probes and by deletions in the sample DNA relative to the chip reference sequence were developed.Our approach eliminated 91% of the false-positive single-nucleotide polymorphism calls identified in the SCHU S4 query sample, at the cost of 10.7% of the true positives, yielding a total base-calling accuracy of 99.992%.

View Article: PubMed Central - PubMed

Affiliation: Pathogen Functional Genomics Resource Center, The Institute for Genomic Research at the J. Craig Venter Institute, Rockville, MD 20850, USA.

ABSTRACT
DNA resequencing arrays enable rapid acquisition of high-quality sequence data. This technology represents a promising platform for rapid high-resolution genotyping of microorganisms. Traditional array-based resequencing methods have relied on the use of specific PCR-amplified fragments from the query samples as hybridization targets. While this specificity in the target DNA population reduces the potential for artifacts caused by cross-hybridization, the subsampling of the query genome limits the sequence coverage that can be obtained and therefore reduces the technique's resolution as a genotyping method. We have developed and validated an Affymetrix Inc. GeneChip(R) array-based, whole-genome resequencing platform for Francisella tularensis, the causative agent of tularemia. A set of bioinformatic filters that targeted systematic base-calling errors caused by cross-hybridization between the whole-genome sample and the array probes and by deletions in the sample DNA relative to the chip reference sequence were developed. Our approach eliminated 91% of the false-positive single-nucleotide polymorphism calls identified in the SCHU S4 query sample, at the cost of 10.7% of the true positives, yielding a total base-calling accuracy of 99.992%.

Show MeSH
Related in: MedlinePlus