Limits...
Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques.

Duitama J, McEwen GK, Huebsch T, Palczewski S, Schulz S, Verstrepen K, Suk EK, Hoehe MR - Nucleic Acids Res. (2011)

Bottom Line: We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs.This improvement includes notable fractions of disease-related and GWA SNPs.Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.

View Article: PubMed Central - PubMed

Affiliation: Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany. Jorge.DuitamaCastellanos@biw.vib-kuleuven.be

ABSTRACT
Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.

Show MeSH
Comparison of algorithms for SIH on NA12878 whole genome fosmid sequence data. (A) Adjusted N50 which takes into consideration block length and number of phased SNPs but not quality; (B) Switch error rate, calculated using comparison with gold-standard trio haplotypes; (C) Quality adjusted N50 which combined measures of completeness and quality; (D) Runtimes of each algorithm on this data set (log scale); (E) QAN50 for ReFHap, DGS, FastHare and HapCUT on subsets of the data built by varying the number of fosmid pools considered; (F) QAN50 for ReFHap, DGS, FastHare and HapCUT for different heterozygosity rates obtained by varying the percentages of SNPs considered.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3299995&req=5

gkr1042-F3: Comparison of algorithms for SIH on NA12878 whole genome fosmid sequence data. (A) Adjusted N50 which takes into consideration block length and number of phased SNPs but not quality; (B) Switch error rate, calculated using comparison with gold-standard trio haplotypes; (C) Quality adjusted N50 which combined measures of completeness and quality; (D) Runtimes of each algorithm on this data set (log scale); (E) QAN50 for ReFHap, DGS, FastHare and HapCUT on subsets of the data built by varying the number of fosmid pools considered; (F) QAN50 for ReFHap, DGS, FastHare and HapCUT for different heterozygosity rates obtained by varying the percentages of SNPs considered.

Mentions: A comparison between all heuristic algorithms for SIH across four different measures is shown in Figure 3, A–D: (A) AN50, (B) switch error rate, (C) QAN50 (described above) and (D) runtime for our dataset. Using ReFHap, 91.7% of SNPs were phased and the QAN50 block size was 117.8 kb. ReFHap had the lowest switch error rate (1.69%) and the highest QAN50 of the eight SIH algorithms. DGS and FastHare phase about the same number of SNPs as ReFHap but with slightly larger switch error rates (1.82 and 1.74%, respectively). HapCUT, for which we ran 10 iterations, phased slightly more SNPs than any other algorithm, phasing 1068 (0.06% of input SNPs) more SNPs than ReFHap. HapCUT also covered the largest fraction of the genome after adding up the lengths of the blocks for which no switch error can be detected and adjusting for unphased SNPs (1.82 Gb). ReFHap, FastHare and DGS were close with 1.8 Gb (1.79 Gb for DGS). However, as expected, HapCUT also had significantly longer running time than the other methods (Figure 3D). While ReFHap, DGS and FastHare were all able to phase full chromosomes within a few seconds, HapCUT can take hours for a single iteration. This happens because the runtime for the first three methods mainly depends on the number of overlapping fragments in one block, while for HapCUT it depends on the maximum number of SNPs connected in one block. Fosmids are able to connect large numbers of SNPs at low coverage, so algorithms such as ReFHap require significantly less computational resources. Chromosome 6 is an extreme case with HapCUT taking more than 10 h to complete one single iteration compared to 3.29 s for ReFHap; this is mainly due to the large blocks of connected SNPs within the MHC region. As fosmid coverage and number of heterozygous variants analyzed increases, the number of connected components also increases, making the instances more difficult to solve for HapCUT.Figure 3.


Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques.

Duitama J, McEwen GK, Huebsch T, Palczewski S, Schulz S, Verstrepen K, Suk EK, Hoehe MR - Nucleic Acids Res. (2011)

Comparison of algorithms for SIH on NA12878 whole genome fosmid sequence data. (A) Adjusted N50 which takes into consideration block length and number of phased SNPs but not quality; (B) Switch error rate, calculated using comparison with gold-standard trio haplotypes; (C) Quality adjusted N50 which combined measures of completeness and quality; (D) Runtimes of each algorithm on this data set (log scale); (E) QAN50 for ReFHap, DGS, FastHare and HapCUT on subsets of the data built by varying the number of fosmid pools considered; (F) QAN50 for ReFHap, DGS, FastHare and HapCUT for different heterozygosity rates obtained by varying the percentages of SNPs considered.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3299995&req=5

gkr1042-F3: Comparison of algorithms for SIH on NA12878 whole genome fosmid sequence data. (A) Adjusted N50 which takes into consideration block length and number of phased SNPs but not quality; (B) Switch error rate, calculated using comparison with gold-standard trio haplotypes; (C) Quality adjusted N50 which combined measures of completeness and quality; (D) Runtimes of each algorithm on this data set (log scale); (E) QAN50 for ReFHap, DGS, FastHare and HapCUT on subsets of the data built by varying the number of fosmid pools considered; (F) QAN50 for ReFHap, DGS, FastHare and HapCUT for different heterozygosity rates obtained by varying the percentages of SNPs considered.
Mentions: A comparison between all heuristic algorithms for SIH across four different measures is shown in Figure 3, A–D: (A) AN50, (B) switch error rate, (C) QAN50 (described above) and (D) runtime for our dataset. Using ReFHap, 91.7% of SNPs were phased and the QAN50 block size was 117.8 kb. ReFHap had the lowest switch error rate (1.69%) and the highest QAN50 of the eight SIH algorithms. DGS and FastHare phase about the same number of SNPs as ReFHap but with slightly larger switch error rates (1.82 and 1.74%, respectively). HapCUT, for which we ran 10 iterations, phased slightly more SNPs than any other algorithm, phasing 1068 (0.06% of input SNPs) more SNPs than ReFHap. HapCUT also covered the largest fraction of the genome after adding up the lengths of the blocks for which no switch error can be detected and adjusting for unphased SNPs (1.82 Gb). ReFHap, FastHare and DGS were close with 1.8 Gb (1.79 Gb for DGS). However, as expected, HapCUT also had significantly longer running time than the other methods (Figure 3D). While ReFHap, DGS and FastHare were all able to phase full chromosomes within a few seconds, HapCUT can take hours for a single iteration. This happens because the runtime for the first three methods mainly depends on the number of overlapping fragments in one block, while for HapCUT it depends on the maximum number of SNPs connected in one block. Fosmids are able to connect large numbers of SNPs at low coverage, so algorithms such as ReFHap require significantly less computational resources. Chromosome 6 is an extreme case with HapCUT taking more than 10 h to complete one single iteration compared to 3.29 s for ReFHap; this is mainly due to the large blocks of connected SNPs within the MHC region. As fosmid coverage and number of heterozygous variants analyzed increases, the number of connected components also increases, making the instances more difficult to solve for HapCUT.Figure 3.

Bottom Line: We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs.This improvement includes notable fractions of disease-related and GWA SNPs.Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.

View Article: PubMed Central - PubMed

Affiliation: Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany. Jorge.DuitamaCastellanos@biw.vib-kuleuven.be

ABSTRACT
Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.

Show MeSH