Limits...
Machine learning approach for pooled DNA sample calibration.

Hellicar AD, Rahman A, Smith DV, Henshall JM - BMC Bioinformatics (2015)

Bottom Line: The approach is tested on SNPs genotyped with the Sequenom iPLEX platform and compared to existing state of the art calibration methods.The new method is capable of reducing the mean square error in allele frequency to half that achievable with existing approaches.This paper demonstrates that improvements in pooled allele frequency estimates result if the genotyping platform is characterised at allele frequencies other than the homozygous and heterozygous cases.

View Article: PubMed Central - PubMed

Affiliation: CSIRO Computational Informatics, Castray Esplanade, Hobart, Australia. andrew.hellicar@csiro.au.

ABSTRACT

Background: Despite ongoing reduction in genotyping costs, genomic studies involving large numbers of species with low economic value (such as Black Tiger prawns) remain cost prohibitive. In this scenario DNA pooling is an attractive option to reduce genotyping costs. However, genotyping of pooled samples comprising DNA from many individuals is challenging due to the presence of errors that exceed the allele frequency quantisation size and therefore cannot be simply corrected by clustering techniques. The solution to the calibration problem is a correction to the allele frequency to mitigate errors incurred in the measurement process. We highlight the limitations of the existing calibration solutions such as the fact they impose assumptions on the variation between allele frequencies 0, 0.5, and 1.0, and address a limited set of error types. We propose a novel machine learning method to address the limitations identified.

Results: The approach is tested on SNPs genotyped with the Sequenom iPLEX platform and compared to existing state of the art calibration methods. The new method is capable of reducing the mean square error in allele frequency to half that achievable with existing approaches. Furthermore for the first time we demonstrate the importance of carefully considering the choice of training data when using calibration approaches built from pooled data.

Conclusion: This paper demonstrates that improvements in pooled allele frequency estimates result if the genotyping platform is characterised at allele frequencies other than the homozygous and heterozygous cases. Techniques capable of incorporating such information are described along with aspects of implementation.

No MeSH data available.


Data cleaning workflow. Workflow in cleaning input data. steps involve removing all data corresponding to bad SNPs; removing all SNP results for bad samples including both individual and pool samples. Finally low amplitude detections are removed. Size of data set shown next to arrows. Input data files shown lightly shaded. Output files dark shaded.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4495942&req=5

Fig2: Data cleaning workflow. Workflow in cleaning input data. steps involve removing all data corresponding to bad SNPs; removing all SNP results for bad samples including both individual and pool samples. Finally low amplitude detections are removed. Size of data set shown next to arrows. Input data files shown lightly shaded. Output files dark shaded.

Mentions: Experimental data from Black Tiger prawns Penaeus monodon was acquired from the Sequenom IPLEX platform [15]. The raw data, typical of a commercial run, generated 61 SNP (HA,HB) measurement pairs on 1041 individuals. This data was then processed for quality control Figure 2. A second experiment was conducted whereby all steps required in the genotyping process were conducted in a manner as rigorous as possible; however, due to increased cost and time of the rigorous process only a smaller set of 47 individuals were genotyped (randomly selected from the 1041 in the larger experiment). The calls from the more accurate experiment were used to rank the SNP accuracy of the larger experiment. In total 13 SNPs were identified as being inconsistent between the two experiments and were removed leaving 48 SNPs in total. The 1041 individuals were ranked in terms of number of available calls. Low quality samples (<80% calls) were removed from the data set leaving 850 samples. We also removed 1279 measurement pairs (HA,HB) values below a threshold (R<1) resulting in 39521 (HA,HB) pairs. 22 pool samples containing a minimum of 18 individuals and a maximum of 26 individuals were created from the 1041 individuals. An individual was in at most one pool sample. Three pools resulted in no data and were removed resulting in 19 pooled samples, of which 11 (HA,HB) pairs were below the minimum signal threshold and removed, leaving 901 measurement pairs. Because the pools were constructed from individuals selected from the 1041 genotyped individuals, we can calculate the ground truth allele frequencies of these pools using the 39521 individual results. A small fraction (3.4%) of the pool and SNP combinations contained individuals which did not pass quality control, and therefore were not included when calculating pool ground truth.Figure 2


Machine learning approach for pooled DNA sample calibration.

Hellicar AD, Rahman A, Smith DV, Henshall JM - BMC Bioinformatics (2015)

Data cleaning workflow. Workflow in cleaning input data. steps involve removing all data corresponding to bad SNPs; removing all SNP results for bad samples including both individual and pool samples. Finally low amplitude detections are removed. Size of data set shown next to arrows. Input data files shown lightly shaded. Output files dark shaded.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4495942&req=5

Fig2: Data cleaning workflow. Workflow in cleaning input data. steps involve removing all data corresponding to bad SNPs; removing all SNP results for bad samples including both individual and pool samples. Finally low amplitude detections are removed. Size of data set shown next to arrows. Input data files shown lightly shaded. Output files dark shaded.
Mentions: Experimental data from Black Tiger prawns Penaeus monodon was acquired from the Sequenom IPLEX platform [15]. The raw data, typical of a commercial run, generated 61 SNP (HA,HB) measurement pairs on 1041 individuals. This data was then processed for quality control Figure 2. A second experiment was conducted whereby all steps required in the genotyping process were conducted in a manner as rigorous as possible; however, due to increased cost and time of the rigorous process only a smaller set of 47 individuals were genotyped (randomly selected from the 1041 in the larger experiment). The calls from the more accurate experiment were used to rank the SNP accuracy of the larger experiment. In total 13 SNPs were identified as being inconsistent between the two experiments and were removed leaving 48 SNPs in total. The 1041 individuals were ranked in terms of number of available calls. Low quality samples (<80% calls) were removed from the data set leaving 850 samples. We also removed 1279 measurement pairs (HA,HB) values below a threshold (R<1) resulting in 39521 (HA,HB) pairs. 22 pool samples containing a minimum of 18 individuals and a maximum of 26 individuals were created from the 1041 individuals. An individual was in at most one pool sample. Three pools resulted in no data and were removed resulting in 19 pooled samples, of which 11 (HA,HB) pairs were below the minimum signal threshold and removed, leaving 901 measurement pairs. Because the pools were constructed from individuals selected from the 1041 genotyped individuals, we can calculate the ground truth allele frequencies of these pools using the 39521 individual results. A small fraction (3.4%) of the pool and SNP combinations contained individuals which did not pass quality control, and therefore were not included when calculating pool ground truth.Figure 2

Bottom Line: The approach is tested on SNPs genotyped with the Sequenom iPLEX platform and compared to existing state of the art calibration methods.The new method is capable of reducing the mean square error in allele frequency to half that achievable with existing approaches.This paper demonstrates that improvements in pooled allele frequency estimates result if the genotyping platform is characterised at allele frequencies other than the homozygous and heterozygous cases.

View Article: PubMed Central - PubMed

Affiliation: CSIRO Computational Informatics, Castray Esplanade, Hobart, Australia. andrew.hellicar@csiro.au.

ABSTRACT

Background: Despite ongoing reduction in genotyping costs, genomic studies involving large numbers of species with low economic value (such as Black Tiger prawns) remain cost prohibitive. In this scenario DNA pooling is an attractive option to reduce genotyping costs. However, genotyping of pooled samples comprising DNA from many individuals is challenging due to the presence of errors that exceed the allele frequency quantisation size and therefore cannot be simply corrected by clustering techniques. The solution to the calibration problem is a correction to the allele frequency to mitigate errors incurred in the measurement process. We highlight the limitations of the existing calibration solutions such as the fact they impose assumptions on the variation between allele frequencies 0, 0.5, and 1.0, and address a limited set of error types. We propose a novel machine learning method to address the limitations identified.

Results: The approach is tested on SNPs genotyped with the Sequenom iPLEX platform and compared to existing state of the art calibration methods. The new method is capable of reducing the mean square error in allele frequency to half that achievable with existing approaches. Furthermore for the first time we demonstrate the importance of carefully considering the choice of training data when using calibration approaches built from pooled data.

Conclusion: This paper demonstrates that improvements in pooled allele frequency estimates result if the genotyping platform is characterised at allele frequencies other than the homozygous and heterozygous cases. Techniques capable of incorporating such information are described along with aspects of implementation.

No MeSH data available.