Limits...
Comparison of methods to detect copy number alterations in cancer using simulated and real genotyping data.

Mosén-Ansorena D, Aransay AM, Rodríguez-Ezpeleta N - BMC Bioinformatics (2012)

Bottom Line: This supports the viability of approaches other than the common hidden Markov model (HMM)-based.We devised and implemented a comprehensive model to generate data that simulate tumoural samples genotyped using SNP arrays.The validity of the model is supported by the similarity of the results obtained with synthetic and real data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Genome Analysis Platform, CIC bioGUNE-CIBERehd, Technologic Park of Bizkaia, Building 502, 48160 Derio, Spain. dmosen.gn@cicbiogune.es

ABSTRACT

Background: The detection of genomic copy number alterations (CNA) in cancer based on SNP arrays requires methods that take into account tumour specific factors such as normal cell contamination and tumour heterogeneity. A number of tools have been recently developed but their performance needs yet to be thoroughly assessed. To this aim, a comprehensive model that integrates the factors of normal cell contamination and intra-tumour heterogeneity and that can be translated to synthetic data on which to perform benchmarks is indispensable.

Results: We propose such model and implement it in an R package called CnaGen to synthetically generate a wide range of alterations under different normal cell contamination levels. Six recently published methods for CNA and loss of heterozygosity (LOH) detection on tumour samples were assessed on this synthetic data and on a dilution series of a breast cancer cell-line: ASCAT, GAP, GenoCNA, GPHMM, MixHMM and OncoSNP. We report the recall rates in terms of normal cell contamination levels and alteration characteristics: length, copy number and LOH state, as well as the false discovery rate distribution for each copy number under different normal cell contamination levels.Assessed methods are in general better at detecting alterations with low copy number and under a little normal cell contamination levels. All methods except GPHMM, which failed to recognize the alteration pattern in the cell-line samples, provided similar results for the synthetic and cell-line sample sets. MixHMM and GenoCNA are the poorliest performing methods, while GAP generally performed better. This supports the viability of approaches other than the common hidden Markov model (HMM)-based.

Conclusions: We devised and implemented a comprehensive model to generate data that simulate tumoural samples genotyped using SNP arrays. The validity of the model is supported by the similarity of the results obtained with synthetic and real data. Based on these results and on the software implementation of the methods, we recommend GAP for advanced users and GPHMM for a fully driven analysis.

Show MeSH

Related in: MedlinePlus

Recall rates of CNAs on cell-line samples. Recall rates (y-axis), by normal cell contaminations and copy number (x-axis), with respect to a high-confidence subset of CNAs detected by GAP on the cell-line pure tumour sample. Colour code: GAP (orange), updated GAP (golden), ASCAT (purple) GPHMM (black), OncoSNP (blue), MixHMM (grey), MixHMM with manual parameterization (dashed grey) and GenoCNA (green).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3472297&req=5

Figure 5: Recall rates of CNAs on cell-line samples. Recall rates (y-axis), by normal cell contaminations and copy number (x-axis), with respect to a high-confidence subset of CNAs detected by GAP on the cell-line pure tumour sample. Colour code: GAP (orange), updated GAP (golden), ASCAT (purple) GPHMM (black), OncoSNP (blue), MixHMM (grey), MixHMM with manual parameterization (dashed grey) and GenoCNA (green).

Mentions: We examined the recall rates at the available contamination levels closer to those in synthetic data: 0%, 21%, 50% and 77% (see Figure 5). Except at 77% contamination, ASCAT proves to be the best performer together with GAP and the updated GAP, and its recall rates are stable throughout the different copy numbers. The updated GAP recognizes the pattern under contamination, although as we saw on synthetic data, this does not always happen. Additionally, both ASCAT and the updated GAP fail to recognize the alteration pattern at 77% contamination, an issue also observed with synthetic data, and the old GAP is the only method that still correctly estimates the LRR shift at such contamination, a fact that is independent from having selected GAP as a reference. Despite a general good performance, OncoSNP has trouble with copy number 3 regions under contamination, as it had with the synthetic near-triploid samples. Cell-line results also confirm that MixHMM and GenoCNA are unable to correctly recall most high copy number regions and have a tendency to call higher copy numbers as contamination increases. Manually providing the contamination and LRR shift parameters improves MixHMM recall rates, but its high sensitivity to changes in intensity result in overfragmented calls (see Additional file 9). Finally, whereas GPHMM manages to carry out a rather correct breakpoint detection, it fails to estimate the LRR baseline shift on all four contamination levels. Hence, all copy numbers are increased and recall rates are minimum except for copy numbers 4 and higher, given that they are grouped. GPHMM’s results stress the importance of correct LRR baseline shift estimation, whether it is made directly, or through ploidy or DNA index estimation. We wish to note that the baseline shift can be correctly estimated (see GPHMM paper [13]), if a PFB [19] with a modified specification is used.


Comparison of methods to detect copy number alterations in cancer using simulated and real genotyping data.

Mosén-Ansorena D, Aransay AM, Rodríguez-Ezpeleta N - BMC Bioinformatics (2012)

Recall rates of CNAs on cell-line samples. Recall rates (y-axis), by normal cell contaminations and copy number (x-axis), with respect to a high-confidence subset of CNAs detected by GAP on the cell-line pure tumour sample. Colour code: GAP (orange), updated GAP (golden), ASCAT (purple) GPHMM (black), OncoSNP (blue), MixHMM (grey), MixHMM with manual parameterization (dashed grey) and GenoCNA (green).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3472297&req=5

Figure 5: Recall rates of CNAs on cell-line samples. Recall rates (y-axis), by normal cell contaminations and copy number (x-axis), with respect to a high-confidence subset of CNAs detected by GAP on the cell-line pure tumour sample. Colour code: GAP (orange), updated GAP (golden), ASCAT (purple) GPHMM (black), OncoSNP (blue), MixHMM (grey), MixHMM with manual parameterization (dashed grey) and GenoCNA (green).
Mentions: We examined the recall rates at the available contamination levels closer to those in synthetic data: 0%, 21%, 50% and 77% (see Figure 5). Except at 77% contamination, ASCAT proves to be the best performer together with GAP and the updated GAP, and its recall rates are stable throughout the different copy numbers. The updated GAP recognizes the pattern under contamination, although as we saw on synthetic data, this does not always happen. Additionally, both ASCAT and the updated GAP fail to recognize the alteration pattern at 77% contamination, an issue also observed with synthetic data, and the old GAP is the only method that still correctly estimates the LRR shift at such contamination, a fact that is independent from having selected GAP as a reference. Despite a general good performance, OncoSNP has trouble with copy number 3 regions under contamination, as it had with the synthetic near-triploid samples. Cell-line results also confirm that MixHMM and GenoCNA are unable to correctly recall most high copy number regions and have a tendency to call higher copy numbers as contamination increases. Manually providing the contamination and LRR shift parameters improves MixHMM recall rates, but its high sensitivity to changes in intensity result in overfragmented calls (see Additional file 9). Finally, whereas GPHMM manages to carry out a rather correct breakpoint detection, it fails to estimate the LRR baseline shift on all four contamination levels. Hence, all copy numbers are increased and recall rates are minimum except for copy numbers 4 and higher, given that they are grouped. GPHMM’s results stress the importance of correct LRR baseline shift estimation, whether it is made directly, or through ploidy or DNA index estimation. We wish to note that the baseline shift can be correctly estimated (see GPHMM paper [13]), if a PFB [19] with a modified specification is used.

Bottom Line: This supports the viability of approaches other than the common hidden Markov model (HMM)-based.We devised and implemented a comprehensive model to generate data that simulate tumoural samples genotyped using SNP arrays.The validity of the model is supported by the similarity of the results obtained with synthetic and real data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Genome Analysis Platform, CIC bioGUNE-CIBERehd, Technologic Park of Bizkaia, Building 502, 48160 Derio, Spain. dmosen.gn@cicbiogune.es

ABSTRACT

Background: The detection of genomic copy number alterations (CNA) in cancer based on SNP arrays requires methods that take into account tumour specific factors such as normal cell contamination and tumour heterogeneity. A number of tools have been recently developed but their performance needs yet to be thoroughly assessed. To this aim, a comprehensive model that integrates the factors of normal cell contamination and intra-tumour heterogeneity and that can be translated to synthetic data on which to perform benchmarks is indispensable.

Results: We propose such model and implement it in an R package called CnaGen to synthetically generate a wide range of alterations under different normal cell contamination levels. Six recently published methods for CNA and loss of heterozygosity (LOH) detection on tumour samples were assessed on this synthetic data and on a dilution series of a breast cancer cell-line: ASCAT, GAP, GenoCNA, GPHMM, MixHMM and OncoSNP. We report the recall rates in terms of normal cell contamination levels and alteration characteristics: length, copy number and LOH state, as well as the false discovery rate distribution for each copy number under different normal cell contamination levels.Assessed methods are in general better at detecting alterations with low copy number and under a little normal cell contamination levels. All methods except GPHMM, which failed to recognize the alteration pattern in the cell-line samples, provided similar results for the synthetic and cell-line sample sets. MixHMM and GenoCNA are the poorliest performing methods, while GAP generally performed better. This supports the viability of approaches other than the common hidden Markov model (HMM)-based.

Conclusions: We devised and implemented a comprehensive model to generate data that simulate tumoural samples genotyped using SNP arrays. The validity of the model is supported by the similarity of the results obtained with synthetic and real data. Based on these results and on the software implementation of the methods, we recommend GAP for advanced users and GPHMM for a fully driven analysis.

Show MeSH
Related in: MedlinePlus