Limits...
cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate.

Klambauer G, Schwarzbauer K, Mayr A, Clevert DA, Mitterecker A, Bodenhofer U, Hochreiter S - Nucleic Acids Res. (2012)

Bottom Line: Using a Bayesian approach, cn.MOPS decomposes variations in the depth of coverage across samples into integer copy numbers and noise by means of its mixture components and Poisson distributions, respectively.The noise estimate allows for reducing the FDR by filtering out detections having high noise that are likely to be false detections.We compared cn.MOPS with the five most popular methods for CNV detection in NGS data using four benchmark datasets: (i) simulated data, (ii) NGS data from a male HapMap individual with implanted CNVs from the X chromosome, (iii) data from HapMap individuals with known CNVs, (iv) high coverage data from the 1000 Genomes Project. cn.MOPS outperformed its five competitors in terms of precision (1-FDR) and recall for both gains and losses in all benchmark data sets.

View Article: PubMed Central - PubMed

Affiliation: Institute of Bioinformatics, Johannes Kepler University, A-4040 Linz, Austria.

ABSTRACT
Quantitative analyses of next-generation sequencing (NGS) data, such as the detection of copy number variations (CNVs), remain challenging. Current methods detect CNVs as changes in the depth of coverage along chromosomes. Technological or genomic variations in the depth of coverage thus lead to a high false discovery rate (FDR), even upon correction for GC content. In the context of association studies between CNVs and disease, a high FDR means many false CNVs, thereby decreasing the discovery power of the study after correction for multiple testing. We propose 'Copy Number estimation by a Mixture Of PoissonS' (cn.MOPS), a data processing pipeline for CNV detection in NGS data. In contrast to previous approaches, cn.MOPS incorporates modeling of depths of coverage across samples at each genomic position. Therefore, cn.MOPS is not affected by read count variations along chromosomes. Using a Bayesian approach, cn.MOPS decomposes variations in the depth of coverage across samples into integer copy numbers and noise by means of its mixture components and Poisson distributions, respectively. The noise estimate allows for reducing the FDR by filtering out detections having high noise that are likely to be false detections. We compared cn.MOPS with the five most popular methods for CNV detection in NGS data using four benchmark datasets: (i) simulated data, (ii) NGS data from a male HapMap individual with implanted CNVs from the X chromosome, (iii) data from HapMap individuals with known CNVs, (iv) high coverage data from the 1000 Genomes Project. cn.MOPS outperformed its five competitors in terms of precision (1-FDR) and recall for both gains and losses in all benchmark data sets. The software cn.MOPS is publicly available as an R package at http://www.bioinf.jku.at/software/cnmops/ and at Bioconductor.

Show MeSH
Definitions for the evaluation of copy number detection methods. A genome is split into equally sized evaluation segments of a length shorter than the shortest CNV. Top panel: Knowing the true CNV regions (green), the evaluation segments are labeled as class 1 (CNV segment) or class −1 (non-CNV segment). Middle panel: A CNV detection method classifies each evaluation segment into CNV segments (blue, class 1) and non-CNV segments (class −1). Bottom panel: In the first line, positives (known CNV regions) are divided into true positives (TP, green) and false negatives (FN, red). In the second line, negatives (no overlap with known CNV regions) are divided into true negatives (TN, green) and false positives (FP, red). Segments partly overlapping with known or predicted CNV regions are not considered (‘na’).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3351174&req=5

gks003-F3: Definitions for the evaluation of copy number detection methods. A genome is split into equally sized evaluation segments of a length shorter than the shortest CNV. Top panel: Knowing the true CNV regions (green), the evaluation segments are labeled as class 1 (CNV segment) or class −1 (non-CNV segment). Middle panel: A CNV detection method classifies each evaluation segment into CNV segments (blue, class 1) and non-CNV segments (class −1). Bottom panel: In the first line, positives (known CNV regions) are divided into true positives (TP, green) and false negatives (FN, red). In the second line, negatives (no overlap with known CNV regions) are divided into true negatives (TN, green) and false positives (FP, red). Segments partly overlapping with known or predicted CNV regions are not considered (‘na’).

Mentions: We assume that the true CNVs are known and to be rediscovered. Each chromosome is split into equally large evaluation segments the size of which is chosen to accommodate the shortest known CNV. An evaluation segment is called a true positive (TP) if it is entirely contained both in a true CNV and in a detected CNV segment. It is called a false negative (FN) if it is entirely contained in a true CNV but does not overlap with any predicted CNV segment. An evaluation segment is called a false positive (FP) if it is entirely detected as a CNV segment but does not overlap with any true CNV. Finally, it is called a true negative (TN) if it overlaps neither with a true CNV nor with a detected CNV segment. These definitions imply that all evaluation segments that partly overlap with true CNVs or detected CNV segments remain ignored, as the copy numbers in these segments are ambiguous. Figure 3 illustrates the definitions of the four categories of evaluation segments. The two measures we employ hereafter are recall [#TP/(#TP + #FN)] and precision [#TP/(#TP + #FP)]. Note that precision is 1-FDR, in which we are especially interested. A CNV calling threshold governs the trade-off between recall and precision or, in other words, the trade-off between FNs and FPs, because more detected CNVs lead to more FPs but fewer FNs, and vice versa. To assess the performance of methods at different CNV calling thresholds, we use precision-recall curves. Precision-recall curves are independent of the number of TNs, which makes them an ideal tool for our evaluation, as the majority of samples are negatives (non-CNVs).Figure 3.


cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate.

Klambauer G, Schwarzbauer K, Mayr A, Clevert DA, Mitterecker A, Bodenhofer U, Hochreiter S - Nucleic Acids Res. (2012)

Definitions for the evaluation of copy number detection methods. A genome is split into equally sized evaluation segments of a length shorter than the shortest CNV. Top panel: Knowing the true CNV regions (green), the evaluation segments are labeled as class 1 (CNV segment) or class −1 (non-CNV segment). Middle panel: A CNV detection method classifies each evaluation segment into CNV segments (blue, class 1) and non-CNV segments (class −1). Bottom panel: In the first line, positives (known CNV regions) are divided into true positives (TP, green) and false negatives (FN, red). In the second line, negatives (no overlap with known CNV regions) are divided into true negatives (TN, green) and false positives (FP, red). Segments partly overlapping with known or predicted CNV regions are not considered (‘na’).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3351174&req=5

gks003-F3: Definitions for the evaluation of copy number detection methods. A genome is split into equally sized evaluation segments of a length shorter than the shortest CNV. Top panel: Knowing the true CNV regions (green), the evaluation segments are labeled as class 1 (CNV segment) or class −1 (non-CNV segment). Middle panel: A CNV detection method classifies each evaluation segment into CNV segments (blue, class 1) and non-CNV segments (class −1). Bottom panel: In the first line, positives (known CNV regions) are divided into true positives (TP, green) and false negatives (FN, red). In the second line, negatives (no overlap with known CNV regions) are divided into true negatives (TN, green) and false positives (FP, red). Segments partly overlapping with known or predicted CNV regions are not considered (‘na’).
Mentions: We assume that the true CNVs are known and to be rediscovered. Each chromosome is split into equally large evaluation segments the size of which is chosen to accommodate the shortest known CNV. An evaluation segment is called a true positive (TP) if it is entirely contained both in a true CNV and in a detected CNV segment. It is called a false negative (FN) if it is entirely contained in a true CNV but does not overlap with any predicted CNV segment. An evaluation segment is called a false positive (FP) if it is entirely detected as a CNV segment but does not overlap with any true CNV. Finally, it is called a true negative (TN) if it overlaps neither with a true CNV nor with a detected CNV segment. These definitions imply that all evaluation segments that partly overlap with true CNVs or detected CNV segments remain ignored, as the copy numbers in these segments are ambiguous. Figure 3 illustrates the definitions of the four categories of evaluation segments. The two measures we employ hereafter are recall [#TP/(#TP + #FN)] and precision [#TP/(#TP + #FP)]. Note that precision is 1-FDR, in which we are especially interested. A CNV calling threshold governs the trade-off between recall and precision or, in other words, the trade-off between FNs and FPs, because more detected CNVs lead to more FPs but fewer FNs, and vice versa. To assess the performance of methods at different CNV calling thresholds, we use precision-recall curves. Precision-recall curves are independent of the number of TNs, which makes them an ideal tool for our evaluation, as the majority of samples are negatives (non-CNVs).Figure 3.

Bottom Line: Using a Bayesian approach, cn.MOPS decomposes variations in the depth of coverage across samples into integer copy numbers and noise by means of its mixture components and Poisson distributions, respectively.The noise estimate allows for reducing the FDR by filtering out detections having high noise that are likely to be false detections.We compared cn.MOPS with the five most popular methods for CNV detection in NGS data using four benchmark datasets: (i) simulated data, (ii) NGS data from a male HapMap individual with implanted CNVs from the X chromosome, (iii) data from HapMap individuals with known CNVs, (iv) high coverage data from the 1000 Genomes Project. cn.MOPS outperformed its five competitors in terms of precision (1-FDR) and recall for both gains and losses in all benchmark data sets.

View Article: PubMed Central - PubMed

Affiliation: Institute of Bioinformatics, Johannes Kepler University, A-4040 Linz, Austria.

ABSTRACT
Quantitative analyses of next-generation sequencing (NGS) data, such as the detection of copy number variations (CNVs), remain challenging. Current methods detect CNVs as changes in the depth of coverage along chromosomes. Technological or genomic variations in the depth of coverage thus lead to a high false discovery rate (FDR), even upon correction for GC content. In the context of association studies between CNVs and disease, a high FDR means many false CNVs, thereby decreasing the discovery power of the study after correction for multiple testing. We propose 'Copy Number estimation by a Mixture Of PoissonS' (cn.MOPS), a data processing pipeline for CNV detection in NGS data. In contrast to previous approaches, cn.MOPS incorporates modeling of depths of coverage across samples at each genomic position. Therefore, cn.MOPS is not affected by read count variations along chromosomes. Using a Bayesian approach, cn.MOPS decomposes variations in the depth of coverage across samples into integer copy numbers and noise by means of its mixture components and Poisson distributions, respectively. The noise estimate allows for reducing the FDR by filtering out detections having high noise that are likely to be false detections. We compared cn.MOPS with the five most popular methods for CNV detection in NGS data using four benchmark datasets: (i) simulated data, (ii) NGS data from a male HapMap individual with implanted CNVs from the X chromosome, (iii) data from HapMap individuals with known CNVs, (iv) high coverage data from the 1000 Genomes Project. cn.MOPS outperformed its five competitors in terms of precision (1-FDR) and recall for both gains and losses in all benchmark data sets. The software cn.MOPS is publicly available as an R package at http://www.bioinf.jku.at/software/cnmops/ and at Bioconductor.

Show MeSH