Limits...
Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model.

Zhang ZD, Gerstein MB - BMC Bioinformatics (2010)

Bottom Line: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations.Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates.Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA. zhengdong.zhang@einstein.yu.edu

ABSTRACT

Background: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale.

Results: We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms.

Conclusions: In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

Show MeSH
The ROC curves of array-CGH data analysis methods. (A) The complete plot. All curves were calculated using the same data set. They show that our Bayesian algorithm (black line) is appreciably more sensitive than all other methods (gray lines) at low (< 10%) false positive rates. (B) Details of the ROC curves in the low FPR region of (A) inside the box with dashed border. See Lai et al. [28] for the identities of the gray ROC curves. TPR, the true positive rate; FPR, the false positive rate.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2992546&req=5

Figure 4: The ROC curves of array-CGH data analysis methods. (A) The complete plot. All curves were calculated using the same data set. They show that our Bayesian algorithm (black line) is appreciably more sensitive than all other methods (gray lines) at low (< 10%) false positive rates. (B) Details of the ROC curves in the low FPR region of (A) inside the box with dashed border. See Lai et al. [28] for the identities of the gray ROC curves. TPR, the true positive rate; FPR, the false positive rate.

Mentions: Lai et al. [28] examined the performance of 11 array-CGH data analysis methods: CGHseg, quantreg, CLAC, GLAD, CBS, HMM, wavelet, lowess, ChARM, GA, and ACE. To assess the performance of our algorithm in conjunction with these methods, we used the same simulated data as Lai et al. used for the assessment in their study to calculate the true positive rates (TPR) and the false positive rates (FPR) as the threshold for determining a CNV is varied. See Lai at al. for the definitions of TPR and FPR and the details of the simulated data sets. We calculated the receiver operation characteristic (ROC) curve of our algorithm using the most noisy (thus the lowest signal-to-noise ratio, SNR = 1) data set with the CNV width of 40 probes. This ROC curve, together with the ROC curves of other array-CGH methods based on the same data set, was plotted in Figure 4. These curves show that our Bayesian algorithm is appreciably more sensitive than all other methods at low (< 10%) false positive rates. We need to point out that the comparison was conducted in a fair manner, if not to the disadvantage of our method: all the results from Lai et al. were used directly without modification and our method has no free parameters to tune.


Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model.

Zhang ZD, Gerstein MB - BMC Bioinformatics (2010)

The ROC curves of array-CGH data analysis methods. (A) The complete plot. All curves were calculated using the same data set. They show that our Bayesian algorithm (black line) is appreciably more sensitive than all other methods (gray lines) at low (< 10%) false positive rates. (B) Details of the ROC curves in the low FPR region of (A) inside the box with dashed border. See Lai et al. [28] for the identities of the gray ROC curves. TPR, the true positive rate; FPR, the false positive rate.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2992546&req=5

Figure 4: The ROC curves of array-CGH data analysis methods. (A) The complete plot. All curves were calculated using the same data set. They show that our Bayesian algorithm (black line) is appreciably more sensitive than all other methods (gray lines) at low (< 10%) false positive rates. (B) Details of the ROC curves in the low FPR region of (A) inside the box with dashed border. See Lai et al. [28] for the identities of the gray ROC curves. TPR, the true positive rate; FPR, the false positive rate.
Mentions: Lai et al. [28] examined the performance of 11 array-CGH data analysis methods: CGHseg, quantreg, CLAC, GLAD, CBS, HMM, wavelet, lowess, ChARM, GA, and ACE. To assess the performance of our algorithm in conjunction with these methods, we used the same simulated data as Lai et al. used for the assessment in their study to calculate the true positive rates (TPR) and the false positive rates (FPR) as the threshold for determining a CNV is varied. See Lai at al. for the definitions of TPR and FPR and the details of the simulated data sets. We calculated the receiver operation characteristic (ROC) curve of our algorithm using the most noisy (thus the lowest signal-to-noise ratio, SNR = 1) data set with the CNV width of 40 probes. This ROC curve, together with the ROC curves of other array-CGH methods based on the same data set, was plotted in Figure 4. These curves show that our Bayesian algorithm is appreciably more sensitive than all other methods at low (< 10%) false positive rates. We need to point out that the comparison was conducted in a fair manner, if not to the disadvantage of our method: all the results from Lai et al. were used directly without modification and our method has no free parameters to tune.

Bottom Line: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations.Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates.Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA. zhengdong.zhang@einstein.yu.edu

ABSTRACT

Background: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale.

Results: We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms.

Conclusions: In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

Show MeSH