Limits...
Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model.

Zhang ZD, Gerstein MB - BMC Bioinformatics (2010)

Bottom Line: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations.Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates.Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA. zhengdong.zhang@einstein.yu.edu

ABSTRACT

Background: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale.

Results: We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms.

Conclusions: In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

Show MeSH
The posterior distribution of model parameters. (A) A view of the posterior probability distribution landscape in the two-dimensional s1-s2 parameter space. The simulated data D (M = 700) were generated with θ = {N = 2, (s1 = 100, w1 = 20, a1 = 2), (s2 = 600, w2 = 20, a2 = - 2), σ2 = 0.42}. The posterior probability was evaluated with various combinations of s1 ands2 while all other parameters were kept fixed at their true values. The terrain color varies from green to yellow to red and then to gray as the posterior probability increases. (B) The contour of the same posterior distribution. A closed contour line traces points (s1, s2) of equal posterior probability density. As expected, the global maximum of the posterior probability occurs where s1 = 100 and s2 = 600, two values that were used to generate the underlying data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2992546&req=5

Figure 7: The posterior distribution of model parameters. (A) A view of the posterior probability distribution landscape in the two-dimensional s1-s2 parameter space. The simulated data D (M = 700) were generated with θ = {N = 2, (s1 = 100, w1 = 20, a1 = 2), (s2 = 600, w2 = 20, a2 = - 2), σ2 = 0.42}. The posterior probability was evaluated with various combinations of s1 ands2 while all other parameters were kept fixed at their true values. The terrain color varies from green to yellow to red and then to gray as the posterior probability increases. (B) The contour of the same posterior distribution. A closed contour line traces points (s1, s2) of equal posterior probability density. As expected, the global maximum of the posterior probability occurs where s1 = 100 and s2 = 600, two values that were used to generate the underlying data.

Mentions: The surface plot in Figure 7A shows a global maximum peak located at s1 = 100 and s2 = 600 as expected and an overall very rugged posterior distribution 'terrain': the landscape is full of local maxima with, especially, two prominent 'ridges' of local maxima at s1 = 100 and s2 = 600, respectively. It is clear from Figures 7A and 7B that if the Markov chain of RWMH gets to a local maximum on the ridge at s1 = 100 or s2 = 600 but fortuitously far from the global maximum, it will be trapped on the ridge and practically cannot reach the global peak if the random update interval is small (which is almost always the case). Based on these observations, we chose the Gibbs sampling algorithm for our Bayesian analysis of the genomic copy number data as the Gibbs sampler is well suitable to explore this 'ridged' terrain by using full conditionals to scan the landscape along ridges to find the global maximum.


Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model.

Zhang ZD, Gerstein MB - BMC Bioinformatics (2010)

The posterior distribution of model parameters. (A) A view of the posterior probability distribution landscape in the two-dimensional s1-s2 parameter space. The simulated data D (M = 700) were generated with θ = {N = 2, (s1 = 100, w1 = 20, a1 = 2), (s2 = 600, w2 = 20, a2 = - 2), σ2 = 0.42}. The posterior probability was evaluated with various combinations of s1 ands2 while all other parameters were kept fixed at their true values. The terrain color varies from green to yellow to red and then to gray as the posterior probability increases. (B) The contour of the same posterior distribution. A closed contour line traces points (s1, s2) of equal posterior probability density. As expected, the global maximum of the posterior probability occurs where s1 = 100 and s2 = 600, two values that were used to generate the underlying data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2992546&req=5

Figure 7: The posterior distribution of model parameters. (A) A view of the posterior probability distribution landscape in the two-dimensional s1-s2 parameter space. The simulated data D (M = 700) were generated with θ = {N = 2, (s1 = 100, w1 = 20, a1 = 2), (s2 = 600, w2 = 20, a2 = - 2), σ2 = 0.42}. The posterior probability was evaluated with various combinations of s1 ands2 while all other parameters were kept fixed at their true values. The terrain color varies from green to yellow to red and then to gray as the posterior probability increases. (B) The contour of the same posterior distribution. A closed contour line traces points (s1, s2) of equal posterior probability density. As expected, the global maximum of the posterior probability occurs where s1 = 100 and s2 = 600, two values that were used to generate the underlying data.
Mentions: The surface plot in Figure 7A shows a global maximum peak located at s1 = 100 and s2 = 600 as expected and an overall very rugged posterior distribution 'terrain': the landscape is full of local maxima with, especially, two prominent 'ridges' of local maxima at s1 = 100 and s2 = 600, respectively. It is clear from Figures 7A and 7B that if the Markov chain of RWMH gets to a local maximum on the ridge at s1 = 100 or s2 = 600 but fortuitously far from the global maximum, it will be trapped on the ridge and practically cannot reach the global peak if the random update interval is small (which is almost always the case). Based on these observations, we chose the Gibbs sampling algorithm for our Bayesian analysis of the genomic copy number data as the Gibbs sampler is well suitable to explore this 'ridged' terrain by using full conditionals to scan the landscape along ridges to find the global maximum.

Bottom Line: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations.Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates.Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA. zhengdong.zhang@einstein.yu.edu

ABSTRACT

Background: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale.

Results: We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms.

Conclusions: In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

Show MeSH