Limits...
Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model.

Zhang ZD, Gerstein MB - BMC Bioinformatics (2010)

Bottom Line: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations.Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates.Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA. zhengdong.zhang@einstein.yu.edu

ABSTRACT

Background: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale.

Results: We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms.

Conclusions: In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

Show MeSH
Parameter estimation by MCMC simulation for a simulated array-CGH data set. (A) The log-ratio vs. probe genomic index plot of a simulated one-CNV array-CGH data set. The data D (M = 500) were generated with θ = {N = 1, (s1 = 200, w1 = 50, a1 = 1.5), σ2 = 0.42}. (B) The logarithm of the posterior probability (calculated up to some multiplicative constant) at consecutive 500 MCMC sampling iterations. In the stationary phase, the posterior probability of the MCMC-sampled parameter values given data D, , fluctuates closely beneath the maximum value p(θ/D). (C-F) Histograms of the 500 estimates of s1, w1, a1, and σ respectively. (G-J) Traces of the estimates of s1, w1, a1, and σ through the consecutive 500 MCMC sampling iterations.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2992546&req=5

Figure 2: Parameter estimation by MCMC simulation for a simulated array-CGH data set. (A) The log-ratio vs. probe genomic index plot of a simulated one-CNV array-CGH data set. The data D (M = 500) were generated with θ = {N = 1, (s1 = 200, w1 = 50, a1 = 1.5), σ2 = 0.42}. (B) The logarithm of the posterior probability (calculated up to some multiplicative constant) at consecutive 500 MCMC sampling iterations. In the stationary phase, the posterior probability of the MCMC-sampled parameter values given data D, , fluctuates closely beneath the maximum value p(θ/D). (C-F) Histograms of the 500 estimates of s1, w1, a1, and σ respectively. (G-J) Traces of the estimates of s1, w1, a1, and σ through the consecutive 500 MCMC sampling iterations.

Mentions: We first used simulated array-CGH data sets to test our Bayesian model and its implementation. To generate such synthetic data, we first specified values for the parameters in θ = {N, (sj, wj, aj), σ2}, j = 1, 2, ..., N, in which N and (sj, wj, aj) define the artificial genomic CNV structure encoded as a step function and σ2 determines the overall noise level in the data. The simulated data were then generated by superimposing this predefined step function with random Gaussian noise. Typical simulated array data with one and multiple CNVs are shown in Figures 2A and 3A respectively.


Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model.

Zhang ZD, Gerstein MB - BMC Bioinformatics (2010)

Parameter estimation by MCMC simulation for a simulated array-CGH data set. (A) The log-ratio vs. probe genomic index plot of a simulated one-CNV array-CGH data set. The data D (M = 500) were generated with θ = {N = 1, (s1 = 200, w1 = 50, a1 = 1.5), σ2 = 0.42}. (B) The logarithm of the posterior probability (calculated up to some multiplicative constant) at consecutive 500 MCMC sampling iterations. In the stationary phase, the posterior probability of the MCMC-sampled parameter values given data D, , fluctuates closely beneath the maximum value p(θ/D). (C-F) Histograms of the 500 estimates of s1, w1, a1, and σ respectively. (G-J) Traces of the estimates of s1, w1, a1, and σ through the consecutive 500 MCMC sampling iterations.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2992546&req=5

Figure 2: Parameter estimation by MCMC simulation for a simulated array-CGH data set. (A) The log-ratio vs. probe genomic index plot of a simulated one-CNV array-CGH data set. The data D (M = 500) were generated with θ = {N = 1, (s1 = 200, w1 = 50, a1 = 1.5), σ2 = 0.42}. (B) The logarithm of the posterior probability (calculated up to some multiplicative constant) at consecutive 500 MCMC sampling iterations. In the stationary phase, the posterior probability of the MCMC-sampled parameter values given data D, , fluctuates closely beneath the maximum value p(θ/D). (C-F) Histograms of the 500 estimates of s1, w1, a1, and σ respectively. (G-J) Traces of the estimates of s1, w1, a1, and σ through the consecutive 500 MCMC sampling iterations.
Mentions: We first used simulated array-CGH data sets to test our Bayesian model and its implementation. To generate such synthetic data, we first specified values for the parameters in θ = {N, (sj, wj, aj), σ2}, j = 1, 2, ..., N, in which N and (sj, wj, aj) define the artificial genomic CNV structure encoded as a step function and σ2 determines the overall noise level in the data. The simulated data were then generated by superimposing this predefined step function with random Gaussian noise. Typical simulated array data with one and multiple CNVs are shown in Figures 2A and 3A respectively.

Bottom Line: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations.Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates.Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA. zhengdong.zhang@einstein.yu.edu

ABSTRACT

Background: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale.

Results: We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms.

Conclusions: In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

Show MeSH