Limits...
A computational framework discovers new copy number variants with functional importance.

Banerjee S, Oldridge D, Poptsova M, Hussain WM, Chakravarty D, Demichelis F - PLoS ONE (2011)

Bottom Line: Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb.We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08).Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies.

View Article: PubMed Central - PubMed

Affiliation: Department of Public Health, Weill Cornell Medical College, New York, New York, United States of America.

ABSTRACT
Structural variants which cause changes in copy numbers constitute an important component of genomic variability. They account for 0.7% of genomic differences in two individual genomes, of which copy number variants (CNVs) are the largest component. A recent population-based CNV study revealed the need of better characterization of CNVs, especially the small ones (<500 bp).We propose a three step computational framework (Identification of germline Changes in Copy Number or IgC2N) to discover and genotype germline CNVs. First, we detect candidate CNV loci by combining information across multiple samples without imposing restrictions to the number of coverage markers or to the variant size. Secondly, we fine tune the detection of rare variants and infer the putative copy number classes for each locus. Last, for each variant we combine the relative distance between consecutive copy number classes with genetic information in a novel attempt to estimate the reference model bias. This computational approach is applied to genome-wide data from 1250 HapMap individuals. Novel variants were discovered and characterized in terms of size, minor allele frequency, type of polymorphism (gains, losses or both), and mechanism of formation. Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb. Finally, we queried transcriptomic data from 129 individuals determined by RNA-sequencing as further validation and to assess the functional role of the new variants. We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08). Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies.

Show MeSH

Related in: MedlinePlus

Results of Power Simulation Study as function of Size and Coverage.In silico power computations for IgC2N. The panel of 8 plots is organized in rows by sample size of the datasets used for simulations and in columns by the number of markers covering a CNV and size (in kb) of a CNV. Each plot shows the average power to detect CNVs with three different frequencies, i.e. 1%, 5% and 15% for the dotted, dashed and solid lines respectively.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3066184&req=5

pone-0017539-g002: Results of Power Simulation Study as function of Size and Coverage.In silico power computations for IgC2N. The panel of 8 plots is organized in rows by sample size of the datasets used for simulations and in columns by the number of markers covering a CNV and size (in kb) of a CNV. Each plot shows the average power to detect CNVs with three different frequencies, i.e. 1%, 5% and 15% for the dotted, dashed and solid lines respectively.

Mentions: Firstly, we evaluated the predictive performance of IgC2N on simulated data. We simulated datasets with several CNVs spanning a wide range of characteristics in terms of size, incidence (frequency in the population) and type of variant, namely deletion, gain or deletion/gain. To assess the adequate number of datasets to be used in the simulation, we monitored the behavior and stabilization of the false positive rate over simulated datasets (File S1) and power (data not shown) by generating up to 500 datasets with a sample size of 200 and found that both stabilize at 100 datasets. Figure 2 shows the average empirical power (described in details in the Materials and Methods section) for each CNV characteristic (e.g. size and frequency). Note that the size of a CNV is inherently related to the number of markers covering the CNV which depends on the marker distribution of the specific platform in question. Figure 2 indicates that IgC2N has at least 80% power to detect polymorphisms of frequency 5% in datasets of sample size 200 which are at least 100 kb long or have at least 10 covering markers. On the other hand, with larger sample size (2000 or more) IgC2N has 80% or more power to detect similar CNVs (in terms of size of the CNV and number of covering markers) with frequency of 1% or more. The False Positive Rate (FPR) for detecting CNVs is provided in greater details in File S1. Briefly, the mean (maximum) FPR (over 100 datasets) was 0.0072 (0.04), 0.0026 (0.01), 0.0029 (0.01) and 0.0113 (0.03) for sample sizes of 200, 400, 800 and 2000 respectively.


A computational framework discovers new copy number variants with functional importance.

Banerjee S, Oldridge D, Poptsova M, Hussain WM, Chakravarty D, Demichelis F - PLoS ONE (2011)

Results of Power Simulation Study as function of Size and Coverage.In silico power computations for IgC2N. The panel of 8 plots is organized in rows by sample size of the datasets used for simulations and in columns by the number of markers covering a CNV and size (in kb) of a CNV. Each plot shows the average power to detect CNVs with three different frequencies, i.e. 1%, 5% and 15% for the dotted, dashed and solid lines respectively.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3066184&req=5

pone-0017539-g002: Results of Power Simulation Study as function of Size and Coverage.In silico power computations for IgC2N. The panel of 8 plots is organized in rows by sample size of the datasets used for simulations and in columns by the number of markers covering a CNV and size (in kb) of a CNV. Each plot shows the average power to detect CNVs with three different frequencies, i.e. 1%, 5% and 15% for the dotted, dashed and solid lines respectively.
Mentions: Firstly, we evaluated the predictive performance of IgC2N on simulated data. We simulated datasets with several CNVs spanning a wide range of characteristics in terms of size, incidence (frequency in the population) and type of variant, namely deletion, gain or deletion/gain. To assess the adequate number of datasets to be used in the simulation, we monitored the behavior and stabilization of the false positive rate over simulated datasets (File S1) and power (data not shown) by generating up to 500 datasets with a sample size of 200 and found that both stabilize at 100 datasets. Figure 2 shows the average empirical power (described in details in the Materials and Methods section) for each CNV characteristic (e.g. size and frequency). Note that the size of a CNV is inherently related to the number of markers covering the CNV which depends on the marker distribution of the specific platform in question. Figure 2 indicates that IgC2N has at least 80% power to detect polymorphisms of frequency 5% in datasets of sample size 200 which are at least 100 kb long or have at least 10 covering markers. On the other hand, with larger sample size (2000 or more) IgC2N has 80% or more power to detect similar CNVs (in terms of size of the CNV and number of covering markers) with frequency of 1% or more. The False Positive Rate (FPR) for detecting CNVs is provided in greater details in File S1. Briefly, the mean (maximum) FPR (over 100 datasets) was 0.0072 (0.04), 0.0026 (0.01), 0.0029 (0.01) and 0.0113 (0.03) for sample sizes of 200, 400, 800 and 2000 respectively.

Bottom Line: Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb.We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08).Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies.

View Article: PubMed Central - PubMed

Affiliation: Department of Public Health, Weill Cornell Medical College, New York, New York, United States of America.

ABSTRACT
Structural variants which cause changes in copy numbers constitute an important component of genomic variability. They account for 0.7% of genomic differences in two individual genomes, of which copy number variants (CNVs) are the largest component. A recent population-based CNV study revealed the need of better characterization of CNVs, especially the small ones (<500 bp).We propose a three step computational framework (Identification of germline Changes in Copy Number or IgC2N) to discover and genotype germline CNVs. First, we detect candidate CNV loci by combining information across multiple samples without imposing restrictions to the number of coverage markers or to the variant size. Secondly, we fine tune the detection of rare variants and infer the putative copy number classes for each locus. Last, for each variant we combine the relative distance between consecutive copy number classes with genetic information in a novel attempt to estimate the reference model bias. This computational approach is applied to genome-wide data from 1250 HapMap individuals. Novel variants were discovered and characterized in terms of size, minor allele frequency, type of polymorphism (gains, losses or both), and mechanism of formation. Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb. Finally, we queried transcriptomic data from 129 individuals determined by RNA-sequencing as further validation and to assess the functional role of the new variants. We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08). Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies.

Show MeSH
Related in: MedlinePlus