Limits...
A computational framework discovers new copy number variants with functional importance.

Banerjee S, Oldridge D, Poptsova M, Hussain WM, Chakravarty D, Demichelis F - PLoS ONE (2011)

Bottom Line: Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb.We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08).Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies.

View Article: PubMed Central - PubMed

Affiliation: Department of Public Health, Weill Cornell Medical College, New York, New York, United States of America.

ABSTRACT
Structural variants which cause changes in copy numbers constitute an important component of genomic variability. They account for 0.7% of genomic differences in two individual genomes, of which copy number variants (CNVs) are the largest component. A recent population-based CNV study revealed the need of better characterization of CNVs, especially the small ones (<500 bp).We propose a three step computational framework (Identification of germline Changes in Copy Number or IgC2N) to discover and genotype germline CNVs. First, we detect candidate CNV loci by combining information across multiple samples without imposing restrictions to the number of coverage markers or to the variant size. Secondly, we fine tune the detection of rare variants and infer the putative copy number classes for each locus. Last, for each variant we combine the relative distance between consecutive copy number classes with genetic information in a novel attempt to estimate the reference model bias. This computational approach is applied to genome-wide data from 1250 HapMap individuals. Novel variants were discovered and characterized in terms of size, minor allele frequency, type of polymorphism (gains, losses or both), and mechanism of formation. Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb. Finally, we queried transcriptomic data from 129 individuals determined by RNA-sequencing as further validation and to assess the functional role of the new variants. We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08). Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies.

Show MeSH

Related in: MedlinePlus

Schematic of the approach used for the Identification of germline Changes in Copy Numbers (IgC2N).IgC2N is a multistep approach, which includes the identification of potential CNV loci along the genome (A–F), a bias correction step (H) and the CN genotyping (I), leveraging the experimental data from many samples. (A–I) The log2 intensity ratio signals or the segmented signal (A) is dichotomized on a marker and sample basis (B). A genome-wide score vector S is obtained by summing the transformed signal across all samples on a marker basis (C). The  distribution of the score is obtained by permutations in order to identify the level of significant deviation of the score S, S_sig, from the baseline signal. (D) S_sig value corresponding to a pre-specified FDR threshold is applied to the data vector S (E). The intermediate output is a collection of putative polymorphic loci across the genome. No restriction on size or coverage is applied (F). A Gaussian Mixture Model (GMM) is applied to predict the CN classes (genotypes not assigned) (G). The distance between the median of consecutive CN classes (1 CN class difference) is compared to the 1 CN class difference of all CNVs, and relative classes are inferred (H). Along with 1 CN class differences, the presence of “0” class and expected direction of bias are also considered to infer the genotypes of these CN classes and the reference model bias is estimated (I).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3066184&req=5

pone-0017539-g001: Schematic of the approach used for the Identification of germline Changes in Copy Numbers (IgC2N).IgC2N is a multistep approach, which includes the identification of potential CNV loci along the genome (A–F), a bias correction step (H) and the CN genotyping (I), leveraging the experimental data from many samples. (A–I) The log2 intensity ratio signals or the segmented signal (A) is dichotomized on a marker and sample basis (B). A genome-wide score vector S is obtained by summing the transformed signal across all samples on a marker basis (C). The distribution of the score is obtained by permutations in order to identify the level of significant deviation of the score S, S_sig, from the baseline signal. (D) S_sig value corresponding to a pre-specified FDR threshold is applied to the data vector S (E). The intermediate output is a collection of putative polymorphic loci across the genome. No restriction on size or coverage is applied (F). A Gaussian Mixture Model (GMM) is applied to predict the CN classes (genotypes not assigned) (G). The distance between the median of consecutive CN classes (1 CN class difference) is compared to the 1 CN class difference of all CNVs, and relative classes are inferred (H). Along with 1 CN class differences, the presence of “0” class and expected direction of bias are also considered to infer the genotypes of these CN classes and the reference model bias is estimated (I).

Mentions: In this paper, we propose a 3-step method for the Identification of germline Changes in Copy Number, IgC2N, to discover and genotype CNVs and show its application to genome-wide array-based data (Figure 1).


A computational framework discovers new copy number variants with functional importance.

Banerjee S, Oldridge D, Poptsova M, Hussain WM, Chakravarty D, Demichelis F - PLoS ONE (2011)

Schematic of the approach used for the Identification of germline Changes in Copy Numbers (IgC2N).IgC2N is a multistep approach, which includes the identification of potential CNV loci along the genome (A–F), a bias correction step (H) and the CN genotyping (I), leveraging the experimental data from many samples. (A–I) The log2 intensity ratio signals or the segmented signal (A) is dichotomized on a marker and sample basis (B). A genome-wide score vector S is obtained by summing the transformed signal across all samples on a marker basis (C). The  distribution of the score is obtained by permutations in order to identify the level of significant deviation of the score S, S_sig, from the baseline signal. (D) S_sig value corresponding to a pre-specified FDR threshold is applied to the data vector S (E). The intermediate output is a collection of putative polymorphic loci across the genome. No restriction on size or coverage is applied (F). A Gaussian Mixture Model (GMM) is applied to predict the CN classes (genotypes not assigned) (G). The distance between the median of consecutive CN classes (1 CN class difference) is compared to the 1 CN class difference of all CNVs, and relative classes are inferred (H). Along with 1 CN class differences, the presence of “0” class and expected direction of bias are also considered to infer the genotypes of these CN classes and the reference model bias is estimated (I).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3066184&req=5

pone-0017539-g001: Schematic of the approach used for the Identification of germline Changes in Copy Numbers (IgC2N).IgC2N is a multistep approach, which includes the identification of potential CNV loci along the genome (A–F), a bias correction step (H) and the CN genotyping (I), leveraging the experimental data from many samples. (A–I) The log2 intensity ratio signals or the segmented signal (A) is dichotomized on a marker and sample basis (B). A genome-wide score vector S is obtained by summing the transformed signal across all samples on a marker basis (C). The distribution of the score is obtained by permutations in order to identify the level of significant deviation of the score S, S_sig, from the baseline signal. (D) S_sig value corresponding to a pre-specified FDR threshold is applied to the data vector S (E). The intermediate output is a collection of putative polymorphic loci across the genome. No restriction on size or coverage is applied (F). A Gaussian Mixture Model (GMM) is applied to predict the CN classes (genotypes not assigned) (G). The distance between the median of consecutive CN classes (1 CN class difference) is compared to the 1 CN class difference of all CNVs, and relative classes are inferred (H). Along with 1 CN class differences, the presence of “0” class and expected direction of bias are also considered to infer the genotypes of these CN classes and the reference model bias is estimated (I).
Mentions: In this paper, we propose a 3-step method for the Identification of germline Changes in Copy Number, IgC2N, to discover and genotype CNVs and show its application to genome-wide array-based data (Figure 1).

Bottom Line: Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb.We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08).Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies.

View Article: PubMed Central - PubMed

Affiliation: Department of Public Health, Weill Cornell Medical College, New York, New York, United States of America.

ABSTRACT
Structural variants which cause changes in copy numbers constitute an important component of genomic variability. They account for 0.7% of genomic differences in two individual genomes, of which copy number variants (CNVs) are the largest component. A recent population-based CNV study revealed the need of better characterization of CNVs, especially the small ones (<500 bp).We propose a three step computational framework (Identification of germline Changes in Copy Number or IgC2N) to discover and genotype germline CNVs. First, we detect candidate CNV loci by combining information across multiple samples without imposing restrictions to the number of coverage markers or to the variant size. Secondly, we fine tune the detection of rare variants and infer the putative copy number classes for each locus. Last, for each variant we combine the relative distance between consecutive copy number classes with genetic information in a novel attempt to estimate the reference model bias. This computational approach is applied to genome-wide data from 1250 HapMap individuals. Novel variants were discovered and characterized in terms of size, minor allele frequency, type of polymorphism (gains, losses or both), and mechanism of formation. Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb. Finally, we queried transcriptomic data from 129 individuals determined by RNA-sequencing as further validation and to assess the functional role of the new variants. We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08). Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies.

Show MeSH
Related in: MedlinePlus