Limits...
Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm.

Magi A, Benelli M, Yoon S, Roviello F, Torricelli F - Nucleic Acids Res. (2011)

Bottom Line: Over recent years, the advent of new high-throughput sequencing (HTS) platforms has opened many opportunities for SVs discovery, and a very promising approach consists in measuring the depth of coverage (DOC) of reads aligned to the human reference genome.For these reasons, we developed a novel algorithm (JointSLM) that allows to detect common CNVs among individuals by analysing DOC data from multiple samples simultaneously.When we apply JointSLM to analyse chromosome one of eight genomes with different ancestry, we identify 3000 regions with recurrent CNVs of different frequency and size: hierarchical clustering on these regions segregates the eight individuals in two groups that reflect their ancestry, demonstrating the potential utility of JointSLM for population genetics studies.

View Article: PubMed Central - PubMed

Affiliation: Laboratory Department, Diagnostic Genetic Unit, Careggi Hospital, Florence 5014, Italy. albertomagi@gmail.com

ABSTRACT
The discovery of genomic structural variants (SVs), such as copy number variants (CNVs), is essential to understand genetic variation of human populations and complex diseases. Over recent years, the advent of new high-throughput sequencing (HTS) platforms has opened many opportunities for SVs discovery, and a very promising approach consists in measuring the depth of coverage (DOC) of reads aligned to the human reference genome. At present, few computational methods have been developed for the analysis of DOC data and all of these methods allow to analyse only one sample at time. For these reasons, we developed a novel algorithm (JointSLM) that allows to detect common CNVs among individuals by analysing DOC data from multiple samples simultaneously. We test JointSLM performance on synthetic and real data and we show its unprecedented resolution that enables the detection of recurrent CNV regions as small as 500 bp in size. When we apply JointSLM to analyse chromosome one of eight genomes with different ancestry, we identify 3000 regions with recurrent CNVs of different frequency and size: hierarchical clustering on these regions segregates the eight individuals in two groups that reflect their ancestry, demonstrating the potential utility of JointSLM for population genetics studies.

Show MeSH

Related in: MedlinePlus

TPR and FPR estimate for different values of η and ω on synthetic data made of 10 chromosomes. Each point of the plot is obtained by averaging the JointSLM results over 100 repeated simulations. (a) Each curve represents the TPR estimate against deletion events of different size. In each plot are reported the curves for different values of fraction of altered samples f (with f that ranges between 0.1 and 1). (b) Each curve represent the FPR estimate against the size of false detected events.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3105418&req=5

Figure 1: TPR and FPR estimate for different values of η and ω on synthetic data made of 10 chromosomes. Each point of the plot is obtained by averaging the JointSLM results over 100 repeated simulations. (a) Each curve represents the TPR estimate against deletion events of different size. In each plot are reported the curves for different values of fraction of altered samples f (with f that ranges between 0.1 and 1). (b) Each curve represent the FPR estimate against the size of false detected events.

Mentions: We applied JointSLM with different parameter settings on data sets made of 10, 30 and 50 normal copy number chromosomes, and we evaluated false positive rate (FPR) by counting the number of detected alterations. The specificity of the algorithm can be controlled by the parameters η and ω (see Supplementary Data): the higher are the values of η and ω and the larger is the number of detected false positive (FP) events. For instance, in the 10 samples analysis (Figure 1b), when we set η = 10−3 and ω = 0.3 we detect an average of 9.4 FP events that range between 100 and 500 bp in size (6.0 events of 100 bp, 2.6 of 200 bp, 0.62 of 300 bp, 0.12 of 400 bp and 0.08 of 500 bp), while using a more conservative set of parameters (η = 10−6 and ω = 0.1), we detect an average of 0.91 FP events (0.66 of 100 bp, 0.22 of 200 bp, 0.02 of 300 bp and 0.01 of 400 bp). In the analysis of the 30 samples data set (Supplementary Figure), we detected an average of 21.6 FP events (FPR = 0.2%) with ω = 0.3 and η = 10−3 and an average of 9.6 (0.1%) events with ω = 0.1 and η = 10−6, while for the 50 sample analysis the FPR grows to 0.3% with ω = 0.3 and η = 10−3 and to 0.2% with ω = 0.1 and η = 10−6. These results show that the use of the most conservative set of parameters allows us to obtain a global FPR smaller than 0.01%.Figure 1.


Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm.

Magi A, Benelli M, Yoon S, Roviello F, Torricelli F - Nucleic Acids Res. (2011)

TPR and FPR estimate for different values of η and ω on synthetic data made of 10 chromosomes. Each point of the plot is obtained by averaging the JointSLM results over 100 repeated simulations. (a) Each curve represents the TPR estimate against deletion events of different size. In each plot are reported the curves for different values of fraction of altered samples f (with f that ranges between 0.1 and 1). (b) Each curve represent the FPR estimate against the size of false detected events.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3105418&req=5

Figure 1: TPR and FPR estimate for different values of η and ω on synthetic data made of 10 chromosomes. Each point of the plot is obtained by averaging the JointSLM results over 100 repeated simulations. (a) Each curve represents the TPR estimate against deletion events of different size. In each plot are reported the curves for different values of fraction of altered samples f (with f that ranges between 0.1 and 1). (b) Each curve represent the FPR estimate against the size of false detected events.
Mentions: We applied JointSLM with different parameter settings on data sets made of 10, 30 and 50 normal copy number chromosomes, and we evaluated false positive rate (FPR) by counting the number of detected alterations. The specificity of the algorithm can be controlled by the parameters η and ω (see Supplementary Data): the higher are the values of η and ω and the larger is the number of detected false positive (FP) events. For instance, in the 10 samples analysis (Figure 1b), when we set η = 10−3 and ω = 0.3 we detect an average of 9.4 FP events that range between 100 and 500 bp in size (6.0 events of 100 bp, 2.6 of 200 bp, 0.62 of 300 bp, 0.12 of 400 bp and 0.08 of 500 bp), while using a more conservative set of parameters (η = 10−6 and ω = 0.1), we detect an average of 0.91 FP events (0.66 of 100 bp, 0.22 of 200 bp, 0.02 of 300 bp and 0.01 of 400 bp). In the analysis of the 30 samples data set (Supplementary Figure), we detected an average of 21.6 FP events (FPR = 0.2%) with ω = 0.3 and η = 10−3 and an average of 9.6 (0.1%) events with ω = 0.1 and η = 10−6, while for the 50 sample analysis the FPR grows to 0.3% with ω = 0.3 and η = 10−3 and to 0.2% with ω = 0.1 and η = 10−6. These results show that the use of the most conservative set of parameters allows us to obtain a global FPR smaller than 0.01%.Figure 1.

Bottom Line: Over recent years, the advent of new high-throughput sequencing (HTS) platforms has opened many opportunities for SVs discovery, and a very promising approach consists in measuring the depth of coverage (DOC) of reads aligned to the human reference genome.For these reasons, we developed a novel algorithm (JointSLM) that allows to detect common CNVs among individuals by analysing DOC data from multiple samples simultaneously.When we apply JointSLM to analyse chromosome one of eight genomes with different ancestry, we identify 3000 regions with recurrent CNVs of different frequency and size: hierarchical clustering on these regions segregates the eight individuals in two groups that reflect their ancestry, demonstrating the potential utility of JointSLM for population genetics studies.

View Article: PubMed Central - PubMed

Affiliation: Laboratory Department, Diagnostic Genetic Unit, Careggi Hospital, Florence 5014, Italy. albertomagi@gmail.com

ABSTRACT
The discovery of genomic structural variants (SVs), such as copy number variants (CNVs), is essential to understand genetic variation of human populations and complex diseases. Over recent years, the advent of new high-throughput sequencing (HTS) platforms has opened many opportunities for SVs discovery, and a very promising approach consists in measuring the depth of coverage (DOC) of reads aligned to the human reference genome. At present, few computational methods have been developed for the analysis of DOC data and all of these methods allow to analyse only one sample at time. For these reasons, we developed a novel algorithm (JointSLM) that allows to detect common CNVs among individuals by analysing DOC data from multiple samples simultaneously. We test JointSLM performance on synthetic and real data and we show its unprecedented resolution that enables the detection of recurrent CNV regions as small as 500 bp in size. When we apply JointSLM to analyse chromosome one of eight genomes with different ancestry, we identify 3000 regions with recurrent CNVs of different frequency and size: hierarchical clustering on these regions segregates the eight individuals in two groups that reflect their ancestry, demonstrating the potential utility of JointSLM for population genetics studies.

Show MeSH
Related in: MedlinePlus