Limits...
The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes.

Challis D, Antunes L, Garrison E, Banks E, Evani US, Muzny D, Poplin R, Gibbs RA, Marth G, Yu F - BMC Genomics (2015)

Bottom Line: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data.This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%.In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.

View Article: PubMed Central - PubMed

Affiliation: Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. dannychallis@gmail.com.

ABSTRACT

Background: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls.

Results: This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%.

Conclusions: In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.

Show MeSH

Related in: MedlinePlus

Exome INDEL call set analysis per individual. a. To characterize the exome INDEL length distribution in individuals we calculated the density of exome INDELs of different lengths in each individual. Shorter INDELs are generally common with a strong bias towards INDEL lengths that are a multiple of 3 (inframe). The INDEL density for the HuRef and YanHuang exome are also shown for reference. Deletions are shown with negative length. b. To characterize the allele frequency distribution in individuals we calculated the exome INDEL density in each individual across 7 logarithmic allele frequency bins. The results confirm that most of the exome INDELs in an individual are common (>10%), but most individuals harbor a small number of extremely rare exome INDELs (i.e. singletons).
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4352271&req=5

Fig2: Exome INDEL call set analysis per individual. a. To characterize the exome INDEL length distribution in individuals we calculated the density of exome INDELs of different lengths in each individual. Shorter INDELs are generally common with a strong bias towards INDEL lengths that are a multiple of 3 (inframe). The INDEL density for the HuRef and YanHuang exome are also shown for reference. Deletions are shown with negative length. b. To characterize the allele frequency distribution in individuals we calculated the exome INDEL density in each individual across 7 logarithmic allele frequency bins. The results confirm that most of the exome INDELs in an individual are common (>10%), but most individuals harbor a small number of extremely rare exome INDELs (i.e. singletons).

Mentions: Frameshift INDELs (INDEL length is not multiple of three bases) are under especially strong scrutiny as they generally result in a nonsense mutation and changes in amino acid sequences. In order to further characterize this effect we evaluated the distribution of INDEL lengths (Figure 2a). In our call set we discovered INDELs ranging in size from −84 to 12 base pairs (deletion lengths are represented as negative numbers). The vast majority (95%) of the INDELs are short (<10 base-pairs), and the distribution is strongly enriched for in-frame INDELs (i.e. where INDEL length is a multiple of three bases) (39%). Short INDELs were the most frequent with an average of 1.50 1 bp-INDELs per Mb of sequence, and an average of 0.19 large (>10 bp) INDELs per Mb (Figure 2a). At the individual exome level, selection against frameshift INDELs is even more pronounced, with an average of 2.31 frameshift INDELs per Mb, and an average of 3.21 inframe INDELs per Mb. In fact, larger in-frame INDELs are observed more frequently than shorter frameshift INDELs. These findings are directly related to the power law of INDEL mutagenesis and clearly demonstrate selection against frameshift INDELs.Figure 2


The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes.

Challis D, Antunes L, Garrison E, Banks E, Evani US, Muzny D, Poplin R, Gibbs RA, Marth G, Yu F - BMC Genomics (2015)

Exome INDEL call set analysis per individual. a. To characterize the exome INDEL length distribution in individuals we calculated the density of exome INDELs of different lengths in each individual. Shorter INDELs are generally common with a strong bias towards INDEL lengths that are a multiple of 3 (inframe). The INDEL density for the HuRef and YanHuang exome are also shown for reference. Deletions are shown with negative length. b. To characterize the allele frequency distribution in individuals we calculated the exome INDEL density in each individual across 7 logarithmic allele frequency bins. The results confirm that most of the exome INDELs in an individual are common (>10%), but most individuals harbor a small number of extremely rare exome INDELs (i.e. singletons).
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4352271&req=5

Fig2: Exome INDEL call set analysis per individual. a. To characterize the exome INDEL length distribution in individuals we calculated the density of exome INDELs of different lengths in each individual. Shorter INDELs are generally common with a strong bias towards INDEL lengths that are a multiple of 3 (inframe). The INDEL density for the HuRef and YanHuang exome are also shown for reference. Deletions are shown with negative length. b. To characterize the allele frequency distribution in individuals we calculated the exome INDEL density in each individual across 7 logarithmic allele frequency bins. The results confirm that most of the exome INDELs in an individual are common (>10%), but most individuals harbor a small number of extremely rare exome INDELs (i.e. singletons).
Mentions: Frameshift INDELs (INDEL length is not multiple of three bases) are under especially strong scrutiny as they generally result in a nonsense mutation and changes in amino acid sequences. In order to further characterize this effect we evaluated the distribution of INDEL lengths (Figure 2a). In our call set we discovered INDELs ranging in size from −84 to 12 base pairs (deletion lengths are represented as negative numbers). The vast majority (95%) of the INDELs are short (<10 base-pairs), and the distribution is strongly enriched for in-frame INDELs (i.e. where INDEL length is a multiple of three bases) (39%). Short INDELs were the most frequent with an average of 1.50 1 bp-INDELs per Mb of sequence, and an average of 0.19 large (>10 bp) INDELs per Mb (Figure 2a). At the individual exome level, selection against frameshift INDELs is even more pronounced, with an average of 2.31 frameshift INDELs per Mb, and an average of 3.21 inframe INDELs per Mb. In fact, larger in-frame INDELs are observed more frequently than shorter frameshift INDELs. These findings are directly related to the power law of INDEL mutagenesis and clearly demonstrate selection against frameshift INDELs.Figure 2

Bottom Line: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data.This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%.In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.

View Article: PubMed Central - PubMed

Affiliation: Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. dannychallis@gmail.com.

ABSTRACT

Background: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls.

Results: This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%.

Conclusions: In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.

Show MeSH
Related in: MedlinePlus