Limits...
SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data.

Pattnaik S, Gupta S, Rao AA, Panda B - BMC Bioinformatics (2014)

Bottom Line: SInC is capable of generating single- and paired-end reads with user-defined insert size and with high efficiency compared to the other existing tools.SInC, due to its multi-threaded capability during read generation, has a low time footprint.We have come up with a user-friendly multi-variant simulator and read-generator tools called SInC.

View Article: PubMed Central - HTML - PubMed

Affiliation: Ganit Labs, Bio-IT Centre, Institute of Bioinformatics and Applied Biotechnology, Biotech Park, Electronic City Phase I, Bangalore 560100, India. binay@ganitlabs.in.

ABSTRACT

Background: The rapid advancements in the field of genome sequencing are aiding our understanding on many biological systems. In the last five years, computational biologists and bioinformatics specialists have come up with newer, better and more efficient tools towards the discovery, analysis and interpretation of different genomic variants from high-throughput sequencing data. Availability of reliable simulated dataset is essential and is the first step towards testing any newly developed analytical tools for variant discovery. Although there are tools currently available that can simulate variants, none present the possibility of simulating all the three major types of variations (Single Nucleotide Polymorphisms, Insertions and Deletions and Copy Number Variations) and can generate reads taking a realistic error-model into consideration. Therefore, an efficient simulator and read generator is needed that can simulate variants taking the error rates of true biological samples into consideration.

Results: We report SInC (Snp, Indel and Cnv) an open-source variant simulator and read generator capable of simulating all the three common types of biological variants taking into account a distribution of base quality score from a most commonly used next-generation sequencing instrument from Illumina. SInC is capable of generating single- and paired-end reads with user-defined insert size and with high efficiency compared to the other existing tools. SInC, due to its multi-threaded capability during read generation, has a low time footprint. SInC is currently optimised to work in limited infrastructure setup and can efficiently exploit the commonly used quad-core desktop architecture to simulate short sequence reads with deep coverage for large genomes.

Conclusions: We have come up with a user-friendly multi-variant simulator and read-generator tools called SInC. SInC can be downloaded from http://sourceforge.net/projects/sincsimulator.

Show MeSH
Variant rediscovery statistics. Percentages of simulated variants performed using GATK and PINDEL for identification are shown of A) SNVs and B) indels respectively. The rediscovery of indels based on size specificity was also performed and is given in Additional file3. The rediscovery percentages of C) heterozygous and D) homozygous SNVs are compared.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3926339&req=5

Figure 2: Variant rediscovery statistics. Percentages of simulated variants performed using GATK and PINDEL for identification are shown of A) SNVs and B) indels respectively. The rediscovery of indels based on size specificity was also performed and is given in Additional file3. The rediscovery percentages of C) heterozygous and D) homozygous SNVs are compared.

Mentions: We used human chromosome 22 sequence from the UCSC build hg19 for generating SNVs and indels using all the four different SRG simulators. The SNV rate, indel percentage and coverage was maintained across all the tools and the resulting reads were aligned using Novoalign[14]. These mapped files were subject to SNV and indel detection by GATK[4] and Pindel[6] respectively (see Results, FigureĀ 2). The predicted SNVs and indels from the different simulators were compared to the actual number of incorporated variants to estimate the percentage rediscovery. Rediscovery percentage using Pindel has a limitation that it merges short indels within a span of 40 nucleotides of each other leading to a slight loss (less than 1%) of rediscovered indels across all the simulators (see Additional file2).


SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data.

Pattnaik S, Gupta S, Rao AA, Panda B - BMC Bioinformatics (2014)

Variant rediscovery statistics. Percentages of simulated variants performed using GATK and PINDEL for identification are shown of A) SNVs and B) indels respectively. The rediscovery of indels based on size specificity was also performed and is given in Additional file3. The rediscovery percentages of C) heterozygous and D) homozygous SNVs are compared.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3926339&req=5

Figure 2: Variant rediscovery statistics. Percentages of simulated variants performed using GATK and PINDEL for identification are shown of A) SNVs and B) indels respectively. The rediscovery of indels based on size specificity was also performed and is given in Additional file3. The rediscovery percentages of C) heterozygous and D) homozygous SNVs are compared.
Mentions: We used human chromosome 22 sequence from the UCSC build hg19 for generating SNVs and indels using all the four different SRG simulators. The SNV rate, indel percentage and coverage was maintained across all the tools and the resulting reads were aligned using Novoalign[14]. These mapped files were subject to SNV and indel detection by GATK[4] and Pindel[6] respectively (see Results, FigureĀ 2). The predicted SNVs and indels from the different simulators were compared to the actual number of incorporated variants to estimate the percentage rediscovery. Rediscovery percentage using Pindel has a limitation that it merges short indels within a span of 40 nucleotides of each other leading to a slight loss (less than 1%) of rediscovered indels across all the simulators (see Additional file2).

Bottom Line: SInC is capable of generating single- and paired-end reads with user-defined insert size and with high efficiency compared to the other existing tools.SInC, due to its multi-threaded capability during read generation, has a low time footprint.We have come up with a user-friendly multi-variant simulator and read-generator tools called SInC.

View Article: PubMed Central - HTML - PubMed

Affiliation: Ganit Labs, Bio-IT Centre, Institute of Bioinformatics and Applied Biotechnology, Biotech Park, Electronic City Phase I, Bangalore 560100, India. binay@ganitlabs.in.

ABSTRACT

Background: The rapid advancements in the field of genome sequencing are aiding our understanding on many biological systems. In the last five years, computational biologists and bioinformatics specialists have come up with newer, better and more efficient tools towards the discovery, analysis and interpretation of different genomic variants from high-throughput sequencing data. Availability of reliable simulated dataset is essential and is the first step towards testing any newly developed analytical tools for variant discovery. Although there are tools currently available that can simulate variants, none present the possibility of simulating all the three major types of variations (Single Nucleotide Polymorphisms, Insertions and Deletions and Copy Number Variations) and can generate reads taking a realistic error-model into consideration. Therefore, an efficient simulator and read generator is needed that can simulate variants taking the error rates of true biological samples into consideration.

Results: We report SInC (Snp, Indel and Cnv) an open-source variant simulator and read generator capable of simulating all the three common types of biological variants taking into account a distribution of base quality score from a most commonly used next-generation sequencing instrument from Illumina. SInC is capable of generating single- and paired-end reads with user-defined insert size and with high efficiency compared to the other existing tools. SInC, due to its multi-threaded capability during read generation, has a low time footprint. SInC is currently optimised to work in limited infrastructure setup and can efficiently exploit the commonly used quad-core desktop architecture to simulate short sequence reads with deep coverage for large genomes.

Conclusions: We have come up with a user-friendly multi-variant simulator and read-generator tools called SInC. SInC can be downloaded from http://sourceforge.net/projects/sincsimulator.

Show MeSH