Limits...
An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.

Ribeiro A, Golicz A, Hackett CA, Milne I, Stephen G, Marshall D, Flavell AJ, Bayer M - BMC Bioinformatics (2015)

Bottom Line: We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive.All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors.The choice of reference assembler, mapper and variant caller also significantly affected the outcome.

View Article: PubMed Central - PubMed

Affiliation: The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK. antonio.ribeiro@hutton.ac.uk.

ABSTRACT

Background: Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling - quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive.

Results: The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases.

Conclusions: The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration.

No MeSH data available.


Related in: MedlinePlus

Experimental design. a The A. thaliana genome was used to generate simulated reads of different lengths. De novo assemblies were computed from the 150 bp read datasets using different assemblers. b With the assemblies as references, separate read mappings were carried out for each of the different read length datasets and with different combination of factor levels, using the original genome as a control. c SNP detection was carried out with different variant callers and the results were analysed to detect whether the mismatched reads causing the SNPs were due to mismapping. SNP annotation was performed to detect enrichment for particular genomic features at SNP positions
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4642669&req=5

Fig1: Experimental design. a The A. thaliana genome was used to generate simulated reads of different lengths. De novo assemblies were computed from the 150 bp read datasets using different assemblers. b With the assemblies as references, separate read mappings were carried out for each of the different read length datasets and with different combination of factor levels, using the original genome as a control. c SNP detection was carried out with different variant callers and the results were analysed to detect whether the mismatched reads causing the SNPs were due to mismapping. SNP annotation was performed to detect enrichment for particular genomic features at SNP positions

Mentions: The five chromosome sequences of Arabidopsis thaliana, available at ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes, served as the template for the generation of the simulated reads for our study. The SimSeq read simulator (last update 4.12.2011; https://github.com/jstjohn/SimSeq) was used to generate haploid, error-free paired-end and mate-pair reads (the latter created specifically for the assembly stage) from each of the chromosome sequences (see Additional file 1: Supplementary data section SD.1). This sampling mode allowed us to assume that every SNP encountered in the mappings must be an FP SNP which is due to read mismapping as there were no other sources of variant alleles. Paired-end reads were produced with 100-fold coverage depth and at lengths of 50, 100, 150, 300, 500, and 1000 bp (Fig. 1). Fragment sizes for these were 90, 180, 270, 540, 900 and 1800 bp respectively. Mate-pair reads were produced with 50-fold coverage depth at a length of 150 bp, with a fragment size of 3000 bp. Full details of the fragment sizes are provided in Additional file 1: Supplementary data (section SD.1.1, Table S4).Fig. 1


An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.

Ribeiro A, Golicz A, Hackett CA, Milne I, Stephen G, Marshall D, Flavell AJ, Bayer M - BMC Bioinformatics (2015)

Experimental design. a The A. thaliana genome was used to generate simulated reads of different lengths. De novo assemblies were computed from the 150 bp read datasets using different assemblers. b With the assemblies as references, separate read mappings were carried out for each of the different read length datasets and with different combination of factor levels, using the original genome as a control. c SNP detection was carried out with different variant callers and the results were analysed to detect whether the mismatched reads causing the SNPs were due to mismapping. SNP annotation was performed to detect enrichment for particular genomic features at SNP positions
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4642669&req=5

Fig1: Experimental design. a The A. thaliana genome was used to generate simulated reads of different lengths. De novo assemblies were computed from the 150 bp read datasets using different assemblers. b With the assemblies as references, separate read mappings were carried out for each of the different read length datasets and with different combination of factor levels, using the original genome as a control. c SNP detection was carried out with different variant callers and the results were analysed to detect whether the mismatched reads causing the SNPs were due to mismapping. SNP annotation was performed to detect enrichment for particular genomic features at SNP positions
Mentions: The five chromosome sequences of Arabidopsis thaliana, available at ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes, served as the template for the generation of the simulated reads for our study. The SimSeq read simulator (last update 4.12.2011; https://github.com/jstjohn/SimSeq) was used to generate haploid, error-free paired-end and mate-pair reads (the latter created specifically for the assembly stage) from each of the chromosome sequences (see Additional file 1: Supplementary data section SD.1). This sampling mode allowed us to assume that every SNP encountered in the mappings must be an FP SNP which is due to read mismapping as there were no other sources of variant alleles. Paired-end reads were produced with 100-fold coverage depth and at lengths of 50, 100, 150, 300, 500, and 1000 bp (Fig. 1). Fragment sizes for these were 90, 180, 270, 540, 900 and 1800 bp respectively. Mate-pair reads were produced with 50-fold coverage depth at a length of 150 bp, with a fragment size of 3000 bp. Full details of the fragment sizes are provided in Additional file 1: Supplementary data (section SD.1.1, Table S4).Fig. 1

Bottom Line: We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive.All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors.The choice of reference assembler, mapper and variant caller also significantly affected the outcome.

View Article: PubMed Central - PubMed

Affiliation: The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK. antonio.ribeiro@hutton.ac.uk.

ABSTRACT

Background: Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling - quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive.

Results: The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases.

Conclusions: The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration.

No MeSH data available.


Related in: MedlinePlus