Limits...
Best practices for evaluating single nucleotide variant calling methods for microbial genomics.

Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW, Schupp JM, Keim P, Morrow JB, Salit ML, Zook JM - Front Genet (2015)

Bottom Line: As experience grows, researchers increasingly recognize that analyzing the wealth of data provided by these new sequencing platforms requires careful attention to detail for robust results.Missing, however, is a focus on critical evaluation of variant callers for these genomes.Variant calling is essential for comparative genomics as it yields insights into nucleotide-level organismal differences.

View Article: PubMed Central - PubMed

Affiliation: Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA.

ABSTRACT
Innovations in sequencing technologies have allowed biologists to make incredible advances in understanding biological systems. As experience grows, researchers increasingly recognize that analyzing the wealth of data provided by these new sequencing platforms requires careful attention to detail for robust results. Thus far, much of the scientific Communit's focus for use in bacterial genomics has been on evaluating genome assembly algorithms and rigorously validating assembly program performance. Missing, however, is a focus on critical evaluation of variant callers for these genomes. Variant calling is essential for comparative genomics as it yields insights into nucleotide-level organismal differences. Variant calling is a multistep process with a host of potential error sources that may lead to incorrect variant calls. Identifying and resolving these incorrect calls is critical for bacterial genomics to advance. The goal of this review is to provide guidance on validating algorithms and pipelines used in variant calling for bacterial genomics. First, we will provide an overview of the variant calling procedures and the potential sources of error associated with the methods. We will then identify appropriate datasets for use in evaluating algorithms and describe statistical methods for evaluating algorithm performance. As variant calling moves from basic research to the applied setting, standardized methods for performance evaluation and reporting are required; it is our hope that this review provides the groundwork for the development of these standards.

No MeSH data available.


Related in: MedlinePlus

Contingency table. True Positive, False Positive, False Negative, and True Negatives are defined based by the relationship between variants called by the SNP calling algorithm and known differences between the reference genome and the analyzed sample.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4493402&req=5

Figure 3: Contingency table. True Positive, False Positive, False Negative, and True Negatives are defined based by the relationship between variants called by the SNP calling algorithm and known differences between the reference genome and the analyzed sample.

Mentions: Contingency tables, also referred to as confusion matrices, are used to evaluate classifiers such as variant calling algorithms (Figure 3). Two by two contingency tables present the relationship between variant labels assigned by an algorithm and labels from a declared truth set. The four basic values in a 2 × 2 contingency table: true positive (TP), true negative (TN), false positive (FP), and false negative (FN), are used to assess algorithm performance. Of note, contingency tables can be functions of a parameter value or threshold choices. For example, changing the variant call quality score threshold used to determine variant call sets can alter the resulting contingency table.


Best practices for evaluating single nucleotide variant calling methods for microbial genomics.

Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW, Schupp JM, Keim P, Morrow JB, Salit ML, Zook JM - Front Genet (2015)

Contingency table. True Positive, False Positive, False Negative, and True Negatives are defined based by the relationship between variants called by the SNP calling algorithm and known differences between the reference genome and the analyzed sample.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4493402&req=5

Figure 3: Contingency table. True Positive, False Positive, False Negative, and True Negatives are defined based by the relationship between variants called by the SNP calling algorithm and known differences between the reference genome and the analyzed sample.
Mentions: Contingency tables, also referred to as confusion matrices, are used to evaluate classifiers such as variant calling algorithms (Figure 3). Two by two contingency tables present the relationship between variant labels assigned by an algorithm and labels from a declared truth set. The four basic values in a 2 × 2 contingency table: true positive (TP), true negative (TN), false positive (FP), and false negative (FN), are used to assess algorithm performance. Of note, contingency tables can be functions of a parameter value or threshold choices. For example, changing the variant call quality score threshold used to determine variant call sets can alter the resulting contingency table.

Bottom Line: As experience grows, researchers increasingly recognize that analyzing the wealth of data provided by these new sequencing platforms requires careful attention to detail for robust results.Missing, however, is a focus on critical evaluation of variant callers for these genomes.Variant calling is essential for comparative genomics as it yields insights into nucleotide-level organismal differences.

View Article: PubMed Central - PubMed

Affiliation: Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA.

ABSTRACT
Innovations in sequencing technologies have allowed biologists to make incredible advances in understanding biological systems. As experience grows, researchers increasingly recognize that analyzing the wealth of data provided by these new sequencing platforms requires careful attention to detail for robust results. Thus far, much of the scientific Communit's focus for use in bacterial genomics has been on evaluating genome assembly algorithms and rigorously validating assembly program performance. Missing, however, is a focus on critical evaluation of variant callers for these genomes. Variant calling is essential for comparative genomics as it yields insights into nucleotide-level organismal differences. Variant calling is a multistep process with a host of potential error sources that may lead to incorrect variant calls. Identifying and resolving these incorrect calls is critical for bacterial genomics to advance. The goal of this review is to provide guidance on validating algorithms and pipelines used in variant calling for bacterial genomics. First, we will provide an overview of the variant calling procedures and the potential sources of error associated with the methods. We will then identify appropriate datasets for use in evaluating algorithms and describe statistical methods for evaluating algorithm performance. As variant calling moves from basic research to the applied setting, standardized methods for performance evaluation and reporting are required; it is our hope that this review provides the groundwork for the development of these standards.

No MeSH data available.


Related in: MedlinePlus