Limits...
Evaluating the impact of scoring parameters on the structure of intra-specific genetic variation using RawGeno, an R package for automating AFLP scoring.

Arrigo N, Tuszynski JW, Ehrich D, Gerdes T, Alvarez N - BMC Bioinformatics (2009)

Bottom Line: Using a new scoring algorithm, RawGeno, we show that scoring errors--in particular "bin oversplitting" (i.e. when variant sizes of the same AFLP marker are not considered as homologous) and "technical homoplasy" (i.e. when two AFLP markers that differ slightly in size are mistakenly considered as being homologous)--induce a loss of discriminatory power, decrease the robustness of results and, in extreme cases, introduce erroneous information in genetic structure analyses.Our algorithm appears to perform as well as a commercial program in automating AFLP scoring, at least in the context of population genetics or phylogeographic studies.To our knowledge, RawGeno is the only freely available public-domain software for fully automated AFLP scoring, from electropherogram files to user-defined working binary matrices.

View Article: PubMed Central - HTML - PubMed

Affiliation: Laboratory of Evolutionary Botany, Institute of Biology, University of Neuchâtel, 11 rue Emile-Argand, CH-2000 Neuchâtel, Switzerland. nils.arrigo@unine.ch

ABSTRACT

Background: Since the transfer and application of modern sequencing technologies to the analysis of amplified fragment-length polymorphisms (AFLP), evolutionary biologists have included an increasing number of samples and markers in their studies. Although justified in this context, the use of automated scoring procedures may result in technical biases that weaken the power and reliability of further analyses.

Results: Using a new scoring algorithm, RawGeno, we show that scoring errors--in particular "bin oversplitting" (i.e. when variant sizes of the same AFLP marker are not considered as homologous) and "technical homoplasy" (i.e. when two AFLP markers that differ slightly in size are mistakenly considered as being homologous)--induce a loss of discriminatory power, decrease the robustness of results and, in extreme cases, introduce erroneous information in genetic structure analyses. In the present study, we evaluate several descriptive statistics that can be used to optimize the scoring of the AFLP analysis, and we describe a new statistic, the information content per bin (Ibin) that represents a valuable estimator during the optimization process. This statistic can be computed at any stage of the AFLP analysis without requiring the inclusion of replicated samples. Finally, we show that downstream analyses are not equally sensitive to scoring errors. Indeed, although a reasonable amount of flexibility is allowed during the optimization of the scoring procedure without causing considerable changes in the detection of genetic structure patterns, notable discrepancies are observed when estimating genetic diversities from differently scored datasets.

Conclusion: Our algorithm appears to perform as well as a commercial program in automating AFLP scoring, at least in the context of population genetics or phylogeographic studies. To our knowledge, RawGeno is the only freely available public-domain software for fully automated AFLP scoring, from electropherogram files to user-defined working binary matrices. RawGeno was implemented in an R CRAN package (with an user-friendly GUI) and can be found at http://sourceforge.net/projects/rawgeno.

Show MeSH

Related in: MedlinePlus

Error rates and information content. In light grey, RawGeno datasets labelled with "RG" as prefix; in dark grey, GeneMapper datasets, labelled with "GM" as prefix. Only the quartiles are displayed on the box-plots. A. Mismatch error rate (Eb). B. and C. Bayesian error rates representing the probability (%) of mis-scoring allele presence (ε1.0) or absence (ε0.1). D.Information content per bin (Ibin), calculated as the mean inter-individual distance divided by the number of bins in the dataset. This statistic could also be computed for the manually scored dataset since it does not require replicated samples.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2656475&req=5

Figure 3: Error rates and information content. In light grey, RawGeno datasets labelled with "RG" as prefix; in dark grey, GeneMapper datasets, labelled with "GM" as prefix. Only the quartiles are displayed on the box-plots. A. Mismatch error rate (Eb). B. and C. Bayesian error rates representing the probability (%) of mis-scoring allele presence (ε1.0) or absence (ε0.1). D.Information content per bin (Ibin), calculated as the mean inter-individual distance divided by the number of bins in the dataset. This statistic could also be computed for the manually scored dataset since it does not require replicated samples.

Mentions: The previous statistics used the manually scored dataset as a reference. This strategy, however, might be misleading for several reasons. First, this reference point is not available when analysing a new dataset and second, manual scoring may introduce subjective biases (for instance resulting from different filtering strategies) in the dataset. It is therefore clear that the evaluation of dataset quality requires a much more absolute criterion, such as error rates. However, the properties of the different estimators must be evaluated taking into account technical homoplasy and bin oversplitting. For instance, the mismatch error rate (Eb), defined by the number of mismatches between a sample and its replicate, divided by the number of bins (Eb = Mrepl/nbin), failed to unambiguously detect the oversplitting, while technical homoplasy was slightly detected (Figure 3A). On one hand, oversplitting could not be detected by an increase of the mismatch error rate since it dramatically increased the number of bins. Technical homoplasy, on the other hand, was hardly detected because it artificially decreased the number of mismatches between a sample and its replicate, leading to an underestimation of Eb. Similar results were described by Holland et al. [15] where the Eb rate (i.e. referred to as the Euclidean Error rate by the authors) was shown to be unable to discriminate datasets scored with varying bin sizes. As a result, we advise that the mismatch error rate should not be used to optimize the bin definition and the scoring of AFLPs. This criterion provided, however, valuable information when the number of bins remained similar among datasets. This situation was encountered, for instance, during further quality checking steps, such as fluorescence or bin reproducibility filtering. Interestingly, the two Bayesian error rates (Figure 3B and 3C) detected both the oversplitting (with ε1.0) and the technical homoplasy (with ε0.1). These indices thus provided a valuable solution for optimizing the whole range of scoring parameters, including downstream filtering procedures as shown by Whitlock et al. [14]. These statistics, however, required a large computational time and a quicker alternative may be desirable. We therefore propose the use of a new statistic, i.e. the information content per bin (Figure 3D), as a quality criterion to optimize the first steps of AFLP scoring (see above for explanations regarding this new criterion). In our study, this statistic detected both the oversplitting and technical homoplasy. Oversplitting increased the number of bins faster than the number of mismatches. In contrast, the technical homoplasy decreased the accuracy of the dataset at a faster rate than the number of bins decreased. Maximizing the Ibin represents an interesting trade-off between the accuracy of the dataset and the number of bins that are used to record the AFLP information. We propose applying this criterion to optimize the bin definition, while the other quality criteria (e.g. Whitlock et al. [14]) can be used during downstream scoring steps. According to the Ibin criterion, we assumed that RG_2 and GM_2 datasets had a reasonably good quality. This result was confirmed for RG_2 by both the correlation to the manually scored dataset and the partial constrained correspondence analysis (see Figure 2), while according to these same statistics, GM_1 might perform better than GM_2. It bears repeating, however, that datasets scored with bin widths up to 2 bp probably included some level of technical homoplasy. The presence of stutter-bands (i.e. PCR artefacts) might contribute to this. The effect of this phenomenon on scoring could either produce rare alleles that led to the definition of non-informative bins or cause erroneous results when the range of the stutter bands extended into another bin (e.g. oversplitting bins or mis-scoring of absences or presences). Moreover, stutter-bands generally had a decreased fluorescence and their various peaks might not be reproducible. Therefore, scoring parameters that considered a cluster of stutter-bands as a single allele probably handled the situation in a more appropriate way.


Evaluating the impact of scoring parameters on the structure of intra-specific genetic variation using RawGeno, an R package for automating AFLP scoring.

Arrigo N, Tuszynski JW, Ehrich D, Gerdes T, Alvarez N - BMC Bioinformatics (2009)

Error rates and information content. In light grey, RawGeno datasets labelled with "RG" as prefix; in dark grey, GeneMapper datasets, labelled with "GM" as prefix. Only the quartiles are displayed on the box-plots. A. Mismatch error rate (Eb). B. and C. Bayesian error rates representing the probability (%) of mis-scoring allele presence (ε1.0) or absence (ε0.1). D.Information content per bin (Ibin), calculated as the mean inter-individual distance divided by the number of bins in the dataset. This statistic could also be computed for the manually scored dataset since it does not require replicated samples.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2656475&req=5

Figure 3: Error rates and information content. In light grey, RawGeno datasets labelled with "RG" as prefix; in dark grey, GeneMapper datasets, labelled with "GM" as prefix. Only the quartiles are displayed on the box-plots. A. Mismatch error rate (Eb). B. and C. Bayesian error rates representing the probability (%) of mis-scoring allele presence (ε1.0) or absence (ε0.1). D.Information content per bin (Ibin), calculated as the mean inter-individual distance divided by the number of bins in the dataset. This statistic could also be computed for the manually scored dataset since it does not require replicated samples.
Mentions: The previous statistics used the manually scored dataset as a reference. This strategy, however, might be misleading for several reasons. First, this reference point is not available when analysing a new dataset and second, manual scoring may introduce subjective biases (for instance resulting from different filtering strategies) in the dataset. It is therefore clear that the evaluation of dataset quality requires a much more absolute criterion, such as error rates. However, the properties of the different estimators must be evaluated taking into account technical homoplasy and bin oversplitting. For instance, the mismatch error rate (Eb), defined by the number of mismatches between a sample and its replicate, divided by the number of bins (Eb = Mrepl/nbin), failed to unambiguously detect the oversplitting, while technical homoplasy was slightly detected (Figure 3A). On one hand, oversplitting could not be detected by an increase of the mismatch error rate since it dramatically increased the number of bins. Technical homoplasy, on the other hand, was hardly detected because it artificially decreased the number of mismatches between a sample and its replicate, leading to an underestimation of Eb. Similar results were described by Holland et al. [15] where the Eb rate (i.e. referred to as the Euclidean Error rate by the authors) was shown to be unable to discriminate datasets scored with varying bin sizes. As a result, we advise that the mismatch error rate should not be used to optimize the bin definition and the scoring of AFLPs. This criterion provided, however, valuable information when the number of bins remained similar among datasets. This situation was encountered, for instance, during further quality checking steps, such as fluorescence or bin reproducibility filtering. Interestingly, the two Bayesian error rates (Figure 3B and 3C) detected both the oversplitting (with ε1.0) and the technical homoplasy (with ε0.1). These indices thus provided a valuable solution for optimizing the whole range of scoring parameters, including downstream filtering procedures as shown by Whitlock et al. [14]. These statistics, however, required a large computational time and a quicker alternative may be desirable. We therefore propose the use of a new statistic, i.e. the information content per bin (Figure 3D), as a quality criterion to optimize the first steps of AFLP scoring (see above for explanations regarding this new criterion). In our study, this statistic detected both the oversplitting and technical homoplasy. Oversplitting increased the number of bins faster than the number of mismatches. In contrast, the technical homoplasy decreased the accuracy of the dataset at a faster rate than the number of bins decreased. Maximizing the Ibin represents an interesting trade-off between the accuracy of the dataset and the number of bins that are used to record the AFLP information. We propose applying this criterion to optimize the bin definition, while the other quality criteria (e.g. Whitlock et al. [14]) can be used during downstream scoring steps. According to the Ibin criterion, we assumed that RG_2 and GM_2 datasets had a reasonably good quality. This result was confirmed for RG_2 by both the correlation to the manually scored dataset and the partial constrained correspondence analysis (see Figure 2), while according to these same statistics, GM_1 might perform better than GM_2. It bears repeating, however, that datasets scored with bin widths up to 2 bp probably included some level of technical homoplasy. The presence of stutter-bands (i.e. PCR artefacts) might contribute to this. The effect of this phenomenon on scoring could either produce rare alleles that led to the definition of non-informative bins or cause erroneous results when the range of the stutter bands extended into another bin (e.g. oversplitting bins or mis-scoring of absences or presences). Moreover, stutter-bands generally had a decreased fluorescence and their various peaks might not be reproducible. Therefore, scoring parameters that considered a cluster of stutter-bands as a single allele probably handled the situation in a more appropriate way.

Bottom Line: Using a new scoring algorithm, RawGeno, we show that scoring errors--in particular "bin oversplitting" (i.e. when variant sizes of the same AFLP marker are not considered as homologous) and "technical homoplasy" (i.e. when two AFLP markers that differ slightly in size are mistakenly considered as being homologous)--induce a loss of discriminatory power, decrease the robustness of results and, in extreme cases, introduce erroneous information in genetic structure analyses.Our algorithm appears to perform as well as a commercial program in automating AFLP scoring, at least in the context of population genetics or phylogeographic studies.To our knowledge, RawGeno is the only freely available public-domain software for fully automated AFLP scoring, from electropherogram files to user-defined working binary matrices.

View Article: PubMed Central - HTML - PubMed

Affiliation: Laboratory of Evolutionary Botany, Institute of Biology, University of Neuchâtel, 11 rue Emile-Argand, CH-2000 Neuchâtel, Switzerland. nils.arrigo@unine.ch

ABSTRACT

Background: Since the transfer and application of modern sequencing technologies to the analysis of amplified fragment-length polymorphisms (AFLP), evolutionary biologists have included an increasing number of samples and markers in their studies. Although justified in this context, the use of automated scoring procedures may result in technical biases that weaken the power and reliability of further analyses.

Results: Using a new scoring algorithm, RawGeno, we show that scoring errors--in particular "bin oversplitting" (i.e. when variant sizes of the same AFLP marker are not considered as homologous) and "technical homoplasy" (i.e. when two AFLP markers that differ slightly in size are mistakenly considered as being homologous)--induce a loss of discriminatory power, decrease the robustness of results and, in extreme cases, introduce erroneous information in genetic structure analyses. In the present study, we evaluate several descriptive statistics that can be used to optimize the scoring of the AFLP analysis, and we describe a new statistic, the information content per bin (Ibin) that represents a valuable estimator during the optimization process. This statistic can be computed at any stage of the AFLP analysis without requiring the inclusion of replicated samples. Finally, we show that downstream analyses are not equally sensitive to scoring errors. Indeed, although a reasonable amount of flexibility is allowed during the optimization of the scoring procedure without causing considerable changes in the detection of genetic structure patterns, notable discrepancies are observed when estimating genetic diversities from differently scored datasets.

Conclusion: Our algorithm appears to perform as well as a commercial program in automating AFLP scoring, at least in the context of population genetics or phylogeographic studies. To our knowledge, RawGeno is the only freely available public-domain software for fully automated AFLP scoring, from electropherogram files to user-defined working binary matrices. RawGeno was implemented in an R CRAN package (with an user-friendly GUI) and can be found at http://sourceforge.net/projects/rawgeno.

Show MeSH
Related in: MedlinePlus