Limits...
MHC genotyping of non-model organisms using next-generation sequencing: a new methodology to deal with artefacts and allelic dropout.

Sommer S, Courtiol A, Mazzoni CJ - BMC Genomics (2013)

Bottom Line: We developed the analytical methodology and validated a data processing procedure which can be applied to any organism.It allows the separation of true alleles from artefacts and the evaluation of genotyping reliability, which in addition to artefacts considers for the first time the possibility of allelic dropout due to unbalanced amplification efficiencies across alleles.Combining our workflow with the study of amplification efficiency offers the chance for researchers to evaluate enormous amounts of NGS-generated data in great detail, improving confidence over the downstream analyses and subsequent applications.

View Article: PubMed Central - HTML - PubMed

Affiliation: Evolutionary Genetics, Leibniz-Institute for Zoo and Wildlife Research, Alfred-Kowalke-Straße 17, D-10315 Berlin, Germany. sommer@izw-berlin.de

ABSTRACT

Background: The Major Histocompatibility Complex (MHC) is the most important genetic marker to study patterns of adaptive genetic variation determining pathogen resistance and associated life history decisions. It is used in many different research fields ranging from human medical, molecular evolutionary to functional biodiversity studies. Correct assessment of the individual allelic diversity pattern and the underlying structural sequence variation is the basic requirement to address the functional importance of MHC variability. Next-generation sequencing (NGS) technologies are likely to replace traditional genotyping methods to a great extent in the near future but first empirical studies strongly indicate the need for a rigorous quality control pipeline. Strict approaches for data validation and allele calling to distinguish true alleles from artefacts are required.

Results: We developed the analytical methodology and validated a data processing procedure which can be applied to any organism. It allows the separation of true alleles from artefacts and the evaluation of genotyping reliability, which in addition to artefacts considers for the first time the possibility of allelic dropout due to unbalanced amplification efficiencies across alleles. Finally, we developed a method to assess the confidence level per genotype a-posteriori, which helps to decide which alleles and individuals should be included in any further downstream analyses. The latter method could also be used for optimizing experiment designs in the future.

Conclusions: Combining our workflow with the study of amplification efficiency offers the chance for researchers to evaluate enormous amounts of NGS-generated data in great detail, improving confidence over the downstream analyses and subsequent applications.

Show MeSH

Related in: MedlinePlus

Influence of the individual number of alleles (A), and the lowest read frequency (B) on the minimum number of reads required to determine a complete genotype with a 99.9% confidence level. Each point represents a different individual (N = 36). The minimum number of reads required was a posteriori computed (T1Resampled) by resampling both amplicon replicates (T1Resampled1, T1Resampled2) and the maximal value was retained. The dashed line in (A) illustrates T1Galan Simul computed according to Galan’s et al. [13] definition of an allele’s dropout (i.e. an allele associated with two reads or less; see text for details).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3750822&req=5

Figure 6: Influence of the individual number of alleles (A), and the lowest read frequency (B) on the minimum number of reads required to determine a complete genotype with a 99.9% confidence level. Each point represents a different individual (N = 36). The minimum number of reads required was a posteriori computed (T1Resampled) by resampling both amplicon replicates (T1Resampled1, T1Resampled2) and the maximal value was retained. The dashed line in (A) illustrates T1Galan Simul computed according to Galan’s et al. [13] definition of an allele’s dropout (i.e. an allele associated with two reads or less; see text for details).

Mentions: Subsequently, we further relaxed the assumption that amplification efficiency is fixed for each allele, by allowing each allele to have different amplification efficiencies in any given amplicon. Under these circumstances, T1 values (called T1Resampled), were between 0.93 and 6.28 fold (mean = 1.76) higher than T1Galan Simul and between 0.76 and 2.76 fold (mean = 1.28) higher than T1Simul VarAmplEff (Table 2). T1Resampled increased with the number of alleles observed, as predicted by Galan et al. for T1Galan Negmult. Nonetheless, knowing only the number of alleles seems not to be sufficient to predict T1 accurately (Figure 6A; Spearman rank correlation between T1Galan Simul and T1Resampled, rho = + 0.54, P < 0.001, N = 36). Figure 6B shows however that with the knowledge of the lowest observed read frequency present in a genotype, one could almost predict T1Resampled to perfection (rho = − 0.996, P < 0.001, N = 36). Interestingly, the lowest read frequency can be predicted accurately (rho = + 0.79, P < 0.001, N = 36) from the ratio between the lowest allele amplification efficiency and the sum of amplification efficiency across all alleles constituting one genotype. Thus, we found that in our system T1Resampled values can be predicted accurately from the lowest efficiency within a genotype (rho = − 0.80, P < 0.001, N = 36).


MHC genotyping of non-model organisms using next-generation sequencing: a new methodology to deal with artefacts and allelic dropout.

Sommer S, Courtiol A, Mazzoni CJ - BMC Genomics (2013)

Influence of the individual number of alleles (A), and the lowest read frequency (B) on the minimum number of reads required to determine a complete genotype with a 99.9% confidence level. Each point represents a different individual (N = 36). The minimum number of reads required was a posteriori computed (T1Resampled) by resampling both amplicon replicates (T1Resampled1, T1Resampled2) and the maximal value was retained. The dashed line in (A) illustrates T1Galan Simul computed according to Galan’s et al. [13] definition of an allele’s dropout (i.e. an allele associated with two reads or less; see text for details).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3750822&req=5

Figure 6: Influence of the individual number of alleles (A), and the lowest read frequency (B) on the minimum number of reads required to determine a complete genotype with a 99.9% confidence level. Each point represents a different individual (N = 36). The minimum number of reads required was a posteriori computed (T1Resampled) by resampling both amplicon replicates (T1Resampled1, T1Resampled2) and the maximal value was retained. The dashed line in (A) illustrates T1Galan Simul computed according to Galan’s et al. [13] definition of an allele’s dropout (i.e. an allele associated with two reads or less; see text for details).
Mentions: Subsequently, we further relaxed the assumption that amplification efficiency is fixed for each allele, by allowing each allele to have different amplification efficiencies in any given amplicon. Under these circumstances, T1 values (called T1Resampled), were between 0.93 and 6.28 fold (mean = 1.76) higher than T1Galan Simul and between 0.76 and 2.76 fold (mean = 1.28) higher than T1Simul VarAmplEff (Table 2). T1Resampled increased with the number of alleles observed, as predicted by Galan et al. for T1Galan Negmult. Nonetheless, knowing only the number of alleles seems not to be sufficient to predict T1 accurately (Figure 6A; Spearman rank correlation between T1Galan Simul and T1Resampled, rho = + 0.54, P < 0.001, N = 36). Figure 6B shows however that with the knowledge of the lowest observed read frequency present in a genotype, one could almost predict T1Resampled to perfection (rho = − 0.996, P < 0.001, N = 36). Interestingly, the lowest read frequency can be predicted accurately (rho = + 0.79, P < 0.001, N = 36) from the ratio between the lowest allele amplification efficiency and the sum of amplification efficiency across all alleles constituting one genotype. Thus, we found that in our system T1Resampled values can be predicted accurately from the lowest efficiency within a genotype (rho = − 0.80, P < 0.001, N = 36).

Bottom Line: We developed the analytical methodology and validated a data processing procedure which can be applied to any organism.It allows the separation of true alleles from artefacts and the evaluation of genotyping reliability, which in addition to artefacts considers for the first time the possibility of allelic dropout due to unbalanced amplification efficiencies across alleles.Combining our workflow with the study of amplification efficiency offers the chance for researchers to evaluate enormous amounts of NGS-generated data in great detail, improving confidence over the downstream analyses and subsequent applications.

View Article: PubMed Central - HTML - PubMed

Affiliation: Evolutionary Genetics, Leibniz-Institute for Zoo and Wildlife Research, Alfred-Kowalke-Straße 17, D-10315 Berlin, Germany. sommer@izw-berlin.de

ABSTRACT

Background: The Major Histocompatibility Complex (MHC) is the most important genetic marker to study patterns of adaptive genetic variation determining pathogen resistance and associated life history decisions. It is used in many different research fields ranging from human medical, molecular evolutionary to functional biodiversity studies. Correct assessment of the individual allelic diversity pattern and the underlying structural sequence variation is the basic requirement to address the functional importance of MHC variability. Next-generation sequencing (NGS) technologies are likely to replace traditional genotyping methods to a great extent in the near future but first empirical studies strongly indicate the need for a rigorous quality control pipeline. Strict approaches for data validation and allele calling to distinguish true alleles from artefacts are required.

Results: We developed the analytical methodology and validated a data processing procedure which can be applied to any organism. It allows the separation of true alleles from artefacts and the evaluation of genotyping reliability, which in addition to artefacts considers for the first time the possibility of allelic dropout due to unbalanced amplification efficiencies across alleles. Finally, we developed a method to assess the confidence level per genotype a-posteriori, which helps to decide which alleles and individuals should be included in any further downstream analyses. The latter method could also be used for optimizing experiment designs in the future.

Conclusions: Combining our workflow with the study of amplification efficiency offers the chance for researchers to evaluate enormous amounts of NGS-generated data in great detail, improving confidence over the downstream analyses and subsequent applications.

Show MeSH
Related in: MedlinePlus