Limits...
MHC genotyping of non-model organisms using next-generation sequencing: a new methodology to deal with artefacts and allelic dropout.

Sommer S, Courtiol A, Mazzoni CJ - BMC Genomics (2013)

Bottom Line: We developed the analytical methodology and validated a data processing procedure which can be applied to any organism.It allows the separation of true alleles from artefacts and the evaluation of genotyping reliability, which in addition to artefacts considers for the first time the possibility of allelic dropout due to unbalanced amplification efficiencies across alleles.Combining our workflow with the study of amplification efficiency offers the chance for researchers to evaluate enormous amounts of NGS-generated data in great detail, improving confidence over the downstream analyses and subsequent applications.

View Article: PubMed Central - HTML - PubMed

Affiliation: Evolutionary Genetics, Leibniz-Institute for Zoo and Wildlife Research, Alfred-Kowalke-Straße 17, D-10315 Berlin, Germany. sommer@izw-berlin.de

ABSTRACT

Background: The Major Histocompatibility Complex (MHC) is the most important genetic marker to study patterns of adaptive genetic variation determining pathogen resistance and associated life history decisions. It is used in many different research fields ranging from human medical, molecular evolutionary to functional biodiversity studies. Correct assessment of the individual allelic diversity pattern and the underlying structural sequence variation is the basic requirement to address the functional importance of MHC variability. Next-generation sequencing (NGS) technologies are likely to replace traditional genotyping methods to a great extent in the near future but first empirical studies strongly indicate the need for a rigorous quality control pipeline. Strict approaches for data validation and allele calling to distinguish true alleles from artefacts are required.

Results: We developed the analytical methodology and validated a data processing procedure which can be applied to any organism. It allows the separation of true alleles from artefacts and the evaluation of genotyping reliability, which in addition to artefacts considers for the first time the possibility of allelic dropout due to unbalanced amplification efficiencies across alleles. Finally, we developed a method to assess the confidence level per genotype a-posteriori, which helps to decide which alleles and individuals should be included in any further downstream analyses. The latter method could also be used for optimizing experiment designs in the future.

Conclusions: Combining our workflow with the study of amplification efficiency offers the chance for researchers to evaluate enormous amounts of NGS-generated data in great detail, improving confidence over the downstream analyses and subsequent applications.

Show MeSH

Related in: MedlinePlus

Frequency distribution for all final cluster (= variant) classifications. The cluster frequencies within their respective amplicons are indicated based on the final classification categories after the allele and artefact identification workflow. The frequencies on the x axis represent the proportion of reads within an amplicon (i.e. intra-amplicon frequency) that represent a given variant or cluster, while the y axis shows the total number of clusters over all amplicons for each intra-amplicon frequency range. The grey zone indicates the overlapping zone of artefacts and putative alleles, i.e. the zone between the most frequent variant classified as ‘putative artefact’ and the least frequent variant among the ‘putative alleles’ within an amplicon. The dotted line at 4.37% represents the threshold T2 according to Galan et al. [13] to separate putative alleles from putative artefacts. The most frequent artefact within an amplicon represented 5.4% of the total number of reads of this specific amplicon.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3750822&req=5

Figure 4: Frequency distribution for all final cluster (= variant) classifications. The cluster frequencies within their respective amplicons are indicated based on the final classification categories after the allele and artefact identification workflow. The frequencies on the x axis represent the proportion of reads within an amplicon (i.e. intra-amplicon frequency) that represent a given variant or cluster, while the y axis shows the total number of clusters over all amplicons for each intra-amplicon frequency range. The grey zone indicates the overlapping zone of artefacts and putative alleles, i.e. the zone between the most frequent variant classified as ‘putative artefact’ and the least frequent variant among the ‘putative alleles’ within an amplicon. The dotted line at 4.37% represents the threshold T2 according to Galan et al. [13] to separate putative alleles from putative artefacts. The most frequent artefact within an amplicon represented 5.4% of the total number of reads of this specific amplicon.

Mentions: The variant frequencies within each amplicon (i.e. cluster frequencies) were analysed based on their classification categories of Figure 2. Figure 4 shows the prevalence of artefacts over alleles in the total number of clusters (artefactual clusters: 3674; putative allele clusters: 390). Although we confirmed only between 2-9 (3-11) alleles per individual (Figure 3), the number of variants varied between 6 and 126 (57 ± 26) per amplicon. The intra-amplicon classification for ‘putative artefacts’ was clearly dominated by 'chimeras' (37.7 ± 22.3) followed by ‘1-2 bp diff' (9.7 ± 10.5) and ' > 2 bp diff’ (3.7 ± 3) (see Table 1 for definitions). The most frequent variant classified as ‘putative artefact’ represented 5.4% of an amplicon and the least frequent variant among the ‘putative alleles’ represented only 1.5% of the total amount of reads within an amplicon (after excluding all intra-amplicon singletons).


MHC genotyping of non-model organisms using next-generation sequencing: a new methodology to deal with artefacts and allelic dropout.

Sommer S, Courtiol A, Mazzoni CJ - BMC Genomics (2013)

Frequency distribution for all final cluster (= variant) classifications. The cluster frequencies within their respective amplicons are indicated based on the final classification categories after the allele and artefact identification workflow. The frequencies on the x axis represent the proportion of reads within an amplicon (i.e. intra-amplicon frequency) that represent a given variant or cluster, while the y axis shows the total number of clusters over all amplicons for each intra-amplicon frequency range. The grey zone indicates the overlapping zone of artefacts and putative alleles, i.e. the zone between the most frequent variant classified as ‘putative artefact’ and the least frequent variant among the ‘putative alleles’ within an amplicon. The dotted line at 4.37% represents the threshold T2 according to Galan et al. [13] to separate putative alleles from putative artefacts. The most frequent artefact within an amplicon represented 5.4% of the total number of reads of this specific amplicon.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3750822&req=5

Figure 4: Frequency distribution for all final cluster (= variant) classifications. The cluster frequencies within their respective amplicons are indicated based on the final classification categories after the allele and artefact identification workflow. The frequencies on the x axis represent the proportion of reads within an amplicon (i.e. intra-amplicon frequency) that represent a given variant or cluster, while the y axis shows the total number of clusters over all amplicons for each intra-amplicon frequency range. The grey zone indicates the overlapping zone of artefacts and putative alleles, i.e. the zone between the most frequent variant classified as ‘putative artefact’ and the least frequent variant among the ‘putative alleles’ within an amplicon. The dotted line at 4.37% represents the threshold T2 according to Galan et al. [13] to separate putative alleles from putative artefacts. The most frequent artefact within an amplicon represented 5.4% of the total number of reads of this specific amplicon.
Mentions: The variant frequencies within each amplicon (i.e. cluster frequencies) were analysed based on their classification categories of Figure 2. Figure 4 shows the prevalence of artefacts over alleles in the total number of clusters (artefactual clusters: 3674; putative allele clusters: 390). Although we confirmed only between 2-9 (3-11) alleles per individual (Figure 3), the number of variants varied between 6 and 126 (57 ± 26) per amplicon. The intra-amplicon classification for ‘putative artefacts’ was clearly dominated by 'chimeras' (37.7 ± 22.3) followed by ‘1-2 bp diff' (9.7 ± 10.5) and ' > 2 bp diff’ (3.7 ± 3) (see Table 1 for definitions). The most frequent variant classified as ‘putative artefact’ represented 5.4% of an amplicon and the least frequent variant among the ‘putative alleles’ represented only 1.5% of the total amount of reads within an amplicon (after excluding all intra-amplicon singletons).

Bottom Line: We developed the analytical methodology and validated a data processing procedure which can be applied to any organism.It allows the separation of true alleles from artefacts and the evaluation of genotyping reliability, which in addition to artefacts considers for the first time the possibility of allelic dropout due to unbalanced amplification efficiencies across alleles.Combining our workflow with the study of amplification efficiency offers the chance for researchers to evaluate enormous amounts of NGS-generated data in great detail, improving confidence over the downstream analyses and subsequent applications.

View Article: PubMed Central - HTML - PubMed

Affiliation: Evolutionary Genetics, Leibniz-Institute for Zoo and Wildlife Research, Alfred-Kowalke-Straße 17, D-10315 Berlin, Germany. sommer@izw-berlin.de

ABSTRACT

Background: The Major Histocompatibility Complex (MHC) is the most important genetic marker to study patterns of adaptive genetic variation determining pathogen resistance and associated life history decisions. It is used in many different research fields ranging from human medical, molecular evolutionary to functional biodiversity studies. Correct assessment of the individual allelic diversity pattern and the underlying structural sequence variation is the basic requirement to address the functional importance of MHC variability. Next-generation sequencing (NGS) technologies are likely to replace traditional genotyping methods to a great extent in the near future but first empirical studies strongly indicate the need for a rigorous quality control pipeline. Strict approaches for data validation and allele calling to distinguish true alleles from artefacts are required.

Results: We developed the analytical methodology and validated a data processing procedure which can be applied to any organism. It allows the separation of true alleles from artefacts and the evaluation of genotyping reliability, which in addition to artefacts considers for the first time the possibility of allelic dropout due to unbalanced amplification efficiencies across alleles. Finally, we developed a method to assess the confidence level per genotype a-posteriori, which helps to decide which alleles and individuals should be included in any further downstream analyses. The latter method could also be used for optimizing experiment designs in the future.

Conclusions: Combining our workflow with the study of amplification efficiency offers the chance for researchers to evaluate enormous amounts of NGS-generated data in great detail, improving confidence over the downstream analyses and subsequent applications.

Show MeSH
Related in: MedlinePlus