Limits...
Statistical resolution of ambiguous HLA typing data.

Listgarten J, Brumme Z, Kadie C, Xiaojiang G, Walker B, Carrington M, Goulder P, Heckerman D - PLoS Comput. Biol. (2008)

Bottom Line: Our improvements are achieved by using a parsimonious parameterization for haplotype distributions and by smoothing the maximum likelihood (ML) solution.These improvements make it possible to scale the refinement to a larger number of alleles and loci in a more computationally efficient and stable manner.We also show how to augment our method in order to incorporate ethnicity information (as HLA allele distributions vary widely according to race/ethnicity as well as geographic area), and demonstrate the potential utility of this experimentally.

View Article: PubMed Central - PubMed

Affiliation: Microsoft Research, Redmond, Washington, United States of America. jennl@microsoft.com

ABSTRACT
High-resolution HLA typing plays a central role in many areas of immunology, such as in identifying immunogenetic risk factors for disease, in studying how the genomes of pathogens evolve in response to immune selection pressures, and also in vaccine design, where identification of HLA-restricted epitopes may be used to guide the selection of vaccine immunogens. Perhaps one of the most immediate applications is in direct medical decisions concerning the matching of stem cell transplant donors to unrelated recipients. However, high-resolution HLA typing is frequently unavailable due to its high cost or the inability to re-type historical data. In this paper, we introduce and evaluate a method for statistical, in silico refinement of ambiguous and/or low-resolution HLA data. Our method, which requires an independent, high-resolution training data set drawn from the same population as the data to be refined, uses linkage disequilibrium in HLA haplotypes as well as four-digit allele frequency data to probabilistically refine HLA typings. Central to our approach is the use of haplotype inference. We introduce new methodology to this area, improving upon the Expectation-Maximization (EM)-based approaches currently used within the HLA community. Our improvements are achieved by using a parsimonious parameterization for haplotype distributions and by smoothing the maximum likelihood (ML) solution. These improvements make it possible to scale the refinement to a larger number of alleles and loci in a more computationally efficient and stable manner. We also show how to augment our method in order to incorporate ethnicity information (as HLA allele distributions vary widely according to race/ethnicity as well as geographic area), and demonstrate the potential utility of this experimentally. A tool based on our approach is freely available for research purposes at http://microsoft.com/science.

Show MeSH
Sensitivity to Variable Ordering.Top two rows are for the Hispanic data set; bottom two rows are for the European data set. Within each of these, the top row is the geometric mean probabilities, and the bottom row shows the percent correct MAP predictions. The number of masked alleles, respectively, in the Hispanic 30% and loci masks (A,B,C), was 306 and 354. The number of masked alleles, respectively, in the European 30% and loci masks (A,B,C), was 2669 and 3012.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2289775&req=5

pcbi-1000016-g004: Sensitivity to Variable Ordering.Top two rows are for the Hispanic data set; bottom two rows are for the European data set. Within each of these, the top row is the geometric mean probabilities, and the bottom row shows the percent correct MAP predictions. The number of masked alleles, respectively, in the Hispanic 30% and loci masks (A,B,C), was 306 and 354. The number of masked alleles, respectively, in the European 30% and loci masks (A,B,C), was 2669 and 3012.

Mentions: Because we use softmax regression functions in our haplotype model, the order in which we apply the chain rule (Equation 2) to our loci will have an effect on predictive accuracy. We examined the sensitivity of performance to variable ordering on three loci (A,B,C) using the European and Hispanic data sets. The results are shown in Figure 4 in which a locus order of ‘B A C’ means we used . The experiments labeled ‘30% mask' denote the performance using the 30% random masking procedure we used in our earlier experiments. Additionally, we systematically masked all (and only) A alleles (‘A mask’), and separately, all and only B alleles (‘B mask’), and all and only C alleles (‘C mask’). This procedure allows us to see if the variable ordering differentially affects our ability to predict particular loci. Statistical significance was measured only on the difference in test log likelihoods.


Statistical resolution of ambiguous HLA typing data.

Listgarten J, Brumme Z, Kadie C, Xiaojiang G, Walker B, Carrington M, Goulder P, Heckerman D - PLoS Comput. Biol. (2008)

Sensitivity to Variable Ordering.Top two rows are for the Hispanic data set; bottom two rows are for the European data set. Within each of these, the top row is the geometric mean probabilities, and the bottom row shows the percent correct MAP predictions. The number of masked alleles, respectively, in the Hispanic 30% and loci masks (A,B,C), was 306 and 354. The number of masked alleles, respectively, in the European 30% and loci masks (A,B,C), was 2669 and 3012.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2289775&req=5

pcbi-1000016-g004: Sensitivity to Variable Ordering.Top two rows are for the Hispanic data set; bottom two rows are for the European data set. Within each of these, the top row is the geometric mean probabilities, and the bottom row shows the percent correct MAP predictions. The number of masked alleles, respectively, in the Hispanic 30% and loci masks (A,B,C), was 306 and 354. The number of masked alleles, respectively, in the European 30% and loci masks (A,B,C), was 2669 and 3012.
Mentions: Because we use softmax regression functions in our haplotype model, the order in which we apply the chain rule (Equation 2) to our loci will have an effect on predictive accuracy. We examined the sensitivity of performance to variable ordering on three loci (A,B,C) using the European and Hispanic data sets. The results are shown in Figure 4 in which a locus order of ‘B A C’ means we used . The experiments labeled ‘30% mask' denote the performance using the 30% random masking procedure we used in our earlier experiments. Additionally, we systematically masked all (and only) A alleles (‘A mask’), and separately, all and only B alleles (‘B mask’), and all and only C alleles (‘C mask’). This procedure allows us to see if the variable ordering differentially affects our ability to predict particular loci. Statistical significance was measured only on the difference in test log likelihoods.

Bottom Line: Our improvements are achieved by using a parsimonious parameterization for haplotype distributions and by smoothing the maximum likelihood (ML) solution.These improvements make it possible to scale the refinement to a larger number of alleles and loci in a more computationally efficient and stable manner.We also show how to augment our method in order to incorporate ethnicity information (as HLA allele distributions vary widely according to race/ethnicity as well as geographic area), and demonstrate the potential utility of this experimentally.

View Article: PubMed Central - PubMed

Affiliation: Microsoft Research, Redmond, Washington, United States of America. jennl@microsoft.com

ABSTRACT
High-resolution HLA typing plays a central role in many areas of immunology, such as in identifying immunogenetic risk factors for disease, in studying how the genomes of pathogens evolve in response to immune selection pressures, and also in vaccine design, where identification of HLA-restricted epitopes may be used to guide the selection of vaccine immunogens. Perhaps one of the most immediate applications is in direct medical decisions concerning the matching of stem cell transplant donors to unrelated recipients. However, high-resolution HLA typing is frequently unavailable due to its high cost or the inability to re-type historical data. In this paper, we introduce and evaluate a method for statistical, in silico refinement of ambiguous and/or low-resolution HLA data. Our method, which requires an independent, high-resolution training data set drawn from the same population as the data to be refined, uses linkage disequilibrium in HLA haplotypes as well as four-digit allele frequency data to probabilistically refine HLA typings. Central to our approach is the use of haplotype inference. We introduce new methodology to this area, improving upon the Expectation-Maximization (EM)-based approaches currently used within the HLA community. Our improvements are achieved by using a parsimonious parameterization for haplotype distributions and by smoothing the maximum likelihood (ML) solution. These improvements make it possible to scale the refinement to a larger number of alleles and loci in a more computationally efficient and stable manner. We also show how to augment our method in order to incorporate ethnicity information (as HLA allele distributions vary widely according to race/ethnicity as well as geographic area), and demonstrate the potential utility of this experimentally. A tool based on our approach is freely available for research purposes at http://microsoft.com/science.

Show MeSH