Limits...
Probability genotype imputation method and integrated weighted lasso for QTL identification.

Demetrashvili N, Van den Heuvel ER, Wit EC - BMC Genet. (2013)

Bottom Line: The results confirm previously identified regions, however several new markers are also found.Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method.This means that under realistic missing data settings this methodology can be used for QTL identification.

View Article: PubMed Central - HTML - PubMed

Affiliation: Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen, Groningen 9747 AG, The Netherlands. n.demetrashvili@rug.nl.

ABSTRACT

Background: Many QTL studies have two common features: (1) often there is missing marker information, (2) among many markers involved in the biological process only a few are causal. In statistics, the second issue falls under the headings "sparsity" and "causal inference". The goal of this work is to develop a two-step statistical methodology for QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing genotypes. Outcomes of the proposed imputation method are probabilities which serve as weights to the second step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for the trait of interest.

Results: Simulation studies validate the proposed methodology under a wide range of realistic settings. Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. The methodology was applied to an Arabidopsis experiment, containing 69 markers for 165 recombinant inbred lines of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On the basis of the inferred ROC behavior these markers show good potential for being real, especially for the germination trait Gmax.

Conclusions: Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under realistic missing data settings this methodology can be used for QTL identification.

Show MeSH

Related in: MedlinePlus

Possible scenarios for genotypes of three consecutive markers located at locationst0,t (at mark ?) andt1 for RILi: (scenario 1) y axis shows the probability of observing Sha at locationt given we observe Sha at locationst0 andt1,; (scenario 2) y axis shows the probability of observing Sha at locationt given we observe Bay att0 and Sha att1,.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4126192&req=5

Figure 3: Possible scenarios for genotypes of three consecutive markers located at locationst0,t (at mark ?) andt1 for RILi: (scenario 1) y axis shows the probability of observing Sha at locationt given we observe Sha at locationst0 andt1,; (scenario 2) y axis shows the probability of observing Sha at locationt given we observe Bay att0 and Sha att1,.

Mentions: This physical intertwining during meiosis induces naturally (though not necessarily) a Markov dependence structure, whereby knowledge of the configuration of chromatids at a particular marker depends only on the neighboring configurations. With respect to the shape of the dependence structure, we note that it is natural that the absolute value of the derivative of the probability model is minimal exactly half-way through the interval between two known markers. This reflects the fact that the amount of change of information is smallest when we are further away from the known markers. This leads to four possible scenarios of a marker with immediate neighbors. Figure 3 shows two of such scenarios, the remaining two are symmetric (about the x-axis) of the ones depicted on this Figure. These scenarios are meaningful only for markers having both neighbors. In turn, a marker has both neighbors if it is not at the edge of a chromosome. As for edge markers, only two scenarios are possible (not depicted here). Considering scenarios given in Figure 3, we propose a model that exhibits the same shape. We introduce a parameter α∈[0;∞) which technically has a scaling function and biologically it has a role similar to the recombination rate. The probability model for a RIL i and a marker located in the middle of a chromosome is:


Probability genotype imputation method and integrated weighted lasso for QTL identification.

Demetrashvili N, Van den Heuvel ER, Wit EC - BMC Genet. (2013)

Possible scenarios for genotypes of three consecutive markers located at locationst0,t (at mark ?) andt1 for RILi: (scenario 1) y axis shows the probability of observing Sha at locationt given we observe Sha at locationst0 andt1,; (scenario 2) y axis shows the probability of observing Sha at locationt given we observe Bay att0 and Sha att1,.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4126192&req=5

Figure 3: Possible scenarios for genotypes of three consecutive markers located at locationst0,t (at mark ?) andt1 for RILi: (scenario 1) y axis shows the probability of observing Sha at locationt given we observe Sha at locationst0 andt1,; (scenario 2) y axis shows the probability of observing Sha at locationt given we observe Bay att0 and Sha att1,.
Mentions: This physical intertwining during meiosis induces naturally (though not necessarily) a Markov dependence structure, whereby knowledge of the configuration of chromatids at a particular marker depends only on the neighboring configurations. With respect to the shape of the dependence structure, we note that it is natural that the absolute value of the derivative of the probability model is minimal exactly half-way through the interval between two known markers. This reflects the fact that the amount of change of information is smallest when we are further away from the known markers. This leads to four possible scenarios of a marker with immediate neighbors. Figure 3 shows two of such scenarios, the remaining two are symmetric (about the x-axis) of the ones depicted on this Figure. These scenarios are meaningful only for markers having both neighbors. In turn, a marker has both neighbors if it is not at the edge of a chromosome. As for edge markers, only two scenarios are possible (not depicted here). Considering scenarios given in Figure 3, we propose a model that exhibits the same shape. We introduce a parameter α∈[0;∞) which technically has a scaling function and biologically it has a role similar to the recombination rate. The probability model for a RIL i and a marker located in the middle of a chromosome is:

Bottom Line: The results confirm previously identified regions, however several new markers are also found.Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method.This means that under realistic missing data settings this methodology can be used for QTL identification.

View Article: PubMed Central - HTML - PubMed

Affiliation: Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen, Groningen 9747 AG, The Netherlands. n.demetrashvili@rug.nl.

ABSTRACT

Background: Many QTL studies have two common features: (1) often there is missing marker information, (2) among many markers involved in the biological process only a few are causal. In statistics, the second issue falls under the headings "sparsity" and "causal inference". The goal of this work is to develop a two-step statistical methodology for QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing genotypes. Outcomes of the proposed imputation method are probabilities which serve as weights to the second step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for the trait of interest.

Results: Simulation studies validate the proposed methodology under a wide range of realistic settings. Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. The methodology was applied to an Arabidopsis experiment, containing 69 markers for 165 recombinant inbred lines of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On the basis of the inferred ROC behavior these markers show good potential for being real, especially for the germination trait Gmax.

Conclusions: Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under realistic missing data settings this methodology can be used for QTL identification.

Show MeSH
Related in: MedlinePlus