Limits...
Gene-based bin analysis of genome-wide association studies.

Omont N, Forner K, Lamarine M, Martin G, Képès F, Wojcik J - BMC Proc (2008)

Bottom Line: However new analytical methods have to be developed to face the problems induced by this data scale-up, such as statistical multiple testing, data quality control and computational tractability.We present a novel method to analyze genome-wide association studies results.The method practically overcomes the scale-up problems and permits to identify new putative regions statistically associated with the disease.

View Article: PubMed Central - HTML - PubMed

Affiliation: Merck Serono International S,A, 9 chemin des Mines, 1202 Geneva, Switzerland.

ABSTRACT

Background: With the improvement of genotyping technologies and the exponentially growing number of available markers, case-control genome-wide association studies promise to be a key tool for investigation of complex diseases. However new analytical methods have to be developed to face the problems induced by this data scale-up, such as statistical multiple testing, data quality control and computational tractability.

Results: We present a novel method to analyze genome-wide association studies results. The algorithm is based on a Bayesian model that integrates genotyping errors and genomic structure dependencies. p-values are assigned to genomic regions termed bins, which are defined from a gene-biased partitioning of the genome, and the false-discovery rate is estimated. We have applied this algorithm to data coming from three genome-wide association studies of Multiple Sclerosis.

Conclusion: The method practically overcomes the scale-up problems and permits to identify new putative regions statistically associated with the disease.

No MeSH data available.


Related in: MedlinePlus

FDR versus number of bins selected using L3 naive likelihood (top) and L2 two-marker likelihood (bottom). A: solid, B: dash, C: dash dot, ABC: thick.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2654974&req=5

Figure 5: FDR versus number of bins selected using L3 naive likelihood (top) and L2 two-marker likelihood (bottom). A: solid, B: dash, C: dash dot, ABC: thick.

Mentions: Figure 4 shows the FDR plotted against p-values computed using the two-marker L2 or the naive L3 likelihood for the three collection design. Two-marker FDR remains below naive FDR until a p-value level of 0.01 and both increase slowly towards 1. FDR against the number of selected SNP plots are detailed by collection in Figure 5. As observed in other studies [13], the FDR is not monotonous with the p-value. The oscillations are less important for the three collection design, maybe because of the three time increase of sample size. With a FDR threshold of 5%, only between 2 and 6 bins are selected depending on the collections and likelihood considered (Table 2, top). Most of them are located, in the Major Histocompatibility Complex (MHC) region, mainly in the class III subregion. The class II subregion is known to be associated with MS [14]. The three collection design selects more associated bins than one collection designs, independently on the likelihood. Results with a less stringent FDR threshold of 50% (Table 2, middle) shows a greater power of L2 over L3 for the three collection design. However, FDR is misleading in this study because the MHC region is known to be associated with MS. It leads to an overestimation of the FDR at which bins outside of this region are selected. It contains 12 of the 33 bins selected by L2 on the three collection design. As a result, only 10 and not 21 bins are selected (Table 2, bottom).


Gene-based bin analysis of genome-wide association studies.

Omont N, Forner K, Lamarine M, Martin G, Képès F, Wojcik J - BMC Proc (2008)

FDR versus number of bins selected using L3 naive likelihood (top) and L2 two-marker likelihood (bottom). A: solid, B: dash, C: dash dot, ABC: thick.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2654974&req=5

Figure 5: FDR versus number of bins selected using L3 naive likelihood (top) and L2 two-marker likelihood (bottom). A: solid, B: dash, C: dash dot, ABC: thick.
Mentions: Figure 4 shows the FDR plotted against p-values computed using the two-marker L2 or the naive L3 likelihood for the three collection design. Two-marker FDR remains below naive FDR until a p-value level of 0.01 and both increase slowly towards 1. FDR against the number of selected SNP plots are detailed by collection in Figure 5. As observed in other studies [13], the FDR is not monotonous with the p-value. The oscillations are less important for the three collection design, maybe because of the three time increase of sample size. With a FDR threshold of 5%, only between 2 and 6 bins are selected depending on the collections and likelihood considered (Table 2, top). Most of them are located, in the Major Histocompatibility Complex (MHC) region, mainly in the class III subregion. The class II subregion is known to be associated with MS [14]. The three collection design selects more associated bins than one collection designs, independently on the likelihood. Results with a less stringent FDR threshold of 50% (Table 2, middle) shows a greater power of L2 over L3 for the three collection design. However, FDR is misleading in this study because the MHC region is known to be associated with MS. It leads to an overestimation of the FDR at which bins outside of this region are selected. It contains 12 of the 33 bins selected by L2 on the three collection design. As a result, only 10 and not 21 bins are selected (Table 2, bottom).

Bottom Line: However new analytical methods have to be developed to face the problems induced by this data scale-up, such as statistical multiple testing, data quality control and computational tractability.We present a novel method to analyze genome-wide association studies results.The method practically overcomes the scale-up problems and permits to identify new putative regions statistically associated with the disease.

View Article: PubMed Central - HTML - PubMed

Affiliation: Merck Serono International S,A, 9 chemin des Mines, 1202 Geneva, Switzerland.

ABSTRACT

Background: With the improvement of genotyping technologies and the exponentially growing number of available markers, case-control genome-wide association studies promise to be a key tool for investigation of complex diseases. However new analytical methods have to be developed to face the problems induced by this data scale-up, such as statistical multiple testing, data quality control and computational tractability.

Results: We present a novel method to analyze genome-wide association studies results. The algorithm is based on a Bayesian model that integrates genotyping errors and genomic structure dependencies. p-values are assigned to genomic regions termed bins, which are defined from a gene-biased partitioning of the genome, and the false-discovery rate is estimated. We have applied this algorithm to data coming from three genome-wide association studies of Multiple Sclerosis.

Conclusion: The method practically overcomes the scale-up problems and permits to identify new putative regions statistically associated with the disease.

No MeSH data available.


Related in: MedlinePlus