Limits...
Detecting Genetic Interactions for Quantitative Traits Using m-Spacing Entropy Measure.

Yee J, Kwon MS, Jin S, Park T, Park M - Biomed Res Int (2015)

Bottom Line: Information gain based on entropy measure has previously been successful in identifying genetic associations with binary traits.Hence, the information gain can be obtained for any phenotype distribution.Here, we show its use to successfully identify the main effect, as well as the genetic interactions, associated with a quantitative trait.

View Article: PubMed Central - PubMed

Affiliation: Department of Physiology and Biophysics, Eulji University, Daejeon, Republic of Korea.

ABSTRACT
A number of statistical methods for detecting gene-gene interactions have been developed in genetic association studies with binary traits. However, many phenotype measures are intrinsically quantitative and categorizing continuous traits may not always be straightforward and meaningful. Association of gene-gene interactions with an observed distribution of such phenotypes needs to be investigated directly without categorization. Information gain based on entropy measure has previously been successful in identifying genetic associations with binary traits. We extend the usefulness of this information gain by proposing a nonparametric evaluation method of conditional entropy of a quantitative phenotype associated with a given genotype. Hence, the information gain can be obtained for any phenotype distribution. Because any functional form, such as Gaussian, is not assumed for the entire distribution of a trait or a given genotype, this method is expected to be robust enough to be applied to any phenotypic association data. Here, we show its use to successfully identify the main effect, as well as the genetic interactions, associated with a quantitative trait.

No MeSH data available.


The n-independence and constant offset from the true value of the estimates averaged over all possible m values for each n. Each symbol represents a result of samplings from N(0, σ2). While varying n, the number of samples, the estimated entropy values were averaged over all the possible m, sample-spacing values. 〈m〉 denotes this averaging, which should not depend on weighting due to the virtually same standard deviations shown in Figure 1(b). Over a wide range of n, the estimated entropy stays effectively the same, showing n-independence in the range of practical number of sampling. Moreover, the almost flat line connecting each symbol shifts up or down following exactly the change of the true value indicated by the horizontal arrows. The rise in the extremely small n range should be within the tolerable error of any specific application, because the contribution to conditional entropy by such a case would be suppressed by weighting, based on the marginal probability that should be proportional to n.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4538333&req=5

fig2: The n-independence and constant offset from the true value of the estimates averaged over all possible m values for each n. Each symbol represents a result of samplings from N(0, σ2). While varying n, the number of samples, the estimated entropy values were averaged over all the possible m, sample-spacing values. 〈m〉 denotes this averaging, which should not depend on weighting due to the virtually same standard deviations shown in Figure 1(b). Over a wide range of n, the estimated entropy stays effectively the same, showing n-independence in the range of practical number of sampling. Moreover, the almost flat line connecting each symbol shifts up or down following exactly the change of the true value indicated by the horizontal arrows. The rise in the extremely small n range should be within the tolerable error of any specific application, because the contribution to conditional entropy by such a case would be suppressed by weighting, based on the marginal probability that should be proportional to n.

Mentions: The estimator in (4) has both n and m as parameters. In genetic association studies, the number of samples, n, of several hundreds is common. However, when the conditional entropy is estimated, there may be a minor allele that could have a much smaller number of samples corresponding to that allele. Moreover, the choice of the sample-spacing, m, should affect the resulting estimation of an entropy value. Therefore, it is required to have an entropy estimation scheme independent of the number of samples, without the need of choosing a particular value of the sample-spacing. To illustrate such a requirement, an ensemble of 3,000 sets of the random deviation from N(0, 12) was generated for each data point in Figure 1, where the mean and standard deviation of the estimates are plotted for each ensemble. On the left panel of Figure 1, m is fixed to 10 and 20 while n is varied. The analytic formula of the entropy for a normal distribution can be obtained as follows [20], where e is Euler's number:(5)H=ln⁡σ2πe.The calculated value of (5) is pointed on the vertical axis with a horizontal arrow with the corresponding σ above it. The obvious n-dependence of the estimator can be seen in this plot, where the estimation approaches the analytic value, as n increases with -consistency, as expected [24]. In Figure 1(b), n is fixed to 400, while m is varied. In this plot, the estimated entropy again changes in value throughout the possible range of m. It is shown that the estimated value is always smaller than the analytically calculated value. Therefore, assigning a particular value to m such as , the typical choice [25], would not be appropriate in this sampling range. Because of these n- and m-dependences, the estimator in (4) may need to be modified. Therefore, we modify the entropy estimator in (4) as follows:(6)Hm,n=1n−1∑m=1n−11n−m∑k=1n−mln⁡nmXn,k+m−Xn,kaaaaaaaaaaaaaaaa−Γ′mΓm+ln⁡m.In this modification, an entropy estimator is averaged over the possible m values for each n, which is denoted by 〈m〉. This estimator is used to plot the entropy versus number of samples in Figure 2. Over a wide range of n, this entropy estimator yields very stable values, in contrast to Figure 1(a). An increase in the extremely small n range should be within the tolerable error in an application of genome-wide association, as the contribution to the conditional entropy by such a minor allele would be suppressed by the weighting factor of the marginal probability that should be proportional to the number of corresponding samples. Analytically obtained entropy values for N(0, σ2), with three different σ's, are marked on the vertical axis on the right-hand side. Regardless of the value of σ, the differences between the analytically obtained value and the values given by the estimator stay essentially the same. Considering that the association study measures the difference between the entropy and the corresponding conditional entropy, the stability should be a more critical issue than the absolute value of the estimates. Therefore compensation of this Δ would not be necessary as long as it is stable. Furthermore, the underestimation of the entropy shown in the plot should have little effect on the association strength. Hence, an entropy estimator has been set up that should satisfy the practical n-independence without the need to find a proper sample-spacing.


Detecting Genetic Interactions for Quantitative Traits Using m-Spacing Entropy Measure.

Yee J, Kwon MS, Jin S, Park T, Park M - Biomed Res Int (2015)

The n-independence and constant offset from the true value of the estimates averaged over all possible m values for each n. Each symbol represents a result of samplings from N(0, σ2). While varying n, the number of samples, the estimated entropy values were averaged over all the possible m, sample-spacing values. 〈m〉 denotes this averaging, which should not depend on weighting due to the virtually same standard deviations shown in Figure 1(b). Over a wide range of n, the estimated entropy stays effectively the same, showing n-independence in the range of practical number of sampling. Moreover, the almost flat line connecting each symbol shifts up or down following exactly the change of the true value indicated by the horizontal arrows. The rise in the extremely small n range should be within the tolerable error of any specific application, because the contribution to conditional entropy by such a case would be suppressed by weighting, based on the marginal probability that should be proportional to n.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4538333&req=5

fig2: The n-independence and constant offset from the true value of the estimates averaged over all possible m values for each n. Each symbol represents a result of samplings from N(0, σ2). While varying n, the number of samples, the estimated entropy values were averaged over all the possible m, sample-spacing values. 〈m〉 denotes this averaging, which should not depend on weighting due to the virtually same standard deviations shown in Figure 1(b). Over a wide range of n, the estimated entropy stays effectively the same, showing n-independence in the range of practical number of sampling. Moreover, the almost flat line connecting each symbol shifts up or down following exactly the change of the true value indicated by the horizontal arrows. The rise in the extremely small n range should be within the tolerable error of any specific application, because the contribution to conditional entropy by such a case would be suppressed by weighting, based on the marginal probability that should be proportional to n.
Mentions: The estimator in (4) has both n and m as parameters. In genetic association studies, the number of samples, n, of several hundreds is common. However, when the conditional entropy is estimated, there may be a minor allele that could have a much smaller number of samples corresponding to that allele. Moreover, the choice of the sample-spacing, m, should affect the resulting estimation of an entropy value. Therefore, it is required to have an entropy estimation scheme independent of the number of samples, without the need of choosing a particular value of the sample-spacing. To illustrate such a requirement, an ensemble of 3,000 sets of the random deviation from N(0, 12) was generated for each data point in Figure 1, where the mean and standard deviation of the estimates are plotted for each ensemble. On the left panel of Figure 1, m is fixed to 10 and 20 while n is varied. The analytic formula of the entropy for a normal distribution can be obtained as follows [20], where e is Euler's number:(5)H=ln⁡σ2πe.The calculated value of (5) is pointed on the vertical axis with a horizontal arrow with the corresponding σ above it. The obvious n-dependence of the estimator can be seen in this plot, where the estimation approaches the analytic value, as n increases with -consistency, as expected [24]. In Figure 1(b), n is fixed to 400, while m is varied. In this plot, the estimated entropy again changes in value throughout the possible range of m. It is shown that the estimated value is always smaller than the analytically calculated value. Therefore, assigning a particular value to m such as , the typical choice [25], would not be appropriate in this sampling range. Because of these n- and m-dependences, the estimator in (4) may need to be modified. Therefore, we modify the entropy estimator in (4) as follows:(6)Hm,n=1n−1∑m=1n−11n−m∑k=1n−mln⁡nmXn,k+m−Xn,kaaaaaaaaaaaaaaaa−Γ′mΓm+ln⁡m.In this modification, an entropy estimator is averaged over the possible m values for each n, which is denoted by 〈m〉. This estimator is used to plot the entropy versus number of samples in Figure 2. Over a wide range of n, this entropy estimator yields very stable values, in contrast to Figure 1(a). An increase in the extremely small n range should be within the tolerable error in an application of genome-wide association, as the contribution to the conditional entropy by such a minor allele would be suppressed by the weighting factor of the marginal probability that should be proportional to the number of corresponding samples. Analytically obtained entropy values for N(0, σ2), with three different σ's, are marked on the vertical axis on the right-hand side. Regardless of the value of σ, the differences between the analytically obtained value and the values given by the estimator stay essentially the same. Considering that the association study measures the difference between the entropy and the corresponding conditional entropy, the stability should be a more critical issue than the absolute value of the estimates. Therefore compensation of this Δ would not be necessary as long as it is stable. Furthermore, the underestimation of the entropy shown in the plot should have little effect on the association strength. Hence, an entropy estimator has been set up that should satisfy the practical n-independence without the need to find a proper sample-spacing.

Bottom Line: Information gain based on entropy measure has previously been successful in identifying genetic associations with binary traits.Hence, the information gain can be obtained for any phenotype distribution.Here, we show its use to successfully identify the main effect, as well as the genetic interactions, associated with a quantitative trait.

View Article: PubMed Central - PubMed

Affiliation: Department of Physiology and Biophysics, Eulji University, Daejeon, Republic of Korea.

ABSTRACT
A number of statistical methods for detecting gene-gene interactions have been developed in genetic association studies with binary traits. However, many phenotype measures are intrinsically quantitative and categorizing continuous traits may not always be straightforward and meaningful. Association of gene-gene interactions with an observed distribution of such phenotypes needs to be investigated directly without categorization. Information gain based on entropy measure has previously been successful in identifying genetic associations with binary traits. We extend the usefulness of this information gain by proposing a nonparametric evaluation method of conditional entropy of a quantitative phenotype associated with a given genotype. Hence, the information gain can be obtained for any phenotype distribution. Because any functional form, such as Gaussian, is not assumed for the entire distribution of a trait or a given genotype, this method is expected to be robust enough to be applied to any phenotypic association data. Here, we show its use to successfully identify the main effect, as well as the genetic interactions, associated with a quantitative trait.

No MeSH data available.