Limits...
Does replication groups scoring reduce false positive rate in SNP interaction discovery?

Toplak M, Curk T, Demsar J, Zupan B - BMC Genomics (2010)

Bottom Line: Recently, Gayan et al. (2008) proposed to reduce the number of false positives by combining results of interaction analysis performed on subsets of data (replication groups), rather than analyzing the entire data set directly.Because Gayan et al. do not compare their approach to the standard interaction analysis techniques, we here investigate if replication groups indeed reduce the number of reported false positive interactions.With respect to the direct scoring approach the utility of replication groups does not reduce false positive rates, and may, depending on the data set, often perform worse.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Computer and Information Science, University of Ljubljana, TrZaska 25, SI-1000 Ljubljana, Slovenia.

ABSTRACT

Background: Computational methods that infer single nucleotide polymorphism (SNP) interactions from phenotype data may uncover new biological mechanisms in non-Mendelian diseases. However, practical aspects of such analysis face many problems. Present experimental studies typically use SNP arrays with hundreds of thousands of SNPs but record only hundreds of samples. Candidate SNP pairs inferred by interaction analysis may include a high proportion of false positives. Recently, Gayan et al. (2008) proposed to reduce the number of false positives by combining results of interaction analysis performed on subsets of data (replication groups), rather than analyzing the entire data set directly. If performing as hypothesized, replication groups scoring could improve interaction analysis and also any type of feature ranking and selection procedure in systems biology. Because Gayan et al. do not compare their approach to the standard interaction analysis techniques, we here investigate if replication groups indeed reduce the number of reported false positive interactions.

Results: A set of simulated and false interaction-imputed experimental SNP data sets were used to compare the inference of SNP-SNP interactions by means of replication groups to the standard approach where the entire data set was directly used to score all candidate SNP pairs. In all our experiments, the inference of interactions from the entire data set (e.g. without using the replication groups) reported fewer false positives.

Conclusions: With respect to the direct scoring approach the utility of replication groups does not reduce false positive rates, and may, depending on the data set, often perform worse.

Show MeSH
Disease penetrance models. Penetrance models used to simulate epistasis between two SNPs. Allele frequencies are denoted with p and q. For example, model 1 specifies that 10% of individuals with genotypes AABb, AaBB, Aabb or aaBb and none of individuals with other genotypes have the disease.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2823693&req=5

Figure 1: Disease penetrance models. Penetrance models used to simulate epistasis between two SNPs. Allele frequencies are denoted with p and q. For example, model 1 specifies that 10% of individuals with genotypes AABb, AaBB, Aabb or aaBb and none of individuals with other genotypes have the disease.

Mentions: We followed the data synthesis protocol as proposed by Ritchie et al. (2003). The simulated data sets were generated according to six two-SNP epistasis models (see Figure 1). Unlike Ritchie et al. (2003), our data sets included multiple interactions, but such that each SNP was involved in interaction with at most one other SNP. Two different types of data sets with respect to the number of SNPs were crafted, each comprising 200 control and 200 disease samples:


Does replication groups scoring reduce false positive rate in SNP interaction discovery?

Toplak M, Curk T, Demsar J, Zupan B - BMC Genomics (2010)

Disease penetrance models. Penetrance models used to simulate epistasis between two SNPs. Allele frequencies are denoted with p and q. For example, model 1 specifies that 10% of individuals with genotypes AABb, AaBB, Aabb or aaBb and none of individuals with other genotypes have the disease.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2823693&req=5

Figure 1: Disease penetrance models. Penetrance models used to simulate epistasis between two SNPs. Allele frequencies are denoted with p and q. For example, model 1 specifies that 10% of individuals with genotypes AABb, AaBB, Aabb or aaBb and none of individuals with other genotypes have the disease.
Mentions: We followed the data synthesis protocol as proposed by Ritchie et al. (2003). The simulated data sets were generated according to six two-SNP epistasis models (see Figure 1). Unlike Ritchie et al. (2003), our data sets included multiple interactions, but such that each SNP was involved in interaction with at most one other SNP. Two different types of data sets with respect to the number of SNPs were crafted, each comprising 200 control and 200 disease samples:

Bottom Line: Recently, Gayan et al. (2008) proposed to reduce the number of false positives by combining results of interaction analysis performed on subsets of data (replication groups), rather than analyzing the entire data set directly.Because Gayan et al. do not compare their approach to the standard interaction analysis techniques, we here investigate if replication groups indeed reduce the number of reported false positive interactions.With respect to the direct scoring approach the utility of replication groups does not reduce false positive rates, and may, depending on the data set, often perform worse.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Computer and Information Science, University of Ljubljana, TrZaska 25, SI-1000 Ljubljana, Slovenia.

ABSTRACT

Background: Computational methods that infer single nucleotide polymorphism (SNP) interactions from phenotype data may uncover new biological mechanisms in non-Mendelian diseases. However, practical aspects of such analysis face many problems. Present experimental studies typically use SNP arrays with hundreds of thousands of SNPs but record only hundreds of samples. Candidate SNP pairs inferred by interaction analysis may include a high proportion of false positives. Recently, Gayan et al. (2008) proposed to reduce the number of false positives by combining results of interaction analysis performed on subsets of data (replication groups), rather than analyzing the entire data set directly. If performing as hypothesized, replication groups scoring could improve interaction analysis and also any type of feature ranking and selection procedure in systems biology. Because Gayan et al. do not compare their approach to the standard interaction analysis techniques, we here investigate if replication groups indeed reduce the number of reported false positive interactions.

Results: A set of simulated and false interaction-imputed experimental SNP data sets were used to compare the inference of SNP-SNP interactions by means of replication groups to the standard approach where the entire data set was directly used to score all candidate SNP pairs. In all our experiments, the inference of interactions from the entire data set (e.g. without using the replication groups) reported fewer false positives.

Conclusions: With respect to the direct scoring approach the utility of replication groups does not reduce false positive rates, and may, depending on the data set, often perform worse.

Show MeSH