Limits...
Investigating the effect of paralogs on microarray gene-set analysis.

Faure AJ, Seoighe C, Mulder NJ - BMC Bioinformatics (2011)

Bottom Line: In order to interpret the results obtained from a microarray experiment, researchers often shift focus from analysis of individual differentially expressed genes to analyses of sets of genes.The Indygene tool efficiently removes paralogy relationships from a given dataset and we found that such a reduction, performed prior to GSA, has the ability to generate significantly different results that often represent novel and plausible biological hypotheses.This was demonstrated for three different GSA approaches when applied to the reanalysis of previously published microarray datasets and suggests that the redundancy and non-independence of paralogs is an important consideration when dealing with GSA methodologies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Computational Biology Group, Department of Clinical Laboratory Sciences, University of Cape Town, Cape Town, South Africa. andrefau@ebi.ac.uk

ABSTRACT

Background: In order to interpret the results obtained from a microarray experiment, researchers often shift focus from analysis of individual differentially expressed genes to analyses of sets of genes. These gene-set analysis (GSA) methods use previously accumulated biological knowledge to group genes into sets and then aim to rank these gene sets in a way that reflects their relative importance in the experimental situation in question. We suspect that the presence of paralogs affects the ability of GSA methods to accurately identify the most important sets of genes for subsequent research.

Results: We show that paralogs, which typically have high sequence identity and similar molecular functions, also exhibit high correlation in their expression patterns. We investigate this correlation as a potential confounding factor common to current GSA methods using Indygene http://www.cbio.uct.ac.za/indygene, a web tool that reduces a supplied list of genes so that it includes no pairwise paralogy relationships above a specified sequence similarity threshold. We use the tool to reanalyse previously published microarray datasets and determine the potential utility of accounting for the presence of paralogs.

Conclusions: The Indygene tool efficiently removes paralogy relationships from a given dataset and we found that such a reduction, performed prior to GSA, has the ability to generate significantly different results that often represent novel and plausible biological hypotheses. This was demonstrated for three different GSA approaches when applied to the reanalysis of previously published microarray datasets and suggests that the redundancy and non-independence of paralogs is an important consideration when dealing with GSA methodologies.

Show MeSH
Removing paralogs leads to significantly different GSA results. Estimated  distribution for τ used to determine whether the paralog-reduced dataset (red vertical line at τ = 0.65) produces significantly different GSA results. The abscissa gives Kendall's correlation (τ) between the ranked GO term lists before and after randomly reducing the dataset by 6126 genes. The black line indicates the approximate probability density function of the  distribution, estimated using a Gaussian smoothing kernel.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3037853&req=5

Figure 5: Removing paralogs leads to significantly different GSA results. Estimated distribution for τ used to determine whether the paralog-reduced dataset (red vertical line at τ = 0.65) produces significantly different GSA results. The abscissa gives Kendall's correlation (τ) between the ranked GO term lists before and after randomly reducing the dataset by 6126 genes. The black line indicates the approximate probability density function of the distribution, estimated using a Gaussian smoothing kernel.

Mentions: We performed GSA on the original microarray dataset using an ORA approach (see Methods) and compared the resulting list of GO SLIM terms from the Biological Process ontology to that obtained after reducing the dataset by 6126 genes using Indygene. The obtained correlation value of τ = 0.65 quantifies the difference between the ranking of terms in the two lists. To determine whether this difference was statistically significant and not merely related to the removal of a large number of genes, we estimated the distribution for τ using a Monte Carlo sampling procedure (see Figure 5). When compared to this distribution, a nonparametric P-value ≈ 0:007 was obtained, indicating that the presence of paralogs can significantly affect results from GSA. In other words, a paralog reduction as performed by Indygene can result in a significantly different GSA term ranking, not simply attributable to the reduction in the number of genes.


Investigating the effect of paralogs on microarray gene-set analysis.

Faure AJ, Seoighe C, Mulder NJ - BMC Bioinformatics (2011)

Removing paralogs leads to significantly different GSA results. Estimated  distribution for τ used to determine whether the paralog-reduced dataset (red vertical line at τ = 0.65) produces significantly different GSA results. The abscissa gives Kendall's correlation (τ) between the ranked GO term lists before and after randomly reducing the dataset by 6126 genes. The black line indicates the approximate probability density function of the  distribution, estimated using a Gaussian smoothing kernel.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3037853&req=5

Figure 5: Removing paralogs leads to significantly different GSA results. Estimated distribution for τ used to determine whether the paralog-reduced dataset (red vertical line at τ = 0.65) produces significantly different GSA results. The abscissa gives Kendall's correlation (τ) between the ranked GO term lists before and after randomly reducing the dataset by 6126 genes. The black line indicates the approximate probability density function of the distribution, estimated using a Gaussian smoothing kernel.
Mentions: We performed GSA on the original microarray dataset using an ORA approach (see Methods) and compared the resulting list of GO SLIM terms from the Biological Process ontology to that obtained after reducing the dataset by 6126 genes using Indygene. The obtained correlation value of τ = 0.65 quantifies the difference between the ranking of terms in the two lists. To determine whether this difference was statistically significant and not merely related to the removal of a large number of genes, we estimated the distribution for τ using a Monte Carlo sampling procedure (see Figure 5). When compared to this distribution, a nonparametric P-value ≈ 0:007 was obtained, indicating that the presence of paralogs can significantly affect results from GSA. In other words, a paralog reduction as performed by Indygene can result in a significantly different GSA term ranking, not simply attributable to the reduction in the number of genes.

Bottom Line: In order to interpret the results obtained from a microarray experiment, researchers often shift focus from analysis of individual differentially expressed genes to analyses of sets of genes.The Indygene tool efficiently removes paralogy relationships from a given dataset and we found that such a reduction, performed prior to GSA, has the ability to generate significantly different results that often represent novel and plausible biological hypotheses.This was demonstrated for three different GSA approaches when applied to the reanalysis of previously published microarray datasets and suggests that the redundancy and non-independence of paralogs is an important consideration when dealing with GSA methodologies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Computational Biology Group, Department of Clinical Laboratory Sciences, University of Cape Town, Cape Town, South Africa. andrefau@ebi.ac.uk

ABSTRACT

Background: In order to interpret the results obtained from a microarray experiment, researchers often shift focus from analysis of individual differentially expressed genes to analyses of sets of genes. These gene-set analysis (GSA) methods use previously accumulated biological knowledge to group genes into sets and then aim to rank these gene sets in a way that reflects their relative importance in the experimental situation in question. We suspect that the presence of paralogs affects the ability of GSA methods to accurately identify the most important sets of genes for subsequent research.

Results: We show that paralogs, which typically have high sequence identity and similar molecular functions, also exhibit high correlation in their expression patterns. We investigate this correlation as a potential confounding factor common to current GSA methods using Indygene http://www.cbio.uct.ac.za/indygene, a web tool that reduces a supplied list of genes so that it includes no pairwise paralogy relationships above a specified sequence similarity threshold. We use the tool to reanalyse previously published microarray datasets and determine the potential utility of accounting for the presence of paralogs.

Conclusions: The Indygene tool efficiently removes paralogy relationships from a given dataset and we found that such a reduction, performed prior to GSA, has the ability to generate significantly different results that often represent novel and plausible biological hypotheses. This was demonstrated for three different GSA approaches when applied to the reanalysis of previously published microarray datasets and suggests that the redundancy and non-independence of paralogs is an important consideration when dealing with GSA methodologies.

Show MeSH