Limits...
Global meta-analysis of transcriptomics studies.

Caldas J, Vinga S - PLoS ONE (2014)

Bottom Line: Current methods are based on breaking down studies into multiple comparisons between phenotypes (e.g. disease vs. healthy), based on the studies' experimental designs, followed by computing the overlap between the resulting differential expression signatures.Finally, we highlight via case studies how the framework can be used to derive novel biological hypotheses regarding related studies and the genes that drive those connections.Our proposed statistical framework shows that it is possible to perform a meta-analysis of transcriptomics studies with arbitrary experimental designs by deriving global expression features rather than decomposing studies into multiple phenotype comparisons.

View Article: PubMed Central - PubMed

Affiliation: INESC-ID, Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento, Lisboa, Portugal.

ABSTRACT
Transcriptomics meta-analysis aims at re-using existing data to derive novel biological hypotheses, and is motivated by the public availability of a large number of independent studies. Current methods are based on breaking down studies into multiple comparisons between phenotypes (e.g. disease vs. healthy), based on the studies' experimental designs, followed by computing the overlap between the resulting differential expression signatures. While useful, in this methodology each study yields multiple independent phenotype comparisons, and connections are established not between studies, but rather between subsets of the studies corresponding to phenotype comparisons. We propose a rank-based statistical meta-analysis framework that establishes global connections between transcriptomics studies without breaking down studies into sets of phenotype comparisons. By using a rank product method, our framework extracts global features from each study, corresponding to genes that are consistently among the most expressed or differentially expressed genes in that study. Those features are then statistically modelled via a term-frequency inverse-document frequency (TF-IDF) model, which is then used for connecting studies. Our framework is fast and parameter-free; when applied to large collections of Homo sapiens and Streptococcus pneumoniae transcriptomics studies, it performs better than similarity-based approaches in retrieving related studies, using a Medical Subject Headings gold standard. Finally, we highlight via case studies how the framework can be used to derive novel biological hypotheses regarding related studies and the genes that drive those connections. Our proposed statistical framework shows that it is possible to perform a meta-analysis of transcriptomics studies with arbitrary experimental designs by deriving global expression features rather than decomposing studies into multiple phenotype comparisons.

Show MeSH

Related in: MedlinePlus

Retrieval performance for all methods.Mean and standard error of the AP scores for four variants of our framework, as well as for three Spearman correlation-based methods. Regarding the x-axis labels, “binary” indicates the binary significance matrix variant of our framework while “discrete” indicates the discrete significance matrix variant; “bin + IDF” and “disc + IDF” indicate those two same variants, but also performing the TF-IDF modeling step prior to establishing connections between studies; finally, the “max-SP”, “mean-SP”, and “median-SP” labels refer to the maximum, mean, and median statistic summarization approaches in the competing Spearman correlation methods.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3935861&req=5

pone-0089318-g004: Retrieval performance for all methods.Mean and standard error of the AP scores for four variants of our framework, as well as for three Spearman correlation-based methods. Regarding the x-axis labels, “binary” indicates the binary significance matrix variant of our framework while “discrete” indicates the discrete significance matrix variant; “bin + IDF” and “disc + IDF” indicate those two same variants, but also performing the TF-IDF modeling step prior to establishing connections between studies; finally, the “max-SP”, “mean-SP”, and “median-SP” labels refer to the maximum, mean, and median statistic summarization approaches in the competing Spearman correlation methods.

Mentions: Then we proceeded to assess the performance differences between our framework variants and the Spearman competing methods. Figure 4 displays the mean and standard error of the AP scores for both our framework and the Spearman correlation-based methods. It can be seen that our framework attains higher scores in all data sets. Finally, is can be seen that on the human data sets, the discrete variants attain a better performance, while in the S. pneumoniae data set the binary variant performs the best.


Global meta-analysis of transcriptomics studies.

Caldas J, Vinga S - PLoS ONE (2014)

Retrieval performance for all methods.Mean and standard error of the AP scores for four variants of our framework, as well as for three Spearman correlation-based methods. Regarding the x-axis labels, “binary” indicates the binary significance matrix variant of our framework while “discrete” indicates the discrete significance matrix variant; “bin + IDF” and “disc + IDF” indicate those two same variants, but also performing the TF-IDF modeling step prior to establishing connections between studies; finally, the “max-SP”, “mean-SP”, and “median-SP” labels refer to the maximum, mean, and median statistic summarization approaches in the competing Spearman correlation methods.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3935861&req=5

pone-0089318-g004: Retrieval performance for all methods.Mean and standard error of the AP scores for four variants of our framework, as well as for three Spearman correlation-based methods. Regarding the x-axis labels, “binary” indicates the binary significance matrix variant of our framework while “discrete” indicates the discrete significance matrix variant; “bin + IDF” and “disc + IDF” indicate those two same variants, but also performing the TF-IDF modeling step prior to establishing connections between studies; finally, the “max-SP”, “mean-SP”, and “median-SP” labels refer to the maximum, mean, and median statistic summarization approaches in the competing Spearman correlation methods.
Mentions: Then we proceeded to assess the performance differences between our framework variants and the Spearman competing methods. Figure 4 displays the mean and standard error of the AP scores for both our framework and the Spearman correlation-based methods. It can be seen that our framework attains higher scores in all data sets. Finally, is can be seen that on the human data sets, the discrete variants attain a better performance, while in the S. pneumoniae data set the binary variant performs the best.

Bottom Line: Current methods are based on breaking down studies into multiple comparisons between phenotypes (e.g. disease vs. healthy), based on the studies' experimental designs, followed by computing the overlap between the resulting differential expression signatures.Finally, we highlight via case studies how the framework can be used to derive novel biological hypotheses regarding related studies and the genes that drive those connections.Our proposed statistical framework shows that it is possible to perform a meta-analysis of transcriptomics studies with arbitrary experimental designs by deriving global expression features rather than decomposing studies into multiple phenotype comparisons.

View Article: PubMed Central - PubMed

Affiliation: INESC-ID, Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento, Lisboa, Portugal.

ABSTRACT
Transcriptomics meta-analysis aims at re-using existing data to derive novel biological hypotheses, and is motivated by the public availability of a large number of independent studies. Current methods are based on breaking down studies into multiple comparisons between phenotypes (e.g. disease vs. healthy), based on the studies' experimental designs, followed by computing the overlap between the resulting differential expression signatures. While useful, in this methodology each study yields multiple independent phenotype comparisons, and connections are established not between studies, but rather between subsets of the studies corresponding to phenotype comparisons. We propose a rank-based statistical meta-analysis framework that establishes global connections between transcriptomics studies without breaking down studies into sets of phenotype comparisons. By using a rank product method, our framework extracts global features from each study, corresponding to genes that are consistently among the most expressed or differentially expressed genes in that study. Those features are then statistically modelled via a term-frequency inverse-document frequency (TF-IDF) model, which is then used for connecting studies. Our framework is fast and parameter-free; when applied to large collections of Homo sapiens and Streptococcus pneumoniae transcriptomics studies, it performs better than similarity-based approaches in retrieving related studies, using a Medical Subject Headings gold standard. Finally, we highlight via case studies how the framework can be used to derive novel biological hypotheses regarding related studies and the genes that drive those connections. Our proposed statistical framework shows that it is possible to perform a meta-analysis of transcriptomics studies with arbitrary experimental designs by deriving global expression features rather than decomposing studies into multiple phenotype comparisons.

Show MeSH
Related in: MedlinePlus