Limits...
Selecting high-dimensional mixed graphical models using minimal AIC or BIC forests.

Edwards D, de Abreu GC, Labouriau R - BMC Bioinformatics (2010)

Bottom Line: In the genetics of gene expression context the method identifies a network approximating the joint distribution of the DNA markers and the gene expression levels.Trees and forests are unrealistically simple models for biological systems, but can provide useful insights.Uses include the following: identification of distinct connected components, which can be analysed separately (dimension reduction); identification of neighbourhoods for more detailed analyses; as initial models for search algorithms with a larger search space, for example decomposable models or Bayesian networks; and identification of interesting features, such as hub nodes.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Genetics and Biotechnology, Faculty of Agricultural Sciences, Aarhus University, Aarhus, Denmark. David.Edwards@agrsci.dk

ABSTRACT

Background: Chow and Liu showed that the maximum likelihood tree for multivariate discrete distributions may be found using a maximum weight spanning tree algorithm, for example Kruskal's algorithm. The efficiency of the algorithm makes it tractable for high-dimensional problems.

Results: We extend Chow and Liu's approach in two ways: first, to find the forest optimizing a penalized likelihood criterion, for example AIC or BIC, and second, to handle data with both discrete and Gaussian variables. We apply the approach to three datasets: two from gene expression studies and the third from a genetics of gene expression study. The minimal BIC forest supplements a conventional analysis of differential expression by providing a tentative network for the differentially expressed genes. In the genetics of gene expression context the method identifies a network approximating the joint distribution of the DNA markers and the gene expression levels.

Conclusions: The approach is generally useful as a preliminary step towards understanding the overall dependence structure of high-dimensional discrete and/or continuous data. Trees and forests are unrealistically simple models for biological systems, but can provide useful insights. Uses include the following: identification of distinct connected components, which can be analysed separately (dimension reduction); identification of neighbourhoods for more detailed analyses; as initial models for search algorithms with a larger search space, for example decomposable models or Bayesian networks; and identification of interesting features, such as hub nodes.

Show MeSH

Related in: MedlinePlus

The radius eight neighbourhood of strainin the minimal BIC forest for the E. coli data. The class variable strain is shown as a red circle, and genes that are among the 40 top ranked in the limma analysis are shown as green circles.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2823705&req=5

Figure 5: The radius eight neighbourhood of strainin the minimal BIC forest for the E. coli data. The class variable strain is shown as a red circle, and genes that are among the 40 top ranked in the limma analysis are shown as green circles.

Mentions: Figure 5 shows the radius eight neighbourhood of strain, that is to say the subgraph of vertices whose path length from strain is less than or equal to 8. There are three variables adjacent to strain. The short arm links to the knockout gene itself via an intergenic region (IG) tRNA gene. This arm just reflects the marked downregulation of Lrp in the knockout strain. The other two arms suggest that Lrp targets just two genes, serA and gltD. It is instructive to compare Figure 4 with a conventional analysis of differential expression using the limma library [31]. If a false discovery rate of 0.2 is used, 40 genes are flagged as possibly differentially regulated. Although the two analysis approaches are very different -- limma is based on gene-by-gene hypothesis testing, and is concerned with the operating characteristics of this, while the present approach is based on approximating the joint distribution of the entire variable set -- the results are broadly consistent. Of the 40 genes identified by the limma analysis, 35 have a path length less or equal to 8 to strain in the minimum BIC forest, and so appear in Figure 5. The remaining 5 genes, however, are very distant from strain, with path lengths ranging from 59 to 81. This could suggest that their apparent regulation by Lrp is spurious.


Selecting high-dimensional mixed graphical models using minimal AIC or BIC forests.

Edwards D, de Abreu GC, Labouriau R - BMC Bioinformatics (2010)

The radius eight neighbourhood of strainin the minimal BIC forest for the E. coli data. The class variable strain is shown as a red circle, and genes that are among the 40 top ranked in the limma analysis are shown as green circles.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2823705&req=5

Figure 5: The radius eight neighbourhood of strainin the minimal BIC forest for the E. coli data. The class variable strain is shown as a red circle, and genes that are among the 40 top ranked in the limma analysis are shown as green circles.
Mentions: Figure 5 shows the radius eight neighbourhood of strain, that is to say the subgraph of vertices whose path length from strain is less than or equal to 8. There are three variables adjacent to strain. The short arm links to the knockout gene itself via an intergenic region (IG) tRNA gene. This arm just reflects the marked downregulation of Lrp in the knockout strain. The other two arms suggest that Lrp targets just two genes, serA and gltD. It is instructive to compare Figure 4 with a conventional analysis of differential expression using the limma library [31]. If a false discovery rate of 0.2 is used, 40 genes are flagged as possibly differentially regulated. Although the two analysis approaches are very different -- limma is based on gene-by-gene hypothesis testing, and is concerned with the operating characteristics of this, while the present approach is based on approximating the joint distribution of the entire variable set -- the results are broadly consistent. Of the 40 genes identified by the limma analysis, 35 have a path length less or equal to 8 to strain in the minimum BIC forest, and so appear in Figure 5. The remaining 5 genes, however, are very distant from strain, with path lengths ranging from 59 to 81. This could suggest that their apparent regulation by Lrp is spurious.

Bottom Line: In the genetics of gene expression context the method identifies a network approximating the joint distribution of the DNA markers and the gene expression levels.Trees and forests are unrealistically simple models for biological systems, but can provide useful insights.Uses include the following: identification of distinct connected components, which can be analysed separately (dimension reduction); identification of neighbourhoods for more detailed analyses; as initial models for search algorithms with a larger search space, for example decomposable models or Bayesian networks; and identification of interesting features, such as hub nodes.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Genetics and Biotechnology, Faculty of Agricultural Sciences, Aarhus University, Aarhus, Denmark. David.Edwards@agrsci.dk

ABSTRACT

Background: Chow and Liu showed that the maximum likelihood tree for multivariate discrete distributions may be found using a maximum weight spanning tree algorithm, for example Kruskal's algorithm. The efficiency of the algorithm makes it tractable for high-dimensional problems.

Results: We extend Chow and Liu's approach in two ways: first, to find the forest optimizing a penalized likelihood criterion, for example AIC or BIC, and second, to handle data with both discrete and Gaussian variables. We apply the approach to three datasets: two from gene expression studies and the third from a genetics of gene expression study. The minimal BIC forest supplements a conventional analysis of differential expression by providing a tentative network for the differentially expressed genes. In the genetics of gene expression context the method identifies a network approximating the joint distribution of the DNA markers and the gene expression levels.

Conclusions: The approach is generally useful as a preliminary step towards understanding the overall dependence structure of high-dimensional discrete and/or continuous data. Trees and forests are unrealistically simple models for biological systems, but can provide useful insights. Uses include the following: identification of distinct connected components, which can be analysed separately (dimension reduction); identification of neighbourhoods for more detailed analyses; as initial models for search algorithms with a larger search space, for example decomposable models or Bayesian networks; and identification of interesting features, such as hub nodes.

Show MeSH
Related in: MedlinePlus