Limits...
Decision tree-based method for integrating gene expression, demographic, and clinical data to determine disease endotypes.

Williams-DeVane CR, Reif DM, Hubal EC, Bushel PR, Hudgens EE, Gallagher JE, Edwards SW - BMC Syst Biol (2013)

Bottom Line: However, significant challenges exist with regard to how to segregate individuals into suitable subtypes of the disease and understand the distinct biological mechanisms of each when the goal is to maximize the discovery potential of these data sets.This application should be considered a complement to ongoing efforts to better define and diagnose known endotypes.When coupled with existing methods developed to determine the genetics of gene expression, these methods provide a mechanism for linking genetics and exposomics data and thereby accounting for both major determinants of disease.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Health and Environmental Effects Research Laboratory - Integrated Systems Toxicology Division, U,S, Environmental Protection Agency, Research Triangle Park, Durham, NC 27711, USA. clarlynda.williams@nccu.edu.

ABSTRACT

Background: Complex diseases are often difficult to diagnose, treat and study due to the multi-factorial nature of the underlying etiology. Large data sets are now widely available that can be used to define novel, mechanistically distinct disease subtypes (endotypes) in a completely data-driven manner. However, significant challenges exist with regard to how to segregate individuals into suitable subtypes of the disease and understand the distinct biological mechanisms of each when the goal is to maximize the discovery potential of these data sets.

Results: A multi-step decision tree-based method is described for defining endotypes based on gene expression, clinical covariates, and disease indicators using childhood asthma as a case study. We attempted to use alternative approaches such as the Student's t-test, single data domain clustering and the Modk-prototypes algorithm, which incorporates multiple data domains into a single analysis and none performed as well as the novel multi-step decision tree method. This new method gave the best segregation of asthmatics and non-asthmatics, and it provides easy access to all genes and clinical covariates that distinguish the groups.

Conclusions: The multi-step decision tree method described here will lead to better understanding of complex disease in general by allowing purely data-driven disease endotypes to facilitate the discovery of new mechanisms underlying these diseases. This application should be considered a complement to ongoing efforts to better define and diagnose known endotypes. When coupled with existing methods developed to determine the genetics of gene expression, these methods provide a mechanism for linking genetics and exposomics data and thereby accounting for both major determinants of disease.

Show MeSH

Related in: MedlinePlus

Scatterplots for each of the 81 Clinical Covariate Clustering Methods (Table3); A: Silhouette, B: Baker & Hubert, C: Hubert & Levine, D: Prespecified cluster count = 11, E: Prespecified cluster count = 12, F: Prespecified cluster count = 14. Colors are representative of individual clusters.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4228284&req=5

Figure 5: Scatterplots for each of the 81 Clinical Covariate Clustering Methods (Table3); A: Silhouette, B: Baker & Hubert, C: Hubert & Levine, D: Prespecified cluster count = 11, E: Prespecified cluster count = 12, F: Prespecified cluster count = 14. Colors are representative of individual clusters.

Mentions: Clinical covariate domain clustering was performed using all 81 clinical covariates and separately with a subset of 67 covariates for comparison to the multi-domain Modk method (see Methods). Cluster validity testing when using 81 covariates suggested that the best clinical chemistry clustering occurred with 2 or 25 (Table 5). The Silhouette clustering yielded two distinct and balanced clusters, however, they were both equally composed of asthmatics and non-asthmatics (Figure 5A). The Baker & Hubert clustering yielded two clusters with one containing a single subject, also seen in the gene expression results (Figure 5B). The Hubert & Levine clustering yielded 25 clusters (Figure 5C). Some of the clusters were all asthmatics and non-asthmatics, however, the well-segregated groups were composed of less than five subjects each (See Additional file 1: Table S2). While this represents a better segregation of asthmatics and non-asthmatics than seen with the gene expression, the group sizes are inadequate for meaningful interpretation. Less optimal clustering combinations were chosen using 81 covariates based on index values to ascertain whether there were cluster counts that were more suitable for asthma segregation and mechanistic interpretation (Figure 5 D-F). These clustering combinations yielded one clearly non-asthmatic cluster and several other equally non-asthmatic and asthmatic clusters. The use of 67 covariates yielded similar results (Table 6, Figure 6, and See Additional file 1: Table S3). One individual was a constant outlier (based on the clustering results) through each of the clustering methods (i.e. the only individual in the cluster). However, this individual did not meet any of the standard metrics for defining outliers. To determine if this potential outlier was deterministic in the clustering results, the individual was removed and the clustering repeated. The resulting clusters still did not segregate asthmatics from non-asthmatics. Since the individual could not be defined as an outlier by any formal test, the original results including this individual are shown.


Decision tree-based method for integrating gene expression, demographic, and clinical data to determine disease endotypes.

Williams-DeVane CR, Reif DM, Hubal EC, Bushel PR, Hudgens EE, Gallagher JE, Edwards SW - BMC Syst Biol (2013)

Scatterplots for each of the 81 Clinical Covariate Clustering Methods (Table3); A: Silhouette, B: Baker & Hubert, C: Hubert & Levine, D: Prespecified cluster count = 11, E: Prespecified cluster count = 12, F: Prespecified cluster count = 14. Colors are representative of individual clusters.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4228284&req=5

Figure 5: Scatterplots for each of the 81 Clinical Covariate Clustering Methods (Table3); A: Silhouette, B: Baker & Hubert, C: Hubert & Levine, D: Prespecified cluster count = 11, E: Prespecified cluster count = 12, F: Prespecified cluster count = 14. Colors are representative of individual clusters.
Mentions: Clinical covariate domain clustering was performed using all 81 clinical covariates and separately with a subset of 67 covariates for comparison to the multi-domain Modk method (see Methods). Cluster validity testing when using 81 covariates suggested that the best clinical chemistry clustering occurred with 2 or 25 (Table 5). The Silhouette clustering yielded two distinct and balanced clusters, however, they were both equally composed of asthmatics and non-asthmatics (Figure 5A). The Baker & Hubert clustering yielded two clusters with one containing a single subject, also seen in the gene expression results (Figure 5B). The Hubert & Levine clustering yielded 25 clusters (Figure 5C). Some of the clusters were all asthmatics and non-asthmatics, however, the well-segregated groups were composed of less than five subjects each (See Additional file 1: Table S2). While this represents a better segregation of asthmatics and non-asthmatics than seen with the gene expression, the group sizes are inadequate for meaningful interpretation. Less optimal clustering combinations were chosen using 81 covariates based on index values to ascertain whether there were cluster counts that were more suitable for asthma segregation and mechanistic interpretation (Figure 5 D-F). These clustering combinations yielded one clearly non-asthmatic cluster and several other equally non-asthmatic and asthmatic clusters. The use of 67 covariates yielded similar results (Table 6, Figure 6, and See Additional file 1: Table S3). One individual was a constant outlier (based on the clustering results) through each of the clustering methods (i.e. the only individual in the cluster). However, this individual did not meet any of the standard metrics for defining outliers. To determine if this potential outlier was deterministic in the clustering results, the individual was removed and the clustering repeated. The resulting clusters still did not segregate asthmatics from non-asthmatics. Since the individual could not be defined as an outlier by any formal test, the original results including this individual are shown.

Bottom Line: However, significant challenges exist with regard to how to segregate individuals into suitable subtypes of the disease and understand the distinct biological mechanisms of each when the goal is to maximize the discovery potential of these data sets.This application should be considered a complement to ongoing efforts to better define and diagnose known endotypes.When coupled with existing methods developed to determine the genetics of gene expression, these methods provide a mechanism for linking genetics and exposomics data and thereby accounting for both major determinants of disease.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Health and Environmental Effects Research Laboratory - Integrated Systems Toxicology Division, U,S, Environmental Protection Agency, Research Triangle Park, Durham, NC 27711, USA. clarlynda.williams@nccu.edu.

ABSTRACT

Background: Complex diseases are often difficult to diagnose, treat and study due to the multi-factorial nature of the underlying etiology. Large data sets are now widely available that can be used to define novel, mechanistically distinct disease subtypes (endotypes) in a completely data-driven manner. However, significant challenges exist with regard to how to segregate individuals into suitable subtypes of the disease and understand the distinct biological mechanisms of each when the goal is to maximize the discovery potential of these data sets.

Results: A multi-step decision tree-based method is described for defining endotypes based on gene expression, clinical covariates, and disease indicators using childhood asthma as a case study. We attempted to use alternative approaches such as the Student's t-test, single data domain clustering and the Modk-prototypes algorithm, which incorporates multiple data domains into a single analysis and none performed as well as the novel multi-step decision tree method. This new method gave the best segregation of asthmatics and non-asthmatics, and it provides easy access to all genes and clinical covariates that distinguish the groups.

Conclusions: The multi-step decision tree method described here will lead to better understanding of complex disease in general by allowing purely data-driven disease endotypes to facilitate the discovery of new mechanisms underlying these diseases. This application should be considered a complement to ongoing efforts to better define and diagnose known endotypes. When coupled with existing methods developed to determine the genetics of gene expression, these methods provide a mechanism for linking genetics and exposomics data and thereby accounting for both major determinants of disease.

Show MeSH
Related in: MedlinePlus