Limits...
Carotta: Revealing Hidden Confounder Markers in Metabolic Breath Profiles.

Hauschild AC, Frisch T, Baumbach JI, Baumbach J - Metabolites (2015)

Bottom Line: On the one hand, this allows us to eliminate the effect of potential confounding factors hindering disease classification, such as smoking.It does not require much prior knowledge or technical skills to operate.We demonstrate its power and applicability by means of one artificial dataset.

View Article: PubMed Central - PubMed

Affiliation: Computational Systems Biology Group, Max Planck Institute for Informatics, Saarbrücken 66123, Germany. a.hauschild@mpi-inf.mpg.de.

ABSTRACT
Computational breath analysis is a growing research area aiming at identifying volatile organic compounds (VOCs) in human breath to assist medical diagnostics of the next generation. While inexpensive and non-invasive bioanalytical technologies for metabolite detection in exhaled air and bacterial/fungal vapor exist and the first studies on the power of supervised machine learning methods for profiling of the resulting data were conducted, we lack methods to extract hidden data features emerging from confounding factors. Here, we present Carotta, a new cluster analysis framework dedicated to uncovering such hidden substructures by sophisticated unsupervised statistical learning methods. We study the power of transitivity clustering and hierarchical clustering to identify groups of VOCs with similar expression behavior over most patient breath samples and/or groups of patients with a similar VOC intensity pattern. This enables the discovery of dependencies between metabolites. On the one hand, this allows us to eliminate the effect of potential confounding factors hindering disease classification, such as smoking. On the other hand, we may also identify VOCs associated with disease subtypes or concomitant diseases. Carotta is an open source software with an intuitive graphical user interface promoting data handling, analysis and visualization. The back-end is designed to be modular, allowing for easy extensions with plugins in the future, such as new clustering methods and statistics. It does not require much prior knowledge or technical skills to operate. We demonstrate its power and applicability by means of one artificial dataset. We also apply Carotta exemplarily to a real-world example dataset on chronic obstructive pulmonary disease (COPD). While the artificial data are utilized as a proof of concept, we will demonstrate how Carotta finds candidate markers in our real dataset associated with confounders rather than the primary disease (COPD) and bronchial carcinoma (BC). Carotta is publicly available at http://carotta.compbio.sdu.dk [1].

No MeSH data available.


Related in: MedlinePlus

Comparison of the clustering results on the entire COPD dataset, i.e., using all metabolites, as well as the four most interesting metabolite clusters (two best and two of the worst). The plot shows the F-measure for different clustering thresholds computed against the disease annotation (COPD, COPD with bronchial carcinoma (BC) and healthy). The Y-axis corresponds to the clustering threshold, in this case the number of splits. Given three groups of patients in the annotation, we are particularly interested in the performance at clustering results at T ~ 2 (x-axis). This is shown in more detail in the zoomed cutout, as well as the table of F-measure values at this position. Two subsets of metabolites overlap with the patients’ disease annotation better than the clusterings based on the entire metabolite set. The two other metabolite subsets result in reduced F-measures, indicating a relation to confounding factors, in this case menthol (see the text).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4495376&req=5

f5-metabolites-05-00344: Comparison of the clustering results on the entire COPD dataset, i.e., using all metabolites, as well as the four most interesting metabolite clusters (two best and two of the worst). The plot shows the F-measure for different clustering thresholds computed against the disease annotation (COPD, COPD with bronchial carcinoma (BC) and healthy). The Y-axis corresponds to the clustering threshold, in this case the number of splits. Given three groups of patients in the annotation, we are particularly interested in the performance at clustering results at T ~ 2 (x-axis). This is shown in more detail in the zoomed cutout, as well as the table of F-measure values at this position. Two subsets of metabolites overlap with the patients’ disease annotation better than the clusterings based on the entire metabolite set. The two other metabolite subsets result in reduced F-measures, indicating a relation to confounding factors, in this case menthol (see the text).

Mentions: This dataset was evaluated utilizing Carotta following the previously introduced workflow. At first, all 120 metabolites were clustered by HAC and the Pearson correlation (converted to dissimilarity, as explained above). Several thresholds (thus, varying numbers and sizes of clusters) were investigated, leading to an optimal result of T = 40. Subsequently, the set of metabolites was split into 40 subsets, one for each cluster of correlating metabolites. We now exclude all clusters with less than three compounds, leaving us with a total of 14 metabolite sets. Finally, the hierarchical agglomerative clustering was performed on the correlation matrix (converted to the distance matrix, as previously described) of the patients for each of these metabolite sets. Carotta subsequently evaluates the overlap of the patient clusters with the three patient groups over varying clustering thresholds using the F-measure. Figure 5 plots the results for four of the 14 metabolite subsets, as well as the results when using the entire set of metabolites. For better visualization, we restricted the figure to the five most interesting results: the entire metabolite set and the two best (highest F-measures), as well as two of the worst (lowest F-measures) performing metabolite sets.


Carotta: Revealing Hidden Confounder Markers in Metabolic Breath Profiles.

Hauschild AC, Frisch T, Baumbach JI, Baumbach J - Metabolites (2015)

Comparison of the clustering results on the entire COPD dataset, i.e., using all metabolites, as well as the four most interesting metabolite clusters (two best and two of the worst). The plot shows the F-measure for different clustering thresholds computed against the disease annotation (COPD, COPD with bronchial carcinoma (BC) and healthy). The Y-axis corresponds to the clustering threshold, in this case the number of splits. Given three groups of patients in the annotation, we are particularly interested in the performance at clustering results at T ~ 2 (x-axis). This is shown in more detail in the zoomed cutout, as well as the table of F-measure values at this position. Two subsets of metabolites overlap with the patients’ disease annotation better than the clusterings based on the entire metabolite set. The two other metabolite subsets result in reduced F-measures, indicating a relation to confounding factors, in this case menthol (see the text).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4495376&req=5

f5-metabolites-05-00344: Comparison of the clustering results on the entire COPD dataset, i.e., using all metabolites, as well as the four most interesting metabolite clusters (two best and two of the worst). The plot shows the F-measure for different clustering thresholds computed against the disease annotation (COPD, COPD with bronchial carcinoma (BC) and healthy). The Y-axis corresponds to the clustering threshold, in this case the number of splits. Given three groups of patients in the annotation, we are particularly interested in the performance at clustering results at T ~ 2 (x-axis). This is shown in more detail in the zoomed cutout, as well as the table of F-measure values at this position. Two subsets of metabolites overlap with the patients’ disease annotation better than the clusterings based on the entire metabolite set. The two other metabolite subsets result in reduced F-measures, indicating a relation to confounding factors, in this case menthol (see the text).
Mentions: This dataset was evaluated utilizing Carotta following the previously introduced workflow. At first, all 120 metabolites were clustered by HAC and the Pearson correlation (converted to dissimilarity, as explained above). Several thresholds (thus, varying numbers and sizes of clusters) were investigated, leading to an optimal result of T = 40. Subsequently, the set of metabolites was split into 40 subsets, one for each cluster of correlating metabolites. We now exclude all clusters with less than three compounds, leaving us with a total of 14 metabolite sets. Finally, the hierarchical agglomerative clustering was performed on the correlation matrix (converted to the distance matrix, as previously described) of the patients for each of these metabolite sets. Carotta subsequently evaluates the overlap of the patient clusters with the three patient groups over varying clustering thresholds using the F-measure. Figure 5 plots the results for four of the 14 metabolite subsets, as well as the results when using the entire set of metabolites. For better visualization, we restricted the figure to the five most interesting results: the entire metabolite set and the two best (highest F-measures), as well as two of the worst (lowest F-measures) performing metabolite sets.

Bottom Line: On the one hand, this allows us to eliminate the effect of potential confounding factors hindering disease classification, such as smoking.It does not require much prior knowledge or technical skills to operate.We demonstrate its power and applicability by means of one artificial dataset.

View Article: PubMed Central - PubMed

Affiliation: Computational Systems Biology Group, Max Planck Institute for Informatics, Saarbrücken 66123, Germany. a.hauschild@mpi-inf.mpg.de.

ABSTRACT
Computational breath analysis is a growing research area aiming at identifying volatile organic compounds (VOCs) in human breath to assist medical diagnostics of the next generation. While inexpensive and non-invasive bioanalytical technologies for metabolite detection in exhaled air and bacterial/fungal vapor exist and the first studies on the power of supervised machine learning methods for profiling of the resulting data were conducted, we lack methods to extract hidden data features emerging from confounding factors. Here, we present Carotta, a new cluster analysis framework dedicated to uncovering such hidden substructures by sophisticated unsupervised statistical learning methods. We study the power of transitivity clustering and hierarchical clustering to identify groups of VOCs with similar expression behavior over most patient breath samples and/or groups of patients with a similar VOC intensity pattern. This enables the discovery of dependencies between metabolites. On the one hand, this allows us to eliminate the effect of potential confounding factors hindering disease classification, such as smoking. On the other hand, we may also identify VOCs associated with disease subtypes or concomitant diseases. Carotta is an open source software with an intuitive graphical user interface promoting data handling, analysis and visualization. The back-end is designed to be modular, allowing for easy extensions with plugins in the future, such as new clustering methods and statistics. It does not require much prior knowledge or technical skills to operate. We demonstrate its power and applicability by means of one artificial dataset. We also apply Carotta exemplarily to a real-world example dataset on chronic obstructive pulmonary disease (COPD). While the artificial data are utilized as a proof of concept, we will demonstrate how Carotta finds candidate markers in our real dataset associated with confounders rather than the primary disease (COPD) and bronchial carcinoma (BC). Carotta is publicly available at http://carotta.compbio.sdu.dk [1].

No MeSH data available.


Related in: MedlinePlus