Limits...
Decision tree supported substructure prediction of metabolites from GC-MS profiles.

Hummel J, Strehmel N, Selbig J, Walther D, Kopka J - Metabolomics (2010)

Bottom Line: Structural feature extraction was applied to sub-divide the metabolite space contained in the GMD and to define the prediction target classes.Decision tree (DT)-based prediction of the most frequent substructures based on mass spectral features and RI information is demonstrated to result in highly sensitive and specific detections of sub-structures contained in the compounds.The underlying set of DTs can be inspected by the user and are made available for batch processing via SOAP (Simple Object Access Protocol)-based web services.

View Article: PubMed Central - PubMed

ABSTRACT
Gas chromatography coupled to mass spectrometry (GC-MS) is one of the most widespread routine technologies applied to the large scale screening and discovery of novel metabolic biomarkers. However, currently the majority of mass spectral tags (MSTs) remains unidentified due to the lack of authenticated pure reference substances required for compound identification by GC-MS. Here, we accessed the information on reference compounds stored in the Golm Metabolome Database (GMD) to apply supervised machine learning approaches to the classification and identification of unidentified MSTs without relying on library searches. Non-annotated MSTs with mass spectral and retention index (RI) information together with data of already identified metabolites and reference substances have been archived in the GMD. Structural feature extraction was applied to sub-divide the metabolite space contained in the GMD and to define the prediction target classes. Decision tree (DT)-based prediction of the most frequent substructures based on mass spectral features and RI information is demonstrated to result in highly sensitive and specific detections of sub-structures contained in the compounds. The underlying set of DTs can be inspected by the user and are made available for batch processing via SOAP (Simple Object Access Protocol)-based web services. The GMD mass spectral library with the integrated DTs is freely accessible for non-commercial use at http://gmd.mpimp-golm.mpg.de/. All matching and structure search functionalities are available as SOAP-based web services. A XML + HTTP interface, which follows Representational State Transfer (REST) principles, facilitates read-only access to data base entities.

No MeSH data available.


Related in: MedlinePlus

Two synchronized web controls facilitate the access to all currently available DTs and sub-structure predictions provided by the GMD interface. The upper table describes the investigated functional group, with optional use of the RI system, and records the date of DT generation, as well as the number and proportion of MSTs linked to present calls of the substructure. DTs are characterised by a 50-fold CV error, with corresponding precision, recall, F-measure and MCC values (cf. Sect. 2.2.4). The table can be sorted by activating the selected column header. The full list of all available DTs is accessible through a paging function at the left bottom of the display. An exemplary DT trained to classify mass spectra with regard to the presence or absence of the amine substructure using the VAR5 RI information and 582 mass spectra representing metabolites with an amine moiety contrasted by 1,662 MSTs of metabolites lacking this moiety. The DT is depicted on the left hand side. Standard SQL Server Analysis Services Decision Tree properties of the activated tree node (indicated by underlay) are displayed to the right. Light (green) bars indicate the proportion of MSTs linked to a compound with amine moiety; dark (red) indicate the proportion of MSTs linked to compounds lacking this moiety. Note that we use the terms Decision Tree and Mining Model to indicate the data structures according to the SQL Server Analysis Services terminology
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2874469&req=5

Fig2: Two synchronized web controls facilitate the access to all currently available DTs and sub-structure predictions provided by the GMD interface. The upper table describes the investigated functional group, with optional use of the RI system, and records the date of DT generation, as well as the number and proportion of MSTs linked to present calls of the substructure. DTs are characterised by a 50-fold CV error, with corresponding precision, recall, F-measure and MCC values (cf. Sect. 2.2.4). The table can be sorted by activating the selected column header. The full list of all available DTs is accessible through a paging function at the left bottom of the display. An exemplary DT trained to classify mass spectra with regard to the presence or absence of the amine substructure using the VAR5 RI information and 582 mass spectra representing metabolites with an amine moiety contrasted by 1,662 MSTs of metabolites lacking this moiety. The DT is depicted on the left hand side. Standard SQL Server Analysis Services Decision Tree properties of the activated tree node (indicated by underlay) are displayed to the right. Light (green) bars indicate the proportion of MSTs linked to a compound with amine moiety; dark (red) indicate the proportion of MSTs linked to compounds lacking this moiety. Note that we use the terms Decision Tree and Mining Model to indicate the data structures according to the SQL Server Analysis Services terminology

Mentions: For the reasons stated above a 50-fold CV by iterative exclusion of randomly chosen MSTs was routinely implemented. As an alternative we explored a heuristic validation process by excluding all technological replicate MSTs of single analytes (data not shown). As GMD will provide a steadily increasing number of replicate mass spectra and RIs of each analyte that is expected to improve DT-based substructure prediction we decided to use random choice of MSTs for CV. Our implemented procedure characterises DT performance and enables the calculation of overall error estimates for each substructure tree. CV results are displayed along with the respective DT information and visualization (Fig. 2). In order to demonstrate the implementation of the chosen CV process, exemplary results of a ten-fold CV are shown for the DT-classification which assesses the presence or absence of an amine moiety (Table 2).Fig. 2


Decision tree supported substructure prediction of metabolites from GC-MS profiles.

Hummel J, Strehmel N, Selbig J, Walther D, Kopka J - Metabolomics (2010)

Two synchronized web controls facilitate the access to all currently available DTs and sub-structure predictions provided by the GMD interface. The upper table describes the investigated functional group, with optional use of the RI system, and records the date of DT generation, as well as the number and proportion of MSTs linked to present calls of the substructure. DTs are characterised by a 50-fold CV error, with corresponding precision, recall, F-measure and MCC values (cf. Sect. 2.2.4). The table can be sorted by activating the selected column header. The full list of all available DTs is accessible through a paging function at the left bottom of the display. An exemplary DT trained to classify mass spectra with regard to the presence or absence of the amine substructure using the VAR5 RI information and 582 mass spectra representing metabolites with an amine moiety contrasted by 1,662 MSTs of metabolites lacking this moiety. The DT is depicted on the left hand side. Standard SQL Server Analysis Services Decision Tree properties of the activated tree node (indicated by underlay) are displayed to the right. Light (green) bars indicate the proportion of MSTs linked to a compound with amine moiety; dark (red) indicate the proportion of MSTs linked to compounds lacking this moiety. Note that we use the terms Decision Tree and Mining Model to indicate the data structures according to the SQL Server Analysis Services terminology
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2874469&req=5

Fig2: Two synchronized web controls facilitate the access to all currently available DTs and sub-structure predictions provided by the GMD interface. The upper table describes the investigated functional group, with optional use of the RI system, and records the date of DT generation, as well as the number and proportion of MSTs linked to present calls of the substructure. DTs are characterised by a 50-fold CV error, with corresponding precision, recall, F-measure and MCC values (cf. Sect. 2.2.4). The table can be sorted by activating the selected column header. The full list of all available DTs is accessible through a paging function at the left bottom of the display. An exemplary DT trained to classify mass spectra with regard to the presence or absence of the amine substructure using the VAR5 RI information and 582 mass spectra representing metabolites with an amine moiety contrasted by 1,662 MSTs of metabolites lacking this moiety. The DT is depicted on the left hand side. Standard SQL Server Analysis Services Decision Tree properties of the activated tree node (indicated by underlay) are displayed to the right. Light (green) bars indicate the proportion of MSTs linked to a compound with amine moiety; dark (red) indicate the proportion of MSTs linked to compounds lacking this moiety. Note that we use the terms Decision Tree and Mining Model to indicate the data structures according to the SQL Server Analysis Services terminology
Mentions: For the reasons stated above a 50-fold CV by iterative exclusion of randomly chosen MSTs was routinely implemented. As an alternative we explored a heuristic validation process by excluding all technological replicate MSTs of single analytes (data not shown). As GMD will provide a steadily increasing number of replicate mass spectra and RIs of each analyte that is expected to improve DT-based substructure prediction we decided to use random choice of MSTs for CV. Our implemented procedure characterises DT performance and enables the calculation of overall error estimates for each substructure tree. CV results are displayed along with the respective DT information and visualization (Fig. 2). In order to demonstrate the implementation of the chosen CV process, exemplary results of a ten-fold CV are shown for the DT-classification which assesses the presence or absence of an amine moiety (Table 2).Fig. 2

Bottom Line: Structural feature extraction was applied to sub-divide the metabolite space contained in the GMD and to define the prediction target classes.Decision tree (DT)-based prediction of the most frequent substructures based on mass spectral features and RI information is demonstrated to result in highly sensitive and specific detections of sub-structures contained in the compounds.The underlying set of DTs can be inspected by the user and are made available for batch processing via SOAP (Simple Object Access Protocol)-based web services.

View Article: PubMed Central - PubMed

ABSTRACT
Gas chromatography coupled to mass spectrometry (GC-MS) is one of the most widespread routine technologies applied to the large scale screening and discovery of novel metabolic biomarkers. However, currently the majority of mass spectral tags (MSTs) remains unidentified due to the lack of authenticated pure reference substances required for compound identification by GC-MS. Here, we accessed the information on reference compounds stored in the Golm Metabolome Database (GMD) to apply supervised machine learning approaches to the classification and identification of unidentified MSTs without relying on library searches. Non-annotated MSTs with mass spectral and retention index (RI) information together with data of already identified metabolites and reference substances have been archived in the GMD. Structural feature extraction was applied to sub-divide the metabolite space contained in the GMD and to define the prediction target classes. Decision tree (DT)-based prediction of the most frequent substructures based on mass spectral features and RI information is demonstrated to result in highly sensitive and specific detections of sub-structures contained in the compounds. The underlying set of DTs can be inspected by the user and are made available for batch processing via SOAP (Simple Object Access Protocol)-based web services. The GMD mass spectral library with the integrated DTs is freely accessible for non-commercial use at http://gmd.mpimp-golm.mpg.de/. All matching and structure search functionalities are available as SOAP-based web services. A XML + HTTP interface, which follows Representational State Transfer (REST) principles, facilitates read-only access to data base entities.

No MeSH data available.


Related in: MedlinePlus