Limits...
Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry.

Kind T, Fiehn O - BMC Bioinformatics (2007)

Bottom Line: Only 0.6% of these compounds did not pass all rules.Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions.Corresponding software and supplemental data are available for downloads from the authors' website.

View Article: PubMed Central - HTML - PubMed

Affiliation: University of California Davis, Genome Center, Davis, CA 95616, USA. tkind@ucdavis.edu <tkind@ucdavis.edu>

ABSTRACT

Background: Structure elucidation of unknown small molecules by mass spectrometry is a challenge despite advances in instrumentation. The first crucial step is to obtain correct elemental compositions. In order to automatically constrain the thousands of possible candidate structures, rules need to be developed to select the most likely and chemically correct molecular formulas.

Results: An algorithm for filtering molecular formulas is derived from seven heuristic rules: (1) restrictions for the number of elements, (2) LEWIS and SENIOR chemical rules, (3) isotopic patterns, (4) hydrogen/carbon ratios, (5) element ratio of nitrogen, oxygen, phosphor, and sulphur versus carbon, (6) element ratio probabilities and (7) presence of trimethylsilylated compounds. Formulas are ranked according to their isotopic patterns and subsequently constrained by presence in public chemical databases. The seven rules were developed on 68,237 existing molecular formulas and were validated in four experiments. First, 432,968 formulas covering five million PubChem database entries were checked for consistency. Only 0.6% of these compounds did not pass all rules. Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions. Thirdly 6,000 pharmaceutical, toxic and natural compounds were selected from DrugBank, TSCA and DNP databases. The correct formulas were retrieved as top hit at 80-99% probability when assuming data acquisition with complete resolution of unique compounds and 5% absolute isotope ratio deviation and 3 ppm mass accuracy. Last, some exemplary compounds were analyzed by Fourier transform ion cyclotron resonance mass spectrometry and by gas chromatography-time of flight mass spectrometry. In each case, the correct formula was ranked as top hit when combining the seven rules with database queries.

Conclusion: The seven rules enable an automatic exclusion of molecular formulas which are either wrong or which contain unlikely high or low number of elements. The correct molecular formula is assigned with a probability of 98% if the formula exists in a compound database. For truly novel compounds that are not present in databases, the correct formula is found in the first three hits with a probability of 65-81%. Corresponding software and supplemental data are available for downloads from the authors' website.

Show MeSH

Related in: MedlinePlus

Frequency distribution for the molecular masses of all elemental compositions downloaded from the PubChem database (2006) covering more than 5 million single compounds.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1851972&req=5

Figure 3: Frequency distribution for the molecular masses of all elemental compositions downloaded from the PubChem database (2006) covering more than 5 million single compounds.

Mentions: As stated above, rules were evolved from the Wiley mass spectral libraries and the DNP database (~ 47,000 and ~ 31,000 unique elemental compositions, respectively and combined a total of 68,237 unique formulas). The NIST02 mass spectral database was additionally used for special test cases. The mass spectral databases were selected for rule development because mass spectrometry is usually used for molecular formula determination. For validation, we have compiled a much larger database of 432,968 unique molecular formulas that were derived from 5 million compounds from the PubChem database (version February 2006). At the time of writing the PubChem database has almost doubled to 10 million compounds. Even if such queries were possible, online-query times would be rather slow. We have therefore downloaded or purchased the databases for in-house use and rapid access. The PubChem database was found to well represent the known molecular space of small molecules. 27% of all elemental compositions were found between 400–500 Dalton (see Figure 3). Out of a total of more than five million chemical structures, 99.6% of all formulas comprised less than 99 carbon or hydrogen atoms, rendering this number a reasonable cut off that we have utilized in our script implementation. Interestingly, only 2665 molecular formulas (0.6%) failed to pass the set of seven rules we have developed. Among them 1589 molecular formulas did not pass the Lewis and Senior check, either because of conversion errors, wrong formulas or they were true radical containing compounds. Most of the other failed candidates (1481) did not pass rule #6 – the HNOPS probability check – due to the fact that they did not contain any hydrogen. The following largest group (1349 candidates) did not pass rule #4, the hydrogen/carbon ratio test. An overlap analysis revealed that 21,783 molecular formulas (5%) from Wiley and DNP were not contained in the PubChem validation set. These results demonstrate the good performance of the set of rules, despite the comparatively small size of the development databases.


Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry.

Kind T, Fiehn O - BMC Bioinformatics (2007)

Frequency distribution for the molecular masses of all elemental compositions downloaded from the PubChem database (2006) covering more than 5 million single compounds.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1851972&req=5

Figure 3: Frequency distribution for the molecular masses of all elemental compositions downloaded from the PubChem database (2006) covering more than 5 million single compounds.
Mentions: As stated above, rules were evolved from the Wiley mass spectral libraries and the DNP database (~ 47,000 and ~ 31,000 unique elemental compositions, respectively and combined a total of 68,237 unique formulas). The NIST02 mass spectral database was additionally used for special test cases. The mass spectral databases were selected for rule development because mass spectrometry is usually used for molecular formula determination. For validation, we have compiled a much larger database of 432,968 unique molecular formulas that were derived from 5 million compounds from the PubChem database (version February 2006). At the time of writing the PubChem database has almost doubled to 10 million compounds. Even if such queries were possible, online-query times would be rather slow. We have therefore downloaded or purchased the databases for in-house use and rapid access. The PubChem database was found to well represent the known molecular space of small molecules. 27% of all elemental compositions were found between 400–500 Dalton (see Figure 3). Out of a total of more than five million chemical structures, 99.6% of all formulas comprised less than 99 carbon or hydrogen atoms, rendering this number a reasonable cut off that we have utilized in our script implementation. Interestingly, only 2665 molecular formulas (0.6%) failed to pass the set of seven rules we have developed. Among them 1589 molecular formulas did not pass the Lewis and Senior check, either because of conversion errors, wrong formulas or they were true radical containing compounds. Most of the other failed candidates (1481) did not pass rule #6 – the HNOPS probability check – due to the fact that they did not contain any hydrogen. The following largest group (1349 candidates) did not pass rule #4, the hydrogen/carbon ratio test. An overlap analysis revealed that 21,783 molecular formulas (5%) from Wiley and DNP were not contained in the PubChem validation set. These results demonstrate the good performance of the set of rules, despite the comparatively small size of the development databases.

Bottom Line: Only 0.6% of these compounds did not pass all rules.Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions.Corresponding software and supplemental data are available for downloads from the authors' website.

View Article: PubMed Central - HTML - PubMed

Affiliation: University of California Davis, Genome Center, Davis, CA 95616, USA. tkind@ucdavis.edu <tkind@ucdavis.edu>

ABSTRACT

Background: Structure elucidation of unknown small molecules by mass spectrometry is a challenge despite advances in instrumentation. The first crucial step is to obtain correct elemental compositions. In order to automatically constrain the thousands of possible candidate structures, rules need to be developed to select the most likely and chemically correct molecular formulas.

Results: An algorithm for filtering molecular formulas is derived from seven heuristic rules: (1) restrictions for the number of elements, (2) LEWIS and SENIOR chemical rules, (3) isotopic patterns, (4) hydrogen/carbon ratios, (5) element ratio of nitrogen, oxygen, phosphor, and sulphur versus carbon, (6) element ratio probabilities and (7) presence of trimethylsilylated compounds. Formulas are ranked according to their isotopic patterns and subsequently constrained by presence in public chemical databases. The seven rules were developed on 68,237 existing molecular formulas and were validated in four experiments. First, 432,968 formulas covering five million PubChem database entries were checked for consistency. Only 0.6% of these compounds did not pass all rules. Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions. Thirdly 6,000 pharmaceutical, toxic and natural compounds were selected from DrugBank, TSCA and DNP databases. The correct formulas were retrieved as top hit at 80-99% probability when assuming data acquisition with complete resolution of unique compounds and 5% absolute isotope ratio deviation and 3 ppm mass accuracy. Last, some exemplary compounds were analyzed by Fourier transform ion cyclotron resonance mass spectrometry and by gas chromatography-time of flight mass spectrometry. In each case, the correct formula was ranked as top hit when combining the seven rules with database queries.

Conclusion: The seven rules enable an automatic exclusion of molecular formulas which are either wrong or which contain unlikely high or low number of elements. The correct molecular formula is assigned with a probability of 98% if the formula exists in a compound database. For truly novel compounds that are not present in databases, the correct formula is found in the first three hits with a probability of 65-81%. Corresponding software and supplemental data are available for downloads from the authors' website.

Show MeSH
Related in: MedlinePlus