Limits...
Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry.

Kind T, Fiehn O - BMC Bioinformatics (2007)

Bottom Line: Only 0.6% of these compounds did not pass all rules.Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions.Corresponding software and supplemental data are available for downloads from the authors' website.

View Article: PubMed Central - HTML - PubMed

Affiliation: University of California Davis, Genome Center, Davis, CA 95616, USA. tkind@ucdavis.edu <tkind@ucdavis.edu>

ABSTRACT

Background: Structure elucidation of unknown small molecules by mass spectrometry is a challenge despite advances in instrumentation. The first crucial step is to obtain correct elemental compositions. In order to automatically constrain the thousands of possible candidate structures, rules need to be developed to select the most likely and chemically correct molecular formulas.

Results: An algorithm for filtering molecular formulas is derived from seven heuristic rules: (1) restrictions for the number of elements, (2) LEWIS and SENIOR chemical rules, (3) isotopic patterns, (4) hydrogen/carbon ratios, (5) element ratio of nitrogen, oxygen, phosphor, and sulphur versus carbon, (6) element ratio probabilities and (7) presence of trimethylsilylated compounds. Formulas are ranked according to their isotopic patterns and subsequently constrained by presence in public chemical databases. The seven rules were developed on 68,237 existing molecular formulas and were validated in four experiments. First, 432,968 formulas covering five million PubChem database entries were checked for consistency. Only 0.6% of these compounds did not pass all rules. Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions. Thirdly 6,000 pharmaceutical, toxic and natural compounds were selected from DrugBank, TSCA and DNP databases. The correct formulas were retrieved as top hit at 80-99% probability when assuming data acquisition with complete resolution of unique compounds and 5% absolute isotope ratio deviation and 3 ppm mass accuracy. Last, some exemplary compounds were analyzed by Fourier transform ion cyclotron resonance mass spectrometry and by gas chromatography-time of flight mass spectrometry. In each case, the correct formula was ranked as top hit when combining the seven rules with database queries.

Conclusion: The seven rules enable an automatic exclusion of molecular formulas which are either wrong or which contain unlikely high or low number of elements. The correct molecular formula is assigned with a probability of 98% if the formula exists in a compound database. For truly novel compounds that are not present in databases, the correct formula is found in the first three hits with a probability of 65-81%. Corresponding software and supplemental data are available for downloads from the authors' website.

Show MeSH

Related in: MedlinePlus

Isotopic pattern of 45.000 compound formulas from the Wiley mass spectral database and 60.000 peptides formulas in the small molecule space < 1000 Dalton. M+1 and M+2 are given as relative abundances in [%] and are normalized to 100% of the highest isotope abundance in the molecular formula.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1851972&req=5

Figure 1: Isotopic pattern of 45.000 compound formulas from the Wiley mass spectral database and 60.000 peptides formulas in the small molecule space < 1000 Dalton. M+1 and M+2 are given as relative abundances in [%] and are normalized to 100% of the highest isotope abundance in the molecular formula.

Mentions: Compounds that were synthesized by natural precursors comprise monoisotopic and isotope masses according to the natural average abundance of stable isotope abundances which are listed for each element [18]. Previously, we have shown that applying isotopic abundance patterns removes most of the wrongly assigned molecular formulas from a certain mass measurement experiment [10], and we proved that without isotope information, even a hypothetical spectrometer capable of 0.1 ppm mass accuracy could not report unique elemental compositions in the range above 185.9760 Dalton when elements C, H, N, S, O and P were included in the search list. Therefore, isotope ratio abundance was included in the algorithm as an additional orthogonal constraint, assuming high quality data acquisitions, specifically sufficient ion statistics and high signal/noise ratio for the detection of the M+1 and M+2 abundances. Mass spectrometers such as time-of-flight instruments are commercially available which report isotopic abundance patterns with a very low relative error (around 2–5% relative standard deviation Figure 1 shows all M+1 and M+2 patterns for the formulas from the Wiley mass spectral database and a data set of calculated peptide formulas. This figure demonstrates that for a few compound classes such as brominated and chlorinated small molecules, but also for sulphur-containing peptides, the isotope pattern is a valuable tool to immediately determine the presence of such elements. For monoisotopic elements (F, Na, P, I) this rule has no impact. The relative impact of the isotopic ratio evaluations against the other rules is highlighted below.


Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry.

Kind T, Fiehn O - BMC Bioinformatics (2007)

Isotopic pattern of 45.000 compound formulas from the Wiley mass spectral database and 60.000 peptides formulas in the small molecule space < 1000 Dalton. M+1 and M+2 are given as relative abundances in [%] and are normalized to 100% of the highest isotope abundance in the molecular formula.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1851972&req=5

Figure 1: Isotopic pattern of 45.000 compound formulas from the Wiley mass spectral database and 60.000 peptides formulas in the small molecule space < 1000 Dalton. M+1 and M+2 are given as relative abundances in [%] and are normalized to 100% of the highest isotope abundance in the molecular formula.
Mentions: Compounds that were synthesized by natural precursors comprise monoisotopic and isotope masses according to the natural average abundance of stable isotope abundances which are listed for each element [18]. Previously, we have shown that applying isotopic abundance patterns removes most of the wrongly assigned molecular formulas from a certain mass measurement experiment [10], and we proved that without isotope information, even a hypothetical spectrometer capable of 0.1 ppm mass accuracy could not report unique elemental compositions in the range above 185.9760 Dalton when elements C, H, N, S, O and P were included in the search list. Therefore, isotope ratio abundance was included in the algorithm as an additional orthogonal constraint, assuming high quality data acquisitions, specifically sufficient ion statistics and high signal/noise ratio for the detection of the M+1 and M+2 abundances. Mass spectrometers such as time-of-flight instruments are commercially available which report isotopic abundance patterns with a very low relative error (around 2–5% relative standard deviation Figure 1 shows all M+1 and M+2 patterns for the formulas from the Wiley mass spectral database and a data set of calculated peptide formulas. This figure demonstrates that for a few compound classes such as brominated and chlorinated small molecules, but also for sulphur-containing peptides, the isotope pattern is a valuable tool to immediately determine the presence of such elements. For monoisotopic elements (F, Na, P, I) this rule has no impact. The relative impact of the isotopic ratio evaluations against the other rules is highlighted below.

Bottom Line: Only 0.6% of these compounds did not pass all rules.Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions.Corresponding software and supplemental data are available for downloads from the authors' website.

View Article: PubMed Central - HTML - PubMed

Affiliation: University of California Davis, Genome Center, Davis, CA 95616, USA. tkind@ucdavis.edu <tkind@ucdavis.edu>

ABSTRACT

Background: Structure elucidation of unknown small molecules by mass spectrometry is a challenge despite advances in instrumentation. The first crucial step is to obtain correct elemental compositions. In order to automatically constrain the thousands of possible candidate structures, rules need to be developed to select the most likely and chemically correct molecular formulas.

Results: An algorithm for filtering molecular formulas is derived from seven heuristic rules: (1) restrictions for the number of elements, (2) LEWIS and SENIOR chemical rules, (3) isotopic patterns, (4) hydrogen/carbon ratios, (5) element ratio of nitrogen, oxygen, phosphor, and sulphur versus carbon, (6) element ratio probabilities and (7) presence of trimethylsilylated compounds. Formulas are ranked according to their isotopic patterns and subsequently constrained by presence in public chemical databases. The seven rules were developed on 68,237 existing molecular formulas and were validated in four experiments. First, 432,968 formulas covering five million PubChem database entries were checked for consistency. Only 0.6% of these compounds did not pass all rules. Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions. Thirdly 6,000 pharmaceutical, toxic and natural compounds were selected from DrugBank, TSCA and DNP databases. The correct formulas were retrieved as top hit at 80-99% probability when assuming data acquisition with complete resolution of unique compounds and 5% absolute isotope ratio deviation and 3 ppm mass accuracy. Last, some exemplary compounds were analyzed by Fourier transform ion cyclotron resonance mass spectrometry and by gas chromatography-time of flight mass spectrometry. In each case, the correct formula was ranked as top hit when combining the seven rules with database queries.

Conclusion: The seven rules enable an automatic exclusion of molecular formulas which are either wrong or which contain unlikely high or low number of elements. The correct molecular formula is assigned with a probability of 98% if the formula exists in a compound database. For truly novel compounds that are not present in databases, the correct formula is found in the first three hits with a probability of 65-81%. Corresponding software and supplemental data are available for downloads from the authors' website.

Show MeSH
Related in: MedlinePlus