Limits...
The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets.

Karapetyan K, Batchelor C, Sharpe D, Tkachenko V, Williams AJ - J Cheminform (2015)

Bottom Line: There are presently hundreds of online databases hosting millions of chemical compounds and associated data.As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online.In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set.

View Article: PubMed Central - PubMed

Affiliation: Royal Society of Chemistry, US Office, 904 Tamaras Circle, Wake Forest, NC 27587 USA.

ABSTRACT

Background: There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets.

Results: The chemical validation and standardization platform (CVSP) both validates and standardizes chemical structure representations according to sets of systematic rules. The chemical validation algorithms detect issues with submitted molecular representations using pre-defined or user-defined dictionary-based molecular patterns that are chemically suspicious or potentially requiring manual review. Each identified issue is assigned one of three levels of severity - Information, Warning, and Error - in order to conveniently inform the user of the need to browse and review subsets of their data. The validation process includes validation of atoms and bonds (e.g., making aware of query atoms and bonds), valences, and stereo. The standard form of submission of collections of data, the SDF file, allows the user to map the data fields to predefined CVSP fields for the purpose of cross-validating associated SMILES and InChIs with the connection tables contained within the SDF file. This platform has been applied to the analysis of a large number of data sets prepared for deposition to our ChemSpider database and in preparation of data for the Open PHACTS project. In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set. CVSP web site is located at http://cvsp.chemspider.com/.

Conclusion: A platform for the validation and standardization of chemical structure representations of various formats has been developed and made available to the community to assist and encourage the processing of chemical structure files to produce more homogeneous compound representations for exchange and interchange between online databases. While the CVSP platform is designed with flexibility inherent to the rules that can be used for processing the data we have produced a recommended rule set based on our own experiences with the large data sets such as DrugBank, ChEMBL, and data sets from ChemSpider.

No MeSH data available.


Depiction of the ring-walking algorithm. If we start at the topmost node on the ring and proceed clockwise recording whether there is a left turn (L) or a right turn (R) at every node, then we obtain a six-character “signature”. This indicates whether the hexagon is homotropous (all turns are the same), a chair, a boat or a yet more exotic shape
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4494041&req=5

Fig14: Depiction of the ring-walking algorithm. If we start at the topmost node on the ring and proceed clockwise recording whether there is a left turn (L) or a right turn (R) at every node, then we obtain a six-character “signature”. This indicates whether the hexagon is homotropous (all turns are the same), a chair, a boat or a yet more exotic shape

Mentions: One particular challenge for molecular structure identifier generation is the depiction of carbohydrate stereochemistry, as, for example, the InChI algorithm [23] cannot cope with “perspective” drawings of pyranose rings in the chair conformation. A full description of our approach including standardization is out of scope for this paper; however the algorithm proceeds as follows. The Indigo toolkit is used to detect the unfused six-membered rings in the molecular structure, the shapes of which are then classified using the ring-walking algorithm depicted in Fig. 14, which identifies a “signature” for each ring based on whether on walking around the ring you take a left or a right turn at each node. Table 3 shows the eight possible shapes for hexagons along with common names, where they exist, and a small selection of signatures. The code compares the hexagon signature against all cyclic permutations of both a left-first and right-first “canonical” signature. By cyclic permutations we mean that, for example, LLLRRR = LLRRRL = LRRRLL = RRRLLL = RRLLLR = RLLLRR = LLLRRR. We also use this code to detect L-pyranose rings as in many cases these are the result of the author having inadvertently mirrored the ring (John Blunt, personal communication).Fig. 14


The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets.

Karapetyan K, Batchelor C, Sharpe D, Tkachenko V, Williams AJ - J Cheminform (2015)

Depiction of the ring-walking algorithm. If we start at the topmost node on the ring and proceed clockwise recording whether there is a left turn (L) or a right turn (R) at every node, then we obtain a six-character “signature”. This indicates whether the hexagon is homotropous (all turns are the same), a chair, a boat or a yet more exotic shape
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4494041&req=5

Fig14: Depiction of the ring-walking algorithm. If we start at the topmost node on the ring and proceed clockwise recording whether there is a left turn (L) or a right turn (R) at every node, then we obtain a six-character “signature”. This indicates whether the hexagon is homotropous (all turns are the same), a chair, a boat or a yet more exotic shape
Mentions: One particular challenge for molecular structure identifier generation is the depiction of carbohydrate stereochemistry, as, for example, the InChI algorithm [23] cannot cope with “perspective” drawings of pyranose rings in the chair conformation. A full description of our approach including standardization is out of scope for this paper; however the algorithm proceeds as follows. The Indigo toolkit is used to detect the unfused six-membered rings in the molecular structure, the shapes of which are then classified using the ring-walking algorithm depicted in Fig. 14, which identifies a “signature” for each ring based on whether on walking around the ring you take a left or a right turn at each node. Table 3 shows the eight possible shapes for hexagons along with common names, where they exist, and a small selection of signatures. The code compares the hexagon signature against all cyclic permutations of both a left-first and right-first “canonical” signature. By cyclic permutations we mean that, for example, LLLRRR = LLRRRL = LRRRLL = RRRLLL = RRLLLR = RLLLRR = LLLRRR. We also use this code to detect L-pyranose rings as in many cases these are the result of the author having inadvertently mirrored the ring (John Blunt, personal communication).Fig. 14

Bottom Line: There are presently hundreds of online databases hosting millions of chemical compounds and associated data.As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online.In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set.

View Article: PubMed Central - PubMed

Affiliation: Royal Society of Chemistry, US Office, 904 Tamaras Circle, Wake Forest, NC 27587 USA.

ABSTRACT

Background: There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets.

Results: The chemical validation and standardization platform (CVSP) both validates and standardizes chemical structure representations according to sets of systematic rules. The chemical validation algorithms detect issues with submitted molecular representations using pre-defined or user-defined dictionary-based molecular patterns that are chemically suspicious or potentially requiring manual review. Each identified issue is assigned one of three levels of severity - Information, Warning, and Error - in order to conveniently inform the user of the need to browse and review subsets of their data. The validation process includes validation of atoms and bonds (e.g., making aware of query atoms and bonds), valences, and stereo. The standard form of submission of collections of data, the SDF file, allows the user to map the data fields to predefined CVSP fields for the purpose of cross-validating associated SMILES and InChIs with the connection tables contained within the SDF file. This platform has been applied to the analysis of a large number of data sets prepared for deposition to our ChemSpider database and in preparation of data for the Open PHACTS project. In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set. CVSP web site is located at http://cvsp.chemspider.com/.

Conclusion: A platform for the validation and standardization of chemical structure representations of various formats has been developed and made available to the community to assist and encourage the processing of chemical structure files to produce more homogeneous compound representations for exchange and interchange between online databases. While the CVSP platform is designed with flexibility inherent to the rules that can be used for processing the data we have produced a recommended rule set based on our own experiences with the large data sets such as DrugBank, ChEMBL, and data sets from ChemSpider.

No MeSH data available.