Limits...
Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents.

Senger S, Bartek L, Papadatos G, Gaulton A - J Cheminform (2015)

Bottom Line: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively.SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents.

View Article: PubMed Central - PubMed

Affiliation: GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK.

ABSTRACT

Background: First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.

Results: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.

Conclusions: In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant 'gold standards' is required.

No MeSH data available.


SureChEMBL Chemical Corpus Count of Compounds. Number of compounds from the set of 35 patents with the SciFinder annotation “Biological Studies” within eight ranges of SureChEMBL chemical corpus count
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4594083&req=5

Fig2: SureChEMBL Chemical Corpus Count of Compounds. Number of compounds from the set of 35 patents with the SciFinder annotation “Biological Studies” within eight ranges of SureChEMBL chemical corpus count

Mentions: Considering that in SureChEMBL chemical entities are automatically extracted from patent documents (regardless of whether they are claimed compounds, reagents, synthetic intermediates etc.), it might not be surprising that the number of molecules associated with a patent in SureChEMBL tends to be much larger than the corresponding number in SciFinder. For the set of 46 patents used for this analysis only 12 % of compounds associated with the patents in SureChEMBL are also associated with these patents in SciFinder. For compounds annotated with “Biological Studies” in SciFinder the overlap is only 5 %. This prompts the question of how to identify the relevant compounds amongst what one could call artefacts or noise (e.g. reagents, intermediates, solvents, radicals, fragments, reference compounds, and existing marketed drugs). A SureChEMBL annotation that can help to address this challenge is the “chemical corpus count”. It is the count of the number of occasions a chemical structure has been observed within the whole currently available chemical corpus of SureChEMBL. Figure 2 shows how the compounds from the subset of 35 patents, for which in SciFinder compounds with the annotation “Biological Studies” can be found, are distributed across different ranges of chemical corpus count. As can be seen, approx. 90 % of the compounds have a chemical corpus count of less than or equal to 30. At the same time, only 16 % of all molecules found in SureChEMBL for the 35 patents have a chemical corpus count less than or equal to 30. Hence, applying a filter of chemical corpus count ≤30 enriches compounds that we consider here as being of greatest interest by a factor of 5. It is worth pointing out that, as expected, the majority of the molecules with the SciFinder annotation “Biological Studies” that have larger values for the chemical corpus count are, for example, marketed drugs or well known pharmacological tool compounds that are of minor relevance in the context of the use cases discussed here. The SureChEMBL output contains a number of further annotations (e.g. Ring Count, Molecular Weight, Log P, Radical, Med Chem Alert) that can also be used for filtering depending on the specific use case. Similarly, substructure searching is a viable option in cases where patents contain compounds that share a common structural entity. Examples of further data mining and cheminformatics methods applied to SureChEMBL compounds from patent documents are available as IPython Notebooks [16]. The notebooks aim to identify genuinely novel structures and series as claimed in a patent, along with methods to retrospectively flag key compounds [17, 18].Fig. 2


Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents.

Senger S, Bartek L, Papadatos G, Gaulton A - J Cheminform (2015)

SureChEMBL Chemical Corpus Count of Compounds. Number of compounds from the set of 35 patents with the SciFinder annotation “Biological Studies” within eight ranges of SureChEMBL chemical corpus count
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4594083&req=5

Fig2: SureChEMBL Chemical Corpus Count of Compounds. Number of compounds from the set of 35 patents with the SciFinder annotation “Biological Studies” within eight ranges of SureChEMBL chemical corpus count
Mentions: Considering that in SureChEMBL chemical entities are automatically extracted from patent documents (regardless of whether they are claimed compounds, reagents, synthetic intermediates etc.), it might not be surprising that the number of molecules associated with a patent in SureChEMBL tends to be much larger than the corresponding number in SciFinder. For the set of 46 patents used for this analysis only 12 % of compounds associated with the patents in SureChEMBL are also associated with these patents in SciFinder. For compounds annotated with “Biological Studies” in SciFinder the overlap is only 5 %. This prompts the question of how to identify the relevant compounds amongst what one could call artefacts or noise (e.g. reagents, intermediates, solvents, radicals, fragments, reference compounds, and existing marketed drugs). A SureChEMBL annotation that can help to address this challenge is the “chemical corpus count”. It is the count of the number of occasions a chemical structure has been observed within the whole currently available chemical corpus of SureChEMBL. Figure 2 shows how the compounds from the subset of 35 patents, for which in SciFinder compounds with the annotation “Biological Studies” can be found, are distributed across different ranges of chemical corpus count. As can be seen, approx. 90 % of the compounds have a chemical corpus count of less than or equal to 30. At the same time, only 16 % of all molecules found in SureChEMBL for the 35 patents have a chemical corpus count less than or equal to 30. Hence, applying a filter of chemical corpus count ≤30 enriches compounds that we consider here as being of greatest interest by a factor of 5. It is worth pointing out that, as expected, the majority of the molecules with the SciFinder annotation “Biological Studies” that have larger values for the chemical corpus count are, for example, marketed drugs or well known pharmacological tool compounds that are of minor relevance in the context of the use cases discussed here. The SureChEMBL output contains a number of further annotations (e.g. Ring Count, Molecular Weight, Log P, Radical, Med Chem Alert) that can also be used for filtering depending on the specific use case. Similarly, substructure searching is a viable option in cases where patents contain compounds that share a common structural entity. Examples of further data mining and cheminformatics methods applied to SureChEMBL compounds from patent documents are available as IPython Notebooks [16]. The notebooks aim to identify genuinely novel structures and series as claimed in a patent, along with methods to retrospectively flag key compounds [17, 18].Fig. 2

Bottom Line: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively.SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents.

View Article: PubMed Central - PubMed

Affiliation: GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK.

ABSTRACT

Background: First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.

Results: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.

Conclusions: In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant 'gold standards' is required.

No MeSH data available.