Limits...
Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents.

Senger S, Bartek L, Papadatos G, Gaulton A - J Cheminform (2015)

Bottom Line: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively.SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents.

View Article: PubMed Central - PubMed

Affiliation: GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK.

ABSTRACT

Background: First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.

Results: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.

Conclusions: In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant 'gold standards' is required.

No MeSH data available.


Distribution of compounds found by one or both of the two automatically curated databases. Percentage of compounds that are in SciFinder for the set of 45 patents and are only in SureChEMBL (orange), IBM SIIP (blue), or both (green). US4231938 is not included since none of compounds in SciFinder have been found in either SureChEMBL or IBM SIIP (cf. Table 1). The results used to generate this visualisation can be found in Additional file 3
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4594083&req=5

Fig4: Distribution of compounds found by one or both of the two automatically curated databases. Percentage of compounds that are in SciFinder for the set of 45 patents and are only in SureChEMBL (orange), IBM SIIP (blue), or both (green). US4231938 is not included since none of compounds in SciFinder have been found in either SureChEMBL or IBM SIIP (cf. Table 1). The results used to generate this visualisation can be found in Additional file 3

Mentions: In order to better understand how much overlap there is between compounds that are in SciFinder as well as in SureChEMBL and/or IBM SIIP for the set of 46 patents, the relevant information is shown in Fig. 4. As can be seen, there are 9 patents where every compound in SciFinder that was identified by one of the data sources was also found by the other. For one patent, none of the compounds in SciFinder have been found in either SureChEMBL or IBM SIIP. For two of the patents IBM SIIP found 32 additional compounds that are not in SureChEMBL, whereas SureChEMBL did not find any compound that is not also in IBM SIIP. For 16 patents SureChEMBL found compounds that IBM didn’t find and vice versa. Across these 16 patents SureChEMBL found 207 molecules that IBM didn’t find but IBM SIIP identified only 69 molecules that were missed by SureChEMBL. For the remaining 18 patents only SureChEMBL found compounds that were not identified by the other data source. In these 18 patents SureChEMBL found 120 additional compounds.Fig. 4


Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents.

Senger S, Bartek L, Papadatos G, Gaulton A - J Cheminform (2015)

Distribution of compounds found by one or both of the two automatically curated databases. Percentage of compounds that are in SciFinder for the set of 45 patents and are only in SureChEMBL (orange), IBM SIIP (blue), or both (green). US4231938 is not included since none of compounds in SciFinder have been found in either SureChEMBL or IBM SIIP (cf. Table 1). The results used to generate this visualisation can be found in Additional file 3
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4594083&req=5

Fig4: Distribution of compounds found by one or both of the two automatically curated databases. Percentage of compounds that are in SciFinder for the set of 45 patents and are only in SureChEMBL (orange), IBM SIIP (blue), or both (green). US4231938 is not included since none of compounds in SciFinder have been found in either SureChEMBL or IBM SIIP (cf. Table 1). The results used to generate this visualisation can be found in Additional file 3
Mentions: In order to better understand how much overlap there is between compounds that are in SciFinder as well as in SureChEMBL and/or IBM SIIP for the set of 46 patents, the relevant information is shown in Fig. 4. As can be seen, there are 9 patents where every compound in SciFinder that was identified by one of the data sources was also found by the other. For one patent, none of the compounds in SciFinder have been found in either SureChEMBL or IBM SIIP. For two of the patents IBM SIIP found 32 additional compounds that are not in SureChEMBL, whereas SureChEMBL did not find any compound that is not also in IBM SIIP. For 16 patents SureChEMBL found compounds that IBM didn’t find and vice versa. Across these 16 patents SureChEMBL found 207 molecules that IBM didn’t find but IBM SIIP identified only 69 molecules that were missed by SureChEMBL. For the remaining 18 patents only SureChEMBL found compounds that were not identified by the other data source. In these 18 patents SureChEMBL found 120 additional compounds.Fig. 4

Bottom Line: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively.SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents.

View Article: PubMed Central - PubMed

Affiliation: GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK.

ABSTRACT

Background: First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.

Results: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.

Conclusions: In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant 'gold standards' is required.

No MeSH data available.