Limits...
Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents.

Senger S, Bartek L, Papadatos G, Gaulton A - J Cheminform (2015)

Bottom Line: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively.SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents.

View Article: PubMed Central - PubMed

Affiliation: GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK.

ABSTRACT

Background: First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.

Results: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.

Conclusions: In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant 'gold standards' is required.

No MeSH data available.


Bar charts showing the number of substances found in the databases. Bar charts with the number of substances found (green) or not found (red) in at least one of the two patent databases depending on the number of patents in Reaxys (left) and the number of components (right)
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4594083&req=5

Fig5: Bar charts showing the number of substances found in the databases. Bar charts with the number of substances found (green) or not found (red) in at least one of the two patent databases depending on the number of patents in Reaxys (left) and the number of components (right)

Mentions: From the 438 compounds, a total of 294 compounds were found in at least one of the two sources. 231 compounds were found in both sources, 13 compounds were only found in IBM SIIP, and 50 compounds were only found in SureChEMBL. Hence, 55.7 % of the 438 compounds were found in IBM SIIP and 64.2 % were found in SureChEMBL. For the set of 91 compounds mentioned above (covering the period 2011 onwards) only 50 compounds (54.5 %) were found in SureChEMBL. To reiterate, at this point we only checked for the presence or absence of particular compounds in the two patent databases. Figure 5 (left) shows how the likelihood that a compound is found in at least one of the patent databases increases with the number of patents that were found for a compound (in Reaxys). For example, 59.3 % of compounds with only one patent in Reaxys were found whereas almost all molecules (95.3 %) with more than five patents in Reaxys are in either SureChEMBL or IBM SIIP (or in both). This is exactly what one would expect to observe. Figure 5 also shows (on the right) how the likelihood of a compound being found in either SureChEMBL or IBM SIIP seems to depend on the number of components it contains. Whereas 83.2 % of compounds with only one component were found, only 12.9 % of compounds with two components (e.g. salts) were found. None of the six compounds with three components were found in either SureChEMBL or IBM SIIP. This finding suggests (maybe not surprisingly) that it is more challenging to extract multi-component compounds correctly from patents than it is to extract individual molecules. Depending on whether or not this plays a role in relevant use cases will most likely have a significant effect on the result.Fig. 5


Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents.

Senger S, Bartek L, Papadatos G, Gaulton A - J Cheminform (2015)

Bar charts showing the number of substances found in the databases. Bar charts with the number of substances found (green) or not found (red) in at least one of the two patent databases depending on the number of patents in Reaxys (left) and the number of components (right)
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4594083&req=5

Fig5: Bar charts showing the number of substances found in the databases. Bar charts with the number of substances found (green) or not found (red) in at least one of the two patent databases depending on the number of patents in Reaxys (left) and the number of components (right)
Mentions: From the 438 compounds, a total of 294 compounds were found in at least one of the two sources. 231 compounds were found in both sources, 13 compounds were only found in IBM SIIP, and 50 compounds were only found in SureChEMBL. Hence, 55.7 % of the 438 compounds were found in IBM SIIP and 64.2 % were found in SureChEMBL. For the set of 91 compounds mentioned above (covering the period 2011 onwards) only 50 compounds (54.5 %) were found in SureChEMBL. To reiterate, at this point we only checked for the presence or absence of particular compounds in the two patent databases. Figure 5 (left) shows how the likelihood that a compound is found in at least one of the patent databases increases with the number of patents that were found for a compound (in Reaxys). For example, 59.3 % of compounds with only one patent in Reaxys were found whereas almost all molecules (95.3 %) with more than five patents in Reaxys are in either SureChEMBL or IBM SIIP (or in both). This is exactly what one would expect to observe. Figure 5 also shows (on the right) how the likelihood of a compound being found in either SureChEMBL or IBM SIIP seems to depend on the number of components it contains. Whereas 83.2 % of compounds with only one component were found, only 12.9 % of compounds with two components (e.g. salts) were found. None of the six compounds with three components were found in either SureChEMBL or IBM SIIP. This finding suggests (maybe not surprisingly) that it is more challenging to extract multi-component compounds correctly from patents than it is to extract individual molecules. Depending on whether or not this plays a role in relevant use cases will most likely have a significant effect on the result.Fig. 5

Bottom Line: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively.SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents.

View Article: PubMed Central - PubMed

Affiliation: GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK.

ABSTRACT

Background: First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.

Results: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.

Conclusions: In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant 'gold standards' is required.

No MeSH data available.