Limits...
Mining semantic networks of bioinformatics e-resources from the literature.

Afzal H, Eales J, Stevens R, Nenadic G - J Biomed Semantics (2011)

Bottom Line: These efforts rely on manual curation, making it difficult to cope with the huge influx of various electronic resources that have been provided by the bioinformatics community.Since such representations are typically sparse (on average 13.77 features per resource), we used lexical kernel metrics to identify semantically related resources via descriptor smoothing.Resources are then clustered or linked into semantic networks, providing the users (bioinformaticians, curators and service/tool crawlers) with a possibility to explore algorithms, tools, services and datasets based on their relatedness.

View Article: PubMed Central - HTML - PubMed

Affiliation: College of Telecommunication Engineering, National University of Sciences and Technology, Islamabad, Pakistan. Hammad.Afzal@deri.org.

ABSTRACT

Background: There have been a number of recent efforts (e.g. BioCatalogue, BioMoby) to systematically catalogue bioinformatics tools, services and datasets. These efforts rely on manual curation, making it difficult to cope with the huge influx of various electronic resources that have been provided by the bioinformatics community. We present a text mining approach that utilises the literature to automatically extract descriptions and semantically profile bioinformatics resources to make them available for resource discovery and exploration through semantic networks that contain related resources.

Results: The method identifies the mentions of resources in the literature and assigns a set of co-occurring terminological entities (descriptors) to represent them. We have processed 2,691 full-text bioinformatics articles and extracted profiles of 12,452 resources containing associated descriptors with binary and tf*idf weights. Since such representations are typically sparse (on average 13.77 features per resource), we used lexical kernel metrics to identify semantically related resources via descriptor smoothing. Resources are then clustered or linked into semantic networks, providing the users (bioinformaticians, curators and service/tool crawlers) with a possibility to explore algorithms, tools, services and datasets based on their relatedness. Manual exploration of links between a set of 18 well-known bioinformatics resources suggests that the method was able to identify and group semantically related entities.

Conclusions: The results have shown that the method can reconstruct interesting functional links between resources (e.g. linking data types and algorithms), in particular when tf*idf-like weights are used for profiling. This demonstrates the potential of combining literature mining and simple lexical kernel methods to model relatedness between resource descriptors in particular when there are few features, thus potentially improving the resource description, discovery and exploration process. The resource profiles are available at http://gnode1.mib.man.ac.uk/bioinf/semnets.html.

No MeSH data available.


Related in: MedlinePlus

Semantic network of bioinformatics resources (using method 2 and values shown in Figure 4). Node size represents frequency in the corpus; edge thickness represents how similar the two connected nodes are. Node colour is determined by the semantic class of the node: red for Data, green for Data resource, blue for Algorithm and yellow for Application. (A) The scores based on binary weights. (B) The scores based on tf*idf. The image was generated using Cytoscape [23], the network was laid out using the Cytoscape layout algorithm ‘Edge-Weighted Spring Embedded’, using the edge weight data in the network.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3105496&req=5

Figure 6: Semantic network of bioinformatics resources (using method 2 and values shown in Figure 4). Node size represents frequency in the corpus; edge thickness represents how similar the two connected nodes are. Node colour is determined by the semantic class of the node: red for Data, green for Data resource, blue for Algorithm and yellow for Application. (A) The scores based on binary weights. (B) The scores based on tf*idf. The image was generated using Cytoscape [23], the network was laid out using the Cytoscape layout algorithm ‘Edge-Weighted Spring Embedded’, using the edge weight data in the network.

Mentions: Even though similarity data alone can identify important semantic links, we further explored the importance of the number and strength of links between resources. In Figure 6 we present the similarity data as edges in a network connecting each node (representing individual resources) with those that have some similarity to it. Each edge is weighted by the similarity between the resources it connects, so that edges that appear thick represent strong relationships and weak relationships are represented by thin edges. We have removed all edges that have a weight below the median edge weight for the network, or below the median weight for a given node. Nodes that are left with no edges are not presented in the resulting networks. Our intention with this was to remove edges that exist due to chance alone and to better highlight the strongest relationships in the network.


Mining semantic networks of bioinformatics e-resources from the literature.

Afzal H, Eales J, Stevens R, Nenadic G - J Biomed Semantics (2011)

Semantic network of bioinformatics resources (using method 2 and values shown in Figure 4). Node size represents frequency in the corpus; edge thickness represents how similar the two connected nodes are. Node colour is determined by the semantic class of the node: red for Data, green for Data resource, blue for Algorithm and yellow for Application. (A) The scores based on binary weights. (B) The scores based on tf*idf. The image was generated using Cytoscape [23], the network was laid out using the Cytoscape layout algorithm ‘Edge-Weighted Spring Embedded’, using the edge weight data in the network.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3105496&req=5

Figure 6: Semantic network of bioinformatics resources (using method 2 and values shown in Figure 4). Node size represents frequency in the corpus; edge thickness represents how similar the two connected nodes are. Node colour is determined by the semantic class of the node: red for Data, green for Data resource, blue for Algorithm and yellow for Application. (A) The scores based on binary weights. (B) The scores based on tf*idf. The image was generated using Cytoscape [23], the network was laid out using the Cytoscape layout algorithm ‘Edge-Weighted Spring Embedded’, using the edge weight data in the network.
Mentions: Even though similarity data alone can identify important semantic links, we further explored the importance of the number and strength of links between resources. In Figure 6 we present the similarity data as edges in a network connecting each node (representing individual resources) with those that have some similarity to it. Each edge is weighted by the similarity between the resources it connects, so that edges that appear thick represent strong relationships and weak relationships are represented by thin edges. We have removed all edges that have a weight below the median edge weight for the network, or below the median weight for a given node. Nodes that are left with no edges are not presented in the resulting networks. Our intention with this was to remove edges that exist due to chance alone and to better highlight the strongest relationships in the network.

Bottom Line: These efforts rely on manual curation, making it difficult to cope with the huge influx of various electronic resources that have been provided by the bioinformatics community.Since such representations are typically sparse (on average 13.77 features per resource), we used lexical kernel metrics to identify semantically related resources via descriptor smoothing.Resources are then clustered or linked into semantic networks, providing the users (bioinformaticians, curators and service/tool crawlers) with a possibility to explore algorithms, tools, services and datasets based on their relatedness.

View Article: PubMed Central - HTML - PubMed

Affiliation: College of Telecommunication Engineering, National University of Sciences and Technology, Islamabad, Pakistan. Hammad.Afzal@deri.org.

ABSTRACT

Background: There have been a number of recent efforts (e.g. BioCatalogue, BioMoby) to systematically catalogue bioinformatics tools, services and datasets. These efforts rely on manual curation, making it difficult to cope with the huge influx of various electronic resources that have been provided by the bioinformatics community. We present a text mining approach that utilises the literature to automatically extract descriptions and semantically profile bioinformatics resources to make them available for resource discovery and exploration through semantic networks that contain related resources.

Results: The method identifies the mentions of resources in the literature and assigns a set of co-occurring terminological entities (descriptors) to represent them. We have processed 2,691 full-text bioinformatics articles and extracted profiles of 12,452 resources containing associated descriptors with binary and tf*idf weights. Since such representations are typically sparse (on average 13.77 features per resource), we used lexical kernel metrics to identify semantically related resources via descriptor smoothing. Resources are then clustered or linked into semantic networks, providing the users (bioinformaticians, curators and service/tool crawlers) with a possibility to explore algorithms, tools, services and datasets based on their relatedness. Manual exploration of links between a set of 18 well-known bioinformatics resources suggests that the method was able to identify and group semantically related entities.

Conclusions: The results have shown that the method can reconstruct interesting functional links between resources (e.g. linking data types and algorithms), in particular when tf*idf-like weights are used for profiling. This demonstrates the potential of combining literature mining and simple lexical kernel methods to model relatedness between resource descriptors in particular when there are few features, thus potentially improving the resource description, discovery and exploration process. The resource profiles are available at http://gnode1.mib.man.ac.uk/bioinf/semnets.html.

No MeSH data available.


Related in: MedlinePlus