Limits...
OntoGene web services for biomedical text mining.

Rinaldi F, Clematide S, Marques H, Ellendorff T, Romacker M, Rodriguez-Esteban R - BMC Bioinformatics (2014)

Bottom Line: Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest.We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC).The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them.

View Article: PubMed Central - HTML - PubMed

ABSTRACT
Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them.

Show MeSH
Dependency parser output example. Complete analysis of the sentence "HA-tagged Tim18p coimmunoprecipitated with Tim12p, with all of Tim22p and with a portion of Tim54p". The figure shows the linguistic processing performed by OntoGene: tokenization (each word is separated and assigned a unique identifier), part of speech tagging (assignment of grammatical category to each word, e.g. NN for noun), lemmatization (detection of grammatical root form, e.g. coimmunoprecipitate), and dependency-based syntactic analysis, represented in the figure by the tree-like orange arrows, which depict the syntactic structure of the sentences, with, for example, Tim18p as subject.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4255746&req=5

Figure 1: Dependency parser output example. Complete analysis of the sentence "HA-tagged Tim18p coimmunoprecipitated with Tim12p, with all of Tim22p and with a portion of Tim54p". The figure shows the linguistic processing performed by OntoGene: tokenization (each word is separated and assigned a unique identifier), part of speech tagging (assignment of grammatical category to each word, e.g. NN for noun), lemmatization (detection of grammatical root form, e.g. coimmunoprecipitate), and dependency-based syntactic analysis, represented in the figure by the tree-like orange arrows, which depict the syntactic structure of the sentences, with, for example, Tim18p as subject.

Mentions: Candidate interactions are generated by simple co-occurence of entities within the same syntactic units. However, in order to increase precision, we parse the sentences with our state-of-the-art dependency parser [45], which generates a syntactic representation of the sentence. We have used an existing manually annotated corpus [46] as training corpus for the interaction detection task, based on an approach described in [47]. The corpus was parsed with our dependency parser which has been adapted to and evaluated on the biomedical domain [48,49]. The paths that are extracted from the corpus can directly be used for interaction detection. For example, in the sentence shown in Figure 1, a pattern with the decision 'yes' exists for the relation between Tim18 and Tim12, i.e. the pattern with top node coimmunoprecipitate, left path [subj] and right path [pobj]. The details of the algorithm are presented in [31]. The information delivered by the syntactic analysis is used as a factor in order to score and filter candidate interactions based on the syntactic fragment which connects the two participating entities. All available lexical and syntactic information is used in order to provide an optimized ranking for candidate interactions. The ranking of relation candidates is further optimized by a supervised machine learning method [50]. Since the term recognizer aims at high recall, it introduces several noisy concepts, which we want to automatically identify in order to penalize them. The goal is to identify some global preferences or biases which can be found in the reference database. One technique is to weight individual concepts according to their likeliness to appear as an entity in a correct relation, as seen in the target database.


OntoGene web services for biomedical text mining.

Rinaldi F, Clematide S, Marques H, Ellendorff T, Romacker M, Rodriguez-Esteban R - BMC Bioinformatics (2014)

Dependency parser output example. Complete analysis of the sentence "HA-tagged Tim18p coimmunoprecipitated with Tim12p, with all of Tim22p and with a portion of Tim54p". The figure shows the linguistic processing performed by OntoGene: tokenization (each word is separated and assigned a unique identifier), part of speech tagging (assignment of grammatical category to each word, e.g. NN for noun), lemmatization (detection of grammatical root form, e.g. coimmunoprecipitate), and dependency-based syntactic analysis, represented in the figure by the tree-like orange arrows, which depict the syntactic structure of the sentences, with, for example, Tim18p as subject.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4255746&req=5

Figure 1: Dependency parser output example. Complete analysis of the sentence "HA-tagged Tim18p coimmunoprecipitated with Tim12p, with all of Tim22p and with a portion of Tim54p". The figure shows the linguistic processing performed by OntoGene: tokenization (each word is separated and assigned a unique identifier), part of speech tagging (assignment of grammatical category to each word, e.g. NN for noun), lemmatization (detection of grammatical root form, e.g. coimmunoprecipitate), and dependency-based syntactic analysis, represented in the figure by the tree-like orange arrows, which depict the syntactic structure of the sentences, with, for example, Tim18p as subject.
Mentions: Candidate interactions are generated by simple co-occurence of entities within the same syntactic units. However, in order to increase precision, we parse the sentences with our state-of-the-art dependency parser [45], which generates a syntactic representation of the sentence. We have used an existing manually annotated corpus [46] as training corpus for the interaction detection task, based on an approach described in [47]. The corpus was parsed with our dependency parser which has been adapted to and evaluated on the biomedical domain [48,49]. The paths that are extracted from the corpus can directly be used for interaction detection. For example, in the sentence shown in Figure 1, a pattern with the decision 'yes' exists for the relation between Tim18 and Tim12, i.e. the pattern with top node coimmunoprecipitate, left path [subj] and right path [pobj]. The details of the algorithm are presented in [31]. The information delivered by the syntactic analysis is used as a factor in order to score and filter candidate interactions based on the syntactic fragment which connects the two participating entities. All available lexical and syntactic information is used in order to provide an optimized ranking for candidate interactions. The ranking of relation candidates is further optimized by a supervised machine learning method [50]. Since the term recognizer aims at high recall, it introduces several noisy concepts, which we want to automatically identify in order to penalize them. The goal is to identify some global preferences or biases which can be found in the reference database. One technique is to weight individual concepts according to their likeliness to appear as an entity in a correct relation, as seen in the target database.

Bottom Line: Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest.We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC).The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them.

View Article: PubMed Central - HTML - PubMed

ABSTRACT
Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them.

Show MeSH