Limits...
The strength of co-authorship in gene name disambiguation.

Farkas R - BMC Bioinformatics (2008)

Bottom Line: Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results.We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph.Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well.

View Article: PubMed Central - HTML - PubMed

Affiliation: Hungarian Academy of Science, Research Group on Artificial Intelligence, Aradi vertanuk tere, Szeged, Hungary. rfarkas@inf.u-szeged.hu

ABSTRACT

Background: A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) - a special case of Word Sense Disambiguation (WSD) - is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact - one of the special features of biological articles - that the authors of the documents are known through graph-based semi-supervised methods for the GSD task.

Results: Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances (based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively.

Conclusion: Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well.

Show MeSH

Related in: MedlinePlus

Precision-coverage curves on the human GSD dataset. The three curves represents different weighting strategies and their points for different levels of filtering of the inverse co-author graph. The authors who had over 100, 50 or 20 MedLine publications were ignored yielding 3 points on the precision-coverage space, while the fourth point of each curve shows the case without any filtering.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2262057&req=5

Figure 1: Precision-coverage curves on the human GSD dataset. The three curves represents different weighting strategies and their points for different levels of filtering of the inverse co-author graph. The authors who had over 100, 50 or 20 MedLine publications were ignored yielding 3 points on the precision-coverage space, while the fourth point of each curve shows the case without any filtering.

Mentions: The different degrees of filtering resulted in different precision and coverage value pairs. Figure 1 shows the precision-coverage curves obtained using the three weighting methods (i.e. non-weighted, Dsum and Dmin). According to these results, ignoring more authors from the co-author graph yields a higher precision but at the price of lower coverage. Thus this filtering approach is a parametric trade-off between precision and coverage. A 100% precision can be kept with a coverage of 54.42% while the best coverage achieved by this method was 84.67% with a decrease in precision to 84.76%. The difference between the performance of the three weighting (or non-weighting) methods is significant. The right choice of a method can yield a 2–3% improvement in precision at a given level of coverage. The minmax method seems to outperform the other two, but it does not perform well on the unfiltered graph hence we cannot regard it as the ultimate 'winning' solution here.


The strength of co-authorship in gene name disambiguation.

Farkas R - BMC Bioinformatics (2008)

Precision-coverage curves on the human GSD dataset. The three curves represents different weighting strategies and their points for different levels of filtering of the inverse co-author graph. The authors who had over 100, 50 or 20 MedLine publications were ignored yielding 3 points on the precision-coverage space, while the fourth point of each curve shows the case without any filtering.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2262057&req=5

Figure 1: Precision-coverage curves on the human GSD dataset. The three curves represents different weighting strategies and their points for different levels of filtering of the inverse co-author graph. The authors who had over 100, 50 or 20 MedLine publications were ignored yielding 3 points on the precision-coverage space, while the fourth point of each curve shows the case without any filtering.
Mentions: The different degrees of filtering resulted in different precision and coverage value pairs. Figure 1 shows the precision-coverage curves obtained using the three weighting methods (i.e. non-weighted, Dsum and Dmin). According to these results, ignoring more authors from the co-author graph yields a higher precision but at the price of lower coverage. Thus this filtering approach is a parametric trade-off between precision and coverage. A 100% precision can be kept with a coverage of 54.42% while the best coverage achieved by this method was 84.67% with a decrease in precision to 84.76%. The difference between the performance of the three weighting (or non-weighting) methods is significant. The right choice of a method can yield a 2–3% improvement in precision at a given level of coverage. The minmax method seems to outperform the other two, but it does not perform well on the unfiltered graph hence we cannot regard it as the ultimate 'winning' solution here.

Bottom Line: Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results.We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph.Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well.

View Article: PubMed Central - HTML - PubMed

Affiliation: Hungarian Academy of Science, Research Group on Artificial Intelligence, Aradi vertanuk tere, Szeged, Hungary. rfarkas@inf.u-szeged.hu

ABSTRACT

Background: A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) - a special case of Word Sense Disambiguation (WSD) - is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact - one of the special features of biological articles - that the authors of the documents are known through graph-based semi-supervised methods for the GSD task.

Results: Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances (based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively.

Conclusion: Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well.

Show MeSH
Related in: MedlinePlus