Limits...
MeSH: a window into full text for document summarization.

Bhattacharya S, Ha-Thuc V, Srinivasan P - Bioinformatics (2011)

Bottom Line: Specifically, we create reduced versions of full text documents that contain only important portions.Our experiments confirm the ability of our approach to pick the important text portions.Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full texts. sanmitra-bhattacharya@uiowa.edu; padmini-srinivasan@uiowa.edu.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA. sanmitra-bhattacharya@uiowa.edu

ABSTRACT

Motivation: Previous research in the biomedical text-mining domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the realm of full texts. Full texts are, however, more expensive to process not only in terms of resources needed but also in terms of accuracy. Since full texts contain embellishments that elaborate, contextualize, contrast, supplement, etc., there is greater risk for false positives. Motivated by this, we explore an approach that offers a compromise between the extremes of abstracts and full texts. Specifically, we create reduced versions of full text documents that contain only important portions. In the long-term, our goal is to explore the use of such summaries for functions such as document retrieval and information extraction. Here, we focus on designing summarization strategies. In particular, we explore the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full text documents.

Results: Our experiments confirm the ability of our approach to pick the important text portions. Using the ROUGE measures for evaluation, we were able to achieve maximum ROUGE-1, ROUGE-2 and ROUGE-SU4 F-scores of 0.4150, 0.1435 and 0.1782, respectively, for our MeSH term-based method versus the maximum baseline scores of 0.3815, 0.1353 and 0.1428, respectively. Using a MeSH profile-based strategy, we were able to achieve maximum ROUGE F-scores of 0.4320, 0.1497 and 0.1887, respectively. Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full texts.

Contact: sanmitra-bhattacharya@uiowa.edu; padmini-srinivasan@uiowa.edu.

Show MeSH
A simplified diagram of the summarization system.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117369&req=5

Figure 1: A simplified diagram of the summarization system.

Mentions: Starting with an individual full text document to reduce, we remove the abstract and break the rest apart into sentences using MxTerminator (Reynar and Ratnaparkhi, 1997), followed by some post-processing. We then index the collection of sentences using the SMART retrieval system (Salton, 1971). Note each sentence is regarded as an independent record. We then create a query consisting of the MeSH terms marked as major index terms/topics (marked with asterisks) in the corresponding MEDLINE record. We consider only the descriptor phrases of those terms. Finally, we run retrieval with this query, which returns a ranking of sentences. The top N (a parameter) sentences are selected for our document summary. Figure 1 shows the underlying algorithm. While indexing, term features are weighted using the standard atc approach; a for augmented term frequency weights, t for inverse document (in this case sentence) frequency weight and c for cosine normalization (Salton, 1971).


MeSH: a window into full text for document summarization.

Bhattacharya S, Ha-Thuc V, Srinivasan P - Bioinformatics (2011)

A simplified diagram of the summarization system.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117369&req=5

Figure 1: A simplified diagram of the summarization system.
Mentions: Starting with an individual full text document to reduce, we remove the abstract and break the rest apart into sentences using MxTerminator (Reynar and Ratnaparkhi, 1997), followed by some post-processing. We then index the collection of sentences using the SMART retrieval system (Salton, 1971). Note each sentence is regarded as an independent record. We then create a query consisting of the MeSH terms marked as major index terms/topics (marked with asterisks) in the corresponding MEDLINE record. We consider only the descriptor phrases of those terms. Finally, we run retrieval with this query, which returns a ranking of sentences. The top N (a parameter) sentences are selected for our document summary. Figure 1 shows the underlying algorithm. While indexing, term features are weighted using the standard atc approach; a for augmented term frequency weights, t for inverse document (in this case sentence) frequency weight and c for cosine normalization (Salton, 1971).

Bottom Line: Specifically, we create reduced versions of full text documents that contain only important portions.Our experiments confirm the ability of our approach to pick the important text portions.Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full texts. sanmitra-bhattacharya@uiowa.edu; padmini-srinivasan@uiowa.edu.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA. sanmitra-bhattacharya@uiowa.edu

ABSTRACT

Motivation: Previous research in the biomedical text-mining domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the realm of full texts. Full texts are, however, more expensive to process not only in terms of resources needed but also in terms of accuracy. Since full texts contain embellishments that elaborate, contextualize, contrast, supplement, etc., there is greater risk for false positives. Motivated by this, we explore an approach that offers a compromise between the extremes of abstracts and full texts. Specifically, we create reduced versions of full text documents that contain only important portions. In the long-term, our goal is to explore the use of such summaries for functions such as document retrieval and information extraction. Here, we focus on designing summarization strategies. In particular, we explore the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full text documents.

Results: Our experiments confirm the ability of our approach to pick the important text portions. Using the ROUGE measures for evaluation, we were able to achieve maximum ROUGE-1, ROUGE-2 and ROUGE-SU4 F-scores of 0.4150, 0.1435 and 0.1782, respectively, for our MeSH term-based method versus the maximum baseline scores of 0.3815, 0.1353 and 0.1428, respectively. Using a MeSH profile-based strategy, we were able to achieve maximum ROUGE F-scores of 0.4320, 0.1497 and 0.1887, respectively. Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full texts.

Contact: sanmitra-bhattacharya@uiowa.edu; padmini-srinivasan@uiowa.edu.

Show MeSH