Limits...
A bioinformatics knowledge discovery in text application for grid computing.

Castellano M, Mastronardi G, Bellotti R, Tarricone G - BMC Bioinformatics (2009)

Bottom Line: It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes.As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.

View Article: PubMed Central - HTML - PubMed

Affiliation: DEE Dipartimento di Elettrotecnica ed Elettronica, Politecnico di Bari, via Orabona, 4, 70125, Bari, Italy. castellano@poliba.it

ABSTRACT

Background: A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources.

Methods: The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.

Results: A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed.

Conclusion: In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.

Show MeSH

Related in: MedlinePlus

Knowledge discovery in text process. This figure shows the Knowledge Discovery in Text process. It is composed by Text Refining and Text Mining phases. The former transforms a free-form text document into a chosen Intermediate Form while that latter deduces patterns or knowledge from the Intermediate Form. Text Refining input are not-structured data such as texts or semi-structured data like HTML pages. It consists of Tokenization, which splits a text document into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters with single white spaces, and Filtering methods, which remove words like articles, conjunctions, prepositions, etc. from the documents. Lemmatization methods try to map verb forms to the infinite tense and nouns to their singular form. Stemming methods attempt to build the basic forms of words, for example, by stripping the plural 's' from nouns, the 'ing' from verbs, or other affixes. Additional linguistic pre-processing may be needed to enhance the available information about terms: N-grams individualization, which is n-word generic sequences that do not necessarily correspond to an idiomatic use; Anaphora resolution, which can identify relationships among a linguistic expression (anaphora) and its preceding phrase, thus, determining the corresponding reference; Part-of-speech tagging (POS) determines the part of speech tag, noun, verb, adjective, etc. for each term; Text chunking aims at grouping adjacent words in a sentence; Word Sense Disambiguation (WSD) tries to resolve the ambiguity in the meaning of single words or phrases; Parsing produces a full parse tree of a sentence (subject, object, etc.). Text Refining output can be stored in database, XML file or other structured forms which are referred to as the Intermediate Form. Text Mining techniques are then applied to the Intermediate Form. The Text Mining phases are: document clustering, document categorization, and pattern extraction.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2697647&req=5

Figure 1: Knowledge discovery in text process. This figure shows the Knowledge Discovery in Text process. It is composed by Text Refining and Text Mining phases. The former transforms a free-form text document into a chosen Intermediate Form while that latter deduces patterns or knowledge from the Intermediate Form. Text Refining input are not-structured data such as texts or semi-structured data like HTML pages. It consists of Tokenization, which splits a text document into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters with single white spaces, and Filtering methods, which remove words like articles, conjunctions, prepositions, etc. from the documents. Lemmatization methods try to map verb forms to the infinite tense and nouns to their singular form. Stemming methods attempt to build the basic forms of words, for example, by stripping the plural 's' from nouns, the 'ing' from verbs, or other affixes. Additional linguistic pre-processing may be needed to enhance the available information about terms: N-grams individualization, which is n-word generic sequences that do not necessarily correspond to an idiomatic use; Anaphora resolution, which can identify relationships among a linguistic expression (anaphora) and its preceding phrase, thus, determining the corresponding reference; Part-of-speech tagging (POS) determines the part of speech tag, noun, verb, adjective, etc. for each term; Text chunking aims at grouping adjacent words in a sentence; Word Sense Disambiguation (WSD) tries to resolve the ambiguity in the meaning of single words or phrases; Parsing produces a full parse tree of a sentence (subject, object, etc.). Text Refining output can be stored in database, XML file or other structured forms which are referred to as the Intermediate Form. Text Mining techniques are then applied to the Intermediate Form. The Text Mining phases are: document clustering, document categorization, and pattern extraction.

Mentions: Knowledge Discovery in Text refers generally to the process of extracting interesting information from a large amount of unstructured texts. Specific methods and algorithms are required in order to extract useful patterns. Text mining is the application of these algorithms and methods from the fields of machine learning and statistics to texts. The goal is to find useful patterns. To mine large document collections it is necessary to pre-process the text documents and store the information in a data structure, which is more appropriate for further processing than a plain text file. Although, several methods try to also exploit the syntactic structure and semantics of text, most text mining approaches are based on the idea that a text document can be represented by a set of words. That is to say, a text document is described based on the set of words contained in it. Figure 1 shows the KDT process based on two phases: Text Refining or Text Pre-processing and Text Mining. The central element of the Text Mining process is understanding the concepts being presented in the text. The process not only considers the presence or frequency of a word in a document but also aims at finding the relationship between them. By doing so, the mining process attempts to find information contained within a text. The Text Mining phases are: document clustering, document categorization, and pattern extraction. Document clustering is the assignment of a multivariate entity to a few categories (classes, groups) previously undefined. The goal is to gather together similar entities. Textual clustering is used as an automatic process which divides a collection of documents into groups. Inside these groups, the documents have similarities based on selected characteristics: author, length, dates, keywords. Textual clustering can be used to provide a planning of the contents of document collections, to identify hidden similarities, to facilitate the process of browsing and to find correlated or similar information. If the clustering works with keywords or features that represent the semantics of the documents, the identified groups will be distinguished on the basis of the different topics being discussed in the corpus. In Document categorization, the objects must be attributed to one or more classes or categories which will have already been identified. Classification is the process in which meaningful correlations among frequent data are identified. There are association rules for Text Categorization. All algorithms operate in two phases to produce association rules. First, all the whole keywords with greater or equal support with respect to the reference are listed to create what is called the frequent set. Then, all the association rules, that can be derived from the frequent set and that satisfy the given confidence, are established. In Pattern Extraction, some patterns are identified following the analysis of associations and tendencies. The discovery of associations is the process in which meaningful correlations among frequent whole data are found. Predictive Text Mining is used for identifying the tendencies in collected documents during a given time period while Association Analysis identifies the relationships among the attributes, for example, if the presence of a pattern implies the presence of another pattern in documents. [17-21]


A bioinformatics knowledge discovery in text application for grid computing.

Castellano M, Mastronardi G, Bellotti R, Tarricone G - BMC Bioinformatics (2009)

Knowledge discovery in text process. This figure shows the Knowledge Discovery in Text process. It is composed by Text Refining and Text Mining phases. The former transforms a free-form text document into a chosen Intermediate Form while that latter deduces patterns or knowledge from the Intermediate Form. Text Refining input are not-structured data such as texts or semi-structured data like HTML pages. It consists of Tokenization, which splits a text document into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters with single white spaces, and Filtering methods, which remove words like articles, conjunctions, prepositions, etc. from the documents. Lemmatization methods try to map verb forms to the infinite tense and nouns to their singular form. Stemming methods attempt to build the basic forms of words, for example, by stripping the plural 's' from nouns, the 'ing' from verbs, or other affixes. Additional linguistic pre-processing may be needed to enhance the available information about terms: N-grams individualization, which is n-word generic sequences that do not necessarily correspond to an idiomatic use; Anaphora resolution, which can identify relationships among a linguistic expression (anaphora) and its preceding phrase, thus, determining the corresponding reference; Part-of-speech tagging (POS) determines the part of speech tag, noun, verb, adjective, etc. for each term; Text chunking aims at grouping adjacent words in a sentence; Word Sense Disambiguation (WSD) tries to resolve the ambiguity in the meaning of single words or phrases; Parsing produces a full parse tree of a sentence (subject, object, etc.). Text Refining output can be stored in database, XML file or other structured forms which are referred to as the Intermediate Form. Text Mining techniques are then applied to the Intermediate Form. The Text Mining phases are: document clustering, document categorization, and pattern extraction.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2697647&req=5

Figure 1: Knowledge discovery in text process. This figure shows the Knowledge Discovery in Text process. It is composed by Text Refining and Text Mining phases. The former transforms a free-form text document into a chosen Intermediate Form while that latter deduces patterns or knowledge from the Intermediate Form. Text Refining input are not-structured data such as texts or semi-structured data like HTML pages. It consists of Tokenization, which splits a text document into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters with single white spaces, and Filtering methods, which remove words like articles, conjunctions, prepositions, etc. from the documents. Lemmatization methods try to map verb forms to the infinite tense and nouns to their singular form. Stemming methods attempt to build the basic forms of words, for example, by stripping the plural 's' from nouns, the 'ing' from verbs, or other affixes. Additional linguistic pre-processing may be needed to enhance the available information about terms: N-grams individualization, which is n-word generic sequences that do not necessarily correspond to an idiomatic use; Anaphora resolution, which can identify relationships among a linguistic expression (anaphora) and its preceding phrase, thus, determining the corresponding reference; Part-of-speech tagging (POS) determines the part of speech tag, noun, verb, adjective, etc. for each term; Text chunking aims at grouping adjacent words in a sentence; Word Sense Disambiguation (WSD) tries to resolve the ambiguity in the meaning of single words or phrases; Parsing produces a full parse tree of a sentence (subject, object, etc.). Text Refining output can be stored in database, XML file or other structured forms which are referred to as the Intermediate Form. Text Mining techniques are then applied to the Intermediate Form. The Text Mining phases are: document clustering, document categorization, and pattern extraction.
Mentions: Knowledge Discovery in Text refers generally to the process of extracting interesting information from a large amount of unstructured texts. Specific methods and algorithms are required in order to extract useful patterns. Text mining is the application of these algorithms and methods from the fields of machine learning and statistics to texts. The goal is to find useful patterns. To mine large document collections it is necessary to pre-process the text documents and store the information in a data structure, which is more appropriate for further processing than a plain text file. Although, several methods try to also exploit the syntactic structure and semantics of text, most text mining approaches are based on the idea that a text document can be represented by a set of words. That is to say, a text document is described based on the set of words contained in it. Figure 1 shows the KDT process based on two phases: Text Refining or Text Pre-processing and Text Mining. The central element of the Text Mining process is understanding the concepts being presented in the text. The process not only considers the presence or frequency of a word in a document but also aims at finding the relationship between them. By doing so, the mining process attempts to find information contained within a text. The Text Mining phases are: document clustering, document categorization, and pattern extraction. Document clustering is the assignment of a multivariate entity to a few categories (classes, groups) previously undefined. The goal is to gather together similar entities. Textual clustering is used as an automatic process which divides a collection of documents into groups. Inside these groups, the documents have similarities based on selected characteristics: author, length, dates, keywords. Textual clustering can be used to provide a planning of the contents of document collections, to identify hidden similarities, to facilitate the process of browsing and to find correlated or similar information. If the clustering works with keywords or features that represent the semantics of the documents, the identified groups will be distinguished on the basis of the different topics being discussed in the corpus. In Document categorization, the objects must be attributed to one or more classes or categories which will have already been identified. Classification is the process in which meaningful correlations among frequent data are identified. There are association rules for Text Categorization. All algorithms operate in two phases to produce association rules. First, all the whole keywords with greater or equal support with respect to the reference are listed to create what is called the frequent set. Then, all the association rules, that can be derived from the frequent set and that satisfy the given confidence, are established. In Pattern Extraction, some patterns are identified following the analysis of associations and tendencies. The discovery of associations is the process in which meaningful correlations among frequent whole data are found. Predictive Text Mining is used for identifying the tendencies in collected documents during a given time period while Association Analysis identifies the relationships among the attributes, for example, if the presence of a pattern implies the presence of another pattern in documents. [17-21]

Bottom Line: It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes.As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.

View Article: PubMed Central - HTML - PubMed

Affiliation: DEE Dipartimento di Elettrotecnica ed Elettronica, Politecnico di Bari, via Orabona, 4, 70125, Bari, Italy. castellano@poliba.it

ABSTRACT

Background: A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources.

Methods: The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.

Results: A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed.

Conclusion: In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.

Show MeSH
Related in: MedlinePlus