Limits...
A bioinformatics knowledge discovery in text application for grid computing.

Castellano M, Mastronardi G, Bellotti R, Tarricone G - BMC Bioinformatics (2009)

Bottom Line: It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes.As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.

View Article: PubMed Central - HTML - PubMed

Affiliation: DEE Dipartimento di Elettrotecnica ed Elettronica, Politecnico di Bari, via Orabona, 4, 70125, Bari, Italy. castellano@poliba.it

ABSTRACT

Background: A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources.

Methods: The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.

Results: A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed.

Conclusion: In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.

Show MeSH

Related in: MedlinePlus

Grid Middleware functioning. This figure shows how the different parts of the system interact among each other to complete the objectives of the Grid Middleware. To start the system, the end user interacts with the node search system through the interface. Using calls through shell script to Condor scheduler, the node search system retrieves information on nodes in a computational grid and submits them to the user. The user selects the nodes for computation and specifies the input data set and the local directory that will contain the remote computing results. At this point, the user selects the SIMD application that will be distributed in the grid from a lists of SIMD Applications. Once these specifications have been fixed, the load balancer analyzes the input data set and the node set, which are to be used for the remote computing, and provides a peer distribution of the workload on the different nodes of the grid computing. After carrying out this subdivision, the transfer optimizer makes a compression of the data set and instruction set to send to different remote nodes. Following the compression operations, the remote computation on the different grid system nodes is started by calls to the shell script system. At the end of the remote computing, each node performs local compression of the computing results, and at this point, middleware recovers the compressed results and makes them available in decompressed form on the end user node.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2697647&req=5

Figure 3: Grid Middleware functioning. This figure shows how the different parts of the system interact among each other to complete the objectives of the Grid Middleware. To start the system, the end user interacts with the node search system through the interface. Using calls through shell script to Condor scheduler, the node search system retrieves information on nodes in a computational grid and submits them to the user. The user selects the nodes for computation and specifies the input data set and the local directory that will contain the remote computing results. At this point, the user selects the SIMD application that will be distributed in the grid from a lists of SIMD Applications. Once these specifications have been fixed, the load balancer analyzes the input data set and the node set, which are to be used for the remote computing, and provides a peer distribution of the workload on the different nodes of the grid computing. After carrying out this subdivision, the transfer optimizer makes a compression of the data set and instruction set to send to different remote nodes. Following the compression operations, the remote computation on the different grid system nodes is started by calls to the shell script system. At the end of the remote computing, each node performs local compression of the computing results, and at this point, middleware recovers the compressed results and makes them available in decompressed form on the end user node.

Mentions: The middleware solution allows the user to move from the situation in Figure 2A situation to that in Figure 2B. To achieve these goals the system will operate on an open source grid infrastructural middleware based on a data-intensive grid toolkit solution. The operations required of the system are: locating, submitting, monitoring and deleting remote jobs on Grid-based computer resources; a reliable data transfer which is optimized for high-bandwidth wide-area networks; a transfer optimization; a node discovery; a load balancing; the management of the User SIMD Applications and finally a simple Graphical User Interface (GUI). Figure 3 shows the functional components of the system and the organization of them. The computational grid with its services is the physical layer. It interacts with the system with the shell scripts. The functional management system layer, obtains the user requests with the graphical user interface and executes them using the shell script level parameterization and invocation. GUI allows the user to specify: the applicative module that will be distributed on the Grid, the data set on which the user module will be executed and the computational grid nodes. The user selects the SIMD application in order to perform the job by specifying the data set and the directory where the result produced by the grid can be stored. Data and the instruction set are sent to the remote nodes after being compressed by the transfer optimizer. Next, remote computation on the several grid system nodes begins. At the end of the remote computing, each node locally compresses the computing results, and the system makes the results available to the user. The functional management system is composed by several components that interoperate. The node research system is based on a scheduler to obtain all the information about the grid nodes' status, which allows a dynamic management of the grid nodes. The Load Balancer maintains a reasonable workload sharing between the grid nodes selected by the end user for the grid computation. The user job is processed analyzing the input data set and the computing node set with the aim to transform a single job in parallel jobs. The system propose a SIMD applications management component to extend the system applications based on the plug-in approach. After a preliminary phase in which the new application is linked to the system, the end user can select it with the GUI. It is well known that a critical factor for the elapsed-time of grid distributed applications is the time required by communications. For our applications, the communications consist of network data transfer (input data set) and instructions (instruction set) and then the remote computing results download in the user node. In order to optimize the communication time, the data is compressed to significantly reduce the data dimension run on the network. The compression factor and the advantage of this technology depend on the source data type. The transfer optimizer produces compressed packages containing a data set and an instruction set. Before being sent to remote nodes, the compression is also applied when the results computation return. The Shell Script Grid Interface consists of a collection of script modules used to start a remote or local job and to allow the parallel execution of calls [22-24].


A bioinformatics knowledge discovery in text application for grid computing.

Castellano M, Mastronardi G, Bellotti R, Tarricone G - BMC Bioinformatics (2009)

Grid Middleware functioning. This figure shows how the different parts of the system interact among each other to complete the objectives of the Grid Middleware. To start the system, the end user interacts with the node search system through the interface. Using calls through shell script to Condor scheduler, the node search system retrieves information on nodes in a computational grid and submits them to the user. The user selects the nodes for computation and specifies the input data set and the local directory that will contain the remote computing results. At this point, the user selects the SIMD application that will be distributed in the grid from a lists of SIMD Applications. Once these specifications have been fixed, the load balancer analyzes the input data set and the node set, which are to be used for the remote computing, and provides a peer distribution of the workload on the different nodes of the grid computing. After carrying out this subdivision, the transfer optimizer makes a compression of the data set and instruction set to send to different remote nodes. Following the compression operations, the remote computation on the different grid system nodes is started by calls to the shell script system. At the end of the remote computing, each node performs local compression of the computing results, and at this point, middleware recovers the compressed results and makes them available in decompressed form on the end user node.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2697647&req=5

Figure 3: Grid Middleware functioning. This figure shows how the different parts of the system interact among each other to complete the objectives of the Grid Middleware. To start the system, the end user interacts with the node search system through the interface. Using calls through shell script to Condor scheduler, the node search system retrieves information on nodes in a computational grid and submits them to the user. The user selects the nodes for computation and specifies the input data set and the local directory that will contain the remote computing results. At this point, the user selects the SIMD application that will be distributed in the grid from a lists of SIMD Applications. Once these specifications have been fixed, the load balancer analyzes the input data set and the node set, which are to be used for the remote computing, and provides a peer distribution of the workload on the different nodes of the grid computing. After carrying out this subdivision, the transfer optimizer makes a compression of the data set and instruction set to send to different remote nodes. Following the compression operations, the remote computation on the different grid system nodes is started by calls to the shell script system. At the end of the remote computing, each node performs local compression of the computing results, and at this point, middleware recovers the compressed results and makes them available in decompressed form on the end user node.
Mentions: The middleware solution allows the user to move from the situation in Figure 2A situation to that in Figure 2B. To achieve these goals the system will operate on an open source grid infrastructural middleware based on a data-intensive grid toolkit solution. The operations required of the system are: locating, submitting, monitoring and deleting remote jobs on Grid-based computer resources; a reliable data transfer which is optimized for high-bandwidth wide-area networks; a transfer optimization; a node discovery; a load balancing; the management of the User SIMD Applications and finally a simple Graphical User Interface (GUI). Figure 3 shows the functional components of the system and the organization of them. The computational grid with its services is the physical layer. It interacts with the system with the shell scripts. The functional management system layer, obtains the user requests with the graphical user interface and executes them using the shell script level parameterization and invocation. GUI allows the user to specify: the applicative module that will be distributed on the Grid, the data set on which the user module will be executed and the computational grid nodes. The user selects the SIMD application in order to perform the job by specifying the data set and the directory where the result produced by the grid can be stored. Data and the instruction set are sent to the remote nodes after being compressed by the transfer optimizer. Next, remote computation on the several grid system nodes begins. At the end of the remote computing, each node locally compresses the computing results, and the system makes the results available to the user. The functional management system is composed by several components that interoperate. The node research system is based on a scheduler to obtain all the information about the grid nodes' status, which allows a dynamic management of the grid nodes. The Load Balancer maintains a reasonable workload sharing between the grid nodes selected by the end user for the grid computation. The user job is processed analyzing the input data set and the computing node set with the aim to transform a single job in parallel jobs. The system propose a SIMD applications management component to extend the system applications based on the plug-in approach. After a preliminary phase in which the new application is linked to the system, the end user can select it with the GUI. It is well known that a critical factor for the elapsed-time of grid distributed applications is the time required by communications. For our applications, the communications consist of network data transfer (input data set) and instructions (instruction set) and then the remote computing results download in the user node. In order to optimize the communication time, the data is compressed to significantly reduce the data dimension run on the network. The compression factor and the advantage of this technology depend on the source data type. The transfer optimizer produces compressed packages containing a data set and an instruction set. Before being sent to remote nodes, the compression is also applied when the results computation return. The Shell Script Grid Interface consists of a collection of script modules used to start a remote or local job and to allow the parallel execution of calls [22-24].

Bottom Line: It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes.As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.

View Article: PubMed Central - HTML - PubMed

Affiliation: DEE Dipartimento di Elettrotecnica ed Elettronica, Politecnico di Bari, via Orabona, 4, 70125, Bari, Italy. castellano@poliba.it

ABSTRACT

Background: A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources.

Methods: The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.

Results: A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed.

Conclusion: In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.

Show MeSH
Related in: MedlinePlus