Limits...
NFU-Enabled FASTA: moving bioinformatics applications onto wide area networks.

Baker EJ, Lin GN, Liu H, Kosuri R - Source Code Biol Med (2007)

Bottom Line: We also find that genome-scale sizes of the stored data are easily adaptable to logistical networks.In situations where computation is subject to parallel solution over very large data sets, this approach provides a means to allow distributed collaborators access to a shared storage resource capable of storing the large volumes of data equated with modern life science.In addition, it provides a computation framework that removes the burden of computation from the client and places it within the network.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, USA. Erich_Baker@baylor.edu.

ABSTRACT

Background: Advances in Internet technologies have allowed life science researchers to reach beyond the lab-centric research paradigm to create distributed collaborations. Of the existing technologies that support distributed collaborations, there are currently none that simultaneously support data storage and computation as a shared network resource, enabling computational burden to be wholly removed from participating clients. Software using computation-enable logistical networking components of the Internet Backplane Protocol provides a suitable means to accomplish these tasks. Here, we demonstrate software that enables this approach by distributing both the FASTA algorithm and appropriate data sets within the framework of a wide area network.

Results: For large datasets, computation-enabled logistical networks provide a significant reduction in FASTA algorithm running time over local and non-distributed logistical networking frameworks. We also find that genome-scale sizes of the stored data are easily adaptable to logistical networks.

Conclusion: Network function unit-enabled Internet Backplane Protocol effectively distributes FASTA algorithm computation over large data sets stored within the scaleable network. In situations where computation is subject to parallel solution over very large data sets, this approach provides a means to allow distributed collaborators access to a shared storage resource capable of storing the large volumes of data equated with modern life science. In addition, it provides a computation framework that removes the burden of computation from the client and places it within the network.

No MeSH data available.


Number of queries vs. response time in M. musculus. FASTA-formatted genome sequence databases were either kept locally as an unformatted dataset, distributed within a local IBP node in 20 chunks, or distributed within a non-local IBP network to 1, 5, 10 or 20 nodes. Query time of multiple 500 bp alignments against M. musculus databases demonstrated that as the number of distributed nodes is increased, either in local or non-local systems, there is an overall reduction in query time. Distributed data sets representing 20 nodes shows greater improvement in query time over locally run FASTA algorithms. The average of three iterations is shown.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2211279&req=5

Figure 8: Number of queries vs. response time in M. musculus. FASTA-formatted genome sequence databases were either kept locally as an unformatted dataset, distributed within a local IBP node in 20 chunks, or distributed within a non-local IBP network to 1, 5, 10 or 20 nodes. Query time of multiple 500 bp alignments against M. musculus databases demonstrated that as the number of distributed nodes is increased, either in local or non-local systems, there is an overall reduction in query time. Distributed data sets representing 20 nodes shows greater improvement in query time over locally run FASTA algorithms. The average of three iterations is shown.

Mentions: Figures 7 and 8 demonstrate query time of multiple 500 bp alignments against C. elegans and M. musculus databases, respectively. In both cases, as the number of distributed nodes is increased, either in local or non-local systems, there is an overall reduction in query time. Distributed data sets representing 20 nodes shows greater improvement in query time over locally run FASTA algorithms. In both these scenarios the NFU-enabled nodes performed exceptionally well; the query failure rate at each node was less than 0.01%, and each query failure was identified and the query repeated on data stored on mirrored nodes. As expected, as the dataset increases in size there is a clearer benefit in the use of distributed nodes to process the algorithm. We recognize that there are insufficiencies encountered when operating parallel FASTA algorithms, as expected values depend, in part, on the size of search space which is often difficult to reconstruct accurately in stripped datasets [16]. Our results from the merged distributed-returns are not significantly different from FASTA algorithms run in a stand-alone mode (results not shown), with the vast majority of results demonstrating zero variance.


NFU-Enabled FASTA: moving bioinformatics applications onto wide area networks.

Baker EJ, Lin GN, Liu H, Kosuri R - Source Code Biol Med (2007)

Number of queries vs. response time in M. musculus. FASTA-formatted genome sequence databases were either kept locally as an unformatted dataset, distributed within a local IBP node in 20 chunks, or distributed within a non-local IBP network to 1, 5, 10 or 20 nodes. Query time of multiple 500 bp alignments against M. musculus databases demonstrated that as the number of distributed nodes is increased, either in local or non-local systems, there is an overall reduction in query time. Distributed data sets representing 20 nodes shows greater improvement in query time over locally run FASTA algorithms. The average of three iterations is shown.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2211279&req=5

Figure 8: Number of queries vs. response time in M. musculus. FASTA-formatted genome sequence databases were either kept locally as an unformatted dataset, distributed within a local IBP node in 20 chunks, or distributed within a non-local IBP network to 1, 5, 10 or 20 nodes. Query time of multiple 500 bp alignments against M. musculus databases demonstrated that as the number of distributed nodes is increased, either in local or non-local systems, there is an overall reduction in query time. Distributed data sets representing 20 nodes shows greater improvement in query time over locally run FASTA algorithms. The average of three iterations is shown.
Mentions: Figures 7 and 8 demonstrate query time of multiple 500 bp alignments against C. elegans and M. musculus databases, respectively. In both cases, as the number of distributed nodes is increased, either in local or non-local systems, there is an overall reduction in query time. Distributed data sets representing 20 nodes shows greater improvement in query time over locally run FASTA algorithms. In both these scenarios the NFU-enabled nodes performed exceptionally well; the query failure rate at each node was less than 0.01%, and each query failure was identified and the query repeated on data stored on mirrored nodes. As expected, as the dataset increases in size there is a clearer benefit in the use of distributed nodes to process the algorithm. We recognize that there are insufficiencies encountered when operating parallel FASTA algorithms, as expected values depend, in part, on the size of search space which is often difficult to reconstruct accurately in stripped datasets [16]. Our results from the merged distributed-returns are not significantly different from FASTA algorithms run in a stand-alone mode (results not shown), with the vast majority of results demonstrating zero variance.

Bottom Line: We also find that genome-scale sizes of the stored data are easily adaptable to logistical networks.In situations where computation is subject to parallel solution over very large data sets, this approach provides a means to allow distributed collaborators access to a shared storage resource capable of storing the large volumes of data equated with modern life science.In addition, it provides a computation framework that removes the burden of computation from the client and places it within the network.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, USA. Erich_Baker@baylor.edu.

ABSTRACT

Background: Advances in Internet technologies have allowed life science researchers to reach beyond the lab-centric research paradigm to create distributed collaborations. Of the existing technologies that support distributed collaborations, there are currently none that simultaneously support data storage and computation as a shared network resource, enabling computational burden to be wholly removed from participating clients. Software using computation-enable logistical networking components of the Internet Backplane Protocol provides a suitable means to accomplish these tasks. Here, we demonstrate software that enables this approach by distributing both the FASTA algorithm and appropriate data sets within the framework of a wide area network.

Results: For large datasets, computation-enabled logistical networks provide a significant reduction in FASTA algorithm running time over local and non-distributed logistical networking frameworks. We also find that genome-scale sizes of the stored data are easily adaptable to logistical networks.

Conclusion: Network function unit-enabled Internet Backplane Protocol effectively distributes FASTA algorithm computation over large data sets stored within the scaleable network. In situations where computation is subject to parallel solution over very large data sets, this approach provides a means to allow distributed collaborators access to a shared storage resource capable of storing the large volumes of data equated with modern life science. In addition, it provides a computation framework that removes the burden of computation from the client and places it within the network.

No MeSH data available.