Limits...
MRPack: Multi-Algorithm Execution Using Compute-Intensive Approach in MapReduce.

Idris M, Hussain S, Siddiqi MH, Hassan W, Syed Muhammad Bilal H, Lee S - PLoS ONE (2015)

Bottom Line: The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce.The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost.Complexity and qualitative results analysis shows significant performance improvement.

View Article: PubMed Central - PubMed

Affiliation: Ubiquitous Computing Lab., Department of Computer Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do, Republic of Korea.

ABSTRACT
Large quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement.

No MeSH data available.


Cluster-based analysis of MRPack performance compared to that of generic MapReduce.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4549337&req=5

pone.0136259.g004: Cluster-based analysis of MRPack performance compared to that of generic MapReduce.

Mentions: The results of the cluster size experiment are described in Fig 4. The total computation time for each size of cluster is calculated by the NameNode of the Hadoop cluster. The data size is maintained throughout the experiment, and nodes are added and removed after each test. The effect of cluster size is also measured by changing the data volume. With a constant data size at a certain threshold, the performance of the cluster improves by increasing the number of nodes as shown in Fig 4. In this case, when the data size is increased from 2GB to 8GB, the performance of the cluster improves with the addition of extra nodes. Similarly, the data size and cluster size are proportional to each other (i.e., in large datasets, addition of nodes decreases the total time).


MRPack: Multi-Algorithm Execution Using Compute-Intensive Approach in MapReduce.

Idris M, Hussain S, Siddiqi MH, Hassan W, Syed Muhammad Bilal H, Lee S - PLoS ONE (2015)

Cluster-based analysis of MRPack performance compared to that of generic MapReduce.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4549337&req=5

pone.0136259.g004: Cluster-based analysis of MRPack performance compared to that of generic MapReduce.
Mentions: The results of the cluster size experiment are described in Fig 4. The total computation time for each size of cluster is calculated by the NameNode of the Hadoop cluster. The data size is maintained throughout the experiment, and nodes are added and removed after each test. The effect of cluster size is also measured by changing the data volume. With a constant data size at a certain threshold, the performance of the cluster improves by increasing the number of nodes as shown in Fig 4. In this case, when the data size is increased from 2GB to 8GB, the performance of the cluster improves with the addition of extra nodes. Similarly, the data size and cluster size are proportional to each other (i.e., in large datasets, addition of nodes decreases the total time).

Bottom Line: The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce.The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost.Complexity and qualitative results analysis shows significant performance improvement.

View Article: PubMed Central - PubMed

Affiliation: Ubiquitous Computing Lab., Department of Computer Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do, Republic of Korea.

ABSTRACT
Large quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement.

No MeSH data available.