Limits...
Two-level incremental checkpoint recovery scheme for reducing system total overheads.

Li H, Pang L, Wang Z - PLoS ONE (2014)

Bottom Line: The checkpoint technology is used to reduce the losses in the event of a failure.The comparison results show that the total overheads of setting checkpoints, the total re-computing time and the system total overheads in the two-level incremental checkpoint recovery scheme are all significantly smaller than those in the two-level checkpoint recovery scheme.At last, limitations of our study are discussed, and at the same time, open questions and possible future work are given.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, China; Department of Computer Science, Wayne State University, Detroit, Michigan, United States of America.

ABSTRACT
Long-running applications are often subject to failures. Once failures occur, it will lead to unacceptable system overheads. The checkpoint technology is used to reduce the losses in the event of a failure. For the two-level checkpoint recovery scheme used in the long-running tasks, it is unavoidable for the system to periodically transfer huge memory context to a remote stable storage. Therefore, the overheads of setting checkpoints and the re-computing time become a critical issue which directly impacts the system total overheads. Motivated by these concerns, this paper presents a new model by introducing i-checkpoints into the existing two-level checkpoint recovery scheme to deal with the more probable failures with the smaller cost and the faster speed. The proposed scheme is independent of the specific failure distribution type and can be applied to different failure distribution types. We respectively make analyses between the two-level incremental and two-level checkpoint recovery schemes with the Weibull distribution and exponential distribution, both of which fit with the actual failure distribution best. The comparison results show that the total overheads of setting checkpoints, the total re-computing time and the system total overheads in the two-level incremental checkpoint recovery scheme are all significantly smaller than those in the two-level checkpoint recovery scheme. At last, limitations of our study are discussed, and at the same time, open questions and possible future work are given.

Show MeSH
The relationship between optimal number of i-checkpoints and pn under different u = Oi/Om.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4128665&req=5

pone-0104591-g003: The relationship between optimal number of i-checkpoints and pn under different u = Oi/Om.

Mentions: According to Section 3.3, we can determine the optimal number of i-checkpoints between two neighboring N-checkpoints with specific parameters. The tendency of the optimal number of i-checkpoints will be shown visually by several groups of examples in Fig. 3. The parameter pn is the probability of recovering from an N-checkpoint, and u = Oi/Om is the ratio of the overheads of setting i-checkpoint and m-checkpoint. The range of u is (0,1). Note that we do not care the occasions when u = 0 and u = 1. When u = 0, it means the overhead of i-checkpoint is 0; when u = 1, it means the overhead of i-checkpoint is equivalent to the overhead of m-checkpoint. In both cases, the two-level incremental checkpoint recovery scheme will degenerate to the two-level checkpoint recovery scheme. The range of pn is (0, 1). The case, pn = 0 or pn = 1, means only permanent failure or only transient failure occurs in the system, which does not match the actual situations, so we do not consider these two cases either. In fact, during the practical running, the checkpoint recovery scheme must be affected by other factors, such as network throughout and I/O interaction [30], [31]. However, the existing schemes [15], [16], [18] just considered the main factors that affect the system performance basically and their experiments ignore the affection of them. Now, there have been some other researches that study their affection on the checkpoint recovery scheme, which has been considered as another new and independent research topic. Also in our experiment, to achieve the performance comparison between the existing schemes and ours in the same circumstance, we also ignore these factors like [15], [16], [18]. Studies of the affection of these factors is not our contribution of this paper, and may be one of our further works.


Two-level incremental checkpoint recovery scheme for reducing system total overheads.

Li H, Pang L, Wang Z - PLoS ONE (2014)

The relationship between optimal number of i-checkpoints and pn under different u = Oi/Om.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4128665&req=5

pone-0104591-g003: The relationship between optimal number of i-checkpoints and pn under different u = Oi/Om.
Mentions: According to Section 3.3, we can determine the optimal number of i-checkpoints between two neighboring N-checkpoints with specific parameters. The tendency of the optimal number of i-checkpoints will be shown visually by several groups of examples in Fig. 3. The parameter pn is the probability of recovering from an N-checkpoint, and u = Oi/Om is the ratio of the overheads of setting i-checkpoint and m-checkpoint. The range of u is (0,1). Note that we do not care the occasions when u = 0 and u = 1. When u = 0, it means the overhead of i-checkpoint is 0; when u = 1, it means the overhead of i-checkpoint is equivalent to the overhead of m-checkpoint. In both cases, the two-level incremental checkpoint recovery scheme will degenerate to the two-level checkpoint recovery scheme. The range of pn is (0, 1). The case, pn = 0 or pn = 1, means only permanent failure or only transient failure occurs in the system, which does not match the actual situations, so we do not consider these two cases either. In fact, during the practical running, the checkpoint recovery scheme must be affected by other factors, such as network throughout and I/O interaction [30], [31]. However, the existing schemes [15], [16], [18] just considered the main factors that affect the system performance basically and their experiments ignore the affection of them. Now, there have been some other researches that study their affection on the checkpoint recovery scheme, which has been considered as another new and independent research topic. Also in our experiment, to achieve the performance comparison between the existing schemes and ours in the same circumstance, we also ignore these factors like [15], [16], [18]. Studies of the affection of these factors is not our contribution of this paper, and may be one of our further works.

Bottom Line: The checkpoint technology is used to reduce the losses in the event of a failure.The comparison results show that the total overheads of setting checkpoints, the total re-computing time and the system total overheads in the two-level incremental checkpoint recovery scheme are all significantly smaller than those in the two-level checkpoint recovery scheme.At last, limitations of our study are discussed, and at the same time, open questions and possible future work are given.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, China; Department of Computer Science, Wayne State University, Detroit, Michigan, United States of America.

ABSTRACT
Long-running applications are often subject to failures. Once failures occur, it will lead to unacceptable system overheads. The checkpoint technology is used to reduce the losses in the event of a failure. For the two-level checkpoint recovery scheme used in the long-running tasks, it is unavoidable for the system to periodically transfer huge memory context to a remote stable storage. Therefore, the overheads of setting checkpoints and the re-computing time become a critical issue which directly impacts the system total overheads. Motivated by these concerns, this paper presents a new model by introducing i-checkpoints into the existing two-level checkpoint recovery scheme to deal with the more probable failures with the smaller cost and the faster speed. The proposed scheme is independent of the specific failure distribution type and can be applied to different failure distribution types. We respectively make analyses between the two-level incremental and two-level checkpoint recovery schemes with the Weibull distribution and exponential distribution, both of which fit with the actual failure distribution best. The comparison results show that the total overheads of setting checkpoints, the total re-computing time and the system total overheads in the two-level incremental checkpoint recovery scheme are all significantly smaller than those in the two-level checkpoint recovery scheme. At last, limitations of our study are discussed, and at the same time, open questions and possible future work are given.

Show MeSH