Checkpointing of parallel applications in a grid environment

Kreeteeraj Sajadah, Gabor Terstyansky, Stephen C. Winter, P. Kacsuk

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

Jobs in Grid workflows are exposed to different types of failure. It is important to develop fault tolerant mechanisms to ensure a good level of reliability during the execution of Grid jobs. While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This paper gives an overview of a checkpoint solution for checkpointing parallel applications executed on multiple sites in the Grid environment. The checkpointing mechanism is an improvement of the PGRADE checkpointing solution.

Original languageEnglish
Title of host publicationDistributed and Parallel Systems: In Focus: Desktop Grid Computing
PublisherSpringer US
Pages179-187
Number of pages9
ISBN (Print)9780387698571
DOIs
Publication statusPublished - 2007

Fingerprint

Fault tolerance

Keywords

  • Checkpointing
  • Critical Region
  • First Order Approximation
  • Natural Synchronisation Points

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Sajadah, K., Terstyansky, G., Winter, S. C., & Kacsuk, P. (2007). Checkpointing of parallel applications in a grid environment. In Distributed and Parallel Systems: In Focus: Desktop Grid Computing (pp. 179-187). Springer US. https://doi.org/10.1007/978-0-387-79448-8_16

Checkpointing of parallel applications in a grid environment. / Sajadah, Kreeteeraj; Terstyansky, Gabor; Winter, Stephen C.; Kacsuk, P.

Distributed and Parallel Systems: In Focus: Desktop Grid Computing. Springer US, 2007. p. 179-187.

Research output: Chapter in Book/Report/Conference proceedingChapter

Sajadah, K, Terstyansky, G, Winter, SC & Kacsuk, P 2007, Checkpointing of parallel applications in a grid environment. in Distributed and Parallel Systems: In Focus: Desktop Grid Computing. Springer US, pp. 179-187. https://doi.org/10.1007/978-0-387-79448-8_16
Sajadah K, Terstyansky G, Winter SC, Kacsuk P. Checkpointing of parallel applications in a grid environment. In Distributed and Parallel Systems: In Focus: Desktop Grid Computing. Springer US. 2007. p. 179-187 https://doi.org/10.1007/978-0-387-79448-8_16
Sajadah, Kreeteeraj ; Terstyansky, Gabor ; Winter, Stephen C. ; Kacsuk, P. / Checkpointing of parallel applications in a grid environment. Distributed and Parallel Systems: In Focus: Desktop Grid Computing. Springer US, 2007. pp. 179-187
@inbook{e32307fb13de4a3ca69163bd6bfe95de,
title = "Checkpointing of parallel applications in a grid environment",
abstract = "Jobs in Grid workflows are exposed to different types of failure. It is important to develop fault tolerant mechanisms to ensure a good level of reliability during the execution of Grid jobs. While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This paper gives an overview of a checkpoint solution for checkpointing parallel applications executed on multiple sites in the Grid environment. The checkpointing mechanism is an improvement of the PGRADE checkpointing solution.",
keywords = "Checkpointing, Critical Region, First Order Approximation, Natural Synchronisation Points",
author = "Kreeteeraj Sajadah and Gabor Terstyansky and Winter, {Stephen C.} and P. Kacsuk",
year = "2007",
doi = "10.1007/978-0-387-79448-8_16",
language = "English",
isbn = "9780387698571",
pages = "179--187",
booktitle = "Distributed and Parallel Systems: In Focus: Desktop Grid Computing",
publisher = "Springer US",

}

TY - CHAP

T1 - Checkpointing of parallel applications in a grid environment

AU - Sajadah, Kreeteeraj

AU - Terstyansky, Gabor

AU - Winter, Stephen C.

AU - Kacsuk, P.

PY - 2007

Y1 - 2007

N2 - Jobs in Grid workflows are exposed to different types of failure. It is important to develop fault tolerant mechanisms to ensure a good level of reliability during the execution of Grid jobs. While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This paper gives an overview of a checkpoint solution for checkpointing parallel applications executed on multiple sites in the Grid environment. The checkpointing mechanism is an improvement of the PGRADE checkpointing solution.

AB - Jobs in Grid workflows are exposed to different types of failure. It is important to develop fault tolerant mechanisms to ensure a good level of reliability during the execution of Grid jobs. While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This paper gives an overview of a checkpoint solution for checkpointing parallel applications executed on multiple sites in the Grid environment. The checkpointing mechanism is an improvement of the PGRADE checkpointing solution.

KW - Checkpointing

KW - Critical Region

KW - First Order Approximation

KW - Natural Synchronisation Points

UR - http://www.scopus.com/inward/record.url?scp=84889967505&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84889967505&partnerID=8YFLogxK

U2 - 10.1007/978-0-387-79448-8_16

DO - 10.1007/978-0-387-79448-8_16

M3 - Chapter

AN - SCOPUS:84889967505

SN - 9780387698571

SP - 179

EP - 187

BT - Distributed and Parallel Systems: In Focus: Desktop Grid Computing

PB - Springer US

ER -