Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs
Abstract:
Graphics Processing Units (GPUs) are increasingly used to solve non-graphical scientific problems. However, it has been shown that the reliability of the GPUs is a concern because of the occurrence of the soft and hard errors. The check-point/restart is the most commonly used technique to achieve fault tolerance in the presence of failures. This work present an application-level checkpoint scheme for systems composed of GPUs. Our scheme exploits the benefits of the divide-and-conquer technique and of the communication-computation overlapping to improve the execution time and checkpoint overhead. By dividing the problem and checkpointing in n subprocesses, we show that our scheme improves the checkpoint overhead by a factor of n. We also show that dividing the problem with finer granularity is not beneficial. © 2010 IEEE.
Año de publicación:
2010
Keywords:
- Fault tolerance
- CHECKPOINT
- CUDA
- GPU
- Tesla
Fuente:
scopus
googleTipo de documento:
Conference Object
Estado:
Acceso restringido
Áreas de conocimiento:
- Ciencias de la computación
Áreas temáticas de Dewey:
- Ciencias de la computación
Objetivos de Desarrollo Sostenible:
- ODS 9: Industria, innovación e infraestructura
- ODS 17: Alianzas para lograr los objetivos
- ODS 8: Trabajo decente y crecimiento económico