Module prototype for online failure prediction for the IBM blue Gene/L


Abstract:

The growing complexity of scientific applications has led to the design and deployment of large-scale parallel systems. The IBM Blue Gene/L can hold in excess of 200K processors and it has been designed for high performance and reliability. However, failures in this large-scale parallel system are a major concern, since it has been demonstrated that a failure will significantly reduce the performance of the system. Although reactive fault tolerant policies effectively minimize the effects of faults, it has been shown that these techniques drastically reduce the system performance. Proactive fault tolerant policies have emerged as an alternative due to the reduced performance degradation they impose. Proactive fault tolerant policies are based on the analysis of information about the state of the system. The monitoring system of the IBM Blue Gene/L generates online information about the state of hardware and software of the system and stores that information in the RAS event log. In this study, we design and implement a module prototype for online failure prediction. This prototype is tested and validated, on a realistic scenario, using the RAS event log of an IBM Blue Gene/L system. We show that our module prototype for failure prediction predicts up to 70% of the fatal events. © 2008 IEEE.

Año de publicación:

2008

Keywords:

  • Failure analysis
  • Blue Gene/L
  • Computer Fault Tolerance
  • Software fault tolerance

Fuente:

scopusscopus
googlegoogle

Tipo de documento:

Conference Object

Estado:

Acceso restringido

Áreas de conocimiento:

  • Simulación por computadora
  • Simulación por computadora

Áreas temáticas de Dewey:

  • Ciencias de la computación
  • Programación informática, programas, datos, seguridad
  • Instrumentos de precisión y otros dispositivos
Procesado con IAProcesado con IA

Objetivos de Desarrollo Sostenible:

  • ODS 9: Industria, innovación e infraestructura
  • ODS 17: Alianzas para lograr los objetivos
  • ODS 8: Trabajo decente y crecimiento económico
Procesado con IAProcesado con IA