Module prototype for online failure pbkp_rediction for the IBM blue Gene/L


Abstract:

The growing complexity of scientific applications has led to the design and deployment of large-scale parallel systems. The IBM Blue Gene/L can hold in excess of 200K processors and it has been designed for high performance and reliability. However, failures in this large-scale parallel system are a major concern, since it has been demonstrated that a failure will significantly reduce the performance of the system. Although reactive fault tolerant policies effectively minimize the effects of faults, it has been shown that these techniques drastically reduce the system performance. Proactive fault tolerant policies have emerged as an alternative due to the reduced performance degradation they impose. Proactive fault tolerant policies are based on the analysis of information about the state of the system. The monitoring system of the IBM Blue Gene/L generates online information about the state of hardware and software of the system and stores that information in the RAS event log. In this study, we design and implement a module prototype for online failure pbkp_rediction. This prototype is tested and validated, on a realistic scenario, using the RAS event log of an IBM Blue Gene/L system. We show that our module prototype for failure pbkp_rediction pbkp_redicts up to 70% of the fatal events. © 2008 IEEE.

Año de publicación:

2008

Keywords:

  • Failure analysis
  • Blue Gene/L
  • Computer Fault Tolerance
  • Software fault tolerance

Fuente:

googlegoogle
scopusscopus

Tipo de documento:

Conference Object

Estado:

Acceso restringido

Áreas de conocimiento:

  • Simulación por computadora
  • Simulación por computadora

Áreas temáticas:

  • Ciencias de la computación
  • Programación informática, programas, datos, seguridad
  • Instrumentos de precisión y otros dispositivos