RAS and Job log data analysis for failure pbkp_rediction for the IBM Blue Gene/L
Abstract:
Currently, the computational needs of scientific applications have grown to levels where it is necessary to have computers with a very high degree of parallelism. The IBM Blue Gene/L can hold in excess of 200K processors and it has been designed for high performance. However, failures in this large system are a major concern, since it has been demonstrated that a failure will drastically decrease the performance of the system. Checkpointing and log schemes have been utilized to overcome these failures, however, it has been shown that these techniques are not as effective as desired. Therefore, proactive failure detection and pbkp_rediction has gained interest in the research community. In this study, we have collected the RAS event and Job logs from a large IBM Blue Gene/L over a three-month period. We have investigated the relationship among fatal and non-fatal events with the aim of proactive failure pbkp_rediction. Based on our observations, we have developed a scheme for pbkp_redicting fatal events based on the spatial and temporal relation among fatal and nonfatal events. We will show that with our scheme up to 84% of fatal events could be effectively pbkp_redicted.
Año de publicación:
2008
Keywords:
- Fault tolerance
- IBM Blue Gene/L
- Large scale systems
Fuente:


Tipo de documento:
Conference Object
Estado:
Acceso restringido
Áreas de conocimiento:
- Simulación por computadora
- Ciencias de la computación
Áreas temáticas:
- Programación informática, programas, datos, seguridad