Inferring Workflows with Job Dependencies from Distributed Processing Systems Logs


Abstract:

We consider the problem of evaluating new improvements to distributed processing platforms like Spark and Hadoop. One approach commonly used when evaluating these systems is to use workloads published by companies with large data clusters, like Google and Facebook. These evaluations seek to demonstrate the benefits of improvements to critical framework components like the job scheduler, under realistic workloads. However, published workloads typically do not contain information on dependencies between the jobs. This is problematic, as ignoring dependencies could lead to significantly misestimating the speedup obtained from a particular improvement. In this position paper, we discuss why it is important to include job dependency information when evaluating distributed processing frameworks, and show that workflow mining techniques can be used to obtain dependencies from job traces that lack …

Año de publicación:

2017

Keywords:

    Fuente:

    googlegoogle

    Tipo de documento:

    Other

    Estado:

    Acceso abierto

    Áreas de conocimiento:

    • Análisis de datos
    • Ciencias de la computación

    Áreas temáticas:

    • Programación informática, programas, datos, seguridad

    Contribuidores: