Regresar

Inferring workflows with job dependencies from distributed processing systems logs: (Or, how to evaluate your systems with realistic workflows NOT pulled out of thin air)

Abstract:

We consider the problem of evaluating new improvements to distributed processing platforms like Spark and Hadoop. One approach commonly used when evaluating these systems is to use workloads published by companies with large data clusters, like Google and Facebook. These evaluations seek to demonstrate the benefits of improvements to critical framework components like the job scheduler, under realistic workloads. However, published workloads typically do not contain information on dependencies between the jobs. This is problematic, as ignoring dependencies could lead to significantly misestimating the speedup obtained from a particular improvement. In this position paper, we discuss why it is important to include job dependency information when evaluating distributed processing frameworks, and show that workflow mining techniques can be used to obtain dependencies from job traces that lack them. As a proof-of-concept, we show that the proposed methodology is able to find workflows in traces published by Google.