Regresar

Improving Memory Utilization by Sharing DNN Models for Serverless Inference

Abstract:

Serving deep neural network (DNN) models requires a lot of memory and has a relatively long response time. Researchers nowadays try to integrate DNN inference services with the serverless architecture, which workload is suited for the serverless, because of its short-lived and burst characteristics. In addition, the cases of deploying DNN services to edge clouds are also increasing, to reduce the response time. However, serving DNN models in a serverless manner has a problem of excessive memory usage deriving from the data duplication, which is a more serious problem on strongly resource-constrained edge cloud. To address this problem, we designed ShmFaas on the open-source serverless platform running on the Kubernetes with minimal code changes. First, we implemented the serverless system with lightweight memory isolation by sharing DNN models in-memory, in order to avoid the model duplication problem. Also, we designed an LRU-based model eviction algorithm for efficient memory usage on the edge cloud. As a result, in our experiment, the system's memory usage is reduced by more than 29.4% compared to the existing system, and it shows that the overhead due to the proposed system is negligible enough to be used for real-world workloads.