Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs

Yu, Menglu

Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs

File

Yu_iastate_0097E_20231.pdf (2.05 MB)

Date

2022-08

Authors

Yu, Menglu

Advisor

Liu, Jia (Kevin)

Rajan, Hridesh

Zhang, Wensheng

Aduri, Pavan

Zhang, Hongwei

Altmetrics

Abstract

This dissertation will focus on modeling and designing efficient resource scheduling and allocation algorithms for deep learning jobs in distributed machine learning systems or computing clusters based on mainstream frameworks (e.g., TensorFlow and PyTorch). Due to the rapid growth of training dataset size and model complexity, it becomes prevailing to leverage the data parallelism to expedite the training process. However, data communication between computing devices (e.g., GPUs) typically becomes the bottleneck to scaling the system. Thus how to alleviate the communication bottleneck when scheduling deep learning jobs in distributed systems attracts increasing attention both in academia and industry recently. However, designing the resource allocation scheduling algorithms is highly non-trivial. Specifically, the problem typically has packing-type constraints (due to the resource capacity limit), covering-type constraints (due to the job workload requirements), and non-convex constraints (due to the topology, contention, etc), which is NP-Hard in general. Moreover, demanding integer variables adds another layer of difficulty to solve the problem. To overcome these challenges, we need to design a suite of provable algorithms to schedule the jobs efficiently. In this thesis, we start with a resource allocation algorithm design for the computing clusters, where we focus on resource allocation without considering the placement for DNN jobs. Then we extend our work to the distributed machine learning systems and computing clusters by jointly optimizing the placement and resource scheduling for DNN jobs. In this thesis, we design schedulers for deep learning jobs with various objectives (e.g., minimize the overall training completion time, minimize the makespan and maximize the overall job utility). We first work on designing efficient scheduling algorithms based on simplified assumptions like reserved bandwidth for each job and the underlying network is a complete graph. Then we extend our work by taking into practical concerns (e.g., topology mapping and contention among multiple jobs) into consideration when developing schedulers for the distributed machine learning systems and computing clusters.

Academic or Administrative Unit

Department of Computer Science

Type

dissertation