Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs

Thumbnail Image
Date
2022-08
Authors
Yu, Menglu
Major Professor
Advisor
Liu, Jia (Kevin)
Rajan, Hridesh
Zhang, Wensheng
Aduri, Pavan
Zhang, Hongwei
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract
This dissertation will focus on modeling and designing efficient resource scheduling and allocation algorithms for deep learning jobs in distributed machine learning systems or computing clusters based on mainstream frameworks (e.g., TensorFlow and PyTorch). Due to the rapid growth of training dataset size and model complexity, it becomes prevailing to leverage the data parallelism to expedite the training process. However, data communication between computing devices (e.g., GPUs) typically becomes the bottleneck to scaling the system. Thus how to alleviate the communication bottleneck when scheduling deep learning jobs in distributed systems attracts increasing attention both in academia and industry recently. However, designing the resource allocation scheduling algorithms is highly non-trivial. Specifically, the problem typically has packing-type constraints (due to the resource capacity limit), covering-type constraints (due to the job workload requirements), and non-convex constraints (due to the topology, contention, etc), which is NP-Hard in general. Moreover, demanding integer variables adds another layer of difficulty to solve the problem. To overcome these challenges, we need to design a suite of provable algorithms to schedule the jobs efficiently. In this thesis, we start with a resource allocation algorithm design for the computing clusters, where we focus on resource allocation without considering the placement for DNN jobs. Then we extend our work to the distributed machine learning systems and computing clusters by jointly optimizing the placement and resource scheduling for DNN jobs. In this thesis, we design schedulers for deep learning jobs with various objectives (e.g., minimize the overall training completion time, minimize the makespan and maximize the overall job utility). We first work on designing efficient scheduling algorithms based on simplified assumptions like reserved bandwidth for each job and the underlying network is a complete graph. Then we extend our work by taking into practical concerns (e.g., topology mapping and contention among multiple jobs) into consideration when developing schedulers for the distributed machine learning systems and computing clusters.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
dissertation
Comments
Rights Statement
Copyright
Funding
Subject Categories
Supplemental Resources
Source