Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs

Yu, Menglu

Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs

dc.contributor.advisor	Liu, Jia (Kevin)
dc.contributor.advisor	Rajan, Hridesh
dc.contributor.advisor	Zhang, Wensheng
dc.contributor.advisor	Aduri, Pavan
dc.contributor.advisor	Zhang, Hongwei
dc.contributor.author	Yu, Menglu
dc.contributor.department	Department of Computer Science
dc.date.accessioned	2022-11-09T00:15:39Z
dc.date.available	2022-11-09T00:15:39Z
dc.date.embargo	2024-09-08T00:00:00Z
dc.date.issued	2022-08
dc.date.updated	2022-11-09T00:15:39Z
dc.description.abstract	This dissertation will focus on modeling and designing efficient resource scheduling and allocation algorithms for deep learning jobs in distributed machine learning systems or computing clusters based on mainstream frameworks (e.g., TensorFlow and PyTorch). Due to the rapid growth of training dataset size and model complexity, it becomes prevailing to leverage the data parallelism to expedite the training process. However, data communication between computing devices (e.g., GPUs) typically becomes the bottleneck to scaling the system. Thus how to alleviate the communication bottleneck when scheduling deep learning jobs in distributed systems attracts increasing attention both in academia and industry recently. However, designing the resource allocation scheduling algorithms is highly non-trivial. Specifically, the problem typically has packing-type constraints (due to the resource capacity limit), covering-type constraints (due to the job workload requirements), and non-convex constraints (due to the topology, contention, etc), which is NP-Hard in general. Moreover, demanding integer variables adds another layer of difficulty to solve the problem. To overcome these challenges, we need to design a suite of provable algorithms to schedule the jobs efficiently. In this thesis, we start with a resource allocation algorithm design for the computing clusters, where we focus on resource allocation without considering the placement for DNN jobs. Then we extend our work to the distributed machine learning systems and computing clusters by jointly optimizing the placement and resource scheduling for DNN jobs. In this thesis, we design schedulers for deep learning jobs with various objectives (e.g., minimize the overall training completion time, minimize the makespan and maximize the overall job utility). We first work on designing efficient scheduling algorithms based on simplified assumptions like reserved bandwidth for each job and the underlying network is a complete graph. Then we extend our work by taking into practical concerns (e.g., topology mapping and contention among multiple jobs) into consideration when developing schedulers for the distributed machine learning systems and computing clusters.
dc.format.mimetype	PDF
dc.identifier.doi	https://doi.org/10.31274/td-20240329-477
dc.identifier.uri	https://dr.lib.iastate.edu/handle/20.500.12876/1wge8QMr
dc.language.iso	en
dc.language.rfc3066	en
dc.subject.disciplines	Computer science	en_US
dc.subject.keywords	Deep Learning Jobs	en_US
dc.subject.keywords	Distributed Systems	en_US
dc.subject.keywords	Networking	en_US
dc.subject.keywords	Optimization	en_US
dc.subject.keywords	Resource Scheduling	en_US
dc.title	Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs
dc.type	dissertation	en_US
dc.type.genre	dissertation	en_US
dspace.entity.type	Publication
relation.isOrgUnitOfPublication	f7be4eb9-d1d0-4081-859b-b15cee251456
thesis.degree.discipline	Computer science	en_US
thesis.degree.grantor	Iowa State University	en_US
thesis.degree.level	dissertation	$
thesis.degree.name	Doctor of Philosophy	en_US

File

Original bundle

Now showing 1 - 1 of 1

Name:: Yu_iastate_0097E_20231.pdf
Size:: 2.05 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 0 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and Dissertations