Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs

dc.contributor.advisor Liu, Jia (Kevin)
dc.contributor.advisor Rajan, Hridesh
dc.contributor.advisor Zhang, Wensheng
dc.contributor.advisor Aduri, Pavan
dc.contributor.advisor Zhang, Hongwei
dc.contributor.author Yu, Menglu
dc.contributor.department Department of Computer Science
dc.date.accessioned 2022-11-09T00:15:39Z
dc.date.available 2022-11-09T00:15:39Z
dc.date.embargo 2024-09-08T00:00:00Z
dc.date.issued 2022-08
dc.date.updated 2022-11-09T00:15:39Z
dc.description.abstract This dissertation will focus on modeling and designing efficient resource scheduling and allocation algorithms for deep learning jobs in distributed machine learning systems or computing clusters based on mainstream frameworks (e.g., TensorFlow and PyTorch). Due to the rapid growth of training dataset size and model complexity, it becomes prevailing to leverage the data parallelism to expedite the training process. However, data communication between computing devices (e.g., GPUs) typically becomes the bottleneck to scaling the system. Thus how to alleviate the communication bottleneck when scheduling deep learning jobs in distributed systems attracts increasing attention both in academia and industry recently. However, designing the resource allocation scheduling algorithms is highly non-trivial. Specifically, the problem typically has packing-type constraints (due to the resource capacity limit), covering-type constraints (due to the job workload requirements), and non-convex constraints (due to the topology, contention, etc), which is NP-Hard in general. Moreover, demanding integer variables adds another layer of difficulty to solve the problem. To overcome these challenges, we need to design a suite of provable algorithms to schedule the jobs efficiently. In this thesis, we start with a resource allocation algorithm design for the computing clusters, where we focus on resource allocation without considering the placement for DNN jobs. Then we extend our work to the distributed machine learning systems and computing clusters by jointly optimizing the placement and resource scheduling for DNN jobs. In this thesis, we design schedulers for deep learning jobs with various objectives (e.g., minimize the overall training completion time, minimize the makespan and maximize the overall job utility). We first work on designing efficient scheduling algorithms based on simplified assumptions like reserved bandwidth for each job and the underlying network is a complete graph. Then we extend our work by taking into practical concerns (e.g., topology mapping and contention among multiple jobs) into consideration when developing schedulers for the distributed machine learning systems and computing clusters.
dc.format.mimetype PDF
dc.identifier.doi https://doi.org/10.31274/td-20240329-477
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/1wge8QMr
dc.language.iso en
dc.language.rfc3066 en
dc.subject.disciplines Computer science en_US
dc.subject.keywords Deep Learning Jobs en_US
dc.subject.keywords Distributed Systems en_US
dc.subject.keywords Networking en_US
dc.subject.keywords Optimization en_US
dc.subject.keywords Resource Scheduling en_US
dc.title Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs
dc.type dissertation en_US
dc.type.genre dissertation en_US
dspace.entity.type Publication
relation.isOrgUnitOfPublication f7be4eb9-d1d0-4081-859b-b15cee251456
thesis.degree.discipline Computer science en_US
thesis.degree.grantor Iowa State University en_US
thesis.degree.level dissertation $
thesis.degree.name Doctor of Philosophy en_US
File
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Yu_iastate_0097E_20231.pdf
Size:
2.05 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description: