Toward efficient online scheduling for large-scale distributed machine learning system

dc.contributor.advisor Jia Liu
dc.contributor.author Yu, Menglu
dc.contributor.department Computer Science
dc.date 2019-09-19T12:58:47.000
dc.date.accessioned 2020-06-30T03:17:58Z
dc.date.available 2020-06-30T03:17:58Z
dc.date.copyright Wed May 01 00:00:00 UTC 2019
dc.date.embargo 2021-01-24
dc.date.issued 2019-01-01
dc.description.abstract <p>Thanks to the rise of machine learning (ML) and its vast applications, recent years have witnessed a rapid growth of large-scale distributed ML frameworks, which exploit the massive parallelism of computing clusters to expedite ML training jobs. However, the proliferation of large-scale distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a central question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and close observations on the worker-parameter server locality configurations, we transform the problem into a mixed cover/packing integer program, which enables approximation algorithm design; iii) We propose a meticulously designed randomized rounding approximation algorithm and rigorously prove its performance.Collectively, our results contribute to a comprehensive and fundamental understanding of distributed ML system optimization and algorithm design.</p>
dc.format.mimetype application/pdf
dc.identifier archive/lib.dr.iastate.edu/etd/17374/
dc.identifier.articleid 8381
dc.identifier.contextkey 15016870
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath etd/17374
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/31557
dc.language.iso en
dc.source.bitstream archive/lib.dr.iastate.edu/etd/17374/Yu_iastate_0097M_17793.pdf|||Fri Jan 14 21:21:37 UTC 2022
dc.subject.disciplines Computer Sciences
dc.title Toward efficient online scheduling for large-scale distributed machine learning system
dc.type article
dc.type.genre thesis
dspace.entity.type Publication
relation.isOrgUnitOfPublication f7be4eb9-d1d0-4081-859b-b15cee251456
thesis.degree.discipline Computer Science
thesis.degree.level thesis
thesis.degree.name Master of Science
File
Original bundle
Now showing 1 - 1 of 1
Name:
Yu_iastate_0097M_17793.pdf
Size:
433.01 KB
Format:
Adobe Portable Document Format
Description: