The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large

dc.contributor.author Biswas, Sumon
dc.contributor.author Wardat, Mohammad
dc.contributor.department Computer Science
dc.date.accessioned 2021-12-17T13:57:11Z
dc.date.available 2021-12-17T13:57:11Z
dc.date.issued 2021-12-02
dc.description.abstract Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state-of-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.
dc.description.comments This is a preprint made available through arXiv:https://arxiv.org/abs/2112.01590. This work is licensed under the Creative Commons Attribution 4.0 License.
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/OrD8qMEr
dc.language.iso en
dc.publisher © Author(s) 2021
dc.source.uri https://arxiv.org/abs/2112.01590 *
dc.subject data science pipelines
dc.subject data science processes
dc.subject descriptive
dc.subject predictive
dc.subject.disciplines DegreeDisciplines::Physical Sciences and Mathematics::Computer Sciences::Software Engineering
dc.subject.disciplines DegreeDisciplines::Physical Sciences and Mathematics::Computer Sciences
dc.title The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large
dc.type Preprint
dspace.entity.type Publication
relation.isAuthorOfPublication 4e3f4631-9a99-4a4d-ab81-491621e94031
relation.isOrgUnitOfPublication f7be4eb9-d1d0-4081-859b-b15cee251456
File
Original bundle
Now showing 1 - 1 of 1
Name:
2021-RajanH-ArtPracticePreprint.pdf
Size:
1.18 MB
Format:
Adobe Portable Document Format
Description:
Collections