Shared Data Science Infrastructure for Genomics Data

Thumbnail Image
Bagheri, Hamid
Muppirala, Usha
Masonbrink, Rick E.
Major Professor
Committee Member
Journal Title
Journal ISSN
Volume Title
Rajan, Hridesh
Professor and Department Chair of Computer Science
Severin, Andrew
Manager Research
Research Projects
Organizational Units
Organizational Unit
Office of Biotechnology
The Office of Biotechnology facilitates and advances programs in research, education, and outreach that contribute to the goals of Iowa State University’s Strategic Plan in the area of biotechnology. The Office oversees the biotechnology programs developed by the university’s Biotechnology Council and the Office of the Vice President for Research. The Office of Biotechnology works with the university’s biotechnology faculty and administrators to ensure effectiveness in research, education, and technology transfer related to the application of molecular biology to the development of useful products and processes.
Organizational Unit
Computer Science

Computer Science—the theory, representation, processing, communication and use of information—is fundamentally transforming every aspect of human endeavor. The Department of Computer Science at Iowa State University advances computational and information sciences through; 1. educational and research programs within and beyond the university; 2. active engagement to help define national and international research, and 3. educational agendas, and sustained commitment to graduating leaders for academia, industry and government.

The Computer Science Department was officially established in 1969, with Robert Stewart serving as the founding Department Chair. Faculty were composed of joint appointments with Mathematics, Statistics, and Electrical Engineering. In 1969, the building which now houses the Computer Science department, then simply called the Computer Science building, was completed. Later it was named Atanasoff Hall. Throughout the 1980s to present, the department expanded and developed its teaching and research agendas to cover many areas of computing.

Dates of Existence

Related Units

Organizational Unit
Genome Informatics Facility
The Genome Informatics Facility serves as a centralized resource of expertise on the application of emerging sequencing technologies and open source software as applied to biological systems. Its mission is to integrate this knowledge into pipelines that are easy to understand and use by faculty, staff and students to enable the transformation of ‘big data’ into data that dramatically accelerates our understanding of biology and evolutionary processes.
Journal Issue
Is Version Of
Background Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boag is needed to efficiently process and parse data contained in large data repositories. The main features of Boag are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. Results As a proof of concept, Boa for genomics, Boag, has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boag provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boag to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boag databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. Conclusions In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boag, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boag using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boag could be used with large biological datasets.
This article is published as Bagheri, Hamid, Usha Muppirala, Rick E. Masonbrink, Andrew J. Severin, and Hridesh Rajan. "Shared data science infrastructure for genomics data." BMC Bioinformatics 20, no. 1 (2019): 1-13. DOI: 10.1186/s12859-019-2967-2. Copyright 2020 The Author(s). Attribution 4.0 International (CC BY 4.0). Posted with permission.
Mon Jan 01 00:00:00 UTC 2018