Bringing ultra-large-scale software repository mining to the masses with Boa

Dyer, Robert

Bringing ultra-large-scale software repository mining to the masses with Boa

dc.contributor.advisor	Hridesh Rajan
dc.contributor.author	Dyer, Robert
dc.contributor.department	Computer Science
dc.date	2018-08-11T09:12:49.000
dc.date.accessioned	2020-06-30T02:50:31Z
dc.date.available	2020-06-30T02:50:31Z
dc.date.copyright	Tue Jan 01 00:00:00 UTC 2013
dc.date.embargo	2015-07-30
dc.date.issued	2013-01-01
dc.description.abstract	<p>Mining software repositories provides developers and researchers a</p> <p>chance to learn from previous development activities and apply that</p> <p>knowledge to the future. Ultra-large-scale open source repositories</p> <p>(e.g., SourceForge with 350,000+ projects, GitHub with 250,000+</p> <p>projects, and Google Code with 250,000+ projects) provide an extremely</p> <p>large corpus to perform such mining tasks on. This large corpus allows</p> <p>researchers the opportunity to test new mining techniques and</p> <p>empirically validate new approaches on real-world data. However, the</p> <p>barrier to entry is often extremely high. Researchers interested in</p> <p>mining must know a large number of techniques, languages, tools, etc,</p> <p>each of which is often complex. Additionally, performing mining at</p> <p>the scale proposed above adds additional complexity and often is</p> <p>difficult to achieve.</p> <p>The Boa language and infrastructure was developed to solve these</p> <p>problems. We provide users a domain-specific language tailored for</p> <p>software repository mining and allow them to submit queries via our</p> <p>web-based interface. These queries are then automatically</p> <p>parallelized and executed on a cluster, analyzing a dataset containing</p> <p>almost 700,000 projects, history information from millions of</p> <p>revisions, millions of Java source files, and billions of AST nodes.</p> <p>The language also provides an easy to comprehend visitor syntax to</p> <p>ease writing source code mining queries. The underlying</p> <p>infrastructure contains several optimizations, including query</p> <p>optimizations to make single queries faster as well as a fusion</p> <p>optimization to group queries from multiple users into a single query.</p> <p>The latter optimization is important as Boa is intended to be a</p> <p>shared, community resource. Finally, we show the potential benefit of</p> <p>Boa to the community by reproducing a previously published case</p> <p>study and performing a new case study on the adoption of Java language</p> <p>features.</p>
dc.format.mimetype	application/pdf
dc.identifier	archive/lib.dr.iastate.edu/etd/13553/
dc.identifier.articleid	4560
dc.identifier.contextkey	5050393
dc.identifier.doi	https://doi.org/10.31274/etd-180810-3277
dc.identifier.s3bucket	isulib-bepress-aws-west
dc.identifier.submissionpath	etd/13553
dc.identifier.uri	https://dr.lib.iastate.edu/handle/20.500.12876/27740
dc.language.iso	en
dc.source.bitstream	archive/lib.dr.iastate.edu/etd/13553/Dyer_iastate_0097E_13923.pdf\|\|\|Fri Jan 14 19:55:24 UTC 2022
dc.subject.disciplines	Computer Sciences
dc.title	Bringing ultra-large-scale software repository mining to the masses with Boa
dc.type	article
dc.type.genre	dissertation
dspace.entity.type	Publication
relation.isOrgUnitOfPublication	f7be4eb9-d1d0-4081-859b-b15cee251456
thesis.degree.level	dissertation
thesis.degree.name	Doctor of Philosophy

File

Original bundle

Now showing 1 - 1 of 1

Name:: Dyer_iastate_0097E_13923.pdf
Size:: 1.53 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Theses and Dissertations