Case-Specific Random Forests for Big Data Prediction

Supplemental Files
Date
2015-01-01
Authors
Zimmerman, Joshua
Nettleton, Dan
Nettleton, Dan
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Statistics
Organizational Unit
Journal Issue
Series
Department
Statistics
Abstract

Some training datasets may be too large for storage on a single computer. Such datasets may be partitioned and stored on separate computers connected in a parallel computing environment. To predict the response associated with a specific target case when training data are partitioned, we propose a method for finding the training cases within each partition that are most relevant for predicting the response of a target case of interest. These most relevant training cases from each partition can be combined into a single dataset, which can be a subset of the entire training dataset that is small enough for storage and analysis in memory on a single computer. To generate a prediction from this selected subset, we use Case-Specific Random Forests, a variation of random forests that replaces the uniform bootstrap sampling used to build a tree in a random forest with unequal weighted bootstrap sampling, where training cases more similar to the target case are given greater weight. We demonstrate our method with an example concrete dataset. Our results show that predictions generated from a small selected subset of a partitioned training dataset can be as accurate as predictions generated in a traditional manner from the entire training dataset.

Comments

This proceeding is published as Zimmerman, J., Nettleton, D. (2015). Case-specific random forests for big data prediction. In JSM Proceedings, General Methodology. Alexandria, VA: American Statistical Association, pp. 2537–2543. Posted with permission.

Description
Keywords
Citation
DOI
Source