Reducing labeling complexity in streaming data mining
Date
Authors
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract
Supervised machine learning is an approach where an algorithm estimates a mapping
function by using labeled data i.e. utilizing data attributes and target values. One of the major
obstacles in supervised learning is the labeling step. Obtaining labeled data is an expensive
procedure since it typically requires human effort. Training a model with too little data tends
to overfit therefore in order to achieve a reasonable accuracy of prediction we need a minimum
number of labeled examples. This is also true for streaming machine learning models. Maintaining
a model without rebuilding and performing a prediction task without ever storing input samples are
the key concepts of streaming machine learning models. A successful and widely used streaming
model is the Hoeffding tree which has large labeling complexity. In this work, we present Frugal
Hoeffding tree, a variation of the Hoeffding tree that uses less labeled data, and provides similar
performance as the original Hoeffding tree. We conduct experiments on large real-world datasets
where we compare the performances of traditional batch decision trees, the Hoeffding tree and
the Frugal Hoeffding tree. We show that the Frugal Hoeffding tree consumes less labeled data
yet can achieve classification performance similar to the Hoeffding tree.