Reducing labeling complexity in streaming data mining

Thumbnail Image
Date
2018-01-01
Authors
Izenov, Yesdaulet
Major Professor
Advisor
Srikanta Tirthapura
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract

Supervised machine learning is an approach where an algorithm estimates a mapping

function by using labeled data i.e. utilizing data attributes and target values. One of the major

obstacles in supervised learning is the labeling step. Obtaining labeled data is an expensive

procedure since it typically requires human effort. Training a model with too little data tends

to overfit therefore in order to achieve a reasonable accuracy of prediction we need a minimum

number of labeled examples. This is also true for streaming machine learning models. Maintaining

a model without rebuilding and performing a prediction task without ever storing input samples are

the key concepts of streaming machine learning models. A successful and widely used streaming

model is the Hoeffding tree which has large labeling complexity. In this work, we present Frugal

Hoeffding tree, a variation of the Hoeffding tree that uses less labeled data, and provides similar

performance as the original Hoeffding tree. We conduct experiments on large real-world datasets

where we compare the performances of traditional batch decision trees, the Hoeffding tree and

the Frugal Hoeffding tree. We show that the Frugal Hoeffding tree consumes less labeled data

yet can achieve classification performance similar to the Hoeffding tree.

Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
thesis
Comments
Rights Statement
Copyright
Tue May 01 00:00:00 UTC 2018
Funding
Supplemental Resources
Source