A balanced approach to the multi-class imbalance problem

Mosley, Lawrence

A balanced approach to the multi-class imbalance problem

File

Mosley_iastate_0097E_13904.pdf (2.76 MB)

Date

2013-01-01

Authors

Mosley, Lawrence

Advisor

Sigurdur Olafsson

Altmetrics

Abstract

The multi-class class-imbalance problem is a subset of supervised machine learning tasks where the classification variable of interest consists of three or more categories with unequal sample sizes. In the fields of manufacturing and business, common machine learning classification tasks such as failure mode, fraud, and threat detection often exhibit class imbalance due to the infrequent occurrence of one or more event states. Though machine learning as a discipline is well established, the study of class imbalance with respect to multi-class learning does not yet have the same deep, rich history. In its current state, the class imbalance literature leverages the use of biased sampling and increasing model complexity to improve predictive performance, and while some have made advances, there is still no standard model evaluation criteria for which to compare their performance. In the presence of substantial multi-class distributional skew, of the model evaluation criteria that can scale beyond the binary case, many become invalid due to their over-emphasis on the majority class observations.

Going a step further, many of the evaluation criteria utilized in practice vary significantly across the class imbalance literature and so far no single measure has been able to galvanize consensus due not only to implementation complexity, but the existence of undesirable properties. Therefore, the focus of this research is to introduce a new performance measure, Class Balance Accuracy, designed specifically for model validation in the presence of multi-class imbalance. This paper begins with the statement of definition for Class Balance Accuracy and provides an intuitive proof for its interpretation as a simultaneous lower bound for the average per class recall and average per class precision. Results from comparison studies show that models chosen by maximizing the training class balance accuracy consistently yield both high overall accuracy and per class recall on the test sets compared to the models chosen by other criteria. Simulation studies were then conducted to highlight specific scenarios where the use of class balance accuracy outperforms model selection based on regular accuracy. The measure is then invoked in two novel applications, one as the maximization criteria in the instance selection biased sampling technique and the other as a model selection tool in a multiple classifier system prediction algorithm. In the case of instance selection, the use of class balance accuracy shows improvement over traditional accuracy in scenarios of multi-class class-imbalance data sets with low separability between the majority and minority classes. Likewise, the use of CBA in the multiple classifier system resulted in improved predictions over state of the art methods such as adaBoost for some of the U.C.I. machine learning repository test data sets. The paper then concludes with a discussion of the climbR package, a repository of functions designed to aid in the model evaluation and prediction of class imbalance machine learning problems.

Academic or Administrative Unit

Department of Industrial and Manufacturing Systems Engineering

Type

dissertation

Copyright

Tue Jan 01 00:00:00 UTC 2013