An Efficient k-modes Algorithm for Clustering Categorical Datasets

Date
2020-01-01
Authors
Dorman, Karin
Maitra, Ranjan
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Statistics
Organizational Unit
Journal Issue
Series
Department
StatisticsGenetics, Development and Cell Biology
Abstract

Mining clusters from datasets is an important endeavor in many applications. The k-means algorithm is a popular and efficient distribution-free approach for clustering numerical-valued data but can not be applied to categorical-valued observations. The k-modes algorithm addresses this lacuna by taking the k-means objective function, replacing the dissimilarity measure and using modes instead of means in the modified objective function. Unlike many other clustering algorithms, both k-modes and k-means are scalable, because they do not require calculation of all pairwise dissimilarities. We provide a fast and computationally efficient implementation of k-modes, OTQT, and prove that it can find superior clusterings to existing algorithms. We also examine five initialization methods and three types of K-selection methods, many of them novel, and all appropriate for k-modes. By examining the performance on real and simulated datasets, we show that simple random initialization is the best intializer, a novel K-selection method is more accurate than two methods adapted from k-means, and that the new OTQT algorithm is more accurate and almost always faster than existing algorithms.

Comments

This is a pre-print of the article Dorman, Karin S., and Ranjan Maitra. "An Efficient k-modes Algorithm for Clustering Categorical Datasets." arXiv preprint arXiv:2006.03936 (2020). Posted with permission.

Description
Keywords
Citation
DOI
Source
Collections