An efficient k-modes algorithm for clustering categorical datasets

Dorman, Karin S.; Maitra, Ranjan

An efficient k-modes algorithm for clustering categorical datasets

File

2022-Maitra-EfficientKModes.pdf (1.21 MB)

Date

2022-02

Authors

Dorman, Karin S.

Maitra, Ranjan

Publisher

Wiley Periodicals LLC

Abstract

Mining clusters from data is an important endeavor in many applications. The k-means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The k-modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the k-means objective function. We provide a novel, computationally efficient implementation of k-modes, called Optimal Transfer Quick Transfer (OTQT). We prove that OTQT finds updates to improve the objective function that are undetectable to existing k-modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for k-modes optimization.

Versions

Article

An Efficient k-modes Algorithm for Clustering Categorical Datasets

( 2020-01-01) Dorman, Karin ; Maitra, Ranjan ; Statistics (LAS) ; Department of Genetics, Development, and Cell Biology (LAS)

Mining clusters from datasets is an important endeavor in many applications. The k-means algorithm is a popular and efficient distribution-free approach for clustering numerical-valued data but can not be applied to categorical-valued observations. The k-modes algorithm addresses this lacuna by taking the k-means objective function, replacing the dissimilarity measure and using modes instead of means in the modified objective function. Unlike many other clustering algorithms, both k-modes and k-means are scalable, because they do not require calculation of all pairwise dissimilarities. We provide a fast and computationally efficient implementation of k-modes, OTQT, and prove that it can find superior clusterings to existing algorithms. We also examine five initialization methods and three types of K-selection methods, many of them novel, and all appropriate for k-modes. By examining the performance on real and simulated datasets, we show that simple random initialization is the best intializer, a novel K-selection method is more accurate than two methods adapted from k-means, and that the new OTQT algorithm is more accurate and almost always faster than existing algorithms.

Academic or Administrative Unit

Department of Genetics, Development, and Cell Biology (LAS)

Statistics (CALS)

Type

article

Comments

This article is published as Dorman, Karin S., and Ranjan Maitra. "An efficient k‐modes algorithm for clustering categorical datasets." Statistical Analysis and Data Mining: The ASA Data Science Journal 15, no. 1 (2022): 83-97. doi:10.1002/sam.11546.

Rights Statement

© 2021 The Authors. This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.