A Hierarchical Clustering and Validity Index for Mixed Data

Thumbnail Image
Yang, Rui
Major Professor
Sigurdur Olafsson
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Industrial and Manufacturing Systems Engineering
The Department of Industrial and Manufacturing Systems Engineering teaches the design, analysis, and improvement of the systems and processes in manufacturing, consulting, and service industries by application of the principles of engineering. The Department of General Engineering was formed in 1929. In 1956 its name changed to Department of Industrial Engineering. In 1989 its name changed to the Department of Industrial and Manufacturing Systems Engineering.
Journal Issue
Is Version Of
Industrial and Manufacturing Systems Engineering

This study develops novel approaches to partition mixed data into natural groups, that is, clustering datasets containing both numeric and nominal attributes. Such data arises in many diverse applications. Our approach addresses two important issues regarding clustering mixed datasets. One is how to find the optimal number of clusters which is important because this is unknown in many applications. The other is how to group the objects "naturally" according to a suitable similarity measurement. These problems are especially difficult for the mixed datasets since they involve determining how to unify the two different representation schemes for numeric and nominal data.

To address the issue of constructing clusters, that is, to naturally group objects, we compare the performance of four distances capable of dealing with the mixed datasets when incorporating into a classical agglomerative hierarchical clustering approach. Based on these results, we conclude that the so-called co-occurrence distance to measure the dissimilarity performs well as this distance is found to obtain good clustering results with reasonable computation, thus balancing effectiveness and efficiency.

The second important contribution of this research is to define an entropy-based validity index to validate the sequence of partitions generated by the hierarchical clustering with the co-occurrence distance. A cluster validity index called the BK index is modified for mixed data and used in conjunction with the proposed clustering algorithm. This index is compared to three well-known indices, namely, the Calinski-Harabasz index (CH), the Dunn index (DU), and the Silhouette index (SI). The results show that the modified BK index outperforms the three other indices for its ability to identify the true number of clusters.

Finally, the study also identifies the limitation of the hierarchical clustering with a co-occurrence distance, and provides some remedies to improve not only the clustering accuracy but especially the ability to correctly identify best number of classes of the mixed datasets.

Subject Categories
Sun Jan 01 00:00:00 UTC 2012