Principal component analysis and classification of discrete and mixed feature datasets using Gaussian copula

Zhu, Yifan

Principal component analysis and classification of discrete and mixed feature datasets using Gaussian copula

File

Zhu_iastate_0097E_19820.pdf (763.55 KB)

Date

2021-12

Authors

Zhu, Yifan

Advisor

Maitra, Ranjan

Dai, Xiongtao

Meeker, William

Yu, Cindy

Chyzh, Olga

Altmetrics

Abstract

Nowadays, datasets with discrete features are quite common. Such datasets have complete discrete features or mixed continuous and discrete features, and many of them have a quite high dimension. However, the models for datasets with mixed features are quite limited, as well as the dimensionality reduction methods and classification methods. In this thesis, we proposed a model based on the Gaussian copula to perform dimensionality reduction and classification for datasets with purely discrete or mixed features. In Chapter 3, we developed a scale invariant and data contamination robust principal component analysis (PCA) for discrete datasets that was further extended to ordinal categorical datasets. This method solves the scale variant and outlier sensitive problem of the usual PCA, and enables us to perform a dimensionality reduction on discrete data by obtaining a low dimensional representation from the latent normal random vectors with the assumed Gaussian copula model. A rank based estimator for correlation matrix in the Gaussian copula for discrete dataset is developed, extending the current rank based estimator for only continuous data. Methodology for obtaining the surrogate PC scores as the low dimensional representation is developed with a sampling algorithm that samples from the truncated normal distribution, which is also used in classification problem for calculating the posterior probability. In Chapter 3, a classification model based on the mixture of discrete Gaussian copula family distributions is proposed. The optimal classification rule that minimizes the misclassification probability is derived under this model. With results in Chapter 2, we can estimate each component in the mixture distribution. One of the biggest challenges is the evaluation the posterior probability in this model, which involves the numerical evaluation of high dimensional integral in a rectangular region. A low dimensional approximation for the correlation matrix is used to reduce the high dimensional integral to a lower dimensional one. Based on that, three different are introduced to get the posterior probability. In Chapter 4, we extend the classification model in Chapter 3 to datasets with mixed continuous and discrete features. A rank based estimator for the correlation matrix in the Gaussian copula is developed that extends the current estimator for continuous datasets and the estimators developed in Chapter 2 for discrete datasets. The methodology for evaluating posterior probabilities is developed based results from Chapter 3. In Chapter 5, we finalize the thesis with a summary of results from Chapter 2 to Chapter 4 and some future works.

Academic or Administrative Unit

Statistics (LAS)

Type

dissertation