Principal component analysis and classification of discrete and mixed feature datasets using Gaussian copula
Date
2021-12
Authors
Zhu, Yifan
Major Professor
Advisor
Maitra, Ranjan
Dai, Xiongtao
Meeker, William
Yu, Cindy
Chyzh, Olga
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract
Nowadays, datasets with discrete features are quite common. Such datasets have complete
discrete features or mixed continuous and discrete features, and many of them have a quite high
dimension. However, the models for datasets with mixed features are quite limited, as well as the
dimensionality reduction methods and classification methods. In this thesis, we proposed a model
based on the Gaussian copula to perform dimensionality reduction and classification for datasets
with purely discrete or mixed features.
In Chapter 3, we developed a scale invariant and data contamination robust principal
component analysis (PCA) for discrete datasets that was further extended to ordinal categorical
datasets. This method solves the scale variant and outlier sensitive problem of the usual PCA,
and enables us to perform a dimensionality reduction on discrete data by obtaining a low
dimensional representation from the latent normal random vectors with the assumed Gaussian
copula model. A rank based estimator for correlation matrix in the Gaussian copula for discrete
dataset is developed, extending the current rank based estimator for only continuous data.
Methodology for obtaining the surrogate PC scores as the low dimensional representation is
developed with a sampling algorithm that samples from the truncated normal distribution, which
is also used in classification problem for calculating the posterior probability.
In Chapter 3, a classification model based on the mixture of discrete Gaussian copula family
distributions is proposed. The optimal classification rule that minimizes the misclassification
probability is derived under this model. With results in Chapter 2, we can estimate each
component in the mixture distribution. One of the biggest challenges is the evaluation the
posterior probability in this model, which involves the numerical evaluation of high dimensional
integral in a rectangular region. A low dimensional approximation for the correlation matrix is
used to reduce the high dimensional integral to a lower dimensional one. Based on that, three
different are introduced to get the posterior probability.
In Chapter 4, we extend the classification model in Chapter 3 to datasets with mixed
continuous and discrete features. A rank based estimator for the correlation matrix in the
Gaussian copula is developed that extends the current estimator for continuous datasets and the
estimators developed in Chapter 2 for discrete datasets. The methodology for evaluating posterior
probabilities is developed based results from Chapter 3.
In Chapter 5, we finalize the thesis with a summary of results from Chapter 2 to Chapter 4
and some future works.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
dissertation