Principal Components Analysis of Discrete Datasets
We propose a Gaussian copula based method to perform principal component analysis for discrete data. By assuming the data are from a discrete distributions in the Gaussian copula family, we can consider the discrete random vectors are generated from a latent multivariate normal random vector. So we first obtain an estimate of the correlation matrix of latent multivariate normal distribution, then we use the estimated latent correlation matrix to get the estimates of principal components. We also focus on the case when we have categorical sequence data with multinomial marginal distribution. In this case the marginal distribution is not univariate and thus the usual Gaussian copula does not fit here. The optimal mapping method is proposed to convert the original data with multivariate discrete marginals to the mapped data with univariate marginals. Then the usual Gaussian copula can be used to model the mapped data, and we apply the discrete principal component analysis to the mapped data. The senators' voting data was used in the experiment as an example. Finally, we also propose a matrix Gaussian copula method to deal with data with multivariate marginals. It can be considered as an extension of Gaussian copula, and we use the latent correlation matrix in the matrix Gaussian copula to obtain the principal components.