Two sample inference for high dimensional data and nonparametric variable selection for census data

Thumbnail Image
Li, Jun
Major Professor
Song Xi Chen
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Journal Issue
Is Version Of

In the first part of this thesis, we address the question of how new testing methods can be developed for two sample inference for high dimensional data. Particularly, chapter 2 focuses on testing the equality of two high dimensional covariance matrices, which can be directly applied to evaluating the difference in genetic correlation for

different populations subject to various biological conditions. As we will demonstrate in chapter 2 , the test we propose has no normality assumption and also allows the dimension to be much larger than the sample sizes. These two aspects surpass the capacity of the classical tests such as the likelihood ratio test. Testing the equality of high dimensional mean vectors is another important two-sample testing problem. Most tests for the equality of two mean vectors are not powerful against sparse alternative in the sense that the difference of two population mean vectors only spreads out over a small number of coordinates. In chapter 3, we propose two tests designed to obtain better power performance against sparse alternative by conducting both variance reduction and signal enhancement through thresholding and transformation, respectively.

The second part of this thesis is on variable selection for census data. Human populations are heterogeneous in that the probability of enumerating an individual depends on the characteristics of the individual. For the US Census, a group of variables is chosen to reflect much of the heterogeneity and the relevance of these variables to the enumeration function needs to be investigated. In chapter 4, we introduce a nonparametric variable selection method based on the optimal bandwidths obtained by minimizing the cross- validation function. The relevance of each variable to the enumeration function is reflected by the asymptotic convergence of associated optimal bandwidth. Also to formally test the significance of each variable, a bootstrap procedure is introduced.

Subject Categories
Tue Jan 01 00:00:00 UTC 2013