Improving statistical inference for gene expression profiling data by borrowing information

Thumbnail Image
Date
2010-01-01
Authors
Qu, Long
Major Professor
Advisor
Jack C. Dekkers
Dan Nettleton
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract

Gene expression profiling experiments, in particular, microarray experiments, are popular in genomics research. However, in addition to the great opportunities provided by such experiments, statistical challenges also arise in the analysis of expression profiling data. The current thesis discusses statistical issues associated with gene expression profiling experiments and develops new statistical methods to tackle some of these problems.

In Chapter 2, we consider the insufficient sample size problem in detecting differential gene expression. We address the problem by developing and evaluating methods for variance model selection. The idea is that information about error variances might be learned from related datasets to improve the estimation of error variances. We develop a modified multiresponse permutation procedure (MRPP), modified cross-validation procedures, and the right AICc (corrected Akaike’s information criterion) for choosing a variance model. Through realistic simulations based on three real microarray studies, we evaluate the proposed methods and suggest practical recommendations for data analysis.

In Chapter 3, we address the multiple testing problem by improving the estimation of the distribution of noncentrality parameters given a large number of two-sample t-tests. We provide parametric, nonparametric and semiparametric estimators for the distribution of noncentrality parameters, as well as false discovery rates (FDR) and local FDR. Simulations show that our density estimates are closer to the underlying truth and that our estimates of FDR are also improved relative to competing methods under a variety of situations.

In Chapter 4, we develop a novel combination of two statistical techniques with the aim to by-pass the curse of dimensionality problem in detecting differential expression of genes. We accept the fact that, in “small N, large p” situations, the data are not sufficient to provide enough information about dependency across genes. Hence, we suggest using a priori biological knowledge to assist statistical inference. We first use multidimensional scaling (MDS) methods to summarize prior knowledge about inter-gene relationships into a set of pseudo-covariates. Then, we develop a hierarchical additive logistic regression model conditional upon the generated pseudo-covariates. Simulations and analysis of real microarray data suggest that our strategy is more powerful than methods that do not use \a priori information.

Future research directions are discussed at the end of the thesis.

Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
dissertation
Comments
Rights Statement
Copyright
Fri Jan 01 00:00:00 UTC 2010
Funding
Subject Categories
Supplemental Resources
Source