Assessing differential expression when the distribution of effect sizes is asymmetric and evaluating concordance of differential expression across multiple gene expression experiments

Orr, Megan
Major Professor
Peng Liu
Dan Nettleton
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Journal Issue

The emergence and development of gene expression technologies has resulted in an ever-increasing number of high-dimensional data sets available for analysis. The availability of these data sets has prompted much research into the development of methods for statistically analyzing gene expression experiments. Many of these methods focus on identifying genes that are differentially expressed (DE), i.e., exhibit changes in mean expression levels between treatments, in a single experiment. This dissertation presents novel methods for detecting differential expression in one experiment and proposes methods for analyzing gene expression data from two independent experiments.

Many methods have been proposed for estimating the number of genes that are equivalently expressed (EE), and thus the number of DE genes, in a single gene expression experiment, but many researchers are interested in comparing the results of two independent experiments. Estimating the number of genes that are DE in two independent experiments is generally performed in two steps. First, data from each experiment are analyzed separately, and a list of genes identified as DE is obtained for each experiment. Each list is generally produced by a method that attempts to control false discovery rate (FDR) at some desired level &alpha. Then, the number of genes common to both lists is used as an estimate of the number of genes DE in both experiments. A major flaw of this method is that the resulting estimates can vary greatly depending on the value of &alpha. Chapter 2 proposes a new method that estimates the number of genes that are DE in both of two independent experiments, which includes analyzing the p-values from each experiment simultaneously, and results in a single estimate that does not depend on &alpha. Through simulation studies, we show the advantages of our approach. In Chapter 3, we extend the idea of Chapter 2 by proposing a new method for identifying genes that are DE in both of two independent experiments while controlling FDR and compare this method to two existing methods. These three methods are compared through simulation studies that show the proposed method controls FDR better as well as provides similar or better power when compared to the existing methods.

Chapter 4 proposes a new method for calculating q-values when the distribution of effect sizes in a gene expression experiment is asymmetric. This method first estimates the number of genes that are EE in an experiment based on the distribution of all p-values. Then, the p-values are split into two subsets based on the signs of their corresponding test statistics, and q-values are then calculated separately for each subset. Simulation study results show that the proposed method, when compared to the traditional q-value method, generally provides a better ranking for genes as well as a higher number of truly DE genes identified as DE, while still adequately controlling FDR.