Classical and Bayesian mixed model analysis of microarray data for detecting gene expression and DNA differences

Demirkale, Cumhur
Major Professor
Dan Nettleton
Tapabrata Maiti
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Journal Issue

This thesis focuses on classical and Bayesian mixed model analysis of microarray data for detecting gene expression and DNA differences. It consists of three research papers. The first study discusses the selection of gene specific linear mixed models in microarray data analysis. In a microarray experiment, one experimental design is used to obtain expression measures for all genes. One popular analysis method involves fitting the same linear mixed model for each gene, obtaining gene-specific p-values for tests of interest involving fixed effects, and then choosing a threshold for significance that is intended to control False Discovery Rate (FDR) at a desired level. When one or more random factors have zero variance components for some genes, the standard practice of fitting the same full linear mixed model for all genes can result in failure to control FDR. We propose a new method which combines results from the fit of full and selected linear mixed models to identify differentially expressed genes and provide FDR control at target levels when the true underlying random effects structure varies across genes.

The second study discusses a hierarchical Bayesian modeling strategy for microarray data analysis. Some microarray experiments have complex experimental designs that call for modeling of multiple sources of variation through the inclusion of multiple random factors. While large amounts of data on thousands of genes are collected in these experiments, the sample size for each gene is usually small. Therefore, in a classical gene-by-gene mixed linear model analysis, there will be very few degrees of freedom to estimate the variance components of all random factors considered in the model and low statistical power for testing fixed effects of interest. To address these challenges, we propose a hierarchical Bayesian modeling strategy to account for important experimental factors and complex correlation structure among the expression measurements for each gene. We use half-Cauchy priors for the standard deviation parameters of the random factors with few effects. We rank genes with respect to evidence of differential expression across the levels of a factor of interest by calculating a single summary statistic per gene from the posterior distribution of the treatment effects considered in the model. Simulation shows that our hierarchical Bayesian approach is much better than a traditional gene-by-gene mixed linear model analysis at distinguishing differentially expressed genes from non-differentially expressed genes.

The third study focuses on the identification of Single Feature Polymorphisms (SFPs) using Affymetrix gene expression data. In microarray data analysis, the identification of SFPs is important for producing more accurate expression measurements when comparing samples of different genotypes. Also, portions of DNA that differ between parental lines can serve as markers for tracking DNA inheritance in offspring. We summarize several SFPs discovery methods in the literature. To identify single probes defining SFPs in the data, we developed two new algorithms where a difference value is defined for each probe after accounting for the overall gene expression level differences in the probe set. The first method contrast the difference value of each probe with the average of the difference values for the rest of the probes in that probe set. Second method is a robust version of the first method. The performance of all methods are compared through two publicly available published data sets, where truth about the sequence polymorphism is known for some "Gold Standard" probes. It was shown that our algorithms provided performance superior to the other methods in ordering probes for evidence of SFPs.