Statistical methods in detecting differential expressed genes, analyzing insertion tolerance for genes and group selection for survival data

Thumbnail Image
Date
2015-01-01
Authors
Liu, Fangfang
Major Professor
Advisor
Peng Liu
Chong Wang
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract

The thesis is composed of three independent projects: (i) analyzing transposon-sequencing data to infer functions of genes on bacteria growth (chapter 2), (ii) developing semi-parametric Bayesian method method for differential gene expression analysis with RNA-sequencing data (chapter 3), (iii) solving group selection problem for survival data (chapter 4). All projects

are motivated by statistical challenges raised in biological research.

The first project is motivated by the need to develop statistical models to accommodate the transposon insertion sequencing (Tn-Seq) data, Tn-Seq data consist of sequence reads around each transposon insertion site.

The detection of transposon insertion at a given site indicates that the disruption of genomic sequence at this site does not cause essential function loss and the bacteria can still grow.

Hence, such measurements have been used to infer the functions of each gene on bacteria growth. We propose a zero-inflated Poisson regression method for analyzing the Tn-Seq count data, and derive an Expectation-Maximization (EM) algorithm to obtain parameter estimates. We also propose a multiple testing procedure that categorizes genes into each of the three states, hypo-tolerant, tolerant, and hyper-tolerant, while controlling false discovery rate. Simulation studies show our method provides good

estimation of model parameters and inference on gene functions.

In the second project, we model the count data from RNA-sequencing experiment for each gene using a Poisson-Gamma hierarchical model, or equivalently, a negative binomial (NB) model. We derive a full semi-parametric Bayesian approach with Dirichlet process as the prior for the fold changes between two treatment means. An inference strategy using Gibbs algorithm is developed for differential expression analysis. We evaluate our method with several simulation studies, and the results demonstrate that our method outperforms other methods including the popularly applied ones such as edgeR and DESeq.

In the third project, we develop a new semi-parametric Bayesian method to address the group variable selection problem and study the dependence of survival outcomes on the grouped predictors using the Cox proportional hazard model. We use indicators for groups to induce sparseness and obtain the posterior inclusion probability for each group. Bayes factors are used to evaluate whether the groups should be selected or not. We compare our method with one frequentist method (HPCox) based on several simulation studies and show that our method performs better than HPCox method.

In summary, this dissertation tackles several statistical problems raised in biological research, including high-dimensional genomic data analysis and survival analysis. All proposed methods are evaluated with simulation studies and show satisfactory performances. We also apply the proposed methods to real data analysis.

Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
dissertation
Comments
Rights Statement
Copyright
Thu Jan 01 00:00:00 UTC 2015
Funding
Subject Categories
Keywords
Supplemental Resources
Source